Compare commits
5 Commits
832f84f34b
...
bfbb5c4185
Author | SHA1 | Date |
---|---|---|
Malte Grosse | bfbb5c4185 | |
Malte Grosse | 49afe43ad7 | |
Malte Grosse | 8cfc367923 | |
Malte Grosse | 3002313bf0 | |
Malte Grosse | 0716270cbb |
Binary file not shown.
After Width: | Height: | Size: 20 KiB |
Binary file not shown.
After Width: | Height: | Size: 10 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 38 KiB |
|
@ -1,4 +1,106 @@
|
||||||
# Training Environment
|
# Training Environment
|
||||||
|
|
||||||
Currently under heavy development, coming soon....stayed tuned!
|
This documentation is for advanced users which are aware of following tools: git, python/R, cuda, pytorch/tensorflow and basic container knowledge.
|
||||||
|
![repos](./res/training.svg)
|
||||||
|
## Overview
|
||||||
|
Available are two worker agents with
|
||||||
|
- 12 physical CPUs
|
||||||
|
- 40 GB memory
|
||||||
|
- 20 GB Nvidia GPU memory
|
||||||
|
- 100 GB Hdd Diskspace
|
||||||
|
|
||||||
|
Only two pipelines can run in parallel to ensure having the promised hardware resources. If more jobs occur, they will be stored in a queue and released after the fifo principle. Storage is not persistent - every build/training job needs to saved somewhere external.
|
||||||
|
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
### Git
|
||||||
|
Create a new git repository and commit your latest code here: https://git.sandbox.iuk.hdm-stuttgart.de/
|
||||||
|
|
||||||
|
Repositories can be private or public - depends on your use case.
|
||||||
|
|
||||||
|
|
||||||
|
### CI
|
||||||
|
Connect your newly created repository here: https://ci.sandbox.iuk.hdm-stuttgart.de/
|
||||||
|
1. After login, click on "+ Add repository"
|
||||||
|
![repos](./res/sandbox-ci-repos.png)
|
||||||
|
2. Enable the specific repository
|
||||||
|
|
||||||
|
3. Go to the repositories [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos) and select your enabled repository
|
||||||
|
4. Go to settings (clicking the settings icon)
|
||||||
|
![repos](./res/sandbox-ci-settings.png)
|
||||||
|
5. Set a reasonable timeout in minutes (e.g. 360 minutes for 6hours) if some training crashes/hangs
|
||||||
|
6. Add additional settings like secrets or container registries, see the official [documentation](https://woodpecker-ci.org/docs/usage/project-settings) for additional settings
|
||||||
|
|
||||||
|
|
||||||
|
### Pipeline File
|
||||||
|
An example script can be found here:
|
||||||
|
|
||||||
|
https://git.sandbox.iuk.hdm-stuttgart.de/grosse/test-ci
|
||||||
|
|
||||||
|
|
||||||
|
1. Create a new file in your repository `.woodpecker.yml` (or different regarding repository settings above)
|
||||||
|
2. The content can look like following:
|
||||||
|
|
||||||
|
```
|
||||||
|
steps:
|
||||||
|
"train":
|
||||||
|
image: nvcr.io/nvidia/tensorflow:23.10-tf2-py3
|
||||||
|
commands:
|
||||||
|
- echo "starting python script"
|
||||||
|
- python run.py
|
||||||
|
"compress and upload":
|
||||||
|
image: alpine:3
|
||||||
|
commands:
|
||||||
|
- apk --no-cache add zip curl
|
||||||
|
- zip mymodel.zip mymodel.keras
|
||||||
|
- curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload
|
||||||
|
```
|
||||||
|
See the official [documentation](https://woodpecker-ci.org/docs/usage/workflow-syntax) for the syntax.
|
||||||
|
|
||||||
|
Generally, the pipeline is based on different steps, and in each step, another container environment can be chosen. In the example above, first an official tensorflow container with python 3 is used to run the training python script. In the second step, the model gets compressed and pushed on the temp. sandbox storage.
|
||||||
|
3. Commit and push
|
||||||
|
4. See current state of the pipelines at the [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Exporting trained model
|
||||||
|
We provide a 3-months disposal internal storage.
|
||||||
|
You can either use the a simple curl command `curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload` to upload a file or a simple python script
|
||||||
|
|
||||||
|
```
|
||||||
|
import requests
|
||||||
|
import os
|
||||||
|
|
||||||
|
myurl = 'https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload'
|
||||||
|
|
||||||
|
print("uploading file")
|
||||||
|
files = {
|
||||||
|
'fileUpload':('mymodel.keras', open('mymodel.keras', 'rb'),'application/octet-stream')
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post(myurl, files=files)
|
||||||
|
print(response,response.text)
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
which returns a json with the download url of your uploaded file.
|
||||||
|
|
||||||
|
```
|
||||||
|
{"PublicUrl":"https://storage.sandbox.iuk.hdm-stuttgart.de/upload/49676006-94e4-4da6-be3f-466u786768979/mymodel.keras","Size":97865925,"Expiration":"2024-03-30T00:00:00Z"}
|
||||||
|
|
||||||
|
```
|
||||||
|
## Troubleshooting:
|
||||||
|
- The first time an external container is pulled, depending on the size, container images can take quite a while as different organization (like dockerhub) limit the download speed. The Sandbox git also supports hosting container images...
|
||||||
|
- Choose a proper way to output some reasonable logs during your training, so it wont spam the logs too heavily
|
||||||
|
- training exists after 60 minutes: increase maximum duration in the ci repository settings
|
||||||
|
|
||||||
|
## Useful Links
|
||||||
|
- [Sandbox GIT](https://git.sandbox.iuk.hdm-stuttgart.de/)
|
||||||
|
- [Sandbox CI](https://ci.sandbox.iuk.hdm-stuttgart.de)
|
||||||
|
- [Git](https://git-scm.com/docs/gittutorial)
|
||||||
|
- [Woodpecker Syntax](https://woodpecker-ci.org/docs/2.3/usage/workflow-syntax)
|
||||||
|
- [PyTorch](https://pytorch.org/docs/stable/index.html)
|
||||||
|
- [TensorFlow](https://www.tensorflow.org/versions/r2.15/api_docs/python/tf)
|
||||||
|
- [NVIDIA PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
|
||||||
|
- [NVIDIA Tensorflow Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
|
Loading…
Reference in New Issue