diff --git a/src/sandbox/res/sandbox-ci-repos.png b/src/sandbox/res/sandbox-ci-repos.png new file mode 100644 index 0000000..4f4c54c Binary files /dev/null and b/src/sandbox/res/sandbox-ci-repos.png differ diff --git a/src/sandbox/res/sandbox-ci-settings.png b/src/sandbox/res/sandbox-ci-settings.png new file mode 100644 index 0000000..945e008 Binary files /dev/null and b/src/sandbox/res/sandbox-ci-settings.png differ diff --git a/src/sandbox/res/training.svg b/src/sandbox/res/training.svg new file mode 100644 index 0000000..9c70de9 --- /dev/null +++ b/src/sandbox/res/training.svg @@ -0,0 +1,4 @@ + + + +SandboxScientific UserAccess SandboxLogin to Gitcreate Repositoryclone repositorycreate model & set parametercreate woodpecker.ymlgit commit & pushci/cd add new repositoryWoodpecker start trainingafter training upload model to datapoolprovide linkDatapoolaccess modelShut Down ServerGIT \ No newline at end of file diff --git a/src/sandbox/training.md b/src/sandbox/training.md index 4e0973b..39f1692 100644 --- a/src/sandbox/training.md +++ b/src/sandbox/training.md @@ -1,4 +1,106 @@ # Training Environment -Currently under heavy development, coming soon....stayed tuned! +This documentation is for advanced users which are aware of following tools: git, python/R, cuda, pytorch/tensorflow and basic container knowledge. +![repos](./res/training.svg) +## Overview +Available are two worker agents with +- 12 physical CPUs +- 40 GB memory +- 20 GB Nvidia GPU memory +- 100 GB Hdd Diskspace +Only two pipelines can run in parallel to ensure having the promised hardware resources. If more jobs occur, they will be stored in a queue and released after the fifo principle. Storage is not persistent - every build/training job needs to saved somewhere external. + + +## Development + +### Git +Create a new git repository and commit your latest code here: https://git.sandbox.iuk.hdm-stuttgart.de/ + +Repositories can be private or public - depends on your use case. + + +### CI +Connect your newly created repository here: https://ci.sandbox.iuk.hdm-stuttgart.de/ +1. After login, click on "+ Add repository" +![repos](./res/sandbox-ci-repos.png) +2. Enable the specific repository + +3. Go to the repositories [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos) and select your enabled repository +4. Go to settings (clicking the settings icon) +![repos](./res/sandbox-ci-settings.png) +5. Set a reasonable timeout in minutes (e.g. 360 minutes for 6hours) if some training crashes/hangs +6. Add additional settings like secrets or container registries, see the official [documentation](https://woodpecker-ci.org/docs/usage/project-settings) for additional settings + + +### Pipeline File +An example script can be found here: + + https://git.sandbox.iuk.hdm-stuttgart.de/grosse/test-ci + + +1. Create a new file in your repository `.woodpecker.yml` (or different regarding repository settings above) +2. The content can look like following: + +``` +steps: + "train": + image: nvcr.io/nvidia/tensorflow:23.10-tf2-py3 + commands: + - echo "starting python script" + - python run.py + "compress and upload": + image: alpine:3 + commands: + - apk --no-cache add zip curl + - zip mymodel.zip mymodel.keras + - curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload +``` +See the official [documentation](https://woodpecker-ci.org/docs/usage/workflow-syntax) for the syntax. + +Generally, the pipeline is based on different steps, and in each step, another container environment can be chosen. In the example above, first an official tensorflow container with python 3 is used to run the training python script. In the second step, the model gets compressed and pushed on the temp. sandbox storage. +3. Commit and push +4. See current state of the pipelines at the [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos) + + + +### Exporting trained model +We provide a 3-months disposal internal storage. +You can either use the a simple curl command `curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload` to upload a file or a simple python script + +``` +import requests +import os + +myurl = 'https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload' + +print("uploading file") +files = { + 'fileUpload':('mymodel.keras', open('mymodel.keras', 'rb'),'application/octet-stream') +} + +response = requests.post(myurl, files=files) +print(response,response.text) + +``` + +which returns a json with the download url of your uploaded file. + +``` +{"PublicUrl":"https://storage.sandbox.iuk.hdm-stuttgart.de/upload/49676006-94e4-4da6-be3f-466u786768979/mymodel.keras","Size":97865925,"Expiration":"2024-03-30T00:00:00Z"} + +``` +## Troubleshooting: +- The first time an external container is pulled, depending on the size, container images can take quite a while as different organization (like dockerhub) limit the download speed. The Sandbox git also supports hosting container images... +- Choose a proper way to output some reasonable logs during your training, so it wont spam the logs too heavily +- training exists after 60 minutes: increase maximum duration in the ci repository settings + +## Useful Links +- [Sandbox GIT](https://git.sandbox.iuk.hdm-stuttgart.de/) +- [Sandbox CI](https://ci.sandbox.iuk.hdm-stuttgart.de) +- [Git](https://git-scm.com/docs/gittutorial) +- [Woodpecker Syntax](https://woodpecker-ci.org/docs/2.3/usage/workflow-syntax) +- [PyTorch](https://pytorch.org/docs/stable/index.html) +- [TensorFlow](https://www.tensorflow.org/versions/r2.15/api_docs/python/tf) +- [NVIDIA PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) +- [NVIDIA Tensorflow Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) \ No newline at end of file