Merge branch 'master' into dev_environment

This commit is contained in:
Malte Grosse 2024-05-14 18:20:57 +09:00
commit 72bf8f6137
4 changed files with 107 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 38 KiB

View File

@ -1,4 +1,106 @@
# Training Environment # Training Environment
Currently under heavy development, coming soon....stayed tuned! This documentation is for advanced users which are aware of following tools: git, python/R, cuda, pytorch/tensorflow and basic container knowledge.
![repos](./res/training.svg)
## Overview
Available are two worker agents with
- 12 physical CPUs
- 40 GB memory
- 20 GB Nvidia GPU memory
- 100 GB Hdd Diskspace
Only two pipelines can run in parallel to ensure having the promised hardware resources. If more jobs occur, they will be stored in a queue and released after the fifo principle. Storage is not persistent - every build/training job needs to saved somewhere external.
## Development
### Git
Create a new git repository and commit your latest code here: https://git.sandbox.iuk.hdm-stuttgart.de/
Repositories can be private or public - depends on your use case.
### CI
Connect your newly created repository here: https://ci.sandbox.iuk.hdm-stuttgart.de/
1. After login, click on "+ Add repository"
![repos](./res/sandbox-ci-repos.png)
2. Enable the specific repository
3. Go to the repositories [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos) and select your enabled repository
4. Go to settings (clicking the settings icon)
![repos](./res/sandbox-ci-settings.png)
5. Set a reasonable timeout in minutes (e.g. 360 minutes for 6hours) if some training crashes/hangs
6. Add additional settings like secrets or container registries, see the official [documentation](https://woodpecker-ci.org/docs/usage/project-settings) for additional settings
### Pipeline File
An example script can be found here:
https://git.sandbox.iuk.hdm-stuttgart.de/grosse/test-ci
1. Create a new file in your repository `.woodpecker.yml` (or different regarding repository settings above)
2. The content can look like following:
```
steps:
"train":
image: nvcr.io/nvidia/tensorflow:23.10-tf2-py3
commands:
- echo "starting python script"
- python run.py
"compress and upload":
image: alpine:3
commands:
- apk --no-cache add zip curl
- zip mymodel.zip mymodel.keras
- curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload
```
See the official [documentation](https://woodpecker-ci.org/docs/usage/workflow-syntax) for the syntax.
Generally, the pipeline is based on different steps, and in each step, another container environment can be chosen. In the example above, first an official tensorflow container with python 3 is used to run the training python script. In the second step, the model gets compressed and pushed on the temp. sandbox storage.
3. Commit and push
4. See current state of the pipelines at the [overview site](https://ci.sandbox.iuk.hdm-stuttgart.de/repos)
### Exporting trained model
We provide a 3-months disposal internal storage.
You can either use the a simple curl command `curl -F fileUpload=@mymodel.zip https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload` to upload a file or a simple python script
```
import requests
import os
myurl = 'https://share.storage.sandbox.iuk.hdm-stuttgart.de/upload'
print("uploading file")
files = {
'fileUpload':('mymodel.keras', open('mymodel.keras', 'rb'),'application/octet-stream')
}
response = requests.post(myurl, files=files)
print(response,response.text)
```
which returns a json with the download url of your uploaded file.
```
{"PublicUrl":"https://storage.sandbox.iuk.hdm-stuttgart.de/upload/49676006-94e4-4da6-be3f-466u786768979/mymodel.keras","Size":97865925,"Expiration":"2024-03-30T00:00:00Z"}
```
## Troubleshooting:
- The first time an external container is pulled, depending on the size, container images can take quite a while as different organization (like dockerhub) limit the download speed. The Sandbox git also supports hosting container images...
- Choose a proper way to output some reasonable logs during your training, so it wont spam the logs too heavily
- training exists after 60 minutes: increase maximum duration in the ci repository settings
## Useful Links
- [Sandbox GIT](https://git.sandbox.iuk.hdm-stuttgart.de/)
- [Sandbox CI](https://ci.sandbox.iuk.hdm-stuttgart.de)
- [Git](https://git-scm.com/docs/gittutorial)
- [Woodpecker Syntax](https://woodpecker-ci.org/docs/2.3/usage/workflow-syntax)
- [PyTorch](https://pytorch.org/docs/stable/index.html)
- [TensorFlow](https://www.tensorflow.org/versions/r2.15/api_docs/python/tf)
- [NVIDIA PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
- [NVIDIA Tensorflow Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)