Hi there. I have been trying to train a YOLOv5s model on some of my data (~25000 images). However, when it nears the end of profiling, the training exits with error code 137 (OOMKilled). I have enterprise and have set the job limits to the max, but it still cuts out anyway.
Profiling 99% done
/app/run-python-with-venv.sh: line 17: 8 Killed /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)
2024-01-24T01:25:24.247Z logger=server level=error Failed job execution
Error: Job 15488525 finished
at /home/node/studio/build/server/server/start-daemon.js:211:48
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
Application exited with code 1
I was trying with SSD Mobilenet, incase the issue is with YOLOv5, but the job wouldn’t start because the estimated memory required is far too high (suspecting a memory leak). I have looked on the forums but couldn’t find any fix relating to either problem. Is it something I’m doing wrong, or are these bugs? If so, when can we expect a fix (I have some deadlines to meet with this project)? My project ID is 331781. Thank you!