How to fix "Application exited with code 137 (OOMKilled)"

SMCoder775 · January 24, 2024, 3:54am

Hi there. I have been trying to train a YOLOv5s model on some of my data (~25000 images). However, when it nears the end of profiling, the training exits with error code 137 (OOMKilled). I have enterprise and have set the job limits to the max, but it still cuts out anyway.

Profiling 99% done
/app/run-python-with-venv.sh: line 17:     8 Killed                  /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)
2024-01-24T01:25:24.247Z logger=server level=error Failed job execution
Error: Job 15488525 finished
    at /home/node/studio/build/server/server/start-daemon.js:211:48
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
Application exited with code 1

I was trying with SSD Mobilenet, incase the issue is with YOLOv5, but the job wouldn’t start because the estimated memory required is far too high (suspecting a memory leak). I have looked on the forums but couldn’t find any fix relating to either problem. Is it something I’m doing wrong, or are these bugs? If so, when can we expect a fix (I have some deadlines to meet with this project)? My project ID is 331781. Thank you!

SMCoder775 · January 24, 2024, 4:58pm

Recently my project ran out of GPU runtime. Could this be why? Also, how can I increase my GPU runtime? I am using the free trial of the enterprise edition.

louis · January 24, 2024, 8:53pm

Hello @SMCoder775,

You can contact our Sales Team on this page: Pricing

In short, the issue you are running into in an Out Of Memory issue.
There are several options to fix it:

Reduce your dataset size
Reduce your batch size
Change your model architecture
Train your model locally and then import it using the BYOM feature.
Or, train on bigger machine (you’ll need to contact the Sales team to adjust your plan).

Best,

Louis

SMCoder775 · January 26, 2024, 3:43am

Hi @louis, sorry for late reply, but after reducing my dataset by around 5000 images, the estimated memory has not changed at all (still at 57344 for YOLOv5s). The batch size also does not change the estimated memory, and the only other architecture I can use, SSD Mobilenet, says that the memory required is 142459Mi. Is this right?

louis · January 26, 2024, 5:30pm

Hello @SMCoder775,

Thanks for your feedback.
@matkelcey has been working on a fix.
It will land in production in the next release.

Best,

Louis

SMCoder775 · January 26, 2024, 7:33pm

Cool, thank you so much!