Help! Getting Error: Application exited with code 137 (OOMKilled)

Hello All,

I am trying to run my object detection training and I am getting this error:

Application exited with code 137 (OOMKilled)

I am assuming its because I am using too much ram as I’ve previously seen in other topics?

Any help would be greatly appreciated! Thanks!

Edit:
Ok I deleted like 400 pictures, hopefully its going to work now. At least I’m seeing it pass epochs now, before it just gave me:

Creating job… OK (ID: 2165071)

Copying features from processing blocks…
Job started
Copying features from DSP block…
Copying features from DSP block OK
Copying features from processing blocks OK

Job started
Splitting data into training and validation sets…
Splitting data into training and validation sets OK

Training model…
Training on 367 inputs, validating on 92 inputs
Building model and restoring weights for fine-tuning…
Finished restoring weights
Fine tuning…
Attached to job 2165071…
Attached to job 2165071…
Attached to job 2165071…
Attached to job 2165071…
Attached to job 2165071…
Attached to job 2165071…

Application exited with code 137 (OOMKilled)

Now I’m seeing the epochs crossing fingers

Have to present this tomorrow in front of my class :frowning:. Sucks having to delete 400+ pictures :frowning:

Still love edge impulse though, amazing.

Edit:

Now I’m receiving this message:

ERR: DeadlineExceeded - Job was active longer than specified deadline Try decreasing the number of windows or reducing the number of training cycles. If the error persists then you can contact support at hello@edgeimpulse.com to increase this time limit.

2nd Edit:

Ok, I had to do 10 Epochs with .01. 80/20 split but would time out or give me errors passed 10 epochs. We are making a chair occupancy system (kind of like a smart parking garage system, but, for chairs). We are students and we are trying to make it easier for students to find seating at school instead of walking around for 20+ minutes trying to find an open seat at our school.

If we could please get some increased time on our project or increased resources we WOULD HUGELY appreciate it. We are using a Jetson Nano 2GB and plan on adding additional features via python code to submit seating detections to our webserver so students could go to said website and see the seats, right now we are starting small (3 chairs) but eventually plan to scale this up if we have time/its possible. Thank you guys I hope someone reads this <3.

Hello @jumill87,

I just increased your time limit to 60 minutes.
Good luck with your school project :wink:

Regards,

Louis

Thank you so much! We won’t let you down!

Sincerely,

Justin

Will extending the time allow us to have a larger data set? Or does it just allow it time for what we currently have? Would it work with 1,100 pictures like we previously had?

Would like to know so we don’t end up doing 8 hours extra work only for it to not work. Thanks again for your help. Wondering because if it does end up being like that, I’m afraid our ML model wont be as accurate as it needs to be, especially when we start doing a whole class room. If that is the case :frowning:, I am thinking it will be better to just do it through the Jetson Nano solely so we can collect larger data sets.

Sincerely,

Justin

I think the time limit is for the training of the models.
This will depend on how many classes are in the 1100 images.

I have 2 classes, seatempty/seattaken. Although, I was using a lot of bounding boxes (trying to make it as accurate as possible) and was still getting python errors and timeout errors. I’m assuming this has to do with the processing power/ram usage. I’m not sure how long it would have taken honestly. Just want to know before I take another 700+ pictures and only to have it not work again.

Thank you for your reply!

I really do not know how much time is required to train these images. As we mentioned before the limitation is per job. I would say give it a try once and see if 1000 images can be trained in 60 minutes.

I have, I tried increasing the images to just over 500+ and was getting python error line 10 or something like that, didn’t save the error as I needed to present to class so I had to go delete another 200+ images again and retrain it.

So I’m guessing this is just for basic applications, if you wanted to do more you would need more or something is bugging out for me… I was able to do 25 epochs 2 days ago when he increased my time but yesterday after trying to add more data (images/bounding boxes), kept giving my python errors after epoch 1 or it would error out at epoch 23 (definitely not taking 60 minutes, more like 20-25 minutes). IDK what is going on but I’m guessing ill just have to do everything on the Nano from now on.

I thought this would be a great tool to show people that don’t know about it yet but its actually causing me more stress than anything when you think its going to train your model then crashes (not due to time running out/resources, just gives me some python error on line 10 I believe or something like that, should have saved the error, but, I was iin a hurry). So the model I have trained now is only about 423 images I believe, when I tried increasing that to 600+ it gave me the same errors.

Any help would be appreciated but I’m guessing what we’re trying to do is just too much for edge.

Thanks for the reply.

Would it be possible to share the python error you are facing, please?
I will see if we can increase the training time. @louis is this still possible?

The OOMKilled issue is a memory related issue.
We are using by default 4GB RAM containers for the training.

I’ll increase the default for your project too (while keeping the training time limit to 60 minutes). That should help, let me know if you have any more issues.

Regards,

Louis

Thank you so much guys, really appreciate the help! Will keep you guys updated, if I get the python error again will post ASAP. Once I deleted more images it went away and was able to train again. Maybe it had something to do with the images or # of bounding boxes I was using?

Really appreciate everything and all the help!

1 Like

@jumill87 - also answered on another thread, but just for people finding this thread instead: we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.