Getting errors training model

Project ID: 79347

Hello, I am trying to retrain our model and now its giving me errors like this:

Creating job… OK (ID: 2281197)

Scheduling job in cluster…
Job started
Splitting data into training and validation sets…
Splitting data into training and validation sets OK

Training model…
Training on 552 inputs, validating on 139 inputs
Building model and restoring weights for fine-tuning…
Finished restoring weights
Fine tuning…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Attached to job 2281197…
Epoch 1 of 15, loss=0.338769, val_loss=0.41559038
/app/run-python-with-venv.sh: line 21: 10 Killed PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)

Job failed (see above)

Any help would be appreciated. Thank you.

Ok so,

I’ve found out if we get above 600+ images I get this error every time, it was happening to me last time and now its happening again. I’ve had to delete another 300 images I spent about 3-4 hours labeling and then another 2-3 trying to train it and reconfigure it. I do not understand what’s going on but it kinda sucks having to take all these pictures and then delete them because the model will not train, it gives me that error every time I try to add more pictures/bounding boxes to our dataset.

  • Justin

Hi @jumill87,

Looks like your training went through. Do you still have the issue?

Aurelien

It went through after deleting about 300+ pictures. When we add more pictures to try to make our model more accurate we get that error, every time. I don’t know what I’m doing wrong or what is happening. I had to get it working temporarily so we could test our twilio application today. Eventually, I’m guessing the problem will occur again as it had in the past when we had 1,100+ pictures/numerous bounding boxes.

So yes, for now it works, but when we take more pictures and add bounding boxes I’m afraid it will error again which is why I’ve cloned it so we can run it in case it occurs again.

Sincerely,

Justin

We have higher memory/compute time limit with our enterprise subscription.
If you’re working on an industrial use case feel free to ping me in DM.

Aurelien

Sadly its not for an industrial case, its for my project I’m doing with some friends, we are trying to make a seat occupancy system that shows whether chairs are occupied or not with object detection. We hate looking for empty seats/places to work so we’re trying to make a system using object detection and somehow uploading the results to a website so people can see how many chairs are available in a room (we have no idea how to do that, we are not web dev’s :frowning:).

Edge has increased our training memory and training time and I am still getting this error when training our model. So is my guess correct, too many images/bounding boxes? And would you know a starting point or have any guidance on how to upload/transfer the results to a website database of some sort? We are really lost with the last part. So far our best option is to use the twilio SMS system to let people know how many chairs are available in a room.

Thanks for the reply regardless, I appreciate it.

Justin

@jumill87 - we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.

Hello, i am having the same issue.
Training model…
Training on 776 inputs, validating on 194 inputs
Attached to job 13823588…
Attached to job 13823588…
Attached to job 13823588…
Attached to job 13823588…
/app/run-python-with-venv.sh: line 17: 11 Killed /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)

Job failed (see above)