Transfer learning issue - Application exited with code 137 (OOMKilled)

OfirSagi · February 20, 2022, 7:06pm

Hi,

I am trying to train my model after uploading new data to it.

I’ve tried running the transfer learning phase 4 times, and I am getting the following error each time:

/app/run-python-with-venv.sh: line 21:    10 Killed                  PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS

Application exited with code 137 (OOMKilled)

Job failed (see above)

Would appreciate your help here
Thanks!

Full training output:

 Splitting data into training and validation sets...
Splitting data into training and validation sets OK

Training on 1104 inputs, validating on 277 inputs
Training model...
Epoch 1/5
Epoch 20% done
Epoch 52% done
Epoch 85% done
35/35 - 43s - loss: 0.4939 - accuracy: 0.8995 - val_loss: 0.1280 - val_accuracy: 0.9747 - 43s/epoch - 1s/step
Epoch 2/5
Epoch 29% done

Attached to job 2194082...
Epoch 64% done
Epoch 97% done
35/35 - 39s - loss: 0.0138 - accuracy: 0.9973 - val_loss: 0.0431 - val_accuracy: 0.9856 - 39s/epoch - 1s/step
Epoch 3/5
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 0.0287 - accuracy: 0.9946 - val_loss: 0.1016 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 4/5
Epoch 29% done
Epoch 64% done
Epoch 100% done
35/35 - 39s - loss: 0.0116 - accuracy: 0.9964 - val_loss: 0.0608 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 5/5
Epoch 29% done
Epoch 64% done
Epoch 100% done
35/35 - 39s - loss: 0.0028 - accuracy: 1.0000 - val_loss: 0.0620 - val_accuracy: 0.9819 - 39s/epoch - 1s/step

Initial training done.
Fine-tuning best model for 10 epochs...
Epoch 1/10
Epoch 20% done
Epoch 52% done
Epoch 85% done
35/35 - 42s - loss: 0.0063 - accuracy: 0.9982 - val_loss: 0.0755 - val_accuracy: 0.9783 - 42s/epoch - 1s/step
Epoch 2/10
Epoch 29% done
Epoch 64% done
Epoch 100% done
35/35 - 39s - loss: 9.9456e-04 - accuracy: 0.9991 - val_loss: 0.0749 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 3/10
Epoch 29% done
Epoch 64% done
Epoch 100% done
35/35 - 39s - loss: 6.2040e-05 - accuracy: 1.0000 - val_loss: 0.0638 - val_accuracy: 0.9819 - 39s/epoch - 1s/step
Epoch 4/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 4.1405e-05 - accuracy: 1.0000 - val_loss: 0.0591 - val_accuracy: 0.9819 - 39s/epoch - 1s/step
Epoch 5/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 8.1886e-05 - accuracy: 1.0000 - val_loss: 0.0612 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 6/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 3.0306e-04 - accuracy: 1.0000 - val_loss: 0.0757 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 7/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 1.6120e-05 - accuracy: 1.0000 - val_loss: 0.0570 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 8/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 39s - loss: 4.3634e-05 - accuracy: 1.0000 - val_loss: 0.0512 - val_accuracy: 0.9783 - 39s/epoch - 1s/step
Epoch 9/10
Epoch 29% done
Epoch 64% done
Epoch 100% done
35/35 - 39s - loss: 6.4518e-05 - accuracy: 1.0000 - val_loss: 0.0471 - val_accuracy: 0.9856 - 39s/epoch - 1s/step
Epoch 10/10
Epoch 29% done
Epoch 61% done
Epoch 97% done
35/35 - 40s - loss: 2.7638e-05 - accuracy: 1.0000 - val_loss: 0.0480 - val_accuracy: 0.9856 - 40s/epoch - 1s/step
Finished training

Saving best performing model...
Still saving model...
Still saving model...
Still saving model...
Still saving model...
Converting TensorFlow Lite float32 model...
Attached to job 2194082...
Converting TensorFlow Lite int8 quantized model...
Attached to job 2194082...
Attached to job 2194082...
Calculating performance metrics...
Calculating inferencing time...
Calculating inferencing time OK
Profiling float32 model...
Profiling 34% done
Profiling 68% done
Profiling float32 model (tflite)...
/app/run-python-with-venv.sh: line 21:    10 Killed                  PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS

Application exited with code 137 (OOMKilled)

Job failed (see above)

jenny · February 21, 2022, 8:26am

Hi @OfirSagi,

Can you please provide me with your project ID so we can take a look?

Thanks!

OfirSagi · February 21, 2022, 1:35pm

Hi Jenny,

the project ID is 68230.

Thank you!

OfirSagi · February 22, 2022, 10:45am

Hi @jenny,

Updating that we tried to remove ~200 pictures from the data set, and run the transfer learning again, but we are still getting the same OOM killed error.

We also created a new project with only half the pictures (project ID 82611), and getting the same error there. In the past we were able to train the model with more than 1000 pictures, and now we are getting an error when using only 600.

Can you please help us?
Appreciate it!

jenny · February 23, 2022, 9:41am

Hi @OfirSagi,

Thank you for your patience, looking at your project now.

OfirSagi · February 23, 2022, 6:15pm

Thanks @jenny,

Waiting for your reply

jenny · February 24, 2022, 9:32am

Hi @OfirSagi,

I have alerted our engineering team of the issue, in the meantime I have fixed your project by increasing the training job memory allocation. I am re-training your project now.

Please let me know if you have any further questions!
– Jenny

jenny · February 24, 2022, 9:45am

@OfirSagi, Your project is now trained successfully and ready for deployment.

Thank you for your patience!

OfirSagi · February 24, 2022, 1:55pm

Hi @jenny,
Thanks for your reply!

I can see you trained project ID 82611 - thank you.

I’ve tried re-training it again now with the ‘data augmentation’ option turned on, but got the same error -
/app/run-python-with-venv.sh: line 21: 10 Killed PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS

Application exited with code 137 (OOMKilled)

Job failed (see above)

Job ID was 2214269 if that helps.

Would appreciate further help here

janjongboom · April 19, 2022, 5:04pm

@OfirSagi - we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.

naveen · June 19, 2022, 3:31pm

@janjongboom I am having the same issue. My project id is 95737. Can you please increase the limit?
.

louis · June 20, 2022, 8:06am

Hello @naveen,

As you are part of the Edge Impulse Expert Network, you can transfer the ownership of your project to the Edge Impulse Expert organization, you will then be able to run your jobs with the enterprise performances.

Let me know if that fixes your issue.

Regards,

Louis

naveen · June 20, 2022, 8:38am

Hi @louis,

My project was created under Edge Impulse Experts organization and the Enterprise Performance is enabled. The training is completed without any issues but the OOMKilled exception occurs while profiling and I can’t see the final confusion matrix and visualization. Also, during running the tests and deployment it throws the same error.

0% done
/app/run-python-with-venv.sh: line 21:    10 Killed    
              
PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS

Application exited with code 137 (OOMKilled)

Best,
Naveen

naveen · June 22, 2022, 10:16am

I have reduced the dataset by 10 times but still the same error.

louis · June 22, 2022, 2:39pm

Hello @naveen,

I can see that your model has been successfully trained now.

@dansitu, do we still have that issue of the memory leak in the MobileNetV2 SSD transfer learning when the dataset is too big? I can’t remember.

I also asked our core engineering to have a look, we should not have OOM errors on the studio. See here: Application exited with code 137 (OOMKilled) - #35 by janjongboom

Regards,

Louis

dansitu · June 24, 2022, 7:49pm

Sorry for the late reply, I was travelling and just got back to the computer! Thank you for the bug report; the OOM error during training has been fixed and I haven’t seen this during profiling before. Sounds like the core engineering team is on it!

Warmly,
Dan