Application exited with code 137 (OOMKilled) - again

AlbertaBeef · March 31, 2022, 3:09pm

I am attempting to train a classification model with transfer learning.

I am getting the “Application exited with codee 137 (OOMKilled)” error, and am trying to understand why.

This model is classifying the dobble card deck.
I have 56 classes, and am using the MobileNetV2 with 160x160 input.

Here is a more verbose of the output:

...
Finished training

Saving best performing model...
Still saving model...
Still saving model...
Still saving model...
Still saving model...
Converting TensorFlow Lite float32 model...
Attached to job 2441794...
Converting TensorFlow Lite int8 quantized model...
Attached to job 2441794...
Calculating performance metrics...
Calculating inferencing time...
Calculating inferencing time OK
Profiling float32 model...
Profiling float32 model (tflite)...
/app/run-python-with-venv.sh: line 21:    10 Killed                  PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)

Job failed (see above)

Project ID : 90777

Just an informative note, this model needs data augmentation, so it is over-fitting as it stands …

Mario.

oduor_c · April 1, 2022, 8:02am

Hi @AlbertaBeef ,

56 Classes & 560 images with augmentation on seems to be a big task for the current available compute instances. Could you at least try reducing the number of classes and try again?

Thanks,
Clinton

AlbertaBeef · April 1, 2022, 12:55pm

Clinton,
Thank you for your response …

This training takes only 10 minutes on a normal computer (without GPU) using Keras.
I am struggling to understand why this does not work in the cloud with EdgeImpulse.

This error occurs whether or not I have “data augmentation” enabled.
Reducing the number of classes (ie. cards) with render this model useless with the card game.
If this is truly a limitation of Edge Impulse, then I will find another use case to try.
Before I do that, what is the maximum number of classes supported by Edge Impulse ?

You can see in the log that I shared previously that the training and fine-tuning have completed.
The error seems to occur during the profiling step :

Profiling float32 model (tflite)...
/app/run-python-with-venv.sh: line 21:    10 Killed                  PATH=$PATH:/app/$VENV_NAME/.venv/bin /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)

Job failed (see above)

Is there a way to bypsss this profiling step ?

Thanks in advance,
Mario.

oduor_c · April 1, 2022, 1:29pm

Hi @AlbertaBeef,

I have just increased compute resources for your project. Could you kindly try again to see whether it will bring the same error?

Regarding max no of classes, there isn’t really a theoretical value for that, but as you increase the number of classes it comes with some tradeoffs like: You will need more data to train your model of which you might run into issues like time or compute resources limits. Also more classes might make your model more complex and even bigger and it might not eventually fit on your target device.

Thanks,
Clinton

AlbertaBeef · April 1, 2022, 4:02pm

Clinton,
Thank you, the training phase passed, and the (poor) results are as expected.
If I understand correctly, the profile step is the most demanding for compute resources ?

The classification of this dataset is something that I have already solved:
The Dobble Challenge - Hackster.io
Training The Dobble Challenge - Hackster.io

I wanted to attempt to achieve the same with EdgeImpulse.
As is, the model is overfitting, and needs more data, which I have implemented in a separate project (project ID : 89394)
This project, however, as you already predicted runs out of time.

Regards,
Mario.

oduor_c · April 4, 2022, 7:54am

Hi @AlbertaBeef ,

Based on the reported jobs that have failed here on the forrum, feature generation & the profiling step are common, hence I would say they are the ones that consume much compute.

I have just increased your time limit for project 89394. Let me know if this helps.

Thanks,
Clinton

AlbertaBeef · April 4, 2022, 5:25pm

@oduor_c ,
Thank you very much
This allowed the training and fine-tuning to work, achieving 99% accuracy ! but now crashes on the profiling part (like the previous project).
Regards,
Mario.

oduor_c · April 5, 2022, 5:16am

Hi @AlbertaBeef ,

Great Perfomance! To solve the crash at the profiling step, I have just increased compute resources for your project. You can now retrain your model and let me know if you face any other challenge.

Thanks,
Clinton

AlbertaBeef · April 5, 2022, 5:18pm

@oduor_c,
With the increased compute resources, the training completed with the following results:

float32 : 99.8% accuracy
int8 : 97.8% accuracy

With my test data, which was captured from a live card game (completely differently than the training data), I am getting:

93.2% accuracy

Is this using the int8 model ? or float32 model ?
Best Regards,
Mario.

aurel · April 6, 2022, 8:25am

Hi @AlbertaBeef,

This is the float32 model only.

Aurelien

AlbertaBeef · April 6, 2022, 12:08pm

@aurel,
Thank you for the clarification.
Mario.

janjongboom · April 19, 2022, 5:03pm

Hi @AlbertaBeef - we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.