Application exited with code 2 (OOMKilled) issue

TechDevTom · June 29, 2021, 7:15pm

Hello all!

I have a new shiny issue I’ve encountered

I’ve been retraining my model with some of the data my camera trap is giving me, and everything has been going smoothly in downloading the latest modelfile.eim to my Raspberry Pi 4. But tonight when I tried to download the latest modelfile.eim to my device, using the command “sudo edge-impulse-linux-runner --download modelfile.eim” I’m getting the following output:

Edge Impulse Linux runner v1.2.6

[RUN] Downloading model...
[BLD] Created build job with ID 1035434
[BLD] Writing templates OK
[BLD] Scheduling job in cluster...
[BLD] Exporting TensorFlow Lite model...
[BLD] Job started
[BLD] Exporting TensorFlow Lite model OK
[BLD]
[BLD] Removing clutter...
[BLD] Removing clutter OK
[BLD]
[BLD] Copying output...
[BLD] Copying output OK
[BLD]
[BLD] Job started
[BLD] Building binary...
[BLD] arm-linux-gnueabihf-g++ -MD -Wall -g -Wno-strict-aliasing -I. -Isource -Imodel-parameters -Itflite-model -Ithird_party/ -Os -DNDEBUG -g -DEI_CLASSIFIER_USE_FULL_TFLITE=1 -Iedge-impulse-sdk/tensorflow-lite -std=c++14 -c source/main.cpp -o source/main.o
[BLD] arm-linux-gnueabihf-g++ -MD -Wall -g -Wno-strict-aliasing -I. -Isource -Imodel-parameters -Itflite-model -Ithird_party/ -Os -DNDEBUG -g -DEI_CLASSIFIER_USE_FULL_TFLITE=1 -Iedge-impulse-sdk/tensorflow-lite -std=c++14 -c tflite-model/tflite-trained.cpp -o tflite-model/tflite-trained.o
[BLD] arm-linux-gnueabihf-g++: internal compiler error: Killed (program cc1plus)
[BLD] Please submit a full bug report,
[BLD] with preprocessed source if appropriate.
[BLD] See <file:///usr/share/doc/gcc-6/README.Bugs> for instructions.
[BLD] make: *** [source/main.o] Error 4
[BLD] Makefile:67: recipe for target 'source/main.o' failed
[BLD] Application exited with code 2 (OOMKilled)
[RUN] Failed to run impulse Failed to build binary

Can anyone help me with this please? I’m not sure what I should be doing next, as I’m not aware that I’ve done anything differently and everything’s been working nicely so far!

Thanks for any help in advance, I’m having great success using Edge Impulse otherwise

aurel · June 30, 2021, 8:06am

Hi Tom,

Could you you share your project ID?

Thanks,
Aurelien

janjongboom · June 30, 2021, 8:16am

@TechDevTom Apologies! We restricted memory limits for some deployment targets, and this has broken object detection on some Linux targets. We’re reverting this now, and should be fixed within an hour or so.

TechDevTom · June 30, 2021, 8:45am

@aurel seems like @janjongboom is on the case!

Could I ask why this was done? Are some projects taking up too much memory, and if so, is there anything we as the people using Edge Impulse can do to lower our memory usage? Is it in relation to the amount of data we’re processing for our models?

janjongboom · June 30, 2021, 9:07am

Hi @TechDevTom this is now released.

No, definitely not on the users side, and no reason to make smaller projects - we wanted to make the resource allocation more explicit in our code base, and by accident halved the memory we allocated for deployment blocks.

TechDevTom · June 30, 2021, 9:10am

Cheers, fixed on my end, hoorah! Back to gathering data and hoping my model now does not recognise plant pots and plants as animals/birds.

Ah I see, these things happen, I know all too well

TechDevTom · August 12, 2021, 8:36pm

Hey @janjongboom, sorry to necro an older thread, but I’m having an issue where I’m getting another OOMKilled message when retraining my object detection model:

Application exited with code 137 (OOMKilled)

Is this a memory issue again, or should I open up a new thread and ask for help?

aurel · August 13, 2021, 9:13am

Hi @TechDevTom,

Could you give it another try? I enabled our enterprise performance feature as you have a large dataset.
Let us know if that helps,

Aurelien

TechDevTom · August 13, 2021, 9:31am

@aurel I’ve just set it off now and will see how it goes.

Regarding the enterprise performance feature, is that something that I’ll need to pay for? What are the limitations dataset/image count wise with normal performance vs enterprise performance?

TechDevTom · August 13, 2021, 10:27am

@aurel Sorry but no luck! Same error, just after it starts on the first epoch.

TechDevTom · August 17, 2021, 12:26pm

Hey @aurel, sorry to bother you again, but I’m not having any success in building my model still, have you got any suggestions that might help?

aurel · August 17, 2021, 2:14pm

Hi @TechDevTom,

Sorry for the late reply. This is actually related to a Tensorflow memory leak (https://github.com/tensorflow/models/issues/9981), we are following up with the team. In the meantime the best solution is to reduce the size of the dataset, 100 images should already work well.
We’ll keep you posted.

Aurelien

TechDevTom · August 17, 2021, 3:23pm

Hey @aurel, no worries. And I see, well, I guess I’ll just wait until it’s fixed. I’m afraid for my application 100 images wasn’t working so well, but I may take what I’ve learned and create a new project to see if some of the newer images I’m taking can create a more efficient model.

Cheers!

janjongboom · April 19, 2022, 5:04pm

@TechDevTom - we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.

TechDevTom · May 5, 2022, 6:33pm

Awesome, have given it a blast and I’m seeing no more issues, cheers @janjongboom!