Application exited with code 137 (OOMKilled)

HShroff · October 27, 2021, 4:53am

@janjongboom sir, I am also having the same issue.
Can you please help me too?

My project ID is 53685

janjongboom · October 27, 2021, 8:45am

Quick fix (while we fix this for real):

Go to Dashboard > Export, and export with ‘Retain crops and splits’ enabled.
Create new project, and upload the exported files to the project.
Done.

This will have all the smaller split-up files already (rather than trying to crop the large audio files in the DSP process).

HShroff · October 27, 2021, 1:06pm

Thanks sir. It is working well now!

MARSK · January 14, 2022, 11:58am

Hello there, (It’s my first post on the platform)

I wished to know if there was a way to actually run the process using the local machine’s resources. It’s because I was facing a similar isssue, and that was when my dataset still needed to be enlarged by like 200%. So, if I could somehow gain access to the code which I could just run locally in , say PyCharm, that would be great, since I also do not want to overburden your servers’ resources.

Thanks!

MARSK · January 14, 2022, 12:01pm

And sir, unlike as you mentioned earlier, the features are NOT generated in my case at all, so unfortunately I cannot train the4 model as of now. Is this a bug?

okosa20d · March 17, 2022, 8:22pm

Hi, I’m having the same issue ( [Application exited with code 137) when I train the network for 20 epochs. It works for 15 epochs though. Please, could you help me out?

louis · March 18, 2022, 9:27am

Hello @okosa20d ,

It seems that your training was successful with 20 epochs (on your project ID 79163).
I also increased a bit your training jobs performances. Let me know if you still have some issues.

Regards,

Louis

asma · March 22, 2022, 7:27am

Hi, I’m happy that I’m using such a great tool for tinyML. I’m really enjoying it. I did some scratch project with about 163 images first with bounding box annotation and I can see the model output and its classification result. The project ID is 87553 which is working fine, from workflow perspective.

Later I experimented with about 438 images with transfer learning. Initially i went with 100 epoch and I got this code 137 error. Then I thought it was running out of 20 minutes allocated time. I later reduced upto 25 epoch, then I’m facing the same error. The project ID is 88632.

Now, I see that the transfer learning is still attached to some job since yesterday. It would be a great help to know whether I’m running out of resources or what I’m doing wrong?

Thank you in advance!

louis · March 22, 2022, 8:19am

Hello @asma ,

I increased your compute resources and the compute time limit of your project.
Indeed, by default, we use 4GB-ram containers to run the training jobs which was probably too small for your project.

However, the job that is still attached seems odd. Can you try to cancel the job and run it again. It will take into account your new performance settings.

Regards,

Louis

asma · March 22, 2022, 9:41am

Thank you for the support @louis and for increasing the compute resource. I’m able to train it with 100 epoch now.

mette.lvl · March 28, 2022, 3:04pm

Hi,
I have the same issue (project ID 38986). I’m training on 2,35 hours of audio data. I did a lot of adaptation of the window lengths to try and make it run in various ways. Is there any solution to this yet?

louis · March 28, 2022, 7:37pm

Hello @mette.lvl,

I just updated your jobs limits.

Let us know how it goes.

Regards,

Louis

mette.lvl · March 29, 2022, 8:51am

Hi Louis,

Thanks a lot!! I tried a couple of times now, also logging out and back in - I keep getting the same error Any suggestions what I am doing wrong here?

louis · March 29, 2022, 2:49pm

Hello @mette.lvl ,

It seems that your MFCC page takes ages to compute even one window size. I am not sure about the reason.
Could you try to delete all the blocks in your Impulse (in Create Impulse tab, remove them all and add them again). Alternatively, exporting all the data in your project and import them again on a new project could eventually work too (I’d need increase the performances on your second project so if you do that, please give me your project ID).

Let me know if the trick worked.

Regards,

Louis

mette.lvl · April 1, 2022, 12:33pm

Still no luck, Louis.
I tried deleting the entire impulse and building it up in various ways.
I also exported all of my code into a zip, which my MacBook is unable to unzip. And I cannot find any easy way to upload my existing dataset into a new project. It was quite a pain to upload the data in batches for each label originally. I’m on the edge of giving up on EdgeImpulse

mette.lvl · April 1, 2022, 1:10pm

It seems that it is always sample #32 that fails - perhaps you can tell me which one that is and I can try and delete it from the dataset?

newmanreece · April 13, 2022, 12:36pm

Hi there,
I seem to be getting the same error but when I’m training the model.

It seems to finish training and then an issue occurs afterwards.
This is my project ID: 94279

louis · April 13, 2022, 3:08pm

Hello @newmanreece ,

I just created an internal ticket for our infra team.

Regards,

Louis

janjongboom · April 19, 2022, 5:02pm

Hi @newmanreece - we’ve upped the memory limits for all jobs to be less stringent (they can go over memory limits without being killed immediately) and this should resolve all OOMKilled issues. We’re monitoring actively to see if any others happen and can tweak the limits if that’s the case.

V_Deepa · July 26, 2022, 12:21pm

Same issue. Project Id : 122758