Training crashed after 2-days

ereichPwc · May 13, 2025, 2:50pm

Question/Issue:
After completing the model training in just over 2-days the conversion process from float32 to int8 ran out of memory and crashed.

Project ID:
676816

Context/Use case:
I trained my model and it crashed after training, but before converting it to int8 data format.

Summary:
[Provide a concise summary of the bug]

Steps to Reproduce:
Click the Save and Train button on the Object Detection page.

Expected Results:
I expected to have a deployable model.

Actual Results:
I have nothing, but an error that says increase the memory used, but I don’t see where I can control that.

Reproducibility:
I don’t know, yet I’ll let you know in 2-days when it completes or fails again.

Environment:

Platform: [e.g., Raspberry Pi, nRF9160 DK, etc.]
Build Environment Details: [e.g., Arduino IDE 1.8.19 ESP32 Core for Arduino 2.0.4]
OS Version: [e.g., Ubuntu 20.04, Windows 10]
Edge Impulse Version (Firmware): [e.g., 1.2.3]
To find out Edge Impulse Version:
if you have pre-compiled firmware: run edge-impulse-run-impulse --raw and type AT+INFO. Look for Edge Impulse version in the output.
if you have a library deployment: inside the unarchived deployment, open model-parameters/model_metadata.h and look for EI_STUDIO_VERSION_MAJOR, EI_STUDIO_VERSION_MINOR, EI_STUDIO_VERSION_PATCH
Edge Impulse CLI Version: [e.g., 1.5.0]
Project Version: [e.g., 1.0.0]
Custom Blocks / Impulse Configuration: [Describe custom blocks used or impulse configuration]
Logs/Attachments:
[Include any logs or screenshots that may help in diagnosing the issue]

Logs/Attachments:
Epoch Train Validation
Loss Loss Precision Recall F1
59 0.00707 0.00856 0.91 0.74 0.82
Loss Loss Precision Recall F1
59 0.00707 0.00856 0.91 0.74 0.82
Finished training

Finished training
Finished training

Finished training

Converting TensorFlow Lite float32 model…
Converting TensorFlow Lite int8 quantized model…
Loading data for profiling…
Loading data for profiling OK

Loading data for profiling OK

Calculating performance metrics…
Calculating inferencing time…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Calculating performance metrics…
Calculating inferencing time…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Calculating inferencing time OK
Calculating inferencing time OK
Calculating float32 accuracy…
Calculating float32 accuracy…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Calculating inferencing time OK
Calculating float32 accuracy…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Profiling 20% done
Profiling 43% done
Profiling 68% done
Profiling 89% done

/app/run-python-with-venv.sh: line 17: 24531 Killed /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)
025-05-04T07:52:53.866600658Z Train job failed: out of memory.
025-05-04T07:52:53.866625930Z Please contact support if this issue persists or increase train job memory in the project dashboard.

025-05-04T07:52:53.866625930Z Please contact support if this issue persists or increase train job memory in the project dashboard.
2025-05-04T07:52:53.897Z level=error logger=server msg=“Failed job execution”
Error: Job 32325144 finished
at /home/node/studio/build/server/server/start-daemon.js:290:48
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Error: Job 32325144 finished
at /home/node/studio/build/server/server/start-daemon.js:290:48
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2025-05-04T07:52:53.897Z level=error logger=server msg=“Failed job execution”
Error: Job 32325144 finished
at /home/node/studio/build/server/server/start-daemon.js:290:48
Application exited with code 1
Job failed (see above)

Additional Information:
[Any other information that might be relevant to the issue]

ereichPwc · May 15, 2025, 6:19pm

I tried doing the build again after increasing the memory allowed by the job. It failed in a similar way to the previous time, but got there sooner.

Here is the end of the log output from the latest attempt:
Epoch Train Validation
Loss Loss Precision Recall F1
59 0.00712 0.00687 0.94 0.71 0.81
Finished training

Finished training

Converting TensorFlow Lite float32 model…
Converting TensorFlow Lite int8 quantized model…
Loading data for profiling…
Loading data for profiling OK

Calculating performance metrics…
Calculating inferencing time…
Calculating inferencing time…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Calculating inferencing time OK
Calculating float32 accuracy…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Calculating inferencing time OK
Calculating float32 accuracy…
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Profiling 19% done
Profiling 40% done
Profiling 60% done
Profiling 84% done

/app/run-python-with-venv.sh: line 17: 6819 Killed /app/$VENV_NAME/.venv/bin/python3 -u $ARGS
Application exited with code 137 (OOMKilled)
025-05-15T00:23:11.455327595Z Train job failed: out of memory.

025-05-15T00:23:11.455350553Z Please contact support if this issue persists or increase train job memory in the project dashboard.

025-05-15T00:23:11.455350553Z Please contact support if this issue persists or increase train job memory in the project dashboard.
2025-05-15T00:23:11.471Z level=error logger=jobs msg=“Failed to set finished on job” parentId=676816 id=32934881
2025-05-15T00:23:11.471Z level=error logger=jobs msg=“Failed to set finished on job” parentId=676816 id=32934881
error: password authentication failed for user “studio-service”
error: password authentication failed for user “studio-service”
at /home/node/studio/node_modules/pg-pool/index.js:45:11
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async PgDB.finishProjectJob (/home/node/studio/build/server/shared/db/pg_db.js:4678:9)
at async /home/node/studio/build/server/server/jobs/jobs.js:635:21
2025-05-15T00:23:11.475Z level=error logger=server msg=“Failed to mark job as finished” id={“kind”:“project”,“parentId”:676816,“internalId”:32934880}
at /home/node/studio/node_modules/pg-pool/index.js:45:11
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async PgDB.finishProjectJob (/home/node/studio/build/server/shared/db/pg_db.js:4678:9)
2025-05-15T00:23:11.475Z level=error logger=server msg=“Failed to mark job as finished” id={“kind”:“project”,“parentId”:676816,“internalId”:32934880}
error: password authentication failed for user “studio-service”
at /home/node/studio/node_modules/pg-pool/index.js:45:11
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async PgDB.finishProjectJob (/home/node/studio/build/server/shared/db/pg_db.js:4678:9)
at async publishJobFinished (/home/node/studio/build/server/server/start-daemon.js:161:41)
at async /home/node/studio/build/server/server/start-daemon.js:259:37
2025-05-15T00:23:11.476Z level=error logger=server msg=“Failed job execution”
Error: Job 32934881 finished
at /home/node/studio/build/server/server/start-daemon.js:290:48
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
error: password authentication failed for user “studio-service”
at async publishJobFinished (/home/node/studio/build/server/server/start-daemon.js:161:41)
at async /home/node/studio/build/server/server/start-daemon.js:259:37
2025-05-15T00:23:11.476Z level=error logger=server msg=“Failed job execution”
Error: Job 32934881 finished
Application exited with code 1
Job failed (see above)

Eoin · May 16, 2025, 7:50pm

Hi @ereichPwc

Oh this is strange looking, thanks for reaching out, let me check with our infra and platform team.

Best

Eoin

Fyi @ferjm_ei and @nabilkoroghli

ereichPwc · May 20, 2025, 6:01pm

Hi Eoin,
Do you have any status regarding this issue?

Thanks,
Edwin

Eoin · May 28, 2025, 12:51pm

Hi @ereichPwc

During your initial job I got confirmation that this was an infrastructure issue, we had approx 2min of downtime as a patch was applied and rolled back.

It was unfortunate timing to coincide with your long running job, we don’t often have issues but it was unlucky that your job was running during this window.

The second failure looks like you need to increase the training time of your job, this would mean moving tier to the enterprise level, or increasing the job duration with your sales rep, are you already engaged with solutions / sales?

Best

Eoin

ereichPwc · May 28, 2025, 5:52pm

Hi Eoin,
Thanks for looking into this and responding.

Regarding your comments about the second failure:

The training time limit is set to 1-week so I don’t see how that made it stop in 1-day.
I am on the enterprise level
I had double the memory limits before I ran the second training, which stopped in half the time.

I am currently set to the maximum memory supported by the system.