Job 31717273 shows complete, log shows it crashed

Question/Issue:
While training a model using YOLOv5 the dashboard claims it completed successfully and shows the green checkmark (job 31717273). When I tried to deploy the model it failed to deploy with (job 31753618). When I checked the log for job 31717273 it shows that it ended by raising exceptions and then claims success.

Project ID:
Projecdt 663946
Job 31717273

Context/Use case:
My previously trained model used an object detection model with FOMO. I followed the tutorial for adding YOLOv5 without incident and created an impulse using it. I used the default YOLOv5 settings to train the model and the dashboard claimed success. When I tried to deploy it it failed.

Summary:
Default settings of YOLOv5 failed to train, but the dashboard didn’t recognize the failure and claimed success.

Steps to Reproduce:

  1. Add YOLOv5 to project.
  2. Create an impulse with YOLOv5.
  3. Train model with YOLOv5 defaults.
  4. Wait for training to complete (in my case about 8-hours).

Expected Results:
I expected the training to complete with a working model. Or, for the dashboard to recognize that the training failed.

Actual Results:
The training failed and the dashboard didn’t recognize that it failed.

Reproducibility:

  • [ ] Always
  • [ ] Sometimes
  • [ ] Rarely
    I have only tried this once as it takes 8-hours to train.

I tried running the training a second time and this time the dashboard knows it crashed, so I guess that’s an improvement, but I don’t know why it’s crashing after over 1.5-days of training.

Tos review log see job 31761555.

Hi @ereichPwc

Thanks for reporting this, hmmm it can be hard to review a job after it has passed a day or so, and this one is much longer. It sounds like an intermittent infrastructure issue to me but let me please confirm this with our infra team. Are you ok to let me try running this one again?

Best

Eoin