Job 31717273 shows complete, log shows it crashed

Question/Issue:
While training a model using YOLOv5 the dashboard claims it completed successfully and shows the green checkmark (job 31717273). When I tried to deploy the model it failed to deploy with (job 31753618). When I checked the log for job 31717273 it shows that it ended by raising exceptions and then claims success.

Project ID:
Projecdt 663946
Job 31717273

Context/Use case:
My previously trained model used an object detection model with FOMO. I followed the tutorial for adding YOLOv5 without incident and created an impulse using it. I used the default YOLOv5 settings to train the model and the dashboard claimed success. When I tried to deploy it it failed.

Summary:
Default settings of YOLOv5 failed to train, but the dashboard didn’t recognize the failure and claimed success.

Steps to Reproduce:

  1. Add YOLOv5 to project.
  2. Create an impulse with YOLOv5.
  3. Train model with YOLOv5 defaults.
  4. Wait for training to complete (in my case about 8-hours).

Expected Results:
I expected the training to complete with a working model. Or, for the dashboard to recognize that the training failed.

Actual Results:
The training failed and the dashboard didn’t recognize that it failed.

Reproducibility:

  • [ ] Always
  • [ ] Sometimes
  • [ ] Rarely
    I have only tried this once as it takes 8-hours to train.

I tried running the training a second time and this time the dashboard knows it crashed, so I guess that’s an improvement, but I don’t know why it’s crashing after over 1.5-days of training.

Tos review log see job 31761555.