Slow training times for image models

janjongboom · December 16, 2022, 11:14am

Hi all,

We’re seeing a regression in training time for image models. We’ve pinpointed the release that introduced this, but are investigating what caused this

Note that training models will still work, but a lot slower than normal.

janjongboom · December 16, 2022, 11:54am

We’ve identified a test, running the test suite on staging now; and hopefully live in an hour or two.

Some background: for image models we store the model after every epoch, so we can find the epoch with the lowest loss and use that model (this happens when you see ‘Saving best performing model…’ during training). The format was changed from hdf5 to tf recently, which is a lot slower to save. As this runs on every epoch we’re now spending most cycles just saving models. Reverted this back to hdf5 for image models.

janjongboom · December 16, 2022, 1:52pm

This change is now on production