I’ve performed versioning on one of my active projects and restored it multiple times to have several projects in parallel. For each project, I retrained as suggested by the interface and then, without making any changes, conducted model testing. I noticed that each restored project had a different (and lower) final accuracy. Why is that happening? How can I avoid it? I would like my projects to start from the same conditions to then compare the accuracy trends.
Could you share a project ID so I can have a look.
My first suggestion would be to increase the number of epochs.
As the weights are initialized randomly this can in theory happen when you have a low number of training cycles.
Also, can you make sure:
The confidence threshold in Model Testing is the same
You don’t perform a split between your training and testing data for each project
You compare the same model version (quantized vs float32)
Thank you for using Edge Impulse! This is a known issue: the order of samples is changing each time we restore the project, resulting in a slightly different training run.
While we work on a fix, one way to work around the issue is to export your data from the original project manually, then re-upload it to a new project using the uploader’s --concurrency flag to set concurrency to 1, as shown in Uploader - Edge Impulse Documentation
Setting concurrency to 1 will ensure the samples are uploaded in their original order.
@louis I haven’t modified the confidence threshold in Model Testing, nor performed a split between training and testing data, and I am sure I am comparing the same model version.
Thanks for the tip, @dansitu . I tried it, but in any case, I end up with a different final accuracy, and it’s likely due to the different training process resulting from a different weight initialization, as suggested by louis.
In this situation, if I proceed with parallel training (thus modifying the networks simultaneously and comparing them), am I making formal errors, or are the networks and their respective results still comparable? Remembering, of course, that there is uniformity in the dataset across various projects.
Sorry to hear you are still having the issue. We actually use a fixed seed for random number generation, so the initialized weights should be the same between runs. It seems like there is some underlying issue here, so our team will take a deeper look.
In answer to your question: yes, it’s reasonable to compare between models despite some minor variations—with the caveat that you won’t be able to reason about differences in performance that fall within the natural range of variation between different models.
To fully account for variation, even with properly deterministic training, you would perform cross-validation: train multiple models with different train/test splits, random seeds, etc. This would allow you to account for variations in performance due to changes in these properties. However, in practice, with an adequately sized dataset the differences should be fairly minimal. Cross validation can be quite slow, so we don’t recommend it for most projects.
I will keep you updated with our investigation into why you are seeing differences between training runs.
We have created a fix for the sample ordering issue when restoring projects; this should be deployed during the next few days.
Aside from that I wasn’t able to reproduce the discrepancy you are seeing. However, one thing you might try is setting ENSURE_DETERMINISM = True in your Expert mode code. This will prevent any variation between training runs, at a small cost to performance and potentially model accuracy.