Different accuracy with Restore

MARIAGG · January 3, 2024, 10:09am

Hello everyone,

I’ve performed versioning on one of my active projects and restored it multiple times to have several projects in parallel. For each project, I retrained as suggested by the interface and then, without making any changes, conducted model testing. I noticed that each restored project had a different (and lower) final accuracy. Why is that happening? How can I avoid it? I would like my projects to start from the same conditions to then compare the accuracy trends.

Thanks in advance to anyone who responds.

Maria

louis · January 3, 2024, 3:25pm

Hello @MARIAGG,

Could you share a project ID so I can have a look.
My first suggestion would be to increase the number of epochs.
As the weights are initialized randomly this can in theory happen when you have a low number of training cycles.

Also, can you make sure:

The confidence threshold in Model Testing is the same
You don’t perform a split between your training and testing data for each project
You compare the same model version (quantized vs float32)

Best,

Louis

dansitu · January 3, 2024, 5:35pm

Hi @MARIAGG,

Thank you for using Edge Impulse! This is a known issue: the order of samples is changing each time we restore the project, resulting in a slightly different training run.

While we work on a fix, one way to work around the issue is to export your data from the original project manually, then re-upload it to a new project using the uploader’s --concurrency flag to set concurrency to 1, as shown in Uploader - Edge Impulse Documentation

Setting concurrency to 1 will ensure the samples are uploaded in their original order.

Warmly,
Dan

MARIAGG · January 5, 2024, 12:41pm

Hello everyone and thank you for your responses.

@louis I haven’t modified the confidence threshold in Model Testing, nor performed a split between training and testing data, and I am sure I am comparing the same model version.

Thanks for the tip, @dansitu . I tried it, but in any case, I end up with a different final accuracy, and it’s likely due to the different training process resulting from a different weight initialization, as suggested by louis.

In this situation, if I proceed with parallel training (thus modifying the networks simultaneously and comparing them), am I making formal errors, or are the networks and their respective results still comparable? Remembering, of course, that there is uniformity in the dataset across various projects.

Thanks again,
Maria

dansitu · January 5, 2024, 2:48pm

Hi Maria,

Sorry to hear you are still having the issue. We actually use a fixed seed for random number generation, so the initialized weights should be the same between runs. It seems like there is some underlying issue here, so our team will take a deeper look.

In answer to your question: yes, it’s reasonable to compare between models despite some minor variations—with the caveat that you won’t be able to reason about differences in performance that fall within the natural range of variation between different models.

To fully account for variation, even with properly deterministic training, you would perform cross-validation: train multiple models with different train/test splits, random seeds, etc. This would allow you to account for variations in performance due to changes in these properties. However, in practice, with an adequately sized dataset the differences should be fairly minimal. Cross validation can be quite slow, so we don’t recommend it for most projects.

I will keep you updated with our investigation into why you are seeing differences between training runs.

Warmly,
Dan

dansitu · January 5, 2024, 6:05pm

Hi Maria,

We have created a fix for the sample ordering issue when restoring projects; this should be deployed during the next few days.

Aside from that I wasn’t able to reproduce the discrepancy you are seeing. However, one thing you might try is setting ENSURE_DETERMINISM = True in your Expert mode code. This will prevent any variation between training runs, at a small cost to performance and potentially model accuracy.

Warmly,
Dan