Reproduction overfitting issue

**Question/Issue: I have been trying to reproduce ARM keyword spotting implementation for microcontrollers( and I get an overfitting issue. The implementation is exactly the same but in Keras (as supposed to be in Keras mode), the dataset is also the same. I do not see any overfitting in the original implementation by ARM when running in my local pycharm platform.

**Project ID: 118659

Context/Use case:

Hello @Sina,

Did you solve your issue?
I cannot see a strong overfit in your project. I checked the difference of between your validation accuracy and your testing accuracy and they are somehow similar (87% vs 84%).
Also, did you make sur to keep the same DSP parameters between your impulse and the ones in the Github repo? I have troubles seeing where the DSP parameters are set in the ML-KWS-for-MCU repo.

Let me know,



1 Like

Hello @louis,

Thank you for getting back to me.
Well not yet, but trying to figure out what causes this.
The overfitting can be seen while training the network, the training accuracy goes to 100 percent but the validation doesn’t go any higher than ~90 91 percent.
I actually set the normalization window size to 0 and sounds like it is working a little bit better (training accuracy tends to increase a little bit slower). But still not as expected.
Yes, finding the DSP parameters is a bit hard, however, some of them can be found here:
Regarding FFT size, I am still not sure if I should set it to 512 or 1024. other than that, everything else match. Although, in the original project, a set of background noise is mixed with the input 80% of audio data with the 0.1 volume. This could affect the performance definitely, but I add Gaussian noise std=0.1 to somehow cover this issue.
As you might see, the testing accuracy should go to ~94 95% which is a lot different than what I am getting.

Thank you for any feedback,


One more thing I would like to mention here, which is the learning rate scheduler. I defined a learning scheduler function that determines the learning rate at a specific epoch. However, sounds like something is wrong, I get a validation accuracy of 90.7 at my last epoch but after the job is done and profiling, the accuracy drops to 87.7 which is almost the number that I get for the first learning rate. Is there anything I am missing? Isn’t it supposed to take the best performance (accuracy) and report it?

Thank you in advance,


@dansitu or @matkelcey, do you have an idea?

For the overfitting issue, not completely sure why, maybe a different version of tensorflow that does not perform exactly the same, is your validation set the same size in both methods?

And for the drop after the training when using a learning scheduler, I am not sure either, I think we apply a sort of post-processing that is supposed to fine-tune the model but may be incompatible with the learning rate change.

I’ll ask one of our ML expert to have a look at your questions.

Best regards,



Kind of weird actually, I tried to consider any scenario that might affect the training accuracy rate and validation accuracy, but still see some kind of weird behavior here in Keras. The validation size is the same as the one I have on ARM implementation (10% of the dataset for training). After a certain point, the training accuracy goes high (even 100%) even with a steady learning rate (0.0005), which I do not see this behavior in my local training process with ARM implementation.
I should mention that I am using the exact same dataset, except that in the ARM implementation, some sort of background noise is mixed with the training samples. That’s the only difference, and I tried to remove this part in ARM implementation so that I can have a fair comparison. but still, the issue persists.
I really appreciate someone’s help here.

Thank you,

Hi @Sina,

Thanks for using Edge Impulse! I just checked your project and your validation accuracy is 91.6%, so it looks like you resolved the issue with the mismatch? That said, here are some notes that should help improve performance overall:

Steps per epoch
First up, I noticed that you are setting steps_per_epoch=1 in your call to This means that you are only using one batch of data per epoch. You have 32047 samples and a batch size of 100, so you are currently throwing away almost 99.7% of your dataset.

I would recommend removing this parameter altogether. It’s almost guaranteed to cause overfitting, since you are training your model on the same 100 samples every epoch and ignoring most of your data.

Dataset repeating
In the code your dataset is being repeated 10 times:

train_dataset= train_dataset.repeat(10)

Since each epoch goes through the dataset once, there’s no need to do this—you can just multiply your number of epochs by 10. Plus, since you set steps_per_epoch=1, you were only using the first 100 samples of the dataset during each epoch anyway, so this repeat was having no effect. I’d recommend leaving this one out.

Validation frequency
With validation_freq=50 you’re running validation only every 50 epochs; to get some insight into potential overfitting it’s useful to leave this on for every epoch.

Number of epochs
Since your dataset was being reduced to only 100 samples, you had a very high number of epochs set. With more samples a lower number will work.

New model version

After making these changes in a new model version I’m seeing 88% validation accuracy after 5 epochs, which is promising—you can try training for more epochs and see how high you can get it.

Reasons for drop in accuracy between the logs and the final model

There are a couple of things that can cause a drop in accuracy between the last epoch and the final printed value. They are:

  1. By default the UI displays the accuracy for the quantized model, which has been reduced in precision so that it runs more quickly and takes up less space when deployed to an embedded device. Sometimes the quantized model has lower accuracy than the original model.
  2. At the end of training, we use the model at whichever epoch has the lowest validation loss. This may not be the model from the final epoch. It may also not be the model that has the lowest accuracy, since loss is a better measure of the model’s overall representation of the dataset.

I hope this helps! Let me know if you have any more questions.


1 Like

Hi @dansitu,

I am so appreciative for taking the time to go through my project here.
There are some issues that I would like to share with you:
Steps per epoch
Well, actually at first I did not change Steps per epoch and the training went through the whole dataset, I checked that in ARM implementation, they take some random samples for each batch for every iteration, therefore there exist some redundancies of samples at different batches. This seems that it was my fault about the nature of steps per epoch that I was not aware of its functionality.
Validation frequency
Since I was wrong about steps per epoch, I took validation frequency for 50.
New model version
Originally, my first version is the same as you just described as a new model, and I got training accuracy goes to 100 percent but validation accuracy doesn’t go any higher than ~87- 90%, since the training reaches 100% and there is no room to improve the accuracy. That’s why I tried to intervene with steps per epoch (which sounds like I was wrong).
However, if you set the learning rate to let’s say 0.0005 which is the starting point rate, and train for 20 30 or more epochs, the training accuracy goes to 100% but validation accuracy gets stuck. On the other hand, in the ARM implementation. I do not reach this accuracy in training even after 10000 epochs (no more than 96%) and the validation accuracy reaches 93% with a learning rate=0.0005. and if I change the learning rate to a lower number (0.0001) both accuracies (training and validation) go higher (validation accuracy = somewhere about ~94 95%). Even with a lower learning rate, the training accuracy has some room to improve (does not reach 100%).
This is my real problem here that I’ve been trying to figure out whether this is my fault in maybe implementation (which I think I have not missed anything).



I just wanted to show you this for convenience:

The val accuracy starts to become stagnant at this point when the learning rate(LR) is 0.0005.
I changed the LR to 0.0001 and then to 0.00002 and as can be seen in the following shot, the training accuracy has no room to improve:

Although all these results can be found in the project job section.
On the other side, I get val accuracy of ~ 93-94% for LR=0.0005 in the ARM implementation (implemented with TensorFlow and tf-slim).

Just one more thing:
If the definition of steps per epoch is like it takes only 100 first sample (the first batch) if equals 1, how does the validation accuracy gets better at each epoch? I got 91.6% val accuracy with steps per epoch=1, and test accuracy=88.7%. Isn’t it nonsense? with only 1 batch, how does it train and performs well on a new set of data?

Thank you,

Hi Sina,

Thanks for the extra detail. I’ll go over your last question first:

If the definition of steps per epoch is like it takes only 100 first sample (the first batch) if equals 1, how does the validation accuracy gets better at each epoch? I got 91.6% val accuracy with steps per epoch=1, and test accuracy=88.7%. Isn’t it nonsense? with only 1 batch, how does it train and performs well on a new set of data?

This is the type of thing you can only really understand through experimentation and exploration. If the variance in the dataset is low, it might be that 100 samples is sufficient to train a reasonable model. There’s no rule that says a larger dataset is always better. For example, imagine your dataset contains a lot of label noise (e.g. mislabelled samples). It might be the case that by luck, the 100 samples selected in your 1 batch per epoch happen to be less noisy ones.

Alternatively, they might happen to coincide well with the samples selected for the validation dataset. Since you have a lot of classes and you are only selecting 10% of your dataset for validation, you may be ending up with with an imbalanced validation dataset, and the model you are training with 100 samples just happens to work well for that dataset.

In terms of reproducing Arm’s results, I took a look at their benchmarks and it looks like their models’ performance is not far off from yours:

You’re getting 84% test accuracy, which is around 10% less than Arm’s results. You also mentioned that you are not mixing background noise into your samples. It’s definitely reasonable to assume you might be able to improve performance by 10% by mixing in background noise, especially since you are currently having a problem with overfitting. Adding Gaussian noise is not as good, especially since it happens after signal processing has been applied—so it can only approximate full-spectrum white noise, not any other kind of realistic background noise.

We actually have a transform block you can use to easily mix in background noise; perhaps you’d like to try it:




Thank you so much for the complete response, and for taking the time to go through the details of my questions.
The transformation block sounds interesting and definitely is a powerful and insightful part of Edge Impulse for developers providing such ability to maneuver.
I will try this transform block and for others’ reference, I will keep updating this post if any improvement is achieved with the help of that.

Cheers to you and the Edge Impulse team,

1 Like

Thank you Sina, always happy to help! Let me know how it goes and if you run into any roadblocks.



Quick Update
Based on what Dansitu suggested, I tried to utilize the transformation block (based on the GitHub repository that contains an example of a mix background noise transformation block) and using that I could inject some noise (from the same YouTube link provided) with just one output and importing it into my project (which is an ARM KWS DS-CNN reproduction) and I could get a 95% accuracy on test data set with a confidence threshold of 0.5.
The transformation block works like a miracle and I am pretty sure by injecting more and various noise into the training dataset, I would get a better accuracy on test dataset and most definitely this will be more efficient when using the ML app on a real devboard.
Please feel free to reach me out if you have any question.