Model Overfitting ? - Continuous speech recognition with ESP32 for numbers (0--9)

With this project I want to try out, if its possible to detect the spoken numbers “zero” to “nine” by using an ESP32 processor and MEMS microphone and ML 1D CNN network.

While there are a lot examples to distinguish 3 or 4 different keywords, I have not seen a working example to distinguish 10 different labels. Understanding, that this could be to complex for such a small device I will try to train it at least listening to one specific voice.

  • Lolin D32 Pro (ESP32) and an INMP441 I2S MEMS Microphone for sample generation and inference.
  • CLION and Platformio.io as development platform (downloading the Arduino deployment package)
  • Python script to download samples files via Wifi from ESP32
  • EdgeImpulse (of course :-))

More information on the pipeline and source-code is available at: GitHub

Situation

I have 120 voice samples per number with duration of 700ms, voice starting randomly within the first 100ms, length of the voice is between 200-500ms. Impulse is created with 600ms Windows size and 20ms increase window, so I get 10 windows per sample increasing the sample size.

Using the 1DCNN, with various parameters at MFCC I get 94% accuracy on validation data, 84% something on testing, and only 50% on the ESP32 with percentages <40% in majority.

Over-fitting at Impulse Generation?

Over-fitting: CNN networks are able to detect a pattern in a picture (MFCC) regardless of position, so is this really needed to use the 20ms window increase or does it increase over-fitting?
(I think this is needed, as on the ESP32 the EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW max value is 4, so I have 250ms blocks, with the 100ms at least it increases likely-hod that a relevant voice snippet is detected)

Overfitting - as its only my voice?
I am using the default 1D model, that is pretty simple and my not be able to handle 10 keywords. So I try to train it with my voice only (would be fine for my use-case).

Is this a realistic assumption at all?

As an alternative, could I first train the model with numbers different speakers - and then using kind of “transfer learning” to improve with my voice? If yes, how could I do it in EdgeImpulse?

Overfitting - but why do I have 84% on test and only 50% on the device?
I am using the same SW for inference and sampling - so the sound quality is identical,
What could be wrong, that I get this decrease in performance when moving to the device?

Areas I tried:

  • I have changed the MFCC parameters (coefficient, frame-length) and EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW - with some improvement, but never getting 80%
  • I tried to change the model (increasing neurons + dropout)

Any feedback is welcome - its an experiment and learning exercise.
Maybe a very simple edge-device is shipped “pre-learned”, with a mobile - app its trained for its user and can handle much more complex tasks later…

Hi @christian42, thanks for the writeup. I’ve quickly looked through your project and see validation accuracy of ~68%, so if that’s the case it’s definitely overfit compared to training set (or you’ve been iterating a bit over it). I think there’s some root causes:

  1. You generate 5 samples for every training sample (with 20ms. increase) so you’ll have a lot of data that looks very similar. I’d lower this actually to maybe 100ms. increase so you’ll generate two.
  2. The number of parameters in the 1D CNN might not be enough to properly convey the wide spectrum of outcomes you’re trying to create. I think you want to make the layers bigger (e.g. 16/32 instead of 8/16) or switch to a 2D Conv network.
  3. Increase the dropout in the network if you see overfitting happen.

Regarding:

and only 50% on the device?

  1. How do you evaluate this? Is this just by observing the output? Are you sure you’re logging every inference (so 3x per second or so?)? Could you do a quick screencast with you talking and the results displaying?
  2. There is a moving-average filter in place when using continuous audio classification. That smooths over the results, but you might try and disable it on small datasets: https://docs.edgeimpulse.com/docs/responding-to-your-voice#poor-performance-due-to-unbalanced-dataset.

FYI, I did a quick iteration on your model and here’s how I get to 89.3% accuracy on the testing set: https://studio.edgeimpulse.com/studio/17635 (I’ve added you as a collaborator).

Basically:

  • 100ms. window increase.
  • Switch MFCC to 0.02 frame length. Coefficients to 13, and increase normalization window size to 301. Reset the frequency parameters to default.
  • 2D CNN with 0.5 dropout, and disabled time / frequency masking.

janjongboom: Thank you very much for your feedback, really appreciated :slight_smile:

I believe the main trick was to increase the impulse window to 100ms.

Now - when I deploy the your updated model to ESP I get “Sample buffer overrun”, also with SLICES_PER_MODEL_WINDOW=2 (minimum value). Seems to be too much for the ESP.

I played with the CNN Model and MFCC stride/window parameters to save time. When setting back to 1D model with a bit more neurons as default, drop-out to 0.5 and MFCC to 20ms stride, 200ms Window - it works reasonable (much better - but not perfect). Any increase in MFCC details results in “Sample buffer overrun”.

My thinking:
I need SLICES_PER_MODEL_WINDOW=4 at minimum to get good results.
That means total processing time from ESP as example:
Predictions (DSP: 131 ms., Classification: 32 ms., Anomaly: 0 ms.):
=131+32=163ms per processed window. 4 windows = 163*4=652ms > 600ms Impulse.
So the ESP takes 652ms to calculate a 600ms impulse…buffer overrun…
Correct?

Ways to solve:
I need to play around with MFCC parameters (that take most of the time) or Model Parameters *until I get below 600ms. Reducing the model is not the best path to go, as it needs the parameter to capture the different labels and also takes only minority of the time. So MFCC is the critical part to work on. Correct? Maybe it also possible to speed ESP32 processing on the device (also I use already both CPUs of the ESP, one for recording, one inversion)?

Idea:
To save some time, it would be good to have a way to set-up a “custom device” for latency calculation in EdgeImpulse (with some tool for calibration) - so its easy to check on device performance from MFCC and NN Classifier matching to my own device. Would this be possible?

I will try out to remove the “moving-average filter” tomorrow.

Yeah, basically correct. If you want to do a quick test set frame stride to 0.02, it reduces the features by 2x and should result in a smaller neural network.

To save some time, it would be good to have a way to set-up a “custom device” for latency calculation in EdgeImpulse (with some tool for calibration) - so its easy to check on device performance from MFCC and NN Classifier matching to my own device. Would this be possible?

Good idea, definitely something we’d consider.