With this project I want to try out, if its possible to detect the spoken numbers “zero” to “nine” by using an ESP32 processor and MEMS microphone and ML 1D CNN network.
While there are a lot examples to distinguish 3 or 4 different keywords, I have not seen a working example to distinguish 10 different labels. Understanding, that this could be to complex for such a small device I will try to train it at least listening to one specific voice.
- Lolin D32 Pro (ESP32) and an INMP441 I2S MEMS Microphone for sample generation and inference.
- CLION and Platformio.io as development platform (downloading the Arduino deployment package)
- Python script to download samples files via Wifi from ESP32
- EdgeImpulse (of course :-))
More information on the pipeline and source-code is available at: GitHub
Situation
I have 120 voice samples per number with duration of 700ms, voice starting randomly within the first 100ms, length of the voice is between 200-500ms. Impulse is created with 600ms Windows size and 20ms increase window, so I get 10 windows per sample increasing the sample size.
Using the 1DCNN, with various parameters at MFCC I get 94% accuracy on validation data, 84% something on testing, and only 50% on the ESP32 with percentages <40% in majority.
Over-fitting at Impulse Generation?
Over-fitting: CNN networks are able to detect a pattern in a picture (MFCC) regardless of position, so is this really needed to use the 20ms window increase or does it increase over-fitting?
(I think this is needed, as on the ESP32 the EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW max value is 4, so I have 250ms blocks, with the 100ms at least it increases likely-hod that a relevant voice snippet is detected)
Overfitting - as its only my voice?
I am using the default 1D model, that is pretty simple and my not be able to handle 10 keywords. So I try to train it with my voice only (would be fine for my use-case).
Is this a realistic assumption at all?
As an alternative, could I first train the model with numbers different speakers - and then using kind of “transfer learning” to improve with my voice? If yes, how could I do it in EdgeImpulse?
Overfitting - but why do I have 84% on test and only 50% on the device?
I am using the same SW for inference and sampling - so the sound quality is identical,
What could be wrong, that I get this decrease in performance when moving to the device?
Areas I tried:
- I have changed the MFCC parameters (coefficient, frame-length) and EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW - with some improvement, but never getting 80%
- I tried to change the model (increasing neurons + dropout)
Any feedback is welcome - its an experiment and learning exercise.
Maybe a very simple edge-device is shipped “pre-learned”, with a mobile - app its trained for its user and can handle much more complex tasks later…