Question/Issue:
I’m working on a sound classification project that combines both keyword spotting (e.g., “Aakash”) and environmental sounds (e.g., door knock, traffic noise, background noise). I’m not getting the desired accuracy even after using augmentation techniques. I want suggestions to improve my model performance.
Project ID:
700599
Context/Use case:
The goal is to deploy a sound recognition system on edge devices that can recognize when someone says “Aakash” and also detect general sounds like “door knock” or “traffic sound” for use in a smart wearable for the deaf and mute.
Steps Taken:
Collected and uploaded 1h 46m 55s labeled audio samples for 4 classes.
Applied augmentations like background noise, pitch shift, time stretch, and random silence.
Used Transfer Learning (Keyword Spotting) model with MFCC features and also experimented with 1D CNN and raw audio.
Expected Outcome:
A single model capable of classifying both keywords and general sounds with >90% accuracy on all classes. Actual Outcome:
Accuracy is inconsistent. “Aakash” class performs poorly (~60%), while environmental sounds show ~75–80% accuracy. Performance is not reliable on-device (ESP32-S3).
Reproducibility:
[ ] Always
Environment:
**Platform:**ESP32 S3 , STM 32
Build Environment Details: Arduino IDE 1.8.19
OS Version: Windows 11
Edge Impulse Version (Firmware): 1.2.3
Edge Impulse CLI Version: 1.15.5
Project Version: 1.0.0
Custom Blocks / Impulse Configuration: MFCC preprocessing, MFE, Transfer Learning keyword spotting, spectrogram preprocessing for 1D CNN variant and MobileNetv2 0.35
Additional Information:
I am open to using a paid model like YAMNet if it improves results, but I’d prefer a unified model if possible.
Can Edge Impulse suggest:
The best model architecture for mixed sound types?
Whether to split into 2 models or merge all data into one?
Ways to improve classification for spoken words like “Aakash”?
I just had a look at your project and I see that your data samples for door knock and aakash are pretty clean. They don’t have background noise in it.
It could be a good idea to mix those with other background noise (other than your noise class). It’ll force the NN to focus on the keyword or sound and make the distinction. However, this could create some confusion with your noise class if you don’t have enough samples.
Also, how have you recorded your sounds? Depending on the microphone used, it may differ from what the ESP32 is recording and this can have a big impact on the accuracy.
I’d suggest you try to record some data samples with your ESP and have a look at the data explorer. If you see that your new samples are far off the others, it will indicate your model has not generalized enough.
Next, have you tried to use the EON Tuner to see what can of combination between DSP and Learning block parameters it suggests?