Suggestions to improve accuracy and performance of my sound classification project using both keyword spotting and environmental sounds

Akshata · June 2, 2025, 10:43am

Question/Issue:
I’m working on a sound classification project that combines both keyword spotting (e.g., “Aakash”) and environmental sounds (e.g., door knock, traffic noise, background noise). I’m not getting the desired accuracy even after using augmentation techniques. I want suggestions to improve my model performance.

Project ID:
700599

Context/Use case:
The goal is to deploy a sound recognition system on edge devices that can recognize when someone says “Aakash” and also detect general sounds like “door knock” or “traffic sound” for use in a smart wearable for the deaf and mute.

Steps Taken:

Collected and uploaded 1h 46m 55s labeled audio samples for 4 classes.
Applied augmentations like background noise, pitch shift, time stretch, and random silence.
Used Transfer Learning (Keyword Spotting) model with MFCC features and also experimented with 1D CNN and raw audio.

Expected Outcome:
A single model capable of classifying both keywords and general sounds with >90% accuracy on all classes.
Actual Outcome:
Accuracy is inconsistent. “Aakash” class performs poorly (~60%), while environmental sounds show ~75–80% accuracy. Performance is not reliable on-device (ESP32-S3).

Reproducibility:

[ ] Always

Environment:

**Platform:**ESP32 S3 , STM 32
Build Environment Details: Arduino IDE 1.8.19
OS Version: Windows 11
Edge Impulse Version (Firmware): 1.2.3
Edge Impulse CLI Version: 1.15.5
Project Version: 1.0.0
Custom Blocks / Impulse Configuration: MFCC preprocessing, MFE, Transfer Learning keyword spotting, spectrogram preprocessing for 1D CNN variant and MobileNetv2 0.35

Additional Information:
I am open to using a paid model like YAMNet if it improves results, but I’d prefer a unified model if possible.
Can Edge Impulse suggest:

The best model architecture for mixed sound types?
Whether to split into 2 models or merge all data into one?
Ways to improve classification for spoken words like “Aakash”?

louis · June 4, 2025, 3:12pm

Hello @Akshata,

Good questions.

I just had a look at your project and I see that your data samples for door knock and aakash are pretty clean. They don’t have background noise in it.
It could be a good idea to mix those with other background noise (other than your noise class). It’ll force the NN to focus on the keyword or sound and make the distinction. However, this could create some confusion with your noise class if you don’t have enough samples.

Also, how have you recorded your sounds? Depending on the microphone used, it may differ from what the ESP32 is recording and this can have a big impact on the accuracy.

I’d suggest you try to record some data samples with your ESP and have a look at the data explorer. If you see that your new samples are far off the others, it will indicate your model has not generalized enough.

Next, have you tried to use the EON Tuner to see what can of combination between DSP and Learning block parameters it suggests?

Let me know if that helps.

Best,

Louis

Akshata · June 6, 2025, 5:30am

Title:
Need help choosing model & DSP settings for keyword and environmental sound classification on STM32

Message:
Hello Team,

I’m working on a sound classification project using Edge Impulse + STM32. I want to detect:

Environmental sounds like door_knock, traffic, etc.
A keyword (Aakash) when someone calls it out

My current model performs well on clean data, but it struggles in real-world conditions (with background noise, etc.). I have a few questions:

Should I use a single model for both environmental and keyword detection or split it into two?
What’s the best model type for this case — transfer learning, keyword spotting, custom CNN, or something else?
What audio DSP parameters (frame length, window size, etc.) work best for such mixed inputs?
Can the EON Tuner help optimize this use case, and what combos should I try?
What are the best ways to make the model robust to noise and avoid false triggers from general speech?
What’s a realistic model size and latency I should aim for on STM32?

Any tips, architecture suggestions, or references to similar projects would be really helpful!

Thanks,
Akshata