Bad classification performance after deployment

rbothof · January 2, 2026, 1:07pm

Question/Issue:
bad performance after deployment

Project ID:
867419

Context/Use case:
I’m am trying to classify ring-tone like audio melodies

Steps Taken:

used the actual device to capture audio on an sd card (my own code)
I have 45 audio samples now for each melody (with different background sounds)
Set up and training
I currently have 3 melodies to classify, I have tried some different building blocks
but settled on MFCC. because there are a lot of quick tones and small time variations
in these kind of sounds I have increased the time resolution,
Frame length 0.0125
Frame stride 0.005
There are about 4000 features now, but the separation is very good. clear seperated clusters
Training goes to 100% accuracy in a couple of epochs
[Step 3]
Deploying to esp32 s3 device
The EON compiler does not seem to work at all, I am using TFLite
After making some changes to example code (I added a bug post for this to this forum)
I have matched the inference code, to the same method I am using as my dataset recorder code.

It works, but classification is very poor, there is one melody that is detected sometimes, usually it detects the wrong one, no good accuracy scores.

Using the model on my mobile phone the results are also very bad

Is this a case of overfitting ? My instict says these sounds should be very seperatable.
I thought 45 recordings per melody would be enough. help ?

Environment:

Platform: [UM Feather eps32s3, etc.]

using the web platfrom to train and deploy as an arduino library

rbothof · January 5, 2026, 2:22pm

After long experimentation I solved the main issue.

For anyone interested or running into the same problem. My sounds/melodies are generated synthetically. And I build a dataset recorder that plays and records the sound together and records some ambient noise, which I then bring to different environments.

The main reason why it worked well in Studio but not when deployed was that each sound always started at the same time in each recording. While tests got really high scores, this did not translate at all to the embedded device which uses moving window for inference

I have changed the recorder setup so it add some random space 300-400ms before and after each recording each melody is about 1-2s long
This creates a much better fit for a sliding window inference. And using an windowsize of 1000ms with stride of 200ms I get good cropping / data augmentation that also seems to work well with sliding.

greets