Improvement idea for multi-labeling

tuoman · December 9, 2025, 10:44am

Hello, I have been experimenting with edge impulse multi-labeling (Multi-label - Edge Impulse Documentation)

But I’m getting terrible results with audio event classifier that works perfectly with exactly the same data, model, window size when using separated audio samples.

Multi-labeling:
Error: 27.3% (215 / 787) Actual label: sneeze. Predicted label: other

Separate samples:
Error: 1.1% (2 / 174) Actual label: sneeze. Predicted label: other

I analyzed the results and multi-labeling provides wrong statistics because it does mislabeling with all of the configurations.

I will explain the problem with each setting:

Use label at the end of the window
Docs: works well for scenarios where the primary interest lies in the resulting state or activity of the window such as recognizing sustained motions or transitions

This will result to many misclassifications, as it will classify based on end of the window. So even minor amount of the next class at the end of the window will label the whole window to that event.

Use label X if anywhere present in the window
Docs: useful for detecting short or sparse events that may not occupy the full window but are critical to capture when they occur.
- When I select “sneeze”
  It will misclassify the whole window that has even a tiny amount of sneeze to sneeze (5/95, 10/90, 20/80 …)
- When I select “other”
  It will misclassify the whole window that has even a tiny amount of other to other (5/95, 10/90, 20/80 …)

So no matter what I select, I will get devastating amount of misclassification. I understand that I can manually inspect that the model still works, but I need statistics to be reliable, so I can evaluate models by the numbers and not by feeling.

My suggestion:
Instead of checkbox ON/OFF coarse selection, allow user to enter percentage value that must be within the window to classify it.

I would try to set 70% to sneeze. That means that window must occupy at least 70% sneeze in order to qualify to be classified as sneeze. Not sure if it fixes the issue, but I think so. The checkbox method essentially sets this value to either 0% or 100%, which will lead to misclassifying either way.

Edit:
I noticed I can check both of the checkboxes simultaneously. I assume it means 50% chance to classify either one. That improved the results, but I manually checked the errors and they are mostly mislabeling.

Error: 9.3% (38 / 407) Actual label: sneeze. Predicted label: other

brianmcfadden · December 17, 2025, 5:27pm

Thanks for the feedback @tuoman. I can see what you’re saying about assigning a label to the window based on the percentage of labels found in that window. I have passed that feedback along to the team.

You mention “…as it will classify based on end of the window” and “…will misclassify the whole window”. I want to clarify that the issue isn’t classification directly; the issue has to do with labeling of the windows used for model training not being as accurate as when your data is split into separate samples, leading to a model that doesn’t perform as well, leading to misclassifications of your test/real-world data.

So when you say:

I understand that I can manually inspect that the model still works, but I need statistics to be reliable, so I can evaluate models by the numbers and not by feeling.

I don’t quite follow. Sure, the model “works” but it is a model that doesn’t perform as well as the model you trained with separate samples with a single label. So the statistics are reliable - the multi-label model isn’t performing as well.