Performance calibration provides option for “Simulated real world audio: Samples taken from your testing dataset and layered on top of artificial background noise”.
But that noise cannot be configured.
In my case my training and real use-case audio volume is really low. The default artificial background noise completely overpowers my signal, which makes this calibration impossible to use.
I would like to have ability to disable the noise in calibration, or manually adjust the noise level.
@tuoman Thanks for the feedback, if you need to get around this with your project it is possible to upload your own representative background noise audio to use- apart from allowing you to calibrate your custom noise to your input samples, if you can record some real-world noise from an environment where you expect your model to run it’s likely to yield even better results.
It is also not as convenient for quick tests, as I must manually label ~10 minutes of data.
I can also normalize all my audio data which solves the issue (which I already have done), but it is not that convenient either.
I think performance calibration without noise is valuable feature, as noise resilience and spurious detections are 2 different problems.
I currently have excellent noise resilience, but I’m dealing with spurious detection issue with smaller window size.
With 500ms window my detections are solid, but I cannot count accurately as events occur with 50-100ms spacing, and thus one 500ms window can occupy multiple events.
I tried to solve that with smaller 200ms window that just fits the events, but I have problem that now my detections are very spurious.
If you had some of these in your test set it would enable you to check accuracy and the spurious detections over more representative samples and experiment with the EON Tuner to see what might improve this. Performance calibration is very useful for tuning the post-processing parameters (suppression period, averaging, confidence score acceptance) but more generally the multi-label approach can help you tune the model first before getting to post-processing.
The model performs very well in Model testing. Accuracy 99.10%, f1-score: 0.99, precision: 0.99, recall: 0.99. False positives Error: 0.6% (3 / 467).
I can easily reach >99% with 200ms or 500ms window.
But when I move to performance calibration with 200ms window, I get far worse results:
Mean FAR : 11.7%, Mean FRR : 17.2%
With 500ms I get excellent scores, FAR : 0.7%, Mean FRR : 2.9%.
Same data, only difference is longer window size.
I assume this is because the Performance calibration uses the continuous classification, instead of isolated windows, which is more difficult task with shorter window.
The multi-label approach for your testing data could help with testing this more easily as mentioned above- as with a multi-label sample you’re able to test with the continuous classification windowing that you’d get in real life.
What is your use-case? Are you trying to detect an event (e.g. a keyword) or a continuous noise (e.g. a fire alarm ringing)? If it’s discrete events you’ve got to be careful with choosing your window size, window increase and also whether you’ve set the “Zero-Pad Samples” setting (this last one will add zeroes to your data if the sample isn’t long enough to meet your full window length, which can be useful sometimes but for continuous use-cases it means you’ll be training your model to sometimes see silence at the end of a sample, even though in real life there isn’t. This can cause real world accuracy problems)
If it’s discrete events you’ve got to be careful with choosing your window size, window increase and also whether you’ve set the “Zero-Pad Samples”
My audio events are 50-200ms long. I use only one window size per project. E.g. if I use 500ms window, all of the samples are labelled 500ms clips. Stride is (window size) / 2.
I posted the results in previous message for different window sizes.
I can try the multi-label approach, but I doubt it fixes it. I don’t see the logic why normal labeling works fine with 500ms window, but not with 200ms window.
But I see the logic why smaller window produces less reliable results in continuous classification. Long window works perfectly, but I cannot use long window because it has no ability to count accurately. Short window has ability to count, but is too unreliable.
But I will trust your advice. It will take a while, I need to relabel 1 hour of training material. I already have manually split labelled data to 500ms windows.
It may be good enough to start by creating one single multi-label longer sample rather than spending a long time relabelling the entire dataset- find a section which shows the issue you’re facing with multiple events in a short period of time- and upload this to your test set, then view it in the “Live Classification” area and you’ll see whether each model variant detects the events correctly. I’m not sure using multi-label for your training dataset will improve things, but using it for some test data will give you a better picture of which model works better in the real world.