Comparison MFCC versus MFE to recognize doorbell ring

janvda · October 2, 2020, 7:10am

I have compared MFCC block with MFE block in the context of my doorbell project (ongoing)
Some background topics see:

Here below the setup based on MFCC block and its results:

MFCC Settings:

1. (MFCC) Impulse design

2. (MFCC) settings

3. (MFCC) Neural Network settings and Training output

4. (MFCC) Accuracy 99.9%

5. (MFCC) Model Testing

6. (MFCC) Conclusion
Although reporting an excellent training accuracy of 99.9% but when testing on other audio fragments it classified many “non-ring” sounds as “ring”. This makes it not really usable to detect “rings” due to too many false positives.

In response to this post, I will share the MFE results…

janvda · October 2, 2020, 7:26am

In this response you find the outcome of testing with MFE block instead of MFCC block.
Note that I have used exactly the same set of training data and test data for this comparision.

MFE Settings:

1. (MFE) Impulse design

2. (MFE) settings
In the spectogram of this ring example you clearly see the bands that are typically for the ring sound.

3. (MFE) Neural Network settings and Training output

4. (MFE) Accuracy 98.3%
The accuracy (98.3%) is lower than in case of MFCC (99.9%). I admit that this is a bit strange.

5. (MFE) Model Testing

6. (MFE) Conclusion
Although reporting a lower training accuracy (98.3%) compared to MFCC (99.9%), testing on other audio fragments gave better results:

slightly better accuracy (90.76%) compared to MFCC (88.49%)
but more important No FALSE POSITIVES !
there are also some “FALSE negatives” (especially for one ring audio fragment).

So the outcome is certainly promising and more useful than MFCC (despite higher training accuracy).

Any feedback / tips / hints are always welcome.

janjongboom · October 2, 2020, 8:03am

@janvda Really hopeful results! You’re still overfitting by a bit, maybe a dropout layer of 0.1 after flatten would help? We’re also adding some data augmentation to audio data next week, that should help with smaller datasets.

Some other tips that would help on the inferencing side to make this robust:

If you collect a second of data, don’t classify once, but use the sliding window approach to classify a bunch of slices of the data.
If you see >70% ring windows this is a very strong indication that there was a ring.

Together that will probably get you a deployment with barely any false positives / negatives.

janvda · October 2, 2020, 8:27am

@janjongboom thanks for the tips.
I have added a dropout layer of 0.1 as you suggested:

Training accuracy (99.0%) better than without dropout layer (98.3%) :

Model testing
With dropout layer we get a slightly better accuracy (91.47%) versus (90.76%) but we also get 3 FALSE positives.

janvda · October 2, 2020, 8:39am

I think one of the issues is that there are in fact not really 2 classes but 4:

no sound (only background noise)
ring
sound X (e.g. person speaking / radio)
ring + sound X

So I think the issue is that the model is categorizing “ring + sound X” to category “other” when “sound X” is more “dominant” than the “ring” sound in the audio fragment.

I understand that one way to overcome this problem is to train the network also with all possible combinations of sound “ring + sound X” and “sound X without ring”.

janjongboom · October 2, 2020, 8:46am

I think the false positives can be offset easily by running the inferencing a few times, we do this averaging for continuous audio in general.

janvda · October 2, 2020, 9:07am

Yes, I understand. So one “ring” classification is not sufficient to conclude the doorbell rings, for that you must have at least 2 “ring” classifications in a row (or even more complex conditions like at least 3 ring classifications in a series of 5 subsequent classifications).

Of course we must take care that a ring can be short (a few 100ms long) - so we must assure that multiple classifications are done covering any period X (where X is the minimal ring period we want to be able to recognize).

janjongboom · October 2, 2020, 9:10am

@janvda Yeah, e.g. take a 600 ms. recording, then classifying 500ms. windows with 20ms. frame stride would give you the 5 classifications. Based on your dataset it looks like that would work.

Also cc @dansitu