Multiple Audio Snippet Recognition Capabilities?

bklein · September 15, 2020, 6:25pm

I watched the Audio recognition video using the running faucet example. My interest is in metal detector target recognition. Various target metal characteristics result in a series of audio classifications usually identified by audio frequency. Some targets have variations - like a bobby pin or nail would have a double blip. With some detectors, different metals are identified with different frequency audio. My question is, what type of processing power and what kind of delay is required to identify one of many possible target sounds. The video just showed one sound vs a background noise. It didn’t really discuss the processing delays needed or if multiple sound ID’s are even possible/practical with this technique.
Many detectorists are getting hard of hearing and would appreciate something perhaps like a light or handle vibration that would indicate the target ID in real time.

janjongboom · September 16, 2020, 9:44am

Hi @bklein, the same principles as in the video also apply to multi-class audio. Maybe you’ll need to build a larger neural network to accurately predict the outcomes, but that requires a bit of toying around. We’ve recently added cropping and splitting of data to the studio (https://www.edgeimpulse.com/blog/crop-split-data/) which should make it easier to detect discrete events like the blips.

Maybe @dansitu also has something to add, he’s been busy on the audio pipeline!

dansitu · September 16, 2020, 11:57pm

Hi @bklein,

This is a great application and I’d love to see Edge Impulse working for detectorists!

If I understand correctly, it sounds like you’re interested in training a model to recognize more than one class of sound in addition to background noise. This is absolutely possible, and I encourage you to experiment. You’ll have to think about the following:

As a rule, the more classes you add the larger your model will need to be in order to perform well. You can experiment with adding and removing layers and filters to find the ideal balance between accuracy and model size.
If your model is having trouble distinguishing between subtly different sounds, you might need to add more resolution to the MFCC block by increasing the number of coefficients or reducing the frame length/stride.
For balanced training and evaluation, try to collect an equal amount of training and testing data samples for each class of sound.

Try some things out and let me know how it goes—this is an area I’m personally interested in, and I’m happy to help with some pointers along the way.

Warmly,
Dan