Hi. I’ve some doubts regarding using Edge Impulse for audio classification. Please would you help me answering my questions below?
- Considering audio files may have different amplitudes, normalizing audio for both training/testing/validating and for inferencing, it seems it’s a good idea to achieve better overall results.
Does Edge Impulse have an option to normalize audio for training/testing/validating processes? If yes, how can I do this?
- Considering I’ll acquire audio from a standard microphone (with the same sample rate as the audio files used as dataset) for the inferences, are there any recommendations to pre-process the audio files before sending them to Edge Impulse (and use them in training/testing/validating processes)? I mean, in the case I described, should I use mono audios and/or another specific pre-processing techniques or features before use these audio files in Edge Impulse as datasets?
Thanks in advance.
If you are using some sort of frequency domain conversion (e.g. spectrogram, MFE, MFCC), you should not need to normalize your audio, as the process of converting to the frequency domain manages the normalization for you.
For preprocessing, I recommend the following:
- Keep with mono (unless there is some specific reason you need stereo)
- Use the lowest possible sampling rate possible to meet your needs. If you are deploying to a microcontroller, many I2S and PDM mics have a set sampling rate, so go with that.
- I highly recommend doing some data augmentation, especially for keyword spotting. I have an example notebook that does augmentation for keyword spotting here (ei-keyword-spotting/ei-audio-dataset-curation.ipynb at master · ShawnHymel/ei-keyword-spotting · GitHub). Feel free to use that as a starting point.
@shawn_edgeimpulse , thank you very much for the information and notebook!
@shawn_edgeimpulse , I’ve one more question, if you allow me, of course.
I’m planing to detect some gunshot sounds. Regarding microphone sample rate, I’ll use these common microphones which have a analog output, allowing us to reading their signals with an internal microcontroller ADC. I’m planing to use an ESP32 microcontroller, which allows me to use sample rates up to 200ksps (according to I’ve read about it). Assuming I’ll have enough RAM to buffer the audio windows, I’m not sure on which sample rate would be ok.
In this case, please, would you recommend me a .wav sample rate you think it’s ok? I’ve converted some of the wav files of my dataset to 8000Hz sample rate, and to my ears the major sound features remained, but I’d like to know your opinion / feedback about it.
Thanks in advance.
Human voice is generally in the 300-3000 Hz range. Thanks to the Nyquist rate, you need more than 2x the maximum expected frequency to be able to appropriately reconstruct (and/or analyze) the signal without aliasing. So, for human voice, that would be a sampling rate of >6kHz. We often add a 2kHz buffer to get the common 8kHz sampling rate as a bare minimum for recording voice in digital formats (e.g. .wav).
Gunshots are a lot of sound with many different frequencies, often in excess of 40-50 kHz (see this paper for a spectrogram). So, to completely recreate and capture a gunshot, you’re going to need a very high sampling rate, such as 100 kHz. This would be an ideal sampling rate for being able to fully capture and analyze a gunshot.
Now, you can start making compromises. If you sample at 16 kHz, you’re going to lose a lot of information, and gunshots will produce spectrograms that look a lot like other loud “bang” noises (e.g. fireworks). As you continue to raise the sampling rate (e.g. up to a common 44.1kHz), you’ll be able to capture more details of the sound at the higher frequencies (if you’re using 44.1 kHz, you can capture the information at 22.05 kHz). This will make your dataset better (as it contains more details), but you’ll need more processing power.
Frequency vs. processing power is a tradeoff you will have to consider very carefully. My advice is to capture the sounds using a powerful, high-sample rate microphone (at least 44.1 kHz, preferably up to 100 kHz) and make sure you capture things that sound like gunshots so that you can successfully classify those similar sounds as non-gunshots.
From there, you can see if a neural network can identify the difference at higher sample rates. Then, downsample your dataset to something more reasonable (44.1 kHz then to 16 kHz) to see if your NN can still identify gunshots sounds. My prediction is that as you decrease your sample rate, your NN accuracy will also decrease (especially if you include “sounds like” samples, like fireworks).
@shawn_edgeimpulse , thank you very much for the explanation!
Shawn, may I ask a question? would this type of microphone help? it has very high samples, I think of using it in noise. https://www.dodotronic.com/product/ultramic-um250k this type of microphone allows you to listen to bats and other animals.
I didn’t find it here in Brazil, but I’m looking to buy one.
Yes, that looks like a good microphone for recording frequencies up to ~125kHz. I don’t have personal experience with it, but it’s probably worth trying it for things like recording gunshot sounds.
If I manage to buy it, I promise to come back here! But with noise tests!