Mixing target sounds with background relative volume

Hi, working on a project to detect a sound versus background. Based on some videos and suggestions here I have heard that it’s good to normalize the sound, meaning bring all samples to similar average volumes. Then to mix background into the signal files for better inferencing in real world situations.

So, taking the latter topic first, I would assume we don’t want to mix background into the target sounds at the same volume, it is afterall intended to be background. What’s a reasonable volume ratio here?

Then, I was wondering about a preprocessing short cut: I have some longish target and background sounds. If I split them into ~3 second segments, then I can measure the average volume of each small file. Then, rather than normalizing the sounds, and then mixing them in some recommended volume ratio, I could simply select a subset of sounds that are already at the recommended volume ratio and just mix them.

At this point, I have thousands of little files, but it’s not likely Edge Impulse will allow building a model from all of them - So if the recommended mix volume ratio is, say, 70/30, I could just conveniently choose a set of target sounds with mean volume of ~3db louder (approx double volume) than the mean background volume right?

Hope this makes sense!

Hi @braddo,

If it helps, I have a mixing script that does what you are looking for here: https://github.com/ShawnHymel/ei-keyword-spotting/blob/master/ei-audio-dataset-curation.ipynb. It does background audio file sampling/mixing, but no normalization. Keep in mind that if you normalize during training, you will likely need to normalize during inference.

I do a 90/10% volume for sound/background. Feel free to play with that. I doubt Edge Impulse will implement such a feature, but it’s a good suggestion. For now, I recommend doing your mixing via a script to generate new data.

Thank you Shawn, I have seen your script on your github from one of your video series. In the end, I didn’t mix backgrounds in because the sounds actually have various backgrounds already. I did normalize because the “found” sounds were all at different volumes, some already with peaks aligned to 0db and some in need of a boost. Ultimately they should align with the intended capture device, but I wanted to try with my found sounds first to see the experience. I found ffmpeg to be a great tool here although writing simple scripts was hard because so many examples exist but batch processing syntax only work for certain shells. In the end I had to revert to excel for a few tasks where the batch script was 1000 lines long, substituting the file name - whatever works!

1 Like