Keyword Spotting: Should Background Noise be added to Dataset?

I’ve set up a keyword matching impulse that works pretty well. After recording hundreds of samples, my keyword matching works great… if I’m sitting at my desk where I did the majority of the recordings. It seems as though when I move my device (ESP32S3) to different locations, such as outside, or even just in a situation where there’s background noise (like the air conditioning), the keyword matching accuracy seems to plummet.

I was looking into expanding my dataset further when I ran across the EI documentation for generating synthetic (AI text-to-speech) datasets. It struck me that I may have been using the wrong approach to my dataset. I had been attempting to make pristine audio samples, with high quality microphones, noise filtered out, pop filters, etc. But the synthetic dataset tutorial indicates that background noise might actually be exactly what’s missing from my dataset.

Quoting that article:

Finally we iterate though all the options generated, call the Google TTS API to generate the desired sample, and apply noise to it, saving locally with metadata

It then goes on to demonstrate that the “noisy” samples are the ones ultimately uploaded to the model, rather than the clean keyword samples.

So my big question: Should I refactor my dataset to add background noise to all my samples, rather than trying to remove background noise?

The synthetic dataset tutorial has excellent code samples, I could use my existing dataset to add noise to all samples (perhaps multiple times with different background noises even). But I want to verify if my hypothesis holds any water before doing all that work.

For the record, I think my confusion stemmed from most of the getting started tutorials, like the “Responding to your Voice” documentation and video, which seem to indicate having clean samples is preferable.

Screenshots such as this example show clearly specified keywords with zero noise between the samples:

I was under the impression that Edge Impulse combined the “noise” dataset with the the desired keyword datasets automatically. Now from the the more advanced documentation, it seems that I was probably incorrect in assuming that was a “behind the scenes” capability of Edge Impulse.

Now that I’m looking for it, I’m finding more references to adding noise, such as in the documentation for pre-built-datasets.

This is a prebuilt dataset for a keyword spotting system based on a subset of data in the Google Speech Commands Dataset, with added noise from the Microsoft Scalable Noisy Speech Dataset.

Hi @quicksketch - apologies for the delayed response. I’m glad you found a resolution on your own!

In general, you want your training/testing data to match the data your model will see in your production environment. So if you plan to run your model in a noisy environment, you will want to introduce noise into your training/testing data instead of removing noise to create clean samples as you initially did.

Thank you for pointing out that your confusion came from that specific tutorial. I can understand where your confusion came from. Edge Impulse does automatically apply noise if the “Data augmentation” checkbox is selected under the “Audio training options” section when configuring your learning block.

I can’t believe I overlooked that. How interesting. So now that I’ve added background noise manually I should probably turn off the “Add noise” feature by setting it to “None”?

For reference, when setting “Add noise” to Low or High, does that pull samples from the noise dataset? Or is it literally “random noise” like static being added?

image

So now that I’ve added background noise manually I should probably turn off the “Add noise” feature by setting it to “None”?

Since you already added noise, the data augmentation in the learning block is likely not needed. A lot of machine learning comes down to empirical results though. You can always train multiple models - one with data augmentation on and one with data augmentation off - to see what performs best for your particular situation.

For reference, when setting “Add noise” to Low or High, does that pull samples from the noise dataset? Or is it literally “random noise” like static being added?

Instead of applying noise directly to the raw audio samples, we apply “noise” to the features generated by the processing blocks, e.g. the spectrograms produced by the MFCC or MFE blocks, using a technique called SpecAugment. If you’re curious about the topic, you can find the paper that introduces the technique here: SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.

From the abstract: “The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps.”

Thanks @brianmcfadden for your responses.

A lot of machine learning comes down to empirical results

You’re right, I’ve really had mixed results. I had a dramatic improvement by adding noise to almost all my samples (I downloaded the dataset, then applied background noise via python script, then re-uploaded it). I couldn’t believe how much better it got. But then I tried expanding that approach by applying different variations of noise files and generating 3 samples with different noise for each original recording… but that caused matching to drop substantially.

It’s surprising that when you find something that makes a big improvement, it seems to only work up until a certain point, then you have to find other ways to optimize. I’ll play with the “Data Augmentation” settings and see what gets better results like you suggest.

Thanks again for your help!

One final comment regarding your last message.

Something to dive into is the concept of model capacity vs. problem complexity, and growing your dataset and model together. Essentially a larger model has more capacity to solve a more complex problem.

It could be that your model architecture had enough capacity to process your dataset with a single variation of noise. When you added additional variations and created more data samples, however, your problem became too complex for that particular architecture. Of course, I haven’t seen your architecture so there could be more to it. Regardless, it’s a concept worth knowing about.

A general approach is growing the dataset and model together. Start with a small dataset and increase the model size (capacity) to increase performance up until the point you start to see overfitting. At that point, you want to add more samples to your dataset so that the model generalizes better, but you’ll get to a point where the model doesn’t have enough capacity any more and performance degrades. So you increase the model size… then repeat.

1 Like

That is really interesting information. Thank you again!

For what its worth, in my situation where background noise was added manually, disabling all Data Augmentation seemed to yield a marginally better result than with it on.