Wake word add silence requirement

Hi I am new to edge and am simply trying to build a wake word system.

I find one glaring problem with wake words even Ok Google.

If i say ok google as a request I pause before and after the request.
however if I tell someone “ok google is stupid” or “my wake word is ok google”
this will also fire the wake word.
I can see two ways to get around this
is to record lots of “is wakeword” “wakeword is” and similar terms then tag them as incorrect.
2 the easy way
have a requirement of not speech ie silence or background noise for a short prior and after the wake word.

would simply increasing the window size be the best way around the second approach.

Thanks :slight_smile:

Hi @greg_dickson how we do it is that we add some gating logic, e.g. look for a pattern like:

[ noise … wakeword … noise ]

Where the 2 noise classes need to be seen in the last 3 seconds or so.

Some example code is here: https://github.com/edgeimpulse/example-standalone-inferencing-c/blob/b175eead42ec7cdde89749360d2355903b277ac1/source/main.c#L288 (not for the faint of hearted :wink: ). @dansitu and team are working on something to better profile and configure this type of post-processing from the Studio.

Thanks Jan ( @janjongboom )
I’ll check it out.
As long as it’s not python I’m happy :astonished:

I guess this could be used to combine two wake words like Alexa and stop into
[noise…Alexa stop… noise] as well
to allow the wake word to be easily extendable to simple combinations.

Thanks for the link.

Hi @greg_dickson! This is definitely a tricky problem—there are tons of ways the wakeword could be interleaved with other audio, whether it’s background noise or other words by the same speaker. In addition to Jan’s suggestion, here are a couple of techniques used in production deployments:

  • When a wakeword is identified by your MCU, wake up a larger processor and perform speech recognition on the surrounding few seconds of audio (maybe even via a cloud service), obtaining a transcription, then use a natural language processing model to understand whether the user was intentionally invoking the system.

  • Use a speaker recognition model to identify who was speaking so that you can better understand whether a wakeword was spoken in isolation (vs. on top of other speakers’ voices)

Thanks I’m using rhasspy at the moment which handles the second stage using kaldi or deepspeech.
However to do as you suggested for the couple of seconds after the wake word is triggered to make sure no other word was spoken would not be so hard.
I am using AEC to cut most of the background noise so that helps.

Is there a large performance hit by extending the window to 2 seconds and zero padding all the wakeword data and or recording for 2 seconds with silence added.

I guess in a closed system you could split the training by speaker as well and have the AI separate on them so you have
greg_spoke and simone_spoke as different triggers

@janjongboom and @dansitu
Thanks heaps guys.
A slapped together a wakeword intent and implemented it with the example so easily.
The cpp code is very extensive however the example is easy to understand and works so well.
What seemed like black magic before now is stating to make sense.

Thank you and all the others involved so much for making such a wonderful entrance point to AI.
Well done.


That’s awesome to hear, very glad you’ve been able to make progress!

Is there a large performance hit by extending the window to 2 seconds and zero padding all the wakeword data and or recording for 2 seconds with silence added.

It’ll double the inference time, so not ideal. To detect silence I’d probably just use some other algorithm in your application code, for example looking at the current loudness level.


Thanks @dansitu
Yep that is what I thought after having something to work with.
Even sequential wakewords
Like “wakeword stop” could work like that just refine the output based on timing and confidence at the last stage.
In the training milliseconds matter a lot but at the user interface an actual delay in the response to increase accuracy is not a bit hit.
The most annoying thing about AI is when it gets it wrong you have to go around the whole loop again. False positives have a much higher hit than negatives.
Especially emotionally. (Haha the anthropomorphism is sinking in deeper.)

@greg_dickson Just FYI, we’re currently working on a feature to give insight in a bunch of real world metrics including the number of false positives, and giving you the ability to model things like post-processing filters, thresholds, and silence requirements :slight_smile: More details during Imagine!

1 Like

Looking forward to that. I have implemented a simple approach that saves a wakeword on success and as unknown on a false positive however having an incremental version and a way to add a confidence level to the re-training would be extra cool. A way to re-train on the run.

I’d love a system where you could introduce a new user and that user would not only train the general system as they go also sort of get specific tag so the system could differentiate between the users over time. That is pretty out there for a TinyML system though.
But then again
You can’t create what you can’t imagine.

@greg_dickson FYI see @dansitu’s keynote here from 43:50 about what we’re working on: https://youtu.be/jTDjO1xf-yc?t=2630

Thanks Jan I did watch that and I am very excited about the future. Pity I’m an old guy. But young people are going to really enjoy what your doing. Thank you for allowing us access to these cutting edge tools. I watched the first day all the way through but haven’t gotten through the other two days. Great job heaps of information. You all must be exhausted.