Keyword detector on M5 Stick C

I successfully implemented a keyword detector on an M5 Stick C (ESP32) using the Edge Impulse tool chain. I used Shawn Hymel’s Colab script to select and upload samples from Google for the keywords Stop and Go and then deployed it as Arduino app.
Here I have a few questions and remarks after making it work:

  1. The Goggle voice files and background sounds are sampled with 16bit, so that is the basis for the Inference calculation. The microphone on the M5 Stick C is only 14 bits. Now, after feature extraction I would expect to get the same features like in the trained model but with lower amplitude, and obviously the model works. However, I wonder if the recognition rate would be lower or is expected the same as when feeding it with 16bit sounds and what is the technical reasoning behind that?

  2. I wonder what the influence of technical characteristics of the microphone is. In other words, if I would create samples using the M5 Stick C microphone, would I get a better recognition rate? The success rate I currently get when running the NN classifier is in the low 80s. When running the model on the Stick I am getting between 50 and 95% probability and relatively few misses.

  3. I used the nano_ble33_sense_microphone_continuous.ino example as basis to create a library that I then can include in my app if I want to use the keyword function. I replaced the microphone function and created a task that continuously fills the inference buffers. From there I am using the same code as in the example, with one exception: I made the static bool microphone_inference_record(void)
    function non-blocking so that it returns if no new data is available.
    I would strongly suggest that you change the code generator of your toolbox in the same way as integration into real code becomes much easier. In my library I simply provided a processLoop() function that can be called from the applications main loop. As long as this is done in at least the same frequency as the buffers get filled, there are no misses, but if there is no new data available, the function returns immediately.

  1. In general, the amplitude can matter for your results depending on your normalization settings, but I wouldn’t think 2 bits would make a significant difference. If you want to try an experiment, you can just save off a test sample from your M5, and compare inference results between the raw sample and raw sample left shifted by 2 bits.

What does your confusion matrix look like on the Model Testing page? What do you mean exactly when you say your success rate is in the low 80s?

That aside, do you have a Noise and Other Words class? That’s typically very helpful, especially when real deployment performance doesn’t match the confusion matrix. For noise, you can just record ambient noise in your office…I wouldn’t suggest a lot of loud noises like in some public datasets.

  1. run_classifier_continuous and signal_from_buffer are intended to allow users to handle various types of asynchronous patterns.
    signal_from_buffer(buffer, /*put size here*/ , &signal);

    EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

Thanks for the reply.
I probably will do a test with samples from the Stick to see if there is a difference.

By success rate I meant the correct predictions in the confusion matrix. Sorry for using the wrong term. And yes, I am using background noise and other words as well, all from the Google repository. The background noise in my case is relatively special, so I might try with background noise recordings from that particular environment.

The code section I think should be changed in the examples is this part of the microphone_inference_record(void) function in the examples:

while (inference.buf_ready == 0) {

The function is blocking. A better way would be to return immediately if no data is ready so that other stuff can be done while waiting for data.

Hi @tanner87661,

Regarding your first 2 questions, how are you converting 14 to 16 bit? Do you take the negative sign in account? The value -1 translated to 14 bit would be (in hex) 0x3FF. If you would use that as a 16 bit sample the value is interpreted as 1023.

The amplitude itself doesn’t have to be different using a different bit depth. It is a difference in dynamic range, aka the size of a step between 2 samples.

On your 3rd point. The continuous sampling flow is deterministic and we want detect audio buffer overruns. So our flow is basically:

  • Wait for audio data
  • Run inference
  • Handle classified output

In the last step the user can put all the extra functionality as long as it’s within the time of filling the audio buffer.
If you want to run other functionality in parallel I would suggest to create a separate thread running on a lower priority.

Finally did a video on this project for my Youtube channel:

Great work and presentation @tanner87661!!! :clap: