Classifying live data: Best moving average approach for 1sec hotword detection?

Hi there!

With carefully following along the super lovely Responding to your voice tutorial, I’ve managed to build my own hotword detection classifier. Yay! :partying_face:! I’ve also managed to run it on the Arduino Nano 33 BLE Sense and via WebAssembly in the Browser and Node.js. Yay II!

But now I am wondering what would be the most correct way for the moving average?

The tutorial seems to have a hint on the moving average topic (see “Poor performance due to unbalanced dataset?”) but at least for me is stays a bit vague. My dataset is balanced with roughly 1500 wav files of 1sec durations for every class (noise, unknown and my-hotword).

Hence could someone please elaborate what would be the best moving average approach and why for the “Hello World” example?

Also a dedicated moving average code example would be terrific, so that one can study and understand it in detail while tweaking paprameters. Maybe extending the one of Classification of existing audio files on desktop with an moving average to detect “Hello World” in wav files of arbitrary durations?

Many thanks!

PS. Also many thanks for the super nice Edge Impulse product! Now finally I can design and create in an easy my own simple classifiers! This is exactly what I have been looking for for ages! Please keep up the good work!

1 Like

Thanks for the great feedback and we hope you’ll continue to be delighted! Actually we implement the moving average filter for you as discussed in: https://docs.edgeimpulse.com/docs/responding-to-your-voice#poor-performance-due-to-unbalanced-dataset and that function is available as source code when you export your model (it includes our c++ sdk). Hope this helps.

Hi @b_g we actually already run a moving-average filter if you run in continuous mode (both in the mobile client & on device). The implementation is here: https://github.com/edgeimpulse/inferencing-sdk-cpp/blob/68877720efa5c16a2e82160f9f4db09711607f60/classifier/ei_run_classifier.h#L131

And you can disable it (to see the difference) by passing in false as the last parameter to run_classifier_continuous: https://github.com/edgeimpulse/inferencing-sdk-cpp/blob/68877720efa5c16a2e82160f9f4db09711607f60/classifier/ei_run_classifier.h#L195

Hi @yodaimpulse and @janjongboom! Many thanks for the pointers! Got it.

Last related question:
I have a window of 1000ms and currently “centered” all the samples of my hotword inside the 1000ms. The durations of the hotword-sound is divers (see plot below, removed everything under 300 ms in the training data) …

Now I’m wondering why you are exactly slicing into 4 chunks of 250ms. Is this simply due to a compromise of performance on hardware or is there a theoretical reason I don’t get. Would it not be better to classify more often to further increase accuracy?

For others who come across the moving average question, you can see nicely the moving average in action in the “Arduino library” export and the example “nano_ble33_sense_microphone_continuous”:

void loop()
{
    bool m = microphone_inference_record();
    if (!m) {
        ei_printf("ERR: Failed to record audio...\n");
        return;
    }

    signal_t signal;
    signal.total_length = EI_CLASSIFIER_SLICE_SIZE;
    signal.get_data = &microphone_audio_signal_get_data;
    ei_impulse_result_t result = {0};

    EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);
    if (r != EI_IMPULSE_OK) {
        ei_printf("ERR: Failed to run classifier (%d)\n", r);
        return;
    }

    if (++print_results >= (EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW)) {
        // print the predictions
        ei_printf("Predictions ");
        ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
            result.timing.dsp, result.timing.classification, result.timing.anomaly);
        ei_printf(": \n");
        for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
            ei_printf("    %s: %.5f\n", result.classification[ix].label,
                      result.classification[ix].value);
        }
#if EI_CLASSIFIER_HAS_ANOMALY == 1
        ei_printf("    anomaly score: %.3f\n", result.anomaly);
#endif

        print_results = 0;
    }
}

Hi @b_g it’s a tradeoff about inferences per second and performance. A classification step consists of:

  1. Building a spectrogram of some sorts (MFCC, MFE or normal spectrogram)
  2. Normalizing the data
  3. Running classifier

When we do continuous audio classification we can shorten step 1 significantly (because we only make part of the spectrogram) but 2 & 3 need to run every time we process a slice (on the full spectrogram), so the 4 slices is the balance where we have enough time to do all this on most development boards. If you have a dev board that is faster you can up this freely (see model_metadata.h for the macro). Just look at the DSP+NN timings after each inference, if it’s way less than 250ms. you can safely up this.

1 Like

Great + many thanks for the explanation! :+1: :pray:

1 Like