Webiste Accuracy higher than on Nano 33 BLE Sense (Speech)

Hello there,

I have created a project on Edge Impulse, in order to detect different voice commands.
The voice commands go from 0.5 seconds to 1.5 seconds, all recorded with the Nano 33 BLE Sense microphone, with variations in intonation, pronounciation, distance and angle compared to the mic. After that all samples were cropped to contain only the actual keyword, removing any in-between noise.

Here is the breakdown:

  1. 21 classes (4 classes for 4 types of noise)
  2. 50 samples per class, totalling 1351 samples
  3. 1215 training samples, 136 testing samples
  4. 624 features per class

After a multitude of tests with updates to the MFCC and CNN, I get to 92.1% accuracy on the testing samples, the only misrecognised samples were the different types of noise, which is fine.
Then I build it as a library for Arduino, I add it to Arduino, compile it and upload it and the accuracy is actually quite bad.

I edited the code so instead of displaying the prediction class and percentage of certainty, it just displays the name of the class if that specific class had a higher than 80% classification probability. Even tried with 85, 90, 95, 98. It is not good at all and it misclifies quite frequently.


Now going back to the beginning, before doing this NN and reaching 92.1% accuracy, I had a model with the same amount of commands, with 50 samples each, but only 1 brown noise class, so in total there were 17 classes. With different parameters I reached 99.1% accuracy on the 10% validation set.

The voice commands are as follows:

  1. "What is the Rainfall Forecast? "
  2. “Temperature”
  3. “Humidity”
  4. “Altitude”
  5. “Heat Index”
  6. “Current time”
    etc.

Now, the issue that I have is that if I record some samples, with the arduino nano mic(same as before), and then use those for classification in the “Model Testing”, it gets all correct, having 100% accuracy.

Can I get some help on what I am doing wrong? Would I need to reduce the number of classes as they are too many? Should I highly increase the train/test split?

Hi @Ciprian,

If you are seeing high accuracy on your validation and test set in the Studio but poor accuracy on the device, it often means you have an overfit model or you have data that is not representative of the actual operating environment.

Can you share the project ID so I can take a look at your dataset and model?

In my experience, if you are not using transfer learning, 50 samples per class is far too low. I found that I needed at least 1000 samples per class to get something that starts to work.

If you record keywords with a silent background, you may find that you have false triggers if there is background noise during inference. To fix this, you can mix various background noises into your keyword samples. I put together a script to help with this: https://github.com/ShawnHymel/ei-keyword-spotting

Hello Shawn,

Many thanks for your swift reply.
Here is the project ID for the 17 classes, 99% accuracy. I performed a 50/50 split and now it’s 97.3% accurate. ID: 87158

Meanwhile for the project with 21 classes (4 types of noise), 92% accuracy, 90/10 split. ID: 86846

I am running the nano 33 ble in the same environment as when taking the samples used for the training. Both cases with close to no background noise.

1000 samples would be quite annoying to take, especially since there are 21 classes, so 21000 samples would take extremely long to get.

Hi @Ciprian,

Thank you for the IDs. I took a look at your projects, and I think you should either try to get more data or try “transfer learning (keyword spotting).” 50 samples per class is simply not enough for training a deep NN from scratch.

Thanks for the reply. Do you know if by changing the Nano 33 Mic to a phone/pc(Audacity), so I can record a full file and then split it, which will highly speed up the process, have any negative effects on the MFCC/CNN training process, considering the fact that the quality will be different thanks to the different microphones?

I forgot to specify that I technically do not need live immediate recognition.
My system has a button, and once the button is pressed the speech recognition starts, then depending on what it recognised, serial output something.
It is fine for me, if it records a section/frame once I press the button, then classify it, and then based on that classification, perform serial output.

Will the second scenario help with the CNN low sample situation? Is there a way to modify the library example (not the continous one), to listen for longer and listen immediately as the button is pressed?
So far the non continous inferencing example gives you a notice in the Serial Monitor, then listens for like half a second, then classifies the recorded audio file.

@shawn_edgeimpulse In order to better demonstrate the issue I have done the following:

  1. Removed different types of noises, only left brown noise.
  2. Removed 2 classes/voice commands.
  3. Changed MFCC to 20 features
  4. Change CNN layers/neurons/drop-out rate.
  5. 60/40% split (still 50 samples per class)

With all this changes I reached 99.7% accuracy.
If you watch the following YT Video you can see on PC in browser, it is quite accurate: https://youtu.be/9Ni6wb2YSDY

If you watch the following video on Arduino Nano 33 Ble Sense, you can see, it is not doing so great: https://youtu.be/N6IQMh3YtUM

I noticed that by going from 3 slice (1000ms/3 default) to 1 slice, it is better, although not perfect and getting the right timing to when it starts listening is a bit difficult.

Is there a reason for the lower performance, I would guess that the browser simulates the Cortex M4 conditions, for a proper comparison, maybe that is not the case.

But as I said, I do not need live recognition, I can record + classify then output, if I can get some help in setting that up.

Please let me know what would be best to do now, I was thinking of trying the google database from the github link you provided. I’ll be awaiting your reply.

Hi @Ciprian,

The PC version does seem a little more accurate, however, if you watch closely, you can still see a lot of false positives for other words. For example, when you say “altitude,” the “discomfort” class was triggered briefly at 1.00. These false hits happen throughout the video.

The quality of the Arduino microphone does make a difference, but in my experience, I have found that if you are using MFCCs as your features, it will still work decently well if you train with recordings from something else (e.g. a smartphone). Using recordings from different microphones, I imagine, would only help make a more robust model.

Pressing a button will not help accuracy. That only serves to save on power/battery life so that you don’t have to perform inference all the time. The only thing it might help is that if you want to do single inference (not continuous), the button will trigger a recording time so that you can guarantee your captured audio will not be split between 2 windows.

You likely need an “other” class. Without it, inference will be forced to choose one of the classes you gave it. If an utterance is split between windows or you say something that’s not one of the keywords, it will choose one of the classes. As a result, it will pick the label with the closest match, which is probably what you do not want to have happen. That’s where the Google Speech Commands dataset comes it–it gives you a starting place to create an “other” class in case the model is trying to figure out what you said (and it’s not a keyword).

Hope that helps!

@shawn_edgeimpulse Thanks for the reply.
I have most definetly seen how there are other false positives class triggers, but the idea is that outside of the altitude and discomfort, it was working quite well, and the false positives were under 90%, which means that with a >90% if statement, I only have a specific output when a speciifc class has high probability.

In terms of this “the button will trigger a recording time so that you can guarantee your captured audio will not be split between 2 windows”, can you help me modifiy the library given example that only listens for 1 second, I’ve added the immediate recording after button, but the time frame is still very low. How can I increase that timing?

Lastly, I am on a strict deadline, I can dedicate a maximum of 1 to 2 weeks for getting this to work (60 hours circa), Since you have experience, do you thinnk it is plausible to get it to work properly within this timeframe? Or should I just revert to way less classes, or just ignore the speech recognition as it will not get any better. I need someone that can do different classes quite accurately, without any false positives (over 90% probability let’s say), but the background will not contain more than brown noise or distant voice chatter.

Thanks for your help!

Hi @Ciprian,

If you download the Arduino library for your project, take a look at the nano_ble33_sense_microphone example. This will show you how to record a snippet of audio before performing inference on that recording.

ei_printf("Starting inferencing in 2 seconds...\n");

delay(2000);

ei_printf("Recording...\n");

bool m = microphone_inference_record();
if (!m) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
}

ei_printf("Recording done\n");

You can simply remove that 2 second delay and add something that waits for a button to be pressed instead. I also added a couple of lines that would turn on an LED while recording. For example:

while(digitalRead(BUTTON_PIN) == HIGH);

digitalWrite(LED_PIN, HIGH);
ei_printf("Recording...\n");

bool m = microphone_inference_record();
if (!m) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
}

digitalWrite(LED_PIN, LOW);
ei_printf("Recording done\n");

I honestly have not worked with 10+ classes for keyword spotting, so I don’t know how well it will function. In my experience, as you add more classes, your model will get more complex, so you will probably need to continue tweaking the hyperparameters and architecture until you find something that works. You will also need more data if you hope to achieve 90+% accuracy.

[Edit] I do think your project is feasible in the time you have outlined. It will likely require gathering more data and trying different hyperparameters to find something that works.

Hey @shawn_edgeimpulse, thanks for your reply.
I have already performed the changes in the nano_ble33_sense_microphone, as specified “I’ve added the immediate recording after button”, the issue is that the recording window is only 1 second, how can I change that to 2 seconds? It should be somewhere in the library .h files, but I don’t know how, since some of my classes are 1.5-2 seconds, so having 2 seconds would make it better for the recognition.

Hi @Ciprian,

The length of the recording for any single sample is given by the “Window size” that is set in your project in the Sudio.

If you go to Model testing, you can see that some of your 2 second samples are actually several windows, each of which goes through inference.

As a result, you’ll find that your model is not actually looking for the entire phrase “precipitation forecast.” Rather, it is trained to look for a piece of that phrase at any given time.

If you’d like to make it so that your model looks for entire phrases up to 2 seconds long, I recommend increasing the “Window size” to 2000 ms. However, there is a good chance you will run out of RAM on your Arduino Nano 33 BLE Sense when you do that. Doing keyword spotting on 1 second windows already pushes that board to the limit in terms of RAM and CPU processing.

Hey @shawn_edgeimpulse,
Thanks to your help I’ve been able to get it to work accordingly. Thank you for that.

One question though, how can I change the following code:

      ei_printf("Predictions ");
      ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
                result.timing.dsp, result.timing.classification, result.timing.anomaly);
      ei_printf(": \n");
      for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        ei_printf("    %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);
      }

Specifically, this section:

      for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        ei_printf("    %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);
      }

So that rather than priting % per class, it only displays the one class that had the highest prediction %?
Thanks in advance!


EDIT:
Don’t worry about it, I did it like this:

      // print the predictions
     String highest_class = "";
     float highest_prediction = 0;
     for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        if (result.classification[ix].value > highest_prediction)
        {
          highest_class = String(result.classification[ix].label);
          highest_prediction = result.classification[ix].value;
        }
      }
     Serial.println("Highest Prediction Class is: " + String(highest_classifier) + "\nAccuracy is: " + String(highest_prediction));

And it looks like the following:
Screenshot (140)

1 Like

hi @Ciprian
could you please, sent the full code for Arduino nano?since I have a similar project…thanks

1 Like

Hi @Ciprian,

Great work! Yes, the easiest solution is to loop through all your inference result values to find the highest one. You could also pick the target class (or classes) you care about and do a simple threshold on that, assuming the threshold is greater than 0.5. For example:

if (result.classification[2].value >= 0.5) {
  Serial.println("Class 2 predicted");
}

The softmax function at the end of each classification makes it so all the output values sum to 1. As a result, if any class prediction value is over 0.5, that has to be the highest value.

Hey @ShawnHymel, thanks for the info on the Softmax function, I wasn’t aware of the higher than 0.5 trick.

As a separate question though, what would you say would be the best number of coefficients for the MFCC block? Online it says normally 8-13, with 20 used for intonation-complex languages like Chinese.

I have been playing around with different values to try to improve the accuracy. Any tips?

Same with the data augmention, for my type of project, would you say data augmentation could help? I have tried with and without and haven’t seen really any change, maybe a few 0.% for the final accuracy, but the test split is also random (I believe) so it could also be from there.

Hi Ciprian,

I’ve only tried MFCCs with that 8-13 bins. I see little different in the simple keyword spotting demos that I have made, but keep playing with the hyperparameters to see if it will improve accuracy. I honestly find that every project is different. Often, if things aren’t improving after tweaking a few hyperparameters, it’s likely a problem with your dataset–you’ll need to collect more and/or better data that more accurately represents your operating environment.

Data augmentation is often a good bet. I find that the “data augmentation” checkbox in the Studio sometimes helps. The script I linked to earlier will help create a curated dataset that’s been augmented by mixing in random snippets of background noises. You can add your own classes and background noises, if you wish. That will help you build a dataset that’s more representative of your operating environment.

@shawn_edgeimpulse
Hey there Shawn, I promise this is the last time I’m disturbing haha.

I have continued working on it, considering all the information I was able to gather from about 50h of time and 45-50 iterations.
I have made it so that it works perfectly with 1150 samples, 9 voice commands of actual phrases.

And although I have tested it on-device and it works wonderfully I have wondered how it would actually perform with more than 140 training cycles, I cannot run more than the 140-150 cycles because of the 20min time limit, however in the past with models that were able to perform 500-600 training cycles, they would actually have better validation accuracy and lower loss values after the initial 200 cycles.

I am requesting if there is the possibility to increase the time available from 20min to 60-80min just to perform 1 single test with 500 samples to see how it performs then, the project ID is 89778. The given 60min time can be set for only 24h and then removed at your disposal. Please let me know if this would be possible, I can even pay a fee if necessary.

Hi @Ciprian,

I’ve increased your project training time to 60 min for 89778. Let me know if that works for you!

@shawn_edgeimpulse That’s amazing, huge thanks! I really appreacite the support you have provided.

I just have a few “theory” questions, just so I can get a better understanding of the library:

  1. The if/else for “Has anomaly”. What does the “anomaly” imply here? What type fo issue with the code?

2)In the code below, what could cause a failure to record audio? Can it be both hardware and software?

      bool m = microphone_inference_record();
      if (!m) {
        ei_printf("ERR: Failed to record audio...\n");
        return;
      }
  1. In the code below, what could cause the classifier to not run?
      EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
      if (r != EI_IMPULSE_OK) {
        ei_printf("ERR: Failed to run classifier (%d)\n", r);
        return;
      }
  1. In the code below, what could cause the sampling to fail?
    if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
        ei_printf("ERR: Failed to setup audio sampling\r\n");
        return;
    }
  1. What could cause the PDM library to fail initialising?
    if (!PDM.begin(1, EI_CLASSIFIER_FREQUENCY)) {
        ei_printf("Failed to start PDM!");
        microphone_inference_end();

        return false;
    }
  1. I can see the PDM library can be initialised as both mono or stereo. Has anyone tested to perform the same tests with stereo instead of mono and observe the performance of the model?

Many thanks in advance!

Hi @Ciprian,

  1. Anomaly detection uses cluster analysis to determine if the newly acquired sample is part of that cluster. A score of 0.0 means that it’s right on the border of the cluster. Negative scores mean “within the cluster” and positive scores mean “outside the cluster.” A default threshold of 0.3 is set to mean “definitely outside the cluster.” I struggled with the same thing, and you can read my discussion with Jan here :slight_smile:

  2. Are you working with continuous or non-continuous inference? Did you modify the example code given by Edge Impulse? My guess is you’re doing continuous inference, as that’s the only way for that function to fail. If you look at the definition (found here), you can see that it’s likely caused by the sample buffer being overrun. This happens if you have too many other things going on in the background or after inference. In other words, the audio buffer is filling up and inference (e.g. run_classifier_continuous()) was not called in time to process/empty that buffer. So, the buffer overfills, causing you to miss samples (or, in this case, fail the function altogether). In my experience, MCUs (like the nRF52840 on the 33 BLE) can barely handle doing continuous KWS. If you try to use it to perform other actions, you’ll start overflowing the audio buffer.

3, 4, 5 - I’m honestly not sure. Did you modify the Arduino example that came with the downloaded .zip library from Edge Impulse? The nano_ble33_sense_* examples should run out of the box without modification. If they are not running (without modification), make sure that you’re using a Nano 33 BLE Sense (and not the Nano 33 BLE). If you have another Nano 33 BLE Sense, try that one.

  1. I don’t think anyone has tried it with stereo. That could be a pain to set up (as you’d need to connect another PDM microphone to the same interface), but it might work. Right now, I think the Studio only supports mono, as we extract MFCCs from only one channel to feed to the model.