Data format for audio recogniton

Nikhil · June 7, 2021, 3:18am

Hello, i want more clarity about the data used in training and testing in audio recognition examples.

Currently I receive PCM audio samples, which I get after interfacing PDM microphone with nRF DK. Than I have created a python script to generate .wav files using the above PCM samples for listening audio.

So what type of data should I use in the audio examples to recognize audio ?

Should I use the raw PCM data or can I directly pass the 2seconds .wav file to train and test ?

Or

Should I generate .wav files from the PCM data and use that .wav files to recognize voice by passing it to edge pulse using python script. If yes is there any script available to pass .wav file to edge impulse for testing.

OR

Should I get data from the generated .wav file (mentioned in 2nd option ) using the python scipy module.

Can anyone guide me, which data should I use to train and test.

Thanks in advance

louis · June 7, 2021, 11:59am

Hello @Nikhil,

I haven’t used this microphone (don’t know which libraries are available) but to collect the training data you have several options
You could store in a buffer the 2 seconds raw you want to sample and then upload it to EI studio (either .wav format or directly the raw data).
You can also use a python script to upload the data.

On my side, I would upload .wav file but other options should work.

The only thing you cannot use at the moment is the CLI data forwarder because it only works on sensors with lower sampling frequencies.

Regards,

Louis

Nikhil · June 7, 2021, 3:08pm

Thanks @louis for the valuable information. Currently I am reading the .wav files in python script using scipy module and passing data to edge impulse.

I think this can suit me as well, it will save time of reading data from the .wav file in python script.

I would like to know is there any way of uploading .wav files using python script ?

louis · June 7, 2021, 3:34pm

Hello @Nikhil,

This is a public project I wrote for some benchmarks with TinyML Perfs: https://studio.edgeimpulse.com/public/29349/latest

Here is the Jupyter Notebook I used to push the data that are present in the public project above to EI studio: https://colab.research.google.com/drive/1g9TPZcXVkOxCdl-my72oqlioIPKgOOnM?usp=sharing

I mostly used the script from @ShawnHymel from his Coursera course.
Particularly this one on week 3: https://www.coursera.org/lecture/introduction-to-embedded-machine-learning/audio-data-capture-wMOOj

I hope this can give you an example (on how to upload .wav files to Edge Impulse using a python script)

Regards,

Louis

Nikhil · June 7, 2021, 4:18pm

Thanks @louis for guidance.

ShawnHymel · June 7, 2021, 4:39pm

Hi @Nikhil, if it helps, here is the repo where I keep the Jupyter Notebook script for curating and uploading .wav data: https://github.com/ShawnHymel/ei-keyword-spotting. I usually run it in Colab.

Please note that it was specifically designed for keyword spotting, so it downloads the Google Speech Commands dataset and mixes those with whatever samples you’re using to create a dataset that is divided into keyword1, keyword2, etc. and noise and unknown categories (where unknown is “all other words that are not one of the keywords”). Hope that helps!

Nikhil · June 9, 2021, 6:30am

Hello @louis
My project in the end will be deploying a trained edge impulse model on nRF52840 DK. So i think i should be using 2s raw PCM data to train and test my model, as i won’t be able to generate .wav files on development board while inference.

Just wanted to know once more will the raw PCM data will work for audio classification like cough, laugh detection ?

janjongboom · June 9, 2021, 8:53am

Hi @Nikhil, yeah you can just stick PCM values into the values array like this:

{
    "protected": {
        "ver": "v1",
        "alg": "HS256",
        "iat": 1564128599
    },
    "signature": "xxx",
    "payload": {
        "device_type": "MY_DEVICE",
        "interval_ms": 0.0625,
        "sensors": [
            { "name": "audio", "units": "wav" }
        ],
        "values": [
            -1, 121, 381, -211
        ]
    }
}

Nikhil · June 9, 2021, 8:56am

Thanks @janjongboom
Exactly what i was looking for.

Nikhil · June 12, 2021, 8:59am

Hello,

I have successfully trained my model using int16 PCM values. But during deployment, while inference i need to provide float buffer as input.

I have read " docs https://docs.edgeimpulse.com/docs/running-your-impulse-locally-1 " it states that i can convert the int to float using numpy function.

numpy::int16_to_float(features + offset, out_ptr, length);

Inside this function, its basically converting my values to range of -1.0 to 1.0 .

So i dont think my model will do inference on the input float values as its far different in range of int values on which model is trained.

So i am just converting my float values from int values without division. Will it work ?
Below my current buffer is int16 pointer to int buffer.

features[i] = (float)*(current_buff + i);

Or should i once again train my model by passing values in this format (int values/32768).

janjongboom · June 22, 2021, 6:37am

@Nikhil yes, you can just cast to float as well, the exact format that we expect is here under ‘Raw features’ which is just int16 range (but as a float).

int16_to_float is a helper function which you can use to save memory (when using it in a signal_t get_data function), as you don’t need to convert beforehand and thus save 2x the memory (as we page data in and then convert when needed, thus your audio buffer can be 16-bit int (2 bytes) rather than 32-bit float (4 bytes)).

StevenFenton · February 17, 2025, 1:12pm

@janjongboom Hi, I have a similiar query related to this (posted a question some days ago). The ‘Raw Features’ values you say INT16 range but as a float. Are you therefore saying that the raw feature values have already been converted to ‘float’ (int16_to_float). It’s just confusing as the WAV shows the range as a signed int16_t value (negative 32768 through positive 32767, -1->0.9999 resp). Have you already done the INT16_TO_FLOAT conversion on the Raw feature values, so they can be copied directly into the float buffer? I understand then that as we refill the buffer (via the signal callback), we’d need to convert the incoming audio from int16 to float prior to calling run_classifier.