Is there a minimum sample size for audio classification?

Our project is attempting to run classifier on a small window of audio sampled at 8kHz, at an 8-bit resolution.

This blog post is very helpful:

However, I am trying to find out if the signal_t struct has fields of uint8_t types for the individual samples. Or are we to define our own structs in our source code?

Either way, our system is not able to handle floats, and even if it could, RAM constraints force us to consider the viability of making inferences on a buffer that is of type uint8_t.

Is this possible?

Thank you in advance!

Hi @Markrubianes underneath we use float’s for the calculation of the features, so we need to have floats at some point - but you can just page this in whenever required and keep your internal audio data in int8. See and find the numpy::int16_to_float reference. There’s an int8 version of this too.

Thank you! It’s helpful to know how run_classifier() gathers its raw data. I take it that raw data is promoted to floats when using run_classifier_continuous() as well? ​

This brings up some concerns that I haven’t yet considered.

First, I’m not sure if we can page in samples because 1) our device has no virtual memory and 2) we’re doing continuous inferencing using the double buffer technique you explain here:

We have plenty of flash and very limited RAM, but I don’t know if writing and reading pages continuously to flash memory will be fast enough, or if the memory will even hold up to that amount of read/write cycles.

If we go by the “peak RAM” performance metric that the studio generates for us after we have a complete, trained impulse, can we then safely assume that peak RAM to be an upper bound (maximum stack size) that includes the promotion to floats and operations on those types? If so, then we will know how to at least manage the volatile memory problem.

The other issue is that we’re using a Cortex M4 core, not M4F, so we don’t have floating point acceleration, and it’s 48MHz, not 80MHz. Judging by the performance metrics for continuous audio listed here,, will our downgraded hardware increase the latency by a factor that pushes us out of the realm of being able to inference continuously on a 1 second window? Maybe we can achieve a single inference within a one second window, but then I suppose that eliminates the possibility of having a rolling average?

We’ll eventually discover the actual metrics, but I want to at least pose the question of viability of what we’re trying to achieve. I’d like to know if we should start looking into hardware upgrades now, or consider if non-continuous sampling can meet the needs of our application.

Thank you!