Word Spotting project help

palmering · May 12, 2023, 1:13am

Hi,

I would like to apologize for posting this topic in the Blog section. I don’t know how to change it, so I post it again in the right place.

I’m new to Edge Impulse and voice recognition.
I’m part of a project about a voice recognition product.
The basic idea is that the device listens for a word (just one word each time) and shows that word on a display. The device should be able to recognize 250 words at least (the more the better), in one language only (English or Spanish).
I have some questions:

The more spot words, the more memory requirement?
Is 120Mhz, 1MB Flash, 256KB of RAM memory (some Cortex M4 microcontrollers) enough for 250/500 words or should I use a more powerful microcontroller?
Would be necessary to train the model for each word? What about different voices (man, woman, kids)?
What is the cost of using that library for a commercial product?

Thank you!

louis · May 12, 2023, 9:05am

Hello @palmering,

Recognizing 250 words will be tough.
To get a good accuracy, you will need a lot of data and probably a complex NN architecture.
Maybe some models to detect your words already exist and can be ported to edge devices.
Feel free to have a look if existing models suits you.

The more spot words, the more memory requirement?

Not necessarily (well a bit to store the full buffer of classes with the associated probability but it’s not linear), but the more words, the more difficult it will be to get a good accuracy, thus you’ll need to complexify the NN architecture and probably increase the number of layers, kernels, etc… This will increase the number of weights and the compute usage.

Is 120Mhz, 1MB Flash, 256KB of RAM memory (some Cortex M4 microcontrollers) enough for 250/500 words or should I use a more powerful microcontroller?

I would go with something bigger.

Would be necessary to train the model for each word?

To recognize 250 words, you will need to train one big model that can classify the 250 words.
Keep in mind that developing a good model takes time and many iterations.

What about different voices (man, woman, kids)?

It depends on the use case, for a use-case constrained model, it is sometime interesting to scope your dataset. But in general, for human voice models, to reduce any biased, the more variety the better. This also applies for any general models.

What is the cost of using that library for a commercial product?

Edge Impulse let you use the downloaded library in commercial product. If you are interested, feel free to reach out (louis@edgeimpulse.com) to discuss about our enterprise features that will save you time and complexity. I will be able to redirect you to right person.

Best,

Louis