Audio data samples acquisition question

DmitrSou · August 8, 2023, 7:40pm

I’m newbie in ML and have some questions regarding Data acquisition/Model learning:

In «Responding to your voice» tutorial the «Segment length» is set to 1 sec. All of them have small lead-in and lead-out. There is even «Shift samples» checkbox to automatically shift the segments a little by random. For example my key phrase length is 1.5 sec. Am I right that optimal length for me is 2 sec so there is a possibility «to shift» for better recognition?
Should all the segments have the same length or they might be of different length? For example I have datasets labelled as «Unknown» and «Noise» with segments 1 sec long (from Keyword spotting dataset). Is it OK to use them with my custom dataset «HelloWorld» with segments 2 sec long? Or all the segments in the dataset should be the same length (normalized to 2 sec)?
Is it OK to use mixed length in the dataset with the same label (for example with segments which length varies from 1.5 to 2.5 sec)?
It’s said «Neural networks need to learn patterns in data sets, and the more data the better.» Does this mean that for the best performance it is worth recording samples from as many people as possible, of different genders and ages, right?

MMarcial · August 10, 2023, 1:24am

RE #1. The Impulse Input Data Block controls how each Sample is processed. If you set Windows Size to 1000ms then a 1s Sample will be used for training. The Window Increase will handle processing Samples that are longer than 1s. The actual voice can happen anywhere within the 1s Sample. Varying where within the Sample the actual voice occurs will make your Model more robust.

RE #2. The Samples can be of different length. When enabled, the Zero-Pad Data will lengthen the Sample to Window Size.

RE #3. Yes you can use mixed lengths in your Samples. Window Increase and Zero-Pad Data will make the Samples compatible with the training process.

RE #4. It depends on the mission. Your training data should match your intended Model usage.
The Studio NN Classifier has a Data Augmentation settings that you should use to create a more powerfully built training session.

For a general purpose Model to be deployed to the general public, make your training data as varied as possible:
- Different people
- Different genders
- Samples recorded inside a room with a TV on, in a bathroom (with echo), etc.
- Samples recorded outside in a park, in an active football stadium, etc.
- Samples recorded in a sound proof room
- Different accents
- Different ages
- Different races
- Different cultures
- Samples recorded from a microphone, over a phone, etc.
For a user specific Model, then only train on Samples from that user, thus, giving your Model a layer of security. For example, you only want your voice to open your corner office private bathroom door lock with your voice.