Audio-Recording - Understanding the Time Domain

Would it be possible to have an overview of the different time values you are using in audio processing with a bit of more detail/context - what I have in mind:

Time Series Data:

  • WindowSize
  • WindowIncrease

MFCC:

  • FrameLength
  • FrameStride
  • WindowSize (I think that is only for the MEL)

Arduino:

  • I_CLASSIFIER_SLICES_PER_MODEL_WINDOW

NN-Settings:

  • Warp time axis

Its in your documentation, but I am not 100% if I really get the different meanings.

Time-series data:

  • Window size - this is the length in milliseconds of a single window used during training. You’ll need this much data to classify when your model is trained. E.g. if you’re listening for a keyword you probably want 1 second windows (most keywords fit in that), if you do scene detection (‘am I in the kitchen?’) maybe 2 seconds, but it’s a variable to play with.
  • Window increase - if your training data is longer than your window size (e.g. kitchen_sounds01.wav is 10 seconds, but your window length is 2 seconds) we create multiple windows out of the sample. The ‘window increase’ determines the step. So with window size 2000ms. and window increase 1000ms. and a 5000ms. long sample you’ll get 0-2000ms., 1000-3000ms., 2000-4000ms., 3000-5000ms. = 4 windows.

MFCC:

  • Frame length - when creating a spectrogram you’re creating a time x frequency matrix. Every time column in the matrix is frame length long (in seconds, so 20ms. with default config).
  • Frame stride - this is the same as ‘window increase’ above, but then for the time columns in the spectrogram.

So for a 1000ms. long sample with frame length 0.02 (20ms.) and frame stride 0.02 (20ms.) you’ll get 50 time columns (0.00-0.02, 0.02-0.04, 0.04-0.06, etc.). And with frame stride 0.01 (10ms.) you’ll get 99 time columns (0.00-0.02, 0.01-0.03, 0.02-0.04, etc.). [1]

[1] Note that - because of a bug when we first implemented the algorithm, and hard to change now with 10K projects in the wild - we discard the last time column so subtract 1 from these counts.

Arduino:

  • CLASSIFIER_SLICES_PER_MODEL_WINDOW - The MFCC / MFE / spectrogram calculation can take a long time (on some targets 500ms. for a 1000ms. window) and thus we could only do one classification per second (you also need some time to call the classifier). That’s not great, because you’ll potentially miss events that are half in one second, half in another. To mitigate that we can calculate the spectrogram in smaller slices (e.g. 250ms. slices), then stitch them together, then classify. Now you can classify every 250ms. because the time spent will be ~125ms. for the slice of the spectrogram + time to classify (say 50ms.) and you never miss any events. The macro here determines the slice length. If you’re window size is 1000ms. and CLASSIFIER_SLICES_PER_MODEL_WINDOW is 4, the slice length is 250ms. If it’s 3 it’s 333ms. etc.

NN Settings:

  • Warp time axis - here we take the spectrogram, choose a point (let’s say at 400ms.), split the spectrogram in two (A = 0-400ms., B = 400-1000ms.), and stitch them back together as B+A. As you can see this is not useful for speech, but for background noise, scene detection etc. it is. as it helps harden the neural network. Like all other data augmentation options this is done at-random for each sample during each training cycle so it also artificially increases the size of your training set.

THANK YOU!!!
Very clear - maybe good to put this on your online documentation.
The Warp-Time I understood wrong from your documentation (could be my fault) - thinking that it shifts the frame a bit and allows to generate more samples - when using speech.

1 Like