Audio sampling best practices question: short and targeted versus long and generic?

Hello All,

My first project is making a bark detector and I currently have 200 one second samples recorded from the target hardware (ESP32). Barks are about 250 milliseconds and the samples consist of one bark, several barks, or entirely barks and the remaining space is background noise. If I train with a one second window size to capture the entire sample, model testing is about 97% accurate, however, real-life results are rather poor. (two classes: barks and background) I am thinking the poor results might be due to the default sample size (per EI’s example code) on the ESP32 is 2048 samples (~46ms) with an inference time of ~1000ms. So with this, there decently high probability of missing a 250ms bark out of 1000ms.

Three questions:

  1. Should I increase the 2048 sample size on the ESP32?

  2. Should I trim down the bark samples to only include the barks and no background noise, and if so, should I reduce the window to 500ms (or other)?

  3. Any other tips to consider?

Thanks!

Hi @ReubenStr1

You can increase the sample size to up to 10 seconds, try moving to a longer sample size.

No make sure the samples are balances between barks and background (non barking) 50:50

Best

Eoin