How to deal with "distributed/repeating" sounds

braddo · February 24, 2024, 7:36pm

Hi, I’m building an audio recognition model but having a bit of difficulty. The reason I think is that the target sound is itself repeating. I’m trying to recognize the sound of a dog drinking… “lap lap lap”. Each individual lap has a certain signature, that should be possible to recognize (the simple model isn’t so good at that unfortunately) but even more if there is a repeat of that sound every half second or so it can be much more certain that the dog is taking a drink. Is there a way to set the various parameters in the feature definition or the modeling to detect such a repeated signature? I suppose I could handle that part in the microcontroller code, but given that speech is often not just one 200ms sound but a series of these, each of which presumably increases the confidence of the score. Ideally, I’d like to get a signal when the drinking has gone on for a certain amount of time.

Thanks all!

Eoin · February 26, 2024, 11:21am

Hi @braddo

Welcome to the forum, If you are just starting out I’d recommend following:

here are some steps you can take for your project:

Data Collection

Diverse Dataset: Collect a diverse set of audio samples of the dog drinking sound, ensuring you capture the sound in various environments and with different background noises. Include samples with varying intervals between the “lap” sounds.
Augmentation: Consider augmenting your dataset to include more variations of the drinking sound. Edge Impulse supports data augmentation.

2. Signal Processing and Feature Extraction

Windowing: In the “Create impulse” step, you can define your window size and increase the window overlap in the processing block. This helps capture sequences where the “lap” sound repeats.
MFCCs: Use the MFCC (Mel Frequency Cepstral Coefficient) block for feature extraction. MFCCs are effective in capturing the timbral aspects of audio, which is useful for differentiating sounds like dog drinking patterns.

3. Model Selection and Training

Choosing a Model: For audio pattern recognition, especially with temporal sequences, using a neural network that can capture time dependencies is crucial. Edge Impulse offers different neural network architectures, including CNNs and RNNs. An LSTM (Long Short-Term Memory) layer can be particularly effective for detecting sequences in time-series data.
Training: Train your model with the collected and augmented dataset. Use the validation set to tune the model’s hyperparameters for better performance.

4. Deployment and Inference

Running Inference: Deploy your trained model to your target device. Edge Impulse supports deployment to a wide range of devices, including microcontrollers.
Post-Processing: Implement post-processing on your microcontroller to analyze the model’s inference results over time. For instance, you could write a simple algorithm to detect a sequence of high-confidence “lap” detections with the expected time intervals to confirm the drinking action.

5. Optimization and Testing

Continuous Improvement: Based on testing, you may find that adjustments to the feature extraction parameters, model architecture, or post-processing logic are necessary. Edge Impulse makes iterating on your design straightforward with our EON tuner:
EON Tuner: Utilize the EON Tuner in Edge Impulse to optimize your model’s performance and resource usage, ensuring it runs efficiently on your target device.

See also Increasing model performance - Edge Impulse Documentation

Best

Eoin

How to deal with "distributed/repeating" sounds

Data Collection

2. Signal Processing and Feature Extraction

3. Model Selection and Training

4. Deployment and Inference

5. Optimization and Testing