Impulse performance on STM32WL55x extremely slow

This an offshoot from Can't verify STM32WL project successfully

Thank you @AlexE for taking a look at this.

Here’s my current DSP block

Here is a screenshot of the serial output.


I used the EON Tuner to optimize for the TI-LAUNCHXL CC1352P because it’s the processor that matches ours the closest. We’re running a Cortex-M4 at 48 MHz, but it does not have an FPU, so floating point operations is software simulated. I expected some cost to performance for this reason, but not to the extent we’re experiencing. We’re doing keyword spotting over 8-bit, 8kHz audio.

Unfortunately, floating point emulation does incur a fairly heavy latency penalty, especially for MFCC, but there are some settings tweaks that can hopefully get acceptable latency without too much accuracy hit:

  • Are you using int8 or float32 version of the NN (you choose that on the Deployment screen). For sure pick int8
  • Compare the default NN settings to those suggested by the EON Tuner. The default may be faster with only a slight accuracy hit
  • Try these changes, in this order. The first changes hopefully will have the biggest latency impact with the least accuracy impact. I would profile each change and stop once you have acceptable latency
  1. Change the normalization window much lower. Try 5 to start and see if your accuracy is still good. You can even try 0, although 5 will get you most of the processing gain but still give you some decent normalization. Try 11, 21, etc if your accuracy suffers too much.
  2. Turn off pre emphasis by setting the coefficient to 0.
  3. Set frame stride = frame length. (in your case, 32 mS). I find overlapping frames has little benefit to accuracy
  4. Set FFT to 64. It’s surprising, but you can get away with skipping samples and usually still have enough frequency resolution
  5. Lower the filter number, and reduce the number coefficients as well, to keep about the same ratio (13/32)

Good luck, and let me know how these changes go!

1 Like

In addition, consider an MFE block as well. It does not have any normalization so can be calculated faster. We have customers on Cortex-M0+ using it.

1 Like

Thank you for the great tips and advice! @AlexE’s suggestions shaved about 500 milliseconds from the estimated performance, although it still took about four seconds for DSP on my processor for one second of audio. The only thing so far that seems to get us under that one second window is overclocking, but we will also try the MFE block that @janjongboom suggested as well.