ESP32-S3: SIMD Acceleration (ESP-NN/ESP-DSP) Not Effective for run_classifier

haowan · June 6, 2025, 1:17am

Question/Issue:
Hello Edge Impulse Team and community,

I am seeking some expert advice on a performance issue I’m facing when deploying a model to an ESP32-S3. I’ve been working through a very detailed debugging process and have successfully isolated the problem, but now I need your help to understand the root cause within the EI SDK.

Project ID:
693087

Context/Use case:
I have developed a real-time motion detection application on an ESP32-S3 microcontroller using ESP-IDF v5.4. My model, developed on Edge Impulse, uses 3-axis accelerometer data to classify between ‘walk’, ‘slipforward’, and ‘slipright’.

The impulse is configured as follows:

Input: Time-series data (2000ms window, 100Hz frequency).
DSP Block 1: Low-pass Filter (Cut-off: 5Hz).
DSP Block 2: Wavelet Analysis (Wavelet: db4, Level: 1).
Learning Block: Classification (Neural Network).

The goal is to leverage the ESP32-S3’s SIMD capabilities via ESP-NN and ESP-DSP to achieve the best possible inference performance.

Steps Taken:

I exported the C++ library from a fully trained and validated model (99.8% accuracy on the platform).
I integrated the library into my ESP-IDF project and fixed several initial bugs, including a critical data type mismatch (incorrectly passing raw integers instead of floats to the feature buffer) and a memory alignment issue (I have now ensured the input buffer is 16-byte aligned using __attribute__((aligned(16)))).
I implemented a rigorous A/B test by toggling the ESP-NN and ESP-DSP settings between ANSI C and Optimized in menuconfig.
To get precise results, I am measuring performance using esp_cpu_get_cycle_count() wrapped around the run_classifier() function.
After observing no speedup, I created a final diagnostic test to validate my environment. This test directly compares a manual C-language dot product loop against the dsps_dotprod_f32_ae32 function from the official ESP-DSP library.

Expected Outcome:
I expected to see a significant reduction in CPU cycles for the run_classifier() call when compiling with the Optimized settings, especially after fixing the memory alignment. The result.timing.dsp_us and result.timing.classification_us values should have been much lower.

Actual Outcome:

The A/B test on my project’s run_classifier() function shows no discernible performance difference between the ANSI C and Optimized builds. The total execution time remains ~9.2 ms / ~2.2 million cycles in both configurations.
However, the direct diagnostic test was a success. The results were:

Manual C Loop: ~13,400 cycles
ESP-DSP SIMD Function: ~4,100 cycles
This shows a ~3.2x speedup and conclusively proves that my development environment is set up correctly and that SIMD acceleration is working at the library level.

The issue seems to be that the run_classifier pipeline is not utilizing these optimized backends for my specific model.

Reproducibility:

[*] Always
[ ] Sometimes
[ ] Rarely

Environment:

**Platform:**ESP32S3
Build Environment Details: [e.g., Arduino IDE 1.8.19 ESP32 Core for Arduino 2.0.4]
OS Version: Windows 11
Edge Impulse Version (Firmware): [e.g., 1.2.3]
#define EI_STUDIO_VERSION_MAJOR 1
#define EI_STUDIO_VERSION_MINOR 71
#define EI_STUDIO_VERSION_PATCH 56
Edge Impulse CLI Version: 1.32.1
Project Version: [e.g., 1.0.0]
Custom Blocks / Impulse Configuration:
My Impulse workflow consists of the following key parts:
Input block: time series data, window size 2000ms, window increment 1ms, sampling frequency 100Hz.
DSP block 1: Low-pass filter with a cut-off frequency of 5Hz.
DSP block 2: Wavelet Analysis using Daubechies 4 (db4).
Learning Block: A Classification (DNN) model.

Logs/Attachments:
run_classifier test:
// ANSI C Mode
Total CPU cycles: 2188623
Timing breakdown (us): DSP=7907, NN=1006, Anomaly=0

// Optimized Mode
Total CPU cycles: 2197387
Timing breakdown (us): DSP=7999, NN=1017, Anomaly=0

Environment check(SIMD work as expected)
Manual C Loop cycles: 13338 (result: 1786949.250000)
ESP-DSP SIMD cycles: 4118 (result: 1786949.250000)
Conclusion: SIMD is working and is 3.24 times faster!

Additional Information:
The diagnosis strongly suggests that the problem is not with my development environment or hardware, but specific to the C++ library that Edge Impulse generated for my model with low-pass filtering + wavelet analysis, whose internal implementation fails to properly link or invoke the SIMD-optimized backend of ESP-NN and ESP-DSP provided by ESP-IDF.
My primary question is: Are there known reasons or configurations that would prevent the Wavelet Analysis DSP block (or the subsequent neural network) from using the ESP-NN/ESP-DSP SIMD backends on an ESP32-S3, even when they are enabled in menuconfig? It appears the linkage between the high-level Edge Impulse pipeline and the low-level optimized libraries is not happening as expected in my project.

AIWintermuteAI · June 6, 2025, 8:50am

Hello, @haowan !
Thanks for the detailed report.
First things first, we do not have ESP DSP acceleration enabled - it has been in Backlog for a while, but is still not implemented.

I’ve tested the ESP NN acceleration for larger networks (e.g. MobileNet) and confirmed it working. For small networks, such as the one you have, the performance gains are not apparent and hard to measure - that’s the difference between 0.001 and 0.0003 seconds in your case, see the numbers here GitHub - espressif/esp-nn: Optimised Neural Network functions for Espressif chipsets (opt ratio of 2.77 for fully connected on S3).

Do you have a use case that justifies the necessity for such extreme latencies?