Question/Issue:
Hello Edge Impulse Team and community,
I am seeking some expert advice on a performance issue I’m facing when deploying a model to an ESP32-S3. I’ve been working through a very detailed debugging process and have successfully isolated the problem, but now I need your help to understand the root cause within the EI SDK.
Project ID:
693087
Context/Use case:
I have developed a real-time motion detection application on an ESP32-S3 microcontroller using ESP-IDF v5.4. My model, developed on Edge Impulse, uses 3-axis accelerometer data to classify between ‘walk’, ‘slipforward’, and ‘slipright’.
The impulse is configured as follows:
- Input: Time-series data (2000ms window, 100Hz frequency).
- DSP Block 1: Low-pass Filter (Cut-off: 5Hz).
- DSP Block 2: Wavelet Analysis (Wavelet: db4, Level: 1).
- Learning Block: Classification (Neural Network).
The goal is to leverage the ESP32-S3’s SIMD capabilities via ESP-NN and ESP-DSP to achieve the best possible inference performance.
Steps Taken:
- I exported the C++ library from a fully trained and validated model (99.8% accuracy on the platform).
- I integrated the library into my ESP-IDF project and fixed several initial bugs, including a critical data type mismatch (incorrectly passing raw integers instead of floats to the feature buffer) and a memory alignment issue (I have now ensured the input buffer is 16-byte aligned using
__attribute__((aligned(16)))
). - I implemented a rigorous A/B test by toggling the
ESP-NN
andESP-DSP
settings betweenANSI C
andOptimized
inmenuconfig
. - To get precise results, I am measuring performance using
esp_cpu_get_cycle_count()
wrapped around therun_classifier()
function. - After observing no speedup, I created a final diagnostic test to validate my environment. This test directly compares a manual C-language dot product loop against the
dsps_dotprod_f32_ae32
function from the officialESP-DSP
library.
Expected Outcome:
I expected to see a significant reduction in CPU cycles for the run_classifier()
call when compiling with the Optimized
settings, especially after fixing the memory alignment. The result.timing.dsp_us
and result.timing.classification_us
values should have been much lower.
Actual Outcome:
- The A/B test on my project’s
run_classifier()
function shows no discernible performance difference between theANSI C
andOptimized
builds. The total execution time remains ~9.2 ms / ~2.2 million cycles in both configurations. - However, the direct diagnostic test was a success. The results were:
- Manual C Loop: ~13,400 cycles
- ESP-DSP SIMD Function: ~4,100 cycles
- This shows a ~3.2x speedup and conclusively proves that my development environment is set up correctly and that SIMD acceleration is working at the library level.
The issue seems to be that the run_classifier
pipeline is not utilizing these optimized backends for my specific model.
Reproducibility:
- [*] Always
- [ ] Sometimes
- [ ] Rarely
Environment:
-
**Platform:**ESP32S3
-
Build Environment Details: [e.g., Arduino IDE 1.8.19 ESP32 Core for Arduino 2.0.4]
-
OS Version: Windows 11
-
Edge Impulse Version (Firmware): [e.g., 1.2.3]
#define EI_STUDIO_VERSION_MAJOR 1
#define EI_STUDIO_VERSION_MINOR 71
#define EI_STUDIO_VERSION_PATCH 56 -
Edge Impulse CLI Version: 1.32.1
-
Project Version: [e.g., 1.0.0]
-
Custom Blocks / Impulse Configuration:
My Impulse workflow consists of the following key parts:
Input block: time series data, window size 2000ms, window increment 1ms, sampling frequency 100Hz.
DSP block 1: Low-pass filter with a cut-off frequency of 5Hz.
DSP block 2: Wavelet Analysis using Daubechies 4 (db4).
Learning Block: A Classification (DNN) model.
Logs/Attachments:
run_classifier test:
// ANSI C Mode
Total CPU cycles: 2188623
Timing breakdown (us): DSP=7907, NN=1006, Anomaly=0
// Optimized Mode
Total CPU cycles: 2197387
Timing breakdown (us): DSP=7999, NN=1017, Anomaly=0
Environment check(SIMD work as expected)
Manual C Loop cycles: 13338 (result: 1786949.250000)
ESP-DSP SIMD cycles: 4118 (result: 1786949.250000)
Conclusion: SIMD is working and is 3.24 times faster!
Additional Information:
The diagnosis strongly suggests that the problem is not with my development environment or hardware, but specific to the C++ library that Edge Impulse generated for my model with low-pass filtering + wavelet analysis, whose internal implementation fails to properly link or invoke the SIMD-optimized backend of ESP-NN and ESP-DSP provided by ESP-IDF.
My primary question is: Are there known reasons or configurations that would prevent the Wavelet Analysis DSP block (or the subsequent neural network) from using the ESP-NN/ESP-DSP SIMD backends on an ESP32-S3, even when they are enabled in menuconfig
? It appears the linkage between the high-level Edge Impulse pipeline and the low-level optimized libraries is not happening as expected in my project.