STM32 - final elf binary is 60 times bigger than expected

@hwidvorakinfo Is ARM_ALL_FFT_TABLES defined by any chance in your compiler defines or something else related to an ARM_* define? I’ve asked our embedded team to comment here as well.

Hi @hwidvorakinfo, just to be sure. Are you building the exact same project in the STM32CubeIDE as in the SW4STM32 environment? Or did you swap out the model as well?

No, there is not macro ARM_ALL_FFT_TABLES defined in the project.

In arm_common_tables.h are preprocessor conditions like this one:

#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_FFT_ALLOW_TABLES)
/* Double Precision Float CFFT twiddles */
#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) || defined(ARM_TABLE_BITREV_1024)
extern const uint16_t armBitRevTable[1024];
#endif /* !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) */

All macros ARM_DSP_CONFIG_TABLES, ARM_FFT_ALLOW_TABLES and ARM_ALL_FFT_TABLES are not defined. The first part of the condition !defined(ARM_DSP_CONFIG_TABLES) does the dirty trick here in my opinion.

Hello @Arjan, yes, the exact same project. I took the entire https://github.com/ShawnHymel/ei-keyword-spotting project.

hi @hwidvorakinfo, can you try this? Go to config.hpp, and at the top of the file, just. below the include guard, put #define EIDSP_USE_CMSIS_DSP 0

So should look like this:

#ifndef _EIDSP_CPP_CONFIG_H_
#define _EIDSP_CPP_CONFIG_H_

#define EIDSP_USE_CMSIS_DSP 0

#ifndef EIDSP_USE_CMSIS_DSP

Let me know if that helps

Note that that disables all of CMSIS-DSP and that’s probably too slow to run classification on the target, but at least we’ll have a baseline.

edit: @hwidvorakinfo If you could zip up your complete project and email it to jan@edgeimpulse.com I’ll also have a look.

Hello @AlexEEE, it works like a charm!

/Library/Developer/CommandLineTools/usr/bin/make --no-print-directory post-build
Generating hex and Printing size information:
arm-none-eabi-objcopy -O ihex "H7_Beast_ML_CM7.elf" "H7_Beast_ML_CM7.hex"
arm-none-eabi-size "H7_Beast_ML_CM7.elf"
 text	   data	    bss	    dec	    hex	filename
84776	   1052	  17100	 102928	  19210	H7_Beast_ML_CM7.elf 

Thank you very much, you guys at Edge Impulse. I am really impressed by your effort supporting me!

Would it be possible to add this macro to the STM32 pack during deployment or is it a really special stuff and it would make more harm than good?

So, we’d like to have our cake and eat it too! Like Jan pointed out, CMSIS provides some impressive performance enhancements via usage of DSP hardware on ARM chips. Unfortunately, the latest CMSIS library opts for the fastest possible speed at the expense of ROM (more detail than is probably interesting here, but they’re using a mixed radix FFT)

However, they have a prior version that, with some patching, is almost as fast (and still uses HW acceleration), BUT, has the added benefit of very little ROM cost. (TMI: radix 2 FFT with one table for all FFT sizes)

We’re working on this patch and once it’s released, the EIDSP_USE_CMSIS_DSP flag will cost far less ROM.

(PS love the name of your elf file, good choice :smile: )

1 Like

The interesting part is that we don’t see this happening on other targets, which is why I’d be very interested in seeing your full project. E.g. on a STM32L4 target I see ~10K for CMSIS-DSP with the latest SDK.

It is named by the board I am working on :slightly_smiling_face:

Beast_H7 - STM32H757 + 32 MB SDRAM + 32 MB QSPI flash + DA14531MOD (BLE) + USB-C with UART/USB converter + 2 analog buffered inputs + much more

1 Like

That is a beast indeed!

@janjongboom I just emailed you the link to download the entire project

@janjongboom I identified I need these macros to be defined and thus tables to be linked:

#define ARM_TABLE_BITREV_1024
#define ARM_TABLE_TWIDDLECOEF_F32_4096
#define ARM_TABLE_TWIDDLECOEF_Q15_4096
#define ARM_TABLE_TWIDDLECOEF_Q31_4096
#define ARM_TABLE_REALCOEF_F32
#define ARM_TABLE_REALCOEF_Q15
#define ARM_TABLE_REALCOEF_Q31
#define ARM_TABLE_RECIP_Q15
#define ARM_TABLE_RECIP_Q31
#define ARM_TABLE_SIN_F32
#define ARM_TABLE_SIN_Q15
#define ARM_TABLE_SIN_Q31 

Is it something I can expect in every Edge Impulse model OR the requirements are let’s say very volatile?

Awesome update!

No, we’re automatically creating these in the very near future (PR is open already) based on DSP config.

1 Like

I went through this same painful process of finding out that CMSIS-DSP needs to be configured manually to only include the data/functionality that is needed. For my particular case (M4 platform) I found the culprit to be the FFT tables - turns out the entry function into the 32-bit float FFT has a case statement to switch through all sizes of FFT; the compiler sees this and decides it needs to include all of the FFT tables, which for me added something like 120kB overhead.

I’m very happy to hear that these flags will be added automatically in the future!

1 Like

@tennies The super weird thing is that it seems to be linker dependent. On some targets the increase is 10K for CMSIS-DSP flash usage, and then on another target it doubles the flash usage - weird, but yes, should be fixed soon!

@janjongboom the great feature of this (or next) update would be to export all macros that must be defined to a dedicated text file.

For example, all macros needed for run_inference() function.

Why do I ask for it? Because I do not use the deployed pack in Stm32CubeIDE but in let’s say bare IDE environment and the list of all needed macros would make the implementation much easier.

@hwidvorakinfo In general everything is already included in model_metadata.h and dsp/config.hpp - no need to set anything else unless we can’t autodetect your MCU and you want to enable HW acceleration through CMSIS / ARC DSPs.

FWIW, these are the flags I’ve found I needed using the 32-bit float FFTs

General flags needed:
ARM_DSP_CONFIG_TABLES;ARM_FAST_ALLOW_TABLES;ARM_ALL_FAST_TABLES;ARM_FFT_ALLOW_TABLES

(ARM_ALL_FAST_TABLES catches sin cos and the like without increasing code size substantially in my experience):

This one is needed for all 32-bit float RFFTs:
ARM_TABLE_REALCOEF_F32

And the below macros must be defined for every FFT size you are using (where <NFFT> is the RFFT size)
ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_BITREVIDX_FLT_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_RFFT_F32_<NFFT>

(Note MFCC features use 2 FFT sizes, one for the FFT and one for the DCT)

In addition, you need to exclude the following files from compilation or compiler will complain about undefined variables:

arm_rfft_*q*.c
arm_cfft_radix4*.c
arm_cfft_radix2*.c

@AlexEEE does this align with what you found?