STM32 - final elf binary is 60 times bigger than expected

janjongboom · April 15, 2021, 1:11pm

@hwidvorakinfo Is ARM_ALL_FFT_TABLES defined by any chance in your compiler defines or something else related to an ARM_* define? I’ve asked our embedded team to comment here as well.

Arjan · April 15, 2021, 1:41pm

Hi @hwidvorakinfo, just to be sure. Are you building the exact same project in the STM32CubeIDE as in the SW4STM32 environment? Or did you swap out the model as well?

hwidvorakinfo · April 15, 2021, 3:39pm

No, there is not macro ARM_ALL_FFT_TABLES defined in the project.

In arm_common_tables.h are preprocessor conditions like this one:

#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_FFT_ALLOW_TABLES)
/* Double Precision Float CFFT twiddles */
#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) || defined(ARM_TABLE_BITREV_1024)
extern const uint16_t armBitRevTable[1024];
#endif /* !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) */

All macros ARM_DSP_CONFIG_TABLES, ARM_FFT_ALLOW_TABLES and ARM_ALL_FFT_TABLES are not defined. The first part of the condition !defined(ARM_DSP_CONFIG_TABLES) does the dirty trick here in my opinion.

hwidvorakinfo · April 15, 2021, 3:41pm

Hello @Arjan, yes, the exact same project. I took the entire https://github.com/ShawnHymel/ei-keyword-spotting project.

zAlexE · April 15, 2021, 4:00pm

hi @hwidvorakinfo, can you try this? Go to config.hpp, and at the top of the file, just. below the include guard, put #define EIDSP_USE_CMSIS_DSP 0

So should look like this:

#ifndef _EIDSP_CPP_CONFIG_H_
#define _EIDSP_CPP_CONFIG_H_

#define EIDSP_USE_CMSIS_DSP 0

#ifndef EIDSP_USE_CMSIS_DSP

Let me know if that helps

janjongboom · April 15, 2021, 4:25pm

Note that that disables all of CMSIS-DSP and that’s probably too slow to run classification on the target, but at least we’ll have a baseline.

edit: @hwidvorakinfo If you could zip up your complete project and email it to jan@edgeimpulse.com I’ll also have a look.

hwidvorakinfo · April 15, 2021, 4:27pm

Hello @AlexEEE, it works like a charm!

/Library/Developer/CommandLineTools/usr/bin/make --no-print-directory post-build
Generating hex and Printing size information:
arm-none-eabi-objcopy -O ihex "H7_Beast_ML_CM7.elf" "H7_Beast_ML_CM7.hex"
arm-none-eabi-size "H7_Beast_ML_CM7.elf"
 text	   data	    bss	    dec	    hex	filename
84776	   1052	  17100	 102928	  19210	H7_Beast_ML_CM7.elf

Thank you very much, you guys at Edge Impulse. I am really impressed by your effort supporting me!

Would it be possible to add this macro to the STM32 pack during deployment or is it a really special stuff and it would make more harm than good?

zAlexE · April 15, 2021, 4:34pm

So, we’d like to have our cake and eat it too! Like Jan pointed out, CMSIS provides some impressive performance enhancements via usage of DSP hardware on ARM chips. Unfortunately, the latest CMSIS library opts for the fastest possible speed at the expense of ROM (more detail than is probably interesting here, but they’re using a mixed radix FFT)

However, they have a prior version that, with some patching, is almost as fast (and still uses HW acceleration), BUT, has the added benefit of very little ROM cost. (TMI: radix 2 FFT with one table for all FFT sizes)

We’re working on this patch and once it’s released, the EIDSP_USE_CMSIS_DSP flag will cost far less ROM.

(PS love the name of your elf file, good choice )

janjongboom · April 15, 2021, 4:39pm

The interesting part is that we don’t see this happening on other targets, which is why I’d be very interested in seeing your full project. E.g. on a STM32L4 target I see ~10K for CMSIS-DSP with the latest SDK.

hwidvorakinfo · April 15, 2021, 4:39pm

It is named by the board I am working on

Beast_H7 - STM32H757 + 32 MB SDRAM + 32 MB QSPI flash + DA14531MOD (BLE) + USB-C with UART/USB converter + 2 analog buffered inputs + much more

zAlexE · April 15, 2021, 4:40pm

That is a beast indeed!

hwidvorakinfo · April 15, 2021, 4:42pm

@janjongboom I just emailed you the link to download the entire project

hwidvorakinfo · April 19, 2021, 12:25pm

@janjongboom I identified I need these macros to be defined and thus tables to be linked:

#define ARM_TABLE_BITREV_1024
#define ARM_TABLE_TWIDDLECOEF_F32_4096
#define ARM_TABLE_TWIDDLECOEF_Q15_4096
#define ARM_TABLE_TWIDDLECOEF_Q31_4096
#define ARM_TABLE_REALCOEF_F32
#define ARM_TABLE_REALCOEF_Q15
#define ARM_TABLE_REALCOEF_Q31
#define ARM_TABLE_RECIP_Q15
#define ARM_TABLE_RECIP_Q31
#define ARM_TABLE_SIN_F32
#define ARM_TABLE_SIN_Q15
#define ARM_TABLE_SIN_Q31

Is it something I can expect in every Edge Impulse model OR the requirements are let’s say very volatile?

janjongboom · April 19, 2021, 12:39pm

Awesome update!

No, we’re automatically creating these in the very near future (PR is open already) based on DSP config.

tennies · April 19, 2021, 8:59pm

I went through this same painful process of finding out that CMSIS-DSP needs to be configured manually to only include the data/functionality that is needed. For my particular case (M4 platform) I found the culprit to be the FFT tables - turns out the entry function into the 32-bit float FFT has a case statement to switch through all sizes of FFT; the compiler sees this and decides it needs to include all of the FFT tables, which for me added something like 120kB overhead.

I’m very happy to hear that these flags will be added automatically in the future!

janjongboom · April 20, 2021, 6:17am

@tennies The super weird thing is that it seems to be linker dependent. On some targets the increase is 10K for CMSIS-DSP flash usage, and then on another target it doubles the flash usage - weird, but yes, should be fixed soon!

hwidvorakinfo · April 20, 2021, 7:43am

@janjongboom the great feature of this (or next) update would be to export all macros that must be defined to a dedicated text file.

For example, all macros needed for run_inference() function.

Why do I ask for it? Because I do not use the deployed pack in Stm32CubeIDE but in let’s say bare IDE environment and the list of all needed macros would make the implementation much easier.

janjongboom · April 20, 2021, 7:56am

@hwidvorakinfo In general everything is already included in model_metadata.h and dsp/config.hpp - no need to set anything else unless we can’t autodetect your MCU and you want to enable HW acceleration through CMSIS / ARC DSPs.

tennies · April 20, 2021, 1:54pm

FWIW, these are the flags I’ve found I needed using the 32-bit float FFTs

General flags needed:
ARM_DSP_CONFIG_TABLES;ARM_FAST_ALLOW_TABLES;ARM_ALL_FAST_TABLES;ARM_FFT_ALLOW_TABLES

(ARM_ALL_FAST_TABLES catches sin cos and the like without increasing code size substantially in my experience):

This one is needed for all 32-bit float RFFTs:
ARM_TABLE_REALCOEF_F32

And the below macros must be defined for every FFT size you are using (where <NFFT> is the RFFT size)
ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_BITREVIDX_FLT_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_RFFT_F32_<NFFT>

(Note MFCC features use 2 FFT sizes, one for the FFT and one for the DCT)

In addition, you need to exclude the following files from compilation or compiler will complain about undefined variables:

arm_rfft_*q*.c
arm_cfft_radix4*.c
arm_cfft_radix2*.c

janjongboom · April 20, 2021, 4:47pm

@AlexEEE does this align with what you found?