STM32 - final elf binary is 60 times bigger than expected

hwidvorakinfo · April 15, 2021, 4:42pm

@janjongboom I just emailed you the link to download the entire project

hwidvorakinfo · April 19, 2021, 12:25pm

@janjongboom I identified I need these macros to be defined and thus tables to be linked:

#define ARM_TABLE_BITREV_1024
#define ARM_TABLE_TWIDDLECOEF_F32_4096
#define ARM_TABLE_TWIDDLECOEF_Q15_4096
#define ARM_TABLE_TWIDDLECOEF_Q31_4096
#define ARM_TABLE_REALCOEF_F32
#define ARM_TABLE_REALCOEF_Q15
#define ARM_TABLE_REALCOEF_Q31
#define ARM_TABLE_RECIP_Q15
#define ARM_TABLE_RECIP_Q31
#define ARM_TABLE_SIN_F32
#define ARM_TABLE_SIN_Q15
#define ARM_TABLE_SIN_Q31

Is it something I can expect in every Edge Impulse model OR the requirements are let’s say very volatile?

janjongboom · April 19, 2021, 12:39pm

Awesome update!

No, we’re automatically creating these in the very near future (PR is open already) based on DSP config.

tennies · April 19, 2021, 8:59pm

I went through this same painful process of finding out that CMSIS-DSP needs to be configured manually to only include the data/functionality that is needed. For my particular case (M4 platform) I found the culprit to be the FFT tables - turns out the entry function into the 32-bit float FFT has a case statement to switch through all sizes of FFT; the compiler sees this and decides it needs to include all of the FFT tables, which for me added something like 120kB overhead.

I’m very happy to hear that these flags will be added automatically in the future!

janjongboom · April 20, 2021, 6:17am

@tennies The super weird thing is that it seems to be linker dependent. On some targets the increase is 10K for CMSIS-DSP flash usage, and then on another target it doubles the flash usage - weird, but yes, should be fixed soon!

hwidvorakinfo · April 20, 2021, 7:43am

@janjongboom the great feature of this (or next) update would be to export all macros that must be defined to a dedicated text file.

For example, all macros needed for run_inference() function.

Why do I ask for it? Because I do not use the deployed pack in Stm32CubeIDE but in let’s say bare IDE environment and the list of all needed macros would make the implementation much easier.

janjongboom · April 20, 2021, 7:56am

@hwidvorakinfo In general everything is already included in model_metadata.h and dsp/config.hpp - no need to set anything else unless we can’t autodetect your MCU and you want to enable HW acceleration through CMSIS / ARC DSPs.

tennies · April 20, 2021, 1:54pm

FWIW, these are the flags I’ve found I needed using the 32-bit float FFTs

General flags needed:
ARM_DSP_CONFIG_TABLES;ARM_FAST_ALLOW_TABLES;ARM_ALL_FAST_TABLES;ARM_FFT_ALLOW_TABLES

(ARM_ALL_FAST_TABLES catches sin cos and the like without increasing code size substantially in my experience):

This one is needed for all 32-bit float RFFTs:
ARM_TABLE_REALCOEF_F32

And the below macros must be defined for every FFT size you are using (where <NFFT> is the RFFT size)
ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_BITREVIDX_FLT_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_F32_<NFFT/2>;ARM_TABLE_TWIDDLECOEF_RFFT_F32_<NFFT>

(Note MFCC features use 2 FFT sizes, one for the FFT and one for the DCT)

In addition, you need to exclude the following files from compilation or compiler will complain about undefined variables:

arm_rfft_*q*.c
arm_cfft_radix4*.c
arm_cfft_radix2*.c

janjongboom · April 20, 2021, 4:47pm

@AlexEEE does this align with what you found?

zAlexE · April 20, 2021, 5:20pm

I’m not sure if those are the exact same macros that we’re looking at, but they look correct. Even if you try to use an FFT size that you don’t have the table for, you’ll get a runtime error (or sometimes it won’t even build).

Also, I don’t remember having to remove those files, so that’s different too. @tennies are you using the older arm_cfft_f32.c ? My understanding was that arm_cfft_radix4*.c arm_cfft_radix2*.c are called under-the-hood by arm_rfft_fast* (which is the function that uses the fast tables)

But in the end, it’s safe to play with those macros to try to reduce ROM size, b/c it will fail in a very obvious way if you’re missing a key table.

tennies · April 20, 2021, 5:51pm

What do you mean by the older arm_cfft_f32.c?
I found that only arm_cfft_radix8 was used by the rfft function, so the others could be excluded. Including them meant having to include a few other tables. In the end, enabling CMSIS-DSP with these options only seemed to add about 6-7kB to the flash size compared to CMSIS-DSP disabled (EIDSP_USE_CMSIS_DSP I think is the flag).

zAlexE · April 20, 2021, 9:47pm

@tennies Actually, I had it backwards. The files that end in radix2/4 are a different set of cfft functions than the ones used by rfft.

So yes, I agree, you can remove those files.

hwidvorakinfo · May 4, 2021, 5:49am

When do you estimate can be released the new enhanced CMSIS-PACK version?

janjongboom · May 4, 2021, 6:29am

We’re putting the final things together - probably in a week or two.

hwidvorakinfo · June 11, 2021, 5:50am

Can I ask you what is the status of the CMSIS-PACK built with only selected tables to be linked in or in other words the “enhanced CMSIS-PACK” version?

janjongboom · June 11, 2021, 6:55am

@hwidvorakinfo Still in progress, the PR is ready but we found a bug with some audio models that needed to be ironed out (and then other things got in the way, you know how it goes ) I’ll update this thread when released.

janjongboom · September 14, 2021, 3:49pm

So after investigating this and going back and forth between Alex and me (and a bunch of PRs back and forth) we decided not to go the route of manually declaring the FFT tables that are required. On all our fully supported development boards this is done correctly and unused tables are compiled out so the basic premise already works. If someone (e.g. @hwidvorakinfo) can send me a full STM32Cube.IDE project with the issue we’ll look at all linker flags and compare that with other projects.

janjongboom · September 27, 2021, 2:05pm

So based on looking at @hwidvorakinfo’s project at least this should be enabled:

Under Project > C/C++ Build > Settings > Optimization (both GCC and G++).

janjongboom · October 8, 2021, 2:42pm

Update: today we encountered an issue for a new target where all FFT tables were included. I’ve been tracking this bug down all day, and somewhere the linker gets confused, does not realize that there’s only one path through the CMSIS-DSP RFFT init code, and includes all possible paths (which include all possible FFT tables). Basically this:

static int arm_cmsis_rfft_init(int n_fft) {
   switch (n_fft) {
      case 32:
      // load fiddle tables etc
      case 64:
      // load fiddle tables etc
      // etc
}

static int ei_dsp_fn3(int n_fft) {
    arm_cmsis_rfft_init(n_fft);
}

static int ei_dsp_fn2(int n_fft) {
    ei_dsp_fn3(n_fft);
}

static int ei_dsp_fn1(int n_fft) {
    ei_dsp_fn2(n_fft);
}

ei_dsp_fn3(128); // knows to only include 128
ei_dsp_fn1(128); // includes all fft tables

// ?!?!

Weird right? This adds 150K of flash in that case.

I’ve put together a PR which inlines the RFFT init, and this resolves the issue. Will get it reviewed somewhere next week.

janjongboom · October 13, 2021, 7:34am

The patch above has been merged into master, and will be deployed in the next few days (included in any C++ Export or CMSIS.PACK export). This will require you to regenerate features on DSP blocks, easiest is to remove and re-add the DSP block and retrain your impulse.

To spot if you have the updated model: you should see #define EI_CLASSIFIER_HAS_FFT_INFO 1 in model_metadata.h.