STM32 - final elf binary is 60 times bigger than expected

I trained a very simple NN and during the deployment for STM32 environment, I got this prediction:

The issue is this code adds 998 kB to the final elf binary even with -Os optimisation for size.

ei_printf("Inferencing settings:\r\n");
ei_printf("\tInterval: %.2f ms.\r\n", (float)EI_CLASSIFIER_INTERVAL_MS);
ei_printf("\tFrame size: %d\r\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
ei_printf("\tSample length: %d ms.\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16)
ei_printf("\tNo. of classes: %d\r\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));
    
int print_results = -(EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW);

while (1)
{
  // Do classification (i.e. the inference part)
  signal_t signal;
  signal.total_length = EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE;
  signal.get_data = &get_signal_data;
  ei_impulse_result_t result = { 0 };
  EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

  if (r != EI_IMPULSE_OK)
  {
	  ei_printf("ERROR: Failed to run classifier (%d)\r\n", r);
	  break;
  }

  // Print output predictions
  if(++print_results >= (EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW >> 1))
  {
	  // Comment this section out if you don't want to see the raw scores
	  ei_printf("Predictions (DSP: %d ms, NN: %d ms)\r\n", result.timing.dsp, result.timing.classification);
	  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++)
	  {
		  ei_printf("    %s: %.5f\r\n", result.classification[ix].label, result.classification[ix].value);
	  }
	  print_results = 0;
  }
}

What should I fix there?

I forgot to mention the platform is STM32H757 and the classifier was built for Cortex-M7 core.

@hwidvorakinfo That looks a little insane indeed. Are you building this from STM32Cube.IDE?

@janjongboom I did build it from SW4STM32 IDE.

There are some interesting facts:

  1. I started with https://github.com/ShawnHymel/ei-keyword-spotting project I imported into STM32CubeIDE - the binary with -Os for size is about 220 kB

  2. I implemented the deployed model from EI into my project in SW4STM32 while transforming it to C++ project

  3. If I build this https://github.com/ShawnHymel/ei-keyword-spotting project in my SW4STM32 the binary is exactly 1028876 with -Os for size.

This is a very strange behaviour because both IDEs use g++ compiler and linker.

I have to dig into the compiler settings etc. This is a suspicious thing.

Any clues kindly welcomed.

Does it generate a map file or does the linker already throw before it can make this?

Yes, there is a mapfile and here it is - I uploaded it to some Czech sharing website. Just open the link.

Ha, there are all rom tables included:

.rodata 0x0000000008011af8 0xe05c8 Middlewares/edgeimpulse/edge-impulse-sdk/CMSIS/DSP/Source/CommonTables/arm_common_tables.o

How can I make the linker not include them? It is 897 kB of them :man_facepalming:

@hwidvorakinfo Is ARM_ALL_FFT_TABLES defined by any chance in your compiler defines or something else related to an ARM_* define? I’ve asked our embedded team to comment here as well.

Hi @hwidvorakinfo, just to be sure. Are you building the exact same project in the STM32CubeIDE as in the SW4STM32 environment? Or did you swap out the model as well?

No, there is not macro ARM_ALL_FFT_TABLES defined in the project.

In arm_common_tables.h are preprocessor conditions like this one:

#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_FFT_ALLOW_TABLES)
/* Double Precision Float CFFT twiddles */
#if !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) || defined(ARM_TABLE_BITREV_1024)
extern const uint16_t armBitRevTable[1024];
#endif /* !defined(ARM_DSP_CONFIG_TABLES) || defined(ARM_ALL_FFT_TABLES) */

All macros ARM_DSP_CONFIG_TABLES, ARM_FFT_ALLOW_TABLES and ARM_ALL_FFT_TABLES are not defined. The first part of the condition !defined(ARM_DSP_CONFIG_TABLES) does the dirty trick here in my opinion.

Hello @Arjan, yes, the exact same project. I took the entire https://github.com/ShawnHymel/ei-keyword-spotting project.

hi @hwidvorakinfo, can you try this? Go to config.hpp, and at the top of the file, just. below the include guard, put #define EIDSP_USE_CMSIS_DSP 0

So should look like this:

#ifndef _EIDSP_CPP_CONFIG_H_
#define _EIDSP_CPP_CONFIG_H_

#define EIDSP_USE_CMSIS_DSP 0

#ifndef EIDSP_USE_CMSIS_DSP

Let me know if that helps

Note that that disables all of CMSIS-DSP and that’s probably too slow to run classification on the target, but at least we’ll have a baseline.

edit: @hwidvorakinfo If you could zip up your complete project and email it to jan@edgeimpulse.com I’ll also have a look.

Hello @AlexEEE, it works like a charm!

/Library/Developer/CommandLineTools/usr/bin/make --no-print-directory post-build
Generating hex and Printing size information:
arm-none-eabi-objcopy -O ihex "H7_Beast_ML_CM7.elf" "H7_Beast_ML_CM7.hex"
arm-none-eabi-size "H7_Beast_ML_CM7.elf"
 text	   data	    bss	    dec	    hex	filename
84776	   1052	  17100	 102928	  19210	H7_Beast_ML_CM7.elf 

Thank you very much, you guys at Edge Impulse. I am really impressed by your effort supporting me!

Would it be possible to add this macro to the STM32 pack during deployment or is it a really special stuff and it would make more harm than good?

So, we’d like to have our cake and eat it too! Like Jan pointed out, CMSIS provides some impressive performance enhancements via usage of DSP hardware on ARM chips. Unfortunately, the latest CMSIS library opts for the fastest possible speed at the expense of ROM (more detail than is probably interesting here, but they’re using a mixed radix FFT)

However, they have a prior version that, with some patching, is almost as fast (and still uses HW acceleration), BUT, has the added benefit of very little ROM cost. (TMI: radix 2 FFT with one table for all FFT sizes)

We’re working on this patch and once it’s released, the EIDSP_USE_CMSIS_DSP flag will cost far less ROM.

(PS love the name of your elf file, good choice :smile: )

1 Like

The interesting part is that we don’t see this happening on other targets, which is why I’d be very interested in seeing your full project. E.g. on a STM32L4 target I see ~10K for CMSIS-DSP with the latest SDK.

It is named by the board I am working on :slightly_smiling_face:

Beast_H7 - STM32H757 + 32 MB SDRAM + 32 MB QSPI flash + DA14531MOD (BLE) + USB-C with UART/USB converter + 2 analog buffered inputs + much more

1 Like

That is a beast indeed!

@janjongboom I just emailed you the link to download the entire project

@janjongboom I identified I need these macros to be defined and thus tables to be linked:

#define ARM_TABLE_BITREV_1024
#define ARM_TABLE_TWIDDLECOEF_F32_4096
#define ARM_TABLE_TWIDDLECOEF_Q15_4096
#define ARM_TABLE_TWIDDLECOEF_Q31_4096
#define ARM_TABLE_REALCOEF_F32
#define ARM_TABLE_REALCOEF_Q15
#define ARM_TABLE_REALCOEF_Q31
#define ARM_TABLE_RECIP_Q15
#define ARM_TABLE_RECIP_Q31
#define ARM_TABLE_SIN_F32
#define ARM_TABLE_SIN_Q15
#define ARM_TABLE_SIN_Q31 

Is it something I can expect in every Edge Impulse model OR the requirements are let’s say very volatile?