MobilenetV2 float32 is faster than itn8

kira9k · January 27, 2024, 3:38pm

Question/Issue:
Hello everybody! I have model MobilenetV2 in SavedModel format. When I convert the model to Linux AARCH64 format and test on the board (am62xsk), it turns out that the float32 model is faster than int8. Why is that?
Project ID:
327535
Context/Use case:

Eoin · January 29, 2024, 12:17pm

Hi @kira9k

The Cortex-A53’s 64-bit architecture might be more optimized for handling float32 operations, possibly due to the way floating-point operations are implemented at the hardware level. Int8 is typically faster on MCU based devices.

Let me check with the embedded team for a more in depth response on this particular device, @mateusz do you have any more details on why this operation would be faster or is the fact that it is not an MCU the key?

Best

Eoin

mateusz · February 1, 2024, 7:13am

Hi @kira9k @Eoin
It is very unusual, but in some cases int8 model can be faster than float32.

Generally, the int8 model requires additional bitshift operations to move the values from 8 bits to 32 bits and vice versa, so this is a situation where int8 gets a fine for optimization on 32-bit CPUs. On the other hand, float models use FPU which (usually) requires a few cycles for 1 FPU op.
But, if the CPU (I don’t know the AM62 very well) has the FPU that is capable of making 1 OP per cycle, then we can end up in a situation where the int8 model is slower than float.

Refer to this Google article https://arxiv.org/pdf/1712.05877.pdf for more details.

Best regards,
Mateusz

rjames · February 1, 2024, 8:47am

Hi @kira9k,

We’ve also made an internal issue to investigate more. We’ll report when we have some findings.
Thanks for bringing to our attention.

rjames · February 2, 2024, 1:19pm

Hi @kira9k,

The reason float32 in your instance was faster is because our Tensorflow Lite EIMs use XNNPACK which optimizes for float32.

If you try the TFLite Micro (omit USE_FULL_TFLITE=1 when building from source) the results would be that int8 is (typically) faster than float32. But it depends on the target, ISA, optimizations and other factors like @mateusz mentioned.