Hello everybody! I have model MobilenetV2 in SavedModel format. When I convert the model to Linux AARCH64 format and test on the board (am62xsk), it turns out that the float32 model is faster than int8. Why is that?
The Cortex-A53’s 64-bit architecture might be more optimized for handling float32 operations, possibly due to the way floating-point operations are implemented at the hardware level. Int8 is typically faster on MCU based devices.
Let me check with the embedded team for a more in depth response on this particular device, @mateusz do you have any more details on why this operation would be faster or is the fact that it is not an MCU the key?
Generally, the int8 model requires additional bitshift operations to move the values from 8 bits to 32 bits and vice versa, so this is a situation where int8 gets a fine for optimization on 32-bit CPUs. On the other hand, float models use FPU which (usually) requires a few cycles for 1 FPU op.
But, if the CPU (I don’t know the AM62 very well) has the FPU that is capable of making 1 OP per cycle, then we can end up in a situation where the int8 model is slower than float.
Refer to this Google article https://arxiv.org/pdf/1712.05877.pdf for more details.
We’ve also made an internal issue to investigate more. We’ll report when we have some findings.
Thanks for bringing to our attention.
The reason float32 in your instance was faster is because our Tensorflow Lite EIMs use XNNPACK which optimizes for float32.
If you try the TFLite Micro (omit
USE_FULL_TFLITE=1 when building from source) the results would be that int8 is (typically) faster than float32. But it depends on the target, ISA, optimizations and other factors like @mateusz mentioned.