How inference time is related to model size?


I’m training a image classification model on custom dataset. I ran EON tuner for getting different models. That’s when I saw this.

As you can see, both the models are of input size 32X32 and RGB.

The model on the left has only 2 conv layers which shows inferencing time of 58ms.
But the model on the right has 3 conv layers and a dense layer which shows inferencing time of 30ms.

The model with less layers should be faster right? Why is this different? Am I missing something?

I’m new to ML. So please enlighten my knowledge.

Ramson Jehu K

Hello @Ramson,

Can you tell me which project you are using so I can have a look? Or what’s the device you’ve selected for the latency in the EON Tuner?



Hi Louis,

Here is my public project ID: 74860
I have selected Cortex-M7 216Mhz device.

Ramson Jehu K

Hi @Ramson, here’s the Netron output of these two models:

I guess that because of the extra convolutional layers the fully-connected layer is quicker to be calculated. If I look at the estimated MACCs it’s ~800K for the model with 3 layers, and ~1.6M for the model with two layers. A bit counterintuitive indeed :slight_smile:

@dansitu might have something to add here too.

Hi @janjongboom,

Thanks for the intuition, how did you estimated the MACCs?

Ramson Jehu K

@Ramson I think we’re using something based off of

Note that we’re switching to benchmarking the actual model, but this is not implemented for the Cortex-M7 216MHz target yet so it falls back to MACC calculation.

@janjongboom, thanks for the referring the blog. Do you have any script or something to check MACC calculation for various model or is it just manual calculation?

What do you mean by benchmarking the actual model? how would it differ from MACC calculation?

Hi @Ramson! Regarding compute time—as @janjongboom says, it’s not just the number of layers, it’s what’s going on inside them that makes a difference. The two-layer network ends up doing more work than the three layer one. The number of filters in the first layer is higher in the two-layer network, and these extra dimensions propagate on through the model, creating more inputs to be convolved through.

In terms of benchmarking—MACCs give us a theoretical measure of compute, but the time taken to perform different operations can vary between targets and optimizations. For example, on Arm devices there are vector extensions available that will speed up the Conv2D op—but only for certain filter and input sizes. One model may have fewer MACCs than another, but if it has differing support in optimizations on a certain target it may end up running slower.

This is why EON Tuner is so cool—it can automatically take all of this into account when designing a model!

1 Like

Hi @dansitu,
Thanks for you explanation, now I understand it clearly.
And yes EON Tuner is really cool. I could experiment with different model architectures.
Waiting for EON tuner support for object detection problem as well.

Thanks and Regards,
Ramson Jehu K

Object detection support should be coming soon :slight_smile:

1 Like