Filter object inferences by size of box

buzzcopper · October 5, 2025, 2:08pm

It would be helpful when using the FOMO model to be able to set a minimum ‘box’ size for detected objects.

As far as I understand it the FOMO algorithm divides the input image into a series of 8x8 pixel squares and does an inference on each square. Where several neighbouring squares inference as the same object type the squares are combined. The resulting collection of squares that inference the same become one ‘box’ in the ei_result.bounding_boxes collection.

Inferences are reported in EI Studio on the Live Classification tab under Detailed Result and here the Width and Height indicate where several 8x8 squares have been joined together.

In our project we find that inferences with Width = 8 and Height = 8 are very often wrong and would like to suppress them. This is, of course, possible to do within our own code as we are doing but there may be value to others in being able to apply a post-processing filter within the model to do this.

NB When a box in the ei_result.bounding_boxes array is accessed within code the box->width and box->height are always multiples of 7 rather than 8 (as shown in EI Studio). Why is this? The images presented to run_classifier are 150x150 which agrees with the input width and height specified for the model.

matkelcey · October 9, 2025, 8:47pm

Thanks for the feedback!

We don’t have any post processing filtering of the type you describe and ideally we’d like to fix the training for the single cell example. If there’s more detail you could provide on your specific case we might be able to just directly fix it with some tuning and/or data prep.

Re: 7 vs 8; we don’t explicitly make the output 8 pixels; it’s a side effect of the network doing 1/8th reduction…

e.g. a (96, 96) input is compressed down to (12, 12) so each of those output cells represents 8x8 pixels.

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer_7       │ (None, 96, 96, 3) │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Conv1 (Conv2D)      │ (None, 48, 48, 8) │        216 │ input_layer_7[0]… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
....
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ block_6_expand_relu │ (None, 12, 12,    │          0 │ block_6_expand_B… │
│ (ReLU)              │ 48)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ head (Conv2D)       │ (None, 12, 12,    │      1,568 │ block_6_expand_r… │
│                     │ 32)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ logits (Conv2D)     │ (None, 12, 12, 5) │        165 │ head[0][0]        │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

and this is always 1/8 when the input is a multiple of 8

but for (150, 150) we end up with (19, 19) ( due to the padding config in the MobileNet head ). so each of these output cells is actually representing 150/19=7.8 pixels.

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer_8       │ (None, 150, 150,  │          0 │ -                 │
│ (InputLayer)        │ 3)                │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ Conv1 (Conv2D)      │ (None, 75, 75, 8) │        216 │ input_layer_8[0]… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
...
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ block_6_expand_relu │ (None, 19, 19,    │          0 │ block_6_expand_B… │
│ (ReLU)              │ 48)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ head (Conv2D)       │ (None, 19, 19,    │      1,568 │ block_6_expand_r… │
│                     │ 32)               │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ logits (Conv2D)     │ (None, 19, 19, 5) │        165 │ head[0][0]        │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

( Note: you can see all this info by switching to “Expert mode” and dropping a print(model.summary()) in the code after the call to build_model() )

If you switched to (152, 152) these would line up ( 152/19=8 ) and that’s your quickest “fix”.

But if you’re seeing 7x7 in the SDK, but 8x8 in Studio, that’s a bug on our side that I can log.

Cheers,
Mat

buzzcopper · October 11, 2025, 5:18pm

Ok I understand. I’ll probably correct our input size to 152x152 and add a single white pixel border around our 150x150 raw images so the images are a straight multiple of 8x8 squares. That will avoid needing to scale the image from 150 to 152 which can’t be an efficient use of processor cycles.

The project objective is to recognize one or more insects on a bait station. We frequently see European Hornets or wasps, which we want to recognize and ignore. We occasionally see Asian Hornets, which we want to recognize and send out an alert. We sometimes see both European and Asian Hornets in the same picture.

I’m finding that the FOMO algorithm often recognizes e.g. a 24x16 box as European Hornet with high probability but sometimes an immediately adjacent 8x8 box is detected as Asian Hornet but with a low probability. Given the solid identification of European Hornet we want to accept that and ignore the single 8x8 box that has been apparently misidentified. In our use case we can afford to ignore any weak identifications as there will likely be another chance to identify either the same or a different insect in a few seconds time when the image triggers a bigger box with better probability.

One possibility is that we are using too high a resolution and the 8x8 boxes are too small. Would it be better perhaps to use 80x80 or 96x96 so that the 8x8 boxes cover a larger area of the insect? My gut instinct is to use a high resolution as the key differences between AH and EH seem to disappear to the human eye below 150x150 but perhaps I don’t understand the algorithm well enough.

But 7 v 8 Yes - in Studio the box dimensions are always multiples of 8 but in the SDK the bounding boxes (width/height) are always multiples of 7 which must be due to scaling the input image.