Deploy a Multiple Object Detection in a Xiao ESP32S3 Sense

Hello, community!

I have read posts and blogs in multiple forums about users deploying an object detection model into microcontrollers, mainly ESP32. But all the object detection model tutorials and guides I found were too simple. Detecting only about three objects, and requiring single object images.

What I want to do is to deploy a object detection model with 10 to 12 classes that can detect multiple objects in an image on a Xiao ESP32S3 Sense. For example, let’s say an indoor object detection model that can detect furniture like chair, sofa, lamp, tv, and so on. When prompted an image of a living room taken by the OV3660 camera, it should detect “simultaneously” the sofa, chair, tv, etc. I found the Arduino IDE deployment feature inside Edge Impulse really interesting since it automatically creates an Arduino library for the user. Making it easier to use the model on microcontrollers.

I followed the model training by using these guides. But again, they only use a few classes and require a dataset of single object images:

I have just begun to use Edge Impulse, Arduino and training ML models. But I am interested in learning more about it!

Does someone know a helpful guide that can help me on that?

Hello, @matbarbosa !
I took a look at the article you linked and see that in the first one there are instances of multiple detections in one image?


About the number of classes - again, in tutorials people use only a few classes for two reasons:

  1. it is easier to collect enough data
  2. more classes would require larger model

If you want to detect furniture like chair, sofa, lamp, tv, and so on, the best course for you is to take an open source dataset like PASCAL VOC or similar and parse it to leave only the classes you are interested in. For the classes that are not present in the dataset (I think lamp is not there for example?) you will either need to find it in some other datasets or collect and label the images yourself.

Then choose the model with higher capacity (so alpha 0.35) and train - should work for 10 - 12 classes.

1 Like

Hello @AIWintermuteAI.

Thank you for your response!

Sorry. I wasn’t clear with my post. I meant the object detection model should detect the multiple objects even though their bounding boxes are overlapped. Or are not occupying most of the image. Like there is a half part of a lamp or a chair on the image that the model should detect. Do you think that the models available at Edge Impulse are suitable for this application.

Well, even that said, I gave it a try by using the furniture dataset available at Ultralytics called HomeObjects-3K (HomeObjects-3K Dataset - Ultralytics YOLO Docs). It has 3000 images and 12 classes:

  • bed
  • sofa
  • chair
  • table
  • lamp
  • tv
  • laptop
  • wardrobe
  • window
  • door
  • potted plant
  • photo frame

Although it doesn’t have objects like oven or fridge, it is the best dataset I found. With a good distribution of images of the objects and a decent amount of classes. Other ones had something like 1000 to 2000 chair images, but only 98 sink images. The only disappointment is that this dataset is available as YOLO .txt format.

I went digging and found I can use Roboflow to convert this dataset to Pascal VOC. I imported the dataset folder along with the label files and it seemed all good.

I then created three versions of the dataset. While the first one being just a test. The other two ones are one preprocessed, augmented and resized to 640x640. The other is only preprocessed and resized to 160x160 (I used these resolution based on the guides I read before. I improves the model’s latency in the ESP32S3 Sense). The augmented one has 6443 images, and the not-augmented has 2686 images. Somethin obvious, since augmented datasets feed the dataset with distorted images of the same dataset. I decided to first test with the second dataset (resized to 160x160 and not augmented).

Then, I downloaded the dataset as Pascal VOC and imported it into the Edge Impulse platform. Select for automatic split train/test data. Everything went good. My only concern is that the Pascal VOC folder that was downloaded to my PC had a train, test and valid folders. I don’t know what happened to the “valid” images and how Edge interpreted it.

I then created an impulse. All the classes seemed to be automatically detected. Good!

Generated the features. All seem good as well.

When I trained the model (FOMO with alpha as 0.35). I ended up with a really bad result. A F1 Score of 32.9% and a weird confusion matrix. Most of objects are being detected as “background”.

Even though not decently accurate, I decided to deploy the model into the ESP32S3 Sense. I read that using EON Compiler can result in some error with the ESP not using the pseudostatic RAM (PSRAM). But let’s try using the compiler first.

I deploy the model as a Arduino IDE library. I load the Arduino library into Arduino IDE. I enable PSRAM and I upload the code and I keep receiving this error message:

CORRUPT HEAP: Bad tail at 0x3c1ea2a4. Expected 0xbaad5678 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:279 (head != NULL)


Backtrace: 0x40375b85:0x3fcebaf0 0x4037b729:0x3fcebb10 0x40381a3a:0x3fcebb30 0x403806cf:0x3fcebc70 0x403769bf:0x3fcebc90 0x40381a6d:0x3fcebcb0 0x42007d65:0x3fcebcd0 0x4200ba91:0x3fcebcf0 0x4200406d:0x3fcebd10 0x420040ab:0x3fcebe80 0x42004143:0x3fcebea0 0x42004525:0x3fcebf50 0x420045e5:0x3fcebf70 0x4200d88c:0x3fcec070 0x4037c155:0x3fcec090




ELF file SHA256: f2eefe0ac

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40378712
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce2820,len:0x11bc
load:0x403c8700,len:0xc2c
load:0x403cb700,len:0x3158
entry 0x403c88b8
Edge Impulse Inferencing Demo
Camera initialized

I tried to search for this error message, but I could not find any post or blog that could match exactly my situation. But I think it is something related to memory allocation or something like that. Let’s try then using the TensorFlow Lite compiler.

It worked! However, the model still runs poorly. It only detects precisely windows, photo frames, potted plants and lamps. Based on the platform, the model runs with a 600ms latency. But here we have a latency of 405ms. Which is really good. Since my threshold is 5 seconds. What means space for a bigger dataset. This is an output example:

597656) [ x: 16, y: 112, width: 8, height: 8 ]

Predictions (DSP: 10 ms., Classification: 405 ms., Anomaly: 0 ms.): 
Object detection bounding boxes:

  window (0.503906) [ x: 16, y: 104, width: 8, height: 8 ]

  window (0.652344) [ x: 16, y: 120, width: 8, height: 16 ]

Predictions (DSP: 10 ms., Classification: 405 ms., Anomaly: 0 ms.): 
Object detection bounding boxes:

  potted plant (0.519531) [ x: 72, y: 96, width: 16, height: 8 ]

  window (0.519531) [ x: 56, y: 112, width: 8, height: 8 ]

  window (0.527344) [ x: 24, y: 120, width: 16, height: 16 ]

I still do not know why the confusion matrix showed a poor performance to the other classes. Even with a decent dataset. Any idea why this is happening? What do I need to do for improvement? I tried different settings but I always managed a bad score and a bad confusion matrix.

Hi @matbarbosa,

Thanks for using Edge Impulse! I lead our Research team and @AIWintermuteAI mentioned you were having some trouble getting good results here.

FOMO is designed for simple scenarios: for example, identifying objects against a fixed background in an industrial setting, like a production line.

It’s unlikely to get good results with your dataset, which contains a large amount of variation. I would recommend trying the biggest object detection model you can, in order to get a good baseline for what level of performance is possible, and to ensure the performance you need is actually possible with the dataset you are using. An easy starting point might be this YOLOv5 custom block: GitHub - edgeimpulse/ml-block-yolov5: YOLOv5 transfer learning model for Edge Impulse

Once you have that baseline you can try with smaller models and see if you can maintain adequate performance.

We’ll soon be releasing a new object detection architecture in Edge Impulse called YOLO-Pro, you can find more information here: Edge Impulse Goes Industrial

Warmly,
Dan

Hello @dansitu,

Thank you for your response.

I used the custom provided at the link you sent. I followed all steps.

  • Installed Python 3 (Already had installed);
  • Installed Node.Js v20 or above (Mine was v22.17);
  • Installed the Additional Node.Js tools by running the “Install Additional Tools for Node.Js” executioner;
  • Installed the CLI tools;
  • Downloaded the YOLOv5 ML Block Github repository;
  • Ran the commands “init” and “push” inside the folder of the repository, I obtained the following results:
E:\Users\mathe\Documents\ml-block-yolov5-master>edge-impulse-blocks init
Edge Impulse Blocks v1.33.0
Attaching block to organization 'matbarbosa'

Your new block has been created in 'E:\Users\mathe\Documents\ml-block-yolov5-master'.
When you have finished building your block, run 'edge-impulse-blocks push' to update the block in Edge Impulse.

E:\Users\mathe\Documents\ml-block-yolov5-master>edge-impulse-blocks push
Edge Impulse Blocks v1.33.0
Archiving 'ml-block-yolov5-master'...
Archiving 'ml-block-yolov5-master' OK (8 KB) C:\Users\mathe\AppData\Local\Temp\ei-machine-learning-block-1b48f99cf374dba734b98ac1b72a738f.tar.gz

Uploading block 'YOLOv5' to organization 'matbarbosa'...
(node:31376) [DEP0044] DeprecationWarning: The `util.isArray` API is deprecated. Please use `Array.isArray()` instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Uploading block 'YOLOv5' to organization 'matbarbosa' OK

Building machine learning block 'YOLOv5'...
Connected to job
[spinner-done] Job scheduled at 08 Jul 2025 17:01:29
[spinner] Preparing the environment...
[spinner-done] Job started at 08 Jul 2025 17:01:34

Extracting archive...
Extracting archive OK

Calculating hash of extracted archive...
Calculating hash of extracted archive OK (processed_files=12, hash=976b74878b47d658da0680bf44c448feb5a34f9607fa89d21fb1d29f0dfb9588)

Already has container with this hash in container registry, skipping build

Building machine learning block 'YOLOv5' OK

My only concern about it is the “Already has container with this hash in container registry, skipping build”. I accidentally ran the “init” command not inside the repository folder and created a block, but I did not ran “push”. I don’t know if this is something of concern.

I created a new project, imported my HomeObjects-3K dataset and created an impulse and inserted the YOLOv5 learning block.

I trained the model with these settings:

  • Epochs (Number of training cycles): 15
  • Training processor: CPU
  • Model size: Nano - 1.9 Parameters, 3.87MB
  • Batch size: 16 (default)
  • Validation set size: 20%
  • Profile int8 model: Checked

I obtained a even worse result (Training logs for better investigation: Training Logs - YOLOv5 Edge Impulse - Google Docs):

RAM usage, Flash usage and Inferencing time are still under my project restrictions at least.

I then ran a model test and it seemed to detect with higher precision only images with a single object class or ones that had up to three objects.

The biggest problem is that I cannot deploy the model as an Arduino IDE library.

Why did the model resulted in such low accuracy? YOLO, as far as I understood, is suitable for detecting multiple objects and is pre-trained with the COCO dataset. And how can I deploy this model to my ESP32S3 Sense without being able to use it as an Arduino library?

Hi @matbarbosa,

It looks like you are still attempting to train a very small model. I’d recommend training as large a model as possible in order to determine whether it’s possible to get good performance on your dataset.

If you get good performance with a large model but not a small one, and your dataset is representative of the task your application needs to perform, you may need to think about reducing the scope of your application, or targeting more capable hardware that can support a larger model.

Warmly,
Dan