Deploy a Multiple Object Detection in a Xiao ESP32S3 Sense

matbarbosa · July 4, 2025, 7:29pm

Hello, community!

I have read posts and blogs in multiple forums about users deploying an object detection model into microcontrollers, mainly ESP32. But all the object detection model tutorials and guides I found were too simple. Detecting only about three objects, and requiring single object images.

What I want to do is to deploy a object detection model with 10 to 12 classes that can detect multiple objects in an image on a Xiao ESP32S3 Sense. For example, let’s say an indoor object detection model that can detect furniture like chair, sofa, lamp, tv, and so on. When prompted an image of a living room taken by the OV3660 camera, it should detect “simultaneously” the sofa, chair, tv, etc. I found the Arduino IDE deployment feature inside Edge Impulse really interesting since it automatically creates an Arduino library for the user. Making it easier to use the model on microcontrollers.

I followed the model training by using these guides. But again, they only use a few classes and require a dataset of single object images:

I have just begun to use Edge Impulse, Arduino and training ML models. But I am interested in learning more about it!

Does someone know a helpful guide that can help me on that?

AIWintermuteAI · July 7, 2025, 12:49pm

Hello, @matbarbosa !
I took a look at the article you linked and see that in the first one there are instances of multiple detections in one image?

About the number of classes - again, in tutorials people use only a few classes for two reasons:

it is easier to collect enough data
more classes would require larger model

If you want to detect furniture like chair, sofa, lamp, tv, and so on, the best course for you is to take an open source dataset like PASCAL VOC or similar and parse it to leave only the classes you are interested in. For the classes that are not present in the dataset (I think lamp is not there for example?) you will either need to find it in some other datasets or collect and label the images yourself.

Then choose the model with higher capacity (so alpha 0.35) and train - should work for 10 - 12 classes.

matbarbosa · July 7, 2025, 7:57pm

Hello @AIWintermuteAI.

Thank you for your response!

Sorry. I wasn’t clear with my post. I meant the object detection model should detect the multiple objects even though their bounding boxes are overlapped. Or are not occupying most of the image. Like there is a half part of a lamp or a chair on the image that the model should detect. Do you think that the models available at Edge Impulse are suitable for this application.

Well, even that said, I gave it a try by using the furniture dataset available at Ultralytics called HomeObjects-3K (HomeObjects-3K Dataset - Ultralytics YOLO Docs). It has 3000 images and 12 classes:

bed
sofa
chair
table
lamp
tv
laptop
wardrobe
window
door
potted plant
photo frame

Although it doesn’t have objects like oven or fridge, it is the best dataset I found. With a good distribution of images of the objects and a decent amount of classes. Other ones had something like 1000 to 2000 chair images, but only 98 sink images. The only disappointment is that this dataset is available as YOLO .txt format.

I went digging and found I can use Roboflow to convert this dataset to Pascal VOC. I imported the dataset folder along with the label files and it seemed all good.

I then created three versions of the dataset. While the first one being just a test. The other two ones are one preprocessed, augmented and resized to 640x640. The other is only preprocessed and resized to 160x160 (I used these resolution based on the guides I read before. I improves the model’s latency in the ESP32S3 Sense). The augmented one has 6443 images, and the not-augmented has 2686 images. Somethin obvious, since augmented datasets feed the dataset with distorted images of the same dataset. I decided to first test with the second dataset (resized to 160x160 and not augmented).

Then, I downloaded the dataset as Pascal VOC and imported it into the Edge Impulse platform. Select for automatic split train/test data. Everything went good. My only concern is that the Pascal VOC folder that was downloaded to my PC had a train, test and valid folders. I don’t know what happened to the “valid” images and how Edge interpreted it.

I then created an impulse. All the classes seemed to be automatically detected. Good!

Generated the features. All seem good as well.

When I trained the model (FOMO with alpha as 0.35). I ended up with a really bad result. A F1 Score of 32.9% and a weird confusion matrix. Most of objects are being detected as “background”.

Even though not decently accurate, I decided to deploy the model into the ESP32S3 Sense. I read that using EON Compiler can result in some error with the ESP not using the pseudostatic RAM (PSRAM). But let’s try using the compiler first.

I deploy the model as a Arduino IDE library. I load the Arduino library into Arduino IDE. I enable PSRAM and I upload the code and I keep receiving this error message:

CORRUPT HEAP: Bad tail at 0x3c1ea2a4. Expected 0xbaad5678 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:279 (head != NULL)


Backtrace: 0x40375b85:0x3fcebaf0 0x4037b729:0x3fcebb10 0x40381a3a:0x3fcebb30 0x403806cf:0x3fcebc70 0x403769bf:0x3fcebc90 0x40381a6d:0x3fcebcb0 0x42007d65:0x3fcebcd0 0x4200ba91:0x3fcebcf0 0x4200406d:0x3fcebd10 0x420040ab:0x3fcebe80 0x42004143:0x3fcebea0 0x42004525:0x3fcebf50 0x420045e5:0x3fcebf70 0x4200d88c:0x3fcec070 0x4037c155:0x3fcec090




ELF file SHA256: f2eefe0ac

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40378712
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce2820,len:0x11bc
load:0x403c8700,len:0xc2c
load:0x403cb700,len:0x3158
entry 0x403c88b8
Edge Impulse Inferencing Demo
Camera initialized

I tried to search for this error message, but I could not find any post or blog that could match exactly my situation. But I think it is something related to memory allocation or something like that. Let’s try then using the TensorFlow Lite compiler.

It worked! However, the model still runs poorly. It only detects precisely windows, photo frames, potted plants and lamps. Based on the platform, the model runs with a 600ms latency. But here we have a latency of 405ms. Which is really good. Since my threshold is 5 seconds. What means space for a bigger dataset. This is an output example:

597656) [ x: 16, y: 112, width: 8, height: 8 ]

Predictions (DSP: 10 ms., Classification: 405 ms., Anomaly: 0 ms.): 
Object detection bounding boxes:

  window (0.503906) [ x: 16, y: 104, width: 8, height: 8 ]

  window (0.652344) [ x: 16, y: 120, width: 8, height: 16 ]

Predictions (DSP: 10 ms., Classification: 405 ms., Anomaly: 0 ms.): 
Object detection bounding boxes:

  potted plant (0.519531) [ x: 72, y: 96, width: 16, height: 8 ]

  window (0.519531) [ x: 56, y: 112, width: 8, height: 8 ]

  window (0.527344) [ x: 24, y: 120, width: 16, height: 16 ]

I still do not know why the confusion matrix showed a poor performance to the other classes. Even with a decent dataset. Any idea why this is happening? What do I need to do for improvement? I tried different settings but I always managed a bad score and a bad confusion matrix.

dansitu · July 8, 2025, 3:13pm

Hi @matbarbosa,

Thanks for using Edge Impulse! I lead our Research team and @AIWintermuteAI mentioned you were having some trouble getting good results here.

FOMO is designed for simple scenarios: for example, identifying objects against a fixed background in an industrial setting, like a production line.

It’s unlikely to get good results with your dataset, which contains a large amount of variation. I would recommend trying the biggest object detection model you can, in order to get a good baseline for what level of performance is possible, and to ensure the performance you need is actually possible with the dataset you are using. An easy starting point might be this YOLOv5 custom block: GitHub - edgeimpulse/ml-block-yolov5: YOLOv5 transfer learning model for Edge Impulse

Once you have that baseline you can try with smaller models and see if you can maintain adequate performance.

We’ll soon be releasing a new object detection architecture in Edge Impulse called YOLO-Pro, you can find more information here: Edge Impulse Goes Industrial

Warmly,
Dan

matbarbosa · July 8, 2025, 7:53pm

Hello @dansitu,

Thank you for your response.

I used the custom provided at the link you sent. I followed all steps.

Installed Python 3 (Already had installed);
Installed Node.Js v20 or above (Mine was v22.17);
Installed the Additional Node.Js tools by running the “Install Additional Tools for Node.Js” executioner;
Installed the CLI tools;
Downloaded the YOLOv5 ML Block Github repository;
Ran the commands “init” and “push” inside the folder of the repository, I obtained the following results:

E:\Users\mathe\Documents\ml-block-yolov5-master>edge-impulse-blocks init
Edge Impulse Blocks v1.33.0
Attaching block to organization 'matbarbosa'

Your new block has been created in 'E:\Users\mathe\Documents\ml-block-yolov5-master'.
When you have finished building your block, run 'edge-impulse-blocks push' to update the block in Edge Impulse.

E:\Users\mathe\Documents\ml-block-yolov5-master>edge-impulse-blocks push
Edge Impulse Blocks v1.33.0
Archiving 'ml-block-yolov5-master'...
Archiving 'ml-block-yolov5-master' OK (8 KB) C:\Users\mathe\AppData\Local\Temp\ei-machine-learning-block-1b48f99cf374dba734b98ac1b72a738f.tar.gz

Uploading block 'YOLOv5' to organization 'matbarbosa'...
(node:31376) [DEP0044] DeprecationWarning: The `util.isArray` API is deprecated. Please use `Array.isArray()` instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Uploading block 'YOLOv5' to organization 'matbarbosa' OK

Building machine learning block 'YOLOv5'...
Connected to job
[spinner-done] Job scheduled at 08 Jul 2025 17:01:29
[spinner] Preparing the environment...
[spinner-done] Job started at 08 Jul 2025 17:01:34

Extracting archive...
Extracting archive OK

Calculating hash of extracted archive...
Calculating hash of extracted archive OK (processed_files=12, hash=976b74878b47d658da0680bf44c448feb5a34f9607fa89d21fb1d29f0dfb9588)

Already has container with this hash in container registry, skipping build

Building machine learning block 'YOLOv5' OK

My only concern about it is the “Already has container with this hash in container registry, skipping build”. I accidentally ran the “init” command not inside the repository folder and created a block, but I did not ran “push”. I don’t know if this is something of concern.

I created a new project, imported my HomeObjects-3K dataset and created an impulse and inserted the YOLOv5 learning block.

I trained the model with these settings:

Epochs (Number of training cycles): 15
Training processor: CPU
Model size: Nano - 1.9 Parameters, 3.87MB
Batch size: 16 (default)
Validation set size: 20%
Profile int8 model: Checked

I obtained a even worse result (Training logs for better investigation: Training Logs - YOLOv5 Edge Impulse - Google Docs):

RAM usage, Flash usage and Inferencing time are still under my project restrictions at least.

I then ran a model test and it seemed to detect with higher precision only images with a single object class or ones that had up to three objects.

The biggest problem is that I cannot deploy the model as an Arduino IDE library.

Why did the model resulted in such low accuracy? YOLO, as far as I understood, is suitable for detecting multiple objects and is pre-trained with the COCO dataset. And how can I deploy this model to my ESP32S3 Sense without being able to use it as an Arduino library?

dansitu · July 8, 2025, 8:55pm

Hi @matbarbosa,

It looks like you are still attempting to train a very small model. I’d recommend training as large a model as possible in order to determine whether it’s possible to get good performance on your dataset.

If you get good performance with a large model but not a small one, and your dataset is representative of the task your application needs to perform, you may need to think about reducing the scope of your application, or targeting more capable hardware that can support a larger model.

Warmly,
Dan

matbarbosa · July 14, 2025, 10:43pm

Hello again @dansitu,

Thank you for your insight. I think I did not understand your reply. I thought you meant that using YOLOv5 was a good starting point as a baseline. Since the input layer tripled and the model has 1.9 million parameters.

The largest object detection model I could find was YOLOv12x pre-trained with the COCO 2017 dataset. The extra-large version of the most recent model by Ultralytics. I converted the model to a .tflite archive and imported it into my impulse project. I obtained the following results:

When going to the “Deployment” window, exporting the model as an Arduino library is not supported. The only viable method I suppose is exporting as a C++ library. But since I am quite new with C++ programming, will it work the same way? I encountered this guide after successfully building the project. But I do not know if I should follow it or not:

As a generic C++ library | Edge Impulse Documentation

Should I just import the library folder into my Arduino IDE project and run the “ei_run_classifier.h” file?

Although I was not certain how should I import the C++ library into Arduino. I ran a test by clicking on the test button on the Deployment window but I did not receive any benchmark results. Maybe I forgot to do or click something? So then I decided to run the model on my browser, but it did not work. I tried on Opera GX and Firefox. It resulted in a “Failed to build library” error. Unfortunately I do not see any logs button. I also tried on my phone browser reading the QR Code, but it still showed the same error.

Thank you for taking the time to help me!

Sincerely,
Matheus Barbosa.

AIWintermuteAI · July 16, 2025, 8:37am

What @dansitu meant was that you picked the smallest YOLOv5, the Nano one. The model size choice can be found here

For a challenging task, the smallest model might not be good, so you might need to up the size. His advice was to start with the largest model to see if you get an acceptable performance with a large model. It would mean your dataset is good. If you choose the largest model and still cannot get acceptable performance, something must be wrong with the data then.

And for YOLOv12, it is not supported by our platform. You can see the supported models in output layer dropdown here,

matbarbosa · July 16, 2025, 11:12pm

Hello @AIWintermuteAI!

Thank you for your reply.

Due to account plan limitations, I was only able to train the model with a small number of 3 training cycles. But other settings remained the same:

Batch size: 16
Validation set size: 20%
Profile int8 model: checked

I obtained a precision score of only 20.7%. Which is still low. Possibly because of low training cycles:

Here are the logs for further investigation: Training Logs - TOLOv5 Large - Edge Impulse - Google Docs

Although the small training cycles disturbs our troubleshoot. I think it is important to inform how I imported the dataset into the project.

The HomeObjects-3K dataset has around 2600 colorized images and 12 classes. On Ultralytics website, it is available as a YOLO TXT format. I imported the dataset as YOLO TXT on Edge Impulse. I manually selected the train images and test images by selecting the “train” folder and “valid” folder respectively inside the dataset main folder.

The dataset only comes with a “train” and “valid” folders. I’m worried about the validation images, in this case, can work as testing images. Besides, the dataset does not come with a label map. So I edited all labels to match the number to the respective name based on the .yaml code provided by the dataset page at Ultralytics (HomeObjects-3K Dataset - Ultralytics YOLO Docs).

Since the images are not all squared, I resized all images to fit into 160x160 and selected “Fit longest axis”. I’m afraid how this can affect the bounding boxes and image quality. But it is what it seemed the most appropriate for the model considering the use of a ESP32S3 Sense and its guides.

And that’s basically all I did with the dataset. Nothing more. Is there anything I missed or that I should have done before training the model? Anything I can do better? What can I do with the few training cycles?

I tried with YOLOv5 Medium but I only managed to set up to 5 training cycles. But obtained a higher precision score of 24.1%:

I appreciate the help and attention the Edge Impulse team is giving to this post.

Sincerely,
Matheus Barbosa.