Size of training/testing data set

Question/Issue:
I would like some guidance on the recommended size of dataset we should use to train a model

Project ID:
buzzcopper1

Context/Use case:
My colleagues and I are building a microcontroller device to detect and alert when Asian Hornets visit a detector. The device is described at https://buzzcopper.org. The device currently does hard-coded machine vision processing to detect when an insect is in the field of view and then sends a high-resolution picture to Google Vision for inferencing. This is neither efficient nor accurate. We wish to build in an C++ AI model to classify insects on the ESP32S3 device itself. Classification only needs to distinguish between Asian Hornets, European Hornets, bees, wasps and flies.

Steps Taken:
We have already uploaded 1200 images of Asian Hornets, European Hornets, wasps and other insects from Google Cloud Storage but have many more we could upload. How many is too many? Do unlabelled images need to be removed or are they ignored when training?

Expected Outcome:
We want the training/test data set to be an appropriate size such that accuracy will be say 90% and training can be done within the one hour processing limit.

Actual Outcome:
When attempting to train with 1200 images but (so far) only 120 labels over 60 epochs the job failed early as it would take too long. Another job succeeded over 30 epochs but the accuracy was unacceptable.

Additional Information:
This is a not-for-profit project

Hi @buzzcopper

I took a look at your project and can see that you have a major Class imbalance. This will cause your model to:

  • Bias heavily toward the dominant class
  • Miss detections or misclassify underrepresented classes (like hornets)
  • Appear to have “high accuracy” but poor real-world performance

image

I shared this in another one of your posts, but sharing here again for others to find, we have a dedicated guide for model optimization here.

Start with a smaller dataset that is well balanced and reduce the amount of processing on your images and match the same in the preprocessing as you do on the device for inference see our guide here for buidling a dataset and model optimization:

Best

Eoin

1 Like

Hi @buzzcopper great questions but can you try to consolidate your questions to one thread in future so we can track them a bit easier for others with similar issues that find these questions?

It could be hard for us to find them in future for follow up, and mean we can’t get all of the context together for others who see these answers. I’m just going to mark a couple as solved and keep one or two open that need follow up.

@marcpous fyi

Best

Eoin

2 Likes