Question/Issue:
I would like some guidance on the recommended size of dataset we should use to train a model
Project ID:
buzzcopper1
Context/Use case:
My colleagues and I are building a microcontroller device to detect and alert when Asian Hornets visit a detector. The device is described at https://buzzcopper.org. The device currently does hard-coded machine vision processing to detect when an insect is in the field of view and then sends a high-resolution picture to Google Vision for inferencing. This is neither efficient nor accurate. We wish to build in an C++ AI model to classify insects on the ESP32S3 device itself. Classification only needs to distinguish between Asian Hornets, European Hornets, bees, wasps and flies.
Steps Taken:
We have already uploaded 1200 images of Asian Hornets, European Hornets, wasps and other insects from Google Cloud Storage but have many more we could upload. How many is too many? Do unlabelled images need to be removed or are they ignored when training?
Expected Outcome:
We want the training/test data set to be an appropriate size such that accuracy will be say 90% and training can be done within the one hour processing limit.
Actual Outcome:
When attempting to train with 1200 images but (so far) only 120 labels over 60 epochs the job failed early as it would take too long. Another job succeeded over 30 epochs but the accuracy was unacceptable.
Additional Information:
This is a not-for-profit project