Correct use of metadata key for train/validation split

rnath · March 15, 2026, 1:01pm

Question/Issue:
I am trying to control the training/validation split in Edge Impulse Studio using metadata so that augmented images are never used in the validation set.

In my project, the training dataset contains both raw images and augmented images. I added metadata to distinguish them:

dataset_group = raw
dataset_group = augmented

Currently the dataset contains:

280 images with dataset_group = augmented (161 images in one class and 119 in the other)
59 images with dataset_group = raw (37 images in one class and 22 in the other)

My goal is to train using augmented and raw images but validate only on raw images, ensuring that augmented samples never appear in the validation set.

Following the documentation on using metadata to control train/validation splits, I tried setting:

Split train/validation set on metadata key = dataset_group
Validation set size = 0%

However, the training job fails with an error indicating that no samples are found in the validation set.

I also tried setting Validation set size = 20%, but the same error occurs.

What is the best way in Edge Impulse Studio to ensure that augmented images are never placed in the validation set while still using them during training?

Project ID:

Context/Use case:
I am building an image classification model where I augmented the dataset to improve training performance.

However, I want the validation metrics to reflect performance on real (raw) images only, not augmented ones, because augmented images are derived from the same originals and could bias validation accuracy.

Summary:
When using a metadata key (dataset_group) to control the train/validation split, training fails with the error:

ERROR: No samples in validation set!
Please check your custom validation split.
If you wanted to set validation set explicitly via “Split train/validation set on metadata key” you need to change validation set size to 0

This happens even though the dataset contains both raw and augmented samples.

Steps to Reproduce:

Upload a dataset containing raw and augmented images.
Add metadata key dataset_group with values raw or augmented.
In Training settings → Advanced training settings:

Set Split train/validation set on metadata key = dataset_group
Set Validation set size = 0%

Start training.

Actual Results:
Training fails with:

ERROR: No samples in validation set!
Please check your custom validation split.

This occurs both with Validation set size = 0% and 20%.

Reproducibility:

[x] Always
[ ] Sometimes
[ ] Rarely

Environment:

Platform: Edge Impulse Studio (cloud training)
Build Environment Details: N/A (training done entirely in Studio)
OS Version: N/A
Edge Impulse Version (Firmware): N/A
Edge Impulse CLI Version: N/A
Project Version: Default Studio configuration
Custom Blocks / Impulse Configuration:
Image classification impulse
Transfer learning with MobileNetV2

Logs/Attachments:

Training log excerpt:
Splitting data into training and validation sets…
Using custom validation split…
Traceback (most recent call last):
File “/home/train.py”, line 389, in
main_function()
File “/home/train.py”, line 306, in main_function
train_dataset, validation_dataset, samples_dataset, X_train, X_test, Y_train, Y_test, has_samples, X_samples, Y_samples = ei_tensorflow.training.get_dataset_from_folder(
File “/app/./resources/libraries/ei_tensorflow/training.py”, line 274, in get_dataset_from_folder
X_train, X_test, Y_train, Y_test, X_train_raw, sample_id_details = split_and_shuffle_data(
File “/app/./resources/libraries/ei_tensorflow/training.py”, line 151, in split_and_shuffle_data
raise Exception(‘ERROR: No samples in validation set! ’
Exception: ERROR: No samples in validation set!
Please check your custom validation split.
If you wanted to set validation set explictly via “Split train/validation set on metadata key” you need to change validation set size to 0
Application exited with code 1
Job failed (see above)’

Additional Information:

Since the metadata key only has two values (raw and augmented), it effectively creates two large buckets. Is metadata splitting intended for many small groups (e.g., per original image or capture session) rather than coarse labels like this?

Eoin · April 2, 2026, 11:43am

Welcome to the forum @rnath , thanks for the detailed post!

This looks more like a feature request than a bug.

You’re using the feature correctly. The metadata split is designed to keep related samples together, not to explicitly exclude one group from validation.

So if you use dataset_group = raw and dataset_group = augmented, Studio will keep those groups together, but it cannot guarantee that augmented stays in training only.

For this use case, metadata works better when it identifies the original source of the sample, such as:

original image ID
capture session
video file

That way, a raw image and all of its augmented versions stay together and do not leak across train/validation.

If you need raw-only validation, the safest approach is to keep augmented samples in training only and evaluate separately on raw data.

More details here:

Hope this helps!

Best

Eoin