Correct use of metadata key for train/validation split

Question/Issue:
I am trying to control the training/validation split in Edge Impulse Studio using metadata so that augmented images are never used in the validation set.

In my project, the training dataset contains both raw images and augmented images. I added metadata to distinguish them:

  • dataset_group = raw
  • dataset_group = augmented

Currently the dataset contains:

  • 280 images with dataset_group = augmented (161 images in one class and 119 in the other)
  • 59 images with dataset_group = raw (37 images in one class and 22 in the other)

My goal is to train using augmented and raw images but validate only on raw images, ensuring that augmented samples never appear in the validation set.

Following the documentation on using metadata to control train/validation splits, I tried setting:

  • Split train/validation set on metadata key = dataset_group
  • Validation set size = 0%

However, the training job fails with an error indicating that no samples are found in the validation set.

I also tried setting Validation set size = 20%, but the same error occurs.

What is the best way in Edge Impulse Studio to ensure that augmented images are never placed in the validation set while still using them during training?

Project ID:

Context/Use case:
I am building an image classification model where I augmented the dataset to improve training performance.

However, I want the validation metrics to reflect performance on real (raw) images only, not augmented ones, because augmented images are derived from the same originals and could bias validation accuracy.

Summary:
When using a metadata key (dataset_group) to control the train/validation split, training fails with the error:

ERROR: No samples in validation set!
Please check your custom validation split.
If you wanted to set validation set explicitly via “Split train/validation set on metadata key” you need to change validation set size to 0

This happens even though the dataset contains both raw and augmented samples.

Steps to Reproduce:

  1. Upload a dataset containing raw and augmented images.
  2. Add metadata key dataset_group with values raw or augmented.
  3. In Training settings → Advanced training settings:
  • Set Split train/validation set on metadata key = dataset_group
  • Set Validation set size = 0%
  1. Start training.

Actual Results:
Training fails with:

ERROR: No samples in validation set!
Please check your custom validation split.

This occurs both with Validation set size = 0% and 20%.

Reproducibility:

  • [x] Always
  • [ ] Sometimes
  • [ ] Rarely

Environment:

  • Platform: Edge Impulse Studio (cloud training)
  • Build Environment Details: N/A (training done entirely in Studio)
  • OS Version: N/A
  • Edge Impulse Version (Firmware): N/A
  • Edge Impulse CLI Version: N/A
  • Project Version: Default Studio configuration
  • Custom Blocks / Impulse Configuration:
  • Image classification impulse
  • Transfer learning with MobileNetV2

Logs/Attachments:

Training log excerpt:
Splitting data into training and validation sets…
Using custom validation split…
Traceback (most recent call last):
File “/home/train.py”, line 389, in
main_function()
File “/home/train.py”, line 306, in main_function
train_dataset, validation_dataset, samples_dataset, X_train, X_test, Y_train, Y_test, has_samples, X_samples, Y_samples = ei_tensorflow.training.get_dataset_from_folder(
File “/app/./resources/libraries/ei_tensorflow/training.py”, line 274, in get_dataset_from_folder
X_train, X_test, Y_train, Y_test, X_train_raw, sample_id_details = split_and_shuffle_data(
File “/app/./resources/libraries/ei_tensorflow/training.py”, line 151, in split_and_shuffle_data
raise Exception(‘ERROR: No samples in validation set! ’
Exception: ERROR: No samples in validation set!
Please check your custom validation split.
If you wanted to set validation set explictly via “Split train/validation set on metadata key” you need to change validation set size to 0
Application exited with code 1
Job failed (see above)’

Additional Information:

Since the metadata key only has two values (raw and augmented), it effectively creates two large buckets. Is metadata splitting intended for many small groups (e.g., per original image or capture session) rather than coarse labels like this?