Choosing your own validation data

tennies · May 13, 2021, 3:12pm

Hello. One capability that I’ve found lacking in Edge Impulse is that ability to choose my validation data. I think technically it would be possible using the Keras ‘expert mode’ but I’m not even clear how I would distinguish which rows of the training tensor correspond to which samples in my dataset. It would be very beneficial to have the ability to label data as validation data so I can control the distribution of samples that I validate my training data against.

If anyone has any quick tips on how to map from the training matrix X available in the keras expert mode to the names of the sample from my dataset, that would also be very helpful.

Thanks!

janjongboom · May 15, 2021, 5:58pm

Hi @tennies, why exactly would you want this? Isn’t the random split that we make more beneficial as it gives a proper indication of performance, and cannot be gamed easily. The test set you can build up separately.

@dansitu can comment as well.

dansitu · May 19, 2021, 6:06pm

Hi @tennies,

There isn’t currently a mechanism for doing this, but we’re planning to support different validation splits (for example, stratified samples across the labels of a classifier). Would this help with your use case or are you looking for something different?

Warmly,
Dan

tennies · May 19, 2021, 7:10pm

Hey guys, thanks for the input.

@janjongboom
Maybe the simplest way to say it is you may want to control the validation split for the same reasons you would want to control the test split.

The random split i think is fine if your data is exactly balanced in terms of the target classes and the characteristics of the data. But often data is coming from several distributions with varying characteristics (e.g. different test subjects, test environments, noise, sensor mounting, etc), or there is a high degree of class imbalance, so that striating the data (i.e. separating your data into subgroups before splitting) may be necessary to get a ‘fair’ evaluation.

One simple example where you would probably want this is to make sure that all ‘subwindows’ of a sample (i.e. the sliding window applied by edge impulse) belong to the same fold.

In a more complicated example, you may have a limited homebrew dataset that is high quality but you want to leverage some publicly available datasets to make your model more robust. In this case it may be helpful to control your validation data s.t. it matches more closely with your test data, and include less trusted data, or synthetic data, only in the training set (and verify it actually helps model performance). If not you could end up in the situation where you get high validation performance (on account of the more plentiful mismatched data) but low test performance, and if you’re constantly looking at your test set to tweak model performance than it kind of defeats the purpose of having a test set

@dansitu
I think stratifying by class label would help with class imbalance, but being able to stratify by metadata tags would be more powerful.

Again, a hacky way to achieve this would be to just have a reliable method to map from the rows in the X/Y tensors to the sample names (not sure if this can be done currently) - that way I could just encode simple metadata into the names of the samples when I upload them, and decode them in the keras expert mode to decide the validation split.

dansitu · May 20, 2021, 8:03pm

Glad to hear this is what you are looking for—I agree with all of your points and we’re working towards adding both metadata and stratified sampling.

There isn’t a mechanism for accessing sample metadata from the training script right now, but we’ll let you know when these metadata-based features are available!

Joeri · August 31, 2022, 6:20pm

@dansitu

I am working on an application using data from IMU placed on healthy subjects. Currently, I am training a model (on a local GPU machine) using a cross-validation approach and trying to find the best network topology. When I have the final topology, I will train the model and test the model using the testing dataset. Because uploading a pre-trained model is not possible, I also need to train the final model topology in Edge Impulse Studio. However, inside EI studio, the data is randomly split between training and validation datasets. The outcome is that data from the same subject (person) will be in the training and validation dataset. This split results in data leakage.

My question is, is there already some progress in choosing your validation data?

Regards,
Joeri

dansitu · September 23, 2022, 5:55pm

Hi Joeri,

Sorry for the late reply! We’ve recently prioritized this issue and it’s planned for implementation in an upcoming sprint. We’ll keep you updated in this thread!

Warmly,
Dan

MMarcial · September 23, 2022, 10:15pm

Another use-case for not randomly distributing the train/validation/test datasets:

When photographing a parking lot with for example a 15-minute time-lapse interval for a period of days the images of the same day should belong to just one of these sets. This avoids that pictures related to the same car parked in the same space for hours showing just light variations appearing in the training and the validation/testing sets simultaneously.

Likewise, photos of a rainy day must be split across sets.

So a metadata field would work for this.

When uploading data perhaps dot notation could be used:
lable_name.meta01.meta02.meta03.image.jpg

janjongboom · November 30, 2022, 5:59pm

FYI one update and one tip here Does Building Model with EI's Random Split will leads data Leakage? - #2 by janjongboom