Choosing your own validation data

Hello. One capability that I’ve found lacking in Edge Impulse is that ability to choose my validation data. I think technically it would be possible using the Keras ‘expert mode’ but I’m not even clear how I would distinguish which rows of the training tensor correspond to which samples in my dataset. It would be very beneficial to have the ability to label data as validation data so I can control the distribution of samples that I validate my training data against.

If anyone has any quick tips on how to map from the training matrix X available in the keras expert mode to the names of the sample from my dataset, that would also be very helpful.


1 Like

Hi @tennies, why exactly would you want this? Isn’t the random split that we make more beneficial as it gives a proper indication of performance, and cannot be gamed easily. The test set you can build up separately.

@dansitu can comment as well.

Hi @tennies,

There isn’t currently a mechanism for doing this, but we’re planning to support different validation splits (for example, stratified samples across the labels of a classifier). Would this help with your use case or are you looking for something different?


Hey guys, thanks for the input.

Maybe the simplest way to say it is you may want to control the validation split for the same reasons you would want to control the test split.

The random split i think is fine if your data is exactly balanced in terms of the target classes and the characteristics of the data. But often data is coming from several distributions with varying characteristics (e.g. different test subjects, test environments, noise, sensor mounting, etc), or there is a high degree of class imbalance, so that striating the data (i.e. separating your data into subgroups before splitting) may be necessary to get a ‘fair’ evaluation.

One simple example where you would probably want this is to make sure that all ‘subwindows’ of a sample (i.e. the sliding window applied by edge impulse) belong to the same fold.

In a more complicated example, you may have a limited homebrew dataset that is high quality but you want to leverage some publicly available datasets to make your model more robust. In this case it may be helpful to control your validation data s.t. it matches more closely with your test data, and include less trusted data, or synthetic data, only in the training set (and verify it actually helps model performance). If not you could end up in the situation where you get high validation performance (on account of the more plentiful mismatched data) but low test performance, and if you’re constantly looking at your test set to tweak model performance than it kind of defeats the purpose of having a test set :stuck_out_tongue:

I think stratifying by class label would help with class imbalance, but being able to stratify by metadata tags would be more powerful.

Again, a hacky way to achieve this would be to just have a reliable method to map from the rows in the X/Y tensors to the sample names (not sure if this can be done currently) - that way I could just encode simple metadata into the names of the samples when I upload them, and decode them in the keras expert mode to decide the validation split.

1 Like

Glad to hear this is what you are looking for—I agree with all of your points and we’re working towards adding both metadata and stratified sampling.

There isn’t a mechanism for accessing sample metadata from the training script right now, but we’ll let you know when these metadata-based features are available!