I am working on a model with the healthy continuous time series dataset based on the client application. I am trying to build a better topology of a model with the classification and spectrogram processing block. But inside EI studio, the data is randomly split between training and validation datasets. The outcome is that data from the same subject (person) will be in the training and validation dataset. This split results in data leakage.
Is this data leakage will affect the model or not?? If yes, Kindly assist me with how to resolve this data leakage in EI Studio.
Hi @Keerthivasan, yes, we’re aware. We are (well @04rlowe is) currently working on a way to do the test/val split based on sample metadata. Then you just tag all samples with subject = XXX metadata (metadata is currently only visible for enterprise projects, but we’d open this up for everyone) and in the UI there’ll be an option to do the validation split based on subject metadata key.
Though the overlap between train and validation is important (and the metadata fix will help with that) the bigger concern is the overlap between train and test. It’s not uncommon that as long as you’re clean between train and test the train/validate overlap isn’t as big a concern. My advice is that it’s correct to be aware of it but I’d ignore it for now and check how things look with respect to the test performance; it might be OK. Mat
I train a model on my local machine (cross-validation, hyperparameter tuning, + experiment tracking). I obtain a final model architecture.
I retrain this final model architecture on EI-studio, given the training dataset I uploaded. Of course, there is a random split => data leakage (overfitting). The loss metrics at the end of your training don’t tell you something meaningful.
I use the test set to check my final model performance. Because the test set split is based on patient ID, the test outcome gives metrics that tell you how well the ‘model generalizes’.
(I have performed some comparison checks training/testing on the local machine and EI studio. So far, this approach has given good results.)
Be careful if you use this approach to improve your model, i.e., tune the model parameters based on the outcome of the test from the test set. In this case, you will use in an indirect way the test set as a validation set, and you will leak information from the test set into your model design see Regression - Confidence Interval - Q and Regression - Confidence Interval - A. This is not good practice. A solution: once you have a final model architecture, you replace the test set with a new test never used during the training/validation phase and use this new test to test the final model.
A question that can be asked: In this use case, is a train/validation split needed ? (Because we ignore the result from the validation set.) Can I set the validation set size to a lower value, for example, from 20% to 1%, and use the complete information from the training set to train my model and test it on the test set? Yes, you can, but this can become tricky. In the case of an int8 model, a representative dataset is needed for more info Post-training quantization. In EI studio, the representative dataset is your validation set. If the size of the validation set is not sufficiently large, your model performance using int8 model will be insufficient (Int8 model: representative dataset)
I haven’t looked in detail into to code, but I assume the split train/validation set ((function split_and_shuffle_data in libraries/ei_tensorflow/training.py?)) is performed in the same approach as in EI studio.