Data leakage in training-validation split for raw IMU data

Question/Issue: Is the training-validation split in my case (for raw IMU data) with window size 500ms and window increase of 20ms have overlapping data in-between?

Project ID:

Context/Use case: So, I have IMU data (x,y,z acceleration values) recorded at 50Hz which I’m using to classify into various categories. When training the model, there’s an option for a metadata key to avoid any data leakage between training and validation.
But am I correct to say that in my case, it’s not possible to avoid data leakage on Edge Impulse as the windowed data in the training set will have some overlapping window data in the validation set too?
Or Is the data first divided into training and validation sets, and then windows are generated for both sets separately?
Thank you!

1 Like

Hi @Rohitashva

For information on spectral features and recommended steps to take on windowing see:

I will need to double check with our DSP team on the windowing sequencing for test / training data. If this can lead to data leakage as you have suggested I would hope not, but we could clarify better in the docs as it is not explicitly stated. @AlexE can you offer a concrete answer here? Thanks!

Best

Eoin

Hi @Rohitashva , actually the train / validation split is random across windows, not samples, so if you have overlap, especially more than say 30% or so, then you may end up with undesirable leakage. You were on the right track looking at metadata keys, but you’re not correct in saying it’s impossible to avoid leakage.

If you use the metadata keys correctly, you will have no leakage. You will have the windows from one set of samples inside training, and windows from other samples inside validation. Check out this doc for details : Metadata - Edge Impulse Documentation

Hi @Eoin @AlexE , Thank you for your responses!
I have a follow-up question to your response. When uploading multiple CSV files, does the edge impulse create sliding windows separately for data in every CSV file to preserve the temporal relationship, Or combined together for all the CSV files?
Because in the later one, It will distort the data by combining it from two different files collected at different points in time for different participants.

The sliding windows are only created within the same sample (ie CSV file). Once the end of sample is reached, a new set of overlapping windows are created, starting with the first N data points of the next sample