Sample name all equal and split dataset

JLannoo · November 23, 2021, 3:37pm

Hi all,

I just caputured 60x 10 seconds of accelerometer values for 3 labels using the data forwarder. The sample names are the same as the labels (so 20x for each label).

Is this an issue? Previously a random hash was added to the name.
When splitting the dataset using the button on the dashboard, it only splits up one type of label to the test dataset. Is this because the names are all equal?

Anyone else encountered this issue? I did this before in another project, and there it worked fine.
Project ID: 63942

Best regards,
Jonas

aurel · November 23, 2021, 4:18pm

Hi @JLannoo,

We’re looking into it but it seems like a bug introduced with a recent release.
As a workaround, once you’ve captured all of your data, go to Export to download all your dataset (it will add a hash to each sample).
Then delete data from your project and re-import it all from the previous export.
We’ll keep you posted about the fix.

Aurelien

janjongboom · November 23, 2021, 7:31pm

Hi, we’re rolling out a hotfix for this. I’m a bit too considerate to go change names of data items retroactively, so @aurel’s suggestion is a good one.

JLannoo · December 7, 2021, 3:00pm

Hi all,

I see this problem has been fixed!
But now I notice that the automatic splitting doesn’t split the data “evenly”.
I have three labels with each 20 samples.
After splitting, the test set contains 12 samples, not for each label 4 samples, but 3, 3 and 6 samples…
After repeating this a few times, I come to the same conclusion each time. And it’s each time the same split!

Is splitting the dataset random or is this done in a certain way? Maybe the seed depends on the dataset itself? I’m only guessing…

Best regards,
Jonas

janjongboom · December 7, 2021, 3:44pm

Hi @JLannoo it’s a deterministic process based on the hash of the file, otherwise you can end up with a completely different train/test split when you perform the train/test split more often and then you cannot compare different training cycles.

I’d suggest adding a bit more data, then it’ll deviate to completely proper split.