Issue with rebalancing dataset

I have a student who is experiencing issues with the rebalacing feature in a project:

Hi Shawn, I want to mention a point which I noticed. I collected all 4 categories of data, which had a total span of 13 mins 20 seconds ( 800 secs) as was suggested in the video. After I rebalanced data set, in data acquisition, under training data it was 11 mins 10 seconds and my test data was 5 mins 20 seconds. The time is beyond the time I took to collect the data. test data should have been 2mins 10 seconds right? Any issues here?? !!!

I have not personally witnessed this. Is there something that they did to cause this extra data to appear?

Hi @ShawnHymel,

Rebalancing the dataset is based on the number of samples (not the total length) so you can end up with more than 20% in the test set if you have samples with different lengths.
On the total span, I’d suppose there were already samples in the test set. If you can share the project ID I can have a deeper look.


1 Like

Thanks, @aurel! I’ve passed your response to the student. Here is their project that they made public: Any insights would be appreciated.

Thanks @ShawnHymel.
The rebalancing feature uses the hash of the samples’ filenames to have a deterministic process. With a small number of samples this can lead to a different split than 80/20. I’m checking with the team if it’s something we can fix, will keep you posted.


@aurel Any word back on why the training + test audio time after the rebalance would be greater than the total captured audio? I’m having another student with the same issue.

Hi @ShawnHymel,

Haven’t found an issue on the total captured audio increasing after rebalancing. Can you share the project ID? I can directly look into our buckets to check the total length and timestamps of files.


@aurel the project idea is 3735. Any help is appreciated!


From what I see in the project 3735. It is accelerometer data not audio. Just want to make sure that you shared the right project :wink:

Apologies, as I thought the project was for audio. It is supposed to be for accelerometer data.