New features: automatically split data, duplicate detection in ingestion

janjongboom · July 15, 2020, 6:45am

Hi,

I didn’t feel this warranted a full blog post, but we have made some new changes in the uploader that should be useful for people. This has been released as part of v1.7.2 of the CLI.

We’ve added a --category split option. If you set this data is automatically split 80/20 between your training and testing sets. This is a deterministic process (a file always goes in the same category) as we look at the MD5 hash of the file. This should help with automatically building a balanced dataset from existing data.
Duplicate detection is now on by default. When a file already exists in your dataset the ingestion service will now reject the file. You can override this behavior through --allow-duplicates. If you send data directly through the ingestion service you’ll need to enable this behavior by setting the x-disallow-duplicates header to 1.

Hope this helps keeping your dataset sane!