I didn’t feel this warranted a full blog post, but we have made some new changes in the uploader that should be useful for people. This has been released as part of v1.7.2 of the CLI.
- We’ve added a
--category splitoption. If you set this data is automatically split 80/20 between your training and testing sets. This is a deterministic process (a file always goes in the same category) as we look at the MD5 hash of the file. This should help with automatically building a balanced dataset from existing data.
- Duplicate detection is now on by default. When a file already exists in your dataset the ingestion service will now reject the file. You can override this behavior through
--allow-duplicates. If you send data directly through the ingestion service you’ll need to enable this behavior by setting the
Hope this helps keeping your dataset sane!