Gathering diverse data

timlester · October 4, 2020, 7:14am

Perhaps a feature where a member can give out a token for allowing others temporary guided access to a specific project of theirs to contribute data e.g. upload images and label them or contribute audio samples with labels to build a broader dataset of words and images not found in available datasets. For instance, a member would ask for help on a social media site for the word “ten” and he would post a token which when used brought non-members to that members project to contribute data by scanning their phone in, but they wouldn’t have access to any other features. It would make it easier to get help building datasets in my opinion. It would also give your site a lot of free advertisement on social media…Just a thought.

janjongboom · October 4, 2020, 7:46am

Hi @timlester, thanks for the suggestion! I’d be a bit wary of of spam, but actually would be pretty cool way to build open datasets. Maybe we should separate this from projects, and have some ‘open dataset’ feature where people can contribute to?

timlester · October 4, 2020, 8:22am

It would benefit the community more that way I’m sure. A member would check the dataset and if they can’t find a category, help build it up. Perhaps would have to run large datasets against a pretrained model to help find mislabeled data and help curate to a degree by tagging questionable samples for human verification.

timlester · October 4, 2020, 8:38am

Either way you go, a little Captcha window should help fight the spam while opening up the site to non members.

janjongboom · October 4, 2020, 9:27am

Yeah not so much on the CAPTCHA but how would you verify that what someone is uploading is actually what they say they’re uploading.

timlester · October 4, 2020, 9:49am

Sounds like a job for machine learning…lol. Wouldn’t it be possible that once a dataset category reaches a minimum number of entries (verified by a person) to use said entries to build a model that further checked new entries and flagged questionable entries for human verification? Your basically doing this when you run a test dataset against a trained one at your site. A high percentage on a decent performing model gets a free pass to production. Lower percentages have to be verified. The model ultimately contributed to its own improvement?

timlester · October 4, 2020, 10:12am

Makes me wonder what an automated system would do if it had a constant upload of data and every 500 new entries it auto ran a training cycle and used that new knowledge to further accept or reject new uploads. Would it tend toward a more intelligent model, or towards a lesser intelligent one. The most important part would definitely be the first training cycle. Just thinking out loud.

timlester · October 4, 2020, 10:47am

If having a smart database that uses incrementally trained models to curate its own datasets is improbable, then you could go the member project route also and the member who initiated the all call for data would curate it themselves after which as a price for service, their new collected data goes to the public/open dataset and the members become responsible for data integrity.

janjongboom · October 7, 2020, 7:24pm

@timlester definitely really good thinking points. We’re thinking about how to add community features around Edge Impulse at the moment, one of them could be read-only projects so you can share not just datasets but also model architecture etc. Then from there have a way to contribute to the project (like a pull request but for data / model changes or something?) etc. Will take your ideas into account when designing this!