Question/Issue:
Hello EdgeImpulse Team,
I’m encountering an issue with duplicate data entries being added to the training dataset when retraining an Impulse via pipeline. Here’s a detailed breakdown of the behavior:
Steps to Reproduce:
- Initial Training:
- Created an Impulse and performed initial training using a CSV file with 15 rows.
- All 15 rows were successfully imported into the project’s dataset.
- First Retrain via Pipeline:
- Used a new CSV file with 5 entirely new rows.
- All 5 rows were successfully imported.
- Second Retrain via Pipeline:
- Reused the same 5-row CSV file from the previous step.
- No data was imported, and the system correctly identified that the data was already present.
- Third Retrain via Pipeline:
- Used a CSV file with 10 rows: 5 rows from the previous dataset (already imported) and 5 new rows.
- All 10 rows were imported, resulting in 5 duplicate entries in the dataset.
Summary of Observed Behavior:
- Impulse correctly skips importing data if the entire CSV has already been imported.
- Impulse imports all rows if the CSV contains partially new data, even if some rows are duplicates.
- This leads to duplicate entries in the project’s dataset.
Expected Outcome:
Impulse should deduplicate data during import, even when the CSV contains a mix of new and previously imported rows.
Could you please confirm if this is expected behavior ? If it’s expected, is there a recommended way to prevent duplicates during retraining via pipeline?