Duplicate Data Entries in Impulse Training Dataset During Retraining via Pipeline

ramdineshjp · July 10, 2025, 9:50am

Question/Issue:
Hello EdgeImpulse Team,

I’m encountering an issue with duplicate data entries being added to the training dataset when retraining an Impulse via pipeline. Here’s a detailed breakdown of the behavior:

Steps to Reproduce:

Initial Training:

Created an Impulse and performed initial training using a CSV file with 15 rows.
All 15 rows were successfully imported into the project’s dataset.

First Retrain via Pipeline:

Used a new CSV file with 5 entirely new rows.
All 5 rows were successfully imported.

Second Retrain via Pipeline:

Reused the same 5-row CSV file from the previous step.
No data was imported, and the system correctly identified that the data was already present.

Third Retrain via Pipeline:

Used a CSV file with 10 rows: 5 rows from the previous dataset (already imported) and 5 new rows.
All 10 rows were imported, resulting in 5 duplicate entries in the dataset.

Summary of Observed Behavior:

Impulse correctly skips importing data if the entire CSV has already been imported.
Impulse imports all rows if the CSV contains partially new data, even if some rows are duplicates.
This leads to duplicate entries in the project’s dataset.

Expected Outcome:
Impulse should deduplicate data during import, even when the CSV contains a mix of new and previously imported rows.

Could you please confirm if this is expected behavior ? If it’s expected, is there a recommended way to prevent duplicates during retraining via pipeline?

Eoin · July 15, 2025, 1:57pm

Hi @ramdineshjp

That indeed looks like a bug, thanks for highlighting this, let me log a ticket for this behaviour.

Best

Eoin

Eoin · July 15, 2025, 2:03pm

Hi @ramdineshjp,

Issue 13651 logged internally for this to our studio team. With a note linking to this post. When that gets resolved our team should update here.

Best

Eoin

ramdineshjp · July 16, 2025, 5:50am

Hi @Eoin, Thanks for the update.