Duplicate Data Entries in Impulse Training Dataset During Retraining via Pipeline

Question/Issue:
Hello EdgeImpulse Team,

I’m encountering an issue with duplicate data entries being added to the training dataset when retraining an Impulse via pipeline. Here’s a detailed breakdown of the behavior:

Steps to Reproduce:

  1. Initial Training:
  • Created an Impulse and performed initial training using a CSV file with 15 rows.
  • All 15 rows were successfully imported into the project’s dataset.
  1. First Retrain via Pipeline:
  • Used a new CSV file with 5 entirely new rows.
  • All 5 rows were successfully imported.
  1. Second Retrain via Pipeline:
  • Reused the same 5-row CSV file from the previous step.
  • No data was imported, and the system correctly identified that the data was already present.
  1. Third Retrain via Pipeline:
  • Used a CSV file with 10 rows: 5 rows from the previous dataset (already imported) and 5 new rows.
  • All 10 rows were imported, resulting in 5 duplicate entries in the dataset.

Summary of Observed Behavior:

  • Impulse correctly skips importing data if the entire CSV has already been imported.
  • Impulse imports all rows if the CSV contains partially new data, even if some rows are duplicates.
  • This leads to duplicate entries in the project’s dataset.

Expected Outcome:
Impulse should deduplicate data during import, even when the CSV contains a mix of new and previously imported rows.

Could you please confirm if this is expected behavior ? If it’s expected, is there a recommended way to prevent duplicates during retraining via pipeline?

Hi @ramdineshjp

That indeed looks like a bug, thanks for highlighting this, let me log a ticket for this behaviour.

Best

Eoin

Hi @ramdineshjp,

Issue 13651 logged internally for this to our studio team. With a note linking to this post. When that gets resolved our team should update here.

Best

Eoin

Hi @Eoin, Thanks for the update.