Upload dataset in CSV file to Edge Impulse

Hi @AashiDutt,

For now the best way is to convert your csv to json, you can follow the data acquisition format documentation . Regarding the API key, can you check it is correctly copied? It should be 68 characters long.

I will also add the csv import feature in our backlog.

Aurelien

2 Likes

Hi Aurel,
Thanks for your guidance. But i seem to run into another error.Could you suggest, whether their is some problem in conversion or other?
Error: Failed to upload 6ip6n-5u4rm.json Missing protected header

I tried with numerous json files but got the same error.

Thanks

@AashiDutt we need some metadata about the device (in the protected header). Here’s an example in Python on converting CSV:

test.csv

timestamp,accX,accY,accZ
1596181208106,9.81,0.11,0.13
1596181208122,9.71,0.14,-0.27
1596181208138,9.83,0.12,0.01

convert.py

import csv, json, math, hmac, hashlib

header = None
# keep track of the first row to know the beginning timestamp
first_row = True
begin_ts = 0
next_ts = 0
values = []

HMAC_KEY = "fed53116f20684c067774ebf9e7bcbdc"

# Parse the CSV file
with open('test.csv', newline='') as csvfile:
    rows = csv.reader(csvfile, delimiter=',')
    for row in rows:
        if (not header):
            header = row
            continue

        if not begin_ts:
            begin_ts = int(row[0])
        elif not next_ts:
            next_ts = int(row[0])

        # skip over timestamp column, and add the rest
        values.append([ float(x) for x in row[1:] ])

# empty signature (all zeros). HS256 gives 32 byte signature, and we encode in hex, so we need 64 characters here
emptySignature = ''.join(['0'] * 64)

# This is the Edge Impulse Data Acquisition Format, it has the protected header
data = {
    "protected": {
        "ver": "v1",
        "alg": "HS256",
        "iat": math.floor(begin_ts / 1000) # epoch time, seconds since 1970 (the timestamp earlier was in ms.)
    },
    "signature": emptySignature,
    "payload": {
        "device_type": "CSV_IMPORTER",
        "interval_ms": next_ts - begin_ts,
        "sensors": [ { "name": x, "units": "m/s2" } for x in header[1:] ],
        "values": values
    }
}

# encode in JSON
encoded = json.dumps(data)

# sign message
signature = hmac.new(bytes(HMAC_KEY, 'utf-8'), msg = encoded.encode('utf-8'), digestmod = hashlib.sha256).hexdigest()

# set the signature again in the message, and encode again
data['signature'] = signature
encoded = json.dumps(data)

print(encoded)

Thanks🙂, will give it a try.

@janjongboom

Any word on CSV import directly into edge impulse dashboard, without needing to convert the csv file into a JSON file.

Since we are on the dashboard all the signing of keys should not be very relevant. CSV import is really important for error testing, but the conversion to JSON is very error prone.

What about drag and drop into a textarea in the dashboard? Probably not a good idea for huge files but would be very nice just for quick data checking from smaller datasets

@Rocksetta The drag&drop is not hard, but the lack of any defined structure in the CSV files is. We’d need to have information about sensors (axes name and units), frequency and preferably also device information in a specific format already in the CSV so it’ll require preprocessing anyway. In that case the structure we can get from either JSON or CBOR format is much better. You can skip over the signing and just set the signature to 0 if you don’t care about this in the JSON files.

1 Like

@janjongboom I’d also like to see a way to import a CSV directly as well. I know that from working with Keras, I save a lot of raw captured data as CSV (and it lets me view things in Excel, too). I’m wondering if there’s a way to capture the needed metadata (device info, axes, units, etc.) either in the CSV file (e.g. as the first row) or through some kind of GUI interface during the uploading process.

For now, the converter script definitely helps. If I’m making a data curation or analysis script/notebook, I can add that at the end and let it spit out JSON for me, which can be uploaded to Edge Impulse.

2 Likes

@janjongboom Headings are a fairly standard CSV method for passing labels. See iris

I guess another approach would be for my Arduino to serial print out properly formatted JSON file. Does anyone have a base example of what the JSON would look like?

@Rocksetta, I understand, but this is not enough data to import into Edge Impulse at the moment. It’s just a bunch of features but this is not raw sensor data. FYI, here under ‘Examples’ there is code in both Python and Node.js to format your data: https://docs.edgeimpulse.com/reference#data-acquisition-format

Note that on Arduino you don’t need to do this by hand. Just print data out over UART and use the Data forwarder to wrap it in a package which we understand.

@ShawnHymel The overhead is pretty minimal with the JSON structure that we have (you can even skip the signature if you don’t care about it), so my feeling is we can stick to that comfortably but I’m open to suggestions if you have a way of encapsulating this more properly in CSV.

2 Likes

Thanks Jan. I didn’t see the node.js tab in the examples. That makes sense now, it does diverge from my plan of a quick and easy CSV upload to see what happens on the EdgeImpulse dashboard side with remote upload before just running remote upload from the Arduino.

So for this situation I think I will just try the remote upload directly from the Arduino.

I would still have on the back burner the ability for the dashboard to allow drag and drop CSV uploading as it would be useful for error checking.

Here is an example of converting a textarea to useable data

https://www.rocksetta.com/tensorflowjs/beginner-examples/tfjs08-knn.html

The key bits of code are. (took me a while to figure out how to sperate the CSV data)


<input type="button" value="Seperate Data and Labels" onclick="{
 
   document.getElementById('myArea01').value = '' 
   document.getElementById('myText02').value = '' 
   document.getElementById('myText03').value = '' 
   document.getElementById('myLabelsSpot').innerHTML = ''                                                          
                                                                                                              
                                                                                                           
   document.getElementById('myArea02').value = myRemoveLineBreak(document.getElementById('myArea02').value)
                                                                                                                                  
                                                             
                                                             
                                                             
                                                             
//  alert('not yet ready\nUsing the testing data below') 
  //myIncoming = new Array()                                                           
 myIncoming = document.getElementById('myArea02').value.split(/\r\n|\n/)                                                    
// alert(myIncoming.length)    
                                                            
  myLabelsTemp = new Array(myIncoming.length)
   myInSplit =    new Array(myIncoming.length)   
   
                                                      
 for(var t=0;  t <= myIncoming.length-1; t++){
    myInSplit[t] =  myIncoming[t].split(',')                                                   
                                                 
} 


We now natively support importing CSV files: https://docs.edgeimpulse.com/reference#importing-csv-data

4 Likes

@janjongboom is there any reason that the CSV upload expects a timestamp and more than 2 rows of data with an increasing timestamp. Any chance it could just upload a single row of labels and a single row of data?

This doesn’t work:
filename is the label : myData1.csv

W,X,Y,Z
345,123,456,789

This works fine:
filename is the label : myData2.csv

timestamp,W,X,Y,Z
11111,345,123,456,789
11112,345,123,456,789

Lots of Machine Learning programs don’t need time series based data. Single data can be connected with a label.

Hi @Rocksetta So the reason why we did it like this is that we do specialize in time series data, and not so much in other data. But I see your usecase, will add some code to handle this case.

edit: This will be fixed in the next patch release, somewhere this week.

1 Like

I appreciate the error messages sent by Edge Impulse data uploader. It was very easy to debug. Perhaps the no timestamp and missing 2 lines of data error can be converted to a warning.

@Rocksetta we’ll properly fix it, so you can just upload in the format you mentioned earlier :slight_smile:

3 Likes

@Rocksetta this is now fixed (both in Studio and in CLI v1.13.14).

1 Like

Awesome, works great using the Portenta with the Vision Shield SD card. I had 14 seconds between samples, now that the information is on Edge Impulse and I know when the data was sampled I can edit the labels as needed.

My Arduino sketch used is on my portenta-pro-community-solutions library here.

3 Likes

for acceleration do the column names have to be timestamp - accX - accY - accZ? because when I uploaded my dataset with timestamp - X - Y - Y format, EI_CLASSIFIER_SENSOR goes to EI_CLASSIFIER_SENSOR_FUSION instead of EI_CLASSIFIER_SENSOR_ACCELEROMETER. resulting in “Invalid model for current sensor” error. If column name is not the issue, please help in my next step.

Hi @f20190094,

The column name is likely the issue, can you please rename them to accX, accY, accZ?

Thanks!

ok thanks, I will try that