Hi, I am looking for help in trying to understand some of the parameters of the MFCC process block.
Here are the parameters I chose in this example.
First of all let me explain what I understood :
- first each sample of audio data that is fed to the block is 0.7s long (i.e 11200samples working at 16khz sample rate)
- Then the 0.7s long sample is sliced into 35 (0.7/0.02=35) 0.02s audio frames which are 320samples (0.02*16000) long
- Then each frame is windowed and the pre emphasis is applied
- Then the fft and the rest is applied.
From this starting point I have a few questions to ask :
- First what kind of window is applied each time on the audio (is it hamming ?)?
- Here the frame length in samples (320) is bigger than the fft length (256). What does that mean ? Is the end, the beginning or both chopped before performing the fft.
- What does the shift value symbolizes in the pre-emphasis section ?
- For the filtering, what kind of formula is used, HTK, Slaney, other ?
- Is the signal normalized at any point before being fed to the neural network ?
Thank you very much for any answer !