I’m wondering if it’s possible to get a pure spectrogram out of the audio processing block rather than the MFCCs. While I know the MFCCs are better for speech, I’ve had some luck (with reduced accuracy) using a 2D CNN on just a spectrogram, where I average every 3-5 bins together. I find that the DCT takes a good chunk of processing power, so I was wondering if there’s a way we can avoid using it.
Additionally, is there a way to get the audio processing block to output a 2D array? Even in the NN step, you could put a “Flatten” layer first to do (I assume) the same thing if you wanted to use a 1D CNN. I’d love to play around with 1D vs. 2D CNNs for audio processing to see if that helps with accuracy and speed at all.
If there’s not a current way to do this, I’ll call this a feature request. In the meantime, I could start learning how to make my own custom processing blocks