Using CNN + RNN (hybrid models) in Edge Impulse for audio on ESP32

I am working on an embedded audio project (wake word / voice-related task) targeting the ESP32. I’m exploring whether it is possible to implement or approximate a hybrid architecture combining CNNs and RNNs (e.g., CNN + LSTM/GRU) using Edge Impulse, while staying within MCU constraints.

Details:
In audio applications, CNNs are commonly used for feature extraction from spectrograms or MFCCs, while RNNs are often used to capture temporal dependencies.

A practical reference is WakeNet from Espressif which performs very well on ESP32 for wake word detection. However, the internal architecture of WakeNet is not publicly available. Based on its behavior and performance, it seems plausible that it uses some form of hybrid temporal-spatial modeling, potentially mixing CNN-like layers with temporal mechanisms (RNNs or alternatives).

Does Edge Impulse currently provide any built-in feature or supported workflow to implement hybrid CNN + RNN architectures (e.g., CNN followed by LSTM/GRU) for audio models targeting MCUs such as the ESP32, or any officially recommended alternative to achieve similar temporal modeling?