Make deep learning models run fast on embedded hardware

There are huge benefits to running deep learning models “at the edge”, on hardware that is connected directly to sensors.

This is a companion discussion topic for the original entry at

I think this article is rather silly. That STM 32 processor is very new and very large compared to most embedded processors.

Actually, extremely large and power hungry compared to most iot processors.

SiLabs is better for low power per MHz and still has now where near the processing power needed.

I think the real issue here is that people are trying to develop ml algorithms only on larger processors with tons of ram. If you think in terms of the cellular automata used in Wolfram, that would be a much better solution although still going to be extremely power hungry for any iot device.

Hi @wher0001, this ST target is not the only thing we support :slight_smile: Support includes low-power SiLabs and Nordic silicon, and deployment on anything from 16-bit MCUs to Cortex-M7 depending on the workload you need. For basic things like machine monitoring you can get pretty far with a high-end 16-bit MCU or low-end Cortex-M0+. Here are some real-life performance metrics with time per inference and RAM usage over some MCUs:

And very large compared to most embedded processors.

This is changing quickly though, there’s 17 billion (!) MCUs with Cortex-M MCUs shipping every year. Virtually all of them are able to run some form of ML. Here’s some of my own takes on that:

Those numbers include RX/TX in Bluetooth Smart, BT mesh, WiFi, any other protocol to get the data out wirelessly?

You have to be able to transmit something or the device is useless.

What HW resources, such as clocks, gpio, ADC, are being utilized during these tests to actually get the data to the algorithms? What is their usage?

You have to be able to sample the data to input something into the algorithms.

I’ve been working in the embedded world since we had 64bytes of RAM. I’ve worked on multiple devices that were extremely low-powered and battery operated devices.

The ML algorithms are easy to code up. It’s another thing to have them running on a product that is already RAM and power constrained but still requires signal processing to acquire the data and wireless means to output the data.

Do you have any real-world scenarios like this analyzed?

Sure, but that’s left out here because you need to do that anyway for your control loop. The amount of power you use to get data from the sensor, or off to the network, won’t change just because the control loop runs an ML algorithm. In the end it’s all math, whether you’ve handcoded this in Matlab, or because it’s generated code that runs an ML algorithm (it’s vector multiplications all the way down).

Just measuring the time spent on inferencing is the fairest way of assessing that.