Runtime AI ARM

Faster Convolutional Layers for Embedded CNNs

Convolutional layer in neural network.

Convolutional neural networks have become essential in the modern algorithmic toolbox of computer vision. Running them on embedded devices poses a challenge though. On Arm® processors, the versatile Common Mi­cro­con­trol­ler Software Interface Standard (CMSIS) library exists, which is provided by Arm and optimized for its processors. For example, it runs quantized networks using single instruction multiple data (SIMD) on ARMv7-M CPUs with DSP extensions.

The bottlenecks in a convolutional neural network are most likely the convolutional layers themselves. This was the case for one of our customers. We optimized the convolutional layer implementation of CMSIS 5 for our customer’s precise use-case and implemented microarchitecture optimizations for the customer’s Cortex®-M4 processor. With this, we were able to speed up the convolutional layers in the customer’s application by 3×.

Quick facts

Hardware:

  • CPU: Arm Cortex-M4

Operating System:

  • None (bare bone)

Compiler:

  • GNU Compiler Collection

Summary of our results:

  • Speedup of convolutional layers in CMSIS 5 by 3×.
All referenced product or service names and trademarks are the property of their respective owners.