Machine learning techniques, particularly those based on neural networks, are always more often used at the edge of the network by IoT nodes. Unfortunately, the computation capabilities demanded by those applications, together with their energy efficiency related constraints, exceed those exposed by embedded general-purpose processors. For this reason, the use of domain-specific hardware accelerators is considered the most viable solution to the unsustainable “Turing tariff” of general-purpose hardware. Starting from the observation that memory and communication traffic account for a large fraction of the overall latency and energy in deep neural network (DNN) inferences, this paper proposes a new compression technique aimed at (i) reducing the memory footprint for storing the model parameters of a DNN, and (ii) improving DNN inference latency and energy on resource-constrained IoT devices. The proposed compression technique, namely, LineCompress, is applied on a set of representative convolutional neural networks (CNNs) for object recognition mapped on a state-of-the-art domain-specific hardware accelerator targeted for resource-constrained IoT devices. We show that, on average, 7.4× memory footprint reduction can be obtained, thus reducing the memory and communication traffic that result to 77% and 87% inference latency and energy reduction, respectively, trading-off efficiency vs. accuracy.

DNN Model Compression for IoT Domain Specific Hardware Accelerators

Palesi M.;Monteleone S.;Patti D.;Ascia G.;Catania V.
2021-01-01

Abstract

Machine learning techniques, particularly those based on neural networks, are always more often used at the edge of the network by IoT nodes. Unfortunately, the computation capabilities demanded by those applications, together with their energy efficiency related constraints, exceed those exposed by embedded general-purpose processors. For this reason, the use of domain-specific hardware accelerators is considered the most viable solution to the unsustainable “Turing tariff” of general-purpose hardware. Starting from the observation that memory and communication traffic account for a large fraction of the overall latency and energy in deep neural network (DNN) inferences, this paper proposes a new compression technique aimed at (i) reducing the memory footprint for storing the model parameters of a DNN, and (ii) improving DNN inference latency and energy on resource-constrained IoT devices. The proposed compression technique, namely, LineCompress, is applied on a set of representative convolutional neural networks (CNNs) for object recognition mapped on a state-of-the-art domain-specific hardware accelerator targeted for resource-constrained IoT devices. We show that, on average, 7.4× memory footprint reduction can be obtained, thus reducing the memory and communication traffic that result to 77% and 87% inference latency and energy reduction, respectively, trading-off efficiency vs. accuracy.
2021
Computational modeling
DNN accelerator
DNN model compression
Domain-Specific Accelerator
Energy vs. Performance vs. Accuracy trade-off.
Hardware acceleration
Internet of Things
Memory management
Microcontrollers
Neural Networks
Quantization (signal)
System-on-chip
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/523170
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 14
social impact