Improving inference latency and energy of network-on-chip based convolutional neural networks through weights compression

IRIS

Network-on-Chip (NoC) based Convolutional Neural Network (CNN) accelerators are energy and performance limited by the communication traffic. In fact, to run an inference, the amount of traffic generated both on-chip and off-chip to fetch the parameters of the network, namely, filters and weights, accounts for a large fraction of the energy and latency. This paper presents a technique for compressing the network parameters in such a way to reduce the amount of traffic for fetching the network parameters thus improving the overall performance and energy figures of the accelerator. The lossy nature of the proposed compression technique results in a degradation of the accuracy of the network which we show being, nevertheless, widely justified by the achievable latency and energy consumption improvements. The proposed technique is applied to several widespread CNN models in which the trade-off accuracy vs. inference latency and inference energy is discussed. We show that up to 63% inference latency reduction and 67% inference energy reduction can be achieved with less than 5% top 5 accuracy degradation without the need of retraining the network.

Improving inference latency and energy of network-on-chip based convolutional neural networks through weights compression

Ascia G.;Catania V.;Jose J.;Monteleone S.;Palesi M.;Patti D.

2020-01-01

Abstract

Network-on-Chip (NoC) based Convolutional Neural Network (CNN) accelerators are energy and performance limited by the communication traffic. In fact, to run an inference, the amount of traffic generated both on-chip and off-chip to fetch the parameters of the network, namely, filters and weights, accounts for a large fraction of the energy and latency. This paper presents a technique for compressing the network parameters in such a way to reduce the amount of traffic for fetching the network parameters thus improving the overall performance and energy figures of the accelerator. The lossy nature of the proposed compression technique results in a degradation of the accuracy of the network which we show being, nevertheless, widely justified by the achievable latency and energy consumption improvements. The proposed technique is applied to several widespread CNN models in which the trade-off accuracy vs. inference latency and inference energy is discussed. We show that up to 63% inference latency reduction and 67% inference energy reduction can be achieved with less than 5% top 5 accuracy degradation without the need of retraining the network.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Codice ISBN
	
				978-1-7281-7445-7
			
	Parole chiave
	
				Accuracy/latency/energy trade-off
Approximate deep neural network
Deep neural network accelerator
Weights compression
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/486272

Citazioni

ND

6

6

social impact