Boosting deep neural network efficiency with dual-module inference

L Liu, L Deng, Z Chen, Y Wang, S Li…�- International�…, 2020 - proceedings.mlr.press
L Liu, L Deng, Z Chen, Y Wang, S Li, J Zhang, Y Yang, Z Gu, Y Ding, Y Xie
International Conference on Machine Learning, 2020proceedings.mlr.press
Using deep neural networks (DNNs) in machine learning tasks is promising in delivering
high-quality results but challenging to meet stringent latency requirements and energy
constraints because of the memory-bound and the compute-bound execution pattern of
DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary
memory accesses and computations to accelerate DNN inference. Leveraging the noise-
resilient feature of nonlinear activation functions, we propose to use a lightweight little�…
Abstract
Using deep neural networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory accesses and computations to accelerate DNN inference. Leveraging the noise-resilient feature of nonlinear activation functions, we propose to use a lightweight little module that approximates the original DNN layer, termed as the big module, to compute activations of the insensitive region that are more noise-resilient. Hence, the expensive memory accesses and computations of the big module can be reduced as the results are only calculated in the sensitive region. For memory-bound models such as recurrent neural networks (RNNs), our method can reduce the overall memory accesses by 40% on average and achieve 1.54 x to 1.75 x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound models such as convolutional neural networks (CNNs) by 3.02 x, with only a 0.5% accuracy drop.
proceedings.mlr.press
Showing the best result for this search. See all results