1 Introduction
Recent improvements in network and storage devices have provided the machine learning community with the opportunity to utilize immense data sources, leading to the golden age of AI and deep learning [
22]. Since modern
Deep Neural Networks (DNNs) require considerable computing resources and are deployed in a variety of compute devices, ranging from high-end servers to mobile devices with limited computational resources, there is a strong need to realize economical DNNs that fit within the resource constraints [
128,
160,
161]. Resource-efficient DNN research has vividly been carried out independently in various research communities including the machine learning, computer arithmetic, and computing system communities. Recently, DeepMind proposed the resource-efficient deep learning benchmark metric, which is the accuracy along with the required memory footprint and number of operations [
84].
With this regard, this article surveys resource-efficient techniques for
Convolutional Neural Networks (CNNs) based on the three-level categorization: the model-, arithmetic-, and implementation-level techniques along with various resource efficiency metrics as shown in Figure
1, since CNN is one of the most widely used DNN architectures [
100]. Our resource efficiency metrics include the accuracy per parameter, operation, memory footprint, core utilization, memory access, and Joule. For the resource efficiency comparison between the baseline CNN and a CNN utilizing resource-efficient techniques, the accuracy should be equivalent between the two CNNs. In other words, it is not fair to compare the resource efficiency between a CNN producing a high accuracy and a CNN producing a low accuracy since the resource efficiency is significantly higher in a low-performing CNN based on our resource metrics. We categorize the resource-efficient techniques into the
model-level resource-efficient techniques if they compress the CNN model sizes, the
arithmetic-level resource-efficient techniques if they utilize reduced precision arithmetic and/or customized arithmetic rules, and the
implementation-level resource-efficient techniques if they apply hardware optimization techniques to the CNNs (e.g., locating local memory near
Processing Elements (PEs)) to improve physical resource efficiency such as the accuracy per compute resource and per Joule.
In Figure
1, CNNs can be considered as a resource-efficient technique since they improve the accuracy per parameter, per operation, and per memory footprint, compared to fully connected neural networks. The resource efficiency from CNNs can be further improved by applying the model-, arithmetic-, and implementation-level techniques. The model- and arithmetic-level techniques can affect the accuracy since they affect either the CNN model structure or the arithmetic rule, whereas the implementation-level techniques generally do not affect the accuracy. The model-level techniques mostly contribute to improving abstract resource efficiency, whereas the implementation-level techniques contribute to improve physical resource efficiency. Without careful consideration at the intersection between the model- and the implementation-level techniques, a CNN model compressed by the model-level techniques might require significant runtime compute resources, incurring longer training time and inference latency than the original model [
31,
119]. Thus, to optimize the performance and energy efficiency on a particular hardware, it is essential to consider the joint effect of the model-, arithmetic-, and implementation-level optimizations.
Related survey works are as follows. Sze et al. [
155] provided a comprehensive tutorial and survey toward efficient processing of DNNs, discussing DNN architectures, software frameworks (e.g., PyTorch, TensorFlow, Keras), and the implementation methods optimizing
Multiply-and-Accumulate Computations (MACs) of CNNs on given compute platforms. Cheng et al. [
35,
36] conducted a survey on the model compression techniques including pruning, low-rank factorization, compact convolution, and knowledge distillation. Deng et al. [
45] discussed joint model-compression methods that combined multiple model-level compression techniques, and their efficient implementation on particular computing platforms. Wang et al. [
166] provided a survey on custom hardware implementations of DNNs and evaluated their performance using the Roofline model of Williams et al. [
171]. Hoefler et al. [
82] provided a survey on pruning techniques to generate sparse DNNs and a tutorial of how to train such sparse DNNs. Ghimire et al. [
58] provided a survey on model compression methods and computing platforms suitable for accelerating CNNs.
Unlike the previous survey works, we conduct a comprehensive survey on resource-efficient CNN techniques in terms of the model-, arithmetic-, and implementation-level techniques by clarifying which resource efficiency can be improved with particular techniques according to our resource efficiency metrics as defined in Section
2.2. Such clarification would provide machine learning engineers, computer arithmetic designers, software developers, and hardware manufacturers with useful information to improve particular resource efficiency for their CNN applications. Besides, since we notice that fast wireless communication and edge computing development affects CNN applications [
190], our survey also includes cutting-edge resource-efficient techniques for distributed AI such as early exiting techniques [
160,
161]. The holistic and multi-facet view for resource-efficient techniques for CNN from our survey would allow for a better understanding of the available techniques and, as consequence, a better global optimization, compared to previous survey works. The main contributions of our article include the following:
•
This article first provides a comprehensive survey coverage of the recent resource-efficient techniques for CNNs in terms of the model-, arithmetic-, and implementation-level techniques.
•
To the best of our knowledge, our work is the first to provide a comprehensive survey on arithmetic-level utilization techniques for CNNs.
•
This work utilizes multiple resource efficiency metrics to clarify which resource efficiency metrics each technique improves.
•
This article provides the influence of resource-efficient CNN techniques from higher- to lower-level techniques (refer to Figure
1).
•
We discuss the future trend for resource-efficient CNN techniques.
We discuss our resource efficiency metrics for CNNs in Section
2, the model-level resource-efficient techniques in Section
3, the arithmetic-level techniques in Section
4, the implementation-level techniques in Section
5, and the influences between different-level techniques and the future research trends in Section
6, and we present our conclusion in Section
7. Our article excludes higher-level training procedure manipulation techniques such as one-pass ImageNet [
84], bag of freebies [
20], and data augmentation. We have predominantly collected papers that have been (1) highly cited or (2) published in world-leading machine learning or computing system conferences/journals (e.g., CORE:
\(A/A^*\) or JCR Q1).