Convolutional neural networks (CNNs) are used in our daily life, including self-driving cars, virtual assistants, social network services, healthcare services, and face recognition, among others. However, deep CNNs demand substantial compute resources during training and inference. The machine learning community has mainly focused on model-level optimizations such as architectural compression of CNNs, whereas the system community has focused on implementation-level optimization. In between, various arithmetic-level optimization techniques have been proposed in the arithmetic community. This article provides a survey on resource-efficient CNN techniques in terms of model-, arithmetic-, and implementation-level techniques, and identifies the research gaps for resource-efficient CNN techniques across the three different level techniques. Our survey clarifies the influence from higher- to lower-level techniques based on our resource efficiency metric definition and discusses the future trend for resource-efficient CNN research.

1 Introduction

Recent improvements in network and storage devices have provided the machine learning community with the opportunity to utilize immense data sources, leading to the golden age of AI and deep learning [22]. Since modern Deep Neural Networks (DNNs) require considerable computing resources and are deployed in a variety of compute devices, ranging from high-end servers to mobile devices with limited computational resources, there is a strong need to realize economical DNNs that fit within the resource constraints [128, 160, 161]. Resource-efficient DNN research has vividly been carried out independently in various research communities including the machine learning, computer arithmetic, and computing system communities. Recently, DeepMind proposed the resource-efficient deep learning benchmark metric, which is the accuracy along with the required memory footprint and number of operations [84].

With this regard, this article surveys resource-efficient techniques for Convolutional Neural Networks (CNNs) based on the three-level categorization: the model-, arithmetic-, and implementation-level techniques along with various resource efficiency metrics as shown in Figure 1, since CNN is one of the most widely used DNN architectures [100]. Our resource efficiency metrics include the accuracy per parameter, operation, memory footprint, core utilization, memory access, and Joule. For the resource efficiency comparison between the baseline CNN and a CNN utilizing resource-efficient techniques, the accuracy should be equivalent between the two CNNs. In other words, it is not fair to compare the resource efficiency between a CNN producing a high accuracy and a CNN producing a low accuracy since the resource efficiency is significantly higher in a low-performing CNN based on our resource metrics. We categorize the resource-efficient techniques into the model-level resource-efficient techniques if they compress the CNN model sizes, the arithmetic-level resource-efficient techniques if they utilize reduced precision arithmetic and/or customized arithmetic rules, and the implementation-level resource-efficient techniques if they apply hardware optimization techniques to the CNNs (e.g., locating local memory near Processing Elements (PEs)) to improve physical resource efficiency such as the accuracy per compute resource and per Joule.

Fig. 1.

In Figure 1, CNNs can be considered as a resource-efficient technique since they improve the accuracy per parameter, per operation, and per memory footprint, compared to fully connected neural networks. The resource efficiency from CNNs can be further improved by applying the model-, arithmetic-, and implementation-level techniques. The model- and arithmetic-level techniques can affect the accuracy since they affect either the CNN model structure or the arithmetic rule, whereas the implementation-level techniques generally do not affect the accuracy. The model-level techniques mostly contribute to improving abstract resource efficiency, whereas the implementation-level techniques contribute to improve physical resource efficiency. Without careful consideration at the intersection between the model- and the implementation-level techniques, a CNN model compressed by the model-level techniques might require significant runtime compute resources, incurring longer training time and inference latency than the original model [31, 119]. Thus, to optimize the performance and energy efficiency on a particular hardware, it is essential to consider the joint effect of the model-, arithmetic-, and implementation-level optimizations.

Related survey works are as follows. Sze et al. [155] provided a comprehensive tutorial and survey toward efficient processing of DNNs, discussing DNN architectures, software frameworks (e.g., PyTorch, TensorFlow, Keras), and the implementation methods optimizing Multiply-and-Accumulate Computations (MACs) of CNNs on given compute platforms. Cheng et al. [35, 36] conducted a survey on the model compression techniques including pruning, low-rank factorization, compact convolution, and knowledge distillation. Deng et al. [45] discussed joint model-compression methods that combined multiple model-level compression techniques, and their efficient implementation on particular computing platforms. Wang et al. [166] provided a survey on custom hardware implementations of DNNs and evaluated their performance using the Roofline model of Williams et al. [171]. Hoefler et al. [82] provided a survey on pruning techniques to generate sparse DNNs and a tutorial of how to train such sparse DNNs. Ghimire et al. [58] provided a survey on model compression methods and computing platforms suitable for accelerating CNNs.

Unlike the previous survey works, we conduct a comprehensive survey on resource-efficient CNN techniques in terms of the model-, arithmetic-, and implementation-level techniques by clarifying which resource efficiency can be improved with particular techniques according to our resource efficiency metrics as defined in Section 2.2. Such clarification would provide machine learning engineers, computer arithmetic designers, software developers, and hardware manufacturers with useful information to improve particular resource efficiency for their CNN applications. Besides, since we notice that fast wireless communication and edge computing development affects CNN applications [190], our survey also includes cutting-edge resource-efficient techniques for distributed AI such as early exiting techniques [160, 161]. The holistic and multi-facet view for resource-efficient techniques for CNN from our survey would allow for a better understanding of the available techniques and, as consequence, a better global optimization, compared to previous survey works. The main contributions of our article include the following:

•

This article first provides a comprehensive survey coverage of the recent resource-efficient techniques for CNNs in terms of the model-, arithmetic-, and implementation-level techniques.

•

To the best of our knowledge, our work is the first to provide a comprehensive survey on arithmetic-level utilization techniques for CNNs.

•

This work utilizes multiple resource efficiency metrics to clarify which resource efficiency metrics each technique improves.

•

This article provides the influence of resource-efficient CNN techniques from higher- to lower-level techniques (refer to Figure 1).

•

We discuss the future trend for resource-efficient CNN techniques.

We discuss our resource efficiency metrics for CNNs in Section 2, the model-level resource-efficient techniques in Section 3, the arithmetic-level techniques in Section 4, the implementation-level techniques in Section 5, and the influences between different-level techniques and the future research trends in Section 6, and we present our conclusion in Section 7. Our article excludes higher-level training procedure manipulation techniques such as one-pass ImageNet [84], bag of freebies [20], and data augmentation. We have predominantly collected papers that have been (1) highly cited or (2) published in world-leading machine learning or computing system conferences/journals (e.g., CORE: \(A/A^*\) or JCR Q1).

2 Background on CNNs and Resource Efficiency

This section describes our CNN overview and resource efficiency metric, as preparatory to the description of resource-efficient techniques via the three different levels.

2.1 Deep Learning Overview

Deep learning is defined as “learning multiple levels of representation” [17] and often utilizes CNNs to learn the multiple levels of representation. CNNs are trained using the training dataset, and their prediction accuracy is evaluated using the test dataset [5, 100]. In this section, we describe the perceptron model (i.e., artificial neuron) first and then CNNs later.

2.1.1 Perceptron Model.

The McCulloch and Pitts’s neuron (a.k.a. M-P neuron) [123], proposed in 1943, was a system mimicking the neuron in the nervous system, receiving multiple binary inputs and producing one binary output based on a threshold. Inspired by the work of McCulloch and Pitts [123], Rosenblatt [139] proposed the “perceptron” model consisting of multiple weights, a summation, and an activation function as shown in Figure 2(a).

Fig. 2.

Equation (1) describes a perceptron’s firing activity \(y_{out}\) using the inputs \(x_i\) associated with their weights \(w_i\) , where the i represents an index to indicate one of multiple inputs:

\begin{equation} y_{out} = {\left\lbrace \begin{array}{ll} 1, & \text{if } (\Sigma _{i=1}^{n_{in}} w_i \times x_i \gt threshold) \text{ or } (\Sigma _{i=1}^{n_{in}} w_i \times x_i + bias \gt 0) \\ 0, & \text{if } (\Sigma _{i=1}^{n_{in}} w_i \times x_i \le threshold) \text{ or } (\Sigma _{i=1}^{n_{in}} w_i \times x_i + bias \le 0), \end{array}\right.} \end{equation}

(1)

where \(n_{in}\) is the number of the inputs. The function that determines the firing activity is referred to as the activation function, and the bias is in proportion to the probability of the firing activation [126]. Since the single perceptron model is suitable only for linearly separable problems, a Multi-Layer Perceptron (MLP) model can be used for non-linearly separable problems as shown in Figure 2(b), where \(w_{j,k,(i)}\) represents a weight linking the j^th neuron in the \((i-1)\) ^th layer to the k^th neuron in the i^th layer. The signal \(s_{j,(i)}\) in Figure 2 follows Equation (2):

\begin{equation} s_{j,(l)} = \Sigma _{i=1}^{n_{in}^{(l-1)}} (w_{i,j,(l)} \times x_{i,(l-1)}) = (\mathbf {W}_{(l)}^T \mathbf {x}_{(l-1)})_j, \end{equation}

(2)

and \(x_{j,(l)} = \theta _{P}(s_{j,(l)})\) , where \(\theta _{P}(s)\) is a perceptron’s activation function that follows Equation (1) (i.e., step function), and \(\mathbf {W}_{(l)}\) consists of the matrix elements, \(w_{i,j,(l)}\) s, for the \(i^{\rm th}\) row and the \(j^{\rm th}\) column.

2.1.2 Deep Neural Network.

Since it requires tremendous efforts for human to optimize MLPs manually, neural networks that adopt a soft threshold activation function \(\theta _N\) (sigmoid, ReLU, etc.) were proposed to train the weights according to the training data [169, 172]. term neural network is sometimes interchangeably used with MLP [126]. For clarity, we name an algorithm as an MLP if it utilizes a step function for its activation functions and as a neural network if it utilizes a soft threshold function. In Figure 2(b), the output from the i^th neuron at the l^th layer in a neural network employing a soft threshold activation function, \(\theta _N(\cdot)\) , can be represented as Equation (3):

\begin{equation} x_{i,(l)} = \theta _N(s_{i,(l)}). \end{equation}

(3)

A neural network allows the weights and the biases to be trained using backpropagation [5]. A neural network model is often referred to as a feed-forward model in that the weights always link the neurons in the current layer to the neurons in the very next layer. In a neural network, the middle layers located between the input and output layer are often referred to as hidden layers (e.g., two hidden layers in Figure 2(b)). A neural network with multiple hidden layers is referred to as a DNN [155].

2.1.3 Training: Backpropagation.

In the forward pass, the neuron outputs are propagated in the forward direction based on the matrix-vector multiplications as shown in Equation (2). Likewise, the weights and the biases can be trained in the backward direction using matrix-vector multiplications. This method is called backpropagation. The backpropagation method consists of three steps, allowing a gradient descent algorithm to be implemented efficiently on computers. It finds the activation gradients, \(\delta _{j,(l)}\) s (i.e., the gradients with respect to all the signals, \(s_{j,(l)}\) s, in Equation (2)), in step 1, finds the weight gradients (i.e., the gradients with respect to all the weights) using the activation gradients in step 2, and finally updates the weights using the weight gradients in step 3. All \(\delta _{j,(l-1)}\) s are found in the backward direction using the matrix-vector multiplications by replacing \(\mathbf {W}_{(l)}^T\) to \(\mathbf {W}_{(l)}\) and \(x_{j,(l-1)}\) to \(\delta _{j,(l)}\) in Equation (3). After all activation gradients have been found, the weight gradients can be found. Finally, the weights are updated using the weight gradients. The backpropagation requires additional storage to store all the weights and activation values. Once a DNN is trained, the DNN is used for the inference task using the trained weights. Please refer to the work of Abu-Mostafa et al. [5] for further details on the backpropagation method. After a DNN is trained, the DNN’s accuracy is evaluated using the validation dataset which is unseen from the training.

2.1.4 Convolutional Neural Network.

CNN employs multiple convolutional layers, and each convolutional layer utilizes multiple filters to perform convolutions independently with respect to each filter as shown in Figure 3, where a filter at a convolutional layer consists of as many kernels as the number of the channels at the input layer (e.g., three kernels per filter in Figure 3). For example, each \(3 \times 3\) filter has nine weight parameters and slides from the top-left to the bottom-right position, generating four output values with respect to each position (e.g., top-left, top-right, bottom-left, and bottom-right positions) in Figure 3. The outputs from the convolutions are often called feature maps and are fed into activation functions. Modern CNNs such as ResNet [72] often employ a batch normalization layer [90] between the convolutional layer and the ReLU layer to improve the accuracy.

Fig. 3.

The CNN is a resource-efficient DNN architecture in terms of the accuracy per parameter and per operation by leveraging the two properties: local receptive field and shared weights [103]. For example, performing convolutions using multiple small kernels extracts multiple local receptive features from the input image during training, and each kernel contains some meaningful pattern from the input image after being trained [189]. Thus, a CNN utilizes much fewer weights than a fully connected DNN, since the kernel’s height and weight are generally much smaller than the height and the width at the input layer, leading to the improved resource efficiency. Notice that a convolutional layer becomes a fully connected layer if the height and the width at the input layer are matched with each kernel’s height and width. The number of total weights in a layer in a CNN is much less than used in a fully connected neural network, since the local receptive weights are shared over the entire feature on a layer.

Training a CNN also utilizes backpropagation using the transpose of kernel matrices in a filter to update the weights in the filter. The mini-batch gradient descent algorithm is widely used to train CNNs, which utilizes part of training data to update the weights per iteration. The number of data used per iteration is often referred to as the batch size B (e.g., \(B = 64\) or 128). Each epoch consumes the entire training data, consisting of \(N/B\) iterations, where N is the number of the entire training data. The mini-batch gradient descent method is a resource-efficient training algorithm in terms of the accuracy per operation, compared to the batch gradient descent method that utilizes entire training dataset per iteration (i.e., the batch gradient descent method updates the weights per epoch). For parallel backpropagation implementation with respect to B data samples in one mini-batch, all the weights and all the activation values using B training samples should be stored to update the weights per the mini-batch iteration, requiring \(B \times\) additional storage, compared to the backpropagation using a stochastic gradient descent algorithm which updates the weights per training sample.

2.2 Resource Efficiency Metrics

Recently, researchers from DeepMind [84] proposed the metrics for resource-efficient deep learning benchmarks, including the top-1 accuracy, the required memory footprint for training, and the required number of floating operations for training, and evaluated the resource efficiency for deep learning applications by jointly considering the three metrics. The Roofline model [171] discussed attainable performance in terms of the operational intensity defined as the number of Floating-Point (FP) operations per DRAM access. Motivated by other works [84, 171], our resource efficiency metrics include the accuracy per parameter, per operation, per memory footprint, per core utilization, per memory access, and per Joule as shown in Figure 1.

2.2.1 Accuracy per Parameter.

We consider the accuracy per parameter (i.e., weight) a resource efficiency metric. The accuracy per parameter is an abstract resource efficiency metric since higher accuracy per parameter does not always imply higher physical resource efficiency after its implementation [1, 84].

2.2.2 Accuracy per Operation.

We consider the accuracy per arithmetic operation a resource efficiency metric. This is also an abstract metric, since it can be evaluated prior to the implementation.

2.2.3 Accuracy per Compute Resource.

Instruction-driven architecture such as CPU or GPU requires substantial memory accesses due to instruction fetch and decode operations, whereas data-driven architecture such as ASIC or FPGA can minimize the number of memory accesses, resulting in energy efficiency. We further categorize such compute resource into core utilization, memory footprint, and memory access, required to operate a CNN on given computing platforms. For example, the memory access can be interpreted as GPU DRAM access (or off-chip memory) for a GPU and as FPGA on-chip memory access (or off-chip memory) for a FPGA.

Accuracy per Core Utilization. The core utilization in this work represents the utilization percentage of the processing cores or PEs.

Accuracy per Memory Footprint. The accuracy per memory footprint is related to both physical and abstract resource efficiency as shown in Figure 1. The memory footprint is in proportion to the number of the parameters, but it can be varied according to a precision level applied for arithmetic. For example, if a half precision arithmetic is applied for a CNN, the memory footprint can be saved by \(2\times\) compared to a single-precision arithmetic CNN.

Accuracy per Memory Access. A computing kernel having a low operational intensity cannot approach a peak performance defined by hardware specification since the data supply rate from DRAM to CPU cannot catch up with the data consumption rate by arithmetic operations. Such kernels are called memory bound kernels in the work of Williams et al. [171]. Other type kernels are called compute bound kernels, which can approach a peak performance defined by hardware specification. Utilizing reduced precision arithmetic can improve the performance for both memory bound kernels by improving the data supply rate from DRAM to CPU and compute bound kernels by increasing word-level parallelism on SIMD architectures [105].

2.2.4 Accuracy per Joule.

Dynamic power consumption is the main factor to determine energy consumption required for computationally intensive tasks (e.g., CNN training/inference tasks). The dynamic power consumption, \(P_D\) , follows:

\begin{equation} P_D = \#_{TTR} \times C_{CP} \times V^{2}_{CP} \times f_{CP}, \end{equation}

(4)

where \(\#_{TTR}\) is the number of toggled transistors, \(C_{CP}\) is an effective capacitance, \(V_{CP}\) is an operational voltage, and \(f_{CP}\) is an operational frequency for a given computing platform CP. Generally, the required minimum operational voltage is in proportion to the operational frequency. Therefore, adapting the frequency to the voltage scaling can save power cubically (a.k.a. Dynamic Voltage Frequency Scaling [147]). For example, minimizing the operations required to operate a CNN during runtime contributes to minimizing \(\#_{TTR}\) , resulting in power reduction and energy savings; we discuss the resource-efficient techniques leveraging this in Section 5.3.1.

2.2.5 Interrelated Influence of Resource Efficiency Metrics.

A higher-level resource efficiency metric influences a lower-level metric as shown in Figure 1. For example, accuracy per parameter affects accuracy per memory footprint and/or memory access. Reduced parameters require less compute memory footprints and minimize the number of arithmetic operations, leading to reduced core utilization, memory accesses, and energy consumption. Energy consumption of CNNs on CPUs and GPUs mainly comes from memory access [7, 180]. Therefore, minimizing memory access cost is the main consideration for energy savings for CNNs [155]. For example, most implementation-level resource-efficient techniques focus on minimizing memory access cost to save energy consumption.

3 Model-level Resource-Efficient Techniques

The model-level resource-efficient techniques, mostly developed from the machine learning community, aim at reducing the CNN model size to fit the models to resource-constrained systems such as mobile devices and IoT. We categorize the model-level resource-efficient techniques as shown in Figure 4.

Fig. 4.

3.1 Weight Quantization

The weight quantization techniques quantize the weights with a smaller number of bits, improving the accuracy per memory footprint. The training procedure should be amended according to the weight quantization schemes.

3.1.1 Binary Weight Quantization.

The BinaryConnect training scheme [41] allowed a CNN to represent the weights using one bit. In step 1, the weights are encoded to \(\lbrace -1, 1\rbrace\) using a stochastic clipping function. In step 2, the forward pass is performed using the encoded binary weights. In step 3, backpropagation seeks all activation gradients using full precision. In step 4, the weights are updated using full precision, and the training procedure goes back to step 1 for the training using the next mini-batch. This method required only one bit to represent the weights, thus improving the accuracy per memory footprint. In addition, the binary weight quantization also removed the need for multiplication arithmetic operations for MAC operations, improving the accuracy per operation. Moreover, if the activations are also quantized to the binary value, all MAC operations in the CNN can be implemented only with XNOR gates and a counter [42, 135].

3.1.2 Ternary Weight Quantization.

Li et al. [108] proposed ternary weight networks that utilized ternary weights, improving accuracy, compared to the binary weight networks. All the weights on each layer were quantized into three values, requiring only two bits to represent the quantized weights. The overall training procedure was similar to that of Courbariaux et al. [41], but with ternary valued weights instead of the binary weights. The ternary weight network showed equivalent accuracy to various single-precision networks with MNIST, CIFAR-10, and ImageNet, whereas the binary weight quantization [41] showed minor accuracy loss. Zhu et al. [196] scaled the ternary weights independently for each layer with a layer-wise scaling approach, improving the accuracy further, compared to Li et al. [108].

3.1.3 Mixed Quantization.

Hubara et al. [88] proposed the “Quantized Neural Network,” which quantizes the activations and the weights to arbitrary lower-precision format. For example, quantizing the weights to one bit and the activations to two bits improved the accuracy compared to the binarized CNN of Courbariaux et al. [42].

3.2 Pruning

Pruning unimportant neurons, filters, and channels can save computational resources for CNN applications without sacrificing accuracy, improving the accuracy per parameter and per operation. Coarse-grained pruning methods such as pruning filters or channels are not flexible to achieve a prescribed accuracy but can be implemented efficiently on hardware [114], implying higher physical resource efficiency than fine-grained pruning such as pruning weights. Notice that such pruning methods can degrade confidence scores without careful retraining, even though they did not affect top-1 accuracy [184].

3.2.1 Pruning Weights.

In 1990, LeCun et al. [104] proposed a weight pruning method to generate sparse DNNs with fewer weights without losing accuracy. In 2015, the weight pruning approach was revisited [70], and the weights were pruned based on their magnitudes after training—the pruned CNNs were retrained to regain the lost accuracy. The pruning and retraining procedures could be performed iteratively to prune the weights further. This method reduced the number of weights of AlexNet by \(9 \times\) without losing accuracy. In 2016, Guo et al. [63] noticed that pruning wrong weights could not be revived, and proposed to prune and splice the weights per mini-batch training to minimize the risk from pruning wrong weights from previous mini-batch training. For example, the pruned weights were also used in the weight update procedure during the backpropagation and were restored when they were reconsidered as the important weights. In 2017, Yang et al. [180] proposed an energy-aware weight pruning method in which the energy consumption of a CNN was directly measured to guide the pruning process. In 2019, Frankle and Carbin [53] demonstrated that some of pruned models outperformed the original model by retraining the pruned models with replacing the survived weights with the initial random weights used for training the original model.

3.2.2 Pruning Neurons.

Instead of pruning individual weights, pruning a neuron can remove a group of the weights belonging to the neuron [85, 120, 152, 188]. In 2015, Srinivas and Babu [152] pruned the redundant neurons having similar weight values in a trained CNN model. For example, the weights in a baseline neuron were compared to the weights in other neurons at the same layer, and the neurons having similar weights to the baseline neuron were fused to the baseline neuron based on a Euclidean distance metric in the weight values between the two neurons. In 2016, Mariet and Sra [120] pruned the redundant neurons based on the “determinantal point process” metric. Hu et al. [85] measured the average percentage of zero activations per neuron and pruned the neurons having a high percentage of zero activations according to a given compression rate. Yu et al. [188] pruned unimportant neurons based on the effect of the pruning error propagation on the final response layer (e.g., the neurons were pruned backward from the final layer to the first layer). The methods of pruning neurons improved the resource efficiency, such as the accuracy per parameter and per operation.

3.2.3 Pruning Filters.

Pruning insignificant filters after training can improve the accuracy per parameter and per operation. The feature maps associated with the pruned filters and the next kernels associated with the pruned feature maps should be also pruned. Pruning filters can maintain the dense structure of DNN unlike pruning weights, implying that it is highly probable to improve physical resource efficiency further, compared to pruning weights. Li et al. [109] pruned unimportant filters based on the summation of absolute weight values in the filter. The pruned CNNs were retrained with the survived filters to regain the lost accuracy. Yang et al. [181] pruned filters based on a platform-aware magnitude-based metric depending on the resource-constrained devices. ThiNet [118] calculated the significance of the filters using the outputs of the next layer and pruned the insignificant filters based on this significance measurement.

3.2.4 Pruning Channels.

Unlike pruning filters, pruning channels removes the filters at the current layer and the kernels at the next layer associated with the pruned channels. The network slimming approach [114] pruned insignificant channels, producing compact models while keeping equivalent accuracy, compared to the models prior to pruning. For example, insignificant channels were identified based on scaling factors generated from the batch normalization of Ioffe and Szegedy [90], and the channels associated with lower scaling factors were pruned. After the initial training, the channels associated with relatively low scaling factors were first pruned, and retraining was then performed to refine the network. He et al. [75] identified unimportant channels using LASSO regression from a pre-trained CNN model and pruned them. The channel pruning brought 5 \(\times\) speedup on VGG16 with minor accuracy loss. Lin et al. [112] pruned unimportant channels during runtime based on a decision maker trained by reinforcement learning. Gao et al. [56] proposed another dynamic channel pruning method that dynamically skipped the convolution operations associated with unimportant channels.

3.3 Compact Convolution

To improve resource efficiency such as the accuracy per operation and per parameter from computationally intensive convolution operations, many compact convolution methods have been proposed.

3.3.1 Squeezing Channel.

In 2016, Iandola et al. [89] proposed SqueezeNet, in which each network block utilized the number of 1 \(\times\) 1 filters less than the number of the input channels to reduce the network width in the squeezing stage and then utilized multiple 1 \(\times\) 1 and 3 \(\times\) 3 kernels in the expansion stage. The computational complexity was significantly reduced by squeezing the width while compensating the accuracy in the expansion stage. SqueezeNet reduced the number of parameters by \(50 \times\) , compared to AlexNet on ImageNet without losing accuracy, improving accuracy per parameter. Gholami et al. [59] proposed SqueezeNext that utilized separable convolutions in the expansion stage; a \(k \times k\) filter was divided into a \(k \times 1\) and a \(1 \times k\) filter. Such separable convolutions reduced the number of parameters further compared to SqueezeNet while maintaining AlexNet’s accuracy on ImageNet, improving accuracy per parameter further, compared to SqueezeNet.

3.3.2 Depth-Wise Separable Convolution.

Xception [38] utilized depth-wise separable convolutions, which replace 3D convolutions with 2D separable convolutions followed by 1D convolutions (i.e., point-wise convolutions) as shown in Figure 5, to reduce computational complexity. The 2D separable convolutions are performed separately with respect to different channels. Howard et al. [83] proposed MobileNet v1 that utilizes depth-wise separable convolutions with two hyperparameters, “width multiplier and resolution multiplier,” to fit CNNs to resource-constrained devices by fully leveraging the accuracy and resource tradeoff in the CNNs. MobileNet v1 showed equivalent accuracy to GoogleNet and VGG16 on the ImageNet dataset with less computational complexity, improving the accuracy per parameter and per operation.

Fig. 5.

3.3.3 Linear Bottleneck Layer.

In general, the manifold of interest (i.e., the subspace formed by the set of activations at each layer) could be embedded in low-dimensional subspaces in CNNs. Inspired by this, Sandler et al. [142] proposed MobileNet v2 consisting of a series of bottleneck layer blocks. Each bottleneck layer block as shown in Figure 6 received lower-dimensional input, expanded the input to high-dimensional intermediate feature maps, and projected the high-dimensional intermediate features onto low-dimensional features. Keeping linearity for the output feature maps was crucial to avoid destroying information from non-linear activations, so linear activation functions were used at the end of each bottleneck block.

Fig. 6.

3.3.4 Group Convolution.

In a group convolution method, the input channels are divided into several groups, and the channels in each group are separately participated in convolution with other groups. For example, the input channels with three groups required three separate convolutions. Since group convolution does not communicate with the channels in other groups, communication between different groups is performed after the separate convolutions. Group convolution methods [86, 87, 119, 193] reduced the number of MAC operations, improving the accuracy per operation, compared to CNNs using regular convolution. In 2012, AlexNet utilized group convolution to train the CNNs effectively using two NVIDIA GTX580 GPUs [100]. Surprisingly, AlexNet using group convolution showed superior accuracy to AlexNet using regular convolution, improving the accuracy per operation. In 2017, ResNext [177] utilized group convolution based on ResNet [72] using a cardinality parameter (i.e., the number of groups). In 2018, Zhang et al. [193] noticed that the point-wise convolutions were computationally intensive in practice in the depth-wise convolutions and proposed ShuffleNet that applied group convolution to every point-wise convolution to reduce compute complexity further, compared to MobileNet v1. ShuffleNet shuffled the output channels from the grouped point-wise convolution to communicate with different grouped convolutions, demonstrating superior accuracy to MobileNet v1 on ImageNet and COCO datasets, given the same arithmetic operation cost budget. Ma et al. [119] proposed ShuffleNet v2, which improved physical resource efficiency further compared to ShuffleNet [193] by employing equal channel width for input and output channels where applicable and minimizing the number of operations required for \(1 \times 1\) convolutions. Rather than choosing each group randomly and shuffling them, Huang et al. [86] proposed to learn each group for a group convolution during training. The “learned group convolution” was applied in DenseNet [87], and DenseNet improved the accuracy per parameter and per operation, compared to ShuffleNet, given a prescribed accuracy.

3.3.5 Octave Convolution.

Chen et al. [32] decomposed feature maps into a higher and a lower frequency part to save the feature maps’ memory footprint and reduce the computational cost. The decomposed feature maps were used by specific convolution called octave convolution, which performs a convolution between the higher and lower frequency part. The application of the octave convolution to ResNet-152 architecture achieved higher accuracy using ImageNet dataset than the regular convolution, improving the accuracy per operation and per memory footprint.

3.3.6 Downsampling.

Qin et al. [132] applied a downsampling approach (e.g., a larger stride size for a convolution) to MobileNet v1, improving the top-1 accuracy by 5.5% over MobileNet v1 on the ILSVRC 2012 dataset, given a 12M arithmetic operations budget.

3.3.7 Low Rank Approximation.

Denton et al. [47] proposed a low rank approximation that compresses the kernel tensors in the convolutional layers and the weight matrices in the fully connected layers by using singular value decomposition. Another low rank approximation [98] used Tucker decomposition to compress the feature maps, resulting in significant reductions in the model size, the inference latency, and the energy consumption. Such low rank approximation methods improve the accuracy per parameter, per operation, and per memory footprint.

3.4 Knowledge Distillation

The knowledge from a large-scale high-performing model (teacher network) could be transferred to a compact neural network (student network) to improve resource efficiency such as accuracy per parameter and per operation for inference tasks [23, 29, 138]. Buciluă et al. [23] utilized data with the labels generated from the teacher model (i.e., a large-scale ensemble model) to train a compact neural network. The compact model was trained with the pseudo training data generated from the teacher model, demonstrating equivalent accuracy to the teacher model. Ba and Caruana [14] noticed that the softmax outputs often resulted in the student network ignoring the information of the other categorizations than the one with the highest probability, and utilized the values prior to the softmax layer, from the teacher network, for the training labels to allow the student network to learn the teacher network more efficiently. Hinton et al. [78] added a “temperature” term for the labels to enrich the information from the teacher network and train the student network more efficiently, compared to Ba and Caruana [14]. Romeo et al. [138] utilized both labels and intermediate representations from a wider teacher network to compress it to a thinner and deeper student network. The “hint layer” was chosen from the teacher network, and the “guided layer” was chosen from the student network. The student network was then trained so that the intermediate representation deviation between the outputs from the hint layers and guided layers could be minimized. A thinner student network employed \(10.4 \times\) less weight parameters, compared to a wider teacher network, while improving accuracy. This technique is also known as “hint learning.” Hint learning was applied to both the region proposal and classification components for object detection applications [29].

3.5 Neural Architecture Search for Compressed Models

Zoph and Le [198] proposed the Neural Architectural Search (NAS) technique to seek optimal DNN models in the space of hyperparameters of network width, depth, and resolution. In case that compute resource budget was limited (e.g., mobile devices), many NAS variants exploited the tradeoff between accuracy and latency to maximize resource efficiency given a compute resource budget [74, 157, 158, 159, 174]. He et al. [74] proposed a NAS employing reinforcement learning, AutoML, that sampled the least sufficient candidate design space to compress the DNN models. MnasNet [157] utilized reinforcement learning with a balanced reward function between the accuracy and the latency to seek a compact neural network model. Wu et al. [174] proposed a gradient-based NAS that produced a DNN model with \(2.4\times\) model size reduction compared to a MobileNet v2 without losing accuracy on the ImageNet dataset. Scheidegger et al. [143] proposed a narrow-space NAS to generate low-resource DNNs satisfying a strict memory budget and inference time requirement for IoT applications. Anderson et al. [13] noticed that conventional NAS might improve abstract resource efficiency rather than physical resource efficiency, and utilized the hardware information including the inference latency for a NAS to ensure that the candidate models could improve the physical resource efficiency in practice. EfficientNet [158] utilized a NAS with compound scaling of depth, width, and resolution to seek optimal DNN models given fixed compute resource budgets. Another NAS utilizing compound scaling, EfficientDet, was proposed for object detection applications [159]. EfficientDet improved accuracy using the COCO dataset with 4 to 9 \(\times\) model size reduction, compared to state-of-the-art object detectors, improving the accuracy per parameters. Recently, Cai et al. [25] proposed a feed-forward NAS approach that produced a customized DNN, given compute resource and latency constraint.

4 Arithmetic-Level Resource-Efficient Techniques

Utilizing lower-precision arithmetic reduces the memory footprint and the time spent transferring data across buses and interconnections [52, 125, 182, 197]. Employing least sufficient arithmetic precision for CNN applications can improve the accuracy per memory footprint and the accuracy per memory access. We categorize the arithmetic-level resource-efficient techniques into the two categories as shown in Figure 7, Arithmetic-Level Techniques for Inference and Arithmetic-Level Techniques for Training. We discuss different number formats first and the deployment of such number formats on CNNs later.

Fig. 7.

4.1 Number Formats for CNNs

This section describes various number formats for CNN applications, as preparatory to explaining the arithmetic-level resource-efficient techniques. Fixed-Point (FiP) number format utilizes a binary FiP between the fraction and integer parts. For example, an eight-bit FiP format, “01.100000,” represents 1.5 (i.e., \(... 0\times 2^{1}+ 1\times 2^0 + 1\times 2^{-1} + 0\times 2^{-2} ...\) ) for the decimal representation, and the point between the integer part and fraction part is fixed for arithmetic operations. Therefore, it could be implemented with simple circuits, but the available data range is quite limited [130].

We exemplify the IEEE 754 general-purpose FP standard [4] to explain FP format and its arithmetic, since this standard is used for most commercially available CPUs and GPUs. The IEEE 754 Floating-Point (IFP) data format [4] consists of sign, exponent, and significand as shown in Equation (5). For example, an FP number has a \((p+1)\) -bit significand (including the hidden one), an e-bit exponent, and a one sign bit. The machine epsilon \(\epsilon _{mach}\) is defined as \(2^{-(p+1)}\) . The value represented by FP is as follows:

\begin{equation} y_{out} = {\left\lbrace \begin{array}{ll} \text{normal mode:} & (-1)^{sign}\times (1\times 2^0 + d_1\times 2^{-1}+ \cdots + d_p\times 2^{-p}) \times 2^{exponent - bias}\\ \text{subnormal mode:} & (-1)^{sign}\times (d_1\times 2^{-1}+ \cdots + d_p\times 2^{-p}) \times 2^{1 - bias}, \end{array}\right.} \end{equation}

(5)

where \(d_1\) , ..., \(d_p\) represent binary digits, the ‘1’ associated with the coefficient \(2^0\) is referred to as the hidden ‘1’, the exponent is stored in offset notation, and the bias is a positive constant. If the absolute value of exponent is zero, the FP value is represented by the subnormal mode. IEEE 754 standard requires exact rounding for addition, subtraction, multiplication, and division; the FP arithmetic result should be identical to the one obtained from the final rounding after exact calculation. For example, based on the IEEE 754 rounding to nearest mode standard, FP arithmetic should follow Equation (6):

\begin{equation} fl(x_1 \odot x_2) = (x_1 \odot x_2) (1 + \epsilon _r), \end{equation}

(6)

where \(|\epsilon _r| \le \epsilon _{mach}\) , \(\odot\) is one of the four arithmetic operations, and \(fl(\cdot)\) represents the result from the FP arithmetic. Notice that quantization quantizes data to lower precision, whereas arithmetic is a rule applied to arithmetic operations between the two operands. For example, the quantization affects the values for the two operands, \(x_1\) and \(x_2,\) in Equation (6), whereas arithmetic affects the rounding error, \(\epsilon _r\) .

4.1.1 Half, Single, and Double Precision.

The IEEE FP 32- (IFP32 or single-precision) and 64-bit (IFP64 or double-precision) versions are available on most off-the-shelf conventional processors. Additionally, the IEEE 754 standard includes a 16-bit FP format (IFP16 or half precision) [4]. \(p=52\) , \(e=11\) , and \(bias = 1023\) for IFP64, \(p=23\) , \(e=8\) , and \(bias=127\) for IFP32, and \(p=10\) , \(e=5\) , and \(bias=15\) for IFP16. IFP16 is currently supported in hardware on some modern GPUs to accelerate DNN applications [40, 79].

4.1.2 Brain FP Using 16 Bits (BFloat16).

In 2018, a 16-bit Brain FP format [24, 186] was proposed that was tailored to CNN applications. BFloat16 consists of an eight-bit exponent and a seven-bit significand, supporting a wider dynamic data range than IFP16. BFloat16 is currently supported in hardware in Intel Cooper Lake Xeon processors, NVIDIA A100 GPUs, and Google TPUs.

4.1.3 DLFloat.

In the race of designing specific FP formats for CNNs, some authors [6, 167] proposed another 16-bit precision format, DLFloat, consisting of a 6-bit exponent and a 9-bit significand to provide better balance between dynamic data range and precision than IFP16 and BFloat16 formats for some CNN applications.

4.1.4 TensorFloat32 (TF32).

NVIDIA proposed a 19-bit data format, TF32, consisting of a 1-bit sign, an 8-bit exponent, and a 10-bit significand to accelerate CNN applications on A100 GPUs with the same dynamic range support as IFP32 [51]. TF32 tensor cores in an A100 truncate IFP32 operands to 19-bit TF32 format but accumulate them using IFP32 arithmetic for MAC operations.

4.2 Arithmetic-Level Techniques for Inference

This section discusses various resource-efficient arithmetic-level techniques based on pre-trained CNNs for the inference tasks.

4.2.1 Lower-Precision Arithmetic.

Lower-precision FiP arithmetic has been widely used to deploy CNNs on edge devices [179]. The effects of deploying various lower-precision arithmetic to the CNN inference tasks were explored in terms of accuracy and latency [168, 175]. The BitFusion method accelerated CNN inference tasks by employing variable bit-width FiP formats dynamically depending on the different layers [148]. Similarly, Tambe et al. [156] proposed AdaptiveFloat that adjusted dynamic ranges of FP numbers depending on the different layers, resulting in higher energy efficiency than FiP-based methods, given the same accuracy requirement.

4.2.2 Encoding Weights and Using a Lookup Table.

Bordawekar et al. [21] leveraged the fact that the exponent values of most of the weights were located within a narrow range and encoded the frequent exponent values of the weights with fewer bits using a Huffman coding scheme, improving the accuracy per memory footprint for natural language processing applications. A lookup table, located between the memory and FP arithmetic units, is used to convert the encoded exponent values into FP exponent values.

4.2.3 Applying Various Number Format Quantizations to CNNs.

The Residue Number System (RNS) is a parallel and carry-limited number system that transforms a big natural number to several smaller residues. Therefore, RNS was often used to perform parallel and independent calculations on residues without carry-propagation, and it exploited such parallelism to accelerate CNN computation [28]. In an RNS-based CNN, the weights of a pre-trained model were transformed to RNS presentation. Recently, RNS was used to replace costly multiplication operations with simple logical operations such as multiplexing and shifting, accelerating CNN applications [115, 140, 141]. The Logarithmic Number System applies the logarithm to the absolute values of the real numbers [130]. The main advantage of the Logarithmic Number System is in the capability of transforming multiplications into additions and divisions into subtractions. In 2018, Vogel et al. [165] utilized a five-bit logarithmic format using arbitrary log bases to improve resource efficiency such as accuracy per memory footprint and per operations by replacing costly multiplication arithmetic operations to simple bit-shift operations [165]. The Posit number format [66] employs multiple separate exponent fields to represent dynamic range effectively. Recently, CNNs utilizing Posit showed higher accuracy than various FP8 formats using Mushroom and Iris datasets [26, 27].

4.3 Arithmetic-Level Techniques for Training

This section discusses arithmetic-level resource-efficient techniques used for CNN training tasks. Training CNNs generally requires higher-precision arithmetic due to extremely small weight gradient values [41, 196]. Adjusting arithmetic precision according to different training procedures such as forward propagation, activation gradient updates, and weight updates can accelerate CNN training [64]. Training quantized CNNs often required stochastic rounding schemes [65, 176, 182, 195].

4.3.1 Mixed-Precision Training.

A conventional mixed-precision training applied lower-precision arithmetic to the multiplications in MACs, including both forward and backward paths, and higher-precision arithmetic to the accumulations in the MACs using lower-precision quantized operands [95, 125]. The higher-precision outcomes from MACs were quantized to a lower-precision format to be used for consequent operations. In the following (X + Y) formats, X represents the data format used for MAC operations and Y represents the arithmetic applied for the accumulations in MAC operations (refer to the work of Gupta and Ranga [64] for details on lower- and higher-precision arithmetic usage).

IFP16 + IFP32. In 2018, Micikevicius et al. [125] noticed that the weights were updated using very small weight gradient values, and applied a lower-precision arithmetic IFP16 to the multiplications and a higher-precision IFP32 to the accumulations for the weight updates. For example, in the mixed-precision training approach in their work [125], IFP16 was used to store weights, activations, activation gradients, and weight gradients, whereas IFP32 was used to keep the weight copies for their updates. Along with accumulating IFP16 operands using IFP32 arithmetic, the use of loss scaling allowed the mixed-precision training to achieve equivalent accuracy to the IFP32 training while reducing the memory footprint.

BFloat16 + IFP32. In 2018, mixed-precision CNN training using (BFloat16 + IFP32) was explored by Ying et al. [186]. In 2019, Kalamkar et al. [95] studied BFloat16’s feasibility for mixed-precision training for various CNNs including AlexNet, ResNet, and GAN, among others, and concluded that the (BFloat16 + IFP32) scheme outperformed the (IFP16 + IFP32) scheme since BFloat16 could represent the same dynamic range of data as IFP32 while using fewer bits.

FP8 + DLFloat. In 2018, Wang et al. [167] proposed a mixed-precision training method that applies the 5eFP8 format (one sign bit, five-bit exponent, and two bits for the significand) to the multiplications and DLFloat to the accumulations in MAC operations. The mixed-precision method improved resource efficiency such as accuracy per memory footprint and accuracy per memory access compared to various (FP16 + IFP32) schemes with respect to different FP16 formats. Compared to the previous (FP16 + IFP32) methods, the chunk-accumulation and stochastic rounding schemes were additionally used to minimize the accuracy loss in the work of Wang et al. [167], The chunk-based accumulation utilized 64 data per chunk instead of one long sequential accumulation to reduce rounding errors. Utilizing a stochastic rounding scheme with limited-precision format for deep learning was proposed earlier by Gupta et al. [65]. Sun et al. [153] noted that (5eFP8 + DLFloat) training degraded accuracy for CNNs utilizing depth-wise convolutions such as MobileNets. To overcome this issue, Sun et al. [153] proposed to employ two different eight-bit FP formats each for forward and backward propagation to minimize the accuracy degradation for compressed CNNs. The mixed-precision training utilized 5eFP8 for backpropagation and another eight-bit FP format with (sign, exponent, significand) = (1, 4, 3), 4eFP8, for forward propagation.

DLFloat Only. In 2019, Agrawal et al. [6] employed DLFloat for the entire training procedure, removing the necessity of data conversions between the multiplications and the accumulations and found that DLFloat could provide better balance between dynamic range and precision than IFP16 and BFloat16 for LSTM networks [80] using the Penn Treebank dataset. The DLFloat arithmetic units removed subnormal mode and supported the round-to-nearest up mode to minimize computational complexity. In the work of Agrawal et al. [6], the DLFloat arithmetic showed equivalent performance to IFP32 on ResNet-32 using CIFAR10 and ResNet-50 using ImageNet while using a half of IFP32 bit width.

INT8 Only. Yang et al. [182] noticed that previously proposed mixed-precision training schemes did not quantize the data in the batch normalization layer, requiring high FP arithmetic in some parts of the data paths. To overcome this issue, a unified INT8-based quantization framework was used to quantize all data paths in a CNN including weights, activation, gradient, batch normalization, and weight update, among others, into INT8-based data. However, this training method degraded the accuracy to some extent. In 2020, Zhu et al. [197] improved the accuracy, compared to the work of Yang et al. [182], while keeping a unified INT8-based quantization framework. In the work of Zhu et al. [197], the deviation of the activation gradient direction between before and after quantization was minimized by measuring the distance during runtime based on the inner product between the two normalized gradient vectors generated before and after quantization.

Four-Bit Data Only (INT4 + FP4). In 2020, Sun et al. [154] proposed a new training technique using only four bits to represent weights, activations, and their gradient values, requiring mixed-precision INT4-FP4 MAC operations. The weights and activations are stored using INT4 format since their data range is not wide, whereas their gradient values are stored using four-bit FP format (FP4) since their data range is wide. The proposed FP4 format uses one bit for the sign bit and three bits for the exponent (i.e., no mantissa bit) with radix-4 to represent wide-ranged gradient values. Therefore, this training method can improve the accuracy per memory footprint and memory access. A layer-wise gradient scaling approach with a customized rounding scheme was employed to minimize the impact of large quantization errors caused by the four-bit representation for gradient values on the accuracy. This training method brought minor accuracy loss using CIFAR10 and ImageNet datasets with a ResNet18, compared to IFP32.

Layer-Wise Adaptive FiP Training. In 2020, Zhang et al. [192] proposed a layer-wise adaptive quantization scheme. For example, activation gradient distributions at fully connected layers followed a narrower distribution, requiring more bit width for the quantizations. AlexNet was quantized using INT8 for all the weights and activations and both INT8 (22%) and INT16 (78%) for the activation gradients. The quantized AlexNet achieved equivalent accuracy to the one using IFP32 for entire training on the ImageNet dataset.

4.3.2 Block Floating-Point Training.

Block Floating-Point (BFP) format utilizes a shared exponent for a series of numbers in a data block to reduce data size [170]. Applying BFP to CNNs can improve the resource efficiency in terms of accuracy per memory footprint and per memory access. In addition, BFP utilizes less transistors for simpler adders and multipliers than FP adders and multipliers, resulting in improving accuracy per Joule. Various versions of CNN training methods using BFP were proposed to improve resource efficiency.

Flexpoint. A DNN-optimized BFP format, Flexpoint [99], was proposed by Intel, and it was used with the Nervana neural processors. The BFP format used 5 bits for a shared exponent and 16 bits for the significand for the data in a data block. Flexpoint utilized the format of (Flex N) + M, where Flex N represents variable number of bits for the shared exponent according to the different epochs, and M represents the number of bits for the separated significand. For example, the number of exponent bits is adapted based on the dynamic range of the weight values depending on the number of iterations; the dynamic range of the weight values at the current iteration was predicted at the previous iteration. The (Flex N + 16) format produced equivalent accuracy to IFP32 in AlexNet using the ImageNet dataset and a ResNet using the CIFAR-10 dataset, resulting in significant resource efficiency improvement in terms of accuracy per memory footprint and accuracy per memory access.

BFP + FP training. Drumond et al. [48] proposed a hybrid use of BFP and FP for CNN training that uses BFP only for MAC operations and FP for the other operations. Such hybrid training method brought 8.5 \(\times\) potential throughput improvement with minor accuracy loss in WideResNet28-10 using the CIFAR-100 dataset on a Stratix V FPGA.

Block MiniFloat. It was noticed that ordinary BFP formats were limited in minimizing original data loss with fewer bits and improving arithmetic density per memory access for CNN applications [52]. To address the two issues, Fox et al. [52] proposed Block Minifloat (BM) along with customized hardware circuit design. The BM<e,m> format follows:

\begin{equation} y_{out} = {\left\lbrace \begin{array}{ll} \text{normal mode:} & (-1)^{sign}\times (1\times 2^0 + d_1\times 2^{-1}+ \cdots + d_m\times 2^{-m}) \times 2^{exponent - bias - BIAS_{SE}}\\ \text{subnormal mode:} & (-1)^{sign}\times (d_1\times 2^{-1}+ \cdots + d_m\times 2^{-m}) \times 2^{1 - bias - BIAS_{SE}}, \end{array}\right.} \end{equation}

(7)

where \(bias = 2^{e-1} - 1\) and \(BIAS_{SE}\) is a shared exponent value. \(BIAS_{SE}\) is scaled according to the maximum value of the data for dot-product operations. For example, BM<2,3> represents a six-bit data format having one sign bit, one-bit exponent, and three-bit significand. Such BM variant formats were applied for training. Utilizing these six-bit BM formats produced equivalent accuracy to IFP32 formats but with fewer bits using CIFAR 10 and 100 datasets with ResNet-18, resulting in reduced memory traffic and low energy consumption. Therefore, BM improved the resource efficiency in term of accuracy per memory access.

5 Implementation-Level Resource-Efficient Techniques

Figure 8 classifies the implementation-level resource-efficient techniques. Most implementation-level techniques have focused on improving energy efficiency and computational speed for MAC operations, since MACs generally occupy more than 90% of computational workload for both training and inference tasks in CNNs [155]. The implementation-level resource-efficient techniques exploited the characteristics of MACs in the CNN including data reuse, sparsity of weights and activations, and weight repetition from quantized CNNs.

Fig. 8.

5.1 Leveraging Data Reuse from Convolution

The weights and the activations are heavily reused in convolution operations. For example, the weights of a filter are reused \(((H - k_H + 1)\times (W - k_W+1))/stride\) times, where \(H=W=4\) (height and width at input channel) and \(k_H = k_W = 3\) (height and width for a kernel) in Figure 3. Generally, H and W are three orders of magnitude (128, 256, etc.), \(k_H\) and \(k_W\) are one order of magnitude (3, 5, etc.), and stride is either 1 or 2. For example, if \(H = W= 128\) , \(k_H = k_W= 3\) , and \(stride = 1\) , each filter is reused 16,129 times for convolutional operations. Each input element at a convolutional layer is also reused approximately \(M \times k_{H} \times k_{W}\) times, where M is the number of the total kernels used in the layer. Figure 9 describes the data access patterns for MAC operations used for convolutional layers. In each MAC in Figure 9(a), the data, a, b, and c, are read from the memory for multiply and add computation, and the result d is written back to the memory, where c contains a partial sum for the MAC. To save energy consumption, highly reused data for MACs can be stored in small local memory as shown in Figure 9(b). For example, the power consumption required to access data depends on where the data are located—accessing data from off-chip memory, DRAM, generally requires two orders of magnitude more than from on-chip memory [31]. Many research works have presented how to leverage such data reuse properties to improve resource efficiency.

Fig. 9.

5.1.1 Employing SRAM Local Memory Near PEs.

The use of SRAM buffers reduces the energy consumed by CNNs by up to two orders of magnitude compared to DRAM. Similar to Figure 9(b), the DianNao architecture [30] employed one Neural Functional Unit (NFU) integrated with three separated local buffers, each for holding 16 input neurons, \(16 \times 16\) weights, and 16 output neurons, to optimize circuitry for MAC operations. The weights and the activations stored to the local memories were reused efficiently by additionally using internal registers to store the partial sums and the circular buffer. The NFU is a three-stage pipelined architecture consisting of the multiplication, the adder-tree, and the activation stage. In the multiplication stage, 256 multipliers support the multiplications based on the weight connections between 16 input and 16 output neurons. In the adder-tree stage, 16 feature maps are generated from the multiplications based on adder-tree structure. In the activation stage, the 16 feature maps are approximated for the 16 activations by using piece-wise linear function approximation. DianNao with a 65-nm ASIC layout brought up to \(118 \times\) speedup and reduced the energy consumption by \(21 \times\) compared to a 128-bit 2-GHz SIMD processor over the benchmarks used in the work of Chen et al. [30]. One of the following studies adapted DianNao to deploy it on a supercomputer and named it DaDianNao [33]. Since the number of weights is generally larger than the number of input activations for convolution operations, DaDianNao stored a big chunk of weights and shared them to multiple NFUs by using a central embedded DRAM to reduce the data movement cost in delivering the weights associated to each NFU.

5.1.2 Leveraging Spatial Architecture.

Designing PEs and their local memory according to data reuse properties of MAC operations improved energy efficiency on FPGAs and ASIC [31, 34]. For example, Google TPU employs a systolic array architecture to send the data directly to an adjacent PE as shown in Figure 9(c) [3]. Chen et al. [31] noticed that the computational throughput and energy consumption of CNNs mainly depended on data movement rather than computation and proposed a “row-stationary” spatial architecture (a variant of Figure 9(c)), Eyeriss, to support parallel processing with minimal data movement energy cost by fully exploiting the data reuse property. For example, the three PEs in the first column in Figure 9(c) can be assigned to compute the first row of the convolution output using a \(3\times 3\) filter—the three elements on each row of the kernel are stored to the local memory on each PE (i.e., “row-stationary” structure in the work of Sze et al. [155]), and all the elements in the kernel are reused during convolution, generating the first row of the output. In this case, the partial summation values are stored back to the local memory on each PE.

5.1.3 Circuit Optimization.

Exploring binary weights [41] with binary inputs offered the opportunity to explore XNOR gates for the efficient implementation of a CNN [134], improving the accuracy per memory foot print and per Joule. In 2021, Zhao et al. [194] proposed hardware-friendly statistic-based quantization units and near data processing engines to accelerate mixed-precision training schemes by minimizing the number of accesses to higher-precision data.

5.2 Leveraging Fast Convolution Algorithms

For commercially available CPUs or GPUs, transforming convolution operations into matrix multiplications can leverage data reuse properties to accelerate the convolution operations by utilizing highly optimized BLAS libraries [163]; unrolling the weights of a 3D filter to a 1D vector transforms 3D convolution operations into 2D matrix multiplications (a.k.a. im2col convolution). Mathieu et al. [121] implemented circular convolutions with a Fast Fourier Transform (FFT) algorithm for CNNs. Using FFT, a circular convolution can be implemented with element-wise multiplications between the spectrum signals of the image and the kernel in the frequency domain. The number of multiplications required for a single FFT-based convolution between an \(N \times N\) resolution image and a \(k \times k\) kernel requires \(N^2(6log_2N + 4)\) multiplications (= \(6N^2log_2N\) (two FFTs and one inverse FFT) \(+\) \(4N^2\) (element-wise multiplications between two complex numbers)), whereas conventional convolution takes \(N^2k^2\) multiplications. Since the number of multiplications required for an FFT-based convolution does not depend on the kernel size, FFT-based convolutions are useful when kernel sizes are relatively large (e.g., \(k^2 \gt (6log_2N + 4)\) ). Vasilache et al. [164] demonstrated that CNNs implemented with NVIDIA cuFFT showed 1.4 to 14.5 \(\times\) speedups over cuDNN for various kernel sizes. Highlander and Rodriguez [77] implemented FFT with an Overlap and Add algorithm to reduce the FFT cost from \(O(N^2log_2N)\) to \(O(N^2log_2k)\) to apply FFT-based convolutions for smaller-sized kernels. Zhang and Prasanna [191] customized the FFT-based convolution architecture on FPGAs. Notice that such FFT-based convolutions achieve higher accuracy per operation with the cost of additional memory footprint. Lavin and Gray [102] leveraged a Winograd minimal filtering algorithm [173] to reduce the number of multiplication operations required for convolutions to improve the accuracy per operation. Lu et al. [117] proposed an efficient architecture for Winograd-based convolutions on FPGAs. Cong and Xiao [39] reformulated convolution operations between an input tensor and multiple filters into a single matrix multiplication form named convolution matrix multiplication to reduce the number of multiplications with the Strassen algorithm.

5.3 Leveraging Sparsity of Weights and Activations

In the forward pass, negative feature map values are converted to zeros after ReLU activation functions, making the activation data structure sparse. In addition, the trained weight values follow a sharp Gaussian distribution centered at zero, locating most of the weights near to zero. Quantizing such weights makes the weight data structure sparse, so the sparse weights can be fully exploited on the quantized networks such as binarized CNNs [41, 42] and ternary weight CNNs [108, 196].

5.3.1 Skipping Operations During Runtime.

In 2016, several methods to conditionally skip MAC operations were proposed simultaneously [10, 31, 113]. Eyeriss [31] employed clock gating to block the convolution operations during runtime when either the weight or the activation was detected as zero to save computational power. Cnvlutin [10] skipped MAC operations associated with zero activations by employing separated “neuron lanes” according to different channels. Similarly, Liu et al. [113] proposed Cambricon-X that fetches the activations associated with any non-zero weights for convolutions by using “step indexing” to skip the MAC operations associated with the zero weights. Cambricon-X brought 6 \(\times\) resource efficiency improvement in terms of accuracy per Joule compared to the original DianNao architecture. In 2017, Kim et al. [96] proposed ZeNa that performs MAC operations only if both the weights and the activations are non-zero values. In 2018, Akhlaghi et al. [8] proposed a runtime technique, SnaPEA, that performs MAC operations associated with positive weights first and then negative weights later while monitoring the sign of the partial sum value. Since the activation values from ReLU are always greater than or equal to zero, the convolution operation can be terminated once the partial sum value becomes negative. Notice that such decision should be performed during runtime, since the zero valued activation patterns depend on the test images. In 2021, another method skipping zero operations, GoSPA [44], was proposed, which is similar to ZeNa in that MAC operations were performed only when both input activations and weights were non-zero values. Deng et al. [44] constructed a “Static Sparsity Filter” module by leveraging the property that the weight values are static while the activation values are dynamic to filter out zero activations associated with non-zero weights on the fly before MAC operations. Such skipping operation optimization techniques improved the accuracy per Joule, since the transistors associated with skipped operations were not toggled during runtime, saving dynamic power consumption.

5.3.2 Encoding Sparse Weights/Activations/Feature Maps.

Since memory access operations dominate the power consumption in CNN applications, fetching the weights less frequently from memory by encoding and compressing the weights and the activations can improve resource efficiency such as the accuracy per memory footprint, per memory access, and per Joule. Han et al. [68, 69] utilized the Huffman encoding scheme to compress the weights. The quantized CNN reduced the memory footprint of AlexNet on the ImageNet dataset by 35 \(\times\) without losing accuracy. In the work of Han et al. [68, 69], a three-stage pipelined operation was performed to reduce the memory footprint of CNNs as follows. The pruned sparse quantized weights were stored with Compressed Sparse Row (CSR) format and then divided into several groups. The weights in the same group were shared with the average value over the weights in the group, and they were retrained thereafter. Huffman coding was used to compress the weights further. Parashar et al. [129] employed an encoding scheme to compress sparse weights and activations and designed an associated dataflow, SCNN (Compressed-Sparse Convolutional Neural Network), to minimize data transfer and reduce memory footprint. Aimar et al. [7] proposed NullHop that encodes the sparse feature maps by using two sequentially ordered (i.e., internally indexed) additional storage, one for a 3D mask to indicate the positions of non-zero values and the other for storing the non-zero data sequentially in the feature map. For example, ‘0’s are marked at the position of zero values in the 3D mask, and otherwise ‘1’s are marked. Decoding refers to both the 3D mask and the non-zero value list. Rhu et al. [137] presented HashedNet that utilizes a low-cost hash function to compress sparse activations. The virtualized DNN (vDNN) [136] compressed sparse activation units using the “zero-value compression” technique to minimize the data transfer cost between GPU and CPU. The vDNN allowed users to utilize both GPU and CPU memory for DNN training.

5.3.3 Decomposing Kernel Matrix.

Li et al. [110] proposed SqueezeFlow that reduces the number of operations for convolutions by decomposing the kernel matrix into non-zero valued sub-matrices and zero-valued sub-matrices. This method can improve the accuracy per Joule.

5.4 Leveraging Weight Repetition in Quantized CNNs

Hedge et al. [76] noticed that the identical weight values were often repeated in quantized CNNs such as binary weight CNNs [41, 42] and ternary weight CNNs [108, 196] and proposed the Unique Weight CNN Accelerator (UCNN) that reduces the number of memory accesses and the number of operations by leveraging the repeated weight values in the quantized CNNs. For example, if a \(2 \times 2\) kernel consisting of \(\lbrace k_{1,1}, k_{1,2}, k_{2,1}, k_{2,2}\rbrace\) performs a convolution with the activation maps, \(\lbrace a_{1,1}, a_{1,2}, a_{2,1}, a_{2,2}\rbrace\) . The conventional convolutional operation, \(\Sigma _{i=1,j=1}^{i=2,j=2} k_{i,j}\times a_{i,j}\) , requires eight read memory accesses, four multiplications, and three additions. If two of the weights in the kernel are identical (e.g., \(k_{1,1} = k_{2,2}\) and \(k_{1,2} = k_{2,1}\) ), the convolutional operation can be performed using \(k_{1,1}(a_{1,1}+a_{2,2}) + k_{1,2}(a_{1,2}+a_{2,1})\) , requiring six read memory accesses, two multiplications, and three additions. The UCNN improved the accuracy per Joule by 1.2 to 4 \(\times\) in AlexNet and LeNet on Eyeriss architecture using the ImageNet dataset.

5.5 Leveraging Innovative Technology

Many research attempts have leveraged innovative computing architecture technologies such as neuromorphic computing and in-memory processing as follows.

5.5.1 Neuromorphic Computing.

Neuromorphic computing mimics the brain, including brain components such as neurons and synapses; furthermore, biological neural systems include axons, dendrites, glial cells, and spiking signal transferring mechanisms [91]. The memristor, “memory resistor,” is one of the most widely used devices in neuromorphic computing. ISAAC [146] replaced MAC operation units with the memristor crossbars based on the DaDianNao architecture. In the crossbars, every wire in horizontal wire array was connected to every wire in vertical wire array with a resistor. Different level voltages, \(V = [v_1, v_2, \ldots , v_m]\) , were applied to the horizontal wires connected to a vertical wire by the different resistors, \(R = [1/g_1, 1/g_2, \ldots , 1/g_m]\) . With mapping \(v_i\) to input elements and \(g_i\) to weights, where \(i = 1, \ldots , m\) , the output current, I, from the vertical wire can be represented as MAC operations in a layer, \(I = \Sigma _i^m (v_i \times g_i)\) , based on Kirchhoff’s law. Multiple MAC operations can be performed by collecting the currents from multiple vertical wires. ISAAC employed digital-to-analog converters to receive the input elements and convert them into the appropriate voltages and analog-to-digital converters to convert the current values into digitized feature map values. Due to lack of reprogrammability of resistors in the crossbars, the ISAAC architecture was only available for the inference tasks. ISAAC improved \(5.5 \times\) accuracy per Joule compared to full-fledged DaDianNao. As another neuromorphic computing approach, many research attempts implemented Hodgkin-Huxley and Morris Lecar models [81] that describes the activity of neurons using non-linear differential equations in hardware simulators [15, 16, 49, 60, 149, 150, 151]. Several studies implemented neuromorphic architectures in ASIC, including TrueNorth [9], SpiNNaker [127], Neurogrid [18], BrainScaleS [144], and IFAT [131]. Please refer to the work of Schuman et al. [145] for a comprehensive survey of neuromorphic computing.

5.5.2 In-Memory Processing.

Scaling down the size of transistors enables energy efficiency and high performance for Von-Neumann computing systems. However, it became quite challenging in the era of sub-10-nm technologies due to physical limitations [11, 54, 124]. To address this challenge, researchers proposed a paradigm of in-memory processing to improve performance and energy efficiency by integrating computations units into memory devices [67, 111, 178]. Several studies proposed to enable in-memory processing to accelerate CNNs [12, 46, 55, 73, 97, 185]. The in-memory accelerators can be implemented using non-volatile memory technologies, such as phase change memory [92, 101], resistive random access memory [37, 61, 62, 94, 133], and SRAM memory devices [185]. Several studies projected that the non-volatile memory based accelerators are able to improve performance significantly compared to state-of-the-art microprocessors [61, 62]. The XNOR-SRAM technology [185] integrating the XNOR gates and accumulation logic into SRAM can fetch data from SRAM and perform MAC operation in one cycle.

5.6 Adaptive Compute Resource Assignment

This section comprises the methods assigning runtime compute resources adaptively to the CNN inference workload to improve resource efficiency. The implementation of the CNNs can be adapted to the accuracy requirements of the applications by using various runtime implementation techniques as follows.

5.6.1 Early Exiting.

The required depth of CNN depends on the problem complexity. The “early exiting” technique allows a CNN to classify an object as early as possible by having multiple exit classifier points in a single CNN [128, 160, 161]. The early exiting technique was applied to distributed computing systems, addressing concern about privacy, response time, and higher quality of experience [161]. Such early exiting methods minimized the compute resources and the inference latency, improving the accuracy per Joule, per operation, and per core utilization. Please refer to the work of Matsubara et al. [122] for details on the early exiting techniques.

5.6.2 Runtime Channel Width Adaptation.

The runtime channel width adaptation pruned unimportant filters during runtime. In 2018, Fang et al. [50] presented a single DNN model, NestDNN, being able to switch between multiple capacities of the DNN during runtime according to the accuracy and inference latency requirement. During training, unimportant filters from the original model were pruned to generate the smallest possible model, the “seed model.” Each retraining, some of pruned filters were added to the seed model while fixing the filter parameters from the previous training. Since the seed model was descended from the original model, the accuracy for each capacity in NestDNN was higher than the model having the identical capacity trained from the scratch. Similarly, Yu et al. [187] proposed another runtime switchable DNN model, the Slimmable Neural Network, in which a larger capacity model shared the filter parameters from a smaller capacity model.

5.6.3 Runtime Model Switching.

Lee et al. [107] selected the best-performing object detectors between multiple DNN detectors during runtime according to dynamic video content to improve the accuracy per core utilization and per Joule. Lou et al. [116] switched between multiple DNNs, generated from the Once-for-All NAS of Cai et al. [25], during runtime according to dynamic workload. For example, when the inference latency of a DNN was increased due to a newly assigned workload, a runtime decision maker downgraded the current DNN during runtime to meet a latency constraint. Such runtime model switching approaches were appropriate when memory resources were sufficient, since the multiple DNNs should be pre-loaded in DRAM.

6 Interrelated Influences, Insights, and Future Trends

This section discusses the influence from higher- to lower-level techniques as shown in Figure 1, resource efficiency metrics according to the techniques, insights into resource-efficient CNNs, and future research trends based on the insights.

6.1 Influences of Model-Level Techniques on Arithmetic-Level Techniques

Weight quantization [41, 108, 196] in model-level techniques influenced arithmetic-level techniques as shown in Figure 7. The multiplications using the quantized binary weights can be replaced with multiplexers, removing multiplication arithmetic operations. The resource efficiency from the model-level techniques can be further improved by utilizing the arithmetic-level techniques. For example, quantized CNNs such as ternary weight and binarized CNNs allowed INT8 arithmetic to be used in training [176, 182]. When reduced-precision CNNs suffered from zero gradients, the reduced precision arithmetic was replaced with a hybrid version arithmetic using both BFP and FP [48] or the BM format [52].

6.2 Influences of Model-Level Techniques on Implementation-Level Techniques

Weight quantization in model-level techniques influenced the implementation-level techniques as shown in Figure 8. Pruning weights can bring sparsity in the hardware architecture while pruning filters [109, 118] maintains dense structure. Weight quantization in the model-level techniques allows a CNN to utilize fewer bits for weights to save memory resource usage, requiring customized hardware. For example, EIE [68] is an inference accelerator with weights quantized by four bits. To implement the weight quantization method effectively, EIE utilized weight sharing to reduce the model size further and fit the compressed CNN into the on-chip SRAM. Exploring binary weights [42] with binary inputs offered the opportunity to explore XNOR gates for the efficient implementation of CNNs [134], improving the accuracy per memory footprint and per Joule. In the work of Tridgell et al. [162], ternary neural networks [108, 196] were implemented on FPGAs by unrolling convolution operations. Since quantized CNNs [108, 196] increased the number of repeated weights in CNNs, UCNN [76] leveraged the property of the repeated weight values in quantized CNNs to improve resource efficiency such as accuracy per memory access and per operation. As the main limitation, weight quantization methods (e.g., [41, 108, 196]) were not suitable for commercially available CPUs and GPUs, since such computing platforms do not support binary and ternary weights in hardware. Therefore, the implementation of weight quantization methods on CPUs or GPUs might not improve accuracy per Joule, as higher-precision arithmetic still was required in part of data path in training and inference. The bottleneck structures generated by compact convolutions in other works [89, 142] can be used to reduce the data size transferred between a local device and an edge server for the efficient implementation of edge-based AI [122].

6.3 Influences of Arithmetic-Level Techniques on Implementation-Level Techniques

The arithmetic-level techniques influenced the implementation-level techniques as follows.

First, the research in arithmetic utilization acted as a catalyst for commodity CPUs and GPUs. For example, mixed-precision research [125] laid a foundation for tensor cores in the latest NVIDIA GPUs, which can accelerate the performance of CNN workloads by supporting a fused multiply-add operation and the mixed-precision training capability in hardware [19]. The BFloat16 format [24] designed by Google overcomes the limited accuracy issue of the IFP16 format by providing the same dynamic range as IFP32, and it is supported in hardware in Intel Cooper Lake Xeon processors, NVIDIA A100 GPUs, and Google TPUs. In 2016, NVIDIA Pascal GPUs supported IFP16 arithmetic in hardware to accelerate CNN applications. In 2017, NVIDIA Volta GPUs supported IFP16 tensor cores. In 2020, the NVIDIA Ampere architecture supported tensor cores, TF32, BFloat16, and sparsity acceleration in hardware to accelerate MACs [2]. The Graphcore company developed the Intelligent Processing Unit (IPU), which employs local memory assigned to each processing unit with support for a large number of independently operating hardware threads [93]. The IPU is an efficient computing architecture customized to “fine-grained, irregular computation that exhibits irregular data access.”

Second, the arithmetic-level techniques led to specialized custom accelerators for CNNs. There is ample evidence in the arithmetic-level literature (e.g., [21, 48, 52, 167, 180]) that even smaller operators (e.g., 16 bits or even less) have almost no impact on the accuracy of CNNs. For example, DianNao [30] and DaDianNao [33] were customized to 16-bit FiP arithmetic operators instead of word-size (e.g., 32-bit) FP operators. ISAAC [146] is a fully fledged crossbar-based CNN accelerator architecture, which implemented a memristor-based logic based on resistive memory, suitable for 16-bit arithmetic for CNN workloads. Wang et al. [167] designed their customized 8-bit FP arithmetic multiplications with 16-bit accumulations on an ASIC-based hardware platform with a 14-nm silicon technology to support energy-efficient CNN training. The Eyeriss [31] and SnaPEA [8] accelerators were customized to 16-bit arithmetic. UCNN [76] utilized 8-bit and 16-bit FiP configurations. SCNN [129] utilized 16-bit multiplication and 24-bit accumulation.

Last, the mixed-precision training schemes were accelerated in hardware by minimizing the data conversion overhead between lower- and higher-precision formats in updating weights and activations [194]. Additionally, the stochastic rounding scheme was supported in hardware in the Intel Loihi processor [43] and Graphcore IPU [93], since it was often required for quantizing weights and activations during training [65, 176, 182].

6.4 Resource Efficiency Metrics According to Techniques

Table 1 shows the resource efficiency metrics improved by specific techniques discussed in Sections 3, 4, and 5, by mapping rows for techniques, columns for metrics, and cells marked if the corresponding technique improves the corresponding metric (‘ \(\bigcirc\) ’ for an improved metric directly by a technique and ‘E’ for the expected improved metric from directly improved metrics.). In Table 1, A/Param, A/Op, A/MF, A/MA, A/CU, and A/Joule represent accuracy per parameter, accuracy per operation, accuracy per memory footprint, accuracy per memory access, accuracy per core utilization, and accuracy per Joule, respectively.

Table 1.

6.5 New Insights into Resource-Efficient CNNs and Future Research Trends

Model-level techniques focus on seeking the least sufficient connections between neurons to reduce the number of parameters, whereas arithmetic-level techniques focus on seeking the least sufficient number of bits to represent the data. We observe that CNNs having fewer connections often perform better than the original model [53], whereas lower-precision CNNs do not perform better than the original model. Therefore, we draw insight into resource-efficient CNNs that it is highly probable that reducing number of weights has a regularization effect, whereas reducing the number of bits to represent weights, activations, and gradients does not have a regularization effect according to the bias-variance tradeoff [57]. For example, lower-precision arithmetic can incur overfitting due to the increased rounding errors [5, 106] (e.g., four-bit CNNs in the work of Sun et al. [154] are fit for ResNet18 rather than ResNet50). From this insight, incorporating the arithmetic-level techniques into the model-level techniques will be a promising research direction since we expect that overfitting magnified by lower-precision arithmetic can be minimized by regularizing the model with model-level techniques.

Model-level techniques [70, 75, 78, 159] have reduced parameters by 5 to 10 \(\times\) in general compared to baseline models such as AlexNets, VGGNets, and ResNets, improving the accuracy per parameter. On the contrary, the resource efficiency metrics improved by the arithmetic- and implementation-level techniques rely on computing architectures. For example, reducing the number of bits by \(2 \times\) generally accelerates multiplications by \(4\times\) on ASICs/FPGAs [105]. It has recently been demonstrated that using four bits could be sufficient to represent weights, activations, and their gradients with a specific rounding error scheme called two-phased rounding [154]. Since the four-bit FP arithmetic potentially brings approximately \(64 \times\) speedup to multiplications on ASICs/FPGAs compared to 32-bit FP arithmetic, the arithmetic-level techniques with a special rounding scheme can improve the throughput of training significantly. From this insight, we expect that incorporating the arithmetic-level techniques into ASICs is a promising research direction in the future.

Most resource-efficient CNNs lack in explaining the effects of their techniques to the accuracy according to other dynamic variables such as sample complexity (the number of training samples), target complexity, and CNN complexity. It has been recently reported that increasing the depth of the CNN follows a classic bias-variance tradeoff, whereas increasing the width of the CNN follows the “bell-shaped” variance curve rather than a monotonically increasing pattern [183]. For example, if the CNN width is already saturated (beyond the peak of variance), increasing the CNN width even with reduced precision arithmetic can improve the accuracy [71]. However, if the CNN width is not saturated yet, increasing the CNN width might hurt the accuracy even with full precision. Increasing the depth of the CNN with reduced-precision arithmetic will have different impacts. In this regard, we suggest a meaningful unexplored research question: given a sample, a target, and a CNN complexity, how can we determine if we need to increase (or decrease) depth (or width) to improve the accuracy of reduced-precision CNNs according to the bias-variance tradeoff?

7 Conclusion

To the best of our knowledge, our survey is the first to provide a comprehensive survey coverage of the recent resource-efficient CNN techniques based on the three-level hierarchy including model-, arithmetic-, and implementation-level techniques. Our survey also utilizes multiple resource efficiency metrics to clarify which resource efficiency metrics each technique can improve. For example, most model-level resource-efficient techniques contribute to improving abstract resource efficiency, whereas the arithmetic- and implementation-level techniques directly contribute to improving physical resource efficiency by employing reduced-precision arithmetic and/or optimizing the dataflow of CNN architectures. Therefore, the efficient implementation of model-level techniques on given compute platforms is essential to improve physical resource efficiency. It is our hope that this work will contribute to the machine learning, arithmetic, and system communities by providing them with a comprehensive survey for various resource-efficient CNN techniques as guidelines to seek CNN structures using the least sufficient parameters and the least sufficient precision arithmetic on particular compute platforms, customized to the problem complexity and the training data quantity and quality.

References

[1]

Papers With Code. (n.d.) ImageNet Benchmark (Image Classification on ImageNet). Retrieved March 15, 2023 from https://paperswithcode.com/sota/image-classification-on-imagenet.

Abstract