research-article

Open access

Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

Authors:

Di Liu,

Weichen LiuAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 24, Issue 1

Article No.: 21, Pages 1 - 100

https://doi.org/10.1145/3701728

Published: 10 December 2024 Publication History

PDF eReader

Abstract

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs in real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate this computational gap and enable ubiquitous embedded intelligence, we focus in this survey on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. We also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future research, which can largely help researchers to quickly and smoothly get started in this emerging field.

1 Introduction

With the increasing availability of large-scale datasets and advanced computing paradigms, deep neural networks (DNNs)¹ have empowered a wide range of intelligent applications and have demonstrated strong performance [1, 2, 3]. These intelligent applications span from image classification [2] to downstream vision tasks, such as object detection [4], tracking [5], and segmentation [6], to natural language processing (NLP) tasks, such as automatic speech recognition [7], machine translation [8], and question answering [9]. Subsequently, DNNs have been evolving more deeply with increasing numbers of layers in order to maintain state-of-the-art accuracy on target tasks [1, 2, 3]. Novel network structures and advanced training techniques have also emerged, which further push forward the attainable accuracy [10, 11, 12]. These powerful deep learning (DL) networks and advanced training techniques, starting from VGGNet [1] and ResNet [2], mark the emergence of the DL era.

The tremendous breakthroughs of DNNs have attracted a huge amount of attention from both academia and industry to deploy powerful DNNs in real-world embedded computing systems, including mobile phones [13, 14], autonomous vehicles [15, 16], and health care [17, 18], to enable intelligent embedded applications towards embedded intelligence [19]. In practice, this may bring significant benefits. For example, embedded computing systems explicitly allow real-time on-device data processing, which significantly improves processing efficiency and thus delivers enhanced user experience. This also protects data security and privacy since everything can be locally processed without being uploaded to a remote server [19]. Despite these promising benefits, deploying powerful DNNs in real-world embedded computing systems still suffers from several critical limitations. On the one hand, in order to maintain competitive accuracy, recent representative networks have been evolving much deeper with hundreds of layers [2, 3] and, as a result, lead to prohibitive computational complexity [19, 20]. For example, ResNet50 [2], as one of the most representative deep networks, consists of over 4 billion floating-point operations (FLOPs) and 25 million parameters, which requires over 87 MB of on-device storage to deal with one single input image. On the other hand, real-world embedded computing systems, such as mobile phones and autonomous vehicles, typically feature limited available computational resources in order to optimize the on-device power and energy consumption. In light of the above, the evolving network complexity continues to enlarge the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems [20], inevitably making it increasingly challenging to embrace ubiquitous embedded intelligence.

To bridge the aforementioned computational gap towards ubiquitous embedded intelligence, a plethora of model compression techniques have been recently proposed, including network pruning [21, 22, 23], network quantization [24, 25, 26], and network distillation [11, 27, 28], which strive for better accuracy–efficiency trade-offs to accommodate the limited available computational resources in real-world embedded scenarios. For example, network pruning focuses on removing the redundancy network units, such as weights [29], channels [21], and layers [30], to trim down network redundancy, which can boost the efficiency on target hardware with minimal accuracy loss on the target task. In addition to network compression, another parallel alternative is to manually design resource-efficient networks instead, such as SqueezeNet [31], MobileNets [32, 33], ShuffleNets [34, 35], and GhostNets [36, 37], which have dominated the early progress from the lens of efficient network design. These efficient networks, despite being able to exhibit superior efficiency, highly rely on human expertise to explore novel network structures through trial and error, which also involve non-trivial engineering efforts and prohibitive computational resources [38, 39, 40]. To overcome such limitations, recent network design practices have shifted from manual to automated, also referred to as neural architecture search (NAS) or automated machine learning (AutoML), which focuses on automatically exploring novel network structures [41]. The tremendous success of NAS has sparked rich hardware-aware NAS works, such as MnasNet [38], ProxylessNAS [40], FBNet [39], and Once-for-All [42], to automate the design of accurate yet hardware-efficient network solutions. These solutions have shown strong accuracy–efficiency trade-offs and have been widely deployed in real-world embedded computing systems to deliver intelligent services [43].

Apart from the above efficient networks and techniques that typically focus on improving on-device inference efficiency, recent research also turns back to the on-device training efficiency [44, 45]. The rationale here is that previous representative networks, despite being able to exhibit superior accuracy, have to be trained for hundreds of epochs, which may require multiple days on powerful graphics processing units (GPUs) [44]. Even worse, the expensive training process for remote GPUs does not allow on-device customization on local hardware, especially in resource-constrained embedded scenarios [45]. Note that local on-device customization has the potential to further improve the attainable accuracy using newly collected data since local sensors continue to collect new data from users over time. To overcome such limitations, several efficient on-device learning techniques have been established recently, such as on-device continual learning [46], on-device transfer learning [44], and on-device federated learning [47], making it possible to train and fine-tune powerful deep networks on local hardware for further performance improvement.

More recently, large language models (LLMs), such as GPT-3 [48] and GPT-4 [49], have demonstrated impressive success across various real-world language processing tasks [50]. However, the strong learning capability of these powerful LLMs also comes at the cost of excessive computational complexity. For example, OpenAI’s GPT-3 [48], one of the most representative LLMs, consists of 175 billion parameters. Furthermore, in order to achieve state-of-the-art performance, recent LLMs continue to evolve to be exponentially larger with ever-increasing model sizes [51, 52]. These make it increasingly challenging to deploy recent powerful LLMs in modern embedded computing systems towards intelligent language processing services. To overcome such limitations, a series of effective techniques have been proposed recently, which focus on alleviating the prohibitive computational complexity of LLMs to explore computation-efficient LLMs, including efficient LLM architecture design [53, 54, 55, 56], efficient LLM compression techniques (i.e., pruning [57, 58], quantization [59, 60], and knowledge distillation [61, 62]), and efficient LLM system design [63, 64, 65].

In parallel to the booming emergence of powerful deep networks and advanced training techniques, a plethora of representative deep learning software frameworks and hardware accelerators have been tailored to facilitate the development of efficient deep learning solutions for embedded computing systems, such as TensorFlow [66], PyTorch [67], Google edge tensor processing units TPUs [68], NVIDIA edge GPUs [69], and Intel Neural Compute Stick [70]. These deep learning software programs and hardware units have been extensively adopted in the deep learning era and bring two main benefits. On the one hand, they lift the roadblock for both software and hardware engineers and, thus, allow engineers to quickly develop intelligent embedded applications, such as on-device object detection [4], tracking [5], and segmentation [6], with less domain-specific expertise. On the other hand, they typically feature domain-specific optimization and, thus, can achieve superior accuracy–efficiency trade-offs with minimal engineering efforts. For example, NVIDIA Jetson AGX Xavier, a representative NVIDIA Jetson edge GPU, supports the development of intelligent embedded applications with the precision of INT8 (i.e., 8-bit weights), which can deliver significant efficiency improvement over its full-precision counterpart (32-bit weights) without degrading the accuracy on the target task [69].

1.1 Organization of This Article

In this survey, we summarize recent efficient deep learning infrastructures that may benefit current and future embedded computing systems towards ubiquitous embedded intelligence. In practice, some existing surveys [71, 72, 73, 74] typically focus on efficient deep learning algorithms, which may be out-of-date since recent deep learning infrastructures have been rapidly evolving, especially from the perspective of LLMs. In contrast to [71, 72, 73, 74], we focus on providing a more comprehensive and holistic view of recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks (CNNs) to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss the following recent efficient deep learning infrastructures for embedded computing systems: (1) efficient manual network design, (2) efficient automated network design, f(3) efficient network compression, (4) efficient on-device learning, (5) efficient LLMs for embedded computing systems, (6) efficient deep learning software and hardware, and (7) efficient intelligent applications. We believe this survey has its merits and can shed light on future research, which can largely benefit researchers to quickly and smoothly get started in this emerging field. We demonstrate the organization of this survey in Figure 1, which is also summarized as follows:

Fig. 1.

—

Section 2 extensively discusses recent representative efficient manual networks.

—

Section 3 extensively discusses recent representative efficient automated networks.

—

Section 4 extensively discusses recent representative network compression techniques.

—

Section 5 extensively discusses recent representative on-device learning techniques.

—

Section 6 extensively discusses recent representative LLMAs.

—

Section 7 extensively discusses recent representative deep learning software and hardware.

—

Section 8 extensively discusses recent representative intelligent embedded applications.

At the end of each section, we also envision possible future directions in the respective fields, which have the potential to pave the way for future ubiquitous embedded intelligence.

2 Manual Network Design for Embedded Computing Systems

The tremendous success of DNNs highly relies on the prohibitive network complexity, leading to the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems [20]. To bridge this computational gap, one of the most representative solutions is to design computation-efficient DNNs to accommodate the limited computational resources on embedded computing systems. To this end, we systematically discuss recent state-of-the-art efficient manual networks in this section,. For better understanding, we divide these efficient networks into two main categories and subsections, including efficient convolutional networks in Section 2.1 and efficient transformers in Section 2.2, since these efficient networks may feature different network structures and target different intelligent embedded applications.

2.1 Manual Convolutional Neural Network Design

As shown in previous state-of-the-art deep convolutional networks, such as AlexNet [76], VGGNet [1], GoogleNet [77], ResNet [2], DenseNet [3], and EfficientNets [78, 79], despite being able to push forward the attainable accuracy on ImageNet [80] from 57.2% [81] to 87.3% [79], network complexity has increased over time. We note that the convolutional network consists of convolutional layers, pooling layers, and fully connected layers, where most of the network complexity comes from convolutional layers [82]. For example, in ResNet50 [2], more than 99% of FLOPs are from convolutional layers. In light of this, designing efficient convolutional layers is critical to innovating computation-efficient convolutional networks. In practice, there are five typical efficient convolutional layers: pointwise convolution, groupwise convolution, depthwise convolution, dilated convolution, and Ghost convolution.

—

Pointwise Convolution. Pointwise convolution is a type of convolutional layer with the fixed kernel size of 1 × 1, which performs an element-wise multiplication and addition along the depth dimension. On the one hand, compared with the standard K × K convolutional layer, the pointwise convolutional layer is able to reduce the number of FLOPs and parameters by $K^2$ times, which significantly improves the efficiency. On the other hand, we note that the output from the pointwise convolutional layer typically has the same spatial dimensions as the input but may have a different number of channels. As such, the pointwise convolutional layer can be used to adjust the intermediate feature maps in terms of the number of channels. Specifically, it can reduce or increase the number of channels, making it a practical technique for compressing or expanding convolutional networks.

—

Groupwise Convolution. Groupwise convolution is a type of convolutional layer that (1) divides the input feature map into G groups along the depth dimension; (2) performs convolution in terms of each group, and (3) concatenates the outputs along the depth dimension to derive the final output. For example, given an input feature map with the size of $B \times C \times H \times W$, each kernel in the K × K groupwise convolutional layer has the size of $(C/G) \times K \times K$, which convolves the above G groups of feature maps, respectively. Therefore, compared with the standard K × K convolutional layer, the groupwise convolutional layer is able to reduce the number of FLOPs and parameters by G times.

—

Depthwise Convolution. Depthwise convolution is a type of convolutional layer that has gained popularity owing to its ability to significantly reduce the number of FLOPs and parameters in convolutional networks. It is a special case of groupwise convolutional layer, in which the number of groups G is equal to the number of input channels. Each input channel is convolved with a unique kernel of $1 \times K \times K$, after which the outputs from all input channels are concatenated along the depth dimension to derive the final output. In practice, this has the potential to achieve a significant reduction in the number of FLOPs and parameters because the intermediate feature maps may consist of thousands of channels as shown in previous state-of-the-art convolutional networks [2, 3, 78, 79].

—

Dilated Convolution. Dilated convolution [83], also referred to as atrous convolution, is a type of convolutional layer that is designed to increase the receptive field size. In the dilated convolutional layer, there is an adjustable parameter called the dilation rate, which determines the spacing between different elements and can be varied to adjust the size of the receptive field. For example, the 3 × 3 dilated convolutional layer with the dilation rate of 1 maintains the same receptive field as the standard 5 × 5 convolutional layer. This further allows us to increase the receptive field size to unlock better accuracy without introducing additional computational overheads, such as FLOPs and parameters.

—

Ghost Convolution. The Ghost convolution [36, 37, 75] is a type of convolutional layer that is designed to generate rich feature maps using cheaper computational resources, as illustrated in Figure 2. The Ghost convolutional layer consists of two sequential parts. The first part corresponds to the standard convolutional layer, in which the number of output channels is rigorously controlled. In the second part, to generate rich feature maps, a series of simple linear operations are applied to the output feature maps from the first part. As a result, the size of the output feature maps still remains the same as the standard convolutional layer, but the total required computational resources, such as the number of FLOPs and parameters, are significantly reduced, as shown in Figure 2.

—

Partial Convolution. Partial convolution [84] is to reduce computational redundancy and memory access simultaneously. The partial convolution is built upon the regular convolution, in which only a small number of input channels are convolved with the regular convolution to extract representative spatial features and the remaining input channels are unchanged. Similar to the Ghost convolution, the resulting output channels are further concatenated along the depth dimension to produce the final output channels. In practice, the partial convolution brings significant computational efficiency and memory efficiency since only a small number of input channels are convolved, which also maintains better on-device resource utilization than the Ghost convolution.

Fig. 2.

Built on top of the aforementioned efficient convolutional layers and structures, there are several representative families of manually designed efficient convolutional networks, including SqueezeNet [81], MobileNets [32, 85, 86], ShuffleNets [34, 35], CondenseNets [87, 88], GhostNets [36, 37, 75], and FasterNet [84]. We compare the above representative efficient convolutional networks in Figure 3,² which are also discussed in the remainder of this section.

Fig. 3.

SqueezeNet [81] is stacked using a series of Fire modules, which aims to achieve AlexNet-level accuracy with fewer parameters. Each Fire module consists of two convolutional layers, including one squeeze layer and one expand layer. In the squeeze layer, only pointwise convolutional layers are used to reduce the number of input channels for the subsequent expand layer. Next, the expand layer performs feature expansion using a pair of 1 × 1 and 3 × 3 convolutional layers. SqueezeNet is able to achieve slightly better accuracy on ImageNet than AlexNet (i.e., 57.5% in SqueezeNet vs. 57.2% in AlexNet) using a ×50 smaller model size. SqueezeNet is more compression friendly than AlexNet. For example, we are allowed to further compress SqueezeNet using a method outlined in [89], which delivers more compact network variants with $\times 363\sim$×510 smaller model size, and, more importantly, without degrading the accuracy on ImageNet.

MobileNets [32, 85, 86] are a family of lightweight convolutional networks, including MobileNetV1 [32], MobileNetV2 [85], and MobileNeXt [86], which are tailored for mobile devices with limited computational resources. MobileNetV1 is built upon a series of building blocks. Each building block consists of two convolutional layers, including one 3 × 3 depthwise convolutional layer and one 1 × 1 pointwise convolutional layer. With 569 M FLOPs and 4.2 M parameters, MobileNetV1 achieves 70.6% top-1 accuracy on ImageNet. MobileNetV2 is an improved version of MobileNetV1, which aims to unlock higher accuracy with fewer FLOPs and parameters. MobileNetV2 introduces the inverted residual building block that consists of three convolutional layers, including one 1 × 1 pointwise convolutional layer, one 3 × 3 depthwise convolutional layer, and one 1 × 1 pointwise convolutional layer. Here, the inverted residual building block also borrows the residual connection from ResNet [2] to stabilize the training process and improve the accuracy. With 300 M FLOPs and 3.4 M parameters, MobileNetV2 achieves 72.0% top-1 accuracy on ImageNet. MobileNeXt investigates the inverted residual building block in MobileNetV2 and introduces the sandglass block to enhance the accuracy without increasing the network complexity. The sandglass block consists of four convolutional layers, including one 3 × 3 depthwise convolutional layer, one 1 × 1 pointwise convolutional layer, one 1 × 1 pointwise convolutional layer, and one 3 × 3 depthwise convolutional layer. With 300 M FLOPs and 3.4 M parameters, MobileNeXt achieves 74.0% top-1 accuracy on ImageNet.

ShuffleNets [34, 35] are a family of efficient convolutional networks, including ShuffleNetV1 [35] and ShuffleNetV2 [34], which exploit channel shuffling to reduce the network complexity while maintaining competitive accuracy. Specifically, ShuffleNetV1, for the first time, introduces channel shuffling to enhance the information flow across different channels. In practice, the channel shuffling operation is inserted after the 3 × 3 depthwise convolutional layer to shuffle the feature maps from different groups, which is capable of generating richer and more diverse feature maps while not increasing the number of FLOPs and parameters. With 292 M FLOPs and 3.4 M parameters, ShuffleNetV1 achieves 71.5% top-1 accuracy on ImageNet, which is $+$0.9% higher than MobileNetV1 under comparable settings of FLOPs. ShuffleNetV2 improves the accuracy and efficiency of ShuffleNetV1 with several architectural modifications. ShuffleNetV2 first leverages channel splitting to divide the input feature maps into two parallel branches, one of which is fed into three convolutional layers, including one 1 × 1 pointwise convolutional layer, one 3 × 3 depthwise convolutional layer, and one 1 × 1 pointwise convolutional layer. After that, the above two branches of feature maps are concatenated along the depth dimension, which are then shuffled using the channel shuffling operation. With 299 M FLOPs and 3.5 M parameters, ShuffleNetV2 is able to achieve 72.6% top-1 accuracy on ImageNet, which is $+$1.1% higher than ShuffleNetV1 under comparable settings of FLOPs.

CondenseNets [87, 88] are a family of efficient convolutional networks, including CondenseNetV1 [87] and CondenseNetV2 [88], which are built upon another representative convolutional network named DenseNet [3]. CondenseNetV1 enhances the dense connection with a novel module called learned group convolution. Note that the dense connection reuses the features from preceding convolutional layers to enhance the information flow as seen in DenseNet. In contrast, the learned group convolution removes the redundant dense connection between different convolutional layers to reduce network redundancy. With 274 M FLOPs and 2.9 M parameters, CondenseNetV1 achieves 71.0% top-1 accuracy on ImageNet. CondenseNetV2 introduces an alternative named sparse feature reactivation (SFR) to increase feature reuse. Integrated with SFR, each convolutional layer can learn to (1) selectively reuse a set of the most important features from preceding convolutional layers and (2) actively update a set of preceding features to increase their reuse in subsequent convolutional layers. With 146 M FLOPs and 3.6 M parameters, CondenseNetV2 achieves 71.9% top-1 accuracy on ImageNet.

GhostNets [36, 37, 75] are a family of efficient deep convolutional networks, including GhostNetV1 [36, 75] and GhostNetV2 [37], which focus on generating rich feature maps using computationally cheap and simple yet powerful operations. To this end, GhostNetV1 introduces a powerful yet computation-efficient convolution dubbed Ghost convolution as shown in Figure 2, which consists of two sequential parts. The first part corresponds to the standard convolutional layer, in which the number of output channels is rigorously controlled. In the second part, a series of computationally cheap and simple linear operations are applied to the output feature maps from the first part to generate rich feature maps. With only 141 M FLOPs and 5.2 M parameters, GhostNetV1 achieves 73.9% top-1 accuracy on ImageNet. GhostNetV2 introduces a novel hardware-friendly attention mechanism, DFC attention, to enhance the learned feature maps to boost the expressiveness ability, which is seamlessly integrated into GhostNetV1 to push forward the accuracy and efficiency. For example, with 167 M FLOPs and 6.1 M parameters, GhostNetV2 is able to achieve 75.3% top-1 accuracy on ImageNet.

FasterNets [84] are built on the partial convolution. In contrast to the above efficient networks that typically optimize the number of FLOPs, FasterNet pioneers to design efficient networks with optimized FLOPS (i.e., FLOPs per second). The motivation behind FasterNet is that the on-device latency is determined by both FLOPs and FLOPS (i.e., Latency = FLOPs/FLOPS). To this end, FasterNet focuses on increasing the number of FLOPs to maintain competitive accuracy on the target task, while at the same time optimizing FLOPS to maintain competitive efficiency on target hardware. For example, compared with GhostNetV1x1.3, which involves 0.24 G FLOPs and exhibits 75.7% top-1 accuracy on ImageNet, FasterNet-T1 achieves $+$0.5% higher top-1 accuracy with many more FLOPs (i.e., 0.85 G) and, more importantly, achieves ×1.7 speedup on ARM processors.

2.2 Manual Transformer Design

2.2.1 Transformer for NLP.

In parallel to convolutional networks, the transformer [90] is another well-established branch of DNNs, which exploits multi-head self-attention mechanisms. In practice, a transformer is first designed and applied to NLP tasks, where it has achieved tremendous success. For example, BERT [91], as one of the most representative transformers in the field of NLP, is able to achieve state-of-the-art performance across 11 downstream NLP tasks, such as language translation, question answering, language generation, and more, at the moment of BERT being proposed. Furthermore, GPT-3 (Generative Pretrained Transformer 3) [48] pioneers to scale up and pretrain a massive transformer that consists of 175 billion parameters on 45 TB of compressed plaintext data, which unlocks even stronger performance across almost all downstream NLP tasks and, more importantly, without requiring fine-tuning on specific NLP tasks. More recently, GPT-4 [49] has been proposed by OpenAI, which can significantly outperform GPT-3 across a wide range of language processing tasks and has also been widely integrated into various real-world language processing tasks, such as ChatGPT [92], to provide intelligent language processing services. These early transformer-based deep networks, thanks to their prohibitive computational complexity, have been pushing forward the boundaries of various language processing tasks and dominating recent advances in the field of NLP (see Figure 4).

Fig. 4.

Nonetheless, it is quite challenging to deploy powerful transformers in embedded computing systems due to the computational gap between computation-intensive transformers and computation-limited embedded computing systems. For example, as pointed out in [93], to translate a short sentence with only 30 words, a typical transformer model needs to execute 13 G FLOPs, which takes 20 seconds on a Raspberry Pi device. This significantly hinders the user experience in real-world embedded scenarios. To tackle this issue, a series of computation-efficient transformers have emerged, among which TinyBERT [94], MobileBERT [95], DistilBERT [96], Linformer [97], and Reformer [98] are some of the most representative ones. The main intuition behind these efficient transformers is to resolve the memory bottleneck and increase the parallelism, making it possible to deploy NLP workloads on resource-constrained embedded computing systems. Note that, compared with computer vision tasks such as image classification and object detection, running NLP workloads on embedded computing systems is less common due to high inference latency. For example, as demonstrated in [93], running language translation workloads with hardware-tailored transformers on a Raspberry Pi device takes even seconds, whereas running image classification workloads typically takes milliseconds per image. More recently, inspired by the remarkable success of GPTs [48, 49], transformer-based LLMs have become increasingly popular in the NLP community. To optimize the efficiency of transformer-based LLMs, a plethora of efficient transformer-based LLMs have been proposed, which typically focus on improving the training efficiency [99, 100, 101], inference efficiency [102, 103], and fine-tuning efficiency [104, 105] of transformers in the context of LLMs. For example, to optimize the inference efficiency of transformer-based LLMs, [102] partitions LLMs over different hardware chips in order to fit weights and activation tensors into memory and run computation and memory workloads within the given latency constraint, which also features a simple yet effective strategy to alleviate the communication overheads among different hardware chips for cost-effective and latency-efficient inference.

2.2.2 Transformer for Vision.

Inspired by the tremendous success of the transformer in the field of NLP, researchers have recently applied it to vision tasks, which achieves surprisingly strong performance (see Figure 4). This opens up a new direction and further challenges the dominant role of convolutional networks in vision tasks. Specifically, DETR [106] and Vision Transformer (ViT) [107] are the very early transformers in vision tasks, among which ViT is the most representative. These early pioneers have motivated a myriad of subsequent transformers in various vision tasks, such as image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119]. For example, ViT was first proposed in June 2020, which has since gained over 20,000 citations as shown in Google Scholar. The main intuition behind ViT is surprisingly simple and straightforward, which (1) splits the input image into a series of fixed-size patches, (2) linearly embeds each of them, and (3) feeds the resulting sequence of vectors into the standard transformer encoder as illustrated in Figure 5. However, there is no free lunch. The surprisingly strong performance of ViT and its variants comes at the cost of prohibitive computational complexity, which significantly hinders the practical deployments of ViT and its variants in embedded computing systems with limited computational resources.

Fig. 5.

To resolve the complexity bottleneck, some recent works have pioneered to design computation-efficient transformers for vision tasks with the aim of reducing the computational complexity while maintaining competitive accuracy. The representative computation-efficient transformers in vision tasks include LeViT [120], MobileFormer [121], MobileViTs [122, 123, 124], EfficientViT [125], EdgeViT [126], EdgeNeXt [127], CastlingViT [128], and FastViT [129]. These computation-efficient vision transformers are summarized and compared in Figure 6.

Fig. 6.

LeViT [120] is a hybrid vision transformer built on top of convolutional networks, which aims to improve the trade-off between accuracy and efficiency. To this end, LeViT introduces several enhancements to shrink down the network size, including (1) a multi-stage transformer architecture that uses attention mechanisms as down-sampling, (2) a computation-efficient patch descriptor that shrinks down the number of features in the early layers, (3) a per-head translation-invariant attention bias that replaces ViT’s positional embeddings, and (4) an efficient MLP-based attention block that improves the network capacity under given computational budgets. With 406 M FLOPs and 9.2 M parameters, LeViT achieves 78.6% top-1 accuracy on ImageNet.

MobileFormer [121] parallelizes MobileNetV2 [85] and transformer [107] with a two-way bridge, which shifts the network design paradigm from series to parallel. The network here is named MobileFormer, where Mobile refers to MobileNetV2 and Former stands for transformer. Mobile takes the image as input and stacks inverted residual blocks that consist of efficient pointwise and depthwise convolutional layers to extract local features. Former takes learnable tokens as input and stacks multi-head attention and feed-forward networks, in which the learnable tokens encode global features of the image. As such, Mobile and Former can communicate through a two-way bridge to fuse local and global features for better expressiveness ability. With 294 M FLOPs and 11.4 M parameters, MobileFormer achieves 77.9% top-1 accuracy on ImageNet.

MobileViTs, including MobileViTv1 [122], MobileViTv2 [123], and MobileViTv3 [124], are a family of efficient hybrid networks that combine the benefits of CNNs (e.g., spatial inductive bias and less sensitivity to data augmentations) and vision transformers (e.g., input-adaptive weighting and global processing). In contrast to mainstream vision transformers, both MobileViTv1 and MobileViTv2 are designed with the aim of low inference latency rather than low FLOPs since the number of FLOPs cannot accurately reflect the inference efficiency on target hardware. To this end, MobileViTv1 introduces a novel block that is able to efficiently and effectively encode both local and global features. In addition, MobileViTv1 replaces local processing in convolutional layers with global processing using transformers, which can lead to better representation capability with fewer parameters and simpler training recipes. Finally, with 5.6 M parameters, MobileViTv1 achieves 78.4% top-1 accuracy on ImageNet. MobileViTv2 introduces a separable self-attention mechanism with linear complexity, which is integrated into MobileViTv1 to boost the accuracy and hardware efficiency. For example, MobileViTv2 achieves 75.6% top-1 accuracy on ImageNet, which is $+$0.8% higher than MobileViTv1 while maintaining ×3.2 speedup on iPhone 12. In addition, MobileViTv3 introduces two simple yet effective enhancements: (1) replacing $3\times 3$ convolutional layers with $1t\times 1$ convolutional layers and (2) scaling up building blocks in terms of the network width. With 927 M FLOPs, MobileViTv3 achieves 76.7% top-1 accuracy on ImageNet, which is $+$1.9% higher than MobileViTv1 under similar FLOPs.

EfficientViT [125] investigates high-resolution, low-computation visual recognition tasks using ViT and its variants, and identifies that the complexity bottleneck of ViT and its variants comes from the excessively used softmax attention mechanism. To resolve the complexity bottleneck, EfficientViT challenges the dominant role of softmax attention in vision transformers. It introduces a strong alternative, namely, enhanced linear attention, to replace softmax attention, which demonstrates strong representation capability in local feature extraction while being able to maintain low computational complexity and high hardware efficiency. With 406 M FLOPs and 7.9 M parameters, EfficientViT achieves 78.6% top-1 accuracy on ImageNet.

EdgeViT [126] investigates the design of efficient vision transformers from the perspective of on-device deployment, enabling vision transformers to compete with state-of-the-art CNNs in terms of the accuracy–efficiency trade-off. EdgeViT is designed based on an optimal decomposition of self-attention using standard primitive operations, optimizing EdgeViT towards target hardware to achieve superior accuracy–efficiency trade-offs. With 600 M FLOPs and 4.1 M parameters, EdgeViT achieves 74.4% top-1 accuracy on ImageNet, which is $+$2.4% higher than MobileNetV2 under comparable latency constraints on the Samsung Galaxy S21 device.

EdgeNeXt [127] is an efficient hybrid network that marries both worlds of convolutional networks and vision transformers. To better encode global information, EdgeNeXt introduces an efficient split depthwise transpose attention (SDTA) encoder to address the issue of limited receptive fields in CNNs without increasing the number of FLOPs and parameters. EdgeNeXt also leverages adaptive kernel sizes to shrink down the network complexity. With 538 M FLOPs and 2.3 M parameters, EdgeNeXt achieves 75.0% top-1 accuracy on ImageNet, which is comparable to MobileViTv1 [122] in terms of both accuracy and on-device latency.

CastlingViT [128] proposes to (1) train ViT and its variants using both linear-angular attention and masked softmax-based quadratic attention and (2) switch to having only linear-angular attention during the inference in order to save computational resources. The linear-angular attention leverages angular kernels to bridge the accuracy gap between linear attention and softmax-based attention. It expands angular kernels where linear terms are kept while complex high-order residuals are approximated. This aligns with the observation in EfficientViT [125] that the complexity bottleneck of ViT and its variants comes from the excessively involved softmax attention mechanism. To address the complexity bottleneck, CastlingViT replaces softmax attention with linear-angular attention to further improve the efficiency of ViT and its variants. With 490 M FLOPs and 10.5 M parameters, CastlingViT achieves 79.6% top-1 accuracy on ImageNet.

FastViT [129] is an efficient hybrid network that combines both CNNs and vision transformer, which aims to marry the best of both and enable state-of-the-art accuracy–efficiency trade-offs. To this end, FastViT introduces a novel token-mixing operator named RepMixer, which is the basic building block of FastViT that leverages structural reparameterization to reduce the memory access cost by removing the less important skip connections. In addition, FastViT also applies training-time over-parameterization and large kernel convolutions to further boost the accuracy with minimal effect on the inference latency. In practice, structural reparameterization enables FastViT to achieve strong accuracy on target tasks during the training process and maintain superior efficiency on target hardware during the on-device inference process. With 700 M FLOPs and 3.6 M parameters, FastViT achieves 75.6% top-1 accuracy on ImageNet.

2.3 Envisioning Future Trends

In this section, we envision the future trends and possible directions of manual network design, including convolutional networks and transformers, summarized as follows.

(1)

Hardware-Aware Optimization. The trend in the field of network design is to reduce the number of FLOPs. However, the number of FLOPs only represents the theoretical complexity and the reduction in the number of FLOPs does not necessarily lead to the inference speedup on target hardware [122, 123, 126, 130, 131]. For example, PiT [132] has ×3 fewer FLOPs than DeiT [133], but both have similar inference latency on iPhone 12 (i.e., DeiT vs. PiT on iPhone 12: 10.99 ms vs. 10.56 ms) [122]. In parallel, the attention mechanisms are powerful plug-in enhancements in various real-world scenarios [134, 135], such as Squeeze-and-Excitation (SE) [31] in vision tasks and self-attention [90] in NLP tasks, which can further boost the attainable accuracy on target tasks while slightly increasing the number of FLOPs. However, DNNs with attention mechanisms, despite being able to push forward the accuracy on target tasks, introduce considerable extra parameters and are difficult to parallelize on target hardware, especially for transformers that are full of self-attention mechanisms. For example, EfficientViT [125] demonstrates that the prohibitive computational complexity of ViT and its variants comes from excessively used softmax attention. In light of the above, we should focus on optimizing more direct efficiency metrics, such as latency and energy, which may directly benefit real-world embedded computing systems.

(2)

Interpretability and Explainability. Recent manually designed DNNs, including efficient convolutional networks and transformers, have been empirically developed through trial and error. The intuition behind this is that DNNs suffer from limited interpretability and explainability [136]. Therefore, to find one decent network solution with competitive accuracy, we have to repeat a plethora of training experiments to evaluate the accuracy of possible network configurations [137, 138], thereby necessitating non-trivial computational resources for repeated training workloads [130, 131]. To avoid this, we should focus in the future on addressing the interpretability and explainability of DNNs to facilitate the network design process and minimize the required engineering efforts.

(3)

Hybrid Multi-modal Networks. Compared with vision transformers, convolutional networks are able to maintain superior efficiency on target hardware but may suffer from inferior accuracy on target tasks. However, self-attention mechanisms are excessively involved in vision transformers, which are difficult to parallelize on mainstream embedded computing systems [125, 128]. For example, as demonstrated in EdgeViT [126], under similar FLOPs settings, MobileNetV2 [85] is about ×2 faster on the Samsung Galaxy S21 device than MobileViTv1 [122]. This further hinders the practical deployment of vision transformers in real-world embedded scenarios. In parallel, [139] demonstrates that, similar to transformers, graph neural networks, when properly engineered, can also achieve competitive performance in vision tasks. More importantly, these hybrid networks have the potential to handle various modalities (i.e., different types of input), such as text, images, and audio [140]. For example, convolutional networks are particularly effective at handling spatial data, such as images. In contrast, transformers are better suited for sequential data, such as text. Therefore, in order to achieve better accuracy–efficiency trade-offs and allow diverse input modalities, one natural and promising future direction is to continue exploring hybrid multi-modal networks that combine the strengths of existing representative networks, such as convolutional networks, vision transformers, and graph networks.

(4)

Simpler Training Recipes. As demonstrated in [141], the competitive performance of ViT and its variants highly relies on more advanced training recipes, such as pretraining on larger datasets, more training epochs, stronger data augmentations, and stronger regularization strategies. For example, ViT [107] is first pretrained on ImageNet-21k and JFT and then fine-tuned on ImageNet. Note that ImageNet consists of 1,000 categories, whereas ImageNet-21k has 21,000 categories. This further makes it more difficult and challenging to train vision transformers under regular training settings and significantly increases the total training cost. Therefore, training vision transformers in a more computation-efficient manner and under simpler training recipes is a promising future direction.

(5)

Adversarial Robustness. In addition to the efficiency, adversarial robustness is another desirable network property since efficient networks, especially vision transformers [142], are more sensitive to input perturbations and, as a result, are more vulnerable to adversarial attacks than nonefficient ones [143]. Adversarial robustness refers to the ability of the network to maintain its accuracy even when encountering adversarial attacks that are intentionally designed to mislead the network. Adversarial robustness is critical in real-world scenarios, especially for environments that are complex and unpredictable, such as autonomous vehicles. Therefore, innovating efficient yet robust DNNs is a promising future direction in the field of network design.

3 Automated Network Design for Embedded Computing Systems

In contrast to manual network design, automated network design, also known as neural architecture search (NAS) [137], has recently flourished, which strives to automate the design of efficient neural networks. In the past decade, NAS has achieved impressive performance in the field of network design, which delivers more advanced networks with both higher accuracy and efficiency than the conventional manual network design (see Section 2). To this end, in this section, we further discuss recent advances in the field of NAS, especially from the perspective of hardware-aware NAS that searches for hardware-efficient network solutions, including modular search space in Section 3.1, search strategy in Section 3.2, and speedup techniques and extensions in Section 3.3.

3.1 Modular Search Space

The search space $\mathcal {A}$ plays a prominent role in the success of NAS since the search engine of NAS strives to search for top-performing architecture candidates within the predefined search space. This also indicates that the search space determines the upper-performance limit of modern NAS algorithms. However, designing efficient and effective search spaces is quite difficult and challenging since there are a myriad of possible operator candidates (e.g., $1\times 1$, $3\times 3$, $5\times 5$, and $7\times 7$ convolutional layers) and different network configurations (e.g., the combination strategies of different operator candidates and the network channel layouts) [137, 138, 144]. Therefore, to reduce the search space size and trim down the search complexity, previous state-of-the-art NAS methods [137, 138, 144] often restrict the search space to allow efficient search and leverage modular search spaces, which are coarse-grained in contrast to layer-wise fine-grained search spaces. Previous state-of-the-art NAS methods are based on the following two representative types of modular search spaces, including cell-based search space and block-based search space.

Cell-Based Search Space. The cell-based search space $\mathcal {A}$ has been dominating the early success in the field of NAS. The cell-based search space was first introduced by NASNet [144] and DARTS [138]. As defined in NASNet, the cell-based search space consists of two types of cell structures, which are denoted as the normal cell and the reduction cell. In practice, both types of cells are encoded into directed acyclic graphs (DAGs) and maintain the same cell structure as illustrated in Figure 7, except that the reduction cell starts with one convolutional layer with the stride of 2 to reduce the input spatial dimension. Once the cell structure is determined at the end of search, it is then repeatedly stacked to derive the final architecture candidate. In addition, DARTS introduces another type of cell-based search space, which has motivated a plethora of subsequent NAS methods that are also built on top of the same cell-based search space, such as RobustDARTS [145], EdgeNAS [130], PC-DARTS [146], P-DARTS [147], DARTS+ [148], DARTS–[149], FairDARTS [150], and $\beta$-DARTS [151]. Similar to NASNet, the cell-based search space in DARTS consists of two types of cells, the normal cell and the reduction cell. As shown in Figure 7, each cell has an ordered sequence of nodes, where each node is a latent representation (e.g., a feature map in convolutional networks) and each directed edge has a set of possible operators $\lbrace o^{(i, j)}\rbrace$ that transform the input $x^{(i)}$. In contrast to NASNet, the cell in DARTS is assumed to have two different input nodes and one single output node. With this in mind, we are able to mathematically calculate the intermediate node as follows, which is based on all of its predecessors:

\begin{equation} x^{(j)} = \sum _{i \lt j} o^{(i, j)}(x^{(i)}) . \end{equation}

(1)

Finally, the search space of DARTS contains $6.3 \times 10^{29}$ possible architecture candidates [130].

Fig. 7.

Block-Based Search Space. The block-based search space $\mathcal {A}$ advocates for simple and diverse network topologies as illustrated in Figure 8, in which each architecture candidate consists of multiple sequential operator candidates. As shown in previous NAS practices, the operator candidates in the block-based search space are usually taken from state-of-the-art manual DNNs, such as MobileNets [32, 85] and ShuffleNets [34, 35]. For example, the block-based search space in ProxylessNAS [40] is built on top of MobileNetV2, whereas the block-based search space in HSCoNAS [152] is built on top of ShuffleNetV2. In parallel, HURRICANE [153] demonstrates that different hardware platforms favor different search spaces, based on which HURRICANE introduces a hybrid block-based search space that combines both MobileNetV2 and ShuffleNetV2 to deliver superior architecture solutions. In contrast to the cell-based search space, the block-based search space is hardware friendly, due to which the block-based search space has been widely adopted in previous hardware-aware NAS methods, such as MnasNet [38], ProxylessNAS [40], OFA [42], HSCoNAS [152], SurgeNAS [154], and LightNAS [131]. The intuition behind this is that the architecture candidate in the cell-based search space consists of multiple parallel branches as shown in Figure 7, which introduce additional overheads in terms of the memory access and, as a result, deteriorate the inference efficiency on target hardware according to the roofline analysis [155]. In addition, in contrast to the cell-based search space that repeatedly stacks the same cell structure across the entire network, the block-based search space allows operator diversity within different blocks, encouraging the search for architecture candidates with better accuracy–efficiency trade-offs [156].

Fig. 8.

3.2 Search Strategy

In this section, we discuss recent state-of-the-art NAS algorithms and divide them into three main categories: reinforcement learning-based search [137], evolutionary algorithm-based search [157], and gradient-based search (also known as differentiable search) [138].

Reinforcement Learning-Based Search. In the field of NAS, to the best of our knowledge, [137] is the first NAS work³ that opens up the possibility to automate the design of top-performing DNNs, which features reinforcement learning (RL) [159] as the search engine. Specifically, [137] leverages a simple yet effective recurrent neural network (RNN) as the RL controller to generate possible architecture candidates from the search space as shown in Figure 9. The generated architecture candidate is then trained from scratch on the target task to evaluate the accuracy. Next, the accuracy of the generated architecture candidate is fed back into the aforementioned RNN controller, which optimizes the RNN controller to generate better architecture candidates in the next iteration. Once the search process terminates, the well-optimized RNN controller is able to provide DNNs with superior accuracy on the target task. For example, the network generated by the RNN controller achieves 96.35% top-1 accuracy on CIFAR-10, which is comparable to or even better than the family of manually designed DNNs, such as ResNet [2]. The promising performance of [137] marks an important milestone in the field of NAS, pioneering an effective alternative to automate the design of competitive DNNs.

Fig. 9.

Subsequently, based on [137], NASNet [144] introduces the flexible cell-based search space as shown in Figure 7, which further boosts the attainable accuracy on the target task. For example, NASNet achieves 97.6% top-1 accuracy on CIFAR-10, which is $+$1.25% higher than [137] while involving fewer parameters (i.e., 37.4 M in [137] vs. 27.6 M in NASNet). Despite the promising performance, [137] and NASNet have to train a large number of possible architecture candidates from scratch, thus inevitably necessitating prohibitive computational resources. For example, to optimize the RNN controller, [137] needs to train 12,800 stand-alone architecture candidates. To overcome such limitations, ENAS [160] proposes an efficient NAS paradigm dubbed parameter sharing, which forces all the architecture candidates to share network weights to eschew training each architecture candidate from scratch. In practice, this leads to significant reduction in terms of search cost, while at the same time still maintaining strong accuracy on the target task. For example, in [137], one single search experiment takes 3$\sim$4 days on 450 NVIDIA GTX 1080 Ti GPUs [144]. In contrast, benefiting from the paradigm of parameter sharing, ENAS is able to find one decent network solution with 97.11% top-1 accuracy on CIFAR-10 and, more importantly, in less than 16 hours on one single NVIDIA GTX 1080 Ti GPU. Thanks to the significant search efficiency, the paradigm of parameter sharing has been dominating subsequent breakthroughs in the NAS community [42, 138, 161].

Although early RL-based NAS methods [137, 144, 160] have had tremendous success in automatic network design, they focus on accuracy-only optimization, ignoring other important performance metrics, such as latency and energy. To search for hardware-efficient network solutions, MnasNet [38] formulates the search process as a multi-objective optimization problem that optimizes both accuracy and latency as shown in Figure 10. To achieve this, MnasNet introduces a flexible block-based search space (see Figure 8) and designs an effective multi-objective RL reward function to optimize the RNN controller. Specifically, the goal of MnasNet is to find Pareto-optimal architecture candidates arch in the search space $\mathcal {A}$ that maximize the predefined multi-objective RL reward, which can be formulated as follows:

\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) \times \left[\frac{Latency(arch)}{T}\right]^{w} , \end{equation}

(2)

where $Accuracy(\cdot)$ and $Latency(\cdot)$ denote the accuracy on the target task and the latency on target hardware, respectively. In addition, T is the specified latency constraint. It is worth noting that the latency $Latency(\cdot)$ in MnasNet is directly measured on target hardware, which suffers from non-trivial engineering efforts due to the prohibitive search space (e.g., $|\mathcal {A}| = \sim 10^{39}$ in MnasNet) [40, 42]. To avoid the tedious on-device latency measurements, we discuss several efficient latency predictors later in this section. Apart from these, w is the trade-off coefficient to control the trade-off magnitude between accuracy and latency, which is defined as follows:

\begin{equation} \begin{aligned}w = {\left\lbrace \begin{array}{ll} \alpha , & \text{if } Latency(arch) \le T\\ \beta , & \text{otherwise} \end{array}\right.} \end{aligned} , \end{equation}

(3)

where $\alpha$ and $\beta$ are application-specific hyperparameters to control the trade-off magnitude between accuracy and efficiency. According to the empirical observation that doubling the latency usually brings $\sim$5% relative accuracy improvement, MnasNet assigns $\alpha = \beta = -0.07$. In practice, $\alpha$ and $\beta$ are both sensitive and difficult to tune. Even worse, given new hardware devices or new search spaces, $\alpha$ and $\beta$ involve additional engineering efforts for hyperparameter tuning. For example, as observed in MobileNetV3 [162], the accuracy changes much more dramatically with latency for small networks. Therefore, to obtain the required architecture candidate that satisfies the specified latency constraint T, we typically need to repeat 7 search experiments to tune $\alpha$ and $\beta$ through trial and error [163]. This significantly increases the total search cost by ×7. To eliminate such additional hyperparameter tuning, TuNAS [163] investigates the multi-objective RL reward in Equation (2) and further introduces a similar RL reward function, which can be formulated as follows:

\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) + \gamma \times \left|\frac{Latency(arch)}{T} - 1\right| , \end{equation}

(4)

where $|\cdot |$ is the absolute function and $\gamma \lt 0$ is a finite negative value, which controls how strongly we enforce the architecture candidate to maintain the latency close to T.

Fig. 10.

MONAS [164] also introduces a simple yet effective RL reward function that considers optimizing both accuracy and energy, which can be formulated as follows:

\begin{equation} Reward(arch) = \eta \times Accuracy(arch) - (1 - \eta) \times Energy(arch) , \end{equation}

(5)

where $\eta \in [0, 1]$ is the coefficient to control the trade-off between accuracy and energy. We note that the RL reward function in Equation (5) aims to find the architecture candidate with high accuracy and low energy, which can be generalized to other performance constraints, such as latency.

Evolutionary Algorithm-Based Search. In addition to reinforcement learning–based search, evolutionary algorithm–based search is another popular branch in the NAS literature thanks to its flexibility, conceptual simplicity, and competitive performance [157]. As seen in the very early evolutionary practices [165, 166, 167, 168], the evolutionary algorithm–based search typically consists of four key steps: (1) sampling a set of possible architecture candidates from the search space as the child population; (2) evaluating the architecture candidates in the child population to interpret the performance, such as accuracy and efficiency; (3) reserving the top-k architecture candidates in the latest child population to form the parent population and discarding the architecture candidates with poor performance; and (4) manipulating the architecture candidates in the latest parent population to generate new architecture candidates to form the next-generation child population. These four steps are repeated until the evolutionary process converges.

There are many other aspects in which the evolutionary algorithm may differ, including (1) how to sample the initial population, (2) how to select the parent population, and (3) how to generate the child population from the parent population. Generating the child population from the parent population is of utmost importance in order to produce superior architecture candidates [157]. In practice, to allow efficient exploration and exploitation [170], crossover and mutation are two of the most popular strategies to generate the child population [171, 172]. Specifically, for crossover, two random architecture candidates from the parent population are crossed to produce one new child architecture candidate. For mutation, one randomly selected architecture candidate mutates its operators with a fixed probability. However, the early evolutionary NAS works have to train a large number of stand-alone architecture candidates from scratch to evaluate the accuracy [157] and, as a result, suffer from non-trivial computational resources [169].

To reduce the required computational resources for neural architecture search, [169] introduces the paradigm of one-shot NAS, which has been widely applied in subsequent NAS methods [40, 42, 138] thanks to its significant search efficiency. In parallel to [169], SMASH [173] also proposes a similar one-shot NAS paradigm, but [169] is much more popular in the NAS community. Specifically, [169] designs an effective one-shot supernet as visualized in Figure 11, which consists of all possible architecture candidates in the search space. Therefore, we only need to train the one-shot supernet, after which we can evaluate different architecture candidates in the search space with inherited network weights from the pretrained one-shot supernet as shown in Figure 12. This effectively avoids needing to train a large number of stand-alone architecture candidates from scratch. In practice, the one-shot supernet is simply trained using the standard SGD optimizer with momentum. Once the one-shot supernet is well trained, it is able to quickly and reliably approximate the performance of different architecture candidates using the paradigm of weight sharing [160]. With the well-trained one-shot supernet, it is straightforward and technically easy to leverage the standard evolutionary algorithm to search for top-performing architecture candidates with superior accuracy on the target task [169]. We note that the searched architecture candidates still need to be retrained or fine-tuned on the target task in order to recover the accuracy for further deployment on target hardware.

Fig. 11.

Fig. 12.

SPOS [161] investigates the one-shot NAS [169] and identifies two critical issues. On the one hand, the network weights in the one-shot supernet are deeply coupled during the training process. On the other hand, joint optimization introduces further coupling between architecture candidates and supernet weights. To address these, SPOS proposes the paradigm of single-path one-shot NAS, which uniformly samples one single-path subnetwork from the supernet and trains the sample single-path subnetwork instead. This brings two main benefits: (1) reducing memory consumption to the single-path level and (2) improving the performance of the final searched architecture candidate. The success of SPOS has motivated a series of follow-up works [42, 152, 153, 174, 175, 176, 177, 178, 179]. Note that all of these follow-up works [42, 152, 153, 174, 175, 176, 177, 178, 179] focus on training an effective and reliable supernet, which then serves as the evaluator to quickly query the performance of different architecture candidates. For example, FairNAS [174] demonstrates that the uniform sampling strategy only implies the soft fairness, and to imply the strict fairness, FairNAS samples multiple single-path subnetworks to enforce that all the operator candidates in the supernet are equally optimized during each training iteration. In parallel, OFA [42] is another representative evolutionary NAS method that aims to train the supernet, after which we are allowed to detach single-path subnetworks from the supernet with inherited network weights for further deployment on target hardware. Note that the detached subnetwork in OFA still requires to be fine-tuned on the target task for several epochs (e.g., 25 epochs) in order to obtain competitive accuracy. To eliminate the fine-tuning process, BigNAS [175] proposes several enhancements to train one single-stage supernet, where the single-path subnetwork detached from the supernet with inherited network weights can achieve superior accuracy without being retrained or fine-tuned on the target task and can be directly deployed on target hardware. This significantly saves the computational resources required for training stand-alone architecture candidates, especially when targeting multiple different deployment scenarios such as multiple different hardware platforms.

Thanks to its search flexibility, evolutionary algorithm-based NAS can be easily extended to search for hardware-efficient architecture candidates, which maximize the accuracy on the target task while satisfying various real-world performance constraints [153], such as latency, energy, memory, and more. Without loss of generality, we consider the following multi-objective optimization:

\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) \,\,\, s.t., \,\,\, Constraint_1(arch) \le C_1, \ldots , Constraint_n(arch) \le C_n , \end{equation}

(6)

where $\lbrace Constraint_i(\cdot)\rbrace _{i=1}^n$ and $\lbrace C_i\rbrace _{i=1}^n$ are a set of real-world performance constraints.

Gradient-Based Search. In addition to reinforcement learning–based search and evolutionary algorithm–based search, gradient-based search [138], also known as differentiable search, is another representative branch of NAS, which has since gained increasing popularity in the NAS community and motivated a plethora of subsequent differentiable NAS works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188], thanks to its significant search efficiency [189]. For example, DARTS [138], as the seminal differentiable NAS work, is able to deliver one superior architecture candidate in $\sim$1 day on one single NVIDIA GTX 1080 Ti GPU. In contrast to previous non-differentiable NAS practices [137, 144, 160, 169] that highly rely on discrete search spaces, DARTS leverages a list of architecture parameters $\alpha$ to relax the discrete search space to become continuous. Benefiting from the continuous search space, both the network weights w and the architecture parameters $\alpha$ can be optimized via alternating gradient descent. Once the differentiable search process terminates, we can interpret the optimal architecture candidate from the architecture parameters $\alpha$. Specifically, the supernet in DARTS is initialized by stacking multiple over-parameterized cells (see Figure 13 (1)), in which each cell consists of all possible cell structures in the cell-based search space $\mathcal {A}$. As shown in Figure 13, each cell is represented using the DAG that consists of N nodes $\lbrace x_i\rbrace _{i=1}^N$. Note that the nodes here correspond to the intermediate feature maps. In addition, the directed edges between $x_i$ and $x_j$ correspond to a list of operator candidates $\lbrace o | o \in \mathcal {O}\rbrace$ in the operator space $\mathcal {O}$. The directed edges between $x_i$ and $x_j$ are also assigned with a list of architecture parameters $\lbrace \alpha _o^{(i, j)} | o \in \mathcal {O}\rbrace$. Finally, following DARTS, we formulate $x_j$ as follows:

\begin{equation} x_j = \sum _{o \in \mathcal {O}} \frac{\exp \alpha _o^{(i, j)}}{\sum _{o^{\prime } \in \mathcal {O}} \exp \alpha _{o^{\prime }}^{(i, j)}} o(x_i) . \end{equation}

(7)

Note that the output $x_j$ is continuous with respect to $x_i$, $\alpha$, and w. In Light of this, DARTS proposes to optimize $\alpha$ and w using the following bilevel optimization scheme:

\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) \,\,\, s.t., \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha) , \end{equation}

(8)

where $\mathcal {L}_{train}(\cdot)$ and $\mathcal {L}_{val}(\cdot)$ are the loss functions on the training and validation datasets, respectively. Once the differentiable search process terminates, DARTS determines the optimal architecture candidate by reserving the strongest operator $\alpha _o^{(i, j)}$ and removing other operators between $x_i$ and $x_j$, in which the operator strength is defined as $\exp \alpha _o^{(i, j)} / \sum _{o^{\prime } \in \mathcal {O}} \exp \alpha _{o^{\prime }}^{(i, j)}$. It is worth noting that the searched optimal architecture candidate still needs to be retrained on the target task in order to recover its accuracy for further deployment on target hardware.

Fig. 13.

Inspired by the promising performance of DARTS, a plethora of follow-up works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188] have recently emerged that strive to unleash the power of differentiable NAS to deliver superior architecture candidates. For example, in contrast to DARTS, which simultaneously optimizes all operator candidates in the supernet, PC-DARTS [146] introduces partial channel connections to alleviate the excessive memory consumption of DARTS. In addition, DARTS+ [148] investigates the performance collapse issue of DARTS and finds that the performance collapse issue is caused by the over-selection of skip-connect. To tackle this, DARTS+ proposes a simple yet effective early-stopping strategy to terminate the search process upon fulfilling a set of predefined criteria. In parallel, DARTS– [149] also observes that the performance collapse issue of DARTS comes from the over-selection of skip-connect and further leverages an auxiliary skip connection to mitigate the performance collapse issue and stabilize the search process. Apart from these, Single-DARTS [185] and Gold-NAS [184] investigate the bilevel optimization in Equation (8) and point out that the bi-level optimization may end up with suboptimal architecture candidates, based on which Single-DARTS and Gold-NAS turn back to the one-level optimization. To accelerate the search process, GDAS [186] introduces an efficient Gumbel-Softmax [190]–based differentiable sampling approach to reduce the optimization complexity to the single-path level. Similar to GDAS, SNAS [187] also leverages Gumbel-Softmax reparameterization to improve the search process, which can make use of gradient information from generic differentiable loss without sacrificing the completeness of NAS pipelines. PT-DARTS [182] revisits the architecture selection in differentiable NAS and demonstrates that the architecture parameters $\alpha$ cannot always imply the optimal architecture candidate, based on which PT-DARTS introduces the perturbation-based architecture selection to determine the optimal architecture candidate at the end of search.

The aforementioned differentiable NAS works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188], however, focus on accuracy-only neural architecture search, which indeed demonstrates promising performance in terms of finding the architecture candidate with competitive accuracy but fails to accommodate the limited available computational resources in real-world embedded scenarios. To overcome such limitations, the paradigm of hardware-aware differentiable NAS [130, 191, 192, 193, 194] has recently emerged, which is based on DARTS and focuses on finding top-performing architecture candidates within the cell-based search space that can achieve both high accuracy on target task and high inference efficiency on target hardware. To achieve this goal, one widely adopted approach is to integrate the latency-constrained loss term into the overall loss function to penalize the architecture candidate with high latency, which can be mathematically formulated as follows:

\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda \cdot Latency(\alpha) \,\,\, s.t. \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha) \end{equation}

(9)

where $\lambda$ is the trade-off coefficient to control the trade-off magnitude between accuracy and latency. As demonstrated in [131, 156], a larger $\lambda$ ends up with the architecture candidate that maintains low accuracy and low latency, whereas a smaller $\lambda$ leads to the architecture candidate with high accuracy and high latency. $Latency(\alpha)$ corresponds to the latency of the architecture candidate encoded by $\alpha$. We note that the optimization objective in Equation (9) can be easily generalized to jointly optimize other types of hardware performance constraints, such as energy and memory consumption, in which we only need to incorporate $Energy(\alpha)$ and $Memory(\alpha)$ into the optimization objective in Equation (9). For example, we can reformulate the optimization objective in Equation (9) as follows to jointly optimize the on-device latency, energy, and memory consumption:

\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda _1 \cdot Latency(\alpha) + \lambda _2 \cdot Energy(\alpha) + \lambda _3 \cdot Memory(\alpha) , \end{equation}

(10)

where $\lambda _1$, $\lambda _2$, and $\lambda _3$ are trade-off coefficients to determine the trade-off magnitudes between accuracy and latency, energy, and memory, respectively.

Despite the significant progress to date, the aforementioned hardware-aware differentiable NAS works [130, 191, 192, 193, 194] highly rely on the cell-based search space; these works first determine the optimal cell structure and then repeatedly stack the same cell structure across the entire network [138]. However, as demonstrated in MnasNet [38], such NAS practices suffer from inferior accuracy and efficiency due to the lack of operator diversity. Even worse, the architecture candidates in the cell-based search space have multiple parallel branches as shown in Figure 7, which introduce considerable memory access overheads and, as a result, have difficulty benefitting from the high computational parallelism on mainstream hardware platforms [34, 35]. To overcome such limitations, recent hardware-aware differentiable NAS works [39, 40, 154, 188, 195, 196, 197, 198, 199] have shifted their attention from the cell-based search space (see Figure 7) to the block-based search space (see Figure 8). The most representative include FBNet [39], ProxylessNAS [40], SP-NAS [197], and TF-NAS [195]. Similar to GDAS [186] and SNAS [187], FBNet leverages Gumbel-Softmax reparameterization [190] to relax the discrete search space to be continuous. FBNet collects a simple yet effective latency lookup table to quickly approximate the latency of different architecture candidates. The pre-collected latency lookup table is then integrated into the search process to derive hardware-efficient architecture candidates. However, similar to DARTS [138], FBNet needs to simultaneously optimize all the operator candidates in the supernet during the search process, which is not scalable to large search spaces and suffers from tmemory bottleneck [40, 131]. In light of this, ProxylessNAS introduces an effective path-level binarization approach to reduce the memory consumption to the single-path level, which significantly improves search efficiency without compromising search accuracy. In parallel, SP-NAS demonstrates that different operator candidates in the supernet can be viewed as subsets of an over-parameterized superkernel, based on which SP-NAS proposes to encode all the operator candidates into the superkernel. In practice, this explicitly reduces memory consumption to the single-path level, which alleviates the memory bottleneck during the search process. TF-NAS thoroughly investigates the three search freedoms in hardware-aware differentiable NAS: (1) operator-level search, (2) depth-level search, and (3) width-level search as shown in Figure 14, which is able to perform fine-grained architecture search. To obtain hardware-efficient architecture candidates, TF-NAS integrates the pre-collected latency lookup table into the search process. TF-NAS also introduces a simple yet effective bi-sampling search algorithm to accelerate the search process towards enhanced search efficiency.

Fig. 14.

Even so, we should consider not only the explicit search cost, the time required for one single search experiment, but also the implicit search cost, the time required for manual hyperparameter tuning in order to find the desired architecture candidate. This is because, in real-world embedded scenarios such as autonomous vehicles, DNNs must be executed under strict latency constraints (e.g., 24 ms), in which any violation may lead to catastrophic consequences [20, 163]. However, to find the architecture candidate with the latency of 24 ms, the aforementioned hardware-aware differentiable NAS works [39, 40, 130, 154, 188, 191, 192, 193, 194, 195, 196, 197, 198, 199] have to repeat a plethora of search experiments to tune the trade-off coefficient $\lambda$ (see Equation (9)) through trial and error [131, 156], which significantly increases the total search cost. The intuition behind this is that $\lambda$, despite being able to trade off between accuracy and latency, is quite sensitive and difficult to control [131, 156]. To overcome such limitations, HardCoRe-NAS [200] leverages an elegant Block Coordinate Stochastic Frank-Wolfe (BCSFW) algorithm [201] to restrict the search direction around the specified latency requirement. In addition, LightNAS [131, 156] introduces a simple yet effective hardware-aware differentiable NAS approach, which investigates the optimization objective in Equation (9) and proposes to optimize the trade-off coefficient $\lambda$ during the search process in order to satisfy the specified latency requirement. In other words, LightNAS focuses on automatically learning $\lambda$ that strictly complies with the specified latency requirement, which is able to find the required architecture candidate in one single search (i.e., you only search once) and avoids performing manual hyperparameter tuning over $\lambda$. The optimization objective of LightNAS is formulated as follows:

\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda \cdot \left(\frac{Latency(\alpha)}{T} - 1\right)\,\,\,s.t. \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha), \end{equation}

(11)

where T is the specified latency requirement. In contrast to previous hardware-aware differentiable NAS works [39, 40, 130, 154, 188, 191, 192, 193, 194, 195, 196, 197, 198, 199], $\lambda$ in Equation (11) is not a constant but rather a learnable hyperparameter that can be automatically optimized during the search process. For the sake of simplicity, below we use $\mathcal {L}(w, \alpha , \lambda)$ to denote the optimization objective in Equation (11). Finally, to satisfy the specified latency requirement (i.e., $Latency(\alpha) = T$), w and $\alpha$ are updated using gradient descent [138], whereas $\lambda$ is updated using gradient ascent as follows:

\begin{equation} {\left\lbrace \begin{array}{ll} w^* = w - lr_w \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial w}, \,\, \alpha ^* = \alpha - lr_{\alpha } \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial \alpha } \\ \lambda ^* = \lambda + lr_{\lambda } \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial \lambda } = \lambda + lr_{\lambda } \cdot \left(\frac{LAT(\alpha)}{T} - 1\right) \end{array}\right.} , \end{equation}

(12)

where $lr_{w}$, $lr_{\alpha }$, and $lr_{\lambda }$ are the learning rates of w, $\alpha$, and $\lambda$, respectively. Below we further demonstrate why LightNAS guarantees $Latency(\alpha) = T$. As shown in LightNAS, a larger $\lambda$ leads to the architecture candidate with low latency, whereas a smaller $\lambda$ results in the architecture candidate with high latency. Therefore, if $Latency(\alpha) \gt T$, the gradient ascent scheme increases $\lambda$ to reinforce the latency regularization magnitude. As a result, $Latency(\alpha)$ decreases towards T in the next search iteration. Likewise, if $Latency(\alpha) \lt T$, the gradient ascent scheme decreases $\lambda$ to diminish the latency regularization magnitude, after which $Latency(\alpha)$ increases towards T in the next search iteration. Finally, the search engine ends up with the architecture candidate that strictly satisfies the specified latency requirement (i.e., $Latency(\alpha) = T$). More recently, Double-Win NAS [202] proposed deep-to-shallow transformable search to further marry the best of both deep and shallow networks towards an aggressive accuracy–efficiency win–win. Similar to LightNAS [131, 156], the resulting shallow network can also satisfy the specified latency constraint. Finally, we compare previous representative hardware-aware NAS works, which are summarized in Table 1.

Table 1.

Method	Search Space	Search Strategy	Search Cost			Target Hardware	Hardware Modeling	ImageNet
Method	Search Space	Search Strategy	Dataset	GPU Hours	GPU	Target Hardware	Hardware Modeling	FLOPs (M)	Top-1 Acc (%)
MnasNet [38]	Block	Reinforce	ImageNet	40,000	V100	Mobile Phones	N/A	312	75.2
ProxylessNAS [40]	Block	Gradient	ImageNet	200	V100	GPUs, CPUs, and Mobile Phones	LUT	N/A	75.1
MobileNetV3 [162]	Block	Evolution	ImageNet	N/A	N/A	Mobile Phones	N/A	219	75.2
FBNet [39]	Block	Gradient	ImageNet	216	N/A	Mobile Phones	LUT	375	74.9
TuNAS [163]	Block	Reinforce	ImageNet	N/A	N/A	Mobile Phones	LUT	–	75.4
OFA [42]	Block	Evolution	ImageNet	1,200	V100	GPUs, CPUs, Edge GPUs, and Mobile Phones	LUT	230	76.0
SP-NAS [197]	Block	Gradient	ImageNet	30	TPU	Mobile Phones	LUT	N/A	75.0
LA-DARTS [191]	Cell	Gradient	CIFAR-10	17	P100	GPUs and CPUs	Predictor	575	74.8
MDARTS [194]	Cell	Gradient	CIFAR-10	$\sim$6.5	Titan XP	Eyeriss	Predictor	N/A	N/A
EH-DNAS [203]	Cell	Gradient	CIFAR-10	24	1080 Ti	Customized Accelerators	Predictor	840	69.6
E-DNAS [204]	Block	Gradient	ImageNet	N/A	V100	CPUs and DSPs	Predictor	365	76.9
SNAS [198]	Block	Gradient	ImageNet	30	N/A	TPUs	Predictor	1,290	79.4
HSCoNAS [152]	Block	Evolution	ImageNet	N/A	N/A	GPU, CPUs, and Edge GPUs	LUT	N/A	74.9
DenseNAS [199]	Block	Gradient	ImageNet	64	Titan XP	GPUs	LUT	361	75.3
TF-NAS [195]	Block	Gradient	ImageNet	43	Titan RTX	GPUs	LUT	284	75.2
HardCoRe-NAS [200]	Block	Gradient	ImageNet	400	P100	GPUs and CPUs	LUT	N/A	75.7
LightNAS [131]	Block	Gradient	ImageNet	10	RTX 3090	Edge GPUs	Predictor	N/A	75.2
SurgeNAS [154]	Block	Gradient	ImageNet	30	V100	GPUs, CPUs, and Edge GPUs	Predictor	N/A	75.5
SPOS [161]	Block	Evolution	ImageNet	288	1080 Ti	GPUs	LUT	328	74.7
HURRICANE [153]	Block	Evolution	ImageNet	N/A	N/A	CPUs, DSPs, and VPUs	LUT	409	75.1
ProxyNAS [176]	Block	Evolution	ImageNet	N/A	N/A	GPUs, CPUs, TPUs, and FPGAs	Predictor	N/A	N/A

Table 1. Comparisons of Representative Hardware-aware NAS Works

This table is to roughly compare different hardware-aware NAS works, in which N/A means that the related data is not reported in the respective paper. Note that the accuracy in this table may be trained under different training recipes.

3.3 Speedup Techniques and Extensions

In this section, we discuss recent state-of-the-art advances in general speedup techniques and extensions for NAS algorithms, including one-shot NAS enhancements, efficient latency prediction, efficient accuracy prediction, low-cost proxies, zero-cost proxies, efficient transformer search, efficient domain-specific search, and mainstream NAS benchmarks, which have the potential to significantly benefit NAS algorithms and largely facilitate the search process.

Beyond One-Shot NAS. Despite the high search efficiency, one-shot NAS often suffers from poor ranking correlation between one-shot search and stand-alone training. As pointed out in [205], one-shot search results do not necessarily correlate with stand-alone training results across various search experiments. To overcome such limitations, a plethora of one-shot NAS enhancements have been proposed recently[206, 207, 208, 209, 210, 211]. Specifically, [206, 207, 208, 209, 210] turn back to few-shot NAS. In contrast to one-shot NAS [160], which only features one supernet, few-shot NAS introduces multiple supernets to explore different regions of the predefined search space, which slightly increases the search cost over one-shot NAS but can deliver much more reliable search results. For example, as shown in [206], with only up to 7 supernets, few-shot NAS can establish new state-of-the-art search results on ImageNet. Among them, [209] demonstrates that zero-cost proxies can be integrated into few-shot NAS, which can further enhance the search process of one-shot NAS and thus produce better search results. More recently, [208] generalizes few-shot NAS to distill LLMs, which focuses on automatically distilling multiple compressed student models under various computational budgets from a large teacher model. In contrast to few-shot NAS, which leverages multiple supernets to improve the ranking correlation performance of one-shot NAS, CLOSE [211] instead features an effective curriculum learning-like schedule to control the parameter-sharing extent within the proposed supernet dubbed CLOSENet, in which the parameter-sharing extent can be flexibly adjusted during the search process and the parameter-sharing scheme is built upon an efficient graph-based encoding scheme.

Efficient Latency Prediction.⁴ As seen in MnasNet [38], latency is directly measured on target hardware, which is then integrated into the RL reward (see Equation (2)) to penalize the architecture candidate with high latency. The direct on-device latency measurement is indeed accurate. However, it is time-consuming and unscalable to large search spaces [40]. To overcome such limitations, several latency prediction strategies have been proposed recently. For example, ProxylessNAS [40], FBNet [39], and OFA [42] leverage the latency lookup table to approximate the on-device latency, which sums up the latency of all the operator candidates. In addition, HSCoNAS [152, 178] demonstrates that the data movements and communications among different operator candidates introduce additional latency overheads, making the pre-collected latency lookup table inaccurate. To mitigate this issue, HSCoNAS quantifies the latency that corresponds to the intermediate data movements and communications, which is then fed into the pre-collected latency lookup table to achieve more accurate latency prediction performance. However, the latency lookup table is only applicable to the block-based search space, which leads to unreliable latency prediction performance in terms of the cell-based search space [213]. To this end, EdgeNAS [130], LA-DARTS [191], and LC-NAS [192] propose to use learning-based approaches for the purpose of latency prediction. For example, EdgeNAS trains an efficient multi-layer perceptron (MLP) to predict the latency of different architecture candidates in the cell-based search space, which can also be generalized to predict the latency of different architecture candidates in the block-based search space as shown in [131, 156, 176, 212, 214]. BRP-NAS [213] and SurgeNAS [154] introduce graph neural network (GNN)–based latency predictors to achieve more reliable latency prediction performance. The above latency predictors (1) rely on a large number of training samples to achieve decent latency prediction performance (e.g., 100,000 training samples in EdgeNAS) and (2) need to be reconstructed for either new hardware or new search spaces. To avoid these, HELP [215] and MAPLE-Edge [216] focus on building an efficient latency predictor using only a few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [217] considered an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts.

Efficient Accuracy Prediction. In parallel to latency prediction, accuracy prediction has also received increasing attention from the NAS community [213, 218, 219, 220, 221, 222], which strives to directly predict the accuracy of different architecture candidates in the search space. Specifically, [218] introduces a simple yet effective graph convolutional network (GCN)–based accuracy predictor, which can achieve reliable accuracy prediction performance thanks to GCNs’ strong capability to learn graph-structured data. Similar to [218], BRP-NAS [213] also considers GCNs for reliable accuracy prediction, which introduces transfer learning to further improve the accuracy prediction performance from the pretrained latency predictor. In parallel, [219] leverages the non-neural network (i.e., gradient-boosted decision tree (GBDT)) as the accuracy predictor, which has a stronger capability to learn representations than neural network–based accuracy predictors. In addition, NASLib [220] investigates a wide range of accuracy predictors from learning curve extrapolation, weight-sharing, supervised learning, and zero-cost proxies on three popular NAS benchmarks (i.e., NAS-Bench-101 [223], NAS-Bench-201 [224], and NAS-Bench-NLP [225]). NASLib reveals that different accuracy predictors can be combined to achieve substantially better accuracy prediction performance than any single accuracy predictor. DONNA [221] proposes to build an efficient accuracy predictor, which involves only minimal computational resources and, more importantly, can scale to diverse search spaces. To achieve this, DONNA uses blockwise knowledge distillation to construct an architecture candidate pool in which each architecture candidate only needs to be fine-tuned for several epochs to derive the accuracy rather than being trained from scratch. In contrast to the aforementioned accuracy predictors that feature graph-based encoding schemes, GATES [222, 226] instead models the operations as the transformation of the propagating information, which can effectively mimic the actual data processing of different neural architecture candidates. More importantly, the encoding scheme of GATES can be integrated into the above accuracy predictors to further boost their accuracy prediction performance. Similar to GATES, TA-GATES [227] introduces an effective encoding scheme with analogous modeling of the training process of different neural architecture candidates, which can achieve better accuracy prediction performance than GATES on various representative NAS benchmarks.

Low-Cost Proxies (Learning Curve Extrapolations). Low-cost proxies, also referred to as learning curve extrapolations [231], aim to interpret the accuracy of the given architecture candidate only using its early training statistics, such as the training loss in the first few training epochs, which has motivated a plethora of subsequent works to continue exploring learning curve extrapolation [156, 232, 233, 234, 235, 236, 237]. For example, in contrast to the conventional accuracy predictor that only uses the network configuration as input features, [236] proposes to combine the network configuration and a series of validations of accuracy in the first few training epochs as input features to train a simple regression model, which can be generalized to predict the accuracy of unseen architecture candidates. In addition, [232] introduces Training Speed Estimation (TSE), which simply accumulates the early training statistics to achieve reliable yet computationally inexpensive ranking among different architecture candidates. The work of [156, 237] introduces Batchwise Training Estimation (BTE) and Trained Batchwise Estimation (TBE), both of which consider the fine-grained batchwise training statistics to provide more reliable prediction performance using minimal computational resources. In parallel, [234] introduces Loss Curve Gradient Approximation (LCGA) to rank the accuracy of different architecture candidates with minimal training. The work of [233] introduces NAS-Bench-x11 to unleash the power of learning curve extrapolation by predicting the training trajectories, which can be easily integrated into the aforementioned learning curve extrapolation works to quickly estimate the performance of the given architecture candidate.

Zero-Cost Proxies.⁵ In addition to the above low-cost proxies (i.e., learning curve extrapolation), zero-cost proxies have recently flourished [228, 229, 230, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247], which focus on interpreting the performance of the given architecture candidate in training-free manners. Zero-cost proxies, such as EPE [240], Fisher [241], GradNorm [238], Grasp [242], Jacov [243], Snip [244], Synflow [245], ZenScore [246], LRC [228], and NTK [229], can provide reliable performance estimation using only one single mini-batch of data and one single forward/backward propagation pass, which necessitate near-zero computational cost [230, 238, 239]. Thanks to their reliable performance estimation and low cost, these zero-cost proxies have been widely adopted in recent NAS works to accelerate the search process [230, 243, 247]. As demonstrated in [230, 238], combining different zero-cost proxies may lead to more reliable ranking performance estimation than any single zero-cost proxy. For example, as shown in Figure 15, combining LRC and NTK provides more reliable ranking performance estimation than LRC or NTK separately. In light of this, TE-NAS [230] further leverages LRC and NTK to jointly estimate the ranking performance among different architecture candidates in the search space, which quickly ends up with the optimal architecture candidate on ImageNet in less than 4 hours on one single NVIDIA GTX 1080 Ti GPU.

Fig. 15.

Efficient Transformer Search. In addition to CNNs, transformers are another important branch of DNNs. Inspired by the tremendous success of NAS in searching for superior CNNs, automated transformer search has gained increasing popularity, which applies NAS techniques to automatically search for superior transformers, including transformers for NLP tasks [93, 249, 250, 251, 252, 253, 254] and vision transformers for vision tasks [248, 255, 256, 257, 258, 259, 260]. Automated transformer search is technically the same as automated convolutional network search, in which both feature the same search pipeline. For example, HAT [93], as one of the state-of-the-art NAS works in the field of NLP, focuses on searching for hardware-efficient transformers for NLP tasks. To achieve this, HAT first initializes an over-parameterized superformer that consists of all possible transformer candidates in the search space, which is technically the same as the supernet in automated convolutional network search. After that, HAT trains the superformer using the standard weight-sharing technique [160], which then serves as the accuracy predictor to quickly interpret the accuracy of different transformer candidates. Next, HAT builds an efficient latency predictor to avoid the tedious on-device latency measurement. Finally, HAT applies the standard evolutionary algorithm to find hardware-efficient transformer candidates with both high accuracy and high efficiency, which is technically the same as OFA [42], which searches for hardware-efficient convolutional networks. Furthermore, due to the tremendous success of vision transformers in vision tasks as discussed in Section 2.2, a plethora of NAS works [248, 255, 256, 257, 258, 259, 260] have been subsequently proposed to automate the design of superior vision transformers. The work of [248], being the first, introduces an evolutionary algorithm-based NAS framework dubbed AutoFormer. Similar to HAT, AutoFormer first constructs an over-parameterized superformer that consists of all possible vision transformer candidates in the search space, which is then trained using the weight entanglement scheme. The difference between weight sharing and weight entanglement is visualized in Figure 16, in which weight entanglement is technically similar to the superkernel in SP-NAS [197]. Finally, AutoFormer applies the standard evolutionary algorithm to explore the optimal vision transformer candidate. These clearly demonstrate that we can easily leverage recent state-of-the-art NAS techniques that focus on searching for competitive CNNs to automate the design of top-performing transformers for both NLP and vision tasks.

Fig. 16.

Efficient Domain-Specific Search. In addition to image classification, NAS can also be applied to a wide range of real-world scenarios, such as object detection [268, 269, 270], semantic segmentation [271, 272, 273], point cloud processing [192, 274, 275, 276], image super-resolution [277, 278], and more. For example, MobileDets [268] are a family of hardware-efficient object detection networks that can deliver promising detection accuracy while maintaining superior detection efficiency on multiple embedded computing systems, including mobile central processing units (CPUs), edge tensor processing units (TPUs), and edge GPUs. MobileDets first construct an enlarged search space that contains a large number of possible object detection networks and then leverage an MnasNet-like reinforcement learning–based search algorithm [38] to find top-performing object detection networks, which also feature the same reward function as TuNAS [163] to trade off between detection accuracy and efficiency. The work of [192] introduces an efficient hardware-aware differentiable NAS framework dubbed LC-NAS, aiming to automate the design of competitive network solutions for point cloud processing. Here, similar to EdgeNAS [130] and LA-DARTS [191], which focus on finding top-performing architecture candidates for image classification, LC-NAS exploits the same cell-based search space and integrates the latency constraint into the optimization objective to penalize the architecture candidate with high latency. These demonstrate that we can easily include domain-specific knowledge (e.g., domain-specific search spaces) into mainstream NAS techniques (e.g., differentiable, evolutionary algorithm–based, and reinforcement learning–based NAS) to search for domain-specific network solutions.

Mainstream NAS Benchmarks. Although NAS has achieved substantial performance improvement across various NLP and vision tasks, fair comparisons between different NAS works are frustratingly hard and still an open issue, as demonstrated in [279]. This is because different NAS works may feature quite different training recipes, such as different training epochs and training enhancements. For example, DARTS+ [148] trains the searched architecture candidate on CIFAR-10 for 2,000 epochs, whereas DARTS [138] only applies 600 training epochs. DARTS+ trains the searched architecture candidate on ImageNet for 800 epochs, with a batch size of 2,048, where AutoAugment [280] is also integrated in order to achieve stronger data augmentations. In contrast, DARTS only applies 250 training epochs with a batch size of 128 by default. We note that, for the same architecture candidate, longer training epochs and stronger data augmentations typically achieve better training accuracy on the target task, as shown in [279]. RandomNAS [281] challenges the effectiveness of early state-of-the-art NAS works and demonstrates that random search, as one strong search baseline to explore random networks, can achieve even better performance on the target task than early state-of-the-art NAS works. In parallel, RandWire [282] shows that randomly wired networks can also exhibit strong accuracy on ImageNet. Therefore, it remains unknown whether the performance improvement of NAS is due to the more advanced training recipe or the search algorithm itself, making it difficult to evaluate and compare the technical contributions of different NAS works [223, 224, 279].

To overcome such limitations, a plethora of tabular and surrogate NAS benchmarks have been proposed: NAS-Bench-101 [223], NAS-Bench-201 [224], NATS-Bench [261], NAS-Bench-301 [262], NAS-Bench-360 [263], NAS-Bench-1Shot1 [264], NAS-Bench-ASR [265], NAS-Bench-Graph [266], NAS-Bench-NLP [225], HW-NAS-Bench [214], NAS-Bench-x11 [233], and NAS-Bench-Suite [267]. We note that NAS benchmarks typically have two important parts, the predefined search space and the related performance metrics for all possible architecture candidates that can be easily queried. In tabular NAS benchmarks [214, 223, 224, 225, 261, 263, 264, 265, 266], all possible architecture candidates are enumerated and trained from scratch on the target task to obtain the performance metrics, such as the training and validation accuracy. In contrast, surrogate NAS benchmarks [214, 233, 262, 267] leverage learning-based methods to predict the performance metrics of different architecture candidates rather than directly enumerating and training all possible architecture candidates on the target task, thus leading to significantly reduced computational resources. In light of this, surrogate NAS benchmarks can be easily extended to deal with larger search spaces than tabular NAS benchmarks ($10^{18}$ in NAS-Bench-301 [262] vs. 15,625 in NAS-Bench-201 [224]). Finally, we compare and summarize the aforementioned state-of-the-art NAS benchmarks in Table 2.

Table 2.

Benchmark	Search Space		Queryable		Tasks	Datasets	Metrics
Benchmark	Size	Type	Tabular	Surrogate	Tasks	Datasets	Metrics
NAS-Bench-101 [223]	423k	Cell-Based	✓	✗	Image Classification	CIFAR-10	Training Accuracy, Validation Accuracy, Testing Accuracy, Training Time, and Number of Parameters
NAS-Bench-201 [224]	15.6k	Cell-Based	✓	✗	Image Classification	CIFAR-10, CIFAR-100, and ImageNet-16-120	Training Accuracy, Validation Accuracy, Testing Accuracy, Training Loss, Validation Loss, Testing Loss, Training Time, Number of FLOPs, and Number of Parameters
NATS-Bench [261]	39.3k	Cell-Based	✓	✗	Image Classification	CIFAR-10, CIFAR-100, and ImageNet-16-120	Training Accuracy, Validation Accuracy, Testing Accuracy, Training Loss, Validation Loss, Testing Loss, Training Time, Number of FLOPs, and Number of Parameters
NAS-Bench-301 [262]	$10^{18}$	Cell-Based	✗	✓	Image Classification	CIFAR-10	Validation Accuracy
NAS-Bench-360 [263]	N/A	Cell- and Block-Based	✓	✗	10 Diverse Tasks	10 Diverse Datasets	N/A
NAS-Bench-1Shot1 [264]	399k	Cell-Based	✓	✗	Image Classification	CIFAR-10	Validation Accuracy
NAS-Bench-ASR [265]	8.2k	Cell-Based	✓	✗	Automatic Speech Recognition	TIMIT	CTC Loss, Phoneme Error Rate (PER), On-Device Latency, Number of FLOPs, and Number of Parameters
NAS-Bench-Graph [266]	26.2k	Cell-Based	✓	✗	9 Graph Tasks	9 Graph Datasets	Training Loss, Validation Loss, Testing Loss, Validation Accuracy, On-Device Latency, and Number of Parameters
NAS-Bench-NLP [225]	14k	Cell-Based	✓	✗	Language Understanding	PTB and WikiText-2	Testing Perplexity, Training Time, and Number of Parameters
NAS-Bench-111 [233]	423k	Cell-Based	✗	✓	Image Classification	CIFAR-10	Training Accuracy, Validation Accuracy, Testing Accuracy, Training Loss, Validation Loss, and Testing Loss
NAS-Bench-311 [233]	$10^{18}$	Cell-Based	✗	✓	Image Classification	CIFAR-10	Same as NAS-Bench-111
NAS-Bench-NLP11 [233]	$10^{53}$	Cell-Based	✗	✓	Language Understanding	PTB	Same as NAS-Bench-111
NAS-Bench-Suite [267]	N/A	Cell-Based	✓	✓	A suite of 11 tabular and surrogate NAS benchmarks
HW-NAS-Bench [214]	15.6k	Cell-Based	✓	✗	Image Classification	CIFAR-10, CIFAR-100, and ImageNet-16-120	On-Device Latency
HW-NAS-Bench [214]	$10^{21}$	Block-Based	✗	✓	Image Classification	CIFAR-100 and ImageNet	On-Device Latency

Table 2. Comparisons of Different NAS Benchmarks

Note that ImageNet-16-120 is a subset of ImageNet that consists of 120 object categories, in which the input image resolution is fixed to 16×16 [224].

3.4 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of automated network design, summarized as follows:

(1)

General Search Spaces. The success of NAS highly relies on the well-engineered search space, such as the cell-based search space [137, 144, 160] and the block-based search space [38, 39, 40]. In the past, researchers manually designed the search spaces using heuristic-based strategies, which are typically based on existing state-of-the-art networks, such as MobileNets [123, 268] and ShuffleNets [34, 35]. This effectively restricts the search space to improve the search efficiency and delivers competitive architecture candidates with promising accuracy and efficiency. However, this may significantly limit the search performance, which may reject more competitive architecture candidates outside the well-engineered search space. To overcome such limitations, the authors of [283, 284] pioneer designing more general search spaces than the cell-based and block-based search spaces, which, unfortunately, are under-explored since the works of [283, 284] still suffer from human biases. Therefore, one promising future direction in the field of NAS is to innovate and explore more general search spaces to unleash the power of automated network design.

(2)

Fully Automated Architecture Search. The early NAS practices either focus on searching for the optimal architecture candidate [137, 144, 160], the optimal data augmentation [280, 285], the optimal activation function [286, 287], or the optimal training recipe [288, 289]. As demonstrated in FBNetV3 [288] and AutoHAS [289], different architecture candidates may prefer different training recipes, in which jointly searching for the optimal architecture candidate and its tailored training recipe has the potential to push forward the attainable accuracy. This observation can be easily generalized. For example, different architecture candidates may prefer different data augmentations. Therefore, one promising future direction in the field of NAS is fully automated search, which jointly searches for the optimal architecture candidate and its tailored data augmentation, activation function, and training recipe in one single search experiment to maximize the attainable accuracy.

(3)

Multi-task Architecture Search. Previous NAS works typically focus on searching for task-specific architecture candidates that can achieve promising performance in the specified task, such as image classification [138, 146], object detection [268, 269, 270], and semantic segmentation [271, 272, 273]. However, this search paradigm significantly increases the total search cost when the number of tasks exponentially evolves since we have to conduct search experiments for each task. To alleviate this issue, FBNetV5 [290] takes the first step to search for multi-task architecture candidates that can achieve competitive performance across multiple tasks, including image classification on ImageNet [80], object detection on COCO [291], and semantic segmentation on ADE20K [292]. Nonetheless, this is far from enough since we have a large number of tasks in real-world scenarios. Therefore, one promising future direction in the field of NAS is multi-task search, which automates the design of top-performing architecture candidates that can be generalized to multiple different tasks without being re-engineered (i.e., once for all).

(4)

Dynamic Architecture Search. Previous NAS works [137, 138, 144, 160] typically focus on searching for static neural networks that can only run at fixed computational budgets, which cannot adapt to lower or higher computational complexity. Dynamic neural networks, such as slimmable neural networks [293, 294, 295], are another important branch of DNNs, which can be executed to accommodate different computational resources in real-world environments. This is because, even on the same hardware device, the available computational resources may vary with respect to time. For example, mobile phones may be in low-power or power-saving modes to reduce power consumption. To overcome such limitations, deploying multiple static neural networks on the same hardware device seems to be the first-in-mind solution, which, unfortunately, demands high on-device storage requirements. Therefore, one promising future direction in the field of NAS is to search for top-performing dynamic neural networks, which can instantly, adaptively, and efficiently trade off between accuracy and inference efficiency to accommodate the rapidly changing computational budgets in real-world embedded computing scenarios.

(5)

Hybrid Architecture Search. As discussed in Section 2, both convolutional networks and vision transformers have their own technical merits when applied to vision tasks. Convolutional networks demonstrate superior efficiency on target hardware, whereas vision transformers achieve better accuracy on a target task. In sight of this, designing hybrid networks on top of both convolutional networks and vision transformers has the potential to push forward accuracy–efficiency trade-offs. Nonetheless, previous NAS works typically focus on searching for either convolutional networks [137, 138, 144, 160] or vision transformers [248, 256, 257, 260]. To alleviate this, the authors of [296, 297] have taken the very first steps to investigate hybrid architecture search, which, however, still remains under-explored. Therefore, one promising future direction in the field of NAS is to search for competitive hybrid networks that combine the technical merits of both convolutional networks and vision transformers to achieve better accuracy–efficiency trade-offs.

(6)

Explainable Architecture Search. Previous representative NAS works [40, 42, 138, 144] highly rely on the weight-sharing paradigm [160], also known as one-shot NAS, which initializes an over-parameterized supernet that consists of all possible architectures in the search space and then searches for the optimal architecture candidate within the supernet through weight-sharing. Despite the promising search efficiency, one-shot NAS has been widely criticized due to its limited explainability, which implies that weight-sharing may lead to suboptimal architectures due to weight interference. Even worse, the intuition behind one-shot NAS still remains unknown in the NAS community. To alleviate this, a plethora of zero-shot NAS works have been proposed recently [228, 229, 230, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247], which leverage zero-cost proxies to quickly interpret the accuracy of different architectures. However, existing zero-cost proxies still cannot achieve reliable performance estimation as shown in Figure 15. Therefore, one promising future direction in the field of NAS is to develop more explainable NAS techniques and innovate more reliable zero-cost proxies.

(7)

Meta Architecture Search. Meta-learning [298], also referred to as learning-to-learn, aims to facilitate and accelerate common learning-based practices such that the learned model can quickly adapt to unseen tasks/environments using minimal engineering efforts. For example, HELP [215] introduces an efficient meta-learning–based latency predictor, which can be generalized to new hardware platforms using as few as 10 latency measurements. The widely used weight-sharing paradigm in the field of NAS can be considered to be a special case of meta-learning, which takes the over-parameterized supernet as the meta-model. As the weight-sharing paradigm has been dominating recent advances in the field of NAS, one promising future direction is to explore meta-learning to accelerate the search process and enhance the few-shot learning capability [299, 300, 301].

4 Network Compression for Embedded Computing Systems

In addition to designing novel networks, another alternative is to compress existing networks at hand, either manually designed networks or automatically searched networks, to reduce network redundancy, which leads to network variants with better accuracy–efficiency trade-offs. As illustrated in previous relevant literature [82, 302], there are three popular branches of network compression techniques, including network pruning, network quantization, and network distillation. Note that these three branches are parallel with each other as shown in [303], which indicates that they can be combined to further enable better accuracy–efficiency trade-offs. Among them, network pruning and network quantization focus on improving the accuracy–efficiency trade-off from the efficiency perspective, whereas network distillation enhances the accuracy–efficiency trade-off from the accuracy perspective. To this end, we systematically discuss recent state-of-the-art network compression techniques in this section. For better understanding, we divide these network compression techniques into three main categories and subsections: network pruning in Section 4.1, network quantization in Section 4.2, and network distillation in Section 4.3 since these network compression techniques feature different algorithms to improve the accuracy–efficiency trade-off from different perspectives. Note that these network compression techniques can typically generalize across different networks (e.g., convolutional networks and transformers). For example, we can leverage knowledge distillation to enhance the training process of both convolutional networks and transformers towards better training accuracy.

4.1 Network Pruning

The rationale behind network pruning is that DNNs are usually over-parameterized and redundant in terms of network weights and channels [304, 305]. As such, eliminating redundant network weights and channels can largely benefit network efficiency at the cost of minimal accuracy loss, thus enabling accommodation of limited available computational resources and rigorous storage requirements in real-world embedded scenarios. Following previous well-established pruning conventions, we divide recent state-of-the-art pruning methods into two main categories according to their pruning granularity, in which non-structured pruning (i.e., weight pruning) is fine-grained whereas structured pruning (i.e., channel pruning and layer pruning) is coarse-grained. As illustrated in Figure 17, weight pruning focuses on removing the redundant weight connections, whereas channel pruning and layer pruning focus on removing the redundant channels and layers. In practice, both non-structured pruning and structured pruning can explore simplified network structures with optimized computational efficiency. Nonetheless, non-structured weight pruning highly relies on specialized hardware accelerators [29] and cannot provide realistic runtime speedups on modern embedded computing systems owing to irregular network sparsity [305, 306]. In contrast, structured channel pruning and layer pruning are coarse-grained and do not introduce irregular network sparsity, which can deliver realistic runtime speedups on modern embedded computing systems. For better coverage, below we further discuss recent representative works in the field of both non-structured pruning and structured pruning, which are also summarized in Figure 19.

Fig. 17.

Fig. 18.

Fig. 19.

4.1.1 Non-Structured Pruning.

Non-structured pruning, also referred to as weight pruning, removes the less important network weights, which is typically more fine-grained than structured pruning as illustrated in Figure 17. Applying weight pruning for network compression can be traced back to the early 1990s. For example, Optimal Brain Damage [307] and Optimal Brain Surgeon [308], the very early weight pruning approaches, pioneered investigation fo the efficacy of weight pruning on vanilla fully-connected networks, in which the less important network weights are removed based on Hessian of the loss function. More recently, [29] proposed a simple yet effective weight pruning technique to compress deep convolutional networks, such as AlexNet [76] and VGGNet [1], instead of vanilla fully-connected networks. Specifically, [29] observes that the network weights with smaller magnitudes typically contribute less to network accuracy, based on which [29] removes the less important network weights with smaller magnitudes. Subsequently, this weight pruning technique is further integrated into Deep Compression [89] to obtain highly compressed networks, making it possible to aggressively reduce network size without sacrificing network accuracy. For example, Deep Compression is able to significantly reduce the network size of VGGNet by ×49, from 552 MB to 11.3 MB, while maintaining comparable accuracy on ImageNet. Nonetheless, the reduction in terms of the network size cannot directly translate into the speedup on target hardware since the resulting compressed networks are of high irregular network sparsity. To overcome such limitations, EIE [306] designs an efficient specialized inference engine to maximize the inference efficiency of compressed networks. In parallel, [309] proposes an efficient data-free weight pruning approach to iteratively remove redundant network weights. In addition, [310] and [311] leverage Variational Dropout and $L_0$-norm regularization-based stochastic gates, respectively, to remove the less important network weights.

Weight Importance Criteria. The core of weight pruning is to determine the importance of different network weights, based on which we can easily rank different network weights and remove the less important network weights at the cost of minimal accuracy loss. There have been several representative importance criteria to measure the importance of different network weights after the network is trained. The most straight-forward criterion is based on the weight magnitude thanks to its conceptual simplicity and surprisingly strong performance, which leverages the absolute weight $|w|$ to interpret the importance of different network weights (i.e., the larger, the more important) [29, 313, 314]. The rationale behind magnitude-based weight pruning is that smaller network weights typically contribute less to the output of the network. Other importance criteria include second-order derivative-based [307, 315], Taylor expansion-based [316], and output sensitivity–based [317] strategies. More recently, [312] proposes an effective strategy, namely, gates with differentiable polarization (GDP), which introduces learnable gates to interpret the importance of different network weights. GDP encourages a large margin between exact zero gates and non-zero gates as shown in Figure 18, while still allowing gradient optimization. Finally, GDP removes the network weights with exact zero gates and further merges the remaining non-zero gates into the resulting pruned network without hurting the network accuracy once the optimization process terminates. Despite the impressive progress to date, the design of efficient and effective importance criteria is quite under-explored and still remains an open challenge in the community.

Sparse Network Acceleration. Different from structured pruning that is hardwarefriendly, non-structured pruning, despite being able to maintain competitive accuracy under high compression ratios, introduces considerable irregular network sparsity, making it difficult to parallelize the resulting sparse networks on mainstream hardware systems such as GPUs and CPUs [306]. This indicates that non-structured pruning highly relies on specialized hardware to achieve superior on-device speedups. To this end, a plethora of specialized hardware accelerators [306, 318, 319, 320, 321, 322, 323, 324, 325] and compiler-based optimization techniques [326, 327] have been developed recently to accelerate the on-device inference of sparse networks, which typically focus on improving the irregular memory access on target hardware. For example, Cambricon-X [321] features an access-efficient indexing module to select and transfer irregular network weights to different processing elements (PEs) with reduced bandwidth requirements. This indexing module allows each PE to store irregular network weights for local computation in an asynchronous manner and, as a result, significantly reduces the irregular memory access overheads across different PEs.

Sparse Training Techniques. As shown in [29], weight pruning can effectively lead to efficient sparse network variants that are up to 90% smaller than the unpruned network, significantly alleviating the storage requirements and reducing the computational complexity. Despite the promising efficiency improvement, training the resulting sparse networks is quite challenging, in which using conventional training strategies may lead to non-negligible accuracy loss, as demonstrated in [328]. To recover the accuracy, a plethora of sparse training techniques have been developed to train the resulting sparse networks [29, 89, 329, 330, 331, 332]. For example, [29] proposes to fine-tune the pruned sparse network with inherited network weights from the unpruned network. Furthermore, [89] generalizes this fine-tuning strategy to become iterative, in which multiple iterations of pruning and fine-tuning are iteratively repeated to recover the attainable accuracy. In addition, [330] investigates the performance collapse issue of training sparse networks, which may simply be trapped into suboptimal local minima. To escape the suboptimal local minima, [330] proposes to traverse extra dimensions between dense and sparse subspaces during training sparse networks. More recently, [331] introduced an alternative sparse training strategy, which customizes the sparse training techniques to deviate from the default vanilla training protocols, consisting of introducing ghost neurons and skip connections at the early training stage and strategically modifying the initialization as well as the labels. Below, we introduce another representative branch of weight pruning and sparse training techniques, namely, the lottery ticket hypothesis [328], which demonstrates that the pruned sparse networks, when properly initialized, can be trained from scratch to recover the accuracy comparable to the unpruned network.

Lottery Ticket Hypothesis. The lottery ticket hypothesis [328] is a special case of non-structured pruning that has since gained increasing popularity in the pruning community [333, 334, 335, 336, 337, 338]. The lottery ticket hypothesis reveals that: A randomly initialized unpruned network contains sparse subnetworks (i.e., winning tickets) that are initialized such that, when trained in isolation, they can match the accuracy of the unpruned network after training for at most the same number of iterations. The winning tickets can be up to 90% smaller than the unpruned network while at the same time maintaining comparable accuracy. To identify the winning ticket, the lottery ticket hypothesis [328] proposes to leverage the following steps:

(1)

Randomly initialize the unpruned network $f(x;\theta _0)$, in which $\theta _0 \sim \mathcal {D}_{\theta }$;

(2)

Train the unpruned network for a number of j iterations, arriving at weights $\theta _j$;

(3)

Prune p% of the weights in $\theta _j$, creating the sparse mask $m \in \lbrace 0, 1\rbrace ^{|\theta _0|}$;

(4)

Reset the remaining weights to their values in $\theta _0$, creating the winning ticket $f(x;m \odot \theta _0)$.

As demonstrated in [328], directly pruning p% of the network weights may lead to significant accuracy loss, which also makes the training process unstable. To overcome such limitations, the lottery ticket hypothesis proposes to iteratively repeat the above steps, also referred to as iterative pruning, which repeatedly trains, prunes, and resets the network weights over n rounds, where each round prunes $p^{\frac{1}{n}}$% of the network weights. The lottery ticket hypothesis opens up the possibility and provides empirical guidelines to train sparse networks from scratch to match the accuracy of the unpruned network. Subsequently, [333, 334] prove the lottery ticket hypothesis from insightful theoretical perspectives. In parallel, [335] investigates the performance collapse issue of the lottery ticket hypothesis, especially when dealing with deeper networks, such as ResNets [2] and DenseNets [3], based on which [335] proposes an effective modified iterative pruning scheme called rewinding iteration to stabilize the lottery ticket hypothesis. The works of [336, 337, 338] generalize the lottery ticket hypothesis to other types of networks beyond convolutional networks, such as graph networks [336], spiking networks [110], and photonic networks [338].

Semi-structured Pruning. In contrast to the above mainstream non-structured pruning methods that introduce considerable irregular network sparsity, semi-structured pruning focuses on removing the less important consecutive weight connections [339]. The resulting semi-structured sparse networks can exhibit less irregular network sparsity than non-structured pruning, which are well supported by some existing deep learning libraries (e.g., cuSPARSElt [340] and TVM [341]) and thus can maintain much higher parallelism and speedups on modern embedded computing systems than non-structured pruning. For example, popular BERT models can achieve about 1.3x to 1.6x runtime inference speedups on NVIDIA A100 GPUs using the optimized sparse tensor cores [340, 342]. Thanks to its superior accuracy–efficiency trade-offs over non-structured pruning, semi-structured pruning has been widely employed to optimize the computational complexity of convolutional networks [343, 344], transformers [342, 345], and large language models [346, 347]. For example, [344] introduces an effective channel permutation scheme to optimize the attainable accuracy of the resulting semi-structured sparse convolutional network. The work of [343] introduces sparse-refined straight-through estimator (SR-STE) to explore the optimal semi-structured sparse convolutional network. In addition to convolutional networks, [342] and [345] introduce the alternating direction method of multipliers (ADMM) and progressive gradient flow to explore the optimal semi-structured sparse transformer for real-world language processing tasks. More recently, [346, 347] investigated semi-structured sparsity to enhance the inference efficiency of large language models. In parallel to the above semi-structured pruning methods that optimize inference efficiency, [348, 349] instead focus on optimizing the training efficiency of semi-structured sparse networks. The works of [350, 351, 352] also focus on exploring dedicated hardware accelerators to further enhance the runtime inference efficiency of semi-structured sparse networks.

4.1.2 Structured Pruning.

In parallel to non-structured pruning, structured pruning, including channel pruning⁶ and layer pruning, is another popular branch, which removes the less important channels or layers to reduce the network complexity as shown in Figure 17. Layer pruning is a special case of channel pruning and channel pruning is equivalent to layer pruning when all channels in the same layer are removed. We emphasize that non-structured pruning, despite being able to achieve significant compression ratios, introduces considerable irregular network sparsity and the resulting pruned network also features irregular computational patterns, which highly relies on specialized hardware accelerators to achieve realistic speedups, as demonstrated in [306]. In contrast, structured pruning can easily achieve realistic speedups on mainstream hardware, such as GPUs and CPUs, thanks to its high on-device parallelisms [305]. This unique technical merit has been making structured pruning increasingly popular, especially in the context of designing hardware-friendly network solutions [383]. With all this in mind, below we further elaborate on recent representative structured pruning works, which can be roughly divided into the following four categories: weight-based pruning, activation-based pruning, batch normalization statistics–based pruning, and search-based pruning.

Weight-Based Pruning. Weight-based pruning, also referred to as weight-dependent pruning, determines the importance of different channels based on the corresponding weights, which is technically similar to magnitude-based pruning as discussed in Section 4.1.1. There have been two popular weight-based pruning criteria, including weight norm and weight correlation. Without loss of generality, we can easily calculate the $L_n$-norm as $||w||_n$, where w is the corresponding network weight. For example, [21] proposes to remove the less important channels based on their $L_1$-norm values, which indicates that the channel with smaller $L_1$ is considered less important and contributes less to the network output. The work of [22] observes that $L_2$-norm can achieve better pruning performance than $L_1$-norm. FurthermoreThe work of [23] challenges the empirical assumption in [21, 22] and demonstrates that the channels with smaller $L_1$- and $L_2$-norm magnitudes are not necessarily less important. To avoid this, [23] turns back to the channel correlation, which reveals that the channels close to the geometric median are typically redundant since they represent similar feature maps in the same layer. As a result, removing the channels close to the geometric median only leads to minimal accuracy loss. Inspired by the promising performance of [23, 353, 354] propose to first apply scalar hashing on the weights of each layer and then remove redundant channels based on the corresponding weight similarity. This is because similar channels are of high redundancy in terms of the contributions to the network representation capability. In addition, unlike [21, 22, 23, 353, 354], which measure the channel redundancy in the same layer, [355] investigates the channel redundancy across multiple different layers in order to minimize the accuracy loss. Furthermore, [356] prioritizes to remove the channels in more redundant layers rather than globally ranking different channels across all network layers.

Activation-Based Pruning. Activation-based pruning typically leverages the intermediate activation maps to interpret the importance of different channels, in which activation maps, also known as feature maps, correspond to the output features from one specific network layer. As demonstrated in [383], there have been three representative techniques to determine the importance of different channels in the L-th layer: (1) using the activation maps of the L-th layer, (2) using the activation maps of adjacent layers (e.g., $(L+1)$-th and $(L+2)$-th layers), and (3) using the activation maps of the last layer (i.e., the network output).

(1)

Current Layer. To determine the importance of different channels in the L-th layer, [357] proposes a simple yet effective two-step scheme, which first removes different channels and then measures the reconstruction error based on the output activation maps of the L-th layer. Next, the channels that lead to smaller reconstruction error are removed to reduce the network complexity. Similarly, [358] measures the channel importance according to the decomposition error. Subsequent works utilize the channel independence [359] and post-activation maps [360] to measure channel importance.

(2)

Adjacent Layers. Recent state-of-the-art DNNs are naturally coupled and of sequential layer structures, which indicates that there is significant layer dependency between different adjacent layers. In light of this convention, [361, 362] investigate the dependency between the current layer and the subsequent layer in order to measure the channel importance in the current layer. In parallel, [295, 363] demonstrate that the activation maps of previous layers can also reflect the channel importance in the subsequent layers.

(3)

Last Layer. The channel importance can also be evaluated using the activation maps of the last layer, which correspond to the network output. The rationale behind this is that we are allowed to use the network output to interpret the accuracy of the pruned network. For example, we can simply determine the channel importance based on the reconstruction error [244] and the discrimination of the entire network [384].

Statistics-Based Pruning. Statistics-based pruning refers to pruning that exploits batch normalization statistics [385] to interpret channel importance, which has since gained increasing popularity thanks to its conceptual simplicity and surprisingly strong pruning performance. As shown in previous representative networks [2, 3, 32, 34, 35, 123], batch normalization is a widely used plug-and-play technique to accelerate and stabilize the network training process towards better training convergence while also reducing internal covariate shift to benefit network accuracy. Specifically, batch normalization $BN(\cdot)$ transforms the input $x \in \mathbb {R}^{B \times C \times W \times H}$ as follows:

\begin{equation} BN(x) = \gamma \cdot \frac{x - \mu }{\sqrt {\delta ^2 + \epsilon }} + \beta , \end{equation}

(13)

where $\mu$ and $\delta$ are the mean and standard deviation of the input x, respectively, and $\epsilon$ is a small constant (e.g., $1\times 10^{-9}$) to avoid zero-division. $\gamma \in \mathbb {R}^{B}$ and $\beta \in \mathbb {R}^{B}$ are learnable parameters to scale and shift $\frac{x - \mu }{\sqrt {\delta ^2 + \epsilon }}$, which are optimized during the training process to recover the input x. Note that $\gamma$ and $\beta$ have the same dimension as the number of input channels. As seen in [2, 3, 32, 34, 35, 123], it is common practice to insert one batch normalization layer after one convolutional layer. In light of these, [364] pioneers leveraging batch normalization statistics $\gamma$ to enable and disable different input channels, among which those disabled channels are pruned at the end of the training process. To this end, [364] applies $L_1$-norm regularization on $\gamma$ to achieve sparsity. Similar to [364], Gate Decorator [365] introduces gated batch normalization that leverages batch normalization statistics $\gamma$ as channel gates to enable and disable different input channels from the previous convolutional layer. To achieve sparsity, Gate Decorator also exploits $L_1$-norm regularization to penalize $\gamma$ during the training process. In addition, Gate Decorator introduces an iterative pruning scheme, which progressively prunes redundant channels during the training process and fine-tunes the resulting pruned network to recover the accuracy. However, as demonstrated in [366], $L_1$-norm regularization suffers from inferior discrimination between different channels since $L_1$-norm regularization pushes all the scaling factors $\gamma$ towards zero. To tackle this, in contrast to [364, 365], which regularize $\gamma$ with $L_1$-norm penalty, [366] polarizes $\gamma$ to enforce a large margin between zero and non-zero $\gamma$. Furthermore, [367] challenges [364, 365, 366], observing that smaller non-zero $\gamma$ does not imply that the corresponding channel is less important. Based on this observation, [367] introduces a simple yet effective iterative pruning approach, which (1) prunes the less important channels with exact zero $\gamma$, (2) rescales the magnitude of $\gamma$, and (3) fine-tunes the resulting pruned network to recover the accuracy. In addition to the scaling factors $\gamma$, [368] demonstrates that the shifting factors $\beta$ can also be leveraged to interpret the channel importance and jointly considering $\gamma$ and $\beta$ has the potential to achieve more reliable channel pruning. The aforementioned statistics-based channel pruning works [364, 365, 366, 367, 368] explicitly demonstrate that batch normalization statistics (i.e., $\gamma$ and $\beta$), when properly engineered, can reflect the importance of input channels from the previous convolutional layer.

Search-Based Pruning. Inspired by the tremendous success of NAS as discussed in Section 3, a plethora of search-based pruning works have recently emerged, which typically leverage search-based techniques, including reinforcement learning–based [369, 370, 371], evolutionary algorithm–based [372, 373, 374], and gradient-based search [375, 376, 377], to automatically search for the optimal pruning policy instead of using manually designed pruning heuristics. The rationale behind this is that different channels can be alternatively viewed as a list of possible operator candidates, making it possible to generalize the novel findings and advances in the field of NAS to address research challenges in the field of pruning. Specifically, previous search-based channel pruning works can be divided into the following three categories:

(1)

Reinforcement Learning–Based Search. Reinforcement learning is a well-established technique for solving search problems as discussed in Section 3.2. Several pruning works [369, 370, 371] have pioneered the exploitation of reinforcement learning to search for the optimal channel pruning policy. For example, AMC [369] proposes to train an efficient deep deterministic policy gradient (DDPG) [386] agent such that the well-trained DDPG agent can output the optimal layer-wise channel pruning policy to maximize the pre-defined reward function as shown in Figure 20. In addition, AGMC [370] demonstrates that AMC may yield suboptimal pruning policies due to the fixed number of environment states. To tackle this, AGMC leverages graph convolutional networks (GCNs) to encode the pruned network and exploits the graph-based encoder–decoder to automatically learn the optimal environment state. DECORE [371] turns back to multi-agent search, which assigns each agent to one specific network layer to learn better pruning policies.

(2)

Evolutionary Algorithm-Based Search. The evolutionary algorithm is another well-established search technique thanks to its conceptual simplicity, flexibility, and surprisingly strong performance. There have been several pruning works [372, 373, 374] that employ an evolutionary algorithm to automatically search for the optimal channel pruning policy. For example, MetaPruning [372] introduces an efficient two-stage pruning pipeline. In the first stage, MetaPruning trains an over-parameterized PruningNet that consists of all possible pruned network configurations. Note that PruningNet here is technically the same as the supernet in the field of NAS as discussed in Section 3. In the second stage, MetaPruning leverages the well-trained PruningNet to quickly evaluate the accuracy of different pruned networks with inherited weights from the well-trained PruningNet [160]. In addition, an evolutionary engine is integrated to explore the optimal pruned network.

(3)

Gradient-Based Search. Unlike the aforementioned reinforcement learning–based and evolutionary algorithm–based pruning works that explore the optimal pruning policy within the discrete space, gradient-based search instead allows learning of the optimal pruning policy within the continuous space [375, 376, 377]. As a result, gradient-based search is able to maintain much better computational efficiency than reinforcement learning–based and evolutionary algorithm–based counterparts. For example, DSA [377] proposes an efficient differentiable sparsity allocation approach to learn optimal layer-wise pruning ratios with gradient-based optimization, in which each pruning experiment only requires about 5 GPU hours. To relax the discrete search space to become continuous, DSA introduces learnable pruning ratios, which are conceptually the same as the architecture parameters in differentiable NAS [138]. During the training process, the above learnable pruning ratios can be jointly optimized together with the network weights using standard gradient descent.

Fig. 20.

In general, search-based pruning is similar to NAS, in which search-based pruning searches for pruned network structures and NAS searches for stand-alone network structures. This indicates that we can generalize more advanced NAS algorithms to search for better pruned networks.

Layer-Based Pruning. Layer pruning is a special case of channel pruning, which aggressively removes all the channels in the same layer, as shown in Figure 17. Under similar compression ratios, layer pruning can achieve better performance in terms of latency reduction than channel pruning, as demonstrated in [379]. However, there is no free lunch, which indicates that layer pruning may suffer from greater accuracy loss than channel pruning. Note that layer pruning is conceptually and technically similar to channel pruning. Specifically, channel pruning aims to remove the less important channels, whereas layer pruning focuses on removing the less important layers as seen in previous representative layer pruning works [30, 378, 379, 380, 381, 382]. In light of this, the aforementioned channel pruning techniques can be easily generalized to prune redundant layers. For example, [379] introduces several importance criteria from the lens of channel pruning, such as weight magnitudes, activation maps, and batch normalization statistics, which are further combined to reliably determine the less important layers. In addition, [382] investigates the lottery ticket hypothesis [328] from the perspective of layer pruning, which confirms that there also exist winning tickets at initialization in terms of layer pruning. More importantly, the winning tickets here are more environment friendly, with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387, 388] observed that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387, 388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Several recent pruning methods [389, 390, 391, 392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model’s complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy–efficiency trade-offs than traditional channel-based and layer-based pruning methods. Similarly, HACScale [393] proposes an effective scaling paradigm to re-scale different channels and layers towards more efficient inference on target hardware.

4.2 Network Quantization

In contrast to network pruning, which aims to reduce the network complexity at the structure level, network quantization focuses on representing the network weights and activations with lower bits, which significantly reduces the network complexity at the precision level. Therefore, the resulting quantized network maintains the same network structure (i.e., the same number of layers and channels), but with lower-bit network weights and activations. In practice, network quantization can be traced back to the 1990s, when early quantization works pioneered quantizing the network weights for Boltzmann machines [447], optical networks [448], and multi-layer perceptrons (MLPs) [449]. Note that quantization has the potential to significantly trim down the network size to accommodate the limited storage in real-world embedded scenarios. For example, Deep Compression [89] is able to reduce the network size of VGGNet by ×49, from 552 MB to 11.3 MB, while delivering comparable accuracy on ImageNet [80]. Thanks to its surprisingly strong performance in reducing the computational complexity and alleviating the storage requirements, renewed research interest in network quantization has emerged since the 2010s [409], which demonstrates that, compared with full-precision weights (i.e., 32 bits), 8-bit quantized weights can effectively accelerate the network inference on mainstream CPUs without significant accuracy degradation. To this end, we discuss recent advances in the field of network quantization in this section, including representative quantized networks and popular quantization-related extensions/implementations, which are also summarized in Figure 21.

Fig. 21.

4.2.1 Quantized Networks.

Here, we introduce several representative quantized networks: binarized networks, ternarized networks, INT8 networks, and mixed-precision networks.

Binarized Networks. Binarized networks are built upon only 1-bit weights, which are constrained to be either $+$1 or $-$1 during forward and backward propagations [24, 25]. This can effectively eliminate computation-intensive multiply-accumulate operations and allows replacing of multiply-accumulate operations with cheap additions and subtractions. This can lead to significant performance improvement in terms of latency and energy consumption, as demonstrated in [24]. In the relevant literature, BinaryConnect [24] and BinaryNet [25] are the very early seminal binarized networks, which pioneered investigation of the efficacy of 1-bit weights in order to reduce computational complexity and alleviate the storage bottleneck. BinaryConnect [24] introduces the first binarized network, which explores both deterministic binarization,

\begin{equation} \begin{aligned}w_b = \mathrm{sign}(w) = {\left\lbrace \begin{array}{ll} +1, & \text{if} \,\,\, w \ge T \\ -1, & \text{otherwise} \end{array}\right.} \end{aligned} , \end{equation}

(14)

and stochastic binarization to stochastically binarize the network weights,

\begin{equation} \begin{aligned}w_b = {\left\lbrace \begin{array}{ll} +1, & \text{with probability} \,\,\, p=\sigma (w) \\ -1, & \text{with probability} \,\,\, p=1-\sigma (w), \end{array}\right.} \end{aligned} \end{equation}

(15)

where $\sigma (\cdot)$ is the hard sigmoid function and can be mathematically formulated as follows:

\begin{equation} \sigma (x) = \mathrm{clip}\left(\frac{x+1}{2}, 0, 1\right) = \mathrm{max}\left(x, \mathrm{min}\left(1, \frac{x+1}{2}\right)\right) . \end{equation}

(16)

Note that BinaryConnect exploits the above hard sigmoid function rather than the soft version because it is far less computationally expensive and can still yield competitive results. As shown in BinaryConnect, the stochastic binarization is more advanced and can achieve much better quantization accuracy than the deterministic counterpart. So far, BinaryConnect only enables weight-level binarization, whereas the network inputs are still required to be full precision. In light of this, BinaryNet [25] extends BinaryConnect to support both binarized weights and binarized inputs in order to maximize the inference efficiency of binarized networks. XNOR-Net [26] demonstrates that [24, 25] cannot be generalized to large-scale datasets such as ImageNet. To address this, XNOR-Net introduces an effective approach to estimate binarized weights to maintain $w \approx \alpha \cdot w_b$, after which the estimated $\alpha$ can be detached from the binarized weight to rescale the input. To further enhance binarization accuracy, a plethora of binarized networks [394, 395, 396, 397, 398, 399, 400] have been proposed. For example, [399, 400] propose learnable activation binarizer [399] and adaptive binary sets [400] to explore more accurate binarized networks.

Ternarized Networks. In addition to binarized networks, ternarized networks [401, 402] are another representative branch of quantized networks and have gained increasing popularity thanks to their superior accuracy. Specifically, ternarized networks quantize the network weights from 32 bits to 2 bits, in which the 2-bit weights are constrained to $-$1, 0, and $+$1 in contrast to $\pm$1 in binarized networks. As such, ternarized networks can achieve much better accuracy than binarized networks at the cost of slightly increased computational complexity. To achieve this, [401], reporting on the first ternarized network, proposes to quantize the full-precision weights as follows:

\begin{equation} \begin{aligned}w_t = {\left\lbrace \begin{array}{ll} +1, & \text{if} \,\,\, w \gt \Delta \\ 0, & \text{if} \,\,\, |w| \le \Delta \\ -1, & \text{if} \,\,\, w \lt -\Delta \end{array}\right.} \end{aligned} , \end{equation}

(17)

where $\Delta$ is a positive constant to control the ternarization threshold. To derive the optimal ternarization threshold $\Delta ^*$, [401] turns back to XNOR-Net [26] and borrows the binarization estimating scheme from XNOR-Net, which introduces an adjustable scaling factor to minimize $||w - \alpha \cdot w_t||_2^2$. Finally, [401] demonstrates an empirical rule of thumb to derive $\Delta ^*$ as follows:

\begin{equation} \Delta ^* = 0.7 \cdot E(|w|) \approx \frac{0.7}{n} \sum _{i=1}^n |w_i| , \end{equation}

(18)

where n is the number of elements within w. To boost the ternarization accuracy, [402] further introduces two trained scaling coefficients $w_l^p$ and $w_l^n$ for the l-th layer, which are then trained using gradient descent during backward propagation. Once the training process terminates, [402] deploys the ternarized networks on target hardware, including the trained ternarized weights and the corresponding scaling coefficients, reducing the network size by at least ×16. Subsequently, several ternarized networks [403, 404, 405, 406, 407, 408] have been proposed to further improve ternarization accuracy. Among them, [408] demonstrates that the hard ternarization threshold $\Delta$, despite being simple and effective, often leads to suboptimal results. To avoid this, [408] introduces the paradigm of a soft ternarization threshold, which instead enables the network to automatically determine the optimal ternarization intervals to maximize its accuracy.

INT8 Quantization. Binarized and ternarized networks have the potential to achieve $\times 16\sim$×32 speedups, which, however, suffer from non-negligible accuracy loss and, even worse, require considerable engineering efforts to design specialized hardware for further deployment. The rationale behind this is that mainstream hardware does not support low-bit quantized networks. To overcome such limitations, an effective alternative is INT8 quantization, which trims down the network weights from 32 bits to 8 bits within the range of $[-128, 127]$. As such, INT8 quantization can lead to about ×4 compression in terms of network size and, more importantly, at the cost of negligible accuracy loss [409]. In addition, thanks to its well-suited software (e.g., Google’s TensorFlow Lite and NVIDIA’s TensorRT), we can easily deploy INT8 quantized networks on mainstream hardware, such as mobile devices, CPUs, and edge GPUs, with minimal engineering efforts [68, 69]. For example, as shown in [410], TensorRT allows post-training INT8 quantization, which leverages simple weight calibration to convert pre-trained full-precision weights into 8-bit weights (see Figure 23) with only trivial accuracy loss. Several follow-up works [411, 412, 413, 414] have been proposed recently to investigate INT8 quantization and improve INT8 quantization accuracy. Among them, [414], the very first INT8 quantization work, pioneers quantizing both weights and activations with 8-bit integers to boost inference efficiency. In addition, [411] evaluates the performance of various INT8 quantized networks on mobile GPUs, based on which [411] introduces a unified INT8 quantization framework that integrates various off-the-shelf INT8 quantization techniques, such as symmetric, asymmetric, per-layer, and per-channel INT8 quantization. The work of [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450, 451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks.

Fig. 22.

Fig. 23.

Mixed-Precision Networks. Mixed-precision quantization is another well-established branch of network quantization. As shown in [415], mixed-precision quantization allows more fine-grained quantization schemes across different weights and activations. As a result, it can usually achieve better accuracy–efficiency trade-offs than conventional fixed-precision quantization, such as binarized (1-bit), ternarized (2-bit), and INT8 (8-bit) quantization. For example, TBN [415], the very first mixed-precision network, proposes to combine layer-wise ternarized inputs and binarized weights, which delivers surprisingly better accuracy–efficiency trade-offs than stand-alone binarized networks and ternarized networks. The success of TBN has motivated several subsequent mixed-precision quantization works [416, 417] to continue improving quantization accuracy. For example, SYQ [416] proposes to quantize the network weights with 1/2 bits and the intermediate activation with 8 bits, whereas PACT [417] allows 2-bit activations and 2/3/4/5-bit weights. These early mixed-precision quantization works have demonstrated promising performance. Later, we will introduce automated mixed-precision quantization, which exploits automated techniques to search for the optimal bit allocation and can achieve more fine-grained quantization. The works in [418, 419, 420, 421] also consider leveraging mixed-precision quantization to improve the training efficiency of full-precision networks, which can significantly accelerate the training process and, more importantly, achieve comparable accuracy to the full-precision training.

4.2.2 Quantization Extensions and Implementations.

Here, we introduce several quantization extensions and implementations, including quantization-aware training, automated mixed-precision quantization, and quantization-aware hardware accelerators.

Quantization-Aware Training. Quantization-aware training refers to the technique that trains quantized networks, which is fundamentally different from post-training quantization as shown in Figure 22. Note that post-training quantization can achieve satisfactory performance on early networks such as AlexNet [76] and VGGNet [1], which, however, suffers from significant accuracy loss when applied to more advanced lightweight networks such as MobileNets [32, 85] and ShuffleNets [34, 35]. In general, quantization-aware training incorporates the quantization loss into the training loss, which then allows the optimizer to minimize the quantization loss during the training process in order to unlock better quantization accuracy than post-training quantization. In practice, The seminal quantization-aware training work, [414], proposes quantizing both weights and activations with 8-bit integers. To maximize the accuracy of INT8 quantized networks, [414] also introduces an effective tailored quantization-aware training approach to train the resulting INT8 quantized networks. Similar to [414, 422, 423] unify and improve quantization-aware training of INT8 quantized networks to minimize accuracy degradation. To generalize quantization-aware training to train other types of quantized networks (e.g., 1-bit and 2-bit networks), a plethora of quantization-aware training works [424, 425, 426, 427] have been proposed, which further push forward the attainable quantization accuracy.

Automated Mixed-Precision Quantization. Early quantization works typically quantize all weights and activations with the same level of precision, such as 1 bit for binarized networks and 2 bits for ternarized networks. Despite the promising performance, early uniform quantization practices suffer from suboptimal accuracy–efficiency trade-offs. For example, as shown in TBN [415], mixed-precision quantization that combines layer-wise ternarized inputs and binarized weights can achieve much better accuracy–efficiency trade-offs than stand-alone binarized networks and ternarized networks. However, determining the optimal mixed-precision quantization strategy is difficult owing to the large number of possible quantization combinations across different layers. To overcome such limitations, recent research has shifted to automated mixed-precision quantization [428, 429, 430, 431, 432, 433, 434, 435, 436] thanks to the tremendous success of NAS, as discussed in Section 3. Among them, [428], the first automated mixed-precision quantization work, follows early differentiable NAS practices [39, 138] to search for the optimal layer-wise precision assignment. HAQ [429] leverages reinforcement learning-based search to explore the huge quantization design space with hardware feedback in the loop, which focuses on finding the optimal layer-wise precision assignment to maximize both quantization accuracy and hardware efficiency. Note that we can easily generalize recent advances in the field of NAS to further improve automated mixed-precision quantization.

Quantization-Aware Accelerators. In contrast to INT8 quantized neural networks, binarized, ternarized, and mixed-precision quantized neural networks are not supported by mainstream hardware, such as mobile devices, CPUs, and edge GPUs. This further demands the design of quantization-aware accelerators to efficiently execute low-bit quantized networks at runtime. To this end, a plethora of representative quantization-aware accelerators have been proposed recently[437, 438, 439, 440, 441, 442, 443, 444, 445, 446], including binarized network-based accelerators [437, 438, 439], ternarized network-based accelerators [440, 441, 442], and mixed-precision network-based accelerators [443, 444, 445, 446]. These quantization-aware accelerators have demonstrated significant efficiency improvements in terms of latency, memory, area, and energy consumption in various real-world embedded scenarios.

4.3 Network Distillation

Network distillation, also referred to as knowledge distillation,⁷ is another well-established paradigm to further push forward the accuracy–efficiency trade-off, which is initially proposed by [501] and subsequently generalized by [11, 28]. Note that knowledge distillation is a plug-and-play training technique, which has been applied to various tasks to achieve better training performance, such as object detection [502] and language understanding [96]. In contrast to network pruning and network quantization, which focus on improving network efficiency without sacrificing network accuracy, as discussed in Sections 4.1 and 4.2, network distillation instead boosts the accuracy–efficiency trade-off from the accuracy perspective, which aims to improve the network accuracy without changing the network structure. In other words, unlike network pruning and network quantization that lead to simplified network structures, network distillation results in the same network. However, the resulting network can typically achieve higher accuracy. As shown in [11, 28, 501], knowledge distillation refers to the training process that leverages a larger pre-trained teacher network to benefit the training process of a smaller student network (see Figure 25), which transfers the rich and discriminative knowledge from the larger pre-trained teacher network to the smaller student network to further achieve better accuracy on the target task than simply training the student network alone. Next, we first present the preliminaries of knowledge distillation and then introduce recent representative data-dependent and data-efficient knowledge distillation works. These knowledge distillation works can also be found in Figure 24.

Fig. 24.

Fig. 25.

Knowledge Distillation Basics. In order to better understand knowledge distillation, we first elaborate on the preliminaries of knowledge distillation, which are mainly based on the most representative knowledge distillation work [11]. As shown in previous state-of-the-art networks [32, 34, 35, 85], the network outputs, also referred to as the network logits, are typically fed into the softmax function to calculate the probability distribution over different categories for further prediction purposes. However, given a pre-trained teacher network, the output logits after the softmax function are discriminative but less informative, which are close to either 1 or 0 (e.g, [0.02, 0.95, 0.01, 0.01, 0.01]). This makes it difficult to directly transfer the discriminative knowledge from the pre-trained teacher network to the student network. The rationale behind this is that the student network is of smaller network size than the teacher network, thus is less capable of learning discriminative knowledge [11]. To mitigate this issue, [11] leverages the distillation temperature T to soften the knowledge from the pre-trained teacher network to facilitate the teacher–student knowledge transfer process, which can be mathematically formulated as follows:

\begin{equation} z_i = \frac{\exp (y_i/T)}{\sum _{j=1}^n \exp (y_j/T)} \,\,\, s.t. \,\,\, i = 1, \ldots , n , \end{equation}

(19)

where $\lbrace y_i\rbrace _{i=1}^n$ denotes the output logits without softmax and T is the temperature to soften the output logits $\lbrace y_i\rbrace _{i=1}^n$ as $\lbrace z_i\rbrace _{i=1}^n$. Note that Equation (19) is equivalent to the standard softmax function when $T=1$. As shown in [11], larger T can produce softer probability distribution over different categories (e.g., $[0.1, 0.6, 0.1, 0.1, 0.1]$ when $T=5$ and $[0.2, 0.2, 0.2, 0.2, 0.2]$ when $T=+\infty$). In addition, fixing the temperature T to 2 can empirically yield the best performance. Furthermore, [11] exploits the softened knowledge from the pre-trained teacher network to guide the training process of the student network, which can be mathematically formulated as follows:

\begin{equation} \mathop {\mathrm{minimize}}_{w} \,\,\, \mathcal {L}_{train}(x, w) = \mathcal {L}(y, y^*) + \alpha \cdot T^2 \cdot \mathcal {L}(y, z) \,\,\, s.t. \,\,\, y = f_w(x) , \end{equation}

(20)

where x is the input data, $y^*$ is the ground-truth label, z is the softened knowledge from the pre-trained teacher network, $\alpha$ is the constant to control the teacher–student distillation magnitude, and $\mathcal {L}(\cdot)$ is the standard cross entropy loss function. Apart from these, $f_w(\cdot)$ parameterizes the student network with the weight of w. As demonstrated in [11], it is important to multiply the teacher–student distillation loss term (i.e., $\mathcal {L}(y, z)$) with $T^2$ because the teacher–student distillation loss term scales the gradient of w to $1/T^2$ during the training process.

With the above in mind, we introduce recent representative knowledge distillation practices next, which are built upon [11] and can be roughly divided into the following two categories: data-dependent and data-efficient knowledge distillation.

4.3.1 Data-Dependent Knowledge Distillation.

In this section, we introduce several representative data-dependent knowledge distillation techniques, including logit-based knowledge distillation, intermediate layer-based knowledge distillation, multi-teacher knowledge distillation, teacher-free knowledge distillation, and privileged knowledge distillation.

Knowledge from Logits. Knowledge distillation from logits is one representative branch of knowledge distillation and has been widely applied to improve network accuracy thanks to its conceptual simplicity and surprisingly strong performance. The works of [11, 27] pioneered leveraging the logit-based knowledge from the pre-trained teacher network to facilitate the training process of the less-capable student network. As seen in [11], the logit-based knowledge from the pre-trained teacher network, also referred to as soft labels, corresponds to the output of the teacher network after being fed into the softmax function to calculate the probability distribution over different categories, as shown in Equation (19). Subsequently, [452, 453, 454] investigate the efficacy of knowledge distillation and demonstrate that early knowledge distillation practices [11, 27] only yield suboptimal results since $\alpha$ and T are fixed for different teacher–student networks, as shown in Equation (20). The works of [455, 456, 457] demonstrate that the accuracy of the student network may significantly degrade when there are large accuracy gaps between teacher and student. To overcome such limitations, [456] introduces an intermediate-sized network (i.e., teacher assistant) to facilitate the knowledge transferred from the pre-trained teacher network to the student network, thus effectively bridging the gap between powerful teacher and less-capable student. In addition to soft labels, [458, 459, 460] demonstrate that noisy labels are also helpful to knowledge distillation, which can be leveraged to further improve the accuracy of the student network.

Knowledge from Intermediate Layers. Knowledge distillation from intermediate layers is another representative branch of knowledge distillation, which provides more fine-grained knowledge to better guide the training process of the smaller student network. The rationale behind this is that intermediate features are also discriminative and can be combined with the final network output to further enhance the feature expressiveness as seen in [503]. Specifically, [28] pioneers investigation of knowledge distillation from intermediate layers and introduces hint learning to improve the training process of the student network, in which hints correspond to the intermediate features of the teacher network. Compared with logit-based knowledge, knowledge from intermediate layers is often richer and more fine-grained, as shown in [28]. A plethora of subsequent knowledge distillation works [461, 462, 463, 464, 465, 466] has been proposed to enhance the knowledge transferred from the pre-trained teacher network to the student network, continuing to explore the rich intermediate features to facilitate the training process of the student network. For example, [466] delves into more fine-grained channel-level knowledge distillation, leading to more fine-grained and discriminative knowledge.

Multi-teacher Knowledge Distillation. The standard knowledge distillation paradigm exploits the pre-trained knowledge from one single teacher network to guide the training process of the less-capable student network [11, 28]. The work of [467] demonstrates that the student network may learn richer and more discriminative knowledge from multiple teacher networks, which push the student network to achieve better accuracy since multiple teacher networks can provide more informative and instructive knowledge than one single teacher network. To this end, [467] proposes to average the network weights of multiple teacher networks (i.e., mean teachers) to better guide the training process of the student network. Similar to [467], several follow-up knowledge distillation works [468, 469, 470, 471] propose to average the output logits of multiple pre-trained teacher networks and then exploit the averaged knowledge to enhance the training process of the student network. In addition, [472, 473, 474] demonstrate that directly averaging the output logits of multiple teacher networks ignores the teacher diversity since different teacher networks may maintain different network capabilities. To avoid this, [472, 473, 474] propose to actively enable and disable different teacher networks through gates during the training process to better guide the student network to learn more discriminative knowledge from different teacher networks.

Teacher-Free Knowledge Distillation. Despite the promising accuracy improvement, previous knowledge distillation works [11, 28] highly rely on off-the-shelf pre-trained teacher networks, which necessitate considerable computational resources to train teacher networks. In addition, to maximize accuracy improvement, it is of utmost importance to design proper teacher networks, leading to additional engineering efforts. To overcome such limitations, several knowledge distillation works [475, 476, 477, 478, 479, 480, 481] have been proposed recently to exclude teacher networks, which instead exploit the knowledge from the student network itself to guide the training process of the student network with teacher-free methods. For example, [475] introduces deep mutual learning, which demonstrates that the pre-trained teacher network is not necessary in the context of knowledge distillation. Instead, [475] demonstrates that an ensemble of student networks can collaboratively learn from each other throughout the training process and, more importantly, can achieve surprisingly better training accuracy than standard knowledge distillation practices [11, 28]. This explicitly indicates that the knowledge from the student network itself can also be leveraged to improve the training process of the student network towards better training accuracy.

Privileged Knowledge Distillation. Privileged knowledge distillation is a special type of knowledge distillation in which the student network has access to additional information or features that are not available to the teacher network during the training process [482]. In contrast to the standard knowledge distillation paradigm [11], privileged knowledge distillation allows the student network to learn from both the teacher network and the additional information that is only available to the student network, which has the potential to further improve the attainable accuracy as demonstrated in [482]. The rationale behind privileged knowledge distillation is that the student network can leverage the additional information to improve its ability to mimic the behavior of the pre-trained teacher network. Furthermore, inspired by [482], several privileged knowledge distillation works [483, 484, 485, 486, 487] have been proposed recently to improve the performance of the student network in various tasks. For example, [484] explores progressive privileged knowledge distillation to embrace better online action detection. In addition, [487] introduces privileged feature distillation to improve the product recommendations of Taobao. These privileged knowledge distillation works clearly demonstrate that the student network may benefit from the additional information and knowledge to achieve better training accuracy on the target task.

4.3.2 Data-Efficient Knowledge Distillation.

In this section, we introduce generative adversarial networks (GAN)-based knowledge distillation and few-sample knowledge distillation, which are of high data efficiency and can perform teacher–student distillation using a small amount of training data.

GAN-based Knowledge Distillation. Despite promising accuracy improvement, knowledge distillation is often data driven and highly relies on sufficient training data to transfer the rich pre-trained knowledge from the teacher network to the student network. This inevitably leads to significant engineering efforts for data preparation, such as data collection, cleaning, and labeling. As seen in the relevant literature, GANs have been applied to a wide range of tasks and are considered to be one of the most effective approaches for generating high-quality synthetic data [504]. In light of this, a plethora of GAN-based knowledge distillation works [488, 489, 490, 491, 492, 493, 494] have been proposed recently to leverage GANs to generate sufficient training data and then use the generated data to train the student network. These GAN-based knowledge distillation works have demonstrated significant data efficiency since the well-optimized generator can be used to produce a large amount of high-quality synthetic data while at the same time achieving promising accuracy improvement on the target task.

Few-Sample Knowledge Distillation. In addition to GAN-based knowledge distillation, another promising direction is to perform efficient knowledge distillation to transfer the rich knowledge from the pre-trained teacher network to the student network with only a small amount of training data or only few data samples, which can also bring significant data efficiency. To achieve this, several few-sample knowledge distillation works [33, 495, 496, 497, 498, 499, 500] have been proposed recently. Among them, [497] proposes a simple yet effective solution for knowledge distillation using label-free few samples to realize both data efficiency and training/processing efficiency. Specifically, [497] first inserts one $1\times 1$ convolutional layer at the end of each building block of the student network and then optimizes the inserted $1\times 1$ convolutional layer to minimize the knowledge distillation loss, which can quickly converge using only few data samples. More recently, [500] introduced an effective mimicking-then-replacing knowledge distillation technique to quickly train the student network with only a few data samples, which maintains significant data efficiency while still achieving superior training accuracy.

4.4 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of network compression, which are summarized as follows:

(1)

Automated Teacher–Student Search. Knowledge distillation transfers the rich knowledge from the pre-trained teacher network to the student network to facilitate the training process of the student network, which has achieved promising accuracy improvement [11, 28]. In the past, researchers empirically exploited larger networks as teacher networks and smaller networks as student networks. However, such empirical practices may lead to suboptimal accuracy and cannot always achieve accuracy improvement. The rationale behind this is that different student networks may prefer quite different teacher networks, as shown in [505, 506]. This further motivates us to design the optimal teacher–student network pair to maximize the attainable accuracy of the student network. To achieve this, one promising alternative is to leverage recent advances in the field of NAS to automatically search for the optimal teacher–student network pair.

(2)

Joint Network Compression. To embrace better accuracy–efficiency trade-offs, an intuitive and straightforward approach is sequential network compression, which exploits multiple network compression techniques to progressively reduce network complexity. For example, [507] introduces a simple yet effective sequential compression pipeline, which starts with searching for an efficient network with ProxylessNAS [40] and then applies automated channel pruning [369] and mixed-precision quantization [429] to further trim down the network size. However, this sequential compression pipeline has critical drawbacks that lead to suboptimal results. This is because the searched optimal network is not necessarily optimal for subsequent pruning and quantization. To address this, one promising future direction is joint network compression, which jointly optimizes the network structure, pruning, and quantization to yield the best accuracy–efficiency trade-off.

(3)

Federated Network Compression. Federated learning is an emerging decentralized learning approach that allows multiple hardware devices to collaboratively learn the same network without sharing their raw data [508]. Specifically, federated learning allows the network to be trained locally on each hardware device using its own data, in which only the network updates, rather than the raw data, are sent back to the central server for further aggregation. In light of this, one promising future direction is federated network compression, including federated pruning, federated quantization, and federated distillation, which can significantly enhance data privacy and protect data security while still achieving competitive performance in terms of accuracy and training efficiency.

(4)

Domain-Specific Network Compression. In addition to image classification, there are a wide range of popular downstream applications, such as object detection, tracking, and semantic segmentation, in which the involved networks are still quite computation intensive. This makes it difficult to accommodate the limited available computational resources in real-world embedded scenarios. To tackle this issue, some early practices have attempted to leverage general network compression techniques to compress domain-specific networks. For example, [502, 509, 510] propose to leverage pruning [510], quantization [509], and knowledge distillation [502] to improve the accuracy–efficiency trade-off in real-world object detection scenarios, which, however, are still under-explored and cannot simply generalize to other scenarios. Therefore, one promising future direction is domain-specific network compression, which exploits domain-specific knowledge to largely boost network compression performance towards better accuracy–efficiency trade-offs.

(5)

Mixed-Precision Training. Mixed-precision training refers to the technique that trains the network with both full-precision weights and low-bit weights, which has the potential to significantly improve training efficiency without sacrificing accuracy. For example, PyTorch [67] introduces Automatic Mixed Precision (AMP), which combines 32-bit full-precision weights with 16-bit half-precision weights during the training process. As a result, AMP achieves the same level of accuracy as the stand-alone 32-bit full-precision training while at the same time being able to deliver about ×2 training speedups for convolutional networks [511]. In light of this, one promising future direction is to leverage low-bit mixed-precision training techniques to train full-precision networks, which may aggressively push forward training efficiency without degrading accuracy.

5 Efficient On-Device Learning for Embedded Computing Systems

On-device learning consists of two branches: on-device inference and training. On-device inference refers to the process of deploying efficient pre-trained networks on local hardware devices, which allows local hardware devices to run various intelligent inference tasks, such as image classification and object detection. There have been several representative techniques [513, 514] to enable efficient on-device inference. They focus on either designing computation-efficient networks with less redundancy or compressing computation-intensive networks to reduce the computational complexity in order to accommodate the limited on-device computational resources. Note that this article has discussed popular techniques for efficient on-device inference, such as efficient manual/automated network design and efficient network compression. The readers may refer to Sections 2 to 4 for more details.

On-device training refers to the capability of local hardware to perform training tasks directly on local hardware without the need for remote servers [44]. Unlike on-device inference, in which the deployed network always remains static, on-device training may further enhance the deployed network over time. This allows adaptation of the deployed network to new data collected from local sensors to achieve better accuracy. Thanks to its strong capability of protecting data privacy and ensuring data security, on-device training has become increasingly popular over the past few years in order to achieve secured embedded intelligence [546]. To this end, we systematically discuss recent state-of-the-art on-device learning techniques (especially on-device training) in this section, including general on-device learning in Section 5.1, on-device continual learning in Section 5.2, on-device transfer learning in Section 5.3, and on-device federated learning in Section 5.4, since these techniques feature different learning algorithms to enhance on-device learning performance. For better understanding, we also summarize these methods in Figure 26. Note that these on-device learning techniques can typically generalize across different networks (e.g., convolutional networks and transformers). For example, we can leverage on-device federated learning to optimize both convolutional networks and transformers on multiple local hardware devices.

Fig. 26.

5.1 General On-Device Learning

In this section, we introduce recent state-of-the-art works about general on-device learning techniques, including efficient on-device inference and efficient on-device training.

Efficient On-Device Inference. To enable efficient on-device inference, one straightforward approach is to design tiny networks with less redundancy in order to accommodate the limited on-device computational resources. To this end, a plethora of representative tiny networks [512, 513, 514, 515] have been proposed recently, including MicroNets [512], MCUNets [513, 514], and EtinyNet [515]. MCUNetV1 [513], one of the early tiny networks, proposes to jointly design the lightweight tiny network using TinyNAS and the lightweight inference engine using TinyEngine, enabling ImageNet-scale inference on microcontrollers. MCUNetV2 [514] introduces an efficient patch-based inference pipeline to trim down on-device memory consumption for better on-device inference since memory consumption is the key bottleneck of on-device inference. However, in contrast to training large networks, training tiny networks poses significant challenges, as demonstrated in [547]. The rationale here is that existing regularization techniques (e.g., data augmentation and dropout), despite being able to benefit the training process of large networks, may degrade the training performance of tiny networks [547]. To tackle this issue, [547] proposes augmentation of the tiny network itself rather than augmenting the input data, which shows promising accuracy improvement over the standard training scheme.

Efficient On-Device Training. The key difference between on-device training and inference is that on-device training requires saving all of the intermediate activations, which are used to optimize parameters using gradient descent during backward propagation. In contrast, on-device inference that only performs forward propagation does not need to save intermediate activations, which can be progressively released to reduce memory consumption. In light of this, on-device training suffers from non-negligible memory consumption since the activation size grows with respect to the training batch size and training typically involves a large batch size to accelerate the training process. As a result, intermediate activations become the major bottleneck of on-device training, as demonstrated in [44]. For example, under the batch size of 16, the activation size of ResNet50 [2] is ×13.9 larger than its parameter size, as shown in Figure 27. To alleviate the excessive memory consumption caused by intermediate activations, there have been several representative strategies, including gradient checkpointing, activation gradient pruning, and low-bit training.

Fig. 27.

(1)

Gradient Checkpointing. Gradient checkpointing is a simple yet effective memory optimization technique that seeks to reduce training memory consumption at the cost of increased training time [516]. To this end, gradient checkpointing reserves a minimal set of intermediate activations during forward propagation, which are then utilized to re-compute the remaining intermediate activations during backward propagation. As shown in [516], gradient checkpointing has the potential to significantly reduce the training memory consumption from $O(n)$ to $O(\sqrt {n})$, where n is the number of network layers. More importantly, gradient checkpointing does not degrade training accuracy since the training behaviors remain the same as the standard training scheme. Several subsequent gradient checkpointing works generalize gradient checkpointing to allow arbitrary computation graphs [517] and train GNNs [518].

(2)

Activation Gradient Pruning. Activation gradient pruning removes less important intermediate activation gradients to optimize training memory consumption [519]. This relies on an empirical observation that most of the intermediate activation gradients during backward propagation are very close to zero and thus have minimal impact on gradient descent [519]. Therefore, pruning these very small activation gradients can effectively reduce training memory consumption at the cost of minimal accuracy loss, which also accelerates the training process. Similar to [519, 520] proposes an efficient gradient filtering scheme, which filters similar activation gradients during backward propagation and only reserves those with unique elements to reduce the number of elements in the activation gradient maps. Another popular approach is to build dynamic sparse computation graphs to eliminate intermediate activations in an input-dependent manner, which can also reduce training memory consumption [521].

(3)

Low-Bit Training. Low-bit training refers to training the given network with low-bit weights (e.g., 8 and 16 bits) rather than full-precision 32-bit weights, which has the potential to significantly reduce training memory consumption by ×32 [82]. The rationale here is that low-bit training can reduce the memory consumption for both network weights and intermediate activations. Specifically, [522], an early exploration, proposes to train the given network with 16-bit weights under stochastic rounding, which leads to ×2 less training memory consumption than the standard full-precision 32-bit training counterpart and, more importantly, maintains comparable training accuracy. The work of [523] introduces an efficient INT8 training pipeline, consisting of loss-aware compensation and backward quantization, to enable tiny on-device training thanks to the well-optimized INT8 quantization on mainstream hardware. The work of [45] proposes to optimize real-quantized graphs, in which an effective memory-efficient sparse update scheme and tiny training engine are integrated to achieve on-device training under 256 KB memory. It is worth noting that low-bit training is similar to network quantization as discussed in Section 4.2 since both leverage quantized weights to trim down network complexity. This allows generalization of recent advanced quantization techniques to benefit low-bit on-device training.

We note that the aforementioned strategies can also be combined to further reduce the overall memory consumption during training and accelerate the training process. For example, we can easily combine gradient checkpointing and low-bit training to further reduce the training memory consumption from $O(\sqrt {n})$ to $O(\sqrt {n}/32)$, where n is the number of network layers.

5.2 On-Device Continual Learning

On-device continual learning, also known as on-device lifelong learning or incremental learning, is an advanced learning paradigm that allows the deployed network to continuously learn from the newly collected data to further push forward the attainable accuracy [46, 533]. This is particularly favored in real-world embedded scenarios, especially those with rich local sensors, where embedded devices can continue to collect new data through local sensors over time [44, 45]. The newly collected data is then used to train the deployed network to unlock better performance over time. In other words, we are allowed to utilize the newly collected data to train or fine-tune the deployed network on target hardware itself, which typically leads to better accuracy on the target task. On-device continual learning performs local training and does not need to send back the newly collected data to remote servers, which also protects data privacy and ensures data security. However, on-device continual learning, despite being able to deliver significant benefits, suffers from the catastrophic forgetting issue, which is the tendency to forget the previously learned knowledge when adapting to newly collected data [46]. The rationale is that on-device continual learning must adjust the pre-trained network weights in order to adapt to the newly collected data, which deteriorates the previously learned knowledge accordingly.

To alleviate the catastrophic forgetting issue, a plethora of state-of-the-art on-device continual learning works have been established recently [46, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534], which seek to stabilize the on-device continual learning process and further push forward the attainable accuracy on target the task. Among them, [46], an early exploration, investigates three common continual learning scenarios and demonstrates that it is frustratingly hard to evaluate different continual learning approaches, based on which [46] establishes several evaluation protocols to compare them. It is worth noting that the authors of [46] do not support on-device continual learning but we can easily integrate recent advances from the lens of on-device training (see Section 5.1) into [46] to further enable efficient on-device continual learning. Several subsequent on-device continual learning works [524, 525, 526, 527] explore on-device continual learning on resource-constrained embedded computing systems, which have since demonstrated promising accuracy improvement. In parallel, [528, 529, 530] attempt to generalize on-device continual learning to benefit language tasks in real-world embedded scenarios, such as environmental sound classification [528] and automatic speech recognition [530]. Inspired by the tremendous success of vision transformers, [107, 531] also investigate the efficacy of on-device embedded continual learning to continuously improve the accuracy of mainstream vision transformers. The works of [532, 533, 534] delve deeper into on-device continual learning, which focus more on the training pipeline and introduce several on-device training enhancements to maximize accuracy improvement, such as selective weight updates [532], weight freezing [533], and deep network ensembles [534].

5.3 On-Device Transfer Learning

As demonstrated in [44], it is often different to directly train DNNs from scratch in real-world embedded scenarios, in which the collected data samples are far limited. To tackle this issue, an effective alternative is on-device transfer learning, which fine-tunes pre-trained networks on large-scale datasets. The rationale here is that DNNs pre-trained on large-scale datasets (e.g., ImageNet [80]) can serve as powerful feature extractors for further transfer learning, which only fine-tunes several layers (e.g., batch normalization layers and the last layer) whereas other layers are typically frozen. In contrast to previous on-device learning practices as discussed in Sections 5.1 and 5.2, on-device transfer learning does not require storing memory-intensive intermediate activations. As a result, it maintains significant efficiency in terms of training memory consumption, as illustrated in Figure 28 [44]. Despite the promising memory efficiency, on-device transfer learning is quite challenging, which may result in poor accuracy, especially on those datasets whose data distribution is far from ImageNet [82].

Fig. 28.

To overcome such limitations, early transfer learning practices [535, 536] propose to fine-tune all the network layers, which indeed achieves better accuracy but leads to considerable memory consumption due to memory-intensive intermediate activations. To avoid this, several subsequent transfer learning works [537, 538, 539, 540] demonstrate that it is often not necessary to fine-tune all the network layers, which indicates that fine-tuning batch normalization layers can also achieve strong accuracy on the target task. This fine-tuning paradigm has the potential to significantly reduce the number of trainable parameters during the transfer learning process. In light of this, [537, 538, 539, 540] propose to only optimize learnable parameters in batch normalization layers (see $\gamma$ and $\beta$ in Equation (13)), whereas other learnable parameters are frozen during the transfer learning process. For example, [537] leverages batch normalization layers as scale-and-bias patches and then trains the patched parameters, optionally also the last layer, whereas the remaining parameters are left unchanged. Furthermore, [538] reveals that, for those networks with sufficient depth, training only $\gamma$ and $\beta$ can reach surprisingly strong accuracy, which demonstrates the expressive power of the learnable parameters in batch normalization layers. However, fewer trainable parameters cannot directly translate to superior training memory efficiency as shown in Figure 27, which may still involve a large amount of memory consumption (e.g., 326 MB under the training batch size of 8) to store memory-intensive intermediate activations of batch normalization layers [44].

To further alleviate the prohibitive training memory consumption, [44] introduces a simple yet effective transfer learning solution, which exhibits significant training memory efficiency. Specifically, [44] relies on an empirical observation that intermediate activations are only required to update network weights, whereas updating network biases does not involve intermediate activations. This observation also reveals that the training memory bottleneck comes from updating network weights rather than biases. In light of this, [44] proposes to freeze network weights and only update network biases. However, freezing network weights and only updating network biases may lead to significant accuracy loss. To compensate for such accuracy loss due to freezing network weights, [44] introduces an effective lite residual learning scheme, which leverages generalized memory-efficient bias modules to refine memory-intensive intermediate activations. In particular, the lite residual learning scheme can improve the attainable accuracy on the target task and, more importantly, at the cost of negligible memory overheads. Finally, [44] reduces the training memory consumption from more than 250 MB to only 16 MB, making it possible to explore in-memory computing infrastructures to perform memory-efficient transfer learning. Note that we can easily integrate recent advances from the lens of on-device training (see Section 5.1) into the aforementioned transfer learning works towards boosted on-device transfer learning performance.

5.4 On-Device Federated Learning

On-device federated learning is an advanced decentralized learning paradigm that enables efficient training on a large corpus of decentralized data residing on local client devices such as mobile phones and allows multiple local client devices to jointly train the given network without explicitly sharing their raw data [541, 549]. In practice, on-device federated learning has the potential to significantly accelerate the training process when the number of client devices evolves. In addition, on-device federated learning is one instance of the more general approach of “bringing the neural network to the data” rather than “bringing the data to the neural network.” As a result, it addresses the fundamental problems of data privacy, security, and ownership [508]. This is particularly favored in real-world embedded scenarios, in which embedded devices can continue to collect new data through local sensors. Thanks to these practical benefits, on-device federated learning has garnered increasing attention from both academia and industry [47]. In the past decade, on-device federated learning has been utilized to empower a plethora of real-world intelligent applications, such as mobile keyboard content suggestions [550], medical image analysis [551], and smart health care infrastructures [552]. As demonstrated in [541], standard on-device federated learning practices typically consist of the following five iterative steps:

(1)

Initialization. On-device federated learning begins with the randomly initialized network, namely, the global model, which is shared among local client devices. At the early learning stage, the global model is sent to all local client devices from the centralized server, in which each local client device receives the same copy of the global model.

(2)

Local Training. Once local client devices receive the global model, they start to perform local on-device training, which treats the global model as the local model and then trains the local model using the locally collected data. Note that the locally collected data only resides on the local client device and is not shared among other client devices.

(3)

Model Update. After local on-device training terminates, each local client device requires generation of the respective model update scheme, which should essentially reflect what the local client device has learned from the locally collected data. These model update schemes, instead of the locally collected data, are then sent back to the centralized server for further aggregation, which effectively eliminates data leakage and protects data privacy.

(4)

Aggregation. The centralized server receives model update schemes from all local client devices, after which the centralized server aggregates the received model update schemes to further produce an improved global model.

(5)

Distribution. The centralized server then distributes the improved global model to all local client devices and these steps repeat until convergence.

Despite being able to deliver superior learning performance across various real-world embedded tasks, on-device federated learning suffers from critical limitations, especially from the perspective of data transmission [508], posing significant challenges to generalizing on-device federated learning to benefit real-world intelligent embedded applications. In contrast to the centralized server that is equipped with high-end network infrastructures, local client devices in real-world embedded scenarios are often low end with less-capable network infrastructures. In such a case, it may be time-consuming to (1) distribute the global model from the centralized server to local client devices and (2) send back model update schemes from local client devices to the centralized server for further aggregation. To overcome such limitations, a plethora of advanced federated learning techniques have been established recently to accommodate the limited data bandwidth of local client devices [47, 414, 541, 542, 543, 544, 545], which primarily focus on reducing the total data bits transferred between the remote centralized server and local client devices, such as federated averaging [541], gradient compression [542, 543], quantization [414], delayed gradient averaging [544], partial variable training [545], and local training sparsity [47]. Note that we can easily integrate recent advances from the lens of general on-device training techniques (see Section 5.1) into the aforementioned federated learning works to further enhance on-device federated learning.

5.5 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of on-device learning, which are summarized as follows.

(1)

Offline On-Device Federated Learning. As discussed in Section 5.3, on-device federated learning highly relies on the centralized server for updating local models, which needs stable Internet requirements for data movements between local devices and remote servers. This may lead to inferior on-device learning efficiency due to the communication overheads between local devices and the remote server, especially when Internet connectivity is limited or unavailable. Therefore, one promising future trend is offline on-device federated learning, which excludes the remote centralized server and exploits local devices to perform learning tasks. Offline on-device federated learning has the potential to significantly boost on-device learning efficiency.

(2)

Personalized On-Device Learning. As discussed in Section 5.1, on-device learning exhibits strong local personalization, which distinguishes itself from the global training counterpart. Personalized on-device learning can bring two-fold benefits. On the one hand, personalized on-device learning allows local devices to directly learn from local users to provide user-tailored AI solutions, which can protect the data privacy since the collected data do not need to be transferred to the cloud. On the other hand, personalized on-device learning can achieve better learning accuracy since it can continue to collect rich new personalized training data from local users. Therefore, future research should leverage this unique capability to further provide more highly personalized on-device learning solutions, in which local devices can actively and quickly adapt themselves to users’ diverse needs to deliver user-tailored services, such as personalized voice assistants.

(3)

Robust On-Device Learning. On-device learning, despite being able to achieve promising success, still suffers from critical limitations, such as poor adversarial robustness [553]. This is particularly important in real-world embedded computing systems, such as embedded visual sensing [554, 555], in which the environments may dynamically change over time. This further makes local on-device learning more vulnerable to adversarial attacks, especially those unseen adversarial attacks, which may significantly degrade the on-device learning performance even when encountering simple adversarial attacks [553]. To overcome such limitations, future research should focus on developing robust on-device learning techniques featuring novel adversarial training algorithms, which can achieve competitive on-device learning performance while also maintaining superior adversarial robustness against well-engineered adversarial attacks or even unseen adversarial attacks.

(4)

Efficient On-Device Learning Ecosystems. As discussed in Section 5.1, on-device learning has gained increasing popularity from both academia and industry thanks to its strong capability to ensure data privacy and security. In light of this, future research should also develop efficient on-device learning ecosystems, including software and hardware frameworks, to further support the development, deployment, and management of on-device learning applications, making it easier for developers to create and optimize models for various on-device learning purposes. For example, [45], one of the most representative on-device learning methods, leverages quantization to trim down training memory consumption. However, mainstream embedded computing systems do not support low-bit training, making it difficult to benefit mainstream embedded computing systems.

6 Efficient Large Language Models for Embedded Computing Systems

In the past few years, LLMs, such as GPT-3 [48] and GPT-4 [49], have achieved impressive success across various real-world language processing tasks [50]. However, the strong learning capability of LLMs also comes at the cost of excessive computational complexity. For example, OpenAI’s GPT-3 [48], one of the most representative LLMs, consists of 175 billion parameters. More recently, LLMs have continued to evolve with ever-increasing model sizes in order to achieve state-of-the-art performance [51, 52]. This makes it even more challenging to deploy LLMs on modern embedded computing systems. To this end, we first introduce the preliminaries on LLMs and then discuss recent state-of-the-art advances from the perspective of efficient LLMs in this section, including efficient LLM architecture design in Section 6.2, efficient LLM compression techniques in Section 6.3, and efficient LLM system design in Section 6.4. We also summarize the above state-of-the-art advances regarding efficient LLMs in Figure 29. Finally, in Section 6.5, we envision several promising future directions in the field of efficient LLMs.

Fig. 29.

6.1 Preliminaries on LLMs

LLMs are emerging machine learning models that are dedicated to understanding, generating, and interacting with human language through leveraging extensive textual data. In practice, LLMs are typically built upon a transformer with both encoder and decoder [90] and heavily rely on self-attention mechanisms to measure the significance of different words in the given sentence regardless of their positional relationships. Thanks to their strong capability to interpret rich information, LLMs can exhibit remarkable performance across a wide range of language processing tasks, such as text summarization, translation, question answering, and conversational response generation. As discussed in [50], recent state-of-the-art LLMs can be divided into three main categories according to their inherent architectures: encoder-only architecture, decoder-only architecture, and encoder–decoder architecture as follows.

(1)

Encoder-Only Language Models. Encoder-only language models typically focus on transforming the given input text into continuous representations, which can capture and reflect the context of the given input text. These encoder-only language models are usually used for real-world language processing tasks that require understanding or embedding of the given input text, such as sentence classification, named entity recognition, and extractive question answering, where the output does not need to be sequential or generated texts. For example, BERT [91] is one of the most representative encoder-only language models that features masked language modeling during training, which enables the model itself to understand the context from both directions (i.e., left and right context).

(2)

Decoder-Only Language Models. Decoder-only language models typically focus on generating texts based on the given input text, which can interpret the context of the given input text. These decoder-only language models are usually used for real-world language processing tasks in which generating texts is required, such as text generation and language modeling. For example, GPT-3 [48] is one of the most representative decoder-only language models that features advanced auto-regressive training, which can learn to accurately predict the next word in sequence from all the previous words.

(3)

Encoder-Decoder Language Models. Encoder-decoder language models, also known as sequence-to-sequence (seq2seq) models, typically consist of two parts: (1) the encoder to process the given input text and encode it into feature representations and (2) the decoder to explore the above feature representations to generate an output sequence. The encoder-decoder architecture is versatile and suitable for real-world language processing tasks that require transformation of the given input text into different formats, such as language translation, summarization, and dialogue systems. For example, T5 [556] is one of the most representative encoder-decoder language models, which formulates the given task as a text-to-text transformation problem and converts the given input text to the target output text. In parallel, BART [557] is particularly effective in generative and comprehensive tasks thanks to its bidirectional encoder and auto-regressive decoder.

6.2 Efficient LLM Architectures

As discussed in Section 6.1, recent LLMs are typically built upon transformers [90] and heavily rely on self-attention mechanisms to interpret the significance of different words in a given sentence, regardless of their positional relationships. However, self-attention mechanisms, despite their strong capability for language processing, also introduce considerable computational complexity. As pointed out in [599], the quadratic time and memory complexity of self-attention mechanisms may significantly slow down the pre-training, inference, and fine-tuning stages of LLMs. To optimize the prohibitive computational complexity of LLMs, recent state-of-the-art efficient LLMs often focus on exploring computation-efficient self-attention mechanisms.

General Efficient Attention. Some recent works [558, 559, 560, 561, 562, 563] focus on exploring computation-efficient attention mechanisms to optimize the quadratic computational complexity of vanilla self-attention [90]. Among them, [558] introduces clustered attention, which groups different queries into clusters and computes the attention just for the centroids rather than computing the attention for every query. To further improve the above approximation, [558] also employs the computed clusters to identify the keys with the highest attention per query and computes the exact key-query dot products. The work in [559] draws insights from the Nyström method and proposes approximating the standard self-attention mechanism in linear complexity, which can enable applications to longer sequences with even thousands of tokens. The work of [560] demonstrates that the complexity bottleneck of self-attention mainly comes from the computation of partition functions in the denominator of the softmax function and the multiplication of the softmax matrix with the matrix of values. To this end, [560] features an efficient kernel density estimation (KDE) solver to resolve the above complexity bottleneck via subsampling-based fast matrix products, which can approximate the attention in subquadratic time with provable spectral norm bounds. The work of [561] introduces an efficient single-head gated attention mechanism with an exponential moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism, which can exhibit linear time and space complexity while causing minimal performance loss. The work of [562] introduces an efficient universal approximation for self-attention, which can exhibit linear time and space complexity and reveal the theoretical insights behind existing efficient transformers (e.g., Linformer [97]) with linear time and space complexity. The work of [563] introduces an efficient attention approximation mechanism featuring fused low-rank kernel approximation, which can provide sizable runtime performance gains and is also high quality.

Hardware-Aware Efficient Attention. In parallel to the above efficient attention approximation works, some recent works [53, 54, 55, 564, 565, 566] focus on exploring efficient hardware-aware attention mechanisms, which can exhibit considerable efficiency improvement on modern hardware systems. Among them, FlashAttention [54] features an efficient IO-aware exact attention algorithm, which explores tiling to reduce the total number of memory reads and writes between GPU high bandwidth memory and GPU on-chip Static Random Access Memory (SRAM). However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, which is mainly due to the suboptimal work partitioning between different thread blocks and warps on GPUs. To tackle this issue, FlashAttention-2 [55] introduces an improved work partitioning scheme, which (1) tweaks the algorithm to reduce the number of non-MatMul FLOPs, (2) parallelizes the attention computation workloads across different thread blocks, and (3) distributes the work between warps to reduce communications through shared memory. FLASHLINEARATTENTION [566] dives deeper into I/O awareness and introduces an effective hardware-efficient algorithm for linear attention, which trades off memory movement against parallelizability and can even be faster than FlashAttention-2. PagedAttention [53] draws insights from the operating system’s solution to memory fragmentation and sharing through virtual memory with paging and further divides the request’s key-value (KV) cache into different blocks, each of which contains the attention keys and values of a fixed number of tokens. A3 [564] demonstrates that implementing attention mechanisms using matrix-vector multiplication is often suboptimal and further proposes to accelerate attention mechanisms with joint algorithmic approximation and hardware specialization. Similar to A3, ELSA [565] features an effective approximation scheme to significantly reduce the amount of computation workloads by efficiently filtering out relationships that are unlikely to affect the final output.

6.3 Efficient LLM Compression

In addition to designing LLMs with efficient architectures, another promising direction is to explore efficient LLM compression techniques to optimize the computational complexity of existing computation-intensive LLMs. With this in mind, we discuss recent state-of-the-art compression techniques for LLMs in this section, including efficient LLM pruning in Section 6.3.1, efficient LLM quantization in Section 6.3.2, and efficient LLM distillation in Section 6.3.3.

6.3.1 Efficient LLM Pruning.

Pruning is one of the most effective strategies to optimize the computational efficiency of LLMs, which removes the less important parameters of LLMs while incurring minimal accuracy loss. Recent state-of-the-art LLM pruning methods can be divided into two main categories, non-structured LLM pruning and structured LLM pruning, as follows.

(1)

Non-structured LLM Pruning. Non-structured LLM pruning removes the less important LLM weights/connects, which can yield more aggressive compression ratios than structured pruning while also exhibiting strong accuracy [58, 346, 567, 568, 569, 570, 571, 572]. For example, SparseGPT [567] shows that LLMs can be pruned to at least 50% sparsity in one shot without any retraining and, more importantly, at minimal accuracy loss. In parallel, Wanda [58] proposes to prune the less important weights with the smallest magnitudes multiplied with the corresponding output activations on a per-output basis. More importantly, both SparseGPT and Wanda can generalize to semi-structured pruning [346, 568] towards better hardware parallelism, which can deliver realistic on-device inference speedups with the support of some existing deep learning libraries (e.g., cuSPARSElt [340] and TVM [341]). The work in [569] advocates for reinstating ReLU activation in LLMs and explores sparse patterns in ReLU-based LLMs, which shows that ReLU activation can effectively reduce LLM inference computation overheads up to three times. In practice, non-structured pruning has been widely employed to enhance the pre-training and fine-tuning process of LLMs towards better pre-training and fine-tuning processes [570, 571, 572].

(2)

Structured LLM Pruning. In contrast to non-structured LLM pruning, structured LLM pruning can achieve realistic inference speedups on target hardware, which, however, also suffers from more aggressive accuracy loss than non-structured pruning. To tackle this dilemma, recent state-of-the-art non-structured LLM pruning methods [57, 573, 574, 575, 576, 577, 578, 579] typically feature an additional fine-tuning stage to further recover the attainable accuracy of the pruned LLM. For example, LLM-Pruner [57] employs structural pruning to selectively remove non-critical coupled structures according to their gradient information, which can preserve the majority of the LLM’s functionality while optimizing its computational efficiency. LLM-Pruner recovers the performance of the pruned LLM using another state-of-the-art tuning technique (i.e., LoRA [600]), which merely takes 3 hours with 50K data. Similar to LLM-Pruner, ZipLM [574] iteratively identifies and removes LLM components with the worst loss–runtime trade-off, which can end up with efficient LLMs and generalize across various runtime constraints. In addition, LoRAShear [575] first creates the dependency graphs over LoRA modules and then proceeds to progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer. To further recover the lost information during pruning, LoRAShear also introduces an effective fine-tuning scheme with dynamic data adaptors to narrow down the performance gap between the pruned LLM and the non-pruned LLM. More recently, several LLM layer pruning methods [577, 578, 579] demonstrate that LLM’s layers are also redundant and thus can be removed to largely enhance the inference efficiency of LLMs at minimal accuracy loss. For example, ShortGPT [578] and Shorted-LLaMA [579] propose to remove the less important LLM layers according to their layer importance score, whereas LLM-Streamline [577] proposes to replace the less important LLM layers with more lightweight ones.

6.3.2 Efficient LLM Quantization.

Recent state-of-the-art LLM quantization techniques [59, 60, 580, 581, 582, 583, 584, 585, 586] focus on reducing the weights of LLMs from higher to lower bits (e.g., from 32 bits to 8 bits or even 1 bit), which can substantially enhance the inference efficiency of LLMs at the cost of slight accuracy loss. Among them, SmoothQuant [59] introduces an efficient training-free post-training quantization solution to enable 8-bit weight and 8-bit activation quantization for LLMs. Given that weights are easy to quantize while activations are not, SmoothQuant also smooths the activation outliers by offline mitigating of the quantization difficulty from activations to weights with an equivalent mathematical transformation. Similar to SmoothQuant, AWQ [60] introduces an efficient hardware-friendly quantization approach for low-bit LLM weight-only quantization, which is built upon an interesting observation that weights are not equally important and reserving only 1% of salient weights can greatly reduce the quantization error. In light of this, AWQ proposes to search for the optimal per-channel scaling scheme that protects the salient weights by observing the activations rather than weights. SpQR [580] introduces a new compressed format for efficient LLM quantization, which can enable near-lossless compression of LLMs across various model scales while maintaining comparable compression levels to previous quantization methods. SpQR first identifies and isolates the outlier weights that may cause particular-large quantization errors, after which SpQR stores them in high precision while compressing all other weights in 3 to 4 bits. OS+ [581] features the channel-wise shifting for asymmetry and the channel-wise scaling for concentration since these operations can be seamlessly migrated into the subsequent quantization modules while maintaining strict equivalence. OS+ also introduces a fast and stable scheme to calculate effective shifting and scaling values, which can further achieve better quantization burden balance towards better quantization performance.

OWQ [582] introduces an efficient outlier-aware weight quantization strategy that aims to minimize an LLM’s memory footprint through low-bit quantization. OWQ prioritizes a small subset of structured weights that are sensitive to quantization and stores them in higher bits while applying highly tuned quantization to the remaining dense weights. QuIP [583] introduces quantization with incoherence process, which consists of two independent stages: (1) an adaptive rounding stage to minimize the pre-defined quadratic proxy objective and (2) an efficient pre- and post-processing stage to ensure weight and Hessian incoherence via multiplication by random orthogonal matrices. OmniQuant [584] introduces an omnidirectionally calibrated quantization technique for LLMs, which consists of two novel components: learnable weight clipping (LWC) and learnable equivalent transformation (LET). LWC modulates the extreme weight values by optimizing the clipping threshold and LET eliminates the activation outliers by shifting the challenge of quantization from activations to weights. Both LWC and LET can be seamlessly integrated into an effective differentiable optimization framework featuring block-wise error minimization for both weight-only and weight-activation quantization. The work of [585] dives deeper into LLM quantization and analyzes the effect of LLM quantization with comprehensive experiments to evaluate current state-of-the-art LLM quantization techniques, which systematically summarizes the effect of LLM quantization, provides recommendations to apply LLM quantization techniques, and points out future directions of LLM quantization. In contrast to these LLM quantization works that focus on efficient LLM quantization algorithms, OliVe [586] presents an algorithm/architecture co-designed solution to explore efficient quantized LLMs, which features an outlier–victim pair (OVP) quantization scheme and handles outlier values locally with low hardware overheads and high performance gains. This enables an efficient hardware-aligned OVP encoding scheme, which can be integrated into existing hardware accelerators (e.g., systolic arrays and tensor cores) towards more efficient quantized LLMs for generative inference.

6.3.3 Efficient LLM Distillation.

Another promising direction is to leverage the pre-trained knowledge from large LLMs to enhance the training or fine-tuning process of small LLMs, which can allow small LLMs to maintain as strong a performance as large LLMs while exhibiting superior efficiency. As discussed in [50], recent LLM distillation methods can be divided into two main categories, black-box LLM distillation and white-box LLM distillation, as follows:

(1)

Black-Box LLM Distillation. In the context of black-box distillation, the teacher LLM’s parameters are not available for the student LLM and the student LLM can only see the final output from the teacher LLM. Black-box distillation typically features those commercial LLMs (e.g., GPT-3 [48] and GPT-4 [49]) as the teacher and leverages the predictions from the teacher to further enhance the training or fine-tuning process of small student LLMs [61, 587, 588, 589]. For example, Self-Instruct [61] first generates a large number of instructions, input, and output sequences from GPT-3 using its application programming interfaces (APIs), after which Self-Instruct filters the invalid or similar ones before using them to fine-tune the original GPT-3 model. Finally, Self-Instruct can achieve an absolute improvement of 33% over the original GPT-3 model on Super-NaturalInstructions. Similar to Self-Instruct, [587] uses GPT-4 to first generate rich instruction-following data pairs and then uses the generated data pairs to fine-tune small LLaMA models to improve their performance.

(2)

White-Box LLM Distillation. In the context of white-box distillation, the teacher LLM’s parameters are available for the student LLM and the student LLM can also see the hidden intermediate output from the teacher LLM. More recently, with the emergence of open-source LLMs, white-box distillation has become more popular and more valuable for the LLM community since the student LLM can potentially benefit from the hidden states of the teacher LLM towards better distillation performance [62, 590, 591, 592]. MiniLLM [62] first replaces the forward Kullback-Leibler divergence (KLD) objective with reverse KLD, which can prevent the student LLM from overestimating the low-probability regions of the teacher distribution. Next, MiniLLM introduces an effective optimization approach to learn the above reverse KLD objective, which can enhance the student LLM to generate high-quality responses. TED [590] presents an effective task-aware layer-wise distillation strategy, which features task-aware filters to align the hidden states of teacher and student at each layer. The above filters can select the knowledge from the hidden states that are useful for target tasks, which can further reduce the knowledge gap between teacher and student LLMs. GKD [591] proposes to train the student LLM on its self-generated output sequences along with the feedback from the teacher LLM on such self-generated sequences. GKD also offers the flexibility to employ alternative loss functions between teacher and student, which can enhance the distillation performance of the student even when the student lacks the expressivity to mimic the teacher’s distribution. More recently, [592] introduces token-scaled logit distillation for quantization-aware training of LLMs, which can effectively mitigate the overfitting issue and also largely enhance the distillation process from the teacher predictions and the ground truths.

6.4 Efficient LLM Systems

In parallel to the rapid development of efficient LLM algorithms, a plethora of efficient LLM systems and infrastructures have recently emerged [63, 64, 65, 593, 594, 595, 596, 597, 598] that further optimize the generative inference efficiency of LLMs from the perspective of efficient system-level implementations. FlexGen [63] features an efficient high-throughput generation engine for running LLMs on one single GPU with limited memory, which can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. After solving a linear programming problem, FlexGen can also search for efficient patterns to store and access tensors. Tabi [65] features an inference system with an efficient multi-level inference engine, which can serve queries using small models and optional LLMs for demanding applications. Tabi is particularly optimized for discriminative models (not generative LLMs) in a serving framework, which uses the calibrated confidence score to determine whether to directly return the accurate results of small models or further re-route them to LLMs. DeepSpeed [593] presents a comprehensive system solution for efficient transformer inference, which consists of (1) a multi-GPU inference engine to minimize the runtime latency while maximizing the runtime throughput of both dense and sparse transformers when they fit into aggregate GPU memory and (2) a heterogeneous inference engine that leverages CPU and NVMe memory in addition to GPU memory and computation to enable high inference throughput with large models that do not fit into aggregate GPU memory. FastServe [594] presents an efficient distributed inference serving system for efficient LLM inference that (1) exploits the auto-regressive pattern of LLM inference to enable preemption at the granularity of each output token and (2) explores preemptive scheduling to minimize job completion time with a novel skip-join multi-level feedback queue scheduler. Petals [64] features an efficient collaborative system for runtime inference and fine-tuning of LLMs through joining the resources of multiple parties. In contrast to concurrent LLM inference engines, Petals also natively exposes hidden states of the served model, allowing training and sharing of custom model extensions based on efficient fine-tuning schemes.

S$^3$ [595] demonstrates that designing an inference system with prior knowledge of the output sequence can largely increase the runtime inference throughput of LLMs. Therefore, in order to increase the runtime inference throughput of LLMs, S$^3$ proposes to (1) first, predict the output sequence length, (2) then schedule generation queries based on the prediction to increase runtime resource utilization and throughput, and, (3) finally, handle mispredictions. Thanks to the prior knowledge of the output sequence, S$^3$ can achieve much better inference throughput than early LLM systems. Splitwise [596] proposes to split the two phases of typical LLM inference workloads on to different hardware, which allows use of well-suited hardware for each inference phase and provision independent computational resources for each inference phase. This strategy can improve the runtime resource utilization across different hardware. Splitwise also optimizes the state transfer across different hardware using the fast back-plane interconnects in today’s GPU clusters to further increase the runtime LLM inference throughput. DistServe [597] proposes to disaggregate the prefill and decoding computation to enhance the runtime serving performance of LLMs, which assigns the prefill and decoding computation workloads to different GPUs and, thus, largely eliminates the prefill-decoding interference towards better runtime inference throughput. DistServe also optimizes the above two phases according to the serving cluster’s bandwidth to minimize the communication overheads caused by the disaggregation. Liger [598] features an efficient distributed collaborative inference system for LLMs, which can achieve low inference latency at high throughput on multiple GPUs. In addition, to achieve high parallelism and throughput, Liger also introduces an efficient scheduling strategy to effectively schedule the computation and communication kernels across different input requests onto multiple streams of multiple GPUs.

6.5 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of efficient LLMs, as follows.

(1)

AutoML for Efficient LLMs. Recent state-of-the-art efficient LLMs are typically built upon manual heuristics, which, despite their efficacy, often require considerable human expertise and engineering efforts. In light of this, one promising future direction is to automatically explore efficient LLMs using automated machine learning (AutoML) techniques [137]. For example, given an efficient LLM, we can leverage AutoML techniques to automatically search for its tailored efficient system implementation towards the optimal on-device inference speedup. Similarly, we can also leverage AutoML techniques to automatically search for its tailored pruning or quantization strategy towards the optimal accuracy–efficiency trade-off. This has the potential to largely push forward the frontier of efficient LLM designs.

(2)

Alternative Structures for Efficient LLMs. Recent state-of-the-art LLMs heavily rely on the self-attention mechanism in transformers [90], which, however, suffers from quadratic time and memory complexity and greatly slows down the pre-training, inference, and fine-tuning stages of LLMs [599]. To tackle this dilemma, several alternative structures have recently emerged (e.g., RWKV [601], Mamba [602], and RetNet [603]), which exhibit optimized computational efficiency and allow researchers to perform efficient language modeling tasks without transformers. For example, RetNet [603] introduces the recurrent representation to enable low-cost inference, which improves the decoding throughput, runtime latency, and GPU memory without sacrificing the language modeling performance. In light of this, one promising future direction is to explore more efficient alternative structures for LLMs, which may deliver considerable efficiency gains over existing transformer-based LLMs without sacrificing the language modeling performance.

(3)

Hardware-Aware Benchmarks for Efficient LLMs. Recent state-of-the-art efficient LLMs are typically optimized in terms of the number of parameters or FLOPs. However, these theoretical complexity metrics cannot accurately reflect the runtime performance on target hardware (e.g., latency and energy). This makes it challenging to fairly compare different efficient LLMs in terms of their runtime inference efficiency on target hardware. One promising future direction is to design hardware-aware benchmarks for efficient LLMs, which may include different hardware performance metrics (e.g., latency and energy) across different hardware systems.

(4)

Infrastructures for Efficient LLMs. Recently, there have been a large number of works on efficient LLM compression, including LLM pruning and LLM quantization. However, they often require specialized hardware accelerators and, thus, cannot achieve realistic on-device inference speedups on modern embedded computing systems. For example, non-structured LLM pruning can remove the less important weights to explore highly sparse LLMs with aggressive compression ratios. However, the resulting sparse LLMs cannot achieve realistic on-device inference speedups due to the irregular network sparsity [57]. Another recent work [586] has also explored accelerating quantized LLMs and achieved promising performance. However, these are far from enough for real-world large-scale deployments. One promising future direction is to design specialized software and hardware infrastructures to further optimize LLMs for efficient on-device inference.

7 Deep Learning Frameworks for Embedded Computing Systems

In the past few years, DNNs have been achieving tremendous success in a myriad of real-world intelligent embedded computing scenarios, such as on-device speech recognition [604, 605], object detection and tracking [606, 607], and autonomous vehicles [608, 609]. A series of customized software programs [66, 67, 610, 611, 612, 613, 614, 615] and hardware frameworks [68, 69, 70, 616, 617, 618] have also been developed to facilitate the deployment of DNNs on embedded computing systems. Therefore, we discuss recent popular deep learning software and hardware frameworks here that bring deep learning to embedded computing systems to embrace ubiquitous embedded intelligence.

7.1 Deep Learning Software Frameworks

In this section, we introduce popular deep learning software frameworks that have been widely used to develop deep learning solutions for embedded computing systems, including TensorFlow [66], PyTorch [67], Caffe [610], MXNet [611], Keras [612], CoreML [613], PaddlePaddle [614], and BigDL [615]. These frameworks are summarized in Table 3.

Table 3.

Software	Created by	Year	Programming Languages	Computation Graph	Training	Maintenance
TensorFlow [66]	Google	2015	Python, C++, Java, and JavaScript	Static and Dynamic	✓	✓
PyTorch [67]	Facebook (now Meta)	2016	Python and C++	Dynamic	✓	✓
Caffe [610]	Berkeley	2014	C++	Static	✓	✗
MXNet [611]	Amazon	2015	Python, C++, R, Java, Julia, JavaScript, Scala, Go, and Perl	Static and Dynamic	✓	✗
Keras [612]	Personal	2015	Python	Static and Dynamic	✓	✓
CoreML [613]	Apple	2017	Python, Swift, and Objective-C	Static	✗	✓
PaddlePaddle [614]	Baidu	2016	Python and C++	Static and Dynamic	✓	✓
BigDL [615]	Intel	2017	Python and Scala	Dynamic	✓	✓

Table 3. Deep Learning Software Frameworks Discussed in Section 7.1

Note that we refer to the framework under active maintenance if there are new releases within the previous 6 months.

TensorFlow [66] is an open-source deep learning software framework developed by Google, which was released in 2015 and has since become one of the most popular deep learning software frameworks for training and deploying DNNs. TensorFlow, especially TensorFlow Lite, allows developers to easily build and deploy DNNs in a wide range of embedded computing systems, including mobile phones, microcontroller units (MCUs), Raspberry Pi, TPUs, and edge GPUs. TensorFlow also supports various real-world applications, ranging from image and speech recognition to NLP and predictive analytics. With its flexible architecture and vast pre-trained models, TensorFlow has been deemed to be one of the most important tools for researchers and developers in the field of deep learning.

PyTorch [67] is an open-source deep learning software framework that is widely used for training and deploying deep neural networks. It was developed by Facebook (now known as Meta) and was released in 2016. One of the key features is the dynamic computation graph, which allows developers to change the computation behavior of DNNs on the fly. This feature distinguishes PyTorch from other deep learning software frameworks (e.g., TensorFlow) that only support static computation graphs. In addition, PyTorch has a number of high-level features that make it easier to build more complex DNNs. For example, the TorchVision package provides various useful tools and pre-trained models for image and video processing. With its dynamic computation graph, ease of use, and range of high-level features, PyTorch has become an essential software framework for training and deploying DNNs in the deep learning community.

Caffe [610] is a popular deep learning software framework developed by Berkeley and released in 2014, which has gained increasing popularity owing to its speed, modularity, and ease of use. One of the key features is its ability to deal with large datasets containing millions of images. Another important feature is its modularity, which allows developers to add or remove components with ease. In addition, Caffe includes a large library comprising hundreds of pre-trained models that can be used to quickly build deep learning applications, such as image classification, object detection, and segmentation. In addition to its powerful features, Caffe has a user-friendly interface, allowing developers to train and deploy DNNs without extensive knowledge of deep learning. With its powerful features and user-friendly interfaces, Caffe has become an invaluable tool for researchers and developers and inspired subsequent deep learning software frameworks.

MXNet [611] is an open-source deep learning software framework to train and deploy DNNs, which was developed by Amazon and released in 2015. One of the key technical merits is its distributed training capability, which allows training of DNNs across multiple computation nodes in a computationally efficient manner. MXNet also supports multiple programming languages, such as Python, C++, R, and Julia, which increases the accessibility to researchers and developers with diverse skill levels. MXNet’s integration with other deep learning software frameworks and tools, such as Apache Spark and Apache Flink, is another important feature that facilitates the integration of deep learning into existing data processing pipelines. Thanks to its scalability, flexibility, and efficiency, MXNet has become a popular option for developers and researchers in the deep learning community.

Keras [612] is an open-source deep learning software framework written in Python, which provides high-level APIs for building and training efficient DNN solutions. It was developed by François Chollet in 2015 and is now maintained by a community of developers. Keras has been integrated into TensorFlow; starting from TensorFlow 2.0, Keras has become the default API to build DNN solutions in TensorFlow. One of the key features of Keras is its modularity, which allows developers and researchers to easily construct and customize DNNs. Keras also allows users to productize DNN solutions on modern mobile phones such as iOS and Android, on the web, or on the Java virtual machine. Last, but not least, Keras allows training of DNN solutions in an efficient distributed manner on clusters of multiple GPUs and TPUs. These strengths make Keras increasingly popular in both industry and academia.

CoreML [613] is an open-source deep learning software framework developed by Apple in 2017, which aims to integrate DNNs into Apple commercial products, such as the iPhone, iPad, and Apple Watch. In addition to supporting extensive DNNs with over 30 layer types, CoreML covers standard machine learning models, such as tree ensembles, support vector machine, and generalized linear models. Another key feature of CoreML is its ability to directly run DNNs on the device without the need for cloud-based inference. CoreML also provides a range of optimization techniques, such as quantization and pruning, to reduce the complexity of DNNs. This is particularly important for modern mobile devices, which typically have limited storage and computational resources. Also, CoreML, built on top of advanced technologies such as Metal and Accelerate, seamlessly takes advantage of CPUs and GPUs to provide the maximum inference performance at runtime. These technical features make CoreML the first choice for developing efficient DNN solutions on Apple commercial products.

PaddlePaddle [614], also known as Paddle, is an open-source deep learning software framework developed by Baidu, which was released in 2016 to benefit the deep learning community. PaddlePaddle is designed to be an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end tools, and service platforms. PaddlePaddle originated from industrial practices with dedication and commitment to industrialization. It has been adopted by a wide range of sectors, including manufacturing, agriculture, and enterprise service. With the industrial benefits, PaddlePaddle has motivated an increasing number of developers and researchers to commercialize AI.

BigDL [615] is an open-source deep learning software framework that runs on top of Apache Spark. It was developed by Intel and released in 2017. The goal of BigDL is to provide a high-performance, scalable, and easy-to-use platform, especially distributed deep learning. To this end, BigDL includes a comprehensive set of features that cover various deep learning applications, including image classification, object detection, and NLP. One of the key features of BigDL is its ability to take full advantage of distributed computing resources, such as CPU, GPU, and FPGA clusters, to accelerate the training of DNNs. BigDL is seamlessly integrated with Apache Spark, which enables users to leverage the distributed computing capability of Spark for data preprocessing and postprocessing. This integration also makes it possible to build end-to-end deep learning pipelines that span from data ingestion to model deployment.

7.2 Deep Learning Hardware Frameworks

In this section, we introduce popular embedded hardware platforms that are designed to run powerful DNNs in embedded scenarios without cloud-based assistance, including NVIDIA Jetson [69], Intel Neural Compute Stick [70], Google Edge TPU [68], Google Coral Dev Board [616], Huawei HiKey 970 [617], and Orange Pi AI Stick Lite [618]. These frameworks are summarized in Table 4.

Table 4.

Hardware	RAM	Storage	Power	Performance	Price	Supported Deep Learning Software
NVIDIA Jetson TX2 [69]	8 GB LPDDR4	32 GB eMMC 5.1	7.5 W $\sim$ 15 W	1.33 TFLOPS	$399	TensorFlow, PyTorch, Caffe, Keras, and MXNet
NVIDIA Jetson Nano [69]	4 GB LPDDR4	16 GB eMMC 5.1	5 W $\sim$ 10 W	0.472 TFLOPS	$99	TensorFlow, PyTorch, Caffe, Keras, and MXNet
NVIDIA Jetson AGX Xavier [69]	32 GB LPDDR4x	32 GB eMMC 5.1	10 W $\sim$ 30 W	32 TOPS	$1,099	TensorFlow, PyTorch, Caffe, Keras, and MXNet
NVIDIA Jetson Xavier NX [69]	8 GB LPDDR4x	16 GB eMMC 5.1	10 W $\sim$ 20 W	21 TOPS	$399	TensorFlow, PyTorch, Caffe, Keras, and MXNet
NVIDIA Jetson AGX Orin [69]	32 GB LPDDR5	64 GB eMMC 5.1	15 W $\sim$ 40 W	275 TOPS	$1,999	TensorFlow, PyTorch, Caffe, Keras, and MXNet
NVIDIA Jetson Orin NX [69]	16 GB LPDDR5	32 GB eMMC 5.1	10 W $\sim$ 25 W	100 TOPS	$599	TensorFlow, PyTorch, Caffe, Keras, and MXNet
Intel Neural Compute Stick [70]	N/A	N/A	0.5 W $\sim$ 1.5 W	0.1 TFLOPS	$79	TensorFlow, Caffe, and MXNet
Google Edge TPU [68]	N/A	N/A	2 W	4 TOPS	$75	TensorFlow and TensorFlow Lite
Google Coral Dev Board [616]	1 GB LPDDR4	8 GB eMMC 5.1	1 W $\sim$ 6 W	4 TOPS	$149	TensorFlow and TensorFlow Lite
Huawei HiKey 970 [617]	6 GB LPDDR4	64 GB UFS 2.1	6 W $\sim$ 12 W	1.88 TOPS	$299	TensorFlow, PyTorch, and Caffe
Orange Pi AI Stick Lite [618]	N/A	N/A	1 W $\sim$ 2 W	4 TOPS	$69	TensorFlow, PyTorch, and Caffe

Table 4. Deep Learning Software Frameworks Discussed in Section 7.2

Note that the price here refers to the initial price during the product launch, which is subject to fluctuations over time.

NVIDIA Jetson [69] is a series of embedded systems-on-modules (SoMs) designed by NVIDIA for running advanced deep learning workloads, especially the inference of DNNs. NVIDIA Jetson consists of NVIDIA Jetson TX2, NVIDIA Jetson Nano, NVIDIA Jetson AGX Xavier, NVIDIA Jetson Xavier NX, NVIDIA Jetson AGX Orin, and NVIDIA Jetson Orin NX. To accelerate deep learning workloads, NVIDIA Jetson runs on top of NVIDIA’s CUDA parallel computing architectures and features an integrated system-on-chip (SoC) with a powerful NVIDIA GPU, a multi-core CPU, and various high-speed interfaces, including Ethernet, USB, HDMI, and CSI/DSI. More importantly, NVIDIA Jetson is compatible with various deep learning software frameworks, including TensorFlow, PyTorch, Caffe, Keras, and MXNet. Thanks to its advanced architectural design and powerful interfaces, NVIDIA Jetson is able to support a wide range of embedded deep learning applications to accommodate different resource and performance requirements.

Intel Neural Compute Stick [70] is a small, low-power, and cost-effective embedded hardware designed to run deep learning workloads without cloud-based assistance, which was developed by Movidius (now acquired by Intel). The Neural Compute Stick (NCS) is a small USB device that can be connected to a host computer or embedded computing system. NCS features the Myriad 2 Vision Processing Unit (VPU), which is optimized for the inference of DNNs. In addition, NCS is integrated with various high-speed interfaces, including USB 3.0 and Wi-Fi. Developers can use the Intel Movidius SDK, which provides a set of tools for developing, testing, and deploying DNNs on NCS. Furthermore, NCS supports various deep learning software frameworks, including TensorFlow, Caffe, and MXNet. Thanks to its significant flexibility and cost efficiency, NCS makes it possible to deploy advanced DNNs in a wide range of embedded scenarios.

Google Edge TPU [68] is a custom-built ASIC chip to accelerate deep learning workloads on resource-constrained edge computing systems. The Google Edge TPU is designed to seamlessly work together with TensorFlow Lite, a lightweight version of TensorFlow, and is optimized for the inference of DNNs towards enhanced inference efficiency. Google Edge TPU itself cannot work alone and, similar to the Intel Neural Compute Stick, it must be connected to other embedded computing systems, such as Raspberry Pi 4 and the Google Coral Dev Board, to deliver deep learning solutions. The Google Edge TPU is capable of performing up to two trillion floating-point operations per second (TFLOPS) and four trillion operations per second (TOPS) using only two watts of power. This further allows us to build and deploy powerful deep learning solutions on embedded computing systems with limited computational resources. Thanks to its easy integration with other embedded computing systems, powerful performance, and significant efficiency, the Google Edge TPU has gained increasing popularity in the deep learning community for deploying deep learning solutions on embedded computing systems.

Google Coral Dev Board [616] is a single-board computer designed for building embedded deep learning applications. It features an on-board Google Edge TPU, which is a custom-built chip to run TensorFlow Lite models with high performance and low power consumption. The Coral Dev Board has various built-in interfaces, including Audio, Wi-Fi, Bluetooth, Ethernet, and USB 3.0, which enable it to be connected to other embedded computing systems. The Coral Dev Board is also integrated with 1 GB LPDDR4 RAM, 8 GB eMMC 5.1 flash memory, and a MicroSD slot for additional storage. The Coral Dev Board also comes with pre-installed software tools, including TensorFlow Lite, Google Edge TPU API, and various sample applications. These allow users to easily and quickly start building their deep learning solutions. Thanks to its powerful Google Edge TPU, various useful interfaces, and software tools, the Coral Dev Board has become increasingly popular for developing and deploying embedded deep learning solutions.

Huawei HiKey 970 [617] is a high-performance, single-board embedded computer designed by Huawei. The HiKey 970 features a powerful neural processing unit (NPU) to accelerate various deep learning workloads. The HiKey 970 is also integrated with 6 GB LPDDR4 LPDDR4 RAM and 64 GB UFS 2.1 flash memory, while at the same time allowing MicroSD extension for additional storage. In addition, the HiKey 970 supports various high-speed interfaces, including the Ethernet, USB 3.0, and PCIe 3.0. Furthermore, the HiKey 970 is compatible with popular deep learning software frameworks, such as TensorFlow, PyTorch, and Caffe, allowing users to easily and quickly build on-board deep learning solutions. Thanks to its powerful NPU, rich memory and storage, and extensive connectivity options, the HiKey 970 is suitable for a wide range of intelligent embedded applications, such as robotics, autonomous vehicles, and smart home devices.

Orange Pi AI Stick Lite [618] is a tiny and cost-effective USB stick designed for small to medium-sized deep learning workloads. It is equipped with a single-core Cortex-A7 processor and an NPU that provides hardware acceleration. Note that, similar to the Intel Neural Compute Stick and Google Edge TPU, the Orange Pi AI Stick Lite cannot work alone; it must be connected to a host device using the on-device USB 3.0 interface. The Orange Pi AI Stick Lite supports various deep learning software frameworks, including TensorFlow, PyTorch, and Caffe. Thanks to its cost efficiency, the Orange Pi AI Stick Lite is suitable for embedded computing systems to deal with small to medium-sized deep learning workloads.

7.3 Visions for the Future

In this section, we envision the future trends and possible directions of deep learning software and hardware infrastructures, summarized as follows.

(1)

Integration with Emerging Technologies. In the future, we should consider developing deep learning software and hardware that can be seamlessly integrated with emerging technologies. For example, quantum computing [619, 620, 621] has the potential to deliver significant speedup and computational capability, which can accelerate the training and inference of DNNs. Therefore, it is of paramount importance to explore the potential of integrating deep learning software and hardware with emerging technologies to unlock new possibilities and new advances in various real-world scenarios.

(2)

Democratization of Deep Learning. The democratization of deep learning [622] has emerged as a prominent trend in the deep learning era, with the explicit goal of making deep learning software and hardware more accessible to a wider range of developers and researchers. Therefore, in order to alleviate the technical barrier to entry for building efficient embedded deep learning solutions, we should continue to develop more user-friendly deep learning software and hardware frameworks, democratizing the benefits and advances of deep learning in real-world embedded scenarios.

(3)

Development of Specialized Hardware. Conventional embedded computing systems typically focus on optimizing and accelerating the training and inference of traditional convolutional networks, neglecting recent advances in the deep learning era. Vision Transformer (ViT) [107] is the most representative, which has opened up a new direction and has been challenging the dominant role of traditional convolutional networks in various real-world vision applications, such as image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119]. In order to unleash the promise of ViT and its variants, we should also develop specialized embedded computing systems to accelerate the family of ViT rather than only focusing on accelerating complicated convolutional networks.

(4)

Development of More Powerful Hardware. As seen in recent advanced DNNs [623, 624, 625], network complexity has continued to explode. As a result, it continues to enlarge the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems. In parallel, the LLMs, such as ChatGPT [49], have been achieving impressive success in various NLP tasks, such as language generation, language translation, and question answering, at the cost of pushing forward network complexity to another unseen level, significantly enlarging the computational gap. These demonstrate the necessity of innovating more powerful yet cost-effective embedded computing systems to further bridge the aforementioned computational gap, especially from the hardware perspective.

(5)

Development of Infrastructures for On-Device Training. In the past, the convention in the deep learning community is to (1) first, train DNNs on powerful GPUs or the cloud and (2) then deploy the pre-trained DNNs on local embedded computing systems for further inference at runtime. Compared with this convention, the emerging paradigm of on-device training enables the pre-trained DNNs to adapt to the new data collected from the local sensors or by the users [45]. As such, the users can benefit from customized DNNs without having to transfer the collected data to the cloud, thereby significantly protecting data privacy and security [45]. Nonetheless, conventional embedded computing systems are typically optimized for inference and do not support efficient on-device training owing to the training memory bottleneck during the training process [44, 45]. This motivates us to further develop efficient infrastructures, including specialized deep learning software and hardware, to effectively accommodate future on-device training demands.

8 Deep Learning Applications for Embedded Computing Systems

In the previous sections, we have extensively discussed recent advances towards ubiquitous embedded intelligence from various perspectives of efficient deep learning networks, algorithms, software, and hardware. In this section, we further elaborate on recent popular intelligent deep learning applications in real-world embedded scenarios, spanning from vision to NLP tasks. Note that these intelligent embedded applications highly rely on efficient deep networks and efficient deep learning algorithms that have been extensively discussed in the previous sections.

8.1 Computer Vision Applications

Computer vision is an emerging field that focuses on interpreting and understanding visual information from real-world environments, such as images and video, spanning from image classification [1] to downstream vision tasks, such as object detection [4], tracking [5], and segmentation [6]. Below we discuss recent popular intelligent embedded vision applications.

Image Classification. Image classification, also referred to as image recognition, is the most fundamental vision task, which focuses on recognizing the input image based on its visual information [1]. Various intelligent applications in real-world embedded computing systems, such as mobile phones and IoT sensors, enable these embedded computing systems to automatically recognize objects, scenes, or patterns within the given image [627]. For example, face recognition [628], person re-identification [503], and hand gesture recognition [629] have been widely integrated into mainstream embedded computing systems, such as mobile phones, ATMs, and intelligent cameras, for the purpose of identity authentication. In practice, image classification typically features deep convolutional networks, such as VGGNet [1], ResNet [2], and DenseNet [3], thanks to their strong capabilities to capture rich visual information, especially for large-scale datasets such as ImageNet [80]. For example, as shown in Figure 30, AlexNet [76], the first of its kind, demonstrates the possibility of leveraging convolutional layers to learn discriminative features from vision inputs, which exhibits significantly better recognition performance on ImageNet than previous well-established non-convolutional networks, such as MLPs and other learning-based techniques. ResNet [2] investigates the training collapse of deep convolutional networks and introduces a simple yet effective deep residual learning paradigm, which allows us to significantly increase the network depth for stronger learning capabilities and also marks the booming development of the deep learning era. As a result, ResNet, for the first time, achieves better recognition performance on ImageNet than humans thanks to its significant network depth, as shown in Figure 30.

Fig. 30.

Downstream Vision Applications. Downstream vision applications typically refer to the practical and specific usage, in which the results or outputs from other fundamental vision tasks, such as image classification, are applied to deal with real-world challenges. Popular downstream vision applications in practice include but are not limited to object detection [4], object tracking [5], object segmentation [6], image super-resolution [277, 278], image restoration [630], pose estimation [631], image captioning [632, 633], augmented reality (AR) and virtual reality (VR) [634], and video-related analysis [635]. The work of [630] features memory-oriented structured pruning to optimize the on-device memory consumption during runtime image restoration, which can accommodate the limited memory and storage requirements in real-world embedded scenarios. These downstream vision applications have evolved to be ubiquitous in real-world embedded scenarios, which serve as important components towards ubiquitous embedded intelligence. For example, object detection and tracking have been widely used in recent autonomous vehicles [636] to detect other vehicles and in surveillance systems [637] to detect suspicious persons or activities. These downstream vision applications have also been widely applied to other real-world embedded scenarios, such as smart cities and intelligent health care [638]. To further facilitate the development of intelligent applications, several powerful tools have been proposed recently. For example, Precog [639] introduces an efficient object detection infrastructure to enable real-time object detection on resource-constrained embedded computing systems, such as Raspberry Pi, which also features YOLOv3 [640] to achieve superior on-device object detection accuracy.

From CNNs to Vision Transformers. More recently, vision transformers (ViTs) [107] and their variants have demonstrated surprisingly strong performance in various vision tasks, including, but not limited to, image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119], which continue to push forward the state-of-the-art performance over their convolutional counterparts across various vision tasks. Specifically, [107], featuring the very first vision transformer, proposes to divide the input image into a series of smaller image patches (e.g., 8, 16, and 32 patches), each of which is then fed into the transformer-based encoder to learn discriminative features. The learned discriminative features are further aggregated and fed into the classification layer to make predictions as shown in Figure 5. However, despite their strong performance across various vision tasks, ViTs and their variants often exhibit inferior on-device efficiency [139] since they are typically more difficult to parallelize on resource-constrained embedded computing systems than their convolutional counterparts and, thus, inevitably suffer from considerable resource underutilization, as pointed out in [127]. To overcome such limitations, a plethora of resource-efficient vision transformers have recently flourished. We refer interested readers to Section 2.2 for more details about recent representative resource-efficient vision transformers. We emphasize that significant efforts are still required in order to further alleviate the on-device efficiency bottleneck and unleash the promise of modern vision transformers, which are of paramount importance to bring powerful vision transformers to the less capable embedded computing systems towards ubiquitous embedded intelligence.

8.2 Natural Language Processing Applications

In parallel to vision tasks, NLP is another representative application that has been widely deployed in real-world embedded scenarios to explore auditory and textual inputs, which has largely revolutionized how embedded computing systems interact with users and their surroundings [641]. Embedded computing systems ranging from traditional IoT systems to wearable systems and autonomous systems are transitioning from simple responsive systems to more proactive and interactive systems, which can comprehend context and also anticipate users’ needs based on their linguistic inputs. To this end, below we introduce several representative NLP applications in real-world embedded scenarios.

(1)

Sentiment Analysis. Recent intelligent embedded computing systems, such as wearable devices and intelligent health care infrastructures, have largely featured sentiment analysis, which can effectively capture users’ physiological status through language interactions [642]. This also allows more comprehensive understanding of users’ emotional well-being, which paves the way for future holistic health ecosystem solutions [643].

(2)

Automatic Speech Recognition. Automatic speech recognition has gained increasing interest in real-world embedded scenarios, such as autonomous vehicles and smart homes, which can largely facilitate complicated functions using vocal commands. This can mitigate manual interactions and enhance safety and user convenience [644, 645].

(3)

Conversational Agents. The conversational agents have been playing an important role in recent intelligent embedded computing systems, such as home automation systems [646] and interactive assistant systems [647]. These intelligent conversational agents maintain strong abilities to comprehend and interpret users’ commands, preferences, and behavioral patterns towards better intelligent services in subsequent interactions.

(4)

Speech-to-Text/Text-to-Speech Synthesis. The integration of text-to-speech (TTS) [253] and speech-to-text (STT) [648] marks an important milestone in enriching human–computer interactions, especially for wearable systems such as mobile phones and intelligent translation devices, which can largely facilitate human–computer interactions. TTS can synthesize digital text into speech to provide auditory feedback to users and vice versa for STT, both of which are particularly important in hands-free environments.

(5)

Real-Time Translation. The emergence of real-time translation has served as an effective technique to eliminate cross-language barriers. More recently, real-time translation has been widely integrated into real-world embedded scenarios, especially wearable communication devices, which can largely facilitate cross-language interactions [649, 650].

To summarize, the integration of NLP and embedded computing systems is more than simple technical enhancements. It is an important paradigm shift towards ubiquitous embedded intelligence. It can enable real-world embedded computing systems to understand and interpret not only short commands but also longer contexts and conversations, which can ensure seamless and enriched interfaces between humans and embedded computing systems.

8.3 Visions for the Future

In this section, we envision some future trends and possible directions of intelligent embedded applications, which are summarized as follows:

(1)

LLM-Enabled Embedded Applications. LLMs starting from GPT-3 [48] have attracted considerable interest from both academia and industry, thanks to their surprisingly strong performance across various language tasks. ChatGPT [92], one of the most representative LLM-enabled applications, has achieved promising performance improvement over humans across diverse domains of knowledge. Nonetheless, modern LLMs, despite their promise, require a huge amount of computational resources for both training and inference, making it challenging to deploy powerful LLMs on resource-constrained embedded computing systems. Therefore, modern LLMs can only be deployed on remote GPU servers and provide remote services to local users through network connectivity. This, however, is often less convenient and also involves data security/privacy concerns. To overcome such limitations, a plethora of works have been proposed recently to compress computation-intensive LLMs towards better on-device inference efficiency. For example, SmoothQuant [59] and AWQ [60] pioneered quantizing the weights of powerful LLMs from higher bits to lower bits in order to reduce their prohibitive computational complexity, making it possible to run powerful LLMs on resource-constrained embedded computing systems. These are also important milestones to bringing LLMs to real-world embedded computing systems towards ubiquitous embedded intelligence.

(2)

Multi-modal Embedded Applications. Modern embedded applications largely focus on one single modality, either from the perspective of vision or language processing. Nonetheless, recent embedded computing systems typically feature various advanced sensors, which can simultaneously collect rich data from multiple modalities, including, but not limited to, visual, auditory, and tactile information. The most important benefit of these multi-modal embedded applications is their strong ability to provide comprehensive understanding of real-world dynamic environments using comprehensive information collected from different modalities. This also has the potential to significantly boost the attainable accuracy on the target task and greatly improve the reliability in real-world dynamic environments. For example, visual information can be easily augmented with other modalities, such as radar and lidar, which can be jointly leveraged to deliver better and safer driving experiences in autonomous vehicles [651]. However, despite the promising benefits, the development of multi-modal embedded applications is also challenging. On the one hand, the real-time synchronization of diverse data modalities may require significant computational resources. On the other hand, the development of multi-modal embedded applications introduces additional complexity for data alignment, calibration, and fusion, which may also require more advanced software algorithms to ensure real-time processing.

9 Conclusion

In this survey, we focus on summarizing recent efficient deep learning infrastructures for embedded computing systems towards ubiquitous embedded intelligence, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. To this end, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. We also envision promising future directions and trends to enable more efficient and ubiquitous embedded intelligence. We believe this survey can shed light on future research and allow researchers to quickly and smoothly get started in this emerging field.

Footnotes

In this work, we may interchangeably use some technical terms, such as deep learning models, machine learning models, DL models, ML models, deep neural networks (DNNs), and convolutional neural networks (CNNs).

We do not include FasterNet [84] in Figure 3 for comparisons since FasterNet does not optimizes the number of FLOPs.

MetaQNN [158] is another seminal NAS work in parallel to [137], both of which feature reinforcement learning as the search engine to automate the design of top-performing DNNs with competitive accuracy on target task.

⁴

We mainly discuss latency prediction since latency is the most dominant performance constraint in hardware-aware NAS [176, 212], which can be generalized to predict other performance constraints, such as energy and memory consumption.

⁵

Most of the covered zero-cost proxies are available at https://github.com/automl/naslib/tree/zerocost

⁶

Filter pruning is another name for channel pruning since removing filters is technically equivalent to removing channels [383]. In this work, we use channel pruning by default and may interchangeably use filter pruning and channel pruning.

⁷

We interchangeably use network distillation and knowledge distillation to refer to the distillation-based training process.

References

[1]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

Abstract

1 Introduction

1.1 Organization of This Article

2 Manual Network Design for Embedded Computing Systems

2.1 Manual Convolutional Neural Network Design

2.2 Manual Transformer Design

2.2.1 Transformer for NLP.

2.2.2 Transformer for Vision.

2.3 Envisioning Future Trends

3 Automated Network Design for Embedded Computing Systems

3.1 Modular Search Space

3.2 Search Strategy

3.3 Speedup Techniques and Extensions

3.4 Visions for the Future

4 Network Compression for Embedded Computing Systems

4.1 Network Pruning

4.1.1 Non-Structured Pruning.

4.1.2 Structured Pruning.

4.2 Network Quantization

4.2.1 Quantized Networks.

4.2.2 Quantization Extensions and Implementations.

4.3 Network Distillation

4.3.1 Data-Dependent Knowledge Distillation.

4.3.2 Data-Efficient Knowledge Distillation.

4.4 Visions for the Future

5 Efficient On-Device Learning for Embedded Computing Systems

5.1 General On-Device Learning

5.2 On-Device Continual Learning

5.3 On-Device Transfer Learning

5.4 On-Device Federated Learning

5.5 Visions for the Future

6 Efficient Large Language Models for Embedded Computing Systems

6.1 Preliminaries on LLMs

6.2 Efficient LLM Architectures

6.3 Efficient LLM Compression

6.3.1 Efficient LLM Pruning.

6.3.2 Efficient LLM Quantization.

6.3.3 Efficient LLM Distillation.

6.4 Efficient LLM Systems

6.5 Visions for the Future

7 Deep Learning Frameworks for Embedded Computing Systems

7.1 Deep Learning Software Frameworks

7.2 Deep Learning Hardware Frameworks

7.3 Visions for the Future

8 Deep Learning Applications for Embedded Computing Systems

8.1 Computer Vision Applications

8.2 Natural Language Processing Applications

8.3 Visions for the Future

9 Conclusion

Footnotes

References

Index Terms

Recommendations

End-Edge-Cloud Collaborative Computing for Deep Learning: A Comprehensive Survey

Applications of game theory in deep learning: a survey

Enabling Deep Intelligence on Embedded Systems

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media