Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

Published: 10 December 2024 Publication History

Abstract

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs in real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate this computational gap and enable ubiquitous embedded intelligence, we focus in this survey on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. We also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future research, which can largely help researchers to quickly and smoothly get started in this emerging field.

1 Introduction

With the increasing availability of large-scale datasets and advanced computing paradigms, deep neural networks (DNNs)1 have empowered a wide range of intelligent applications and have demonstrated strong performance [1, 2, 3]. These intelligent applications span from image classification [2] to downstream vision tasks, such as object detection [4], tracking [5], and segmentation [6], to natural language processing (NLP) tasks, such as automatic speech recognition [7], machine translation [8], and question answering [9]. Subsequently, DNNs have been evolving more deeply with increasing numbers of layers in order to maintain state-of-the-art accuracy on target tasks [1, 2, 3]. Novel network structures and advanced training techniques have also emerged, which further push forward the attainable accuracy [10, 11, 12]. These powerful deep learning (DL) networks and advanced training techniques, starting from VGGNet [1] and ResNet [2], mark the emergence of the DL era.
The tremendous breakthroughs of DNNs have attracted a huge amount of attention from both academia and industry to deploy powerful DNNs in real-world embedded computing systems, including mobile phones [13, 14], autonomous vehicles [15, 16], and health care [17, 18], to enable intelligent embedded applications towards embedded intelligence [19]. In practice, this may bring significant benefits. For example, embedded computing systems explicitly allow real-time on-device data processing, which significantly improves processing efficiency and thus delivers enhanced user experience. This also protects data security and privacy since everything can be locally processed without being uploaded to a remote server [19]. Despite these promising benefits, deploying powerful DNNs in real-world embedded computing systems still suffers from several critical limitations. On the one hand, in order to maintain competitive accuracy, recent representative networks have been evolving much deeper with hundreds of layers [2, 3] and, as a result, lead to prohibitive computational complexity [19, 20]. For example, ResNet50 [2], as one of the most representative deep networks, consists of over 4 billion floating-point operations (FLOPs) and 25 million parameters, which requires over 87 MB of on-device storage to deal with one single input image. On the other hand, real-world embedded computing systems, such as mobile phones and autonomous vehicles, typically feature limited available computational resources in order to optimize the on-device power and energy consumption. In light of the above, the evolving network complexity continues to enlarge the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems [20], inevitably making it increasingly challenging to embrace ubiquitous embedded intelligence.
To bridge the aforementioned computational gap towards ubiquitous embedded intelligence, a plethora of model compression techniques have been recently proposed, including network pruning [21, 22, 23], network quantization [24, 25, 26], and network distillation [11, 27, 28], which strive for better accuracy–efficiency trade-offs to accommodate the limited available computational resources in real-world embedded scenarios. For example, network pruning focuses on removing the redundancy network units, such as weights [29], channels [21], and layers [30], to trim down network redundancy, which can boost the efficiency on target hardware with minimal accuracy loss on the target task. In addition to network compression, another parallel alternative is to manually design resource-efficient networks instead, such as SqueezeNet [31], MobileNets [32, 33], ShuffleNets [34, 35], and GhostNets [36, 37], which have dominated the early progress from the lens of efficient network design. These efficient networks, despite being able to exhibit superior efficiency, highly rely on human expertise to explore novel network structures through trial and error, which also involve non-trivial engineering efforts and prohibitive computational resources [38, 39, 40]. To overcome such limitations, recent network design practices have shifted from manual to automated, also referred to as neural architecture search (NAS) or automated machine learning (AutoML), which focuses on automatically exploring novel network structures [41]. The tremendous success of NAS has sparked rich hardware-aware NAS works, such as MnasNet [38], ProxylessNAS [40], FBNet [39], and Once-for-All [42], to automate the design of accurate yet hardware-efficient network solutions. These solutions have shown strong accuracy–efficiency trade-offs and have been widely deployed in real-world embedded computing systems to deliver intelligent services [43].
Apart from the above efficient networks and techniques that typically focus on improving on-device inference efficiency, recent research also turns back to the on-device training efficiency [44, 45]. The rationale here is that previous representative networks, despite being able to exhibit superior accuracy, have to be trained for hundreds of epochs, which may require multiple days on powerful graphics processing units (GPUs) [44]. Even worse, the expensive training process for remote GPUs does not allow on-device customization on local hardware, especially in resource-constrained embedded scenarios [45]. Note that local on-device customization has the potential to further improve the attainable accuracy using newly collected data since local sensors continue to collect new data from users over time. To overcome such limitations, several efficient on-device learning techniques have been established recently, such as on-device continual learning [46], on-device transfer learning [44], and on-device federated learning [47], making it possible to train and fine-tune powerful deep networks on local hardware for further performance improvement.
More recently, large language models (LLMs), such as GPT-3 [48] and GPT-4 [49], have demonstrated impressive success across various real-world language processing tasks [50]. However, the strong learning capability of these powerful LLMs also comes at the cost of excessive computational complexity. For example, OpenAI’s GPT-3 [48], one of the most representative LLMs, consists of 175 billion parameters. Furthermore, in order to achieve state-of-the-art performance, recent LLMs continue to evolve to be exponentially larger with ever-increasing model sizes [51, 52]. These make it increasingly challenging to deploy recent powerful LLMs in modern embedded computing systems towards intelligent language processing services. To overcome such limitations, a series of effective techniques have been proposed recently, which focus on alleviating the prohibitive computational complexity of LLMs to explore computation-efficient LLMs, including efficient LLM architecture design [53, 54, 55, 56], efficient LLM compression techniques (i.e., pruning [57, 58], quantization [59, 60], and knowledge distillation [61, 62]), and efficient LLM system design [63, 64, 65].
In parallel to the booming emergence of powerful deep networks and advanced training techniques, a plethora of representative deep learning software frameworks and hardware accelerators have been tailored to facilitate the development of efficient deep learning solutions for embedded computing systems, such as TensorFlow [66], PyTorch [67], Google edge tensor processing units TPUs [68], NVIDIA edge GPUs [69], and Intel Neural Compute Stick [70]. These deep learning software programs and hardware units have been extensively adopted in the deep learning era and bring two main benefits. On the one hand, they lift the roadblock for both software and hardware engineers and, thus, allow engineers to quickly develop intelligent embedded applications, such as on-device object detection [4], tracking [5], and segmentation [6], with less domain-specific expertise. On the other hand, they typically feature domain-specific optimization and, thus, can achieve superior accuracy–efficiency trade-offs with minimal engineering efforts. For example, NVIDIA Jetson AGX Xavier, a representative NVIDIA Jetson edge GPU, supports the development of intelligent embedded applications with the precision of INT8 (i.e., 8-bit weights), which can deliver significant efficiency improvement over its full-precision counterpart (32-bit weights) without degrading the accuracy on the target task [69].

1.1 Organization of This Article

In this survey, we summarize recent efficient deep learning infrastructures that may benefit current and future embedded computing systems towards ubiquitous embedded intelligence. In practice, some existing surveys [71, 72, 73, 74] typically focus on efficient deep learning algorithms, which may be out-of-date since recent deep learning infrastructures have been rapidly evolving, especially from the perspective of LLMs. In contrast to [71, 72, 73, 74], we focus on providing a more comprehensive and holistic view of recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks (CNNs) to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss the following recent efficient deep learning infrastructures for embedded computing systems: (1) efficient manual network design, (2) efficient automated network design, f(3) efficient network compression, (4) efficient on-device learning, (5) efficient LLMs for embedded computing systems, (6) efficient deep learning software and hardware, and (7) efficient intelligent applications. We believe this survey has its merits and can shed light on future research, which can largely benefit researchers to quickly and smoothly get started in this emerging field. We demonstrate the organization of this survey in Figure 1, which is also summarized as follows:
Fig. 1.
Fig. 1. The organization of this article. Sections 1 and 9 are not in the figure for the sake of simplicity.
Section 2 extensively discusses recent representative efficient manual networks.
Section 3 extensively discusses recent representative efficient automated networks.
Section 4 extensively discusses recent representative network compression techniques.
Section 5 extensively discusses recent representative on-device learning techniques.
Section 6 extensively discusses recent representative LLMAs.
Section 7 extensively discusses recent representative deep learning software and hardware.
Section 8 extensively discusses recent representative intelligent embedded applications.
At the end of each section, we also envision possible future directions in the respective fields, which have the potential to pave the way for future ubiquitous embedded intelligence.

2 Manual Network Design for Embedded Computing Systems

The tremendous success of DNNs highly relies on the prohibitive network complexity, leading to the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems [20]. To bridge this computational gap, one of the most representative solutions is to design computation-efficient DNNs to accommodate the limited computational resources on embedded computing systems. To this end, we systematically discuss recent state-of-the-art efficient manual networks in this section,. For better understanding, we divide these efficient networks into two main categories and subsections, including efficient convolutional networks in Section 2.1 and efficient transformers in Section 2.2, since these efficient networks may feature different network structures and target different intelligent embedded applications.

2.1 Manual Convolutional Neural Network Design

As shown in previous state-of-the-art deep convolutional networks, such as AlexNet [76], VGGNet [1], GoogleNet [77], ResNet [2], DenseNet [3], and EfficientNets [78, 79], despite being able to push forward the attainable accuracy on ImageNet [80] from 57.2% [81] to 87.3% [79], network complexity has increased over time. We note that the convolutional network consists of convolutional layers, pooling layers, and fully connected layers, where most of the network complexity comes from convolutional layers [82]. For example, in ResNet50 [2], more than 99% of FLOPs are from convolutional layers. In light of this, designing efficient convolutional layers is critical to innovating computation-efficient convolutional networks. In practice, there are five typical efficient convolutional layers: pointwise convolution, groupwise convolution, depthwise convolution, dilated convolution, and Ghost convolution.
Pointwise Convolution. Pointwise convolution is a type of convolutional layer with the fixed kernel size of 1 × 1, which performs an element-wise multiplication and addition along the depth dimension. On the one hand, compared with the standard K × K convolutional layer, the pointwise convolutional layer is able to reduce the number of FLOPs and parameters by \(K^2\) times, which significantly improves the efficiency. On the other hand, we note that the output from the pointwise convolutional layer typically has the same spatial dimensions as the input but may have a different number of channels. As such, the pointwise convolutional layer can be used to adjust the intermediate feature maps in terms of the number of channels. Specifically, it can reduce or increase the number of channels, making it a practical technique for compressing or expanding convolutional networks.
Groupwise Convolution. Groupwise convolution is a type of convolutional layer that (1) divides the input feature map into G groups along the depth dimension; (2) performs convolution in terms of each group, and (3) concatenates the outputs along the depth dimension to derive the final output. For example, given an input feature map with the size of \(B \times C \times H \times W\), each kernel in the K × K groupwise convolutional layer has the size of \((C/G) \times K \times K\), which convolves the above G groups of feature maps, respectively. Therefore, compared with the standard K × K convolutional layer, the groupwise convolutional layer is able to reduce the number of FLOPs and parameters by G times.
Depthwise Convolution. Depthwise convolution is a type of convolutional layer that has gained popularity owing to its ability to significantly reduce the number of FLOPs and parameters in convolutional networks. It is a special case of groupwise convolutional layer, in which the number of groups G is equal to the number of input channels. Each input channel is convolved with a unique kernel of \(1 \times K \times K\), after which the outputs from all input channels are concatenated along the depth dimension to derive the final output. In practice, this has the potential to achieve a significant reduction in the number of FLOPs and parameters because the intermediate feature maps may consist of thousands of channels as shown in previous state-of-the-art convolutional networks [2, 3, 78, 79].
Dilated Convolution. Dilated convolution [83], also referred to as atrous convolution, is a type of convolutional layer that is designed to increase the receptive field size. In the dilated convolutional layer, there is an adjustable parameter called the dilation rate, which determines the spacing between different elements and can be varied to adjust the size of the receptive field. For example, the 3 × 3 dilated convolutional layer with the dilation rate of 1 maintains the same receptive field as the standard 5 × 5 convolutional layer. This further allows us to increase the receptive field size to unlock better accuracy without introducing additional computational overheads, such as FLOPs and parameters.
Ghost Convolution. The Ghost convolution [36, 37, 75] is a type of convolutional layer that is designed to generate rich feature maps using cheaper computational resources, as illustrated in Figure 2. The Ghost convolutional layer consists of two sequential parts. The first part corresponds to the standard convolutional layer, in which the number of output channels is rigorously controlled. In the second part, to generate rich feature maps, a series of simple linear operations are applied to the output feature maps from the first part. As a result, the size of the output feature maps still remains the same as the standard convolutional layer, but the total required computational resources, such as the number of FLOPs and parameters, are significantly reduced, as shown in Figure 2.
Partial Convolution. Partial convolution [84] is to reduce computational redundancy and memory access simultaneously. The partial convolution is built upon the regular convolution, in which only a small number of input channels are convolved with the regular convolution to extract representative spatial features and the remaining input channels are unchanged. Similar to the Ghost convolution, the resulting output channels are further concatenated along the depth dimension to produce the final output channels. In practice, the partial convolution brings significant computational efficiency and memory efficiency since only a small number of input channels are convolved, which also maintains better on-device resource utilization than the Ghost convolution.
Fig. 2.
Fig. 2. Comparisons between the standard convolution (left) and the Ghost convolution (right) of GhostNets [36, 37, 75]. Compared with the standard convolutional layer, the Ghost convolutional layer can generate rich features using simple and cheaper linear operations (figure from [36]).
Built on top of the aforementioned efficient convolutional layers and structures, there are several representative families of manually designed efficient convolutional networks, including SqueezeNet [81], MobileNets [32, 85, 86], ShuffleNets [34, 35], CondenseNets [87, 88], GhostNets [36, 37, 75], and FasterNet [84]. We compare the above representative efficient convolutional networks in Figure 3,2 which are also discussed in the remainder of this section.
Fig. 3.
Fig. 3. Comparisons of efficient convolutional networks that discussed in Section 2.1, including SqueezeNet [81], MobileNets [32, 85, 86], ShuffleNets [34, 35], CondenseNets [87, 88], and GhostNets [36, 37, 75], in which the accuracy is evaluated on ImageNet [80] and is taken from the respective paper. Note that the convolutional networks in this figure may be trained under different training recipes.
SqueezeNet [81] is stacked using a series of Fire modules, which aims to achieve AlexNet-level accuracy with fewer parameters. Each Fire module consists of two convolutional layers, including one squeeze layer and one expand layer. In the squeeze layer, only pointwise convolutional layers are used to reduce the number of input channels for the subsequent expand layer. Next, the expand layer performs feature expansion using a pair of 1 × 1 and 3 × 3 convolutional layers. SqueezeNet is able to achieve slightly better accuracy on ImageNet than AlexNet (i.e., 57.5% in SqueezeNet vs. 57.2% in AlexNet) using a ×50 smaller model size. SqueezeNet is more compression friendly than AlexNet. For example, we are allowed to further compress SqueezeNet using a method outlined in [89], which delivers more compact network variants with \(\times 363\sim\)×510 smaller model size, and, more importantly, without degrading the accuracy on ImageNet.
MobileNets [32, 85, 86] are a family of lightweight convolutional networks, including MobileNetV1 [32], MobileNetV2 [85], and MobileNeXt [86], which are tailored for mobile devices with limited computational resources. MobileNetV1 is built upon a series of building blocks. Each building block consists of two convolutional layers, including one 3 × 3 depthwise convolutional layer and one 1 × 1 pointwise convolutional layer. With 569 M FLOPs and 4.2 M parameters, MobileNetV1 achieves 70.6% top-1 accuracy on ImageNet. MobileNetV2 is an improved version of MobileNetV1, which aims to unlock higher accuracy with fewer FLOPs and parameters. MobileNetV2 introduces the inverted residual building block that consists of three convolutional layers, including one 1 × 1 pointwise convolutional layer, one 3 × 3 depthwise convolutional layer, and one 1 × 1 pointwise convolutional layer. Here, the inverted residual building block also borrows the residual connection from ResNet [2] to stabilize the training process and improve the accuracy. With 300 M FLOPs and 3.4 M parameters, MobileNetV2 achieves 72.0% top-1 accuracy on ImageNet. MobileNeXt investigates the inverted residual building block in MobileNetV2 and introduces the sandglass block to enhance the accuracy without increasing the network complexity. The sandglass block consists of four convolutional layers, including one 3 × 3 depthwise convolutional layer, one 1 × 1 pointwise convolutional layer, one 1 × 1 pointwise convolutional layer, and one 3 × 3 depthwise convolutional layer. With 300 M FLOPs and 3.4 M parameters, MobileNeXt achieves 74.0% top-1 accuracy on ImageNet.
ShuffleNets [34, 35] are a family of efficient convolutional networks, including ShuffleNetV1 [35] and ShuffleNetV2 [34], which exploit channel shuffling to reduce the network complexity while maintaining competitive accuracy. Specifically, ShuffleNetV1, for the first time, introduces channel shuffling to enhance the information flow across different channels. In practice, the channel shuffling operation is inserted after the 3 × 3 depthwise convolutional layer to shuffle the feature maps from different groups, which is capable of generating richer and more diverse feature maps while not increasing the number of FLOPs and parameters. With 292 M FLOPs and 3.4 M parameters, ShuffleNetV1 achieves 71.5% top-1 accuracy on ImageNet, which is \(+\)0.9% higher than MobileNetV1 under comparable settings of FLOPs. ShuffleNetV2 improves the accuracy and efficiency of ShuffleNetV1 with several architectural modifications. ShuffleNetV2 first leverages channel splitting to divide the input feature maps into two parallel branches, one of which is fed into three convolutional layers, including one 1 × 1 pointwise convolutional layer, one 3 × 3 depthwise convolutional layer, and one 1 × 1 pointwise convolutional layer. After that, the above two branches of feature maps are concatenated along the depth dimension, which are then shuffled using the channel shuffling operation. With 299 M FLOPs and 3.5 M parameters, ShuffleNetV2 is able to achieve 72.6% top-1 accuracy on ImageNet, which is \(+\)1.1% higher than ShuffleNetV1 under comparable settings of FLOPs.
CondenseNets [87, 88] are a family of efficient convolutional networks, including CondenseNetV1 [87] and CondenseNetV2 [88], which are built upon another representative convolutional network named DenseNet [3]. CondenseNetV1 enhances the dense connection with a novel module called learned group convolution. Note that the dense connection reuses the features from preceding convolutional layers to enhance the information flow as seen in DenseNet. In contrast, the learned group convolution removes the redundant dense connection between different convolutional layers to reduce network redundancy. With 274 M FLOPs and 2.9 M parameters, CondenseNetV1 achieves 71.0% top-1 accuracy on ImageNet. CondenseNetV2 introduces an alternative named sparse feature reactivation (SFR) to increase feature reuse. Integrated with SFR, each convolutional layer can learn to (1) selectively reuse a set of the most important features from preceding convolutional layers and (2) actively update a set of preceding features to increase their reuse in subsequent convolutional layers. With 146 M FLOPs and 3.6 M parameters, CondenseNetV2 achieves 71.9% top-1 accuracy on ImageNet.
GhostNets [36, 37, 75] are a family of efficient deep convolutional networks, including GhostNetV1 [36, 75] and GhostNetV2 [37], which focus on generating rich feature maps using computationally cheap and simple yet powerful operations. To this end, GhostNetV1 introduces a powerful yet computation-efficient convolution dubbed Ghost convolution as shown in Figure 2, which consists of two sequential parts. The first part corresponds to the standard convolutional layer, in which the number of output channels is rigorously controlled. In the second part, a series of computationally cheap and simple linear operations are applied to the output feature maps from the first part to generate rich feature maps. With only 141 M FLOPs and 5.2 M parameters, GhostNetV1 achieves 73.9% top-1 accuracy on ImageNet. GhostNetV2 introduces a novel hardware-friendly attention mechanism, DFC attention, to enhance the learned feature maps to boost the expressiveness ability, which is seamlessly integrated into GhostNetV1 to push forward the accuracy and efficiency. For example, with 167 M FLOPs and 6.1 M parameters, GhostNetV2 is able to achieve 75.3% top-1 accuracy on ImageNet.
FasterNets [84] are built on the partial convolution. In contrast to the above efficient networks that typically optimize the number of FLOPs, FasterNet pioneers to design efficient networks with optimized FLOPS (i.e., FLOPs per second). The motivation behind FasterNet is that the on-device latency is determined by both FLOPs and FLOPS (i.e., Latency = FLOPs/FLOPS). To this end, FasterNet focuses on increasing the number of FLOPs to maintain competitive accuracy on the target task, while at the same time optimizing FLOPS to maintain competitive efficiency on target hardware. For example, compared with GhostNetV1x1.3, which involves 0.24 G FLOPs and exhibits 75.7% top-1 accuracy on ImageNet, FasterNet-T1 achieves \(+\)0.5% higher top-1 accuracy with many more FLOPs (i.e., 0.85 G) and, more importantly, achieves ×1.7 speedup on ARM processors.

2.2 Manual Transformer Design

2.2.1 Transformer for NLP.

In parallel to convolutional networks, the transformer [90] is another well-established branch of DNNs, which exploits multi-head self-attention mechanisms. In practice, a transformer is first designed and applied to NLP tasks, where it has achieved tremendous success. For example, BERT [91], as one of the most representative transformers in the field of NLP, is able to achieve state-of-the-art performance across 11 downstream NLP tasks, such as language translation, question answering, language generation, and more, at the moment of BERT being proposed. Furthermore, GPT-3 (Generative Pretrained Transformer 3) [48] pioneers to scale up and pretrain a massive transformer that consists of 175 billion parameters on 45 TB of compressed plaintext data, which unlocks even stronger performance across almost all downstream NLP tasks and, more importantly, without requiring fine-tuning on specific NLP tasks. More recently, GPT-4 [49] has been proposed by OpenAI, which can significantly outperform GPT-3 across a wide range of language processing tasks and has also been widely integrated into various real-world language processing tasks, such as ChatGPT [92], to provide intelligent language processing services. These early transformer-based deep networks, thanks to their prohibitive computational complexity, have been pushing forward the boundaries of various language processing tasks and dominating recent advances in the field of NLP (see Figure 4).
Fig. 4.
Fig. 4. Illustration of the key milestones of the transformer, which is originally applied to NLP tasks and has recently gained increasing popularity in the vision community. Here, we mark the vision transformers in red.
Nonetheless, it is quite challenging to deploy powerful transformers in embedded computing systems due to the computational gap between computation-intensive transformers and computation-limited embedded computing systems. For example, as pointed out in [93], to translate a short sentence with only 30 words, a typical transformer model needs to execute 13 G FLOPs, which takes 20 seconds on a Raspberry Pi device. This significantly hinders the user experience in real-world embedded scenarios. To tackle this issue, a series of computation-efficient transformers have emerged, among which TinyBERT [94], MobileBERT [95], DistilBERT [96], Linformer [97], and Reformer [98] are some of the most representative ones. The main intuition behind these efficient transformers is to resolve the memory bottleneck and increase the parallelism, making it possible to deploy NLP workloads on resource-constrained embedded computing systems. Note that, compared with computer vision tasks such as image classification and object detection, running NLP workloads on embedded computing systems is less common due to high inference latency. For example, as demonstrated in [93], running language translation workloads with hardware-tailored transformers on a Raspberry Pi device takes even seconds, whereas running image classification workloads typically takes milliseconds per image. More recently, inspired by the remarkable success of GPTs [48, 49], transformer-based LLMs have become increasingly popular in the NLP community. To optimize the efficiency of transformer-based LLMs, a plethora of efficient transformer-based LLMs have been proposed, which typically focus on improving the training efficiency [99, 100, 101], inference efficiency [102, 103], and fine-tuning efficiency [104, 105] of transformers in the context of LLMs. For example, to optimize the inference efficiency of transformer-based LLMs, [102] partitions LLMs over different hardware chips in order to fit weights and activation tensors into memory and run computation and memory workloads within the given latency constraint, which also features a simple yet effective strategy to alleviate the communication overheads among different hardware chips for cost-effective and latency-efficient inference.

2.2.2 Transformer for Vision.

Inspired by the tremendous success of the transformer in the field of NLP, researchers have recently applied it to vision tasks, which achieves surprisingly strong performance (see Figure 4). This opens up a new direction and further challenges the dominant role of convolutional networks in vision tasks. Specifically, DETR [106] and Vision Transformer (ViT) [107] are the very early transformers in vision tasks, among which ViT is the most representative. These early pioneers have motivated a myriad of subsequent transformers in various vision tasks, such as image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119]. For example, ViT was first proposed in June 2020, which has since gained over 20,000 citations as shown in Google Scholar. The main intuition behind ViT is surprisingly simple and straightforward, which (1) splits the input image into a series of fixed-size patches, (2) linearly embeds each of them, and (3) feeds the resulting sequence of vectors into the standard transformer encoder as illustrated in Figure 5. However, there is no free lunch. The surprisingly strong performance of ViT and its variants comes at the cost of prohibitive computational complexity, which significantly hinders the practical deployments of ViT and its variants in embedded computing systems with limited computational resources.
Fig. 5.
Fig. 5. Overview of Vision Transformer (ViT) [107], which (1) splits the image into fixed-size patches, (2) linearly embeds each of them, and (3) feeds the sequence of vectors into the encoder (figure from [107]).
To resolve the complexity bottleneck, some recent works have pioneered to design computation-efficient transformers for vision tasks with the aim of reducing the computational complexity while maintaining competitive accuracy. The representative computation-efficient transformers in vision tasks include LeViT [120], MobileFormer [121], MobileViTs [122, 123, 124], EfficientViT [125], EdgeViT [126], EdgeNeXt [127], CastlingViT [128], and FastViT [129]. These computation-efficient vision transformers are summarized and compared in Figure 6.
Fig. 6.
Fig. 6. Comparisons of efficient vision transformers discussed in Section 2.2, including LeViT [120], MobileFormer [121], MobileViTs [122, 123, 124], EfficientViT [125], EdgeViT [126], EdgeNeXt [127], CastlingViT [128], and FastViT [129], in which the accuracy is evaluated on ImageNet [80] and is taken from the respective papers. Note that the vision transformers here may be trained under different training recipes.
LeViT [120] is a hybrid vision transformer built on top of convolutional networks, which aims to improve the trade-off between accuracy and efficiency. To this end, LeViT introduces several enhancements to shrink down the network size, including (1) a multi-stage transformer architecture that uses attention mechanisms as down-sampling, (2) a computation-efficient patch descriptor that shrinks down the number of features in the early layers, (3) a per-head translation-invariant attention bias that replaces ViT’s positional embeddings, and (4) an efficient MLP-based attention block that improves the network capacity under given computational budgets. With 406 M FLOPs and 9.2 M parameters, LeViT achieves 78.6% top-1 accuracy on ImageNet.
MobileFormer [121] parallelizes MobileNetV2 [85] and transformer [107] with a two-way bridge, which shifts the network design paradigm from series to parallel. The network here is named MobileFormer, where Mobile refers to MobileNetV2 and Former stands for transformer. Mobile takes the image as input and stacks inverted residual blocks that consist of efficient pointwise and depthwise convolutional layers to extract local features. Former takes learnable tokens as input and stacks multi-head attention and feed-forward networks, in which the learnable tokens encode global features of the image. As such, Mobile and Former can communicate through a two-way bridge to fuse local and global features for better expressiveness ability. With 294 M FLOPs and 11.4 M parameters, MobileFormer achieves 77.9% top-1 accuracy on ImageNet.
MobileViTs, including MobileViTv1 [122], MobileViTv2 [123], and MobileViTv3 [124], are a family of efficient hybrid networks that combine the benefits of CNNs (e.g., spatial inductive bias and less sensitivity to data augmentations) and vision transformers (e.g., input-adaptive weighting and global processing). In contrast to mainstream vision transformers, both MobileViTv1 and MobileViTv2 are designed with the aim of low inference latency rather than low FLOPs since the number of FLOPs cannot accurately reflect the inference efficiency on target hardware. To this end, MobileViTv1 introduces a novel block that is able to efficiently and effectively encode both local and global features. In addition, MobileViTv1 replaces local processing in convolutional layers with global processing using transformers, which can lead to better representation capability with fewer parameters and simpler training recipes. Finally, with 5.6 M parameters, MobileViTv1 achieves 78.4% top-1 accuracy on ImageNet. MobileViTv2 introduces a separable self-attention mechanism with linear complexity, which is integrated into MobileViTv1 to boost the accuracy and hardware efficiency. For example, MobileViTv2 achieves 75.6% top-1 accuracy on ImageNet, which is \(+\)0.8% higher than MobileViTv1 while maintaining ×3.2 speedup on iPhone 12. In addition, MobileViTv3 introduces two simple yet effective enhancements: (1) replacing \(3\times 3\) convolutional layers with \(1t\times 1\) convolutional layers and (2) scaling up building blocks in terms of the network width. With 927 M FLOPs, MobileViTv3 achieves 76.7% top-1 accuracy on ImageNet, which is \(+\)1.9% higher than MobileViTv1 under similar FLOPs.
EfficientViT [125] investigates high-resolution, low-computation visual recognition tasks using ViT and its variants, and identifies that the complexity bottleneck of ViT and its variants comes from the excessively used softmax attention mechanism. To resolve the complexity bottleneck, EfficientViT challenges the dominant role of softmax attention in vision transformers. It introduces a strong alternative, namely, enhanced linear attention, to replace softmax attention, which demonstrates strong representation capability in local feature extraction while being able to maintain low computational complexity and high hardware efficiency. With 406 M FLOPs and 7.9 M parameters, EfficientViT achieves 78.6% top-1 accuracy on ImageNet.
EdgeViT [126] investigates the design of efficient vision transformers from the perspective of on-device deployment, enabling vision transformers to compete with state-of-the-art CNNs in terms of the accuracy–efficiency trade-off. EdgeViT is designed based on an optimal decomposition of self-attention using standard primitive operations, optimizing EdgeViT towards target hardware to achieve superior accuracy–efficiency trade-offs. With 600 M FLOPs and 4.1 M parameters, EdgeViT achieves 74.4% top-1 accuracy on ImageNet, which is \(+\)2.4% higher than MobileNetV2 under comparable latency constraints on the Samsung Galaxy S21 device.
EdgeNeXt [127] is an efficient hybrid network that marries both worlds of convolutional networks and vision transformers. To better encode global information, EdgeNeXt introduces an efficient split depthwise transpose attention (SDTA) encoder to address the issue of limited receptive fields in CNNs without increasing the number of FLOPs and parameters. EdgeNeXt also leverages adaptive kernel sizes to shrink down the network complexity. With 538 M FLOPs and 2.3 M parameters, EdgeNeXt achieves 75.0% top-1 accuracy on ImageNet, which is comparable to MobileViTv1 [122] in terms of both accuracy and on-device latency.
CastlingViT [128] proposes to (1) train ViT and its variants using both linear-angular attention and masked softmax-based quadratic attention and (2) switch to having only linear-angular attention during the inference in order to save computational resources. The linear-angular attention leverages angular kernels to bridge the accuracy gap between linear attention and softmax-based attention. It expands angular kernels where linear terms are kept while complex high-order residuals are approximated. This aligns with the observation in EfficientViT [125] that the complexity bottleneck of ViT and its variants comes from the excessively involved softmax attention mechanism. To address the complexity bottleneck, CastlingViT replaces softmax attention with linear-angular attention to further improve the efficiency of ViT and its variants. With 490 M FLOPs and 10.5 M parameters, CastlingViT achieves 79.6% top-1 accuracy on ImageNet.
FastViT [129] is an efficient hybrid network that combines both CNNs and vision transformer, which aims to marry the best of both and enable state-of-the-art accuracy–efficiency trade-offs. To this end, FastViT introduces a novel token-mixing operator named RepMixer, which is the basic building block of FastViT that leverages structural reparameterization to reduce the memory access cost by removing the less important skip connections. In addition, FastViT also applies training-time over-parameterization and large kernel convolutions to further boost the accuracy with minimal effect on the inference latency. In practice, structural reparameterization enables FastViT to achieve strong accuracy on target tasks during the training process and maintain superior efficiency on target hardware during the on-device inference process. With 700 M FLOPs and 3.6 M parameters, FastViT achieves 75.6% top-1 accuracy on ImageNet.

2.3 Envisioning Future Trends

In this section, we envision the future trends and possible directions of manual network design, including convolutional networks and transformers, summarized as follows.
(1)
Hardware-Aware Optimization. The trend in the field of network design is to reduce the number of FLOPs. However, the number of FLOPs only represents the theoretical complexity and the reduction in the number of FLOPs does not necessarily lead to the inference speedup on target hardware [122, 123, 126, 130, 131]. For example, PiT [132] has ×3 fewer FLOPs than DeiT [133], but both have similar inference latency on iPhone 12 (i.e., DeiT vs. PiT on iPhone 12: 10.99 ms vs. 10.56 ms) [122]. In parallel, the attention mechanisms are powerful plug-in enhancements in various real-world scenarios [134, 135], such as Squeeze-and-Excitation (SE) [31] in vision tasks and self-attention [90] in NLP tasks, which can further boost the attainable accuracy on target tasks while slightly increasing the number of FLOPs. However, DNNs with attention mechanisms, despite being able to push forward the accuracy on target tasks, introduce considerable extra parameters and are difficult to parallelize on target hardware, especially for transformers that are full of self-attention mechanisms. For example, EfficientViT [125] demonstrates that the prohibitive computational complexity of ViT and its variants comes from excessively used softmax attention. In light of the above, we should focus on optimizing more direct efficiency metrics, such as latency and energy, which may directly benefit real-world embedded computing systems.
(2)
Interpretability and Explainability. Recent manually designed DNNs, including efficient convolutional networks and transformers, have been empirically developed through trial and error. The intuition behind this is that DNNs suffer from limited interpretability and explainability [136]. Therefore, to find one decent network solution with competitive accuracy, we have to repeat a plethora of training experiments to evaluate the accuracy of possible network configurations [137, 138], thereby necessitating non-trivial computational resources for repeated training workloads [130, 131]. To avoid this, we should focus in the future on addressing the interpretability and explainability of DNNs to facilitate the network design process and minimize the required engineering efforts.
(3)
Hybrid Multi-modal Networks. Compared with vision transformers, convolutional networks are able to maintain superior efficiency on target hardware but may suffer from inferior accuracy on target tasks. However, self-attention mechanisms are excessively involved in vision transformers, which are difficult to parallelize on mainstream embedded computing systems [125, 128]. For example, as demonstrated in EdgeViT [126], under similar FLOPs settings, MobileNetV2 [85] is about ×2 faster on the Samsung Galaxy S21 device than MobileViTv1 [122]. This further hinders the practical deployment of vision transformers in real-world embedded scenarios. In parallel, [139] demonstrates that, similar to transformers, graph neural networks, when properly engineered, can also achieve competitive performance in vision tasks. More importantly, these hybrid networks have the potential to handle various modalities (i.e., different types of input), such as text, images, and audio [140]. For example, convolutional networks are particularly effective at handling spatial data, such as images. In contrast, transformers are better suited for sequential data, such as text. Therefore, in order to achieve better accuracy–efficiency trade-offs and allow diverse input modalities, one natural and promising future direction is to continue exploring hybrid multi-modal networks that combine the strengths of existing representative networks, such as convolutional networks, vision transformers, and graph networks.
(4)
Simpler Training Recipes. As demonstrated in [141], the competitive performance of ViT and its variants highly relies on more advanced training recipes, such as pretraining on larger datasets, more training epochs, stronger data augmentations, and stronger regularization strategies. For example, ViT [107] is first pretrained on ImageNet-21k and JFT and then fine-tuned on ImageNet. Note that ImageNet consists of 1,000 categories, whereas ImageNet-21k has 21,000 categories. This further makes it more difficult and challenging to train vision transformers under regular training settings and significantly increases the total training cost. Therefore, training vision transformers in a more computation-efficient manner and under simpler training recipes is a promising future direction.
(5)
Adversarial Robustness. In addition to the efficiency, adversarial robustness is another desirable network property since efficient networks, especially vision transformers [142], are more sensitive to input perturbations and, as a result, are more vulnerable to adversarial attacks than nonefficient ones [143]. Adversarial robustness refers to the ability of the network to maintain its accuracy even when encountering adversarial attacks that are intentionally designed to mislead the network. Adversarial robustness is critical in real-world scenarios, especially for environments that are complex and unpredictable, such as autonomous vehicles. Therefore, innovating efficient yet robust DNNs is a promising future direction in the field of network design.

3 Automated Network Design for Embedded Computing Systems

In contrast to manual network design, automated network design, also known as neural architecture search (NAS) [137], has recently flourished, which strives to automate the design of efficient neural networks. In the past decade, NAS has achieved impressive performance in the field of network design, which delivers more advanced networks with both higher accuracy and efficiency than the conventional manual network design (see Section 2). To this end, in this section, we further discuss recent advances in the field of NAS, especially from the perspective of hardware-aware NAS that searches for hardware-efficient network solutions, including modular search space in Section 3.1, search strategy in Section 3.2, and speedup techniques and extensions in Section 3.3.

3.1 Modular Search Space

The search space \(\mathcal {A}\) plays a prominent role in the success of NAS since the search engine of NAS strives to search for top-performing architecture candidates within the predefined search space. This also indicates that the search space determines the upper-performance limit of modern NAS algorithms. However, designing efficient and effective search spaces is quite difficult and challenging since there are a myriad of possible operator candidates (e.g., \(1\times 1\), \(3\times 3\), \(5\times 5\), and \(7\times 7\) convolutional layers) and different network configurations (e.g., the combination strategies of different operator candidates and the network channel layouts) [137, 138, 144]. Therefore, to reduce the search space size and trim down the search complexity, previous state-of-the-art NAS methods [137, 138, 144] often restrict the search space to allow efficient search and leverage modular search spaces, which are coarse-grained in contrast to layer-wise fine-grained search spaces. Previous state-of-the-art NAS methods are based on the following two representative types of modular search spaces, including cell-based search space and block-based search space.
Cell-Based Search Space. The cell-based search space \(\mathcal {A}\) has been dominating the early success in the field of NAS. The cell-based search space was first introduced by NASNet [144] and DARTS [138]. As defined in NASNet, the cell-based search space consists of two types of cell structures, which are denoted as the normal cell and the reduction cell. In practice, both types of cells are encoded into directed acyclic graphs (DAGs) and maintain the same cell structure as illustrated in Figure 7, except that the reduction cell starts with one convolutional layer with the stride of 2 to reduce the input spatial dimension. Once the cell structure is determined at the end of search, it is then repeatedly stacked to derive the final architecture candidate. In addition, DARTS introduces another type of cell-based search space, which has motivated a plethora of subsequent NAS methods that are also built on top of the same cell-based search space, such as RobustDARTS [145], EdgeNAS [130], PC-DARTS [146], P-DARTS [147], DARTS+ [148], DARTS–[149], FairDARTS [150], and \(\beta\)-DARTS [151]. Similar to NASNet, the cell-based search space in DARTS consists of two types of cells, the normal cell and the reduction cell. As shown in Figure 7, each cell has an ordered sequence of nodes, where each node is a latent representation (e.g., a feature map in convolutional networks) and each directed edge has a set of possible operators \(\lbrace o^{(i, j)}\rbrace\) that transform the input \(x^{(i)}\). In contrast to NASNet, the cell in DARTS is assumed to have two different input nodes and one single output node. With this in mind, we are able to mathematically calculate the intermediate node as follows, which is based on all of its predecessors:
\begin{equation} x^{(j)} = \sum _{i \lt j} o^{(i, j)}(x^{(i)}) . \end{equation}
(1)
Finally, the search space of DARTS contains \(6.3 \times 10^{29}\) possible architecture candidates [130].
Fig. 7.
Fig. 7. Illustration of the cell-based search space \(\mathcal {A}\) in NASNet [144] and DARTS [138], in which NASNet assigns operator candidates to nodes and DARTS assigns operator candidates to edges (figure from [41]).
Block-Based Search Space. The block-based search space \(\mathcal {A}\) advocates for simple and diverse network topologies as illustrated in Figure 8, in which each architecture candidate consists of multiple sequential operator candidates. As shown in previous NAS practices, the operator candidates in the block-based search space are usually taken from state-of-the-art manual DNNs, such as MobileNets [32, 85] and ShuffleNets [34, 35]. For example, the block-based search space in ProxylessNAS [40] is built on top of MobileNetV2, whereas the block-based search space in HSCoNAS [152] is built on top of ShuffleNetV2. In parallel, HURRICANE [153] demonstrates that different hardware platforms favor different search spaces, based on which HURRICANE introduces a hybrid block-based search space that combines both MobileNetV2 and ShuffleNetV2 to deliver superior architecture solutions. In contrast to the cell-based search space, the block-based search space is hardware friendly, due to which the block-based search space has been widely adopted in previous hardware-aware NAS methods, such as MnasNet [38], ProxylessNAS [40], OFA [42], HSCoNAS [152], SurgeNAS [154], and LightNAS [131]. The intuition behind this is that the architecture candidate in the cell-based search space consists of multiple parallel branches as shown in Figure 7, which introduce additional overheads in terms of the memory access and, as a result, deteriorate the inference efficiency on target hardware according to the roofline analysis [155]. In addition, in contrast to the cell-based search space that repeatedly stacks the same cell structure across the entire network, the block-based search space allows operator diversity within different blocks, encouraging the search for architecture candidates with better accuracy–efficiency trade-offs [156].
Fig. 8.
Fig. 8. Illustration of the block-based search space \(\mathcal {A}\), which is based on MobileNetV2 (figure from [38]).

3.2 Search Strategy

In this section, we discuss recent state-of-the-art NAS algorithms and divide them into three main categories: reinforcement learning-based search [137], evolutionary algorithm-based search [157], and gradient-based search (also known as differentiable search) [138].
Reinforcement Learning-Based Search. In the field of NAS, to the best of our knowledge, [137] is the first NAS work3 that opens up the possibility to automate the design of top-performing DNNs, which features reinforcement learning (RL) [159] as the search engine. Specifically, [137] leverages a simple yet effective recurrent neural network (RNN) as the RL controller to generate possible architecture candidates from the search space as shown in Figure 9. The generated architecture candidate is then trained from scratch on the target task to evaluate the accuracy. Next, the accuracy of the generated architecture candidate is fed back into the aforementioned RNN controller, which optimizes the RNN controller to generate better architecture candidates in the next iteration. Once the search process terminates, the well-optimized RNN controller is able to provide DNNs with superior accuracy on the target task. For example, the network generated by the RNN controller achieves 96.35% top-1 accuracy on CIFAR-10, which is comparable to or even better than the family of manually designed DNNs, such as ResNet [2]. The promising performance of [137] marks an important milestone in the field of NAS, pioneering an effective alternative to automate the design of competitive DNNs.
Fig. 9.
Fig. 9. Illustration of how the recurrent neural network (RNN) controller samples possible convolutional architecture candidates from the search space in reinforcement learning–based NAS (figure from [137]).
Subsequently, based on [137], NASNet [144] introduces the flexible cell-based search space as shown in Figure 7, which further boosts the attainable accuracy on the target task. For example, NASNet achieves 97.6% top-1 accuracy on CIFAR-10, which is \(+\)1.25% higher than [137] while involving fewer parameters (i.e., 37.4 M in [137] vs. 27.6 M in NASNet). Despite the promising performance, [137] and NASNet have to train a large number of possible architecture candidates from scratch, thus inevitably necessitating prohibitive computational resources. For example, to optimize the RNN controller, [137] needs to train 12,800 stand-alone architecture candidates. To overcome such limitations, ENAS [160] proposes an efficient NAS paradigm dubbed parameter sharing, which forces all the architecture candidates to share network weights to eschew training each architecture candidate from scratch. In practice, this leads to significant reduction in terms of search cost, while at the same time still maintaining strong accuracy on the target task. For example, in [137], one single search experiment takes 3\(\sim\)4 days on 450 NVIDIA GTX 1080 Ti GPUs [144]. In contrast, benefiting from the paradigm of parameter sharing, ENAS is able to find one decent network solution with 97.11% top-1 accuracy on CIFAR-10 and, more importantly, in less than 16 hours on one single NVIDIA GTX 1080 Ti GPU. Thanks to the significant search efficiency, the paradigm of parameter sharing has been dominating subsequent breakthroughs in the NAS community [42, 138, 161].
Although early RL-based NAS methods [137, 144, 160] have had tremendous success in automatic network design, they focus on accuracy-only optimization, ignoring other important performance metrics, such as latency and energy. To search for hardware-efficient network solutions, MnasNet [38] formulates the search process as a multi-objective optimization problem that optimizes both accuracy and latency as shown in Figure 10. To achieve this, MnasNet introduces a flexible block-based search space (see Figure 8) and designs an effective multi-objective RL reward function to optimize the RNN controller. Specifically, the goal of MnasNet is to find Pareto-optimal architecture candidates arch in the search space \(\mathcal {A}\) that maximize the predefined multi-objective RL reward, which can be formulated as follows:
\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) \times \left[\frac{Latency(arch)}{T}\right]^{w} , \end{equation}
(2)
where \(Accuracy(\cdot)\) and \(Latency(\cdot)\) denote the accuracy on the target task and the latency on target hardware, respectively. In addition, T is the specified latency constraint. It is worth noting that the latency \(Latency(\cdot)\) in MnasNet is directly measured on target hardware, which suffers from non-trivial engineering efforts due to the prohibitive search space (e.g., \(|\mathcal {A}| = \sim 10^{39}\) in MnasNet) [40, 42]. To avoid the tedious on-device latency measurements, we discuss several efficient latency predictors later in this section. Apart from these, w is the trade-off coefficient to control the trade-off magnitude between accuracy and latency, which is defined as follows:
\begin{equation} \begin{aligned}w = {\left\lbrace \begin{array}{ll} \alpha , & \text{if } Latency(arch) \le T\\ \beta , & \text{otherwise} \end{array}\right.} \end{aligned} , \end{equation}
(3)
where \(\alpha\) and \(\beta\) are application-specific hyperparameters to control the trade-off magnitude between accuracy and efficiency. According to the empirical observation that doubling the latency usually brings \(\sim\)5% relative accuracy improvement, MnasNet assigns \(\alpha = \beta = -0.07\). In practice, \(\alpha\) and \(\beta\) are both sensitive and difficult to tune. Even worse, given new hardware devices or new search spaces, \(\alpha\) and \(\beta\) involve additional engineering efforts for hyperparameter tuning. For example, as observed in MobileNetV3 [162], the accuracy changes much more dramatically with latency for small networks. Therefore, to obtain the required architecture candidate that satisfies the specified latency constraint T, we typically need to repeat 7 search experiments to tune \(\alpha\) and \(\beta\) through trial and error [163]. This significantly increases the total search cost by ×7. To eliminate such additional hyperparameter tuning, TuNAS [163] investigates the multi-objective RL reward in Equation (2) and further introduces a similar RL reward function, which can be formulated as follows:
\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) + \gamma \times \left|\frac{Latency(arch)}{T} - 1\right| , \end{equation}
(4)
where \(|\cdot |\) is the absolute function and \(\gamma \lt 0\) is a finite negative value, which controls how strongly we enforce the architecture candidate to maintain the latency close to T.
Fig. 10.
Fig. 10. Overview of MnasNet [38] (figure from [38]).
MONAS [164] also introduces a simple yet effective RL reward function that considers optimizing both accuracy and energy, which can be formulated as follows:
\begin{equation} Reward(arch) = \eta \times Accuracy(arch) - (1 - \eta) \times Energy(arch) , \end{equation}
(5)
where \(\eta \in [0, 1]\) is the coefficient to control the trade-off between accuracy and energy. We note that the RL reward function in Equation (5) aims to find the architecture candidate with high accuracy and low energy, which can be generalized to other performance constraints, such as latency.
Evolutionary Algorithm-Based Search. In addition to reinforcement learning–based search, evolutionary algorithm–based search is another popular branch in the NAS literature thanks to its flexibility, conceptual simplicity, and competitive performance [157]. As seen in the very early evolutionary practices [165, 166, 167, 168], the evolutionary algorithm–based search typically consists of four key steps: (1) sampling a set of possible architecture candidates from the search space as the child population; (2) evaluating the architecture candidates in the child population to interpret the performance, such as accuracy and efficiency; (3) reserving the top-k architecture candidates in the latest child population to form the parent population and discarding the architecture candidates with poor performance; and (4) manipulating the architecture candidates in the latest parent population to generate new architecture candidates to form the next-generation child population. These four steps are repeated until the evolutionary process converges.
There are many other aspects in which the evolutionary algorithm may differ, including (1) how to sample the initial population, (2) how to select the parent population, and (3) how to generate the child population from the parent population. Generating the child population from the parent population is of utmost importance in order to produce superior architecture candidates [157]. In practice, to allow efficient exploration and exploitation [170], crossover and mutation are two of the most popular strategies to generate the child population [171, 172]. Specifically, for crossover, two random architecture candidates from the parent population are crossed to produce one new child architecture candidate. For mutation, one randomly selected architecture candidate mutates its operators with a fixed probability. However, the early evolutionary NAS works have to train a large number of stand-alone architecture candidates from scratch to evaluate the accuracy [157] and, as a result, suffer from non-trivial computational resources [169].
To reduce the required computational resources for neural architecture search, [169] introduces the paradigm of one-shot NAS, which has been widely applied in subsequent NAS methods [40, 42, 138] thanks to its significant search efficiency. In parallel to [169], SMASH [173] also proposes a similar one-shot NAS paradigm, but [169] is much more popular in the NAS community. Specifically, [169] designs an effective one-shot supernet as visualized in Figure 11, which consists of all possible architecture candidates in the search space. Therefore, we only need to train the one-shot supernet, after which we can evaluate different architecture candidates in the search space with inherited network weights from the pretrained one-shot supernet as shown in Figure 12. This effectively avoids needing to train a large number of stand-alone architecture candidates from scratch. In practice, the one-shot supernet is simply trained using the standard SGD optimizer with momentum. Once the one-shot supernet is well trained, it is able to quickly and reliably approximate the performance of different architecture candidates using the paradigm of weight sharing [160]. With the well-trained one-shot supernet, it is straightforward and technically easy to leverage the standard evolutionary algorithm to search for top-performing architecture candidates with superior accuracy on the target task [169]. We note that the searched architecture candidates still need to be retrained or fine-tuned on the target task in order to recover the accuracy for further deployment on target hardware.
Fig. 11.
Fig. 11. Overview of the one-shot supernet. Solid lines mean that the operator candidates are enabled whereas dashed lines mean that the operator candidates are part of the search space but disabled. Here, the one-shot supernet contains all possible architecture candidates in the search space (figure from [169]).
Fig. 12.
Fig. 12. Illustration of the architecture candidate evaluation in one-shot NAS [169] (figure from [169]).
SPOS [161] investigates the one-shot NAS [169] and identifies two critical issues. On the one hand, the network weights in the one-shot supernet are deeply coupled during the training process. On the other hand, joint optimization introduces further coupling between architecture candidates and supernet weights. To address these, SPOS proposes the paradigm of single-path one-shot NAS, which uniformly samples one single-path subnetwork from the supernet and trains the sample single-path subnetwork instead. This brings two main benefits: (1) reducing memory consumption to the single-path level and (2) improving the performance of the final searched architecture candidate. The success of SPOS has motivated a series of follow-up works [42, 152, 153, 174, 175, 176, 177, 178, 179]. Note that all of these follow-up works [42, 152, 153, 174, 175, 176, 177, 178, 179] focus on training an effective and reliable supernet, which then serves as the evaluator to quickly query the performance of different architecture candidates. For example, FairNAS [174] demonstrates that the uniform sampling strategy only implies the soft fairness, and to imply the strict fairness, FairNAS samples multiple single-path subnetworks to enforce that all the operator candidates in the supernet are equally optimized during each training iteration. In parallel, OFA [42] is another representative evolutionary NAS method that aims to train the supernet, after which we are allowed to detach single-path subnetworks from the supernet with inherited network weights for further deployment on target hardware. Note that the detached subnetwork in OFA still requires to be fine-tuned on the target task for several epochs (e.g., 25 epochs) in order to obtain competitive accuracy. To eliminate the fine-tuning process, BigNAS [175] proposes several enhancements to train one single-stage supernet, where the single-path subnetwork detached from the supernet with inherited network weights can achieve superior accuracy without being retrained or fine-tuned on the target task and can be directly deployed on target hardware. This significantly saves the computational resources required for training stand-alone architecture candidates, especially when targeting multiple different deployment scenarios such as multiple different hardware platforms.
Thanks to its search flexibility, evolutionary algorithm-based NAS can be easily extended to search for hardware-efficient architecture candidates, which maximize the accuracy on the target task while satisfying various real-world performance constraints [153], such as latency, energy, memory, and more. Without loss of generality, we consider the following multi-objective optimization:
\begin{equation} \mathop {\mathrm{maximize}}_{arch \in \mathcal {A}} \,\,\, Accuracy(arch) \,\,\, s.t., \,\,\, Constraint_1(arch) \le C_1, \ldots , Constraint_n(arch) \le C_n , \end{equation}
(6)
where \(\lbrace Constraint_i(\cdot)\rbrace _{i=1}^n\) and \(\lbrace C_i\rbrace _{i=1}^n\) are a set of real-world performance constraints.
Gradient-Based Search. In addition to reinforcement learning–based search and evolutionary algorithm–based search, gradient-based search [138], also known as differentiable search, is another representative branch of NAS, which has since gained increasing popularity in the NAS community and motivated a plethora of subsequent differentiable NAS works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188], thanks to its significant search efficiency [189]. For example, DARTS [138], as the seminal differentiable NAS work, is able to deliver one superior architecture candidate in \(\sim\)1 day on one single NVIDIA GTX 1080 Ti GPU. In contrast to previous non-differentiable NAS practices [137, 144, 160, 169] that highly rely on discrete search spaces, DARTS leverages a list of architecture parameters \(\alpha\) to relax the discrete search space to become continuous. Benefiting from the continuous search space, both the network weights w and the architecture parameters \(\alpha\) can be optimized via alternating gradient descent. Once the differentiable search process terminates, we can interpret the optimal architecture candidate from the architecture parameters \(\alpha\). Specifically, the supernet in DARTS is initialized by stacking multiple over-parameterized cells (see Figure 13 (1)), in which each cell consists of all possible cell structures in the cell-based search space \(\mathcal {A}\). As shown in Figure 13, each cell is represented using the DAG that consists of N nodes \(\lbrace x_i\rbrace _{i=1}^N\). Note that the nodes here correspond to the intermediate feature maps. In addition, the directed edges between \(x_i\) and \(x_j\) correspond to a list of operator candidates \(\lbrace o | o \in \mathcal {O}\rbrace\) in the operator space \(\mathcal {O}\). The directed edges between \(x_i\) and \(x_j\) are also assigned with a list of architecture parameters \(\lbrace \alpha _o^{(i, j)} | o \in \mathcal {O}\rbrace\). Finally, following DARTS, we formulate \(x_j\) as follows:
\begin{equation} x_j = \sum _{o \in \mathcal {O}} \frac{\exp \alpha _o^{(i, j)}}{\sum _{o^{\prime } \in \mathcal {O}} \exp \alpha _{o^{\prime }}^{(i, j)}} o(x_i) . \end{equation}
(7)
Note that the output \(x_j\) is continuous with respect to \(x_i\), \(\alpha\), and w. In Light of this, DARTS proposes to optimize \(\alpha\) and w using the following bilevel optimization scheme:
\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) \,\,\, s.t., \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha) , \end{equation}
(8)
where \(\mathcal {L}_{train}(\cdot)\) and \(\mathcal {L}_{val}(\cdot)\) are the loss functions on the training and validation datasets, respectively. Once the differentiable search process terminates, DARTS determines the optimal architecture candidate by reserving the strongest operator \(\alpha _o^{(i, j)}\) and removing other operators between \(x_i\) and \(x_j\), in which the operator strength is defined as \(\exp \alpha _o^{(i, j)} / \sum _{o^{\prime } \in \mathcal {O}} \exp \alpha _{o^{\prime }}^{(i, j)}\). It is worth noting that the searched optimal architecture candidate still needs to be retrained on the target task in order to recover its accuracy for further deployment on target hardware.
Fig. 13.
Fig. 13. Overview of DARTS [138], which consists of four stages, including (1) initializing w and \(\alpha\) in the supernet, (2) optimizing w and \(\alpha\) via alternating gradient descent, (3) discretizing the optimal architecture candidate from the supernet, and (4) re-training the optimal architecture candidate to recover the accuracy.
Inspired by the promising performance of DARTS, a plethora of follow-up works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188] have recently emerged that strive to unleash the power of differentiable NAS to deliver superior architecture candidates. For example, in contrast to DARTS, which simultaneously optimizes all operator candidates in the supernet, PC-DARTS [146] introduces partial channel connections to alleviate the excessive memory consumption of DARTS. In addition, DARTS+ [148] investigates the performance collapse issue of DARTS and finds that the performance collapse issue is caused by the over-selection of skip-connect. To tackle this, DARTS+ proposes a simple yet effective early-stopping strategy to terminate the search process upon fulfilling a set of predefined criteria. In parallel, DARTS– [149] also observes that the performance collapse issue of DARTS comes from the over-selection of skip-connect and further leverages an auxiliary skip connection to mitigate the performance collapse issue and stabilize the search process. Apart from these, Single-DARTS [185] and Gold-NAS [184] investigate the bilevel optimization in Equation (8) and point out that the bi-level optimization may end up with suboptimal architecture candidates, based on which Single-DARTS and Gold-NAS turn back to the one-level optimization. To accelerate the search process, GDAS [186] introduces an efficient Gumbel-Softmax [190]–based differentiable sampling approach to reduce the optimization complexity to the single-path level. Similar to GDAS, SNAS [187] also leverages Gumbel-Softmax reparameterization to improve the search process, which can make use of gradient information from generic differentiable loss without sacrificing the completeness of NAS pipelines. PT-DARTS [182] revisits the architecture selection in differentiable NAS and demonstrates that the architecture parameters \(\alpha\) cannot always imply the optimal architecture candidate, based on which PT-DARTS introduces the perturbation-based architecture selection to determine the optimal architecture candidate at the end of search.
The aforementioned differentiable NAS works [145, 146, 147, 148, 149, 150, 151, 180, 181, 182, 183, 184, 185, 186, 187, 188], however, focus on accuracy-only neural architecture search, which indeed demonstrates promising performance in terms of finding the architecture candidate with competitive accuracy but fails to accommodate the limited available computational resources in real-world embedded scenarios. To overcome such limitations, the paradigm of hardware-aware differentiable NAS [130, 191, 192, 193, 194] has recently emerged, which is based on DARTS and focuses on finding top-performing architecture candidates within the cell-based search space that can achieve both high accuracy on target task and high inference efficiency on target hardware. To achieve this goal, one widely adopted approach is to integrate the latency-constrained loss term into the overall loss function to penalize the architecture candidate with high latency, which can be mathematically formulated as follows:
\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda \cdot Latency(\alpha) \,\,\, s.t. \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha) \end{equation}
(9)
where \(\lambda\) is the trade-off coefficient to control the trade-off magnitude between accuracy and latency. As demonstrated in [131, 156], a larger \(\lambda\) ends up with the architecture candidate that maintains low accuracy and low latency, whereas a smaller \(\lambda\) leads to the architecture candidate with high accuracy and high latency. \(Latency(\alpha)\) corresponds to the latency of the architecture candidate encoded by \(\alpha\). We note that the optimization objective in Equation (9) can be easily generalized to jointly optimize other types of hardware performance constraints, such as energy and memory consumption, in which we only need to incorporate \(Energy(\alpha)\) and \(Memory(\alpha)\) into the optimization objective in Equation (9). For example, we can reformulate the optimization objective in Equation (9) as follows to jointly optimize the on-device latency, energy, and memory consumption:
\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda _1 \cdot Latency(\alpha) + \lambda _2 \cdot Energy(\alpha) + \lambda _3 \cdot Memory(\alpha) , \end{equation}
(10)
where \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are trade-off coefficients to determine the trade-off magnitudes between accuracy and latency, energy, and memory, respectively.
Despite the significant progress to date, the aforementioned hardware-aware differentiable NAS works [130, 191, 192, 193, 194] highly rely on the cell-based search space; these works first determine the optimal cell structure and then repeatedly stack the same cell structure across the entire network [138]. However, as demonstrated in MnasNet [38], such NAS practices suffer from inferior accuracy and efficiency due to the lack of operator diversity. Even worse, the architecture candidates in the cell-based search space have multiple parallel branches as shown in Figure 7, which introduce considerable memory access overheads and, as a result, have difficulty benefitting from the high computational parallelism on mainstream hardware platforms [34, 35]. To overcome such limitations, recent hardware-aware differentiable NAS works [39, 40, 154, 188, 195, 196, 197, 198, 199] have shifted their attention from the cell-based search space (see Figure 7) to the block-based search space (see Figure 8). The most representative include FBNet [39], ProxylessNAS [40], SP-NAS [197], and TF-NAS [195]. Similar to GDAS [186] and SNAS [187], FBNet leverages Gumbel-Softmax reparameterization [190] to relax the discrete search space to be continuous. FBNet collects a simple yet effective latency lookup table to quickly approximate the latency of different architecture candidates. The pre-collected latency lookup table is then integrated into the search process to derive hardware-efficient architecture candidates. However, similar to DARTS [138], FBNet needs to simultaneously optimize all the operator candidates in the supernet during the search process, which is not scalable to large search spaces and suffers from tmemory bottleneck [40, 131]. In light of this, ProxylessNAS introduces an effective path-level binarization approach to reduce the memory consumption to the single-path level, which significantly improves search efficiency without compromising search accuracy. In parallel, SP-NAS demonstrates that different operator candidates in the supernet can be viewed as subsets of an over-parameterized superkernel, based on which SP-NAS proposes to encode all the operator candidates into the superkernel. In practice, this explicitly reduces memory consumption to the single-path level, which alleviates the memory bottleneck during the search process. TF-NAS thoroughly investigates the three search freedoms in hardware-aware differentiable NAS: (1) operator-level search, (2) depth-level search, and (3) width-level search as shown in Figure 14, which is able to perform fine-grained architecture search. To obtain hardware-efficient architecture candidates, TF-NAS integrates the pre-collected latency lookup table into the search process. TF-NAS also introduces a simple yet effective bi-sampling search algorithm to accelerate the search process towards enhanced search efficiency.
Fig. 14.
Fig. 14. Overview of TF-NAS [195], which investigates the three search freedoms in conventional hardware-aware differentiable NAS: (1) operator-level, (2) depth-level, and (3) width-level (figure from [195]).
Even so, we should consider not only the explicit search cost, the time required for one single search experiment, but also the implicit search cost, the time required for manual hyperparameter tuning in order to find the desired architecture candidate. This is because, in real-world embedded scenarios such as autonomous vehicles, DNNs must be executed under strict latency constraints (e.g., 24 ms), in which any violation may lead to catastrophic consequences [20, 163]. However, to find the architecture candidate with the latency of 24 ms, the aforementioned hardware-aware differentiable NAS works [39, 40, 130, 154, 188, 191, 192, 193, 194, 195, 196, 197, 198, 199] have to repeat a plethora of search experiments to tune the trade-off coefficient \(\lambda\) (see Equation (9)) through trial and error [131, 156], which significantly increases the total search cost. The intuition behind this is that \(\lambda\), despite being able to trade off between accuracy and latency, is quite sensitive and difficult to control [131, 156]. To overcome such limitations, HardCoRe-NAS [200] leverages an elegant Block Coordinate Stochastic Frank-Wolfe (BCSFW) algorithm [201] to restrict the search direction around the specified latency requirement. In addition, LightNAS [131, 156] introduces a simple yet effective hardware-aware differentiable NAS approach, which investigates the optimization objective in Equation (9) and proposes to optimize the trade-off coefficient \(\lambda\) during the search process in order to satisfy the specified latency requirement. In other words, LightNAS focuses on automatically learning \(\lambda\) that strictly complies with the specified latency requirement, which is able to find the required architecture candidate in one single search (i.e., you only search once) and avoids performing manual hyperparameter tuning over \(\lambda\). The optimization objective of LightNAS is formulated as follows:
\begin{equation} \mathop {\mathrm{minimize}}_{\alpha } \,\,\, \mathcal {L}_{val}(w^*(\alpha), \alpha) + \lambda \cdot \left(\frac{Latency(\alpha)}{T} - 1\right)\,\,\,s.t. \,\,\, w^*(\alpha) = \mathop {\mathrm{arg\,min}}_w \mathcal {L}_{train}(w, \alpha), \end{equation}
(11)
where T is the specified latency requirement. In contrast to previous hardware-aware differentiable NAS works [39, 40, 130, 154, 188, 191, 192, 193, 194, 195, 196, 197, 198, 199], \(\lambda\) in Equation (11) is not a constant but rather a learnable hyperparameter that can be automatically optimized during the search process. For the sake of simplicity, below we use \(\mathcal {L}(w, \alpha , \lambda)\) to denote the optimization objective in Equation (11). Finally, to satisfy the specified latency requirement (i.e., \(Latency(\alpha) = T\)), w and \(\alpha\) are updated using gradient descent [138], whereas \(\lambda\) is updated using gradient ascent as follows:
\begin{equation} {\left\lbrace \begin{array}{ll} w^* = w - lr_w \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial w}, \,\, \alpha ^* = \alpha - lr_{\alpha } \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial \alpha } \\ \lambda ^* = \lambda + lr_{\lambda } \cdot \frac{\partial \mathcal {L}(w, \alpha , \lambda)}{\partial \lambda } = \lambda + lr_{\lambda } \cdot \left(\frac{LAT(\alpha)}{T} - 1\right) \end{array}\right.} , \end{equation}
(12)
where \(lr_{w}\), \(lr_{\alpha }\), and \(lr_{\lambda }\) are the learning rates of w, \(\alpha\), and \(\lambda\), respectively. Below we further demonstrate why LightNAS guarantees \(Latency(\alpha) = T\). As shown in LightNAS, a larger \(\lambda\) leads to the architecture candidate with low latency, whereas a smaller \(\lambda\) results in the architecture candidate with high latency. Therefore, if \(Latency(\alpha) \gt T\), the gradient ascent scheme increases \(\lambda\) to reinforce the latency regularization magnitude. As a result, \(Latency(\alpha)\) decreases towards T in the next search iteration. Likewise, if \(Latency(\alpha) \lt T\), the gradient ascent scheme decreases \(\lambda\) to diminish the latency regularization magnitude, after which \(Latency(\alpha)\) increases towards T in the next search iteration. Finally, the search engine ends up with the architecture candidate that strictly satisfies the specified latency requirement (i.e., \(Latency(\alpha) = T\)). More recently, Double-Win NAS [202] proposed deep-to-shallow transformable search to further marry the best of both deep and shallow networks towards an aggressive accuracy–efficiency win–win. Similar to LightNAS [131, 156], the resulting shallow network can also satisfy the specified latency constraint. Finally, we compare previous representative hardware-aware NAS works, which are summarized in Table 1.
Table 1.
MethodSearch
Space
Search
Strategy
Search CostTarget
Hardware
Hardware
Modeling
ImageNet
DatasetGPU HoursGPUFLOPs (M)Top-1 Acc (%)
MnasNet [38]BlockReinforceImageNet40,000V100Mobile PhonesN/A31275.2
ProxylessNAS [40]BlockGradientImageNet200V100GPUs, CPUs, and
Mobile Phones
LUTN/A75.1
MobileNetV3 [162]BlockEvolutionImageNetN/AN/AMobile PhonesN/A21975.2
FBNet [39]BlockGradientImageNet216N/AMobile PhonesLUT37574.9
TuNAS [163]BlockReinforceImageNetN/AN/AMobile PhonesLUT75.4
OFA [42]BlockEvolutionImageNet1,200V100GPUs, CPUs,
Edge GPUs, and
Mobile Phones
LUT23076.0
SP-NAS [197]BlockGradientImageNet30TPUMobile PhonesLUTN/A75.0
LA-DARTS [191]CellGradientCIFAR-1017P100GPUs and CPUsPredictor57574.8
MDARTS [194]CellGradientCIFAR-10\(\sim\)6.5Titan XPEyerissPredictorN/AN/A
EH-DNAS [203]CellGradientCIFAR-10241080 TiCustomized
Accelerators
Predictor84069.6
E-DNAS [204]BlockGradientImageNetN/AV100CPUs and DSPsPredictor36576.9
SNAS [198]BlockGradientImageNet30N/ATPUsPredictor1,29079.4
HSCoNAS [152]BlockEvolutionImageNetN/AN/AGPU, CPUs, and
Edge GPUs
LUTN/A74.9
DenseNAS [199]BlockGradientImageNet64Titan XPGPUsLUT36175.3
TF-NAS [195]BlockGradientImageNet43Titan RTXGPUsLUT28475.2
HardCoRe-NAS [200]BlockGradientImageNet400P100GPUs and CPUsLUTN/A75.7
LightNAS [131]BlockGradientImageNet10RTX 3090Edge GPUsPredictorN/A75.2
SurgeNAS [154]BlockGradientImageNet30V100GPUs, CPUs,
and Edge GPUs
PredictorN/A75.5
SPOS [161]BlockEvolutionImageNet2881080 TiGPUsLUT32874.7
HURRICANE [153]BlockEvolutionImageNetN/AN/ACPUs, DSPs,
and VPUs
LUT40975.1
ProxyNAS [176]BlockEvolutionImageNetN/AN/AGPUs, CPUs,
TPUs, and FPGAs
PredictorN/AN/A
Table 1. Comparisons of Representative Hardware-aware NAS Works
This table is to roughly compare different hardware-aware NAS works, in which N/A means that the related data is not reported in the respective paper. Note that the accuracy in this table may be trained under different training recipes.

3.3 Speedup Techniques and Extensions

In this section, we discuss recent state-of-the-art advances in general speedup techniques and extensions for NAS algorithms, including one-shot NAS enhancements, efficient latency prediction, efficient accuracy prediction, low-cost proxies, zero-cost proxies, efficient transformer search, efficient domain-specific search, and mainstream NAS benchmarks, which have the potential to significantly benefit NAS algorithms and largely facilitate the search process.
Beyond One-Shot NAS. Despite the high search efficiency, one-shot NAS often suffers from poor ranking correlation between one-shot search and stand-alone training. As pointed out in [205], one-shot search results do not necessarily correlate with stand-alone training results across various search experiments. To overcome such limitations, a plethora of one-shot NAS enhancements have been proposed recently[206, 207, 208, 209, 210, 211]. Specifically, [206, 207, 208, 209, 210] turn back to few-shot NAS. In contrast to one-shot NAS [160], which only features one supernet, few-shot NAS introduces multiple supernets to explore different regions of the predefined search space, which slightly increases the search cost over one-shot NAS but can deliver much more reliable search results. For example, as shown in [206], with only up to 7 supernets, few-shot NAS can establish new state-of-the-art search results on ImageNet. Among them, [209] demonstrates that zero-cost proxies can be integrated into few-shot NAS, which can further enhance the search process of one-shot NAS and thus produce better search results. More recently, [208] generalizes few-shot NAS to distill LLMs, which focuses on automatically distilling multiple compressed student models under various computational budgets from a large teacher model. In contrast to few-shot NAS, which leverages multiple supernets to improve the ranking correlation performance of one-shot NAS, CLOSE [211] instead features an effective curriculum learning-like schedule to control the parameter-sharing extent within the proposed supernet dubbed CLOSENet, in which the parameter-sharing extent can be flexibly adjusted during the search process and the parameter-sharing scheme is built upon an efficient graph-based encoding scheme.
Efficient Latency Prediction.4 As seen in MnasNet [38], latency is directly measured on target hardware, which is then integrated into the RL reward (see Equation (2)) to penalize the architecture candidate with high latency. The direct on-device latency measurement is indeed accurate. However, it is time-consuming and unscalable to large search spaces [40]. To overcome such limitations, several latency prediction strategies have been proposed recently. For example, ProxylessNAS [40], FBNet [39], and OFA [42] leverage the latency lookup table to approximate the on-device latency, which sums up the latency of all the operator candidates. In addition, HSCoNAS [152, 178] demonstrates that the data movements and communications among different operator candidates introduce additional latency overheads, making the pre-collected latency lookup table inaccurate. To mitigate this issue, HSCoNAS quantifies the latency that corresponds to the intermediate data movements and communications, which is then fed into the pre-collected latency lookup table to achieve more accurate latency prediction performance. However, the latency lookup table is only applicable to the block-based search space, which leads to unreliable latency prediction performance in terms of the cell-based search space [213]. To this end, EdgeNAS [130], LA-DARTS [191], and LC-NAS [192] propose to use learning-based approaches for the purpose of latency prediction. For example, EdgeNAS trains an efficient multi-layer perceptron (MLP) to predict the latency of different architecture candidates in the cell-based search space, which can also be generalized to predict the latency of different architecture candidates in the block-based search space as shown in [131, 156, 176, 212, 214]. BRP-NAS [213] and SurgeNAS [154] introduce graph neural network (GNN)–based latency predictors to achieve more reliable latency prediction performance. The above latency predictors (1) rely on a large number of training samples to achieve decent latency prediction performance (e.g., 100,000 training samples in EdgeNAS) and (2) need to be reconstructed for either new hardware or new search spaces. To avoid these, HELP [215] and MAPLE-Edge [216] focus on building an efficient latency predictor using only a few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [217] considered an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts.
Efficient Accuracy Prediction. In parallel to latency prediction, accuracy prediction has also received increasing attention from the NAS community [213, 218, 219, 220, 221, 222], which strives to directly predict the accuracy of different architecture candidates in the search space. Specifically, [218] introduces a simple yet effective graph convolutional network (GCN)–based accuracy predictor, which can achieve reliable accuracy prediction performance thanks to GCNs’ strong capability to learn graph-structured data. Similar to [218], BRP-NAS [213] also considers GCNs for reliable accuracy prediction, which introduces transfer learning to further improve the accuracy prediction performance from the pretrained latency predictor. In parallel, [219] leverages the non-neural network (i.e., gradient-boosted decision tree (GBDT)) as the accuracy predictor, which has a stronger capability to learn representations than neural network–based accuracy predictors. In addition, NASLib [220] investigates a wide range of accuracy predictors from learning curve extrapolation, weight-sharing, supervised learning, and zero-cost proxies on three popular NAS benchmarks (i.e., NAS-Bench-101 [223], NAS-Bench-201 [224], and NAS-Bench-NLP [225]). NASLib reveals that different accuracy predictors can be combined to achieve substantially better accuracy prediction performance than any single accuracy predictor. DONNA [221] proposes to build an efficient accuracy predictor, which involves only minimal computational resources and, more importantly, can scale to diverse search spaces. To achieve this, DONNA uses blockwise knowledge distillation to construct an architecture candidate pool in which each architecture candidate only needs to be fine-tuned for several epochs to derive the accuracy rather than being trained from scratch. In contrast to the aforementioned accuracy predictors that feature graph-based encoding schemes, GATES [222, 226] instead models the operations as the transformation of the propagating information, which can effectively mimic the actual data processing of different neural architecture candidates. More importantly, the encoding scheme of GATES can be integrated into the above accuracy predictors to further boost their accuracy prediction performance. Similar to GATES, TA-GATES [227] introduces an effective encoding scheme with analogous modeling of the training process of different neural architecture candidates, which can achieve better accuracy prediction performance than GATES on various representative NAS benchmarks.
Low-Cost Proxies (Learning Curve Extrapolations). Low-cost proxies, also referred to as learning curve extrapolations [231], aim to interpret the accuracy of the given architecture candidate only using its early training statistics, such as the training loss in the first few training epochs, which has motivated a plethora of subsequent works to continue exploring learning curve extrapolation [156, 232, 233, 234, 235, 236, 237]. For example, in contrast to the conventional accuracy predictor that only uses the network configuration as input features, [236] proposes to combine the network configuration and a series of validations of accuracy in the first few training epochs as input features to train a simple regression model, which can be generalized to predict the accuracy of unseen architecture candidates. In addition, [232] introduces Training Speed Estimation (TSE), which simply accumulates the early training statistics to achieve reliable yet computationally inexpensive ranking among different architecture candidates. The work of [156, 237] introduces Batchwise Training Estimation (BTE) and Trained Batchwise Estimation (TBE), both of which consider the fine-grained batchwise training statistics to provide more reliable prediction performance using minimal computational resources. In parallel, [234] introduces Loss Curve Gradient Approximation (LCGA) to rank the accuracy of different architecture candidates with minimal training. The work of [233] introduces NAS-Bench-x11 to unleash the power of learning curve extrapolation by predicting the training trajectories, which can be easily integrated into the aforementioned learning curve extrapolation works to quickly estimate the performance of the given architecture candidate.
Zero-Cost Proxies.5 In addition to the above low-cost proxies (i.e., learning curve extrapolation), zero-cost proxies have recently flourished [228, 229, 230, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247], which focus on interpreting the performance of the given architecture candidate in training-free manners. Zero-cost proxies, such as EPE [240], Fisher [241], GradNorm [238], Grasp [242], Jacov [243], Snip [244], Synflow [245], ZenScore [246], LRC [228], and NTK [229], can provide reliable performance estimation using only one single mini-batch of data and one single forward/backward propagation pass, which necessitate near-zero computational cost [230, 238, 239]. Thanks to their reliable performance estimation and low cost, these zero-cost proxies have been widely adopted in recent NAS works to accelerate the search process [230, 243, 247]. As demonstrated in [230, 238], combining different zero-cost proxies may lead to more reliable ranking performance estimation than any single zero-cost proxy. For example, as shown in Figure 15, combining LRC and NTK provides more reliable ranking performance estimation than LRC or NTK separately. In light of this, TE-NAS [230] further leverages LRC and NTK to jointly estimate the ranking performance among different architecture candidates in the search space, which quickly ends up with the optimal architecture candidate on ImageNet in less than 4 hours on one single NVIDIA GTX 1080 Ti GPU.
Fig. 15.
Fig. 15. Comparisons of different zero-cost proxies on CIFAR-100: (a) LRC [228], (b) NTK [229], and (c) LRC + NTK [230], in which the accuracy is queried from NAS-Bench-201 [224] (figure from [230]).
Efficient Transformer Search. In addition to CNNs, transformers are another important branch of DNNs. Inspired by the tremendous success of NAS in searching for superior CNNs, automated transformer search has gained increasing popularity, which applies NAS techniques to automatically search for superior transformers, including transformers for NLP tasks [93, 249, 250, 251, 252, 253, 254] and vision transformers for vision tasks [248, 255, 256, 257, 258, 259, 260]. Automated transformer search is technically the same as automated convolutional network search, in which both feature the same search pipeline. For example, HAT [93], as one of the state-of-the-art NAS works in the field of NLP, focuses on searching for hardware-efficient transformers for NLP tasks. To achieve this, HAT first initializes an over-parameterized superformer that consists of all possible transformer candidates in the search space, which is technically the same as the supernet in automated convolutional network search. After that, HAT trains the superformer using the standard weight-sharing technique [160], which then serves as the accuracy predictor to quickly interpret the accuracy of different transformer candidates. Next, HAT builds an efficient latency predictor to avoid the tedious on-device latency measurement. Finally, HAT applies the standard evolutionary algorithm to find hardware-efficient transformer candidates with both high accuracy and high efficiency, which is technically the same as OFA [42], which searches for hardware-efficient convolutional networks. Furthermore, due to the tremendous success of vision transformers in vision tasks as discussed in Section 2.2, a plethora of NAS works [248, 255, 256, 257, 258, 259, 260] have been subsequently proposed to automate the design of superior vision transformers. The work of [248], being the first, introduces an evolutionary algorithm-based NAS framework dubbed AutoFormer. Similar to HAT, AutoFormer first constructs an over-parameterized superformer that consists of all possible vision transformer candidates in the search space, which is then trained using the weight entanglement scheme. The difference between weight sharing and weight entanglement is visualized in Figure 16, in which weight entanglement is technically similar to the superkernel in SP-NAS [197]. Finally, AutoFormer applies the standard evolutionary algorithm to explore the optimal vision transformer candidate. These clearly demonstrate that we can easily leverage recent state-of-the-art NAS techniques that focus on searching for competitive CNNs to automate the design of top-performing transformers for both NLP and vision tasks.
Fig. 16.
Fig. 16. Illustration of weight sharing [160] and weight entanglement [248] (figure from [248]).
Efficient Domain-Specific Search. In addition to image classification, NAS can also be applied to a wide range of real-world scenarios, such as object detection [268, 269, 270], semantic segmentation [271, 272, 273], point cloud processing [192, 274, 275, 276], image super-resolution [277, 278], and more. For example, MobileDets [268] are a family of hardware-efficient object detection networks that can deliver promising detection accuracy while maintaining superior detection efficiency on multiple embedded computing systems, including mobile central processing units (CPUs), edge tensor processing units (TPUs), and edge GPUs. MobileDets first construct an enlarged search space that contains a large number of possible object detection networks and then leverage an MnasNet-like reinforcement learning–based search algorithm [38] to find top-performing object detection networks, which also feature the same reward function as TuNAS [163] to trade off between detection accuracy and efficiency. The work of [192] introduces an efficient hardware-aware differentiable NAS framework dubbed LC-NAS, aiming to automate the design of competitive network solutions for point cloud processing. Here, similar to EdgeNAS [130] and LA-DARTS [191], which focus on finding top-performing architecture candidates for image classification, LC-NAS exploits the same cell-based search space and integrates the latency constraint into the optimization objective to penalize the architecture candidate with high latency. These demonstrate that we can easily include domain-specific knowledge (e.g., domain-specific search spaces) into mainstream NAS techniques (e.g., differentiable, evolutionary algorithm–based, and reinforcement learning–based NAS) to search for domain-specific network solutions.
Mainstream NAS Benchmarks. Although NAS has achieved substantial performance improvement across various NLP and vision tasks, fair comparisons between different NAS works are frustratingly hard and still an open issue, as demonstrated in [279]. This is because different NAS works may feature quite different training recipes, such as different training epochs and training enhancements. For example, DARTS+ [148] trains the searched architecture candidate on CIFAR-10 for 2,000 epochs, whereas DARTS [138] only applies 600 training epochs. DARTS+ trains the searched architecture candidate on ImageNet for 800 epochs, with a batch size of 2,048, where AutoAugment [280] is also integrated in order to achieve stronger data augmentations. In contrast, DARTS only applies 250 training epochs with a batch size of 128 by default. We note that, for the same architecture candidate, longer training epochs and stronger data augmentations typically achieve better training accuracy on the target task, as shown in [279]. RandomNAS [281] challenges the effectiveness of early state-of-the-art NAS works and demonstrates that random search, as one strong search baseline to explore random networks, can achieve even better performance on the target task than early state-of-the-art NAS works. In parallel, RandWire [282] shows that randomly wired networks can also exhibit strong accuracy on ImageNet. Therefore, it remains unknown whether the performance improvement of NAS is due to the more advanced training recipe or the search algorithm itself, making it difficult to evaluate and compare the technical contributions of different NAS works [223, 224, 279].
To overcome such limitations, a plethora of tabular and surrogate NAS benchmarks have been proposed: NAS-Bench-101 [223], NAS-Bench-201 [224], NATS-Bench [261], NAS-Bench-301 [262], NAS-Bench-360 [263], NAS-Bench-1Shot1 [264], NAS-Bench-ASR [265], NAS-Bench-Graph [266], NAS-Bench-NLP [225], HW-NAS-Bench [214], NAS-Bench-x11 [233], and NAS-Bench-Suite [267]. We note that NAS benchmarks typically have two important parts, the predefined search space and the related performance metrics for all possible architecture candidates that can be easily queried. In tabular NAS benchmarks [214, 223, 224, 225, 261, 263, 264, 265, 266], all possible architecture candidates are enumerated and trained from scratch on the target task to obtain the performance metrics, such as the training and validation accuracy. In contrast, surrogate NAS benchmarks [214, 233, 262, 267] leverage learning-based methods to predict the performance metrics of different architecture candidates rather than directly enumerating and training all possible architecture candidates on the target task, thus leading to significantly reduced computational resources. In light of this, surrogate NAS benchmarks can be easily extended to deal with larger search spaces than tabular NAS benchmarks (\(10^{18}\) in NAS-Bench-301 [262] vs. 15,625 in NAS-Bench-201 [224]). Finally, we compare and summarize the aforementioned state-of-the-art NAS benchmarks in Table 2.
Table 2.
BenchmarkSearch SpaceQueryableTasksDatasetsMetrics
SizeTypeTabularSurrogate
NAS-Bench-101 [223]423kCell-BasedImage
Classification
CIFAR-10Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Time, and
Number of Parameters
NAS-Bench-201 [224]15.6kCell-BasedImage
Classification
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
Testing Loss,
Training Time,
Number of FLOPs, and
Number of Parameters
NATS-Bench [261]39.3kCell-BasedImage
Classification
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
Testing Loss,
Training Time,
Number of FLOPs, and
Number of Parameters
NAS-Bench-301 [262]\(10^{18}\)Cell-BasedImage
Classification
CIFAR-10Validation Accuracy
NAS-Bench-360 [263]N/ACell- and
Block-Based
10 Diverse Tasks10 Diverse DatasetsN/A
NAS-Bench-1Shot1 [264]399kCell-BasedImage
Classification
CIFAR-10Validation Accuracy
NAS-Bench-ASR [265]8.2kCell-BasedAutomatic
Speech
Recognition
TIMITCTC Loss,
Phoneme Error Rate (PER),
On-Device Latency,
Number of FLOPs, and
Number of Parameters
NAS-Bench-Graph [266]26.2kCell-Based9 Graph Tasks9 Graph DatasetsTraining Loss,
Validation Loss,
Testing Loss,
Validation Accuracy,
On-Device Latency, and
Number of Parameters
NAS-Bench-NLP [225]14kCell-BasedLanguage
Understanding
PTB and
WikiText-2
Testing Perplexity,
Training Time, and
Number of Parameters
NAS-Bench-111 [233]423kCell-BasedImage
Classification
CIFAR-10Training Accuracy,
Validation Accuracy,
Testing Accuracy,
Training Loss,
Validation Loss,
and Testing Loss
NAS-Bench-311 [233]\(10^{18}\)Cell-BasedImage
Classification
CIFAR-10Same as NAS-Bench-111
NAS-Bench-NLP11 [233]\(10^{53}\)Cell-BasedLanguage
Understanding
PTBSame as NAS-Bench-111
NAS-Bench-Suite [267]N/ACell-BasedA suite of 11 tabular and surrogate NAS benchmarks
HW-NAS-Bench [214]15.6kCell-BasedImage
Classification
CIFAR-10,
CIFAR-100, and
ImageNet-16-120
On-Device Latency
HW-NAS-Bench [214]\(10^{21}\)Block-BasedImage
Classification
CIFAR-100 and
ImageNet
On-Device Latency
Table 2. Comparisons of Different NAS Benchmarks
Note that ImageNet-16-120 is a subset of ImageNet that consists of 120 object categories, in which the input image resolution is fixed to 16×16 [224].

3.4 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of automated network design, summarized as follows:
(1)
General Search Spaces. The success of NAS highly relies on the well-engineered search space, such as the cell-based search space [137, 144, 160] and the block-based search space [38, 39, 40]. In the past, researchers manually designed the search spaces using heuristic-based strategies, which are typically based on existing state-of-the-art networks, such as MobileNets [123, 268] and ShuffleNets [34, 35]. This effectively restricts the search space to improve the search efficiency and delivers competitive architecture candidates with promising accuracy and efficiency. However, this may significantly limit the search performance, which may reject more competitive architecture candidates outside the well-engineered search space. To overcome such limitations, the authors of [283, 284] pioneer designing more general search spaces than the cell-based and block-based search spaces, which, unfortunately, are under-explored since the works of [283, 284] still suffer from human biases. Therefore, one promising future direction in the field of NAS is to innovate and explore more general search spaces to unleash the power of automated network design.
(2)
Fully Automated Architecture Search. The early NAS practices either focus on searching for the optimal architecture candidate [137, 144, 160], the optimal data augmentation [280, 285], the optimal activation function [286, 287], or the optimal training recipe [288, 289]. As demonstrated in FBNetV3 [288] and AutoHAS [289], different architecture candidates may prefer different training recipes, in which jointly searching for the optimal architecture candidate and its tailored training recipe has the potential to push forward the attainable accuracy. This observation can be easily generalized. For example, different architecture candidates may prefer different data augmentations. Therefore, one promising future direction in the field of NAS is fully automated search, which jointly searches for the optimal architecture candidate and its tailored data augmentation, activation function, and training recipe in one single search experiment to maximize the attainable accuracy.
(3)
Multi-task Architecture Search. Previous NAS works typically focus on searching for task-specific architecture candidates that can achieve promising performance in the specified task, such as image classification [138, 146], object detection [268, 269, 270], and semantic segmentation [271, 272, 273]. However, this search paradigm significantly increases the total search cost when the number of tasks exponentially evolves since we have to conduct search experiments for each task. To alleviate this issue, FBNetV5 [290] takes the first step to search for multi-task architecture candidates that can achieve competitive performance across multiple tasks, including image classification on ImageNet [80], object detection on COCO [291], and semantic segmentation on ADE20K [292]. Nonetheless, this is far from enough since we have a large number of tasks in real-world scenarios. Therefore, one promising future direction in the field of NAS is multi-task search, which automates the design of top-performing architecture candidates that can be generalized to multiple different tasks without being re-engineered (i.e., once for all).
(4)
Dynamic Architecture Search. Previous NAS works [137, 138, 144, 160] typically focus on searching for static neural networks that can only run at fixed computational budgets, which cannot adapt to lower or higher computational complexity. Dynamic neural networks, such as slimmable neural networks [293, 294, 295], are another important branch of DNNs, which can be executed to accommodate different computational resources in real-world environments. This is because, even on the same hardware device, the available computational resources may vary with respect to time. For example, mobile phones may be in low-power or power-saving modes to reduce power consumption. To overcome such limitations, deploying multiple static neural networks on the same hardware device seems to be the first-in-mind solution, which, unfortunately, demands high on-device storage requirements. Therefore, one promising future direction in the field of NAS is to search for top-performing dynamic neural networks, which can instantly, adaptively, and efficiently trade off between accuracy and inference efficiency to accommodate the rapidly changing computational budgets in real-world embedded computing scenarios.
(5)
Hybrid Architecture Search. As discussed in Section 2, both convolutional networks and vision transformers have their own technical merits when applied to vision tasks. Convolutional networks demonstrate superior efficiency on target hardware, whereas vision transformers achieve better accuracy on a target task. In sight of this, designing hybrid networks on top of both convolutional networks and vision transformers has the potential to push forward accuracy–efficiency trade-offs. Nonetheless, previous NAS works typically focus on searching for either convolutional networks [137, 138, 144, 160] or vision transformers [248, 256, 257, 260]. To alleviate this, the authors of [296, 297] have taken the very first steps to investigate hybrid architecture search, which, however, still remains under-explored. Therefore, one promising future direction in the field of NAS is to search for competitive hybrid networks that combine the technical merits of both convolutional networks and vision transformers to achieve better accuracy–efficiency trade-offs.
(6)
Explainable Architecture Search. Previous representative NAS works [40, 42, 138, 144] highly rely on the weight-sharing paradigm [160], also known as one-shot NAS, which initializes an over-parameterized supernet that consists of all possible architectures in the search space and then searches for the optimal architecture candidate within the supernet through weight-sharing. Despite the promising search efficiency, one-shot NAS has been widely criticized due to its limited explainability, which implies that weight-sharing may lead to suboptimal architectures due to weight interference. Even worse, the intuition behind one-shot NAS still remains unknown in the NAS community. To alleviate this, a plethora of zero-shot NAS works have been proposed recently [228, 229, 230, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247], which leverage zero-cost proxies to quickly interpret the accuracy of different architectures. However, existing zero-cost proxies still cannot achieve reliable performance estimation as shown in Figure 15. Therefore, one promising future direction in the field of NAS is to develop more explainable NAS techniques and innovate more reliable zero-cost proxies.
(7)
Meta Architecture Search. Meta-learning [298], also referred to as learning-to-learn, aims to facilitate and accelerate common learning-based practices such that the learned model can quickly adapt to unseen tasks/environments using minimal engineering efforts. For example, HELP [215] introduces an efficient meta-learning–based latency predictor, which can be generalized to new hardware platforms using as few as 10 latency measurements. The widely used weight-sharing paradigm in the field of NAS can be considered to be a special case of meta-learning, which takes the over-parameterized supernet as the meta-model. As the weight-sharing paradigm has been dominating recent advances in the field of NAS, one promising future direction is to explore meta-learning to accelerate the search process and enhance the few-shot learning capability [299, 300, 301].

4 Network Compression for Embedded Computing Systems

In addition to designing novel networks, another alternative is to compress existing networks at hand, either manually designed networks or automatically searched networks, to reduce network redundancy, which leads to network variants with better accuracy–efficiency trade-offs. As illustrated in previous relevant literature [82, 302], there are three popular branches of network compression techniques, including network pruning, network quantization, and network distillation. Note that these three branches are parallel with each other as shown in [303], which indicates that they can be combined to further enable better accuracy–efficiency trade-offs. Among them, network pruning and network quantization focus on improving the accuracy–efficiency trade-off from the efficiency perspective, whereas network distillation enhances the accuracy–efficiency trade-off from the accuracy perspective. To this end, we systematically discuss recent state-of-the-art network compression techniques in this section. For better understanding, we divide these network compression techniques into three main categories and subsections: network pruning in Section 4.1, network quantization in Section 4.2, and network distillation in Section 4.3 since these network compression techniques feature different algorithms to improve the accuracy–efficiency trade-off from different perspectives. Note that these network compression techniques can typically generalize across different networks (e.g., convolutional networks and transformers). For example, we can leverage knowledge distillation to enhance the training process of both convolutional networks and transformers towards better training accuracy.

4.1 Network Pruning

The rationale behind network pruning is that DNNs are usually over-parameterized and redundant in terms of network weights and channels [304, 305]. As such, eliminating redundant network weights and channels can largely benefit network efficiency at the cost of minimal accuracy loss, thus enabling accommodation of limited available computational resources and rigorous storage requirements in real-world embedded scenarios. Following previous well-established pruning conventions, we divide recent state-of-the-art pruning methods into two main categories according to their pruning granularity, in which non-structured pruning (i.e., weight pruning) is fine-grained whereas structured pruning (i.e., channel pruning and layer pruning) is coarse-grained. As illustrated in Figure 17, weight pruning focuses on removing the redundant weight connections, whereas channel pruning and layer pruning focus on removing the redundant channels and layers. In practice, both non-structured pruning and structured pruning can explore simplified network structures with optimized computational efficiency. Nonetheless, non-structured weight pruning highly relies on specialized hardware accelerators [29] and cannot provide realistic runtime speedups on modern embedded computing systems owing to irregular network sparsity [305, 306]. In contrast, structured channel pruning and layer pruning are coarse-grained and do not introduce irregular network sparsity, which can deliver realistic runtime speedups on modern embedded computing systems. For better coverage, below we further discuss recent representative works in the field of both non-structured pruning and structured pruning, which are also summarized in Figure 19.
Fig. 17.
Fig. 17. Illustration of different structured and non-structured pruning strategies. Weight pruning is non-structured, whereas channel pruning and layer pruning are structured.
Fig. 18.
Fig. 18. Distribution of weight gates [312].
Fig. 19.
Fig. 19. Illustration of non-structured and structured pruning works that have been discussed in Section 4.1.

4.1.1 Non-Structured Pruning.

Non-structured pruning, also referred to as weight pruning, removes the less important network weights, which is typically more fine-grained than structured pruning as illustrated in Figure 17. Applying weight pruning for network compression can be traced back to the early 1990s. For example, Optimal Brain Damage [307] and Optimal Brain Surgeon [308], the very early weight pruning approaches, pioneered investigation fo the efficacy of weight pruning on vanilla fully-connected networks, in which the less important network weights are removed based on Hessian of the loss function. More recently, [29] proposed a simple yet effective weight pruning technique to compress deep convolutional networks, such as AlexNet [76] and VGGNet [1], instead of vanilla fully-connected networks. Specifically, [29] observes that the network weights with smaller magnitudes typically contribute less to network accuracy, based on which [29] removes the less important network weights with smaller magnitudes. Subsequently, this weight pruning technique is further integrated into Deep Compression [89] to obtain highly compressed networks, making it possible to aggressively reduce network size without sacrificing network accuracy. For example, Deep Compression is able to significantly reduce the network size of VGGNet by ×49, from 552 MB to 11.3 MB, while maintaining comparable accuracy on ImageNet. Nonetheless, the reduction in terms of the network size cannot directly translate into the speedup on target hardware since the resulting compressed networks are of high irregular network sparsity. To overcome such limitations, EIE [306] designs an efficient specialized inference engine to maximize the inference efficiency of compressed networks. In parallel, [309] proposes an efficient data-free weight pruning approach to iteratively remove redundant network weights. In addition, [310] and [311] leverage Variational Dropout and \(L_0\)-norm regularization-based stochastic gates, respectively, to remove the less important network weights.
Weight Importance Criteria. The core of weight pruning is to determine the importance of different network weights, based on which we can easily rank different network weights and remove the less important network weights at the cost of minimal accuracy loss. There have been several representative importance criteria to measure the importance of different network weights after the network is trained. The most straight-forward criterion is based on the weight magnitude thanks to its conceptual simplicity and surprisingly strong performance, which leverages the absolute weight \(|w|\) to interpret the importance of different network weights (i.e., the larger, the more important) [29, 313, 314]. The rationale behind magnitude-based weight pruning is that smaller network weights typically contribute less to the output of the network. Other importance criteria include second-order derivative-based [307, 315], Taylor expansion-based [316], and output sensitivity–based [317] strategies. More recently, [312] proposes an effective strategy, namely, gates with differentiable polarization (GDP), which introduces learnable gates to interpret the importance of different network weights. GDP encourages a large margin between exact zero gates and non-zero gates as shown in Figure 18, while still allowing gradient optimization. Finally, GDP removes the network weights with exact zero gates and further merges the remaining non-zero gates into the resulting pruned network without hurting the network accuracy once the optimization process terminates. Despite the impressive progress to date, the design of efficient and effective importance criteria is quite under-explored and still remains an open challenge in the community.
Sparse Network Acceleration. Different from structured pruning that is hardwarefriendly, non-structured pruning, despite being able to maintain competitive accuracy under high compression ratios, introduces considerable irregular network sparsity, making it difficult to parallelize the resulting sparse networks on mainstream hardware systems such as GPUs and CPUs [306]. This indicates that non-structured pruning highly relies on specialized hardware to achieve superior on-device speedups. To this end, a plethora of specialized hardware accelerators [306, 318, 319, 320, 321, 322, 323, 324, 325] and compiler-based optimization techniques [326, 327] have been developed recently to accelerate the on-device inference of sparse networks, which typically focus on improving the irregular memory access on target hardware. For example, Cambricon-X [321] features an access-efficient indexing module to select and transfer irregular network weights to different processing elements (PEs) with reduced bandwidth requirements. This indexing module allows each PE to store irregular network weights for local computation in an asynchronous manner and, as a result, significantly reduces the irregular memory access overheads across different PEs.
Sparse Training Techniques. As shown in [29], weight pruning can effectively lead to efficient sparse network variants that are up to 90% smaller than the unpruned network, significantly alleviating the storage requirements and reducing the computational complexity. Despite the promising efficiency improvement, training the resulting sparse networks is quite challenging, in which using conventional training strategies may lead to non-negligible accuracy loss, as demonstrated in [328]. To recover the accuracy, a plethora of sparse training techniques have been developed to train the resulting sparse networks [29, 89, 329, 330, 331, 332]. For example, [29] proposes to fine-tune the pruned sparse network with inherited network weights from the unpruned network. Furthermore, [89] generalizes this fine-tuning strategy to become iterative, in which multiple iterations of pruning and fine-tuning are iteratively repeated to recover the attainable accuracy. In addition, [330] investigates the performance collapse issue of training sparse networks, which may simply be trapped into suboptimal local minima. To escape the suboptimal local minima, [330] proposes to traverse extra dimensions between dense and sparse subspaces during training sparse networks. More recently, [331] introduced an alternative sparse training strategy, which customizes the sparse training techniques to deviate from the default vanilla training protocols, consisting of introducing ghost neurons and skip connections at the early training stage and strategically modifying the initialization as well as the labels. Below, we introduce another representative branch of weight pruning and sparse training techniques, namely, the lottery ticket hypothesis [328], which demonstrates that the pruned sparse networks, when properly initialized, can be trained from scratch to recover the accuracy comparable to the unpruned network.
Lottery Ticket Hypothesis. The lottery ticket hypothesis [328] is a special case of non-structured pruning that has since gained increasing popularity in the pruning community [333, 334, 335, 336, 337, 338]. The lottery ticket hypothesis reveals that: A randomly initialized unpruned network contains sparse subnetworks (i.e., winning tickets) that are initialized such that, when trained in isolation, they can match the accuracy of the unpruned network after training for at most the same number of iterations. The winning tickets can be up to 90% smaller than the unpruned network while at the same time maintaining comparable accuracy. To identify the winning ticket, the lottery ticket hypothesis [328] proposes to leverage the following steps:
(1)
Randomly initialize the unpruned network \(f(x;\theta _0)\), in which \(\theta _0 \sim \mathcal {D}_{\theta }\);
(2)
Train the unpruned network for a number of j iterations, arriving at weights \(\theta _j\);
(3)
Prune p% of the weights in \(\theta _j\), creating the sparse mask \(m \in \lbrace 0, 1\rbrace ^{|\theta _0|}\);
(4)
Reset the remaining weights to their values in \(\theta _0\), creating the winning ticket \(f(x;m \odot \theta _0)\).
As demonstrated in [328], directly pruning p% of the network weights may lead to significant accuracy loss, which also makes the training process unstable. To overcome such limitations, the lottery ticket hypothesis proposes to iteratively repeat the above steps, also referred to as iterative pruning, which repeatedly trains, prunes, and resets the network weights over n rounds, where each round prunes \(p^{\frac{1}{n}}\)% of the network weights. The lottery ticket hypothesis opens up the possibility and provides empirical guidelines to train sparse networks from scratch to match the accuracy of the unpruned network. Subsequently, [333, 334] prove the lottery ticket hypothesis from insightful theoretical perspectives. In parallel, [335] investigates the performance collapse issue of the lottery ticket hypothesis, especially when dealing with deeper networks, such as ResNets [2] and DenseNets [3], based on which [335] proposes an effective modified iterative pruning scheme called rewinding iteration to stabilize the lottery ticket hypothesis. The works of [336, 337, 338] generalize the lottery ticket hypothesis to other types of networks beyond convolutional networks, such as graph networks [336], spiking networks [110], and photonic networks [338].
Semi-structured Pruning. In contrast to the above mainstream non-structured pruning methods that introduce considerable irregular network sparsity, semi-structured pruning focuses on removing the less important consecutive weight connections [339]. The resulting semi-structured sparse networks can exhibit less irregular network sparsity than non-structured pruning, which are well supported by some existing deep learning libraries (e.g., cuSPARSElt [340] and TVM [341]) and thus can maintain much higher parallelism and speedups on modern embedded computing systems than non-structured pruning. For example, popular BERT models can achieve about 1.3x to 1.6x runtime inference speedups on NVIDIA A100 GPUs using the optimized sparse tensor cores [340, 342]. Thanks to its superior accuracy–efficiency trade-offs over non-structured pruning, semi-structured pruning has been widely employed to optimize the computational complexity of convolutional networks [343, 344], transformers [342, 345], and large language models [346, 347]. For example, [344] introduces an effective channel permutation scheme to optimize the attainable accuracy of the resulting semi-structured sparse convolutional network. The work of [343] introduces sparse-refined straight-through estimator (SR-STE) to explore the optimal semi-structured sparse convolutional network. In addition to convolutional networks, [342] and [345] introduce the alternating direction method of multipliers (ADMM) and progressive gradient flow to explore the optimal semi-structured sparse transformer for real-world language processing tasks. More recently, [346, 347] investigated semi-structured sparsity to enhance the inference efficiency of large language models. In parallel to the above semi-structured pruning methods that optimize inference efficiency, [348, 349] instead focus on optimizing the training efficiency of semi-structured sparse networks. The works of [350, 351, 352] also focus on exploring dedicated hardware accelerators to further enhance the runtime inference efficiency of semi-structured sparse networks.

4.1.2 Structured Pruning.

In parallel to non-structured pruning, structured pruning, including channel pruning6 and layer pruning, is another popular branch, which removes the less important channels or layers to reduce the network complexity as shown in Figure 17. Layer pruning is a special case of channel pruning and channel pruning is equivalent to layer pruning when all channels in the same layer are removed. We emphasize that non-structured pruning, despite being able to achieve significant compression ratios, introduces considerable irregular network sparsity and the resulting pruned network also features irregular computational patterns, which highly relies on specialized hardware accelerators to achieve realistic speedups, as demonstrated in [306]. In contrast, structured pruning can easily achieve realistic speedups on mainstream hardware, such as GPUs and CPUs, thanks to its high on-device parallelisms [305]. This unique technical merit has been making structured pruning increasingly popular, especially in the context of designing hardware-friendly network solutions [383]. With all this in mind, below we further elaborate on recent representative structured pruning works, which can be roughly divided into the following four categories: weight-based pruning, activation-based pruning, batch normalization statistics–based pruning, and search-based pruning.
Weight-Based Pruning. Weight-based pruning, also referred to as weight-dependent pruning, determines the importance of different channels based on the corresponding weights, which is technically similar to magnitude-based pruning as discussed in Section 4.1.1. There have been two popular weight-based pruning criteria, including weight norm and weight correlation. Without loss of generality, we can easily calculate the \(L_n\)-norm as \(||w||_n\), where w is the corresponding network weight. For example, [21] proposes to remove the less important channels based on their \(L_1\)-norm values, which indicates that the channel with smaller \(L_1\) is considered less important and contributes less to the network output. The work of [22] observes that \(L_2\)-norm can achieve better pruning performance than \(L_1\)-norm. FurthermoreThe work of [23] challenges the empirical assumption in [21, 22] and demonstrates that the channels with smaller \(L_1\)- and \(L_2\)-norm magnitudes are not necessarily less important. To avoid this, [23] turns back to the channel correlation, which reveals that the channels close to the geometric median are typically redundant since they represent similar feature maps in the same layer. As a result, removing the channels close to the geometric median only leads to minimal accuracy loss. Inspired by the promising performance of [23, 353, 354] propose to first apply scalar hashing on the weights of each layer and then remove redundant channels based on the corresponding weight similarity. This is because similar channels are of high redundancy in terms of the contributions to the network representation capability. In addition, unlike [21, 22, 23, 353, 354], which measure the channel redundancy in the same layer, [355] investigates the channel redundancy across multiple different layers in order to minimize the accuracy loss. Furthermore, [356] prioritizes to remove the channels in more redundant layers rather than globally ranking different channels across all network layers.
Activation-Based Pruning. Activation-based pruning typically leverages the intermediate activation maps to interpret the importance of different channels, in which activation maps, also known as feature maps, correspond to the output features from one specific network layer. As demonstrated in [383], there have been three representative techniques to determine the importance of different channels in the L-th layer: (1) using the activation maps of the L-th layer, (2) using the activation maps of adjacent layers (e.g., \((L+1)\)-th and \((L+2)\)-th layers), and (3) using the activation maps of the last layer (i.e., the network output).
(1)
Current Layer. To determine the importance of different channels in the L-th layer, [357] proposes a simple yet effective two-step scheme, which first removes different channels and then measures the reconstruction error based on the output activation maps of the L-th layer. Next, the channels that lead to smaller reconstruction error are removed to reduce the network complexity. Similarly, [358] measures the channel importance according to the decomposition error. Subsequent works utilize the channel independence [359] and post-activation maps [360] to measure channel importance.
(2)
Adjacent Layers. Recent state-of-the-art DNNs are naturally coupled and of sequential layer structures, which indicates that there is significant layer dependency between different adjacent layers. In light of this convention, [361, 362] investigate the dependency between the current layer and the subsequent layer in order to measure the channel importance in the current layer. In parallel, [295, 363] demonstrate that the activation maps of previous layers can also reflect the channel importance in the subsequent layers.
(3)
Last Layer. The channel importance can also be evaluated using the activation maps of the last layer, which correspond to the network output. The rationale behind this is that we are allowed to use the network output to interpret the accuracy of the pruned network. For example, we can simply determine the channel importance based on the reconstruction error [244] and the discrimination of the entire network [384].
Statistics-Based Pruning. Statistics-based pruning refers to pruning that exploits batch normalization statistics [385] to interpret channel importance, which has since gained increasing popularity thanks to its conceptual simplicity and surprisingly strong pruning performance. As shown in previous representative networks [2, 3, 32, 34, 35, 123], batch normalization is a widely used plug-and-play technique to accelerate and stabilize the network training process towards better training convergence while also reducing internal covariate shift to benefit network accuracy. Specifically, batch normalization \(BN(\cdot)\) transforms the input \(x \in \mathbb {R}^{B \times C \times W \times H}\) as follows:
\begin{equation} BN(x) = \gamma \cdot \frac{x - \mu }{\sqrt {\delta ^2 + \epsilon }} + \beta , \end{equation}
(13)
where \(\mu\) and \(\delta\) are the mean and standard deviation of the input x, respectively, and \(\epsilon\) is a small constant (e.g., \(1\times 10^{-9}\)) to avoid zero-division. \(\gamma \in \mathbb {R}^{B}\) and \(\beta \in \mathbb {R}^{B}\) are learnable parameters to scale and shift \(\frac{x - \mu }{\sqrt {\delta ^2 + \epsilon }}\), which are optimized during the training process to recover the input x. Note that \(\gamma\) and \(\beta\) have the same dimension as the number of input channels. As seen in [2, 3, 32, 34, 35, 123], it is common practice to insert one batch normalization layer after one convolutional layer. In light of these, [364] pioneers leveraging batch normalization statistics \(\gamma\) to enable and disable different input channels, among which those disabled channels are pruned at the end of the training process. To this end, [364] applies \(L_1\)-norm regularization on \(\gamma\) to achieve sparsity. Similar to [364], Gate Decorator [365] introduces gated batch normalization that leverages batch normalization statistics \(\gamma\) as channel gates to enable and disable different input channels from the previous convolutional layer. To achieve sparsity, Gate Decorator also exploits \(L_1\)-norm regularization to penalize \(\gamma\) during the training process. In addition, Gate Decorator introduces an iterative pruning scheme, which progressively prunes redundant channels during the training process and fine-tunes the resulting pruned network to recover the accuracy. However, as demonstrated in [366], \(L_1\)-norm regularization suffers from inferior discrimination between different channels since \(L_1\)-norm regularization pushes all the scaling factors \(\gamma\) towards zero. To tackle this, in contrast to [364, 365], which regularize \(\gamma\) with \(L_1\)-norm penalty, [366] polarizes \(\gamma\) to enforce a large margin between zero and non-zero \(\gamma\). Furthermore, [367] challenges [364, 365, 366], observing that smaller non-zero \(\gamma\) does not imply that the corresponding channel is less important. Based on this observation, [367] introduces a simple yet effective iterative pruning approach, which (1) prunes the less important channels with exact zero \(\gamma\), (2) rescales the magnitude of \(\gamma\), and (3) fine-tunes the resulting pruned network to recover the accuracy. In addition to the scaling factors \(\gamma\), [368] demonstrates that the shifting factors \(\beta\) can also be leveraged to interpret the channel importance and jointly considering \(\gamma\) and \(\beta\) has the potential to achieve more reliable channel pruning. The aforementioned statistics-based channel pruning works [364, 365, 366, 367, 368] explicitly demonstrate that batch normalization statistics (i.e., \(\gamma\) and \(\beta\)), when properly engineered, can reflect the importance of input channels from the previous convolutional layer.
Search-Based Pruning. Inspired by the tremendous success of NAS as discussed in Section 3, a plethora of search-based pruning works have recently emerged, which typically leverage search-based techniques, including reinforcement learning–based [369, 370, 371], evolutionary algorithm–based [372, 373, 374], and gradient-based search [375, 376, 377], to automatically search for the optimal pruning policy instead of using manually designed pruning heuristics. The rationale behind this is that different channels can be alternatively viewed as a list of possible operator candidates, making it possible to generalize the novel findings and advances in the field of NAS to address research challenges in the field of pruning. Specifically, previous search-based channel pruning works can be divided into the following three categories:
(1)
Reinforcement Learning–Based Search. Reinforcement learning is a well-established technique for solving search problems as discussed in Section 3.2. Several pruning works [369, 370, 371] have pioneered the exploitation of reinforcement learning to search for the optimal channel pruning policy. For example, AMC [369] proposes to train an efficient deep deterministic policy gradient (DDPG) [386] agent such that the well-trained DDPG agent can output the optimal layer-wise channel pruning policy to maximize the pre-defined reward function as shown in Figure 20. In addition, AGMC [370] demonstrates that AMC may yield suboptimal pruning policies due to the fixed number of environment states. To tackle this, AGMC leverages graph convolutional networks (GCNs) to encode the pruned network and exploits the graph-based encoder–decoder to automatically learn the optimal environment state. DECORE [371] turns back to multi-agent search, which assigns each agent to one specific network layer to learn better pruning policies.
(2)
Evolutionary Algorithm-Based Search. The evolutionary algorithm is another well-established search technique thanks to its conceptual simplicity, flexibility, and surprisingly strong performance. There have been several pruning works [372, 373, 374] that employ an evolutionary algorithm to automatically search for the optimal channel pruning policy. For example, MetaPruning [372] introduces an efficient two-stage pruning pipeline. In the first stage, MetaPruning trains an over-parameterized PruningNet that consists of all possible pruned network configurations. Note that PruningNet here is technically the same as the supernet in the field of NAS as discussed in Section 3. In the second stage, MetaPruning leverages the well-trained PruningNet to quickly evaluate the accuracy of different pruned networks with inherited weights from the well-trained PruningNet [160]. In addition, an evolutionary engine is integrated to explore the optimal pruned network.
(3)
Gradient-Based Search. Unlike the aforementioned reinforcement learning–based and evolutionary algorithm–based pruning works that explore the optimal pruning policy within the discrete space, gradient-based search instead allows learning of the optimal pruning policy within the continuous space [375, 376, 377]. As a result, gradient-based search is able to maintain much better computational efficiency than reinforcement learning–based and evolutionary algorithm–based counterparts. For example, DSA [377] proposes an efficient differentiable sparsity allocation approach to learn optimal layer-wise pruning ratios with gradient-based optimization, in which each pruning experiment only requires about 5 GPU hours. To relax the discrete search space to become continuous, DSA introduces learnable pruning ratios, which are conceptually the same as the architecture parameters in differentiable NAS [138]. During the training process, the above learnable pruning ratios can be jointly optimized together with the network weights using standard gradient descent.
Fig. 20.
Fig. 20. Overview of AMC [369], which formulates channel pruning as reinforcement learning–based search and instead automatically searches for the less important channels to be pruned (figure from [369]).
In general, search-based pruning is similar to NAS, in which search-based pruning searches for pruned network structures and NAS searches for stand-alone network structures. This indicates that we can generalize more advanced NAS algorithms to search for better pruned networks.
Layer-Based Pruning. Layer pruning is a special case of channel pruning, which aggressively removes all the channels in the same layer, as shown in Figure 17. Under similar compression ratios, layer pruning can achieve better performance in terms of latency reduction than channel pruning, as demonstrated in [379]. However, there is no free lunch, which indicates that layer pruning may suffer from greater accuracy loss than channel pruning. Note that layer pruning is conceptually and technically similar to channel pruning. Specifically, channel pruning aims to remove the less important channels, whereas layer pruning focuses on removing the less important layers as seen in previous representative layer pruning works [30, 378, 379, 380, 381, 382]. In light of this, the aforementioned channel pruning techniques can be easily generalized to prune redundant layers. For example, [379] introduces several importance criteria from the lens of channel pruning, such as weight magnitudes, activation maps, and batch normalization statistics, which are further combined to reliably determine the less important layers. In addition, [382] investigates the lottery ticket hypothesis [328] from the perspective of layer pruning, which confirms that there also exist winning tickets at initialization in terms of layer pruning. More importantly, the winning tickets here are more environment friendly, with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387, 388] observed that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387, 388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Several recent pruning methods [389, 390, 391, 392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model’s complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy–efficiency trade-offs than traditional channel-based and layer-based pruning methods. Similarly, HACScale [393] proposes an effective scaling paradigm to re-scale different channels and layers towards more efficient inference on target hardware.

4.2 Network Quantization

In contrast to network pruning, which aims to reduce the network complexity at the structure level, network quantization focuses on representing the network weights and activations with lower bits, which significantly reduces the network complexity at the precision level. Therefore, the resulting quantized network maintains the same network structure (i.e., the same number of layers and channels), but with lower-bit network weights and activations. In practice, network quantization can be traced back to the 1990s, when early quantization works pioneered quantizing the network weights for Boltzmann machines [447], optical networks [448], and multi-layer perceptrons (MLPs) [449]. Note that quantization has the potential to significantly trim down the network size to accommodate the limited storage in real-world embedded scenarios. For example, Deep Compression [89] is able to reduce the network size of VGGNet by ×49, from 552 MB to 11.3 MB, while delivering comparable accuracy on ImageNet [80]. Thanks to its surprisingly strong performance in reducing the computational complexity and alleviating the storage requirements, renewed research interest in network quantization has emerged since the 2010s [409], which demonstrates that, compared with full-precision weights (i.e., 32 bits), 8-bit quantized weights can effectively accelerate the network inference on mainstream CPUs without significant accuracy degradation. To this end, we discuss recent advances in the field of network quantization in this section, including representative quantized networks and popular quantization-related extensions/implementations, which are also summarized in Figure 21.
Fig. 21.
Fig. 21. Illustration of different network quantization techniques discussed in Section 4.2.

4.2.1 Quantized Networks.

Here, we introduce several representative quantized networks: binarized networks, ternarized networks, INT8 networks, and mixed-precision networks.
Binarized Networks. Binarized networks are built upon only 1-bit weights, which are constrained to be either \(+\)1 or \(-\)1 during forward and backward propagations [24, 25]. This can effectively eliminate computation-intensive multiply-accumulate operations and allows replacing of multiply-accumulate operations with cheap additions and subtractions. This can lead to significant performance improvement in terms of latency and energy consumption, as demonstrated in [24]. In the relevant literature, BinaryConnect [24] and BinaryNet [25] are the very early seminal binarized networks, which pioneered investigation of the efficacy of 1-bit weights in order to reduce computational complexity and alleviate the storage bottleneck. BinaryConnect [24] introduces the first binarized network, which explores both deterministic binarization,
\begin{equation} \begin{aligned}w_b = \mathrm{sign}(w) = {\left\lbrace \begin{array}{ll} +1, & \text{if} \,\,\, w \ge T \\ -1, & \text{otherwise} \end{array}\right.} \end{aligned} , \end{equation}
(14)
and stochastic binarization to stochastically binarize the network weights,
\begin{equation} \begin{aligned}w_b = {\left\lbrace \begin{array}{ll} +1, & \text{with probability} \,\,\, p=\sigma (w) \\ -1, & \text{with probability} \,\,\, p=1-\sigma (w), \end{array}\right.} \end{aligned} \end{equation}
(15)
where \(\sigma (\cdot)\) is the hard sigmoid function and can be mathematically formulated as follows:
\begin{equation} \sigma (x) = \mathrm{clip}\left(\frac{x+1}{2}, 0, 1\right) = \mathrm{max}\left(x, \mathrm{min}\left(1, \frac{x+1}{2}\right)\right) . \end{equation}
(16)
Note that BinaryConnect exploits the above hard sigmoid function rather than the soft version because it is far less computationally expensive and can still yield competitive results. As shown in BinaryConnect, the stochastic binarization is more advanced and can achieve much better quantization accuracy than the deterministic counterpart. So far, BinaryConnect only enables weight-level binarization, whereas the network inputs are still required to be full precision. In light of this, BinaryNet [25] extends BinaryConnect to support both binarized weights and binarized inputs in order to maximize the inference efficiency of binarized networks. XNOR-Net [26] demonstrates that [24, 25] cannot be generalized to large-scale datasets such as ImageNet. To address this, XNOR-Net introduces an effective approach to estimate binarized weights to maintain \(w \approx \alpha \cdot w_b\), after which the estimated \(\alpha\) can be detached from the binarized weight to rescale the input. To further enhance binarization accuracy, a plethora of binarized networks [394, 395, 396, 397, 398, 399, 400] have been proposed. For example, [399, 400] propose learnable activation binarizer [399] and adaptive binary sets [400] to explore more accurate binarized networks.
Ternarized Networks. In addition to binarized networks, ternarized networks [401, 402] are another representative branch of quantized networks and have gained increasing popularity thanks to their superior accuracy. Specifically, ternarized networks quantize the network weights from 32 bits to 2 bits, in which the 2-bit weights are constrained to \(-\)1, 0, and \(+\)1 in contrast to \(\pm\)1 in binarized networks. As such, ternarized networks can achieve much better accuracy than binarized networks at the cost of slightly increased computational complexity. To achieve this, [401], reporting on the first ternarized network, proposes to quantize the full-precision weights as follows:
\begin{equation} \begin{aligned}w_t = {\left\lbrace \begin{array}{ll} +1, & \text{if} \,\,\, w \gt \Delta \\ 0, & \text{if} \,\,\, |w| \le \Delta \\ -1, & \text{if} \,\,\, w \lt -\Delta \end{array}\right.} \end{aligned} , \end{equation}
(17)
where \(\Delta\) is a positive constant to control the ternarization threshold. To derive the optimal ternarization threshold \(\Delta ^*\), [401] turns back to XNOR-Net [26] and borrows the binarization estimating scheme from XNOR-Net, which introduces an adjustable scaling factor to minimize \(||w - \alpha \cdot w_t||_2^2\). Finally, [401] demonstrates an empirical rule of thumb to derive \(\Delta ^*\) as follows:
\begin{equation} \Delta ^* = 0.7 \cdot E(|w|) \approx \frac{0.7}{n} \sum _{i=1}^n |w_i| , \end{equation}
(18)
where n is the number of elements within w. To boost the ternarization accuracy, [402] further introduces two trained scaling coefficients \(w_l^p\) and \(w_l^n\) for the l-th layer, which are then trained using gradient descent during backward propagation. Once the training process terminates, [402] deploys the ternarized networks on target hardware, including the trained ternarized weights and the corresponding scaling coefficients, reducing the network size by at least ×16. Subsequently, several ternarized networks [403, 404, 405, 406, 407, 408] have been proposed to further improve ternarization accuracy. Among them, [408] demonstrates that the hard ternarization threshold \(\Delta\), despite being simple and effective, often leads to suboptimal results. To avoid this, [408] introduces the paradigm of a soft ternarization threshold, which instead enables the network to automatically determine the optimal ternarization intervals to maximize its accuracy.
INT8 Quantization. Binarized and ternarized networks have the potential to achieve \(\times 16\sim\)×32 speedups, which, however, suffer from non-negligible accuracy loss and, even worse, require considerable engineering efforts to design specialized hardware for further deployment. The rationale behind this is that mainstream hardware does not support low-bit quantized networks. To overcome such limitations, an effective alternative is INT8 quantization, which trims down the network weights from 32 bits to 8 bits within the range of \([-128, 127]\). As such, INT8 quantization can lead to about ×4 compression in terms of network size and, more importantly, at the cost of negligible accuracy loss [409]. In addition, thanks to its well-suited software (e.g., Google’s TensorFlow Lite and NVIDIA’s TensorRT), we can easily deploy INT8 quantized networks on mainstream hardware, such as mobile devices, CPUs, and edge GPUs, with minimal engineering efforts [68, 69]. For example, as shown in [410], TensorRT allows post-training INT8 quantization, which leverages simple weight calibration to convert pre-trained full-precision weights into 8-bit weights (see Figure 23) with only trivial accuracy loss. Several follow-up works [411, 412, 413, 414] have been proposed recently to investigate INT8 quantization and improve INT8 quantization accuracy. Among them, [414], the very first INT8 quantization work, pioneers quantizing both weights and activations with 8-bit integers to boost inference efficiency. In addition, [411] evaluates the performance of various INT8 quantized networks on mobile GPUs, based on which [411] introduces a unified INT8 quantization framework that integrates various off-the-shelf INT8 quantization techniques, such as symmetric, asymmetric, per-layer, and per-channel INT8 quantization. The work of [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450, 451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks.
Fig. 22.
Fig. 22. Comparisons between post-training quantization and quantization-aware training. In contrast to post-training quantization, quantization-aware training integrates the quantization loss into the training loss, which allows the optimizer to minimize the quantization loss to improve the quantization accuracy.
Fig. 23.
Fig. 23. TensorRT’s INT8 calibration [410].
Mixed-Precision Networks. Mixed-precision quantization is another well-established branch of network quantization. As shown in [415], mixed-precision quantization allows more fine-grained quantization schemes across different weights and activations. As a result, it can usually achieve better accuracy–efficiency trade-offs than conventional fixed-precision quantization, such as binarized (1-bit), ternarized (2-bit), and INT8 (8-bit) quantization. For example, TBN [415], the very first mixed-precision network, proposes to combine layer-wise ternarized inputs and binarized weights, which delivers surprisingly better accuracy–efficiency trade-offs than stand-alone binarized networks and ternarized networks. The success of TBN has motivated several subsequent mixed-precision quantization works [416, 417] to continue improving quantization accuracy. For example, SYQ [416] proposes to quantize the network weights with 1/2 bits and the intermediate activation with 8 bits, whereas PACT [417] allows 2-bit activations and 2/3/4/5-bit weights. These early mixed-precision quantization works have demonstrated promising performance. Later, we will introduce automated mixed-precision quantization, which exploits automated techniques to search for the optimal bit allocation and can achieve more fine-grained quantization. The works in [418, 419, 420, 421] also consider leveraging mixed-precision quantization to improve the training efficiency of full-precision networks, which can significantly accelerate the training process and, more importantly, achieve comparable accuracy to the full-precision training.

4.2.2 Quantization Extensions and Implementations.

Here, we introduce several quantization extensions and implementations, including quantization-aware training, automated mixed-precision quantization, and quantization-aware hardware accelerators.
Quantization-Aware Training. Quantization-aware training refers to the technique that trains quantized networks, which is fundamentally different from post-training quantization as shown in Figure 22. Note that post-training quantization can achieve satisfactory performance on early networks such as AlexNet [76] and VGGNet [1], which, however, suffers from significant accuracy loss when applied to more advanced lightweight networks such as MobileNets [32, 85] and ShuffleNets [34, 35]. In general, quantization-aware training incorporates the quantization loss into the training loss, which then allows the optimizer to minimize the quantization loss during the training process in order to unlock better quantization accuracy than post-training quantization. In practice, The seminal quantization-aware training work, [414], proposes quantizing both weights and activations with 8-bit integers. To maximize the accuracy of INT8 quantized networks, [414] also introduces an effective tailored quantization-aware training approach to train the resulting INT8 quantized networks. Similar to [414, 422, 423] unify and improve quantization-aware training of INT8 quantized networks to minimize accuracy degradation. To generalize quantization-aware training to train other types of quantized networks (e.g., 1-bit and 2-bit networks), a plethora of quantization-aware training works [424, 425, 426, 427] have been proposed, which further push forward the attainable quantization accuracy.
Automated Mixed-Precision Quantization. Early quantization works typically quantize all weights and activations with the same level of precision, such as 1 bit for binarized networks and 2 bits for ternarized networks. Despite the promising performance, early uniform quantization practices suffer from suboptimal accuracy–efficiency trade-offs. For example, as shown in TBN [415], mixed-precision quantization that combines layer-wise ternarized inputs and binarized weights can achieve much better accuracy–efficiency trade-offs than stand-alone binarized networks and ternarized networks. However, determining the optimal mixed-precision quantization strategy is difficult owing to the large number of possible quantization combinations across different layers. To overcome such limitations, recent research has shifted to automated mixed-precision quantization [428, 429, 430, 431, 432, 433, 434, 435, 436] thanks to the tremendous success of NAS, as discussed in Section 3. Among them, [428], the first automated mixed-precision quantization work, follows early differentiable NAS practices [39, 138] to search for the optimal layer-wise precision assignment. HAQ [429] leverages reinforcement learning-based search to explore the huge quantization design space with hardware feedback in the loop, which focuses on finding the optimal layer-wise precision assignment to maximize both quantization accuracy and hardware efficiency. Note that we can easily generalize recent advances in the field of NAS to further improve automated mixed-precision quantization.
Quantization-Aware Accelerators. In contrast to INT8 quantized neural networks, binarized, ternarized, and mixed-precision quantized neural networks are not supported by mainstream hardware, such as mobile devices, CPUs, and edge GPUs. This further demands the design of quantization-aware accelerators to efficiently execute low-bit quantized networks at runtime. To this end, a plethora of representative quantization-aware accelerators have been proposed recently[437, 438, 439, 440, 441, 442, 443, 444, 445, 446], including binarized network-based accelerators [437, 438, 439], ternarized network-based accelerators [440, 441, 442], and mixed-precision network-based accelerators [443, 444, 445, 446]. These quantization-aware accelerators have demonstrated significant efficiency improvements in terms of latency, memory, area, and energy consumption in various real-world embedded scenarios.

4.3 Network Distillation

Network distillation, also referred to as knowledge distillation,7 is another well-established paradigm to further push forward the accuracy–efficiency trade-off, which is initially proposed by [501] and subsequently generalized by [11, 28]. Note that knowledge distillation is a plug-and-play training technique, which has been applied to various tasks to achieve better training performance, such as object detection [502] and language understanding [96]. In contrast to network pruning and network quantization, which focus on improving network efficiency without sacrificing network accuracy, as discussed in Sections 4.1 and 4.2, network distillation instead boosts the accuracy–efficiency trade-off from the accuracy perspective, which aims to improve the network accuracy without changing the network structure. In other words, unlike network pruning and network quantization that lead to simplified network structures, network distillation results in the same network. However, the resulting network can typically achieve higher accuracy. As shown in [11, 28, 501], knowledge distillation refers to the training process that leverages a larger pre-trained teacher network to benefit the training process of a smaller student network (see Figure 25), which transfers the rich and discriminative knowledge from the larger pre-trained teacher network to the smaller student network to further achieve better accuracy on the target task than simply training the student network alone. Next, we first present the preliminaries of knowledge distillation and then introduce recent representative data-dependent and data-efficient knowledge distillation works. These knowledge distillation works can also be found in Figure 24.
Fig. 24.
Fig. 24. Illustration of data-dependent and data-efficient knowledge distillation (KD) works in Section 4.3.
Fig. 25.
Fig. 25. Illustration of the teacher-student knowledge transfer process in seminal knowledge distillation techniques [11, 28].
Knowledge Distillation Basics. In order to better understand knowledge distillation, we first elaborate on the preliminaries of knowledge distillation, which are mainly based on the most representative knowledge distillation work [11]. As shown in previous state-of-the-art networks [32, 34, 35, 85], the network outputs, also referred to as the network logits, are typically fed into the softmax function to calculate the probability distribution over different categories for further prediction purposes. However, given a pre-trained teacher network, the output logits after the softmax function are discriminative but less informative, which are close to either 1 or 0 (e.g, [0.02, 0.95, 0.01, 0.01, 0.01]). This makes it difficult to directly transfer the discriminative knowledge from the pre-trained teacher network to the student network. The rationale behind this is that the student network is of smaller network size than the teacher network, thus is less capable of learning discriminative knowledge [11]. To mitigate this issue, [11] leverages the distillation temperature T to soften the knowledge from the pre-trained teacher network to facilitate the teacher–student knowledge transfer process, which can be mathematically formulated as follows:
\begin{equation} z_i = \frac{\exp (y_i/T)}{\sum _{j=1}^n \exp (y_j/T)} \,\,\, s.t. \,\,\, i = 1, \ldots , n , \end{equation}
(19)
where \(\lbrace y_i\rbrace _{i=1}^n\) denotes the output logits without softmax and T is the temperature to soften the output logits \(\lbrace y_i\rbrace _{i=1}^n\) as \(\lbrace z_i\rbrace _{i=1}^n\). Note that Equation (19) is equivalent to the standard softmax function when \(T=1\). As shown in [11], larger T can produce softer probability distribution over different categories (e.g., \([0.1, 0.6, 0.1, 0.1, 0.1]\) when \(T=5\) and \([0.2, 0.2, 0.2, 0.2, 0.2]\) when \(T=+\infty\)). In addition, fixing the temperature T to 2 can empirically yield the best performance. Furthermore, [11] exploits the softened knowledge from the pre-trained teacher network to guide the training process of the student network, which can be mathematically formulated as follows:
\begin{equation} \mathop {\mathrm{minimize}}_{w} \,\,\, \mathcal {L}_{train}(x, w) = \mathcal {L}(y, y^*) + \alpha \cdot T^2 \cdot \mathcal {L}(y, z) \,\,\, s.t. \,\,\, y = f_w(x) , \end{equation}
(20)
where x is the input data, \(y^*\) is the ground-truth label, z is the softened knowledge from the pre-trained teacher network, \(\alpha\) is the constant to control the teacher–student distillation magnitude, and \(\mathcal {L}(\cdot)\) is the standard cross entropy loss function. Apart from these, \(f_w(\cdot)\) parameterizes the student network with the weight of w. As demonstrated in [11], it is important to multiply the teacher–student distillation loss term (i.e., \(\mathcal {L}(y, z)\)) with \(T^2\) because the teacher–student distillation loss term scales the gradient of w to \(1/T^2\) during the training process.
With the above in mind, we introduce recent representative knowledge distillation practices next, which are built upon [11] and can be roughly divided into the following two categories: data-dependent and data-efficient knowledge distillation.

4.3.1 Data-Dependent Knowledge Distillation.

In this section, we introduce several representative data-dependent knowledge distillation techniques, including logit-based knowledge distillation, intermediate layer-based knowledge distillation, multi-teacher knowledge distillation, teacher-free knowledge distillation, and privileged knowledge distillation.
Knowledge from Logits. Knowledge distillation from logits is one representative branch of knowledge distillation and has been widely applied to improve network accuracy thanks to its conceptual simplicity and surprisingly strong performance. The works of [11, 27] pioneered leveraging the logit-based knowledge from the pre-trained teacher network to facilitate the training process of the less-capable student network. As seen in [11], the logit-based knowledge from the pre-trained teacher network, also referred to as soft labels, corresponds to the output of the teacher network after being fed into the softmax function to calculate the probability distribution over different categories, as shown in Equation (19). Subsequently, [452, 453, 454] investigate the efficacy of knowledge distillation and demonstrate that early knowledge distillation practices [11, 27] only yield suboptimal results since \(\alpha\) and T are fixed for different teacher–student networks, as shown in Equation (20). The works of [455, 456, 457] demonstrate that the accuracy of the student network may significantly degrade when there are large accuracy gaps between teacher and student. To overcome such limitations, [456] introduces an intermediate-sized network (i.e., teacher assistant) to facilitate the knowledge transferred from the pre-trained teacher network to the student network, thus effectively bridging the gap between powerful teacher and less-capable student. In addition to soft labels, [458, 459, 460] demonstrate that noisy labels are also helpful to knowledge distillation, which can be leveraged to further improve the accuracy of the student network.
Knowledge from Intermediate Layers. Knowledge distillation from intermediate layers is another representative branch of knowledge distillation, which provides more fine-grained knowledge to better guide the training process of the smaller student network. The rationale behind this is that intermediate features are also discriminative and can be combined with the final network output to further enhance the feature expressiveness as seen in [503]. Specifically, [28] pioneers investigation of knowledge distillation from intermediate layers and introduces hint learning to improve the training process of the student network, in which hints correspond to the intermediate features of the teacher network. Compared with logit-based knowledge, knowledge from intermediate layers is often richer and more fine-grained, as shown in [28]. A plethora of subsequent knowledge distillation works [461, 462, 463, 464, 465, 466] has been proposed to enhance the knowledge transferred from the pre-trained teacher network to the student network, continuing to explore the rich intermediate features to facilitate the training process of the student network. For example, [466] delves into more fine-grained channel-level knowledge distillation, leading to more fine-grained and discriminative knowledge.
Multi-teacher Knowledge Distillation. The standard knowledge distillation paradigm exploits the pre-trained knowledge from one single teacher network to guide the training process of the less-capable student network [11, 28]. The work of [467] demonstrates that the student network may learn richer and more discriminative knowledge from multiple teacher networks, which push the student network to achieve better accuracy since multiple teacher networks can provide more informative and instructive knowledge than one single teacher network. To this end, [467] proposes to average the network weights of multiple teacher networks (i.e., mean teachers) to better guide the training process of the student network. Similar to [467], several follow-up knowledge distillation works [468, 469, 470, 471] propose to average the output logits of multiple pre-trained teacher networks and then exploit the averaged knowledge to enhance the training process of the student network. In addition, [472, 473, 474] demonstrate that directly averaging the output logits of multiple teacher networks ignores the teacher diversity since different teacher networks may maintain different network capabilities. To avoid this, [472, 473, 474] propose to actively enable and disable different teacher networks through gates during the training process to better guide the student network to learn more discriminative knowledge from different teacher networks.
Teacher-Free Knowledge Distillation. Despite the promising accuracy improvement, previous knowledge distillation works [11, 28] highly rely on off-the-shelf pre-trained teacher networks, which necessitate considerable computational resources to train teacher networks. In addition, to maximize accuracy improvement, it is of utmost importance to design proper teacher networks, leading to additional engineering efforts. To overcome such limitations, several knowledge distillation works [475, 476, 477, 478, 479, 480, 481] have been proposed recently to exclude teacher networks, which instead exploit the knowledge from the student network itself to guide the training process of the student network with teacher-free methods. For example, [475] introduces deep mutual learning, which demonstrates that the pre-trained teacher network is not necessary in the context of knowledge distillation. Instead, [475] demonstrates that an ensemble of student networks can collaboratively learn from each other throughout the training process and, more importantly, can achieve surprisingly better training accuracy than standard knowledge distillation practices [11, 28]. This explicitly indicates that the knowledge from the student network itself can also be leveraged to improve the training process of the student network towards better training accuracy.
Privileged Knowledge Distillation. Privileged knowledge distillation is a special type of knowledge distillation in which the student network has access to additional information or features that are not available to the teacher network during the training process [482]. In contrast to the standard knowledge distillation paradigm [11], privileged knowledge distillation allows the student network to learn from both the teacher network and the additional information that is only available to the student network, which has the potential to further improve the attainable accuracy as demonstrated in [482]. The rationale behind privileged knowledge distillation is that the student network can leverage the additional information to improve its ability to mimic the behavior of the pre-trained teacher network. Furthermore, inspired by [482], several privileged knowledge distillation works [483, 484, 485, 486, 487] have been proposed recently to improve the performance of the student network in various tasks. For example, [484] explores progressive privileged knowledge distillation to embrace better online action detection. In addition, [487] introduces privileged feature distillation to improve the product recommendations of Taobao. These privileged knowledge distillation works clearly demonstrate that the student network may benefit from the additional information and knowledge to achieve better training accuracy on the target task.

4.3.2 Data-Efficient Knowledge Distillation.

In this section, we introduce generative adversarial networks (GAN)-based knowledge distillation and few-sample knowledge distillation, which are of high data efficiency and can perform teacher–student distillation using a small amount of training data.
GAN-based Knowledge Distillation. Despite promising accuracy improvement, knowledge distillation is often data driven and highly relies on sufficient training data to transfer the rich pre-trained knowledge from the teacher network to the student network. This inevitably leads to significant engineering efforts for data preparation, such as data collection, cleaning, and labeling. As seen in the relevant literature, GANs have been applied to a wide range of tasks and are considered to be one of the most effective approaches for generating high-quality synthetic data [504]. In light of this, a plethora of GAN-based knowledge distillation works [488, 489, 490, 491, 492, 493, 494] have been proposed recently to leverage GANs to generate sufficient training data and then use the generated data to train the student network. These GAN-based knowledge distillation works have demonstrated significant data efficiency since the well-optimized generator can be used to produce a large amount of high-quality synthetic data while at the same time achieving promising accuracy improvement on the target task.
Few-Sample Knowledge Distillation. In addition to GAN-based knowledge distillation, another promising direction is to perform efficient knowledge distillation to transfer the rich knowledge from the pre-trained teacher network to the student network with only a small amount of training data or only few data samples, which can also bring significant data efficiency. To achieve this, several few-sample knowledge distillation works [33, 495, 496, 497, 498, 499, 500] have been proposed recently. Among them, [497] proposes a simple yet effective solution for knowledge distillation using label-free few samples to realize both data efficiency and training/processing efficiency. Specifically, [497] first inserts one \(1\times 1\) convolutional layer at the end of each building block of the student network and then optimizes the inserted \(1\times 1\) convolutional layer to minimize the knowledge distillation loss, which can quickly converge using only few data samples. More recently, [500] introduced an effective mimicking-then-replacing knowledge distillation technique to quickly train the student network with only a few data samples, which maintains significant data efficiency while still achieving superior training accuracy.

4.4 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of network compression, which are summarized as follows:
(1)
Automated Teacher–Student Search. Knowledge distillation transfers the rich knowledge from the pre-trained teacher network to the student network to facilitate the training process of the student network, which has achieved promising accuracy improvement [11, 28]. In the past, researchers empirically exploited larger networks as teacher networks and smaller networks as student networks. However, such empirical practices may lead to suboptimal accuracy and cannot always achieve accuracy improvement. The rationale behind this is that different student networks may prefer quite different teacher networks, as shown in [505, 506]. This further motivates us to design the optimal teacher–student network pair to maximize the attainable accuracy of the student network. To achieve this, one promising alternative is to leverage recent advances in the field of NAS to automatically search for the optimal teacher–student network pair.
(2)
Joint Network Compression. To embrace better accuracy–efficiency trade-offs, an intuitive and straightforward approach is sequential network compression, which exploits multiple network compression techniques to progressively reduce network complexity. For example, [507] introduces a simple yet effective sequential compression pipeline, which starts with searching for an efficient network with ProxylessNAS [40] and then applies automated channel pruning [369] and mixed-precision quantization [429] to further trim down the network size. However, this sequential compression pipeline has critical drawbacks that lead to suboptimal results. This is because the searched optimal network is not necessarily optimal for subsequent pruning and quantization. To address this, one promising future direction is joint network compression, which jointly optimizes the network structure, pruning, and quantization to yield the best accuracy–efficiency trade-off.
(3)
Federated Network Compression. Federated learning is an emerging decentralized learning approach that allows multiple hardware devices to collaboratively learn the same network without sharing their raw data [508]. Specifically, federated learning allows the network to be trained locally on each hardware device using its own data, in which only the network updates, rather than the raw data, are sent back to the central server for further aggregation. In light of this, one promising future direction is federated network compression, including federated pruning, federated quantization, and federated distillation, which can significantly enhance data privacy and protect data security while still achieving competitive performance in terms of accuracy and training efficiency.
(4)
Domain-Specific Network Compression. In addition to image classification, there are a wide range of popular downstream applications, such as object detection, tracking, and semantic segmentation, in which the involved networks are still quite computation intensive. This makes it difficult to accommodate the limited available computational resources in real-world embedded scenarios. To tackle this issue, some early practices have attempted to leverage general network compression techniques to compress domain-specific networks. For example, [502, 509, 510] propose to leverage pruning [510], quantization [509], and knowledge distillation [502] to improve the accuracy–efficiency trade-off in real-world object detection scenarios, which, however, are still under-explored and cannot simply generalize to other scenarios. Therefore, one promising future direction is domain-specific network compression, which exploits domain-specific knowledge to largely boost network compression performance towards better accuracy–efficiency trade-offs.
(5)
Mixed-Precision Training. Mixed-precision training refers to the technique that trains the network with both full-precision weights and low-bit weights, which has the potential to significantly improve training efficiency without sacrificing accuracy. For example, PyTorch [67] introduces Automatic Mixed Precision (AMP), which combines 32-bit full-precision weights with 16-bit half-precision weights during the training process. As a result, AMP achieves the same level of accuracy as the stand-alone 32-bit full-precision training while at the same time being able to deliver about ×2 training speedups for convolutional networks [511]. In light of this, one promising future direction is to leverage low-bit mixed-precision training techniques to train full-precision networks, which may aggressively push forward training efficiency without degrading accuracy.

5 Efficient On-Device Learning for Embedded Computing Systems

On-device learning consists of two branches: on-device inference and training. On-device inference refers to the process of deploying efficient pre-trained networks on local hardware devices, which allows local hardware devices to run various intelligent inference tasks, such as image classification and object detection. There have been several representative techniques [513, 514] to enable efficient on-device inference. They focus on either designing computation-efficient networks with less redundancy or compressing computation-intensive networks to reduce the computational complexity in order to accommodate the limited on-device computational resources. Note that this article has discussed popular techniques for efficient on-device inference, such as efficient manual/automated network design and efficient network compression. The readers may refer to Sections 2 to 4 for more details.
On-device training refers to the capability of local hardware to perform training tasks directly on local hardware without the need for remote servers [44]. Unlike on-device inference, in which the deployed network always remains static, on-device training may further enhance the deployed network over time. This allows adaptation of the deployed network to new data collected from local sensors to achieve better accuracy. Thanks to its strong capability of protecting data privacy and ensuring data security, on-device training has become increasingly popular over the past few years in order to achieve secured embedded intelligence [546]. To this end, we systematically discuss recent state-of-the-art on-device learning techniques (especially on-device training) in this section, including general on-device learning in Section 5.1, on-device continual learning in Section 5.2, on-device transfer learning in Section 5.3, and on-device federated learning in Section 5.4, since these techniques feature different learning algorithms to enhance on-device learning performance. For better understanding, we also summarize these methods in Figure 26. Note that these on-device learning techniques can typically generalize across different networks (e.g., convolutional networks and transformers). For example, we can leverage on-device federated learning to optimize both convolutional networks and transformers on multiple local hardware devices.
Fig. 26.
Fig. 26. Comparisons of efficient on-device learning techniques that have been discussed in Section 5.

5.1 General On-Device Learning

In this section, we introduce recent state-of-the-art works about general on-device learning techniques, including efficient on-device inference and efficient on-device training.
Efficient On-Device Inference. To enable efficient on-device inference, one straightforward approach is to design tiny networks with less redundancy in order to accommodate the limited on-device computational resources. To this end, a plethora of representative tiny networks [512, 513, 514, 515] have been proposed recently, including MicroNets [512], MCUNets [513, 514], and EtinyNet [515]. MCUNetV1 [513], one of the early tiny networks, proposes to jointly design the lightweight tiny network using TinyNAS and the lightweight inference engine using TinyEngine, enabling ImageNet-scale inference on microcontrollers. MCUNetV2 [514] introduces an efficient patch-based inference pipeline to trim down on-device memory consumption for better on-device inference since memory consumption is the key bottleneck of on-device inference. However, in contrast to training large networks, training tiny networks poses significant challenges, as demonstrated in [547]. The rationale here is that existing regularization techniques (e.g., data augmentation and dropout), despite being able to benefit the training process of large networks, may degrade the training performance of tiny networks [547]. To tackle this issue, [547] proposes augmentation of the tiny network itself rather than augmenting the input data, which shows promising accuracy improvement over the standard training scheme.
Efficient On-Device Training. The key difference between on-device training and inference is that on-device training requires saving all of the intermediate activations, which are used to optimize parameters using gradient descent during backward propagation. In contrast, on-device inference that only performs forward propagation does not need to save intermediate activations, which can be progressively released to reduce memory consumption. In light of this, on-device training suffers from non-negligible memory consumption since the activation size grows with respect to the training batch size and training typically involves a large batch size to accelerate the training process. As a result, intermediate activations become the major bottleneck of on-device training, as demonstrated in [44]. For example, under the batch size of 16, the activation size of ResNet50 [2] is ×13.9 larger than its parameter size, as shown in Figure 27. To alleviate the excessive memory consumption caused by intermediate activations, there have been several representative strategies, including gradient checkpointing, activation gradient pruning, and low-bit training.
Fig. 27.
Fig. 27. Comparisons between on-device training and inference in terms of memory consumption. This reveals that the activation size, instead of the parameter size, is the major bottleneck of on-device training, motivating future research to reduce the activation size for efficient on-device training (figure from [44]).
(1)
Gradient Checkpointing. Gradient checkpointing is a simple yet effective memory optimization technique that seeks to reduce training memory consumption at the cost of increased training time [516]. To this end, gradient checkpointing reserves a minimal set of intermediate activations during forward propagation, which are then utilized to re-compute the remaining intermediate activations during backward propagation. As shown in [516], gradient checkpointing has the potential to significantly reduce the training memory consumption from \(O(n)\) to \(O(\sqrt {n})\), where n is the number of network layers. More importantly, gradient checkpointing does not degrade training accuracy since the training behaviors remain the same as the standard training scheme. Several subsequent gradient checkpointing works generalize gradient checkpointing to allow arbitrary computation graphs [517] and train GNNs [518].
(2)
Activation Gradient Pruning. Activation gradient pruning removes less important intermediate activation gradients to optimize training memory consumption [519]. This relies on an empirical observation that most of the intermediate activation gradients during backward propagation are very close to zero and thus have minimal impact on gradient descent [519]. Therefore, pruning these very small activation gradients can effectively reduce training memory consumption at the cost of minimal accuracy loss, which also accelerates the training process. Similar to [519, 520] proposes an efficient gradient filtering scheme, which filters similar activation gradients during backward propagation and only reserves those with unique elements to reduce the number of elements in the activation gradient maps. Another popular approach is to build dynamic sparse computation graphs to eliminate intermediate activations in an input-dependent manner, which can also reduce training memory consumption [521].
(3)
Low-Bit Training. Low-bit training refers to training the given network with low-bit weights (e.g., 8 and 16 bits) rather than full-precision 32-bit weights, which has the potential to significantly reduce training memory consumption by ×32 [82]. The rationale here is that low-bit training can reduce the memory consumption for both network weights and intermediate activations. Specifically, [522], an early exploration, proposes to train the given network with 16-bit weights under stochastic rounding, which leads to ×2 less training memory consumption than the standard full-precision 32-bit training counterpart and, more importantly, maintains comparable training accuracy. The work of [523] introduces an efficient INT8 training pipeline, consisting of loss-aware compensation and backward quantization, to enable tiny on-device training thanks to the well-optimized INT8 quantization on mainstream hardware. The work of [45] proposes to optimize real-quantized graphs, in which an effective memory-efficient sparse update scheme and tiny training engine are integrated to achieve on-device training under 256 KB memory. It is worth noting that low-bit training is similar to network quantization as discussed in Section 4.2 since both leverage quantized weights to trim down network complexity. This allows generalization of recent advanced quantization techniques to benefit low-bit on-device training.
We note that the aforementioned strategies can also be combined to further reduce the overall memory consumption during training and accelerate the training process. For example, we can easily combine gradient checkpointing and low-bit training to further reduce the training memory consumption from \(O(\sqrt {n})\) to \(O(\sqrt {n}/32)\), where n is the number of network layers.

5.2 On-Device Continual Learning

On-device continual learning, also known as on-device lifelong learning or incremental learning, is an advanced learning paradigm that allows the deployed network to continuously learn from the newly collected data to further push forward the attainable accuracy [46, 533]. This is particularly favored in real-world embedded scenarios, especially those with rich local sensors, where embedded devices can continue to collect new data through local sensors over time [44, 45]. The newly collected data is then used to train the deployed network to unlock better performance over time. In other words, we are allowed to utilize the newly collected data to train or fine-tune the deployed network on target hardware itself, which typically leads to better accuracy on the target task. On-device continual learning performs local training and does not need to send back the newly collected data to remote servers, which also protects data privacy and ensures data security. However, on-device continual learning, despite being able to deliver significant benefits, suffers from the catastrophic forgetting issue, which is the tendency to forget the previously learned knowledge when adapting to newly collected data [46]. The rationale is that on-device continual learning must adjust the pre-trained network weights in order to adapt to the newly collected data, which deteriorates the previously learned knowledge accordingly.
To alleviate the catastrophic forgetting issue, a plethora of state-of-the-art on-device continual learning works have been established recently [46, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534], which seek to stabilize the on-device continual learning process and further push forward the attainable accuracy on target the task. Among them, [46], an early exploration, investigates three common continual learning scenarios and demonstrates that it is frustratingly hard to evaluate different continual learning approaches, based on which [46] establishes several evaluation protocols to compare them. It is worth noting that the authors of [46] do not support on-device continual learning but we can easily integrate recent advances from the lens of on-device training (see Section 5.1) into [46] to further enable efficient on-device continual learning. Several subsequent on-device continual learning works [524, 525, 526, 527] explore on-device continual learning on resource-constrained embedded computing systems, which have since demonstrated promising accuracy improvement. In parallel, [528, 529, 530] attempt to generalize on-device continual learning to benefit language tasks in real-world embedded scenarios, such as environmental sound classification [528] and automatic speech recognition [530]. Inspired by the tremendous success of vision transformers, [107, 531] also investigate the efficacy of on-device embedded continual learning to continuously improve the accuracy of mainstream vision transformers. The works of [532, 533, 534] delve deeper into on-device continual learning, which focus more on the training pipeline and introduce several on-device training enhancements to maximize accuracy improvement, such as selective weight updates [532], weight freezing [533], and deep network ensembles [534].

5.3 On-Device Transfer Learning

As demonstrated in [44], it is often different to directly train DNNs from scratch in real-world embedded scenarios, in which the collected data samples are far limited. To tackle this issue, an effective alternative is on-device transfer learning, which fine-tunes pre-trained networks on large-scale datasets. The rationale here is that DNNs pre-trained on large-scale datasets (e.g., ImageNet [80]) can serve as powerful feature extractors for further transfer learning, which only fine-tunes several layers (e.g., batch normalization layers and the last layer) whereas other layers are typically frozen. In contrast to previous on-device learning practices as discussed in Sections 5.1 and 5.2, on-device transfer learning does not require storing memory-intensive intermediate activations. As a result, it maintains significant efficiency in terms of training memory consumption, as illustrated in Figure 28 [44]. Despite the promising memory efficiency, on-device transfer learning is quite challenging, which may result in poor accuracy, especially on those datasets whose data distribution is far from ImageNet [82].
Fig. 28.
Fig. 28. Comparisons of various on-device transfer learning methods, including TinyTL [44], FT-Norm+Last [537], FT-Last [536], and FT-Full [548]. TinyTL freezes the weights and only optimizes the bias modules. FT-Norm+Last fine-tunes the normalization layers and the last linear layer, whereas FT-Last fine-tunes the last linear layer and FT-Full fine-tunes the full network (figure from [44]).
To overcome such limitations, early transfer learning practices [535, 536] propose to fine-tune all the network layers, which indeed achieves better accuracy but leads to considerable memory consumption due to memory-intensive intermediate activations. To avoid this, several subsequent transfer learning works [537, 538, 539, 540] demonstrate that it is often not necessary to fine-tune all the network layers, which indicates that fine-tuning batch normalization layers can also achieve strong accuracy on the target task. This fine-tuning paradigm has the potential to significantly reduce the number of trainable parameters during the transfer learning process. In light of this, [537, 538, 539, 540] propose to only optimize learnable parameters in batch normalization layers (see \(\gamma\) and \(\beta\) in Equation (13)), whereas other learnable parameters are frozen during the transfer learning process. For example, [537] leverages batch normalization layers as scale-and-bias patches and then trains the patched parameters, optionally also the last layer, whereas the remaining parameters are left unchanged. Furthermore, [538] reveals that, for those networks with sufficient depth, training only \(\gamma\) and \(\beta\) can reach surprisingly strong accuracy, which demonstrates the expressive power of the learnable parameters in batch normalization layers. However, fewer trainable parameters cannot directly translate to superior training memory efficiency as shown in Figure 27, which may still involve a large amount of memory consumption (e.g., 326 MB under the training batch size of 8) to store memory-intensive intermediate activations of batch normalization layers [44].
To further alleviate the prohibitive training memory consumption, [44] introduces a simple yet effective transfer learning solution, which exhibits significant training memory efficiency. Specifically, [44] relies on an empirical observation that intermediate activations are only required to update network weights, whereas updating network biases does not involve intermediate activations. This observation also reveals that the training memory bottleneck comes from updating network weights rather than biases. In light of this, [44] proposes to freeze network weights and only update network biases. However, freezing network weights and only updating network biases may lead to significant accuracy loss. To compensate for such accuracy loss due to freezing network weights, [44] introduces an effective lite residual learning scheme, which leverages generalized memory-efficient bias modules to refine memory-intensive intermediate activations. In particular, the lite residual learning scheme can improve the attainable accuracy on the target task and, more importantly, at the cost of negligible memory overheads. Finally, [44] reduces the training memory consumption from more than 250 MB to only 16 MB, making it possible to explore in-memory computing infrastructures to perform memory-efficient transfer learning. Note that we can easily integrate recent advances from the lens of on-device training (see Section 5.1) into the aforementioned transfer learning works towards boosted on-device transfer learning performance.

5.4 On-Device Federated Learning

On-device federated learning is an advanced decentralized learning paradigm that enables efficient training on a large corpus of decentralized data residing on local client devices such as mobile phones and allows multiple local client devices to jointly train the given network without explicitly sharing their raw data [541, 549]. In practice, on-device federated learning has the potential to significantly accelerate the training process when the number of client devices evolves. In addition, on-device federated learning is one instance of the more general approach of “bringing the neural network to the data” rather than “bringing the data to the neural network.” As a result, it addresses the fundamental problems of data privacy, security, and ownership [508]. This is particularly favored in real-world embedded scenarios, in which embedded devices can continue to collect new data through local sensors. Thanks to these practical benefits, on-device federated learning has garnered increasing attention from both academia and industry [47]. In the past decade, on-device federated learning has been utilized to empower a plethora of real-world intelligent applications, such as mobile keyboard content suggestions [550], medical image analysis [551], and smart health care infrastructures [552]. As demonstrated in [541], standard on-device federated learning practices typically consist of the following five iterative steps:
(1)
Initialization. On-device federated learning begins with the randomly initialized network, namely, the global model, which is shared among local client devices. At the early learning stage, the global model is sent to all local client devices from the centralized server, in which each local client device receives the same copy of the global model.
(2)
Local Training. Once local client devices receive the global model, they start to perform local on-device training, which treats the global model as the local model and then trains the local model using the locally collected data. Note that the locally collected data only resides on the local client device and is not shared among other client devices.
(3)
Model Update. After local on-device training terminates, each local client device requires generation of the respective model update scheme, which should essentially reflect what the local client device has learned from the locally collected data. These model update schemes, instead of the locally collected data, are then sent back to the centralized server for further aggregation, which effectively eliminates data leakage and protects data privacy.
(4)
Aggregation. The centralized server receives model update schemes from all local client devices, after which the centralized server aggregates the received model update schemes to further produce an improved global model.
(5)
Distribution. The centralized server then distributes the improved global model to all local client devices and these steps repeat until convergence.
Despite being able to deliver superior learning performance across various real-world embedded tasks, on-device federated learning suffers from critical limitations, especially from the perspective of data transmission [508], posing significant challenges to generalizing on-device federated learning to benefit real-world intelligent embedded applications. In contrast to the centralized server that is equipped with high-end network infrastructures, local client devices in real-world embedded scenarios are often low end with less-capable network infrastructures. In such a case, it may be time-consuming to (1) distribute the global model from the centralized server to local client devices and (2) send back model update schemes from local client devices to the centralized server for further aggregation. To overcome such limitations, a plethora of advanced federated learning techniques have been established recently to accommodate the limited data bandwidth of local client devices [47, 414, 541, 542, 543, 544, 545], which primarily focus on reducing the total data bits transferred between the remote centralized server and local client devices, such as federated averaging [541], gradient compression [542, 543], quantization [414], delayed gradient averaging [544], partial variable training [545], and local training sparsity [47]. Note that we can easily integrate recent advances from the lens of general on-device training techniques (see Section 5.1) into the aforementioned federated learning works to further enhance on-device federated learning.

5.5 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of on-device learning, which are summarized as follows.
(1)
Offline On-Device Federated Learning. As discussed in Section 5.3, on-device federated learning highly relies on the centralized server for updating local models, which needs stable Internet requirements for data movements between local devices and remote servers. This may lead to inferior on-device learning efficiency due to the communication overheads between local devices and the remote server, especially when Internet connectivity is limited or unavailable. Therefore, one promising future trend is offline on-device federated learning, which excludes the remote centralized server and exploits local devices to perform learning tasks. Offline on-device federated learning has the potential to significantly boost on-device learning efficiency.
(2)
Personalized On-Device Learning. As discussed in Section 5.1, on-device learning exhibits strong local personalization, which distinguishes itself from the global training counterpart. Personalized on-device learning can bring two-fold benefits. On the one hand, personalized on-device learning allows local devices to directly learn from local users to provide user-tailored AI solutions, which can protect the data privacy since the collected data do not need to be transferred to the cloud. On the other hand, personalized on-device learning can achieve better learning accuracy since it can continue to collect rich new personalized training data from local users. Therefore, future research should leverage this unique capability to further provide more highly personalized on-device learning solutions, in which local devices can actively and quickly adapt themselves to users’ diverse needs to deliver user-tailored services, such as personalized voice assistants.
(3)
Robust On-Device Learning. On-device learning, despite being able to achieve promising success, still suffers from critical limitations, such as poor adversarial robustness [553]. This is particularly important in real-world embedded computing systems, such as embedded visual sensing [554, 555], in which the environments may dynamically change over time. This further makes local on-device learning more vulnerable to adversarial attacks, especially those unseen adversarial attacks, which may significantly degrade the on-device learning performance even when encountering simple adversarial attacks [553]. To overcome such limitations, future research should focus on developing robust on-device learning techniques featuring novel adversarial training algorithms, which can achieve competitive on-device learning performance while also maintaining superior adversarial robustness against well-engineered adversarial attacks or even unseen adversarial attacks.
(4)
Efficient On-Device Learning Ecosystems. As discussed in Section 5.1, on-device learning has gained increasing popularity from both academia and industry thanks to its strong capability to ensure data privacy and security. In light of this, future research should also develop efficient on-device learning ecosystems, including software and hardware frameworks, to further support the development, deployment, and management of on-device learning applications, making it easier for developers to create and optimize models for various on-device learning purposes. For example, [45], one of the most representative on-device learning methods, leverages quantization to trim down training memory consumption. However, mainstream embedded computing systems do not support low-bit training, making it difficult to benefit mainstream embedded computing systems.

6 Efficient Large Language Models for Embedded Computing Systems

In the past few years, LLMs, such as GPT-3 [48] and GPT-4 [49], have achieved impressive success across various real-world language processing tasks [50]. However, the strong learning capability of LLMs also comes at the cost of excessive computational complexity. For example, OpenAI’s GPT-3 [48], one of the most representative LLMs, consists of 175 billion parameters. More recently, LLMs have continued to evolve with ever-increasing model sizes in order to achieve state-of-the-art performance [51, 52]. This makes it even more challenging to deploy LLMs on modern embedded computing systems. To this end, we first introduce the preliminaries on LLMs and then discuss recent state-of-the-art advances from the perspective of efficient LLMs in this section, including efficient LLM architecture design in Section 6.2, efficient LLM compression techniques in Section 6.3, and efficient LLM system design in Section 6.4. We also summarize the above state-of-the-art advances regarding efficient LLMs in Figure 29. Finally, in Section 6.5, we envision several promising future directions in the field of efficient LLMs.
Fig. 29.
Fig. 29. Overview of efficient LLM architectures, LLM compression techniques, and LLM systems in Section 6.

6.1 Preliminaries on LLMs

LLMs are emerging machine learning models that are dedicated to understanding, generating, and interacting with human language through leveraging extensive textual data. In practice, LLMs are typically built upon a transformer with both encoder and decoder [90] and heavily rely on self-attention mechanisms to measure the significance of different words in the given sentence regardless of their positional relationships. Thanks to their strong capability to interpret rich information, LLMs can exhibit remarkable performance across a wide range of language processing tasks, such as text summarization, translation, question answering, and conversational response generation. As discussed in [50], recent state-of-the-art LLMs can be divided into three main categories according to their inherent architectures: encoder-only architecture, decoder-only architecture, and encoder–decoder architecture as follows.
(1)
Encoder-Only Language Models. Encoder-only language models typically focus on transforming the given input text into continuous representations, which can capture and reflect the context of the given input text. These encoder-only language models are usually used for real-world language processing tasks that require understanding or embedding of the given input text, such as sentence classification, named entity recognition, and extractive question answering, where the output does not need to be sequential or generated texts. For example, BERT [91] is one of the most representative encoder-only language models that features masked language modeling during training, which enables the model itself to understand the context from both directions (i.e., left and right context).
(2)
Decoder-Only Language Models. Decoder-only language models typically focus on generating texts based on the given input text, which can interpret the context of the given input text. These decoder-only language models are usually used for real-world language processing tasks in which generating texts is required, such as text generation and language modeling. For example, GPT-3 [48] is one of the most representative decoder-only language models that features advanced auto-regressive training, which can learn to accurately predict the next word in sequence from all the previous words.
(3)
Encoder-Decoder Language Models. Encoder-decoder language models, also known as sequence-to-sequence (seq2seq) models, typically consist of two parts: (1) the encoder to process the given input text and encode it into feature representations and (2) the decoder to explore the above feature representations to generate an output sequence. The encoder-decoder architecture is versatile and suitable for real-world language processing tasks that require transformation of the given input text into different formats, such as language translation, summarization, and dialogue systems. For example, T5 [556] is one of the most representative encoder-decoder language models, which formulates the given task as a text-to-text transformation problem and converts the given input text to the target output text. In parallel, BART [557] is particularly effective in generative and comprehensive tasks thanks to its bidirectional encoder and auto-regressive decoder.

6.2 Efficient LLM Architectures

As discussed in Section 6.1, recent LLMs are typically built upon transformers [90] and heavily rely on self-attention mechanisms to interpret the significance of different words in a given sentence, regardless of their positional relationships. However, self-attention mechanisms, despite their strong capability for language processing, also introduce considerable computational complexity. As pointed out in [599], the quadratic time and memory complexity of self-attention mechanisms may significantly slow down the pre-training, inference, and fine-tuning stages of LLMs. To optimize the prohibitive computational complexity of LLMs, recent state-of-the-art efficient LLMs often focus on exploring computation-efficient self-attention mechanisms.
General Efficient Attention. Some recent works [558, 559, 560, 561, 562, 563] focus on exploring computation-efficient attention mechanisms to optimize the quadratic computational complexity of vanilla self-attention [90]. Among them, [558] introduces clustered attention, which groups different queries into clusters and computes the attention just for the centroids rather than computing the attention for every query. To further improve the above approximation, [558] also employs the computed clusters to identify the keys with the highest attention per query and computes the exact key-query dot products. The work in [559] draws insights from the Nyström method and proposes approximating the standard self-attention mechanism in linear complexity, which can enable applications to longer sequences with even thousands of tokens. The work of [560] demonstrates that the complexity bottleneck of self-attention mainly comes from the computation of partition functions in the denominator of the softmax function and the multiplication of the softmax matrix with the matrix of values. To this end, [560] features an efficient kernel density estimation (KDE) solver to resolve the above complexity bottleneck via subsampling-based fast matrix products, which can approximate the attention in subquadratic time with provable spectral norm bounds. The work of [561] introduces an efficient single-head gated attention mechanism with an exponential moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism, which can exhibit linear time and space complexity while causing minimal performance loss. The work of [562] introduces an efficient universal approximation for self-attention, which can exhibit linear time and space complexity and reveal the theoretical insights behind existing efficient transformers (e.g., Linformer [97]) with linear time and space complexity. The work of [563] introduces an efficient attention approximation mechanism featuring fused low-rank kernel approximation, which can provide sizable runtime performance gains and is also high quality.
Hardware-Aware Efficient Attention. In parallel to the above efficient attention approximation works, some recent works [53, 54, 55, 564, 565, 566] focus on exploring efficient hardware-aware attention mechanisms, which can exhibit considerable efficiency improvement on modern hardware systems. Among them, FlashAttention [54] features an efficient IO-aware exact attention algorithm, which explores tiling to reduce the total number of memory reads and writes between GPU high bandwidth memory and GPU on-chip Static Random Access Memory (SRAM). However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, which is mainly due to the suboptimal work partitioning between different thread blocks and warps on GPUs. To tackle this issue, FlashAttention-2 [55] introduces an improved work partitioning scheme, which (1) tweaks the algorithm to reduce the number of non-MatMul FLOPs, (2) parallelizes the attention computation workloads across different thread blocks, and (3) distributes the work between warps to reduce communications through shared memory. FLASHLINEARATTENTION [566] dives deeper into I/O awareness and introduces an effective hardware-efficient algorithm for linear attention, which trades off memory movement against parallelizability and can even be faster than FlashAttention-2. PagedAttention [53] draws insights from the operating system’s solution to memory fragmentation and sharing through virtual memory with paging and further divides the request’s key-value (KV) cache into different blocks, each of which contains the attention keys and values of a fixed number of tokens. A3 [564] demonstrates that implementing attention mechanisms using matrix-vector multiplication is often suboptimal and further proposes to accelerate attention mechanisms with joint algorithmic approximation and hardware specialization. Similar to A3, ELSA [565] features an effective approximation scheme to significantly reduce the amount of computation workloads by efficiently filtering out relationships that are unlikely to affect the final output.

6.3 Efficient LLM Compression

In addition to designing LLMs with efficient architectures, another promising direction is to explore efficient LLM compression techniques to optimize the computational complexity of existing computation-intensive LLMs. With this in mind, we discuss recent state-of-the-art compression techniques for LLMs in this section, including efficient LLM pruning in Section 6.3.1, efficient LLM quantization in Section 6.3.2, and efficient LLM distillation in Section 6.3.3.

6.3.1 Efficient LLM Pruning.

Pruning is one of the most effective strategies to optimize the computational efficiency of LLMs, which removes the less important parameters of LLMs while incurring minimal accuracy loss. Recent state-of-the-art LLM pruning methods can be divided into two main categories, non-structured LLM pruning and structured LLM pruning, as follows.
(1)
Non-structured LLM Pruning. Non-structured LLM pruning removes the less important LLM weights/connects, which can yield more aggressive compression ratios than structured pruning while also exhibiting strong accuracy [58, 346, 567, 568, 569, 570, 571, 572]. For example, SparseGPT [567] shows that LLMs can be pruned to at least 50% sparsity in one shot without any retraining and, more importantly, at minimal accuracy loss. In parallel, Wanda [58] proposes to prune the less important weights with the smallest magnitudes multiplied with the corresponding output activations on a per-output basis. More importantly, both SparseGPT and Wanda can generalize to semi-structured pruning [346, 568] towards better hardware parallelism, which can deliver realistic on-device inference speedups with the support of some existing deep learning libraries (e.g., cuSPARSElt [340] and TVM [341]). The work in [569] advocates for reinstating ReLU activation in LLMs and explores sparse patterns in ReLU-based LLMs, which shows that ReLU activation can effectively reduce LLM inference computation overheads up to three times. In practice, non-structured pruning has been widely employed to enhance the pre-training and fine-tuning process of LLMs towards better pre-training and fine-tuning processes [570, 571, 572].
(2)
Structured LLM Pruning. In contrast to non-structured LLM pruning, structured LLM pruning can achieve realistic inference speedups on target hardware, which, however, also suffers from more aggressive accuracy loss than non-structured pruning. To tackle this dilemma, recent state-of-the-art non-structured LLM pruning methods [57, 573, 574, 575, 576, 577, 578, 579] typically feature an additional fine-tuning stage to further recover the attainable accuracy of the pruned LLM. For example, LLM-Pruner [57] employs structural pruning to selectively remove non-critical coupled structures according to their gradient information, which can preserve the majority of the LLM’s functionality while optimizing its computational efficiency. LLM-Pruner recovers the performance of the pruned LLM using another state-of-the-art tuning technique (i.e., LoRA [600]), which merely takes 3 hours with 50K data. Similar to LLM-Pruner, ZipLM [574] iteratively identifies and removes LLM components with the worst loss–runtime trade-off, which can end up with efficient LLMs and generalize across various runtime constraints. In addition, LoRAShear [575] first creates the dependency graphs over LoRA modules and then proceeds to progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer. To further recover the lost information during pruning, LoRAShear also introduces an effective fine-tuning scheme with dynamic data adaptors to narrow down the performance gap between the pruned LLM and the non-pruned LLM. More recently, several LLM layer pruning methods [577, 578, 579] demonstrate that LLM’s layers are also redundant and thus can be removed to largely enhance the inference efficiency of LLMs at minimal accuracy loss. For example, ShortGPT [578] and Shorted-LLaMA [579] propose to remove the less important LLM layers according to their layer importance score, whereas LLM-Streamline [577] proposes to replace the less important LLM layers with more lightweight ones.

6.3.2 Efficient LLM Quantization.

Recent state-of-the-art LLM quantization techniques [59, 60, 580, 581, 582, 583, 584, 585, 586] focus on reducing the weights of LLMs from higher to lower bits (e.g., from 32 bits to 8 bits or even 1 bit), which can substantially enhance the inference efficiency of LLMs at the cost of slight accuracy loss. Among them, SmoothQuant [59] introduces an efficient training-free post-training quantization solution to enable 8-bit weight and 8-bit activation quantization for LLMs. Given that weights are easy to quantize while activations are not, SmoothQuant also smooths the activation outliers by offline mitigating of the quantization difficulty from activations to weights with an equivalent mathematical transformation. Similar to SmoothQuant, AWQ [60] introduces an efficient hardware-friendly quantization approach for low-bit LLM weight-only quantization, which is built upon an interesting observation that weights are not equally important and reserving only 1% of salient weights can greatly reduce the quantization error. In light of this, AWQ proposes to search for the optimal per-channel scaling scheme that protects the salient weights by observing the activations rather than weights. SpQR [580] introduces a new compressed format for efficient LLM quantization, which can enable near-lossless compression of LLMs across various model scales while maintaining comparable compression levels to previous quantization methods. SpQR first identifies and isolates the outlier weights that may cause particular-large quantization errors, after which SpQR stores them in high precision while compressing all other weights in 3 to 4 bits. OS+ [581] features the channel-wise shifting for asymmetry and the channel-wise scaling for concentration since these operations can be seamlessly migrated into the subsequent quantization modules while maintaining strict equivalence. OS+ also introduces a fast and stable scheme to calculate effective shifting and scaling values, which can further achieve better quantization burden balance towards better quantization performance.
OWQ [582] introduces an efficient outlier-aware weight quantization strategy that aims to minimize an LLM’s memory footprint through low-bit quantization. OWQ prioritizes a small subset of structured weights that are sensitive to quantization and stores them in higher bits while applying highly tuned quantization to the remaining dense weights. QuIP [583] introduces quantization with incoherence process, which consists of two independent stages: (1) an adaptive rounding stage to minimize the pre-defined quadratic proxy objective and (2) an efficient pre- and post-processing stage to ensure weight and Hessian incoherence via multiplication by random orthogonal matrices. OmniQuant [584] introduces an omnidirectionally calibrated quantization technique for LLMs, which consists of two novel components: learnable weight clipping (LWC) and learnable equivalent transformation (LET). LWC modulates the extreme weight values by optimizing the clipping threshold and LET eliminates the activation outliers by shifting the challenge of quantization from activations to weights. Both LWC and LET can be seamlessly integrated into an effective differentiable optimization framework featuring block-wise error minimization for both weight-only and weight-activation quantization. The work of [585] dives deeper into LLM quantization and analyzes the effect of LLM quantization with comprehensive experiments to evaluate current state-of-the-art LLM quantization techniques, which systematically summarizes the effect of LLM quantization, provides recommendations to apply LLM quantization techniques, and points out future directions of LLM quantization. In contrast to these LLM quantization works that focus on efficient LLM quantization algorithms, OliVe [586] presents an algorithm/architecture co-designed solution to explore efficient quantized LLMs, which features an outlier–victim pair (OVP) quantization scheme and handles outlier values locally with low hardware overheads and high performance gains. This enables an efficient hardware-aligned OVP encoding scheme, which can be integrated into existing hardware accelerators (e.g., systolic arrays and tensor cores) towards more efficient quantized LLMs for generative inference.

6.3.3 Efficient LLM Distillation.

Another promising direction is to leverage the pre-trained knowledge from large LLMs to enhance the training or fine-tuning process of small LLMs, which can allow small LLMs to maintain as strong a performance as large LLMs while exhibiting superior efficiency. As discussed in [50], recent LLM distillation methods can be divided into two main categories, black-box LLM distillation and white-box LLM distillation, as follows:
(1)
Black-Box LLM Distillation. In the context of black-box distillation, the teacher LLM’s parameters are not available for the student LLM and the student LLM can only see the final output from the teacher LLM. Black-box distillation typically features those commercial LLMs (e.g., GPT-3 [48] and GPT-4 [49]) as the teacher and leverages the predictions from the teacher to further enhance the training or fine-tuning process of small student LLMs [61, 587, 588, 589]. For example, Self-Instruct [61] first generates a large number of instructions, input, and output sequences from GPT-3 using its application programming interfaces (APIs), after which Self-Instruct filters the invalid or similar ones before using them to fine-tune the original GPT-3 model. Finally, Self-Instruct can achieve an absolute improvement of 33% over the original GPT-3 model on Super-NaturalInstructions. Similar to Self-Instruct, [587] uses GPT-4 to first generate rich instruction-following data pairs and then uses the generated data pairs to fine-tune small LLaMA models to improve their performance.
(2)
White-Box LLM Distillation. In the context of white-box distillation, the teacher LLM’s parameters are available for the student LLM and the student LLM can also see the hidden intermediate output from the teacher LLM. More recently, with the emergence of open-source LLMs, white-box distillation has become more popular and more valuable for the LLM community since the student LLM can potentially benefit from the hidden states of the teacher LLM towards better distillation performance [62, 590, 591, 592]. MiniLLM [62] first replaces the forward Kullback-Leibler divergence (KLD) objective with reverse KLD, which can prevent the student LLM from overestimating the low-probability regions of the teacher distribution. Next, MiniLLM introduces an effective optimization approach to learn the above reverse KLD objective, which can enhance the student LLM to generate high-quality responses. TED [590] presents an effective task-aware layer-wise distillation strategy, which features task-aware filters to align the hidden states of teacher and student at each layer. The above filters can select the knowledge from the hidden states that are useful for target tasks, which can further reduce the knowledge gap between teacher and student LLMs. GKD [591] proposes to train the student LLM on its self-generated output sequences along with the feedback from the teacher LLM on such self-generated sequences. GKD also offers the flexibility to employ alternative loss functions between teacher and student, which can enhance the distillation performance of the student even when the student lacks the expressivity to mimic the teacher’s distribution. More recently, [592] introduces token-scaled logit distillation for quantization-aware training of LLMs, which can effectively mitigate the overfitting issue and also largely enhance the distillation process from the teacher predictions and the ground truths.

6.4 Efficient LLM Systems

In parallel to the rapid development of efficient LLM algorithms, a plethora of efficient LLM systems and infrastructures have recently emerged [63, 64, 65, 593, 594, 595, 596, 597, 598] that further optimize the generative inference efficiency of LLMs from the perspective of efficient system-level implementations. FlexGen [63] features an efficient high-throughput generation engine for running LLMs on one single GPU with limited memory, which can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. After solving a linear programming problem, FlexGen can also search for efficient patterns to store and access tensors. Tabi [65] features an inference system with an efficient multi-level inference engine, which can serve queries using small models and optional LLMs for demanding applications. Tabi is particularly optimized for discriminative models (not generative LLMs) in a serving framework, which uses the calibrated confidence score to determine whether to directly return the accurate results of small models or further re-route them to LLMs. DeepSpeed [593] presents a comprehensive system solution for efficient transformer inference, which consists of (1) a multi-GPU inference engine to minimize the runtime latency while maximizing the runtime throughput of both dense and sparse transformers when they fit into aggregate GPU memory and (2) a heterogeneous inference engine that leverages CPU and NVMe memory in addition to GPU memory and computation to enable high inference throughput with large models that do not fit into aggregate GPU memory. FastServe [594] presents an efficient distributed inference serving system for efficient LLM inference that (1) exploits the auto-regressive pattern of LLM inference to enable preemption at the granularity of each output token and (2) explores preemptive scheduling to minimize job completion time with a novel skip-join multi-level feedback queue scheduler. Petals [64] features an efficient collaborative system for runtime inference and fine-tuning of LLMs through joining the resources of multiple parties. In contrast to concurrent LLM inference engines, Petals also natively exposes hidden states of the served model, allowing training and sharing of custom model extensions based on efficient fine-tuning schemes.
S\(^3\) [595] demonstrates that designing an inference system with prior knowledge of the output sequence can largely increase the runtime inference throughput of LLMs. Therefore, in order to increase the runtime inference throughput of LLMs, S\(^3\) proposes to (1) first, predict the output sequence length, (2) then schedule generation queries based on the prediction to increase runtime resource utilization and throughput, and, (3) finally, handle mispredictions. Thanks to the prior knowledge of the output sequence, S\(^3\) can achieve much better inference throughput than early LLM systems. Splitwise [596] proposes to split the two phases of typical LLM inference workloads on to different hardware, which allows use of well-suited hardware for each inference phase and provision independent computational resources for each inference phase. This strategy can improve the runtime resource utilization across different hardware. Splitwise also optimizes the state transfer across different hardware using the fast back-plane interconnects in today’s GPU clusters to further increase the runtime LLM inference throughput. DistServe [597] proposes to disaggregate the prefill and decoding computation to enhance the runtime serving performance of LLMs, which assigns the prefill and decoding computation workloads to different GPUs and, thus, largely eliminates the prefill-decoding interference towards better runtime inference throughput. DistServe also optimizes the above two phases according to the serving cluster’s bandwidth to minimize the communication overheads caused by the disaggregation. Liger [598] features an efficient distributed collaborative inference system for LLMs, which can achieve low inference latency at high throughput on multiple GPUs. In addition, to achieve high parallelism and throughput, Liger also introduces an efficient scheduling strategy to effectively schedule the computation and communication kernels across different input requests onto multiple streams of multiple GPUs.

6.5 Visions for the Future

In this section, we envision several promising future trends and possible directions in the field of efficient LLMs, as follows.
(1)
AutoML for Efficient LLMs. Recent state-of-the-art efficient LLMs are typically built upon manual heuristics, which, despite their efficacy, often require considerable human expertise and engineering efforts. In light of this, one promising future direction is to automatically explore efficient LLMs using automated machine learning (AutoML) techniques [137]. For example, given an efficient LLM, we can leverage AutoML techniques to automatically search for its tailored efficient system implementation towards the optimal on-device inference speedup. Similarly, we can also leverage AutoML techniques to automatically search for its tailored pruning or quantization strategy towards the optimal accuracy–efficiency trade-off. This has the potential to largely push forward the frontier of efficient LLM designs.
(2)
Alternative Structures for Efficient LLMs. Recent state-of-the-art LLMs heavily rely on the self-attention mechanism in transformers [90], which, however, suffers from quadratic time and memory complexity and greatly slows down the pre-training, inference, and fine-tuning stages of LLMs [599]. To tackle this dilemma, several alternative structures have recently emerged (e.g., RWKV [601], Mamba [602], and RetNet [603]), which exhibit optimized computational efficiency and allow researchers to perform efficient language modeling tasks without transformers. For example, RetNet [603] introduces the recurrent representation to enable low-cost inference, which improves the decoding throughput, runtime latency, and GPU memory without sacrificing the language modeling performance. In light of this, one promising future direction is to explore more efficient alternative structures for LLMs, which may deliver considerable efficiency gains over existing transformer-based LLMs without sacrificing the language modeling performance.
(3)
Hardware-Aware Benchmarks for Efficient LLMs. Recent state-of-the-art efficient LLMs are typically optimized in terms of the number of parameters or FLOPs. However, these theoretical complexity metrics cannot accurately reflect the runtime performance on target hardware (e.g., latency and energy). This makes it challenging to fairly compare different efficient LLMs in terms of their runtime inference efficiency on target hardware. One promising future direction is to design hardware-aware benchmarks for efficient LLMs, which may include different hardware performance metrics (e.g., latency and energy) across different hardware systems.
(4)
Infrastructures for Efficient LLMs. Recently, there have been a large number of works on efficient LLM compression, including LLM pruning and LLM quantization. However, they often require specialized hardware accelerators and, thus, cannot achieve realistic on-device inference speedups on modern embedded computing systems. For example, non-structured LLM pruning can remove the less important weights to explore highly sparse LLMs with aggressive compression ratios. However, the resulting sparse LLMs cannot achieve realistic on-device inference speedups due to the irregular network sparsity [57]. Another recent work [586] has also explored accelerating quantized LLMs and achieved promising performance. However, these are far from enough for real-world large-scale deployments. One promising future direction is to design specialized software and hardware infrastructures to further optimize LLMs for efficient on-device inference.

7 Deep Learning Frameworks for Embedded Computing Systems

In the past few years, DNNs have been achieving tremendous success in a myriad of real-world intelligent embedded computing scenarios, such as on-device speech recognition [604, 605], object detection and tracking [606, 607], and autonomous vehicles [608, 609]. A series of customized software programs [66, 67, 610, 611, 612, 613, 614, 615] and hardware frameworks [68, 69, 70, 616, 617, 618] have also been developed to facilitate the deployment of DNNs on embedded computing systems. Therefore, we discuss recent popular deep learning software and hardware frameworks here that bring deep learning to embedded computing systems to embrace ubiquitous embedded intelligence.

7.1 Deep Learning Software Frameworks

In this section, we introduce popular deep learning software frameworks that have been widely used to develop deep learning solutions for embedded computing systems, including TensorFlow [66], PyTorch [67], Caffe [610], MXNet [611], Keras [612], CoreML [613], PaddlePaddle [614], and BigDL [615]. These frameworks are summarized in Table 3.
Table 3.
SoftwareCreated byYearProgramming LanguagesComputation GraphTrainingMaintenance
TensorFlow [66]Google2015Python, C++,
Java, and JavaScript
Static and Dynamic
PyTorch [67]Facebook
(now Meta)
2016Python and C++Dynamic
Caffe [610]Berkeley2014C++Static
MXNet [611]Amazon2015Python, C++, R, Java,
Julia, JavaScript,
Scala, Go, and Perl
Static and Dynamic
Keras [612]Personal2015PythonStatic and Dynamic
CoreML [613]Apple2017Python, Swift,
and Objective-C
Static
PaddlePaddle [614]Baidu2016Python and C++Static and Dynamic
BigDL [615]Intel2017Python and ScalaDynamic
Table 3. Deep Learning Software Frameworks Discussed in Section 7.1
Note that we refer to the framework under active maintenance if there are new releases within the previous 6 months.
TensorFlow [66] is an open-source deep learning software framework developed by Google, which was released in 2015 and has since become one of the most popular deep learning software frameworks for training and deploying DNNs. TensorFlow, especially TensorFlow Lite, allows developers to easily build and deploy DNNs in a wide range of embedded computing systems, including mobile phones, microcontroller units (MCUs), Raspberry Pi, TPUs, and edge GPUs. TensorFlow also supports various real-world applications, ranging from image and speech recognition to NLP and predictive analytics. With its flexible architecture and vast pre-trained models, TensorFlow has been deemed to be one of the most important tools for researchers and developers in the field of deep learning.
PyTorch [67] is an open-source deep learning software framework that is widely used for training and deploying deep neural networks. It was developed by Facebook (now known as Meta) and was released in 2016. One of the key features is the dynamic computation graph, which allows developers to change the computation behavior of DNNs on the fly. This feature distinguishes PyTorch from other deep learning software frameworks (e.g., TensorFlow) that only support static computation graphs. In addition, PyTorch has a number of high-level features that make it easier to build more complex DNNs. For example, the TorchVision package provides various useful tools and pre-trained models for image and video processing. With its dynamic computation graph, ease of use, and range of high-level features, PyTorch has become an essential software framework for training and deploying DNNs in the deep learning community.
Caffe [610] is a popular deep learning software framework developed by Berkeley and released in 2014, which has gained increasing popularity owing to its speed, modularity, and ease of use. One of the key features is its ability to deal with large datasets containing millions of images. Another important feature is its modularity, which allows developers to add or remove components with ease. In addition, Caffe includes a large library comprising hundreds of pre-trained models that can be used to quickly build deep learning applications, such as image classification, object detection, and segmentation. In addition to its powerful features, Caffe has a user-friendly interface, allowing developers to train and deploy DNNs without extensive knowledge of deep learning. With its powerful features and user-friendly interfaces, Caffe has become an invaluable tool for researchers and developers and inspired subsequent deep learning software frameworks.
MXNet [611] is an open-source deep learning software framework to train and deploy DNNs, which was developed by Amazon and released in 2015. One of the key technical merits is its distributed training capability, which allows training of DNNs across multiple computation nodes in a computationally efficient manner. MXNet also supports multiple programming languages, such as Python, C++, R, and Julia, which increases the accessibility to researchers and developers with diverse skill levels. MXNet’s integration with other deep learning software frameworks and tools, such as Apache Spark and Apache Flink, is another important feature that facilitates the integration of deep learning into existing data processing pipelines. Thanks to its scalability, flexibility, and efficiency, MXNet has become a popular option for developers and researchers in the deep learning community.
Keras [612] is an open-source deep learning software framework written in Python, which provides high-level APIs for building and training efficient DNN solutions. It was developed by François Chollet in 2015 and is now maintained by a community of developers. Keras has been integrated into TensorFlow; starting from TensorFlow 2.0, Keras has become the default API to build DNN solutions in TensorFlow. One of the key features of Keras is its modularity, which allows developers and researchers to easily construct and customize DNNs. Keras also allows users to productize DNN solutions on modern mobile phones such as iOS and Android, on the web, or on the Java virtual machine. Last, but not least, Keras allows training of DNN solutions in an efficient distributed manner on clusters of multiple GPUs and TPUs. These strengths make Keras increasingly popular in both industry and academia.
CoreML [613] is an open-source deep learning software framework developed by Apple in 2017, which aims to integrate DNNs into Apple commercial products, such as the iPhone, iPad, and Apple Watch. In addition to supporting extensive DNNs with over 30 layer types, CoreML covers standard machine learning models, such as tree ensembles, support vector machine, and generalized linear models. Another key feature of CoreML is its ability to directly run DNNs on the device without the need for cloud-based inference. CoreML also provides a range of optimization techniques, such as quantization and pruning, to reduce the complexity of DNNs. This is particularly important for modern mobile devices, which typically have limited storage and computational resources. Also, CoreML, built on top of advanced technologies such as Metal and Accelerate, seamlessly takes advantage of CPUs and GPUs to provide the maximum inference performance at runtime. These technical features make CoreML the first choice for developing efficient DNN solutions on Apple commercial products.
PaddlePaddle [614], also known as Paddle, is an open-source deep learning software framework developed by Baidu, which was released in 2016 to benefit the deep learning community. PaddlePaddle is designed to be an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end tools, and service platforms. PaddlePaddle originated from industrial practices with dedication and commitment to industrialization. It has been adopted by a wide range of sectors, including manufacturing, agriculture, and enterprise service. With the industrial benefits, PaddlePaddle has motivated an increasing number of developers and researchers to commercialize AI.
BigDL [615] is an open-source deep learning software framework that runs on top of Apache Spark. It was developed by Intel and released in 2017. The goal of BigDL is to provide a high-performance, scalable, and easy-to-use platform, especially distributed deep learning. To this end, BigDL includes a comprehensive set of features that cover various deep learning applications, including image classification, object detection, and NLP. One of the key features of BigDL is its ability to take full advantage of distributed computing resources, such as CPU, GPU, and FPGA clusters, to accelerate the training of DNNs. BigDL is seamlessly integrated with Apache Spark, which enables users to leverage the distributed computing capability of Spark for data preprocessing and postprocessing. This integration also makes it possible to build end-to-end deep learning pipelines that span from data ingestion to model deployment.

7.2 Deep Learning Hardware Frameworks

In this section, we introduce popular embedded hardware platforms that are designed to run powerful DNNs in embedded scenarios without cloud-based assistance, including NVIDIA Jetson [69], Intel Neural Compute Stick [70], Google Edge TPU [68], Google Coral Dev Board [616], Huawei HiKey 970 [617], and Orange Pi AI Stick Lite [618]. These frameworks are summarized in Table 4.
Table 4.
HardwareRAMStoragePowerPerformancePriceSupported Deep
Learning Software
NVIDIA Jetson TX2 [69]8 GB LPDDR432 GB eMMC 5.17.5 W \(\sim\) 15 W1.33 TFLOPS$399TensorFlow, PyTorch,
Caffe, Keras, and MXNet
NVIDIA Jetson Nano [69]4 GB LPDDR416 GB eMMC 5.15 W \(\sim\) 10 W0.472 TFLOPS$99TensorFlow, PyTorch,
Caffe, Keras, and MXNet
NVIDIA Jetson AGX Xavier [69]32 GB LPDDR4x32 GB eMMC 5.110 W \(\sim\) 30 W32 TOPS$1,099TensorFlow, PyTorch,
Caffe, Keras, and MXNet
NVIDIA Jetson Xavier NX [69]8 GB LPDDR4x16 GB eMMC 5.110 W \(\sim\) 20 W21 TOPS$399TensorFlow, PyTorch,
Caffe, Keras, and MXNet
NVIDIA Jetson AGX Orin [69]32 GB LPDDR564 GB eMMC 5.115 W \(\sim\) 40 W275 TOPS$1,999TensorFlow, PyTorch,
Caffe, Keras, and MXNet
NVIDIA Jetson Orin NX [69]16 GB LPDDR532 GB eMMC 5.110 W \(\sim\) 25 W100 TOPS$599TensorFlow, PyTorch,
Caffe, Keras, and MXNet
Intel Neural Compute Stick [70]N/AN/A0.5 W \(\sim\) 1.5 W0.1 TFLOPS$79TensorFlow, Caffe,
and MXNet
Google Edge TPU [68]N/AN/A2 W4 TOPS$75TensorFlow and
TensorFlow Lite
Google Coral Dev Board [616]1 GB LPDDR48 GB eMMC 5.11 W \(\sim\) 6 W4 TOPS$149TensorFlow and
TensorFlow Lite
Huawei HiKey 970 [617]6 GB LPDDR464 GB UFS 2.16 W \(\sim\) 12 W1.88 TOPS$299TensorFlow, PyTorch,
and Caffe
Orange Pi AI Stick Lite [618]N/AN/A1 W \(\sim\) 2 W4 TOPS$69TensorFlow, PyTorch,
and Caffe
Table 4. Deep Learning Software Frameworks Discussed in Section 7.2
Note that the price here refers to the initial price during the product launch, which is subject to fluctuations over time.
NVIDIA Jetson [69] is a series of embedded systems-on-modules (SoMs) designed by NVIDIA for running advanced deep learning workloads, especially the inference of DNNs. NVIDIA Jetson consists of NVIDIA Jetson TX2, NVIDIA Jetson Nano, NVIDIA Jetson AGX Xavier, NVIDIA Jetson Xavier NX, NVIDIA Jetson AGX Orin, and NVIDIA Jetson Orin NX. To accelerate deep learning workloads, NVIDIA Jetson runs on top of NVIDIA’s CUDA parallel computing architectures and features an integrated system-on-chip (SoC) with a powerful NVIDIA GPU, a multi-core CPU, and various high-speed interfaces, including Ethernet, USB, HDMI, and CSI/DSI. More importantly, NVIDIA Jetson is compatible with various deep learning software frameworks, including TensorFlow, PyTorch, Caffe, Keras, and MXNet. Thanks to its advanced architectural design and powerful interfaces, NVIDIA Jetson is able to support a wide range of embedded deep learning applications to accommodate different resource and performance requirements.
Intel Neural Compute Stick [70] is a small, low-power, and cost-effective embedded hardware designed to run deep learning workloads without cloud-based assistance, which was developed by Movidius (now acquired by Intel). The Neural Compute Stick (NCS) is a small USB device that can be connected to a host computer or embedded computing system. NCS features the Myriad 2 Vision Processing Unit (VPU), which is optimized for the inference of DNNs. In addition, NCS is integrated with various high-speed interfaces, including USB 3.0 and Wi-Fi. Developers can use the Intel Movidius SDK, which provides a set of tools for developing, testing, and deploying DNNs on NCS. Furthermore, NCS supports various deep learning software frameworks, including TensorFlow, Caffe, and MXNet. Thanks to its significant flexibility and cost efficiency, NCS makes it possible to deploy advanced DNNs in a wide range of embedded scenarios.
Google Edge TPU [68] is a custom-built ASIC chip to accelerate deep learning workloads on resource-constrained edge computing systems. The Google Edge TPU is designed to seamlessly work together with TensorFlow Lite, a lightweight version of TensorFlow, and is optimized for the inference of DNNs towards enhanced inference efficiency. Google Edge TPU itself cannot work alone and, similar to the Intel Neural Compute Stick, it must be connected to other embedded computing systems, such as Raspberry Pi 4 and the Google Coral Dev Board, to deliver deep learning solutions. The Google Edge TPU is capable of performing up to two trillion floating-point operations per second (TFLOPS) and four trillion operations per second (TOPS) using only two watts of power. This further allows us to build and deploy powerful deep learning solutions on embedded computing systems with limited computational resources. Thanks to its easy integration with other embedded computing systems, powerful performance, and significant efficiency, the Google Edge TPU has gained increasing popularity in the deep learning community for deploying deep learning solutions on embedded computing systems.
Google Coral Dev Board [616] is a single-board computer designed for building embedded deep learning applications. It features an on-board Google Edge TPU, which is a custom-built chip to run TensorFlow Lite models with high performance and low power consumption. The Coral Dev Board has various built-in interfaces, including Audio, Wi-Fi, Bluetooth, Ethernet, and USB 3.0, which enable it to be connected to other embedded computing systems. The Coral Dev Board is also integrated with 1 GB LPDDR4 RAM, 8 GB eMMC 5.1 flash memory, and a MicroSD slot for additional storage. The Coral Dev Board also comes with pre-installed software tools, including TensorFlow Lite, Google Edge TPU API, and various sample applications. These allow users to easily and quickly start building their deep learning solutions. Thanks to its powerful Google Edge TPU, various useful interfaces, and software tools, the Coral Dev Board has become increasingly popular for developing and deploying embedded deep learning solutions.
Huawei HiKey 970 [617] is a high-performance, single-board embedded computer designed by Huawei. The HiKey 970 features a powerful neural processing unit (NPU) to accelerate various deep learning workloads. The HiKey 970 is also integrated with 6 GB LPDDR4 LPDDR4 RAM and 64 GB UFS 2.1 flash memory, while at the same time allowing MicroSD extension for additional storage. In addition, the HiKey 970 supports various high-speed interfaces, including the Ethernet, USB 3.0, and PCIe 3.0. Furthermore, the HiKey 970 is compatible with popular deep learning software frameworks, such as TensorFlow, PyTorch, and Caffe, allowing users to easily and quickly build on-board deep learning solutions. Thanks to its powerful NPU, rich memory and storage, and extensive connectivity options, the HiKey 970 is suitable for a wide range of intelligent embedded applications, such as robotics, autonomous vehicles, and smart home devices.
Orange Pi AI Stick Lite [618] is a tiny and cost-effective USB stick designed for small to medium-sized deep learning workloads. It is equipped with a single-core Cortex-A7 processor and an NPU that provides hardware acceleration. Note that, similar to the Intel Neural Compute Stick and Google Edge TPU, the Orange Pi AI Stick Lite cannot work alone; it must be connected to a host device using the on-device USB 3.0 interface. The Orange Pi AI Stick Lite supports various deep learning software frameworks, including TensorFlow, PyTorch, and Caffe. Thanks to its cost efficiency, the Orange Pi AI Stick Lite is suitable for embedded computing systems to deal with small to medium-sized deep learning workloads.

7.3 Visions for the Future

In this section, we envision the future trends and possible directions of deep learning software and hardware infrastructures, summarized as follows.
(1)
Integration with Emerging Technologies. In the future, we should consider developing deep learning software and hardware that can be seamlessly integrated with emerging technologies. For example, quantum computing [619, 620, 621] has the potential to deliver significant speedup and computational capability, which can accelerate the training and inference of DNNs. Therefore, it is of paramount importance to explore the potential of integrating deep learning software and hardware with emerging technologies to unlock new possibilities and new advances in various real-world scenarios.
(2)
Democratization of Deep Learning. The democratization of deep learning [622] has emerged as a prominent trend in the deep learning era, with the explicit goal of making deep learning software and hardware more accessible to a wider range of developers and researchers. Therefore, in order to alleviate the technical barrier to entry for building efficient embedded deep learning solutions, we should continue to develop more user-friendly deep learning software and hardware frameworks, democratizing the benefits and advances of deep learning in real-world embedded scenarios.
(3)
Development of Specialized Hardware. Conventional embedded computing systems typically focus on optimizing and accelerating the training and inference of traditional convolutional networks, neglecting recent advances in the deep learning era. Vision Transformer (ViT) [107] is the most representative, which has opened up a new direction and has been challenging the dominant role of traditional convolutional networks in various real-world vision applications, such as image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119]. In order to unleash the promise of ViT and its variants, we should also develop specialized embedded computing systems to accelerate the family of ViT rather than only focusing on accelerating complicated convolutional networks.
(4)
Development of More Powerful Hardware. As seen in recent advanced DNNs [623, 624, 625], network complexity has continued to explode. As a result, it continues to enlarge the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems. In parallel, the LLMs, such as ChatGPT [49], have been achieving impressive success in various NLP tasks, such as language generation, language translation, and question answering, at the cost of pushing forward network complexity to another unseen level, significantly enlarging the computational gap. These demonstrate the necessity of innovating more powerful yet cost-effective embedded computing systems to further bridge the aforementioned computational gap, especially from the hardware perspective.
(5)
Development of Infrastructures for On-Device Training. In the past, the convention in the deep learning community is to (1) first, train DNNs on powerful GPUs or the cloud and (2) then deploy the pre-trained DNNs on local embedded computing systems for further inference at runtime. Compared with this convention, the emerging paradigm of on-device training enables the pre-trained DNNs to adapt to the new data collected from the local sensors or by the users [45]. As such, the users can benefit from customized DNNs without having to transfer the collected data to the cloud, thereby significantly protecting data privacy and security [45]. Nonetheless, conventional embedded computing systems are typically optimized for inference and do not support efficient on-device training owing to the training memory bottleneck during the training process [44, 45]. This motivates us to further develop efficient infrastructures, including specialized deep learning software and hardware, to effectively accommodate future on-device training demands.

8 Deep Learning Applications for Embedded Computing Systems

In the previous sections, we have extensively discussed recent advances towards ubiquitous embedded intelligence from various perspectives of efficient deep learning networks, algorithms, software, and hardware. In this section, we further elaborate on recent popular intelligent deep learning applications in real-world embedded scenarios, spanning from vision to NLP tasks. Note that these intelligent embedded applications highly rely on efficient deep networks and efficient deep learning algorithms that have been extensively discussed in the previous sections.

8.1 Computer Vision Applications

Computer vision is an emerging field that focuses on interpreting and understanding visual information from real-world environments, such as images and video, spanning from image classification [1] to downstream vision tasks, such as object detection [4], tracking [5], and segmentation [6]. Below we discuss recent popular intelligent embedded vision applications.
Image Classification. Image classification, also referred to as image recognition, is the most fundamental vision task, which focuses on recognizing the input image based on its visual information [1]. Various intelligent applications in real-world embedded computing systems, such as mobile phones and IoT sensors, enable these embedded computing systems to automatically recognize objects, scenes, or patterns within the given image [627]. For example, face recognition [628], person re-identification [503], and hand gesture recognition [629] have been widely integrated into mainstream embedded computing systems, such as mobile phones, ATMs, and intelligent cameras, for the purpose of identity authentication. In practice, image classification typically features deep convolutional networks, such as VGGNet [1], ResNet [2], and DenseNet [3], thanks to their strong capabilities to capture rich visual information, especially for large-scale datasets such as ImageNet [80]. For example, as shown in Figure 30, AlexNet [76], the first of its kind, demonstrates the possibility of leveraging convolutional layers to learn discriminative features from vision inputs, which exhibits significantly better recognition performance on ImageNet than previous well-established non-convolutional networks, such as MLPs and other learning-based techniques. ResNet [2] investigates the training collapse of deep convolutional networks and introduces a simple yet effective deep residual learning paradigm, which allows us to significantly increase the network depth for stronger learning capabilities and also marks the booming development of the deep learning era. As a result, ResNet, for the first time, achieves better recognition performance on ImageNet than humans thanks to its significant network depth, as shown in Figure 30.
Fig. 30.
Fig. 30. Milestones of early convolutional networks [626].
Downstream Vision Applications. Downstream vision applications typically refer to the practical and specific usage, in which the results or outputs from other fundamental vision tasks, such as image classification, are applied to deal with real-world challenges. Popular downstream vision applications in practice include but are not limited to object detection [4], object tracking [5], object segmentation [6], image super-resolution [277, 278], image restoration [630], pose estimation [631], image captioning [632, 633], augmented reality (AR) and virtual reality (VR) [634], and video-related analysis [635]. The work of [630] features memory-oriented structured pruning to optimize the on-device memory consumption during runtime image restoration, which can accommodate the limited memory and storage requirements in real-world embedded scenarios. These downstream vision applications have evolved to be ubiquitous in real-world embedded scenarios, which serve as important components towards ubiquitous embedded intelligence. For example, object detection and tracking have been widely used in recent autonomous vehicles [636] to detect other vehicles and in surveillance systems [637] to detect suspicious persons or activities. These downstream vision applications have also been widely applied to other real-world embedded scenarios, such as smart cities and intelligent health care [638]. To further facilitate the development of intelligent applications, several powerful tools have been proposed recently. For example, Precog [639] introduces an efficient object detection infrastructure to enable real-time object detection on resource-constrained embedded computing systems, such as Raspberry Pi, which also features YOLOv3 [640] to achieve superior on-device object detection accuracy.
From CNNs to Vision Transformers. More recently, vision transformers (ViTs) [107] and their variants have demonstrated surprisingly strong performance in various vision tasks, including, but not limited to, image classification [107, 108, 109], object detection [110, 111, 112], semantic segmentation [113, 114, 115, 116], and video analysis [117, 118, 119], which continue to push forward the state-of-the-art performance over their convolutional counterparts across various vision tasks. Specifically, [107], featuring the very first vision transformer, proposes to divide the input image into a series of smaller image patches (e.g., 8, 16, and 32 patches), each of which is then fed into the transformer-based encoder to learn discriminative features. The learned discriminative features are further aggregated and fed into the classification layer to make predictions as shown in Figure 5. However, despite their strong performance across various vision tasks, ViTs and their variants often exhibit inferior on-device efficiency [139] since they are typically more difficult to parallelize on resource-constrained embedded computing systems than their convolutional counterparts and, thus, inevitably suffer from considerable resource underutilization, as pointed out in [127]. To overcome such limitations, a plethora of resource-efficient vision transformers have recently flourished. We refer interested readers to Section 2.2 for more details about recent representative resource-efficient vision transformers. We emphasize that significant efforts are still required in order to further alleviate the on-device efficiency bottleneck and unleash the promise of modern vision transformers, which are of paramount importance to bring powerful vision transformers to the less capable embedded computing systems towards ubiquitous embedded intelligence.

8.2 Natural Language Processing Applications

In parallel to vision tasks, NLP is another representative application that has been widely deployed in real-world embedded scenarios to explore auditory and textual inputs, which has largely revolutionized how embedded computing systems interact with users and their surroundings [641]. Embedded computing systems ranging from traditional IoT systems to wearable systems and autonomous systems are transitioning from simple responsive systems to more proactive and interactive systems, which can comprehend context and also anticipate users’ needs based on their linguistic inputs. To this end, below we introduce several representative NLP applications in real-world embedded scenarios.
(1)
Sentiment Analysis. Recent intelligent embedded computing systems, such as wearable devices and intelligent health care infrastructures, have largely featured sentiment analysis, which can effectively capture users’ physiological status through language interactions [642]. This also allows more comprehensive understanding of users’ emotional well-being, which paves the way for future holistic health ecosystem solutions [643].
(2)
Automatic Speech Recognition. Automatic speech recognition has gained increasing interest in real-world embedded scenarios, such as autonomous vehicles and smart homes, which can largely facilitate complicated functions using vocal commands. This can mitigate manual interactions and enhance safety and user convenience [644, 645].
(3)
Conversational Agents. The conversational agents have been playing an important role in recent intelligent embedded computing systems, such as home automation systems [646] and interactive assistant systems [647]. These intelligent conversational agents maintain strong abilities to comprehend and interpret users’ commands, preferences, and behavioral patterns towards better intelligent services in subsequent interactions.
(4)
Speech-to-Text/Text-to-Speech Synthesis. The integration of text-to-speech (TTS) [253] and speech-to-text (STT) [648] marks an important milestone in enriching human–computer interactions, especially for wearable systems such as mobile phones and intelligent translation devices, which can largely facilitate human–computer interactions. TTS can synthesize digital text into speech to provide auditory feedback to users and vice versa for STT, both of which are particularly important in hands-free environments.
(5)
Real-Time Translation. The emergence of real-time translation has served as an effective technique to eliminate cross-language barriers. More recently, real-time translation has been widely integrated into real-world embedded scenarios, especially wearable communication devices, which can largely facilitate cross-language interactions [649, 650].
To summarize, the integration of NLP and embedded computing systems is more than simple technical enhancements. It is an important paradigm shift towards ubiquitous embedded intelligence. It can enable real-world embedded computing systems to understand and interpret not only short commands but also longer contexts and conversations, which can ensure seamless and enriched interfaces between humans and embedded computing systems.

8.3 Visions for the Future

In this section, we envision some future trends and possible directions of intelligent embedded applications, which are summarized as follows:
(1)
LLM-Enabled Embedded Applications. LLMs starting from GPT-3 [48] have attracted considerable interest from both academia and industry, thanks to their surprisingly strong performance across various language tasks. ChatGPT [92], one of the most representative LLM-enabled applications, has achieved promising performance improvement over humans across diverse domains of knowledge. Nonetheless, modern LLMs, despite their promise, require a huge amount of computational resources for both training and inference, making it challenging to deploy powerful LLMs on resource-constrained embedded computing systems. Therefore, modern LLMs can only be deployed on remote GPU servers and provide remote services to local users through network connectivity. This, however, is often less convenient and also involves data security/privacy concerns. To overcome such limitations, a plethora of works have been proposed recently to compress computation-intensive LLMs towards better on-device inference efficiency. For example, SmoothQuant [59] and AWQ [60] pioneered quantizing the weights of powerful LLMs from higher bits to lower bits in order to reduce their prohibitive computational complexity, making it possible to run powerful LLMs on resource-constrained embedded computing systems. These are also important milestones to bringing LLMs to real-world embedded computing systems towards ubiquitous embedded intelligence.
(2)
Multi-modal Embedded Applications. Modern embedded applications largely focus on one single modality, either from the perspective of vision or language processing. Nonetheless, recent embedded computing systems typically feature various advanced sensors, which can simultaneously collect rich data from multiple modalities, including, but not limited to, visual, auditory, and tactile information. The most important benefit of these multi-modal embedded applications is their strong ability to provide comprehensive understanding of real-world dynamic environments using comprehensive information collected from different modalities. This also has the potential to significantly boost the attainable accuracy on the target task and greatly improve the reliability in real-world dynamic environments. For example, visual information can be easily augmented with other modalities, such as radar and lidar, which can be jointly leveraged to deliver better and safer driving experiences in autonomous vehicles [651]. However, despite the promising benefits, the development of multi-modal embedded applications is also challenging. On the one hand, the real-time synchronization of diverse data modalities may require significant computational resources. On the other hand, the development of multi-modal embedded applications introduces additional complexity for data alignment, calibration, and fusion, which may also require more advanced software algorithms to ensure real-time processing.

9 Conclusion

In this survey, we focus on summarizing recent efficient deep learning infrastructures for embedded computing systems towards ubiquitous embedded intelligence, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. To this end, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. We also envision promising future directions and trends to enable more efficient and ubiquitous embedded intelligence. We believe this survey can shed light on future research and allow researchers to quickly and smoothly get started in this emerging field.

Footnotes

1
In this work, we may interchangeably use some technical terms, such as deep learning models, machine learning models, DL models, ML models, deep neural networks (DNNs), and convolutional neural networks (CNNs).
2
We do not include FasterNet [84] in Figure 3 for comparisons since FasterNet does not optimizes the number of FLOPs.
3
MetaQNN [158] is another seminal NAS work in parallel to [137], both of which feature reinforcement learning as the search engine to automate the design of top-performing DNNs with competitive accuracy on target task.
4
We mainly discuss latency prediction since latency is the most dominant performance constraint in hardware-aware NAS [176, 212], which can be generalized to predict other performance constraints, such as energy and memory consumption.
5
Most of the covered zero-cost proxies are available at https://github.com/automl/naslib/tree/zerocost
6
Filter pruning is another name for channel pruning since removing filters is technically equivalent to removing channels [383]. In this work, we use channel pruning by default and may interchangeably use filter pruning and channel pruning.
7
We interchangeably use network distillation and knowledge distillation to refer to the distillation-based training process.

References

[1]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[2]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[3]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.
[4]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 21–37.
[5]
Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernandez, Tomas Vojir, Gustav Hager, Georg Nebehay, and Roman Pflugfelder. 2015. The Visual Object Tracking VOT2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1–23.
[6]
Yin Li, Xiaodi Hou, Christof Koch, James M. Rehg, and Alan L. Yuille. 2014. The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280–287.
[7]
Dong Yu and Lin Deng. 2016. Automatic Speech Recognition. Vol. 1. Springer.
[8]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[9]
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question answering in context. arXiv preprint arXiv:1808.07036 (2018).
[10]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.
[11]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[12]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[13]
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI benchmark: Running deep neural networks on Android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 1–27.
[14]
Ke Tan, Xueliang Zhang, and DeLiang Wang. 2021. Deep learning based real-time speech enhancement for dual-microphone mobile phones. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1853–1863.
[15]
Branislav Kisačanin. 2017. Deep learning for autonomous vehicles. In 2017 IEEE 47th International Symposium on Multiple-Valued Logic (ISMVL’17). IEEE, 142–142.
[16]
Jamil Fayyad, Mohammad A. Jaradat, Dominique Gruyer, and Homayoun Najjaran. 2020. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors 20, 15 (2020), 4220.
[17]
Beau Norgeot, Benjamin S. Glicksberg, and Atul J. Butte. 2019. A call for deep-learning healthcare. Nature Medicine 25, 1 (2019), 14–15.
[18]
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare. Nature Medicine 25, 1 (2019), 24–29.
[19]
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 331–344.
[20]
Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. 2022. Bringing AI to edge: From deep learning’s perspective. Neurocomputing 485 (2022), 297–320.
[21]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient ConvNets. In International Conference on Learning Representations.
[22]
Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. 2018. Soft filter pruning for accelerating deep convolutional neural networks. In International Joint Conference on Artificial Intelligence.
[23]
Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4340–4349.
[24]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems 28 (2015).
[25]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. Advances in Neural Information Processing Systems 29 (2016).
[26]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV. Springer, 525–542.
[27]
Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? Advances in Neural Information Processing Systems 27 (2014).
[28]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
[29]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems 28 (2015).
[30]
Artur Jordao, Maiko Lie, and William Robson Schwartz. 2020. Discriminative layer pruning for convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 14, 4 (2020), 828–837.
[31]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.
[32]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[33]
Pavlo Molchanov, Jimmy Hall, Hongxu Yin, Jan Kautz, Nicolo Fusi, and Arash Vahdat. 2022. LANA: Latency aware network acceleration. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 137–156.
[34]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV’18). 116–131.
[35]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.
[36]
Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. 2020. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1580–1589.
[37]
Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. 2022. GhostNetV2: Enhance cheap operation with long-range attention. arXiv preprint arXiv:2211.12905 (2022).
[38]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.
[39]
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10734–10742.
[40]
Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations.
[41]
Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, and Frank Hutter. 2023. Neural architecture search: Insights from 1000 papers. arXiv preprint arXiv:2301.08727 (2023).
[42]
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations.
[43]
Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar, Martin Wistuba, and Naigang Wang. 2021. A comprehensive survey on hardware-aware neural architecture search. arXiv preprint arXiv:2101.09336 (2021).
[44]
Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. 2020. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems 33 (2020), 11285–11297.
[45]
Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. 2022. On-device training under 256kb memory. Advances in Neural Information Processing Systems (2022).
[46]
Gido M. Van de Ven and Andreas S. Tolias. 2019. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 (2019).
[47]
Xinchi Qiu, Javier Fernandez-Marques, Pedro P. B. Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. 2022. ZeroFL: Efficient on-device training for federated learning with local sparsity. International Conference on Learning Representations (2022).
[48]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
[49]
OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[50]
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, et al. 2024. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625 (2024).
[51]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
[52]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2023).
[53]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
[54]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
[55]
Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
[56]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
[57]
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. LLM-Pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems 36 (2023), 21702–21720.
[58]
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695 (2023).
[59]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
[60]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv preprint arXiv:2306.00978 (2023).
[61]
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022).
[62]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. MiniLLM: Knowledge distillation of large language models. In 12th International Conference on Learning Representations.
[63]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-throughput generative inference of large language models with a single GPU. In International Conference on Machine Learning. PMLR, 31094–31116.
[64]
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. 2022. Petals: Collaborative inference and fine-tuning of large models. arXiv preprint arXiv:2209.01188 (2022).
[65]
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the 18th European Conference on Computer Systems. 233–248.
[66]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). https://www.tensorflow.org/Software available from tensorflow.org.
[67]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
[68]
Google. Google Edge TPU. Retrieved from https://cloud.google.com/edge-tpu/ ([n. d.]).
[70]
Intel. Intel Movidius Neural Compute Stick. Retrieved from https://movidius.github.io/ncsdk/ncs.html ([n. d.]).
[71]
Gaurav Menghani. 2023. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. Comput. Surveys 55, 12 (2023), 1–37.
[72]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).
[73]
Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. A comprehensive survey on model compression and acceleration. Artificial Intelligence Review 53 (2020), 5113–5155.
[74]
Zhuo Li, Hengyi Li, and Lin Meng. 2023. Model compression for deep neural networks: A survey. Computers 12, 3 (2023), 60.
[75]
Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chunjing Xu, Enhua Wu, and Qi Tian. 2022. GhostNets on heterogeneous devices via cheap operations. International Journal of Computer Vision 130, 4 (2022), 1050–1069.
[76]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.
[77]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[78]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105–6114.
[79]
Mingxing Tan and Quoc Le. 2021. EfficientNetV2: Smaller models and faster training. In International Conference on Machine Learning. PMLR, 10096–10106.
[80]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[81]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
[82]
Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, and Song Han. 2022. Enable deep learning on mobile devices: Methods, systems, and applications. ACM Transactions on Design Automation of Electronic Systems (TODAES) 27, 3 (2022), 1–50.
[83]
Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
[84]
Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. 2023. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12021–12031.
[85]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
[86]
Daquan Zhou, Qibin Hou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. 2020. Rethinking bottleneck structure for efficient mobile network design. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 680–697.
[87]
Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q. Weinberger. 2018. CondenseNet: An efficient DenseNet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2752–2761.
[88]
Le Yang, Haojun Jiang, Ruojin Cai, Yulin Wang, Shiji Song, Gao Huang, and Qi Tian. 2021. CondenseNet V2: Sparse feature reactivation for deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3569–3578.
[89]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations.
[90]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
[91]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT (2019).
[92]
OpenAI. 2020. ChatGPT: A Variant of GPT by OpenAI. Retrieved from https://openai.com/ (2020).
[93]
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-aware transformers for efficient natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7675–7688.
[94]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4163–4174.
[95]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2158–2170.
[96]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[97]
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
[98]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[99]
Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kusner. 2024. No train no gain: Revisiting efficient training algorithms for transformer-based language models. Advances in Neural Information Processing Systems 36 (2024).
[100]
Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. 2023. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980 (2023).
[101]
Malte Ostendorff and Georg Rehm. 2023. Efficient language model training through cross-lingual and progressive transfer learning. arXiv preprint arXiv:2301.09626 (2023).
[102]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
[103]
Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew M. Dai, Yifeng Lu, et al. 2023. Brainformers: Trading simplicity for efficiency. In International Conference on Machine Learning. PMLR, 42531–42542.
[104]
Zhen-Ru Zhang, Chuanqi Tan, Haiyang Xu, Chengyu Wang, Jun Huang, and Songfang Huang. 2023. Towards adaptive prefix tuning for parameter-efficient language model fine-tuning. arXiv preprint arXiv:2305.15212 (2023).
[105]
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. 2023. LLaMA-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
[106]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 213–229.
[107]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[108]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[109]
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022. Swin Transformer V2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12009–12019.
[110]
Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. 2022. Exploring plain vision transformer backbones for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 280–296.
[111]
Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu Liu. 2021. You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems 34 (2021), 26183–26197.
[112]
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. 2022. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230 (2022).
[113]
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2021. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7262–7272.
[114]
Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. 2021. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
[115]
Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, and David Z. Pan. 2022. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12094–12103.
[116]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
[117]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.
[118]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836–6846.
[119]
Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3163–3172.
[120]
Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. 2021. LeViT: A vision transformer in ConvNet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12259–12269.
[121]
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. 2022. Mobile-former: Bridging MobileNet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5270–5279.
[122]
Sachin Mehta and Mohammad Rastegari. 2022. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations.
[123]
Sachin Mehta and Mohammad Rastegari. 2022. Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680 (2022).
[124]
Shakti N. Wadekar and Abhishek Chaurasia. 2022. MobileViTv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. arXiv preprint arXiv:2209.15159 (2022).
[125]
Han Cai, Chuang Gan, and Song Han. 2022. EfficientViT: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv preprint arXiv:2205.14756 (2022).
[126]
Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and Brais Martinez. 2022. EdgeViTs: Competing light-weight CNNs on mobile devices with vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. Springer, 294–311.
[127]
Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, and Fahad Shahbaz Khan. 2023. EdgeNeXt: efficiently amalgamated CNN-transformer architecture for mobile vision applications. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII. Springer, 3–20.
[128]
Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, and Yingyan Lin. 2023. Castling-ViT: Compressing self-attention via switching towards linear-angular attention during vision transformer inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[129]
Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan. 2023. FastViT: A fast hybrid vision transformer using structural reparameterization. arXiv preprint arXiv:2303.14189 (2023).
[130]
Xiangzhong Luo, Di Liu, Hao Kong, and Weichen Liu. 2020. EdgeNAS: Discovering efficient neural architectures for edge systems. In 2020 IEEE 38th International Conference on Computer Design (ICCD’20). IEEE, 288–295.
[131]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. 2022. You only search once: On lightweight differentiable architecture search for resource-constrained embedded platforms. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 475–480.
[132]
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11936–11945.
[133]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347–10357.
[134]
Dichao Hu. 2020. An introductory survey on attention mechanisms in NLP problems. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 2. Springer, 432–448.
[135]
Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R. Martin, Ming-Ming Cheng, and Shi-Min Hu. 2022. Attention mechanisms in computer vision: A survey. Computational Visual Media 8, 3 (2022), 331–368.
[136]
Plamen Angelov and Eduardo Soares. 2020. Towards explainable deep neural networks (xDNN). Neural Networks 130 (2020), 185–194.
[137]
Barret Zoph and Quoc V. Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
[138]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In International Conference on Learning Representations.
[139]
Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. 2022. Vision GNN: An image is worth graph of nodes. arXiv preprint arXiv:2206.00272 (2022).
[140]
Anubhav Jangra, Sourajit Mukherjee, Adam Jatowt, Sriparna Saha, and Mohammad Hasanuzzaman. 2021. A survey on multi-modal summarization. Comput. Surveys (2021).
[141]
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. 2021. How to train your ViT? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021).
[142]
Yonggan Fu, Shunyao Zhang, Shang Wu, Cheng Wan, and Yingyan Lin. 2022. Patch-fool: Are vision transformers always robust against adversarial perturbations?. In International Conference on Learning Representations.
[143]
Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. 2019. Adversarial robustness vs. model compression, or both?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 111–120.
[144]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8697–8710.
[145]
Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. 2020. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations.
[146]
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. 2020. PC-DARTS: Partial channel connections for memory-efficient architecture search. In International Conference on Learning Representations.
[147]
Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. 2019. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1294–1303.
[148]
Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He, Weiran Huang, Kechen Zhuang, and Zhenguo Li. 2019. Darts+: Improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035 (2019).
[149]
Xiangxiang Chu, Xiaoxing Wang, Bo Zhang, Shun Lu, Xiaolin Wei, and Junchi Yan. 2021. DARTS-: Robustly stepping out of performance collapse without indicators. In International Conference on Learning Representations.
[150]
Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. 2020. Fair darts: Eliminating unfair advantages in differentiable architecture search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV. Springer, 465–480.
[151]
Peng Ye, Baopu Li, Yikang Li, Tao Chen, Jiayuan Fan, and Wanli Ouyang. 2022. \(\beta\)-DARTS: Beta-decay regularization for differentiable architecture search. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). IEEE, 10864–10873.
[152]
Xiangzhong Luo, Di Liu, Shuo Huai, and Weichen Liu. 2021. HSCoNAS: Hardware-software co-design of efficient DNNs via neural architecture search. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 418–421.
[153]
Li Lyna Zhang, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu. 2020. Fast hardware-aware neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 692–693.
[154]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. 2022. SurgeNAS: A comprehensive surgery on hardware-aware differentiable neural architecture search. IEEE Trans. Comput. (2022).
[155]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
[156]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. 2022. LightNAS: On lightweight and scalable neural architecture search for embedded platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022).
[157]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence. 4780–4789.
[158]
Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. 2016. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016).
[159]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement Learning (1992), 5–32.
[160]
Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning. PMLR, 4095–4104.
[161]
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2020. Single path one-shot neural architecture search with uniform sampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 544–560.
[162]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314–1324.
[163]
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, and Quoc V. Le. 2020. Can weight sharing outperform random architecture search? An investigation with TuNAS. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14323–14332.
[164]
Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin-Ping Chou, Chun-Hao Liu, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. 2018. MONAS: Multi-objective neural architecture search using reinforcement learning. arXiv preprint arXiv:1806.10332 (2018).
[165]
Geoffrey F. Miller, Peter M. Todd, and Shailesh U. Hegde. 1989. Designing neural networks using genetic algorithms. In ICGA, Vol. 89. 379–384.
[166]
Peter J. Angeline, Gregory M. Saunders, and Jordan B. Pollack. 1994. An evolutionary algorithm that constructs recurrent neural networks. IEEE Transactions on Neural Networks 5, 1 (1994), 54–65.
[167]
Dario Floreano, Peter Dürr, and Claudio Mattiussi. 2008. Neuroevolution: From architectures to learning. Evolutionary Intelligence 1 (2008), 47–62.
[168]
Kenneth O. Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolutionary Computation 10, 2 (2002), 99–127.
[169]
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning. PMLR, 550–559.
[170]
Matej Črepinšek, Shih-Hsi Liu, and Marjan Mernik. 2013. Exploration and exploitation in evolutionary algorithms: A survey. ACM Computing Surveys (CSUR) 45, 3 (2013), 1–33.
[171]
Juan José Domínguez-Jiménez, Antonia Estero-Botaro, Antonio García-Domínguez, and Inmaculada Medina-Bulo. 2011. Evolutionary mutation testing. Information and Software Technology 53, 10 (2011), 1108–1123.
[172]
William M. Spears, et al. 1995. Adapting crossover in evolutionary algorithms. In Evolutionary Programming. 367–384.
[173]
Andrew Brock, Theodore Lim, James Millar Ritchie, and Nicholas J. Weston. 2018. SmaSH: One-shot model architecture search through hypernetworks. In 6th International Conference on Learning Representations 2018.
[174]
Xiangxiang Chu, Bo Zhang, and Ruijun Xu. 2021. FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12239–12248.
[175]
Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. 2020. BigNAS: Scaling up neural architecture search with big single-stage models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 702–717.
[176]
Bingqian Lu, Jianyi Yang, Weiwen Jiang, Yiyu Shi, and Shaolei Ren. 2021. One proxy device is enough for hardware-aware neural architecture search. Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, 3 (2021), 1–34.
[177]
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and Changshui Zhang. 2020. GreedyNAS: Towards fast one-shot MAS with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1999–2008.
[178]
Xiangzhong Luo, Di Liu, Shuo Huai, Hao Kong, Hui Chen, and Weichen Liu. 2021. Designing efficient DNNs via hardware-aware neural architecture search and beyond. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 6 (2021), 1799–1812.
[179]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Efficient multi-objective neural architecture search via Lamarckian evolution. In International Conference on Learning Representations.
[180]
Guohao Li, Guocheng Qian, Itzel C. Delgadillo, Matthias Muller, Ali Thabet, and Bernard Ghanem. 2020. SGAS: Sequential greedy architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1620–1630.
[181]
Yibo Yang, Shan You, Hongyang Li, Fei Wang, Chen Qian, and Zhouchen Lin. 2021. Towards improving the consistency, efficiency, and flexibility of differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6667–6676.
[182]
Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. 2021. Rethinking architecture selection in differentiable NAS. In International Conference on Learning Representations.
[183]
Xiangning Chen, Ruochen Wang, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. 2021. DrNAS: Dirichlet neural architecture search. In International Conference on Learning Representations.
[184]
Kaifeng Bi, Lingxi Xie, Xin Chen, Longhui Wei, and Qi Tian. 2020. GOLD-NAS: Gradual, one-level, differentiable. arXiv preprint arXiv:2007.03331 (2020).
[185]
Pengfei Hou, Ying Jin, and Yukang Chen. 2021. Single-DARTS: Towards stable architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 373–382.
[186]
Xuanyi Dong and Yi Yang. 2019. Searching for a robust neural architecture in four GPU hours. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1761–1770.
[187]
Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2019. SNAS: Stochastic neural architecture search. In International Conference on Learning Representations.
[188]
Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, and Jianchao Yang. 2020. AtomNAS: Fine-grained end-to-end neural architecture search. In International Conference on Learning Representations.
[189]
Xuanyi Dong, David Jacob Kedziora, Katarzyna Musial, and Bogdan Gabrys. 2021. Automated deep learning: Neural architecture search is not the end. arXiv preprint arXiv:2112.09245 (2021).
[190]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with Gumbel-Softmax. In International Conference on Learning Representations.
[191]
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong. 2020. Latency-aware differentiable neural architecture search. arXiv preprint arXiv:2001.06392 (2020).
[192]
Guohao Li, Mengmeng Xu, Silvio Giancola, Ali Thabet, and Bernard Ghanem. 2020. LC-NAS: Latency constrained neural architecture search for point cloud networks. arXiv preprint arXiv:2008.10309 (2020).
[193]
Mohammad Loni, Hamid Mousavi, Mohammad Riazati, Masoud Daneshtalab, and Mikael Sjödin. 2022. TAS: Ternarized neural architecture search for resource-constrained edge devices. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE’22). IEEE, 1115–1118.
[194]
Sunghoon Kim, Hyunjeong Kwon, Eunji Kwon, Youngchang Choi, Tae-Hyun Oh, and Seokhyeong Kang. 2021. MDARTS: Multi-objective differentiable neural architecture search. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 1344–1349.
[195]
Yibo Hu, Xiang Wu, and Ran He. 2020. TF-NAS: Rethinking three search freedoms of latency-constrained differentiable neural architecture search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer, 123–139.
[196]
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, et al. 2020. FBNetV2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12965–12974.
[197]
Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. 2020. Single-path NAS: Designing hardware-efficient ConvNets in less than 4 hours. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II. Springer, 481–497.
[198]
Jaeseong Lee, Jungsub Rhim, Duseok Kang, and Soonhoi Ha. 2021. SNAS: Fast hardware-aware neural architecture search methodology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 11 (2021), 4826–4836.
[199]
Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. 2020. Densely connected search space for more flexible neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10628–10637.
[200]
Niv Nayman, Yonathan Aflalo, Asaf Noy, and Lihi Zelnik. 2021. Hardcore-NAS: Hard constrained differentiable neural architecture search. In International Conference on Machine Learning. PMLR, 7979–7990.
[201]
Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. 2013. Block-coordinate Frank-Wolfe optimization for structural SVMs. In International Conference on Machine Learning. PMLR, 53–61.
[202]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, and Weichen Liu. 2024. Double-Win NAS: Towards deep-to-shallow transformable neural architecture search for intelligent embedded systems. In Proceedings of the 61st ACM/IEEE Design Automation Conference. 1–6.
[203]
Qian Jiang, Xiaofan Zhang, Deming Chen, Minh N. Do, and Raymond A. Yeh. 2021. EH-DNAS: End-to-end hardware-aware differentiable neural architecture search. arXiv preprint arXiv:2111.12299 (2021).
[204]
Javier García López, Antonio Agudo, and Francesc Moreno-Noguer. 2021. E-DNAS: Differentiable neural architecture search for embedded systems. In 2020 25th International Conference on Pattern Recognition (ICPR’21). IEEE, 4704–4711.
[205]
Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. 2019. Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142 (2019).
[206]
Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo. 2021. Few-shot neural architecture search. In International Conference on Machine Learning. PMLR, 12707–12718.
[207]
Shoukang Hu, Ruochen Wang, Lanqing Hong, Zhenguo Li, Cho-Jui Hsieh, and Jiashi Feng. 2022. Generalizing few-shot NAS with gradient matching. arXiv preprint arXiv:2203.15207 (2022).
[208]
Dongkuan D. K. Xu, Subhabrata Mukherjee, Xiaodong Liu, Debadeepta Dey, Wenhui Wang, Xiang Zhang, Ahmed Awadallah, and Jianfeng Gao. 2022. Few-shot task-agnostic neural architecture search for distilling large language models. Advances in Neural Information Processing Systems 35 (2022), 28644–28656.
[209]
Timotée Ly-Manson, Mathieu Léonardon, and Abdeldjalil Aissa El Bey. 2023. Understanding few-shot neural architecture search with zero-cost proxies. https://gretsi.fr/data/colloque/pdf/2023_lymanson1237.pdf (2023).
[210]
Xiu Su, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. 2021. K-shot NAS: Learnable weight-sharing for NAS with k-shot supernets. In International Conference on Machine Learning. PMLR, 9880–9890.
[211]
Zixuan Zhou, Xuefei Ning, Yi Cai, Jiashu Han, Yiping Deng, Yuhan Dong, Huazhong Yang, and Yu Wang. 2022. Close: Curriculum learning on the sharing extent towards better one-shot nas. In European Conference on Computer Vision. Springer, 578–594.
[212]
Kevin Alexander Laube, Maximus Mutschler, and Andreas Zell. 2022. What to expect of hardware metric predictors in NAS. In International Conference on Automated Machine Learning. PMLR, 13–1.
[213]
Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas Lane. 2020. BRP-NAS: Prediction-based NAS using GCN. Advances in Neural Information Processing Systems 33 (2020), 10480–10490.
[214]
Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue Wang, Cong Hao, and Yingyan Lin. 2021. HW-NAS-Bench: Hardware-aware neural architecture search benchmark. In International Conference on Learning Representations.
[215]
Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. 2021. Hardware-adaptive efficient latency prediction for NAS via meta-learning. Advances in Neural Information Processing Systems 34 (2021), 27016–27028.
[216]
Saeejith Nair, Saad Abbasi, Alexander Wong, and Mohammad Javad Shafiee. 2022. Maple-edge: A runtime latency predictor for edge devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3660–3668.
[217]
Shuo Huai, Hao Kong, Shiqing Li, Xiangzhong Luo, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu. 2023. EvoLP: Self-evolving latency predictor for model compression in real-time edge systems. IEEE Embedded Systems Letters (2023).
[218]
Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. 2020. Neural predictor for neural architecture search. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX. Springer, 660–676.
[219]
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2020. Accuracy prediction with non-neural model for neural architecture search. arXiv preprint arXiv:2007.04785 (2020).
[220]
Colin White, Arber Zela, Robin Ru, Yang Liu, and Frank Hutter. 2021. How powerful are performance predictors in neural architecture search? Advances in Neural Information Processing Systems 34 (2021), 28454–28469.
[221]
Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott, and Tijmen Blankevoort. 2021. Distilling optimal neural networks: Rapid search in diverse spaces. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12229–12238.
[222]
Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. 2020. A generic graph-based neural architecture encoding scheme for predictor-based NAS. In European Conference on Computer Vision. Springer, 189–204.
[223]
Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. 2019. NAS-Bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning. PMLR, 7105–7114.
[224]
Xuanyi Dong and Yi Yang. 2020. NAS-Bench-201: Extending the scope of reproducible neural architecture search. In International Conference on Learning Representations.
[225]
Nikita Klyuchnikov, Ilya Trofimov, Ekaterina Artemova, Mikhail Salnikov, Maxim Fedorov, Alexander Filippov, and Evgeny Burnaev. 2022. NAS-Bench-NLP: Neural architecture search benchmark for natural language processing. IEEE Access 10 (2022), 45736–45747.
[226]
Xuefei Ning, Yin Zheng, Zixuan Zhou, Tianchen Zhao, Huazhong Yang, and Yu Wang. 2022. A generic graph-based neural architecture encoding scheme with multifaceted information. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[227]
Xuefei Ning, Zixuan Zhou, Junbo Zhao, Tianchen Zhao, Yiping Deng, Changcheng Tang, Shuang Liang, Huazhong Yang, and Yu Wang. 2022. TA-GATES: An encoding scheme for neural network architectures. Advances in Neural Information Processing Systems 35 (2022), 32325–32339.
[228]
Huan Xiong, Lei Huang, Mengyang Yu, Li Liu, Fan Zhu, and Ling Shao. 2020. On the number of linear regions of convolutional neural networks. In International Conference on Machine Learning. PMLR, 10514–10523.
[229]
Lechao Xiao, Jeffrey Pennington, and Samuel Schoenholz. 2020. Disentangling trainability and generalization in deep neural networks. In International Conference on Machine Learning. PMLR, 10462–10472.
[230]
Wuyang Chen, Xinyu Gong, and Zhangyang Wang. 2021. Neural architecture search on ImageNet in four GPU hours: A theoretically inspired perspective. In International Conference on Learning Representations.
[231]
Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In 24th International Joint Conference on Artificial Intelligence.
[232]
Robin Ru, Clare Lyle, Lisa Schut, Miroslav Fil, Mark van der Wilk, and Yarin Gal. 2021. Speedy performance estimation for neural architecture search. Advances in Neural Information Processing Systems 34 (2021), 4079–4092.
[233]
Shen Yan, Colin White, Yash Savani, and Frank Hutter. 2021. NAS-Bench-x11 and the power of learning curves. Advances in Neural Information Processing Systems 34 (2021), 22534–22549.
[234]
Dan Zhao, Nathan C. Frey, Vijay Gadepally, and Siddharth Samsi. 2022. Loss curve approximations for fast neural architecture ranking training elasticity estimation. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’22). IEEE, 715–723.
[235]
Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. 2017. Learning curve prediction with Bayesian neural networks. In International Conference on Learning Representations.
[236]
Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. 2017. Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823 (2017).
[237]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, and Weichen Liu. 2022. Work-in-progress: What to expect of early training statistics? An investigation on hardware-aware neural architecture search. In 2022 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS’22). IEEE, 1–2.
[238]
Mohamed S. Abdelfattah, Abhinav Mehrotra, Łukasz Dudziak, and Nicholas D. Lane. 2021. Zero-cost proxies for lightweight NAS. arXiv preprint arXiv:2101.08134 (2021).
[239]
Arjun Krishnakumar, Colin White, Arber Zela, Renbo Tu, Mahmoud Safari, and Frank Hutter. 2022. NAS-Bench-Suite-Zero: Accelerating research on zero cost proxies. arXiv preprint arXiv:2210.03230 (2022).
[240]
Vasco Lopes, Saeid Alirezazadeh, and Luís A. Alexandre. 2021. EPE-NAS: Efficient performance estimation without training for neural architecture search. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part V. Springer, 552–563.
[241]
Jack Turner, Elliot J. Crowley, Michael O’Boyle, Amos Storkey, and Gavin Gray. 2019. BlockSwap: Fisher-guided block substitution for network compression on a budget. arXiv preprint arXiv:1906.04113 (2019).
[242]
Chaoqi Wang, Guodong Zhang, and Roger Grosse. 2020. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376 (2020).
[243]
Joe Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. 2021. Neural architecture search without training. In International Conference on Machine Learning. PMLR, 7588–7598.
[244]
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2018. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340 (2018).
[245]
Hidenori Tanaka, Daniel Kunin, Daniel L. Yamins, and Surya Ganguli. 2020. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems 33 (2020), 6377–6389.
[246]
Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin. 2021. ZEN-NAS: A zero-shot NAS for high-performance image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 347–356.
[247]
Yash Akhauri, Juan Munoz, Nilesh Jain, and Ravishankar Iyer. 2022. EZNAS: Evolving zero-cost proxies for neural architecture scoring. Advances in Neural Information Processing Systems 35 (2022), 30459–30470.
[248]
Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. 2021. AutoFormer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12270–12280.
[249]
Jiahui Gao, Hang Xu, Han Shi, Xiaozhe Ren, L. H. Philip, Xiaodan Liang, Xin Jiang, and Zhenguo Li. 2022. AutoBERT-Zero: Evolving BERT backbone from scratch. In Proceedings of the AAAI Conference on Artificial Intelligence. 10663–10671.
[250]
David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V. Le. 2021. Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668 (2021).
[251]
Yichun Yin, Cheng Chen, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2021. AutoTinyBERT: Automatic hyper-parameter optimization for efficient pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5146–5157.
[252]
Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian Li, Tao Qin, and Tie-Yan Liu. 2021. NAS-BERT: Task-agnostic and adaptive-size BERT compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1933–1943.
[253]
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, and Tie-Yan Liu. 2021. Lightspeech: Lightweight and fast text to speech with neural architecture search. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 5699–5703.
[254]
Jihwan Kim, Jisung Wang, Sangki Kim, and Yeha Lee. 2020. Evolved speech-transformer: Applying neural architecture search to end-to-end automatic speech recognition. In INTERSPEECH. 1788–1792.
[255]
Charles Jin, Phitchaya Mangpo Phothilimthana, and Sudip Roy. 2022. \(\alpha\)NAS: Neural architecture search using property guided synthesis. arXiv preprint arXiv:2205.03960 (2022).
[256]
Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Junjie Yan, and Wanli Ouyang. 2021. GLiT: Neural architecture search for global and local image transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12–21.
[257]
Chengyue Gong, Dilin Wang, Meng Li, Xinlei Chen, Zhicheng Yan, Yuandong Tian, Vikas Chandra, et al. 2021. NASViT: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations.
[258]
Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou Huang. 2019. NAT: Neural architecture transformer for accurate and compact architectures. Advances in Neural Information Processing Systems 32 (2019).
[259]
Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, and Ping Luo. 2021. HR-NAS: Searching efficient high-resolution neural architectures with lightweight transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2982–2992.
[260]
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2022. ViTAS: Vision transformer architecture search. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI. Springer, 139–157.
[261]
Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. 2021. NATS-bench: Benchmarking NAS algorithms for architecture topology and size. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7 (2021), 3634–3646.
[262]
Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. 2020. NAS-Bench-301 and the case for surrogate benchmarks for neural architecture search. arXiv preprint arXiv:2008.09777 (2020).
[263]
Renbo Tu, Nicholas Roberts, Misha Khodak, Junhong Shen, Frederic Sala, and Ameet Talwalkar. 2022. NAS-Bench-360: Benchmarking neural architecture search on diverse tasks. Advances in Neural Information Processing Systems 35 (2022), 12380–12394.
[264]
Arber Zela, Julien Siems, and Frank Hutter. 2020. NAS-Bench-1shot1: Benchmarking and dissecting one-shot neural architecture search. In International Conference on Learning Representations.
[265]
Abhinav Mehrotra, Alberto Gil C. P. Ramos, Sourav Bhattacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas Chau, Mohamed S. Abdelfattah, Samin Ishtiaq, and Nicholas Donald Lane. 2021. NAS-Bench-ASR: Reproducible neural architecture search for speech recognition. In International Conference on Learning Representations.
[266]
Yijian Qin, Ziwei Zhang, Xin Wang, Zeyang Zhang, and Wenwu Zhu. 2022. NAS-Bench-Graph: Benchmarking Graph Neural Architecture Search. In 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[267]
Yash Mehta, Colin White, Arber Zela, Arjun Krishnakumar, Guri Zabergja, Shakiba Moradian, Mahmoud Safari, Kaicheng Yu, and Frank Hutter. 2022. NAS-Bench-sSuite: NAS evaluation is (now) surprisingly easy. In International Conference on Learning Representations.
[268]
Yunyang Xiong, Hanxiao Liu, Suyog Gupta, Berkin Akin, Gabriel Bender, Yongzhe Wang, Pieter-Jan Kindermans, Mingxing Tan, Vikas Singh, and Bo Chen. 2021. MobileDets: Searching for object detection architectures for mobile accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3825–3834.
[269]
Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, Chunhua Shen, and Yanning Zhang. 2020. NAS-FCOS: Fast neural architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11943–11951.
[270]
Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. 2019. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7036–7045.
[271]
Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Li Fei-Fei. 2019. Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 82–92.
[272]
Albert Shaw, Daniel Hunter, Forrest Landola, and Sammy Sidhu. 2019. SqueezeNAS: Fast neural architecture search for faster semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 1–11.
[273]
Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Lei Wang, and Wenqi Ren. 2021. DCNAS: Densely connected neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13956–13967.
[274]
Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R. Qi, Yin Zhou, Mingxing Tan, and Dragomir Anguelov. 2022. LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI. Springer, 158–175.
[275]
Zhijian Liu, Haotian Tang, Shengyu Zhao, Kevin Shao, and Song Han. 2021. PVNAS: 3D neural architecture search with point-voxel convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 8552–8568.
[276]
Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. 2020. Searching efficient 3D architectures with sparse point-voxel convolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII. Springer, 685–702.
[277]
Shaoli Liu, Chengjian Zheng, Kaidi Lu, Si Gao, Ning Wang, Bofei Wang, Diankai Zhang, Xiaofeng Zhang, and Tianyu Xu. 2021. EVSRNet: Efficient video super-resolution with neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2480–2485.
[278]
Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, and Yanzhi Wang. 2022. Compiler-aware neural architecture search for on-mobile real-time super-resolution. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIX. Springer, 92–111.
[279]
Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. 2020. NAS evaluation is frustratingly hard. In International Conference on Learning Representations.
[280]
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 113–123.
[281]
Liam Li and Ameet Talwalkar. 2020. Random search and reproducibility for neural architecture search. In Uncertainty in Artificial Intelligence. PMLR, 367–377.
[282]
Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. 2019. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1284–1293.
[283]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10428–10436.
[284]
Jianyuan Guo, Kai Han, Yunhe Wang, Chao Zhang, Zhaohui Yang, Han Wu, Xinghao Chen, and Chang Xu. 2020. Hit-detector: Hierarchical trinity architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11405–11414.
[285]
Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, and Hideki Nakayama. 2020. Faster autoaugment: Learning augmentation strategies using backpropagation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16. Springer, 1–16.
[286]
Prajit Ramachandran, Barret Zoph, and Quoc V. Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).
[287]
Yucong Zhou, Zezhou Zhu, and Zhao Zhong. 2021. Learning specialized activation functions with the Piecewise Linear Unit. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12095–12104.
[288]
Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, et al. 2021. FBNetV3: Joint architecture-recipe search using predictor pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16276–16285.
[289]
Xuanyi Dong, Mingxing Tan, Adams Wei Yu, Daiyi Peng, Bogdan Gabrys, and Quoc V. Le. 2020. AutoHAS: Efficient hyperparameter and architecture search. arXiv preprint arXiv:2006.03656 (2020).
[290]
Bichen Wu, Chaojian Li, Hang Zhang, Xiaoliang Dai, Peizhao Zhang, Matthew Yu, Jialiang Wang, Yingyan Lin, and Peter Vajda. 2021. FBNetV5: Neural architecture search for multiple tasks in one run. arXiv preprint arXiv:2111.10007 (2021).
[291]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, 740–755.
[292]
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 633–641.
[293]
Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. 2018. Slimmable neural networks. arXiv preprint arXiv:1812.08928 (2018).
[294]
Jiahui Yu and Thomas S. Huang. 2019. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1803–1811.
[295]
Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. 2021. Dynamic slimmable network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8607–8617.
[296]
Changlin Li, Tao Tang, Guangrun Wang, Jiefeng Peng, Bing Wang, Xiaodan Liang, and Xiaojun Chang. 2021. BossNAS: Exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12281–12291.
[297]
Lotfi Abdelkrim Mecharbat, Hadjer Benmeziane, Hamza Ouranoughi, and Smail Niar. 2023. HyT-NAS: Hybrid transformers neural architecture search for edge devices. arXiv preprint arXiv:2303.04440 (2023).
[298]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126–1135.
[299]
Albert Shaw, Wei Wei, Weiyang Liu, Le Song, and Bo Dai. 2019. Meta architecture search. Advances in Neural Information Processing Systems 32 (2019).
[300]
Jiaxing Wang, Jiaxiang Wu, Haoli Bai, and Jian Cheng. 2020. M-NAS: Meta neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence. 6186–6193.
[301]
Hayeon Lee, Eunyoung Hyung, and Sung Ju Hwang. 2021. Rapid neural architecture search by learning to generate graphs from datasets. arXiv preprint arXiv:2107.00860 (2021).
[302]
Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.
[303]
Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, and Song Han. 2020. APQ: Joint search for network architecture, pruning and quantization policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2078–2087.
[304]
Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
[305]
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019. Rethinking the value of network pruning. In International Conference on Learning Representations.
[306]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44, 3 (2016), 243–254.
[307]
Yann LeCun, John Denker, and Sara Solla. 1989. Optimal brain damage. Advances in Neural Information Processing Systems 2 (1989).
[308]
Babak Hassibi, David G. Stork, and Gregory J. Wolff. 1993. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks. IEEE, 293–299.
[309]
Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for deep neural networks. British Machine Vision Conference (2015).
[310]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning. PMLR, 2498–2507.
[311]
Christos Louizos, Max Welling, and Diederik P. Kingma. 2018. Learning sparse neural networks through \(L\_0\) regularization. In International Conference on Learning Representations.
[312]
Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. 2021. GDP: Stabilized neural network pruning via gates with differentiable polarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5239–5250.
[313]
Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).
[314]
Alex Renda, Jonathan Frankle, and Michael Carbin. 2020. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations.
[315]
Babak Hassibi and David Stork. 1992. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems 5 (1992).
[316]
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations.
[317]
Andries P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12, 6 (2001), 1386–1399.
[318]
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127–138.
[319]
Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292–308.
[320]
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 75–84.
[321]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.
[322]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.
[323]
Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan. 2021. GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). IEEE, 1110–1123.
[324]
Jie-Fang Zhang, Ching-En Lee, Chester Liu, Yakun Sophia Shao, Stephen W. Keckler, and Zhengya Zhang. 2020. SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference. IEEE Journal of Solid-State Circuits 56, 2 (2020), 636–647.
[325]
Sumanth Gudaparthi, Sarabjeet Singh, Surya Narayanan, Rajeev Balasubramonian, and Visvesh Sathe. 2022. CANDLES: Channel-aware novel dataflow-microarchitecture co-design for low energy sparse neural network acceleration. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 876–891.
[326]
Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. 2020. PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence. 5117–5124.
[327]
Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-25 International Conference on Architectural Support for Programming Languages and Operating Systems. 907–922.
[328]
Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations.
[329]
Suraj Srinivas, Akshayvarun Subramanya, and R. Venkatesh Babu. 2017. Training sparse neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 138–145.
[330]
Utku Evci, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. 2019. The difficulty of training sparse neural networks. arXiv preprint arXiv:1906.10732 (2019).
[331]
Ajay Kumar Jaiswal, Haoyu Ma, Tianlong Chen, Ying Ding, and Zhangyang Wang. 2022. Training your sparse neural network better with any mask. In International Conference on Machine Learning. PMLR, 9833–9844.
[332]
Yi-Lin Sung, Varun Nair, and Colin A. Raffel. 2021. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems 34 (2021), 24193–24205.
[333]
Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. 2020. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning. PMLR, 6682–6691.
[334]
Zeru Zhang, Jiayin Jin, Zijie Zhang, Yang Zhou, Xin Zhao, Jiaxiang Ren, Ji Liu, Lingfei Wu, Ruoming Jin, and Dejing Dou. 2021. Validating the lottery ticket hypothesis with inertial manifold theory. Advances in Neural Information Processing Systems 34 (2021), 30196–30210.
[335]
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2019. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611 (2019).
[336]
Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. 2021. A unified lottery ticket hypothesis for graph neural networks. In International Conference on Machine Learning. PMLR, 1695–1706.
[337]
Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, and Priyadarshini Panda. 2022. Exploring lottery ticket hypothesis in spiking neural networks. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 102–120.
[338]
Sanmitra Banerjee, Mahdi Nikdast, Sudeep Pasricha, and Krishnendu Chakrabarty. 2022. Pruning coherent integrated photonic neural networks using the lottery ticket hypothesis. In 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’22). IEEE, 128–133.
[339]
Yuxin Zhang, Mingbao Lin, Zhihang Lin, Yiting Luo, Ke Li, Fei Chao, Yongjian Wu, and Rongrong Ji. 2022. Learning best combination for efficient N:M sparsity. Advances in Neural Information Processing Systems 35 (2022), 941–953.
[340]
Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 (2021).
[341]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. \(\lbrace\)TVM\(\rbrace\): An automated \(\lbrace\)End-to-End\(\rbrace\) optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578–594.
[342]
Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. 2021. NxMTransformer: Semi-structured sparsification for natural language understanding via ADMM. Advances in Neural Information Processing Systems 34 (2021), 1818–1830.
[343]
Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. 2021. Learning N:M fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010 (2021).
[344]
Jeff Pool and Chong Yu. 2021. Channel permutations for N:M sparsity. Advances in Neural Information Processing Systems 34 (2021), 13316–13327.
[345]
Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, and Tushar Krishna. 2024. Progressive gradient flow for Robust N:M sparsity training in transformers. arXiv preprint arXiv:2402.04744 (2024).
[346]
Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, and Zhanhui Kang. 2023. E-Sparse: Boosting the large language model inference through entropy-based N:M sparsity. arXiv preprint arXiv:2310.15929 (2023).
[347]
Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang, et al. 2024. The emergence of essential sparsity in large pre-trained models: The weights that matter. Advances in Neural Information Processing Systems 36 (2024).
[348]
Yuxin Zhang, Yiting Luo, Mingbao Lin, Yunshan Zhong, Jingjing Xie, Fei Chao, and Rongrong Ji. 2023. Bi-directional masks for efficient N:M sparse training. In International Conference on Machine Learning. PMLR, 41488–41497.
[349]
Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, and Yani Ioannou. 2023. Dynamic sparse training with structured sparsity. arXiv preprint arXiv:2305.02299 (2023).
[350]
Chao Fang, Aojun Zhou, and Zhongfeng Wang. 2022. An algorithm–hardware co-optimized framework for accelerating N:M sparse transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 30, 11 (2022), 1573–1586.
[351]
Chao Fang, Shouliang Guo, Wei Wu, Jun Lin, Zhongfeng Wang, Ming Kai Hsu, and Lingzhi Liu. 2022. An efficient hardware accelerator for sparse transformer neural networks. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS’22). IEEE, 2670–2674.
[352]
Yixuan Luo, Payman Behnam, Kiran Thorat, Zhuo Liu, Hongwu Peng, Shaoyi Huang, Shu Zhou, Omer Khan, Alexey Tumanov, Caiwen Ding, et al. 2022. CoDG-ReRAM: An algorithm-hardware co-design to accelerate semi-structured GNNs on ReRAM. In 2022 IEEE 40th International Conference on Computer Design (ICCD’22). IEEE, 280–289.
[353]
Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. 2021. RED: Looking for redundancies for data-free structured compression of deep neural networks. Advances in Neural Information Processing Systems 34 (2021), 20863–20873.
[354]
Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. 2022. RED++: Data-free pruning of deep neural networks via input splitting and output merging. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3664–3676.
[355]
Wenxiao Wang, Cong Fu, Jishun Guo, Deng Cai, and Xiaofei He. 2019. COP: Customized deep model compression via regularized correlation-based filter-level pruning. In International Joint Conference on Artificial Intelligence.
[356]
Zi Wang, Chengcheng Li, and Xiangyang Wang. 2021. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14913–14922.
[357]
Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
[358]
Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. 2020. HRank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1529–1538.
[359]
Yang Sui, Miao Yin, Yi Xie, Huy Phan, Saman Aliari Zonouz, and Bo Yuan. 2021. CHIP: CHannel Independence-based Pruning for compact neural networks. Advances in Neural Information Processing Systems 34 (2021), 24604–24616.
[360]
Chong Min John Tan and Mehul Motani. 2020. DropNet: Reducing neural network complexity via iterative pruning. In International Conference on Machine Learning. PMLR, 9356–9366.
[361]
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision. 5058–5066.
[362]
Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. 2019. Approximated oracle filter pruning for destructive CNN width optimization. In International Conference on Machine Learning. PMLR, 1607–1616.
[363]
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. Advances in Neural Information Processing Systems 30 (2017).
[364]
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.
[365]
Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. 2019. Gate Decorator: Global filter pruning method for accelerating deep convolutional neural networks. Advances in Neural Information Processing Systems 32 (2019).
[366]
Tao Zhuang, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. 2020. Neuron-level structured pruning using polarization regularizer. Advances in Neural Information Processing Systems 33 (2020), 9865–9877.
[367]
Jianbo Ye, Xin Lu, Zhe Lin, and James Z. Wang. 2018. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations.
[368]
Minsoo Kang and Bohyung Han. 2020. Operation-aware soft channel pruning using differentiable masks. In International Conference on Machine Learning. PMLR, 5122–5131.
[369]
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV’18). 784–800.
[370]
Sixing Yu, Arya Mazaheri, and Ali Jannesari. 2021. Auto graph encoder-decoder for neural network pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6362–6372.
[371]
Manoj Alwani, Yang Wang, and Vashisht Madhavan. 2022. DECORE: Deep compression with reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12349–12359.
[372]
Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and Jian Sun. 2019. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3296–3305.
[373]
Mingbao Lin, Rongrong Ji, Yuxin Zhang, Baochang Zhang, Yongjian Wu, and Yonghong Tian. 2021. Channel pruning via automatic structure search. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 673–679.
[374]
Xuhua Li, Weize Sun, Lei Huang, and Shaowu Chen. 2022. Sub-network multi-objective evolutionary algorithm for filter pruning. arXiv preprint arXiv:2211.01957 (2022).
[375]
Yawei Li, Shuhang Gu, Kai Zhang, Luc Van Gool, and Radu Timofte. 2020. DHP: Differentiable meta pruning via hypernetworks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. Springer, 608–624.
[376]
Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan. 2020. DMCP: Differentiable Markov channel pruning for neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1539–1547.
[377]
Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. 2020. DSA: More efficient budgeted pruning via differentiable sparsity allocation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III. Springer, 592–607.
[378]
Shi Chen and Qi Zhao. 2018. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 12 (2018), 3048–3056.
[379]
Sara Elkerdawy, Mostafa Elhoushi, Abhineet Singh, Hong Zhang, and Nilanjan Ray. 2020. To filter prune, or to layer prune, that is the question. In Proceedings of the Asian Conference on Computer Vision.
[380]
Hui Tang, Yao Lu, and Qi Xuan. 2023. SR-init: An interpretable layer pruning method. arXiv preprint arXiv:2303.07677 (2023).
[381]
Ke Zhang and Guangzhe Liu. 2022. Layer pruning for obtaining shallower ResNets. IEEE Signal Processing Letters 29 (2022), 1172–1176.
[382]
Artur Jordao, George Correa de Araujo, Helena de Almeida Maia, and Helio Pedrini. 2023. When layers play the lottery, all tickets win at initialization. arXiv preprint arXiv:2301.10835 (2023).
[383]
Yang He and Lingao Xiao. 2023. Structured pruning for deep convolutional neural networks: A survey. arXiv preprint arXiv:2303.00566 (2023).
[384]
Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2018. Discrimination-aware channel pruning for deep neural networks. Advances in Neural Information Processing Systems 31 (2018).
[385]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. PMLR, 448–456.
[386]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[387]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, Shiqing Li, Guochu Xiong, and Weichen Liu. 2024. Pearls hide behind linearity: Simplifying deep convolutional networks for embedded hardware systems via linearity grafting. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24). IEEE, 250–255.
[388]
Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Guochu Xiong, and Weichen Liu. 2024. Domino-Pro-Max: Towards efficient network simplification and reparameterization for embedded hardware systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024).
[389]
Hao Kong, Di Liu, Shuo Huai, Xiangzhong Luo, Weichen Liu, Ravi Subramaniam, Christian Makaya, and Qian Lin. 2022. Smart scissor: Coupling spatial redundancy reduction and CNN compression for embedded hardware. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design. 1–9.
[390]
Hao Kong, Di Liu, Xiangzhong Luo, Shuo Huai, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu. 2023. Towards Towards efficient convolutional neural network for embedded hardware via multi-dimensional pruning. In 2023 60th ACM/IEEE Design Automation Conference (DAC’23). IEEE, 1–6.
[391]
Hao Kong, Xiangzhong Luo, Shuo Huai, Di Liu, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu. 2023. EMNAPE: Efficient multi-dimensional neural architecture pruning for EdgeAI. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE’23). IEEE, 1–2.
[392]
Hao Kong, Di Liu, Shuo Huai, Xiangzhong Luo, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu. 2023. EdgeCompress: Coupling multidimensional model compression and dynamic inference for EdgeAI. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 12 (2023), 4657–4670.
[393]
Hao Kong, Di Liu, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. 2022. HACScale: Hardware-aware compound scaling for resource-efficient DNNs. In 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC’22). IEEE, 708–713.
[394]
Adrian Bulat and Georgios Tzimiropoulos. 2019. XNOR-Net++: Improved binary neural networks. arXiv preprint arXiv:1909.13863 (2019).
[395]
Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. 2018. Bi-Real Net: Enhancing the performance of 1-bit CNNs with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV’18). 722–737.
[396]
Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2250–2259.
[397]
Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, and Chia-Wen Lin. 2020. Rotated binary neural network. Advances in Neural Information Processing Systems 33 (2020), 7474–7485.
[398]
Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Fei Chao, Chia-Wen Lin, and Ling Shao. 2022. SiMaN: Sign-to-magnitude network binarization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[399]
Sieger Falkena, Hadi Jamali-Rad, and Jan van Gemert. 2023. LAB: Learnable activation binarizer for binary neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6425–6434.
[400]
Zhijun Tu, Xinghao Chen, Pengju Ren, and Yunhe Wang. 2022. AdaBin: Improving binary neural networks with adaptive binary sets. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. Springer, 379–395.
[401]
Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
[402]
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2017. Trained ternary quantization. In International Conference on Learning Representations.
[403]
Hande Alemdar, Vincent Leroy, Adrien Prost-Boucle, and Frédéric Pétrot. 2017. Ternary neural networks for resource-efficient AI applications. In 2017 International Joint Conference on Neural Networks (IJCNN’17). IEEE, 2547–2554.
[404]
Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. 2017. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462 (2017).
[405]
Yue Li, Wenrui Ding, Chunlei Liu, Baochang Zhang, and Guodong Guo. 2021. TRQ: Ternary neural networks with residual quantization. In Proceedings of the AAAI Conference on Artificial Intelligence. 8538–8546.
[406]
Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, and Wei Wang. 2020. RTN: Reparameterized ternary network. In Proceedings of the AAAI Conference on Artificial Intelligence. 4780–4787.
[407]
Peng Chen, Bohan Zhuang, and Chunhua Shen. 2021. FATNN: Fast and accurate ternary neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5219–5228.
[408]
Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang, and Jian Cheng. 2022. Soft threshold ternary networks. arXiv preprint arXiv:2204.01234 (2022).
[409]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
[410]
Han Vanholder. 2016. Efficient inference with TensorRT. In GPU Technology Conference.
[411]
Sumin Kim, Gunju Park, and Youngmin Yi. 2021. Performance evaluation of INT8 quantized inference on mobile GPUs. IEEE Access 9 (2021), 164245–164255.
[412]
Li Lyna Zhang, Xudong Wang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting Cao, and Mao Yang. 2023. SpaceEvo: Hardware-friendly search space design for efficient INT8 inference. arXiv preprint arXiv:2303.08308 (2023).
[413]
Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. 2019. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532 (2019).
[414]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.
[415]
Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen. 2018. TBN: Convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV’18). 315–332.
[416]
Julian Faraone, Nicholas Fraser, Michaela Blott, and Philip H. W. Leong. 2018. SYQ: Learning symmetric quantization for efficient deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4300–4309.
[417]
Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I.-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. PACT: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).
[418]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).
[419]
Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, et al. 2018. Mixed precision training of convolutional neural networks using integer operations. In International Conference on Learning Representations.
[420]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arXiv preprint arXiv:1807.11205 (2018).
[421]
Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. 2018. OpenSeq2Seq: Extensible toolkit for distributed and mixed precision training of sequence-to-sequence models. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS’18). 41–46.
[422]
Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and Junjie Yan. 2020. Towards unified INT8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1969–1979.
[423]
Kang Zhao, Sida Huang, Pan Pan, Yinghan Li, Yingya Zhang, Zhenyu Gu, and Yinghui Xu. 2021. Distribution adaptive INT8 quantization for training CNNs. In Proceedings of the AAAI Conference on Artificial Intelligence. 3483–3491.
[424]
Shyam A. Tailor, Javier Fernandez-Marques, and Nicholas D. Lane. 2021. Degree-Quant: Quantization-aware training for graph neural networks. In International Conference on Learning Representations.
[425]
Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. 2022. Overcoming oscillations in quantization-aware training. In International Conference on Machine Learning. PMLR, 16318–16330.
[426]
Charbel Sakr, Steve Dai, Rangha Venkatesan, Brian Zimmer, William Dally, and Brucek Khailany. 2022. Optimal clipping and magnitude-aware differentiation for improved quantization-aware training. In International Conference on Machine Learning. PMLR, 19123–19138.
[427]
Jiseok Youn, Jaehun Song, Hyung-Sin Kim, and Saewoong Bahk. 2022. Bitwidth-adaptive quantization-aware neural network training: A meta-learning approach. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 208–224.
[428]
Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of ConvNets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090 (2018).
[429]
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8612–8620.
[430]
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2019. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 293–302.
[431]
Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. 2020. Search what you want: Barrier panelty NAS for mixed precision quantization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 1–16.
[432]
Weihan Chen, Peisong Wang, and Jian Cheng. 2021. Towards mixed-precision quantization of neural networks via constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5350–5359.
[433]
Zhaowei Cai and Nuno Vasconcelos. 2020. Rethinking differentiable search for mixed-precision neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2349–2358.
[434]
Ziwei Wang, Han Xiao, Jiwen Lu, and Jie Zhou. 2021. Generalizable mixed-precision quantization via attribution rank preservation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5291–5300.
[435]
Hai Victor Habi, Roy H. Jennings, and Arnon Netzer. 2020. HMQ: Hardware friendly mixed precision quantization block for CNNs. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. Springer, 448–463.
[436]
Zhaohui Yang, Yunhe Wang, Kai Han, Chunjing Xu, Chao Xu, Dacheng Tao, and Chang Xu. 2020. Searching for low-bit weights in quantized neural networks. Advances in Neural Information Processing Systems 33 (2020), 4091–4102.
[437]
Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2016. YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’16). IEEE, 236–241.
[438]
Peng Guo, Hong Ma, Ruizhi Chen, Pin Li, Shaolin Xie, and Donglin Wang. 2018. FBNA: A fully binarized neural network accelerator. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 51–513.
[439]
Francesco Conti, Pasquale Davide Schiavone, and Luca Benini. 2018. XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2940–2951.
[440]
Shubham Jain, Sumeet Kumar Gupta, and Anand Raghunathan. 2020. TiM-DNN: Ternary in-memory accelerator for deep neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 7 (2020), 1567–1577.
[441]
Moritz Scherer, Georg Rutishauser, Lukas Cavigelli, and Luca Benini. 2021. CUTIE: Beyond PetaOp/s/W ternary DNN inference acceleration with better-than-binary energy efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 4 (2021), 1020–1033.
[442]
Shien Zhu, Luan H. K. Duong, Hui Chen, Di Liu, and Weichen Liu. 2022. FAT: An in-memory accelerator with fast addition for ternary weight neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022).
[443]
Nahsung Kim, Dongyeob Shin, Wonseok Choi, Geonho Kim, and Jongsun Park. 2020. Exploiting retraining-based mixed-precision quantization for low-cost DNN accelerator design. IEEE Transactions on Neural Networks and Learning Systems 32, 7 (2020), 2925–2938.
[444]
Mengshu Sun, Zhengang Li, Alec Lu, Yanyu Li, Sung-En Chang, Xiaolong Ma, Xue Lin, and Zhenman Fang. 2022. FILM-QNN: Efficient FPGA acceleration of deep neural networks with intra-layer, mixed-precision quantization. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 134–145.
[445]
Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, and Hoi-Jun Yoo. 2019. An energy-efficient sparse deep-neural-network learning accelerator with fine-grained mixed precision of FP8–FP16. IEEE Solid-State Circuits Letters 2, 11 (2019), 232–235.
[446]
Sitao Huang, Aayush Ankit, Plinio Silveira, Rodrigo Antunes, Sai Rahul Chalamalasetti, Izzat El Hajj, Dong Eun Kim, Glaucimar Aguiar, Pedro Bruel, Sergey Serebryakov, et al. 2021. Mixed precision quantization for ReRAM-based DNN inference accelerators. In Proceedings of the 26th Asia and South Pacific Design Automation Conference. 372–377.
[447]
Wolfgang Balzer, Masanobu Takahashi, Jun Ohta, and Kazuo Kyuma. 1991. Weight quantization in Boltzmann machines. Neural Networks 4, 3 (1991), 405–409.
[448]
Emile Fiesler, Amar Choudry, and H. John Caulfield. 1990. Weight discretization paradigm for optical neural networks. In Optical Interconnections and Networks, Vol. 1281. SPIE, 164–173.
[449]
Gunhan Dundar and Kenneth Rose. 1995. The effects of quantization on multilayer neural networks. IEEE Transactions on Neural Networks 6, 6 (1995), 1446–1451.
[450]
Shuo Huai, Di Liu, Xiangzhong Luo, Hui Chen, Weichen Liu, and Ravi Subramaniam. 2023. Crossbar-aligned & integer-only neural network compression for efficient in-memory acceleration. In Proceedings of the 28th Asia and South Pacific Design Automation Conference. 234–239.
[451]
Shuo Huai, Hao Kong, Xiangzhong Luo, Shiqing Li, Ravi Subramaniam, Christian Makaya, Qian Lin, and Weichen Liu. 2023. CRIMP: Compact & Reliable DNN Inference on In-Memory Processing via Crossbar-Aligned Compression and Non-ideality Adaptation. ACM Transactions on Embedded Computing Systems 22, 5s (2023), 1–25.
[452]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).
[453]
Srinidhi Hegde, Ranjitha Prasad, Ramya Hebbalaguppe, and Vishwajeet Kumar. 2020. Variational student: Learning compact and sparser networks in knowledge distillation framework. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 3247–3251.
[454]
Tiancheng Wen, Shenqi Lai, and Xueming Qian. 2021. Preparing lessons: Improve knowledge distillation with better supervision. Neurocomputing 454 (2021), 25–33.
[455]
Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4794–4802.
[456]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence. 5191–5198.
[457]
Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. 2022. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10925–10934.
[458]
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1910–1918.
[459]
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. 2020. Self-training with noisy student improves ImageNet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10687–10698.
[460]
Guanzhe Hong, Zhiyuan Mao, Xiaojun Lin, and Stanley H. Chan. 2021. Student-teacher learning from clean inputs to noisy inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12075–12084.
[461]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4133–4141.
[462]
Jangho Kim, SeongUk Park, and Nojun Kwak. 2018. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems 31 (2018).
[463]
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, and Zhenwen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9163–9171.
[464]
Frederick Tung and Greg Mori. 2019. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1365–1374.
[465]
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-of-Theseus: Compressing BERT by progressive module replacing. arXiv preprint arXiv:2002.02925 (2020).
[466]
Zaida Zhou, Chaoran Zhuge, Xinwei Guan, and Wen Liu. 2020. Channel distillation: Channel-wise attention for knowledge distillation. arXiv preprint arXiv:2006.01683 (2020).
[467]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30 (2017).
[468]
Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1285–1294.
[469]
Bharat Bhusan Sau and Vineeth N. Balasubramanian. 2016. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650 (2016).
[470]
Guocong Song and Wei Chai. 2018. Collaborative learning for deep neural networks. Advances in Neural Information Processing Systems 31 (2018).
[471]
Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. 2020. Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In Proceedings of the 13th International Conference on Web Search and Data Mining. 690–698.
[472]
Xiatian Zhu, Shaogang Gong, et al. 2018. Knowledge distillation by on-the-fly native ensemble. Advances in Neural Information Processing Systems 31 (2018).
[473]
Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran. 2017. Efficient knowledge distillation from an ensemble of teachers. In Interspeech. 3697–3701.
[474]
Liuyu Xiang, Guiguang Ding, and Jungong Han. 2020. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 247–263.
[475]
Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4320–4328.
[476]
Elliot J. Crowley, Gavin Gray, and Amos J. Storkey. 2018. Moonshine: Distilling with cheap convolutions. Advances in Neural Information Processing Systems 31 (2018).
[477]
Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3713–3722.
[478]
Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. 2020. Self-distillation amplifies regularization in Hilbert space. Advances in Neural Information Processing Systems 33 (2020), 3351–3361.
[479]
Sukmin Yun, Jongjin Park, Kimin Lee, and Jinwoo Shin. 2020. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13876–13885.
[480]
Mingi Ji, Seungjae Shin, Seunghyun Hwang, Gibeom Park, and Il-Chul Moon. 2021. Refine myself by teaching myself: Feature refinement via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10664–10673.
[481]
Yixiao Ge, Xiao Zhang, Ching Lam Choi, Ka Chun Cheung, Peipei Zhao, Feng Zhu, Xiaogang Wang, Rui Zhao, and Hongsheng Li. 2021. Self-distillation with batch knowledge ensembling improves ImageNet classification. arXiv preprint arXiv:2104.13298 (2021).
[482]
Vladimir Vapnik and Rauf Izmailov. 2015. Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research 16, 61 (2015), 2023–2049.
[483]
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2016. Unifying distillation and privileged information. In International Conference on Learning Representations.
[484]
Peisen Zhao, Lingxi Xie, Jiajie Wang, Ya Zhang, and Qi Tian. 2022. Progressive privileged knowledge distillation for online action detection. Pattern Recognition 129 (2022), 108741.
[485]
Fengyi Tang, Cao Xiao, Fei Wang, Jiayu Zhou, and Li-wei H. Lehman. 2019. Retaining privileged information for multi-task learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1369–1377.
[486]
Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. 2019. Adversarial distillation for learning with privileged provisions. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 3 (2019), 786–797.
[487]
Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and Wenwu Ou. 2020. Privileged features distillation at Taobao recommendations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2590–2598.
[488]
Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. 2019. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3514–3522.
[489]
Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song. 2019. Data-free adversarial distillation. arXiv preprint arXiv:1912.11006 (2019).
[490]
Xiaoyang Qu, Jianzong Wang, and Jing Xiao. 2021. Enhancing data-free adversarial distillation with activation regularization and virtual interpolation. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 3340–3344.
[491]
Haoran Zhao, Xin Sun, Junyu Dong, Milos Manic, Huiyu Zhou, and Hui Yu. 2022. Dual discriminator adversarial distillation for data-free model compression. International Journal of Machine Learning and Cybernetics (2022), 1–18.
[492]
Yuanxin Zhuang, Lingjuan Lyu, Chuan Shi, Carl Yang, and Lichao Sun. 2022. Data-free adversarial knowledge distillation for graph neural networks. arXiv preprint arXiv:2205.03811 (2022).
[493]
Yiman Zhang, Hanting Chen, Xinghao Chen, Yiping Deng, Chunjing Xu, and Yunhe Wang. 2021. Data-free knowledge distillation for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7852–7861.
[494]
Gongfan Fang, Jie Song, Xinchao Wang, Chengchao Shen, Xingen Wang, and Mingli Song. 2021. Contrastive model inversion for data-free knowledge distillation. arXiv preprint arXiv:2105.08584 (2021).
[495]
Mandar Kulkarni, Kalpesh Patil, and Shirish Karande. 2017. Knowledge distillation using unlabeled mismatched images. arXiv preprint arXiv:1703.07131 (2017).
[496]
Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L. Yuille. 2019. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3662–3671.
[497]
Tianhong Li, Jianguo Li, Zhuang Liu, and Changshui Zhang. 2020. Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14639–14647.
[498]
Akisato Kimura, Zoubin Ghahramani, Koh Takeuchi, Tomoharu Iwata, and Naonori Ueda. 2018. Few-shot learning of neural networks from scratch by pseudo example optimization. arXiv preprint arXiv:1802.03039 (2018).
[499]
Haoli Bai, Jiaxiang Wu, Irwin King, and Michael Lyu. 2020. Few shot network compression via cross distillation. In Proceedings of the AAAI Conference on Artificial Intelligence. 3203–3210.
[500]
Huanyu Wang, Junjie Liu, Xin Ma, Yang Yong, Zhenhua Chai, and Jianxin Wu. 2022. Compressing models with few samples: Mimicking then replacing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 701–710.
References 501 through 651 have been omitted.

Index Terms

  1. Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 24, Issue 1
          January 2025
          664 pages
          EISSN:1558-3465
          DOI:10.1145/3696805
          • Editor:
          • Tulika Mitra
          Issue’s Table of Contents

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Journal Family

          Publication History

          Published: 10 December 2024
          Online AM: 24 October 2024
          Accepted: 17 October 2024
          Revised: 27 May 2024
          Received: 30 October 2023
          Published in TECS Volume 24, Issue 1

          Check for updates

          Author Tags

          1. Embedded computing systems
          2. embedded intelligence
          3. artificial intelligence
          4. efficient deep learning algorithms
          5. efficient network design
          6. efficient neural architecture search
          7. efficient model compression
          8. efficient on-device learning
          9. efficient large language models
          10. efficient deep learning software and hardware
          11. intelligent embedded applications

          Qualifiers

          • Research-article

          Funding Sources

          • Ministry of Education, Singapore, under its Academic Research Fund Tier 1
          • Nanyang Technological University, Singapore

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 1,145
            Total Downloads
          • Downloads (Last 12 months)1,145
          • Downloads (Last 6 weeks)711
          Reflects downloads up to 25 Jan 2025

          Other Metrics

          Citations

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Login options

          Full Access

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media