3.2 Search Strategy
In this section, we discuss recent state-of-the-art NAS algorithms and divide them into three main categories: reinforcement learning-based search [
137], evolutionary algorithm-based search [
157], and gradient-based search (also known as differentiable search) [
138].
Reinforcement Learning-Based Search. In the field of NAS, to the best of our knowledge, [
137] is the first NAS work
3 that opens up the possibility to automate the design of top-performing DNNs, which features
reinforcement learning (RL) [
159] as the search engine. Specifically, [
137] leverages a simple yet effective
recurrent neural network (RNN) as the RL controller to generate possible architecture candidates from the search space as shown in Figure
9. The generated architecture candidate is then trained from scratch on the target task to evaluate the accuracy. Next, the accuracy of the generated architecture candidate is fed back into the aforementioned RNN controller, which optimizes the RNN controller to generate better architecture candidates in the next iteration. Once the search process terminates, the well-optimized RNN controller is able to provide DNNs with superior accuracy on the target task. For example, the network generated by the RNN controller achieves 96.35% top-1 accuracy on CIFAR-10, which is comparable to or even better than the family of manually designed DNNs, such as ResNet [
2]. The promising performance of [
137] marks an important milestone in the field of NAS, pioneering an effective alternative to automate the design of competitive DNNs.
Subsequently, based on [
137], NASNet [
144] introduces the flexible cell-based search space as shown in Figure
7, which further boosts the attainable accuracy on the target task. For example, NASNet achieves 97.6% top-1 accuracy on CIFAR-10, which is
\(+\)1.25% higher than [
137] while involving fewer parameters (i.e., 37.4 M in [
137] vs. 27.6 M in NASNet). Despite the promising performance, [
137] and NASNet have to train a large number of possible architecture candidates from scratch, thus inevitably necessitating prohibitive computational resources. For example, to optimize the RNN controller, [
137] needs to train 12,800 stand-alone architecture candidates. To overcome such limitations, ENAS [
160] proposes an efficient NAS paradigm dubbed
parameter sharing, which forces all the architecture candidates to share network weights to eschew training each architecture candidate from scratch. In practice, this leads to significant reduction in terms of search cost, while at the same time still maintaining strong accuracy on the target task. For example, in [
137], one single search experiment takes 3
\(\sim\)4 days on 450 NVIDIA GTX 1080 Ti GPUs [
144]. In contrast, benefiting from the paradigm of parameter sharing, ENAS is able to find one decent network solution with 97.11% top-1 accuracy on CIFAR-10 and, more importantly, in less than 16 hours on one single NVIDIA GTX 1080 Ti GPU. Thanks to the significant search efficiency, the paradigm of parameter sharing has been dominating subsequent breakthroughs in the NAS community [
42,
138,
161].
Although early RL-based NAS methods [
137,
144,
160] have had tremendous success in automatic network design, they focus on accuracy-only optimization, ignoring other important performance metrics, such as latency and energy. To search for hardware-efficient network solutions, MnasNet [
38] formulates the search process as a multi-objective optimization problem that optimizes both accuracy and latency as shown in Figure
10. To achieve this, MnasNet introduces a flexible block-based search space (see Figure
8) and designs an effective multi-objective RL reward function to optimize the RNN controller. Specifically, the goal of MnasNet is to find
Pareto-optimal architecture candidates
arch in the search space
\(\mathcal {A}\) that maximize the predefined multi-objective RL reward, which can be formulated as follows:
where
\(Accuracy(\cdot)\) and
\(Latency(\cdot)\) denote the accuracy on the target task and the latency on target hardware, respectively. In addition,
T is the specified latency constraint. It is worth noting that the latency
\(Latency(\cdot)\) in MnasNet is directly measured on target hardware, which suffers from non-trivial engineering efforts due to the prohibitive search space (e.g.,
\(|\mathcal {A}| = \sim 10^{39}\) in MnasNet) [
40,
42]. To avoid the tedious on-device latency measurements, we discuss several efficient latency predictors later in this section. Apart from these,
w is the trade-off coefficient to control the trade-off magnitude between accuracy and latency, which is defined as follows:
where
\(\alpha\) and
\(\beta\) are application-specific hyperparameters to control the trade-off magnitude between accuracy and efficiency. According to the empirical observation that doubling the latency usually brings
\(\sim\)5% relative accuracy improvement, MnasNet assigns
\(\alpha = \beta = -0.07\). In practice,
\(\alpha\) and
\(\beta\) are both sensitive and difficult to tune. Even worse, given new hardware devices or new search spaces,
\(\alpha\) and
\(\beta\) involve additional engineering efforts for hyperparameter tuning. For example, as observed in MobileNetV3 [
162], the accuracy changes much more dramatically with latency for small networks. Therefore, to obtain the required architecture candidate that satisfies the specified latency constraint
T, we typically need to repeat 7 search experiments to tune
\(\alpha\) and
\(\beta\) through trial and error [
163]. This significantly increases the total search cost by ×7. To eliminate such additional hyperparameter tuning, TuNAS [
163] investigates the multi-objective RL reward in Equation (
2) and further introduces a similar RL reward function, which can be formulated as follows:
where
\(|\cdot |\) is the absolute function and
\(\gamma \lt 0\) is a finite negative value, which controls how strongly we enforce the architecture candidate to maintain the latency close to
T.
MONAS [
164] also introduces a simple yet effective RL reward function that considers optimizing both accuracy and energy, which can be formulated as follows:
where
\(\eta \in [0, 1]\) is the coefficient to control the trade-off between accuracy and energy. We note that the RL reward function in Equation (
5) aims to find the architecture candidate with high accuracy and low energy, which can be generalized to other performance constraints, such as latency.
Evolutionary Algorithm-Based Search. In addition to reinforcement learning–based search, evolutionary algorithm–based search is another popular branch in the NAS literature thanks to its flexibility, conceptual simplicity, and competitive performance [
157]. As seen in the very early evolutionary practices [
165,
166,
167,
168], the evolutionary algorithm–based search typically consists of four key steps: (1) sampling a set of possible architecture candidates from the search space as the child population; (2) evaluating the architecture candidates in the child population to interpret the performance, such as accuracy and efficiency; (3) reserving the top-k architecture candidates in the latest child population to form the parent population and discarding the architecture candidates with poor performance; and (4) manipulating the architecture candidates in the latest parent population to generate new architecture candidates to form the next-generation child population. These four steps are repeated until the evolutionary process converges.
There are many other aspects in which the evolutionary algorithm may differ, including (1) how to sample the initial population, (2) how to select the parent population, and (3) how to generate the child population from the parent population. Generating the child population from the parent population is of utmost importance in order to produce superior architecture candidates [
157]. In practice, to allow efficient exploration and exploitation [
170], crossover and mutation are two of the most popular strategies to generate the child population [
171,
172]. Specifically, for crossover, two random architecture candidates from the parent population are crossed to produce one new child architecture candidate. For mutation, one randomly selected architecture candidate mutates its operators with a fixed probability. However, the early evolutionary NAS works have to train a large number of stand-alone architecture candidates from scratch to evaluate the accuracy [
157] and, as a result, suffer from non-trivial computational resources [
169].
To reduce the required computational resources for neural architecture search, [
169] introduces the paradigm of one-shot NAS, which has been widely applied in subsequent NAS methods [
40,
42,
138] thanks to its significant search efficiency. In parallel to [
169], SMASH [
173] also proposes a similar one-shot NAS paradigm, but [
169] is much more popular in the NAS community. Specifically, [
169] designs an effective one-shot supernet as visualized in Figure
11, which consists of all possible architecture candidates in the search space. Therefore, we only need to train the one-shot supernet, after which we can evaluate different architecture candidates in the search space with inherited network weights from the pretrained one-shot supernet as shown in Figure
12. This effectively avoids needing to train a large number of stand-alone architecture candidates from scratch. In practice, the one-shot supernet is simply trained using the standard SGD optimizer with momentum. Once the one-shot supernet is well trained, it is able to quickly and reliably approximate the performance of different architecture candidates using the paradigm of weight sharing [
160]. With the well-trained one-shot supernet, it is straightforward and technically easy to leverage the standard evolutionary algorithm to search for top-performing architecture candidates with superior accuracy on the target task [
169]. We note that the searched architecture candidates still need to be retrained or fine-tuned on the target task in order to recover the accuracy for further deployment on target hardware.
SPOS [
161] investigates the one-shot NAS [
169] and identifies two critical issues. On the one hand, the network weights in the one-shot supernet are deeply coupled during the training process. On the other hand, joint optimization introduces further coupling between architecture candidates and supernet weights. To address these, SPOS proposes the paradigm of single-path one-shot NAS, which uniformly samples one single-path subnetwork from the supernet and trains the sample single-path subnetwork instead. This brings two main benefits: (1) reducing memory consumption to the single-path level and (2) improving the performance of the final searched architecture candidate. The success of SPOS has motivated a series of follow-up works [
42,
152,
153,
174,
175,
176,
177,
178,
179]. Note that all of these follow-up works [
42,
152,
153,
174,
175,
176,
177,
178,
179] focus on training an effective and reliable supernet, which then serves as the evaluator to quickly query the performance of different architecture candidates. For example, FairNAS [
174] demonstrates that the uniform sampling strategy only implies the soft fairness, and to imply the strict fairness, FairNAS samples multiple single-path subnetworks to enforce that all the operator candidates in the supernet are equally optimized during each training iteration. In parallel, OFA [
42] is another representative evolutionary NAS method that aims to train the supernet, after which we are allowed to detach single-path subnetworks from the supernet with inherited network weights for further deployment on target hardware. Note that the detached subnetwork in OFA still requires to be fine-tuned on the target task for several epochs (e.g., 25 epochs) in order to obtain competitive accuracy. To eliminate the fine-tuning process, BigNAS [
175] proposes several enhancements to train one single-stage supernet, where the single-path subnetwork detached from the supernet with inherited network weights can achieve superior accuracy without being retrained or fine-tuned on the target task and can be directly deployed on target hardware. This significantly saves the computational resources required for training stand-alone architecture candidates, especially when targeting multiple different deployment scenarios such as multiple different hardware platforms.
Thanks to its search flexibility, evolutionary algorithm-based NAS can be easily extended to search for hardware-efficient architecture candidates, which maximize the accuracy on the target task while satisfying various real-world performance constraints [
153], such as latency, energy, memory, and more. Without loss of generality, we consider the following multi-objective optimization:
where
\(\lbrace Constraint_i(\cdot)\rbrace _{i=1}^n\) and
\(\lbrace C_i\rbrace _{i=1}^n\) are a set of real-world performance constraints.
Gradient-Based Search. In addition to reinforcement learning–based search and evolutionary algorithm–based search, gradient-based search [
138], also known as
differentiable search, is another representative branch of NAS, which has since gained increasing popularity in the NAS community and motivated a plethora of subsequent differentiable NAS works [
145,
146,
147,
148,
149,
150,
151,
180,
181,
182,
183,
184,
185,
186,
187,
188], thanks to its significant search efficiency [
189]. For example, DARTS [
138], as the seminal differentiable NAS work, is able to deliver one superior architecture candidate in
\(\sim\)1 day on one single NVIDIA GTX 1080 Ti GPU. In contrast to previous non-differentiable NAS practices [
137,
144,
160,
169] that highly rely on discrete search spaces, DARTS leverages a list of architecture parameters
\(\alpha\) to relax the discrete search space to become continuous. Benefiting from the continuous search space, both the network weights
w and the architecture parameters
\(\alpha\) can be optimized via alternating gradient descent. Once the differentiable search process terminates, we can interpret the optimal architecture candidate from the architecture parameters
\(\alpha\). Specifically, the supernet in DARTS is initialized by stacking multiple over-parameterized cells (see Figure
13 (1)), in which each cell consists of all possible cell structures in the cell-based search space
\(\mathcal {A}\). As shown in Figure
13, each cell is represented using the DAG that consists of
N nodes
\(\lbrace x_i\rbrace _{i=1}^N\). Note that the nodes here correspond to the intermediate feature maps. In addition, the directed edges between
\(x_i\) and
\(x_j\) correspond to a list of operator candidates
\(\lbrace o | o \in \mathcal {O}\rbrace\) in the operator space
\(\mathcal {O}\). The directed edges between
\(x_i\) and
\(x_j\) are also assigned with a list of architecture parameters
\(\lbrace \alpha _o^{(i, j)} | o \in \mathcal {O}\rbrace\). Finally, following DARTS, we formulate
\(x_j\) as follows:
Note that the output
\(x_j\) is continuous with respect to
\(x_i\),
\(\alpha\), and
w. In Light of this, DARTS proposes to optimize
\(\alpha\) and
w using the following bilevel optimization scheme:
where
\(\mathcal {L}_{train}(\cdot)\) and
\(\mathcal {L}_{val}(\cdot)\) are the loss functions on the training and validation datasets, respectively. Once the differentiable search process terminates, DARTS determines the optimal architecture candidate by reserving the strongest operator
\(\alpha _o^{(i, j)}\) and removing other operators between
\(x_i\) and
\(x_j\), in which the operator strength is defined as
\(\exp \alpha _o^{(i, j)} / \sum _{o^{\prime } \in \mathcal {O}} \exp \alpha _{o^{\prime }}^{(i, j)}\). It is worth noting that the searched optimal architecture candidate still needs to be retrained on the target task in order to recover its accuracy for further deployment on target hardware.
Inspired by the promising performance of DARTS, a plethora of follow-up works [
145,
146,
147,
148,
149,
150,
151,
180,
181,
182,
183,
184,
185,
186,
187,
188] have recently emerged that strive to unleash the power of differentiable NAS to deliver superior architecture candidates. For example, in contrast to DARTS, which simultaneously optimizes all operator candidates in the supernet, PC-DARTS [
146] introduces partial channel connections to alleviate the excessive memory consumption of DARTS. In addition, DARTS+ [
148] investigates the performance collapse issue of DARTS and finds that the performance collapse issue is caused by the over-selection of
skip-connect. To tackle this, DARTS+ proposes a simple yet effective early-stopping strategy to terminate the search process upon fulfilling a set of predefined criteria. In parallel, DARTS– [
149] also observes that the performance collapse issue of DARTS comes from the over-selection of
skip-connect and further leverages an auxiliary skip connection to mitigate the performance collapse issue and stabilize the search process. Apart from these, Single-DARTS [
185] and Gold-NAS [
184] investigate the bilevel optimization in Equation (
8) and point out that the bi-level optimization may end up with suboptimal architecture candidates, based on which Single-DARTS and Gold-NAS turn back to the one-level optimization. To accelerate the search process, GDAS [
186] introduces an efficient Gumbel-Softmax [
190]–based differentiable sampling approach to reduce the optimization complexity to the single-path level. Similar to GDAS, SNAS [
187] also leverages Gumbel-Softmax reparameterization to improve the search process, which can make use of gradient information from generic differentiable loss without sacrificing the completeness of NAS pipelines. PT-DARTS [
182] revisits the architecture selection in differentiable NAS and demonstrates that the architecture parameters
\(\alpha\) cannot always imply the optimal architecture candidate, based on which PT-DARTS introduces the perturbation-based architecture selection to determine the optimal architecture candidate at the end of search.
The aforementioned differentiable NAS works [
145,
146,
147,
148,
149,
150,
151,
180,
181,
182,
183,
184,
185,
186,
187,
188], however, focus on accuracy-only neural architecture search, which indeed demonstrates promising performance in terms of finding the architecture candidate with competitive accuracy but fails to accommodate the limited available computational resources in real-world embedded scenarios. To overcome such limitations, the paradigm of hardware-aware differentiable NAS [
130,
191,
192,
193,
194] has recently emerged, which is based on DARTS and focuses on finding top-performing architecture candidates within the cell-based search space that can achieve both high accuracy on target task and high inference efficiency on target hardware. To achieve this goal, one widely adopted approach is to integrate the latency-constrained loss term into the overall loss function to penalize the architecture candidate with high latency, which can be mathematically formulated as follows:
where
\(\lambda\) is the trade-off coefficient to control the trade-off magnitude between accuracy and latency. As demonstrated in [
131,
156], a larger
\(\lambda\) ends up with the architecture candidate that maintains low accuracy and low latency, whereas a smaller
\(\lambda\) leads to the architecture candidate with high accuracy and high latency.
\(Latency(\alpha)\) corresponds to the latency of the architecture candidate encoded by
\(\alpha\). We note that the optimization objective in Equation (
9) can be easily generalized to jointly optimize other types of hardware performance constraints, such as energy and memory consumption, in which we only need to incorporate
\(Energy(\alpha)\) and
\(Memory(\alpha)\) into the optimization objective in Equation (
9). For example, we can reformulate the optimization objective in Equation (
9) as follows to jointly optimize the on-device latency, energy, and memory consumption:
where
\(\lambda _1\),
\(\lambda _2\), and
\(\lambda _3\) are trade-off coefficients to determine the trade-off magnitudes between accuracy and latency, energy, and memory, respectively.
Despite the significant progress to date, the aforementioned hardware-aware differentiable NAS works [
130,
191,
192,
193,
194] highly rely on the cell-based search space; these works first determine the optimal cell structure and then repeatedly stack the same cell structure across the entire network [
138]. However, as demonstrated in MnasNet [
38], such NAS practices suffer from inferior accuracy and efficiency due to the lack of operator diversity. Even worse, the architecture candidates in the cell-based search space have multiple parallel branches as shown in Figure
7, which introduce considerable memory access overheads and, as a result, have difficulty benefitting from the high computational parallelism on mainstream hardware platforms [
34,
35]. To overcome such limitations, recent hardware-aware differentiable NAS works [
39,
40,
154,
188,
195,
196,
197,
198,
199] have shifted their attention from the cell-based search space (see Figure
7) to the block-based search space (see Figure
8). The most representative include FBNet [
39], ProxylessNAS [
40], SP-NAS [
197], and TF-NAS [
195]. Similar to GDAS [
186] and SNAS [
187], FBNet leverages Gumbel-Softmax reparameterization [
190] to relax the discrete search space to be continuous. FBNet collects a simple yet effective latency lookup table to quickly approximate the latency of different architecture candidates. The pre-collected latency lookup table is then integrated into the search process to derive hardware-efficient architecture candidates. However, similar to DARTS [
138], FBNet needs to simultaneously optimize all the operator candidates in the supernet during the search process, which is not scalable to large search spaces and suffers from tmemory bottleneck [
40,
131]. In light of this, ProxylessNAS introduces an effective path-level binarization approach to reduce the memory consumption to the single-path level, which significantly improves search efficiency without compromising search accuracy. In parallel, SP-NAS demonstrates that different operator candidates in the supernet can be viewed as subsets of an over-parameterized superkernel, based on which SP-NAS proposes to encode all the operator candidates into the superkernel. In practice, this explicitly reduces memory consumption to the single-path level, which alleviates the memory bottleneck during the search process. TF-NAS thoroughly investigates the three search freedoms in hardware-aware differentiable NAS: (1)
operator-level search, (2)
depth-level search, and (3)
width-level search as shown in Figure
14, which is able to perform fine-grained architecture search. To obtain hardware-efficient architecture candidates, TF-NAS integrates the pre-collected latency lookup table into the search process. TF-NAS also introduces a simple yet effective bi-sampling search algorithm to accelerate the search process towards enhanced search efficiency.
Even so, we should consider not only the
explicit search cost, the time required for one single search experiment, but also the
implicit search cost, the time required for manual hyperparameter tuning in order to find the desired architecture candidate. This is because, in real-world embedded scenarios such as autonomous vehicles, DNNs must be executed under strict latency constraints (e.g., 24 ms), in which any violation may lead to catastrophic consequences [
20,
163]. However, to find the architecture candidate with the latency of 24 ms, the aforementioned hardware-aware differentiable NAS works [
39,
40,
130,
154,
188,
191,
192,
193,
194,
195,
196,
197,
198,
199] have to repeat a plethora of search experiments to tune the trade-off coefficient
\(\lambda\) (see Equation (
9)) through trial and error [
131,
156], which significantly increases the total search cost. The intuition behind this is that
\(\lambda\), despite being able to trade off between accuracy and latency, is quite sensitive and difficult to control [
131,
156]. To overcome such limitations, HardCoRe-NAS [
200] leverages an elegant
Block Coordinate Stochastic Frank-Wolfe (BCSFW) algorithm [
201] to restrict the search direction around the specified latency requirement. In addition, LightNAS [
131,
156] introduces a simple yet effective hardware-aware differentiable NAS approach, which investigates the optimization objective in Equation (
9) and proposes to optimize the trade-off coefficient
\(\lambda\) during the search process in order to satisfy the specified latency requirement. In other words, LightNAS focuses on automatically learning
\(\lambda\) that strictly complies with the specified latency requirement, which is able to find the required architecture candidate in one single search (i.e.,
you only search once) and avoids performing manual hyperparameter tuning over
\(\lambda\). The optimization objective of LightNAS is formulated as follows:
where
T is the specified latency requirement. In contrast to previous hardware-aware differentiable NAS works [
39,
40,
130,
154,
188,
191,
192,
193,
194,
195,
196,
197,
198,
199],
\(\lambda\) in Equation (
11) is not a constant but rather a learnable hyperparameter that can be automatically optimized during the search process. For the sake of simplicity, below we use
\(\mathcal {L}(w, \alpha , \lambda)\) to denote the optimization objective in Equation (
11). Finally, to satisfy the specified latency requirement (i.e.,
\(Latency(\alpha) = T\)),
w and
\(\alpha\) are updated using gradient descent [
138], whereas
\(\lambda\) is updated using gradient ascent as follows:
where
\(lr_{w}\),
\(lr_{\alpha }\), and
\(lr_{\lambda }\) are the learning rates of
w,
\(\alpha\), and
\(\lambda\), respectively. Below we further demonstrate why LightNAS guarantees
\(Latency(\alpha) = T\). As shown in LightNAS, a larger
\(\lambda\) leads to the architecture candidate with low latency, whereas a smaller
\(\lambda\) results in the architecture candidate with high latency. Therefore, if
\(Latency(\alpha) \gt T\), the gradient ascent scheme increases
\(\lambda\) to reinforce the latency regularization magnitude. As a result,
\(Latency(\alpha)\) decreases towards
T in the next search iteration. Likewise, if
\(Latency(\alpha) \lt T\), the gradient ascent scheme decreases
\(\lambda\) to diminish the latency regularization magnitude, after which
\(Latency(\alpha)\) increases towards
T in the next search iteration. Finally, the search engine ends up with the architecture candidate that strictly satisfies the specified latency requirement (i.e.,
\(Latency(\alpha) = T\)). More recently, Double-Win NAS [
202] proposed deep-to-shallow transformable search to further marry the best of both deep and shallow networks towards an aggressive accuracy–efficiency win–win. Similar to LightNAS [
131,
156], the resulting shallow network can also satisfy the specified latency constraint. Finally, we compare previous representative hardware-aware NAS works, which are summarized in Table
1.
3.3 Speedup Techniques and Extensions
In this section, we discuss recent state-of-the-art advances in general speedup techniques and extensions for NAS algorithms, including one-shot NAS enhancements, efficient latency prediction, efficient accuracy prediction, low-cost proxies, zero-cost proxies, efficient transformer search, efficient domain-specific search, and mainstream NAS benchmarks, which have the potential to significantly benefit NAS algorithms and largely facilitate the search process.
Beyond One-Shot NAS. Despite the high search efficiency, one-shot NAS often suffers from poor ranking correlation between one-shot search and stand-alone training. As pointed out in [
205], one-shot search results do not necessarily correlate with stand-alone training results across various search experiments. To overcome such limitations, a plethora of one-shot NAS enhancements have been proposed recently[
206,
207,
208,
209,
210,
211]. Specifically, [
206,
207,
208,
209,
210] turn back to few-shot NAS. In contrast to one-shot NAS [
160], which only features one supernet, few-shot NAS introduces multiple supernets to explore different regions of the predefined search space, which slightly increases the search cost over one-shot NAS but can deliver much more reliable search results. For example, as shown in [
206], with only up to 7 supernets, few-shot NAS can establish new state-of-the-art search results on ImageNet. Among them, [
209] demonstrates that zero-cost proxies can be integrated into few-shot NAS, which can further enhance the search process of one-shot NAS and thus produce better search results. More recently, [
208] generalizes few-shot NAS to distill LLMs, which focuses on automatically distilling multiple compressed student models under various computational budgets from a large teacher model. In contrast to few-shot NAS, which leverages multiple supernets to improve the ranking correlation performance of one-shot NAS, CLOSE [
211] instead features an effective curriculum learning-like schedule to control the parameter-sharing extent within the proposed supernet dubbed CLOSENet, in which the parameter-sharing extent can be flexibly adjusted during the search process and the parameter-sharing scheme is built upon an efficient graph-based encoding scheme.
Efficient Latency Prediction.4 As seen in MnasNet [
38], latency is directly measured on target hardware, which is then integrated into the RL reward (see Equation (
2)) to penalize the architecture candidate with high latency. The direct on-device latency measurement is indeed accurate. However, it is time-consuming and unscalable to large search spaces [
40]. To overcome such limitations, several latency prediction strategies have been proposed recently. For example, ProxylessNAS [
40], FBNet [
39], and OFA [
42] leverage the latency lookup table to approximate the on-device latency, which sums up the latency of all the operator candidates. In addition, HSCoNAS [
152,
178] demonstrates that the data movements and communications among different operator candidates introduce additional latency overheads, making the pre-collected latency lookup table inaccurate. To mitigate this issue, HSCoNAS quantifies the latency that corresponds to the intermediate data movements and communications, which is then fed into the pre-collected latency lookup table to achieve more accurate latency prediction performance. However, the latency lookup table is only applicable to the block-based search space, which leads to unreliable latency prediction performance in terms of the cell-based search space [
213]. To this end, EdgeNAS [
130], LA-DARTS [
191], and LC-NAS [
192] propose to use learning-based approaches for the purpose of latency prediction. For example, EdgeNAS trains an efficient
multi-layer perceptron (MLP) to predict the latency of different architecture candidates in the cell-based search space, which can also be generalized to predict the latency of different architecture candidates in the block-based search space as shown in [
131,
156,
176,
212,
214]. BRP-NAS [
213] and SurgeNAS [
154] introduce graph neural network (GNN)–based latency predictors to achieve more reliable latency prediction performance. The above latency predictors (1) rely on a large number of training samples to achieve decent latency prediction performance (e.g., 100,000 training samples in EdgeNAS) and (2) need to be reconstructed for either new hardware or new search spaces. To avoid these, HELP [
215] and MAPLE-Edge [
216] focus on building an efficient latency predictor using only a few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [
217] considered an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts.
Efficient Accuracy Prediction. In parallel to latency prediction, accuracy prediction has also received increasing attention from the NAS community [
213,
218,
219,
220,
221,
222], which strives to directly predict the accuracy of different architecture candidates in the search space. Specifically, [
218] introduces a simple yet effective
graph convolutional network (GCN)–based accuracy predictor, which can achieve reliable accuracy prediction performance thanks to GCNs’ strong capability to learn graph-structured data. Similar to [
218], BRP-NAS [
213] also considers GCNs for reliable accuracy prediction, which introduces transfer learning to further improve the accuracy prediction performance from the pretrained latency predictor. In parallel, [
219] leverages the non-neural network (i.e., gradient-boosted decision tree (GBDT)) as the accuracy predictor, which has a stronger capability to learn representations than neural network–based accuracy predictors. In addition, NASLib [
220] investigates a wide range of accuracy predictors from learning curve extrapolation, weight-sharing, supervised learning, and zero-cost proxies on three popular NAS benchmarks (i.e., NAS-Bench-101 [
223], NAS-Bench-201 [
224], and NAS-Bench-NLP [
225]). NASLib reveals that different accuracy predictors can be combined to achieve substantially better accuracy prediction performance than any single accuracy predictor. DONNA [
221] proposes to build an efficient accuracy predictor, which involves only minimal computational resources and, more importantly, can scale to diverse search spaces. To achieve this, DONNA uses blockwise knowledge distillation to construct an architecture candidate pool in which each architecture candidate only needs to be fine-tuned for several epochs to derive the accuracy rather than being trained from scratch. In contrast to the aforementioned accuracy predictors that feature graph-based encoding schemes, GATES [
222,
226] instead models the operations as the transformation of the propagating information, which can effectively mimic the actual data processing of different neural architecture candidates. More importantly, the encoding scheme of GATES can be integrated into the above accuracy predictors to further boost their accuracy prediction performance. Similar to GATES, TA-GATES [
227] introduces an effective encoding scheme with analogous modeling of the training process of different neural architecture candidates, which can achieve better accuracy prediction performance than GATES on various representative NAS benchmarks.
Low-Cost Proxies (Learning Curve Extrapolations). Low-cost proxies, also referred to as
learning curve extrapolations [
231], aim to interpret the accuracy of the given architecture candidate only using its early training statistics, such as the training loss in the first few training epochs, which has motivated a plethora of subsequent works to continue exploring learning curve extrapolation [
156,
232,
233,
234,
235,
236,
237]. For example, in contrast to the conventional accuracy predictor that only uses the network configuration as input features, [
236] proposes to combine the network configuration and a series of validations of accuracy in the first few training epochs as input features to train a simple regression model, which can be generalized to predict the accuracy of unseen architecture candidates. In addition, [
232] introduces
Training Speed Estimation (TSE), which simply accumulates the early training statistics to achieve reliable yet computationally inexpensive ranking among different architecture candidates. The work of [
156,
237] introduces
Batchwise Training Estimation (BTE) and
Trained Batchwise Estimation (TBE), both of which consider the fine-grained batchwise training statistics to provide more reliable prediction performance using minimal computational resources. In parallel, [
234] introduces
Loss Curve Gradient Approximation (LCGA) to rank the accuracy of different architecture candidates with minimal training. The work of [
233] introduces NAS-Bench-x11 to unleash the power of learning curve extrapolation by predicting the training trajectories, which can be easily integrated into the aforementioned learning curve extrapolation works to quickly estimate the performance of the given architecture candidate.
Zero-Cost Proxies.5 In addition to the above low-cost proxies (i.e., learning curve extrapolation), zero-cost proxies have recently flourished [
228,
229,
230,
238,
239,
240,
241,
242,
243,
244,
245,
246,
247], which focus on interpreting the performance of the given architecture candidate in training-free manners. Zero-cost proxies, such as EPE [
240], Fisher [
241], GradNorm [
238], Grasp [
242], Jacov [
243], Snip [
244], Synflow [
245], ZenScore [
246], LRC [
228], and NTK [
229], can provide reliable performance estimation using only one single mini-batch of data and one single forward/backward propagation pass, which necessitate near-zero computational cost [
230,
238,
239]. Thanks to their reliable performance estimation and low cost, these zero-cost proxies have been widely adopted in recent NAS works to accelerate the search process [
230,
243,
247]. As demonstrated in [
230,
238], combining different zero-cost proxies may lead to more reliable ranking performance estimation than any single zero-cost proxy. For example, as shown in Figure
15, combining LRC and NTK provides more reliable ranking performance estimation than LRC or NTK separately. In light of this, TE-NAS [
230] further leverages LRC and NTK to jointly estimate the ranking performance among different architecture candidates in the search space, which quickly ends up with the optimal architecture candidate on ImageNet in less than 4 hours on one single NVIDIA GTX 1080 Ti GPU.
Efficient Transformer Search. In addition to CNNs, transformers are another important branch of DNNs. Inspired by the tremendous success of NAS in searching for superior CNNs, automated transformer search has gained increasing popularity, which applies NAS techniques to automatically search for superior transformers, including transformers for NLP tasks [
93,
249,
250,
251,
252,
253,
254] and vision transformers for vision tasks [
248,
255,
256,
257,
258,
259,
260]. Automated transformer search is technically the same as automated convolutional network search, in which both feature the same search pipeline. For example, HAT [
93], as one of the state-of-the-art NAS works in the field of NLP, focuses on searching for hardware-efficient transformers for NLP tasks. To achieve this, HAT first initializes an over-parameterized
superformer that consists of all possible transformer candidates in the search space, which is technically the same as the supernet in automated convolutional network search. After that, HAT trains the
superformer using the standard weight-sharing technique [
160], which then serves as the accuracy predictor to quickly interpret the accuracy of different transformer candidates. Next, HAT builds an efficient latency predictor to avoid the tedious on-device latency measurement. Finally, HAT applies the standard evolutionary algorithm to find hardware-efficient transformer candidates with both high accuracy and high efficiency, which is technically the same as OFA [
42], which searches for hardware-efficient convolutional networks. Furthermore, due to the tremendous success of vision transformers in vision tasks as discussed in Section
2.2, a plethora of NAS works [
248,
255,
256,
257,
258,
259,
260] have been subsequently proposed to automate the design of superior vision transformers. The work of [
248], being the first, introduces an evolutionary algorithm-based NAS framework dubbed AutoFormer. Similar to HAT, AutoFormer first constructs an over-parameterized
superformer that consists of all possible vision transformer candidates in the search space, which is then trained using the weight entanglement scheme. The difference between weight sharing and weight entanglement is visualized in Figure
16, in which weight entanglement is technically similar to the superkernel in SP-NAS [
197]. Finally, AutoFormer applies the standard evolutionary algorithm to explore the optimal vision transformer candidate. These clearly demonstrate that we can easily leverage recent state-of-the-art NAS techniques that focus on searching for competitive CNNs to automate the design of top-performing transformers for both NLP and vision tasks.
Efficient Domain-Specific Search. In addition to image classification, NAS can also be applied to a wide range of real-world scenarios, such as object detection [
268,
269,
270], semantic segmentation [
271,
272,
273], point cloud processing [
192,
274,
275,
276], image super-resolution [
277,
278], and more. For example, MobileDets [
268] are a family of hardware-efficient object detection networks that can deliver promising detection accuracy while maintaining superior detection efficiency on multiple embedded computing systems, including mobile
central processing units (CPUs), edge
tensor processing units (TPUs), and edge GPUs. MobileDets first construct an enlarged search space that contains a large number of possible object detection networks and then leverage an MnasNet-like reinforcement learning–based search algorithm [
38] to find top-performing object detection networks, which also feature the same reward function as TuNAS [
163] to trade off between detection accuracy and efficiency. The work of [
192] introduces an efficient hardware-aware differentiable NAS framework dubbed LC-NAS, aiming to automate the design of competitive network solutions for point cloud processing. Here, similar to EdgeNAS [
130] and LA-DARTS [
191], which focus on finding top-performing architecture candidates for image classification, LC-NAS exploits the same cell-based search space and integrates the latency constraint into the optimization objective to penalize the architecture candidate with high latency. These demonstrate that we can easily include domain-specific knowledge (e.g., domain-specific search spaces) into mainstream NAS techniques (e.g., differentiable, evolutionary algorithm–based, and reinforcement learning–based NAS) to search for domain-specific network solutions.
Mainstream NAS Benchmarks. Although NAS has achieved substantial performance improvement across various NLP and vision tasks, fair comparisons between different NAS works are frustratingly hard and still an open issue, as demonstrated in [
279]. This is because different NAS works may feature quite different training recipes, such as different training epochs and training enhancements. For example, DARTS+ [
148] trains the searched architecture candidate on CIFAR-10 for 2,000 epochs, whereas DARTS [
138] only applies 600 training epochs. DARTS+ trains the searched architecture candidate on ImageNet for 800 epochs, with a batch size of 2,048, where AutoAugment [
280] is also integrated in order to achieve stronger data augmentations. In contrast, DARTS only applies 250 training epochs with a batch size of 128 by default. We note that, for the same architecture candidate, longer training epochs and stronger data augmentations typically achieve better training accuracy on the target task, as shown in [
279]. RandomNAS [
281] challenges the effectiveness of early state-of-the-art NAS works and demonstrates that random search, as one strong search baseline to explore random networks, can achieve even better performance on the target task than early state-of-the-art NAS works. In parallel, RandWire [
282] shows that randomly wired networks can also exhibit strong accuracy on ImageNet. Therefore, it remains unknown whether the performance improvement of NAS is due to the more advanced training recipe or the search algorithm itself, making it difficult to evaluate and compare the technical contributions of different NAS works [
223,
224,
279].
To overcome such limitations, a plethora of tabular and surrogate NAS benchmarks have been proposed: NAS-Bench-101 [
223], NAS-Bench-201 [
224], NATS-Bench [
261], NAS-Bench-301 [
262], NAS-Bench-360 [
263], NAS-Bench-1Shot1 [
264], NAS-Bench-ASR [
265], NAS-Bench-Graph [
266], NAS-Bench-NLP [
225], HW-NAS-Bench [
214], NAS-Bench-x11 [
233], and NAS-Bench-Suite [
267]. We note that NAS benchmarks typically have two important parts, the predefined search space and the related performance metrics for all possible architecture candidates that can be easily queried. In tabular NAS benchmarks [
214,
223,
224,
225,
261,
263,
264,
265,
266], all possible architecture candidates are enumerated and trained from scratch on the target task to obtain the performance metrics, such as the training and validation accuracy. In contrast, surrogate NAS benchmarks [
214,
233,
262,
267] leverage learning-based methods to predict the performance metrics of different architecture candidates rather than directly enumerating and training all possible architecture candidates on the target task, thus leading to significantly reduced computational resources. In light of this, surrogate NAS benchmarks can be easily extended to deal with larger search spaces than tabular NAS benchmarks (
\(10^{18}\) in NAS-Bench-301 [
262] vs. 15,625 in NAS-Bench-201 [
224]). Finally, we compare and summarize the aforementioned state-of-the-art NAS benchmarks in Table
2.