1 Introduction

Over the past decade, deep learning has made great strides in several fields, ranging from computer vision to natural language processing and Reinforcement Learning (RL) [7, 16]. Much of this success has been driven by the search for better neural network architectures [12, 32], which has in turn significantly increased the complexity of state-of-the-art neural network architectures. This trend has led to the creation of automated methods for finding optimal neural network architectures.

This set of methods is usually referred to as Neural Architecture Search (NAS). NAS has been done using a wide variety of techniques, from evolutionary algorithms [8] to reinforcement learning [27] and continuous relaxation [20].

Most NAS methods used today are single-use methods, where an algorithm is run once for multiple hours or days, and yields a single architecture. If any of the parameters of the search are changed, such as the search space and the target application, the search must be repeated, which can be computationally costly. Typically, computational costs also scale with the search space, with larger search spaces requiring significantly more computational resources. In this publication, we take the first step towards building a NAS system that scales efficiently with the size of the search space, to limit the necessary computational resources.

We aim to achieve this by learning a searching behaviour, rather than trying to find an optimal architecture for any given problem. In this paper, we focus on the problem setting and the design of the agent. We examine the agent in the NAS-Bench-101 and NAS-Bench-301 settings.

More precisely, we make the following contributions:

  1. 1.

    We propose a novel Reinforcement Learning-based NAS methodology based on the incremental improvement of neural network architectures

  2. 2.

    We investigate the effectiveness of our NAS methodology on two established benchmarks: NAS-Bench-101 and NAS-Bench-301

  3. 3.

    We compare against several known strong baseline algorithms including random search and local search, as well as several state-of-the-art algorithms.

2 Related work

Over the years, several types of algorithms have been used to attempt to tackle the NAS problem. One popular type of NAS algorithm are evolutionary algorithms. They have been used by Real et al. [29], and Elsken et al. [8] to successfully find architectures that can match or exceed the performance of state-of-the-art hand-designed neural network architectures. A more recent example of the evolutionary approach is the work of Hendrickx et al. [13]. They use a modified version of NSGA-Net [21] to optimize convolutional neural networks to with the aim of steering a nanodrone towards a flower, so the flower can be pollinated. They modified NSGA-Net to include depthwise separable operations, and started their searches from known, well-performing architectures, such as MobileNetV2. Using this approach, they showed that starting from known well-performing architectures can increase search performance, leading to NAS algorithms finding good solutions quicker.

Another approach that has proved popular is continuous relaxation. This was first introduced in DARTS [20]. DARTS transforms the discrete set of operations that can be assigned to the edges of a computational graph into a continuous one. This allows them to reframe the problem of finding the optimal operation to assign to each edge as a bi-level optimization problem, where the first level constitutes the trainable parameters of the underlying neural network, and the second level constitutes the architectural parameters that parameterize the choice of operation for each edge. Using gradient descent, they are able to find strong performing architectures, with a reasonable computational budget. DARTS has seen many adaptations in recent years, including BHE-DARTS [4], Fair DARTS [5], etc.

Monte-Carlo Tree Search (MCTS) is another approach that has gained popularity in recent years [35, 44]. Usually, in these approaches, an architecture is broken down into a sequence of decisions. One decision is made at every level of the search tree, until a complete architecture has been sampled. A very recent example is GCNAS [44], which uses MCTS to find a strong architecture for the semantic segmentation of multispectral LiDAR point clouds.

2.1 Reinforcement learning-based NAS

Next, we examine several RL-based NAS algorithms in more detail. In their paper on MetaQNN [3], Baker et al. showcase a tabular Q-learning-based methodology for iteratively designing Convolutional Neural Networks (CNNs) in a chain-structured, macro search space. They consider three different computer vision target domains: CIFAR-10, SVHN and Modified National Institutes of Standard and Technology database (MNIST). This approach models the design process for a neural network as a sequential decision-making problem. A complete neural network is considered to consist of a sequence of layers. MetaQNN decides the type of the layer, and some of its parameters. Using this approach, they were able to achieve comparable performance to some state-of-the-art networks of the time. The experiments in this publication were reported to take 8–10 days using 10 Graphics Processing Units (GPUs), resulting in a total computational cost of 80–100 GPU-days.

Zoph et al. published their paper on performing Neural Architecture Search with Reinforcement Learning in 2017 [47]. Contrary to MetaQNN, Zoph et al. used a Long Short-Term Memory (LSTM)-based Reinforcement Learning controller to build neural networks. Their agent was capable of operating both in a chain-structured macro search space for designing CNNs for CIFAR-10, and in a cell-based search space for designing recurrent cells for use on Penn Treebank. Compared to Baker et al.’s incremental approach, Zoph et al. designed entire neural networks in a single time step, and used their validation accuracy as a reward signal. Using their algorithm, Zoph et al. were able to achieve state-of-the-art performance or close to it for both CIFAR-10 and Penn Treebank. One of the downsides of their methodology was its large computational cost of 800 GPUs over 28 days [48]. Especially compared to the 80–100 GPU-days of Baker et al. [3], this was a large hurdle for any researcher without access to large computational clusters.

From this sprang efforts from various researchers, among who were Pham et al., to lower the amount of necessary computational resources to conduct NAS research. In their paper “Efficient Neural Architecture Search via Parameter Sharing” [27], they proposed a method to speed up the search process. Similar to Zoph et al., Pham et al. considered the CIFAR-10 and Penn Treebank problems. Pham et al. were able to achieve a speed-up by a factor of 1000x compared to Zoph et al.’s first publication [47], resulting in an overall computational cost of between 8 (CIFAR-10, Macro Search Space) and 10 GPU-hours (Penn Treebank and CIFAR-10 Micro Search Space). Pham et al. were able to achieve this speed-up through the use of parameter sharing, or weight sharing. Realizing that the major bottleneck in [47] was the training of neural architectures from scratch to convergence, Pham et al. attempted, through the use of weight-sharing, to accelerate this training process, and thus limit the amount of time necessary to sufficiently train each sampled architecture.

Despite being originally published in 2018, LSTM-based controllers are still often used in applications of NAS. One example of this is Li et al.’s work in [18], where an LSTM-based controller is used for designing Graph Neural Networks (GNNs) capable of transferring between different graph-based tasks. They do not use weight-sharing, but rather use a performance prediction model to even further increase computational efficiency, and avoid some pitfalls of weight sharing specific to GNNs [46].

A more recent method for performing NAS using RL is GraphPNAS [17]. In this work, a probabilistic graph generator is trained using the REINFORCE algorithm [41]. The authors show strong performance on various benchmarks, with a computational cost of 16 GPU-days for the Tiny-ImageNet with Oracle Evaluator setting and 12 GPU-hours in the Efficient Neural Architecture Search (ENAS) Macro search space setting on CIFAR-10.

3 Methods

Originally defined in [9], we consider a solution to a NAS problem to consist of three components: a search space, a search strategy and a performance estimation strategy. In this section, we will consider these three aspects for our method. For a more thorough overview of the different types of search spaces, search strategies, performance estimation strategies, etc., we refer interested readers to [40].

Besides these aspects, we would also like to briefly make a note of the algorithm we use to identify isomorphic graphs. We adopt a modified version of the graph hashing algorithm used in [43] to detect isomorphic graphs. This algorithm is used for quick look-ups into the NAS-Bench-101 dataset, it is used to check the uniqueness of architectures when generating neighbours, it is used in our local search and random search algorithms to identify architectures, etc. Concretely, we improved the performance of the algorithm by removing many of the string \(\leftrightarrow\) bytes conversions. We also replaced the MD5 hashing algorithm in the original with the blake2s hashing algorithm with a 32-byte digest, due to its increased speed. We verified that the new hashing algorithm does not cause any collisions, by computing the hash of every architecture in the NAS-Bench-101 benchmark. Finally, we also verified our algorithm produces the same outcomes on the unit tests that were a part of the original NAS-Bench-101 codebase. We also note that the new 32-byte digest is twice as long as the 16-byte digest of the original MD5 algorithm, which should in theory reduce the chance of hash collisions.

3.1 Search space

Since Neural Architecture Search is a search problem, we will start by defining the search spaces considered in this paper. Specifically, we consider two search spaces, the first is that from NAS-Bench-101 [43], and the second is that from NAS-Bench-301 [31].

The NAS-Bench-101 search space is a cell-based, operations-on-nodes search space. It consists of 423,624 unique directed acyclic graphs, with at most 7 vertices and 9 edges, which represent the computational Directed Acyclic Graph (DAG) of one cell in a neural network. Each vertex can carry one of five labels: “input”, “output”, “conv-1x1”, “conv-3x3” and “max-pool-3x3”. Each graph has 1 vertex labelled as “input” with index 0, and the vertex with the highest index is assigned the “output” label. All vertices in between can be assigned “conv-1x1”, “conv-3x3” or “max-pool-3x3”. Every graph is required to have at least 1 path going from the vertex labelled “input” to the vertex labelled “output”, and all vertices must have an in- and out-degree of at least 1 (With exceptions for vertices labelled “input” or “output”).

Fig. 1
figure 1

An example of a “operations-on-edges” architecture (left) converted to a “operations-on-nodes” representation (right). All edges with associated operations are converted to nodes. Each of these nodes is given an in-edge from the source of the original edge, and an out-edge to the destination of the original edge. The nodes in the original architecture are replaced by reduction operations, a summation in this case. Finally, as was the case before, all reduction operations are given an edge to the output operation

The NAS-Bench-301 search space is a cell-based, operations-on-edges search space that contains roughly \(10^{18}\) architectures. Each cell has 7 vertices, with the first two being labelled as “input” and the last one being labelled as “output”. The cell also contains 4 intermediate vertices, each with 2 incoming edges. Finally, all 4 intermediate vertices also have 1 edge going to the output vertex, where the feature maps from each intermediate vertex are concatenated along the depth dimension. Each of the 8 incoming edges for the 4 intermediate vertices will have an operation assigned to it, picked from “avg_pool_3x3”, “dil_conv_3x3”, “dil_conv_5x5”, “max_pool_3x3”, “sep_conv_3x3”, “sep_conv_5x5” or “skip_connect”. Each of these edges can use as its source any node with an index lower than its destination node (to ensure the computational graph is acyclic.) The two edges incident to one reduction node must also originate from different sources. In total, the described search space contains around \(10^{9}\) architectures. This is raised to \(10^{18}\) by searching for two architectures from this search space, a normal and a reduction cell. Both of these cells are then used to construct the final neural network. When encoding NAS-Bench-301 architectures for our agents, we convert the operations-on-edges architecture to an operations-on-nodes architecture. An example of this conversion is displayed in Fig. 1.

3.2 Search strategy

The second component of a NAS algorithm is the strategy used to traverse the search space. In this section, we will consider this search strategy by first giving a description of the problem our RL agent is tasked with solving, along with a brief description of the agent.

3.2.1 Incremental problem formulation

We will describe our problem formulation in more detail. For this formulation, we will use the Markov Decision Process (MDP) framework, which is usually used to describe sequential decision-making problems, especially in the field of RL. An MDP is an 6-tuple \(\left( S, A, T, \gamma , \mu , R\right)\) where:

  • S is the state space

  • A is the action space

  • \(T: S \times A \times S \rightarrow \left[ 0, 1\right]\) is a probabilistic transition function.

  • \(\gamma \in \left[ 0, 1\right)\) is a discount factor

  • \(\mu : S \rightarrow \left[ 0, 1\right]\) is a probability distribution over initial states

  • \(R: S \times A \times S \rightarrow \mathbb {R}\) is a reward function.

Sequential decision-making problems in RL are usually played out in episodes. At the start of an episode, an initial state, \(s_{0}\) is selected from the overall state space S using the distribution \(\mu\). The agent then observes this state \(s_{0}\), and selects an action, \(a_{0}\) based on this observation. Once an action is selected, the next state, \(s_{1}\) is determined using the transition function T, and a reward is computed using R. This sequence of observing the current state, selecting an action, and determining the next state and reward is generally referred to as a time step. The agent plays out one or more time steps, until either a terminal state is reached, where the episode naturally terminates, or until the episode is truncated for training purposes.

In our case, the state space, S is equivalent to the architecture search space, each architecture (or pair of architectures, in the case of NAS-Bench-301) represents one state, s in the MDP.

In each state s, an agent has a set of actions it can take. An action a, in our case, corresponds to another architecture or tuple of architectures in the search space that the agent can move to. The agent is only presented with actions that are reachable by making one change to the current state (For example: Changing the operation of one node into another). The total number of actions that the agent is presented with is limited at a maximum of N. If there are less than N neighbours, the invalid actions will be masked-off, and they will not be considered for transition by the MDP. The agent is also presented with one additional action that terminates the episode at the current state. Our transition function T is entirely deterministic, the next state is the state the agent indicates through its action. \(\gamma\) is a discounting factor that determines what behaviour is considered optimal. Generally, \(\gamma\) is somewhere in the \(\left[ 0, 1\right)\) range, with \(0.9, 0.95 \text { and } 0.99\) being common values for the parameter. Higher values of \(\gamma\) bias policies to more strongly consider future rewards over immediate rewards, while policies found under lower values of \(\gamma\) tend to favour immediate and short-term rewards over long-term rewards. \(\mu\) determines how the initial state of an episode is selected. In our case, we select an architecture at random from the search space, using the same sampling logic used for our random search algorithm in Sect. 3.2.3. Finally, R is the reward function. In our case, our reward function is defined as the difference between the (validation) accuracy of the previous architecture, and the current architecture. For the first time step of an episode, since there is no previous architecture, the reward is 0. This difference is computed after applying reward shaping, described in 3.2.2.

In essence, we have reframed the NAS problem as a graph search problem, where every node in a graph represents an architecture, and every edge represents a relation between architectures. Then, the task our agents must learn to perform, is to find the graph node with the highest accuracy.

Fig. 2
figure 2

Vertex Removal Process. (1) A graph consisting of 5 vertices, where we want to remove the vertex with index 2. (2) After removing vertex 2, we generate all possible edges connecting the source of in-edges to vertex 2 to the destination of out-edges to vertex 2. (3a) An invalid selection of generated edges, leaving vertex 1 with an out-degree of 0. (3b) A valid selection of edges, leaving none of the other vertices disconnected from the graph

Next, we will describe the process of neighbour generation. Each time step, an agent is presented with the current architecture, and N of its neighbours, and has to choose one of these architectures as the next state of the MDP. If the agent chooses the current architecture, the episode is terminated. These neighbours are generated by making small alterations to the current architecture. Below is a list of the different alterations we use to generate neighbours.

  • Remove a vertex When removing a vertex, edges are selected such that the source of in-edges to the removed vertex are connected to the destination of out-edges to the removed vertex. Edges are selected such that all vertices involved have an in- and out-degree of at least 1, to ensure no invalid graphs are generated, this process is illustrated in Fig. 2. Every possible arrangement of vertices is considered to be a distinct neighbour (after accounting for isomorphism.)

  • Add a vertex The vertex will be connected to one of the preceding and one of the succeeding vertices with an in- and an out-edge respectively. When generating neighbours, all possible ways of connecting the new vertex are generated.

  • Change a vertex label We change the label of a vertex to a different label. This operation can never change a vertex into an input or an output, or change an input or output into a different operation.

  • Remove an edge Only edges that, when removed, do not break the graph’s connectivity are considered for removal.

  • Add an edge Edges can only be added between existing vertices, they can not generate cycles, and the total number of edges can never exceed the maximum number of edges of the search space.

In [37], White et al. also consider the NAS problem through a similar lens. There are some differences though between their framing and ours. In their work [37], assume that neighbourhoods are a symmetric relation. That is, if A is in B’s neighbourhood, then B must also be in A’s neighbourhood. Under our formulation, this isn’t necessarily the case. An example of this is shown in Fig. 2. While the architecture in 3b is considered to be a neighbour of the architecture in 1 (Through removing vertex 2), the architecture in 1 is not a direct neighbour of the architecture in 3b, instead requiring the addition of a vertex, and the addition and removal of several edges. This partly invalidates [37]’s definition of a branching factor, and thus makes it impossible to apply their theoretical results with regards to the number of local minima, etc. to our problem formulation. We also note that our overall search space graph also does not have a regular degree in the way [37]’s does. A possible solution to resolve these discrepancies would be the inclusion of a “zeroize” operation in the NAS-Bench-101 search space, akin to the “zeroize” operation in NAS-Bench-301. This would allow NAS-Bench-101 to be formulated as a NAS problem with a fixed topology, where the removal of certain vertices and edges is achieved through the use of “zeroize” operations, similar to NAS-Bench-301.

Some settings, like NAS-Bench-301, require that more than one cell is designed at a time. In this case, all cells (two in the NAS-Bench-301 case) are shown to the agent at the same time. To generate neighbours in this case, we generate the neighbours of each cell individually, and then select up to N tuples, each consisting of 1 neighbour for each cell to form the final selection of neighbours. This leads to a significant increase in the total number of neighbours for benchmarks that search multiple cells at a time. (In the NAS-Bench-301 setting, we noticed that each cell has around 70 neighbours, leading to a total set of around 4900 neighbours, of which 50 are selected for presentation to the agent).

3.2.2 Reward shaping

In order for our agent to converge, we utilize reward shaping. In reinforcement learning, reward shaping changes the reward function that is used to train an agent, to facilitate the training process. In our case, we employ reward shaping due to the inherent distribution of rewards in the benchmark problems we consider. In NAS-Bench-101, the average validation accuracy across all random initializations for all architectures is 90.24%. If we plot a histogram of the validation accuracy of all architectures and all random initializations, we also see that the vast majority of architectures have accuracies in the range of \(\left[ 85\%, 94\%\right]\), as demonstrated in Fig. 3. This means that, even though the reward for our agent is technically normalized between 0 and 1, the majority of architectures will fall into this small range, resulting in very small reward values, after the difference between two consecutive architectures is taken. Thus, we apply an exponential function, \(e^{\alpha \cdot R}\), to the validation accuracy of the architectures, and use this exponential accuracy in the difference calculation between subsequent architectures.

Fig. 3
figure 3

Histogram of the validation accuracy of the architectures included in NAS-Bench-101, across all random initializations. The original accuracy distribution is shown in blue, while the distribution after reward shaping is shown in red. Note the logarithmic Y-axis

Fig. 4
figure 4

The reward shaping that was used in this paper. Experiments on NAS-Bench-101 used \(\alpha =6\), experiments on NAS-Bench-301 used \(\alpha =32\). Other values were used in ablation studies

By selecting an appropriate value for \(\alpha\), we can tune the reward function based on the problem definition, to always ensure a good distribution of rewards. Figure 3 shows the difference between the original accuracy distribution, and the distribution after reward shaping. We demonstrate the reward shaping functions used in this publication in Fig. 4.

3.2.3 Agents

In order to get a good sense of the performance of our proposed agent, we have also implemented several baseline algorithms. In this section, we will elaborate on the details of each of the agents we evaluate.

Random search The most basic agent we include is a random search agent. This agent does not follow the incremental problem formulation we intend to use, but rather just selects an architecture at random from the entire search space, similar to the random search agent used in [17]. It serves an indication of how much of a change in difficulty or complexity the use of the incremental problem formulation creates. If the incremental problem is significantly more difficult or easy than just picking one architecture from the search space, we expect to see a performance difference between the random search and random walk agent. Our random sampling procedure is slightly different from that used in BANANAS [38]. We start by sampling the size of each graph, in this case from a uniform distribution. Then, we proceed to sample in a manner similar to BANANAS, where we sample random adjacency matrices and vertex labelings, and simply reject the invalid ones until we have a valid sample, with the added requirement that an adjacency matrix must have the desired number of vertices. Figure 5 shows a comparison between a uniform sample from the NAS-Bench-101 dataset, the BANANAS random sampler ours is based on, and our sampling algorithm, in terms of the number of vertices of the generated architectures. We note that our sampler produces a quasi-uniform distribution for both the number of vertices and the number of edges, which ensures that our RL agent has sufficient experience with both large architectures (many vertices) and smaller architectures (few vertices), thus ensuring consistent performance across the entire search space.

Fig. 5
figure 5

Histogram of the number of vertices and edges in a random sample of 10,000 architectures, compared between a uniform sample from the NAS-Bench-101 dataset, the “Random Cell Adj” sampler from the BANANAS repository, and our sampler. Note the logarithmic Y-axis

Random walk We also include a random agent that follows the incremental problem formulation we defined earlier in Sect. 3.2. This agent randomly selects a next neighbour according to a uniform distribution. It can additionally also opt to terminate the episode instead of selecting a next neighbour.

Local Search Some publications [6, 37] note that local search is a strong NAS algorithm, especially in smaller search spaces like NAS-Bench-101. With this in mind, we also include local search as one of the algorithms we benchmark against. Our local search algorithm doesn’t have any of the enhancements that [37] used. The local search agent is presented with N architectures, it compares the validation accuracy for all of them, and selects the one with the highest validation accuracy. If no architecture has a validation accuracy higher than that of the current architecture, the episode is terminated. We note that, different from [37], our neighbour relation is non-symmetrical, meaning that A being a neighbour of B, doesn’t guarantee that B will also be a neighbour of A, as we explained in Sect. 3.2.

RL-based agent Our RL agent, shown in Fig. 6, uses a transformer encoder as the core of its architecture, similar to Vision Transformer (ViT) [7]. The agent is presented with at most N different architectures. Before the architectures are presented to the agent, they must be prepared. We take the lower-triangular half of the adjacency matrix (To ensure acyclicity), and flatten it by concatenating all of the rows. This flattened adjacency matrix is then concatenated to the vertex labels, encoded in a one-hot fashion, resulting in one long binary vector. In order to be able to encode architectures with varying sizes, the original adjacency matrix and vertex labels are first padded to the size of the largest architecture. If multiple architectures need to be searched (As in the case of NAS-Bench-301), the architectures are concatenated after preparation. Before sending the architectures to be encoded, we prepend the the current architecture to the sequence of up to N neighbours.

Fig. 6
figure 6

The architecture of our transformer agent. The top row shows the process of preparing the observations, while the bottom row represents the learnable RL agent

After preparation, an architecture encoding network transforms the prepared observation to a 256-dimensional latent space (\(T = 256\)), using several fully-connected layers with ReLU activations. These latent space vectors then have a positional encoding applied to them, to ensure that the permutation-invariant transformer architecture remains aware of the order of the latent space vectors. While the exact ordering of the architectures is unimportant, the agent must still be aware of their ordering, since each action the agent can perform corresponds to one of the architectures observed. Thus, if the agent loses all awareness of the ordering of all architectures, it will be unable to select the appropriate action.

The current architecture is prepended to all possible neighbours, similar to the “classification token” in [7]. Correspondingly, we use the first output of the transformer encoder as the input for the rest of the agent. At this point, we have a duelling head consisting of two branches [36]. One uses the transformer output to compute advantage values for each architecture that was presented to the transformer encoder, which is used to determine which action the agent will take. The second branch computes a state-value, and is used to stabilize the RL procedure.

The agent is trained using the Ape-X algorithm [14], a variant of Q-Learning designed for a high throughput of experience collection. We enhance Ape-X using Partial Episode Bootstrapping (PEB) [26] and 3-step bootstrapping to be able to train the agent in finite-length episodes, while obtaining a behaviour suitable for episodes of infinite length. Our Q-learning network makes use of a target network [24], duelling head [36] and double Q-learning [11].

3.3 Performance estimation

For this publication, we consider two benchmark problems: NAS-Bench-101 and NAS-Bench-301. We selected NAS-Bench-101, since it is the largest tabular benchmark we are currently aware of. Being a tabular benchmark, it eliminates some possible confounding factors that may arise when using other performance estimation strategies. We also included NAS-Bench-301, since it covers a very large and commonly used search space (\(\approx 10^{18}\) architectures). Contrary to NAS-Bench-101, NAS-Bench-301 uses a machine learning model fitted on a subset of the search space to predict the performance (classification accuracy, in this case) for the entire search space. All agents we trained use the same performance estimator, in order to eliminate the choice of performance estimator as a possible confounding factor.

4 Experiments

In order to evaluate the effectiveness of RL agent, we will evaluate it on two NAS benchmarks: NAS-Bench-101 [43] and NAS-Bench-301 [31]. We start by evaluating on NAS-Bench-101. In their work, White et al. [37] note that because the search space for NAS-Bench-101 is relatively small, relatively simple algorithms like local search tend to perform fairly well. It is with this in mind that we selected NAS-Bench-301 as a second benchmark. Since it uses the DARTS [20] search space, which many other NAS algorithms also use, it should give us a clearer view of how our agent performs compared to algorithms like local search.

4.1 NAS-Bench-101

The first setting that we evaluate our agent on is the NAS-Bench-101 setting.

4.1.1 Experiment configuration

When tackling NAS-Bench-101, we train our agents for \(10 \times 10^{6}\) time steps of experience. As a reward signal, we use the mean of all three validation accuracies after 108 epochs of training (Corresponding to the reduced noise setting in [39]). We train 5 agents with seeds ranging from 0 to 4 (inclusive), using version 1.13 of the RLLib framework [19]. We set \(\gamma = 0.9\) in accordance with the results of our ablation study in Sect. 5.1 with a learning rate of \(4 \times 10^{-5}\). Learning rates must be kept sufficiently low (On the order of \(10^{-5}\)) in order for the agent to converge, we found that higher learning rates lead to unstable training and hamper agent performance. In their work, Zhang and Sutton [45] discovered that replay buffers that are either too small or too large can result in suboptimal agent performance. They also demonstrate that the optimal size of the replay buffer depends on the problem at hand. Our replay buffer has a capacity of \(25 \times 10^{3}\) entries. During prototyping, we experimented with a wide set of values for the size of the replay buffer, ranging up to \(6.11 \times 10^{5}\), but we found no significant benefits to using replay buffers larger than \(25 \times 10^{3}\). Our replay buffer is a prioritized replay buffer [30] with \(\alpha =0.6\) and \(\beta =0.4\), the default settings for RLLib’s Ape-X implementation. Exploration is done using a per-worker epsilon-greedy strategy. We use the same parameters as Sect. 4.1 of the Ape-X paper [14]: Each individual worker uses an epsilon-greedy exploration strategy, with \(\epsilon _{i} = \epsilon ^{1 + \left( \alpha \cdot \frac{i}{N - 1}\right) }\). We set \(\epsilon = 0.4\) and \(\alpha = 7\), these are the RLLib default settings. We train our agent using episodes of 16 time steps, and evaluate it with episodes consisting of 32 time steps. We refer to Fig. 6 in [43], plotting the Random Walk Autocorrelation (RWA) in the NAS-Bench-101 search space. We note that after a 6–10 time step random walk through the search space, RWA flattens out close to zero, indicating that there should be no more locality effects. This implies that, in order for agents to make significant alterations, the maximum episode length should be greater than 6, supporting our choice for 16. As mentioned in Sect. 3.2.2, we use reward shaping with \(\alpha =6\). Our experimental setup allocates 8 threads for 8 workers to gather experience, 4 threads for 4 shards of the replay buffer, and 2 threads for the driver, as well as 1 GPU for the driver, but not for the workers. Different experiments were carried out using either 1 NVIDIA Tesla V100 GPU, or 1 NVIDIA Quadro RTX4000 GPU. Each worker also uses a vectorized version of our environments, that allows it to operate on 32 environments at a time. For evaluation, we randomly selected 5 sets of \(1 \times 10^{4}\) architectures from the search space, since these sets are selected independently, overlap is possible. These are used as the initial state for each episode of the evaluation process. Each of the 5 trained agents is evaluated on each of the 5 sets of \(1 \times 10^{4}\) initial states, resulting in a total of \(2.5 \times 10^{5}\) episodes of evaluation data per algorithm. For evaluation of NAS-Bench-101, we use the image classification accuracy on the test set after 108 epochs of training.

4.1.2 Results

In the NAS-Bench-101 setting, the mean training time was 92.57 h (\(\sigma =20.89\) h), with a minimum of 78.27 h, and a maximum of 133.97 h. During evaluation, using only Central Processing Units (CPUs) (no GPUs), completing an episode took an average of 0.50s (\(\sigma =0.49\) s), with a minimum of 0.00s and a maximum of 10.14s. Table 1 provides a comparison between our method and DiNAS [1]. To the best of our knowledge, DiNAS [1] is the only complete NAS method published in 2024 at the time of writing and tested on NAS-Bench-101 that reports their set-up and query time in terms of wall-time.

Table 1 A comparison of our algorithm to several baselines in terms of query time and set-up time in the NAS-Bench-101 setting

Consistency We start by evaluating the ability of our agents to improve from a given starting accuracy. We do this by plotting the improvement in accuracy over an entire episode against the accuracy of the initial state. An ideal agent would generate a diagonal line, as close as possible to the top right corner. How close this diagonal can be to the top-right corner is determined by the global optimum within the search space. An agent that consistently manages to end up at the same architecture (Regardless of its performance), will have all datapoints on a perfect diagonal line that intersects the X-axis at the accuracy of this final architecture. Thus, the spread of an agent’s datapoints from a perfect diagonal line can be used to gauge how consistently an agent manages to find the same architecture. A deviation from a diagonal line indicates an agent that doesn’t always perform consistently, and can optimize some architectures better than others. These results are shown in Fig. 7. From this, we can see that every tested policy displays some degree of inconsistency, as indicated by none of the policies forming a perfect diagonal line. As expected, this is greatest for random policies, while local search is the most consistent. Our reinforcement learning agent is fairly consistent in improving the accuracy of the architectures it is given, but sometimes fails to improve, and makes things worse. This can be an issue when using the agent as part of a hyperparameter tuning pipeline, but the agent improves consistently enough that this should be a relatively rare occurrence, as evidenced by some of the other data in this section.

Fig. 7
figure 7

The initial accuracy of the agent vs the improvement in accuracy. The diagonal dashed lines represent a final accuracy (at the end of an episode) of 85%, 90% and 95%, respectively. The diagonal line delimiting red region in the top-right corner represents an accuracy of 100%

Fig. 8
figure 8

A histogram showing the distribution of the improvement in accuracy over an entire episode. We limited the range of improvements shown between − 12.5 and + 12.5%, which leads to a small amount of outlying samples (Roughly 2000–3000 samples, or 0.8% - 1.2% of all samples) being excluded from the histogram (Not included in the extreme bins.) The solid vertical line indicates the median improvement. Dashed vertical lines indicate the edges of a symmetrical range around the median containing a certain percentage of the total samples

We can also look at this data from a different perspective, by evaluating how often each agent is able to improve a given architecture, and how often an agent returns a worse architecture. We perform this evaluation by creating a histogram showing the distribution of the difference in accuracy between the start of an episode, and the end of that episode, where positive values indicate an improvement from the initial to the final architecture. Since the agent decides when to terminate the episode, it should ideally always terminate an episode when the accuracy is higher than the initial state. We show these results in Fig. 8.

Fig. 9
figure 9

The local optimum found by local search in the NAS-Bench-101 setting in slightly more than 20% of cases

We also show the median value for each histogram using a solid, vertical black line. Besides this, we also show symmetrical intervals around the median. For instance, the interval marked 50% contains the values between the 25-th and 75-th percentile, i.e. 50% of values concentrated around the median. From this histogram, we can immediately see that, in terms of consistent improvements, local search is undisputedly the best algorithm (Median improvement: \(+3.77\%\)), followed by Ape-X (\(+1.15\%\)), Random Walks (\(+0.26\%\)) and Random Search (\(+0.00\%\)). This is in line with our expectations, since the local search algorithm is formulated in such a way that it should always improve an architecture. We do note that this isn’t always the case, since local search optimizes validation accuracy, but is evaluated on test accuracy. We also note that there are several high peaks in the local search histogram, with one bin even containing around 20% of the episodes. We found that in this bin, the final architecture is the same in the vast majority of episodes, implying that this architecture is a local optimum. We show this architecture in Fig. 9. Quantitatively, we can also see this reflected in the skew value for each distribution. Negative skew values imply that the tail of the distribution is on the left side, while positive skew values indicate the tail is on the right side. Or in other words, negative skew implies that most probability mass is on the right side of the distribution, while positive skew implies the majority of probability mass is on the left side of the distribution. Local search exhibits the lowest skew, with \(-2.091\), followed by Ape-X with \(-1.821\), Random Walks at 0.091 and finally random search with a skew of 0.219. It also shows that our RL agent most commonly improves an architecture, but on rare occasions fails to do so. Finally, one more artefact from this data is the fact that random walks actually outperformed random search, with a median that skews slightly towards the positive side. This becomes even more obvious when we consider the skew by the intervals noted in the histogram. While for random search, most intervals are roughly symmetrical around the median, for random walks, the intervals skew heavily towards the side of positive improvements, with the 95% interval ranging from -5% to +10%. Initially, we hypothesized that there may be a correlation between an architecture’s test accuracy, and the number of neighbours it has. This data is shown in a scatter plot in Fig. 11, but unfortunately reveals only a very weak correlation, with Pearson’s r at − 0.014, and Kendall’s \(\tau\) at − 0.157. We were unable to find another explanation for this performance difference.

Efficiency Finally, we also consider the question of how many queries are required for an agent to be able to find well-performing architectures. We consider a setting where an algorithm can make up to 300 queries, and assess the agent by looking at the highest accuracy architecture that has been found after a given number of queries. We note that, for all algorithms except local search, 1 episode is considered to be 1 query. For local search, we count every neighbour at every time step as a query, since the local search agent needs to know the accuracy of each neighbour, before deciding on which action to take. The results of this are shown in Fig. 10. We also compare against a number of state-of-the-art publications: NASWOT [23], GraphPNAS [17], Random Search (from GraphPNAS), NAO (Reported by SemiNAS) [22], SemiNAS [22] and DiNAS [1]. For both our own experiments, and the state-of-the-art algorithms, we computed a 95% confidence interval, to determine if there is a statistically significant difference in performance between algorithms. The 95% confidence interval for our own experiments was computed using bootstrapping with \(5 \times 10^{3}\) bootstrap samples, while the confidence intervals for other baselines (Black lines in Fig. 10) were computed using a closed-form expression that assumes a normal distribution, since we don’t always have access to the full dataset, but most publications do report a mean and standard deviation. In Fig. 10, we see that our RL agent performs fairly well for low query budgets, but as soon as the total number of queries exceeds 75–100, it is outperformed by a local search agent. Local search is known to perform well in the NAS-Bench-101 setting, due to the relatively limited size of the search space [37]. We computed the diameter of the search space graph to be 12, well within the maximum episode length of 16, which likely also contributes to local search’s strong performance. Interestingly, at lower query budgets (\(<50\) queries, random walks and random search actually outperform local search. We attribute this to the fact that we allow agents to observe at most 50 neighbours. Thus, if local search is presented with an architecture that has 50 or more neighbours, it first must make 50 queries (one for each neighbour) before it is able to realize an improvement. This also explains the jump in performance that occurs at the 50 query mark for local search. Interestingly, we note that, when considering the number of queries, random walks actually outperform random search quite significantly, and even approach the performance of our RL agent as the number of queries increases. This may be another indication that not all forms of random search are created equal, and the way a NAS problem is formulated can have an impact on how well different random algorithms are able to perform. Our first hypothesis for this discrepancy was that larger architectures (More vertices and edges) tend to have both more neighbours and more trainable parameters, and thus a higher accuracy. However, when plotting the number of neighbours an architecture has versus its test accuracy, it becomes clear that the correlation is rather weak, implying there is likely at least one more factor contributing to random walks strong performance. Unfortunately, we were unable to find any other parameters that may contribute to random walks’ strong performance (Fig. 11).

Fig. 10
figure 10

The accuracy of the best architecture found so far plotted against the number of queries in the NAS-Bench-101 setting. We evaluated each algorithm for up to 300 queries. For baselines indicated by black lines, we used the numbers reported in their publications. The shaded area (for our experiments) and the error bars (for other baselines) represents a 95% confidence interval around the mean

Fig. 11
figure 11

The test accuracy of a NAS-Bench-101 architecture vs. the number of neighbours of the architecture

Table 2 shows the numerical results for the best NAS-Bench-101 test accuracy found by our RL agent after a specified number of queries. We selected 50 as the best result for a low query budget, 100 as the cross-over point where local search starts outperforming our algorithm, 150 as the halfway point to 300 queries, and 300 since its the most commonly used value by most state-of-the-art methods.

Table 2 Numerical test accuracy values on NAS-Bench-101 (mean ± SD) and 95% confidence intervals at various query budgets

Policy analysis To complement our numerical results, we also attempt to analyse the behaviours that our RL agents learned. We will perform this analysis based on some of the features that were recently introduced in GRAF [15]. More precisely, we consider all features introduced by GRAF that were found to be important based on the Shapley values from training a random forest to do performance prediction on a limited number of samples from the NAS-Bench-101 benchmark.

In the NAS-Bench-101 case, these features are (Table 10, Left Column in [15]):

  1. 1.

    The minimum path length over convolutions (kernel size 1x1 or 3x3)

  2. 2.

    The average in-degree for nodes labelled with 3x3 convolutions

  3. 3.

    The average out-degree for nodes labelled with 3x3 convolutions

  4. 4.

    The number of 3x3 convolutions

  5. 5.

    The in-degree of the output node, considering only edges coming from nodes labelled max-pool-3x3

We perform our analysis by plotting each of these features as a function of progress through an episode in Fig. 12. Episodes are normalized and interpolated to 32 time steps each, such that episodes of different lengths can be compared. Interpolation is done using Scipy’s scipy.interpolation.interp1d function with mode set to previous. We compute percentiles at 10% intervals, starting at 10%, and plot them as shaded areas between matching percentiles (Outermost area is 10-th to 90-th, next area is 20-th to 80-th, etc.) We also show the median as a solid blue line.

Fig. 12
figure 12

Features from GRAF plotted as a function of normalized progress through an episode in the NAS-Bench-101 Setting. Shaded areas represent percentiles at 10% intervals, starting at 90% on the high side, and 10% on the low side, going to 60% and 40% respectively. The median (50-th percentile) is indicated with a solid blue line

From Fig. 12, we can observe a number of things. First of all, both the average in- and out-degree of 3x3 convolution nodes tend to increase as the episode progresses. While the in- and out-degree of these nodes increases, the minimum path length over convolution nodes (regardless of kernel size) tends to decrease. This may imply that as the episode progresses, the agent might be adding shortcut connections as episodes near their end. We can also see an upward trend in the number of 3x3 convolutions, with only the bottom 10th percentile of designed architectures having no 3x3 convolutions at the end of the episode. Finally, we can also see that the in-degree of the output node, counting only edges incident from max-pool-3x3 nodes decreases as the episode progresses. This can either mean that max-pool-3x3 nodes are moved closer to the input, and lose their connections to the output, or that max-pool-3x3 nodes get removed. Computing the number of max-pool-3x3 operations confirms our suspicion that these nodes are being removed, with only the 80-th percentile of architectures in terms of the number of max-pool-3x3 operations still having at least 1 max-pool-3x3 operation at the end of the episode.

Designed architectures Finally, we also showcase the best architecture found by our agent at the beginning (Fig. 13) and at the end of an episode (Fig. 14). The architecture at the beginning of the episode (Fig. 13) actually occurred as the first architecture in an episode 235 times out of 250000 episodes (\(0.094\%\) of all samples). Despite that, the best final architecture (Fig. 14) only occurred 6 times (\(0.0024\%\) of all samples) as the final architecture in an episode. Since we evaluate our RL policy in a deterministic manner, this could only occur as a consequence of stochasticity in our environment. Since our environment limits the number of neighbours presented to the agent to N (\(N=50\) in our experiments), it is possible the architecture depicted in Fig. 14, or the path to it, was simply never offered to the agent as an option. This inconsistency is consistent with the data observed in Fig. 7. The architecture depicted in Fig. 13 has a mean test accuracy across all seeds of \(88.85\%\) according to the NAS-Bench-101 benchmark, while the architecture depicted in Fig. 14 has a test accuracy across all seeds of \(94.23\%\), thus, in this particular episode, our agent realized an accuracy improvement of \(5.38\%\).

The architectures in Figs. 13 and 14 also corroborate our policy analysis in Sect. 4.1.2. We can see that the architecture has lost all (2) max-pool-3x3 nodes. The number of 3x3 convolutions has also increased from 0 to 4. While most of the convolution nodes only have an out-degree of 1, some have an in-degree of 2. While there is no path going over 3x3 convolutions in the original architecture, there are both very short paths (one residual connection, as well as a path covering only node 4) and a long path (covering nodes 1, 2 and 4) in the designed architecture.

Fig. 13
figure 13

The architecture at the start of the episode that yielded the best mean test accuracy across all three seeds in the NAS-Bench-101 setting

Fig. 14
figure 14

The architecture at the end of the episode that yielded the best mean test accuracy across all three seeds in the NAS-Bench-101 setting

4.2 NAS-Bench-301

As mentioned at the start of Sect. 4, we will also evaluate our agent on the NAS-Bench-301 setting [31], due to its increased complexity compared to NAS-Bench-101.

4.2.1 Configuration

For NAS-Bench-301, our hyperparameter setup is similar to that of NAS-Bench-101. Due to the increased complexity, we train our agents for longer on NAS-Bench-301, for up to \(15 \times 10^{6}\) time steps. We sampled a new set of initial states for evaluation (Since NAS-Bench-301 uses a different search space) with the same size as the NAS-Bench-101 set of initial states. We set the reward shaping exponent to 32, rather than 6 as was the case with NAS-Bench-101, due to NAS-Bench-301 having a different accuracy distribution. The NAS-Bench-301 agent also uses a 512-dimensional latent space, compared to 256 dimensions for NAS-Bench-101, due to the increased size of the observations, and the fact that it needs to encode two architectures (Normal and Reduction Cell) in one latent space. We use version 1.0 of NAS-Bench-301, with the included ensemble of XGB estimators, without using noisy predictions (i.e. taking the mean of all predictors in the ensemble), following the recommendation from [39] to reduce noise in architecture evaluation pipelines (Denoised setting). We made some minor alterations to the original NAS-Bench-301 code to allow for batched inference of the estimator to speed up the training procedure. All experiments were carried out using 1 NVIDIA Tesla V100 GPU. The vectorization of our environment was disabled to reduce the overall time taken for a single training iteration. Since the NAS-Bench-301 XGB ensemble is only trained to predict validation accuracy, we both train and evaluate our agents using the validation accuracy.

4.2.2 Results

We will be evaluating our agent using the same analysis and metrics that we did for NAS-Bench-101, and the evaluation procedure also remains unchanged.

Consistency The achieved accuracy at the end of the episode is plotted against the accuracy at the beginning of an episode in Fig. 15. One thing that is immediately noticeable in the improvement histograms for NAS-Bench-301 (Fig. 16), is the quasi-normal shaped improvement distribution, for all algorithms considered. This differs considerably from NAS-Bench-101, where the improvement distributions, especially for random search, were much more uniformly distributed. To verify this hypothesis, we conducted a normality test with a p-value of \(5 \times 10^{-3}\). For all algorithms, the null-hypothesis that the data is normally distributed was rejected, with the test reporting p-values less than \(1 \times 10^{-6}\) for all algorithms. We used scipy’s scipy.stats.normaltest [34] test, which combines skew and kurtosis to test the normality of a sample. Following these normality tests, we also conducted a test to see if the data might follow a t-distribution. We used scipy’s scipy.stats.goodness_of_fit [34] with 1000 Monte-Carlo samples. Once again, we use a p-value of \(5 \times 10^{-3}\) as our threshold for rejecting the null hypothesis that our data follows a t-distribution. For all algorithms, we rejected the null hypothesis that the data follows a t-distribution, with each test returning a p-value of \(9.99 \times 10^{-4}\) for Ape-X, Local Search and Random Walks, and a p-value of \(1.998 \times 10^{-3}\) for Random Search.

Fig. 15
figure 15

The initial accuracy of the agent vs the improvement in accuracy

Fig. 16
figure 16

A histogram showing the distribution of the improvement in accuracy over an entire episode

Efficiency The confidence intervals displayed in Fig. 17 were computed in the same way as in our NAS-Bench-101 experiments. Figure 17 shows us a slightly different picture from Fig. 10 in the NAS-Bench-101 setting. With a 300-query budget, our RL agent actually outperforms local search. Another interesting difference is the near-identical performance of random search and random walks. While in the NAS-Bench-101 setting, these two showed a significant performance difference, in the NAS-Bench-301 case, they both achieve nearly identical performance.

Fig. 17
figure 17

The accuracy of the best architecture found so far plotted against the number of queries. We evaluated each algorithm for up to 300 queries

Since some publications use larger query budgets for NAS-Bench-301 (Ranging from 150 to 10,000), we also evaluate our agent with a budget of 1000 queries in Fig. 18. When we extend the query budget up to 1000 queries, it becomes clear that, given a sufficiently large computational budget, local search remains a strong search algorithm, surpassing our RL algorithm around the 400 query mark. We also see that, just as in the 300-query case, performance for random search and random walks remains nearly identical, suggesting that the difference in performance in the NAS-Bench-101 case can likely be attributed to a difference between the benchmarks.

Fig. 18
figure 18

The accuracy of the best architecture found so far plotted against the number of queries. We evaluated each algorithm for up to 1000 queries

We also provide numerical data on the best mean validation accuracy obtained after a specific number of queries in Table 3 and Table 4. The shown query budgets were chosen to coincide with the baselines shown in Figs. 17 and 18.

Table 3 Numerical validation accuracy values on NAS-Bench-301 (mean ± SD) and 95% confidence intervals at various query budgets
Table 4 Numerical validation accuracy values on NAS-Bench-301 (mean ± SD) and 95% confidence intervals at various query budgets

Throughput Finally, we consider the time it takes to train and evaluate our RL agent. The mean training time for NAS-Bench-301 was 486.16 h (\(\sigma =40.12\) h), with a minimum of 446.75 h, and a maximum of 553.48 h. During evaluation, using only Central Processing Units (CPUs) (no GPUs), completing an episode took an average of 16.30s (\(\sigma =9.52\) s), with a minimum of 0.00s and a maximum of 69.83s. We compare this data to the state-of-the-art in Table 5. As far as we are aware,the only complete NAS method published in 2024 at the time of writing and tested on NAS-Bench-301 that reports their set-up and query time in terms of wall-time is DiNAS [1].

Table 5 A comparison of our algorithm to several baselines in terms of query time and set-up time in the NAS-Bench-301 setting
Fig. 19
figure 19

A comparison of the throughput of our environment for NAS-Bench-101 and NAS-Bench-301, with various degrees of vectorization. Note the logarithmic axes

We note an important difference in the training and evaluation time between the NAS-Bench-101 and NAS-Bench-301 setting. The average training time for NAS-Bench-101 is around 4 days, while that for NAS-Bench-301 is around 21 days (\(5.25\times\)), despite NAS-Bench-301 only requiring \(1.5\times\) as much training experience, resulting in the conclusion that in the NAS-Bench-301 setting, we are on average \(3.5\times\) slower to train. When looking at the evaluation times, we see an even starker contrast of 0.5s/Episode for NAS-Bench-101, and 16.3s/Episode for NAS-Bench-301 (\(32\times\)). We attribute this difference in training time primarily to the time it takes to evaluate both reward functions. Since evaluating a reward function is a fairly common occurrence in RL (It occurs at least once per time step), a slow reward function will inevitably slow down the whole training process. Because NAS-Bench-301 requires the evaluation of 10 different gradient boosted trees, its reward function is significantly slower than that of NAS-Bench-101, which consists of simply a table look-up. We can also see this in Fig. 19, which compares the performance of our environment when using the NAS-Bench-101 objective versus the NAS-Bench-301 objective. To generate these numbers, we stripped our environment down to its bare essentials, so we can focus solely on the effect that these objectives have on the throughput. The environment no longer uses the incremental formulation we’ve used so far, and no longer needs to do things like generate neighbours. We simply feed the environment an architecture as our action, and it returns the reward. We can clearly see that, when using the NAS-Bench-101 objective, our environment tends to be around two orders of magnitude quicker than when using NAS-Bench-301, regardless of the degree of vectorization used.

Policy analysis Following our evaluation protocol for the NAS-Bench-101 setting, we also evaluate our policies in the NAS-Bench-301 setting based on GRAF [15]. We perform our analysis on the NAS-Bench-301 architectures following conversion from operations-on-edges to operations-on-nodes. We note that this means that our computations are slightly different from those in [15]. The resulting features are shown in Fig. 20. We start by noting that all relevant features relate only to the normal cell, thus, the following discussion only considers the normal cell. The first interesting observation has to do with the out-degree of input node 1 (\(H_{k-1}\) in the architecture diagrams). We see an increase in the number of outgoing skip connections from this node. This is also represented in Fig. 23, with a skip-connection connecting this input and sum node 4. Interestingly, In Fig. 23, both incoming connections to node 4 are skip connections, resulting in both input feature maps being summed together. Furthermore, we also see a decrease in the maximum path length over either pooling operation in the normal cell, implying that the agent is generating shortcut connections with pooling operations. We note that the overall prevalence of pooling operations in the designed architectures decreases, with only the 80-th percentiles having at least one avg-pool-3x3 or max-pool-3x3 operation. The average intermediate node in-degree counting only edges coming from separable convolutions also increases as an episode progresses. We suspect the most likely cause of this is simply a general increase in the number of separable convolutions in each cell. Computing the operation counts for separable convolutions confirms this suspicion, with separable convolutions with both kernel sizes becoming more prevalent as episodes progress. We also see a decrease in the mean intermediate node out-degree for pooling operations and dilated convolutions. Once again, we suspect this is caused by a simple reduction in the number of dilated convolutions and pooling operations, which we confirm by computing op count features. While dilated convolutions remain more prevalent than pooling operations, they nevertheless show a pronounced decrease in terms of occurrences as episodes progress. This too is demonstrated in the best normal cell we designed in Fig. 23, which only contains a single dilated convolution, and no pooling operations, despite both being present at the start of the episode (Fig. 21).

Fig. 20
figure 20

Features from GRAF plotted as a function of normalized progress through an episode in the NAS-Bench-301 setting. Shaded areas represent percentiles at 10% intervals, starting at 90% on the high side, and 10% on the low side, going to 60% and 40% respectively. The median (50-th percentile) is indicated with a solid blue line

Designed architectures We also consider the best architecture that was generated during NAS-Bench-301 evaluation, similar to the analysis we made in the NAS-Bench-101 setting. The normal and reduction architecture at the start of the episode that yielded the best performance are shown in Figs. 21 and 22 respectively. The architectures at the end of the episode are shown in Fig. 23 (Normal) and Fig. 24 (Reduction.) The architecture pair at the start of the episode occurs 5 times as the initial architecture pair (0.002%), while the final architecture pair was unique (0.0004%.) The architecture pair at the start of the episode has an accuracy of \(93.23\%\), while the pair at the end of the episode had an accuracy of \(95.02\%\) based on the NAS-Bench-301 benchmark, thus yielding a performance improvement of \(1.79\%\) in this particular episode.

Figure 23 showcases an interesting pattern, where both inputs are summed together with only skip connections, essentially creating a residual connection that spans a entire cell. One interesting thing to note from Fig. 24 is that the inputs to each summation node are homogeneous. Either both inputs carry the “sep-conv-5x5” operation, or both carry the “max-pool-3x3” operation, but they are never mixed within one summation node. Figures 23 and 24 also have very different shapes, with the reduction cell maintaining a wide but shallow shape where each sum node only uses the cells inputs, while the normal cell is much deeper, with inputs from one sum node (4) feeding into other sum nodes (7, 10 and 13).

Fig. 21
figure 21

The normal architecture at the start of the episode that yielded the best mean validation accuracy in the NAS-Bench-301 setting

Fig. 22
figure 22

The reduction architecture at the start of the episode that yielded the best mean validation accuracy in the NAS-Bench-301 setting

Fig. 23
figure 23

The normal architecture at the end of the episode that yielded the best mean validation accuracy in the NAS-Bench-301 setting

Fig. 24
figure 24

The reduction architecture at the end of the episode that yielded the best mean validation accuracy in the NAS-Bench-301 setting

5 Ablation studies

We also perform several ablation studies to study the effect of certain parameter choices on our overall system. We use the NAS-Bench-101 setting for this, since it is the quickest to train, allowing us to conduct more experiments. We compare different parameter choices by looking at the mean validation accuracy of the architectures produced throughout the training procedure.

5.1 Gamma

First, we consider the \(\gamma\) parameter that is used in RL to determine how heavily future rewards should be discounted compared to current rewards. Lower values for \(\gamma\) discount future rewards more, and thus focus more on immediate rewards, and less on future rewards. We consider three values for \(\gamma : 0.9, 0.95 \text { and } 0.99\). In Fig. 25 we show the mean accuracy of the architectures generated as a function of the number of time steps the agent has been trained on. The figure shows that, during the training procedure, only the agent trained with \(\gamma = 0.9\) is able to make a noticable improvement. The agents with \(\gamma = 0.95 \text { and } 0.99\) both fail to significantly improve the accuracy of the generated architectures, and the agent with \(\gamma = 0.99\) even ends up with slightly worse performance. During training, we noted that, on average, every architecture has around 25 neighbours, this means that, N steps into the future, an agent can be at any of \(25^{N}\) different states. Since the number of possible states that can be reached in relatively few steps grows very quickly, making an accurate determination of the future discounted reward becomes very difficult, thus hindering convergence when larger values of \(\gamma\) are used.

Fig. 25
figure 25

Mean validation accuracy over time for \(\gamma = 0.9, 0.95 \text { and } 0.99\) on NAS-Bench-101. A simple moving average filter with N = 10 was used to smooth out the results

5.2 Number of neighbours

Our environment limits the number of neighbours that our agent can choose from. The agent is presented with up N neighbours, or less if the total number of neighbours is less than N. We consider three values for this parameter: \(25, 50 \text { and } 100\), and show the obtained mean validation accuracy as a function of the number of training steps in Fig. 26. From this, we can see that agents observing 25 or 50 neighbours both converge to roughly the same performance, while observing up to 100 neighbours leads to slightly worse performance. We hypothesize that this is likely a function of model capacity. The same model was used in all three experiments, but the size of the observations, and the number of architectures that needs to be considered quadrupled between \(N=25\) and \(N=100\). These results also show that, in a search space such as NAS-Bench-101 which is fairly densely connected, raising the total number of neighbours beyond a certain threshold doesn’t help agent performance.

Fig. 26
figure 26

Mean validation accuracy over time for \(N = 25, 50 \text { and } 100\) on NAS-Bench-101. A simple moving average filter with N = 10 was used to smooth out the results

5.3 Reward shaping

In the NAS-Bench-101 setting, the majority of architectures have an accuracy around 90%, with only a small minority having accuracies far below 90%.

Because of this, we introduced a reward shaping scheme in Sect. 3.2.2. In this section, we will consider the effect that different values of the shaping parameter \(\alpha\) have on the performance of our agent. We consider four settings for this reward shaping mechanism: No reward shaping (No exponential functions are applied), and three values for the reward shaping coefficient: \(2, 6 \text { and } 10\). The results of this experiment are shown in Fig. 27. In this figure, we see that the experiments without reward shaping, or with only slight reward shaping (\(\alpha =2\)) achieve worse performance than experiments with stronger reward shaping (\(\alpha =6, 10\)). From this, we conclude that, in order to achieve good performance, a sufficient level of reward shaping is necessary. We also note that the experiment with \(\alpha =10\) is slightly quicker to experience a rise in accuracy, however, we do not believe this to be significant, since in our ablations for \(\gamma\), we used \(\alpha =6\), and saw results similar to \(\alpha =10\) when \(\gamma =0.9\).

Fig. 27
figure 27

Mean validation accuracy over time for no reward shaping, and \(\alpha = 2, 6 \text { and } 10\) on NAS-Bench-101. A simple moving average filter with N = 10 was used to smooth out the results

6 Conclusions

In this publication, we outline a new, incremental framing of the NAS problem. We also introduce a method of tackling this problem using a RL-based NAS agent. We show that our RL agent is competitive with state-of-the-art NAS algorithms and baselines that are known to have strong performance. Our NAS agent outperforms all other benchmarks considered when query budgets are low, but starts to be overtaken by other algorithms as query budgets increase. We also perform an ablation study and conclude that our agent has a high sensitivity to some hyperparameters, while being rather insensitive to others.

6.1 Training times

When comparing training times between the NAS-Bench-101 and NAS-Bench-301 settings, we note a significant difference between both (\(\approx 5\times\)) in terms of training time, even after accounting for the additional training required for NAS-Bench-301 (\(\approx 3.5\times\)). We attribute this difference primarily to the difference in time it takes to evaluate each reward function, with NAS-Bench-301 being significantly slower, due to the need to evaluate 10 gradient boosted trees. This presents a significant barrier to the adoption of our RL based method, but it can be overcome by improving the sample-efficiency of the RL agent, or using surrogate models that are quicker to evaluate than the ensemble of gradient boosted trees used in this paper.

6.2 Random search

In the NAS-Bench-101 setting, we observe a noticable difference between the random search and random walk algorithms. When comparing both algorithms in the NAS-Bench-301 setting, we note that they achieve near identical performance. This indicates that the difference in performance between both algorithms is likely a result of a pecularity of the NAS-Bench-101 benchmark. This does draw into question the validty of comparing the results of algorithms across problem formulations, since, in some cases, formulating the problem in a different way can lead to significantly different performance, even for random algorithms.

6.3 Scalability

Despite an increase in the size of the search space from \(4.23 \times 10^{5}\) to \(10^{18}\), the amount of samples required to train our RL agent only increased from \(1.0 \times 10^{7}\) to \(1.5 \times 10^{7}\), while retaining strong performance in both benchmarks. This shows that a RL-based approach can scale well as the search space we operate in becomes larger.

6.4 Ablation studies

Our ablation studies show mixed results with regards to robustness to hyperparameter changes. Some hyperparameters (like the number of neighbours the agent is presented with) seem to have relatively little effect on the agent’s ability to converge, while others (such as the degree of reward shaping and reward discounting) seem to have an almost-binary effect on convergence: Sufficiently high or low values must be selected in order to ensure convergence of the agent.

7 Future work

7.1 Re-usability evaluation

An important distinction between our transformer-based controller and earlier RL-based NAS controllers is its re-usability. The traditional RL-based NAS paradigm involves training an RL agent to output a single architecture. This means that, once the training procedure is finished, the optimal architecture is found, but the RL agent serves no purpose beyond this point. Accordingly, the time and compute cost for training this agent has gone entirely towards finding a single optimal architecture. Through the re-use our agent offers, this time–cost can be amortized over many searches, since we did not learn the optimal solution, but rather how to find it.

7.2 Domain generalization

In this work, we only considered the image classification domain. This unfortunately means that a new agent must be trained every time we wish to tackle a new problem domain. A truly re-usable NAS agent should be re-usable on a completely new domain with little to no adaptation work. There are several avenues that could enable this. One of these, is the use of a training-free, domain-independent performance estimator such as Neural Tangent Kernel-based metrics like Label-Gradient Alignment (LGA) [25], or the number of linearly separable regions [23]. Using metrics like these as performance estimators would create an agent that can theoretically operate in any search space, regardless of the target domain, assuming that the used metrics correlate strongly with actual domain-performance across domains.