1 Introduction

Determining an optimal architecture is key to accurate deep neural networks (DNNs) with good generalisation properties (Szegedy et al., 2017; Huang et al., 2017; He et al., 2016; Han et al., 2017; Conneau et al., 2017; Merity et al., 2018). Neural architecture search (NAS), which has been formulated as a graph search problem, can potentially reduce the need for application-specific expert designers allowing for a wide-adoption of sophisticated networks in various industries. Zoph and Le (2017) presented the first modern algorithm automating structure design, and showed that resulting architectures can indeed outperform human-designed state-of-the-art convolutional networks (Ko, 2019; Liu et al., 2019). However, even in the current settings where flexibility is limited by expertly-designed search spaces, NAS problems are computationally very intensive with early methods requiring hundreds or thousands of GPU-days to discover state-of-the-art architectures (Zoph and Le, 2017; Real et al., 2017; Liu et al., 2018a, 2018b).

Researchers have used a wealth of techniques ranging from reinforcement learning, where a controller network is trained to sample promising architectures (Zoph and Le, 2017; Zoph et al., 2018; Pham et al., 2018; Bender et al., 2020), to evolutionary algorithms that evolve a population of networks for optimal DNN design (Real et al., 2018; Liu et al., 2018b; Lopes and Alexandre, 2022), to optimization on random graphs (Ru et al., 2020). Alas, these approaches are inefficient and can be extremely computationally and/or memory intensive as some require all tested architectures to be trained from scratch. Weight-sharing, introduced in ENAS (Pham et al., 2018), can alleviate this problem. Even so, these techniques cannot easily scale to large datasets, e.g., ImageNet, relying on human-defined heuristics for architecture transfer. Recently, low-fidelity estimates, performance predictors and guiding mechanisms have also been studied to improve the search cost and reduce the memory and computation required (Mellor et al., 2021; Lopes et al., 2021; White et al., 2021a; Ning et al., 2021; White et al., 2021b; Lopes et al., 2022). More, gradient-based frameworks enabled efficient solutions by introducing a continuous relaxation of the search space. For example, DARTS (Liu et al., 2019) uses this relaxation to optimise architecture parameters using gradient descent in a bi-level optimisation problem, while SNAS (Xie et al., 2019) updates architecture parameters and network weights under one generic loss. Still, due to memory constraints the search has to be performed on 8 cells, which are then stacked 20 times for the final architecture. This solution is a coarse approximation to the original problem as shown in Sect. 6 of this work and in Yang et al. (2020), Yu et al. (2019), Li and Talwalkar (2019), receiving some criticisms outlined in Zela et al. (2018), Yang et al. (2020), Wan et al. (2022). In fact, we show that searching directly over 20 cells leads to a reduction in test error (8% relative to Liu et al. (2019)). ProxylessNAS (Cai et al., 2019) is one exception, as it can search for the final models directly; nonetheless they still require twice the amount of memory used by our proposed algorithm, while offering no theoretical guarantees.

To enable the possibility of large-scale joint optimisation of deep architectures we contribute MANAS, the first multi-agent learning algorithm for neural architecture search. MANAS’ multi-agent framework is inspired by a multi-arm bandit setting (Auer et al., 1995), where here each agent is associated with one layer selection and a global network is optimised based on each agent decision. This class of algorithms has been widely explored to solve a range of problems (Bouneffouf et al., 2020), such as autonomous driving (Shalev-Shwartz et al., 2016), recommendation systems (Li et al., 2010), and active learning (Bouneffouf et al., 2014). Here we introduce the approach in neural architecture search.

Our algorithm combines the memory and computational efficiency of multi-agent systems, which is achieved through action coordination with the theoretical rigour of online machine learning, allowing us to balance exploration versus exploitation optimally. Due to its distributed nature, MANAS enables large-scale optimisation of deeper networks while learning different operations per cell. Theoretically, we demonstrate that MANAS implicitly coordinates learners to recover vanishing regrets, guaranteeing convergence. Empirically, we show that our method achieves state-of-the-art accuracy results among methods using the same evaluation protocol but with significant reductions in memory (1/8th of Liu et al. (2019)) and search time (70% of Liu et al. (2019)).

The multi-agent (MA) framework is inherently scalable and allows us to tackle an optimization problem that would be extremely challenging to solve efficiently otherwise: the search space of a single cell is \(8^{14}\) and there is no fast way of learning the joint distribution, as needed by a single controller. More cells to learn exacerbates the problem, and this is why MA is required, as for each agent the size of the search space is always constant.

In short, our contributions can be summarised as: (1) framing NAS as a multi-agent learning problem (MANAS) where each agent supervises a subset of the network; agents coordinate through a credit assignment technique which infers the quality of each operation in the network, without suffering from the combinatorial explosion of potential solutions. (2) Proposing two lightweight implementations of our framework that are theoretically grounded. The algorithms are computationally and memory efficient, and achieve state-of-the-art results on CIFAR-10 and ImageNet when compared with competing methods. Furthermore, MANAS allows search directly on large datasets (e.g. ImageNet). (3) Presenting 3 news datasets for NAS evaluation to minimise algorithmic overfitting; offering a fair comparison with the often ignored random search (Li and Talwalkar, 2019) and random sampling (Yang et al., 2020; Yu et al., 2019) baselines; and presenting a complexity constraint analysis of MANAS.

2 Related work

MANAS derives its search space from DARTS (Liu et al., 2019) and is therefore most related to other gradient-based NAS methods that use the same search space. SNAS (Xie et al., 2019) appears similar at a high level, but has important differences: 1) it uses GD to learn the architecture parameters. This requires a differentiable objective (MANAS does not) and leads to 2) having to forward all operations (see their Eqs.5,6), thus negating any memory advantages (which MANAS has), and effectively requiring repeated cells and preventing search on ImageNet. Sequent gradient-based proposals improve upon baselines by introducing regularization mechanisms to improve the final performance of the generated architectures (Zela et al., 2018; Chen and Hsieh, 2020; Chu et al., 2021), whilst still suffering from the aforementioned problems.

ENAS (Pham et al., 2018) is also very different: its use of RL implies dependence on past states (the previous operations in the cell). It explores not only the stochastic reward function but also the relationship between states, which is where most of the complexity lies. Furthermore, RL has to balance exploration and exploitation by relying on sub-optimal heuristics, while MANAS, due to its theoretically optimal approach from online learning, is more sample efficient. Finally, ENAS uses a single LSTM (which adds complexity and problems such as exploding/vanishing gradients) to control the entire process, and is thus following a monolithic approach. Indeed, at a high level, our multi-agent framework can be seen as a way of decomposing the monolithic controller into a set of simpler, independent sub-policies. This provides a more scalable and memory efficient approach that leads to higher accuracy, as confirmed by our experiments.

Many proposals try to speed up NAS search. Zero-cost proxy estimators do this by evaluating architectures at initialisation (Lopes et al., 2021; Mellor et al., 2021; Chen et al., 2021; Abdelfattah et al., 2021). When coupled to memory-efficient NAS methods, such as random search, architecture search can be used in parallel and easily distributed to allow scoring several architectures simultaneously. GraphNAS++ leverages the idea of distributed evaluation by evaluating multiple architectures simultaneously in multiple GPUs to reduce the wall-clock time of a NAS search (Gao et al., 2022). Differently, AutoDistill partitions the search space and trains a supernetwork for each sub-partitioning, thus alleviating optimisation interference and speeding up the training process (Xu et al., 2022). Similarly, LaMOO (Zhao et al., 2021) and LaNAS (Wang et al., 2021b) progressively partition the search space based on regions with similar performance architectures. DC-NAS splits the search space based on the similarity of all sub-networks feature representations by clustering them using k-means (Wang et al., 2020). More closely related with MANAS are Federated NAS approaches (Zhu et al., 2021; Liu et al., 2022) in the sense of distributing the work through multiple agents. Federated NAS employs federated learning to allow individual users to optimise architecture weights and search for new architectures that match their data, as different clients usually have non-identical and independent data distributions (He et al., 2020; Yuan et al., 2020; Zhu and Jin, 2021; Garg et al., 2020; Hoang and Kingsford, 2021). However, neither distributing the search space by conducting parallel search on multiple GPUs nor independently distributing the search over different users mitigates the problem of large-scale optimisation of deep networks by itself. Since the required memory is still considerable, these approaches force NAS methods to search on a smaller number of cells, requiring heuristics to expand the searched cells into architectures. To mitigate these problems, MANAS leverages multi-agents, where each agent is associated with one layer selection and a global network is optimised based on each agent decision (detailed in Sect. 5). The multi-agent framework is inspired by a multi-arm bandit setting (Auer et al., 1995). Multi-arm bandit algorithms are a class of reinforcement learning algorithms used to balance exploration and exploitation in a decision-making process. This class of algorithms has been widely explored to solve a panoply of problems (Bouneffouf et al., 2020), such as designing recommendation systems (Li et al., 2010). By employing a multi-arm bandit, MANAS allows for a distributed space search through multiple agents with a global optimisation goal, and achieves both memory and time efficiency as a result. When compared with other NAS optimization methods that use the same search space, MANAS achieves better performance and requires less memory.

3 Preliminary: neural architecture search

We consider the NAS problem as formalised in DARTS (Liu et al., 2019). At a higher level, the architecture is composed of a computation cell that is a building block to be learned and stacked in the network. The cell is represented by a directed acyclic graph with V nodes and N edges; edges connect all nodes ij from i to j where \(i<j\). Each vertex \(\varvec{x}^{(i)}\) is a latent representation for \(i\in \{1,\ldots ,V\}\). Each directed edge (ij) (with \(i<j\)) is associated with an operation \(o^{(i,j)}\) that transforms \(\varvec{x}^{(i)}\). Intermediate node values are computed based on all of its predecessors as \(\varvec{x}^{(j)} = \sum _{i<j} o^{(i,j)}(\varvec{x}^{(i)})\). For each edge, an architect needs to intelligently select one operation \(o^{(i,j)}\) from a finite set of K operations, \({\mathcal {O}} = \{o_{k}(\cdot )\}^K_{k=1}\), where operations represents some function to be applied to \(\varvec{x}^{(i)}\) to compute \(\varvec{x}^{(j)}\), e.g., convolutions or pooling layers. To each \(o_{k}^{(i,j)}(\cdot )\) is associated a set of operational weights \(w_{k}^{(i,j)}\) that needs to be learned (e.g. the weights of a convolution filter). Additionally, a parameter \(\alpha _{k}^{(i,j)}\in {\mathbb {R}}\) characterises the importance of operation k within the pool \({\mathcal {O}}\) for edge (ij). The sets of all the operational weights \(\{w_{k}^{(i,j)}\}\) and architecture parameters (edge weights) \(\{\alpha _{k}^{(i,j)}\}\) are denoted by \(\varvec{w}\) and \(\varvec{\alpha }\), respectively. DARTS defined the operation \(\bar{o}^{(i,j)}(\varvec{x})\) as

$$\begin{aligned} \bar{o}^{(i,j)}(\varvec{x})= \sum _{k=1}^K \frac{e^{\alpha _{k}^{(i,j)}}}{\sum _{k'=1}^K e^{\alpha _{k'}^{(i,j)}}} \cdot o^{(i,j)}_k(\varvec{x}) \end{aligned}$$
(1)

in which \(\varvec{\alpha }\) encodes the network architecture; and the optimal choice of architecture is defined by

$$\begin{aligned} \varvec{\alpha }^\star = \min _{\varvec{\alpha }} {\mathcal {L}}^{(\textrm{val})}( \varvec{\alpha },\varvec{w}^\star (\varvec{\alpha })) \quad \text {s.t.} \quad \varvec{w}^\star (\varvec{\alpha }) = \arg \min _{\varvec{w}} {\mathcal {L}}^{(\textrm{train})}(\varvec{\alpha },\varvec{w} ). \end{aligned}$$
(2)

The final objective is to obtain a sparse architecture \(\mathcal {Z}^\star = \{\mathcal {Z}^{(i,j)}\}, \forall i,j\) where \(\mathcal {Z}^{(i,j)}=[z_{1}^{(i,j)}, \dots ,z_{K}^{(i,j)}]\) with \(z_{k}^{(i,j)}=1\) for k corresponding to the best operation and 0 otherwise. That is, for each pair (ij) a single operation is selected.

Fig. 1
figure 1

MANAS with single cell. Between each pair of nodes, an agent \({\mathcal {A}}_i\) selects action \(a^{(i)}\) according to \(\pi ^{(i)}\). Feedback from the validation loss is used to update the policy

4 Online multi-agent learning for AutoML

NAS suffers from a combinatorial explosion in its search space. A recently proposed approach to tackle this problem is to approximate the discrete optimisation variables (i.e., edges in our case) with continuous counterparts and then use gradient-based optimisation methods. DARTS (Liu et al., 2019) introduced this method for NAS, though it suffers from two important drawbacks. First, the algorithm is memory and computationally intensive (\({\mathcal {O}}(NK)\) with K being total number of operations between a pair of nodes and N the number of nodes) as it requires loading all operation parameters into GPU memory. As a result, DARTS only optimises over a small subset of 8 repeating cells, which are then stacked together to form a deep network of 20. Naturally, such an approximation is bound to be sub-optimal. Second, evaluating an architecture on validation requires the optimal set of network parameters. Learning these, unfortunately, is highly demanding since for an architecture \({\mathcal {Z}}_{t}\), one would like to compute \({\mathcal {L}}^{(\textrm{val})}_{t}\left( \mathcal {Z}_t,\varvec{w}^{\star }_{t}\right)\) where \(\varvec{w}^{\star }_{t} = \arg \min _{\varvec{w}} {\mathcal {L}}_{t}^{(\textrm{train})}(\varvec{w}, {\mathcal {Z}}_{t})\). DARTS, uses weight sharing that updates \(\varvec{w}_t\) once per architecture, with the hope of tracking \(\varvec{w}^{\star }_t\) over learning rounds. Although this technique leads to significant speed up in computation, it is not clear how this approximation affects the validation loss function.

Next, we detail a novel methodology based on a combination of multi-agent and online learning to tackle the above two problems (Fig. 1). Multi-agent learning scales our algorithm, reducing memory consumption by an order of magnitude from \({\mathcal {O}}(NK)\) to \({\mathcal {O}}(N)\); and online learning enables rigorous understanding of the effect of tracking \(\varvec{w}^{\star }_{t}\) over rounds.

4.1 NAS as a multi-agent problem

To address the computational complexity we use the weight sharing technique used in DARTS. However, we handle in a more theoretically grounded way the effect of approximating \({\mathcal {L}}^{(\textrm{val})}_{t}\left( \mathcal {Z}_t, \varvec{w}^{\star }_{t}\right)\) by \({\mathcal {L}}^{(\textrm{val})}_{t}\left( \mathcal {Z}_t, \varvec{w}_{t}\right)\). Indeed, such an approximation can lead to arbitrarily bad solutions due to the uncontrollable weight component. To analyse the learning problem with no stochastic assumptions on the process generating \(\nu = \{{\mathcal {L}}_{1}, \dots , {\mathcal {L}}_{T}\}\) we adopt an adversarial online learning framework.

figure a

NAS as Multi-Agent Combinatorial Online Learning. In Sect. 3, we defined a NAS problem where one out of K operations needs to be recommended for each pair of nodes (ij) in a DAG. In this section, we associate each pair of nodes with an agent in charge of exploring and quantifying the quality of these K operations, to ultimately recommend one. The only feedback for each agent is the loss that is associated with a global architecture \(\mathcal {Z}\), which depends on all agents’ choices.

We introduce N decision makers, \({\mathcal {A}}_{1}, \dots , {\mathcal {A}}_{N}\) (see Fig. 1 and Algorithm 1). At training round t, each agent chooses an operation (e.g., convolution or pooling filter) according to its local action-distribution (or policy) \(\varvec{a}_{t}^{j} \sim \pi _{t}^{j}\), for all \(j \in \{1, \dots , N\}\) with \(\varvec{a}_{t}^{j}\in \{1, \dots , K\}\). These operations have corresponding operational weights \(\varvec{w}_t\) that are learned in parallel. Altogether, these choices \(\varvec{a}_{t}=\varvec{a}_{t}^{1}, \dots , \varvec{a}_{t}^{N}\) define a sparse graph/architecture \(\mathcal {Z}_t\equiv \varvec{a}_{t}\) for which a validation loss \({\mathcal {L}}^{(\textrm{val})}_{t}\left( \mathcal {Z}_t,\varvec{w}_t\right)\) is computed and used by the agents to update their policies. After T rounds, an architecture is recommended by sampling \(\varvec{a}_{T+1}^{j} \sim \pi _{T+1}^{j}\), for all \(j \in \{1, \dots , N\}\). These dynamics resemble bandit algorithms where the actions for an agent \({\mathcal {A}}_{j}\) are viewed as separate arms. Multi-arm bandit algorithms allow balancing exploration and exploitation in a decision-making process. By employing a multi-arm bandit, MANAS allows for a distributed space search through multiple agents with a global optimisation goal, and achieves both memory and time efficiency as a result. This framework leaves open the design of 1) the sampling strategy \(\pi ^{j}\) and 2) how \(\pi ^{j}\) is updated from the observed loss.

Minimization of worst-case regret under any loss. The following two notions of regret motivate our proposed NAS method. Given a policy \(\pi\) the cumulative regret \({\mathcal {R}}_{T,\pi }^{\star }\) and the simple regret \(r_{T,\pi }^{\star }\) after T rounds and under the worst possible environment \(\nu\), are:

$$\begin{aligned} {\mathcal {R}}_{T,\pi }^{\star }&= \sup _{\nu } {\mathbb {E}} \sum _{t=1}^{T} {\mathcal {L}}_{t}(\varvec{a}_{t}) - \min _{\varvec{a}} \sum _{t=1}^{T}{\mathcal {L}}_{t}(\varvec{a}), \end{aligned}$$
(3)
$$\begin{aligned} r_{T,\pi }^{\star }&= \sup _{\nu } {\mathbb {E}}\sum _{t=1}^{T} {\mathcal {L}}_{t}(\varvec{a}_{T+1}) - \min _{\varvec{a}}\sum _{t=1}^{T} {\mathcal {L}}_{t}(\varvec{a}) \end{aligned}$$
(4)

where the expectation is taken over both the losses and policy distributions, and \(\varvec{a}=\{\varvec{a}^{({\mathcal {A}}_j)}\}^{N}_{j=1}\) denotes a joint action profile. The simple regret leads to minimising the loss of the recommended architecture \(a_{T+1}\), while minimising the cumulative regret adds the extra requirement of having to sample, at any time t, architectures with close-to-optimal losses. We discuss in the Appendix E how this requirement could improve in practice the tracking of \(\varvec{w}_t^\star\) by \(\varvec{w}_t\). We let \({\mathcal {L}}_{t}(\varvec{a}_{t})\) be potentially adversarially designed to account for the difference between \(\varvec{w}_t^\star\) and \(\varvec{w}_t\) and make no assumption on its convergence. Our models and solutions in Sect. 5 are designed to be robust to arbitrary \({\mathcal {L}}_{t}(\varvec{a}_{t})\).

Because of the discrete nature of the NAS problem, during search the loss can take on large values or alternate between large and small values arbitrarily. Gradient-descent methods perform best under smooth loss functions, which is not the case in NAS. The worst-case regret minimization is a theoretically-grounded objective which we make use of in order to provide guarantees on the convergence of the algorithm when no assumptions are made on the process generating the losses.

5 Adversarial implementations

In the following subsections we will describe our proposed approaches for NAS when considering adversarial losses. We present two algorithms, MANAS and MANAS-LS, that implement two different credit assignment techniques specifying the update rule in line 7 of Algorithm 1. The first approximates the validation loss as a linear combination of edge weights, while the second handles non-linear losses.

Note that adversarial in this context refers to the adversarial multi-arm bandit (Auer et al., 1995) framework: we model the fact that a weight-sharing supernetwork returns noisy rewards as having an adversary that explicitly tries to confuse the learner. Adversarial multi-arm bandit is the strongest generalization of the bandit problem, as it removes all assumptions on the distribution. Our MA formulation and algorithm explicitly account for this adversarial nature and provide a principled solution that is provably robust.

5.1 MANAS-LS

5.1.1 Linear decomposition of the loss

A simple credit assignment strategy is to approximate edge-importance (or edge-weight) by a vector \(\varvec{\beta }_{s}\in {\mathbb {R}}^{KN}\) representing the importance of all K operations for each of the N agents. \(\varvec{\beta }_{s}\) is an arbitrary, potentially adversarially-chosen vector and varies with time s to account for the fact that the operational weights \(\varvec{w}_{s}\) are learned online and to avoid any restrictive assumption on their convergence. The relation between the observed loss \({\mathcal {L}}_{s}^{(\textrm{val})}\) and the architecture selected at each sampling stage s is modeled through a linear combination of the architecture’s edges (agents’ actions) as

$$\begin{aligned} {\mathcal {L}}_{s}^{(\textrm{val})} = \varvec{\beta }_{s}^{\textsf{T}} \varvec{Z}_{s} \end{aligned}$$
(5)

where \(\varvec{Z}_{s}\in \{0,1\}^{KN}\) is a vectorised one-hot encoding of the architecture \({\mathcal {Z}}_{s}\) (active edges are 1, otherwise 0). After evaluating S architectures, at round t we estimate \(\varvec{\beta }\) by solving the following via least-squares:

$$\begin{aligned} \text {Credit assignment}:\,\,\, \widetilde{\varvec{B}}_{t} = \text {arg}\min _{\varvec{\beta }} \sum _{s=1}^{S} \left( {\mathcal {L}}_{s}^{(\textrm{val})} - \varvec{\beta }^{\textsf{T}} \varvec{Z}_{s}\right) ^{2}. \end{aligned}$$
(6)

The solution gives an efficient way for agents to update their corresponding action-selection rules and leads to implicit coordination. Indeed, in Appendix C we demonstrate that the worst-case regret \({\mathcal {R}}^\star _{T}\) (3) can actually be decomposed into an agent-specific form \({\mathcal {R}}^{i}_T\left( \varvec{\pi }^{i},\nu ^{i}\right)\) defined in the appendix: \({\mathcal {R}}^\star _{T} =\sup _{\nu }{\mathcal {R}}_{T}(\varvec{\pi }, \nu ) \iff \sup _{\nu ^{i}}{\mathcal {R}}^{i}_T \left( \varvec{\pi }^{i}, \nu ^{i}\right) , \ \ \ i=1,\ldots ,N\). This decomposition allows us to significantly reduce the search space complexity by letting each agent \({\mathcal {A}}_i\) determine the best operation for the corresponding graph edge.

Zipf Sampling for \(r_{T,\pi }^{\star }\). \({\mathcal {A}}_i\) samples an operation k proportionally to the inverse of its estimated rank \(\widetilde{ \langle k \rangle }^{i}_t\), where \(\widetilde{ \langle k \rangle }^{i}_t\) is computed by sorting the operations of agent \({\mathcal {A}}_i\) w.r.t \(\tilde{\varvec{B}}^{i}_t[k]\), as

$$\begin{aligned} \text {Sampling policy}:\,\,\,\varvec{\pi }^{i}_{t+1}[k]= 1 \Big /\widetilde{ \langle k \rangle }^{i}_t\overline{\log } K \quad \text { where } \overline{\log } K = 1+1/2+\ldots +1/K. \end{aligned}$$

Zipf explores efficiently, is anytime, parameter free, minimises optimally the simple regret in multi-armed bandits when the losses are adversarially designed and adapts optimally to stationary losses (Abbasi-Yadkori et al., 2018).

We prove for this new algorithm an exponentially decreasing simple regret \(r^\star _{T}= {\mathcal {O}}\left( e^{-T/H}\right)\), where H is a measure of the complexity for discriminating sub-optimal solutions as \(H=N(\min _{j\ne k^\star _i,1\le i \le N\} }\varvec{B}^{i}_T[j]-\varvec{B}^{i}_T[k^\star _i] )\), where \(k^\star _i= \min _{1\le j \le K }\varvec{B}^{i}_T[j])\) and \(\varvec{B}^{i}_T[j]=\sum _{t=1}^T \varvec{\beta }^{({\mathcal {A}}_i)}_t[j]\). The proof in given in Appendix D.1.

5.2 MANAS

5.2.1 Coordinated descent for non-linear losses

In some cases the linear approximation may be crude. An alternative is to make no assumptions on the loss function and have each agent directly associate the quality of their actions with the loss \({\mathcal {L}}^{(\textrm{val})}_t(\varvec{a}_t)\). This results in all the agents performing a coordinated descent approach to the problem. Each agent updates for operation k its \(\widetilde{\varvec{B}}^{i}_{t}[k]\) as

$$\begin{aligned} \text {Credit assignment}:\,\, \widetilde{\varvec{B}}^{i}_{t}[k] = \widetilde{\varvec{B}}^{i}_{t-1}[k]+ {\mathcal {L}}^{(\textrm{val})}_t \cdot \mathbbm {1}_{ \varvec{a}_{t}^{i} = k }/\varvec{\pi }^{i}_t[k]. \end{aligned}$$
(7)

Softmax Sampling for \({\mathcal {R}}_{T,\pi }^{\star }\). Following EXP3 (Auer et al., 2002), actions are sampled from a softmax distribution (with temperature \(\eta\)) w.r.t. \(\tilde{ \varvec{B}}^{i}_t[k]\):

$$\begin{aligned} \text {Sampling policy}:\,\, \varvec{\pi }^{i}_{t+1}[k] = \exp \left( \eta \tilde{\varvec{B}}^{i}_t[k] \right) \Big / \sum _{j=1}^{K}\exp \left( \eta \tilde{\varvec{B}}^{i}_t[j] \right) . \end{aligned}$$

Using this sampling strategy, EXP3 (Auer et al., 2002) is run for each agent in parallel. If the regret of each agent is computed by considering the rest of the agent as fixed, then each agent has regret \({\mathcal {O}}\left( \sqrt{TK\log K}\right)\) which sums over agents to \({\mathcal {O}}\left( N\sqrt{TK\log K}\right)\). The proof is given in Appendix D.1

On credit assignment. Our MA formulation provides a gradient-free, credit assignment strategy. Gradient methods are more susceptible to bad initialisation and can get trapped in local minima more easily than our approach, which, not only explores more widely the search space, but makes this search optimally according to multi-armed bandit derived regret minimization. Concretely, MANAS can easily escape from local minima as the reward is scaled by the probability of selecting an action (Eq. 7). Thus, the algorithm has a higher chance of revising its estimate of the quality of a solution based on new evidence. This is important as one-shot methods (such as MANAS and DARTS) change the network—and thus the environment—throughout the search process. Put differently, MANAS’ optimal exploration-exploitation allows the algorithm to move away from ‘good’ solutions towards ‘very good’ solutions that do not live in the former’s proximity; in contrast, gradient methods will tend to stay in the vicinity of a ‘good’ discovered solution.

6 Experiments

We (1) compare MANAS against existing NAS methods on the well established CIFAR-10 dataset; (2) evaluate MANAS on ImageNet; (3) compare MANAS, DARTS, Random Sampling and Random Search with WS (Li and Talwalkar, 2019) on 3 new datasets (Sport-8, Caltech-101, MIT-67); and (4) evaluate MANAS with inference time as complexity constraint. Descriptions of the datasets and details of the search are provided in the Appendix. We report the performance of two algorithms, MANAS and MANAS-LS, as described in Sect. 5. Note that, with the exception of results marked as +AutoAugment, all experiments were run with the same final training protocol as DARTS (Liu et al., 2019), for fair comparison.

Search Spaces. We use the same CNN search space as Liu et al. (2019). Since MANAS is memory efficient, it can search for the final architecture without needing to stack a posteriori repeated cells; thus, all our cells are unique. For fair comparison, we use 20 cells on CIFAR-10 and 14 on ImageNet. Experiments on Sport-8, Caltech-101 and MIT-67 in Sect. 6.3 use both 8 and 14 cell networks.

Search Protocols. For datasets other than ImageNet, we use 500 epochs during the search phase for architectures with 20 cells, 400 epochs for 14 cells, and 50 epochs for 8 cells. All other hyperparameters are as in Liu et al. (2019). For ImageNet, we use 14 cells and 100 epochs during search. In our experiments on the three new datasets we rerun the DARTS code to optimise an 8 cell architecture; for 14 cells we simply stacked the best cells for the appropriate number of times.

Synthetic experiment. To illustrate the theoretical properties of MANAS we apply it to the Gaussian Squeeze task, a problem where agents must coordinate their actions in order to optimize a global objective function that depends on the actions of each agent (Yang et al., 2018; Colby et al., 2015). Specifically, N homogeneous agents determine their individual actions \(a^{(j)}\) to jointly optimize the objective \(G(x) = x e^{-\frac{(x-\mu )^{2}}{\sigma ^{2}}}\) where \(x = \sum _{j=1}^{N} a^{(j)}\). This synthetic setup has the same characteristics of the multi-agent NAS problem, namely a group of agents implicitly coordinating their actions to achieve a global objective, and is therefore a good experiment to showcase the theoretical properties of the MANAS algorithm.

We confirm that (1) MANAS progresses steadily towards zero regret while the Random Search baseline struggles to move beyond the initial starting poin; (2) MANAS stays well within the theoretical cumulative regret bound (Fig. 2).

Fig. 2
figure 2

Left: Regret for the Gaussian Squeeze Domain experiment with 100 agents, 10 actions, \(\mu =1\), \(\sigma =10\). Right: Theoretical bound for the MANAS cumulative regret (\(2N\sqrt{TK\log K}\); see Appendix D.2) and the observed counterpart for the Gaussian Squeeze Domain experiment with 100 agents, 10 actions, \(\mu =1\), \(\sigma =10\)

Table 1 Comparison with state-of-the-art image classifiers on CIFAR-10

6.1 Results on CIFAR-10

6.1.1 Evaluation

To evaluate our NAS algorithm, we follow DARTS’s protocol: we run MANAS 4 times with different random seeds and pick the best architecture based on its validation performance. We then randomly reinitialize the weights and retrain for 600 epochs. During search we use half of the training set as validation. To fairly compare with more recent methods, we also re-train the best searched architecture using AutoAugment and Extended Training (Cubuk et al., 2018).

6.1.2 Results

Both MANAS implementations perform well on this dataset (Table 1). Our algorithm is designed to perform comparably to Liu et al. (2019) but with an order of magnitude less memory. However, MANAS actually achieves higher accuracy. The reason for this is that DARTS is forced to search for an 8 cell architecture and subsequently stack the same cells 20 times; MANAS, on the other hand, can directly search on the final number of cells leading to better results. We also report our results when using only 8 cells: even though the network is much smaller, it still performs competitively with 1st-order 20-cell DARTS. This is explored in more depth in Sect. 6.3. In terms of memory usage with a batch size of 1, MANAS 8 cells required only 1GB of GPU memory, while DARTSv1 utilized more than 8.5GB and DARTSv2 required 9.6GB, making both versions of DARTS unpractical to work with datasets with larger image sizes.

Cai et al. (2019) is another method designed as an efficient alternative to DARTS; unfortunately the authors decided to a) use a different search space (PyramidNet backbone; Han et al. (2017)) and b) offer no comparison to random sampling in the given search space. For these reasons we feel a numerical comparison to be unfair. Furthermore our algorithm uses half the GPU memory (they sample 2 paths at a time) and does not require the reward to be differentiable. Lastly, we observe similar gains when training the best MANAS/MANAS-LS architectures with an extended protocol (AutoAugment + 1500 Epochs + 50 Channels, in addition to the DARTS protocol).

6.2 Results on ImageNet

Table 2 Comparison with state-of-the-art image classifiers on ImageNet (mobile setting)

6.2.1 Evaluation

To evaluate the results on ImageNet we train the final architecture for 250 epochs. We report the result of the best architecture out of 4, as chosen on the validation set for a fair comparison with competing methods. As search and augmentation are very expensive we use only MANAS and not MANAS-LS, as the former is computationally cheaper and performs slightly better on average.

6.2.2 Results

We provide results for networks searched both on CIFAR-10 and directly on ImageNet, which is made possible by the computational efficiency of MANAS (Table 2). When compared to SNAS, DARTS, GDAS and other methods, using the same search space, MANAS achieves state-of-the-art results both with architectures searched directly on ImageNet and also with architectures transferred from CIFAR-10. We observe similar improvements when training the best MANAS architecture with an extended training protocol (AutoAugment + 600 Epochs + 60 Channels, in addition to the DARTS protocol), resulting in a final test error of 25.26% when directly searching on ImageNet.

6.3 Results on new datasets: Sport-8, Caltech-101, MIT-67

6.3.1 Evaluation

The idea behind NAS is that of finding the optimal architecture, given any sets of data and labels. Limiting the evaluation of current methods to CIFAR-10 and ImageNet could potentially lead to algorithmic overfitting. Indeed, recent results suggest that the search space was engineered in a way that makes it very hard to find a a bad architecture (Li and Talwalkar, 2019; Yu et al., 2019; Yang et al., 2020; Wan et al., 2022). To mitigate this, we propose testing NAS algorithms on 3 datasets (composed of regular sized images) that were never before used in this setting, but have been historically used in the CV field: Sport-8, Caltech-101 and MIT-67, described briefly in the Appendix. For these set of experiments we run the algorithm 8 times and report mean and std. We perform this both for 8 and 14 cells; we do the same with DARTS (which, due to memory constraints can only search for 8 cells). As baselines, we consider random search and random sampling. For the latter we simply sample uniformly 8 architectures from the search space. To efficiently implement random search, we follow Li and Talwalkar (2019) and perform experiments on random search with WS. Each proposed architecture is trained from scratch for 600 epochs as in the previous section.

6.3.2 Results

MANAS manages to outperform the random baselines and significantly outperform DARTS, especially on 14 cells (Fig. 3): this clearly shows that the optimal cell architecture for 8 cells is not the optimal one for 14 cells.

Fig. 3
figure 3

Comparing MANAS, random sampling, random search with WS (Li and Talwalkar, 2019) and DARTS (Liu et al., 2019) on 8 and 14 cells. Average results of 8 runs. Note that DARTS was only optimised for 8 cells due to memory constraints

Table 3 Results of MANAS with complexity constraints using different penalty (\(\lambda\)) values on CIFAR-10

6.4 Results with complexity constraint

6.4.1 Evaluation

To evaluate the results of MANAS in a complexity constraint setting, we added the inference time of the generated architectures as a complexity constraint to the training. For this, we update the training loss with the constraint of the inference time that the generated architecture takes to classify an image: \({\mathcal {L}}^{(\textrm{train})}_t(\varvec{a}_t) ={\mathcal {L}}^{(\textrm{train})}_t(\mathcal {Z}_t,\varvec{w}_t )+\lambda L_t(\mathcal {Z}_t,\varvec{w}_t )\), where \(L_t\) is the inference time to classify one image, and \(\lambda\) defines the importance given to \(L_t\). Here, the \(\lambda\) serves the purpose of varying the importance given to the inference time whilst searching. By increasing \(\lambda\), the inference time constraint has a higher importance.

6.4.2 Results

We evaluate MANAS with different \(\lambda\) values using a single GPU (Table 3), and observe that by increasing the importance given to the inference, MANAS consistently generates architectures with a lower inference time, with similar accuracies. This experiment shows that MANAS can be extended for searching with multiple objectives, by modifying the training loss.

7 Discussion

Random Baselines. Clearly, in specific settings, random sampling performs very competitively. On one hand, since the search space is very large (between \(8^{112}\) and \(8^{280}\) architectures exist in the DARTS experiments), finding the global optimum is practically impossible. Why is it then that the randomly sampled architectures are able to deliver nearly state-of-the-art results? Previous experiments (Yu et al., 2019; Li and Talwalkar, 2019) together with the results presented here seem to indicate that the available operations and meta-structure have been carefully chosen and, as a consequence, most architectures in this space generate very good results. This suggests that human effort has simply transitioned from finding a good architecture to finding a good search space—a problem that needs careful consideration in future work. Random search with WS (Li and Talwalkar, 2019), has also shown to perform competitively but it is clearly sub-optimal compared to our multi-agent framework.

On fair evaluation. It is worth stressing that we performed all comparisons using the same final training protocol. This is extremely relevant as there has been a recent trend to boost results simply by stacking more training tricks on to the evaluation protocol. As such, any improvement in the final accuracy is solely due to how the network was trained rather than the quality of the search method used or the architecture discovered (Yang et al., 2020).

Agent coordination, combinatorial explosion and approximate credit assignment. Our set-up introduces multiple agents in need of coordination. Centralised critics use explicit coordination and learn the value of coordinated actions across all agents (Rashid et al., 2018), but the complexity of the problem grows exponentially with the number of possible architectures \(\mathcal {Z}\), which equals \(K^N\). We argue instead for an implicit approach where coordination is achieved through a joint loss function depending on the actions of all agents. This approach is scalable as each agent searches its local action space—small and finite—for optimal action-selection rules. Both credit assignment methods proposed learn, for each operation k belonging to an agent \({\mathcal {A}}_{i}\), a quantity \(\widetilde{\varvec{B}}^{i}_{t}[k]\) (similar to \(\alpha\) in Sect. 3) that quantifies the contribution of the operation to the observed losses.

8 Conclusions

We presented MANAS, a theoretically grounded multi-agent online learning framework for NAS. We proposed two extremely lightweight implementations that, within the same search space, outperform state-of-the-art while reducing memory consumption by an order of magnitude compared to Liu et al. (2019). We provide vanishing regret proofs for our algorithms. Furthermore, we evaluate MANAS on 3 new datasets, empirically showing its effectiveness in a variety of settings. Finally, we confirm concerns raised in recent works (Yu et al., 2019; Li and Talwalkar, 2019; Yang et al., 2020) claiming that NAS algorithms often achieve minor gains over random architectures. We however demonstrate, that MANAS still produces competitive results with limited computational budgets.