Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Learning To Learn With Quantum Neural Networks Via Classical Neural Networks

This paper proposes training classical neural networks to help optimize quantum neural networks. Specifically, classical recurrent neural networks are trained to find approximately optimal parameters for quantum variational algorithms like QAOA in a small number of queries. By initializing other optimizers at the suggested parameters, the total number of iterations to reach a given accuracy is significantly reduced. The approach also generalizes across problem sizes, allowing training on small problems to initialize larger problems on quantum devices.

Uploaded by

jaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Learning To Learn With Quantum Neural Networks Via Classical Neural Networks

This paper proposes training classical neural networks to help optimize quantum neural networks. Specifically, classical recurrent neural networks are trained to find approximately optimal parameters for quantum variational algorithms like QAOA in a small number of queries. By initializing other optimizers at the suggested parameters, the total number of iterations to reach a given accuracy is significantly reduced. The approach also generalizes across problem sizes, allowing training on small problems to initialize larger problems on quantum devices.

Uploaded by

jaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Learning to learn with quantum neural networks via classical neural networks

Guillaume Verdon,1, 2, 4, ∗ Michael Broughton,1, 3, ∗ Jarrod R. McClean,1 Kevin J.


Sung,1, 5 Ryan Babbush,1 Zhang Jiang,1 Hartmut Neven,1 and Masoud Mohseni1
1
Google LLC, Venice, CA 90291
2
Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
3
Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
4
Institute for Quantum Computing, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
5
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109
(Dated: July 12, 2019)
Quantum Neural Networks (QNNs) are a promising variational learning paradigm with applica-
tions to near-term quantum processors, however they still face some significant challenges. One
such challenge is finding good parameter initialization heuristics that ensure rapid and consistent
convergence to local minima of the parameterized quantum circuit landscape. In this work, we train
arXiv:1907.05415v1 [quant-ph] 11 Jul 2019

classical neural networks to assist in the quantum learning process, also know as meta-learning, to
rapidly find approximate optima in the parameter landscape for several classes of quantum varia-
tional algorithms. Specifically, we train classical recurrent neural networks to find approximately
optimal parameters within a small number of queries of the cost function for the Quantum Ap-
proximate Optimization Algorithm (QAOA) for MaxCut, QAOA for Sherrington-Kirkpatrick Ising
model, and for a Variational Quantum Eigensolver for the Hubbard model. By initializing other
optimizers at parameter values suggested by the classical neural network, we demonstrate a signifi-
cant improvement in the total number of optimization iterations required to reach a given accuracy.
We further demonstrate that the optimization strategies learned by the neural network generalize
well across a range of problem instance sizes. This opens up the possibility of training on small,
classically simulatable problem instances, in order to initialize larger, classically intractably simulat-
able problem instances on quantum devices, thereby significantly reducing the number of required
quantum-classical optimization iterations.

I. INTRODUCTION mization; the stochastic nature of the objective function


in combination with readout complexity considerations
has made direct translation of classical local optimiza-
With the advent of noisy intermediate-scale quantum
tion algorithms challenging. Proposed gradient-based
(NISQ) devices [1], there has been a growing body of
optimizers either rely on a quantum form of backpropa-
work [1–24] aiming to develop algorithms which are suit-
gation of errors [28] that requires additional gate depth
able to be run in this near-term era of quantum com-
and quantum memory, or use finite-difference gradients
puting. A particularly promising category of such al-
[2, 23] which typically require numerous quantum circuit
gorithms are the so-called quantum-classical variational
evaluations for each gradient descent iteration. Recent
algorithms [4, 9], which involve the optimization over a
works have proposed sampling analytic gradients [30, 31]
family of parameterized quantum circuits using classical
to reduce this cost. However, these approaches also re-
optimization techniques (see Fig. 1). These variational
quire many measurement runs and consequently remain
algorithms are promising because they have flexible ar-
expensive, and further advances are needed in this area.
chitectures, are adaptive in nature, can be tailored to fit
the gate allowances of near-term quantum devices, and The second major challenge for QNN optimization
are partially robust to systematic noise. is parameter initialization. Although there have been
Many quantum-classical variational algorithms consist some proposals for QNN parameter initialization heuris-
of optimizing the parameters of a parameterized quan- tics [8, 32, 33], we believe there is a need for more efficient
tum circuit to extremize a cost function (often consisting and more flexible variants of such heuristics. By initializ-
of the expectation value of a certain observable at the ing parameters in the neighborhood of a local minimum
output of the circuit). This optimization of parameter- of the cost landscape, one ensures more consistent local
ized functions is similar to the methods of classical deep optimization convergence in a fewer number of iterations
learning with neural networks [25–27]. Furthermore, the and a better overall answer with respect to the global
training and inference processes for classical deep neural landscape. Good initialization is thus crucial to promote
networks have been shown to be embeddable into this the convergence of local optimizers to local extrema and
quantum-classical PQC optimization framework [5, 28]. to select reasonably good local minima.
Given these connections, it has become common to some- In this paper, we tackle the second problem of pa-
times refer to certain PQC ansatze as Quantum Neural rameter initialization by exploring methods that lever-
Networks [2, 7, 23, 29] (QNN’s). age classical neural networks trained to optimize con-
Optimization of QNN’s in the NISQ era is currently trol parameters of parametrized quantum circuits. Tak-
faced with two main challenges. The first is local opti- ing inspiration from the growing body of work on meta-
2

learning, also known as learning to learn [34–38], we use


a classical recurrent neural network (RNN) as a black-
box controller to optimize the PQC parameters directly,
as shown in Figure 2. We train this RNN using random
problem instances from specific classes of problems. We
explore the performance of this approach for the following
problem classes: quantum approximate optimization al-
gorithm (QAOA) for MaxCut [3], QAOA for Sherrington-
Kirkpatrick Ising models [32], and a Variational Quan-
tum Eigensolver (VQE) ansatz for the Hubbard model
[6, 39, 40]. Figure 1. Unrolling the temporal quantum-classical hybrid
Through numerical simulations, we show that a re- computational graph of a general hybrid variational quantum
current neural network trained to optimize small quan- algorithm. At the tth optimization iteration, the CPU is fed
tum neural networks can learn parameter update heuris- the previous iterations’ parameters θt , and the expectation
tics that generalize to larger system sizes and problem value of the Hamiltonian at the previous step yt = hĤiθt , it
instances, while still outperforming other initialization also has access to its own internal memory mt . A classical
strategies at this scale. This opens up the possibility of optimization algorithm then suggest a new set of parameters
classically training RNN optimizers for specific problem θt , which is fed to the QPU. The QPU then executes multiple
runs to obtain yt+1 , the expectation value of the cost Hamil-
classes using instances of classically simulatable QNN’s
tonian at the output of the parameterized quantum circuit
with reasonable system sizes as training data. After this evaluated at these given parameters.
training is done, these RNN optimizers could then be
used on problem instances with QNN’s whose system
sizes are beyond the classically simulatable regime.
algorithm, defined by the cost Hamiltonian. The expec-
For reasons explained further in sections III C, we use
tation value of the cost Hamiltonian hĤiθ ≡ hψθ | Ĥ |ψθ i
the RNN as a few-shot global approximate optimizer,
is estimated using the quantum expectation estimation
which is used to initialize a local optimizer, such as
procedure [47] via many repeated runs of the QPU. Fol-
Nelder-Mead [41–43]. In principle, the neural network
lowing this, the estimated expectation value is relayed
could initialize any other local optimizer (such as SPSA
back to the CPU, where the classical optimizer running
[44], BOBYQA [45], and many others [46]), however, the
on the CPU is then tasked with suggesting a new set of
focus of this paper is not to benchmark and compare var-
parameters for the subsequent iteration [48].
ious options for local optimizers. We have found that
our approach compares favorably over other standard From an optimization perspective, the CPU is given a
parameter initialization methods for all local optimizers parametrized black-box function f : Rm → R for which it
studied. To the authors’ knowledge, this work is a first is tasked to find a set of parameters minimizing this cost
instance where meta-learning techniques have been suc- function θ ∗ = argminθ∈Rm f (θ). In many cases, finding
cessfully applied to enhance quantum machine learning an approximate minimum is sufficient. Typically, one
algorithms. must consider this function to have a stochastic output
which serves as a noisy unbiased estimate (under some
assumptions) of the true output value of the function we
are ultimately trying to optimize. Optimizing this out-
II. QUANTUM-CLASSICAL META-LEARNING
put rapidly and accurately, despite only having access
to noisy estimates poses a significant challenge for varia-
A. Variational Quantum Algorithms tional quantum algorithms.
Even for perfect quantum gates and operations, for
Let us first briefly review the theory of variational a finite number of measurement runs, there is inherent
quantum algorithms, and how one can view the hybrid noise in the quantum expectation estimate [47]. Usually
quantum-classical optimization process as a hybrid com- when performing quantum expectation estimation, the
putational graph. Variational quantum algorithms are cost Hamiltonian can be expressed as a linear combina-
comprised of an iterative quantum-classical optimization PN
tion of k-local Pauli’s, Ĥ = j=1 αj P̂j , where the αj ’s
loop between a classical processing unit (CPU) and a
quantum processing unit (QPU), pictured in Figure 1. are real-valued coefficients and the P̂j ’s are Paulis that
are at most k-local [49]. The measurement of expecta-
An iteration begins with the CPU sending the set of
tions of k-local Pauli observables is fairly straightforward
candidate parameters θ to the QPU. The QPU then exe-
[9], while the linear combination of expectation values
cutes a parameterized circuit Û (θ), which outputs a state
|ψθ i. For the types of QNN of interest in this work, is done on the classical device. For a Hamiltonian Ĥ
namely QAOA [3] and VQE [4], the function to be opti- with such a decomposition, we define its Pauli coefficient
mized is the expectation value of a certain Hamiltonian norm, denoted kĤk∗ , as the one-norm of the vector of
operator f (θ) ≡ hψθ | Ĥ |ψθ i. We will refer to this func- coefficients in its Pauli decomposition, namely kĤk∗ ≡
PN
tion to as the cost function of the variational quantum kαk1 = j=1 |αj |. For such Hamiltonians, the expected
3

number of repetitions is bounded by ∼ O(kĤk2∗ /2 ) to


get an estimate that is accurate within  from the unbi-
ased value with a desired probability [6, 50].
Finally, note that the quantum-classical optimization
loop can be unrolled over time into a single temporal
quantum-classical computational graph, as depicted in
Figure 1. This hybrid computational graph can be con-
sidered as hybrid quantum-classical neural networks. We
developed methods to propagate gradients through such
hybrid computational graph using reverse-mode auto-
differentiation, also known as backpropagation [51, 52].
We achieve this by converting hybrid quantum-classical Figure 2. Unrolled temporal quantum-classical computational
backpropagation methods from previous work [28], origi- graph for the meta-learning optimization of the recurrent neu-
nally formulated for quantum optimizers, to a form suit- ral network (RNN) optimizer and a quantum neural network
able for classical optimizers, which are most relevant for (QNN). This graph is similar to the general VQA graph in
the NISQ era and are our focus in this paper. Dur- Figure 1, except that the memory of the optimizer is en-
ing the writing of this paper, other works have also coded in the hidden state of the RNN h, and we represent
the flow of data used to evaluate the meta-learning loss func-
employed hybrid quantum-classical backpropagation for
tion. This meta loss function L is a functional of the history
various quantum machine learning tasks [30, 53]. of expectation value estimate samples y = {yt }Tt=1 , and is
thus indirectly dependent on the RNN parameters ϕ. We see
that backpropagating from the meta-loss node to the RNN’s
B. Meta-learning with Neural Optimizers necessitates gradients to pass through the QNN.

Meta-learning, also called learning to learn [34, 35],


consists of a set of meta-optimization techniques which dients during training is not strictly necessary, but can
aim to learn how to modify the parameters (or hyperpa- speed up training in some cases. Note that gradients of
rameters) of learning algorithms to further tailor them for hybrid quantum-classical computational graphs can be
a specific purpose. This could be to ensure that the learn- obtained by backpropagation (automatic differentiation)
ing generalizes well (minimizes test set error), to better when simulating quantum circuits, or by using techniques
fit the given data in less iterations (minimize training for backpropagation through black boxes [56], or hybrid
set error) [35], or to perform transfer learning (adapt a quantum-classical backpropagation [28, 30].
pre-trained neural network to a new task) [37, 54]. In To choose an architecture for the optimizer network,
recent years, there have been many new works in the we interpret the QNN parameters and cost function
meta-learning literature [34, 35, 37, 54, 55], and we aim evaluations over multiple quantum-classical iterations as
to transfer some of the tools developed in the context a sequence-to-sequence learning problem. A canonical
of classical deep learning to quantum variational algo- choice of neural network architecture for processing such
rithms. sequential data is a recurrent neural network (RNN)
Our aim will be to train a classical optimizer neural [57, 58]. Generally speaking, a recurrent neural network
network to learn parameter update heuristics for opti- is a network where, for each item in a sequence, the net-
mizee quantum neural networks. As mentioned previ- work accepts an input vector, produces an output vector,
ously, for our QNN’s of interest, the cost function to be and potentially keeps some data in memory for use with
optimized is the expectation value of a certain Hamilto- subsequent items. The computational graph of a RNN
nian operator Ĥ, with respect to a parameterized class of usually consists of many copies of the network, each shar-
states |ψθ i output by a family of parametrized quantum ing the same set of parameters, and each representing a
circuits; f (θ) = hĤiθ . To emulate the statistical noise time step. The recurrent connections, which can be inter-
of quantum expectation estimation, we will this assume preted as self-connections representing the data flow over
that the optimizer is fed noisy unbiased estimates of the time, can be represented as a connections between copies
QNN cost function at test time. of the network representing subsequent time steps. In
In many studies of meta-learning [35], it is assumed this way, the computational graph can then be pictured
that this black box (in our case a QNN) is differen- in an unrolled form, as depicted in Figure 2. A particu-
tiable and that the learner has oracular access to the lar type of RNN architecture which has had demonstrable
gradients of the function with respect to its parameters, successes over other RNN architectures is the Long Short
∇f (θ). Since precise gradient estimations on NISQ de- Term Memory Network (LSTM) [59]. The LSTM owes
vices are hampered by the large number of runs required its successes to its internal tunable mechanisms which, as
and by the noise of the device, we focus on the case where its name implies, allow it to identify both long-term and
the learner will only have access to black box function short-term dependencies in the data.
queries at test time. We will, however, use gradients for The meta-learning neural network architecture used
neural optimizer network training. This access to gra- in this paper is depicted in Figure 2, there, an LSTM
4

recurrent neural network is used to recursively propose such QNN’s.


updates to the QNN parameters, thereby acting as the
classical black-box optimizer for quantum-classical opti-
mization. At a given iteration, the RNN receives as input 1. Meta-Training & Loss functions
the previous QNN query’s estimated cost function expec-
tation yt ∼ p(y|θt ), where yt is the estimate of hĤit , as The objective of quantum-classical meta-learning is to
well as the parameters for which the QNN was evaluated train our RNN to learn an efficient parameter update
θt . The RNN at this time step also receives information scheme for a family of cost functions of interest, i.e., to
stored in its internal hidden state from the previous time discover an optimizer which efficiently optimizes a cer-
step ht . The RNN itself has trainable parameters ϕ, and tain distribution of optimizees, on average. We consider
hence it applies the parameterized mapping an efficient optimizer to be one which finds sufficiently
ht+1 , θt+1 = RNNϕ (ht , θt , yt ) (1) optimal approximate local minima of cost functions in as
few function queries as possible. What qualifies as suffi-
which generates a new suggestion for the QNN param- ciently optimal will depend on the class of problems at
eters as well as a new hidden state. Once this new set hand and the domain of application of interest.
of QNN parameters is suggested, the RNN sends it to In the original work by DeepMind [34], the neural op-
the QPU for evaluation and the loop continues. Note timizer was to be used as a general optimizer; little to no
that this specific meta-learning architecture was adapted assumptions were made about the optimizee (the network
from previous work [34], where the task of ‘learning to being optimized) apart from the dimension of the param-
learn without gradient descent by gradient descent’ was eter space. The optimizer network was to be trained on
considered. one ‘data set’ of optimizees, yet had to be applicable to
The QNN architectures we chose to focus on were a wide array of optimizees previously unseen. To learn
part of a class of ansatze known as Quantum Alter- such a general optimization strategy, the optimizer RNN
nating Operator Ansatze [10], which are generaliza- was trained on random instances of a fairly general dis-
tions of the Quantum Approximate Optimization Al- tribution of functions, namely, functions sampled from
gorithm [3]. These ansatze can be interpreted as a Gaussian processes [62].
method for variationally-optimized bang-bang-controlled Since we are focused on QNN optimization landscapes
quantum-dynamical evolution in an energy landscape which are known to differ from classical Gaussian pro-
[3, 28, 60, 61]. In this case, the QNN’s variational pa- cess optimization landscapes [29] we instead aimed to
rameters are the control parameters of the dynamics, and train specialized neural optimizers that are tailored to
by appropriately tuning these parameters via quantum- specific classes of problems and QNN ansatze. To ex-
classical optimization, one can cause the wavefunction plore how effective this is, we trained our RNN on ran-
to effectively descend the energetic landscape towards dom QNN instances within a targeted class of problems,
lower-energy regions. In recent works [28, 61], explicit namely QAOA and VQE, and tested the trained net-
connections between these quantum-dynamical control work on (larger) previously unseen instances from their
parameters and the hyperparameters of gradient descent respective classes. In Section III, we describe in greater
algorithms were established for the continuum-embedded detail the various classes of problems and corresponding
variants of QAOA. Thus, one can interpret these QAOA ansatze which were considered for training and testing.
parameters as analogous to the hyperparameters of a Given a distribution of optimizees of interest, we must
quantum form of energetic landscape descent. In a re- choose an adequate meta-learning loss function L(ϕ)
cent work in classical meta-learning [35], the authors use with respect to which we will want to optimize the RNN
a RNN to control the hyperparameters of a neural net- parameters ϕ. For a given QNN with cost function
work gradient-based training algorithm, drastically im- f (θ) = hĤiθ , we know that the RNN’s meta-learning
proving the neural networks training time and quality loss function L(ϕ) will generally be dependent on the
of fit when compared to stochastic gradient descent. It estimated cost function (quantum expectation estimate)
is thus natural to consider using an RNN to optimize history {Ef,y [f (θt )]}Tt=1 , but there is some flexibility in
QAOA parameters, given the above intuition about their choosing exactly what this dependence is. Choosing the
analogous relation to gradient descent hyperparameters. appropriate meta-learning loss function for the task at
Another reason for choosing this particular type of QNN hand can be tricky, and depends on what is the particu-
is that the number of variational parameters in these lar application of the QNN. To be most general, we will
ansatze do not depend on the system size, only on the want to pick a loss function which can learn to rapidly
number of alternating operators. This means we can use find optima of the parameter landscape yet is still con-
this approach to train our RNN on smaller QNN instance stantly driven to find higher quality optima.
sizes for certain problems, then test for generalization on A simple choice of meta-learning loss function would be
larger QNN instance sizes. to use the expected final cost value at the end of the op-
Before we dive further into the specific quantum- timization time horizon, averaged over our samples from
classical meta-learning experiments, we will provide fur- our distribution of functions f , i.e., L(ϕ) = Ef,yT [f (θT )].
ther details on how one would train an RNN to optimize In practice, this is a sparse signal, and would require
5

backpropagation through a large portion of the compu- [63], that QAOA-like ansatze have a concentration of op-
tational graph for the loss signal to reach early portions timal parameters. Thus, the neural optimizer is used to
of the RNN graph. A practical option of the same vein learn a problem-class-specific initialization heuristic, and
is the cumulative regret, which is simply the sum of the the fine-tuning is left for other optimizers. Since the neu-
cost function history uniformly averaged over the time ral optimizer would eventually learn a local heuristic for
PT
horizon L(ϕ) = t=1 Ef,y [f (θt )]. This is a better choice the fine-tuning, the added complexity cost of training for
as the loss signal is far less sparse, and the cumulative long time horizons if not justified by corresponding im-
regret is a proxy for the minimum value achieved over the provements in optimization efficiency, and we find that
optimization history. In practice, this loss function may the combination of the neural optimizer as initializer and
not be optimal as it will prioritize rapidly finding an ap- a greedy heuristic such as Nelder-Mead works quite well
proximate optimum and staying there. What is needed in practice, as shown in Figure 3. Now, let us cover which
instead is a loss function that encourages exploration of ansatze we applied our RNN to learn to optimize.
the landscape in order to find a better optimum. The loss
function we chose for our experiments is the observed im-
provement at each time step, summed over the history of III. NUMERICAL EXPERIMENTS
the optimization:
T  In this section, we provide a brief overview of the quan-
L(ϕ) = Ef,y
P
min{f (θt ) − minj<t [f (θj )], 0} , (2) tum neural network ansatze and problem instances con-
t=1 sidered for the hybrid meta-learning numerical experi-
ments (results presented in Section III C). We trained
The observed improvement at time step t is given by
and tested different ‘specialist’ RNN optimizers for each
the difference between the proposed value, f (θt ), and
of these three problem classes: quantum approximate op-
the best value obtained over the past history of the op-
timization for MaxCut (MaxCut QAOA), quantum ap-
timization until that point, minj<t [f (θj )]. If there is no
proximate optimization for Sherrington-Kirkpatrick Ising
improvement at a given time step then the contribution
models [32] (Ising QAOA), and a Trotter-based varia-
to the loss is nil. However, a temporary increase of the
tional quantum eigensolver ansatz for the Hubbard model
cost function followed by a significant improvement over
[39] (Hubbard VQE). We provide a brief introduction to
the historical best will be rewarded rather than penal-
each of these three classes, as well as describe the dis-
ized (in contrast to the behavior of the cumulative regret
tribution of instances from these classes from which we
loss).
sampled to generate training and testing instances.
In order to train the RNN, we need to differentiate
the above loss function L(ϕ). One option to achieve this
is via backpropagation of gradients through the unrolled A. Quantum Approximate Optimization
RNN graph (depicted in Fig. 2). This approach is called Algorithms
backpropagation through time, and can be tricky to scale
to arbitrarily deep networks due to vanishing/exploding
gradient problems [58]. For practical purposes, a small Let us first introduce a general QAOA ansatz before
time horizon is preferable, as it limits the complexity we specialize to applications to MaxCut problems and
of the training of the RNN optimizer and avoids the Ising (Sherrington-Kirkpatrick; SK) Hamiltonians. The
pathologies of backpropagation through long time hori- goal of the QAOA is to prepare low-energy states of a cost
zons. Since our loss function L(ϕ) is dependent on the Hamiltonian ĤC , which is usually a Hamiltonian which is
QNN evaluated at multiple different parameter values diagonal in the computational basis. To achieve this, we
{θt }Tt=1 , in order to perform backpropagation through typically begin in an eigenstate of a mixer Hamiltonian
time we need to backpropagate gradients through multi- ĤM , which does not commute with the cost Hamilto-
ple instances of the QNN. nian; [ĤC , ĤM ] 6= 0. Applied onto this initial state is a
As our approach was backpropagation-based, to avoid sequence of exponentials of the form
problems of gradient blowup and to minimize the com- P
(j)
ĤM −iθc(j) ĤC
Y
plexity of training, we keep a small time horizon for our Û (θ) = e−iθm e , (3)
numerical experiments featured in Section III C. As such, j=1
our RNN optimizer is intended to only run for a fixed
number of iterations, and will be used as an initializer for where θ = {θm , θc } are variational parameters to be op-
other optimizers that perform local search. In principle, timized. Note that in the above and throughout this
one could let the RNN optimize over more iterations at paper we will use the operator product notation conven-
QM
inference time than it was originally trained for, though tion where j=1 Ûj = ÛM . . . Û1 . The objective func-
the performance for later iterations may suffer. In our tion for this optimization is simply the expectation of
case the output of the RNN optimizer after a fixed num- the cost Hamiltonian after applying Û (θ) to the initial
ber of iterations is used to initialize the parameters of the state. This sequence of exponentials is the quantum al-
QNN’s near a typical optimal set of parameters. It has ternating operator ansatz [3, 10]. This is an algorithm
been observed [32], and in some cases formally proven which is well-suited for the NISQ era as the number of
6

gates scales linearly with P , the exponentials of ĤM and bility of yielding a bitstring corresponding to a partition
ĤC are usually easy to compile without any need for ap- of large cut size [3].
proximation via Trotter-Suzuki decomposition [64]. The In order to train and test the RNN optimizer on Max-
cost Hamiltonian is typically a sum of terms that are Cut QAOA problems, we generated random problem in-
diagonal in the computational basis and often simple to stances in the following fashion: we first fixed an integer
compile. Furthermore, there is no need to split the quan- n, and then randomly sampled an integer uniformly from
tum expectation estimation over multiple runs in order the range k ∈ [3, n − 1]. Finally, we tossed a random
to estimate the various terms; each repetition yields an graph from Gn,p with p = k/n and constructed the cor-
estimate of the cost function directly. responding MaxCut QAOA QNN of the form of (3) for
Now that we have introduced the general QAOA ap- P = 2. Note that a random Gn,p graph is a graph on
proach, we can explore the specialization of the QAOA n nodes where an edge between any two nodes is added
to two specific domains of application; namely, Max- independently with probability p. To generate training
Cut QAOA and QAOA for Sherrington-Kirkpatrick Ising data, we uniformly sampled n ∈ [6, 9], yielding QNN sys-
models. tem sizes of at most nine qubits. To train the RNN, 10000
sampled instances from this training set were used. To
generate our testing data, we fixed n = 12, yielding QNN
1. MaxCut QAOA system sizes of 12 qubits, and sampled 50 instances using
the procedure described above.
It has been observed that for random 3-regular graphs,
The problem for which the QAOA was first explored
at fixed parameter values of the QAOA ansatz, the ex-
was for MaxCut [3]. Let us first provide a brief introduc-
pected value of the cost function hĤC iθ concentrates [63].
tion to the MaxCut problem. Suppose we have a graph
Our results displayed in Figures 3 and III C corroborate
G = {V, E} where E are the edges and V the vertices.
this finding while operating on a slightly broader ensem-
Given a partition of these vertices into a subset P0 and
ble of random graphs. This is made clear by noting that
its complement P1 = V \ P0 , the corresponding cut set
initially the MaxCut QAOA has a much narrower 95%
C ⊆ E is the subset of edges that have one endpoint in P0
confidence interval across problem instances regardless
and the other endpoint in P1 . The maximum cut (Max-
of optimization algorithm when compared to Ising (SK)
Cut) for a given graph G is the choice of P0 and P1 which
QAOA.
yields the largest possible cut set. The difficulty of find-
ing this partition is well known to be an NP-Complete
problem in general. 2. Ising QAOA
To translate this problem to a quantum Hamiltonian,
we can assign a qubit to each vertex j ∈ V. The compu-
tational basis states of these qubits can then be used as Another domain of application where we tested
binary labels to indicate which partition each qubit is in, quantum-classical meta-learning was with the QAOA for
i.e., if the qubit j is in the state |lij , l ∈ {0, 1}, we assign finding low energy states of a type of Ising spin glass
it to the partition Pl . We can evaluate the size of a cut model known as the Sherrington-Kirkpatrick (SK) model.
by counting how many edges have endpoints in different Many problems in combinatorial optimization can be
partitions. In order to do this counting, we can compute mapped to these models [65] (for example, training Boltz-
the XOR of the bit values for the endpoints of each edge mann machine neural networks [24, 66]). In general,
and add up these clauses. This cut cardinality can thus finding the lowest energy state of such models is known
be encoded into the cost Hamiltonian for the QAOA as to be NP-Hard. Using the QAOA, we aim to find low-
follows: energy states of an SK Ising Hamiltonian on the graph
G = {V, E}, which has the form
X
1 ˆ
ĤC = 2 (I − Ẑj Ẑk ). (4)
X X
ĤC = √1n Jjk Ẑj Ẑk + hj Ẑj (5)
{j,k}∈E {j,k}∈E j∈V

Now, for our choice of mixer, the standard where n = |V| is the number of vertices, and Jjk and
P choice is the
sum of Pauli X̂ on each qubit, ĤM = j∈V X̂j . This is hj are coupling and bias coefficients. For our numeri-
a good choice as each term is non-commuting with the cal experiments we considered only the case of the fully
cost Hamiltonian and it is easy to exponentiate with min- connected model where G is the complete graph. Like
imal gate depth. The standard choice of initial state is the MaxCut QAOA, the choice of mixer Hamiltonian
the uniform superposition over computational bitstrings is the Psum of the transverse field on all the qubits,
⊗|V| ĤM = j X̂j , and the initial state is chosen as a uniform
|+i , which is an eigenstate of the mixer Hamiltonian.
⊗n
We can now construct our ansatz following (3) by choos- superposition over all computational basis states |+i .
ing some value for P and substituting in our MaxCut The parametric ansatz is once again in the form of a regu-
ĤC and ĤM . By applying and variationally optimiz- lar QAOA (as in (3)), now with the SK Ising Hamiltonian
ing the QAOA, one obtains a wavefunction which, when (5) as the cost Hamiltonian. In similar fashion to Max-
measured in the computational basis, has a high proba- Cut, when we optimize the parameters for this QAOA,
7

we obtain a wavefunction which, when measured in the Let us provide more details as to our choices of pa-
computational basis, will yield a bit string corresponding rameters used to generate the results from Figure 3. We
to a spin configuration with a relatively low energy. used an ansatz consisting of P = 5 steps, where each step
Let us now outline the methods used to generate the introduced 3 parameters. For our initial state, we use an
training and testing data for the RNN specializing in the eigenstate of the kinetic term with the correct particle
optimization of Ising QAOA ansatz parameters. To gen- number and the same total spin as the ground state, and
erate random instances of Ising QAOA, we sampled ran- we study the model at half-filling. We set t = 1.0 for
dom values of Jjk , hj and n. For both the training and all instances, and this defines our units of energy. Our
testing data, after drawing a value for n, the parameters training data consists of 10000 instances with the lattice
Jij and hi were drawn from independent Gaussian dis- system size chosen to be either n = 2 × 2 or n = 3 × 2
tributions with zero mean and unit variance. Finally, we with equal probability and with U chosen from a uniform
constructed the corresponding Ising QAOA QNN ansatze distribution on the domain of [0.1, 4.0]. After training,
of the form (3) with P = 3 for the sampled Hamiltonian. we tested the neural network on instances with system
For the training instances, we sampled the number of size n = 4 × 2, again strictly larger than our training set.
qubits uniformly from n ∈ [6, 8], yielding QNN system
sizes of at most 8 qubits. The size of the training set
was of 10000 instances from the above described distri- C. Meta-learning Methods & Results
bution. For testing, we drew 50 samples uniformly from
n ∈ [9, 11], thus testing was done with strictly larger in-
In this section we present the main results of our
stances than those contained in the training set.
quantum-classical meta-learning experiments, displayed
in Figure 3, and discuss some additional details of our
B. Variational Quantum Eigensolvers
methods used to produce these results. We trained and
tested a set of long short-term memory (LSTM) recur-
rent neural networks (RNN) to learn to optimize the va-
1. Hubbard Model VQE
riety of QNN instances discussed in sections III A and
III B, namely, MaxCut QAOA, Ising QAOA and Hub-
Here we describe the variational quantum eigensolver bard VQE. For each of the three problem classes, the
(VQE) ansatze that were used to generate the results in RNN was trained using 10000 problem instances. This
Fig. 3. The specific class of VQE problems we chose to training of the RNN was executed over a maximum of
consider were for variational preparation of ground states 1000 epochs, each with a time horizon of 10 iterations.
of Hubbard model lattices [39]. The Hubbard model is Hence, training required the simulation of inference for
an idealized model of fermions interacting on a lattice. at most 1 million quantum neural networks. In most
The 2D Hubbard model has a Hamiltonian of the form cases the meta-training was stopped well before these
Ĥ = T̂h + T̂v + V̂ , where T̂h and T̂v are the horizontal 1000 epochs were completed, following standard early-
and vertical hopping terms and V̂ a spin interaction term, stopping criteria [68].
more explicitly, The quantum circuits used for training and testing
X † the recurrent neural network were executed using the
(âi,σ âj,σ + â†j,σ âi,σ ) + U
X †
Ĥ = −t âi,↑ âi,↑ â†i,↓ âi,↓ Cirq quantum circuit simulator [69] running on a clas-
hi,ji,σ i
sical computer. The VQE ansatze were built using
(6) OpenFermion-Cirq [67]. Neural network training and in-
where the âj,σ and â†j,σ
are annihilation and creation ference was done in TensorFlow [70], using code adapted
operators on site j with spin σ ∈ {↑, ↓}. The goal of from previous work by DeepMind [34].
the VQE is to variationally learn a parametrized circuit For both testing and training, we squashed the read-
which prepares the ground state of the Hubbard Hamil- out of the cost function by a quantity which bounds the
tonian from (6), or at least an approximation thereof. operator norm of the Hamiltonian. This was done to en-
Our variational ansatz to prepare these approximate sure a normalized loss signal for our RNN across various
ground states is based on the Trotterization of the time problem instances. In classical machine learning, normal-
evolution under the Hubbard model Hamiltonian, it is of izing data variance is well-known to accelerate and ame-
the form liorate training [71]. In the same spirit, we fed the RNN a
P cost function squashed according to the Pauli coefficient
(j) (j)
T̂h −iθv(j) T̂v −iθU V̂ norm, denoted k. . .k∗ . Recall that for a Hamiltonian Ĥ
Y
Û (θ) = e−iθh e e (7)
j=1
with a decomposition
P as a linear combination ofPPaulis
of the form Ĥ = j αj P̂j , then kĤk∗ ≡ kαk1 = j |αj |.
where θ = {θh , θv , θU } are the variational parameters The squashed cost function is then simply the regular ex-
for the P Trotter steps. The exponentials at each step pectation value of the Hamiltonian, divided by the Pauli
are done using a single fermionic swap network [39]. This coefficient norm, f¯(θ) = hĤiθ /kĤk∗ . As all Paulis have
is similar to the ansatz used in [6] but corresponds to a a spectrum of {±1} we are guaranteed that the squashed
different order of simulation of the terms. cost function f¯(θ) has its range contained in [−1, 1]. In
8

MaxCut QAOA Ising QAOA VQE Hubbard Model


0.3 ×10−3
4
0.15 GPR
Relative error

Relative error
NM - Rnd. Seed

Relative error
NM - Heur. Seed
0.2
3
0.10 NM - LSTM Seed
LSTM
LSTM Cutoff 0.1 2
0.05

1
0.00 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Objective queries Objective queries Objective queries
0.3 ×10−3
4
0.15

Relative error
Relative error

Relative error
0.2 3
0.10

0.1 2
0.05

1
0.00 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Objective queries Objective queries Objective queries

Figure 3. Displayed above are the average relative errors with respect to the number of objective function queries during the
training of 50 random problem instances for the three classes of problems of interest, QAOA for MaxCut (left), QAOA for Ising
models (middle), and VQE for the Hubbard model (right), for various choices of optimizers and initialization heuristics. The
problem instances were sampled from the testing distribution described in sections III A and III B. These include a Gaussian
Process Regression (GPR) optimizer [62], and Nelder-Mead (NM) [41] with various initialization heuristics. The first of these
initialization heuristics was the best of 10 random guesses seed (Rnd. Seed). Also presented is NM initialized with application-
specific heuristic seeds (Heur. Seed), which consisted of the adiabatic heuristic for VQE [67], and the mean optimal parameters
of the training set for QAOA [63], and finally the seed from our meta-learned neural optimizer (LSTM). We cut off the LSTM
after 10 iterations, as it is used mainly as an initializer for other optimizers. Note that we have not included overhead of
the meta-training in this plot, see the main text for a breakdown of the overhead for the training of the LSTM. The top row
is for noiseless readout of the expectation, while the bottom row has some Gaussian noise with variance 0.05 added to the
expectation value readouts, thus emulating approximate estimates of expectation values. For reference, for the set of testing
instances, given a QPU inference repetition rate of 10 kHz, the necessary wall clock time per objective query to achieve this
variance [6] in the cost estimate is at most (70 ± 40) seconds for MaxCut QAOA, (2.3 ± 0.6) seconds for Ising QAOA, and
(30 ± 20) seconds for Hubbard VQE. Note that the relative error is the difference in the squashed cost function relative to the
squashed global optimum found through brute force methods, i.e., f¯rel ((θ)) = (f¯(θ) − min f¯). Error bars represent the 95%
confidence interval for the random testing instances from the distribution of problems described in sections III A and III B.

Figure 3, we plot the relative error, which is the difference horizon of T = 10 quantum-classical iterations, using the
in the squashed cost function relative to the globally op- observed improvement (2) as the meta-learning loss func-
timal squashed cost function value found through brute tion. We trained the LSTM on noiseless quantum cir-
force methods, f¯rel (θ) = (f¯(θ) − minθ f¯(θ)). The brute cuit simulations in Cirq [69]. Note that training of each
force optimization methods were basin hopping for the of the three LSTM networks already required the simu-
QAOA instances [72], and exact diagonalization for the lation of 1 million quantum circuit executions with the
VQE instances. chosen time horizon of 10 iterations, and that the num-
For the testing of the trained LSTM, we used randomly ber of quantum circuit simulations scales linearly with
sampled instances from the distributions of ansatze de- the time horizon. Additionally, gradient-based training
scribed in sections III A and III B. In all cases, the testing required backpropagation through time for the tempo-
instances were for larger-size systems than those used for ral hybrid quantum-classical computational graph, which
training, while keeping the number of variational parame- added further linearly-scaling overhead. Thus, we chose
ters of the ansatze fixed. Note that as all the ansatze con- a short time horizon to minimize the complexity of the
sidered in this paper were QAOA-like, one can thus scale training. For reference, 10 iterations is a significantly
the size of the system while keeping the same number of smaller number of quantum-classical optimization itera-
parameters. This is an important feature of this class of tions than what is typically seen in previous works on
ansatze as our LSTM is trained to optimize ansatze of a QNN optimization [46]. The typical number of itera-
fixed parameter space dimension. tions required by other optimizers is usually on the order
For all instances, the LSTM was trained on a time of hundreds to possibly thousands to reach a comparable
9

optimum of the parameter landscape. Let us provide a description of the alternative opti-
Although the LSTM reaches a good approximate opti- mization and initialization heuristics used to generate
mum in these 10 iterations, some applications of QNN’s Figure 3. First alternative strategy was a Bayesian Op-
such as VQE require further optimization as a high- timization using Gaussian processes [62], here the initial
precision estimate of the cost function is desired. Thus, parameters are set to nil, same as the was the case for the
instead of simply using the LSTM as an optimizer for LSTM optimizer. Next, in order to compare the LSTM
an extended time horizon, we used the LSTM as a few- to other initialization heuristics, we compared the initial-
iteration initializer for Nelder-Mead (NM). This was done ization of Nelder-Mead (NM) at parameter values found
to minimize the complexity of training the RNN and from the best of 10 random guesses (Random Seed),
avoid the instabilities of longer training horizons where NM initialized using some state of the art heuristics for
the RNN would most likely learn a local method for fine- QAOA and VQE (Heuristic Seed), and NM initialized
tuning its own initial guess. A longer time horizon would after 10 iterations of the LSTM (LSTM Seed). The
thus most likely not have provided a significant gain in application-specific heuristic seeds (Heur. Seed) were
performance, all the while substantially increasing cost the adiabatic heuristic for VQE [67], where the varia-
of training. tional parameters are scaled in a similar fashion to an
We tested the robustness of the RNN optimizer by adiabatic interpolation across the 5 steps, while for the
comparing its performance to other common optimizers, QAOA the parameters were initialized at the mean value
both in the cases where Gaussian noise was added to of the optimal parameters for the training set distribution
the cost function evaluations, and for a noiseless read- of problem instances. As was shown in [63], as there is a
out idealized case. This additional Gaussian noise can concentration of the cost function for fixed parameters,
be interpreted as a means to emulate the natural noise one can thus expect the distribution of optimal parame-
of quantum expectation estimation with a finite number ters of the QAOA to be concentrated around some mean.
of measurement runs [9]. Figure 3 allows for comparison In Figure III C, we compare the Euclidean distance in
of noisy and noiseless inference (QNN optimization) for parameter space between the output of the 10 iterations
the trained LSTM versus alternative optimization and in- of the LSTM versus other initialization heuristics. We
tilization heuristics. For the noisy tests, the expectation clearly see that the LSTM optimizer initializes the QNN
samples obeyed a normal distribution of variance 0.05, parameters closer to the optimal parameters of each test
thus the cost function estimates were drawn according to instance, on average, as compared to other methods. We
yt ∼ N (hĤiθt , 0.05) for the results presented in Figure see that in the case of the QAOA, the constant fit heuris-
3. For the testing instances used to generate the results tic [63] for the training instances yields a cluster of pa-
presented in Figure 3, following a standard prescription rameters that is not clustered around the optimal pa-
[6] for the number of repetitions required to guarantee an rameters of the larger instance, while the LSTM output
upper bound to the variance of 0.05, the number of rep- parameters are significantly closer to the globally opti-
etitions (QNN inference runs) should be of (7 ± 4) × 105 mal parameters found by brute force. This shows a clear
repetitions for MaxCut QAOA, (2.3 ± 0.6) × 104 repe- separation between the parameters obtained from a con-
titions for Ising QAOA, and (3 ± 2) × 105 repetitions stant fit of the training set versus the LSTM’s adaptive
for Hubbard VQE. In terms of wall clock time, assuming scheme for optimizing parameters in few iterations.
that the QPU can execute 10000 repetitions (consisting of
a quantum circuit execution, multi-qubit measurement,
and qubit resetting) per second, for the distribution of IV. DISCUSSION
testing instances, the total time needed for the LSTM
to perform its 10 optimization steps is in the range of As shown in Figure 3, our trained neural optimizer
(700 ± 400) seconds for MaxCut QAOA, (23 ± 6) seconds reaches a higher-quality approximate optimum of the pa-
for Ising QAOA, and (300 ± 200) seconds for Hubbard rameters in 10 iterations than other optimizers can man-
VQE. Note that the standard deviation here is due to age in hundreds, both for noisy and noiseless readout.
the variations in the Pauli norm of the Hamiltonians for Most evident in the case of VQE, where the local opti-
the sampled instances of the test set. mizers can have severe difficulty optimizing parameters
Apart from this added cost function noise, the simu- when given noisy evaluations of the cost function. Of
lated quantum circuit executions were simulated with- all alternatives to the neural optimizer, the probabilistic
out any other form of readout or gate execution noise. approach of Bayesian optimization via Gaussian process
Plotted in Figure 3 are the 95% confidence intervals for regression was the best performer.
the optimization of the 50 testing instances which were In all six settings, the LSTM rapidly finds an approx-
sampled according to the testing distributions described imate minimum within its restricted time horizon of 10
in Sec. III A and Sec. III B. Our results show that iterations. The neural optimizer needs to initialize the
the neural optimizer learns initialization heuristics for parameters in a basin of attraction of the cost function
the QAOA and VQE parameters which generalize across landscape so that a local optimizer can then easily con-
problem sizes. We discuss these results in further detail verge to a local optimum in fewer iterations and more
in the following section. consistently. As we can see across all cases, the Nelder-
10

MaxCut QAOA Ising QAOA


scend the energy landscape, and the variational parame-
20 LSTM ters can be seen as energetic landscape descent hyperpa-
Frequency

Frequency
Heuristic
20 rameters, akin to gradient descent parameters. Similar to
10
how the classical meta-learning approaches to gradient-
10 based optimization converged onto methods comparable
to best-practices for hyperparameter optimization (e.g.,
0 0
0.0 0.5 1.0 0.0 0.5 1.0 comparable to performance of AdaGrad and other ma-
Distance to optimum Distance to optimum chine learning best-practice heuristics), the neural opti-
VQE Hubbard Model mizer in our case found a neighborhood of optimal hy-
20 perparameters and learned a heuristic to quickly adjust
these parameters on a case-by-case basis.
Frequency

Although it may seem that this meta-learning method


10
is costly due to its added complexity over regular opti-
mization, one must remember that, for the time being,
0 classical computation is still much cheaper than quan-
0.0 0.5 1.0
Distance to optimum tum computation. As the optimization scheme gener-
alizes across system sizes, one may imagine training an
LSTM to optimize a certain ansatz for small system sizes
Figure 4. The above histograms represent the parameter
by quantum simulation on a classical computer, then us-
space Euclidean distance to the true optimum d(θ) = kθ −
θ ∗ k2 , immediately after initialization, for both the LSTM (af-
ing the LSTM to rapidly initialize the parameters for a
ter 10 iterations) and alternative initialization heuristics. For much larger instance of the same class of problem on a
each of the three cases, 50 samples of the test set were used. quantum computer. This approach may be well worth
For the QAOA, this alternative initialization heuristic was the added classical computation, as it can reduce the
identical to [63], whereas for VQE, this was the Adiabatic number of required runs to get an accurate answer on
Heuristic [67]. The histograms were collected from the test the QPU by an order of magnitude or more.
set problem instances, which were described in Sections III A
and III B, and were used to generate the results presented
in Figure 3. We see that the LSTM initializes parameters
V. CONCLUSION & OUTLOOK
significantly closer to the optimum on average as compared
alternative heuristics.
In this paper, we proposed a novel approach to the
optimization of quantum neural networks, namely, using
Mead runs that are initialized by the LSTM rather than meta-learning and a classical neural network optimizer.
other initialization heuristics tend to reach the highest We tested the performance of this approach on a set
quality optimum with much lower variance in perfor- of random instances of variational quantum algorithm
mance. These result show that the LSTM initializes the optimization tasks, which were the Quantum Approxi-
parameters in a good basin of attraction near the op- mate Optimization applied to MaxCut and Sherrington-
timum. Although the LSTM initialization helps in all Kirkpatrick Ising spin glasses, and a set of Variational
cases, we can see that for the noisy VQE case in Figure Quantum Eigensolver ansatze for preparation of ground
3, the Nelder-Mead approach struggles to improve upon states of Hubbard models.
the guess of the LSTM. While not the focus of this work, The neural network was used to rapidly find a global
these results point towards the need for further investiga- approximate optimum of the parameters, which then
tion into better local optimization techniques which are served as an initialization point for other local search
robust to noise [73]. heuristics. This combination yielded optima of the quan-
When looking at the optimal parameters found by the tum neural network landscape which were of a higher
RNN (see Fig. III C), we observed a certain degree of quality than alternatives could produce with orders of
concentration in their values, similar to what has been magnitude more quantum-classical optimization itera-
observed in previous works [32, 63]. The MaxCut QAOA tions. Furthermore, the neural network exhibited gener-
was the most concentrated, which corroborates recent ob- alization capacity across problem sizes, thus opening up
servations [63]. Similarly, the optimal parameters of the the possibility of classical pre-training of the neural opti-
Ising QAOA were also observed to have a degree of con- mizer for inference on larger instances requiring quantum
centration, as was also observed in previous work [32]. processors.
Finally, the VQE had the least amount of concentration Two significant challenges of quantum neural network
of the three problem classes, but still exhibited some de- optimization in the NISQ era are finding optimization
gree of clustering in the optimal values. methods that allow for precise fine-tuning of the param-
The concentration of parameters is not surprising given eters to hone in on local minima despite the presence of
the connections between QAOA-like ansatze and gradient readout noise and to find good initialization heuristics to
descent/adiabatic optimization [28, 60]. In a sense, these allow for more consistent convergence of these local op-
QAOA-like QNN’s are simply variational methods to de- timizers. Given the results presented in this paper, we
11

believe that this first challenge has been mitigated by our approach to arbitrary QNN optimization tasks. Fi-
our quantum-classical meta-learning approach, while the nally, as a NISQ-oriented alternative to the latter, one
second challenge remains open for future work. could meta-learn to optimize the hyperparameters for the
In terms of possible extensions of this work, the meta- stochastic quantum circuit gradient descent algorithm re-
learning approach could be further improved in several cently proposed by Harrow et al. [31]. We leave the above
ways. One such way would be to use more recent ad- proposed explorations to future work.
vances in meta-learning optimizer neural networks [74]
which can scale to arbitrary problems and number of pa-
rameters. This would extend the capabilities of our cur-
rent approach to optimizing the parameters of arbitrary VI. ACKNOWLEDGEMENTS
QNN’s beyond Trotter-based/QAOA-like ansatze with
variable numbers of parameters across instances. An- Circuits and neural networks in this paper were imple-
other possible extension of this work would be to meta- mented using a combination of Cirq [69], OpenFermion-
learn an optimizer for Quantum Dynamical Descent [28], Cirq [67], and TensorFlow [70]. The authors would like
a quantum generalization of gradient descent which takes to thank Yutian Chen and his colleagues from DeepMind
the form of a continuous-variable QAOA. As our neural for providing code for the neural optimizer [34] which was
optimizer was tested on various QAOA problems success- adapted for this work, as well as Edward Farhi, Li Li, and
fully, one could imagine applying it to the optimization of Murphy Niu for their insights, observations, and sugges-
the Quantum Dynamical Descent hyperparameters. The tions. MB and GV would like to thank the team at the
latter could be considered learning to learn with quantum Google AI Quantum lab for the hospitality and support
dynamical descent with classical gradient descent. This during their respective internships where this work was
would also be a way to generalize the applicability of completed. GV acknowledges funding from NSERC.


Both authors contributed equally to this work. [16] B. Nash, V. Gheorghiu, and M. Mosca, arXiv preprint
[1] J. Preskill, arXiv preprint arXiv:1801.00862 (2018). arXiv:1904.01972 (2019).
[2] E. Farhi and H. Neven, arXiv preprint arXiv:1802.06002 [17] Z. Jiang, J. McClean, R. Babbush, and H. Neven, arXiv
(2018). preprint arXiv:1812.08190 (2018).
[3] E. Farhi, J. Goldstone, and S. Gutmann, arXiv preprint [18] G. R. Steinbrecher, J. P. Olson, D. Englund, and J. Car-
arXiv:1411.4028 (2014). olan, arXiv preprint arXiv:1808.10047 (2018).
[4] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. [19] M. Fingerhuth, T. Babej, et al., arXiv preprint
Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. Obrien, arXiv:1810.13411 (2018).
Nature communications 5, 4213 (2014). [20] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and
[5] N. Killoran, T. R. Bromley, J. M. Arrazola, P. J. Coles, arXiv preprint arXiv:1810.10506 (2018).
M. Schuld, N. Quesada, and S. Lloyd, arXiv preprint [21] L. Cincio, Y. Subaşı, A. T. Sornborger, and P. J. Coles,
arXiv:1806.06871 (2018). New Journal of Physics 20, 113022 (2018).
[6] D. Wecker, M. B. Hastings, and M. Troyer, Phys. Rev. [22] H. Situ, Z. Huang, X. Zou, and S. Zheng, Quantum
A 92, 042303 (2015). Information Processing 18, 230 (2019).
[7] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, [23] H. Chen, L. Wossnig, S. Severini, H. Neven, and
N. Wiebe, and S. Lloyd, Nature 549, 195 (2017). M. Mohseni, arXiv preprint arXiv:1805.08654 (2018).
[8] L. Zhou, S.-T. Wang, S. Choi, H. Pichler, and M. D. [24] G. Verdon, M. Broughton, and J. Biamonte, arXiv
Lukin, arXiv preprint arXiv:1812.01041 (2018). preprint arXiv:1712.05304 (2017).
[9] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru- [25] Y. LeCun, Y. Bengio, and G. Hinton, nature 521, 436
Guzik, New Journal of Physics 18, 023023 (2016). (2015).
[10] S. Hadfield, Z. Wang, B. O’Gorman, E. G. Rief- [26] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,
fel, D. Venturelli, and R. Biswas, arXiv preprint Deep learning, Vol. 1 (MIT press Cambridge, 2016).
arXiv:1709.03489 (2017). [27] J. Schmidhuber, Neural networks 61, 85 (2015).
[11] E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart, [28] G. Verdon, J. Pye, and M. Broughton, arXiv preprint
V. Stojevic, A. G. Green, and S. Severini, npj Quantum arXiv:1806.09729 (2018).
Information 4, 65 (2018). [29] J. R. Mcclean, S. Boixo, V. N. Smelyanskiy, R. Bab-
[12] S. Khatri, R. LaRose, A. Poremba, L. Cincio, A. T. Sorn- bush, and H. Neven, Nature Communications 9 (2018),
borger, and P. J. Coles, Quantum 3, 140 (2019). 10.1038/s41467-018-07090-4.
[13] M. Schuld and N. Killoran, Physical review letters 122, [30] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and
040504 (2019). N. Killoran, arXiv preprint arXiv:1811.11184 (2018).
[14] S. McArdle, T. Jones, S. Endo, Y. Li, S. Benjamin, and [31] A. Harrow and J. Napp, arXiv preprint arXiv:1901.05374
X. Yuan, arXiv preprint arXiv:1804.03023 (2018). (2019).
[15] M. Benedetti, E. Grant, L. Wossnig, and S. Severini, [32] Z.-C. Yang, A. Rahmani, A. Shabani, H. Neven, and
New Journal of Physics 21, 043023 (2019). C. Chamon, Physical Review X 7, 021027 (2017).
12

[33] E. Grant, L. Wossnig, M. Ostaszewski, and [54] A. Nichol and J. Schulman, arXiv preprint
M. Benedetti, arXiv preprint arXiv:1903.05076 (2019). arXiv:1803.02999 (2018).
[34] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, [55] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas-
T. P. Lillicrap, M. Botvinick, and N. de Freitas, arXiv canu, S. Osindero, and R. Hadsell, arXiv preprint
preprint arXiv:1611.03824 (2016). arXiv:1807.05960 (2018).
[35] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, [56] C. Audet and M. Kokkolaras, “Blackbox and derivative-
D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, free optimization: theory, algorithms and applications,”
in Advances in Neural Information Processing Systems (2016).
(2016) pp. 3981–3989. [57] Z. C. Lipton, J. Berkowitz, and C. Elkan, arXiv preprint
[36] B. Zoph and Q. V. Le, arXiv preprint arXiv:1611.01578 arXiv:1506.00019 (2015).
(2016). [58] R. Pascanu, T. Mikolov, and Y. Bengio, in International
[37] C. Finn, P. Abbeel, and S. Levine, arXiv preprint Conference on Machine Learning (2013) pp. 1310–1318.
arXiv:1703.03400 (2017). [59] S. Hochreiter and J. Schmidhuber, Neural computation
[38] M. Long, Y. Cao, J. Wang, and M. I. Jordan, arXiv 9, 1735 (1997).
preprint arXiv:1502.02791 (2015). [60] A. Bapat and S. Jordan, arXiv preprint arXiv:1812.02746
[39] I. D. Kivlichan, J. McClean, N. Wiebe, C. Gidney, (2018).
A. Aspuru-Guzik, G. K.-L. Chan, and R. Babbush, Phys. [61] G. Verdon, J. M. Arrazola, K. Brádler, and N. Killoran,
Rev. Lett. 120, 110501 (2018). arXiv preprint arXiv:1902.00409 (2019).
[40] Z. Jiang, K. J. Sung, K. Kechedzhi, V. N. Smelyanskiy, [62] C. E. Rasmussen, in Advanced lectures on machine learn-
and S. Boixo, Physical Review Applied 9, 044036 (2018). ing (Springer, 2004) pp. 63–71.
[41] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. [63] F. G. Brandao, M. Broughton, E. Farhi, S. Gutmann,
Wright, SIAM Journal on optimization 9, 112 (1998). and H. Neven, arXiv preprint arXiv:1812.04170 (2018).
[42] G. Nannicini, Physical Review E 99, 013304 (2019). [64] M. Suzuki, Physics Letters A 146, 319 (1990).
[43] G. G. Guerreschi and M. Smelyanskiy, arXiv preprint [65] T. Kadowaki and H. Nishimori, Physical Review E 58,
arXiv:1701.01450 (2017). 5355 (1998).
[44] J. C. Spall, IEEE Transactions on aerospace and elec- [66] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and
tronic systems 34, 817 (1998). R. Melko, Physical Review X 8, 021050 (2018).
[45] M. J. Powell, Cambridge NA Report NA2009/06, Uni- [67] J. R. McClean, I. D. Kivlichan, D. S. Steiger, Y. Cao,
versity of Cambridge, Cambridge , 26 (2009). E. S. Fried, C. Gidney, T. Häner, V. Havlı́ček, Z. Jiang,
[46] G. Nannicini, arXiv preprint arXiv:1805.12037 (2018). M. Neeley, et al., arXiv preprint arXiv:1710.07629
[47] J. McClean, J. Romero, R. Babbush, and A. Aspuru- (2017).
Guzik, New Journal of Physics 18, 023023 (2016). [68] L. Prechelt, in Neural Networks: Tricks of the trade
[48] In general, one may relay the raw measurement results to (Springer, 1998) pp. 55–69.
the classical processing unit, which can then compute the [69] M. LLC, “Cirq: A python framework for creating, edit-
expectation value of the cost function. For the purposes ing, and invoking noisy intermediate scale quantum cir-
of this paper, we assumed the classical optimizer only cuits,” (2018).
has access to (noisy) estimates of the expectation value [70] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
of the cost Hamiltonian. J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,
[49] A. Y. Kitaev, A. Shen, M. N. Vyalyi, and M. N. Vya- et al.
lyi, Classical and quantum computation, 47 (American [71] S. Ioffe and C. Szegedy, arXiv preprint arXiv:1502.03167
Mathematical Soc., 2002). (2015).
[50] N. C. Rubin, R. Babbush, and J. McClean, New Journal [72] D. J. Wales and J. P. Doye, The Journal of Physical
of Physics 20, 053020 (2018). Chemistry A 101, 5111 (1997).
[51] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, [73] P. Omalley, R. Babbush, I. D. Kivlichan, J. Romero, J. R.
in Proceedings of the 1988 connectionist models summer McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter,
school, Vol. 1 (CMU, Pittsburgh, Pa: Morgan Kaufmann, N. Ding, et al., Physical Review X 6, 031007 (2016).
1988) pp. 21–28. [74] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman,
[52] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-
nature 323, 533 (1986). Dickstein, in Proceedings of the 34th International Con-
[53] J. Romero and A. Aspuru-Guzik, arXiv preprint ference on Machine Learning-Volume 70 (JMLR. org,
arXiv:1901.00848 (2019). 2017) pp. 3751–3760.

You might also like