Learning To Learn With Quantum Neural Networks Via Classical Neural Networks
Learning To Learn With Quantum Neural Networks Via Classical Neural Networks
classical neural networks to assist in the quantum learning process, also know as meta-learning, to
rapidly find approximate optima in the parameter landscape for several classes of quantum varia-
tional algorithms. Specifically, we train classical recurrent neural networks to find approximately
optimal parameters within a small number of queries of the cost function for the Quantum Ap-
proximate Optimization Algorithm (QAOA) for MaxCut, QAOA for Sherrington-Kirkpatrick Ising
model, and for a Variational Quantum Eigensolver for the Hubbard model. By initializing other
optimizers at parameter values suggested by the classical neural network, we demonstrate a signifi-
cant improvement in the total number of optimization iterations required to reach a given accuracy.
We further demonstrate that the optimization strategies learned by the neural network generalize
well across a range of problem instance sizes. This opens up the possibility of training on small,
classically simulatable problem instances, in order to initialize larger, classically intractably simulat-
able problem instances on quantum devices, thereby significantly reducing the number of required
quantum-classical optimization iterations.
backpropagation through a large portion of the compu- [63], that QAOA-like ansatze have a concentration of op-
tational graph for the loss signal to reach early portions timal parameters. Thus, the neural optimizer is used to
of the RNN graph. A practical option of the same vein learn a problem-class-specific initialization heuristic, and
is the cumulative regret, which is simply the sum of the the fine-tuning is left for other optimizers. Since the neu-
cost function history uniformly averaged over the time ral optimizer would eventually learn a local heuristic for
PT
horizon L(ϕ) = t=1 Ef,y [f (θt )]. This is a better choice the fine-tuning, the added complexity cost of training for
as the loss signal is far less sparse, and the cumulative long time horizons if not justified by corresponding im-
regret is a proxy for the minimum value achieved over the provements in optimization efficiency, and we find that
optimization history. In practice, this loss function may the combination of the neural optimizer as initializer and
not be optimal as it will prioritize rapidly finding an ap- a greedy heuristic such as Nelder-Mead works quite well
proximate optimum and staying there. What is needed in practice, as shown in Figure 3. Now, let us cover which
instead is a loss function that encourages exploration of ansatze we applied our RNN to learn to optimize.
the landscape in order to find a better optimum. The loss
function we chose for our experiments is the observed im-
provement at each time step, summed over the history of III. NUMERICAL EXPERIMENTS
the optimization:
T In this section, we provide a brief overview of the quan-
L(ϕ) = Ef,y
P
min{f (θt ) − minj<t [f (θj )], 0} , (2) tum neural network ansatze and problem instances con-
t=1 sidered for the hybrid meta-learning numerical experi-
ments (results presented in Section III C). We trained
The observed improvement at time step t is given by
and tested different ‘specialist’ RNN optimizers for each
the difference between the proposed value, f (θt ), and
of these three problem classes: quantum approximate op-
the best value obtained over the past history of the op-
timization for MaxCut (MaxCut QAOA), quantum ap-
timization until that point, minj<t [f (θj )]. If there is no
proximate optimization for Sherrington-Kirkpatrick Ising
improvement at a given time step then the contribution
models [32] (Ising QAOA), and a Trotter-based varia-
to the loss is nil. However, a temporary increase of the
tional quantum eigensolver ansatz for the Hubbard model
cost function followed by a significant improvement over
[39] (Hubbard VQE). We provide a brief introduction to
the historical best will be rewarded rather than penal-
each of these three classes, as well as describe the dis-
ized (in contrast to the behavior of the cumulative regret
tribution of instances from these classes from which we
loss).
sampled to generate training and testing instances.
In order to train the RNN, we need to differentiate
the above loss function L(ϕ). One option to achieve this
is via backpropagation of gradients through the unrolled A. Quantum Approximate Optimization
RNN graph (depicted in Fig. 2). This approach is called Algorithms
backpropagation through time, and can be tricky to scale
to arbitrarily deep networks due to vanishing/exploding
gradient problems [58]. For practical purposes, a small Let us first introduce a general QAOA ansatz before
time horizon is preferable, as it limits the complexity we specialize to applications to MaxCut problems and
of the training of the RNN optimizer and avoids the Ising (Sherrington-Kirkpatrick; SK) Hamiltonians. The
pathologies of backpropagation through long time hori- goal of the QAOA is to prepare low-energy states of a cost
zons. Since our loss function L(ϕ) is dependent on the Hamiltonian ĤC , which is usually a Hamiltonian which is
QNN evaluated at multiple different parameter values diagonal in the computational basis. To achieve this, we
{θt }Tt=1 , in order to perform backpropagation through typically begin in an eigenstate of a mixer Hamiltonian
time we need to backpropagate gradients through multi- ĤM , which does not commute with the cost Hamilto-
ple instances of the QNN. nian; [ĤC , ĤM ] 6= 0. Applied onto this initial state is a
As our approach was backpropagation-based, to avoid sequence of exponentials of the form
problems of gradient blowup and to minimize the com- P
(j)
ĤM −iθc(j) ĤC
Y
plexity of training, we keep a small time horizon for our Û (θ) = e−iθm e , (3)
numerical experiments featured in Section III C. As such, j=1
our RNN optimizer is intended to only run for a fixed
number of iterations, and will be used as an initializer for where θ = {θm , θc } are variational parameters to be op-
other optimizers that perform local search. In principle, timized. Note that in the above and throughout this
one could let the RNN optimize over more iterations at paper we will use the operator product notation conven-
QM
inference time than it was originally trained for, though tion where j=1 Ûj = ÛM . . . Û1 . The objective func-
the performance for later iterations may suffer. In our tion for this optimization is simply the expectation of
case the output of the RNN optimizer after a fixed num- the cost Hamiltonian after applying Û (θ) to the initial
ber of iterations is used to initialize the parameters of the state. This sequence of exponentials is the quantum al-
QNN’s near a typical optimal set of parameters. It has ternating operator ansatz [3, 10]. This is an algorithm
been observed [32], and in some cases formally proven which is well-suited for the NISQ era as the number of
6
gates scales linearly with P , the exponentials of ĤM and bility of yielding a bitstring corresponding to a partition
ĤC are usually easy to compile without any need for ap- of large cut size [3].
proximation via Trotter-Suzuki decomposition [64]. The In order to train and test the RNN optimizer on Max-
cost Hamiltonian is typically a sum of terms that are Cut QAOA problems, we generated random problem in-
diagonal in the computational basis and often simple to stances in the following fashion: we first fixed an integer
compile. Furthermore, there is no need to split the quan- n, and then randomly sampled an integer uniformly from
tum expectation estimation over multiple runs in order the range k ∈ [3, n − 1]. Finally, we tossed a random
to estimate the various terms; each repetition yields an graph from Gn,p with p = k/n and constructed the cor-
estimate of the cost function directly. responding MaxCut QAOA QNN of the form of (3) for
Now that we have introduced the general QAOA ap- P = 2. Note that a random Gn,p graph is a graph on
proach, we can explore the specialization of the QAOA n nodes where an edge between any two nodes is added
to two specific domains of application; namely, Max- independently with probability p. To generate training
Cut QAOA and QAOA for Sherrington-Kirkpatrick Ising data, we uniformly sampled n ∈ [6, 9], yielding QNN sys-
models. tem sizes of at most nine qubits. To train the RNN, 10000
sampled instances from this training set were used. To
generate our testing data, we fixed n = 12, yielding QNN
1. MaxCut QAOA system sizes of 12 qubits, and sampled 50 instances using
the procedure described above.
It has been observed that for random 3-regular graphs,
The problem for which the QAOA was first explored
at fixed parameter values of the QAOA ansatz, the ex-
was for MaxCut [3]. Let us first provide a brief introduc-
pected value of the cost function hĤC iθ concentrates [63].
tion to the MaxCut problem. Suppose we have a graph
Our results displayed in Figures 3 and III C corroborate
G = {V, E} where E are the edges and V the vertices.
this finding while operating on a slightly broader ensem-
Given a partition of these vertices into a subset P0 and
ble of random graphs. This is made clear by noting that
its complement P1 = V \ P0 , the corresponding cut set
initially the MaxCut QAOA has a much narrower 95%
C ⊆ E is the subset of edges that have one endpoint in P0
confidence interval across problem instances regardless
and the other endpoint in P1 . The maximum cut (Max-
of optimization algorithm when compared to Ising (SK)
Cut) for a given graph G is the choice of P0 and P1 which
QAOA.
yields the largest possible cut set. The difficulty of find-
ing this partition is well known to be an NP-Complete
problem in general. 2. Ising QAOA
To translate this problem to a quantum Hamiltonian,
we can assign a qubit to each vertex j ∈ V. The compu-
tational basis states of these qubits can then be used as Another domain of application where we tested
binary labels to indicate which partition each qubit is in, quantum-classical meta-learning was with the QAOA for
i.e., if the qubit j is in the state |lij , l ∈ {0, 1}, we assign finding low energy states of a type of Ising spin glass
it to the partition Pl . We can evaluate the size of a cut model known as the Sherrington-Kirkpatrick (SK) model.
by counting how many edges have endpoints in different Many problems in combinatorial optimization can be
partitions. In order to do this counting, we can compute mapped to these models [65] (for example, training Boltz-
the XOR of the bit values for the endpoints of each edge mann machine neural networks [24, 66]). In general,
and add up these clauses. This cut cardinality can thus finding the lowest energy state of such models is known
be encoded into the cost Hamiltonian for the QAOA as to be NP-Hard. Using the QAOA, we aim to find low-
follows: energy states of an SK Ising Hamiltonian on the graph
G = {V, E}, which has the form
X
1 ˆ
ĤC = 2 (I − Ẑj Ẑk ). (4)
X X
ĤC = √1n Jjk Ẑj Ẑk + hj Ẑj (5)
{j,k}∈E {j,k}∈E j∈V
Now, for our choice of mixer, the standard where n = |V| is the number of vertices, and Jjk and
P choice is the
sum of Pauli X̂ on each qubit, ĤM = j∈V X̂j . This is hj are coupling and bias coefficients. For our numeri-
a good choice as each term is non-commuting with the cal experiments we considered only the case of the fully
cost Hamiltonian and it is easy to exponentiate with min- connected model where G is the complete graph. Like
imal gate depth. The standard choice of initial state is the MaxCut QAOA, the choice of mixer Hamiltonian
the uniform superposition over computational bitstrings is the Psum of the transverse field on all the qubits,
⊗|V| ĤM = j X̂j , and the initial state is chosen as a uniform
|+i , which is an eigenstate of the mixer Hamiltonian.
⊗n
We can now construct our ansatz following (3) by choos- superposition over all computational basis states |+i .
ing some value for P and substituting in our MaxCut The parametric ansatz is once again in the form of a regu-
ĤC and ĤM . By applying and variationally optimiz- lar QAOA (as in (3)), now with the SK Ising Hamiltonian
ing the QAOA, one obtains a wavefunction which, when (5) as the cost Hamiltonian. In similar fashion to Max-
measured in the computational basis, has a high proba- Cut, when we optimize the parameters for this QAOA,
7
we obtain a wavefunction which, when measured in the Let us provide more details as to our choices of pa-
computational basis, will yield a bit string corresponding rameters used to generate the results from Figure 3. We
to a spin configuration with a relatively low energy. used an ansatz consisting of P = 5 steps, where each step
Let us now outline the methods used to generate the introduced 3 parameters. For our initial state, we use an
training and testing data for the RNN specializing in the eigenstate of the kinetic term with the correct particle
optimization of Ising QAOA ansatz parameters. To gen- number and the same total spin as the ground state, and
erate random instances of Ising QAOA, we sampled ran- we study the model at half-filling. We set t = 1.0 for
dom values of Jjk , hj and n. For both the training and all instances, and this defines our units of energy. Our
testing data, after drawing a value for n, the parameters training data consists of 10000 instances with the lattice
Jij and hi were drawn from independent Gaussian dis- system size chosen to be either n = 2 × 2 or n = 3 × 2
tributions with zero mean and unit variance. Finally, we with equal probability and with U chosen from a uniform
constructed the corresponding Ising QAOA QNN ansatze distribution on the domain of [0.1, 4.0]. After training,
of the form (3) with P = 3 for the sampled Hamiltonian. we tested the neural network on instances with system
For the training instances, we sampled the number of size n = 4 × 2, again strictly larger than our training set.
qubits uniformly from n ∈ [6, 8], yielding QNN system
sizes of at most 8 qubits. The size of the training set
was of 10000 instances from the above described distri- C. Meta-learning Methods & Results
bution. For testing, we drew 50 samples uniformly from
n ∈ [9, 11], thus testing was done with strictly larger in-
In this section we present the main results of our
stances than those contained in the training set.
quantum-classical meta-learning experiments, displayed
in Figure 3, and discuss some additional details of our
B. Variational Quantum Eigensolvers
methods used to produce these results. We trained and
tested a set of long short-term memory (LSTM) recur-
rent neural networks (RNN) to learn to optimize the va-
1. Hubbard Model VQE
riety of QNN instances discussed in sections III A and
III B, namely, MaxCut QAOA, Ising QAOA and Hub-
Here we describe the variational quantum eigensolver bard VQE. For each of the three problem classes, the
(VQE) ansatze that were used to generate the results in RNN was trained using 10000 problem instances. This
Fig. 3. The specific class of VQE problems we chose to training of the RNN was executed over a maximum of
consider were for variational preparation of ground states 1000 epochs, each with a time horizon of 10 iterations.
of Hubbard model lattices [39]. The Hubbard model is Hence, training required the simulation of inference for
an idealized model of fermions interacting on a lattice. at most 1 million quantum neural networks. In most
The 2D Hubbard model has a Hamiltonian of the form cases the meta-training was stopped well before these
Ĥ = T̂h + T̂v + V̂ , where T̂h and T̂v are the horizontal 1000 epochs were completed, following standard early-
and vertical hopping terms and V̂ a spin interaction term, stopping criteria [68].
more explicitly, The quantum circuits used for training and testing
X † the recurrent neural network were executed using the
(âi,σ âj,σ + â†j,σ âi,σ ) + U
X †
Ĥ = −t âi,↑ âi,↑ â†i,↓ âi,↓ Cirq quantum circuit simulator [69] running on a clas-
hi,ji,σ i
sical computer. The VQE ansatze were built using
(6) OpenFermion-Cirq [67]. Neural network training and in-
where the âj,σ and â†j,σ
are annihilation and creation ference was done in TensorFlow [70], using code adapted
operators on site j with spin σ ∈ {↑, ↓}. The goal of from previous work by DeepMind [34].
the VQE is to variationally learn a parametrized circuit For both testing and training, we squashed the read-
which prepares the ground state of the Hubbard Hamil- out of the cost function by a quantity which bounds the
tonian from (6), or at least an approximation thereof. operator norm of the Hamiltonian. This was done to en-
Our variational ansatz to prepare these approximate sure a normalized loss signal for our RNN across various
ground states is based on the Trotterization of the time problem instances. In classical machine learning, normal-
evolution under the Hubbard model Hamiltonian, it is of izing data variance is well-known to accelerate and ame-
the form liorate training [71]. In the same spirit, we fed the RNN a
P cost function squashed according to the Pauli coefficient
(j) (j)
T̂h −iθv(j) T̂v −iθU V̂ norm, denoted k. . .k∗ . Recall that for a Hamiltonian Ĥ
Y
Û (θ) = e−iθh e e (7)
j=1
with a decomposition
P as a linear combination ofPPaulis
of the form Ĥ = j αj P̂j , then kĤk∗ ≡ kαk1 = j |αj |.
where θ = {θh , θv , θU } are the variational parameters The squashed cost function is then simply the regular ex-
for the P Trotter steps. The exponentials at each step pectation value of the Hamiltonian, divided by the Pauli
are done using a single fermionic swap network [39]. This coefficient norm, f¯(θ) = hĤiθ /kĤk∗ . As all Paulis have
is similar to the ansatz used in [6] but corresponds to a a spectrum of {±1} we are guaranteed that the squashed
different order of simulation of the terms. cost function f¯(θ) has its range contained in [−1, 1]. In
8
Relative error
NM - Rnd. Seed
Relative error
NM - Heur. Seed
0.2
3
0.10 NM - LSTM Seed
LSTM
LSTM Cutoff 0.1 2
0.05
1
0.00 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Objective queries Objective queries Objective queries
0.3 ×10−3
4
0.15
Relative error
Relative error
Relative error
0.2 3
0.10
0.1 2
0.05
1
0.00 0.0
0 100 200 300 0 100 200 300 0 100 200 300
Objective queries Objective queries Objective queries
Figure 3. Displayed above are the average relative errors with respect to the number of objective function queries during the
training of 50 random problem instances for the three classes of problems of interest, QAOA for MaxCut (left), QAOA for Ising
models (middle), and VQE for the Hubbard model (right), for various choices of optimizers and initialization heuristics. The
problem instances were sampled from the testing distribution described in sections III A and III B. These include a Gaussian
Process Regression (GPR) optimizer [62], and Nelder-Mead (NM) [41] with various initialization heuristics. The first of these
initialization heuristics was the best of 10 random guesses seed (Rnd. Seed). Also presented is NM initialized with application-
specific heuristic seeds (Heur. Seed), which consisted of the adiabatic heuristic for VQE [67], and the mean optimal parameters
of the training set for QAOA [63], and finally the seed from our meta-learned neural optimizer (LSTM). We cut off the LSTM
after 10 iterations, as it is used mainly as an initializer for other optimizers. Note that we have not included overhead of
the meta-training in this plot, see the main text for a breakdown of the overhead for the training of the LSTM. The top row
is for noiseless readout of the expectation, while the bottom row has some Gaussian noise with variance 0.05 added to the
expectation value readouts, thus emulating approximate estimates of expectation values. For reference, for the set of testing
instances, given a QPU inference repetition rate of 10 kHz, the necessary wall clock time per objective query to achieve this
variance [6] in the cost estimate is at most (70 ± 40) seconds for MaxCut QAOA, (2.3 ± 0.6) seconds for Ising QAOA, and
(30 ± 20) seconds for Hubbard VQE. Note that the relative error is the difference in the squashed cost function relative to the
squashed global optimum found through brute force methods, i.e., f¯rel ((θ)) = (f¯(θ) − min f¯). Error bars represent the 95%
confidence interval for the random testing instances from the distribution of problems described in sections III A and III B.
Figure 3, we plot the relative error, which is the difference horizon of T = 10 quantum-classical iterations, using the
in the squashed cost function relative to the globally op- observed improvement (2) as the meta-learning loss func-
timal squashed cost function value found through brute tion. We trained the LSTM on noiseless quantum cir-
force methods, f¯rel (θ) = (f¯(θ) − minθ f¯(θ)). The brute cuit simulations in Cirq [69]. Note that training of each
force optimization methods were basin hopping for the of the three LSTM networks already required the simu-
QAOA instances [72], and exact diagonalization for the lation of 1 million quantum circuit executions with the
VQE instances. chosen time horizon of 10 iterations, and that the num-
For the testing of the trained LSTM, we used randomly ber of quantum circuit simulations scales linearly with
sampled instances from the distributions of ansatze de- the time horizon. Additionally, gradient-based training
scribed in sections III A and III B. In all cases, the testing required backpropagation through time for the tempo-
instances were for larger-size systems than those used for ral hybrid quantum-classical computational graph, which
training, while keeping the number of variational parame- added further linearly-scaling overhead. Thus, we chose
ters of the ansatze fixed. Note that as all the ansatze con- a short time horizon to minimize the complexity of the
sidered in this paper were QAOA-like, one can thus scale training. For reference, 10 iterations is a significantly
the size of the system while keeping the same number of smaller number of quantum-classical optimization itera-
parameters. This is an important feature of this class of tions than what is typically seen in previous works on
ansatze as our LSTM is trained to optimize ansatze of a QNN optimization [46]. The typical number of itera-
fixed parameter space dimension. tions required by other optimizers is usually on the order
For all instances, the LSTM was trained on a time of hundreds to possibly thousands to reach a comparable
9
optimum of the parameter landscape. Let us provide a description of the alternative opti-
Although the LSTM reaches a good approximate opti- mization and initialization heuristics used to generate
mum in these 10 iterations, some applications of QNN’s Figure 3. First alternative strategy was a Bayesian Op-
such as VQE require further optimization as a high- timization using Gaussian processes [62], here the initial
precision estimate of the cost function is desired. Thus, parameters are set to nil, same as the was the case for the
instead of simply using the LSTM as an optimizer for LSTM optimizer. Next, in order to compare the LSTM
an extended time horizon, we used the LSTM as a few- to other initialization heuristics, we compared the initial-
iteration initializer for Nelder-Mead (NM). This was done ization of Nelder-Mead (NM) at parameter values found
to minimize the complexity of training the RNN and from the best of 10 random guesses (Random Seed),
avoid the instabilities of longer training horizons where NM initialized using some state of the art heuristics for
the RNN would most likely learn a local method for fine- QAOA and VQE (Heuristic Seed), and NM initialized
tuning its own initial guess. A longer time horizon would after 10 iterations of the LSTM (LSTM Seed). The
thus most likely not have provided a significant gain in application-specific heuristic seeds (Heur. Seed) were
performance, all the while substantially increasing cost the adiabatic heuristic for VQE [67], where the varia-
of training. tional parameters are scaled in a similar fashion to an
We tested the robustness of the RNN optimizer by adiabatic interpolation across the 5 steps, while for the
comparing its performance to other common optimizers, QAOA the parameters were initialized at the mean value
both in the cases where Gaussian noise was added to of the optimal parameters for the training set distribution
the cost function evaluations, and for a noiseless read- of problem instances. As was shown in [63], as there is a
out idealized case. This additional Gaussian noise can concentration of the cost function for fixed parameters,
be interpreted as a means to emulate the natural noise one can thus expect the distribution of optimal parame-
of quantum expectation estimation with a finite number ters of the QAOA to be concentrated around some mean.
of measurement runs [9]. Figure 3 allows for comparison In Figure III C, we compare the Euclidean distance in
of noisy and noiseless inference (QNN optimization) for parameter space between the output of the 10 iterations
the trained LSTM versus alternative optimization and in- of the LSTM versus other initialization heuristics. We
tilization heuristics. For the noisy tests, the expectation clearly see that the LSTM optimizer initializes the QNN
samples obeyed a normal distribution of variance 0.05, parameters closer to the optimal parameters of each test
thus the cost function estimates were drawn according to instance, on average, as compared to other methods. We
yt ∼ N (hĤiθt , 0.05) for the results presented in Figure see that in the case of the QAOA, the constant fit heuris-
3. For the testing instances used to generate the results tic [63] for the training instances yields a cluster of pa-
presented in Figure 3, following a standard prescription rameters that is not clustered around the optimal pa-
[6] for the number of repetitions required to guarantee an rameters of the larger instance, while the LSTM output
upper bound to the variance of 0.05, the number of rep- parameters are significantly closer to the globally opti-
etitions (QNN inference runs) should be of (7 ± 4) × 105 mal parameters found by brute force. This shows a clear
repetitions for MaxCut QAOA, (2.3 ± 0.6) × 104 repe- separation between the parameters obtained from a con-
titions for Ising QAOA, and (3 ± 2) × 105 repetitions stant fit of the training set versus the LSTM’s adaptive
for Hubbard VQE. In terms of wall clock time, assuming scheme for optimizing parameters in few iterations.
that the QPU can execute 10000 repetitions (consisting of
a quantum circuit execution, multi-qubit measurement,
and qubit resetting) per second, for the distribution of IV. DISCUSSION
testing instances, the total time needed for the LSTM
to perform its 10 optimization steps is in the range of As shown in Figure 3, our trained neural optimizer
(700 ± 400) seconds for MaxCut QAOA, (23 ± 6) seconds reaches a higher-quality approximate optimum of the pa-
for Ising QAOA, and (300 ± 200) seconds for Hubbard rameters in 10 iterations than other optimizers can man-
VQE. Note that the standard deviation here is due to age in hundreds, both for noisy and noiseless readout.
the variations in the Pauli norm of the Hamiltonians for Most evident in the case of VQE, where the local opti-
the sampled instances of the test set. mizers can have severe difficulty optimizing parameters
Apart from this added cost function noise, the simu- when given noisy evaluations of the cost function. Of
lated quantum circuit executions were simulated with- all alternatives to the neural optimizer, the probabilistic
out any other form of readout or gate execution noise. approach of Bayesian optimization via Gaussian process
Plotted in Figure 3 are the 95% confidence intervals for regression was the best performer.
the optimization of the 50 testing instances which were In all six settings, the LSTM rapidly finds an approx-
sampled according to the testing distributions described imate minimum within its restricted time horizon of 10
in Sec. III A and Sec. III B. Our results show that iterations. The neural optimizer needs to initialize the
the neural optimizer learns initialization heuristics for parameters in a basin of attraction of the cost function
the QAOA and VQE parameters which generalize across landscape so that a local optimizer can then easily con-
problem sizes. We discuss these results in further detail verge to a local optimum in fewer iterations and more
in the following section. consistently. As we can see across all cases, the Nelder-
10
Frequency
Heuristic
20 rameters, akin to gradient descent parameters. Similar to
10
how the classical meta-learning approaches to gradient-
10 based optimization converged onto methods comparable
to best-practices for hyperparameter optimization (e.g.,
0 0
0.0 0.5 1.0 0.0 0.5 1.0 comparable to performance of AdaGrad and other ma-
Distance to optimum Distance to optimum chine learning best-practice heuristics), the neural opti-
VQE Hubbard Model mizer in our case found a neighborhood of optimal hy-
20 perparameters and learned a heuristic to quickly adjust
these parameters on a case-by-case basis.
Frequency
believe that this first challenge has been mitigated by our approach to arbitrary QNN optimization tasks. Fi-
our quantum-classical meta-learning approach, while the nally, as a NISQ-oriented alternative to the latter, one
second challenge remains open for future work. could meta-learn to optimize the hyperparameters for the
In terms of possible extensions of this work, the meta- stochastic quantum circuit gradient descent algorithm re-
learning approach could be further improved in several cently proposed by Harrow et al. [31]. We leave the above
ways. One such way would be to use more recent ad- proposed explorations to future work.
vances in meta-learning optimizer neural networks [74]
which can scale to arbitrary problems and number of pa-
rameters. This would extend the capabilities of our cur-
rent approach to optimizing the parameters of arbitrary VI. ACKNOWLEDGEMENTS
QNN’s beyond Trotter-based/QAOA-like ansatze with
variable numbers of parameters across instances. An- Circuits and neural networks in this paper were imple-
other possible extension of this work would be to meta- mented using a combination of Cirq [69], OpenFermion-
learn an optimizer for Quantum Dynamical Descent [28], Cirq [67], and TensorFlow [70]. The authors would like
a quantum generalization of gradient descent which takes to thank Yutian Chen and his colleagues from DeepMind
the form of a continuous-variable QAOA. As our neural for providing code for the neural optimizer [34] which was
optimizer was tested on various QAOA problems success- adapted for this work, as well as Edward Farhi, Li Li, and
fully, one could imagine applying it to the optimization of Murphy Niu for their insights, observations, and sugges-
the Quantum Dynamical Descent hyperparameters. The tions. MB and GV would like to thank the team at the
latter could be considered learning to learn with quantum Google AI Quantum lab for the hospitality and support
dynamical descent with classical gradient descent. This during their respective internships where this work was
would also be a way to generalize the applicability of completed. GV acknowledges funding from NSERC.
∗
Both authors contributed equally to this work. [16] B. Nash, V. Gheorghiu, and M. Mosca, arXiv preprint
[1] J. Preskill, arXiv preprint arXiv:1801.00862 (2018). arXiv:1904.01972 (2019).
[2] E. Farhi and H. Neven, arXiv preprint arXiv:1802.06002 [17] Z. Jiang, J. McClean, R. Babbush, and H. Neven, arXiv
(2018). preprint arXiv:1812.08190 (2018).
[3] E. Farhi, J. Goldstone, and S. Gutmann, arXiv preprint [18] G. R. Steinbrecher, J. P. Olson, D. Englund, and J. Car-
arXiv:1411.4028 (2014). olan, arXiv preprint arXiv:1808.10047 (2018).
[4] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. [19] M. Fingerhuth, T. Babej, et al., arXiv preprint
Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. Obrien, arXiv:1810.13411 (2018).
Nature communications 5, 4213 (2014). [20] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and
[5] N. Killoran, T. R. Bromley, J. M. Arrazola, P. J. Coles, arXiv preprint arXiv:1810.10506 (2018).
M. Schuld, N. Quesada, and S. Lloyd, arXiv preprint [21] L. Cincio, Y. Subaşı, A. T. Sornborger, and P. J. Coles,
arXiv:1806.06871 (2018). New Journal of Physics 20, 113022 (2018).
[6] D. Wecker, M. B. Hastings, and M. Troyer, Phys. Rev. [22] H. Situ, Z. Huang, X. Zou, and S. Zheng, Quantum
A 92, 042303 (2015). Information Processing 18, 230 (2019).
[7] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, [23] H. Chen, L. Wossnig, S. Severini, H. Neven, and
N. Wiebe, and S. Lloyd, Nature 549, 195 (2017). M. Mohseni, arXiv preprint arXiv:1805.08654 (2018).
[8] L. Zhou, S.-T. Wang, S. Choi, H. Pichler, and M. D. [24] G. Verdon, M. Broughton, and J. Biamonte, arXiv
Lukin, arXiv preprint arXiv:1812.01041 (2018). preprint arXiv:1712.05304 (2017).
[9] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru- [25] Y. LeCun, Y. Bengio, and G. Hinton, nature 521, 436
Guzik, New Journal of Physics 18, 023023 (2016). (2015).
[10] S. Hadfield, Z. Wang, B. O’Gorman, E. G. Rief- [26] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,
fel, D. Venturelli, and R. Biswas, arXiv preprint Deep learning, Vol. 1 (MIT press Cambridge, 2016).
arXiv:1709.03489 (2017). [27] J. Schmidhuber, Neural networks 61, 85 (2015).
[11] E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart, [28] G. Verdon, J. Pye, and M. Broughton, arXiv preprint
V. Stojevic, A. G. Green, and S. Severini, npj Quantum arXiv:1806.09729 (2018).
Information 4, 65 (2018). [29] J. R. Mcclean, S. Boixo, V. N. Smelyanskiy, R. Bab-
[12] S. Khatri, R. LaRose, A. Poremba, L. Cincio, A. T. Sorn- bush, and H. Neven, Nature Communications 9 (2018),
borger, and P. J. Coles, Quantum 3, 140 (2019). 10.1038/s41467-018-07090-4.
[13] M. Schuld and N. Killoran, Physical review letters 122, [30] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and
040504 (2019). N. Killoran, arXiv preprint arXiv:1811.11184 (2018).
[14] S. McArdle, T. Jones, S. Endo, Y. Li, S. Benjamin, and [31] A. Harrow and J. Napp, arXiv preprint arXiv:1901.05374
X. Yuan, arXiv preprint arXiv:1804.03023 (2018). (2019).
[15] M. Benedetti, E. Grant, L. Wossnig, and S. Severini, [32] Z.-C. Yang, A. Rahmani, A. Shabani, H. Neven, and
New Journal of Physics 21, 043023 (2019). C. Chamon, Physical Review X 7, 021027 (2017).
12
[33] E. Grant, L. Wossnig, M. Ostaszewski, and [54] A. Nichol and J. Schulman, arXiv preprint
M. Benedetti, arXiv preprint arXiv:1903.05076 (2019). arXiv:1803.02999 (2018).
[34] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, [55] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pas-
T. P. Lillicrap, M. Botvinick, and N. de Freitas, arXiv canu, S. Osindero, and R. Hadsell, arXiv preprint
preprint arXiv:1611.03824 (2016). arXiv:1807.05960 (2018).
[35] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, [56] C. Audet and M. Kokkolaras, “Blackbox and derivative-
D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, free optimization: theory, algorithms and applications,”
in Advances in Neural Information Processing Systems (2016).
(2016) pp. 3981–3989. [57] Z. C. Lipton, J. Berkowitz, and C. Elkan, arXiv preprint
[36] B. Zoph and Q. V. Le, arXiv preprint arXiv:1611.01578 arXiv:1506.00019 (2015).
(2016). [58] R. Pascanu, T. Mikolov, and Y. Bengio, in International
[37] C. Finn, P. Abbeel, and S. Levine, arXiv preprint Conference on Machine Learning (2013) pp. 1310–1318.
arXiv:1703.03400 (2017). [59] S. Hochreiter and J. Schmidhuber, Neural computation
[38] M. Long, Y. Cao, J. Wang, and M. I. Jordan, arXiv 9, 1735 (1997).
preprint arXiv:1502.02791 (2015). [60] A. Bapat and S. Jordan, arXiv preprint arXiv:1812.02746
[39] I. D. Kivlichan, J. McClean, N. Wiebe, C. Gidney, (2018).
A. Aspuru-Guzik, G. K.-L. Chan, and R. Babbush, Phys. [61] G. Verdon, J. M. Arrazola, K. Brádler, and N. Killoran,
Rev. Lett. 120, 110501 (2018). arXiv preprint arXiv:1902.00409 (2019).
[40] Z. Jiang, K. J. Sung, K. Kechedzhi, V. N. Smelyanskiy, [62] C. E. Rasmussen, in Advanced lectures on machine learn-
and S. Boixo, Physical Review Applied 9, 044036 (2018). ing (Springer, 2004) pp. 63–71.
[41] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. [63] F. G. Brandao, M. Broughton, E. Farhi, S. Gutmann,
Wright, SIAM Journal on optimization 9, 112 (1998). and H. Neven, arXiv preprint arXiv:1812.04170 (2018).
[42] G. Nannicini, Physical Review E 99, 013304 (2019). [64] M. Suzuki, Physics Letters A 146, 319 (1990).
[43] G. G. Guerreschi and M. Smelyanskiy, arXiv preprint [65] T. Kadowaki and H. Nishimori, Physical Review E 58,
arXiv:1701.01450 (2017). 5355 (1998).
[44] J. C. Spall, IEEE Transactions on aerospace and elec- [66] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and
tronic systems 34, 817 (1998). R. Melko, Physical Review X 8, 021050 (2018).
[45] M. J. Powell, Cambridge NA Report NA2009/06, Uni- [67] J. R. McClean, I. D. Kivlichan, D. S. Steiger, Y. Cao,
versity of Cambridge, Cambridge , 26 (2009). E. S. Fried, C. Gidney, T. Häner, V. Havlı́ček, Z. Jiang,
[46] G. Nannicini, arXiv preprint arXiv:1805.12037 (2018). M. Neeley, et al., arXiv preprint arXiv:1710.07629
[47] J. McClean, J. Romero, R. Babbush, and A. Aspuru- (2017).
Guzik, New Journal of Physics 18, 023023 (2016). [68] L. Prechelt, in Neural Networks: Tricks of the trade
[48] In general, one may relay the raw measurement results to (Springer, 1998) pp. 55–69.
the classical processing unit, which can then compute the [69] M. LLC, “Cirq: A python framework for creating, edit-
expectation value of the cost function. For the purposes ing, and invoking noisy intermediate scale quantum cir-
of this paper, we assumed the classical optimizer only cuits,” (2018).
has access to (noisy) estimates of the expectation value [70] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
of the cost Hamiltonian. J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,
[49] A. Y. Kitaev, A. Shen, M. N. Vyalyi, and M. N. Vya- et al.
lyi, Classical and quantum computation, 47 (American [71] S. Ioffe and C. Szegedy, arXiv preprint arXiv:1502.03167
Mathematical Soc., 2002). (2015).
[50] N. C. Rubin, R. Babbush, and J. McClean, New Journal [72] D. J. Wales and J. P. Doye, The Journal of Physical
of Physics 20, 053020 (2018). Chemistry A 101, 5111 (1997).
[51] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, [73] P. Omalley, R. Babbush, I. D. Kivlichan, J. Romero, J. R.
in Proceedings of the 1988 connectionist models summer McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter,
school, Vol. 1 (CMU, Pittsburgh, Pa: Morgan Kaufmann, N. Ding, et al., Physical Review X 6, 031007 (2016).
1988) pp. 21–28. [74] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman,
[52] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-
nature 323, 533 (1986). Dickstein, in Proceedings of the 34th International Con-
[53] J. Romero and A. Aspuru-Guzik, arXiv preprint ference on Machine Learning-Volume 70 (JMLR. org,
arXiv:1901.00848 (2019). 2017) pp. 3751–3760.