Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Quantum-Classical Multiple Kernel Learning

This document discusses the integration of quantum computation with classical kernel methods in machine learning, particularly focusing on multiple kernel learning (MKL). It introduces a novel approach called QCC-net for optimizing kernel weights and parameters, demonstrating its effectiveness in enhancing classification performance with support-vector machines. The study empirically investigates various combinations of classical and quantum kernels, revealing the importance of parameter training and the growing utility of quantum kernels as feature dimensions increase.

Uploaded by

abhikr3011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Quantum-Classical Multiple Kernel Learning

This document discusses the integration of quantum computation with classical kernel methods in machine learning, particularly focusing on multiple kernel learning (MKL). It introduces a novel approach called QCC-net for optimizing kernel weights and parameters, demonstrating its effectiveness in enhancing classification performance with support-vector machines. The study empirically investigates various combinations of classical and quantum kernels, revealing the importance of parameter training and the growing utility of quantum kernels as feature dimensions increase.

Uploaded by

abhikr3011
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Quantum-Classical Multiple Kernel Learning

Ara Ghukasyan,1, 2 Jack S. Baker,1 Oktay Goktas,1 Juan Carrasquilla,2, 3 Santosh Kumar Radha1, ∗
1
Agnostiq Inc., 325 Front St W, Toronto, ON M5V 2Y1
2
University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1
3
Vector Institute, 661 University Ave Suite 710, Toronto, ON M5G 1M1
(Dated: May 30, 2023)
As quantum computers become increasingly practical, so does the prospect of using quantum
computation to improve upon traditional algorithms. Kernel methods in machine learning is one
area where such improvements could be realized in the near future. Paired with kernel methods like
support-vector machines, small and noisy quantum computers can evaluate classically-hard quantum
kernels that capture unique notions of similarity in data. Taking inspiration from techniques in
classical machine learning, this work investigates simulated quantum kernels in the context of
arXiv:2305.17707v1 [quant-ph] 28 May 2023

multiple kernel learning (MKL). We consider pairwise combinations of several classical-classical,


quantum-quantum, and quantum-classical kernels in an empirical investigation of their classification
performance with support-vector machines. We also introduce a novel approach, which we call
QCC-net (quantum-classical-convex neural network), for optimizing the weights of base kernels
together with any kernel parameters. We show this approach to be effective for enhancing various
performance metrics in an MKL setting. Looking at data with an increasing number of features
(up to 13 dimensions), we find parameter training to be important for successfully weighting kernels
in some combinations. Using the optimal kernel weights as indicators of relative utility, we find
growing contributions from trainable quantum kernels in quantum-classical kernel combinations as
the number of features increases. We observe the opposite trend for combinations containing simpler,
non-parametric quantum kernels.

I. INTRODUCTION

Research towards developing novel computational


paradigms aims to overcome the limitations of classical
computers. Quantum computing leverages the unique
properties of quantum mechanics as a promising alter-
native for tackling some classically infeasible problems.
New algorithms for noisy intermediate-scale quantum
(NISQ) devices [1] continue to emerge, spurred on by
recent demonstrations of quantum supremacy [2, 3]. In
the field of machine learning, kernel methods [4] are one
promising application of this technology [5].
Kernel methods, such as support-vector machines
(SVMs) [6], are prominent in classical machine learn-
ing due to theoretical guarantees associated with their
convex loss landscapes. SVMs and related algorithms
can perform non-linear classification to capture complex
patterns in data. One limitation of kernel methods is FIG. 1: An overview of prospective multiple kernel learning
that they require computing an M × M kernel matrix, methods in the era of quantum computing, emphasizing
where M is the number of training samples. On the other combinations of quantum (q) and/or classical (c) kernels, as
well as quantum or classical optimization approaches.
hand, many quantum kernel methods [7] are immediately
suitable for NISQ devices.
Quantum kernel methods, like quantum support-
vector machines (QSVMs) [8], have shown promise in the quantum feature spaces accessed by quantum kernel
various applications [9], including supernova classifica- methods can be tailored to data [13–15] by training
tion in cosmology [10], probing phase transitions in parametric embeddings.
quantum many-body physics [11], and detecting fraud in
finance [12]. The crux of the QSVM is embedding input Multiple kernel learning (MKL) aims to enhance per-
data into quantum states, which allows kernels to be formance by combining different kernels into a single,
estimated from the overlap of these states. Importantly, more expressive kernel [16], in order to learn a wider
variety of decision functions. Here, tuning combinations
in a data-driven way [17] provides an additional means
tailoring kernels. MKL can facilitate feature selection
∗ research@agnostiq.ai and the incorporation of domain-specific knowledge,
2

which are essential for addressing real-world problems


[18]. MKL also encourages sparsity in the combination of
kernels, effectively performing model selection in learning k(x, x′ ) = ⟨Φ(x′ ), Φ(x)⟩ (1)
the optimal weights for component kernels. This can
reduce the risk of overfitting and improve generalization for all x ∈ X ⊂ Rd .
performance [19]. To date, MKL has proven valuable Kernels provide “shortcuts” into a feature space F via
in various applications, including image classification, the kernel trick. Consider a linear classifier
natural language processing, bioinformatics, and drug
discovery [20–22]. ypred (x) = sgn(⟨w, x⟩ + b), (2)
As the NISQ era progresses, MKL may prove useful for
near term applications that rely on both quantum and which predicts binary labels yi ∈ {−1, +1} for points x ∈
classical computing. These paradigms can be combined Rd , using a separating hyperplane defined by the normal
in six different ways, as illustrated in Fig. 1; utiliz- vector w ∈ Rd and a distance to the origin b ∈ R. If
ing classical-classical (c.c.), quantum-classical (q.c.), or the classes in X are not linearly separable, the algorithm
quantum-quantum (q.q.) kernel combinations. Either can be attempted in an alternate Hilbert space, finding
classical or quantum techniques can also be applied to the w̃ ∈ F and b̃ ∈ C such that
subsequent optimization problem as well. For example,
recent research has focused on using quantum annealers ypred (x) = sgn(⟨w̃, Φ(x)⟩ + b̃). (3)
to solve the SVM problem for fully classical kernels [23].
Multiple quantum kernels have also been combined in correctly classifies Φ(x) (and thereby x).
the fully-quantum (q.q.) setting, using deterministic Instead of evaluating Eq. (3) directly, the representer
quantum computing with one qubit (DQC1) [24], and theorem [26] is applied to re-express w̃ as a finite linear
solving the SVM problem on a classical computer. combination with real coefficients αm ,
A foremost goal of this paper is to systematically
address c.c., q.q., and especially q.c. kernel combinations,
X
w̃ = αm Φ(x(m) ), (4)
of which the latter is noticeably lacking in existing m
literature. Using the classical EasyMKL algorithm [17]
to optimize kernel weights, we consider pairwise combi- which yields a classifier that operates on Φ(x) ∈ F,
nations within a set of three quantum and three classical without directly evaluating any inner products:
kernels. We also introduce another training step to fine- !
tune any parametric component kernels in our end-to- X
(m)
end learnable quantum-classical-convex neural network ypred (x) = sgn αm k(x, x )+b (5)
(QCC-net) method (also used for time-series analysis in m
Ref. [25]). We report performance metrics for the various
combinations in a supervised classification setting using Thus, the kernel trick enables non-linear classification
SVMs, across synthetic datasets with a range of two to with linear algorithms, using an implicit transformation
thirteen features. of X through a feature map Φ.
The remainder of this paper is organized as follows: Kernel methods are more transparent to formal anal-
Sec. II provides a brief overview of kernel methods, ysis (as compared to neural networks, for example)
introduces the concepts behind quantum embedding and often lead to convex problems that benefit from
kernels, and provides additional motivation for hybrid provable performance guarantees [26]. Their connection
paradigms in MKL. We then present our experimental to supervised quantum machine learning (QML) models
methodology in Sec. III, including implementation [27], which can be re-formulated as kernel methods [28],
details, descriptions of the base classical and quantum has stoked additional interest for near-term quantum
kernels, as well as details of data preparation and the applications. As a more distant prospect, kernel-based
QCC-net. Empirical results are discussed in Sec. IV in problems are especially well suited for fault-tolerant
terms of performance metrics and optimal kernel weights. quantum computers [29].
Finally, we draw insights from these results and suggest
future directions in Sec. V.
kernel name kernel function parameters

Linear ⟨x, x ⟩ none
′ 3
II. KERNEL METHODS Polynomial (θ0 ⟨x, x ⟩ + θ1 ) {θ0 , θ1 }
′ 2
RBF exp(−θ2 ||x − x || ) {θ2 }
Kernel methods are machine learning algorithms that
rely on a kernel function to compute numerical notions TABLE I: Classical kernels considered in this work. Default
of pairwise similarity. Any function k(x, x′ ) is a kernel if parameter values are chosen to be (θ0 , θ1 ) = (1/d, 1) and
θ2 = 1.
there exists a feature map Φ : Rd → F, such that
3

kernel name embedding Eqs. parameters embedding circuit (Uθ (x)) ref.
unitary

Rx (x1 )

RX RX (x) (18) - Rx (x2 ) -


Rx (x3 )

H Rz (x1 )

ZZ12 (x1 x2 )
ZZ13 (x1 x3 )
⊗N H Rz (x2 )
IQP V (x)H (19) - [30]
ZZ23 (x2 x3 )

H Rz (x3 )

Rx (x1 ) Ry (θ4 )

ZZ12 (θ1 ) ZZ13 (θ2 )

QAOA W (θ)RX (x) (18, 20) θ ∈ R2N Rx (x2 ) Ry (θ5 ) [31, 32]
ZZ23 (θ3 )

Rx (x3 ) Ry (θ6 )

TABLE II: Quantum kernels considered in this work. Circuit diagrams provide a 3-qubit example of the embedding circuit
(i.e. one half of the kernel circuit). Initial parameters for the QAOA circuit are chosen uniformly at random from [0, 2π]2N .
Equations defining the embedding circuits are provided in Sec. III A 1.

A. Quantum Kernels allowing a hybrid classifier to be implemented by calling


kθ (x, x′ ) inside a classical linear algorithm (e.g. Eq. (5)).
Quantum embedding kernels (QEKs) [14, 15] imple- More generally, the modular nature of kernel methods
ment a similarity measures via fidelity estimates between provides a convenient setting for high-level manipulation
data-representing quantum states. QEKs are an attrac- of both quantum and/or classical kernels.
tive prospect insofar as classically-hard kernel functions While the kernel trick is simple to apply, it remains
can be realized with parametric operations that encode challenging to determine an effective kernel for a given
data points as quantum states [9, 30]. To achieve classification problem. To this end, heuristic [15] or
efficient and expressive QEKs, the quantum feature space cost-based optimization [30] can be leveraged to improve
should be readily parametrized by only a handful of gate the performance of parametric kernels. MKL is a com-
variables (here denoted by θ). plementary approach that considers different kernels in
To construct a QEK, one must choose a unitary combination, and determines the weights for component
transformation Uθ to define the embedding, kernels.

|Φ(x)⟩ = Uθ (x) |0⟩ . (6)

Here, a corresponding feature map is one that takes B. Multiple Kernel Learning
x to a density matrix ρ(x) = |Φ(x)⟩ ⟨Φ(x)|. Note
that any possibility for quantum advantage is lost if
the embedding is too simple [30], so U must be chosen The goal in MKL is to improve a kernel method’s
carefully. The QEK itself is defined by the Frobenius performance by introducing a novel notion of similarity
inner product ⟨ρ(x′ ), ρ(x)⟩, or equivalently derived from multiple distinct kernels. Typical use cases
for MKL include affecting feature selection [36, 37],
Tr{ρ(x′ )ρ(x)} = | ⟨Φ(x′ )|Φ(x)⟩ |2 = kθ (x, x′ ). (7) enabling anomaly detection [38], and enhancing expres-
sivity [24]. Many kernel combination strategies have
Any of several existing methods for fidelity estimation been proposed for MKL [16], although weighted linear
[33–35] can be used to evaluate Eq. (7) in practice, combinations are often effective [17, 24].
4

As a weighted average, this form is both easy to interpret


and computationally convenient. Using Eq. (1), we can
infer that the resulting kernel,
R
(r)
X
kθ,γ (x, x′ ) := γr kθ (x, x′ ), (12)
r=1

corresponds to a feature map


√ (1) √ (R)
Φθ,γ (x) = [ γ1 Φθ (x), ..., γR Φθ (x)]T . (13)

Here, data-driven optimization of γ effectively decides


the importance of each component feature map. Once all
the relevant kernels have been evaluated, the MKL tech-
nique does not distinguish between kernels of quantum
and/or classical nature. Instead, the kernel values are
simply weighted to maximize an optimization objective
(Sec. III A 2) in view of the training data.
To contrast, the q.q. implementation by Vedaie et al.
[24] involves preparing a parametrized mixed state,
R
⊗q ⊗q
X
ργ = γr |r⟩ ⟨r| , (14)
FIG. 2: Illustration of a hybrid kernel that combines inner
r=1
products in C1 , a generic (“classical”) Hilbert space, with
inner products from Q1 , the state space of a quantum which represents an R-component kernel of the form
system.
in Eq. (12); weighted by the classical probabilities of
the mixture, and computed via the expectation value
of an encoding operator. Here, each individual kernel
A combination function fγ : RR → R and a base set of is evaluated at the same time, using a single quantum
(r)
kernels kθ : Rd × Rd → R define a combined kernel [16] circuit.
  In our approach, the kernel (or Gram) matrices,
(r)
kθ,γ (x, x′ ) := fγ {kθ (x, x′ )}R
r=1 (8)
(r) (r)
[Kθ ]x,x′ = kθ (x, x′ ), ∀x, x′ ∈ X̂ (15)
(r)
with adjustable parameters θ and γ for kθand fγ , are computed separately for each quantum or classical
respectively. For brevity of notation, θ is assumed to kernel. The combination parameters (γr ) are then deter-
(r)
contain all kernel parameters and each kernel kθ is mined classically, using these matrices and the training
assumed to utilize only a subset “r” of θ. labels as inputs. This allows us to treat c.c., q.q., and
To be a valid kernel, the resulting kθ,γ must be positive q.c. kernel combinations identically, within a general
semi-definite, which means the inequality optimization framework.

z T Kθ,γ z ≥ 0 (9)
1. Quantum-Classical Kernel Combinations
must hold for all z ∈ RM , where M = #{Ẑ} for any
(improper) subset Ẑ of the training set, X̂ ⊂ X. Here, Since computing common classical kernels adds no
Kθ,γ ∈ RM ×M is the Gram matrix [26] of all pairwise significant overhead compared to querying quantum
kernels, kθ,γ (x, x′ ), for x, x′ ∈ Ẑ. If the Gram matrices computers or simulating quantum circuits, a useful q.c.
of all R base kernels satisfy Eq. (9), then their additive combination could be an easy way to make the most
or multiplicative combinations are also guaranteed to be of available NISQ hardware. Feature spaces associated
valid kernels [39]. with quantum kernels are unique insofar as (1) they
A convex linear combination of base kernels (Fig. 2) can be classically-hard to simulate, and (2) are directly
is constructed by taking modelled by physical quantum states. The first point
here is necessary for achieving quantum advantage [9]
fγ = ⟨γ, ·⟩ (10) (though this does not per se guarantee better learning
performance), while the second point means that the
with the following constraints on γ: feature space is directly tunable in an efficient manner,
unlike useful feature spaces in the traditional setting.
||γ||1 = 1, γr ≥ 0. (11) Using an MKL strategy like Eq. (12) creates a weighted
5

concatenation of feature vectors, which leads to an overall III. METHODOLOGY


kernel that interpolates between component kernels (Fig.
2). Adding a classical kernel to a complicated (and often A. Design
periodic) quantum kernel also introduces a natural means
of trainable regularization (Fig. 3), which can improve
To explore c.c., q.c., and q.q. MKL methods side by
model performance beyond the training set. This effect is
side, we considered binary classification tasks using all
adjustable and could be eliminated entirely if the classical
pairwise combinations from a set of three classical (Tab.
kernel is indeed superfluous. On the other hand, the
I) and three quantum (Tab. II) kernels. To quantify
optimal kernel weights could reveal that the quantum
classification performance and characterize the kernels,
kernel is actually ineffective for the problem at hand.
we used the metrics in Tab. III. Here, the base metrics
(T /F )P and (T /F )N correspond to the true/false posi-
tive and true/false negative counts, respectively. These
are used to compute the classification metrics in the lower
half of Tab. III according to the corresponding definition.
The reported accuracy is the number of correct predic-
tions divided by total number of predictions. The area
under the receiver operator characteristic (AUCROC) is
computed across true/false positive rates over decreas-
ing thresholds, t. These thresholds are obtained from
ypred (x̄) and ȳ for all x̄ in the testing set X̄ = X \ X̂,
using the roc auc score function from [41].
The final two metrics in Tab. III, the margin and
spectral ratio, are computed on the Gram matrix that
represents the “kernelized” training data. These metrics
characterize the kernel function and its embedding, un-
like the accuracy and AUCROC, which are measures of
classification performance. The margin is the smallest
distance between points in feature space (F) which
belong to different classes. It is the maximization target
for the SVM algorithm [39] and has a fixed upper bound
for a given dataset X̂ and feature map Φ.
The spectral ratio is the sum of the diagonal elements
FIG. 3: Examples of trained SVM decision functions from of Kθ,γ divided by its Frobenius (or L2,2 ) norm. For
the IQP (Tab. II) and Linear kernels (Tab. I). Top panels
show each kernel individually and the bottom panel shows
their linear combination, with weights γ1 = 0.4 and γ2 = 0.6
for the IQP and Linear kernels, respectively. The pictured metric definition
training set represents a typical (d = 2) instance of the TP #{x̄ s.t. ypred (x̄) = ȳ = 1}
datasets considered in this study. All horizontal and vertical
axes range from 0 to 2π. TN #{x̄ s.t. ypred (x̄) = ȳ = −1}

FP #{x̄ s.t. ypred (x̄) = 1, ȳ = −1}

FN #{x̄ s.t. ypred (x̄) = −1, ȳ = 1}


2. Kernel Training and MKL Optimization T P +T N
Accuracy T P +T N +F P +F N

The process of training the kernel parameters (θ) AUCROC area under ( T PT+F
P
, FP
)
N F P +T N
t
typically represents a separate consideration with respect
to optimizing the kernel combination weights (γ). In this Margin min{||Φθ,γ (x) − Φθ,γ (x′ )|| s.t. y ̸= y ′ }
work, kernel parameter training relies on a stochastic, qP
gradient-based method [40], while weights optimization Spectral Ratio
P
[Kθ,γ ]ii / [Kθ,γ ]2ij
involves solving a convex quadratic problem based on
the EasyMKL algorithm [17]. The QCC-net introduced
in Sec. III A 2 combines these processes in a fully TABLE III: Base metrics (top four), classification metrics
differentiable way, with the MKL objective substituted (accuracy, AUCROC), and kernel metrics (margin,
as the training loss. As we shall demonstrate in Sec. spectral ratio) used for combined comparisons in Sec. IV.
IV B, parameter training is necessary in some cases for Classification metrics are computed from testing outcomes
and kernel metrics are computed from training outcomes.
the MKL algorithm to distinguish component kernels.
6

bounded kernels (i.e., RBF and all quantum kernels) the The simplest quantum kernel that we consider is RX
diagonal sum is always equal to #{X̂} (i.e. the size of (Tab. II), which encodes each component of x ∈ Rd
X̂), because these kernels evaluate to unity on identical onto one of N = d qubits using RX (x) as the embedding
input pairs. Since off-diagonal terms correspond to non- unitary. The IQP [30] kernel is defined by the ansatz
identical input pairs, the maximum spectral ratio of 1 is ( !)
obtained with a kernel iX
V (x) = exp − xp xq Z p Z q RZ (x), (19)
2
p̸=q
kθ,γ (x, x′ ) = δx,x′ , (16)
which uses of RZ rotations followed by two-qubit gates on
which is a Kronecker delta function on the domain
all pairs of qubits to encode data. Lastly, the variational
x, x′ ∈ X̂. This limit corresponds to poor classification
ansatz,
in general, as only identical points are considered “simi-
lar”. Conversely, the minimum spectral ratio 1/#{X̂} is ( !)
iX
obtained with the constant function W (θ) = RY (θ) exp − θpq Zp Zq . (20)
2
p̸=q
kθ,γ (x, x′ ) = 1, (17)
defines the QAOA [31, 32] kernel. While similar to IQP,
which considers every pair of points to be maximally sim- the QAOA kernel features parametric transformations
ilar. Therefore, a reasonable spectral ratio corresponds after a single set of encoding gates. Regarding all three
to some value between these extremes. embedding circuits, we utilize the minimum, single layer
For combinations containing unbounded kernels (Lin- ansatz in every case, as shown in Tab. II.
ear, Polynomial), we normalize the component Gram
matrix (Eq. (15)) before the sums over Kθ,γ (Tab.
III) are computed. Normalization additionally helps to 2. QCC-net Optimization for MKL
stabilize the kernel weighting algorithm (Sec. III A 2) in
P (r)
general. However, this results in [Kθ ]ii < #{X̂}, Given M training samples from X̂ ⊂ X ⊂ Rd and
which (unless γr is zero) unnaturally lowers of the their labels as a matrix Ŷ = diag(y1 , ..., yM ), where
combined kernel’s spectral ratio. Direct comparisons of yi ∈ {−1, 1}, the combination weights are determined to
spectral ratios are therefore not reliable between combi- maximize the total distance (in feature space) between
nations that contain only bounded kernels versus those positive and negative samples. Following [17], this
containing one or more unbounded kernels. Comparisons problem is formulated as
within these two groups, however, are fully justified.
max min (1 − λ)γ T dθ (ϕ) + λ||ϕ||22 , (21)
||γ||=1 ϕ

1. Classical and Quantum Kernel Selection


where the vector of distances, dθ (ϕ) ∈ RR , has compo-
nents
Among the classical set, the Linear kernel corresponds
to a feature map whose feature space is (trivially) the (r) (r)
dθ (ϕ) = ϕT Ŷ K̂θ Ŷ ϕ, (22)
native data space (Rd ). Any prediction function learned
with the Linear kernel is therefore strictly linear itself. and the variable ϕ ∈ RM is subject to
We also consider a cubic Polynomial kernel and the
radial basis function (RBF) kernel as more “powerful”
X X
ϕi = ϕi = 1. (23)
examples of common classical kernels. Both of the latter i|y(x)=1 i|y(x)=−1
two are parametric and produce non-trivial feature maps.
Any feature space of the RBF kernel is in fact infinite- Considering 0 ≤ λ < 1, a solution γ ⋆ parallel to dθ (ϕ) is
dimensional [42], and the kernel itself is known to be evident for the outer maximization problem:
universal [43].
On the quantum side, we construct a QEK from each dθ (ϕ)
embedding unitary in Tab. II as per Eqs. (6,7). The γ⋆ = . (24)
||dθ (ϕ)||2
shorthand
m
! This reduces Eq. (21) to the (convex) minimization
iX problem,
RA (z) = exp − zp Ap (18)
2 p=1
Lϕmin (θ) = min (1 − λ)||dθ (ϕ)||2 + λ||ϕ||22 , (25)
ϕ
is introduced here to represent single-qubit A-rotations,
where Ap are Pauli X, Y , or Z gates acting on qubit whereby the solution ϕmin leads to the optimal kernel
p of an N -qubit circuit; taking the rotation angle from weights via Eq. (24), given fixed values of the kernel
component p ≤ m ≤ N of the input vector z ∈ Rm . parameters θ.
7

We implemented the above algorithm, which describes result type θ γ


the process of determining the kernel weights (γ ⋆ ), using (i) default/random uniform
the convex program solver CVXPY [44, 45] together with (ii) default/random optimized
the default splitting conic solver (SCS) [46]. In order (iii) trained optimized
to integrate optimization of the kernel parameters (θ),
we expressed the problem in Eq. (25) as a differen- TABLE IV: The three types of results considered in this
tiable neural network layer using the CVXPY extension section. Types are distinguished by the inclusion of kernel
cvxpylayers [47], which uses diffcp [48] internally. parameter (θ) training and/or weights (γ) optimization.
This allowed the gradients
!
∂Lϕmin (θ) ∂Lϕmin (θ) ∂Lϕmin (θ) between kernel combinations, we used a “class separa-
:= , ... , (26) tion” of 1.0 (see [51]), ensuring with high probability
∂θ ∂θ0 ∂θN −1
that the two clusters were not linearly separable. All
to be obtained at intermediate steps in the overall features were scaled (using [54]) into the interval [0, 2π] in
computation. Access to these gradients, in turn, allows preprocessing. We also utilized an SVM implementation
a gradient-based optimizer to be used for training the from scikit-learn [55].
kernel parameters to maximize Lϕmin (θ) with respect to A total 120 instances of this classification problem were
θ. (We utilized the Adam optimizer [40] in this work.) computed for each of the three result types (Tab. IV),
As illustrated in FIG. 1 of Ref. [25], the complete QCC- across d = 2, 3, ..., 13 features, with 10 repetitions per
net algorithm proceeds as follows: value of d. The Supplementary Information [56] contains
representative examples for the easily visualized d = 2
(i) Starting with R base kernels, compute K̂θ,γ assum- case, with one training set also visible in Fig. 3.
ing balanced γ (i.e. all components equal to 1/R), In the section that follows, performance metrics based
and using random/default initial kernel parameters on the outcomes of these experiments are used to demon-
θl = θ, where l = 0 initially. strate the QCC-net optimization approach. We also
identify how kernel parameter optimization influences the
(ii) Solve the cone problem in Eq. (25) using selection of kernel weights in pairwise combinations of
CVXPY+SCS to determine ϕmin at fixed θl . If the q.c., q.q., and c.c. kernels, and conclude with a brief
optimizer’s termination criteria are met, advance to discussion to reconcile the prior results.
(iv). Else, continue to (iii).
(iii) Use the gradients in Eq. (26) to update the
IV. RESULTS AND DISCUSSION
kernel the parameters θl → θl+1 , maximizing loss
function Lϕmin (θ) with respect to θ, then return to
step (i) with θl := θl+1 . As shown in Tab. IV, the results considered in this
section are subdivided into three distinct types: (i) non-
(iv) Set θ ⋆ = θl , obtain γ ⋆ from ϕmin , and compute the optimized, (ii) semi-optimized, and (iii) fully optimized.
final kernel matrix. Note that only type (iii) results include kernel parameter
training. The other two types of results feature default
The QML framework Pennylane [49] was used to imple-
(Tab. I) or randomly chosen (Tab. II) parameter
ment and simulate quantum circuits with a PyTorch- [50]
values. For type (i) and type (ii) results, default
compatible interface for gradient calculations and GPU
kernel parameters are kept constant throughout, whereas
acceleration. We also used Covalent (see [51]) to facilitate
random kernel parameters are uniquely generated for
distributed computations.
each instance of the problem.

B. Dataset Preparation and Preprocessing


A. Performance Comparisons

We utilized the scikit-learn software package [41]


With the chosen MKL strategy, a combination of
to synthesize instances of generic, single-label datasets
identical kernels is equivalent to that specific kernel
for binary classification [52] (based the method in [53]).
individually, due to the symmetry present in Eq. (12).
Each dataset instance consisted of 100 samples (x ∈ Rd ), (r)
with d informative features, split 50:50 into training That is, the base kernel is recovered when kθ (x, x′ ) ≡
(1)
and testing subsets. Samples belonging to each class kθ (x, x′ ) for all r:
were distributed evenly between two d-variate Gaussian
R
clusters with randomly correlated features; for d in (1)
X (1)
kθ,γ (x, x′ ) := kθ (x, x′ ) γr = kθ (x, x′ ) (27)
our experiments ranging from 2 and 13. Our initial
r=1
investigations determined that all kernel combinations
performed very well on linearly separable datasets, as Hence, the diagonal entries in Fig. 4 indicate the given
expected. Therefore, to reveal meaningful differences metric for the lone base kernel. Equivalent values are
8

FIG. 4: Median kernel metrics over 120 instances, across datasets with d = 2 to 13 features, for every unique kernel
combination. Colour scales are normalized among adjacent pairs in columns (i) and (iii), which are labelled according to the
type of results contained therein. The final column contains the difference, (i) subtracted from (iii). Entries in the difference
column are true values rounded to 2-digit precision.

expected for diagonal entries in columns (i) and (iii) if 1. Accuracy


the base kernel is also non-parametric (Linear, RX, and
IQP). In all other cases, however, the kernel combination
(or a parametric base kernel) stands to benefit from Looking at accuracy outcomes, the RBF- and RX-
parameter (θ) and/or kernel weights (γ) optimization. containing kernels exhibit the largest and second-largest
scores, respectively, both with and without optimization.
Meanwhile, the RBF- and QAOA- containing kernels
exhibit the largest and second-largest improvements in
accuracy, as compared to the other combinations. The
To provide an overall measure of optimization efficacy, greatest overall improvement in accuracy corresponds to
we list the total differences between non-optimized (i) the IQP-RBF kernel combination. Prior to optimization,
and fully optimized (iii) results in Tab. V. The table is and indeed after, the IQP-containing kernels exhibit the
obtained by adding the differences in the final column of lowest overall accuracy, considering score totals in the
Fig. 4 according to the following scheme: For a given same way as above. This improvement can be attributed
metric, values for kernel combinations A-B contribute to regularization via inclusion of the smooth RBF kernel
to totals for both kernel A and kernel B, whereas A-A together with IQP. This point was exemplified by Fig.
combinations contribute only once to the total for kernel 3, earlier in this text. Apart from RBF and QAOA, the
A. Polynomial kernel (Tab. I) is the only other parametric
9

kernel. Compared to the related (and non-parametric) entries, all combinations among the RBF, QAOA, RX,
Linear kernel, the Polynomial kernel does not exhibit a and IQP kernels exhibit very high spectral ratios in
significant difference in accuracy scores for the classifica- the “no optimization” column, (i). Without tunable
tion problem at hand, even after optimization. parameters, there is little expectation of any change for
the RX and IQP base kernels, as well as the RX-IQP
combination, because adjusting the kernel weights only
2. AUCROC interpolates between two already-large values. Indeed,
the spectral ratio for RX- and IQP- containing kernels
Trends in the AUCROC values are broadly comparable does not change significantly from column (i) to (iii).
to trends in the accuracy, as evident in the first two For combinations that contain the parametric RBF or
rows of Fig. 4. Here, again, the RBF- and QAOA- QAOA kernels, on the other hand, a strong decrease in
containing kernels show the largest overall improvement. the spectral ratio toward more moderate values (≈ 0.5) is
Individually, the largest improvements are for the QAOA clearly observed. Moreover, the existence and severity of
base kernel and the IQP-RBF kernel combination. A this trend for the lone RBF and QAOA kernels (third and
significant decrease in the AUCROC is seen here for the final diagonal entries, respectively, in the bottom right
IQP-Linear and IQP-Polynomial kernel combinations, grid of Fig. 4) confirms that parameter optimization is
despite a net-zero change in the accuracy, when com- effective at balancing the spectral ratio.
paring (i) and (iii). Noting the low median AUCROC
of the IQP base kernel (0.71), this suggests a stronger
preference for the IQP kernel versus both the Linear 5. Summary
and Polynomial kernels vis a vis the optimization target
Eq. (22), which is not necessarily aligned with the Overall, these results show that kernel combinations,
AUCROC metric. The same line of reasoning applies especially those involving parametric kernels, benefit
for the Linear and Polynomial kernels in combination significantly from the QCC-net optimization procedure.
with QAOA. Conversely, a stronger preference RBF in The RBF- and RX-containing kernels demonstrated the
the IQP-RBF combination (in addition to optimizing highest accuracy and AUCROC scores, with RBF- and
θ2 (Tab. I)) leads to a large increase (+0.08) in the QAOA-containing kernels exhibiting the largest improve-
AUCROC over the balanced IQP-RBF combination. It ments for these metrics. Specifically, the IQP-RBF ker-
is also noteworthy that, despite a lower score overall, nel combination displayed the greatest overall improve-
the base QAOA kernel exhibits a comparatively strong ment in accuracy, which we attribute to the regulariza-
improvement in the median AUCROC with parameter tion effect of the smoother RBF kernel. Conversely, the
optimization. IQP, Linear, and Polynomial kernels showed lower overall
performance. The results also highlight that parameter
optimization can effectively balance the spectral ratio,
3. Margin which is particularly important for parametric kernels
like RBF and QAOA.
Outcomes for the margin metric also show the largest
differences among combinations that pair low- and high-
scoring base kernels. Excluding combinations with the B. Impact of Parameter Training on MKL
two lowest-scoring base kernels (Linear, Polynomial), a
slight overall decrease in the margin is evident among In order to separate the effects of kernel parameter
the remaining kernel combinations. Moreover, negative training and kernel weights optimization, we proceed
changes are associated with the RBF- and QAOA- with a comparison between type (ii) and type (iii) results
containing combinations, and most strongly with the lone (Tab. IV). Recall that the latter uses both trained θ ⋆ and
base kernels in either case. This suggests that parameter optimized γ ⋆ , whereas the former uses random or default
optimization tends to simultaneously reduce the margin, θ and optimized γ ⋆ . We shall outline the comparison
despite generally improving accuracy and AUCROC. here in terms of the optimal kernel weights determined
in either case. As illustrated in Fig. 5, we compare the
distributions of γ ⋆ for data with d = 2, 6, and 13 features
4. Spectral Ratio to distinguish, also, any trends in γ ⋆ that depend on d.

We observe the largest differences overall in comparing


type (i) and type (iii) results for the spectral ratio. 1. Non-Parametric Combinations
Selection against the Linear and Polynomial kernels
during γ optimization leads to a comparative increase of To start, we confirm that type (ii) and type (iii)
the spectral ratios here for related combinations, because results are indeed very similar for kernel combinations
the matching base kernels exhibit very low spectral ratios that contain no parametric base kernels, namely those on
in general. Excluding Linear- and Polynomial-containing rows (a) to (c). We can therefore discuss this subset of
10

Accuracy AUCROC Margin Spectral Ratio


Linear 0.18 0.02 0.10 0.75
Polynomial 0.18 0.02 0.10 0.22
RBF 0.55 0.26 0.02 -1.76
RX 0.10 0.06 0.10 -0.22
IQP 0.14 -0.02 0.10 -0.34
QAOA 0.26 0.14 -0.06 -1.92

TABLE V: Total difference by metric, over all unique combinations containing each base kernel. Every entry is a sum of
six elements from the difference grid for the given metric in final column of Fig. 4. Boldface entries indicate the kernel(s)
showing the greatest total difference by magnitude, for each metric. Improvements in the accuracy, AUCROC, and margin
correspond to positive values (i.e. a total increase). However, a total increase or decrease can both indicate an improvement
in the spectral ratio, depending on initial values in the first column of Fig. 4.

FIG. 5: Density of optimized kernel weights (γ1⋆ + γ2⋆ = 1) with and without additional optimization of kernel parameters (θ),
for d ∈ {2, 6, 13} features. For a given d, distributions skewing right indicate a preference for the kernel on the right-hand
side (γ2⋆ > γ1⋆ ), and vice versa when skewing left (γ1⋆ > γ2⋆ ). Outlined distributions correspond to fully optimized results (type
(iii)) and filled distributions correspond to semi-optimized results (type (ii)). Alphabetic labels are provided for convenience.

outcomes without distinguishing between the two types case, which may explain why the Linear kernel is not
of results. A clear preference for the RX kernel over the eliminated entirely (i.e. γ2 ̸= 0). On row (c), preference
Linear kernel is seen at all three values of d on row (a) for the RX kernel is strong at d = 2 for this non-
of Fig. 5. The same can be said for IQP, regarding the parametric q.q. combination. However, a gradual shift
IQP-Linear combination on row (b). Neither result is toward balanced weights, and a narrowing of the weights
surprising, since the Linear kernel is obviously not suited distribution, is observed with increasing d. For the
to the non-linearly-separable classification problem under largest number of features, d = 13, the RX and IQP
consideration (see, for example, Fig. S1). Nonetheless, kernels appear equally effective.
results from the previous section (Fig. 4) indicate
that the RX-Linear and IQP-Linear combinations do
in fact outperform the lone quantum kernels in either
11

2. Parametric Kernel Combinations In the fully optimized case, however, a preference for
the RBF kernel is clearly visible. Here, and for the
In contrast to the above, results for the c.c. RX-RBF combination, (k), parameter training enhances
Polynomial-Linear combination, (d), show approxi- the selectivity for weights tuning, especially at higher
mately balanced weights across all d, apart from a small dimensions.
proportion of the fully optimized outcomes (iii) that The final rows of Fig. 5, (n) to (q), contain the
skew very strongly toward the Polynomial kernel. This QAOA kernel in combination with RBF, RX, and IQP,
suggests rare instances in which parameter training is respectively. Regarding the semi-optimized results for
highly successful for the Polynomial kernel. Results these three rows, we note that, while the random-
for the Linear-RBF combination immediately below, on parameter QAOA kernel (ii) is selected against for data
row (e), may further support this: While the RBF with d = 2 features, this trend gradually disappears for
kernel largely dominates the combination, a comparable larger d. Indeed, at d = 13, rows (n) to (q) illustrate
proportion of the outcomes at the d = 13 strongly favour virtually no preference between the random θ QAOA
the fully optimized Polynomial kernel. This is not to kernel and the RBF, RX, or IQP kernels. The opposite
suggest, however, that the trained Polynomial kernel is trend is observed, however, for the fully optimized results
particularly effective for the problem at hand, since its on these combinations. That is, preference for the fully
combination with the RX and IQP kernels, on rows (f ) optimized QAOA kernel increases from d = 2 to 13.
and (g), exhibits minimal difference between result types This result, too, supports the concluding claim in the
(ii) and (iii). previous paragraph. Operating upon data with d = 13
The QAOA-Polynomial combination on row (h) cor- features, the weights tuning algorithm (Sec. III A 2) does
responds to the first unambiguous result as far as not discriminate between component kernels in neither
differentiating trained and random-parameter outcomes. q.q., nor RBF-containing q.c. combinations with the base
Preference for the QAOA kernel over the Linear kernel kernels considered in this work.
is clear with and without parameter training, although
much narrower distributions are observed for the former
(type (iii)), especially at lower d. A similar trend is seen 4. Summary
for the QAOA-Linear combination on row (j). With
the Linear-RBF combination in between the prior two These findings reveal that the RX, IQP, and RBF
results, on row (i), the trend is again similar, except kernels surpass the Linear and Polynomial kernels for
the semi-optimized distributions are more narrow here the studied classification problems. Furthermore, the
to start. Evidently, the initial value of the RBF scaling fully optimized QAOA kernel is more strongly preferred
parameter, θ0 = 1, represents a reasonable choice for as data dimensionality grows, suggesting that parameter
data with features scaled to [0, 2π]. No such choice training serves to enable kernel weights tuning in high-
can be made for the QAOA kernel, however, which is dimensional settings. The results also show limited
parametrized via periodic quantum gates. Thus, for discrimination between quantum and classical kernels in
the type (ii) results on rows (h) and (j), random- q.q. and RBF-containing q.c. combinations, suggesting
parameter outcomes for QAOA-containing combinations that kernel and optimization choices should be tailored
are more broadly distributed compared to the Linear- to the problem and dataset specifics.
RBF combination.

C. Insights on QC Combinations
3. Quantum-RBF and QAOA-Quantum Kernel
Combinations Generalizing upon results from the previous two sec-
tions, the RBF-containing q.c. combinations exhibit the
Rows (k) to (q) correspond to pairs among RBF best performance overall and the greatest improvements
and the three quantum kernels (Tab. II). For the in performance metrics between result types (i) and (iii).
RX-RBF combination, (k), we note that γ ⋆ remains When paired with the quantum kernels from Tab. II,
approximately balanced across d without parameter op- MKL optimization weighs the untrained RBF kernel
timization. For type (iii) results, however, the RBF equally or more heavily than all three quantum coun-
kernel is preferred over RX, and more so with increasing terparts, including the untrained (random parameter)
d. Next, for IQP-RBF kernel combination on row (m), QAOA kernel. Based on the median metrics in Fig. 4,
we note that γ ⋆ skews slightly toward RBF for both we can confirm that training θ2 improves the lone RBF
types of results at the two lower values of d. This is kernel’s performance. The fact that the trained QAOA
more severe for the fully optimized case (iii), in view of kernel is weighted more heavily than the trained RBF
the d = 6 result. At d = 13, the semi-optimized case kernel (Fig. 5, row (n)) then suggests an important and
(ii) does not distinguish at all between IQP and RBF. competitive contribution from QAOA. However, even in
(The distribution in Fig. 5 row (m) is very narrow and considering only d = 13 feature data, median metrics
obscured by the vertical grid line at γ1⋆ = γ2⋆ = 0.5.) for case (iii) closely resemble those of results across all d
12

FIG. 6: Density contours illustrating distributions of performance metrics for QAOA-containing q.c. combinations for all
three result types, using (i) random kernel parameters and equal kernel weights (θ, γ); (ii) random kernel parameters and
optimized weights (θ, γ ⋆ ); and (iii) trained kernel parameters and optimized kernel weights (θ ⋆ , γ ⋆ ). The dashed black lines
in all subplots indicate x = y and are provided to help orient to eye. Distributions include results for all feature sizes, d = 2
to 13. Note that matching results are expected for cases (i) and (ii) along the top row of either set of subplots, since these
correspond to the lone QAOA base kernel.

in Fig. 4, where the lone RBF kernel consistently out- datasets under consideration, the utility of more complex
performs the QAOA-RBF combination. QEKs may yet be demonstrable on higher-dimensional or
One reason for this could be the shallow depth of specifically structured data, as in Refs. [9, 58].
the minimal ansätze utilized for QAOA and the other In the present context, QAOA remains the best repre-
quantum kernels (Tab. II). In the context of its original sentative of a trainable and practical family of quantum
application [31], multiple repetitions of the QAOA circuit kernels. The distributions of performance metrics for
are known to produce higher quality approximations q.c. combinations that contain this kernel, and those
to solutions for combinatorial problems. It may be of the base QAOA kernel, are illustrated in Fig. 6.
the case that a similar relationship exists for QEKs Here, as also revealed by the median metrics in Fig.
that also utilize multiple, parametric QAOA layers for 4, the overall accuracy and AUCROC are only slightly
classification with kernel methods [57]. The related, improved, and the variance of outcomes only slightly
though non-parametric, IQP kernel could also benefit reduced, in comparing result types (i) and (iii). Still more
from repetitions of the embedding circuit—such repe- subtle is the accuracy and AUCROC difference between
titions are, in fact, a requirement for rigorous classical types (i) and (ii). The latter is unsurprising in view of
“hardness” in this case [30]. Fig. 5, since without parameter training the optimized
The observation of better performance metrics for weights are seen to converge to the balanced (default)
the RX-containing q.c. combinations, in comparison vector, γ ⋆ → γ = [0.5, 0.5]T , as d increases.
to the QAOA- and IQP-containing combinations, is Regarding the margin and spectral ratio outcomes for
not necessarily contradictory to the above hypothesis. this subset of q.c. combinations (Fig. 6 (b)), the type (ii)
Considering that the RX kernel is effectively a quantum results suggest that weights optimization without param-
implementation of a classical cosine kernel [27], and eter training may actually be detrimental for the QAOA-
acknowledging the relative simplicity of the synthetic Linear and QAOA-Polynomial combinations, whereas
13

the QAOA-RBF distribution is again not significantly kernel weighting step to be indecisive without the former
altered. With the inclusion of parameter training in training step when optimizing q.c. combinations that
case (iii), the distributions for the QAOA kernel and all contain the parametric QAOA kernel. The simpler RX
QAOA-containing q.c. combinations are seen to change kernel and its q.c. combinations exhibited the best
more drastically. Specifically, the margin distribution is performance metrics among the quantum and quantum-
widened while the spectral ratios are reduced for type containing kernels considered. More broadly, classical
(iii) results across every row in Fig. 6 (b). The type RBF kernel and its combinations performed the best
(iii) results here also converge to similar distributions for overall.
all four combinations considered, with the QAOA-RBF We are able to recommend a number of directions
combination producing the best outcome overall. for future work aimed at expanding the scope of this
study and/or identifying an empirical advantage with
quantum kernels. For example, the use of multi -layer
embedding circuits for quantum kernels may prove
V. CONCLUSION more effective on higher-dimensional data, based on
the trends we observed for QAOA-containing kernel
In this work, we used modern software tools and a combinations. Additionally, the EasyMKL algorithm is
novel optimization procedure inspired by classical ma- well suited for combining a far greater number of kernels
chine learning to systematically explore the utility of c.c., and can therefore be used to explore the q.q, c.c., and
q.c., and q.q. kernel combinations in binary classification q.c paradigms beyond pairwise combinations. On the
problems, over datasets with d = 2 to 13 features. other hand, alternate (non-linear) MKL strategies are
Considering three quantum kernels and three canonical also a worthwhile prospect for future work, assuming
classical kernels in a comparative setting, we found that that combination weights can be computed efficiently.
only the most complex and parametric quantum kernel Finally, experiments on datasets with a different, less
(QAOA) attains higher a optimum weight in pairwise generic structure may provide the more promising results
combination with the most performant classical kernel for q.q. and q.c. combinations, especially if the number
(RBF). Conversely, classification performance was not of features is large.
found to differ significantly between q.c. combinations
featuring simpler classical kernels (Linear, Polynomial) VI. ACKNOWLEDGEMENTS
in comparison to the lone quantum kernel. Regarding
use of the QCC-net for training the kernel parameters Partial funding for this work was provided by the
and optimizing the combination weights, we found the Mitacs Accelerate program.

[1] J. Preskill, arXiv preprint arXiv:1801.00862 2, 79 (2018). https://doi.org/10.48550/arXiv.2208.01203.


[2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, [13] S. K. Radha and C. Jao, arXiv preprint arXiv:2201.02310
R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. (2022), https://doi.org/10.48550/arXiv.2201.02310.
Buell, et al., Nature 574, 505 (2019). [14] M. Schuld and N. Killoran, Phys. Rev. Lett. 122, 040504
[3] L. S. Madsen, F. Laudenbach, M. F. Askarani, F. Rortais, (2019).
T. Vincent, J. F. Bulmer, F. M. Miatto, L. Neuhaus, [15] T. Hubregtsen, D. Wierichs, E. Gil-Fuster, P.-J. H. S.
L. G. Helt, M. J. Collins, et al., Nature 606, 75 (2022). Derks, P. K. Faehrmann, and J. J. Meyer, Phys. Rev. A
[4] J. Shawe-Taylor and N. Cristianini, Kernel Methods for 106, 042431 (2022).
Pattern Analysis (Cambridge University Press, 2004). [16] M. Gönen and E. Alpaydın, J. Mach. Learn. Res. 12,
[5] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, 2211 (2011).
N. Wiebe, and S. Lloyd, Nature 549, 195 (2017). [17] F. Aiolli and M. Donini, Neurocomputing 169, 215
[6] C. Cortes and V. Vapnik, Mach. Learn. 20, 273 (1995). (2015).
[7] R. Mengoni and A. D. Pierro, Quantum Mach. Intell. 1, [18] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grand-
65 (2019). valet, J. Mach. Learn. Res. 9, 2491 (2008).
[8] P. Rebentrost, M. Mohseni, and S. Lloyd, Phys. Rev. [19] T. Suzuki and M. Sugiyama, in Artificial Intelligence and
Lett. 113, 130503 (2014). Statistics (PMLR, 2012) pp. 1152–1183.
[9] Y. Liu, S. Arunachalam, and K. Temme, Nat. Phys. 17, [20] F. R. Bach, G. R. Lanckriet, and M. I. Jordan, in
1013 (2021). Proceedings of the twenty-first international conference
[10] E. Peters, J. Caldeira, A. Ho, S. Leichenauer, on Machine learning (2004) p. 6.
M. Mohseni, H. Neven, P. Spentzouris, D. Strain, and [21] Z. Yang, N. Tang, X. Zhang, H. Lin, Y. Li, and Z. Yang,
G. N. Perdue, Npj Quantum Inf. 7, 161 (2021). Artif. Intell. Med. 51, 163 (2011).
[11] T. Sancho-Lorente, J. Román-Roche, and D. Zueco, [22] F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al.,
Phys. Rev. A 105, 042432 (2022). Found. Trends Mach. Learn. 4, 1 (2012).
[12] O. Kyriienko and E. B. Magnusson, [23] D. Willsch, M. Willsch, H. De Raedt, and K. Michielsen,
arXiv preprint arXiv:2208.01203 (2022), Comput. Phys. Commun. 248, 107006 (2020).
14

[24] S. S. Vedaie, M. Noori, J. S. Oberoi, B. C. Sanders, and [48] A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and
E. Zahedinejad, arXiv preprint arXiv:2011.09694 (2020), W. Moursi, Journal of Applied and Numerical Optimiza-
https://doi.org/10.48550/ARXIV.2011.09694. tion 1, 107 (2019).
[25] J. S. Baker, G. Park, K. Yu, A. Ghukasyan, O. Goktas, [49] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin,
and S. K. Radha, arXiv preprint arXiv:2305.05881 S. Ahmed, et al., “Pennylane: Automatic differentiation
(2023), https://doi.org/10.48550/arXiv.2305.05881, of hybrid quantum-classical computations,” (2022),
arXiv:2305.05881 [quant-ph]. arXiv:1811.04968 [quant-ph].
[26] B. Schólkopf, A. J. Smola, F. Bach, et al., Learning [50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
with Kernels: Support Vector Machines, Regularization, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
Optimization, and Beyond (MIT Press, 2002). et al., Advances in neural information processing systems
[27] M. Schuld and F. Petruccione, Supervised learning with 32 (2019), https://doi.org/10.48550/arXiv.1912.01703.
quantum computers (Springer, 2018). [51] Covalent: https://www.covalent.xyz.
[28] M. Schuld, arXiv preprint arXiv:2101.11020 (2021), [52] make classification: https://scikit-learn.org/
https://doi.org/10.48550/arXiv.2101.11020. stable/modules/generated/sklearn.datasets.make_
[29] A. W. Harrow, A. Hassidim, and S. Lloyd, Phys. Rev. classification.html.
Lett. 103, 150502 (2009). [53] I. Guyon, in NIPS 2003 workshop on feature extraction
[30] V. Havlı́ček, A. D. Córcoles, K. Temme, A. W. Harrow, and feature selection, Vol. 253 (2003) p. 40.
A. Kandala, J. M. Chow, and J. M. Gambetta, Nature [54] MinMaxScaler: https://scikit-learn.org/stable/
567, 209 (2019). modules/generated/sklearn.preprocessing.
[31] E. Farhi, J. Goldstone, and S. Gutmann, MinMaxScaler.html.
arXiv preprint arXiv:1411.4028 (2014), [55] SVC: https://scikit-learn.org/stable/modules/
https://doi.org/10.48550/arXiv.1411.4028. generated/sklearn.svm.SVC.html.
[32] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and [56] See Supplementary Information [publisher url] for exam-
N. Killoran, arXiv preprint arXiv:2001.03622 (2020), ples of two-dimensional datasets.
https://doi.org/10.48550/arXiv.2001.03622. [57] S. Jerbi, L. J. Fiderer, H. Poulsen Nautrup, J. M. Kübler,
[33] L. Cincio, Y. Subaşi, A. T. Sornborger, and P. J. Coles, H. J. Briegel, and V. Dunjko, Nature Communications
New J. Phys. 20, 113022 (2018). 14, 517 (2023).
[34] M. Fanizza, M. Rosati, M. Skotiniotis, J. Calsamiglia, [58] J. R. Glick, T. P. Gujarati, A. D. Corcoles,
and V. Giovannetti, Phys. Rev. Lett. 124, 060503 (2020). Y. Kim, A. Kandala, J. M. Gambetta, and
[35] H.-Y. Huang, R. Keung, and J. Preskill, Nat. Phys. 16, K. Temme, arXiv preprint arXiv:2105.03406 (2021),
1050 (2022). https://doi.org/10.48550/arXiv.2105.03406.
[36] C. Brouard, J. Mariette, R. Flamary, and
N. Vialaneix, NAR Genom. Bioinform. 4 (2022),
https://doi.org/10.1093/nargab/lqac014.
[37] H. Xue, Y. Song, and H.-M. Xu, Knowl. Based Syst.
191, 105272 (2020).
[38] C. Gautam, R. Balaji, S. K., A. Tiwari, and K. Ahuja,
Knowl. Based Syst. 165, 241 (2019).
[39] I. Steinward and A. Christmann, Support Vector Ma-
chines (Springer Science & Business Media, 2008).
[40] D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980
(2014), https://doi.org/10.48550/arXiv.1412.6980.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, and E. Duchesnay, J.
Mach. Learn. Res. 12, 2825 (2011).
[42] I. Steinwart, D. Hush, and C. Scovel, IEEE Trans. Inf.
52, 4635 (2006).
[43] C. A. Micchelli, Y. Xu, and H. Zhang, J. Mach. Learn.
Res. 7, 2667 (2006).
[44] S. Diamond and S. Boyd, J. Mach. Learn. Res. 17, 1
(2016).
[45] A. Agrawal, R. Verschueren, S. Diamond,
and S. Boyd, J. Control. Decis. (2019),
https://doi.org/10.48550/arXiv.1709.04494,
arXiv:1709.04494 [math.OC].
[46] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd,
Journal of Optimization Theory and Applications 169,
1042 (2016).
[47] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond,
and Z. Kolter, in Advances in Neural Information Pro-
cessing Systems (2019).
15

SUPPLEMENTARY INFORMATION:
QUANTUM-CLASSICAL MULTIPLE KERNEL LEARNING

FIG. S1: Ten examples of d = 2 dimensional classification datasets corresponding to Sec. III B. Each pair of plots shows the
equally-split training and testing subsets (left (a) and right (b), respectively). All horizontal and vertical axes range from 0
to 2π. Square scatter points (red) belong one class and round scatter points (blue) belong to the other.

You might also like