Quantum-Classical Multiple Kernel Learning
Quantum-Classical Multiple Kernel Learning
Ara Ghukasyan,1, 2 Jack S. Baker,1 Oktay Goktas,1 Juan Carrasquilla,2, 3 Santosh Kumar Radha1, ∗
1
Agnostiq Inc., 325 Front St W, Toronto, ON M5V 2Y1
2
University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1
3
Vector Institute, 661 University Ave Suite 710, Toronto, ON M5G 1M1
(Dated: May 30, 2023)
As quantum computers become increasingly practical, so does the prospect of using quantum
computation to improve upon traditional algorithms. Kernel methods in machine learning is one
area where such improvements could be realized in the near future. Paired with kernel methods like
support-vector machines, small and noisy quantum computers can evaluate classically-hard quantum
kernels that capture unique notions of similarity in data. Taking inspiration from techniques in
classical machine learning, this work investigates simulated quantum kernels in the context of
arXiv:2305.17707v1 [quant-ph] 28 May 2023
I. INTRODUCTION
kernel name embedding Eqs. parameters embedding circuit (Uθ (x)) ref.
unitary
Rx (x1 )
H Rz (x1 )
ZZ12 (x1 x2 )
ZZ13 (x1 x3 )
⊗N H Rz (x2 )
IQP V (x)H (19) - [30]
ZZ23 (x2 x3 )
H Rz (x3 )
Rx (x1 ) Ry (θ4 )
QAOA W (θ)RX (x) (18, 20) θ ∈ R2N Rx (x2 ) Ry (θ5 ) [31, 32]
ZZ23 (θ3 )
Rx (x3 ) Ry (θ6 )
TABLE II: Quantum kernels considered in this work. Circuit diagrams provide a 3-qubit example of the embedding circuit
(i.e. one half of the kernel circuit). Initial parameters for the QAOA circuit are chosen uniformly at random from [0, 2π]2N .
Equations defining the embedding circuits are provided in Sec. III A 1.
Here, a corresponding feature map is one that takes B. Multiple Kernel Learning
x to a density matrix ρ(x) = |Φ(x)⟩ ⟨Φ(x)|. Note
that any possibility for quantum advantage is lost if
the embedding is too simple [30], so U must be chosen The goal in MKL is to improve a kernel method’s
carefully. The QEK itself is defined by the Frobenius performance by introducing a novel notion of similarity
inner product ⟨ρ(x′ ), ρ(x)⟩, or equivalently derived from multiple distinct kernels. Typical use cases
for MKL include affecting feature selection [36, 37],
Tr{ρ(x′ )ρ(x)} = | ⟨Φ(x′ )|Φ(x)⟩ |2 = kθ (x, x′ ). (7) enabling anomaly detection [38], and enhancing expres-
sivity [24]. Many kernel combination strategies have
Any of several existing methods for fidelity estimation been proposed for MKL [16], although weighted linear
[33–35] can be used to evaluate Eq. (7) in practice, combinations are often effective [17, 24].
4
z T Kθ,γ z ≥ 0 (9)
1. Quantum-Classical Kernel Combinations
must hold for all z ∈ RM , where M = #{Ẑ} for any
(improper) subset Ẑ of the training set, X̂ ⊂ X. Here, Since computing common classical kernels adds no
Kθ,γ ∈ RM ×M is the Gram matrix [26] of all pairwise significant overhead compared to querying quantum
kernels, kθ,γ (x, x′ ), for x, x′ ∈ Ẑ. If the Gram matrices computers or simulating quantum circuits, a useful q.c.
of all R base kernels satisfy Eq. (9), then their additive combination could be an easy way to make the most
or multiplicative combinations are also guaranteed to be of available NISQ hardware. Feature spaces associated
valid kernels [39]. with quantum kernels are unique insofar as (1) they
A convex linear combination of base kernels (Fig. 2) can be classically-hard to simulate, and (2) are directly
is constructed by taking modelled by physical quantum states. The first point
here is necessary for achieving quantum advantage [9]
fγ = ⟨γ, ·⟩ (10) (though this does not per se guarantee better learning
performance), while the second point means that the
with the following constraints on γ: feature space is directly tunable in an efficient manner,
unlike useful feature spaces in the traditional setting.
||γ||1 = 1, γr ≥ 0. (11) Using an MKL strategy like Eq. (12) creates a weighted
5
The process of training the kernel parameters (θ) AUCROC area under ( T PT+F
P
, FP
)
N F P +T N
t
typically represents a separate consideration with respect
to optimizing the kernel combination weights (γ). In this Margin min{||Φθ,γ (x) − Φθ,γ (x′ )|| s.t. y ̸= y ′ }
work, kernel parameter training relies on a stochastic, qP
gradient-based method [40], while weights optimization Spectral Ratio
P
[Kθ,γ ]ii / [Kθ,γ ]2ij
involves solving a convex quadratic problem based on
the EasyMKL algorithm [17]. The QCC-net introduced
in Sec. III A 2 combines these processes in a fully TABLE III: Base metrics (top four), classification metrics
differentiable way, with the MKL objective substituted (accuracy, AUCROC), and kernel metrics (margin,
as the training loss. As we shall demonstrate in Sec. spectral ratio) used for combined comparisons in Sec. IV.
IV B, parameter training is necessary in some cases for Classification metrics are computed from testing outcomes
and kernel metrics are computed from training outcomes.
the MKL algorithm to distinguish component kernels.
6
bounded kernels (i.e., RBF and all quantum kernels) the The simplest quantum kernel that we consider is RX
diagonal sum is always equal to #{X̂} (i.e. the size of (Tab. II), which encodes each component of x ∈ Rd
X̂), because these kernels evaluate to unity on identical onto one of N = d qubits using RX (x) as the embedding
input pairs. Since off-diagonal terms correspond to non- unitary. The IQP [30] kernel is defined by the ansatz
identical input pairs, the maximum spectral ratio of 1 is ( !)
obtained with a kernel iX
V (x) = exp − xp xq Z p Z q RZ (x), (19)
2
p̸=q
kθ,γ (x, x′ ) = δx,x′ , (16)
which uses of RZ rotations followed by two-qubit gates on
which is a Kronecker delta function on the domain
all pairs of qubits to encode data. Lastly, the variational
x, x′ ∈ X̂. This limit corresponds to poor classification
ansatz,
in general, as only identical points are considered “simi-
lar”. Conversely, the minimum spectral ratio 1/#{X̂} is ( !)
iX
obtained with the constant function W (θ) = RY (θ) exp − θpq Zp Zq . (20)
2
p̸=q
kθ,γ (x, x′ ) = 1, (17)
defines the QAOA [31, 32] kernel. While similar to IQP,
which considers every pair of points to be maximally sim- the QAOA kernel features parametric transformations
ilar. Therefore, a reasonable spectral ratio corresponds after a single set of encoding gates. Regarding all three
to some value between these extremes. embedding circuits, we utilize the minimum, single layer
For combinations containing unbounded kernels (Lin- ansatz in every case, as shown in Tab. II.
ear, Polynomial), we normalize the component Gram
matrix (Eq. (15)) before the sums over Kθ,γ (Tab.
III) are computed. Normalization additionally helps to 2. QCC-net Optimization for MKL
stabilize the kernel weighting algorithm (Sec. III A 2) in
P (r)
general. However, this results in [Kθ ]ii < #{X̂}, Given M training samples from X̂ ⊂ X ⊂ Rd and
which (unless γr is zero) unnaturally lowers of the their labels as a matrix Ŷ = diag(y1 , ..., yM ), where
combined kernel’s spectral ratio. Direct comparisons of yi ∈ {−1, 1}, the combination weights are determined to
spectral ratios are therefore not reliable between combi- maximize the total distance (in feature space) between
nations that contain only bounded kernels versus those positive and negative samples. Following [17], this
containing one or more unbounded kernels. Comparisons problem is formulated as
within these two groups, however, are fully justified.
max min (1 − λ)γ T dθ (ϕ) + λ||ϕ||22 , (21)
||γ||=1 ϕ
FIG. 4: Median kernel metrics over 120 instances, across datasets with d = 2 to 13 features, for every unique kernel
combination. Colour scales are normalized among adjacent pairs in columns (i) and (iii), which are labelled according to the
type of results contained therein. The final column contains the difference, (i) subtracted from (iii). Entries in the difference
column are true values rounded to 2-digit precision.
kernel. Compared to the related (and non-parametric) entries, all combinations among the RBF, QAOA, RX,
Linear kernel, the Polynomial kernel does not exhibit a and IQP kernels exhibit very high spectral ratios in
significant difference in accuracy scores for the classifica- the “no optimization” column, (i). Without tunable
tion problem at hand, even after optimization. parameters, there is little expectation of any change for
the RX and IQP base kernels, as well as the RX-IQP
combination, because adjusting the kernel weights only
2. AUCROC interpolates between two already-large values. Indeed,
the spectral ratio for RX- and IQP- containing kernels
Trends in the AUCROC values are broadly comparable does not change significantly from column (i) to (iii).
to trends in the accuracy, as evident in the first two For combinations that contain the parametric RBF or
rows of Fig. 4. Here, again, the RBF- and QAOA- QAOA kernels, on the other hand, a strong decrease in
containing kernels show the largest overall improvement. the spectral ratio toward more moderate values (≈ 0.5) is
Individually, the largest improvements are for the QAOA clearly observed. Moreover, the existence and severity of
base kernel and the IQP-RBF kernel combination. A this trend for the lone RBF and QAOA kernels (third and
significant decrease in the AUCROC is seen here for the final diagonal entries, respectively, in the bottom right
IQP-Linear and IQP-Polynomial kernel combinations, grid of Fig. 4) confirms that parameter optimization is
despite a net-zero change in the accuracy, when com- effective at balancing the spectral ratio.
paring (i) and (iii). Noting the low median AUCROC
of the IQP base kernel (0.71), this suggests a stronger
preference for the IQP kernel versus both the Linear 5. Summary
and Polynomial kernels vis a vis the optimization target
Eq. (22), which is not necessarily aligned with the Overall, these results show that kernel combinations,
AUCROC metric. The same line of reasoning applies especially those involving parametric kernels, benefit
for the Linear and Polynomial kernels in combination significantly from the QCC-net optimization procedure.
with QAOA. Conversely, a stronger preference RBF in The RBF- and RX-containing kernels demonstrated the
the IQP-RBF combination (in addition to optimizing highest accuracy and AUCROC scores, with RBF- and
θ2 (Tab. I)) leads to a large increase (+0.08) in the QAOA-containing kernels exhibiting the largest improve-
AUCROC over the balanced IQP-RBF combination. It ments for these metrics. Specifically, the IQP-RBF ker-
is also noteworthy that, despite a lower score overall, nel combination displayed the greatest overall improve-
the base QAOA kernel exhibits a comparatively strong ment in accuracy, which we attribute to the regulariza-
improvement in the median AUCROC with parameter tion effect of the smoother RBF kernel. Conversely, the
optimization. IQP, Linear, and Polynomial kernels showed lower overall
performance. The results also highlight that parameter
optimization can effectively balance the spectral ratio,
3. Margin which is particularly important for parametric kernels
like RBF and QAOA.
Outcomes for the margin metric also show the largest
differences among combinations that pair low- and high-
scoring base kernels. Excluding combinations with the B. Impact of Parameter Training on MKL
two lowest-scoring base kernels (Linear, Polynomial), a
slight overall decrease in the margin is evident among In order to separate the effects of kernel parameter
the remaining kernel combinations. Moreover, negative training and kernel weights optimization, we proceed
changes are associated with the RBF- and QAOA- with a comparison between type (ii) and type (iii) results
containing combinations, and most strongly with the lone (Tab. IV). Recall that the latter uses both trained θ ⋆ and
base kernels in either case. This suggests that parameter optimized γ ⋆ , whereas the former uses random or default
optimization tends to simultaneously reduce the margin, θ and optimized γ ⋆ . We shall outline the comparison
despite generally improving accuracy and AUCROC. here in terms of the optimal kernel weights determined
in either case. As illustrated in Fig. 5, we compare the
distributions of γ ⋆ for data with d = 2, 6, and 13 features
4. Spectral Ratio to distinguish, also, any trends in γ ⋆ that depend on d.
TABLE V: Total difference by metric, over all unique combinations containing each base kernel. Every entry is a sum of
six elements from the difference grid for the given metric in final column of Fig. 4. Boldface entries indicate the kernel(s)
showing the greatest total difference by magnitude, for each metric. Improvements in the accuracy, AUCROC, and margin
correspond to positive values (i.e. a total increase). However, a total increase or decrease can both indicate an improvement
in the spectral ratio, depending on initial values in the first column of Fig. 4.
FIG. 5: Density of optimized kernel weights (γ1⋆ + γ2⋆ = 1) with and without additional optimization of kernel parameters (θ),
for d ∈ {2, 6, 13} features. For a given d, distributions skewing right indicate a preference for the kernel on the right-hand
side (γ2⋆ > γ1⋆ ), and vice versa when skewing left (γ1⋆ > γ2⋆ ). Outlined distributions correspond to fully optimized results (type
(iii)) and filled distributions correspond to semi-optimized results (type (ii)). Alphabetic labels are provided for convenience.
outcomes without distinguishing between the two types case, which may explain why the Linear kernel is not
of results. A clear preference for the RX kernel over the eliminated entirely (i.e. γ2 ̸= 0). On row (c), preference
Linear kernel is seen at all three values of d on row (a) for the RX kernel is strong at d = 2 for this non-
of Fig. 5. The same can be said for IQP, regarding the parametric q.q. combination. However, a gradual shift
IQP-Linear combination on row (b). Neither result is toward balanced weights, and a narrowing of the weights
surprising, since the Linear kernel is obviously not suited distribution, is observed with increasing d. For the
to the non-linearly-separable classification problem under largest number of features, d = 13, the RX and IQP
consideration (see, for example, Fig. S1). Nonetheless, kernels appear equally effective.
results from the previous section (Fig. 4) indicate
that the RX-Linear and IQP-Linear combinations do
in fact outperform the lone quantum kernels in either
11
2. Parametric Kernel Combinations In the fully optimized case, however, a preference for
the RBF kernel is clearly visible. Here, and for the
In contrast to the above, results for the c.c. RX-RBF combination, (k), parameter training enhances
Polynomial-Linear combination, (d), show approxi- the selectivity for weights tuning, especially at higher
mately balanced weights across all d, apart from a small dimensions.
proportion of the fully optimized outcomes (iii) that The final rows of Fig. 5, (n) to (q), contain the
skew very strongly toward the Polynomial kernel. This QAOA kernel in combination with RBF, RX, and IQP,
suggests rare instances in which parameter training is respectively. Regarding the semi-optimized results for
highly successful for the Polynomial kernel. Results these three rows, we note that, while the random-
for the Linear-RBF combination immediately below, on parameter QAOA kernel (ii) is selected against for data
row (e), may further support this: While the RBF with d = 2 features, this trend gradually disappears for
kernel largely dominates the combination, a comparable larger d. Indeed, at d = 13, rows (n) to (q) illustrate
proportion of the outcomes at the d = 13 strongly favour virtually no preference between the random θ QAOA
the fully optimized Polynomial kernel. This is not to kernel and the RBF, RX, or IQP kernels. The opposite
suggest, however, that the trained Polynomial kernel is trend is observed, however, for the fully optimized results
particularly effective for the problem at hand, since its on these combinations. That is, preference for the fully
combination with the RX and IQP kernels, on rows (f ) optimized QAOA kernel increases from d = 2 to 13.
and (g), exhibits minimal difference between result types This result, too, supports the concluding claim in the
(ii) and (iii). previous paragraph. Operating upon data with d = 13
The QAOA-Polynomial combination on row (h) cor- features, the weights tuning algorithm (Sec. III A 2) does
responds to the first unambiguous result as far as not discriminate between component kernels in neither
differentiating trained and random-parameter outcomes. q.q., nor RBF-containing q.c. combinations with the base
Preference for the QAOA kernel over the Linear kernel kernels considered in this work.
is clear with and without parameter training, although
much narrower distributions are observed for the former
(type (iii)), especially at lower d. A similar trend is seen 4. Summary
for the QAOA-Linear combination on row (j). With
the Linear-RBF combination in between the prior two These findings reveal that the RX, IQP, and RBF
results, on row (i), the trend is again similar, except kernels surpass the Linear and Polynomial kernels for
the semi-optimized distributions are more narrow here the studied classification problems. Furthermore, the
to start. Evidently, the initial value of the RBF scaling fully optimized QAOA kernel is more strongly preferred
parameter, θ0 = 1, represents a reasonable choice for as data dimensionality grows, suggesting that parameter
data with features scaled to [0, 2π]. No such choice training serves to enable kernel weights tuning in high-
can be made for the QAOA kernel, however, which is dimensional settings. The results also show limited
parametrized via periodic quantum gates. Thus, for discrimination between quantum and classical kernels in
the type (ii) results on rows (h) and (j), random- q.q. and RBF-containing q.c. combinations, suggesting
parameter outcomes for QAOA-containing combinations that kernel and optimization choices should be tailored
are more broadly distributed compared to the Linear- to the problem and dataset specifics.
RBF combination.
C. Insights on QC Combinations
3. Quantum-RBF and QAOA-Quantum Kernel
Combinations Generalizing upon results from the previous two sec-
tions, the RBF-containing q.c. combinations exhibit the
Rows (k) to (q) correspond to pairs among RBF best performance overall and the greatest improvements
and the three quantum kernels (Tab. II). For the in performance metrics between result types (i) and (iii).
RX-RBF combination, (k), we note that γ ⋆ remains When paired with the quantum kernels from Tab. II,
approximately balanced across d without parameter op- MKL optimization weighs the untrained RBF kernel
timization. For type (iii) results, however, the RBF equally or more heavily than all three quantum coun-
kernel is preferred over RX, and more so with increasing terparts, including the untrained (random parameter)
d. Next, for IQP-RBF kernel combination on row (m), QAOA kernel. Based on the median metrics in Fig. 4,
we note that γ ⋆ skews slightly toward RBF for both we can confirm that training θ2 improves the lone RBF
types of results at the two lower values of d. This is kernel’s performance. The fact that the trained QAOA
more severe for the fully optimized case (iii), in view of kernel is weighted more heavily than the trained RBF
the d = 6 result. At d = 13, the semi-optimized case kernel (Fig. 5, row (n)) then suggests an important and
(ii) does not distinguish at all between IQP and RBF. competitive contribution from QAOA. However, even in
(The distribution in Fig. 5 row (m) is very narrow and considering only d = 13 feature data, median metrics
obscured by the vertical grid line at γ1⋆ = γ2⋆ = 0.5.) for case (iii) closely resemble those of results across all d
12
FIG. 6: Density contours illustrating distributions of performance metrics for QAOA-containing q.c. combinations for all
three result types, using (i) random kernel parameters and equal kernel weights (θ, γ); (ii) random kernel parameters and
optimized weights (θ, γ ⋆ ); and (iii) trained kernel parameters and optimized kernel weights (θ ⋆ , γ ⋆ ). The dashed black lines
in all subplots indicate x = y and are provided to help orient to eye. Distributions include results for all feature sizes, d = 2
to 13. Note that matching results are expected for cases (i) and (ii) along the top row of either set of subplots, since these
correspond to the lone QAOA base kernel.
in Fig. 4, where the lone RBF kernel consistently out- datasets under consideration, the utility of more complex
performs the QAOA-RBF combination. QEKs may yet be demonstrable on higher-dimensional or
One reason for this could be the shallow depth of specifically structured data, as in Refs. [9, 58].
the minimal ansätze utilized for QAOA and the other In the present context, QAOA remains the best repre-
quantum kernels (Tab. II). In the context of its original sentative of a trainable and practical family of quantum
application [31], multiple repetitions of the QAOA circuit kernels. The distributions of performance metrics for
are known to produce higher quality approximations q.c. combinations that contain this kernel, and those
to solutions for combinatorial problems. It may be of the base QAOA kernel, are illustrated in Fig. 6.
the case that a similar relationship exists for QEKs Here, as also revealed by the median metrics in Fig.
that also utilize multiple, parametric QAOA layers for 4, the overall accuracy and AUCROC are only slightly
classification with kernel methods [57]. The related, improved, and the variance of outcomes only slightly
though non-parametric, IQP kernel could also benefit reduced, in comparing result types (i) and (iii). Still more
from repetitions of the embedding circuit—such repe- subtle is the accuracy and AUCROC difference between
titions are, in fact, a requirement for rigorous classical types (i) and (ii). The latter is unsurprising in view of
“hardness” in this case [30]. Fig. 5, since without parameter training the optimized
The observation of better performance metrics for weights are seen to converge to the balanced (default)
the RX-containing q.c. combinations, in comparison vector, γ ⋆ → γ = [0.5, 0.5]T , as d increases.
to the QAOA- and IQP-containing combinations, is Regarding the margin and spectral ratio outcomes for
not necessarily contradictory to the above hypothesis. this subset of q.c. combinations (Fig. 6 (b)), the type (ii)
Considering that the RX kernel is effectively a quantum results suggest that weights optimization without param-
implementation of a classical cosine kernel [27], and eter training may actually be detrimental for the QAOA-
acknowledging the relative simplicity of the synthetic Linear and QAOA-Polynomial combinations, whereas
13
the QAOA-RBF distribution is again not significantly kernel weighting step to be indecisive without the former
altered. With the inclusion of parameter training in training step when optimizing q.c. combinations that
case (iii), the distributions for the QAOA kernel and all contain the parametric QAOA kernel. The simpler RX
QAOA-containing q.c. combinations are seen to change kernel and its q.c. combinations exhibited the best
more drastically. Specifically, the margin distribution is performance metrics among the quantum and quantum-
widened while the spectral ratios are reduced for type containing kernels considered. More broadly, classical
(iii) results across every row in Fig. 6 (b). The type RBF kernel and its combinations performed the best
(iii) results here also converge to similar distributions for overall.
all four combinations considered, with the QAOA-RBF We are able to recommend a number of directions
combination producing the best outcome overall. for future work aimed at expanding the scope of this
study and/or identifying an empirical advantage with
quantum kernels. For example, the use of multi -layer
embedding circuits for quantum kernels may prove
V. CONCLUSION more effective on higher-dimensional data, based on
the trends we observed for QAOA-containing kernel
In this work, we used modern software tools and a combinations. Additionally, the EasyMKL algorithm is
novel optimization procedure inspired by classical ma- well suited for combining a far greater number of kernels
chine learning to systematically explore the utility of c.c., and can therefore be used to explore the q.q, c.c., and
q.c., and q.q. kernel combinations in binary classification q.c paradigms beyond pairwise combinations. On the
problems, over datasets with d = 2 to 13 features. other hand, alternate (non-linear) MKL strategies are
Considering three quantum kernels and three canonical also a worthwhile prospect for future work, assuming
classical kernels in a comparative setting, we found that that combination weights can be computed efficiently.
only the most complex and parametric quantum kernel Finally, experiments on datasets with a different, less
(QAOA) attains higher a optimum weight in pairwise generic structure may provide the more promising results
combination with the most performant classical kernel for q.q. and q.c. combinations, especially if the number
(RBF). Conversely, classification performance was not of features is large.
found to differ significantly between q.c. combinations
featuring simpler classical kernels (Linear, Polynomial) VI. ACKNOWLEDGEMENTS
in comparison to the lone quantum kernel. Regarding
use of the QCC-net for training the kernel parameters Partial funding for this work was provided by the
and optimizing the combination weights, we found the Mitacs Accelerate program.
[24] S. S. Vedaie, M. Noori, J. S. Oberoi, B. C. Sanders, and [48] A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and
E. Zahedinejad, arXiv preprint arXiv:2011.09694 (2020), W. Moursi, Journal of Applied and Numerical Optimiza-
https://doi.org/10.48550/ARXIV.2011.09694. tion 1, 107 (2019).
[25] J. S. Baker, G. Park, K. Yu, A. Ghukasyan, O. Goktas, [49] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin,
and S. K. Radha, arXiv preprint arXiv:2305.05881 S. Ahmed, et al., “Pennylane: Automatic differentiation
(2023), https://doi.org/10.48550/arXiv.2305.05881, of hybrid quantum-classical computations,” (2022),
arXiv:2305.05881 [quant-ph]. arXiv:1811.04968 [quant-ph].
[26] B. Schólkopf, A. J. Smola, F. Bach, et al., Learning [50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
with Kernels: Support Vector Machines, Regularization, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
Optimization, and Beyond (MIT Press, 2002). et al., Advances in neural information processing systems
[27] M. Schuld and F. Petruccione, Supervised learning with 32 (2019), https://doi.org/10.48550/arXiv.1912.01703.
quantum computers (Springer, 2018). [51] Covalent: https://www.covalent.xyz.
[28] M. Schuld, arXiv preprint arXiv:2101.11020 (2021), [52] make classification: https://scikit-learn.org/
https://doi.org/10.48550/arXiv.2101.11020. stable/modules/generated/sklearn.datasets.make_
[29] A. W. Harrow, A. Hassidim, and S. Lloyd, Phys. Rev. classification.html.
Lett. 103, 150502 (2009). [53] I. Guyon, in NIPS 2003 workshop on feature extraction
[30] V. Havlı́ček, A. D. Córcoles, K. Temme, A. W. Harrow, and feature selection, Vol. 253 (2003) p. 40.
A. Kandala, J. M. Chow, and J. M. Gambetta, Nature [54] MinMaxScaler: https://scikit-learn.org/stable/
567, 209 (2019). modules/generated/sklearn.preprocessing.
[31] E. Farhi, J. Goldstone, and S. Gutmann, MinMaxScaler.html.
arXiv preprint arXiv:1411.4028 (2014), [55] SVC: https://scikit-learn.org/stable/modules/
https://doi.org/10.48550/arXiv.1411.4028. generated/sklearn.svm.SVC.html.
[32] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and [56] See Supplementary Information [publisher url] for exam-
N. Killoran, arXiv preprint arXiv:2001.03622 (2020), ples of two-dimensional datasets.
https://doi.org/10.48550/arXiv.2001.03622. [57] S. Jerbi, L. J. Fiderer, H. Poulsen Nautrup, J. M. Kübler,
[33] L. Cincio, Y. Subaşi, A. T. Sornborger, and P. J. Coles, H. J. Briegel, and V. Dunjko, Nature Communications
New J. Phys. 20, 113022 (2018). 14, 517 (2023).
[34] M. Fanizza, M. Rosati, M. Skotiniotis, J. Calsamiglia, [58] J. R. Glick, T. P. Gujarati, A. D. Corcoles,
and V. Giovannetti, Phys. Rev. Lett. 124, 060503 (2020). Y. Kim, A. Kandala, J. M. Gambetta, and
[35] H.-Y. Huang, R. Keung, and J. Preskill, Nat. Phys. 16, K. Temme, arXiv preprint arXiv:2105.03406 (2021),
1050 (2022). https://doi.org/10.48550/arXiv.2105.03406.
[36] C. Brouard, J. Mariette, R. Flamary, and
N. Vialaneix, NAR Genom. Bioinform. 4 (2022),
https://doi.org/10.1093/nargab/lqac014.
[37] H. Xue, Y. Song, and H.-M. Xu, Knowl. Based Syst.
191, 105272 (2020).
[38] C. Gautam, R. Balaji, S. K., A. Tiwari, and K. Ahuja,
Knowl. Based Syst. 165, 241 (2019).
[39] I. Steinward and A. Christmann, Support Vector Ma-
chines (Springer Science & Business Media, 2008).
[40] D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980
(2014), https://doi.org/10.48550/arXiv.1412.6980.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, and E. Duchesnay, J.
Mach. Learn. Res. 12, 2825 (2011).
[42] I. Steinwart, D. Hush, and C. Scovel, IEEE Trans. Inf.
52, 4635 (2006).
[43] C. A. Micchelli, Y. Xu, and H. Zhang, J. Mach. Learn.
Res. 7, 2667 (2006).
[44] S. Diamond and S. Boyd, J. Mach. Learn. Res. 17, 1
(2016).
[45] A. Agrawal, R. Verschueren, S. Diamond,
and S. Boyd, J. Control. Decis. (2019),
https://doi.org/10.48550/arXiv.1709.04494,
arXiv:1709.04494 [math.OC].
[46] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd,
Journal of Optimization Theory and Applications 169,
1042 (2016).
[47] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond,
and Z. Kolter, in Advances in Neural Information Pro-
cessing Systems (2019).
15
SUPPLEMENTARY INFORMATION:
QUANTUM-CLASSICAL MULTIPLE KERNEL LEARNING
FIG. S1: Ten examples of d = 2 dimensional classification datasets corresponding to Sec. III B. Each pair of plots shows the
equally-split training and testing subsets (left (a) and right (b), respectively). All horizontal and vertical axes range from 0
to 2π. Square scatter points (red) belong one class and round scatter points (blue) belong to the other.