\WarningFilter

revtex4-2Repair the float package

Exotic and physics-informed support vector machines for high energy physics

A. Ramirez-Morales andres.ramirez@fisica.uaz.edu.mx Facultad de Física, Universidad Autónoma de Zacatecas, Apartado Postal C-580, 98060 Zacatecas, México A. Gutiérrez-Rodríguez alexgu@fisica.uaz.edu.mx Facultad de Física, Universidad Autónoma de Zacatecas, Apartado Postal C-580, 98060 Zacatecas, México T. Cisneros-Pérez tzihue@gmail.com Unidad Académica de Ciencias Químicas, Universidad Autónoma de Zacatecas,Apartado Postal C-585, 98060 Zacatecas, México. H. Garcia-Tecocoatzi hugo.garcia.tecocoatzi@ge.infn.it INFN, Sezione di Genova, Via Dodecaneso 33, 16146 Genova, Italy A. Dávila-Rivera alejandra.davila@fisica.uaz.edu.mx Facultad de Física, Universidad Autónoma de Zacatecas, Apartado Postal C-580, 98060 Zacatecas, México

Abstract

In this article, we explore machine learning techniques using support vector machines with two novel approaches: exotic and physics-informed support vector machines. Exotic support vector machines employ unconventional techniques such as genetic algorithms and boosting. Physics-informed support vector machines integrate the physics dynamics of a given high-energy physics process in a straightforward manner. The goal is to efficiently distinguish signal and background events in high-energy physics collision data. To test our algorithms, we perform computational experiments with simulated Drell-Yan events in proton-proton collisions. Our results highlight the superiority of the physics-informed support vector machines, emphasizing their potential in high-energy physics and promoting the inclusion of physics information in machine learning algorithms for future research.

I INTRODUCTION

Machine learning techniques have proven to be extremely powerful when applied to high energy physics phenomena, both in theory and experimental studies [1, 2, 3]. Several algorithms have been applied to distinguish signals coming from high energy collider data [4, 5]. For instance, the discovery of the Higgs boson was aided with the help of the so-called boosted decision trees algorithm [6]. Some other popular machine learning algorithms which have been successful in high energy physics are: neural networks [7, 8, 9, 10], linear regressions [8, 11, 12] and deep learning [13, 14, 15, 16].

To continue exploiting the potential of machine learning techniques, the idea that physics insights can help design a better machine learning algorithm has recently been used across several fields, yielding excellent results. This field is known as physics-informed machine learning [17]. The majority of physics-informed machine learning studies are through the use of advanced neural network architectures. Moreover, support vector machines (SVM) which are based on kernel methods, have also benefited from these physics insights. The physics information in the SVMs is introduced via their kernels. The latter improves the SVMs performance [18].

In the realm of high energy physics, physics-informed neural networks and deep learning techniques have been proposed to tackle the most challenging tasks in data analysis coming from high energy physics experiments ranging from searches of new physics phenomena to jet tagging [19, 20, 21, 22]. SVMs have been also helpful and interesting for the high energy physics community [23, 24, 25, 26, 1, 27]. However, there is no reports of physics-informed support vector machines applied to high energy physics phenomena. Hence, this invites the exploration of SVMs in the context of physics-informed machine learning.

This paper is focused on the application and interpretation of the SVMs in experimental high energy physics. The use of support vector machines is motivated by their relatively simple geometric interpretation, especially for binary discrimination of signal events against background events. First, we study what we call exotic support vector machines. These SVMs are exotic in the sense that we utilize unconventional techniques to build them. That is, we use genetic and boosting algorithms to construct more efficient classifiers. Moreover, we use somewhat unconventional kernels. The construction of the exotic SVMs is guided by our previous studies [28]. Second, we study physics-informed support vector machines. To include high energy physics information in our SVMs, we propose kernels that define the SVM and aim to capture the dynamical properties of the underlying theory that intends to describe the observed/expected data in high energy experiments.

We perform a case of study: The Drell-Yan $Z$ boson production in proton-proton high energy collisions. In our studies, we simulate data for the process $q\bar{q}\rightarrow Z\rightarrow l^{+}l^{-}$ , where $q$ and $\bar{q}$ are the quarks coming from the colliding protons and $l^{+}l^{-}$ are the final state oppositely charged leptons. Using the kinematic variables for these final state leptons we construct the kernels that define the SVM in every case. We then make formal statistical tests to compare the performances of each SVM. The latter will help us to conclude the usefulness of introducing the dynamics into a support vector machine algorithm.

In Sect. II we summarize the formalism of support vector machines, the basic kernel theory, and the definition of the considered kernels. Furthermore, we describe the genetic algorithms and boosting techniques used, and the approach of how to introduce the physics dynamics of a given process in high energy physics to a support vector machine algorithm. In Sec. III we present the computational experiments to train and test our proposed support vector machines. In Sect. IV we present our results and discussion. Finally, in Sect. V we present our conclusions.

II Methodology

We propose that if the theory underlying the dynamics of a physics process to be studied in high-energy experiments are considered or included during the construction of a kernel that defines the support vector machine, then the discrimination capabilities of the support vector machine binary classifier will be significantly enhanced. Then we compare the physics-informed SVMs with state-of-the-art SVMs. The following sections describe the ingredients of this proposal.

II.1 Support vector machines

In a binary SVM classifier, an optimal hyper-plane, separating two classes in the feature space, is found [29]. Binary classification is important in experimental high energy physics, as it helps discriminate between signals of interest against background. During optimization, the SVM model selects a subset of support vectors (SVs) from the training samples, $\mathbf{x}$ , to establish the decision surface’s location. To simplify the search for SVs, the training samples are mapped into a high-dimensional space using kernel functions, $\kappa(\mathbf{x},\mathbf{z})$ , which are expressed as inner products of the training samples or their mappings. In this feature space, a specific kernel produces a hyperplane that assigns a prediction $\mathbf{y}$ to each element of $\mathbf{x}$ based on which side of the hyperplane $\mathbf{x}$ lies. The kernel functions solve the optimization problem without explicitly using the actual mappings, a technique known as the kernel trick. Since data may not be perfectly separable and some points may lie within the margin or be misclassified, SVM implementations allow for a certain degree of misclassification by introducing an adjustable penalty cost $C$ [29, 28]. A SVM classifier is defined by its kernel and the parameters that describe the kernel. Kernel theory in machine learning allows the construction of a broad diversity of kernels employing elemental kernel properties. Let $\kappa_{1}$ and $\kappa_{2}$ be kernels over $\mathbf{x}\otimes\mathbf{z},$ where $\mathbf{x,z}\subseteq\mathbb{R}^{n},$ $a\in\mathbb{R}^{+}$ , and $\kappa_{3}$ is a kernel over $\mathbb{R}^{n}\otimes\mathbb{R}^{n}$ . Then the following functions are kernels as well [30]:

$\displaystyle\kappa(\mathbf{x},\mathbf{z})$	$\displaystyle=$	$\displaystyle\kappa_{1}(\mathbf{x},\mathbf{z})+\kappa_{2}(\mathbf{x},\mathbf{z})$	(1)
$\displaystyle\kappa(\mathbf{x},\mathbf{z})$	$\displaystyle=$	$\displaystyle a\kappa_{1}(\mathbf{x},\mathbf{z})$
$\displaystyle\kappa(\mathbf{x},\mathbf{z})$	$\displaystyle=$	$\displaystyle\kappa_{1}(\mathbf{x},\mathbf{z})\kappa_{2}(\mathbf{x},\mathbf{z})$
$\displaystyle\kappa(\mathbf{x},\mathbf{z})$	$\displaystyle=$	$\displaystyle f(\mathbf{x})f(\mathbf{z})$
$\displaystyle\kappa(\mathbf{x},\mathbf{z})$	$\displaystyle=$	$\displaystyle\kappa_{3}(\phi(\mathbf{x}),\phi(\mathbf{z}))$

where, $f,\phi:\mathbf{x}\rightarrow\mathbb{R}^{n}$ .

II.2 Basic kernels

In this context, a kernel is a Hermitian and positive semidefinite Gram matrix $G$ defined as $G=[\langle v_{j},v_{i}\rangle]_{i,j=1}^{n}$ , where the vectors $v_{1},...,v_{n}$ live in a vector space that contains an inner product $\langle\cdot,\cdot\rangle$ [31]. To make the notation more compact, we write $G=\kappa(\textbf{x},\textbf{z})=\langle\textbf{x},\textbf{z}\rangle$ , with $\mathbf{x,z}\subseteq\mathbb{R}^{n}$ . This paper considers the kernels:

•

Linear kernel

$\kappa(\textbf{x},\textbf{z})=\langle\textbf{x},\textbf{z}\rangle,$ (2)

with no hyper-parameters.

•

Radial Basis Function (RBF) kernel

\kappa(\mathbf{x},\mathbf{z})=\exp(-\gamma||\textbf{x}-\textbf{z}||^{2}),

(3)

with hyper-parameter $\gamma$ .

•

Sigmoid kernel

\kappa(\textbf{x},\textbf{z})=\tanh(\gamma\langle\textbf{x},\textbf{z}\rangle+% r),

(4)

with hyper-parameters $\gamma$ and $r$ .

•

Polynomial kernel

\kappa(\textbf{x},\textbf{z})=(\gamma\langle\textbf{x},\textbf{z}\rangle+r)^{d},

(5)

with hyper-parameters $\gamma$ , $r$ and $d$ .

For the sigmoid kernel $r=-1$ . For the polynomial kernel $r=+1$ and $d=2$ . Finally, we set a high $\gamma$ value, $\gamma=100$ , to provide a non-negligible impact of each training vector. The chosen values of the hyper-parameters $\gamma$ , $r$ , and $d$ enforce a good behavior when fitting a SVM [33, 32].

The kernels in Eqs. (2)-(5) in addition to the properties in Eq. (1) allow to define composed kernels with almost an arbitrary shape and hence help include the properties of a given physics process.

II.3 Exotic support vector machines

To construct exotic support vector machines we use and combine three elements:

•

Unconventional kernels. We use the kernels of Eqs. (2)-(5) arbitrarily joined according to Eq. (1). These kernels inherently do not carry any physical information beforehand.

•

Ensembles of classifiers. An ensemble of classifiers is a collection of single weak classifiers that when combined together, provide a strong classifier [34, 35]. In this work, we use the AdaBoost algorithm [36] to construct ensembles. This adaptive method updates the vector¹¹1In this context, a vector is a point of the data sample. weights based on the training error of a given binary classifier. These weights are used to train the next classifier to be added to the ensemble. Correctly classified vectors are assigned lower weights, whilst misclassified vectors are given higher weights. Thus, vectors that are harder to classify receive more focus from the algorithm. The AdaBoost algorithm is repeated $T$ times, $t=1,...,T$ . First, for the data true label $y_{i}$ and the base classifier prediction $h_{t}(\textbf{x}_{i})$ , the training error $\epsilon_{t}$ is calculated

\epsilon_{t}=\sum_{i=1}^{n}w_{i}^{t};\qquad y_{i}\neq h_{t}(\textbf{x}_{i}),

(6)

where $w_{i}^{t}$ are the weights of each vector $\textbf{x}_{i}$ utilized to train the classifier. Then, the score $\alpha_{t}$ is defined as

\alpha_{t}=\frac{1}{2}\ln\frac{1-\epsilon_{t}}{\epsilon_{t}}\;.

(7)

The weights are updated for the next iteration with

w_{i}^{t+1}=w_{i}^{t}e^{[-\alpha_{t}y_{i}h_{t}(\textbf{x}_{i})]}\times A_{t},

(8)

where $A_{t}$ is a normalization factor. The weights in Eq. (8) are applied to train and add a new classifier to the ensemble. When $T$ iterations are completed, the predicted label of the total ensemble is the weighted sum of the predictions of the individual classifiers within the ensemble

\displaystyle H(\textbf{x})=\sum_{t=1}^{T}\alpha_{t}h_{t}(\textbf{x}).

(9)

•

Genetic algorithms. The genetic algorithms are optimization techniques inspired by the principles of biological evolution. Selections are performed using simple operators based on genetic recombinations and mutations. In this work, we use genetic algorithms to select a small subset of the training data, which will likely contain the support vectors needed to solve the binary classification problem [37]. To determine if a subgroup of vectors is indeed likely to contain the support vectors, a fitness function is calculated to check if this subgroup is good at classifying data outside this subgroup. This is repeated for several subgroups of vectors and a selection of subgroups is performed using the high-low method [38]. The selected subgroups are recombined and the previous steps are repeated until a given stop criterion is satisfied. For more details, see Ref. [28].

II.4 The Drell-Yan process

Based on the parton model and the quark-antiquark annihilation mechanism, Sidney D. Drell and Tung-Mow Yan [39] predicted the production of two oppositely charged leptons in hadron-hadron collisions. The neutral dilepton pair was predicted to appear with a large invariant mass. This production is the well-known neutral current Drell-Yan process. For proton-proton collisions, the partons participating in the Drell-Yan production are quark and antiquark that constitute the protons. The tree-level or leading-order partonic cross-section of the $q\bar{q}\rightarrow Z$ process is found to be [40]

\displaystyle\hat{\sigma}^{q\bar{q}\rightarrow Z}

\displaystyle=

\displaystyle\frac{\pi}{3}\sqrt{2}G_{F}M_{Z}^{2}(v_{q}^{2}+a_{q}^{2})\delta(% \hat{s}-M_{Z}^{2}),

(10)

where $G_{F}$ is the Fermi weak coupling constant, $M_{Z}$ the invariant mass of the $Z$ boson, $v_{q}(a_{q})$ is the vector (axial vector) coupling of the $Z$ to the quarks, and $\hat{s}$ is the square of the center-of-mass energy of the quark-antiquark.

A quark with charge $Q_{k}$ inside a proton is described by a parton distribution function $q_{k}$ . Considering all the proton parton distribution functions and with the aid of the QCD factorization theorem, it is found that the hadronic (proton-proton) cross-section for the Drell-Yan process is

$\displaystyle\frac{d\sigma^{pp\rightarrow Z}}{dM_{Z}^{2}}$	$\displaystyle=$	$\displaystyle\frac{\hat{\sigma}^{q\bar{q}\rightarrow Z}}{N_{c}}\int_{0}^{1}{dx% _{1}}{dx_{2}}\delta(x_{1}x_{2}s-M_{Z}^{2})$	(11)
		$\displaystyle\times\quad\Big{[}\sum_{k}\;Q_{k}^{2}\;\big{(}q_{k}(x_{1},M_{Z}^{% 2})\bar{q}_{k}(x_{2},M_{Z}^{2})$
		$\displaystyle+\big{[}1\leftrightarrow 2\big{]}\big{)}\Big{]},$

where $1/N_{c}=1/3$ is the color factor. $x_{1,2}$ are defined in terms of the four-momentum of each parton

	$\displaystyle p_{1}^{\mu}$	$\displaystyle=$	$\displaystyle\dfrac{\sqrt{s}}{2}(x_{1},0,0,x_{1}),$		(12)
	$\displaystyle p_{2}^{\mu}$	$\displaystyle=$	$\displaystyle\dfrac{\sqrt{s}}{2}(x_{2},0,0,x_{2}).$		(13)

From Eqs. (12)-(13) it is found that $\hat{s}=x_{1}x_{2}s$ , where $s$ is the proton-proton center-of-mass energy. For the produced lepton pair, the rapidity is given by $y=\textstyle{1/2}\ln(x_{1}/x_{2})$ , and hence

x_{1}=\frac{M_{Z}}{\sqrt{s}}\;\exp(y)\ ,\qquad x_{2}=\frac{M_{Z}}{\sqrt{s}}\;% \exp(-y).

(14)

The cross-section of Eq. (11) is multiplied by the branching ratio for any particular hadronic or leptonic final state of interest, which for this paper is the dielectron final state, namely, $q\bar{q}\rightarrow Z\rightarrow e^{+}e^{-}$ .

The proton-proton cross section in Eq. (11) is a function of the kinematics of the outgoing leptons. The kernel for our support vector machines is therefore constructed in accordance with Eqs. (10)-(14) in the following way: First, we identify the matrix of the proton-proton collision data as the kernel. Then, we perform operations on this kernel according to the relevant kinematic variables of the final state leptons in the cross-section. With this information, the kernel is expected to discriminate Drell-Yan events against backgrounds. Taking into account the kernel properties in Eq. (1), we propose a physics-informed kernel

\kappa(\mathbf{x},\mathbf{z})=\gamma(\langle\textbf{x},\textbf{z}\rangle^{2}+% \langle\textbf{x},\textbf{z}\rangle+\langle\textbf{x},\textbf{z}\rangle\cdot% \exp(\langle\textbf{x},\textbf{z}\rangle)).

(15)

The terms in Eq. (15) are intended to capture the physics in Eqs. (10)-(14) as

\langle\textbf{x},\textbf{z}\rangle^{2}\sim M_{Z}^{2},

(16)

\langle\textbf{x},\textbf{z}\rangle\sim M_{Z},

(17)

\langle\textbf{x},\textbf{z}\rangle\cdot\exp(\langle\textbf{x},\textbf{z}% \rangle)\sim\frac{M_{Z}}{\sqrt{s}}\cdot\exp(\pm y).

(18)

In Eqs. (16)-(18), when the $Z$ boson decays to an electron-positron pair, $M_{Z}$ is calculated from the kinematics of this electron-positron pair.

III Experiments

To test our proposed methodology, we perform computational experiments on a well-known Standard Model process. Namely, the production of a $Z$ boson decaying to an electron-positron pair (Drell-Yan production). Finally, we train, test, and compare several support vector machine binary classifiers to characterize their discrimination power between the Drell-Yan process against backgrounds.

III.1 Data simulation

In this work, we consider Drell-Yan simulated signal and backgrounds. The simulated data is at the generator level, that is, no detector effects are taken into account. The simulation is carried out utilizing PYTHIA8.3 [41]. The event generation is performed utilizing the PYTHIA configuration for the production of weak single and double bosons for proton-proton collisions at center-of-mass energy $s=14$ TeV. For the signal events, we require that the event contains particles with the PDGid [42] corresponding to the $Z$ boson. Then, we require that this particle’s invariant mass is within the $Z$ boson mass (91.1876 GeV) with a width of 40 GeV. Also, we require in the final state, two oppositely charged leptons whose mother particle is the selected $Z$ . The kinematics of these charged leptons are the variables that are used to construct the kernels of the support vector machines. In this study, we consider the backgrounds which are most important for the Drell-Yan production $Z$ reported by the ATLAS and CMS experiments at the Large Hadron Collider [43, 44]. The considered backgrounds are the diboson ( $WW$ , $ZW$ , $ZZ$ ), $t\bar{t}$ , and single top productions. These backgrounds are expected, as their final states may mimic the single $Z$ boson production final state charged leptons. The event selection for the backgrounds is similar to the single $Z$ boson. In this work, we do not consider backgrounds coming from multijet, as they are expected to be negligible ( $<0.1\%$ ) [43, 44]. Since the events are simulated with no detector effects, the samples contain a high purity of events and there is no need to consider variables which are used to handle mismodelling, particle identification, lepton isolation, or acceptance effects. Figure 1 shows the invariant mass of the $Z$ boson calculated with the kinematics of the final state electron-positron pair. Furthermore, we consider the electron-positron kinematic quantities: energy, momentum, transverse momentum, rapidity and azymuthal angle. These quantities are utilized to build the kernels for SVMs.

Refer to caption — Figure 1: Invariant mass of the $Z$ boson calculated from the kinematics of the final state electron-positron pair coming from the simulated Drell-Yan events. The simulation was carried out with the PYTHIA8.3 event generator [41].

III.2 Data splitting

In high energy physics, the challenge of class imbalance in the data sample usually appears. Hence, in this study, we consider different levels of imbalance among the signal and background events. Conventionally, in the binary classification task for high energy physics, a positive value is assigned to label a signal event, and a negative value is assigned to label a background event, being these values $\pm 1$ . We consider the cases when the data sample is fully balanced and the cases when there is an imbalance of the ratio signal: background as 1:3, 1:10, 3:1, and 10:1. This is summarized in Table 1.

Table 1: Drell-Yan data sets for the experiments.

Sample	$+1$ Class	$-1$ Class	Imbalance
half_half	5000	5000	1:1
1quart_3quart	2500	7500	1:3
3quart_1quart	7500	2500	3:1
1dec_10dec	1000	10000	1:10
10dec_1dec	10000	1000	10:1

III.3 Support vector machine models

The support vector machines we study in this paper are summarized in Table 2. The models listed in this table are based on the definitions in Sections II.1-II.4. The phys-DY model employs a kernel that incorporates the Drell-Yan dynamics, as detailed in Eqs. (10)-(14) and summarized in Eq. (15). Models with lin, rbf, pol, or sig in their names utilize the kernels specified in Eq. (2), Eq. (3), Eq. (5), and Eq. (4), respectively. Models featuring adaboost are ensembles constructed using the AdaBoost algorithm described in Sec. II.3, following Eqs. (6)-(9). Models marked with gen use genetic selection as discussed in Sec. II.3. Finally, single and sum indicate that the kernel consists of a single element or the sum of two kernels, respectively. In addition to the physics-informed support vector machine, the classifiers listed in this table are chosen for their outstanding performance in preliminary tests in agreement with our previous study in Ref. [28].

Table 2: SVM models considered in this paper. The first column gives the name of the model, and the second provides a brief description of the elements considered to construct it.

Name	Description
phys-DY	Single with physics-informed kernel
adaboost-gen-rbf	AdaBoost ensemble with genetic
	selection and RBF kernel
adaboost-gen-pol	AdaBoost ensemble with genetic
	selection and polynomial kernel
adaboost-gen-sig	AdaBoost ensemble with genetic
	selection and sigmoid kernel
single-rbf	Single RBF kernel
single-lin	Single linear kernel
single-pol	Single polynomial kernel
single-sig	Single sigmoid kernel
single-sum-rbf-lin	Sum of RBF and linear kernels
single-sum-rbf-pol	Sum of RBF and polynomial kernels
adaboost-rbf	AdaBoost ensemble with RBF kernel
adaboost-pol	AdaBoost ensemble with polynomial kernel
adaboost-lin	AdaBoost ensemble with linear kernel
adaboost-sig	AdaBoost ensemble with sigmoid kernel

Model/Sample	$\mu_{AUC}(\sigma)$	$p$ -val.	R. $H_{0}$	$\mu_{PREC^{+}}(\sigma)$	$p$ -val.	R. $H_{0}$	$\mu_{PREC^{-}}(\sigma)$	$p$ -val.	R. $H_{0}$	$\mu_{ACC}(\sigma)$	$p$ -val.	R. $H_{0}$
half_half
phys-DY	0.98 (0.01)			0.86 (0.02)			0.97 (0.01)			0.91 (0.01)
adaboost-gen-rbf	0.67 (0.01)	0.0	✓✓	0.73 (0.04)	0.0	✓✓	0.64 (0.02)	0.0	✓✓	0.67 (0.01)	0.0	✓✓
single-rbf	0.84 (0.01)	0.0	✓✓	0.83 (0.02)	0.0	✓✓	0.72 (0.03)	0.0	✓✓	0.76 (0.02)	0.0	✓✓
single-sum-rbf-pol	0.86 (0.01)	0.0	✓✓	0.77 (0.03)	0.0	✓✓	0.74 (0.03)	0.0	✓✓	0.75 (0.02)	0.0	✓✓
single-sum-rbf-lin	0.85 (0.01)	0.0	✓✓	0.74 (0.03)	0.0	✓✓	0.74 (0.03)	0.0	✓✓	0.74 (0.02)	0.0	✓✓
1quart_3quart
phys-DY	0.96 (0.01)			0.83 (0.04)			0.91 (0.02)			0.89 (0.01)
adaboost-gen-rbf	0.55 (0.05)	0.0	✓✓	0.75 (0.29)	0.36224	✗	0.88 (0.02)	0.0	✓✓	0.87 (0.01)	0.0	✓✓
single-rbf	0.81 (0.02)	0.0	✓✓	0.72 (0.06)	0.0	✓✓	0.81 (0.02)	0.0	✓✓	0.80 (0.02)	0.0	✓✓
single-sum-rbf-pol	0.85 (0.02)	0.0	✓✓	0.87 (0.05)	2e-05	✓	0.82 (0.02)	0.0	✓✓	0.83 (0.01)	0.0	✓✓
single-sum-rbf-lin	0.87 (0.02)	0.0	✓✓	0.86 (0.05)	0.00025	✓	0.84 (0.02)	0.0	✓✓	0.84 (0.01)	0.0	✓✓
3quart_1quart
phys-DY	0.98 (0.01)			0.92 (0.01)			0.99 (0.01)			0.94 (0.01)
adaboost-gen-rbf	0.57 (0.06)	0.0	✓✓	0.85 (0.03)	0.0	✓✓	0.29 (0.23)	0.0	✓✓	0.77 (0.08)	0.0	✓✓
single-rbf	0.83 (0.02)	0.0	✓✓	0.77 (0.02)	0.0	✓✓	0.64 (0.08)	0.0	✓✓	0.77 (0.02)	0.0	✓✓
single-sum-rbf-pol	0.84 (0.02)	0.0	✓✓	0.82 (0.02)	0.0	✓✓	0.85 (0.04)	0.0	✓✓	0.83 (0.02)	0.0	✓✓
single-sum-rbf-lin	0.83 (0.02)	0.0	✓✓	0.83 (0.02)	0.0	✓✓	0.91 (0.04)	0.0	✓✓	0.84 (0.01)	0.0	✓✓
1dec_10dec
phys-DY	0.96 (0.01)			0.85 (0.07)			0.95 (0.01)			0.94 (0.01)
adaboost-gen-rbf	0.59 (0.06)	0.0	✓✓	0.54 (0.33)	0.0	✓✓	0.96 (0.01)	0.0	✓	0.94 (0.04)	0.59846	✗
single-rbf	0.78 (0.03)	0.0	✓✓	0.58 (0.15)	0.0	✓✓	0.91 (0.01)	0.0	✓✓	0.90 (0.01)	0.0	✓✓
single-sum-rbf-pol	0.86 (0.02)	0.0	✓✓	0.95 (0.06)	0.0	✓	0.93 (0.01)	0.0	✓✓	0.93 (0.01)	0.0	✓✓
single-sum-rbf-lin	0.86 (0.02)	0.0	✓✓	0.96 (0.06)	0.0	✓	0.92 (0.01)	0.0	✓✓	0.92 (0.01)	0.0	✓✓
10dec_1dec
phys-DY	0.96 (0.02)			0.96 (0.01)			0.99 (0.01)			0.96 (0.01)
adaboost-gen-rbf	0.55 (0.05)	0.0	✓✓	0.94 (0.01)	0.0	✓✓	0.27 (0.32)	0.0	✓✓	0.86 (0.09)	0.0	✓✓
single-rbf	0.84 (0.02)	0.0	✓✓	0.90 (0.01)	0.0	✓✓	0.55 (0.26)	0.0	✓✓	0.90 (0.01)	0.0	✓✓
single-sum-rbf-pol	0.61 (0.06)	0.0	✓✓	0.92 (0.01)	0.0	✓✓	1.00 (0.01)	0.17971	✗	0.92 (0.01)	0.0	✓✓
single-sum-rbf-lin	0.61 (0.06)	0.0	✓✓	0.93 (0.01)	0.0	✓✓	0.98 (0.04)	0.01646	✓✓	0.93 (0.01)	0.0	✓✓

Table 3: The first column indicates the data sample and the model used to describe the data. The second column provides information on the AUC: first, the mean value

\mu_{AUC}

is reported along with its uncertainty

\sigma

in parentheses; then, the

p

-value from the Wilcoxon test is presented. This is followed by the result of rejecting the null hypothesis,

H_{0}

. A double ✓✓ indicates the rejection of

H_{0}

and that phys-DY model performs better than the rest of the classifiers, while a single ✓ indicates the rejection of

H_{0}

and that phys-DY model performs worse than the rest of the classifiers. An ✗ indicates that

H_{0}

cannot be rejected. The third column presents the same information as the second column but for PREC⁺. The fourth column presents the same information as the second column but for PREC^-. The fifth column presents the same information as the second column but for ACC.

III.4 Support vector machines training and testing

To evaluate the efficiency of the proposed support vector machines, we perform training and testing experiments utilizing the data described in Sec. III.1. In the training phase, a subset of the data is used to fit the model. During the testing phase, the fitted model obtains the predictions for the remaining data, where these predictions are the labels of whether a given data point is signal or background. To ensure reliable performance metrics for each support vector machine, we implement a repeated $k$ -fold cross-validation. We divide the data into $k$ folds, where each fold is used once as the test set while the remaining $k-1$ folds serve as the training set. This process is repeated for each of the $k$ folds. That is, the entire $k$ -fold cross-validation is repeated $N_{cv}$ times, with a different random split for each repetition. Overall, this results in $k\times N_{cv}$ training and testing cycles. The reported metrics are the average values of the obtained distributions, with one standard deviation as the associated errors [45, 28].

The classifier metrics are defined in terms of the error matrix elements: $TP$ is the number of true positive values, $TN$ is the number of true negative values, $FP$ is the number of false positive values, $FN$ is the number of false negative values [46]. The metrics considered in this paper are the accuracy ACC,

\text{ACC}=\frac{TP+TN}{TP+TN+FP+FN},

(19)

the positive precision $\text{PRC}^{+}$ ,

\text{PRC}^{+}=\frac{TP}{TP+FP},

(20)

the negative precision $\text{PRC}^{-}$ ,

\text{PRC}^{-}=\frac{TN}{TN+FN},

(21)

and the Area Under the Receiver Operating Characteristic Curve AUC. The AUC is the area under the plot of the $TP$ yields at different thresholds [47]. In SVMs, these thresholds are obtained by varying the offset of the hyperplane from the origin to produce different predictions. The values of these metrics are within the range [0,1] where 0 corresponds to the worst performance and 1 to the best performance.

III.5 Computing implementation

We work out our calculations on Python. In particular, we use the software NumPy [48] and the libsvm [49] implementation of scikit-learn [50]. The computing framework for our experiments is publicly available on GitHub [51].

IV Results

IV.1 Cross validation and statistical tests

In Fig. 2, we display our ACC, PREC⁺, PREC^-, and AUC defined in Eqs. (19)-(21). The latter are calculated for the data samples listed in Table 1 and the support vector machines described in Table 2. Each point in these plots represents the mean value of the metric calculated following the cross-validation procedure described in Sec. III.4. Here we set $k=10$ and $N_{cv}=10$ , that is, we compute 100 times the training and testing phases for each sample and support vector machine, and obtain a distribution for each metric. The displayed ACC, PREC⁺, PREC^-, and AUC are calculated using the predicted classes from the test samples excluded during the training phase. In this plot, a line of a given color can show the behavior across all the proposed support vector machines for a specific data sample.

We carry out a comparison of the support vector machine containing the Drell-Yan kernel, against the support vector machines that showed the best four behaviors in Fig. 2 (we consider that the rest of the classifiers are evidently outperformed by our proposed physics-informed kernel). These are the single-sum-rbf-pol, single-sum-rbf-lin, single-single-rbf, and adaboost-gen-rbf support vector machines whose kernels are described in Table 2. In this work, we use a paired ranked Wilcoxon test [52]. This is equivalent to a Student’s $t$ -test for distributions with a non-Gaussian behavior. This test will determine if the difference between the metrics of the physics-informed support vector machine with respect to the others is statistically significant. Let $H_{0}$ be the null hypothesis that states that the metrics of the classifiers are equal. The purpose is to accept or reject $H_{0}$ in light of the distributions of the metrics obtained in the cross-validation procedure. We reject $H_{0}$ at a statistical significance level of $\alpha=0.05$ , meaning we conclude that the ACC, PREC⁺, PREC^-, and AUC of two classifiers are indeed not equal if the $p$ -value, coming from the Wilcoxon test, is below 0.05. Table 3 summarizes these tests, for each metric we display the mean value of its distribution along with the associated error given by the standard deviation of this distribution. Moreover, in the column named R. $H_{0}$ we display the results of the Wilcoxon test: check marks, ✓or ✓✓, indicate that the test rejects the null-hypothesis, and a cross mark, ✗, indicates that $H_{0}$ is not rejected. Table 3 contains the results for the samples described in Table 1.

IV.2 Discussion

The first feature to note from Fig. 2, is that the values for ACC are stable across the different samples. The reason for this is that this metric takes an average of both the signal and background classification results. This metric is appropriate when describing a balanced data sample. A similar pattern is observed in the values found for the AUC. Conversely, large fluctuations arise when analyzing the signal precision, PRC⁺, and the background precision PREC^-. From the plots in Fig. 2, the most noticeable observation is when we look at the lines corresponding to the samples with imbalance 3:1 and 10:1. This poor behavior of most of the classifiers is expected since when there are not enough samples of one kind during the training phase, the support vector machine fails to describe both classes. Note that there could be a misleading assessment regarding a given classifier, as this classifier can achieve high AUC, ACC, and PREC⁺, while the PREC^- is near zero. Therefore, this suggests that the most important metrics are the positive and negative precisions. The latter implies that a good classifier is expected to be robust against imbalances in data samples, which is typically the case in high energy physics. Remarkably, our proposed physics-informed classifier phys-DY shows high values for all the metrics presented here. The reason for this could be that we have effectively captured the intrinsic properties of the data samples by incorporating physics information into the kernel of the support vector machine. Other classifiers also exhibit stable metrics across the samples, which can be explained by the fact that their kernels are similar to the one inspired by the Drell-Yan process.

From Table 3, we can quantitatively compare the physics-informed kernel against the best-performing kernels. The first notable feature is that, in most cases, we can reject $H_{0}$ . This is evident as the ✓or ✓✓ appears in almost every case. Upon inspecting the metric values, when we reject $H_{0}$ , there are two scenarios. First, the physics-informed kernel outperforms the exotic kernel, indicated by a double check mark (✓✓). Second, the exotic kernel outperforms the physics-informed kernel, indicated by a single check mark (✓). In almost all of the metrics presented in this study, our proposed physics-informed kernel in Eq. (15) performs better than the other kernels. Specifically, when analyzing PRC⁺, our physics-informed kernel performs excellently for the most imbalanced data samples, 1dec_10dec. PRC⁺ is the metric that provides information about the performance of a classifier at finding signal events in the sample. Therefore, the PRC⁺ attained by the physics-informed kernel demonstrates that this kernel is useful for high energy physics data. Moreover, the physics-informed kernel presents a stable PRC^- when describing all the samples, demonstrating the robustness of this kernel against imbalance in data samples. There are two other kernels that show competitive metrics, namely, the single-sum-rbf-pol and single-sum-rbf-lin kernels. These kernels are the sums of the individual kernels defined in Eqs. (3) and (5), and Eqs. (3) and (2), respectively. When comparing them with the physics-informed kernel in Eq. (15), we conclude that these kernels can both capture the dynamical properties of the Drell-Yan cross-section.

V Conclusions

In this work, we analyze several types of kernels that define a support vector machine. A physics-informed kernel is proposed to describe simulated data of a simple and well-known Standard Model process. The physics of this process is introduced to the kernel in a simple and straightforward manner, by considering the functional form of the kinematic variables found in the cross-section and then transforming the matrix that represents the data according to these functional forms. To test the effectiveness of this method, we construct unconventional kernels that a priori can overcome the typical challenges of high energy physics. We carry out statistical tests to determine if the physics-informed kernel is competitive compared to kernels constructed with sophisticated machine-learning algorithms. Remarkably, it is found that our proposed physics-informed kernel outperformed these algorithms. This finding motivates further investigation into the improvement of machine learning algorithms for more complex high energy physics data using the proposed approach. This simple method of introducing physics insights to kernel methods is proven to be effective, and since there is a connection between kernel methods and neural networks [53], the techniques we study in this paper can be extended to more modern machine learning algorithms based on kernel methods.

Acknowledgements.

This work was funded by the CONAHCYT project I1200/311/2023. T. C. P. thanks a CONAHCYT postdoctoral fellowship. A. G. R. thanks SNII (México).

References

[1] S. Whiteson, D. Whiteson, Eng. Appl. Artif. Intell 22, 8 (2009)
[2] P.T. Komiske, E.M. Metodiev, J. Thaler , JHEP 01, 121 (2019).
[3] K.K. Sharma, MPLA. 36, 02 (2021).
[4] P. Baldi, P. Sadowski, D. Whiteson, Nat. Commun. 05, 4308 (2014).
[5] A. Alves, JINST 12, 05 (2017)
[6] T. Biswas, A. Datta, JHEP 05, 104 (2023).
[7] P.C. Bhat, R. Gilmartin, H.B. Prosper, Phys. Rev. D 62, 074022 (2000)
[8] P. Baldi, K. Cranmer, T. Faucett,. Sadowski, D. Whiteson, Eur. Phys. J. C. 76, 235 (2016).
[9] A. Aurisano et al, JINST 11, P09001 (2016).
[10] F. Bishara, A. Paul, J. Dy, Sci. Rep 14, 5294 (2024).
[11] C.W.Murphy, Phys. Rev. D 97, 015007 (2018).
[12] H.B. Prosper, Phys. Rev. 37, 1153 (1988)
[13] P. Baldi, P. Sadowski, and D. Whiteson, Phys. Rev. Lett. 114, 111801 (2015)
[14] G.C. Strong, Mach. Learn.: Sci. Technol. 1, 045006 (2020).
[15] E. Barberio, B. Le, E. Richter-Was, Z. Was, J. Zaremba, D. Zanzi, Phys. Rev. 96, 073002 (2017).
[16] J.Amacker, W.Balunas, L.Beresford, D.Bortoletto, J.Frost, C.Issever, J.Liu, J. McKee, A. Micheli, S.P.Saenz, M.Spannowsky, B, Stanislaus JHEP 12, 115 (2020).
[17] G.E. Karniadakis, I.G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang. Nat. Rev. Phys. 03, 422-440 (2021).
[18] K. Mudunuru, S. Karra, Comput. Methods Appl. Mech. Eng. 374, 113560 (2021).
[19] V.S. Ngairangbam, M. Spannowsky, JHEP 05, 004 (2024).
[20] C. Li, H. Qu, S. Qian, Q. Meng, S. Gong, J. Zhang, TY. Liu, Q. Li, Phys. Rev. D 109, 056003 (2024).
[21] Z. Hao, R. Kansal, J. Duarte, N. Chernyavskaya, Eur. Phys. J. C 83, 485, (2023).
[22] O. Atkinson, A. Bhardwaj, C. Englert, P. Konar, V.S Ngairangbam, M. Spannowsky, Front. Artif. Intell. 5, 943135 (2022).
[23] M.Ö Sahin, D. Krücker, I.A. Melzer-Pellmann, Nucl. Instrum. Methods Phys. Res., Sect. A. 838, 137-146 (2016).
[24] M. Aaboud et al. (ATLAS Collaboration) Phys. Rev. D 108, 032014 (2023).
[25] A. Vaiciulis, Nucl. Instrum. Methods Phys. Res., Sect. A. 502, 2-3 (2003).
[26] F. Sforza, V. Lippi, Nucl. Instrum. Methods Phys. Res., Sect. A. 722, 11-19 (2013).
[27] S.. Wu, S. Sun, W. Guan, C. Zhou, J. Chan, C.L. Cheng, T. Pham, Y. Qian, A.Z. Wang, R. Zhang, M. Livny, J. Glick, P. Kl. Barkoutsos, S. Woerner, I. Tavernelli, F. Carminati, A.D. Meglio, A. C. Y. Li, J. Lykken, P. Spentzouris, S. Y. Chen, S.Yoo, T Wei, Phys. Rev. Research 3, 033221 (2021).
[28] A. Ramirez-Morales, J.U. Salmon-Gamboa, J Li, A.G. Sanchez-Reyna, A. Palli-Valappil, Appl. Intell 53, 4996–5012 (2023).
[29] C. Cortes, V. Vapnik, Mach. Learn. 20, 273–297 (1995).
[30] J. Shawe-Taylor, N. Cristianini, Cambridge University Press (2004).
[31] R.A. Horn, C.R. Johnson, Matrix Analysis. Cambridge, Cambridge University Press (2012). Horn, Roger A.; Johnson, Charles R. (2012). Matrix Analysis (2nd ed.). Cambridge University Press.
[32] H.T. Lin, C.J. Lin, Neural Comput. 3, 1-32 (2003).
[33] Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, C.J. Lin, JMLR 11,4 (2010).
[34] C. Zhang, Y. Ma, Springer 144 (2012).
[35] O. Sagi, L. Rokach, WIRES DMKD 8, 1249 (2018).
[36] R.E. Schapire, Y. Singer, COLT 37 (3), 297-336 (1999).
[37] J.H. Holland, Adaptation in Natural and Artificial Systems, Univ. of Michigan Press 2ed (1992).
[38] E.E.E. Ali, E. Elamin, King Saud Univ., Coll. of Comput. and Inf. Sci. In Proceedings of the 1st NITS (2006).
[39] S.D. Drell, T.M. Yan, Phys. Rev. Lett. 25, 316-320 (1970).
[40] J. M. Campbell et al, Rep. Prog. Phys. 89, 70 (2007).
[41] C. Bierlich, et al, SciPost Phys. Codebases, 8, (2022).
[42] R. L. Workman et al. [Particle Data Group], PTEP 2022, 083C01 (2022).
[43] The ATLAS collaboration., M. Aaboud, G. Aad, et al. J. High Energ. Phys. 12, 59 (2017).
[44] The CMS collaboration., A.M. Sirunyan, , A. Tumasyan, et al. J. High Energ. Phys. 12, 59 (2019).
[45] M.Kuhn, K.Johnson, Applied Predictive Modeling, Springer 26 (2013).
[46] D.M.W. Powers, J. Mach. Learn. Technol. 2 (2008).
[47] A.P. Bradley, Pattern Recognition 30(7), 1145-1159 (1997).
[48] C.R. Harris, K.J.Millman, S.J. van der Walt, et al. Nature 585, 357–362 (2020).
[49] C.C. Chang, C.J. Lin, ACM TIST 2, 3 (2011).
[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion , O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, J. Mach. Learn. Res. 12, 2825–2830 (2011).
[51] A. Ramirez-Morales, A. Davila-Rivera, Github: SVM-physics code. https://github.com/andrex-naranjas/SVM-physics.
[52] F. Wilcoxon, Biometrics Bulletin 1 (6), 80–83 (1945).
[53] Wang, S., Yu, X. and Perdikaris, P., Preprint at arXiv https://arxiv.org/abs/2007.14527 (2020).