Parton Labeling without Matching:
Unveiling Emergent Labelling Capabilities in Regression Models

Shikai Qiu calvin_qiu@berkeley.edu Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 Shuo Han shuohan@lbl.gov Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Xiangyang Ju xju@lbl.gov Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Benjamin Nachman bpnachman@lbl.gov Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA Haichen Wang haichenwang@berkeley.edu Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA

Abstract

Parton labeling methods are widely used when reconstructing collider events with top quarks or other massive particles. State-of-the-art techniques are based on machine learning and require training data with events that have been matched using simulations with truth information. In nature, there is no unique matching between partons and final state objects due to the properties of the strong force and due to acceptance effects. We propose a new approach to parton labeling that circumvents these challenges by recycling regression models. The final state objects that are most relevant for a regression model to predict the properties of a particular top quark are assigned to said parent particle without having any parton-matched training data. This approach is demonstrated using simulated events with top quarks and outperforms the widely-used $\chi^{2}$ method.

I Introduction

A common task in collider event reconstruction is assigning final state objects to a branch of the hypothesized reaction that generated the event. For example, hard-scatter events with outgoing quarks and gluons produce jets that can be associated with their initiating partons. When there are many outgoing particles from the hard-scatter reaction, this is a complex combinatorial challenge. Events with multiple top quarks naturally result in such final states, since nearly all top quarks decay to a $b$ -quark and a $W$ boson, which subsequently decays to two quarks or leptons. A key challenge in many measurements and searches involving top quarks is the assignment of reconstructed objects with one of the top quark decay products. Classically, this assignment has used $\chi^{2}$ or related methods that enumerate all possibilities and pick the one which is most consistent with having two on-shell $W$ boson and top quark intermediaries. The difficulty with these methods is that they do not take into account all available information and are computationally expensive.

A number of modern machine learning (ML) methods have been proposed to address these challenges. These techniques range from Boosted Decision Trees [1, 2, 3] and existing neural networks [4, 5] to custom, permutation invariant deep learning methods [6, 7, 8, 9]. In all cases, object identification can make use of a variety of lepton-, jet- and event-level properties that were inaccessible with $\chi^{2}$ or likelihood methods [10]. This is possible because the ML approaches are trained on simulations, so whatever information is available and well-modeled (within uncertainty) can be used for object labeling.

Refer to caption — Figure 1: Simulated all-hadronic $t\bar{t}$ events. In the $N_{\text{colors}}\rightarrow\infty$ limit, hadrons can be uniquely associated as $W$ boson descendants. Top: number of jets with at least 10% of their energy from the $W^{+}$ . Bottom: of these jets, the fraction of their energy from the $W^{+}$ . Jets are clustered using the anti- $k_{t}$ [11] algorithm with $R=0.4$ .

Despite the success of these ML methods, they all share a common fundamental challenge with classical approaches. In particular, they all require matched objects for training. This may be problematic for two reasons (see e.g. Fig. 1). First, there is no unique match between a hard-scatter quark/gluon and a jet. A single quark/gluon can fragment into multiple jets, and a single jet can be composed of hadrons with energy flow originating from multiple quarks/gluons. This is particularly acute for top quarks, which carry color charge and thus must be color-connected to another quark/gluon in the event. The extent of the overlap also depends on the jet clustering algorithm - jets with a larger catchment area [12] are more likely to be due to the merger of multiple parton showers. Second, even if a parent object like a top quark could be uniquely associated with a set of decay products, acceptance effects will obscure the association. In particular, the finite geometric and energy acceptance of detectors results in missed final state objects.

Our philosophy is to circumvent the issues caused by object-parton matching by directly regressing onto the target particle properties. In Ref. [13], we designed the Covariant Particle Transformer (CPT), a partially Lorentz covariant point cloud transformer, to learn the four-vectors of top quarks given reconstructed jets, leptons, photons, and missing energy. In this paper, we show how one can reuse such a regression method to perform parton labeling. We explore two possibilities, one based on the attention mechanism within the CPT and one based on the gradient of predicted four-vectors with respect to the inputs. The latter approach is compatible with any regression-based top quark reconstruction method, even if it does not involve neural network attention. While we still advocate for regression in cases where the underlying top quark properties are needed, parton labeling is still widely used for determining these properties and no matter what approach is used, parton labels can be useful for diagnostic purposes.

This paper is organized as follows. Section II briefly reviews the CPT technique and then introduces our two approaches to extracting parton labels from the regression model. Numerical results are presented in Sec. IV using a dataset that is briefly introduced in Sec. III. The paper ends with conclusions and outlook in Sec. V.

II Methods

Our goal is to take final states with $n$ top quarks that decay hadronically and assign jets to one of these quarks. In principle, one could simultaneously predict $n$ and assign jets, but in practice, there is often a particular number of target quarks; if not, one could first run a multi-class classification procedure. We also restrict our approach to assigning three jets to each top quark. Both ML-based approaches described below could be modified to assign fewer or more jets by placing thresholds on the Jacobean values (Sec. II.2) or the attention weights (Sec. II.3), but we leave this to future work.

II.1 Covariant Particle Transformer

The Covariant Particle Transformer (CPT) is a Transformer-based [14] neural network tailored for collider physics applications and has demonstrated superior performance in predicting top quarks’ kinematics compared to classical approaches [13]. CPT takes as inputs the 4-vectors and particle identifications of all observed final state objects (jets, lepton, photons, etc.) and outputs predicted 4-vectors of a pre-specified number of top quarks. Compared to the standard transformer architecture, CPT is designed to respect important symmetries in collider physics: it is permutation invariant under reordering of the inputs and partially Lorentz covariant, meaning if we apply a longitudinal boost and/or a transverse rotation to all the inputs, CPT’s outputs will be boosted and/or rotated accordingly, respecting Lorentz symmetry.

In each layer of the network, CPT additively updates the feature vector of every object $f_{i}$ (could be an input or output) with $\Delta f_{i}$ defined as a function of all the feature vectors $\{f_{k}\}:$

\Delta f_{i}=\sum_{k}\alpha_{ik}\varphi(f_{k}),

(1)

where $\varphi$ is a learned linear transformation and $\{\alpha_{ik}\}$ are positive attention weights, which are themselves non-linear functions of $\{f_{k}\},$ such that $\sum_{k}\alpha_{ik}=1$ for each $i.$ The output feature vectors are eventually transformed to the predicted 4-vectors of the top quarks. If $i$ is an output index and $k$ is an input index, then intuitively $\alpha_{ik}$ measures the importance of the information in $k$ for predicting the properties of $i.$ The above procedure is named the covariant attention mechanism, which modifies the standard attention mechanism in a transformer to ensure partial Lorentz covariance. To capture complex correlations between the inputs and outputs, CPT uses $L=6$ covariant attention layers and $H=4$ attention heads per layer to decode the top quark 4-vectors, where each attention head performs separate learned updates according to Equation 1 for added flexibility. We refer readers to the original CPT paper for a more comprehensive review of the architecture and implementation.

II.2 Gradient-based Labeling

The idea of the gradient-based method is to assign a jet to a particular top quark if changes to the jet properties result in significant changes to the top quark properties. If the top quarks were produced independently of each other and of other radiation within the event, then only the jets they produce should be relevant for reconstructing their properties. In reality, this is not the case because top quarks and other objects are correlated through momentum conservation and other physics effects.

Strictly speaking, the term ‘gradient’ applies to the case of one-dimensional quantities (e.g. top quark $p_{T}$ ), but for regression methods that predict multiple top quark properties, a more accurate name would be ‘Jacobian-based’. For simplicity, we will henceforth always call this method ‘gradient-based’.

The gradient-based labeling scheme is compatible with any regression model (not just the CPT from Sec. II.1) and is based on the following quantity:

\displaystyle\Delta_{ik}=\norm{\left(\frac{\partial f_{i,p_{T}}}{\partial j_{k% ,p_{T}}},\frac{\partial f_{i,y}}{\partial j_{k,y}},\frac{\partial f_{i,\phi}}{% \partial j_{k,\phi}}\right)},

(2)

where $f_{i,x}$ is the predicted $x\in\{p_{T},y,\phi\}$ of top quark $i$ and $j_{k,x}$ is the observed $x$ of jet $k$ . Since $f_{i}$ is a neural network, we can compute the derivatives in Eq. 2 using the same automatic differentiation (e.g. back propagation) that is used when training the network in the first place. We assign jet $k$ to top quark $i$ if $\Delta_{ik}$ is one of the top three values across all $k$ . The same jet could be assigned to multiple top quarks. Equation 2 is not the unique combination of elements from the Jacobian and it could be that other combinations could be more effective. We found that using the derivatives with respect to $p_{T}$ , $y$ , and $\phi$ was only slightly better than $p_{T}$ alone. More complex schemes that weight the different entries separately are also possible.

When $f$ is a CPT, then $\Delta_{ik}$ is a partial Lorentz scalar and so the labeling is invariant under longitudinal boosts and rotations in the transverse plane.

II.3 Attention-based Labeling

In each covariant attention layer and attention head in CPT, the attention weight $\alpha_{ik}$ can be interpreted as a measure of the importance of input $k$ for predicting the properties of top $i,$ locally in the network. By averaging $\alpha_{ik}$ over all layers and attention heads, we obtain a measure of the overall importance of input $k$ to top $i$ :

\bar{\alpha}_{ik}=\frac{1}{LH}\sum_{\ell,h}\alpha^{\ell h}_{ik},

(3)

where $\alpha^{\ell h}_{ik}$ is the attention weight between top $i$ and input $k$ in the $h^{\text{th}}$ attention head in the $\ell^{\text{th}}$ layer. Similar to gradient-based labeling, we assign the jet with index $k$ to top quark $i$ if $\bar{\alpha}_{ik}$ is one of the top three values across all jets.

Due to the design of CPT, all attention weights are partial Lorentz scalars and $\bar{\alpha}_{ik}$ is again a partial Lorentz scalar, implying the labeling is invariant under longitudinal boosts and rotations in the transverse plane.

II.4 $\chi^{2}$ -based Labeling

The baseline parton labeling scheme that we use is a widely applied $\chi^{2}$ method. In particular, in events with at least two jets tagged as originating from bottom quarks ( $b$ -jets), the assignment of jets to top quarks is based on the combination that minimized the following $\chi^{2}$ :

	$\displaystyle\chi^{2}$	$\displaystyle=\frac{(m_{b_{1}j_{1}j_{2}}-m_{t})^{2}}{\sigma_{m_{bjj}}^{2}}+% \frac{(m_{b_{2}j_{3}j_{4}}-m_{t})^{2}}{\sigma_{m_{bjj}}^{2}}$
		$\displaystyle\hskip 22.76219pt+\frac{(m_{j_{1}j_{2}}-m_{W})^{2}}{\sigma_{m_{jj% }}^{2}}+\frac{(m_{j_{3}j_{4}}-m_{W})^{2}}{\sigma_{m_{jj}}^{2}}\,,$		(4)

where $m_{t}$ and $m_{W}$ are the top quark and $W$ boson masses, respectively, and $\sigma_{m_{bjj}}$ and $\sigma_{m_{jj}}$ are the resolutions of truth-matched top and $W$ events, respectively. As in this case, when we need to refer to classical truth labels, we will call top quarks that have all three decay products as ‘truth-matched’ when each of the three quark decay products is within $\Delta R<0.4$ of exactly one jet (about 20% efficient). Events without six jets, two of which are $b$ -tagged, are not reconstructable with the $\chi^{2}$ method. It may be possible to recover some of the non-reconstructable cases using other approaches for the $b$ -jets (e.g. taking the highest energy jet(s)), so we check that our results hold in cases where events have two $b$ -jets.

III Dataset

For numerical studies, we use the same dataset as in Ref. [13], which is briefly summarized below. Top quark pair production in association with a Higgs boson¹¹1The Higgs boson decays to photons and is largely ignored and irrelevant for jet labeling. We use this sample because it was the main one used in Ref. [13], although it was also shown that the performance is similar in other top quark final states. in proton-proton collisions is generated with Madgraph@NLO 2.3.7 [15] at next-to-leading order (NLO) in Quantum Chromodynamics (QCD). The decays of the top quarks are simulated with MadSpin [16] and then the rest of the particle-level generation is created with Pythia 8.235 [17]. While this dataset does not emulate detector effects, the salient features of the problem are already present at particle level. Jets are clustered using the anti- $k_{t}$ [11] algorithm with $R=0.4$ as implemented in FastJet 3.3.2 [18, 19].

Jets are required to have $|y|\leq 2.5$ and $p_{T}\geq 25$ GeV. Jets that are $\Delta R$ matched²²2 $\Delta R$ is defined as $\sqrt{\Delta y^{2}+\Delta\phi^{2}}$ , where $\Delta y$ is the difference of two particles in pseudorapidity and $\Delta\phi$ is the difference in azimuthal angle. to $b$ -quarks at the parton level are labeled as $b$ -jets; this label is removed³³3We do not add fake $b$ -jets, since the fake rate (one in a few hundred) is sufficiently small that missing a real $b$ -jet and falsely tagging a non $b$ -jets is rare enough to not impact the numerical results. randomly for 30% of the $b$ -jets, to mimic the inefficiency of a realistic $b$ -tagging [20, 21]. We further apply a preselection on the testing set of $N_{\mathrm{bjet}}>0$ and $N_{\mathrm{jet}}\geq 3$ to mimic realistic data analysis requirements.

IV Results

First, we consider standard, non-unique metrics for evaluating performance. In particular, truth-matched top quarks are compared with each reconstruction method to see the fraction of the time that all three jets are the same. As noted earlier, the truth match labels are not unique, but this is a standard metric for quantifying performance. Figure 2 shows the frequency of an exact match for each method and for different jet multiplicities. The matching generally is harder the more jets there are in the event because there are more combinations and the truth label fidelity also degrades (see Fig. 1).

Overall, the attention-based approach outperforms the other two methods across all configurations, often by a large margin (10% or more). Inclusively, the gradient-based method outperforms the classical $\chi^{2}$ assignment, but the two approaches are comparable after requiring two $b$ -jets. Across all events and inclusively across jet multiplicities, the $\chi^{2}$ approach has a poor matching frequency (about 10%) in part because it requires two $b$ -jets and at least six distinct jets. In contrast, the attention- and gradient-based methods are still effective when there are fewer jets. The numbers for the attention-based and $\chi^{2}$ -based approaches are similar to the ones found by Spa-Net [6], although there are a number of differences in the setup that prohibit a precise comparison.

The next question is to study events in which there is no truth-match. Such events are not even part of the training for other ML-based labeling schemes, but our methods are still able to assign parton labels in these cases. One way to see if the assigned jets in such events are sensible is to examine their trijet invariant mass. Figure 3 presents histograms of this map inclusively and for events without a truth match. There are roughly twice as many entries for the attention- and gradient-based histograms in the top plot of Fig. 3 because of events where there is no truth match. All five histograms in the figure look similar, with a peak near the top quark mass of about 175 GeV [22]. The peak sharpest for the truth-matched events and is slightly sharper for the attention-based method than the gradient-based method. This may be expected from Fig. 2, which indicates that the attention-based approach has a higher fidelity of picking the ‘correct’ jets.

Our last investigation is if the trijet kinematic properties in unmatched events are close to the truth top quarks. One reasonable definition of a ‘good match’ would be that the reconstructed top properties are close to the truth properties, which does not require assigning quark identities to the jets. Since our methods are derived from a top quark property regressor, we would expect that the trijet properties align well with the truth top quark properties, but it is important to check. Figure 4 provides confirmation for the top quark $p_{T}$ and $y$ .

V Conclusions and Outlook

Parton labeling continues to be an important task in collider event reconstruction even though such labels are not unique. We have proposed a set of tools based on regression methods that are able to assign parton labels without also needing unphysical parton matching for training. Our approaches are competitive even though they are not trained using trijet information and are much more flexible than other approaches, since we are able to accommodate events with fewer jets than expected from the lowest order decay Feynman diagrams. While our techniques are compatible with many regression approaches, the CPT model studies here is particularly useful because it is permutation invariant and partially Lorentz covariant. The corresponding labels inherit some of these properties.

There are a number of possible ways to further improve these approaches, including how to best combine the attention weights or Jacobian elements to assign parton labels. It may also be possible to combine approaches in the future, where a simpler model can be trained using the label information from a regression model.

Software

The code for this project is built on the one from Ref. [13]. Updated software that produces also the gradients and makes the figures in this paper can be found at https://github.com/hep-lbdl/Covariant-Particle-Transformer.

Acknowledgments

BN thanks Chase Shimmin for useful discussions. This work is supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. H.W.’s work is partly supported by the U.S. National Science Foundation under the Award No. 2046280.

References

Aaboud et al. [2018] M. Aaboud et al. (ATLAS), Search for the standard model Higgs boson produced in association with top quarks and decaying into a $b\bar{b}$ pair in $pp$ collisions at $\sqrt{s}$ = 13 TeV with the ATLAS detector, Phys. Rev. D 97, 072016 (2018), arXiv:1712.08895 [hep-ex] .
Sirunyan et al. [2020] A. M. Sirunyan et al. (CMS), Measurement of the $\mathrm{t\bar{t}}\mathrm{b\bar{b}}$ production cross section in the all-jet final state in pp collisions at $\sqrt{s}=$ 13 TeV, Phys. Lett. B 803, 135285 (2020), arXiv:1909.05306 [hep-ex] .
Aad et al. [2020] G. Aad et al. (ATLAS), $CP$ Properties of Higgs Boson Interactions with Top Quarks in the $t\bar{t}H$ and $tH$ Processes Using $H\rightarrow\gamma\gamma$ with the ATLAS Detector, Phys. Rev. Lett. 125, 061802 (2020), arXiv:2004.04545 [hep-ex] .
Erdmann et al. [2019] J. Erdmann, T. Kallage, K. Kröninger, and O. Nackenhorst, From the bottom to the top—reconstruction of $t\bar{t}$ events with deep learning, JINST 14 (11), P11015, arXiv:1907.11181 [hep-ex] .
Badea et al. [2022] A. Badea, W. J. Fawcett, J. Huth, T. J. Khoo, R. Poggi, and L. Lee, Solving Combinatorial Problems at Particle Colliders Using Machine Learning, (2022), arXiv:2201.02205 [hep-ph] .
Fenton et al. [2020] M. J. Fenton, A. Shmakov, T.-W. Ho, S.-C. Hsu, D. Whiteson, and P. Baldi, Permutationless Many-Jet Event Reconstruction with Symmetry Preserving Attention Networks, (2020), arXiv:2010.09206 [hep-ex] .
Lee et al. [2020] J. S. H. Lee, I. Park, I. J. Watson, and S. Yang, Zero-Permutation Jet-Parton Assignment using a Self-Attention Network, (2020), arXiv:2012.03542 [hep-ex] .
Shmakov et al. [2021] A. Shmakov, M. J. Fenton, T.-W. Ho, S.-C. Hsu, D. Whiteson, and P. Baldi, SPANet: Generalized Permutationless Set Assignment for Particle Physics using Symmetry Preserving Attention, (2021), arXiv:2106.03898 [hep-ex] .
Ehrke et al. [2023] L. Ehrke, J. A. Raine, K. Zoch, M. Guth, and T. Golling, Topological Reconstruction of Particle Physics Processes using Graph Neural Networks, (2023), arXiv:2303.13937 [hep-ph] .
Erdmann et al. [2014] J. Erdmann, S. Guindon, K. Kroeninger, B. Lemmer, O. Nackenhorst, A. Quadt, and P. Stolte, A likelihood-based reconstruction algorithm for top-quark pairs and the KLFitter framework, Nucl. Instrum. Meth. A 748, 18 (2014), arXiv:1312.5595 [hep-ex] .
Cacciari et al. [2008a] M. Cacciari, G. P. Salam, and G. Soyez, The anti- $k_{t}$ jet clustering algorithm, JHEP 04, 063, arXiv:0802.1189 [hep-ph] .
Cacciari et al. [2008b] M. Cacciari, G. P. Salam, and G. Soyez, The Catchment Area of Jets, JHEP 04, 005, arXiv:0802.1188 [hep-ph] .
Qiu et al. [2022] S. Qiu, S. Han, X. Ju, B. Nachman, and H. Wang, A Holistic Approach to Predicting Top Quark Kinematic Properties with the Covariant Particle Transformer, (2022), arXiv:2203.05687 [hep-ph] .
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017) p. 5998, 1706.03762 .
Alwall et al. [2014] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, and M. Zaro, The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations, JHEP 07, 079, arXiv:1405.0301 [hep-ph] .
Artoisenet et al. [2013] P. Artoisenet, R. Frederix, O. Mattelaer, and R. Rietkerk, Automatic spin-entangled decays of heavy resonances in Monte Carlo simulations, JHEP 03, 015, arXiv:1212.3460 [hep-ph] .
Sjöstrand et al. [2015] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands, An introduction to PYTHIA 8.2, Comput. Phys. Commun. 191, 159 (2015), arXiv:1410.3012 [hep-ph] .
Cacciari et al. [2012] M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual, Eur. Phys. J. C 72, 1896 (2012), arXiv:1111.6097 [hep-ph] .
Cacciari and Salam [2006] M. Cacciari and G. P. Salam, Dispelling the $N^{3}$ myth for the $k_{t}$ jet-finder, Phys. Lett. B 641, 57 (2006), arXiv:hep-ph/0512210 .
Aad et al. [2019] G. Aad et al. (ATLAS), ATLAS b-jet identification performance and efficiency measurement with $t{\bar{t}}$ events in pp collisions at $\sqrt{s}=13$ TeV, Eur. Phys. J. C 79, 970 (2019), arXiv:1907.05120 [hep-ex] .
Sirunyan et al. [2018] A. M. Sirunyan et al. (CMS), Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV, JINST 13 (05), P05011, arXiv:1712.07158 [physics.ins-det] .
Particle Data Group [2020] Particle Data Group, Review of Particle Physics, Progress of Theoretical and Experimental Physics 2020, 083C01 (2020).

Parton Labeling without Matching: Unveiling Emergent Labelling Capabilities in Regression Models