Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Masked Particle Modeling on Sets:
Towards Self-Supervised High Energy Physics Foundation Models

Tobias Golling University of Geneva    Lukas Heinrich Technical University of Munich    Michael Kagan SLAC National Accelerator Laboratory    Samuel Klein University of Geneva    Matthew Leigh University of Geneva    Margarita Osadchy University of Haifa    John Andrew Raine University of Geneva
Abstract

We propose masked particle modeling (MPM) as a self-supervised method for learning generic, transferable, and reusable representations on unordered sets of inputs for use in high energy physics (HEP) scientific data. This work provides a novel scheme to perform masked modeling based pre-training to learn permutation invariant functions on sets. More generally, this work provides a step towards building large foundation models for HEP that can be generically pre-trained with self-supervised learning and later fine-tuned for a variety of down-stream tasks. In MPM, particles in a set are masked and the training objective is to recover their identity, as defined by a discretized token representation of a pre-trained vector quantized variational autoencoder. We study the efficacy of the method in samples of high energy jets at collider physics experiments, including studies on the impact of discretization, permutation invariance, and ordering. We also study the fine-tuning capability of the model, showing that it can be adapted to tasks such as supervised and weakly supervised jet classification, and that the model can transfer efficiently with small fine-tuning data sets to new classes and new data domains.

preprint: APS/123-QED

I Introduction

While Artificial Intelligence (AI) and Machine Learning (ML) are already playing a major role in the analysis of high energy physics (HEP) data, the HEP community has yet to benefit from the self-supervised learning (SSL) based approaches to building large foundation models (FM) Bommasani et al. (2022) that have been pioneered in natural language processing (NLP) Lewis et al. (2019); Devlin et al. (2019); OpenAI (2023); Brown et al. (2020) and computer vision (CV) Dosovitskiy et al. (2021); Caron et al. (2021); Bao et al. (2022). FMs, as opposed to task specific ML models, are pre-trained in generic ways, such that they are useful for a range of downstream tasks. FMs often use SSL to pre-train models on vast data sets in order to learn generic representations of the data. Such models can then be efficiently fine-tuned with small datasets for a variety of downstream tasks. The self-supervised pre-training of a FM produces a model that is also referred to as the “backbone”, as it can serve as the information extraction component for downstream models. This concept significantly expands the possibilities for learning robust and meaningful data representations. The knowledge encoded in this way can be readily applied in downstream tasks which use this representation as input.

The FM with a SSL pre-training approach offers important advantages for HEP. Unlike supervised learning, which typically acquires limited domain representations and focuses on a few key features for high prediction accuracy that must be learned anew for each task, SSL aims to learn generic representations summarizing domain features that prove useful across various downstream tasks. SSL tasks can be formulated on unlabeled data. In the HEP context, this may not only decrease the reliance on labeled simulated data sets but also potentially helps mitigate uncertainties related to domain shift when training models on imperfect simulations. However, this approach also has several major challenges for HEP. SSL strategies are data type specific, so new methods must be developed. These models also represent a scale in both model size and data size that have not been addressed in HEP. In this work, we aim to take the first steps towards building such a HEP foundation model, focusing on developing HEP data specific SSL strategies, understanding how these models perform when fine-tuned for various downstream tasks, and keeping an eye on how well such strategies may scale in the future. We propose a masked particle modeling (MPM) scheme, akin to masked language modeling (MLM) in NLP, for self-supervised learning on unlabeled data consisting of sets of particles in a collider physics environment. In doing so, we propose a novel scheme to apply masked modeling strategies to unordered sets of inputs.

Refer to caption
Figure 1: The proposed model and training scheme for a FM for jets. A jet is represented as a set of particles, each a list of features, and some particles are replaced by a learnable vector and passed through a transformer encoder. Training aims to predict the discrete token identity, defined by the encoder of a pre-trained VQ-VAE, of the masked particles. The MPM Backbone in is a transformer encoder in this work. The positional embedding shown in box A is used only in the prediction head to preserve the permutation invariance in the backbone\added; it is required in the model to break the degeneracy from when m1=m2subscript𝑚1subscript𝑚2m_{1}=m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

This work aims to generalize the language-inspired MLM-type training scheme to HEP scientific data, and to show that in the HEP context this can be an effective strategy for developing a useful FM capable of being fine-tuned for various down-stream tasks. The paradigm involves extracting semantic meaning and understanding of the whole by predicting the missing (masked) pieces, referred to as tokens, thereby considering the collective impact of individual input elements. Generic data sets, such as the ones in HEP, cannot be directly mapped to a sequence of pieces in analogy to words in a sentence as they are continuous and unordered, i.e. not a sequence. The challenges of using continuous elements, i.e. features of particles like momentum, instead of a discrete dictionary of words, and of operating on unordered sets of inputs will be addressed in the development of the MPM scheme. To do so, we examine MPM in collider physics data sets consisting of jets, or collimated streams of particles produced by high-energy quarks and gluons.

We provide a brief overview of related work in Section II. We describe the MPM method, including the token creation scheme, and the fine-tuning procedures in Section III. Experiments on different tokenization schemes, fine-tuning for jet classification including on unseen classes, on new datasets, and in a weakly supervised setting are presented in Section V.

II Related Work

Foundation models, such as Masked Language Models (BART Lewis et al. (2019), BERT Devlin et al. (2019)), Generative Pre-trained Transformer (GPT) OpenAI (2023); Brown et al. (2020), Vision Transformer (ViT) Dosovitskiy et al. (2021), DINO Caron et al. (2021) and their combinations, such as DALLE Ramesh et al. (2021), Flamingo Alayrac et al. (2022) and others have primarily been explored in the domains of language and vision. We refer readers to the recent review Bommasani et al. (2022) for an overview. Most closely related to this work is the BERT model Devlin et al. (2019), which uses the masking and prediction of missing words as a pre-training task, and the BEiT model Bao et al. (2022), which adapts the masked language modeling method to images by masking and predicting patches of input images. On masked modeling schemes for data which consists of unordered sets of inputs, the impact of removing positional information in masked image modeling was examined in Ref. Chen et al. (2021), and using position as a target when processing unordered image patches were explored in Ref. Zhai et al. (2022). The first steps in developing foundation models for science have been developed in e.g. protein biology Rives et al. (2021), molecular chemistry Ross et al. (2022); Pan (2023), and cosmology Lanusse et al. (2023); Walmsley et al. (2022), showing their ability to learn informative representations that are useful in these domains for various downstream tasks.

The first steps in self-supervised learning on jets was explored in Refs. Dillon et al. (2022, 2023); Tombs and Lester (2022), largely focusing on contrastive pre-training using augmentations of jets. Pre-training strategies through masking particle type information have also been explored in Ref. Kishimoto et al. (2023). Transformer models were trained on large jet datasets for classification in a supervised setting in Ref. Qu et al. (2022a); Mikuni and Canelli (2021) and several transformer-based applications have since been developed (for example, see Refs. K ach et al. (2022); Kansal et al. (2023); Fenton et al. (2022); ATLAS Collaboration (2023); Smith et al. (2023); Tomiya and Nagai (2023); K ach and Melzer-Pellmann (2023); Raine et al. (2023a)). Transformers have also been used for auto-regressive density estimation and jet classification Finke et al. (2023); Butter et al. (2023). Notably, Ref. Finke et al. (2023) also explored the discretization of continuous particle features to form jet sequences, which we examine in this work.

In parallel to the present effort on self-supervised foundation models, investigations are ongoing on the potential of supervised FMs in HEP by using physics-motivated pretext tasks followed by fine-tuning in a hierarchical setting Vigl et al. (2024).

III Overview of Methods

The proposed model and training scheme is summarized in Fig. 1. In line with the MLM framework employed by BERT Devlin et al. (2019), the MPM objective described in Section III.1 involves selecting a subset of particles within each jet to form the masked set. A predefined masking strategy is applied to this subset. The goal of MPM is to build a model capable of inferring the attributes of the original particles within the masked set, using information from all other particles present in the jet. As particles form unordered sets, in contrast to the sequential nature of sentences, we develop a masked prediction scheme which is applicable for unordered set-based data. An additional challenge stems from the continuous nature of particle features, in contrast to the discrete dictionary found in language models but similar to the challenges of masking image patches in CV. In Section III.2, we tackle this challenge employing methods akin to those used in BEiT Bao et al. (2022). We discuss the fine-tuning of the pre-trained model to downstream tasks in Sec. III.3

III.1 Masked Particle Modeling

The representation of a jet as an unordered set of particles, each characterized by an ordered collection of continuous features, lends itself to interpretation within the masked training paradigm reminiscent of MLMs. In this analogy, each particle within the jet can be viewed as a representation akin to a token in a block of text, despite the notable difference that particles form unordered sets, while text is sequential.

The (MPM) objective relies on the selection of a subset of particles within a jet. This involves removing the information associated with these particles and replacing it with a learnable mask. Subsequently, the goal is to predict a certain property for each of the originally masked particles. Modeling masked particles is anticipated to serve as a valuable pre-training task. This is because the resulting model is expected to understand properties of jets by acquiring the ability to correct or infer missing information through the analysis of unmasked particles.

We can consider jets with at most N𝑁Nitalic_N constituent particles to be described as a set X={xi}i=1N𝒳𝑋superscriptsubscriptsubscript𝑥𝑖𝑖1𝑁𝒳X=\{x_{i}\}_{i=1}^{N}\in\mathcal{X}italic_X = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_X where each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is some representation of a particle and 𝒳𝒳\mathcal{X}caligraphic_X is the set of possible jets. A dataset is then a collection of K𝐾Kitalic_K jets {Xj}j=1Ksuperscriptsubscriptsuperscript𝑋𝑗𝑗1𝐾\{X^{j}\}_{j=1}^{K}{ italic_X start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The MPM objective can be phrased as partitioning the set of particles in each jet j𝑗jitalic_j into a masked set xj={xij}ijsuperscriptsubscript𝑥𝑗subscriptsuperscriptsubscript𝑥𝑖𝑗𝑖superscript𝑗\mathcal{M}_{x}^{j}=\{x_{i}^{j}\}_{i\in\mathcal{M}^{j}}caligraphic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_M start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over indices jsuperscript𝑗\mathcal{M}^{j}caligraphic_M start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT of masked elements, and an unmasked set 𝒰xj={xij}i𝒰jsuperscriptsubscript𝒰𝑥𝑗subscriptsuperscriptsubscript𝑥𝑖𝑗𝑖superscript𝒰𝑗\mathcal{U}_{x}^{j}=\{x_{i}^{j}\}_{i\in\mathcal{U}^{j}}caligraphic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over indices of non-masked elements. A masking strategy is used for mapping xjmj={mij}ijsuperscriptsubscript𝑥𝑗superscriptsubscript𝑚𝑗subscriptsuperscriptsubscript𝑚𝑖𝑗𝑖superscript𝑗\mathcal{M}_{x}^{j}\rightarrow\mathcal{M}_{m}^{j}=\{m_{i}^{j}\}_{i\in\mathcal{% M}^{j}}caligraphic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT → caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_M start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT where each mijsuperscriptsubscript𝑚𝑖𝑗m_{i}^{j}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is a learnable vector. For convenience, we drop the jet index j𝑗jitalic_j as the procedure is repeated for all jets. The goal is to define a parametric function fθ:𝒳N×d:subscript𝑓𝜃𝒳superscript𝑁𝑑f_{\theta}:\mathcal{X}\to\mathbb{R}^{N\times d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, which maps a jet to a d𝑑ditalic_d-dimensional representation for each of the N𝑁Nitalic_N particles in the jet, and a loss \mathcal{L}caligraphic_L such that minimizing 𝔼𝒳[(x,fθ(m,𝒰x))]subscript𝔼𝒳delimited-[]subscript𝑥subscript𝑓𝜃subscript𝑚subscript𝒰𝑥\mathbb{E}_{\mathcal{X}}[\mathcal{L}(\mathcal{M}_{x},f_{\theta}(\mathcal{M}_{m% },\mathcal{U}_{x}))]blackboard_E start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT [ caligraphic_L ( caligraphic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , caligraphic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) ] results in a function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT that is useful for downstream tasks. As the goal of the pre-training is to recover information about each masked particle, we use a per-particle loss function of the form

𝔼𝒳[1|m|im(xi,fθ,i(m,𝒰x))],subscript𝔼𝒳delimited-[]1subscript𝑚subscript𝑖subscript𝑚subscript𝑥𝑖subscript𝑓𝜃𝑖subscript𝑚subscript𝒰𝑥\mathbb{E}_{\mathcal{X}}\left[\frac{1}{\left|{\mathcal{M}_{m}}\right|}\sum_{i% \in\mathcal{M}_{m}}\mathcal{L}\left(x_{i},f_{\theta,i}(\mathcal{M}_{m},% \mathcal{U}_{x})\right)\right],blackboard_E start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG | caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , caligraphic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) ] ,

where fθ,isubscript𝑓𝜃𝑖f_{\theta,i}italic_f start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT is the model output for the i𝑖iitalic_ith particle.

Unlike language models, which operate on a finite and discrete vocabulary of words, many of the features one that describe particles are continuous, such as momentum, direction, distance of closest approach to the primary collision, etc. This distinction has implications in two aspects of the model. First, concerning the model input, one may choose to use the continuous features directly, following the approach typically employed in BEiT-type models. Alternatively, one could opt to discretize particle features, for example by binning along each feature dimension, and use the feature bin index as an input token for the model, as demonstrated in Ref. Finke et al. (2023). Second, at the model output, where the prediction of missing particle features occurs, one can either directly predict continuous features or, if particle features have been discretized, employing the discrete tokens as prediction targets. This second choice has implications for the pre-training loss function. Predicting continuous features generally involves a regression-type loss, leading to learning the mean of the feature posterior conditioned on the unmasked particle features. On the other hand, predicting discrete indices allows pre-training to be framed as a classification problem, where the model learns a categorical posterior distribution over indices. The ability to model the full posterior and any potential multi-modality in the distribution can be highly beneficial, potentially preventing the model from allocating resources to irrelevant and excessively detailed features. We compare both discrete and continuous inputs to the model, and regression and classification losses.

The particles that form a jet are permutation invariant, and so it is natural to use a backbone model that is permutation equivariant. However, if the same learnable vector is used to replace all of the particles in the masked set, then the output of the model will be equivalent at every masked position. In Fig. 1 this would mean that if m1=m2subscript𝑚1subscript𝑚2m_{1}=m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT then h2=h4subscript2subscript4h_{2}=h_{4}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, because permutation equivariant functions can not distinguish between two input particles with the same values. This exposes a redundancy that can only be removed by inducing an ordering, or by adding attributed connections between particles. The redundancy makes it impossible to achieve zero error on any single prediction unless the original value of every particle in the masked set is identical. However, the redundancy does not make MPM fruitless, as learning the density over the masked set of particles is still a useful and difficult task.

Two strategies are examined to address the redundancy that arises from the permutation invariance of MPM: (a) Use the transverse momentum pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to order each particle in the jet at the input to the backbone. As this breaks the permutation invariance of the backbone for all downstream tasks, it may have adverse effects on predictive performance. (b) Employ the pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to order the particles in a jet at the input to the masked prediction head in Fig. 1. This head is not used for downstream tasks, ensuring that this approach does not break the permutation invariance of the backbone. Both these approaches are compared with retaining the redundancy (i.e., no ordering).

III.2 Making Tokens From Particles

To evaluate the impact of employing a discretized set of tokens for model input or property prediction, a suitable tokenization scheme is needed. As noted earlier, the simplest scheme involves binning each feature into a finite set of feature ranges. While binning features can be used to define a set of labels for classification targets, this method falls short in incorporating contextual information, such as information about particle features in relation to the features of other particles within the jet.

A scheme for defining context dependent tokens from continuous inputs was developed in the masked image modeling approach of the BEiT model Bao et al. (2022). In this case, labels of different image patches were defined using a Vector Quantized Variational AutoEncoder (VQ-VAE) Oord et al. (2017). A VQ-VAE uses an encoder to map a set of inputs to latent vectors, which are subsequently projected onto the nearest element within a finite codebook. These codebook vectors are then decoded back to the original inputs. The codebook vectors are trained simultaneously with the encoder and decoder. The use of transformer encoder and decoder ensures that information from all input elements are used to define the predictions. This process incorporates input-wide context into the codebook prediction for each element within the input set. In MPM, each particle is encoded to a single codebook element, where the encoding is performed conditionally on all other particles in the jet. The index of the codebook element to which each particle is mapped is then used as the target label during pre-training. The VQ-VAE model is only used during pretraining.

We also explore using the K-means algorithm MacQueen et al. (1967) to define labels for the pre-training. We use the K-means++ Arthur and Vassilvitskii (2007) algorithm as implemented in the scikit learn library Buitinck et al. (2013) to define the clusters. After training, each cluster is assigned an index, and the index is used as target labels. This approach is explored as a replacement for the VQ-VAE defined labels. Unlike the VQ-VAE labelling, the K-means approach is context independent. Further studies into the differences between the VQ-VAE and K-means would be beneficial.

III.3 Fine-Tuning

Evaluating the utility of a pre-trained backbone is a non-trivial task, since a suitable set of downstream tasks for benchmarking the performance of the pre-trained models is required. In this paper we focus on jet classification in different contexts. The task of jet classification is performed by applying weighted pooling to the output of the backbone followed by a linear layer to map to the same number of dimensions as there are jet classes.

To assess the impact of the pre-training and fine-tuning we explore three strategies to train a model for a downstream task. The first is referred to as “fixed backbone”, where the pre-trained backbone is frozen during fine-tuning and only a linear classification head is updated. This tests the power of the representation that is learned during fine-tuning, specifically the linear separability of different classes in this representation. The second is referred to as “fine-tuned”, where both a linear classification head and the backbone model itself is fit to the downstream task. This tests the utility of the pre-training task for defining an initialization of the function parameterized by our model. The third is referred to as “from scratch”, where the backbone model is reinitialized with new weights and fit directly on the downstream task, i.e. standard supervised learning. This third approach provides a performance benchmark for a given model and training strategy on a given downstream task without any influence from the pre-training.

IV Data Sets

We use the JetClass dataset Qu et al. (2022b) for all pre-training tasks. The dataset contains 100 million training samples from ten different classes with an equal number of samples for every class. Each class in the JetClass dataset represents jets resulting from a specific decay chain involving different particles. For example, the class labelled Hbb¯𝐻𝑏¯𝑏H\rightarrow b\overline{b}italic_H → italic_b over¯ start_ARG italic_b end_ARG includes jets which result from the decay of a Higgs boson into a b𝑏bitalic_b and anti-b𝑏bitalic_b quark before they hadronize and decay into the many particles captured by the detector. Other classes include jets initiated by quarks q𝑞qitalic_q, gluons g𝑔gitalic_g, top quarks t𝑡titalic_t, and the W𝑊Witalic_W or Z𝑍Zitalic_Z vector bosons. The different processes are also distinguished by whether the decay chain contains a lepton \ellroman_ℓ, bottom b𝑏bitalic_b quark or charm c𝑐citalic_c quark. In this work, the momentum and direction (azimuth ϕitalic-ϕ\phiitalic_ϕ and pseudo-rapidity η𝜂\etaitalic_η) are used as the feature collection for each particle.

The RODEM dataset comprises additional independent samples of top quark initiated jets and jets arising from gluons and quarks (QCD). It should be noted that in this dataset no distinction is made between quark and gluon initiated QCD jets, unlike in the JetClass dataset. Similar to JetClass, both samples are generated with MadGraph Alwall et al. interfaced to Pythia8 Sjöstrand et al. , however in these samples the decay of top quarks and W𝑊Witalic_W bosons is performed using MadSpin Artoisenet et al. . Another difference with respect to the JetClass datasets is in the detector simulation. Here, the Delphes de Favereau et al. (2014) detector simulation is performed with a parameterization similar to the ATLAS detector, rather than the CMS detector parameterization used in JetClass. Furthermore, the anti-ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT jet clustering Cacciari et al. is performed with a radius parameter of 1.0 rather than 0.8, and the selected jets fall in a slightly wider range of transverse momentum, spanning 450 GeV to 1.2 TeV, and an increased pseudorapidity range of |η|<2.5𝜂2.5|\eta|<2.5| italic_η | < 2.5. 10 million jets of each class are available, with particles represented only by their four-momenta and are assumed massless.

V Experiments

Experiments are designed both to test and define model design strategies and to explore the performance of the models on downstream tasks after fine-tuning. The “design strategy” experiments (or experiments studying design strategy choices) aim to understand the impact of tokenization on both the inputs and targets, to examine the quality of tokens learned by the VQ-VAE, and to test the impact of order the inputs or intermediate representations in the FM. We use a wide array of downstream tasks to provide multiple indicators of the utility of the pre-trained models. This includes (a) in-context prediction: making predictions when using the same data set and the same classes seen in pre-training, (b) out-of-context prediction: making predictions when using the same data set at pre-training but on classes not seen in pre-training, and (c) out-of-domain prediction: making predictions using a different dataset to the pre-training, which potentially includes a distribution shift. We examine these fine-tuned performance metrics as a function of the size of the fine-tuning labelled dataset to understand how pre-training may reduce the quantity of labelled data needed for fine-tuning.

The same transformer encoder architecture is used for all backbone models. We used transformer blocks based on the Normformer Shleifer et al. (2021). Eight transformer blocks are used with a model dimension of 1024102410241024, the input nodes are embedded linearly into a 512512512512 dimensional space, and the model contains a total of 40 million parameters. The pre-training head is a single transformer block followed by a softmax for classification or a linear layer for regression. The fine-tuning classification head is a weighted average over all output dimensions (i.e. a linear transform applied independently to each particle representation and then averaged) followed by another linear layer and softmax. Models are trained with AdamW Kingma and Ba (2014); Loshchilov and Hutter (2017) with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and weight decay of 0.010.010.010.01. All pre-trained models are trained for five epochs on the full JetClass data set. For supervised training, all models are trained for up to 50505050 epochs with early stopping and a patience of five epochs. The norm of the gradient vectors are clipped to five. The from scratch supervised models use the same architecture as the pre-trained models: eight transformer blocks followed by a weighted average and softmax for classification. The model architecture is similar to that used by ParT Qu et al. (2022a), which is state of the art. The ParT model uses more features than we use in this work, and so achieves higher accuracy. More details can be found in App. A.

All experiments are repeated five times with different random seeds. The uncertainty in results is given by the standard deviation across all runs, with the mean indicating the average behaviour. The uncertainty that comes from pre-training models is not included, i.e. models are pre-trained only once.

To define target labels and for input quantization, 512 possible token values are used. This corresponds to 512 vectors in the VQ-VAE codebook, or to 512 clusters for the K-means.

All the code used to run these experiments is publicly available at github.com/rodem-hep/mpm.

V.1 Token Creation with a VQ-VAE

One challenge with training VQ-VAEs is the possibility of codebook collapse, where only a few of the codebook vectors are used to encode particle features. In the case of collapse to a single vector, the model would effectively be learning an average representation, which would not aid pre-training. To address this the prescriptions outlined in Ref. Huh et al. (2023) are applied as described in App. B. After training we find that the codebook has 80%similar-toabsentpercent80\sim 80\%∼ 80 % utilization. The quantized vectors, defined by their index or codebook vector, will be referred to as tokens.

To ensure that the latent space codes are sufficiently capturing the salient information about the particles in a jet, we examine the quality of the decoded jet properties after encoding and quantization in the VQ-VAE. Successful reconstruction of the input jet would indicate a strong performance of the encoding. In Fig. 2 we show the distributions of decoded jet transverse momentum and mass are in good agreement with the input distributions. One may note the apparent overlap in goals between the VQ-VAE and the MPM model, however VQ-VAEs can be seen as primarily learning a data compression Oord et al. (2017) while masked particle models learn to understand a jet by inferring missing information.

Refer to caption
Figure 2: The reconstruction of the jet mass and transverse momentum using the decoded output of the VQ-VAE trained to quantize the particles in a jet.

V.2 Discretization and Ordering

In this section we explore different choices that need to be made when pre-training the model. We study the impact of ordering, input quantization, and output quantization / loss function choice. Backbone models are pre-trained under these different settings. The possible ordering settings included are: not ordering anywhere in the model, ordering the input to the backbone, and ordering only at the input to the pre-training prediction head. Ordering is defined by decreasing particle transverse momentum, and the index within the ordered sequence is embedded using learned positional embeddings Gehring et al. (2017); Vaswani et al. (2017). Learned positional embedding assigns a learnable vector to each index value that is then added to the features (and thus has the same dimension as the features per element)\added and updated with the rest of the models parameters. Depending on the model under study, a learned positional embedding is added to the backbone input features in box B of Fig. 1 or the prediction head input features as shown in box A of Fig. 1. Continuous inputs are used in all models except one, which tests the impact of quantizing particle feature inputs. For this input-quantized case, the codebook vectors assigned by the VQ-VAE are used to replace the input vectors from the data set in box B of Fig. 1. Finally, the classification loss using VQ-VAE encoded tokens as targets, the classification loss using K-means encoded tokens as targets, and the direct particle feature regression are tested.

The different settings are compared by using the classification accuracy over the ten classes in a hold-out validation JetClass dataset of one million jets. Classification is performed with a linear classification head on top of the backbone output representations. Here the backbone is fixed and accuracy is compared after 20 epochs of training the linear classification head, where it is observed that the loss functions only change slowly upon further training.

The results of the comparison can be seen in Tab. 1 where five different random seeds were used during the fine tuning stage, but with only a single backbone trained in each case. The mean result is shown in the table, with a standard deviation of O(0.01%) for each result. A clear benefit to ordering only in the model head is observed over ordering the backbone inputs or not ordering anywhere in the model. In this case, ordering the prediction head enables the symmetry over masked elements to be broken and thus make more precise per-particle predictions rather than a joint posterior over all masked particles. Input quantization is observed to hurt model performance, owing to the loss of information when summarizing continuous features with discrete tokens. Using tokens, either from the VQ-VAE or the K-means, as classification targets is found to be substantially better than regressing particle features as a prediction task. This is likely due to the regression giving only a point prediction of average particle properties under the posterior, which provides little predictive power for posteriors with multiple modes. The classification loss is much more flexible and able to make use of the prediction of the full posterior over tokens. The VQ-VAE tokens outperformed the K-means tokens for classification, likely because the VQ-VAE is a more complex model able to capture more subtle details. Nonetheless, the K-means tokens for classification work reasonably well, and may be a useful avenue to explore further in future work due to the relative ease of training the K-means in comparison to the VQ-VAE.

Following these results, all subsequent tests use a backbone with continuous inputs, ordering in the prediction head for pre-training and a classification loss using VQ-VAE tokens as targets in the pre-training. Unless explicitly stated the same backbone is used for all tests, this backbone is trained on the full JetClass training set using the VQ-VAE provided labels. \replacedWe note that using a different head than linear could result in a different model selectionWe note that using a different head than linear could result in a different model selection. However using a linear head provides the benefit of focusing our examination only on the learning in the backbone, as is often done in backbone training exploration, and enables us to select model backbones that perform better than training from scratch.

Table 1: Linear probe into the performance of different pre-training strategies. For each pre-training strategy the accuracy on the ten classes of JetClass is reported. The accuracy is calculated over five different runs of the fine tuning with errors on the order of 0.01%percent0.010.01\%0.01 %
Ordering Inputs Loss Accuracy
no ordering continuous VQ-VAE classification 54.1%percent54.154.1\%54.1 %
order head continuous VQ-VAE classification 56.8%percent56.8\mathbf{56.8\%}bold_56.8 %
order backbone continuous VQ-VAE classification 53.4%percent53.453.4\%53.4 %
order head quantized VQ-VAE classification 51.1%percent51.151.1\%51.1 %
order head quantized K-means classification 49.3%percent49.349.3\%49.3 %
order head continuous K-means classification 56.2%percent56.256.2\%56.2 %
order head continuous regression 48.9%percent48.948.9\%48.9 %
order backbone continuous regression 46.3%percent46.346.3\%46.3 %

V.3 Fine-Tuning for Jet Classification

The first test of the utility of the backbone model for downstream tasks is on in-context data, where we examine the accuracy of ten-class classification on a test sample from JetClass. The labelled dataset size, used for fine-tuning the pre-trained models and for training the fully supervised model, is varied to examine how much performance is gained by pre-training. In Fig. 3, we can see that at small labelled data set sizes of fewer than 10k jets, there is a large performance benefit to using pre-training over the from scratch supervised model. This indicates the utility of the representation learned during pre-training for downstream tasks. With a large enough labelled data set size, the supervised model outperforms the fixed backbone model and converges to the performance of the fine-tuned backbone model. This is expected, as supervised learning on a sufficiently large labelled data set should provide enough information for model optimization without pre-training. In essence, the pre-training provides an excellent set of initial weights for fine-tuning on downstream tasks.

Refer to caption
Figure 3: Accuracy of different training strategies as a function of the number of labelled training samples. Accuracy is calculated on the ten classes in the JetClass dataset. The average and standard deviation over 5 trainings is shown in solid lines and uncertainty bands, respectively. Models with frozen pretrained backbone weights during fine-tuning are “Fixed”, and those with updated weights are “Fine-tuned”.

V.4 Fine-Tuning for Jet Classification on New Classes

To test whether the pre-trained model learns features that are generically useful for downstream tasks, or only those which are useful for the classes that are seen in the pre-training set, we perform an out-of-context test where we pre-train on a subset of the classes provided in JetClass and test the fine-tuning on the remaining classes. Specifically, the pre-training is performed on the six Higgs and QCD classes, and then fine-tuned on the remaining four classes. The four remaining classes are top quarks decaying with and without a lepton, and W,Z𝑊𝑍W,Zitalic_W , italic_Z decays each as a distinct class. The results shown in Fig. 4 indicate that both pre-trained models, with fixed and with fine-tuned backbone, outperform the fully supervised training when only a limited amount of data is available. This indicates that even with small amounts of labelled data, the representations learned during pre-training are generically useful for out-of-context classes. As also expected, with enough labelled training data, the fully supervised model can surpass the fixed backbone model and converge to a similar performance to the fine-tuned model. In essence, the additional labelled data allows both the fine-tuned backbone model and the supervised model to adapt the representations to the specific downstream tasks.

Refer to caption
Figure 4: Accuracy of different training strategies as a function of labelled training samples. Accuracy is calculated on four classes, held out from all pre-trained models, in the JetClass dataset. The average and standard deviation over 5 trainings is shown in solid lines and uncertainty bands, respectively. Models with frozen pretrained backbone weights during fine-tuning are “Fixed”, and those with updated weights are “Fine-tuned”.

V.5 Fine-Tuning for Jet Classification on New Data Sets

We also test how well the backbone representation works out-of-domain on a different dataset of jets and how well the model can adapt to this new domain with fine-tuning. In this case, pre-training is performed with one dataset, JetClass, and fine-tuning is performed with a different data set, RODEM. More abstractly, this is a test of how well such a pre-training and fine-tuning strategy with mixed data sets may help address the domain shift challenge in HEP. This challenge come from the small, but potentially significant, differences between simulated data and real experimental data, such that models trained on simulated data have differences in prediction performance between simulated and real data, thus causing systematic uncertainties. As such, this test is in analogy with the idea that one may want to pre-train on real data, and only fine-tune on small simulated data sets, and we hope to explore the applicability of such an approach. With the MPM scheme, we can pre-train directly on unlabelled data, i.e. we can pre-train directly on real experimental data for representation learning. Thus we can also explore how well the features learned from data may maintain predictive power even after fine-tuning on simulation, represented here by the second labelled data set used for fine-tuning. In general, training primarily on real data has many potential benefits. It promises a reduced sensitivity to simulation-related domain shifts and calibration effects, leading to an overall simplified calibration procedure. An overall reduced simulation budget may also help alleviate the growing compute and storage limitations of the LHC experiments.

To perform these tests, we examine the QCD background rejection (one divided by the false positive rate) at 50%percent5050\%50 % top-jet signal efficiency (i.e. true positive rate) as a function of the labelled data set size. Fig. 5 shows the rejection for the fine-tuning RODEM data set on the left, and the rejection performance in the JetClass data set on the right. Note that the x𝑥xitalic_x-axis in both cases is the size of the RODEM labelled data set used for fine-tuning.

The pre-trained models demonstrate excellent data efficiency and performance when transferred to the RODEM dataset. The fine-tuned model outperforms the supervised model for all labelled data set sizes, as does the fixed backbone up to labelled data sets of O(50k) events. These results further support the hypothesis that a pre-trained model learns generic and useful data representations that are not limited to the specific data set. This result further suggests that models can be pre-trained on real data and fine-tuned in simulation while still maintaining reasonable performance on the original real data. Intriguingly, neither the fine-tuned nor from-scratch supervised model are able to outperform the fixed backbone model performance on the JetClass, i.e. “real", data set after fine-tuning on RODEM data set. This suggest that the fixed features, pre-trained on the JetClass, are the most powerful for downstream tasks on the JetClass data set, while the additional representational information learned on the labelled data set provides some, but limited additional performance when applying the model on the real data. This offers a potential avenue for mitigating domain shifts in the labelled data by improving the representation learning at the pre-training stage with larger pre-training data set sizes. We leave such tests for future work.

Refer to caption
Figure 5: The QCD rejection at 50%percent5050\%50 % top-jet efficiency evaluated on (left) the RODEM test set and (right) the JetClass test set, as a function of the size of the RODEM data set used for fine-tuning. All models are pre-trained on JetClass. The average and standard deviation of rejection over 5 trainings is shown in solid lines and uncertainty bands, respectively. Models with frozen pretrained backbone weights during fine-tuning are “Fixed”, and those with updated weights are “Fine-tuned”.

V.6 Weakly supervised classification

Fine-tuning tasks are typically of a supervised nature and require labels. However, physics knowledge can be exploited to enrich data in certain classes and provide training data for fine-tuning tasks with so-called noisy labels Metodiev et al. (2017). These could replace or complement fine-tuning with simulated data by leveraging pure labels in simulation with noisy labels in data.

This idea can be emulated by taking two samples of one class (QCD jets), each with one million events, and adding N𝑁Nitalic_N samples of another class, considered signal (top-quark initiated jets, or top jets), to one of the datasets. The task is to then train a supervised classifier to discriminate between these two datasets and evaluate the resulting classifiers performance on separating pure datasets of each class (QCD vs top jets). In Fig. 6, we show the significance improvement, defined as the ratio of significance before and after applying a threshold of 0.5 on the classifier output. The significance is defined as the number of signal class events divided by the square root of the number of background class events that pass a given threshold. Note that ground truth labels are used in the evaluation metric, but not in the fine-tuning procedure. Any value of this metric below 2 is not considered to be particularly useful. We can see that the pre-trained backbone is highly useful for this task, significantly improving the performance of the model that is trained from scratch, even when only the linear head is fine-tuned.

The idea of noisy labels is useful in practice for data driven weakly supervised search strategies Raine et al. (2023b); Hallin et al. (2021); Aad et al. (2020); Andreassen et al. (2020); Golling et al. (2023); Collins et al. (2019); Aad et al. (2020); Birman et al. (2022). In particular, it has recently been demonstrated that these data driven techniques can be extended to constituent level representations of the jet Buhmann et al. (2023); Sengupta et al. (2023) where the pre-training we propose here will be of significant benefit. It has also been shown to be successful for isolating muons using data directly Witkowski et al. (2023).

Refer to caption
Figure 6: Models trained with weak supervision to classify data sets of different label proportions. A data set of one million QCD jets is compared to a data set with one million QCD jets plus N𝑁Nitalic_N top jets. The average and standard deviation of the significant improvement over 5 trainings is shown in solid lines and uncertainty bands, respectively. Models with frozen pretrained backbone weights during fine-tuning are “Fixed”, and those with updated weights are “Fine-tuned”.

VI Conclusions

In this paper we propose the masked particle modelling strategy for pre-training models on unordered sets of inputs, and demonstrate that it is useful in the context of high energy physics. Both the continuous nature of particle features, as opposed to the discrete vocabulary typical of natural language, and the unordered nature of the data, as opposed to the sequential nature of text, are addressed to adapt masking strategies from natural language and computer vision to unordered sets of inputs, as is found in high energy physics data. When pre-training with the masked particle modeling strategy, we show that fine-tuned models can achieve high performance on downstream tasks, even using small fine-tuning data sets. These pre-trained models can be fine-tuned to discriminate classes which have not been seen during pre-training, can be adapt to new data sets, and show strong performance in weakly supervised settings. We explore the intriguing possibility to pre-train such models directly on experimental data, whilst only using simulations for fine-tuning. Such an approach may help mitigate uncertainties owing to the small distribution shifts between simulated and real data. Initial studies are promising, indicating that further examination of increasing the scale and size of pre-training data sets and backbone model may help overcome such domain adaptation challenges. More generally, this work suggests that continued exploration of self-supervised learning strategies for high energy physics data, coupled with increased data set and model sizes is a promising direction for the future development of machine learning in high energy physics.

Acknowledgements

MK is supported by the US Department of Energy (DOE) under grant DE-AC02-76SF00515. LH is supported by the Excellence Cluster ORIGINS, which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC-2094-390783311. MO is supported by USA-Israel BSF - 2022641. TG, SK, ML, and JR, would like to acknowledge funding through the SNSF Sinergia grant CRSII5_1937165_1937165\_1937165 _ 193716 called “Robust Deep Density Models for High-Energy Particle Physics and Solar Flare Analysis (RODEM)”, and the SNSF project grant 200020_212127 called “At the two upgrade frontiers: machine learning and the ITk Pixel detector”. ML also acknowledges the funding acquired through the Swiss Government Excellence Scholarships for Foreign Scholars.

References

  • Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou,  and Percy Liang, “On the opportunities and risks of foundation models,”  (2022), arXiv:2108.07258 [cs.LG] .
  • Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov,  and Luke Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,”  (2019), arXiv:1910.13461 [cs.CL] .
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee,  and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”  (2019), arXiv:1810.04805 [cs.CL] .
  • OpenAI (2023) OpenAI, “Gpt-4 technical report,”  (2023), arXiv:2303.08774 [cs.CL] .
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,  and Dario Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, Vol. 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan,  and H. Lin (Curran Associates, Inc., 2020) pp. 1877–1901.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,  and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”  (2021), arXiv:2010.11929 [cs.CV] .
  • Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski,  and Armand Joulin, “Emerging properties in self-supervised vision transformers,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021 (IEEE, 2021) pp. 9630–9640.
  • Bao et al. (2022) Hangbo Bao, Li Dong, Songhao Piao,  and Furu Wei, “Beit: Bert pre-training of image transformers,”  (2022), arXiv:2106.08254 [cs.CV] .
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,  and Ilya Sutskever, “Zero-shot text-to-image generation,”  (2021), arXiv:2102.12092 [cs.CV] .
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman,  and Karén Simonyan, “Flamingo: a visual language model for few-shot learning,” in NeurIPS (2022).
  • Chen et al. (2021) Xinlei Chen, Saining Xie,  and Kaiming He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021) pp. 9640–9649.
  • Zhai et al. (2022) Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Y Cheng, Walter Talbott, Chen Huang, Hanlin Goh,  and Joshua M Susskind, “Position prediction as an effective pretraining strategy,” in Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 162, edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu,  and Sivan Sabato (PMLR, 2022) pp. 26010–26027.
  • Rives et al. (2021) Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma,  and Rob Fergus, “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences,” Proceedings of the National Academy of Sciences 118, e2016239118 (2021).
  • Ross et al. (2022) J. Ross, B. Belgodere,  and V. Chenthamarakshan, “Large-scale chemical language representations capture molecular structure and properties,” Nature Machine Intellegence 4, 1256–1264 (2022).
  • Pan (2023) J. Pan, “Large language model for molecular chemistry,” Nature Communication Science 3 (2023).
  • Lanusse et al. (2023) Francois Lanusse, Liam Parker, Siavash Golkar, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Geraud Krawezik, Michael McCabe, Ruben Ohana, Mariel Pettee, Bruno Regaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho,  and Shirley Ho, “Astroclip: Cross-modal pre-training for astronomical foundation models,”  (2023), arXiv:2310.03024 [astro-ph.IM] .
  • Walmsley et al. (2022) Mike Walmsley, Inigo Val Slijepcevic, Micah Bowles,  and Anna M. M. Scaife, “Towards galaxy foundation models with hybrid contrastive learning,”  (2022), arXiv:2206.11927 [cs.CV] .
  • Dillon et al. (2022) Barry M. Dillon, Gregor Kasieczka, Hans Olischlager, Tilman Plehn, Peter Sorrenson,  and Lorenz Vogel, “Symmetries, safety, and self-supervision,” SciPost Phys. 12, 188 (2022).
  • Dillon et al. (2023) Barry M. Dillon, Luigi Favaro, Friedrich Feiden, Tanmoy Modak,  and Tilman Plehn, “Anomalies, representations, and self-supervision,”  (2023), arXiv:2301.04660 [hep-ph] .
  • Tombs and Lester (2022) Rupert Tombs and Christopher G. Lester, “A method to challenge symmetries in data with self-supervised learning,” Journal of Instrumentation 17, P08024 (2022).
  • Kishimoto et al. (2023) Tomoe Kishimoto, Masahiro Morinaga, Masahiko Saito,  and Junichi Tanaka, “Pre-training strategy using real particle collision data for event classification in collider physics,”  (2023), arXiv:2312.06909 [hep-ex] .
  • Qu et al. (2022a) Huilin Qu, Congqiao Li,  and Sitian Qian, “Particle transformer for jet tagging,”  (2022a), arXiv:2202.03772 [hep-ph] .
  • Mikuni and Canelli (2021) Vinicius Mikuni and Florencia Canelli, “Point cloud transformers applied to collider physics,” Machine Learning: Science and Technology 2, 035027 (2021).
  • K ach et al. (2022) Benno K ach, Dirk Krücker,  and Isabell Melzer-Pellmann, “Point cloud generation using transformer encoders and normalising flows,”  (2022), arXiv:2211.13623 [hep-ex] .
  • Kansal et al. (2023) Raghav Kansal, Anni Li, Javier Duarte, Nadezda Chernyavskaya, Maurizio Pierini, Breno Orzari,  and Thiago Tomei, “Evaluating generative models in high energy physics,” Physical Review D 107 (2023), 10.1103/physrevd.107.076017.
  • Fenton et al. (2022) Michael James Fenton, Alexander Shmakov, Ta-Wei Ho, Shih-Chieh Hsu, Daniel Whiteson,  and Pierre Baldi, “Permutationless many-jet event reconstruction with symmetry preserving attention networks,” Physical Review D 105 (2022), 10.1103/physrevd.105.112008.
  • ATLAS Collaboration (2023) ATLAS Collaboration (ATLAS), Transformer Neural Networks for Identifying Boosted Higgs Bosons decaying into bb¯𝑏¯𝑏b\bar{b}italic_b over¯ start_ARG italic_b end_ARG and cc¯𝑐¯𝑐c\bar{c}italic_c over¯ start_ARG italic_c end_ARG in ATLAS, Tech. Rep. (CERN, Geneva, 2023).
  • Smith et al. (2023) Rachel E. C. Smith, Inês Ochoa, Rúben Inácio, Jonathan Shoemaker,  and Michael Kagan, “Differentiable vertex fitting for jet flavour tagging,”  (2023), arXiv:2310.12804 [hep-ex] .
  • Tomiya and Nagai (2023) Akio Tomiya and Yuki Nagai, “Equivariant transformer is all you need,”  (2023), arXiv:2310.13222 [hep-lat] .
  • K ach and Melzer-Pellmann (2023) Benno K ach and Isabell Melzer-Pellmann, “Attention to mean-fields for particle cloud generation,”  (2023), arXiv:2305.15254 [hep-ex] .
  • Raine et al. (2023a) John Andrew Raine, Matthew Leigh, Knut Zoch,  and Tobias Golling, “ν2superscript𝜈2\nu^{2}italic_ν start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-flows: Fast and improved neutrino reconstruction in multi-neutrino final states with conditional normalizing flows,”  (2023a), arXiv:2307.02405 [hep-ph] .
  • Finke et al. (2023) Thorben Finke, Michael Krämer, Alexander Mück,  and Jan Tönshoff, “Learning the language of qcd jets with transformers,” Journal of High Energy Physics 2023, 184 (2023).
  • Butter et al. (2023) Anja Butter, Nathan Huetsch, Sofia Palacios Schweitzer, Tilman Plehn, Peter Sorrenson,  and Jonas Spinner, “Jet diffusion versus jetgpt – modern networks for the lhc,”  (2023), arXiv:2305.10475 [hep-ph] .
  • Vigl et al. (2024) Matthias Vigl, Nicole Hartman,  and Lukas Heinrich, “Finetuning foundation models for joint analysis optimization,”  (2024), arXiv:2401.13536 [hep-ex] .
  • Oord et al. (2017) Aaron van den Oord, Oriol Vinyals,  and Koray Kavukcuoglu, “Neural discrete representation learning,” arXiv preprint arXiv:1711.00937  (2017).
  • MacQueen et al. (1967) James MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1 (Oakland, CA, USA, 1967) pp. 281–297.
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii, “K-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07 (Society for Industrial and Applied Mathematics, USA, 2007) p. 1027–1035.
  • Buitinck et al. (2013) Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt,  and Gaël Varoquaux, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning (2013) pp. 108–122.
  • Qu et al. (2022b) Huilin Qu, Congqiao Li,  and Sitian Qian, “JetClass: A Large-Scale Dataset for Deep Learning in Jet Physics,”  (2022b).
  • (40) Johan Alwall, R Frederix, S Frixione, V Hirschi, Fabio Maltoni, Olivier Mattelaer, H-S Shao, T Stelzer, P Torrielli,  and M Zaro, “The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations,” JHEP 07, 79.
  • (41) Torbjörn Sjöstrand, Stephen Mrenna,  and Peter Skands, “A brief introduction to pythia 8.1,” Comput. Phys. Commun. 178, 852–867.
  • (42) Pierre Artoisenet, Rikkert Frederix, Olivier Mattelaer,  and Robbert Rietkerk, “Automatic spin-entangled decays of heavy resonances in monte carlo simulations,” JHEP 03, 15.
  • de Favereau et al. (2014) J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, M. Selvaggi,  and The DELPHES 3 collaboration, “Delphes 3: a modular framework for fast simulation of a generic collider experiment,” Journal of High Energy Physics 2014, 57 (2014).
  • (44) Matteo Cacciari, Gavin P Salam,  and Gregory Soyez, “The anti-kt jet clustering algorithm,” JHEP 04, 063.
  • Shleifer et al. (2021) Sam Shleifer, Jason Weston,  and Myle Ott, “Normformer: Improved transformer pretraining with extra normalization,” arXiv preprint arXiv:2110.09456  (2021).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980  (2014).
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101  (2017).
  • Huh et al. (2023) Minyoung Huh, Brian Cheung, Pulkit Agrawal,  and Phillip Isola, “Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks,” arXiv preprint arXiv:2305.08842  (2023).
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats,  and Yann N. Dauphin, “Convolutional sequence to sequence learning,”  (2017), arXiv:1705.03122 [cs.CL] .
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser,  and I. Polosukhin, “Attention is all you need,” CoRR abs/1706.03762 (2017).
  • Metodiev et al. (2017) Eric M Metodiev, Benjamin Nachman,  and Jesse Thaler, “Classification without labels: Learning from mixed samples in high energy physics,” Journal of High Energy Physics 2017, 1–18 (2017).
  • Raine et al. (2023b) John Andrew Raine, Samuel Klein, Debajyoti Sengupta,  and Tobias Golling, “Curtains for your sliding window: Constructing unobserved regions by transforming adjacent intervals,” Frontiers in Big Data 6 (2023b), 10.3389/fdata.2023.899345.
  • Hallin et al. (2021) Anna Hallin, Joshua Isaacson, Gregor Kasieczka, Claudius Krause, Benjamin Nachman, Tobias Quadfasel, Matthias Schlaffer, David Shih,  and Manuel Sommerhalder, “Classifying anomalies through outer density estimation (cathode),” arXiv preprint arXiv:2109.00546  (2021).
  • Aad et al. (2020) Georges Aad, Brad Abbott, Dale Charles Abbott, A Abed Abud, Kira Abeling, Deshan Kavishka Abhayasinghe, Syed Haider Abidi, OS AbouZeid, Nadine L Abraham, Halina Abramowicz, et al., “Dijet resonance search with weak supervision using s= 13 tev p p collisions in the atlas detector,” Physical review letters 125, 131801 (2020).
  • Andreassen et al. (2020) Anders Andreassen, Benjamin Nachman,  and David Shih, “Simulation Assisted Likelihood-free Anomaly Detection,” Phys. Rev. D 101, 095004 (2020)arXiv:2001.05001 [hep-ph] .
  • Golling et al. (2023) Tobias Golling, Samuel Klein, Radha Mastandrea,  and Benjamin Nachman, “Flow-enhanced transportation for anomaly detection,” Phys. Rev. D 107, 096025 (2023)arXiv:2212.11285 [hep-ph] .
  • Collins et al. (2019) Jack H. Collins, Kiel Howe,  and Benjamin Nachman, “Extending the search for new resonances with machine learning,” Phys. Rev. D99, 014038 (2019)arXiv:1902.02634 [hep-ph] .
  • Birman et al. (2022) Mattias Birman, Benjamin Nachman, Raphael Sebbah, Gal Sela, Ophir Turetz,  and Shikma Bressler, “Data-directed search for new physics based on symmetries of the sm,” The European Physical Journal C 82, 508 (2022).
  • Buhmann et al. (2023) Erik Buhmann, Cedric Ewen, Gregor Kasieczka, Vinicius Mikuni, Benjamin Nachman,  and David Shih, “Full phase space resonant anomaly detection,”  (2023), arXiv:2310.06897 [hep-ph] .
  • Sengupta et al. (2023) Debajyoti Sengupta, Matthew Leigh, John Andrew Raine, Samuel Klein,  and Tobias Golling, “Improving new physics searches with diffusion models for event observables and jet constituents,”  (2023), arXiv:2312.10130 [physics.data-an] .
  • Witkowski et al. (2023) Edmund Witkowski, Benjamin Nachman,  and Daniel Whiteson, “Learning to isolate muons in data,” arXiv preprint arXiv:2306.15737  (2023).
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard,  and Aaron Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432  (2013).
  • Huh (2022) Minyoung Huh, “vqtorch: PyTorch package for vector quantization,” https://github.com/minyoungg/vqtorch (2022).
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research 9, 2579–2605 (2008).

Appendix A MPM Encoder Architecture

The Transformer-Encoder block used in all networks is based on the Normformer Shleifer et al. (2021) encoder block. It is depicted in Fig. 7. The block is composed of a residual attention network followed by a residual dense network. The attention network takes the point cloud as input tokens and performs a multi-headed self-attention pass surrounded by layer normalizations. The intermediate tokens are then added to the input tokens via a residual connection. The dense network comprises two fully connected linear layers. A sigmoid-linear-unit (SiLU) activation is applied to the output of the hidden layer, layer normalization is used to keep the gradients stable, and dropout of 10%percent1010\%10 % is used for regularization. All models used in residual connections have zeros initialized weights in the final layer such that the total Transfomer-Encoder block is initialized to the identity. A total of eight heads are used in the Multi-Headed Attention. The output tokens are then added to the intermediate tokens via another residual connection. The input and output dimensions of the token features are the same, so several entire TE-Blocks can be chained together.

Refer to caption
Figure 7: The Transformer-Encoder block is made of a residual self-attention network followed by a residual dense network. before they are passed to the dense network..

Appendix B VQ-VAE Architecture and Training

Training a vector quantized variational autoencoder (VQ-VAE) is known to be difficult Huh et al. (2023), and a useful set of prescriptions for overcoming these challenges have been outlined in Ref. Huh et al. (2023). In the following we provide a brief overview of VQ-VAEs as defined in  Huh et al. (2023). A VQ-VAE Oord et al. (2017) is composed of an encoder F𝐹Fitalic_F and decoder G𝐺Gitalic_G neural network with a quantization layer hhitalic_h and codebook C={ci}i=1m𝐶superscriptsubscriptsubscript𝑐𝑖𝑖1𝑚{C}=\{c_{i}\}_{i=1}^{m}italic_C = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for vectors cinsubscript𝑐𝑖superscript𝑛c_{i}\in\mathbb{R}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where n𝑛nitalic_n is the dimension of the latent space. The layer h:n×CcC:superscript𝑛𝐶𝑐𝐶h:\mathbb{R}^{n}\times{C}\rightarrow c\in{C}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × italic_C → italic_c ∈ italic_C quantizes the latent space by assigning encoded vectors to their closest neighbors in the set C𝐶Citalic_C with a distance measure defined by some measure d𝑑ditalic_d which we will take to be the euclidean norm. The codebook C𝐶Citalic_C will always be omitted from the arguments of the function hhitalic_h in the following. The output of a VQ-VAE is defined as,

x^^𝑥\displaystyle\hat{x}over^ start_ARG italic_x end_ARG =G(h(F(x)))absent𝐺𝐹𝑥\displaystyle=G(h(F(x)))= italic_G ( italic_h ( italic_F ( italic_x ) ) )
=G(h(ze))absent𝐺subscript𝑧𝑒\displaystyle=G(h(z_{e}))= italic_G ( italic_h ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) )
=G(zq),absent𝐺subscript𝑧𝑞\displaystyle=G(z_{q}),= italic_G ( italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ,

for a given input x𝑥xitalic_x. The objective of a VQ-VAE is to minimize the empirical risk,

minF,G,h𝔼x[task(x^,x)],subscript𝐹𝐺subscript𝔼𝑥delimited-[]subscript𝑡𝑎𝑠𝑘^𝑥𝑥\min_{F,G,h}\mathbb{E}_{x}\left[\mathcal{L}_{task}(\hat{x},x)\right],roman_min start_POSTSUBSCRIPT italic_F , italic_G , italic_h end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , italic_x ) ] , (1)

which is not differentiable due to the quantization operation in hhitalic_h and the gradients are estimated using straight through estimation  Bengio et al. (2013). To ensure the accuracy of this estimation a commitment loss is added to create an attractive force between the encodings (ze=F(x))subscript𝑧𝑒𝐹𝑥(z_{e}=F(x))( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_F ( italic_x ) ) and their corresponding codebook vectors (zq=h(ze))subscript𝑧𝑞subscript𝑧𝑒(z_{q}=h(z_{e}))( italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_h ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ),

cmt=(1β)d(ze,sg(zq)+βd(sg(ze),zq),\mathcal{L}_{cmt}=(1-\beta)d(z_{e},\mathrm{sg}(z_{q})+\beta d(\mathrm{sg}(z_{e% }),z_{q}),caligraphic_L start_POSTSUBSCRIPT italic_c italic_m italic_t end_POSTSUBSCRIPT = ( 1 - italic_β ) italic_d ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , roman_sg ( italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) + italic_β italic_d ( roman_sg ( italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , (2)

where sgsg\mathrm{sg}roman_sg is the stop gradient operator. The resulting fully differentiable proxy objective is

minF,G,h𝔼x[task(x^,x)+αcmt(x)],subscript𝐹𝐺subscript𝔼𝑥delimited-[]subscript𝑡𝑎𝑠𝑘^𝑥𝑥𝛼subscript𝑐𝑚𝑡𝑥\min_{F,G,h}\mathbb{E}_{x}\left[\mathcal{L}_{task}(\hat{x},x)+\alpha\mathcal{L% }_{cmt}(x)\right],roman_min start_POSTSUBSCRIPT italic_F , italic_G , italic_h end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , italic_x ) + italic_α caligraphic_L start_POSTSUBSCRIPT italic_c italic_m italic_t end_POSTSUBSCRIPT ( italic_x ) ] , (3)

where we take take α=10𝛼10\alpha=10italic_α = 10 and β=0.9𝛽0.9\beta=0.9italic_β = 0.9 Oord et al. (2017); Huh et al. (2023). The task loss is also taken to be the euclidean distance.

The training of this model is difficult due to well known issues with the collapse of the codebook in the latent space Huh et al. (2023). As such, we make use of the repository from Ref. Huh et al. (2023); Huh (2022). In particular we use a shared parameterization for the codebook elements, update the commitment loss only every four steps and a synchronized update rule with ν=2𝜈2\nu=2italic_ν = 2. Without these additional hyperparameters we found the VQ-VAE to be very difficult to train, and we found that the VQ-VAE was not very sensitive to these parameters.

We use a model with a latent dimension of n=16𝑛16n=16italic_n = 16 and m=512𝑚512m=512italic_m = 512 codebook elements. The codebook elements are initialized using the K-means clustering algorithm on the latent space of a randomly initialized model. The encoder and decoder networks are both transformers of the same type as those described in App. A but with four layers and a model dimension of 256256256256 and a linear embedding of the nodes into a 256256256256 dimensional space.

The input nodes are each quantized and decoded separately, with the distance calculated on a per node basis. Therefore a jet with N𝑁Nitalic_N constituent particles is assigned to N𝑁Nitalic_N codebook elements, each of dimension m𝑚mitalic_m. Each of these codebook elements are decoded to N𝑁Nitalic_N vectors of the same dimension as the input nodes. Using a transformer for the encoder and decoder networks ensures that the model is encoded and decoded conditional on all particles in the jet.

Appendix C t-SNE embeddings

In Fig. 8 we plot the t-SNE van der Maaten and Hinton (2008) embeddings at the output of the pre-trained backbone. This shows the embedding after the self supervised pre-training. The embeddings are performed over the representation averaged over all particles in a given jet. Each input particle has a corresponding output representation and we embed the average over these output representations.

Refer to caption
Figure 8: A t-SNE embedding of the nine signals in the JetClass dataset and the background QCD samples. For a given signal an embedding is found for that signal and background only. Each bin shows the fraction of signal in that bin, defined as the number of signal samples divided by the total number of samples in that bin. For each background and signal pair, 100,000100000100,000100 , 000 samples are embedded.