Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2403.05618v1 [hep-ph] 08 Mar 2024

OmniJet-Ī±š›¼\alphaitalic_Ī±: The first cross-task foundation model for particle physics

Joschka Birk joschka.birk@uni-hamburg.de Institute for Experimental Physics, UniversitƤt Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany
ā€ƒā€ƒ Anna Hallin anna.hallin@uni-hamburg.de Institute for Experimental Physics, UniversitƤt Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany
ā€ƒā€ƒ Gregor Kasieczka gregor.kasieczka@uni-hamburg.de Institute for Experimental Physics, UniversitƤt Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany
Abstract

Foundation models are multi-dataset and multi-task machine learning methods that once pre-trained can be fine-tuned for a large variety of downstream applications. The successful development of such general-purpose models for physics data would be a major breakthrough as they could improve the achievable physics performance while at the same time drastically reduce the required amount of training time and data.

We report significant progress on this challenge on several fronts. First, a comprehensive set of evaluation methods is introduced to judge the quality of an encoding from physics data into a representation suitable for the autoregressive generation of particle jets with transformer architectures (the common backbone of foundation models). These measures motivate the choice of a higher-fidelity tokenization compared to previous works. Finally, we demonstrate transfer learning between an unsupervised problem (jet generation) and a classic supervised task (jet tagging) with our new OmniJet-Ī±š›¼\alphaitalic_Ī±Ā model. This is the first successful transfer between two different and actively studied classes of tasks and constitutes a major step in the building of foundation models for particle physics.

Refer to caption
Figure 1: Schematics of the different steps (tokenization, generation, classification) in the OmniJet-Ī±š›¼\alphaitalic_Ī±Ā model.

I Introduction

Foundation models are a new class of machine learning models which are trained on broad datasets and problems and are able to generalize to a variety of downstream tasks and datasetsĀ BommasaniĀ etĀ al. (2022). Large-language models (LLMs) such as BERTĀ DevlinĀ etĀ al. (2019), BARTĀ LewisĀ etĀ al. (2019), GPT-3Ā BrownĀ etĀ al. (2020), and LLaMAĀ TouvronĀ etĀ al. (2023) are examples of foundation models for text data while DALL-EĀ 2Ā RameshĀ etĀ al. (2022) is an example of an image-based model.

The benefits of a foundation model for particle physics data would be significant: While machine learning models developed so far typically outperform classical approaches (and often by a large margin)Ā KasieczkaĀ etĀ al. (2019a); KaragiorgiĀ etĀ al. (2022), available statistics for training these models is a constant issueĀ MacalusoĀ andĀ Shih (2018); QuĀ etĀ al. (2022a). It becomes even more extreme in the case of searches for rare processes, where only a small fraction of simulated events might pass pre-selection criteria. Foundation models on the other hand are pre-trained and need fewer examples to be fine-tuned for a specific taskĀ ViglĀ etĀ al. (2024).

Beyond improving the physics performance by e.g. increasing selection accuracy of classification tasks, foundation models also address the other major issue currently facing particle physics: limited computing resources in the face of an ever increasing amount of dataĀ AlbrechtĀ etĀ al. (2019); BoehnleinĀ etĀ al. (2022). This problem has already spawned the development of e.g. increasingly high fidelity models for the simulation of calorimeters, as well as techniques for the speeding up of other Monte Carlo simulationsĀ PaganiniĀ etĀ al. (2018); BuhmannĀ etĀ al. (2021, 2023a); AdelmannĀ etĀ al. (2022); BadgerĀ etĀ al. (2023); HashemiĀ andĀ Krause (2023); ButterĀ etĀ al. (2019); deĀ OliveiraĀ etĀ al. (2017). By allowing the re-use of models across datasets and tasks, foundation models will play an important role in reducing this computational burden. Note that this effect will be further compounded by the potential for optimization in computing operations from one model used across multiple tasks.

Finally, using closely related architectures across different tasks inside one experiment, across experimental collaborations, and in the exchange with the theory community will make results easier to re-interpretĀ ara (2024); BieringerĀ etĀ al. (2024).

These potential benefits have inspired research into proto-foundation models suitable for particle physics. For example, DillonĀ etĀ al. (2022a); FavaroĀ etĀ al. (2023); DillonĀ etĀ al. (2023); ParkĀ etĀ al. (2023); DillonĀ etĀ al. (2022b) investigated how known physical symmetries could be used to learn powerful embeddings of jet data, BenatoĀ etĀ al. (2022) showed the versatility of graph-based message passing networks for datasets from different domains of physics, LiuĀ etĀ al. (2023); SalamaniĀ etĀ al. (2023) demonstrated conditioning generative models on the geometry of the detector to allow the simultaneous simulation of multiple detectors with one architecture, DolanĀ andĀ Ore (2022); BeauchesneĀ etĀ al. (2024) used meta-learning for mass-decorrelation and weak-supervision, and QuĀ etĀ al. (2022a) achieved state-of-the-art performance on the top tagging landscape dataset KasieczkaĀ etĀ al. (2019b) by pre-training on a different datasetĀ QuĀ etĀ al. (2022b) and transferring the results.

Due to their flexibility demonstrated across language and other domains, transformers VaswaniĀ etĀ al. (2017) are currently the most suitable candidate architecture for building foundation models for applications in particle physics. Taking inspiration from LLMs where sentences are generated autoregressively, recent efforts have demonstrated success with autoregressive generation of particle physics data, for example using the encoder part of the transformer to generate tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ and q/gš‘žš‘”q/gitalic_q / italic_gĀ jets FinkeĀ etĀ al. (2023) and using the decoder part of the transformer to generate Zš‘Zitalic_Z+jets events ButterĀ etĀ al. (2023). Ref. HeinrichĀ etĀ al. (2024) demonstrated how a transformer backbone can be pre-trained using a BERT-like scheme where the model is trained to predict masked out jet constituents, resulting in an improvement of the performance when fine-tuning the backbone (with a new classification head) for jet tagging, especially at small training dataset sizes. Furthermore, HuangĀ etĀ al. (2024) showed how a tokenized detector representation can be used in a BERT-like model for track reconstruction.

In this work, we will explore whether an autoregressive Generative Pretrained Transformer (GPT) model can be used as a foundation model for jet physics. However, the standard GPT constructions are not built to deal with continuous input data, but rather tokenized data. As point clouds are the most versatile representation of physics data KomiskeĀ etĀ al. (2019); BuhmannĀ etĀ al. (2023b); KasieczkaĀ etĀ al. (2019a); BuhmannĀ etĀ al. (2023c, a) and can incorporate both event level information, jet substructure, and even low-level detector signals, finding a suitable input transformation for point clouds to tokens is the most pressing problem. Various tokenization strategies have been explored, for example using a simple mapping based on binning the input space in FinkeĀ etĀ al. (2023), a Gaussian mixture model in ButterĀ etĀ al. (2023), and using an additional conditional embedding network in HeinrichĀ etĀ al. (2024).

Here, we follow the conditional tokenization strategy from vanĀ den OordĀ etĀ al. (2018); BaoĀ etĀ al. (2022); HeinrichĀ etĀ al. (2024), but first take a step back to verify the quality and trade-offs involved in building these tokens. This will allow us to formulate quality measures to choose a suitable tokenization model, leading to an increase in codebook size from 512 tokens inĀ HeinrichĀ etĀ al. (2024) to 8192 tokens.

Using this representation, we will first demonstrate training a generative model for jets as tokens in an unsupervised way for the JetClassĀ QuĀ etĀ al. (2022b) dataset. Compared to FinkeĀ etĀ al. (2023), the core of our architecture is a transformer-decoder, not a transformer-encoder.

Finally, this allows us to test whether the information encoded in a model that was trained to generate jets can also be transferred to the task of classifying them. Observing such a transfer ability across different classes of tasks ā€” as opposed to transfer between different classification or generation problems ā€” would be a crucial ingredient to building foundation models for physics data, and has not yet been achieved. A graphical representation of this approach is provided in FigureĀ 1. As this is the first prototype of a model to tackle all tasks with jets in particle physics, it is named OmniJet-Ī±š›¼\alphaitalic_Ī±.

The rest of the paper is organized as follows: SectionĀ II introduces the data as well as the tokenization approach, the generative architecture, and the transfer learning strategy. Next, SectionĀ III shows the results of the tokenization study, the generative performance, as well as tests of the transfer learning capabilities of the model. Finally, SectionĀ IV summarizes the results and provides a brief outlook.

Refer to caption
Figure 2: Architecture of the transformer backbone component of OmniJet-Ī±š›¼\alphaitalic_Ī±. The data that has been encoded by the VQ-VAE is fed through an embedding layer, before it reaches the main part of the model which is based on the transformer decoder. The output of the transformer decoder blocks is passed to a task specific head, for either generation or classification tasks. Note that during inference of the generative model, the model does not receive complete token sequences, but only the start token. The model will then autoregressively generate the rest of the sequence, updating its input as it progresses, as described in the text.

II Methods and Dataset

II.1 Dataset

All studies are performed using the JetClassĀ datasetĀ QuĀ etĀ al. (2022b), originally introduced inĀ QuĀ etĀ al. (2022a). It contains both jet-level and constituent-level features for ten different types of jets initiated by gluons and quarks (q/gš‘žš‘”q/gitalic_q / italic_g), top quarks (tš‘”titalic_t, subdivided by their decay mode into tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ and tā†’bā¢ā„“ā¢Ī½ā†’š‘”š‘ā„“šœˆt\to b\ell\nuitalic_t ā†’ italic_b roman_ā„“ italic_Ī½) , as well as Wš‘ŠWitalic_W, Zš‘Zitalic_Z, and Hš»Hitalic_H (Hā†’bā¢bĀÆā†’š»š‘ĀÆš‘H\to b\bar{b}italic_H ā†’ italic_b overĀÆ start_ARG italic_b end_ARG, Hā†’cā¢cĀÆā†’š»š‘ĀÆš‘H\to c\bar{c}italic_H ā†’ italic_c overĀÆ start_ARG italic_c end_ARG, Hā†’gā¢gā†’š»š‘”š‘”H\to ggitalic_H ā†’ italic_g italic_g, Hā†’4ā¢qā†’š»4š‘žH\to 4qitalic_H ā†’ 4 italic_q, and Hā†’ā„“ā¢Ī½ā¢qā¢qā€²ā†’š»ā„“šœˆš‘žsuperscriptš‘žā€²H\to\ell\nu qq^{\prime}italic_H ā†’ roman_ā„“ italic_Ī½ italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT) bosons.

Events are simulated using MadGraph5_aMC@NLOĀ AlwallĀ etĀ al. (2014) with parton showering and hadronization done by PythiaĀ SjƶstrandĀ etĀ al. (2015). A simplified detector simulation implemented in DelphesĀ deĀ FavereauĀ etĀ al. (2014) using the CMS detectorĀ The CMS Collaboration (2008) card is performed. Constituents are clustered into jets using the anti-kTsubscriptš‘˜Tk_{\mathrm{T}}italic_k start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT algorithmĀ CacciariĀ etĀ al. (2008) with a distance parameter of R=0.8š‘…0.8R=0.8italic_R = 0.8.

Jets are selected if they have a transverse momentum of 500Ā GeV<pTjet<1000Ā GeVtimes500GeVsuperscriptsubscriptš‘Tjettimes1000GeV$500\text{\,}\mathrm{G}\mathrm{e}\mathrm{V}$<p_{\mathrm{T}}^{\mathrm{jet}}<$10% 00\text{\,}\mathrm{G}\mathrm{e}\mathrm{V}$start_ARG 500 end_ARG start_ARG times end_ARG start_ARG roman_GeV end_ARG < italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_jet end_POSTSUPERSCRIPT < start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_GeV end_ARG and a pseudorapidity of |Ī·jet|<2superscriptšœ‚jet2|\eta^{\mathrm{jet}}|<2| italic_Ī· start_POSTSUPERSCRIPT roman_jet end_POSTSUPERSCRIPT | < 2. Additionally, truth-level matching is performed for all classes except q/gš‘žš‘”q/gitalic_q / italic_gĀ and only jets that contain all the decay products of the boson or top quark are included. The resulting dataset contains 100M jets for training, 5M jets for validation, and 20M jets for testing.

In this work, only the kinematic information per particle (pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, Ļ•italic-Ļ•\phiitalic_Ļ•, Ī·šœ‚\etaitalic_Ī·) is used while the particle mass mš‘šmitalic_m is approximated as zero. Next, the azimuth angle Ļ•italic-Ļ•\phiitalic_Ļ• and the pseudorapidity Ī·šœ‚\etaitalic_Ī· are pre-processed to be relative to the jet axis111 The difference in Ļ•italic-Ļ•\phiitalic_Ļ• is signed and rectified to āˆ’Ļ€šœ‹-\pi- italic_Ļ€ through Ļ€šœ‹\piitalic_Ļ€. We handle those calculations using the scikit-hep/vectorĀ SchreinerĀ etĀ al. (2023) and scikit-hep/awkwardĀ PivarskiĀ etĀ al. (2024) libraries. :

Ī·relsuperscriptšœ‚rel\displaystyle\eta^{\mathrm{rel}}italic_Ī· start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT =Ī·particleāˆ’Ī·jetabsentsuperscriptšœ‚particlesuperscriptšœ‚jet\displaystyle=\eta^{\mathrm{particle}}-\eta^{\mathrm{jet}}= italic_Ī· start_POSTSUPERSCRIPT roman_particle end_POSTSUPERSCRIPT - italic_Ī· start_POSTSUPERSCRIPT roman_jet end_POSTSUPERSCRIPT (1)
Ļ•relsuperscriptitalic-Ļ•rel\displaystyle\phi^{\mathrm{rel}}italic_Ļ• start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT =Ļ•particleāˆ’Ļ•jet.absentsuperscriptitalic-Ļ•particlesuperscriptitalic-Ļ•jet\displaystyle=\phi^{\mathrm{particle}}-\phi^{\mathrm{jet}}\,.= italic_Ļ• start_POSTSUPERSCRIPT roman_particle end_POSTSUPERSCRIPT - italic_Ļ• start_POSTSUPERSCRIPT roman_jet end_POSTSUPERSCRIPT . (2)

Finally, we apply the cuts |Ī·rel|<0.8superscriptšœ‚rel0.8|\eta^{\mathrm{rel}}|<0.8| italic_Ī· start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT | < 0.8 and |Ļ•rel|<0.8superscriptitalic-Ļ•rel0.8|\phi^{\mathrm{rel}}|<0.8| italic_Ļ• start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT | < 0.8 to remove a very small fraction of low-energy constituents at the periphery and use up to 128 particles per jet.

Refer to caption 512, unconditional
Refer to caption 512, conditional
Refer to caption 9261, binning
Refer to caption 8192, unconditional
Refer to caption 8192, conditional
Refer to caption 512, unconditional
Refer to caption 512, conditional
Refer to caption 9261, binning
Refer to caption 8192, unconditional
Refer to caption 8192, conditional
Figure 3: Visualization of the reconstructed tokens in physical space (i.e. pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, Ī·relsuperscriptšœ‚rel\eta^{\text{rel}}italic_Ī· start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPT, Ļ•relsuperscriptitalic-Ļ•rel\phi^{\text{rel}}italic_Ļ• start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPT) for different tokenization approaches and codebook sizes. Each figure label indicates the codebook size and the tokenization approach. The unconditional tokenization, as well as the binning approach only have one reconstruction for each token, independent of the other tokens in the jet. To visualize the reconstruction spread of the conditional tokens, each token is reconstructed 500 times, each time conditioned on 50 randomly selected tokens from the codebook. Each colored blob corresponds to the reconstructions obtained for one token.

II.2 Jet constituent token creation

We explore three kinds of tokenization approaches: binned, conditional, and unconditional tokenization. In the binned approach FinkeĀ etĀ al. (2023), the space of input features is subdivided using a regular grid in all dimensions (e.g. a 21x21x21 grid in three dimensions) and the cells in this grid are enumerated, resulting in one token per cell.

In the unconditional approach, each constituent is tokenized individually using a non-linear mapping, whereas in the conditional approach constituents are encoded and decoded conditioned on each other. We use a Vector Quantized Variational AutoEncoder (VQ-VAE)Ā vanĀ den OordĀ etĀ al. (2018); BaoĀ etĀ al. (2022); HeinrichĀ etĀ al. (2024); HuhĀ etĀ al. (2023) to create a discrete set of jet constituent tokens both for conditional and unconditional tokenization.

The input features for the VQ-VAE are the Ī·relsuperscriptšœ‚rel\eta^{\text{rel}}italic_Ī· start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPT, Ļ•relsuperscriptitalic-Ļ•rel\phi^{\text{rel}}italic_Ļ• start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPTĀ and pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPTĀ values of the jet constituents. For the conditional tokenization, we use a transformer for both the encoder and the decoder of the VQ-VAE, whereas a simple multi-layer perceptron (MLP) is used for the unconditional tokenization. Details about the different VQ-VAE models used in our studies, as well as details about the preprocessing of the input features can be found in subsectionĀ A.1.

II.3 Transformer backbone

The core of OmniJet-Ī±š›¼\alphaitalic_Ī±Ā is a transformer backbone based on the GPT transformer decoder model first introduced inĀ RadfordĀ etĀ al. (2018). However, since jet constituents are permutation invariant, we do not employ the positional encoding usually used in LLMs. As input, the transformer backbone receives the generated tokens from the VQ-VAE, complemented with a start and stop token. A jet with nš‘›nitalic_n constituents is then represented as

(ššœššššŠšš›ššā¢_ā¢šššš˜šš”ššŽšš—,x1,ā€¦,xnāˆ’1,xn,ššœšššš˜šš™ā¢_ā¢šššš˜šš”ššŽšš—)ššœššššŠšš›šš_šššš˜šš”ššŽšš—subscriptš‘„1ā€¦subscriptš‘„š‘›1subscriptš‘„š‘›ššœšššš˜šš™_šššš˜šš”ššŽšš—\left(\mathtt{start\_token},x_{1},...,x_{n-1},x_{n},\mathtt{stop\_token}\right)( typewriter_start _ typewriter_token , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ā€¦ , italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , typewriter_stop _ typewriter_token ) (3)

where xisubscriptš‘„š‘–x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the tokens.

The transformer backbone itself consists of an embedding layer followed by a series of GPT blocks. Each GPT block contains a multihead attention block, followed by a residual addition, layer normĀ BaĀ etĀ al. (2016), two linear layers with a ReLU in between, another residual addition and a final layer norm. Since this is an autoregressive model, a causal mask is passed together with the input data to the multihead attention block to prevent the model from seeing future tokens. The architecture is illustrated in FigureĀ 2.

The output from the transformer backbone is passed to a task specific head, either for generation or classification. The generative head is a single linear layer, while the classification head consists of a linear layer followed by ReLU, a sum over the constituent dimension, and a last linear layer with softmax activation function. The model is trained with n=8š‘›8n=8italic_n = 8 heads in the multi-head attention block and N=3š‘3N=3italic_N = 3 GPT blocks. No dropout is used.

Once the generative model, i.e. the transformer backbone together with the generative head, has been trained on the tokenized data, it can generate new data autoregressively. The model has learned the probability distribution for a token xjsubscriptš‘„š‘—x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, given a sequence of tokens:

pā¢(xj|xjāˆ’1,ā€¦,x1,ššœššššŠšš›ššā¢_ā¢šššš˜šš”ššŽšš—).š‘conditionalsubscriptš‘„š‘—subscriptš‘„š‘—1ā€¦subscriptš‘„1ššœššššŠšš›šš_šššš˜šš”ššŽšš—p\left(x_{j}|x_{j-1},...,x_{1},\mathtt{start\_token}\right).italic_p ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , ā€¦ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , typewriter_start _ typewriter_token ) . (4)

The model is provided with the start token, and then samples this distribution to sequentially generate new tokens. Generation is repeated until either the stop token is generated or the maximum sequence length (which is set to be equal to 128) is reached. The generated token sequences are then fed to the VQ-VAE decoder, which maps them into physical space for further evaluation.

The classification task can be performed either from scratch, using randomly initialized weights for both the transformer backbone and the classification head, or by fine-tuning the generative model. In the fine-tuning case, the initial weights of the transformer backbone are loaded for from the generative model.

III Results

III.1 Token quality

Refer to caption
Figure 4: (left) Jet mass distribution for all ten jet types combined. (center) Difference between the mass after tokenization and the initial mass for tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets. (right) Difference between the Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT of the initial jets and the reconstructed jets for tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets.

We first inspect how well the tokens cover the space. An illustration of the conditional and unconditional tokens in physical space (i.e. their corresponding Ī·relsuperscriptšœ‚rel\eta^{\text{rel}}italic_Ī· start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPT, Ļ•relsuperscriptitalic-Ļ•rel\phi^{\text{rel}}italic_Ļ• start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPTĀ and pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPTĀ values) is shown in FigureĀ 3 for the different tokenization approaches and different codebook sizes. In the unconditional case, as well as in the binning approach, the reconstruction of a token is always the same, independent of the other tokens in the jet, leading to discrete points in physical space. In the conditional case, however, the reconstruction of a token is by construction affected by the other tokens inside this jet. To visualize the spread of each conditional token in physical space, we reconstruct each token 500 times conditioned on 50 randomly chosen tokens. Each of those reconstructions is shown in the scatter plots in FigureĀ 3, where the different reconstructions of the same token are drawn in the same color. We notice that the reconstruction of each token only changes slightly when conditioned on other tokens. Thus, the 500 different reconstructions of a conditional token show up in FigureĀ 3 as a blob in physical space. This already shows that the conditional tokenization allows to cover a significantly larger volume in reconstruction space, while the unconditional tokens can only be reconstructed to distinct points in reconstruction space. We note that our approach of reconstructing each token 500 times conditioned of randomly chosen other tokens not necessarily represents the reconstructed values of actual jet constituents, as it is possible that those combinations of tokens would not appear for real jets. However, this illustrates the overall behavior of how much the reconstruction of a token can change due to the conditioning on the other tokens.

Next, we consider distributions at the jet level to judge the quality of the tokenization. For this and the following studies, jets are mapped into token space, and then mapped back to physical space to quantify the loss in information. FigureĀ 4Ā (left) shows the jet mass combined for all classes in the dataset, as was done inĀ HeinrichĀ etĀ al. (2024). As observed there, the conditional tokenization with 512 tokens already shows a good agreement between initial and reconstructed mass, with worse performance for the unconditional tokenization.

Refer to caption
Figure 5: Token quality evaluation using a multi-class classifier. The classifier accuracy is shown for different codebook sizes and different classifier architectures (purple and green). The classifiers are also trained on the original constituents, showing an upper limit for the achievable accuracy. The reconstructed constituents are obtained using the conditional tokenization. The numbers correspond to individual trainings, however, we found those trainings to lead to very stable results when looking at different trainings for individual datapoints, where we found an accuracy deviation of around 0.2 percent points for both architectures.

However, this inclusive distribution might hide differences at the level of individual classes and jets. We therefore also consider the difference in mass for jets before and after tokenization and reconstruction FigureĀ 4Ā (center) for tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets. The unconditional tokenization leads to a systematic shift of approximately 15Ā GeV, while the conditional tokenization is well centered at zero. Increasing the codebook size from 512 to 8192 tokens substantially improves the resolution. This behavior is even more pronounced when considering the Nš‘Nitalic_N-subjettinessĀ ThalerĀ andĀ VanĀ Tilburg (2011) ratio Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT in FigureĀ 4Ā (right). Both conditional and unconditional tokenization with 512 tokens results in shifted distributions, while the larger codebook size of 8192 recovers a peak close to zero. Furthermore, while the mass resolution of the conditional tokens is already centered close to zero when using a codebook size of 512, the width of the distribution improves drastically from Ļƒ512cond=8.3superscriptsubscriptšœŽ512cond8.3\sigma_{512}^{\mathrm{cond}}=8.3italic_Ļƒ start_POSTSUBSCRIPT 512 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cond end_POSTSUPERSCRIPT = 8.3Ā GeV to Ļƒ8192cond=4.0superscriptsubscriptšœŽ8192cond4.0\sigma_{8192}^{\mathrm{cond}}=4.0italic_Ļƒ start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cond end_POSTSUPERSCRIPT = 4.0Ā GeV when moving from a codebook size of 512 to a codebook size of 8192, where ĻƒšœŽ\sigmaitalic_Ļƒ corresponds to the standard deviation obtained from fitting a normal distribution to the mass resolution histograms. A similar behavior can be observed for other classes, where in some cases, depending on the jet observable and the jet type, the effect is even more extreme. Finally, the binning approach with a 21x21x21 linear bins in the input features of our VQ-VAE comes close to the mass resolution of the conditional tokens with a codebook size of 8192, while the resolution of the subjettiness ratio is notably worse. Moreover, while this binning approach with 9162 tokens leads to reasonable resolution of the tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets shown in FigureĀ 4, we found quite drastic mismodeling for tā†’bā¢ā„“ā¢Ī½ā†’š‘”š‘ā„“šœˆt\to b\ell\nuitalic_t ā†’ italic_b roman_ā„“ italic_Ī½Ā and Hā†’ā„“ā¢Ī½ā¢qā¢qā€²ā†’š»ā„“šœˆš‘žsuperscriptš‘žā€²H\to\ell\nu qq^{\prime}italic_H ā†’ roman_ā„“ italic_Ī½ italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets with such small codebook sizes222 As expected, the resolution of the binning approach automatically leads to good resolution when the codebook size (i.e. the the number of bins) is increased to a sufficiently large number. We found that around 64ā€‰0006400064\,00064 000 tokens (corresponding to a 40x40x40 grid) offer similar resolution as conditional tokenization with a codebook size of 8192. . The distributions and the corresponding resolutions of the jet mass, jet pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, as well as the subjettiness ratios Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT and Ļ„21subscriptšœ21\tau_{21}italic_Ļ„ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT are shown for all ten jet types individually in AppendixĀ B. Overall, the highest fidelity is achieved by conditional tokenization with a marked improvement from increasing the codebook size from 512 to 8192.

Finally, we quantify the information loss that comes with the tokenization by training multi-class classifiers to distinguish between the ten jet types present in the dataset. The classifiers are trained once with the original inputs, and once with the inputs after undergoing tokenization and subsequent reconstruction back into physical space. This procedure allows a direct comparison of how the loss in resolution affects reconstruction performance. We utilize two standard classifier architectures: DeepSetsĀ ZaheerĀ etĀ al. (2018); KomiskeĀ etĀ al. (2019) (i.e. without particle interactions) and TransformerĀ VaswaniĀ etĀ al. (2017); ShleiferĀ etĀ al. (2021) (i.e. with particle interactions) and perform this study for four different codebook sizes from 512 to 8192 tokens for the conditional tokenization approach. Details about the classifier trainings can be found in subsectionĀ A.2. Note that this approach is similar in spirit to the classifier metric proposed inĀ KrauseĀ andĀ Shih (2023); DasĀ etĀ al. (2023) but tests a multi-class classifier trained on these samples individually, as opposed to judging how well a classifier might distinguish original and reconstructed samples. This is necessary as e.g. points at fixed positions would be distinguishable from the original with close to perfect accuracy, rendering the test less useful.

The resulting classifier accuracy for the two different architectures is shown in FigureĀ 5. As seen in previous studies of resolution, we observe an increase of token quality as we increase the size of the VQ-VAE codebook. Furthermore, we see that the classifier performance starts to plateau with codebook sizes larger than 4096. However, even at the largest codebook size, a gap to the performance on original particles remains, motivating future work into building more accurate tokenization schemes.

For the remaining studies we will utilize a codebook size of 8192 with conditional tokens as this leads to the overall best performance and fidelity.

Refer to caption
Refer to caption
Figure 6: Comparison of generated jets from the model trained on both q/gš‘žš‘”q/gitalic_q / italic_g and tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jets, to reconstructed JetClass tokens. The top row shows jet level distributions, while the bottom row shows distributions on the constituent level.

III.2 Jet generation

After training the transformer backbone with the generative head, it can be used for autoregressive generation as described in subsectionĀ II.3. The model was trained on three separate datasets: tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT only, q/gš‘žš‘”q/gitalic_q / italic_g only, and q/gš‘žš‘”q/gitalic_q / italic_g and tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT combined. This section will describe the combined version, since this is the model that is used for transfer learning. For a discussion of single-jet type generative results, including a comparison to the EPiC-FM method ofĀ BirkĀ etĀ al. (2023), see appendix AppendixĀ C. 48Ā 000 events were generated from the combined model. These events contain tokens, which are then decoded back to physical space using the VQ-VAE decoder.

A comparison to reconstructed JetClassĀ tokens can be seen in FigureĀ 6. We observe that in general the model is able to match the truth level tokens well. We note that the tail of the pTsubscriptš‘Tp_{\mathrm{T}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT spectrum of both the generated constituents and the reconstructed JetClassĀ tokens contains bumps distributed around discrete values, which is consistent with our inspection of the reconstruction space shown in FigureĀ 3.

In order to quantify the performance, a classifier test (see subsectionĀ A.5 for details) is performed to distinguish generated events from reconstructed JetClass tokens. The test results in an AUC score of 0.54.

III.3 Transfer learning from generation to classification

Refer to caption
Figure 7: Performance of pre-trained and non-pre-trained models for the task of tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ vs q/gš‘žš‘”q/gitalic_q / italic_gĀ jet classification. The area under the ROC-curve (AUC) metric is shown on the left, the classification accuracy on the right.

To evaluate the ability of the model to generalize from generating jets to classifying them, we focus on the task of hadronic top quark taggingĀ KasieczkaĀ etĀ al. (2019a), i.e. distinguishing tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ and q/gš‘žš‘”q/gitalic_q / italic_gĀ jets on the JetClassĀ QuĀ etĀ al. (2022b) dataset. For this test, the Next-token prediction head is replaced by a Classification head while the transformer backbone is retained. We compare three training strategies: training the full architecture with randomly initialized weights (termed from scratch) which does not use transfer-learning and corresponds to the baseline, and two versions of fine-tuning the model obtained in the generative training step. In the Fine-tuning run, both the pre-trained backbone weights and the randomly initialized classification task head weights are allowed to float in the training, while in Fine-tuning (backbone fixed) only the classification task head is allowed to change.

The results of these training runs are presented in FigureĀ 7 as a function of the number of training examples provided to the model. We observe a significant gain in classification accuracy of both fine-tuning approaches compared to the baseline, leading to up to 15 percentage-points higher accuracy for small number of training jets, and outperforming by a few percentage-points at the highest training sample size. The difference between the two fine-tuning strategies is relatively small, with the more open training performing slightly better. Put differently, the generative pre-trained model achieves the same accuracy around 84% with 100 training examples for which the model that is trained from scratch requires 10ā€‰000 examples.

IV Conclusion

Foundation models for physics data are an enticing promise: Trained on large amounts of data and tasks, they are expected to easily generalize to any down-stream problem, saving countless hours of human and compute time. In this paper we have taken crucial steps towards the creation of such models.

First, we expect learned representations of data to play a key role as inputs to foundation models. Representations might be continuous and rely on symmetries or learn a mapping to a discrete space as done here with tokenization. Note that while using data raw ā€” i.e. without prior mapping into a representation space ā€” might be possible when only considering a narrow range of similar datasets, it is inherently limiting when data from different sources or with different initial dimensionalities are to be considered.

Whatever representation is used, it will be important to understand and minimize the loss of information inherent in this transformation. This problem is especially important for downstream uses such as classification and regression tasks, as the loss of information can directly limit the achievable accuracy or resolution. This work introduced a set of criteria ā€” both distribution and classifier based ā€” that can be used to assess the quality of any representation.

Using these metrics, we found a marked increase in the resolution of relevant observables like mass and jet substructure by using a codebook size of 8192 with conditional tokenization over binning-based approaches, unconditional tokenization, and conditional tokenization with smaller codebooks. An additional classifier test further confirmed this observation.

Next, we demonstrated the training of an autoregressive generative model for jet constituents, specifically for q/gš‘žš‘”q/gitalic_q / italic_gĀ and tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ jets from the JetClassĀ  datasetĀ QuĀ etĀ al. (2022b). The generated distributions agree well with the ground truth, both for global jet kinematics, jet substructure, and individual constituent features. We note that while our model is the first token-based generator of JetClass-like examples, more extensive studies of its generative fidelity when increasing the feature-space and detailed comparison to prior non-token-based results on this datasetĀ BirkĀ etĀ al. (2023) are left for future work.

Most importantly, we report the generalization capability of OmniJet-Ī±š›¼\alphaitalic_Ī± from learning to generate jets in an unsupervised way, to the supervised classification between tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPTĀ and q/gš‘žš‘”q/gitalic_q / italic_gĀ jets. Overall, the fine-tuned model outperformed training from scratch for all values of training examples, often by a significant margin. For example, for 1000 training jets, the fine-tuned model achieves an accuracy of approximately 90%, compared to around 74% for the freshly initialized model. While the two approaches seem to converge, even at the highest training size of 2 million jets the fine-tuned approach maintains a lead of a few percentage points. Finally, it even provides a non-trivial classification accuracy of 84%, even when trained only on 100 jets, emphasizing the value of foundation models for problems with few available labeled training examples. While other types of transfer have been demonstrated previously, this is the first time that the unification across the two big classes of tasks ā€” classification and generation ā€” has been achieved.

Of course, this work is only one step in building overarching foundation models. While it is the first model that achieves both classification and generation, it is not the most performant for either of these tasks. However, strategies to increase the performance exist and will be integrated. For example, the representation quality needs to be improved, possible gains from masked pre-trained have to be evaluated, architecture and training data need to be scaled up, and more extensive studies of the generalization capabilities, including training and testing on additional tasks, performed. In the medium term, strategies need to be found to align diverse datasets as well as to integrate pre-trained foundation models in community workflows. Nevertheless, the potential benefits in physics performance and compute efficiency glimpsed at in this and other works makes this a worthy endeavor.

Acknowledgements

We thank David Shih, Michael KrƤmer, Michael Kagan, Frank Gaede, Sarah Heim, and Judith Katzy for stimulating discussions of foundation models for physics data and Erik Buhmann for valuable comments on the manuscript.

The authors acknowledge support by the Deutsche Forschungsgemeinschaft under Germanyā€™s Excellence Strategy ā€“ EXC 2121 Quantum Universe ā€“ 390833306, and under PUNCH4NFDI ā€“ project number 460248186. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany.

References

  • BommasaniĀ etĀ al. (2022) RishiĀ Bommasani etĀ al.,Ā ā€œOn the opportunities and risks of foundation models,ā€Ā  (2022),Ā arXiv:2108.07258 [cs.LG] .
  • DevlinĀ etĀ al. (2019) JacobĀ Devlin, Ming-WeiĀ Chang, KentonĀ Lee, Ā andĀ KristinaĀ Toutanova,Ā ā€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,ā€Ā  (2019),Ā arXiv:1810.04805 [cs.CL] .
  • LewisĀ etĀ al. (2019) MikeĀ Lewis, YinhanĀ Liu, NamanĀ Goyal, MarjanĀ Ghazvininejad, AbdelrahmanĀ Mohamed, OmerĀ Levy, VesĀ Stoyanov, Ā andĀ LukeĀ Zettlemoyer,Ā ā€œBART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,ā€Ā  (2019),Ā arXiv:1910.13461 [cs.CL] .
  • BrownĀ etĀ al. (2020) TomĀ B.Ā Brown etĀ al.,Ā ā€œLanguage models are few-shot learners,ā€Ā  (2020),Ā arXiv:2005.14165 [cs.CL] .
  • TouvronĀ etĀ al. (2023) HugoĀ Touvron, ThibautĀ Lavril, GautierĀ Izacard, XavierĀ Martinet, Marie-AnneĀ Lachaux, TimothĆ©eĀ Lacroix, BaptisteĀ RoziĆØre, NamanĀ Goyal, EricĀ Hambro, FaisalĀ Azhar, AurelienĀ Rodriguez, ArmandĀ Joulin, EdouardĀ Grave, Ā andĀ GuillaumeĀ Lample,Ā ā€œLLaMA: Open and Efficient Foundation Language Models,ā€Ā  (2023),Ā arXiv:2302.13971 [cs.CL] .
  • RameshĀ etĀ al. (2022) AdityaĀ Ramesh, PrafullaĀ Dhariwal, AlexĀ Nichol, CaseyĀ Chu, Ā andĀ MarkĀ Chen,Ā ā€œHierarchical Text-Conditional Image Generation with CLIP Latents,ā€Ā  (2022),Ā arXiv:2204.06125 [cs.CV] .
  • KasieczkaĀ etĀ al. (2019a) GregorĀ Kasieczka etĀ al.,Ā ā€œThe machine learning landscape of top taggers,ā€Ā SciPost PhysicsĀ 7 (2019a),Ā 10.21468/scipostphys.7.1.014.
  • KaragiorgiĀ etĀ al. (2022) GeorgiaĀ Karagiorgi, GregorĀ Kasieczka, ScottĀ Kravitz, BenjaminĀ Nachman, Ā andĀ DavidĀ Shih,Ā ā€œMachine learning in the search for new fundamental physics,ā€Ā Nature Rev. Phys.Ā 4,Ā 399ā€“412 (2022).
  • MacalusoĀ andĀ Shih (2018) SebastianĀ MacalusoĀ andĀ DavidĀ Shih,Ā ā€œPulling out all the tops with computer vision and deep learning,ā€Ā Journal of High Energy PhysicsĀ 2018 (2018),Ā 10.1007/jhep10(2018)121.
  • QuĀ etĀ al. (2022a) HuilinĀ Qu, CongqiaoĀ Li, Ā andĀ SitianĀ Qian,Ā ā€œParticle Transformer for Jet Tagging,ā€Ā inĀ Proceedings of the 39th International Conference on Machine LearningĀ (2022)Ā pp.Ā 18281ā€“18292,Ā arXiv:2202.03772 [hep-ph] .
  • ViglĀ etĀ al. (2024) MatthiasĀ Vigl, NicoleĀ Hartman, Ā andĀ LukasĀ Heinrich,Ā ā€œFinetuning Foundation Models for Joint Analysis Optimization,ā€Ā Ā (2024),Ā arXiv:2401.13536 [hep-ex] .
  • AlbrechtĀ etĀ al. (2019) JohannesĀ Albrecht etĀ al. (HEP Software Foundation),Ā ā€œA Roadmap for HEP Software and Computing R&D for the 2020s,ā€Ā Comput. Softw. Big Sci.Ā 3,Ā 7 (2019),Ā arXiv:1712.06982 [physics.comp-ph] .
  • BoehnleinĀ etĀ al. (2022) AmberĀ Boehnlein etĀ al.,Ā ā€œHL-LHC Software and Computing Review Panel Report,ā€Ā  (2022).
  • PaganiniĀ etĀ al. (2018) MichelaĀ Paganini, LukeĀ deĀ Oliveira, Ā andĀ BenjaminĀ Nachman,Ā ā€œAccelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters,ā€Ā Phys. Rev. Lett.Ā 120,Ā 042003 (2018),Ā arXiv:1705.02355 [hep-ex] .
  • BuhmannĀ etĀ al. (2021) ErikĀ Buhmann, SaschaĀ Diefenbacher, EnginĀ Eren, FrankĀ Gaede, GregorĀ Kasieczka, AnatoliiĀ Korol, Ā andĀ KatjaĀ KrĆ¼ger,Ā ā€œGetting High: High Fidelity Simulation of High Granularity Calorimeters with High Speed,ā€Ā Comput. Softw. Big Sci.Ā 5,Ā 13 (2021),Ā 2005.05334 .
  • BuhmannĀ etĀ al. (2023a) ErikĀ Buhmann, FrankĀ Gaede, GregorĀ Kasieczka, AnatoliiĀ Korol, WilliamĀ Korcari, KatjaĀ KrĆ¼ger, Ā andĀ PeterĀ McKeown,Ā ā€œCaloClouds II: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,ā€Ā Ā  (2023a),Ā arXiv:2309.05704 [physics.ins-det] .
  • AdelmannĀ etĀ al. (2022) AndreasĀ Adelmann etĀ al.,Ā ā€œNew directions for surrogate models and differentiable programming for High Energy Physics detector simulation,ā€Ā inĀ Snowmass 2021Ā (2022)Ā arXiv:2203.08806 [hep-ph] .
  • BadgerĀ etĀ al. (2023) SimonĀ Badger etĀ al.,Ā ā€œMachine learning and LHC event generation,ā€Ā SciPost Phys.Ā 14,Ā 079 (2023),Ā arXiv:2203.07460 [hep-ph] .
  • HashemiĀ andĀ Krause (2023) HoseinĀ HashemiĀ andĀ ClaudiusĀ Krause,Ā ā€œDeep Generative Models for Detector Signature Simulation: An Analytical Taxonomy,ā€Ā Ā  (2023),Ā arXiv:2312.09597 [physics.ins-det] .
  • ButterĀ etĀ al. (2019) AnjaĀ Butter, TilmanĀ Plehn, Ā andĀ RamonĀ Winterhalder,Ā ā€œHow to GAN LHC Events,ā€Ā SciPost Phys.Ā 7,Ā 075 (2019),Ā arXiv:1907.03764 [hep-ph] .
  • deĀ OliveiraĀ etĀ al. (2017) LukeĀ deĀ Oliveira, MichelaĀ Paganini, Ā andĀ BenjaminĀ Nachman,Ā ā€œLearning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics Synthesis,ā€Ā Computing and Software for Big ScienceĀ 1 (2017),Ā 10.1007/s41781-017-0004-6.
  • ara (2024) ā€œLes Houches guide to reusable ML models in LHC analyses, author=Jack Y. Araz and Andy Buckley and Gregor Kasieczka and Jan Kieseler and Sabine Kraml and Anders Kvellestad and Andre Lessa and Tomasz Procter and Are Raklev and Humberto Reyes-Gonzalez and Krzysztof Rolbiecki and Sezen Sekmen and Gokhan Unel,ā€Ā  (2024),Ā arXiv:2312.14575 [hep-ph] .
  • BieringerĀ etĀ al. (2024) SebastianĀ Bieringer, GregorĀ Kasieczka, JanĀ Kieseler, Ā andĀ MathiasĀ Trabs,Ā ā€œClassifier Surrogates: Sharing AI-based Searches with the World,ā€Ā  (2024),Ā arXiv:2402.15558 [hep-ph] .
  • DillonĀ etĀ al. (2022a) BarryĀ M.Ā Dillon, GregorĀ Kasieczka, HansĀ Olischlager, TilmanĀ Plehn, PeterĀ Sorrenson, Ā andĀ LorenzĀ Vogel,Ā ā€œSymmetries, safety, and self-supervision,ā€Ā SciPost Phys.Ā 12,Ā 188 (2022a),Ā arXiv:2108.04253 [hep-ph] .
  • FavaroĀ etĀ al. (2023) LuigiĀ Favaro, MichaelĀ KrƤmer, TanmoyĀ Modak, TilmanĀ Plehn, Ā andĀ JanĀ RĆ¼schkamp,Ā ā€œSemi-visible jets, energy-based models, and self-supervision,ā€Ā Ā  (2023),Ā arXiv:2312.03067 [hep-ph] .
  • DillonĀ etĀ al. (2023) BarryĀ M.Ā Dillon, LuigiĀ Favaro, FriedrichĀ Feiden, TanmoyĀ Modak, Ā andĀ TilmanĀ Plehn,Ā ā€œAnomalies, Representations, and Self-Supervision,ā€Ā Ā (2023),Ā arXiv:2301.04660 [hep-ph] .
  • ParkĀ etĀ al. (2023) SangĀ EonĀ Park, PhilipĀ Harris, Ā andĀ BryanĀ Ostdiek,Ā ā€œNeural embedding: learning the embedding of the manifold of physics data,ā€Ā JHEPĀ 07,Ā 108 (2023),Ā arXiv:2208.05484 [hep-ph] .
  • DillonĀ etĀ al. (2022b) BarryĀ M.Ā Dillon, RadhaĀ Mastandrea, Ā andĀ BenjaminĀ Nachman,Ā ā€œSelf-supervised anomaly detection for new physics,ā€Ā Phys. Rev. DĀ 106,Ā 056005 (2022b),Ā arXiv:2205.10380 [hep-ph] .
  • BenatoĀ etĀ al. (2022) LisaĀ Benato etĀ al.,Ā ā€œShared Data and Algorithms for Deep Learning in Fundamental Physics,ā€Ā Comput. Softw. Big Sci.Ā 6,Ā 9 (2022),Ā arXiv:2107.00656 [cs.LG] .
  • LiuĀ etĀ al. (2023) JunzeĀ Liu, AishikĀ Ghosh, DylanĀ Smith, PierreĀ Baldi, Ā andĀ DanielĀ Whiteson,Ā ā€˜ā€˜Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation,ā€Ā JINSTĀ 18,Ā P11003 (2023),Ā arXiv:2305.11531 [physics.ins-det] .
  • SalamaniĀ etĀ al. (2023) DalilaĀ Salamani, AnnaĀ Zaborowska, Ā andĀ WitoldĀ Pokorski,Ā ā€œMetaHEP: Meta learning for fast shower simulation of high energy physics experiments,ā€Ā Phys. Lett. BĀ 844,Ā 138079 (2023).
  • DolanĀ andĀ Ore (2022) MatthewĀ J.Ā DolanĀ andĀ AyodeleĀ Ore,Ā ā€˜ā€˜Metalearning and data augmentation for mass-generalized jet taggers,ā€Ā Phys. Rev. DĀ 105,Ā 094030 (2022),Ā arXiv:2111.06047 [hep-ph] .
  • BeauchesneĀ etĀ al. (2024) HuguesĀ Beauchesne, Zong-EnĀ Chen, Ā andĀ Cheng-WeiĀ Chiang,Ā ā€œImproving the performance of weak supervision searches using transfer and meta-learning,ā€Ā JHEPĀ 02,Ā 138 (2024),Ā arXiv:2312.06152 [hep-ph] .
  • KasieczkaĀ etĀ al. (2019b) GregorĀ Kasieczka, TilmanĀ Plehn, JenniferĀ Thompson, Ā andĀ MichaelĀ Russel,Ā ā€œTop quark tagging reference dataset,ā€Ā  (2019b).
  • QuĀ etĀ al. (2022b) HuilinĀ Qu, CongqiaoĀ Li, Ā andĀ SitianĀ Qian,Ā ā€œJetClass: A Large-Scale Dataset for Deep Learning in Jet Physics,ā€Ā  (2022b).
  • VaswaniĀ etĀ al. (2017) AshishĀ Vaswani, NoamĀ Shazeer, NikiĀ Parmar, JakobĀ Uszkoreit, LlionĀ Jones, AidanĀ N.Ā Gomez, LukaszĀ Kaiser, Ā andĀ IlliaĀ Polosukhin,Ā ā€œAttention Is All You Need,ā€Ā inĀ 31st International Conference on Neural Information Processing SystemsĀ (2017)Ā arXiv:1706.03762 [cs.CL] .
  • FinkeĀ etĀ al. (2023) ThorbenĀ Finke, MichaelĀ KrƤmer, AlexanderĀ MĆ¼ck, Ā andĀ JanĀ Tƶnshoff,Ā ā€œLearning the language of QCD jets with transformers,ā€Ā JHEPĀ 06,Ā 184 (2023),Ā arXiv:2303.07364 [hep-ph] .
  • ButterĀ etĀ al. (2023) AnjaĀ Butter, NathanĀ Huetsch, SofiaĀ PalaciosĀ Schweitzer, TilmanĀ Plehn, PeterĀ Sorrenson, Ā andĀ JonasĀ Spinner,Ā ā€œJet Diffusion versus JetGPT ā€“ Modern Networks for the LHC,ā€Ā Ā (2023),Ā arXiv:2305.10475 [hep-ph] .
  • HeinrichĀ etĀ al. (2024) LukasĀ Heinrich, TobiasĀ Golling, MichaelĀ Kagan, SamuelĀ Klein, MatthewĀ Leigh, MargaritaĀ Osadchy, Ā andĀ JohnĀ AndrewĀ Raine,Ā ā€˜ā€˜Masked particle modeling on sets: Towards self-supervised high energy physics foundation models,ā€Ā  (2024),Ā arXiv:2401.13537 [hep-ph] .
  • HuangĀ etĀ al. (2024) AndrisĀ Huang, YashĀ Melkani, PaoloĀ Calafiura, AlinaĀ Lazar, DanielĀ ThomasĀ Murnane, Minh-TuanĀ Pham, Ā andĀ XiangyangĀ Ju,Ā ā€œA Language Model for Particle Tracking,ā€Ā inĀ Connecting The Dots 2023Ā (2024)Ā arXiv:2402.10239 [hep-ph] .
  • KomiskeĀ etĀ al. (2019) PatrickĀ T.Ā Komiske, EricĀ M.Ā Metodiev, Ā andĀ JesseĀ Thaler,Ā ā€œEnergy flow networks: deep sets for particle jets,ā€Ā Journal of High Energy PhysicsĀ 2019 (2019),Ā 10.1007/jhep01(2019)121.
  • BuhmannĀ etĀ al. (2023b) ErikĀ Buhmann, GregorĀ Kasieczka, Ā andĀ JesseĀ Thaler,Ā ā€œEPiC-GAN: Equivariant Point Cloud Generation for Particle Jets,ā€Ā  (2023b),Ā arXiv:2301.08128 [hep-ph] .
  • BuhmannĀ etĀ al. (2023c) ErikĀ Buhmann, SaschaĀ Diefenbacher, EnginĀ Eren, FrankĀ Gaede, GregorĀ Kasieczka, AnatoliiĀ Korol, WilliamĀ Korcari, KatjaĀ KrĆ¼ger, Ā andĀ PeterĀ McKeown,Ā ā€œCaloClouds: fast geometry-independent highly-granular calorimeter simulation,ā€Ā JINSTĀ 18,Ā P11025 (2023c),Ā arXiv:2305.04847 [physics.ins-det] .
  • vanĀ den OordĀ etĀ al. (2018) AaronĀ vanĀ den Oord, OriolĀ Vinyals, Ā andĀ KorayĀ Kavukcuoglu,Ā ā€œNeural discrete representation learning,ā€Ā  (2018),Ā arXiv:1711.00937 [cs.LG] .
  • BaoĀ etĀ al. (2022) HangboĀ Bao, LiĀ Dong, SonghaoĀ Piao, Ā andĀ FuruĀ Wei,Ā ā€œBEiT: BERT Pre-Training of Image Transformers,ā€Ā  (2022),Ā arXiv:2106.08254 [cs.CV] .
  • AlwallĀ etĀ al. (2014) J.Ā Alwall, R.Ā Frederix, S.Ā Frixione, V.Ā Hirschi, F.Ā Maltoni, O.Ā Mattelaer, H.-S.Ā Shao, T.Ā Stelzer, P.Ā Torrielli, Ā andĀ M.Ā Zaro,Ā ā€œThe automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations,ā€Ā Journal of High Energy PhysicsĀ 2014 (2014),Ā 10.1007/jhep07(2014)079.
  • SjƶstrandĀ etĀ al. (2015) TorbjƶrnĀ Sjƶstrand, StefanĀ Ask, JesperĀ R.Ā Christiansen, RichardĀ Corke, NishitaĀ Desai, PhilipĀ Ilten, StephenĀ Mrenna, StefanĀ Prestel, ChristineĀ O.Ā Rasmussen, Ā andĀ PeterĀ Z.Ā Skands,Ā ā€œAn introduction to PYTHIA 8.2,ā€Ā Computer Physics CommunicationsĀ 191,Ā 159ā€“177 (2015).
  • deĀ FavereauĀ etĀ al. (2014) J.Ā deĀ Favereau, C.Ā Delaere, P.Ā Demin, A.Ā Giammanco, V.Ā LemaĆ®tre, A.Ā Mertens, Ā andĀ M.Ā Selvaggi,Ā ā€œDELPHES 3: a modular framework for fast simulation of a generic collider experiment,ā€Ā Journal of High Energy PhysicsĀ 2014 (2014),Ā 10.1007/jhep02(2014)057.
  • The CMS Collaboration (2008) The CMS Collaboration,Ā ā€œThe CMS experiment at the CERN LHC,ā€Ā JINSTĀ 3,Ā S08004 (2008).
  • CacciariĀ etĀ al. (2008) MatteoĀ Cacciari, GavinĀ PĀ Salam, Ā andĀ GregoryĀ Soyez,Ā ā€œThe anti-kt jet clustering algorithm,ā€Ā Journal of High Energy PhysicsĀ 2008,Ā 063ā€“063 (2008).
  • SchreinerĀ etĀ al. (2023) HenryĀ Schreiner, JimĀ Pivarski, Ā andĀ SaranshĀ Chopra,Ā ā€œvector,ā€Ā  (2023).
  • PivarskiĀ etĀ al. (2024) JimĀ Pivarski, IannaĀ Osborne, IoanaĀ Ifrim, HenryĀ Schreiner, AngusĀ Hollands, AnishĀ Biswas, PratyushĀ Das, SantamĀ RoyĀ Choudhury, NicholasĀ Smith, Ā andĀ ManasviĀ Goyal,Ā ā€œAwkward Array,ā€Ā  (2024).
  • HuhĀ etĀ al. (2023) MinyoungĀ Huh, BrianĀ Cheung, PulkitĀ Agrawal, Ā andĀ PhillipĀ Isola,Ā ā€œStraightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks,ā€Ā  (2023),Ā arXiv:2305.08842 [cs.LG] .
  • RadfordĀ etĀ al. (2018) AlecĀ Radford, KarthikĀ Narasimhan, TimĀ Salimans, Ā andĀ IlyaĀ Sutskever,Ā ā€œImproving language understanding by generative pre-training,ā€Ā Ā (2018).
  • BaĀ etĀ al. (2016) JimmyĀ LeiĀ Ba, JamieĀ RyanĀ Kiros, Ā andĀ GeoffreyĀ E.Ā Hinton,Ā ā€œLayer normalization,ā€Ā  (2016),Ā arXiv:1607.06450 [stat.ML] .
  • ThalerĀ andĀ VanĀ Tilburg (2011) JesseĀ ThalerĀ andĀ KenĀ VanĀ Tilburg,Ā ā€œIdentifying Boosted Objects with N-subjettiness,ā€Ā JHEPĀ 03,Ā 015 (2011),Ā arXiv:1011.2268 [hep-ph] .
  • ZaheerĀ etĀ al. (2018) ManzilĀ Zaheer, SatwikĀ Kottur, SiamakĀ Ravanbakhsh, BarnabasĀ Poczos, RuslanĀ Salakhutdinov, Ā andĀ AlexanderĀ Smola,Ā ā€œDeep Sets,ā€Ā  (2018),Ā arXiv:1703.06114 [cs.LG] .
  • ShleiferĀ etĀ al. (2021) SamĀ Shleifer, JasonĀ Weston, Ā andĀ MyleĀ Ott,Ā ā€œNormformer: Improved transformer pretraining with extra normalization,ā€Ā  (2021),Ā arXiv:2110.09456 [cs.CL] .
  • KrauseĀ andĀ Shih (2023) ClaudiusĀ KrauseĀ andĀ DavidĀ Shih,Ā ā€œFast and accurate simulations of calorimeter showers with normalizing flows,ā€Ā Phys. Rev. DĀ 107,Ā 113003 (2023),Ā arXiv:2106.05285 [physics.ins-det] .
  • DasĀ etĀ al. (2023) RanitĀ Das, LuigiĀ Favaro, TheoĀ Heimel, ClaudiusĀ Krause, TilmanĀ Plehn, Ā andĀ DavidĀ Shih,Ā ā€œHow to Understand Limitations of Generative Networks,ā€Ā  (2023),Ā arXiv:2305.16774 [hep-ph] .
  • BirkĀ etĀ al. (2023) JoschkaĀ Birk, ErikĀ Buhmann, CedricĀ Ewen, GregorĀ Kasieczka, Ā andĀ DavidĀ Shih,Ā ā€œFlow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information,ā€Ā Ā  (2023),Ā arXiv:2312.00123 [hep-ph] .
  • PaszkeĀ etĀ al. (2019) AdamĀ Paszke etĀ al.,Ā ā€œPyTorch: An Imperative Style, High-Performance Deep Learning Library,ā€Ā inĀ Advances in Neural Information Processing Systems 32,Ā edited byĀ H.Ā Wallach, H.Ā Larochelle, A.Ā Beygelzimer, F.Ā d'AlchĆ©-Buc, E.Ā Fox, Ā andĀ R.Ā GarnettĀ (Curran Associates, Inc.,Ā 2019)Ā pp.Ā 8024ā€“8035.
  • FalconĀ andĀ team (2024) WilliamĀ FalconĀ andĀ The PyTorchĀ LightningĀ team,Ā ā€œPytorch lightning,ā€Ā  (2024).
  • Huh (2022) MinyoungĀ Huh,Ā ā€œvqtorch: PyTorch package for vector quantization,ā€Ā https://github.com/minyoungg/vqtorch (2022).
  • LoshchilovĀ andĀ Hutter (2019) IlyaĀ LoshchilovĀ andĀ FrankĀ Hutter,Ā ā€œDecoupled Weight Decay Regularization,ā€Ā  (2019),Ā arXiv:1711.05101 [cs.LG] .
  • Smith (2018) LeslieĀ N.Ā Smith,Ā ā€œA disciplined approach to neural network hyper-parameters: Part 1 ā€“ learning rate, batch size, momentum, and weight decay,ā€Ā  (2018),Ā arXiv:1803.09820 [cs.LG] .
  • KingmaĀ andĀ Ba (2017) DiederikĀ P.Ā KingmaĀ andĀ JimmyĀ Ba,Ā ā€œAdam: A method for stochastic optimization,ā€Ā  (2017),Ā arXiv:1412.6980 [cs.LG] .

Appendix A Model details and hyperparameters

A.1 VQ-VAE for token creation

Both the Ī·relsuperscriptšœ‚rel\eta^{\text{rel}}italic_Ī· start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPTĀ and the Ļ•relsuperscriptitalic-Ļ•rel\phi^{\text{rel}}italic_Ļ• start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPTĀ values are scaled down by a factor ofĀ 3. The transverse momentum of the jet constituents is first transformed using the natural logarithm and subsequently shifted by -1.8. The tokenization was also done without the log transform of the pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, and was found to perform similarly. However, the logarithm transformation has the advantage that it automatically avoids negative pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPTĀ values, which is why we choose to use the log-transformed pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPT. The conditional and unconditional tokenization only differ in the architecture of the encoder and decoder of the VQ-VAE.

Training for the VQ-VAE is implemented in pytorchĀ PaszkeĀ etĀ al. (2019) and pytorch-lightningĀ FalconĀ andĀ team (2024).

The model architecture of the VQ-VAE encoder and decoder in the conditional tokenization approach is similar toĀ HeinrichĀ etĀ al. (2024) with a different set of hyperparameters. We use 4 NormFormerĀ ShleiferĀ etĀ al. (2021) blocks with an embedding dimension of 128 and 8 heads in the MHA for both the encoder and the decoder. We use the vqtorch libraryĀ Huh (2022); HuhĀ etĀ al. (2023) to implement the vector quantization layer with the dimension of the codebook vectors set to 4.

The mean squared error (MSE) between the tensor of the initial particle features and the reconstructed features is used as the task loss ā„’tasksubscriptā„’task\mathcal{L}_{\mathrm{task}}caligraphic_L start_POSTSUBSCRIPT roman_task end_POSTSUBSCRIPT. The total loss is then set to

ā„’=ā„’task+Ī±ā‹…ā„’commitā„’subscriptā„’taskā‹…š›¼subscriptā„’commit\displaystyle\mathcal{L}=\mathcal{L}_{\mathrm{task}}+\alpha\cdot\mathcal{L}_{% \mathrm{commit}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT roman_task end_POSTSUBSCRIPT + italic_Ī± ā‹… caligraphic_L start_POSTSUBSCRIPT roman_commit end_POSTSUBSCRIPT (5)

with Ī±=10š›¼10\alpha=10italic_Ī± = 10. An affine transformation is used for a joint transformation of all codes and dead codes are replaced with a frequency of 10 iterations. The parameter Ī²š›½\betaitalic_Ī² which trades off the importance of updating the embeddings from the encoder zesubscriptš‘§š‘’z_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and the code vectors zqsubscriptš‘§š‘žz_{q}italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is set to Ī²=0.9š›½0.9\beta=0.9italic_Ī² = 0.9. Lastly, we use a synchronized update ruleĀ HuhĀ etĀ al. (2023); Huh (2022) with Ī½=1šœˆ1\nu=1italic_Ī½ = 1.

In the unconditional approach, we use the same hyperparameters as outlined above, with the only difference that the architecture of the encoder and decoder is a simple MLP with 3 hidden layers of dimension 128 and ReLU activation function.

All VQ-VAE models are trained on all 10 classes of the JetClass datasetĀ QuĀ etĀ al. (2022b).

A.2 Classifiers for token quality evaluation

The DeepSetsĀ ZaheerĀ etĀ al. (2018); KomiskeĀ etĀ al. (2019) classifier consists of a per-particle MLP Ī¦Ī¦\Phiroman_Ī¦ with shared weights across all particles inside the jet with 3 hidden layers of dimension 100, 100 and 256. The output of the network Ī¦Ī¦\Phiroman_Ī¦ is then aggregated with a sum and fed into another MLP with 3 hidden layers of dimension 100 followed by a 10-dimension output layer with softmax activation function.

The Transformer classifier consists of 5 NormFormerĀ ShleiferĀ etĀ al. (2021) blocks, followed by two class-attention blocks with a class token as query, inspired by the ParTĀ QuĀ etĀ al. (2022a) architecture. The output of the last class-attention block is fed into a MLP with two hidden layers of dimension 128, followed by a softmaxed 10-dimensional output layer.

The classifiers are trained with the AdamWĀ LoshchilovĀ andĀ Hutter (2019) optimizer with a maximum learning rate of 0.005 (0.001) for the DeepSets (Transformer) classifier and weight decay 0.01. The learning rate first linearly increased from 0.002 (0.0005) during the first 4 training epochs, after which it is linearly decreased to the initial learning rate over 20 epochs and then linearly decreased to a final learning rate of 0.001 (0.0003), following the one-cycle learning rate schedule Smith (2018).

The classifiers for those token quality evaluations are trained on 10M jets from the JetClassĀ datasetĀ QuĀ etĀ al. (2022b).

A.3 Transformer backbone

When training the transformer backbone, cross entropy is used as a loss function and AdamĀ KingmaĀ andĀ Ba (2017) with a learning rate of 0.001 as optimizer. The model had access to 10M tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jet events and 10M q/gš‘žš‘”q/gitalic_q / italic_g jet events. Note that this means that the model trained on these two jet types combined had access to twice as much data. All versions were trained for 30 epochs, and the model state with the lowest validation loss was chosen for the further analysis.

A.4 Transfer learning

To perform the transfer learning from the generative task to the classification task, we change the head of the OmniJet-Ī±š›¼\alphaitalic_Ī±Ā model to the classification head and load the weights of the backbone trained for the generative task. We explore two variations of fine-tuning the pre-trained backbone to the classification task: training all weights of the model with the same learning rate (referred to as Fine-tuning in subsectionĀ III.3) and keeping the weights of the backbone fixed at the state obtained from the generative model (referred to as Fine-tuning (backbone fixed) in subsectionĀ III.3). For the From scratch trainings we start the training with randomly initialized weights of the whole model. The training is performed with the AdamWĀ LoshchilovĀ andĀ Hutter (2019) optimizer with a constant learning rate of 0.0001 and weight decay 0.01. Since those trainings, depending on the size of the training dataset, tend to show overfitting quite quickly, we stop those trainings when the validation loss does not improve for multiple epochs. The threshold of this early stopping is adjusted to the training dataset size, with a patience of 20 epochs for a training dataset of 100100100100, 1000100010001000 and 10ā€‰0001000010\,00010 000 jets, a patience of 10 epochs for a training dataset size of 100ā€‰000100000100\,000100 000 and 1ā€‰000ā€‰00010000001\,000\,0001 000 000, and a patience of 5 for trainings with 2ā€‰000ā€‰00020000002\,000\,0002 000 000 training jets. For each training dataset size we run 5 trainings with different random seeds and the epoch with the smallest validation loss is chosen for evaluation.

A.5 Classifier tests

In order to quantify the performance of the generative model, a classifier test using the structure of the DeepSets classifier from subsectionĀ A.2 is performed. In this case however, the 3 hidden layers of the particle MLP Ī¦Ī¦\Phiroman_Ī¦ all have dimension 10.

A number of 48Ā 000 generated events are combined with equally many reconstructed tokens from the test set of JetClass. The two datasets are combined and shuffled, and a train/val/test split of 0.6/0.2/0.2 is used. The model is trained for 100 epochs with binary cross entropy loss and AdamĀ KingmaĀ andĀ Ba (2017) with learning rate 0.001 as optimizer. The model state with the lowest validation loss is chosen for evaluation. The resulting AUC scores are 0.54 for the model trained on q/gš‘žš‘”q/gitalic_q / italic_g and tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jets combined and 0.57 for the ones trained on single-type jets.

Appendix B Token quality

Additional plots of the jet mass, the jet transverse momentum, as well as the subjettiness ratios are shown in FigureĀ 8, FigureĀ 9, FigureĀ 11 and FigureĀ 10.

Refer to caption
Refer to caption
Figure 8: Jet mass distribution and resolution for different tokenization approaches and codebook sizes.
Refer to caption
Refer to caption
Figure 9: Jet pTsubscriptš‘Tp_{\text{T}}italic_p start_POSTSUBSCRIPT T end_POSTSUBSCRIPTĀ distribution and resolution for different tokenization approaches and codebook sizes.
Refer to caption
Refer to caption
Figure 10: Jet Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT distribution and resolution for different tokenization approaches and codebook sizes.
Refer to caption
Refer to caption
Figure 11: Jet Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT distribution and resolution for different tokenization approaches and codebook sizes.

Appendix C Generative model trained on single-jet data

Refer to caption
Refer to caption
Figure 12: Comparison of generated jets from the model trained on q/gš‘žš‘”q/gitalic_q / italic_g jets to reconstructed JetClass tokens. The top row shows jet level distributions, while the bottom row shows distributions on the constituent level.
Refer to caption
Refer to caption
Figure 13: Comparison of generated jets from the model trained on tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jets to reconstructed JetClass tokens. The top row shows jet level distributions, while the bottom row shows distributions on the constituent level.
Refer to caption
Refer to caption
Figure 14: Comparison of how well OmniJet-Ī±š›¼\alphaitalic_Ī± does on the generative task to the performance of a generative-only model, EPiC-FM. The latter does not use tokenization for the input data, and thus has access to the ā€realā€ input values. The main plots show the original JetClass data, the reconstructed JetClass data, as well as the generated events from OmniJet-Ī±š›¼\alphaitalic_Ī± and EPiC-FM. The ratio plots show how well the models perform with respect to their respective truths.

To test the generative performance, the generative model was also trained on single-jet type data ā€” 10M jets each of tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT and q/gš‘žš‘”q/gitalic_q / italic_g ā€” separately. For these training, no tests of the task-transfer to classification were perfomed.

The result of the q/gš‘žš‘”q/gitalic_q / italic_g jet training is shown in FigureĀ 12, of the tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jet training in FigureĀ 13. In the q/gš‘žš‘”q/gitalic_q / italic_g case, we see a good agreement between the reconstructed tokens and the generated events. However, it seems to be somewhat more difficult for the model to accurately model Ļ„32subscriptšœ32\tau_{32}italic_Ļ„ start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT for tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT jets, which is also mirrored for this quantity in the combined model (see FigureĀ 6).

It is interesting to compare the result of OmniJet-Ī±š›¼\alphaitalic_Ī± with that of a different generative model. EPiC-FMĀ BirkĀ etĀ al. (2023) was the first generative model trained on the JetClass dataset, utilizing flow matching and operating without tokenization. The result of the comparison can be seen in FigureĀ 14. The plots show the JetClass data, the reconstructed JetClass token from this work, the EPiC-FM generated events, and the OmniJet-Ī±š›¼\alphaitalic_Ī± generated events. We use the more challenging tā†’bā¢qā¢qā€²ā†’š‘”š‘š‘žsuperscriptš‘žā€²t\to bqq^{\prime}italic_t ā†’ italic_b italic_q italic_q start_POSTSUPERSCRIPT ā€² end_POSTSUPERSCRIPT class for comparison.

The ratio plots under the main plots show the generated events compared to their respective truths: direct JetClass for EPiC-FM and Reconstructed JetClass tokens for OmniJet-Ī±š›¼\alphaitalic_Ī±. Hence, the ratios show how well the respective generative models learn to replicate their training data. In general, we see that both models are doing well. OmniJet-Ī±š›¼\alphaitalic_Ī± has a somewhat higher discrepancy in the tails of all distributions except for constituent Ī·relsuperscriptšœ‚rel\eta^{\mathrm{rel}}italic_Ī· start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT and the number of constituents. The most difficult distribution is the constituent pTsubscriptš‘Tp_{\mathrm{T}}italic_p start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT, with bumps in the tail, which could also be seen in FigureĀ 13.