OmniJet- $\alpha$ : The first cross-task foundation model for particle physics

Joschka Birk joschka.birk@uni-hamburg.de Institute for Experimental Physics, Universität Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany Anna Hallin anna.hallin@uni-hamburg.de Institute for Experimental Physics, Universität Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany Gregor Kasieczka gregor.kasieczka@uni-hamburg.de Institute for Experimental Physics, Universität Hamburg
Luruper Chaussee 149, 22761 Hamburg, Germany

Abstract

Foundation models are multi-dataset and multi-task machine learning methods that once pre-trained can be fine-tuned for a large variety of downstream applications. The successful development of such general-purpose models for physics data would be a major breakthrough as they could improve the achievable physics performance while at the same time drastically reduce the required amount of training time and data.

We report significant progress on this challenge on several fronts. First, a comprehensive set of evaluation methods is introduced to judge the quality of an encoding from physics data into a representation suitable for the autoregressive generation of particle jets with transformer architectures (the common backbone of foundation models). These measures motivate the choice of a higher-fidelity tokenization compared to previous works. Finally, we demonstrate transfer learning between an unsupervised problem (jet generation) and a classic supervised task (jet tagging) with our new OmniJet- $\alpha$ model. This is the first successful transfer between two different and actively studied classes of tasks and constitutes a major step in the building of foundation models for particle physics.

Refer to caption — Figure 1: Schematics of the different steps (tokenization, generation, classification) in the OmniJet- $\alpha$ model.

I Introduction

Foundation models are a new class of machine learning models which are trained on broad datasets and problems and are able to generalize to a variety of downstream tasks and datasets Bommasani et al. (2022). Large-language models (LLMs) such as BERT Devlin et al. (2019), BART Lewis et al. (2019), GPT-3 Brown et al. (2020), and LLaMA Touvron et al. (2023) are examples of foundation models for text data while DALL-E 2 Ramesh et al. (2022) is an example of an image-based model.

The benefits of a foundation model for particle physics data would be significant: While machine learning models developed so far typically outperform classical approaches (and often by a large margin) Kasieczka et al. (2019a); Karagiorgi et al. (2022), available statistics for training these models is a constant issue Macaluso and Shih (2018); Qu et al. (2022a). It becomes even more extreme in the case of searches for rare processes, where only a small fraction of simulated events might pass pre-selection criteria. Foundation models on the other hand are pre-trained and need fewer examples to be fine-tuned for a specific task Vigl et al. (2024).

Beyond improving the physics performance by e.g. increasing selection accuracy of classification tasks, foundation models also address the other major issue currently facing particle physics: limited computing resources in the face of an ever increasing amount of data Albrecht et al. (2019); Boehnlein et al. (2022). This problem has already spawned the development of e.g. increasingly high fidelity models for the simulation of calorimeters, as well as techniques for the speeding up of other Monte Carlo simulations Paganini et al. (2018); Buhmann et al. (2021, 2023a); Adelmann et al. (2022); Badger et al. (2023); Hashemi and Krause (2023); Butter et al. (2019); de Oliveira et al. (2017). By allowing the re-use of models across datasets and tasks, foundation models will play an important role in reducing this computational burden. Note that this effect will be further compounded by the potential for optimization in computing operations from one model used across multiple tasks.

Finally, using closely related architectures across different tasks inside one experiment, across experimental collaborations, and in the exchange with the theory community will make results easier to re-interpret ara (2024); Bieringer et al. (2024).

These potential benefits have inspired research into proto-foundation models suitable for particle physics. For example, Dillon et al. (2022a); Favaro et al. (2023); Dillon et al. (2023); Park et al. (2023); Dillon et al. (2022b) investigated how known physical symmetries could be used to learn powerful embeddings of jet data, Benato et al. (2022) showed the versatility of graph-based message passing networks for datasets from different domains of physics, Liu et al. (2023); Salamani et al. (2023) demonstrated conditioning generative models on the geometry of the detector to allow the simultaneous simulation of multiple detectors with one architecture, Dolan and Ore (2022); Beauchesne et al. (2024) used meta-learning for mass-decorrelation and weak-supervision, and Qu et al. (2022a) achieved state-of-the-art performance on the top tagging landscape dataset Kasieczka et al. (2019b) by pre-training on a different dataset Qu et al. (2022b) and transferring the results.

Due to their flexibility demonstrated across language and other domains, transformers Vaswani et al. (2017) are currently the most suitable candidate architecture for building foundation models for applications in particle physics. Taking inspiration from LLMs where sentences are generated autoregressively, recent efforts have demonstrated success with autoregressive generation of particle physics data, for example using the encoder part of the transformer to generate $t\to bqq^{\prime}$ and $q/g$ jets Finke et al. (2023) and using the decoder part of the transformer to generate $Z$ +jets events Butter et al. (2023). Ref. Heinrich et al. (2024) demonstrated how a transformer backbone can be pre-trained using a BERT-like scheme where the model is trained to predict masked out jet constituents, resulting in an improvement of the performance when fine-tuning the backbone (with a new classification head) for jet tagging, especially at small training dataset sizes. Furthermore, Huang et al. (2024) showed how a tokenized detector representation can be used in a BERT-like model for track reconstruction.

In this work, we will explore whether an autoregressive Generative Pretrained Transformer (GPT) model can be used as a foundation model for jet physics. However, the standard GPT constructions are not built to deal with continuous input data, but rather tokenized data. As point clouds are the most versatile representation of physics data Komiske et al. (2019); Buhmann et al. (2023b); Kasieczka et al. (2019a); Buhmann et al. (2023c, a) and can incorporate both event level information, jet substructure, and even low-level detector signals, finding a suitable input transformation for point clouds to tokens is the most pressing problem. Various tokenization strategies have been explored, for example using a simple mapping based on binning the input space in Finke et al. (2023), a Gaussian mixture model in Butter et al. (2023), and using an additional conditional embedding network in Heinrich et al. (2024).

Here, we follow the conditional tokenization strategy from van den Oord et al. (2018); Bao et al. (2022); Heinrich et al. (2024), but first take a step back to verify the quality and trade-offs involved in building these tokens. This will allow us to formulate quality measures to choose a suitable tokenization model, leading to an increase in codebook size from 512 tokens in Heinrich et al. (2024) to 8192 tokens.

Using this representation, we will first demonstrate training a generative model for jets as tokens in an unsupervised way for the JetClass Qu et al. (2022b) dataset. Compared to Finke et al. (2023), the core of our architecture is a transformer-decoder, not a transformer-encoder.

Finally, this allows us to test whether the information encoded in a model that was trained to generate jets can also be transferred to the task of classifying them. Observing such a transfer ability across different classes of tasks — as opposed to transfer between different classification or generation problems — would be a crucial ingredient to building foundation models for physics data, and has not yet been achieved. A graphical representation of this approach is provided in Figure 1. As this is the first prototype of a model to tackle all tasks with jets in particle physics, it is named OmniJet- $\alpha$ .

The rest of the paper is organized as follows: Section II introduces the data as well as the tokenization approach, the generative architecture, and the transfer learning strategy. Next, Section III shows the results of the tokenization study, the generative performance, as well as tests of the transfer learning capabilities of the model. Finally, Section IV summarizes the results and provides a brief outlook.

II Methods and Dataset

II.1 Dataset

All studies are performed using the JetClass dataset Qu et al. (2022b), originally introduced in Qu et al. (2022a). It contains both jet-level and constituent-level features for ten different types of jets initiated by gluons and quarks ( $q/g$ ), top quarks ( $t$ , subdivided by their decay mode into $t\to bqq^{\prime}$ and $t\to b\ell\nu$ ) , as well as $W$ , $Z$ , and $H$ ( $H\to b\bar{b}$ , $H\to c\bar{c}$ , $H\to gg$ , $H\to 4q$ , and $H\to\ell\nu qq^{\prime}$ ) bosons.

Events are simulated using MadGraph5_aMC@NLO Alwall et al. (2014) with parton showering and hadronization done by Pythia Sjöstrand et al. (2015). A simplified detector simulation implemented in Delphes de Favereau et al. (2014) using the CMS detector The CMS Collaboration (2008) card is performed. Constituents are clustered into jets using the anti- $k_{\mathrm{T}}$ algorithm Cacciari et al. (2008) with a distance parameter of $R=0.8$ .

Jets are selected if they have a transverse momentum of $500\text{\,}\mathrm{G}\mathrm{e}\mathrm{V}$<p_{\mathrm{T}}^{\mathrm{jet}}<$10% 00\text{\,}\mathrm{G}\mathrm{e}\mathrm{V}$ and a pseudorapidity of $|\eta^{\mathrm{jet}}|<2$ . Additionally, truth-level matching is performed for all classes except $q/g$ and only jets that contain all the decay products of the boson or top quark are included. The resulting dataset contains 100M jets for training, 5M jets for validation, and 20M jets for testing.

In this work, only the kinematic information per particle ( $p_{\text{T}}$ , $\phi$ , $\eta$ ) is used while the particle mass $m$ is approximated as zero. Next, the azimuth angle $\phi$ and the pseudorapidity $\eta$ are pre-processed to be relative to the jet axis¹¹1 The difference in $\phi$ is signed and rectified to $-\pi$ through $\pi$ . We handle those calculations using the scikit-hep/vector Schreiner et al. (2023) and scikit-hep/awkward Pivarski et al. (2024) libraries. :

	$\displaystyle\eta^{\mathrm{rel}}$	$\displaystyle=\eta^{\mathrm{particle}}-\eta^{\mathrm{jet}}$		(1)
	$\displaystyle\phi^{\mathrm{rel}}$	$\displaystyle=\phi^{\mathrm{particle}}-\phi^{\mathrm{jet}}\,.$		(2)

Finally, we apply the cuts $|\eta^{\mathrm{rel}}|<0.8$ and $|\phi^{\mathrm{rel}}|<0.8$ to remove a very small fraction of low-energy constituents at the periphery and use up to 128 particles per jet.

II.2 Jet constituent token creation

We explore three kinds of tokenization approaches: binned, conditional, and unconditional tokenization. In the binned approach Finke et al. (2023), the space of input features is subdivided using a regular grid in all dimensions (e.g. a 21x21x21 grid in three dimensions) and the cells in this grid are enumerated, resulting in one token per cell.

In the unconditional approach, each constituent is tokenized individually using a non-linear mapping, whereas in the conditional approach constituents are encoded and decoded conditioned on each other. We use a Vector Quantized Variational AutoEncoder (VQ-VAE) van den Oord et al. (2018); Bao et al. (2022); Heinrich et al. (2024); Huh et al. (2023) to create a discrete set of jet constituent tokens both for conditional and unconditional tokenization.

The input features for the VQ-VAE are the $\eta^{\text{rel}}$ , $\phi^{\text{rel}}$ and $p_{\text{T}}$ values of the jet constituents. For the conditional tokenization, we use a transformer for both the encoder and the decoder of the VQ-VAE, whereas a simple multi-layer perceptron (MLP) is used for the unconditional tokenization. Details about the different VQ-VAE models used in our studies, as well as details about the preprocessing of the input features can be found in subsection A.1.

II.3 Transformer backbone

The core of OmniJet- $\alpha$ is a transformer backbone based on the GPT transformer decoder model first introduced in Radford et al. (2018). However, since jet constituents are permutation invariant, we do not employ the positional encoding usually used in LLMs. As input, the transformer backbone receives the generated tokens from the VQ-VAE, complemented with a start and stop token. A jet with $n$ constituents is then represented as

\left(\mathtt{start\_token},x_{1},...,x_{n-1},x_{n},\mathtt{stop\_token}\right)

(3)

where $x_{i}$ are the tokens.

The transformer backbone itself consists of an embedding layer followed by a series of GPT blocks. Each GPT block contains a multihead attention block, followed by a residual addition, layer norm Ba et al. (2016), two linear layers with a ReLU in between, another residual addition and a final layer norm. Since this is an autoregressive model, a causal mask is passed together with the input data to the multihead attention block to prevent the model from seeing future tokens. The architecture is illustrated in Figure 2.

The output from the transformer backbone is passed to a task specific head, either for generation or classification. The generative head is a single linear layer, while the classification head consists of a linear layer followed by ReLU, a sum over the constituent dimension, and a last linear layer with softmax activation function. The model is trained with $n=8$ heads in the multi-head attention block and $N=3$ GPT blocks. No dropout is used.

Once the generative model, i.e. the transformer backbone together with the generative head, has been trained on the tokenized data, it can generate new data autoregressively. The model has learned the probability distribution for a token $x_{j}$ , given a sequence of tokens:

p\left(x_{j}|x_{j-1},...,x_{1},\mathtt{start\_token}\right).

(4)

The model is provided with the start token, and then samples this distribution to sequentially generate new tokens. Generation is repeated until either the stop token is generated or the maximum sequence length (which is set to be equal to 128) is reached. The generated token sequences are then fed to the VQ-VAE decoder, which maps them into physical space for further evaluation.

The classification task can be performed either from scratch, using randomly initialized weights for both the transformer backbone and the classification head, or by fine-tuning the generative model. In the fine-tuning case, the initial weights of the transformer backbone are loaded for from the generative model.

III Results

III.1 Token quality

We first inspect how well the tokens cover the space. An illustration of the conditional and unconditional tokens in physical space (i.e. their corresponding $\eta^{\text{rel}}$ , $\phi^{\text{rel}}$ and $p_{\text{T}}$ values) is shown in Figure 3 for the different tokenization approaches and different codebook sizes. In the unconditional case, as well as in the binning approach, the reconstruction of a token is always the same, independent of the other tokens in the jet, leading to discrete points in physical space. In the conditional case, however, the reconstruction of a token is by construction affected by the other tokens inside this jet. To visualize the spread of each conditional token in physical space, we reconstruct each token 500 times conditioned on 50 randomly chosen tokens. Each of those reconstructions is shown in the scatter plots in Figure 3, where the different reconstructions of the same token are drawn in the same color. We notice that the reconstruction of each token only changes slightly when conditioned on other tokens. Thus, the 500 different reconstructions of a conditional token show up in Figure 3 as a blob in physical space. This already shows that the conditional tokenization allows to cover a significantly larger volume in reconstruction space, while the unconditional tokens can only be reconstructed to distinct points in reconstruction space. We note that our approach of reconstructing each token 500 times conditioned of randomly chosen other tokens not necessarily represents the reconstructed values of actual jet constituents, as it is possible that those combinations of tokens would not appear for real jets. However, this illustrates the overall behavior of how much the reconstruction of a token can change due to the conditioning on the other tokens.

Next, we consider distributions at the jet level to judge the quality of the tokenization. For this and the following studies, jets are mapped into token space, and then mapped back to physical space to quantify the loss in information. Figure 4 (left) shows the jet mass combined for all classes in the dataset, as was done in Heinrich et al. (2024). As observed there, the conditional tokenization with 512 tokens already shows a good agreement between initial and reconstructed mass, with worse performance for the unconditional tokenization.

However, this inclusive distribution might hide differences at the level of individual classes and jets. We therefore also consider the difference in mass for jets before and after tokenization and reconstruction Figure 4 (center) for $t\to bqq^{\prime}$ jets. The unconditional tokenization leads to a systematic shift of approximately 15 GeV, while the conditional tokenization is well centered at zero. Increasing the codebook size from 512 to 8192 tokens substantially improves the resolution. This behavior is even more pronounced when considering the $N$ -subjettiness Thaler and Van Tilburg (2011) ratio $\tau_{32}$ in Figure 4 (right). Both conditional and unconditional tokenization with 512 tokens results in shifted distributions, while the larger codebook size of 8192 recovers a peak close to zero. Furthermore, while the mass resolution of the conditional tokens is already centered close to zero when using a codebook size of 512, the width of the distribution improves drastically from $\sigma_{512}^{\mathrm{cond}}=8.3$ GeV to $\sigma_{8192}^{\mathrm{cond}}=4.0$ GeV when moving from a codebook size of 512 to a codebook size of 8192, where $\sigma$ corresponds to the standard deviation obtained from fitting a normal distribution to the mass resolution histograms. A similar behavior can be observed for other classes, where in some cases, depending on the jet observable and the jet type, the effect is even more extreme. Finally, the binning approach with a 21x21x21 linear bins in the input features of our VQ-VAE comes close to the mass resolution of the conditional tokens with a codebook size of 8192, while the resolution of the subjettiness ratio is notably worse. Moreover, while this binning approach with 9162 tokens leads to reasonable resolution of the $t\to bqq^{\prime}$ jets shown in Figure 4, we found quite drastic mismodeling for $t\to b\ell\nu$ and $H\to\ell\nu qq^{\prime}$ jets with such small codebook sizes²²2 As expected, the resolution of the binning approach automatically leads to good resolution when the codebook size (i.e. the the number of bins) is increased to a sufficiently large number. We found that around $64\,000$ tokens (corresponding to a 40x40x40 grid) offer similar resolution as conditional tokenization with a codebook size of 8192. . The distributions and the corresponding resolutions of the jet mass, jet $p_{\text{T}}$ , as well as the subjettiness ratios $\tau_{32}$ and $\tau_{21}$ are shown for all ten jet types individually in Appendix B. Overall, the highest fidelity is achieved by conditional tokenization with a marked improvement from increasing the codebook size from 512 to 8192.

Finally, we quantify the information loss that comes with the tokenization by training multi-class classifiers to distinguish between the ten jet types present in the dataset. The classifiers are trained once with the original inputs, and once with the inputs after undergoing tokenization and subsequent reconstruction back into physical space. This procedure allows a direct comparison of how the loss in resolution affects reconstruction performance. We utilize two standard classifier architectures: DeepSets Zaheer et al. (2018); Komiske et al. (2019) (i.e. without particle interactions) and Transformer Vaswani et al. (2017); Shleifer et al. (2021) (i.e. with particle interactions) and perform this study for four different codebook sizes from 512 to 8192 tokens for the conditional tokenization approach. Details about the classifier trainings can be found in subsection A.2. Note that this approach is similar in spirit to the classifier metric proposed in Krause and Shih (2023); Das et al. (2023) but tests a multi-class classifier trained on these samples individually, as opposed to judging how well a classifier might distinguish original and reconstructed samples. This is necessary as e.g. points at fixed positions would be distinguishable from the original with close to perfect accuracy, rendering the test less useful.

The resulting classifier accuracy for the two different architectures is shown in Figure 5. As seen in previous studies of resolution, we observe an increase of token quality as we increase the size of the VQ-VAE codebook. Furthermore, we see that the classifier performance starts to plateau with codebook sizes larger than 4096. However, even at the largest codebook size, a gap to the performance on original particles remains, motivating future work into building more accurate tokenization schemes.

For the remaining studies we will utilize a codebook size of 8192 with conditional tokens as this leads to the overall best performance and fidelity.

III.2 Jet generation

After training the transformer backbone with the generative head, it can be used for autoregressive generation as described in subsection II.3. The model was trained on three separate datasets: $t\to bqq^{\prime}$ only, $q/g$ only, and $q/g$ and $t\to bqq^{\prime}$ combined. This section will describe the combined version, since this is the model that is used for transfer learning. For a discussion of single-jet type generative results, including a comparison to the EPiC-FM method of Birk et al. (2023), see appendix Appendix C. 48 000 events were generated from the combined model. These events contain tokens, which are then decoded back to physical space using the VQ-VAE decoder.

A comparison to reconstructed JetClass tokens can be seen in Figure 6. We observe that in general the model is able to match the truth level tokens well. We note that the tail of the $p_{\mathrm{T}}$ spectrum of both the generated constituents and the reconstructed JetClass tokens contains bumps distributed around discrete values, which is consistent with our inspection of the reconstruction space shown in Figure 3.

In order to quantify the performance, a classifier test (see subsection A.5 for details) is performed to distinguish generated events from reconstructed JetClass tokens. The test results in an AUC score of 0.54.

III.3 Transfer learning from generation to classification

To evaluate the ability of the model to generalize from generating jets to classifying them, we focus on the task of hadronic top quark tagging Kasieczka et al. (2019a), i.e. distinguishing $t\to bqq^{\prime}$ and $q/g$ jets on the JetClass Qu et al. (2022b) dataset. For this test, the Next-token prediction head is replaced by a Classification head while the transformer backbone is retained. We compare three training strategies: training the full architecture with randomly initialized weights (termed from scratch) which does not use transfer-learning and corresponds to the baseline, and two versions of fine-tuning the model obtained in the generative training step. In the Fine-tuning run, both the pre-trained backbone weights and the randomly initialized classification task head weights are allowed to float in the training, while in Fine-tuning (backbone fixed) only the classification task head is allowed to change.

The results of these training runs are presented in Figure 7 as a function of the number of training examples provided to the model. We observe a significant gain in classification accuracy of both fine-tuning approaches compared to the baseline, leading to up to 15 percentage-points higher accuracy for small number of training jets, and outperforming by a few percentage-points at the highest training sample size. The difference between the two fine-tuning strategies is relatively small, with the more open training performing slightly better. Put differently, the generative pre-trained model achieves the same accuracy around 84% with 100 training examples for which the model that is trained from scratch requires 10 000 examples.

IV Conclusion

Foundation models for physics data are an enticing promise: Trained on large amounts of data and tasks, they are expected to easily generalize to any down-stream problem, saving countless hours of human and compute time. In this paper we have taken crucial steps towards the creation of such models.

First, we expect learned representations of data to play a key role as inputs to foundation models. Representations might be continuous and rely on symmetries or learn a mapping to a discrete space as done here with tokenization. Note that while using data raw — i.e. without prior mapping into a representation space — might be possible when only considering a narrow range of similar datasets, it is inherently limiting when data from different sources or with different initial dimensionalities are to be considered.

Whatever representation is used, it will be important to understand and minimize the loss of information inherent in this transformation. This problem is especially important for downstream uses such as classification and regression tasks, as the loss of information can directly limit the achievable accuracy or resolution. This work introduced a set of criteria — both distribution and classifier based — that can be used to assess the quality of any representation.

Using these metrics, we found a marked increase in the resolution of relevant observables like mass and jet substructure by using a codebook size of 8192 with conditional tokenization over binning-based approaches, unconditional tokenization, and conditional tokenization with smaller codebooks. An additional classifier test further confirmed this observation.

Next, we demonstrated the training of an autoregressive generative model for jet constituents, specifically for $q/g$ and $t\to bqq^{\prime}$ jets from the JetClass dataset Qu et al. (2022b). The generated distributions agree well with the ground truth, both for global jet kinematics, jet substructure, and individual constituent features. We note that while our model is the first token-based generator of JetClass-like examples, more extensive studies of its generative fidelity when increasing the feature-space and detailed comparison to prior non-token-based results on this dataset Birk et al. (2023) are left for future work.

Most importantly, we report the generalization capability of OmniJet- $\alpha$ from learning to generate jets in an unsupervised way, to the supervised classification between $t\to bqq^{\prime}$ and $q/g$ jets. Overall, the fine-tuned model outperformed training from scratch for all values of training examples, often by a significant margin. For example, for 1000 training jets, the fine-tuned model achieves an accuracy of approximately 90%, compared to around 74% for the freshly initialized model. While the two approaches seem to converge, even at the highest training size of 2 million jets the fine-tuned approach maintains a lead of a few percentage points. Finally, it even provides a non-trivial classification accuracy of 84%, even when trained only on 100 jets, emphasizing the value of foundation models for problems with few available labeled training examples. While other types of transfer have been demonstrated previously, this is the first time that the unification across the two big classes of tasks — classification and generation — has been achieved.

Of course, this work is only one step in building overarching foundation models. While it is the first model that achieves both classification and generation, it is not the most performant for either of these tasks. However, strategies to increase the performance exist and will be integrated. For example, the representation quality needs to be improved, possible gains from masked pre-trained have to be evaluated, architecture and training data need to be scaled up, and more extensive studies of the generalization capabilities, including training and testing on additional tasks, performed. In the medium term, strategies need to be found to align diverse datasets as well as to integrate pre-trained foundation models in community workflows. Nevertheless, the potential benefits in physics performance and compute efficiency glimpsed at in this and other works makes this a worthy endeavor.

Acknowledgements

We thank David Shih, Michael Krämer, Michael Kagan, Frank Gaede, Sarah Heim, and Judith Katzy for stimulating discussions of foundation models for physics data and Erik Buhmann for valuable comments on the manuscript.

The authors acknowledge support by the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy – EXC 2121 Quantum Universe – 390833306, and under PUNCH4NFDI – project number 460248186. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany.

References

Bommasani et al. (2022) Rishi Bommasani et al., “On the opportunities and risks of foundation models,” (2022), arXiv:2108.07258 [cs.LG] .
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” (2019), arXiv:1810.04805 [cs.CL] .
Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer, “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” (2019), arXiv:1910.13461 [cs.CL] .
Brown et al. (2020) Tom B. Brown et al., “Language models are few-shot learners,” (2020), arXiv:2005.14165 [cs.CL] .
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “LLaMA: Open and Efficient Foundation Language Models,” (2023), arXiv:2302.13971 [cs.CL] .
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” (2022), arXiv:2204.06125 [cs.CV] .
Kasieczka et al. (2019a) Gregor Kasieczka et al., “The machine learning landscape of top taggers,” SciPost Physics 7 (2019a), 10.21468/scipostphys.7.1.014.
Karagiorgi et al. (2022) Georgia Karagiorgi, Gregor Kasieczka, Scott Kravitz, Benjamin Nachman, and David Shih, “Machine learning in the search for new fundamental physics,” Nature Rev. Phys. 4, 399–412 (2022).
Macaluso and Shih (2018) Sebastian Macaluso and David Shih, “Pulling out all the tops with computer vision and deep learning,” Journal of High Energy Physics 2018 (2018), 10.1007/jhep10(2018)121.
Qu et al. (2022a) Huilin Qu, Congqiao Li, and Sitian Qian, “Particle Transformer for Jet Tagging,” in Proceedings of the 39th International Conference on Machine Learning (2022) pp. 18281–18292, arXiv:2202.03772 [hep-ph] .
Vigl et al. (2024) Matthias Vigl, Nicole Hartman, and Lukas Heinrich, “Finetuning Foundation Models for Joint Analysis Optimization,” (2024), arXiv:2401.13536 [hep-ex] .
Albrecht et al. (2019) Johannes Albrecht et al. (HEP Software Foundation), “A Roadmap for HEP Software and Computing R&D for the 2020s,” Comput. Softw. Big Sci. 3, 7 (2019), arXiv:1712.06982 [physics.comp-ph] .
Boehnlein et al. (2022) Amber Boehnlein et al., “HL-LHC Software and Computing Review Panel Report,” (2022).
Paganini et al. (2018) Michela Paganini, Luke de Oliveira, and Benjamin Nachman, “Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters,” Phys. Rev. Lett. 120, 042003 (2018), arXiv:1705.02355 [hep-ex] .
Buhmann et al. (2021) Erik Buhmann, Sascha Diefenbacher, Engin Eren, Frank Gaede, Gregor Kasieczka, Anatolii Korol, and Katja Krüger, “Getting High: High Fidelity Simulation of High Granularity Calorimeters with High Speed,” Comput. Softw. Big Sci. 5, 13 (2021), 2005.05334 .
Buhmann et al. (2023a) Erik Buhmann, Frank Gaede, Gregor Kasieczka, Anatolii Korol, William Korcari, Katja Krüger, and Peter McKeown, “CaloClouds II: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,” (2023a), arXiv:2309.05704 [physics.ins-det] .
Adelmann et al. (2022) Andreas Adelmann et al., “New directions for surrogate models and differentiable programming for High Energy Physics detector simulation,” in Snowmass 2021 (2022) arXiv:2203.08806 [hep-ph] .
Badger et al. (2023) Simon Badger et al., “Machine learning and LHC event generation,” SciPost Phys. 14, 079 (2023), arXiv:2203.07460 [hep-ph] .
Hashemi and Krause (2023) Hosein Hashemi and Claudius Krause, “Deep Generative Models for Detector Signature Simulation: An Analytical Taxonomy,” (2023), arXiv:2312.09597 [physics.ins-det] .
Butter et al. (2019) Anja Butter, Tilman Plehn, and Ramon Winterhalder, “How to GAN LHC Events,” SciPost Phys. 7, 075 (2019), arXiv:1907.03764 [hep-ph] .
de Oliveira et al. (2017) Luke de Oliveira, Michela Paganini, and Benjamin Nachman, “Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics Synthesis,” Computing and Software for Big Science 1 (2017), 10.1007/s41781-017-0004-6.
ara (2024) “Les Houches guide to reusable ML models in LHC analyses, author=Jack Y. Araz and Andy Buckley and Gregor Kasieczka and Jan Kieseler and Sabine Kraml and Anders Kvellestad and Andre Lessa and Tomasz Procter and Are Raklev and Humberto Reyes-Gonzalez and Krzysztof Rolbiecki and Sezen Sekmen and Gokhan Unel,” (2024), arXiv:2312.14575 [hep-ph] .
Bieringer et al. (2024) Sebastian Bieringer, Gregor Kasieczka, Jan Kieseler, and Mathias Trabs, “Classifier Surrogates: Sharing AI-based Searches with the World,” (2024), arXiv:2402.15558 [hep-ph] .
Dillon et al. (2022a) Barry M. Dillon, Gregor Kasieczka, Hans Olischlager, Tilman Plehn, Peter Sorrenson, and Lorenz Vogel, “Symmetries, safety, and self-supervision,” SciPost Phys. 12, 188 (2022a), arXiv:2108.04253 [hep-ph] .
Favaro et al. (2023) Luigi Favaro, Michael Krämer, Tanmoy Modak, Tilman Plehn, and Jan Rüschkamp, “Semi-visible jets, energy-based models, and self-supervision,” (2023), arXiv:2312.03067 [hep-ph] .
Dillon et al. (2023) Barry M. Dillon, Luigi Favaro, Friedrich Feiden, Tanmoy Modak, and Tilman Plehn, “Anomalies, Representations, and Self-Supervision,” (2023), arXiv:2301.04660 [hep-ph] .
Park et al. (2023) Sang Eon Park, Philip Harris, and Bryan Ostdiek, “Neural embedding: learning the embedding of the manifold of physics data,” JHEP 07, 108 (2023), arXiv:2208.05484 [hep-ph] .
Dillon et al. (2022b) Barry M. Dillon, Radha Mastandrea, and Benjamin Nachman, “Self-supervised anomaly detection for new physics,” Phys. Rev. D 106, 056005 (2022b), arXiv:2205.10380 [hep-ph] .
Benato et al. (2022) Lisa Benato et al., “Shared Data and Algorithms for Deep Learning in Fundamental Physics,” Comput. Softw. Big Sci. 6, 9 (2022), arXiv:2107.00656 [cs.LG] .
Liu et al. (2023) Junze Liu, Aishik Ghosh, Dylan Smith, Pierre Baldi, and Daniel Whiteson, ‘‘Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation,” JINST 18, P11003 (2023), arXiv:2305.11531 [physics.ins-det] .
Salamani et al. (2023) Dalila Salamani, Anna Zaborowska, and Witold Pokorski, “MetaHEP: Meta learning for fast shower simulation of high energy physics experiments,” Phys. Lett. B 844, 138079 (2023).
Dolan and Ore (2022) Matthew J. Dolan and Ayodele Ore, ‘‘Metalearning and data augmentation for mass-generalized jet taggers,” Phys. Rev. D 105, 094030 (2022), arXiv:2111.06047 [hep-ph] .
Beauchesne et al. (2024) Hugues Beauchesne, Zong-En Chen, and Cheng-Wei Chiang, “Improving the performance of weak supervision searches using transfer and meta-learning,” JHEP 02, 138 (2024), arXiv:2312.06152 [hep-ph] .
Kasieczka et al. (2019b) Gregor Kasieczka, Tilman Plehn, Jennifer Thompson, and Michael Russel, “Top quark tagging reference dataset,” (2019b).
Qu et al. (2022b) Huilin Qu, Congqiao Li, and Sitian Qian, “JetClass: A Large-Scale Dataset for Deep Learning in Jet Physics,” (2022b).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” in 31st International Conference on Neural Information Processing Systems (2017) arXiv:1706.03762 [cs.CL] .
Finke et al. (2023) Thorben Finke, Michael Krämer, Alexander Mück, and Jan Tönshoff, “Learning the language of QCD jets with transformers,” JHEP 06, 184 (2023), arXiv:2303.07364 [hep-ph] .
Butter et al. (2023) Anja Butter, Nathan Huetsch, Sofia Palacios Schweitzer, Tilman Plehn, Peter Sorrenson, and Jonas Spinner, “Jet Diffusion versus JetGPT – Modern Networks for the LHC,” (2023), arXiv:2305.10475 [hep-ph] .
Heinrich et al. (2024) Lukas Heinrich, Tobias Golling, Michael Kagan, Samuel Klein, Matthew Leigh, Margarita Osadchy, and John Andrew Raine, ‘‘Masked particle modeling on sets: Towards self-supervised high energy physics foundation models,” (2024), arXiv:2401.13537 [hep-ph] .
Huang et al. (2024) Andris Huang, Yash Melkani, Paolo Calafiura, Alina Lazar, Daniel Thomas Murnane, Minh-Tuan Pham, and Xiangyang Ju, “A Language Model for Particle Tracking,” in Connecting The Dots 2023 (2024) arXiv:2402.10239 [hep-ph] .
Komiske et al. (2019) Patrick T. Komiske, Eric M. Metodiev, and Jesse Thaler, “Energy flow networks: deep sets for particle jets,” Journal of High Energy Physics 2019 (2019), 10.1007/jhep01(2019)121.
Buhmann et al. (2023b) Erik Buhmann, Gregor Kasieczka, and Jesse Thaler, “EPiC-GAN: Equivariant Point Cloud Generation for Particle Jets,” (2023b), arXiv:2301.08128 [hep-ph] .
Buhmann et al. (2023c) Erik Buhmann, Sascha Diefenbacher, Engin Eren, Frank Gaede, Gregor Kasieczka, Anatolii Korol, William Korcari, Katja Krüger, and Peter McKeown, “CaloClouds: fast geometry-independent highly-granular calorimeter simulation,” JINST 18, P11025 (2023c), arXiv:2305.04847 [physics.ins-det] .
van den Oord et al. (2018) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu, “Neural discrete representation learning,” (2018), arXiv:1711.00937 [cs.LG] .
Bao et al. (2022) Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei, “BEiT: BERT Pre-Training of Image Transformers,” (2022), arXiv:2106.08254 [cs.CV] .
Alwall et al. (2014) J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao, T. Stelzer, P. Torrielli, and M. Zaro, “The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations,” Journal of High Energy Physics 2014 (2014), 10.1007/jhep07(2014)079.
Sjöstrand et al. (2015) Torbjörn Sjöstrand, Stefan Ask, Jesper R. Christiansen, Richard Corke, Nishita Desai, Philip Ilten, Stephen Mrenna, Stefan Prestel, Christine O. Rasmussen, and Peter Z. Skands, “An introduction to PYTHIA 8.2,” Computer Physics Communications 191, 159–177 (2015).
de Favereau et al. (2014) J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi, “DELPHES 3: a modular framework for fast simulation of a generic collider experiment,” Journal of High Energy Physics 2014 (2014), 10.1007/jhep02(2014)057.
The CMS Collaboration (2008) The CMS Collaboration, “The CMS experiment at the CERN LHC,” JINST 3, S08004 (2008).
Cacciari et al. (2008) Matteo Cacciari, Gavin P Salam, and Gregory Soyez, “The anti-kt jet clustering algorithm,” Journal of High Energy Physics 2008, 063–063 (2008).
Schreiner et al. (2023) Henry Schreiner, Jim Pivarski, and Saransh Chopra, “vector,” (2023).
Pivarski et al. (2024) Jim Pivarski, Ianna Osborne, Ioana Ifrim, Henry Schreiner, Angus Hollands, Anish Biswas, Pratyush Das, Santam Roy Choudhury, Nicholas Smith, and Manasvi Goyal, “Awkward Array,” (2024).
Huh et al. (2023) Minyoung Huh, Brian Cheung, Pulkit Agrawal, and Phillip Isola, “Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks,” (2023), arXiv:2305.08842 [cs.LG] .
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving language understanding by generative pre-training,” (2018).
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, “Layer normalization,” (2016), arXiv:1607.06450 [stat.ML] .
Thaler and Van Tilburg (2011) Jesse Thaler and Ken Van Tilburg, “Identifying Boosted Objects with N-subjettiness,” JHEP 03, 015 (2011), arXiv:1011.2268 [hep-ph] .
Zaheer et al. (2018) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola, “Deep Sets,” (2018), arXiv:1703.06114 [cs.LG] .
Shleifer et al. (2021) Sam Shleifer, Jason Weston, and Myle Ott, “Normformer: Improved transformer pretraining with extra normalization,” (2021), arXiv:2110.09456 [cs.CL] .
Krause and Shih (2023) Claudius Krause and David Shih, “Fast and accurate simulations of calorimeter showers with normalizing flows,” Phys. Rev. D 107, 113003 (2023), arXiv:2106.05285 [physics.ins-det] .
Das et al. (2023) Ranit Das, Luigi Favaro, Theo Heimel, Claudius Krause, Tilman Plehn, and David Shih, “How to Understand Limitations of Generative Networks,” (2023), arXiv:2305.16774 [hep-ph] .
Birk et al. (2023) Joschka Birk, Erik Buhmann, Cedric Ewen, Gregor Kasieczka, and David Shih, “Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information,” (2023), arXiv:2312.00123 [hep-ph] .
Paszke et al. (2019) Adam Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Curran Associates, Inc., 2019) pp. 8024–8035.
Falcon and team (2024) William Falcon and The PyTorch Lightning team, “Pytorch lightning,” (2024).
Huh (2022) Minyoung Huh, “vqtorch: PyTorch package for vector quantization,” https://github.com/minyoungg/vqtorch (2022).
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter, “Decoupled Weight Decay Regularization,” (2019), arXiv:1711.05101 [cs.LG] .
Smith (2018) Leslie N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay,” (2018), arXiv:1803.09820 [cs.LG] .
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” (2017), arXiv:1412.6980 [cs.LG] .

Appendix A Model details and hyperparameters

A.1 VQ-VAE for token creation

Both the $\eta^{\text{rel}}$ and the $\phi^{\text{rel}}$ values are scaled down by a factor of 3. The transverse momentum of the jet constituents is first transformed using the natural logarithm and subsequently shifted by -1.8. The tokenization was also done without the log transform of the $p_{\text{T}}$ , and was found to perform similarly. However, the logarithm transformation has the advantage that it automatically avoids negative $p_{\text{T}}$ values, which is why we choose to use the log-transformed $p_{\text{T}}$ . The conditional and unconditional tokenization only differ in the architecture of the encoder and decoder of the VQ-VAE.

Training for the VQ-VAE is implemented in pytorch Paszke et al. (2019) and pytorch-lightning Falcon and team (2024).

The model architecture of the VQ-VAE encoder and decoder in the conditional tokenization approach is similar to Heinrich et al. (2024) with a different set of hyperparameters. We use 4 NormFormer Shleifer et al. (2021) blocks with an embedding dimension of 128 and 8 heads in the MHA for both the encoder and the decoder. We use the vqtorch library Huh (2022); Huh et al. (2023) to implement the vector quantization layer with the dimension of the codebook vectors set to 4.

The mean squared error (MSE) between the tensor of the initial particle features and the reconstructed features is used as the task loss $\mathcal{L}_{\mathrm{task}}$ . The total loss is then set to

\displaystyle\mathcal{L}=\mathcal{L}_{\mathrm{task}}+\alpha\cdot\mathcal{L}_{% \mathrm{commit}}

(5)

with $\alpha=10$ . An affine transformation is used for a joint transformation of all codes and dead codes are replaced with a frequency of 10 iterations. The parameter $\beta$ which trades off the importance of updating the embeddings from the encoder $z_{e}$ and the code vectors $z_{q}$ is set to $\beta=0.9$ . Lastly, we use a synchronized update rule Huh et al. (2023); Huh (2022) with $\nu=1$ .

In the unconditional approach, we use the same hyperparameters as outlined above, with the only difference that the architecture of the encoder and decoder is a simple MLP with 3 hidden layers of dimension 128 and ReLU activation function.

All VQ-VAE models are trained on all 10 classes of the JetClass dataset Qu et al. (2022b).

A.2 Classifiers for token quality evaluation

The DeepSets Zaheer et al. (2018); Komiske et al. (2019) classifier consists of a per-particle MLP $\Phi$ with shared weights across all particles inside the jet with 3 hidden layers of dimension 100, 100 and 256. The output of the network $\Phi$ is then aggregated with a sum and fed into another MLP with 3 hidden layers of dimension 100 followed by a 10-dimension output layer with softmax activation function.

The Transformer classifier consists of 5 NormFormer Shleifer et al. (2021) blocks, followed by two class-attention blocks with a class token as query, inspired by the ParT Qu et al. (2022a) architecture. The output of the last class-attention block is fed into a MLP with two hidden layers of dimension 128, followed by a softmaxed 10-dimensional output layer.

The classifiers are trained with the AdamW Loshchilov and Hutter (2019) optimizer with a maximum learning rate of 0.005 (0.001) for the DeepSets (Transformer) classifier and weight decay 0.01. The learning rate first linearly increased from 0.002 (0.0005) during the first 4 training epochs, after which it is linearly decreased to the initial learning rate over 20 epochs and then linearly decreased to a final learning rate of 0.001 (0.0003), following the one-cycle learning rate schedule Smith (2018).

The classifiers for those token quality evaluations are trained on 10M jets from the JetClass dataset Qu et al. (2022b).

A.3 Transformer backbone

When training the transformer backbone, cross entropy is used as a loss function and Adam Kingma and Ba (2017) with a learning rate of 0.001 as optimizer. The model had access to 10M $t\to bqq^{\prime}$ jet events and 10M $q/g$ jet events. Note that this means that the model trained on these two jet types combined had access to twice as much data. All versions were trained for 30 epochs, and the model state with the lowest validation loss was chosen for the further analysis.

A.4 Transfer learning

To perform the transfer learning from the generative task to the classification task, we change the head of the OmniJet- $\alpha$ model to the classification head and load the weights of the backbone trained for the generative task. We explore two variations of fine-tuning the pre-trained backbone to the classification task: training all weights of the model with the same learning rate (referred to as Fine-tuning in subsection III.3) and keeping the weights of the backbone fixed at the state obtained from the generative model (referred to as Fine-tuning (backbone fixed) in subsection III.3). For the From scratch trainings we start the training with randomly initialized weights of the whole model. The training is performed with the AdamW Loshchilov and Hutter (2019) optimizer with a constant learning rate of 0.0001 and weight decay 0.01. Since those trainings, depending on the size of the training dataset, tend to show overfitting quite quickly, we stop those trainings when the validation loss does not improve for multiple epochs. The threshold of this early stopping is adjusted to the training dataset size, with a patience of 20 epochs for a training dataset of $100$ , $1000$ and $10\,000$ jets, a patience of 10 epochs for a training dataset size of $100\,000$ and $1\,000\,000$ , and a patience of 5 for trainings with $2\,000\,000$ training jets. For each training dataset size we run 5 trainings with different random seeds and the epoch with the smallest validation loss is chosen for evaluation.

A.5 Classifier tests

In order to quantify the performance of the generative model, a classifier test using the structure of the DeepSets classifier from subsection A.2 is performed. In this case however, the 3 hidden layers of the particle MLP $\Phi$ all have dimension 10.

A number of 48 000 generated events are combined with equally many reconstructed tokens from the test set of JetClass. The two datasets are combined and shuffled, and a train/val/test split of 0.6/0.2/0.2 is used. The model is trained for 100 epochs with binary cross entropy loss and Adam Kingma and Ba (2017) with learning rate 0.001 as optimizer. The model state with the lowest validation loss is chosen for evaluation. The resulting AUC scores are 0.54 for the model trained on $q/g$ and $t\to bqq^{\prime}$ jets combined and 0.57 for the ones trained on single-type jets.

Appendix B Token quality

Additional plots of the jet mass, the jet transverse momentum, as well as the subjettiness ratios are shown in Figure 8, Figure 9, Figure 11 and Figure 10.

Appendix C Generative model trained on single-jet data

To test the generative performance, the generative model was also trained on single-jet type data — 10M jets each of $t\to bqq^{\prime}$ and $q/g$ — separately. For these training, no tests of the task-transfer to classification were perfomed.

The result of the $q/g$ jet training is shown in Figure 12, of the $t\to bqq^{\prime}$ jet training in Figure 13. In the $q/g$ case, we see a good agreement between the reconstructed tokens and the generated events. However, it seems to be somewhat more difficult for the model to accurately model $\tau_{32}$ for $t\to bqq^{\prime}$ jets, which is also mirrored for this quantity in the combined model (see Figure 6).

It is interesting to compare the result of OmniJet- $\alpha$ with that of a different generative model. EPiC-FM Birk et al. (2023) was the first generative model trained on the JetClass dataset, utilizing flow matching and operating without tokenization. The result of the comparison can be seen in Figure 14. The plots show the JetClass data, the reconstructed JetClass token from this work, the EPiC-FM generated events, and the OmniJet- $\alpha$ generated events. We use the more challenging $t\to bqq^{\prime}$ class for comparison.

The ratio plots under the main plots show the generated events compared to their respective truths: direct JetClass for EPiC-FM and Reconstructed JetClass tokens for OmniJet- $\alpha$ . Hence, the ratios show how well the respective generative models learn to replicate their training data. In general, we see that both models are doing well. OmniJet- $\alpha$ has a somewhat higher discrepancy in the tails of all distributions except for constituent $\eta^{\mathrm{rel}}$ and the number of constituents. The most difficult distribution is the constituent $p_{\mathrm{T}}$ , with bumps in the tail, which could also be seen in Figure 13.

OmniJet-α𝛼\alphaitalic_α: The first cross-task foundation model for particle physics