Anomaly-aware summary statistic from data batches
Abstract
Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman–Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the likelihood-ratio test is thus shared over the batches, and complies with the assumption that the expected rate of new physical processes is time invariant. We show that this method outperforms the simple sum of the independent tests run over the batches, and can recover, or even surpass, the sensitivity of the single test run over the full data. Beside the significant advantage for the offline application of NPLM to large size samples, the proposed approach offers new prospects toward the use of NPLM to construct anomaly-aware summary statistics in quasi-online data streaming scenarios.
1 Introduction
Numerous experimental tests have confirmed the Standard Model as the best description of the particle physics world, reaching levels of precision up to few parts per thousand. New physical processes, if detectable, should therefore manifest as very subtle statistical deviations of the data distributions with respect to the Standard Model expectations. Increasing the phase space acceptance and the statistics of the analysed datasets could unveil interesting data structures that were missed so far. Along with the regular signal-specific statistical analyses, signal-agnostic anomaly detection tools have grown of interest in the high energy physics community, and machine learning has proved to be a crucial ingredient to scale their discovery power.
Anomaly detection broadly refers to the problem of identifying data patterns that do not align with their expected behaviours. Events can be interpreted as anomalous for multiple reasons; either because their appearance in the dataset is rare, or because they manifest out of the expected data support. Moreover, anomalies can emerge as a collective behaviour affecting the statistical model describing the data occurrence in the experiment. In this work we focus on collective anomalies, namely phenomena that rise as an unexpected deformation of the data probability distribution. These kind of anomalies are tackled by means of statistical anomaly detection tools. Statistical anomaly detection refers to the problem of recognising and quantifying distributional deviations of a dataset from its nominal expected behaviour. Besides its use in new physics searches, statistical anomaly detection finds a wide spectrum of applications in experimental particle physics. The most striking examples are data quality monitoring of experimental setups at data taking time, and validation of simulated samples.
When a tractable model of the data nominal behaviour is available, statistical anomaly detection reduces to a goodness-of-fit test: a test statistic is defined according to a notion of similarity between the data distribution and the nominal model, and the level of compatibility is quantified in terms of the test statistic -value with respect to the test distribution in nominal conditions. However, in most of the applications of statistical anomaly detection in high energy physics, a tractable model for the nominal data distribution is not available and needs to be replaced by a dataset of reference ( denoted as in this work), simulated according to the nominal model or built in a data-driven fashion. In these cases, the problem of statistical anomaly detection is solved as a two-sample-test, introducing a test statistic that measures the distance between two samples.
High energy physics, and in particular collider data, pose extremely challenging problems for statistical anomaly detection due to the level of precision required to test the Standard Model. For instance, the accuracy of the statistical models employed in the process of scientific discovery is a particularly delicate matter that requires powerful testing tools. Similarly, the lack of new interesting tensions with the Standard Model hypothesis after the Higgs boson discovery seem to indicate that, if detectable, new physics should manifest at the LHC as a very rare process, whether in time or in phase space. Anomaly detection algorithms must therefore be able to capture very subtle anomalous behaviours, and they should be run over the most inclusive set of data allowed by the computational and methodological constraints. An extensive review of goodness-of-fit algorithms routinely adopted in the high energy physics community can be found in Ref. [1].
Machine learning (ML) has provided significant improvements to the statistical tools employed in data analysis at the LHC, allowing to deal with increasingly complex data structures, and exploiting a wider amount of the physics information produced by the experiments. In the specific context of goodness-of-fit, machine learning based algorithms have opened the path to multivariate statistical analyses, able to capture mismodeling also in the correlations between variables Ref. [2].
In absence of constraining inductive biases, as it is the case in unknown anomalous signals searches, the power of machine learning strategies resides in their ability to learn from the training examples coming directly from the experimental data. However, the latter could be either scarce or completely lacking of anomalous events due to their rarity. The impact of machine learning is therefore expected to become more and more evident by scaling the training sample size and, as a consequence, the rate of collected anomalies. While beneficial to the training, larger sample sizes introduce computational and storage constraints, that need to be addressed. In this work, we focus on the New Physics Learning Machine (NPLM) algorithm, an ML-empowered goodness-of-fit method based on the Neyman-Pearson hypothesis test (Ref. [3, 4, 5, 6, 7]), and we address the problem of how to test large datasets under computational and storage constraints. The approach presented in this paper proposes parallel computing of the test over batches of the original sample and a final aggregation strategy that combines the local approximation of the data-to-reference ratios learnt by each instance of the algorithm. We outline three operational approaches for different resource constraints settings, aiming at maximally exploiting the available experimental data both in offline and quasi-online configurations.
We apply the split-aggregation approach to a one-dimensional toy model and two realistic LHC datasets of increasing complexity. The former is a dimuon final state characterised by 5 observables and sample size of events, and a 24-dimensional problem defined by the tri-momenta of eight objects selected at the L1 trigger of the CMS experiment with sample size of events.
Our numerical experiments for one-dimensional and five-dimensional problems show that the aggregation strategy can fully recover or, in same cases, even outperform the single batch implementation, thanks to the effect of regularization against statistical noise that the aggregation strategy introduces.
Furthermore, the studies conducted on the 24-dimensional problem show the discovery potential of the various approaches in high-dimensional and large statistic scenarios.
The paper is organised as follows. In Section 2 we describe the NPLM algorithm and the proposed batch solution for handling large sample sizes. In Section 3 we study the performances of the split-aggregation approach on the one-dimensional and five-dimensional problems and we compare them with the original implementation of NPLM. In Section 4, we apply the split-aggregation strategy to a large size dataset simulating the CMS L1 trigger, showing the scalability of the algorithm to more complex scenarios. We conclude in Section 6 discussing our main findings and future research directions.
2 Algorithm
In this Section we briefly summarise the main features of the NPLM algorithm and we provide the statistical foundations for the new proposed split-aggregation solution.
2.1 New Physics Learning Machine (NPLM)
New Physics Learning Machine (NPLM) is a ML-based signal-agnostic strategy to detect and quantify statistically significant deviations of the observed data from a defined reference model Ref. [3]. The NPLM method is inspired by the classical approach to hypothesis testing based on the log-likelihood-ratio Ref. [8]
(1) |
where the two compared hypotheses are the reference, , and a composite alternative, , whose nature is not known a priori. A model with trainable parameters is defined on the space of the data to parametrize the density distribution of the data in the family of alternatives
(2) |
Since the total number of events in is a Poissonian random variable whose expectation depends on the theory model, the test statistic in Eq. 1 is computed as the extended log-likelihood-ratio, which in terms of is written as
(3) |
with and the expectation values of the total number of events in under the reference and the alternative hypotheses respectively. The maximisation problem described by Eq. 3 can be solved by a machine learning task
(4) |
where the loss function
(5) |
takes as input two training datasets: the dataset of interest , and a sample of reference drawn according to the reference hypothesis and weighted by to match the experimental luminosity . Thus, the machine learning task learns the optimal values of the trainable parameters of , , to approximate the log-density-ratio of the two samples
(6) |
As for any test statistic, the value of the test obtained for has to be compared to the distribution of the test under the reference hypothesis, , and the -value is finally used as a GoF metric. We build empirically, running pseudo-experiments on datasets that are generated according to the hypothesis. The model could be a neural network as in Ref. [3, 4, 5], or it could be built with kernel methods Ref. [6, 9]. In this work we employ kernel methods since they have been showed to be less time demanding than neural networks. Details about the NPLM training strategy based on kernel methods are postponed to Section 2.4.
It should be mentioned that the test statistic in Eq. 4 has good statistical properties (i.e. its null distribution approximates a in the asymptotic regime) only if the number of events representing the reference hypothesis, is large enough that the statistical fluctuations affecting the reference sample are negligible with respect to the one affecting the data. We refer the reader to Ref. [4] and Ref. [6] for a detailed discussion on the impact of the reference sample size on the NPLM test statistic properties.
Moreover, in realistic particle physics scenarios uncertainties affect the knowledge of the reference hypothesis requiring the introduction of nuisance parameters, defining families of transformations of the reference distribution that should be seen as not anomalous. A solution to extend the NPLM strategy to account for systematic errors has been introduced in Ref. [5]. In our studies, we will consider systematic uncertainties as negligible and leave the treatment of systematic uncertainties within the batch approach for future studies. Finally, often times simulations are not accurate enough to provide a good reference model. In those circumstances, data driven techniques need to be implemented to build . The problem of building an accurate reference model is a distinct problem and out of the scope of this work.Comment on possible solutions to this problem in Section 4. In the next Section we introduce the batching idea to address the problem of large sample size.
2.2 Learning New Physics from batches
Let’s assume that the experimental data are acquired in batches
(7) |
and the size of each batch is a Poissonian random variable whose expectation is a function of the integrated data acquisition time.222more precisely, the expected number of observations for a given physics process () is a function of its physics cross section (), the integrated luminosity (), and the acceptance () and efficiency () of the experimental setup: . In addition, let’s assume that each batch can be stored in a buffer long enough time to undergo the NPLM training task.
Since the proton-proton collisions are independent events, the optimal test of the full dataset according to Neyman and Pearson would be the sum of the log-likelihood-ratio test performed over each single batch, with alternative hypothesis set to the True model () underlying the data:
(8) |
Clearly, the True hypothesis T is not known a priori and we want to exploit the NPLM strategy to estimate it. The solution we propose in this work is to run separate instances of the NPLM algorithm over different batches and recombine the information encoded in the log-density-ratio functions
(9) |
learnt by the different instances. Recalling the parametrization defined in Eq. 2, for each learnt model we can write an estimate of the density ratio between the data batch and the reference
(10) |
Each density can be interpreted as an approximated view of the same True physical model. If the learning process outputs an accurate enough representation of T, then Eq. 8 can be straightforward approximated summing the likelihood-ratio-tests output by NPLM for the separate batches
(11) |
This approach produces a powerful test if the exact shape and normalization of the signal is captured by all models. However, when the signal is very rare its impact in a small size data sample can be overlooked by the model. We can exploit two simple assumptions to provide a better solution. First, we assume that the new physics source is static, namely its occurence rate is uniform over time. Second, we assume that biases affecting single batch views are due to statistical noise whose impact over time is on average null. Under these assumptions, averaging the density-ratio terms learnt in different batches provide a single alternative hypothesis shared among batches, that helps to enhance systematic discrepancies, interpreted as signals, while suppressing random ones. The density of the data under would then have the following form:
(12) |
By expressing the model of the data density distribution in each batch in terms of the set of functions , we can define an “aggregated” model of the log-ratio of the densities as
(13) |
and use it to compute the log-likelihood-ratio test statistics over the set of batches:
(14) |
Batching solutions for computationally intensive calculations with kernel methods have already been proposed in the literature for supervised approaches (see for instance Ref. [10]). In Ref. [10] the authors show theoretical guarantees for recovering the performances of the full sample training solution, provided that the number of splitting is not too large.
2.3 Operational strategies in resource constrained settings
Despite the advantage provided by splitting the dataset in batches, the computational requirements for the algorithm execution might still pose significant challenges to its application in realistic scenarios. In fact, computing requires evaluating over each element in the total set of acquisitions . This means that all data batches have to be stored for the whole duration of the NPLM algorithm execution, and the memory locations at which the models are stored need to be accessible by all batches. Moreover, the execution time largely varies depending on several factors concerning the algorithm setup, such as the size of the batches, the choice of input variables, but also to the computational resources available and the way these are dispatched. The possibility of running NPLM tasks in parallel on multiple GPUs provides, for instance, the potential to significantly reduce the computational time.
On the other hand, if the storage resources are limited, it could happen that not all the data can be saved long enough for the NPLM routines to be all completed. In this situation, the test in Eq. 14 cannot be evaluated exactly. Two alternatives can then be contemplated. The first one is to run NPLM over all the batches and save the final models to estimate , but running the test only on a subset of batches that could be stored:
(15) |
The second alternative is considering Eq. 10 as the limit of a smart multidimensional binning approach, where the bins centres are the elements of the reference sample . In this limit, each bin has counting equal to the unity under the reference hypothesis and value under the alternative. A “saturated” binned likelihood-ratio test [11] can then be computed:
(16) |
Notice that the test does not depend on the data directly, but only via , that has been learnt exploiting the dataset in batches.
Moreover, the test does not compute the likelihood-ratio of the data, but it rather evaluates the distance between the density of the reference and that of the alternative hypothesis on the data points belonging to the reference sample .
Eq. 16 is a powerful metric when the model is able to capture the signal shape well enough. Its performances deteriorates for signal events that lay in rare regions of the input space, or whose shape is too complex compared to the expressivity of the family of universal approximators used to define the models .
An alternative approach of online data summarization based on machine learning was proposed in Ref. [12]. There are two main conceptual differences between the approach presented in Ref. [12] and the one presented in the current paper. The first is that the authors of Ref. [12] propose to learn directly the probability distribution of the data, while via NPLM we propose to learn the ratio of the data density with respect to a reference. The second difference concerns the training strategy. The authors of Ref. [12] suggest to train a unique model (specifically, a normalizing flow) feeding the acquired data batches sequentially. Conversely, in the present work we propose to perform a complete and independent training for each batch, thus preventing catastrophic forgetting.
In the experiments reported in Section 4 we consider three representative resource constrained scenarios:
-
1.
No resource constraints (NPLM-ALL): all batches are used both for training and for testing. The final test is the one in Eq. 14.
-
2.
Long term storage constraint (NPLM-ONE): there are enough computational resources to train multiple instances of the algorithm in a quasi-online fashion but only one batch is available for testing. For this configuration, the final test is the one in Eq. 15, with .
-
3.
Saturated test (NPLM-SAT): all data are temporarily available for a quasi-online training of the algorithm but they cannot be stored and tested. In this case, we only test the distance between the density of the data and that of the reference distribution by computing the saturated test in Eg. 16.
2.4 Time-efficient NPLM implementation with non-parametric models
To harness the execution time of a single instance of the NPLM algorithm we consider the kernel-based implementation of NPLM presented in Ref. [6], and applied in Ref. [9], which exploits the Falkon library [13], a modern solver for kernel methods that leverages several algorithmic solutions developed in Ref. Ref. [14, 15, 13] to allow for their use in large scale problems. The model used in Falkon is a Nystrom approximated Kernel method defined as a finite weighted sum of kernel functions
(17) |
The kernel function, , adopted in this implementation is a Gaussian kernel
(18) |
where the hyper parameter denotes the Gaussian width.
To approximate the log-ratio between the distribution of the and datasets the training exploits a weighted binary cross-entropy loss
(19) |
with a L2 regularisation term. The NPLM method works under the assumption that the reference sample is large enough to be a good representation of the hypothesis, which typically leads to an imbalanced dataset with . The weight , is thus introduced to balance the two classes’ contributions.
Hyper parameters choice.
The hyper parameters choice for a given family of universal approximators defines the “modes” of the data that the NPLM algorithm is able to capture. For the kernel-methods considered in this work, there are three hyper parameters that need to be defined: the number of centres , the Gaussian width , and the regularisation parameter . In Ref. [6] it has been showed that there exists a “valley” of (, , ) configurations that guarantee the convexity of the loss function, arbitrarily small prediction error and reasonable convergence time. However, no further indication can be provided to enhance the sensitivity to statistical anomalies. Different hyper parameters choices could be optimal to capture different types of data deviations from the reference distribution. For instance, small values of the Gaussian kernels width, , would allow to better capture localised deviations, like narrow resonances, while larger values would be better suited to represent effects affecting the distribution on a wider range, like scale mismatches. If the nature of the anomaly is unknown and no good prior is available, then the most inclusive way of testing the data is combining multiple choices of the hyper-parameters configurations. In this work we fix and following the prescription proposed in Ref. [6]. Namely, we chose with such that the training time is reasonable given the available computational resources, and we choose as small as possible, provided no instabilities are encountered.
For the Gaussian width, , we adopt the approach recently proposed in Ref. [NPLM-aggreg] rather than limiting to a single value we consider five values corresponding to the 5, 25, 50, 75 and 95 % quantiles of the pairwise distance distribution between the elements in the reference set. We combine the results obtained with different values of by selecting the minimum:
(20) |
It is worth noticing that the authors of Ref. [10] comment on the impact of regularisation on the batched solutions, highlighting how the correct level of generalisation can be recovered if the trainings run on the batches is under-regularised. The NPLM algorithm is a signal-agnostic strategy and therefore an optimal regularisation level cannot be determined prior to testing the data. Nevertheless, we observe the benefit of the aggregation as a regularisation in our studies. Details are reported in Section 3.
3 Numerical experiments
In this Section we study the properties of the batch-based strategy outlined in Section 2.3 over two proof-of-concept applications. We start considering a one-dimensional problem characterised by a smoothly falling reference distribution (EXPO-1D) for which the optimal test statistic according to Neyman and Pearson can be computed analytically and used as a target. We then move to the multidimensional problem of a dimuon final state observed at collider experiments (DIMUON-5D). A detailed description of the datasets’ properties and signal benchmarks is given in Appendix 7.1. The details about the algorithm implementation, execution time and resource consumption are given in Section 5.
3.1 Improving regularisation and sensitivity by aggregating over batches.
We start by studying the impact of the number of batches on the power of the NPLM-ALL aggregated test statistic (Eq. 14). We run the NPLM algorithm for different values of and we compare the power curves of the aggregated test (Eq. 14) and the simple sum of tests (Eq. 11) resulting by combining them. We run the experiments on the various signal benchmarks reported in Table 2 for the EXPO-1D dataset and on Table 3 for the DIMUON-5D dataset.
Interestingly, the aggregated test shows similar or higher power as the number of batches increases. Conversely, we observe that the simple sum of tests suffers from significant power degradation as the number of batches increases. We argue that the regularisation benefit introduced by averaging over the batches has higher impact than the sensitivity loss due to data splitting. One possible explanation for such result can be found in the in-sample nature of the NPLM algorithm. Each trained model is by construction induced to (over)fit fluctuations in the data batch distribution even when those are not statistically significant in the single batch test. Combining the outcome of each batch training at the level of the functional forms of the model allow therefore to restore the suite of information carried along by each model, and exploit it to recover sensitivity to very rare signals. An illustrative example of the dynamics mentioned above is given in Figure 3. The panels represent the data distribution (black histograms) and reference distribution (light blue histograms) for the one-dimensional EXPO-1D dataset, when the data are split in four batches. A rare Gaussian signal is injected in the tail of the exponentially falling distribution. The light green solid lines represent the models learnt in each training. On the right hand side of the Figure, we report four panels showing the individual batches and the relative trained models. The models follow the statistical fluctuations of the training sample producing an imprecise view of the true model of the data. On the left hand side, the four models are combined to produce the aggregated version, showed in dark green. It is easy to notice the improved stability of the aggregated model with respect to the ones trained on single batches.
The observed behaviour is consistent across all the studied signal benchmarks for both datasets. We report in Figure 4 the power curves for one signal benchmark of the EXPO-1D dataset, and in Figure 5 the power curves for one signal benchmark of the DIMUON-5D dataset. Additional power plots for the aggregated and simple-sum tests for different values of on different datasets and signal benchmarks are reported in Appendix 7.2.
3.2 Improving anomaly detection on a single batch exploiting quasi-online learning.
The results presented in Section 3.1 suggest that the aggregation improves the quality of the density-ratio modelling. This motivates the use of NPLM in a quasi-online setup to extract powerful information from data even when only a fraction of the latter can be eventually stored (storage constrained scenario). This is always the case at collider experiments, where trigger algorithms are designed to filter the data. We study the impact of aggregating the modelling over multiple batches on the power of the NPLM test for a single data batch (NPLM-ONE scenario). For the EXPO-1D dataset, the data are split in eight batches, all of them are analysed via NPLM but only one is saved for testing. For the DIMUON-5D dataset, instead, the data are split in four batches. Figure 6 shows the results of our study for the narrow (left panel) and broad (right panel) signal benchmarks in the EXPO-1D dataset, whereas Figure 7 shows the results for two different injections of a Z’ resonant signal with invariant mass of 300 GeV, considered in the DIMUON-5D dataset. While not competitive with the ideal scenario of unlimited storage availability in which the full dataset can be tested (green line), the power curve of our experiment (lightblue line) shows significant improvement over the simple use of NPLM on a single batch (black line). Similar behaviours are found for different signal benchmarks in both the univariate and multivariate datasets. Additional plots are reported in Appendix 7.3.
3.3 Anomaly-preserving data compression via likelihood-ratio
Finally, we consider the extreme scenario in which data cannot be permanently stored but are only available quasi-online for a limited time window. In this case, the proposed solution is storing the NPLM models as compressed representations of the data batches, and build the saturated likelihood-ratio test described in Eq. 16. The saturated test is not properly a likelihood-ratio test because the likelihood is not directly evaluated on the data points in . This in principle could lead to inefficiencies. A trivial failure example, for instance, is that of a signal localised in a low density region of the reference data distribution. Since the likelihood-ratio reconstructed by is only evaluated in the data points of the reference sample , if the shape of the signal is underestimated then the test would not be sensitive to it. This intuition can be verified running numerical experiments of different nature. Figure 8 shows two representative experiments run on the EXPO-1D dataset. In the Figure we compare the power curve of the saturated test (black line) with that of the aggregated test evaluated over the full dataset (light blue line), or on one batch only (green line). For the signal located in the bulk of the reference distribution (left panel), the saturated test has comparable power to the aggregated test evaluated over the full dataset. Instead, for the Gaussian signal in the tail of the reference distribution (right panel), the power of the saturated test deteriorates, as fewer reference data points are available in the signal region to properly quantify the discrepancy. Similar results are found for the DIMUON-5D dataset (see Figure 9). Additional Figures for the remaining signal benchmarks are reported in Appendix 7.4.
4 Scaling up to big data LHC scenarios
In this Section we apply the batch-aggregation approach to a larger scale dataset, emulating a typical data stream acquired at the CMS experiment with a one light lepton (electron or muon) requirement at the first level of data filtering (L1 trigger). The dataset, initially introduced in Refs. [16, 17] and later adapted for anomaly detection [18] is publicly available on Zenodo [19, 20, 21, 22, 23]. It consists of simulated events for the Standard Model (SM) and four beyond the Standard Model signatures that we report below333Additional information about the dataset can be found in Ref. [18].:
-
•
: a neutral scalar boson () decaying to two off-shell Z bosons, each forced to decay to two leptons;
-
•
: a charged scalar boson () decaying to a lepton and a neutrino;
-
•
: a scalar boson () decaying to two leptons;
-
•
: a leptoquark (LQ) decaying to a quark and a lepton.
The dataset contains a list of 19 physics objects (4 muons, 4 electrons, 10 jets and the event missing transverse energy) for each collision event. Each object is characterised by three kinematic variables and a class label, for a total of 76 features. We run the batched version of NPLM over the 24 dimensional problem defined by considering the kinematic features of eight out of the nineteen objects: the missing transverse energy, the first two most energetic muons, the first two most energetic electrons, and first three most energetic jets. We consider a data sample of million SM events, on top of which signal events of the order of few parts per mill are injected to test the algorithm sensitivity. The marginal distributions over the input features for the Standard Model background and the four signal benchmarks are reported in Figures 12 and 13 in Appendix 7.1.
The reference sample is taken to be million SM events (ten times larger than the full sample), and the sample is split in 5 batches (). As for the kernel methods hyper parameters, we set and . For sake of time, we study the performances of the model for a single value of the kernel width , corresponding to the 90% quantile of the distribution of pair-wise distance between SM data points. We run the three versions of the algorithm presented in Section 3: train and test over all batches (NPLM-ALL), train over all and test only one batch (NPLM-ONE), and the saturated test on the model obtained averaging over all batches (NPLM-SAT).
Figure 10 shows the median Z-score reached by the three approaches for a 0.2% and 0.4% signal injection of the four different signal benchmarks. For completeness, we also show the sensitivity of the algorithm if only a fifth of the luminosity, namely one batch, is exploited by NPLM. The results of our tests confirm the main finding already outlined in Section 3. First, the NPLM-ONE approach exhibits increased performances with respect to the application of NPLM to a single batch, remarking that the aggregation via average proposed in this work allows to unveil sensitive information about the signal signature, even when the full set of data is not available at testing time. Moreover, when all the data are available at testing time (NPLM-ALL), the NPLM algorithm can take advantage of the full sample statistics to make discovery possible. This is the case for the and signal benchmarks, where a 2 per mill signal injection (corresponding to a 2 sigma global normalization effect in our studies), can be raised to evidence (for ) or discovery (for ) level of significance 444The Z-score values reported in Figure 10 are computed empirically up to 3. For larger values of the default NPLM, NPLM-ALL and NPLM-ONE, we rely on the asymptotic distribution of the test statistic under the null hypothesis (the compatibility of the NPLM test with a for kernel methods was previously studied in Refs. [6, 9]). For NPLM-SAT we don’t observe the emergence of an asymptotic . However the null distribution has a good agreement with a normal distribution. We therefore use the normal asymptotic to extrapolate an estimate of Z-scores above 3. Points above 3 in the Figure should therefore be taken more as a qualitative trend than an accurate estimate.. For the and signal benchmarks, we observe impressive reaches for the NPLM-SAT approach as well. This result aligns with what observed in the one-dimensional and five-dimensional experiments, namely that NPLM-SAT is as successful as NPLM-ALL when the signal support is mainly in the bulk of the reference sample support, as it is the case for the signal benchmarks considered here (see marginal distributions reported in Figures 12 and 13). Additionally, this result points out the potential of the NPLM split-aggregation method as a tool to compute anomaly-preserving summary statistics, allowing to save information about data that are only temporarily available and eventually discarded by the data acquisition system. We conclude this Section with commenting on the performances of NPLM on the and signals, that are found to be poorer. We believe that the reason could reside in the choice of the features adopted in these studies – only a subset of the original nineteen physics objects is retained for this analysis. Moreover, the algorithm performances are sensitive to the size of the reference sample – our studies suggest that the larger the sample the more accurate the result becomes, though more computationally demanding. Finally, one peculiar feature of this dataset is the so called “zero padding”, namely filling with zeros the entries corresponding to specific objects in the event that are not observed. The zero padding gives rise to sharp features that project great part of the data on lower dimensional surfaces of the input manifold. The problem of variable size events could be solved choosing a different data representations, model architecture, or by mapping the data to a lower dimensional embedding. These interesting directions are out of the scope of this paper and are left for future work.
5 Computational resources and training time
In this Section, we report the details of the algorithm implementation, highlighting the typical execution time and computational resources needed for the numerical experiments studied in this work.
Execution time.
The execution time of a single instance of the NPLM algorithm based on kernel methods depends on several factors: the total number of data points, the number of kernels , the kernel width , the smoothness parameter , and on the hardware resources employed. Our implementation relies on the Falkon library [13], that allows to speed up the fitting time by optimally exploiting GPUs [6].
dataset | (all) | Time (single batch) | ||||
---|---|---|---|---|---|---|
EXPO-1D | 1k | [0.1, 0.3, 0.7, 1.4, 3.0] | 200k | 16k | 0.5–1.5 s | |
DIMUON-5D | 10k | [0.8, 1.9, 3.0, 4.4, 6.6] | 40k | 8k | 3–12 s | |
L1CMS-24D | 10k | [10.5] | 10M | 1M | 40-70 s |
We report in Table 1 the typical time range for a single batch execution for the hyper parameters choice adopted in our experiments. The variance within experiments on the same dataset is mainly due to the value, smaller values require a higher number of iterations of the Falkon algorithm to reach convergence. The main difference in execution time between the univariate and multivariate cases is due to the model size, being the number of kernels used in the five-dimensional problems a factor 10 larger than in the one-dimensional case. The training time is not significantly affected by the batch size, as the reference sample size is dominating the overall size of the training sample. For large samples and complex models the execution time increases exponentially with the dataset size (see studies reported in Ref. [6]), and the benefit of batching the dataset in terms of execution time become more striking. This has been observed, for instance, when studying the L1CMS-24D.
Storage resources.
Aggregating over batches requires both temporary and permanents computational and storage resources. The temporary resources are used to train each single instance of the algorithm and should be sufficient to store the data batch input to the model, and the model itself. Importantly, the model location should be accessible by all data batches to allow estimating according to Eq. 13. The reference sample and the values that the aggregated model takes in the test points should instead be stored in a permanent location. The reference sample is a static input to the NPLM algorithm and is common to all training instances.555As previously stated, this work doesn’t cover the case of an imperfect reference, namely the case in which the reference model is affected by systematic uncertainties and its optimal configuration given the experimental condition is not constant. We leave this case for future work. Together with the test data, it is used to asses the value of the test in all three versions described in Section 3. The amount of memory resources to be allocated for the aggregated model depends on the number of data points required by the test. For the test in Eq. 14, the aggregated function is evaluated over all data points and reference points, requiring a vector of length . For the test in Eq. 15, only a subset of batches is used for evaluation and therefore the vector length reduces to . Finally for the saturated test in Eq. 16, the aggregated function is only evaluated over the reference set, requiring a length vector only. For the first two cases, the storage resources needed scale linearly with the data taking time, while in the last case they are constant. The implications of information loss on the sensitivity of the different summary statistics are discussed in Section 3.2 and 3.3.
6 Conclusions
In this work we explore the possibility of scaling the use of NPLM to test large size data samples in presence of variable computational and storage resource constraints.
We propose to split the dataset in batches and run parallel instances of the NPLM algorithm on each of them. The final results are combined averaging the density-ratio functions learnt by each instance of the NPLM algorithm.
We apply this strategy to both univariate and multivariate benchmarks representative of typical data distributions in particle physics experiments.
Our experiments show that the split-aggregation strategy preserves or even surpass the sensitivity power of the NPLM test applied to the full sample (original implementation). The reason for the success is mainly related to the uniqueness of its training strategy, aiming at finding the maximum likelihood fit of the model to the input data, a representation of the true model underneath the data that over-fits the specific sample used for training.
While over-fitting is commonly evaded in machine learning algorithms, here it assumes a crucial role in preserving data structures that do not generalise, either because they resemble statistical fluctuations or because they are especially rare traits of the data.
Preserving rare structures is what makes the NPLM batching successful in recombining subtle hints of anomalous events into a statistically significant signal, also suggesting that the density-ratio functions learnt by the NPLM algorithm could be good candidates to build anomaly-preserving summary statistics.
Moreover, the new NPLM split-aggregation strategy offers a solution to offline signal-agnostic analyses of collider data, often characterised by large sample sizes. It is beneficial both in terms of training time and sensitivity performances. thanks to its effect of regularization against statistical fluctuations, able in some cases to increase the significance of the observed discrepancy.
It should be mentioned that rigorously treating systematic uncertainties affecting the reference sample in this context is crucial. In this work we don’t address the problem of systematic uncertainties. A way to address systematic uncertainties in the original NPLM implementation was proposed in Ref. [5] for neural network based models. In the split-aggregation approach additional challenges arise from the fact that each batch has access only to “local” information about uncertainties. An accurate global assessment of the systematic effects via nuisance parameters estimation needs therefore the design of a successful aggregation strategy over the batches (see for instance Ref. [24]). The extension of the approach proposed in Ref. [5] to handle systematic uncertainties in the split-aggregation NPLM strategy is subject of ongoing study and left for future work.
The potential application of the new batch-based NPLM strategy extends beyond the offline analysis, to quasi-online signal-agnostic analyses of streamed data that are only temporarily available. Online approaches to data analyses allow to inspect collider data at the experiments prior the stage of filtering applied by the triggers, opening the way to the exploration of phase space regions never analysed before, at significantly high rates. Efforts in this direction are ongoing at the LHC experiments. The CMS experiment is investigating new approaches to analyse the experimental data at the collision rate of 40 MHz with focus on the muon detector chambers and the electromagnetic calorimeters [25, 26]. Moreover, the LHCb experiment has recently upgraded to a triggerless data readout at a rate of roughly 30 MHz, with partial event reconstruction that allows to reduce the rate down to 1 MHz prior the final trigger stage [27]. In this work we proposed two possible solutions to run the NPLM algorithm over a continuous stream of data under storage and computational resources constraints. In case of limited storage, we propose to train NPLM over multiple batches and evaluate the test only on one batch available for long term storage (NPLM-ONE). In complete absence of permanent storage, we propose to train NPLM over multiple batches and perform a goodness-of-fit test of the aggregated density model following the saturated likelihood-ratio approach (NPLM-SAT). Experiments run under constrained storage scenarios (NPLM-ONE) show that performing the aggregation over batches improves the modelling of the signal increasing the power of the NPLM test applied to a single batch. As for the saturated test, we observe sensitivity powers comparable to those of the NPLM algorithm applied to the full statistics when the signal lays on high density regions, while the power is only partially recovered for signals localised in the tails. In both cases, the sensitivity is expected to scale as the statistics of the processed data increases. Online analysis is a promising avenue for anomaly detection. The smart use of machine learning algorithms can leverage large statistics to efficiently navigate unexplored data and recognise novel rare processes of unexpected nature or location. The studies presented in this work are a first step in the direction of investigating online or quasi-online solutions for statistical anomaly detection with the NPLM algorithm. Detailed studies on scalability and uncertainties quantification in typical LHC data streaming scenarios are left for future work.
Acknowledgments
The author would like to thank Siddharth Mishra-Sharma, Phil Harris, Marco Zanetti and Marco Letizia for the useful conversations and constructive feedback on the manuscript.
References
- [1] M. Williams, “How good are your fits? Unbinned multivariate goodness-of-fit tests in high energy physics,” JINST, vol. 5, p. P09004, 2010.
- [2] C. Weisser and M. Williams, “Machine learning and multivariate goodness of fit,” 12 2016.
- [3] R. T. D’Agnolo and A. Wulzer, “Learning New Physics from a Machine,” Phys. Rev. D, vol. 99, no. 1, p. 015014, 2019.
- [4] R. T. D’Agnolo, G. Grosso, M. Pierini, A. Wulzer, and M. Zanetti, “Learning multivariate new physics,” Eur. Phys. J. C, vol. 81, no. 1, p. 89, 2021.
- [5] R. T. d’Agnolo, G. Grosso, M. Pierini, A. Wulzer, and M. Zanetti, “Learning new physics from an imperfect machine,” Eur. Phys. J. C, vol. 82, no. 3, p. 275, 2022.
- [6] M. Letizia, G. Losapio, M. Rando, G. Grosso, A. Wulzer, M. Pierini, M. Zanetti, and L. Rosasco, “Learning new physics efficiently with nonparametric methods,” Eur. Phys. J. C, vol. 82, no. 10, p. 879, 2022.
- [7] G. Grosso, M. Letizia, M. Pierini, and A. Wulzer, “Goodness of fit by Neyman-Pearson testing,” 5 2023.
- [8] J. Neyman and E. S. Pearson, “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Phil. Trans. Roy. Soc. Lond. A, vol. 231, no. 694-706, pp. 289–337, 1933.
- [9] G. Grosso, N. Lai, M. Letizia, J. Pazzini, M. Rando, L. Rosasco, A. Wulzer, and M. Zanetti, “Fast kernel methods for data quality monitoring as a goodness-of-fit test,” Mach. Learn. Sci. Tech., vol. 4, no. 3, p. 035029, 2023.
- [10] Y. Zhang, J. Duchi, and M. Wainwright, “Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates,” Journal of Machine Learning Research, vol. 16, no. 102, pp. 3299–3340, 2015.
- [11] R. D. Cousins, “Lectures on Statistics in Theory: Prelude to Statistics in Practice,” 7 2018.
- [12] A. Butter, S. Diefenbacher, G. Kasieczka, B. Nachman, T. Plehn, D. Shih, and R. Winterhalder, “Ephemeral Learning - Augmenting Triggers with Online-Trained Normalizing Flows,” SciPost Phys., vol. 13, no. 4, p. 087, 2022.
- [13] G. Meanti, L. Carratino, L. Rosasco, and A. Rudi, “Kernel methods through the roof: handling billions of points efficiently,” Advances in Neural Information Processing Systems, vol. 33, pp. 14410–14422, 2020. arXiv:2006.10350 [cs.LG].
- [14] A. Rudi, L. Carratino, and L. Rosasco, “Falkon: An optimal large scale kernel method,” Advances in neural information processing systems, vol. 30, 2017.
- [15] U. Marteau-Ferey, F. Bach, and A. Rudi, “Globally convergent newton methods for ill-conditioned generalized self-concordant losses,” Advances in Neural Information Processing Systems, vol. 32, 2019. arXiv:1907.01771 [math.OC].
- [16] O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu, and J.-R. Vlimant, “Variational Autoencoders for New Physics Mining at the Large Hadron Collider,” JHEP, vol. 05, p. 036, 2019.
- [17] O. Knapp, O. Cerri, G. Dissertori, T. Q. Nguyen, M. Pierini, and J.-R. Vlimant, “Adversarially Learned Anomaly Detection on CMS Open Data: re-discovering the top quark,” Eur. Phys. J. Plus, vol. 136, no. 2, p. 236, 2021.
- [18] E. Govorkova, E. Puljak, T. Aarrestad, M. Pierini, K. A. Woźniak, and J. Ngadiuba, “LHC physics dataset for unsupervised New Physics detection at 40 MHz,” Sci. Data, vol. 9, p. 118, 2022.
- [19] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: Training Dataset,” June 2021.
- [20] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: A -¿ 4 leptons Signal Benchmark Dataset,” Oct. 2022.
- [21] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: LQ -¿ b tau Signal Benchmark Dataset,” Oct. 2022.
- [22] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: -¿ tau tau Signal Benchmark Dataset,” Oct. 2022.
- [23] T. Aarrestad, E. Govorkova, J. Ngadiuba, E. Puljak, M. Pierini, and K. A. Wozniak, “Unsupervised New Physics detection at 40 MHz: h+ -¿ tau nu Signal Benchmark Dataset,” Oct. 2022.
- [24] L. Heinrich, S. Mishra-Sharma, C. Pollard, and P. Windischhofer, “Hierarchical Neural Simulation-Based Inference Over Event Ensembles,” 6 2023.
- [25] R. Ardino et al., “A 40 MHz Level-1 trigger scouting system for the CMS Phase-2 upgrade,” Nucl. Instrum. Meth. A, vol. 1047, p. 167805, 2023.
- [26] M. Migliorini, J. Pazzini, A. Triossi, and M. Zanetti, “40 MHz triggerless readout of the CMS Drift Tube muon detector,” JINST, vol. 19, no. 02, p. C02050, 2024.
- [27] R. Aaij et al., “Allen: A high level trigger on GPUs for LHCb,” Comput. Softw. Big Sci., vol. 4, no. 1, p. 7, 2020.
- [28] G. Grosso, R. T. D’Agnolo, M. Pierini, A. Wulzer, and M. Zanetti, “Nplm: Learning multivariate new physics,” Jan. 2021.
7 Supplementary Materials
7.1 Datasets details
EXPO 1D.
This simple univariate setup introduced in Ref. [3] and further studied in Ref. [5, 7] represents an energy or transverse-momentum spectrum that falls exponentially. Such type of distributions are fairly common in collider physics experiments. Studying GoF techniques in this setup is thus illustrative of some of the challenges associated with the search for new physics at these experiments. The density distribution under the reference model is defined as
(21) |
where denotes the number of expected events in the dataset. In our studies we set at events for the full dataset , and we consider batches of size (2 batches), (4 batches) or (8 batches) events. To test the model ability to detect features of various narrowness and in different locations, we consider four Gaussian signals
(22) |
and an excess in the tail defined by
(23) |
The data distribution under the alternative hypotheses has therefore the following form
(24) |
with . For this toy model, the ideal test statistic according to Neyman–Pearson can be computed analytically as
(25) |
The fraction of signal injection over reference events, , the Z-score666The Z-score is defined as the quantile of a standard normal distribution whose survival function matches the -value : for the ideal test (), as well as the location and scale of the Gaussian signals are reported in Table 2.
N(S)/N(R) | (full batch) | ||||
---|---|---|---|---|---|
Gaussian peaks | bulk | 1.6 | 0.16 | 4.9 | |
broad | 4 | 0.64 | 4.2 | ||
narrow | 4 | 0.01 | 4.8 | ||
tail | 6.4 | 0.16 | 4.0 | ||
Tail excess | excess | 4.5 |
Running the NPLM algorithm requires a reference sample () to compare the data with. Our sample is composed of events, sampled according to the hypothesis. It should be noticed that this studied is conducted under the assumption that we can sample synthetic data from the hypothesis, which is perfectly known, at will.
Dimuon 5D.
The second benchmark considered in this work consists of a set of Monte Carlo simulated proton-proton collisions happening at 13 TeV at the LHC, with two opposite charged muons in the final state Ref. [28]. In this dataset, each event is represented by five variables describing the kinematics of the dimuon system: the transverse momentum of each muon (), their pseudorapidities (), and the relative azimuthal angle between the two objects (). We focus on final states with transverse momenta greater than 20 GeV, pseudorapidities lower than 2.1 in absolute value, and with invariant mass of the dimuon system larger than 60 GeV. The dominant background process in this configuration is the Drell-Yan, that for sake of simplicity we consider as the only source of background. In our experiments the total number of expected events in the SM background hypothesis is around , after the acceptance selections are applied. We then study the impact of splitting the data in four batches and combining them via the batch-based strategy proposed in Section 3. As signal benchmarks, we considered a set of bosons with variable mass and width and a set of EFT scenarios described by the dimension-6 4-fermions Lagrangian , with variable Wilson coefficients. Details on the signal benchmarks are given in Table 3.
Mass (GeV) | (GeV) | N(S)/N(R) | (full batch) | |
Z’ scenarios | 600 | 16 | 3 | |
16 | 7.1 | |||
300 | 8 | 5 | ||
8 | 8.8 | |||
200 | 5 | 5.1 | ||
5 | 9.1 | |||
N(D)/N(R) | ||||
EFT scenarios | 1.0 | 1.002 | 3.7 |
CMS-L1 24D.
As supplementary material we report the marginal distribution of the SM dataset and the four signal benchmarks over the 24 input variables considered in this work (Figures 12, 13). The narrow peak at zero present in all plots is the effect of “zero padding” (see main text for more comments).
7.2 Improving regularization and sensitivity by aggregating over batches.
In this Appendix we provide additional plots showing the power curves of the aggregation method NPLM-ALL proposed in this work, both for the univariate EXPO-1D dataset (Figure 14) and for the multivariate MUMU-5D one (Figures 15 and 16). The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.1).
7.3 Impact of aggregation on a single batch test
We provide in Figure 17 additional plots showing the power curves of the aggregation method NPLM-ONE proposed in this work. The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.2).
7.4 Saturated test statistic
In this Appendix we provide additional plots showing the power curves of the aggregation method NPLM-SAT proposed in this work (see Figures 18 and 19). The power of the method is compared with a simple sum of tests (details in Sections 3 and 3.3).