Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

]\orgdivDepartment of Physics, \orgnameCavendish Laboratory, \orgaddress\streetJJ Thomson Avenue, \cityCambridge, \postcodeCB3 0HE, \stateCambridgeshire, \countryUnited Kingdom

Statistical divergences in high-dimensional hypothesis testing and a modern technique for estimating them

\fnmJeremy J.H. \surWilkinson jjhw3@cam.ac.uk    \fnmChristopher G. \surLester cgl20@cam.ac.uk [
Abstract

Hypothesis testing in high dimensional data is a notoriously difficult problem without direct access to competing models’ likelihood functions. This paper argues that statistical divergences can be used to quantify the difference between the population distributions of observed data and competing models, justifying their use as the basis of a hypothesis test. We go on to point out how modern techniques for functional optimization let us estimate many divergences, without the need for population likelihood functions, using samples from two distributions alone. We use a physics-based example to show how the proposed two-sample test can be implemented in practice, and discuss the necessary steps required to mature the ideas presented into an experimental framework. The code used has been made available for others to use.

Introduction

Between the axioms of Bayesian probability theory and the Neyman–Pearson lemma, log-likelihood ratio based tests are generally accepted as the logical way of deciding between competing hypotheses. However, typical likelihood-ratio based tests rely on two assumptions which may not be fulfilled in practice:

  1. 1.

    Given arbitrary data, {xi}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\{x_{i}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we have access to the likelihood, p(x|H)𝑝conditional𝑥𝐻p(x|H)italic_p ( italic_x | italic_H ), for each data point and competing hypothesis H𝐻Hitalic_H.

  2. 2.

    The true underlying process explaining our data is in fact within the set of models we consider. In our own words, we must be working in a complete model space.

In the absence of the first condition, it is clear that a parametric test simply cannot be performed. This situation readily arises in experiments with a moderately complex measurement apparatus, since the model of one’s detector response, stacked on top of the physical model of interest, typically won’t admit a computable likelihood function. A number of work arounds are used to bypass the need for direct access to the likelihood functions. A common approach relies on Monte-Carlo simulations to produce simulated data that may be used to fit functional forms approximating the model’s likelihood function or used to perform a binned likelihood test [1, 2]. These approaches have seen phenomenal success, but require significant data to provide a reasonable approximation of each competing model’s likelihood function. Since effective sampling density falls exponentially as a function of the number of dimensions in the data, such techniques are difficult to apply to high dimensional data. One is typically forced to marginalise over a significant number of the measured dimensions in order to obtain sufficiently high data-density to apply the aforementioned techniques. Unfortunately, marginalizing data comes with the risk of significantly reducing the sensitivity of the test, and it is not always clear which dimensions should be marginalized over. Performing a marginalised analysis of one’s data, finding nothing of interest, and performing the analysis again but marginalising over a different set of variables is precisely what the look-elsewhere effect warns of and is bound to produce unreliable conclusions. We believe that as scientific endeavours increase in complexity over time, we will inevitably have to embrace forms of hypothesis testing which do not rely on direct access to likelihood functions while retaining as much sensitivity as possible.

The philosophy of this paper revolves around the principal that one should favour models which predict a distribution of data most similar to the distributions we observe. There are many ways of measuring the similarity of two distributions, p𝑝pitalic_p and q𝑞qitalic_q, but we restrict our discussions to statistical divergences. That is, functions D𝐷Ditalic_D satisfying

  1. 1.

    D(p||q)0D\left(p||q\right)\geq 0italic_D ( italic_p | | italic_q ) ≥ 0

  2. 2.

    D(p||q)=0p=q.D\left(p||q\right)=0\iff p=q.italic_D ( italic_p | | italic_q ) = 0 ⇔ italic_p = italic_q .

A number of divergences have become very well-known. These include,

  1. 1.

    the Kullback-Leibler (KL) divergence, DKL(pq):=𝔼q[p(x)q(x)lnp(x)q(x)]D_{\mathrm{KL}}\left(p\middle\|q\right):=\mathbb{E}_{q}\left[\frac{p(x)}{q(x)}% \ln{\frac{p(x)}{q(x)}}\right]italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) := blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG roman_ln divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG ],

  2. 2.

    the chi-squared, χ2(p||q):=𝔼q[1q(x)(p(x)q(x))2]\chi^{2}\left(p||q\right):=\mathbb{E}_{q}\left[\frac{1}{q(x)}\left(p(x)-q(x)% \right)^{2}\right]italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_p | | italic_q ) := blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_q ( italic_x ) end_ARG ( italic_p ( italic_x ) - italic_q ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ],

  3. 3.

    the Total Variational distance DTV(p||q)=12𝔼q[|p(x)q(x)1|]D_{\mathrm{TV}}\left(p||q\right)=\frac{1}{2}\mathbb{E}_{q}\left[\left|\frac{p(% x)}{q(x)}-1\right|\right]italic_D start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_p | | italic_q ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ | divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG - 1 | ],

  4. 4.

    and the Jensen-Shannon divergence,

    DJS(p,q)=12(DKL(p12(p+q))+DKL(q12(p+q))).D_{\mathrm{JS}}\left(p,q\right)=\frac{1}{2}\left(\vphantom{\int}D_{\mathrm{KL}% }\left(\,p\,\middle\|\ \tfrac{1}{2}(p+q)\,\right)+D_{\mathrm{KL}}\left(\,q\,% \middle\|\ \tfrac{1}{2}(p+q)\,\right)\right).italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p + italic_q ) ) + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p + italic_q ) ) ) .

Since the proliferation of back-propagation as a tool for functional optimization, a number of authors have pointed out the potential for machine learning as a tool for estimating the divergence between two distributions using nothing more than a set of samples from each. No direct access to the likelihood function required. We wonder whether the significance of this result has been overlooked by the scientific community, and whether we may turn these results into a tool for inference which is maximally sensitive to differences between models and data while remaining effective and practical in arbitrarily high dimensions.

Before diving into technical details, we believe it is important to confront one more philosophical point. The Bayesian school of thought is founded upon a unique set of logically consistent rules for combining degrees of belief in a set of statements. Similarly, from the frequentest perspective, the Neyman-Pearson lemma proves that the log-likelihood ratio is the uniquely most powerful test statistic (the UMP) for deciding between two hypothesis H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Does this immediately doom a divergence-based hypothesis test to be inherently sub-optimal? The short answer is no. The reason for this is a subtle consequence of the inevitable violation of the second condition described above. Even if you’re not convinced by the argument that ‘all models are approximate’, one can never know one’s detector response to infinite precision, nor can one control for all variations in an experiment’s initial setup and external influences. Assumptions made about these factors are implicitly built into each and every hypothesis and have tangible effects on their likelihood functions. Therefore we are always necessarily working in an incomplete model space. The proof of the Neyman-Pearson lemma relies on the assumption that either H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT provides the true explanation of our data and therefore predicts the observed data distribution exactly. Likewise, if one is to be strictly honest within the Bayesian framework, how can one assign a non-zero degree of belief to any hypothesis in an incomplete model space? Ignoring this fact and simply cranking the mathematical handle requires the concession that the quantities we are calculating are not degrees of belief, but rather something else, to which the rules of Bayesian inference need not apply. Then again no amount of pessimism about the foundations of these techniques can refute their incontrovertible record of success in practice. A reasonable response might point out that clearly these tests cannot pick the true underlying hypothesis out of an incomplete model space (by definition), but evidence suggests that they must be selecting the model which is, by at least some definition, most similar to the data. Indeed, this point of view forms the core of this paper, and it is surprisingly easy to prove that these tests implicitly use the KL divergence as their measure of ‘similarity’. To understand why, consider a standard log-likelihood ratio test which rejects the null hypothesis if the test statistic,

t^=i=1Nlog(L(xi|H1)L(xi|H0)),^𝑡superscriptsubscript𝑖1𝑁𝐿conditionalsubscript𝑥𝑖subscript𝐻1𝐿conditionalsubscript𝑥𝑖subscript𝐻0\hat{t}=\sum_{i=1}^{N}\log\left(\frac{L(x_{i}|H_{1})}{L(x_{i}|H_{0})}\right),over^ start_ARG italic_t end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ) ,

is greater than some decision threshold value α𝛼\alphaitalic_α. In a complete model space, the Neyman-Pearson lemma assures us that t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG is a UMP, but how should we understand this procedure in an incomplete model space in which the underlying data is in fact explained by an unknown alternative model HTsubscript𝐻𝑇H_{T}italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT? In this case it is instructive to re-arrange t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG,

t^=i=1Nlog(L(xi|H1))i=1Nlog(L(xi|H0))^𝑡superscriptsubscript𝑖1𝑁𝐿conditionalsubscript𝑥𝑖subscript𝐻1superscriptsubscript𝑖1𝑁𝐿conditionalsubscript𝑥𝑖subscript𝐻0\hat{t}=\sum_{i=1}^{N}\log\left(L(x_{i}|H_{1})\right)-\sum_{i=1}^{N}\log\left(% L(x_{i}|H_{0})\right)over^ start_ARG italic_t end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
=N(1Ni=1Nlog(L(xi|HT)L(xi|H0))1Ni=1Nlog(L(xi|HT)L(xi|H1))).absent𝑁1𝑁superscriptsubscript𝑖1𝑁𝐿conditionalsubscript𝑥𝑖subscript𝐻𝑇𝐿conditionalsubscript𝑥𝑖subscript𝐻01𝑁superscriptsubscript𝑖1𝑁𝐿conditionalsubscript𝑥𝑖subscript𝐻𝑇𝐿conditionalsubscript𝑥𝑖subscript𝐻1=N\left(\frac{1}{N}\sum_{i=1}^{N}\log\left(\frac{L(x_{i}|H_{T})}{L(x_{i}|H_{0}% )}\right)-\frac{1}{N}\sum_{i=1}^{N}\log\left(\frac{L(x_{i}|H_{T})}{L(x_{i}|H_{% 1})}\right)\right).= italic_N ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ) - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) ) .

Since the observed data points are themselves drawn from the underlying distribution L(x|HT)𝐿conditional𝑥subscript𝐻𝑇L(x|H_{T})italic_L ( italic_x | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG is in fact an unbiased estimator for the difference between KL divergence of the competing hypotheses and the true data distribution,

𝔼[t^]=N(𝔼HT[log(L(x|HT)L(x|H0))]𝔼HT[log(L(x|HT)L(x|H1))])𝔼delimited-[]^𝑡𝑁subscript𝔼subscript𝐻𝑇delimited-[]𝐿conditional𝑥subscript𝐻𝑇𝐿conditional𝑥subscript𝐻0subscript𝔼subscript𝐻𝑇delimited-[]𝐿conditional𝑥subscript𝐻𝑇𝐿conditional𝑥subscript𝐻1\mathbb{E}\left[\hat{t}\right]=N\left(\mathbb{E}_{H_{T}}\left[\log\left(\frac{% L(x|H_{T})}{L(x|H_{0})}\right)\right]-\mathbb{E}_{H_{T}}\left[\log\left(\frac{% L(x|H_{T})}{L(x|H_{1})}\right)\right]\right)blackboard_E [ over^ start_ARG italic_t end_ARG ] = italic_N ( blackboard_E start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_L ( italic_x | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L ( italic_x | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ) ] - blackboard_E start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_L ( italic_x | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L ( italic_x | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) ] )
=N(DKL(L(|HT)L(|H0))DKL(L(|HT)L(|H1))).=N\left(D_{\mathrm{KL}}\left(L(\cdot|H_{T})\middle\|L(\cdot|H_{0})\right)-D_{% \mathrm{KL}}\left(L(\cdot|H_{T})\middle\|L(\cdot|H_{1})\right)\right).= italic_N ( italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_L ( ⋅ | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ italic_L ( ⋅ | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_L ( ⋅ | italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ italic_L ( ⋅ | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ) .

As a result, regardless of the value of α𝛼\alphaitalic_α, as N𝑁N\rightarrow\inftyitalic_N → ∞ the law of large numbers assures us that the standard log-likelihood ratio test picks out the hypothesis with the smallest KL-divergence between the predicted data distribution and the true underlying data distribution. Remarkably, not only are the concerns that divergence-based hypothesis tests might be provably sub-optimal unfounded, we have in fact been performing divergence-based hypothesis testing all along. A very similar argument may be applied to the Bayesian framework in an incomplete model space to reach the same conclusion.

Once you have let go of the idea that the KL divergence is somehow the uniquely correct way of deciding between competing hypotheses in an incomplete model space, wherein the theorems of optimality do not apply, one may consider many interesting alternative divergence-based techniques.111Indeed, recurring recognition of the need to have good ways of testing whether samples are (or are not) drawn from the same population has motivated related approaches from others in physics. This is particularly so in cases where samples live in high-dimensional spaces and/or where underlying likelihoods are not computable. See for example Ref. [3].

1 A simple example using binary classification

We introduce the relationship between divergences and machine learning using the familiar problem of binary classification and a trick first written about, as far as we know, in the literature on Generative Adversarial Networks [4]. A binary classifier, ϕitalic-ϕ\phiitalic_ϕ, used to discriminate samples from two categories, p𝑝pitalic_p and q𝑞qitalic_q, is typically trained to minimize a Binary Cross Entropy (BCE) loss function,

^[ϕ]:=1Ni=1N(lilog(ϕ(xi))+(1li)log(1ϕ(xi))),assign^delimited-[]italic-ϕ1𝑁superscriptsubscript𝑖1𝑁subscript𝑙𝑖italic-ϕsubscript𝑥𝑖1subscript𝑙𝑖1italic-ϕsubscript𝑥𝑖\hat{\mathcal{L}}[\phi]:=-\frac{1}{N}\sum_{i=1}^{N}\left(\vphantom{\int}l_{i}% \log(\phi(x_{i}))+(1-l_{i})\log(1-\phi(x_{i}))\right),over^ start_ARG caligraphic_L end_ARG [ italic_ϕ ] := - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ( 1 - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , (1)

where each data point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is paired with a corresponding label

li={0if xi was drawn from p,1if xi was drawn from q.subscript𝑙𝑖cases0if subscript𝑥𝑖 was drawn from 𝑝1if subscript𝑥𝑖 was drawn from 𝑞l_{i}=\begin{cases}0&\text{if }x_{i}\text{ was drawn from }p,\\ 1&\text{if }x_{i}\text{ was drawn from }q.\end{cases}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was drawn from italic_p , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was drawn from italic_q . end_CELL end_ROW

Assuming an equal number of samples from both categories, the expected value of the loss for fixed ϕitalic-ϕ\phiitalic_ϕ is given by

¯[ϕ]:=𝔼[^[ϕ]]=𝑑x(12p(x)log(ϕ(x))+12q(x)log(1ϕ(x))).assign¯delimited-[]italic-ϕ𝔼delimited-[]^delimited-[]italic-ϕdifferential-d𝑥12𝑝𝑥italic-ϕ𝑥12𝑞𝑥1italic-ϕ𝑥\bar{\mathcal{L}}[\phi]:=\mathbb{E}\left[\hat{\mathcal{L}}[\phi]\right]=-\int dx% \left(\vphantom{\int}\frac{1}{2}p(x)\log(\phi(x))+\ \frac{1}{2}q(x)\log(1-\phi% (x))\right).over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ ] := blackboard_E [ over^ start_ARG caligraphic_L end_ARG [ italic_ϕ ] ] = - ∫ italic_d italic_x ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_p ( italic_x ) roman_log ( italic_ϕ ( italic_x ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_q ( italic_x ) roman_log ( 1 - italic_ϕ ( italic_x ) ) ) .

A functional variation with respect to ϕitalic-ϕ\phiitalic_ϕ readily shows the well known result that the expected loss is minimized at ϕmin(x)=L(p|x)=p(x)p(x)+q(x)subscriptitalic-ϕmin𝑥𝐿conditional𝑝𝑥𝑝𝑥𝑝𝑥𝑞𝑥\phi_{\mathrm{min}}(x)=L(p|x)=\frac{p(x)}{p(x)+q(x)}italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_x ) = italic_L ( italic_p | italic_x ) = divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_p ( italic_x ) + italic_q ( italic_x ) end_ARG, the likelihood that the given sample x𝑥xitalic_x was drawn from category p𝑝pitalic_p. We refer to this critical function as the optimal classifier for p𝑝pitalic_p and q𝑞qitalic_q222Since for any x𝑥xitalic_x, ϕmin(x)[0,1]subscriptitalic-ϕmin𝑥01\phi_{\mathrm{min}}(x)\in\left[0,1\right]italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_x ) ∈ [ 0 , 1 ], the classifier’s output may be constrained to fall in [0,1]01\left[0,1\right][ 0 , 1 ] without any loss of generality.. Substituting ϕminsubscriptitalic-ϕmin\phi_{\mathrm{min}}italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT into ¯¯\bar{\mathcal{L}}over¯ start_ARG caligraphic_L end_ARG demonstrates that the minimum expected loss, attained by the optimal classifier, is related to the Jensen-Shannon divergence between the category distributions,

¯[ϕmin]¯delimited-[]subscriptitalic-ϕmin\displaystyle\bar{\mathcal{L}}[\phi_{\mathrm{min}}]over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ] =12(𝔼p[log(p(x)p(x)+q(x))]+𝔼q[log(q(x)p(x)+q(x))])absent12subscript𝔼𝑝delimited-[]𝑝𝑥𝑝𝑥𝑞𝑥subscript𝔼𝑞delimited-[]𝑞𝑥𝑝𝑥𝑞𝑥\displaystyle=-\frac{1}{2}\left(\mathbb{E}_{p}\left[\log\left(\frac{p(x)}{p(x)% +q(x)}\right)\right]+\mathbb{E}_{q}\left[\log\left(\frac{q(x)}{p(x)+q(x)}% \right)\right]\right)= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_p ( italic_x ) + italic_q ( italic_x ) end_ARG ) ] + blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_q ( italic_x ) end_ARG start_ARG italic_p ( italic_x ) + italic_q ( italic_x ) end_ARG ) ] )
=log(2)12(DKL(p12(p+q))+DKL(q12(p+q)))\displaystyle=\log(2)-\frac{1}{2}\left(D_{\mathrm{KL}}\left(p\middle\|\tfrac{1% }{2}(p+q)\right)+D_{\mathrm{KL}}\left(q\middle\|\tfrac{1}{2}(p+q)\right)\right)= roman_log ( 2 ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p + italic_q ) ) + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p + italic_q ) ) )
=log(2)DJS(p,q).absent2subscript𝐷JS𝑝𝑞\displaystyle=\log(2)-D_{\mathrm{JS}}\left(p,q\right).= roman_log ( 2 ) - italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) .

It is typically not possible for a machine learning algorithm to match the optimal classifier exactly. However, ¯¯\bar{\mathcal{L}}over¯ start_ARG caligraphic_L end_ARG is minimized at ϕminsubscriptitalic-ϕmin\phi_{\mathrm{min}}italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT, so for any function ϕitalic-ϕ\phiitalic_ϕ,

¯[ϕ]=log(2)DJS(p,q)+E,¯delimited-[]italic-ϕ2subscript𝐷𝐽𝑆𝑝𝑞𝐸\bar{\mathcal{L}}[\phi]=\log(2)-D_{JS}(p,q)+E,over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ ] = roman_log ( 2 ) - italic_D start_POSTSUBSCRIPT italic_J italic_S end_POSTSUBSCRIPT ( italic_p , italic_q ) + italic_E ,

where the training error, E𝐸Eitalic_E, is strictly non-negative. So regardless of what function the classifier actually converges onto,

DJS(p,q)log(2)¯[ϕ],subscript𝐷JS𝑝𝑞2¯delimited-[]italic-ϕD_{\mathrm{JS}}\left(p,q\right)\geq\log(2)-\bar{\mathcal{L}}[\phi],italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) ≥ roman_log ( 2 ) - over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ ] ,

and equality is achieved at ϕminsubscriptitalic-ϕmin\phi_{\mathrm{min}}italic_ϕ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Using the sample estimator for ¯¯\bar{\mathcal{L}}over¯ start_ARG caligraphic_L end_ARG in Equation (1), we appear to have attained our first technique for estimating the divergence between two distributions using only samples from p𝑝pitalic_p and q𝑞qitalic_q,

D^JS(p,q)=log(2)+1Ni=1N(lilog(ϕ(xi))+(1li)log(1ϕ(xi)))subscript^𝐷JS𝑝𝑞21𝑁superscriptsubscript𝑖1𝑁subscript𝑙𝑖italic-ϕsubscript𝑥𝑖1subscript𝑙𝑖1italic-ϕsubscript𝑥𝑖\displaystyle\hat{D}_{\mathrm{JS}}\left(p,q\right)=\log(2)+\frac{1}{N}\sum_{i=% 1}^{N}\left(l_{i}\log(\phi(x_{i}))+(1-l_{i})\log(1-\phi(x_{i}))\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) = roman_log ( 2 ) + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ( 1 - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) (2)

where

𝔼[D^JS(p,q)]DJS(p,q).𝔼delimited-[]subscript^𝐷JS𝑝𝑞subscript𝐷JS𝑝𝑞\displaystyle\mathbb{E}\left[\hat{D}_{\mathrm{JS}}\left(p,q\right)\right]\leq D% _{\mathrm{JS}}\left(p,q\right).blackboard_E [ over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) ] ≤ italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) . (3)

However there is one important caveat. It is important to notice that although ¯¯\bar{\mathcal{L}}over¯ start_ARG caligraphic_L end_ARG is minimized by the optimal classifier of p𝑝pitalic_p and q𝑞qitalic_q, the sample estimator ^^\hat{\mathcal{L}}over^ start_ARG caligraphic_L end_ARG in Equation (1) is not. The sample loss is in fact minimized by the function ϕ(x)=Np(x)Np(x)+Nq(x)italic-ϕ𝑥subscript𝑁𝑝𝑥subscript𝑁𝑝𝑥subscript𝑁𝑞𝑥\phi(x)=\frac{N_{p}(x)}{N_{p}(x)+N_{q}(x)}italic_ϕ ( italic_x ) = divide start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x ) + italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) end_ARG where Np(x)subscript𝑁𝑝𝑥N_{p}(x)italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_x ) and Nq(x)subscript𝑁𝑞𝑥N_{q}(x)italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) give the number of times the value x𝑥xitalic_x is encountered in the training dataset associated with categories p𝑝pitalic_p and q𝑞qitalic_q respectively. In other words, the sample loss is minimized by the optimal classifier of the two sample-distributions, not the optimal classier of the population distributions. Similarly, ^[ϕ]^delimited-[]italic-ϕ\hat{\mathcal{L}}[\phi]over^ start_ARG caligraphic_L end_ARG [ italic_ϕ ] is not an unbiased estimator for ¯[ϕ]¯delimited-[]italic-ϕ\bar{\mathcal{L}}[\phi]over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ ] when evaluated on the dataset used to train the classifier. So, just like any other machine learning problem, over-fitting is a concern. Fortunately, when evaluated on an independent set of validation data-points, the relation

𝔼[^val[ϕ]]=¯[ϕ]𝔼delimited-[]subscript^valdelimited-[]italic-ϕ¯delimited-[]italic-ϕ\mathbb{E}\left[\hat{\mathcal{L}}_{\mathrm{val}}[\phi]\right]=\bar{\mathcal{L}% }[\phi]blackboard_E [ over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT roman_val end_POSTSUBSCRIPT [ italic_ϕ ] ] = over¯ start_ARG caligraphic_L end_ARG [ italic_ϕ ]

holds, inequality (3) is satisfied, and the expected value of the estimator D^JS(p,q)subscript^𝐷JS𝑝𝑞\hat{D}_{\mathrm{JS}}\left(p,q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) evaluated on the validation set provides a reliable lower bound of the true value of DJS(p,q)subscript𝐷JS𝑝𝑞D_{\mathrm{JS}}\left(p,q\right)italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ). As with any estimator, D^JS(p,q)subscript^𝐷JS𝑝𝑞\hat{D}_{\mathrm{JS}}\left(p,q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) has an associated uncertainty which may be estimated from the data. If one assumes that a fixed number of samples from each category, Nvalsubscript𝑁valN_{\mathrm{val}}italic_N start_POSTSUBSCRIPT roman_val end_POSTSUBSCRIPT, are used for validation, then Equation (2) reduces to

D^JS(p,q)=log(2)+12(1Nvalx from plog(ϕ(x))+1Nvalx from qlog(1ϕ(x))),subscript^𝐷JS𝑝𝑞2121subscript𝑁valsubscript𝑥 from 𝑝italic-ϕ𝑥1subscript𝑁valsubscript𝑥 from 𝑞1italic-ϕ𝑥\displaystyle\hat{D}_{\mathrm{JS}}\left(p,q\right)=\log(2)+\frac{1}{2}\left(% \frac{1}{N_{\mathrm{val}}}\sum_{x\text{ from }p}\log\left(\phi\left(x\right)% \right)+\frac{1}{N_{\mathrm{val}}}\sum_{x\text{ from }q}\log\left(1-\phi\left(% x\right)\right)\right),over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) = roman_log ( 2 ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_val end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x from italic_p end_POSTSUBSCRIPT roman_log ( italic_ϕ ( italic_x ) ) + divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_val end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x from italic_q end_POSTSUBSCRIPT roman_log ( 1 - italic_ϕ ( italic_x ) ) ) , (4)

and the standard sample mean uncertainties of the two sums may be combined using standard rules to give an uncertainty estimate for D^JS(p,q)subscript^𝐷JS𝑝𝑞\hat{D}_{\mathrm{JS}}\left(p,q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ). If the number of samples in each category is allowed to vary, simple generalisations may be derived in each case.

Refer to caption
Figure 1: Left: An example of two normal distributions on the real numbers, p𝑝pitalic_p and q𝑞qitalic_q, with unit standard deviation and separation Δμ:=μqμp=1.0assignΔ𝜇subscript𝜇𝑞subscript𝜇𝑝1.0\Delta\mu:=\mu_{q}-\mu_{p}=1.0roman_Δ italic_μ := italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1.0 overlaid onto two histograms of 1000100010001000 samples from each distribution. We use p𝑝pitalic_p and q𝑞qitalic_q as both the category label and the density function, p(x)𝑝𝑥p(x)italic_p ( italic_x ) and q(x)𝑞𝑥q(x)italic_q ( italic_x ), where x𝑥xitalic_x is used to denote an arbitrary real number. Right: Each dashed line shows the numerically integrated Jensen-Shannon divergence between p𝑝pitalic_p and q𝑞qitalic_q as the separation ΔμΔ𝜇\Delta\muroman_Δ italic_μ is varied. For each separation, the solid line shows the convergence of D^JS(p,q)=log(2)^valsubscript^𝐷JS𝑝𝑞2subscript^val\hat{D}_{\mathrm{JS}}\left(p,q\right)=\log(2)-\hat{\mathcal{L}}_{\text{val}}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) = roman_log ( 2 ) - over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT val end_POSTSUBSCRIPT onto DJS(p,q)subscript𝐷JS𝑝𝑞D_{\mathrm{JS}}\left(p,q\right)italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) for a binary classifier trained and validated on separate datasets of 1000 samples per category.

Figure 1 shows the results of the technique applied to pairs of 1D Gaussian distributed categories. A simple neural network consisting of a sequence of linear layers (1×64×64×64×1164646411\times 64\times 64\times 64\times 11 × 64 × 64 × 64 × 1) separated by LeakyReLU activation functions and ending in a sigmoid normalisation was trained on 1000100010001000 examples from each category to optimise a binary cross entropy loss function. The value of D^JS(p,q)=log(2)^valsubscript^𝐷JS𝑝𝑞2subscript^val\hat{D}_{\mathrm{JS}}\left(p,q\right)=\log(2)-\hat{\mathcal{L}}_{\text{val}}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) = roman_log ( 2 ) - over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT val end_POSTSUBSCRIPT was evaluated on an independent set of 1000100010001000 samples from each category at the end of each epoch. The convergence of D^JS(p,q)subscript^𝐷JS𝑝𝑞\hat{D}_{\mathrm{JS}}\left(p,q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) onto DJS(p,q)subscript𝐷JS𝑝𝑞D_{\mathrm{JS}}\left(p,q\right)italic_D start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) is shown in the right hand panel consistent with a training error of E0𝐸0E\approx 0italic_E ≈ 0 and fluctuations consistent with the error-bars derived as previously described.

To map this example into one of hypothesis testing, consider a set of measured data points and a number of hypotheses which each predict Gaussian distributed measurements with varying means. The results in Figure 1 demonstrate how we might easily find the hypothesis which, according to our best estimate, has the smallest divergence to the data distribution using samples from each hypothesis. Although the example is extremely simple, the beauty of this approach is that the complexity of the algorithm effectively doesn’t scale at all with the complexity of the category distributions, nor the dimension of the categories. The same cannot be said for the binned and likelihood approximation approaches mentioned in the introduction.

2 Dual representations, f𝑓fitalic_f-divergences and functional optimization

Section 1 provides the most familiar example of a divergence calculation through machine learning that we know of, but as it so happens, a wide range of theorems exist relating all sorts of divergences to a corresponding variational problem referred to as a dual representation. Some authors even explore the possibility of calculating these divergences using machine learning [4, 5, 6, 7] but we believe these miss the application for these tools in the experimental sciences. In our work so far, we have focused on the class of f𝑓fitalic_f-divergences which include many of the common divergences encountered in other contexts. These include the Jensen-Shannon divergence, KL divergence, the total variational distance, and many others. Every f𝑓fitalic_f-divergence is identified by a function f:[0,)(,]:𝑓0f:[0,\infty)\rightarrow(-\infty,\infty]italic_f : [ 0 , ∞ ) → ( - ∞ , ∞ ] which must:

  1. 1.

    be convex;

  2. 2.

    satisfy f(1)=0𝑓10f(1)=0italic_f ( 1 ) = 0;

  3. 3.

    be finite everywhere except possibly at f(0)𝑓0f(0)italic_f ( 0 ); and

  4. 4.

    be right continuous at 00, that is limt0+f(t)=f(0)subscript𝑡superscript0𝑓𝑡𝑓0\lim_{t\rightarrow 0^{+}}f(t)=f(0)roman_lim start_POSTSUBSCRIPT italic_t → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( italic_t ) = italic_f ( 0 ), although f(0)𝑓0f(0)italic_f ( 0 ) may be infinite.

Given such a generating function, the f𝑓fitalic_f-divergence between a distribution p𝑝pitalic_p absolutely continuous with respect to a distribution q𝑞qitalic_q is defined by

Df(pq):=dxq(x)f(p(x)q(x))=𝔼q[f(p(x)q(x))].\displaystyle D_{f}\left(p\middle\|q\right):=\int\mathrm{d}x\,q(x)\,f\left(% \frac{p(x)}{q(x)}\right)=\mathbb{E}_{q}\left[f\left(\frac{p(x)}{q(x)}\right)% \right].italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) := ∫ roman_d italic_x italic_q ( italic_x ) italic_f ( divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG ) = blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_f ( divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG ) ] . (5)

A dual representation of a given f𝑓fitalic_f-divergence may be constructed using the Legendre transformation of f𝑓fitalic_f, denoted as fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and the trick [8]

Df(pq)=supϕ{𝔼p[ϕ(x)]𝔼q[f(ϕ(x))]}.\displaystyle D_{f}\left(p\middle\|q\right)=\sup_{\phi\in\mathcal{F}}\left\{% \mathbb{E}_{p}\left[\phi(x)\right]-\mathbb{E}_{q}\left[f^{*}(\phi(x))\right]% \right\}.italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) = roman_sup start_POSTSUBSCRIPT italic_ϕ ∈ caligraphic_F end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_ϕ ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ ( italic_x ) ) ] } . (6)

The supremum is taken over the set of all functions, \mathcal{F}caligraphic_F, from the sample space ΩΩ\Omegaroman_Ω to the domain of fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Using the well-known properties of the Legendre transform and functional variation of the supremund in Equation (6), one may derive a convenient alternative form for any generating function differentiable on (0,)0(0,\infty)( 0 , ∞ ),

Df(pq)=supϕ:Ω{𝔼p[f(eϕ(x))]𝔼q[f(f(eϕ(x)))]}.\displaystyle D_{f}\left(p\middle\|q\right)=\sup_{\phi:\Omega\rightarrow% \mathbb{R}}\left\{\mathbb{E}_{p}\left[f^{\prime}\left(e^{\phi(x)}\right)\right% ]-\mathbb{E}_{q}\left[f^{*}\left(f^{\prime}\left(e^{\phi(x)}\right)\right)% \right]\right\}.italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) = roman_sup start_POSTSUBSCRIPT italic_ϕ : roman_Ω → blackboard_R end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x ) end_POSTSUPERSCRIPT ) ] - blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x ) end_POSTSUPERSCRIPT ) ) ] } . (7)

In this case we are free to take the supremum over all functions from ΩΩ\Omegaroman_Ω to the whole real number line, and the supremum is attained by the log-likelihood ratio ϕ(x)=log(p(x)q(x))italic-ϕ𝑥𝑝𝑥𝑞𝑥\phi(x)=\log\left(\frac{p(x)}{q(x)}\right)italic_ϕ ( italic_x ) = roman_log ( divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG ). Fortunately the majority of generating functions of interest fit into this category. Table 1 provides some examples and the functions needed to implement Equation 6 or 7.

Name Generating function, f𝑓fitalic_f fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT dom(f)domsuperscript𝑓\operatorname{dom}(f^{*})roman_dom ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
KL tlnt𝑡𝑡t\ln titalic_t roman_ln italic_t et1superscript𝑒𝑡1e^{t-1}italic_e start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT \mathbb{R}blackboard_R 1+log(t)1𝑡1+\log(t)1 + roman_log ( italic_t )
Jensen-Shannon 12(tlnt(t+1)ln(t+12))12𝑡𝑡𝑡1𝑡12\frac{1}{2}\left(t\ln t-(t+1)\ln\left(\frac{t+1}{2}\right)\right)divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_t roman_ln italic_t - ( italic_t + 1 ) roman_ln ( divide start_ARG italic_t + 1 end_ARG start_ARG 2 end_ARG ) ) 12ln(2e2t)122superscript𝑒2𝑡-\frac{1}{2}\ln(2-e^{2t})- divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_ln ( 2 - italic_e start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT ) (,12ln(2))122\left(-\infty,\tfrac{1}{2}\ln(2)\right)( - ∞ , divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_ln ( 2 ) ) 12log(2tt+1)122𝑡𝑡1\frac{1}{2}\log\left(\frac{2t}{t+1}\right)divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( divide start_ARG 2 italic_t end_ARG start_ARG italic_t + 1 end_ARG )
Total variational 12|t1|12𝑡1\frac{1}{2}|t-1|divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_t - 1 | t𝑡titalic_t [12,12]1212\left[-\frac{1}{2},\frac{1}{2}\right][ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ] N/A
χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 12(t1)212superscript𝑡12\frac{1}{2}(t-1)^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_t - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT t(12t+1)𝑡12𝑡1t\left(\tfrac{1}{2}t+1\right)italic_t ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_t + 1 ) \mathbb{R}blackboard_R t1𝑡1t-1italic_t - 1
Table 1: A few examples of common f𝑓fitalic_f-divergences, their generating functions, corresponding Legendre transforms, Legendre transform domains, and generating function first derivatives.

Using Equation (7) it is simple to construct an estimator for a lower bound of Df(pq)D_{f}\left(p\middle\|q\right)italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) in almost complete analogy with the estimator D^JS(p,q)subscript^𝐷JS𝑝𝑞\hat{D}_{\mathrm{JS}}\left(p,q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_JS end_POSTSUBSCRIPT ( italic_p , italic_q ) in the previous section, since the expectation values over p𝑝pitalic_p and q𝑞qitalic_q may be estimated using samples from p𝑝pitalic_p and q𝑞qitalic_q. Picking the KL-divergence as an example, given two datasets sampled from distributions p𝑝pitalic_p and q𝑞qitalic_q, we may lower bound DKL(pq)D_{\mathrm{KL}}\left(p\middle\|q\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) by following these steps:

  1. 1.

    Partition each of the two datasets into a training and validation set.

  2. 2.

    Use batched gradient descent to maximise the functional

    D^KL(pq)=1Npx from p(1+ϕ(x))1Nqx from qeϕ(x)\hat{D}_{\mathrm{KL}}\left(p\middle\|q\right)=\frac{1}{N_{p}}\sum_{x\text{ % from }p}\left(1+\phi(x)\right)-\frac{1}{N_{q}}\sum_{x\text{ from }q}e^{\phi(x)}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x from italic_p end_POSTSUBSCRIPT ( 1 + italic_ϕ ( italic_x ) ) - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x from italic_q end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x ) end_POSTSUPERSCRIPT

    on the training dataset for some a machine learning model ϕ:Ω:italic-ϕΩ\phi:\Omega\rightarrow\mathbb{R}italic_ϕ : roman_Ω → blackboard_R. Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denote the number of samples drawn from distributions p𝑝pitalic_p and q𝑞qitalic_q in the given batch.

  3. 3.

    At the end of each epoch evaluate D^KL(pq)\hat{D}_{\mathrm{KL}}\left(p\middle\|q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) on the whole validation dataset, unbatched, to obtain an estimate, along with error-bars, for a lower bound of DKL(pq)D_{\mathrm{KL}}\left(p\middle\|q\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ).

  4. 4.

    Repeat steps 2 & 3 until successive lower bounds stop increasing.

Having trained ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) to provide an approximation of the log-likelihood ratio, one may use its output to lower bound the value of any other f𝑓fitalic_f-divergence generated by a second differentiable function f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by evaluating the sample estimator of Equation 7 on the validation dataset using f2superscriptsubscript𝑓2f_{2}^{*}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and f2superscriptsubscript𝑓2f_{2}^{\prime}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This interesting property enables the possibility of reporting a different f𝑓fitalic_f-divergence to the one used to train the network. It is worth pointing out that the dual representation presented in Equation (6) is not unique, and is sometimes referred to as the naive variational representation of an f𝑓fitalic_f-divergence. A number of improved dual representations of f𝑓fitalic_f-divergences have been proposed [8], but we have not yet fully explored these representations, and are considered out of scope for this paper. Future work aims to investigate whether the dual representations of certain f𝑓fitalic_f-divergences have better convergence properties than others, and whether these may be used to train ϕitalic-ϕ\phiitalic_ϕ and then construct better estimates of the f𝑓fitalic_f-divergences we choose to report.

Although divergences which are not f𝑓fitalic_f-divergences crop up in the machine learning literature, Wasserstein GANs for example [5], f𝑓fitalic_f-divergences have a number of properties which naturally lend themselves to scientific disciplines. These advantages include co-ordinate independence, and the data processing inequality, two prerequisites for any reasonable measure between two distributions of physical quantities.

3 An example calculating the KL-divergence between two high dimensional distributions

Having established the machinery to estimate a number of divergences, we move on to a non-trivial toy problem that demonstrates some of the advantages and quirks of this approach in practice. The aim of this section is not to solve the actual problem presented, but to make certain points about the techniques described, so the actual machine learning models used will remain simple and intentionally unoptimized.

Refer to caption
Figure 2: On the diagonal: A 1D histogram of each of the nine components which define the three 3D vectors - x,y,z𝑥𝑦𝑧\vec{x},\vec{y},\vec{z}over→ start_ARG italic_x end_ARG , over→ start_ARG italic_y end_ARG , over→ start_ARG italic_z end_ARG - in the dataset. The unit normal distributions superimposed in red demonstrate that each component’s marginal distribution is unit-normal distributed. On the off-diagonal: 2D histograms of every pair of components in the dataset. These demonstrate that once marginalized over the remaining 7 components, the distributions of the remaining 2 coordinates are independent of one another.
Refer to caption
Figure 3: Left: A histogram of the magnitude of the first vector, |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG |, in the original data and the straw model. These histograms suggest there is no difference in the distribution of the first vector’s magnitude between the two. Right: Each point indicates a lower bound on the KL divergence between the model and data distributions conditioned on a given |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG | bin. The machine learning model is evidently better at separating the two datasets at larger values of |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG | which provides a clue on the nature of the difference between the datasets. These lower bounds were obtained from the machine learning model used to lower bound the unconditioned KL divergence, without re-training the model.
Refer to caption
Figure 4: Left: The distribution of the parity variable, P:=x(y×z)assign𝑃𝑥𝑦𝑧P:=\vec{x}\cdot\left(\vec{y}\times\vec{z}\right)italic_P := over→ start_ARG italic_x end_ARG ⋅ ( over→ start_ARG italic_y end_ARG × over→ start_ARG italic_z end_ARG ), of the original data and the straw model. Right: The same the parity variable histograms, however, the straw model distribution has been re-weighted with the likelihood ratio learnt by the machine learning model in the process of calculating the KL divergence between the distributions. Although the re-weighted distribution is not in perfect agreement with the original data, this plot demonstrates that the model has learnt almost all the differences between the distributions contained in this variable.

Consider a set of data points corresponding to the measurement of triplets of 3D vectors drawn from some unknown distribution p𝑝pitalic_p. For now, the source of these vectors is not important, but one can imagine they represent the momentum vectors of three particles produced in an atomic decays, or the arrangement of triplets of galaxies relative to the earth. Suffice to say we are talking about a list of 100000 samples of three 3D vectors per sample. Figure 2 shows various 1D and 2D component-wise histograms of the data which suggest that the distribution of every component is an independent standard normal distribution; a reasonable first guess. This hypothesis is extremely easy to simulate and is guaranteed to reproduce every one of the histograms in Figure 2, up to statistical fluctuations. But how can we be sure that there aren’t features hiding in the data which simply don’t appear in the histograms we’ve thought to check so far? Using our machine learning approach to lower bound a divergence between the distributions provides such a global check which is maximally sensitive, provided the network is given the data and conditions needed to converge.

Putting Equation (7) into practice, a dense network (9×128×64×64×64×64×191286464646419\times 128\times 64\times 64\times 64\times 64\times 19 × 128 × 64 × 64 × 64 × 64 × 1, with LeakyReLU activation functions) was trained as described in Section 2 with a 50-50 train/validation split to estimate the KL divergence between the data and a competing ‘straw model’ which assumes independent unit normal distributed components for each vector, as guessed above. We use the letter q𝑞qitalic_q to denote the underlying distribution of the straw model. Once trained, this simple setup demonstrated that the KL divergence between the underlying data distribution and our straw model is at least 0.077±0.003plus-or-minus0.0770.0030.077\pm 0.0030.077 ± 0.003 in just a couple minutes. This constitutes overwhelming evidence that there are features in the data - not visible in the histograms we bothered to check - that are missing in the straw model. If this were a real problem, this would be the point where the fun would really start, knowing that interesting features exist in the data and trying to understand what the machine learning algorithm has found.

A useful tool in such a search is to study the behaviour of the divergence as a function of some variable in the data, the magnitude of the first vector, |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG |, for example, by constructing a lower bound on the divergence between the two underlying distributions conditioned on a particular value of |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG |, DKL(p(|x|)q(|x|))D_{\mathrm{KL}}\left(p(\cdot\mid|\vec{x}|)\middle\|q(\cdot\mid|\vec{x}|)\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ( ⋅ ∣ | over→ start_ARG italic_x end_ARG | ) ∥ italic_q ( ⋅ ∣ | over→ start_ARG italic_x end_ARG | ) ). This can be achieved by binning the data based on the variable of interest, and evaluating D^KL(pq)\hat{D}_{\mathrm{KL}}\left(p\middle\|q\right)over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) on the samples within each bin. However one should not use the network trained on the full dataset directly as for a given sample, w𝑤witalic_w, the network’s output provides an approximation of log(p(w)q(w))𝑝𝑤𝑞𝑤\log\left(\frac{p(w)}{q(w)}\right)roman_log ( divide start_ARG italic_p ( italic_w ) end_ARG start_ARG italic_q ( italic_w ) end_ARG ) whereas D^KLsubscript^𝐷KL\hat{D}_{\text{KL}}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT evaluated on the samples in a particular bin of |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG | is maximised by log(p(w|x|)q(w|x|))\log\left(\frac{p(w|\mid\vec{x}|)}{q(w\mid|\vec{x}|)}\right)roman_log ( divide start_ARG italic_p ( italic_w | ∣ over→ start_ARG italic_x end_ARG | ) end_ARG start_ARG italic_q ( italic_w ∣ | over→ start_ARG italic_x end_ARG | ) end_ARG ). One can adjust the output of the network by looping through the training dataset a final time and constructing a histogram of the two datasets, as shown in the left panel of Figure 3. Then using the relationship p(w|x|)q(w|x|)=p(w)q(w)q(|x|)p(|x|)\frac{p(w|\mid\vec{x}|)}{q(w\mid|\vec{x}|)}=\frac{p(w)}{q(w)}\frac{q(|\vec{x}|% )}{p(|\vec{x}|)}divide start_ARG italic_p ( italic_w | ∣ over→ start_ARG italic_x end_ARG | ) end_ARG start_ARG italic_q ( italic_w ∣ | over→ start_ARG italic_x end_ARG | ) end_ARG = divide start_ARG italic_p ( italic_w ) end_ARG start_ARG italic_q ( italic_w ) end_ARG divide start_ARG italic_q ( | over→ start_ARG italic_x end_ARG | ) end_ARG start_ARG italic_p ( | over→ start_ARG italic_x end_ARG | ) end_ARG, create an estimator for p(w|x|)q(w|x|)\frac{p(w|\mid\vec{x}|)}{q(w\mid|\vec{x}|)}divide start_ARG italic_p ( italic_w | ∣ over→ start_ARG italic_x end_ARG | ) end_ARG start_ARG italic_q ( italic_w ∣ | over→ start_ARG italic_x end_ARG | ) end_ARG by re-weighting the networks output by the ratio of the two histograms within each bin. In our particular case, this reweighting has almost no effect as the distributions are almost identical in |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG |, but this is not the case in general. The right panel of Figure 3 shows what this looks like in practice for our example and suggests that the difference between the original data and the straw model increases as |x|𝑥|\vec{x}|| over→ start_ARG italic_x end_ARG | increases. Although we have some more ideas about how searches can be done in practice, this is out of scope for this paper and lots of work exists in the literature on machine learning interpretability which may be used. Since we know exactly how this dataset was produced, we use the opportunity to instead make some other points.

The original data differs from the straw model in its distribution of the so-called parity defined by,

P:=x(y×z).assign𝑃𝑥𝑦𝑧P:=\vec{x}\cdot\left(\vec{y}\times\vec{z}\right).italic_P := over→ start_ARG italic_x end_ARG ⋅ ( over→ start_ARG italic_y end_ARG × over→ start_ARG italic_z end_ARG ) .

The parity distribution for the original data and the straw model is shown in the left panel of Figure 4. The straw model predicts a symmetric parity distribution, whereas the actual data is skewed towards negative parity. Since ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) provides an estimate of the log-likelihood ratio, we can check that the network has learnt to exploit this difference in P𝑃Pitalic_P by comparing the histogram of P𝑃Pitalic_P in the original dataset to the histogram of P𝑃Pitalic_P in the independent model with each sample, x𝑥xitalic_x, weighted by the network’s estimate of the likelihood ratio eϕ(x)=p(x)q(x)superscript𝑒italic-ϕ𝑥𝑝𝑥𝑞𝑥e^{\phi(x)}=\frac{p(x)}{q(x)}italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x ) end_POSTSUPERSCRIPT = divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG. If ϕitalic-ϕ\phiitalic_ϕ is well trained these histograms will coincide. The right panel of Figure 4 shows that our example network has learnt the differences in the parity distributions quite well even without any attempts to fine-tune the machine learning.

Refer to caption
Figure 5: Left: The lower bound of the KL-divergence obtained as a function of the asymmetry parameter used to generate the model data. Right: Same but zoomed into the range (0.6,0.9)0.60.9(0.6,0.9)( 0.6 , 0.9 ). The value α=0.75𝛼0.75\alpha=0.75italic_α = 0.75 was used to produce the original data.

Appendix A explains how the data was produced, and that the degree of asymmetry in the parity of the data is controlled by an additional parameter α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. The value α=0𝛼0\alpha=0italic_α = 0 is equivalent to the straw model, and α=1𝛼1\alpha=1italic_α = 1 corresponds to a large degree of asymmetry. The training procedure was repeated for a range of values of α𝛼\alphaitalic_α and the results are shown in Figure 5. These results can be summarised as follows:

\say

We observe that the KL-divergence lower bound attained by the procedure decreases until it is consistent with 00 in an interval around 0.6250.8750.6250.8750.625-0.8750.625 - 0.875. This is consistent with the value of α=0.75𝛼0.75\alpha=0.75italic_α = 0.75 used to generate the original data. However, with the amount of data given, this particular machine learning model was not able to exclude the values of α=0.625𝛼0.625\alpha=0.625italic_α = 0.625 nor 0.8750.8750.8750.875.

3.1 Discussion

The phrasing of the results in the preceeding paragraph were carefully chosen since strictly speaking this technique can only ever lower-bound the true divergence. In simple terms, obtaining a lower bound for the KL-divergence consistent with 0 is compatible with there being no observable differences between two datasets, however this does not constitute proof that none exists; even differences which are in principal noticeable within the given datasets. This issue is reminiscent of the argument made in the introduction that just because two datasets agree on a particular histogram you decided to check, does not mean that the two datasets are indistinguishable. The difference here is that our approach is guided by gradient descent and remains globally sensitive to all possible differences in the data, as opposed to the established techniques which are restricted to our inspired guesses of what variables to check.

It is worth highlighting that it is difficult to dream up a relatively simple toy example which is easy to simulate, understandable by the typical reader, and yet complex enough to fully justify the use of the techniques described. Since toy models are simulated, we always know how the original data was produced, and therefore it is very artificial to constrain oneself to an incomplete model space. As a result, the series of models considered in Figure 5 did in fact contain the ‘true’ hypothesis, however none of the analysis relied on this fact and we would be free to make conclusions about which model performs the best had we chosen an incomplete model space in our example. Interestingly, if one does end up working in an incomplete model space, the methods described are capable of alerting us to this fact if the divergence lower bound obtained for all models is greater than 00 with statistical-significance. This is something which a typical Bayesian or binary hypothesis test cannot establish.

We also hope it is clear that even though the class of models described in Appendix A admit a simple to evaluate likelihood function, this was neither needed nor used by any part of the statistical analysis. Furthermore, it does not take much additional complexity for the full likelihoods of simple models to become uncomputable. Accounting for the effects of a flawed measurement apparatus, for example, could introduce an angle-dependent measurement resolution of each vector’s components, the possibility for vectors close to one another to be mistakenly merged into a single vector, and many other such effects which are easy to simulate as a Markov chain, but result in tough to compute likelihood functions. In this case the advantages of a machine learning approach are even more pertinent.

4 Conclusion and Outlook

This paper was written to point out how modern techniques for estimating statistical divergences provide a globally sensitive approach to data analysis uniquely suitable for experiments producing data with ever increasing dimension and complexity. The authors’ own field of particle physics is notoriously cursed with scouring a high dimensional space for any deviations away from the stubbornly successful standard model of particle physics. We are in the early stages of applying these techniques in a search for evidence of new physics, building on top of the work described here [9, 10, 11, 12]. In addition to searching for new physics, we suggest divergences could be used to benchmark and quantifying the performance of various Monte-Carlo generators, which are extremely difficult to compare as the data they produce is high dimensional and small changes can have effects in many places. No doubt these applications will raise practical issues which will need to be addressed. A number have already become apparent, for example:

  1. 1.

    A machine learning model is free to pick up on any differences between data and whatever models we propose, including uninteresting effects due to detector mismodelling. How can one learn about how the machine learning model is differentiating the datasets and update one’s detector model if required. Significant effort has been invested towards this point and we intend to write up our results in the coming months.

  2. 2.

    What is the best way to split up train and validation data? More training data results in a higher expected lower-bound on the divergence of choice, but less validation data results in a lower confidence in the value of the lower bound obtained as reflected by larger error bars. This is particularly important when you believe the differences between two underlying distributions is extremely small. Empirical studies on the scaling of network performance as a function of the amount of training data are interesting, but it is unclear how these trends might generalise to arbitrary machine learning problems [13].

  3. 3.

    What techniques can one use to validate the convergence of the machine learning model and therefore the quality of the divergence lower-bound obtained. If one obtains a lower bound of the divergence between data and two models A&B, under what conditions - if any - can we reliably compare the two lower bounds to conclude which model is performing better? The aforementioned empirical studies are once again relevant to understanding the degree of training error, but once again do not generalise [13].

  4. 4.

    In the context of f𝑓fitalic_f-divergences, what is the best way to train the machine learning model and which f𝑓fitalic_f-divergence should one report? In a complete model space, all divergences must agree on which model is performing best333This follows from D(p||q)=0p=qD\left(p||q\right)=0\iff p=qitalic_D ( italic_p | | italic_q ) = 0 ⇔ italic_p = italic_q., but in an incomplete model space various divergences may disagree on which model is ‘closest’ to the data. A natural choice is to report the KL-divergence between the data and each model, since the conclusions obtained should align with those obtained via traditional log-likelihood ratio techniques in the large data limit. However, the KL divergence is unbounded and DKL(pq)D_{\mathrm{KL}}\left(p\middle\|q\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) is only defined if q(x)=0p(x)=0𝑞𝑥0𝑝𝑥0q(x)=0\implies p(x)=0italic_q ( italic_x ) = 0 ⟹ italic_p ( italic_x ) = 0. In contrast the definition of some f𝑓fitalic_f-divergences, like the Jensen-Shannon divergence, are bounded and can be extended to compare any two distributions on the same space.

We hope to inspire a number of readers to try apply these techniques within their own fields and to help settle some of these unanswered questions.

Appendix A How the parity data was produced

The asymmetric data used and described in Section 3 was produced by first sampling each component from a unit normal distribution, just like in the independent-component ‘straw’ model. Then, each sample was either flipped, or not, under the transformation

xx𝑥𝑥\displaystyle\vec{x}\rightarrow-\vec{x}\phantom{,}over→ start_ARG italic_x end_ARG → - over→ start_ARG italic_x end_ARG
yy𝑦𝑦\displaystyle\vec{y}\rightarrow-\vec{y}\phantom{,}over→ start_ARG italic_y end_ARG → - over→ start_ARG italic_y end_ARG
zz,𝑧𝑧\displaystyle\vec{z}\rightarrow-\vec{z},over→ start_ARG italic_z end_ARG → - over→ start_ARG italic_z end_ARG ,

which has the effect of sending PP𝑃𝑃P\rightarrow-Pitalic_P → - italic_P. The probability, g𝑔gitalic_g, of flipping a given sample was a function of the sample’s original parity, P𝑃Pitalic_P. Specifically g𝑔gitalic_g was chosen to be αS(10P)𝛼𝑆10𝑃\alpha S(10P)italic_α italic_S ( 10 italic_P ), where S𝑆Sitalic_S is the sigmoid function, and α𝛼\alphaitalic_α is a parameter which sets the total degree of asymmetry. Thus, samples with P0much-greater-than𝑃0P\gg 0italic_P ≫ 0 were flipped at a rate of α𝛼\alphaitalic_α, and samples with P0much-less-than𝑃0P\ll 0italic_P ≪ 0 were almost never flipped. Hence the heavier tails towards negative P𝑃Pitalic_P. Augmentation of the vectors with this procedure leaves the marginal distributions over single coordinates and pairs of coordinates unchanged, which is why the flipping is not evident in Figure 2. The data discussed in Section 3 was produced with α=0.75𝛼0.75\alpha=0.75italic_α = 0.75.

Appendix B Code which supports this work

A python package named “iwpc”444The letters in “ipwc” stand for “I will prove convergence”. [14] has been made available on the Python Package Index (PyPI, [15]) to permit others to evaluate divergences on samples from their own distributions using the methods described in this paper. The underlying software packaged within “iwpc” may be found in a separate development repository [16]. Those who make use of [14] or [16] in their own work are asked to consider citing the present paper.

References

  • \bibcommenthead
  • Cowan et al. [2011] Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymptotic formulae for likelihood-based tests of new physics. The European Physical Journal C 71(2) (2011) https://doi.org/10.1140/epjc/s10052-011-1554-0
  • Aaij et al. [2022] Aaij, R., et al.: Measurement of the Bs0μ+μsubscriptsuperscript𝐵0𝑠superscript𝜇superscript𝜇B^{0}_{s}\to\mu^{+}\mu^{-}italic_B start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT decay properties and search for the B0μ+μsuperscript𝐵0superscript𝜇superscript𝜇B^{0}\to\mu^{+}\mu^{-}italic_B start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and Bs0μ+μγsubscriptsuperscript𝐵0𝑠superscript𝜇superscript𝜇𝛾B^{0}_{s}\to\mu^{+}\mu^{-}\gammaitalic_B start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_γ decays. Phys. Rev. D 105(1), 012010 (2022) https://doi.org/10.1103/PhysRevD.105.012010 arXiv:2108.09283 [hep-ex]
  • Barter et al. [2018] Barter, W., Burr, C., Parkes, C.: Calculating p𝑝pitalic_p-values and their significances with the Energy Test for large datasets. JINST 13(04), 04011 (2018) https://doi.org/10.1088/1748-0221/13/04/P04011 arXiv:1801.05222 [physics.data-an]
  • Goodfellow et al. [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, p. 1. Curran Associates, Inc., San Diego, CA (2014). https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
  • Arjovsky et al. [2017] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017)
  • Sriperumbudur et al. [2010] Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G.R.G.: Non-parametric estimation of integral probability metrics. In: 2010 IEEE International Symposium on Information Theory, pp. 1428–1432 (2010). https://doi.org/10.1109/ISIT.2010.5513626
  • Birrell et al. [2021] Birrell, J., Dupuis, P., Katsoulakis, M.A., Rey-Bellet, L., Wang, J.: Variational representations and neural network estimation of rényi divergences. SIAM Journal on Mathematics of Data Science 3(4), 1093–1116 (2021) https://doi.org/10.1137/20M1368926 https://doi.org/10.1137/20M1368926
  • Polyanskiy and Wu [2022] Polyanskiy, Y., Wu, Y.: Information Theory: From Coding to Learning. Cambridge University Press, Cambridge (2022)
  • Lester et al. [2022] Lester, C.G., Mastandrea, R., Noel, D., Tombs, R.: Hunting for vampires and other unlikely forms of parity violation at the large hadron collider. Journal of High Energy Physics 2022(8) (2022) https://doi.org/10.1007/jhep08(2022)231
  • Lester [2022] Lester, C.G.: Using unsupervised learning to detect broken symmetries, with relevance to searches for parity violation in nature. Transactions on Machine Learning Research (2022)
  • Tombs and Lester [2022] Tombs, R., Lester, C.G.: A method to challenge symmetries in data with self-supervised learning. Journal of Instrumentation 17(08), 08024 (2022) https://doi.org/10.1088/1748-0221/17/08/P08024
  • Birman et al. [2022] Birman, M., Nachman, B., Sebbah, R., Sela, G., Turetz, O., Bressler, S.: Data-directed search for new physics based on symmetries of the SM. Eur. Phys. J. C 82(6), 508 (2022) https://doi.org/10.1140/epjc/s10052-022-10454-2 arXiv:2203.07529 [hep-ph]
  • Hestness et al. [2017] Hestness, J., Narang, S., Ardalani, N., Diamos, G.F., Jun, H., Kianinejad, H., Patwary, M.M.A., Yang, Y., Zhou, Y.: Deep learning scaling is predictable, empirically. ArXiv abs/1712.00409 (2017)
  • [14] https://pypi.org/project/iwpc
  • [15] Python Package Index - PyPI. https://pypi.org/ Accessed 2021-03-28
  • [16] https://bitbucket.org/jjhw3/divergences