Stat Meth

Statistical
Methods in HEP
International School of
Theory & Analysis
in Particle Physics
Istanbul, Turkey
31st 11th February 2011
Jrg Stelzer
Michigan State University, East Lansing, USA
Outline
From probabilities to data samples
Probability, Bayes theorem
Properties of data samples
Probability densities, multi-dimensional
Catalogue of distributions in HEP, central limit theorem
Data Simulation, random numbers, transformations
From data samples to parameter estimation

Event classification
Statistical tests, Neyman Pearsen lemma
Multivariate methods
Estimators: general, maximum likelihood, chi-square
2
Statistical analysis in particle physics
Observe events of a certain type
Measure characteristics of each event

particle momenta, number of muons, energy of jets,...
Theories (e.g. SM) predict distributions of these properties up to free

parameters, e.g., , GF, MZ, s, mH, ...
Some tasks of data analysis:

Estimate (measure) the parameters;
Quantify the uncertainty of the parameter estimates;
Test the extent to which the predictions of a theory are in agreement with the data.
3
What is Probability
S={E1, E2,} set of possible results (events) of an experiment.
E.g. experiment: Throwing a dice.
E1=throw 1, E2=throw a 2, E3=throw an odd number, E4=throw a number>3,
Ex and Ey are mutually exclusive if they cant occur at the same time.
E1 and E2 are mutually exclusive, E3 and E4 are not
Mathematical probability: For each event E exists a P(E) with:
I : P( E ) 0
II : P ( E1 or E2 ) = P( E1 ) + P ( E2 ) if E1 and E2 are mutually exclusive
III : P( Ei ) = 1, where the sum is over all mutually exclusive events A.N. Kolmogorov
(1903-1987)
From these axioms we can derive further rules

4
Further properties, conditional probability
We can derive further properties P ( A ) = 1 - P ( A)
P( A A ) = 1
P () = 0
if A B, then P ( A) < P ( B )
p ( A B ) = P ( A) + P ( B ) - P ( A B )
B
Conditional probability of A given B P( A B)

P( A | B) =
P( B) A
E.g. you are guessing the weekday of birth of a friend: P(Sunday) = 1/7.
After the hind it was on a weekday: P(Tuesday|weekday) = 1/5 BA
[ P(Tuesday and weekday) = 1/7, P(weekday) = 5/7]
P( A B) P( A) P( B)
Independent events A and B P( A | B) = = = P( A)
If your friend hints it was a rainy day: P( B) P( B)
P(Tuesday|rainday) = 1/7
Axioms can be used to build a complicated theory, but the numbers so far are
entirely free of meaning. Different interpretations of probability
5
Probability as frequency limit
Perform an repeatable experiment N times with outcomes
X1,X2, (the ensemble). Count the number of times that X
occurs: NX. Fraction NX /N tends toward a limit, defined as
NX
the probability of outcome X: P( X ) = lim Richard von Mises
(1883-1953)
N N
Useful in daily life? German insurance company X finds

The N outcomes of the experiment are the that 1.1% of their male clients dies
ensemble. P(E) depends on the experiment between 40 and 41. Does that mean
and one the ensemble ! that the probability that Hr. Schmitt,
he has a police with X, dies between
The biggest problem when doing demographical 40 and 41 is 1.1%? What if the data
studies (shopping behavior, advertisements) is were collected from a sample of
to find the representative ensemble! German smoking hang-glider pilots?
Experiment must be repeatable. Likely you would have gotten a
different fraction.
Common approach in HEP:

Physical laws are universal and unchanging. Different collider experiments all draw
from the same ensemble of particle interactions repeatedly in the same way.
6
Objective probability propensity
Examples: throwing a coin, rolling a die, or drawing colored pearls out
of a bag, playing roulette.
Probability of a certain event as an intrinsic property of the experiment.

E=Draw a red and then a blue pearl when there are 3 red, 5 blue, and 2 black in the
bag. P(E) can be calculated without actually performing the experiment.
Does not depend on any collection of events, it is a single-case

probability, often called chance or propensity.
Propensities can not be empirically asserted

If the experiment is being performed, the propensities give rise to frequencies of
events. This could be defining the propensity (K.Popper), however problems with the
stability of these frequencies arise.
Hence propensities now often defined by the theoretical role they play
in science, e.g. based on an underlying physical law.
7
Bayes Theorem
From conditional probability
P( A | B) P( B) = P( A B) = P( B | A) P( A)
follows Bayes theorem
P( B | A) P( A)
P( A | B) =
P( B)
Reverend Thomas Bayes
Uncontroversial consequence of Kolmogorovs axioms! (17021761)
8
Subjective probability
A,B, are hypothesis (statements that are either true or false). Define
the probability of hypothesis A:
P( A) = degree of belief that A is true

(Considered unscientific in the frequency definition)
Applied to Bayes theorem: Prediction

Probability of a result B
(= likelihood function, back later)
assuming hypothesis A is true
Posterior probability
Probability of
Initial degree of belief (prior probability)
hypothesis A after P( B | A) P( A)
seeing the result B
P( A | B) = Probability of hypothesis A, before seeing
P( B) the result.
this is the subjective part
Normalization: total probability of seeing result B.

Involves sum over all possible hypothesis
Experimental evidence or lack thereof modifies initial degree of belief,

depending on agreement with prediction. 9
Interpretation of Bayes theorem
P(result | theory)
P( theory | result) = P( theory)
P(result)
If a result R forbidden by theory T, P(R|T) = 0, then the probability that

the theory is correct when the result is observed is 0: P(T|R)=0
An observation of R would disprove T.
If theory T says R is unlikely, P(R|T) = , then the theory T is unlikely

under observation of R: P(T|R)=
An observations of R would lower our belief in T.
If theory T predicts R to have a high probability, P(R|T) = , then the

theory T is likely under observation of R: P(T|R)=
An observations of R would strengthen our belief in T.
If the denominator P(R) is large, ie there are many reasons for R to happen,
observation of R is not a strong support of T!
o The problem with the background
10
Law of total probability B
Sample space S with subset B

S
Disjoint subsets Ai of S: ! i Ai = S Ai
B is made up of B = !i B Ai
disjoint BAi:
BAi
P( B Ai ) = P( B | Ai ) P( Ai )
Law of total probability P( B) = i P( B Ai ) = i P( B | Ai ) P( Ai )
P( B | A) P( A)
Bayes theorem becomes P( A | B) =
i P( B | Ai ) P( Ai )
11
Example of Bayes theorem
Meson beam A1 = = A
Consists of 90% pions, 10% kaons A2 = K
Cherenkov counter to give signal on pions B = signal
95% efficient for pions, 6% fake rate (accidental signal) for kaons
Q1: if we see a signal in the counter, how likely did it come from a pion?
p(signal )
p( signal) = p( )
p(signal ) p( ) + p(signal K) p(K )
0.95
= 0.90 = 99.3%
0.95 0.90 + 0.06 0.10
0.7% chance that the signal came from a kaon.

Q2: if there is no signal, how likely was that a kaon?
0.94
p(K no signal) = 0.10 = 67.6%
0.05 0.90 + 0.94 0.10 12
Which probability to use?
Frequency, objective, subjective each has its strong points and
shortcomings.
All consistent with Kolmogorov axioms.
In particle physics frequency approach most often useful.

For instance when deriving results from analyzing many events from a dataset.
Subjective probability can provide you with a more natural way of

thinking about and treating non-repeatable phenomena.
Treatment of systematic uncertainties, probability that you discovered SUSY or the
Higgs in your analysis, probability that parity is violated given a certain measurement,

Be aware that the naming conventions are not always clear (im
particular objective and subjective), best bet is to use Frequentist
and Bayesian.
13
Describing data
Higgs event in an LHC protonproton collision

Tracks in a bubble chamber at CERN as at high luminosity
hit by a pion beam (together with ~24 other inelastic events)
HEP: events of particle interactions, measured by complex detectors
Measurements of random variables, distribution governed by underlying

physics processes
Energy, momentum, angle, number of hits, charge, time(delta)
14
Data sample properties
Data sample (single variable) x = {x1 , x2 ,..., xN } , can be presented
0: 0.998933 7: -0.0747045 14: -1.06067

1: -0.434764 8: 0.00791221 15: -1.3883
2: 0.781796 9: -0.410763 16: 0.767397
un-binned 3:
4:
-0.0300528
0.824264
10: 1.39119
11: -0.985066
17:
18:
-0.73603
0.579721 or binned
5: -0.0567173 12: -0.0489405 19: -0.382134
6: -0.900876 13: -1.44334
Center of Kleinmai-
Nb scheid /Germany
1 N
1
Arithmetic mean: x=
N
x i
or x=
N
n x
j =1
j j
i =1
N
1
Variance: V ( x) =
N
i
( x
i =1
- x ) 2
= x 2
- x 2
also center of
Europe (2005)
Standard deviation: s = V ( x) = x 2 - x 2
15
More than one variable
Set of data of two variables x = {( x1 , y1 ), ( x2 , y2 ),..., ( xN , y N )}
0: (-1.34361,0.93106) 7: (0.517314,-0.512618) 14: (0.901526,-0.397986)
1: (0.370898,-0.337328) 8: (0.990128,-0.597206) 15: (0.761904,-0.462093)
2: (0.215065,0.437488) 9: (0.404006,-0.511216) 16: (-2.17269,2.31899)
3: (0.869935,-0.469104) 10: (0.789204,-0.657488) 17: (-0.653227,0.829676)
4: (0.452493,-0.687919) 11: (0.359607,-0.979264) 18: (-0.543407,0.560198)
5: (0.484871,-0.51858) 12: (-0.00844855,-0.0874483) 19: (-0.701186,1.03088)
6: (0.650495,-0.608453) 13: (0.264035,-0.559026)
There is more information than mean and variance of x and of y !

N
1
Covariance: cov( x, y ) =
N
( x - x )( y - y )
i =1
i i
= xy - x y
r=0 r=0.5
Correlation: cov( x, y)
r=
between -1 and 1 s xs y
without dimensions
r=0.9 r=-0.9
Example: group of adults
r(height, weight)>0, r(weight, stamina)<0, r(height, IQ)=0, but r(weight, IQ)<0 16
Probability density function
Suppose outcome of experiment is value vx for continuous variable x
P( A : vx found in [ x, x + dx]) = f ( x)dx

defines the probability density function (PDF): f (x)

-
f ( x)dx = 1 x must be somewhere (axiom III)
Dimensions:
P(A) is dimensionless (between 0 and 1)
f(x) has the dimensionality of (1 / dimension of x)
For discrete x, with possible outcomes x1, x2, :
probability mass function: P( xi ) with P( x ) = 1

i i
17
Properties of pdfs
Suppose distribution of variable x follows pdf f(x).

Average x the expectation value: E ( x) = x = = xf ( x)dx
-
2
and the variance: V ( x) = x - x2

Can also be defined for functions of x, e.g. h(x): h = h( x) f ( x)dx
-
g+h = g + h
gh g h unless g and h are independent
Note: x , h are averages over pdfs, x, h are averages over the

real data sample
Law of large numbers ensures that h h
18
Cumulative distribution function
Probability to have outcome less than or equal to x is
x
-
f ( x)dx F ( x)
Monotonously rising function with F(-)=0 and F()=1.
Alternatively define pdf with
19
Drawing pdf from data sample
1. Histogram with B bins of width Dx
2. Fill with N outcomes of experiment
x1,,xN H=[n1,,nB]
3. Normalize integral to unit area
n~i = ni N i =1 n~i = 1
B
n~i = fraction of x found in [ xi , xi + Dx]
PDF
N infinite data sample, frequentist approach

Dx 0 zero bin width, step function becomes continuous
20
Multidimensional pdfs
Outcome of experiment (event) characterized by n variables
"
x = ( x1 , x 2 ,!, x n )
Probability described in "

n dimensions by joint pdf :
f ( x ) = f ( x (1) , x ( 2) ,!, x ( n ) )
n # #
P( $ A ) = f ( x )dx
(i )
i =1
= f ( x (1) , x ( 2) ,", x ( n ) )dx (1) dx ( 2) ! dx ( n )
where
A(i ) : hypothesis that variable i of event
is in interval x (i ) and x (i ) + dx (i )
=1
(1) ( 2) ( n) (1) ( 2) ( n)
Normalization: ! f ( x , x , ", x ) dx dx ! dx
21
Marginal pdfs, independent variables
PDF of one (or some) of the variables, integration of all others
marginal PDF:
f X j ( x( j ) ) = f ( x(1) , x( 2) ,", x( n) )dx (1) dx ( 2) !dx ( j -1) dx ( j +1) !dx ( n)

Marginal PDFs are projections of joint PDFs on individual axis
Note that f X i ( x(i ) )dx (i ) = 1
(1) ( 2) (n)
Variables x , x ,!, x are independent from each other if-and-only-
if they factorize:
!
f ( x ) = f X i ( x (i ) )
i
22
Conditional pdfs
Sometimes we want to consider some variables of joint pdf as constant.
Lets look at two dimensions, start from conditional probability:
P( A B) f ( x, y ) dxdy
P( B | A) = = h( y | x) dy
P( A) f x ( x) dx
f ( x = x1 , y )
Conditional pdf, distribution of y for fix x=x1: h( y | x = x1 ) =
f x ( x = x1 )
In joint pdf treat some variables as constant and evaluate at fix point (e.g. x=x1)
Divide the joint pdf by the marginal pdf of those variables being held constant
evaluated at fix point (e.g. fx(x=x1))

h(y|x1) is a slice of f(x,y) at x=x1 and has correct normalization h( y | x = x1 ) dy =1
23
Some Distributions in HEP
Binomial Branching ratio
Multinomial Histogram with fixed N
Poisson Number of events found in data sample
Uniform Monte Carlo method
Exponential Decay time
Gaussian Measurement error
Chi-square Goodness-of-fit
Cauchy (Breit-Wigner) Mass of resonance
Landau Ionization energy loss
Other functions to describe special processes:

Crystal Ball function, Novosibirsk function,
24
Binomial distribution
Outcome of experiment is 0 or 1 with p=P(1) (Bernoulli
trials). r : number of 1s occurring in n independent
trials.
n!
Probability mass P(r; p, n) = p r (1 - p) n -r
function: r!(n - r )!
r times 1, combinatoric
n-r times 0 term
Properties: r = np
V (r ) = s 2 = np ( p - 1)
Expectation: A coin with p(head)=0.45 you expect to land on

its head np=45 out of n=100 times.
Example: spark chamber 95% efficient to detect the passing of a charged particle. How efficient
is a stack of four spark chambers if you require at least three hits to reconstruct a track?
P(3;0.95,4) + P(4;0.95,4) = 0.953 0.05 4 + 0.954 1 = 0.171 + 0.815 = 98.6%
25
Poisson distribution (law of small numbers)
Discrete like binomial distribution, but no notion of trials. Rather l, the mean
number of (rare) events occurring in a continuum of fixed size, is known.
Derivation from binomial distribution:
Divide the continuum into n intervals, in each interval assume p=probability that event occurs in
interval. Note that l=np is the known and constant.
Binomial distribution in the limit of large n (and small p) for fixed r
r
n! n n
P(r; p, n) = p (1 - p)
r n-r
p (1 - l n)
r
r!(n - r )! r!
lr e - l
Probability mass function: P(r ; l ) =
r!
Properties: r =l
V (r ) = s 2 = l
Famous example: Ladislaus Bortkiewicz (1868-1931). The number of Deaths Prediction Cases
soldiers killed by horse-kicks each year in each corps in the Prussian 0 108.7 109
cavalry: 122 fatalities in 10 corps over 20 years. l=122/200=0.61 deaths 1 66.3 65

on average per year and corp. 2 20.2 22
3 4.1 3
Probability of no deaths in a corp in a year: P(0;0.61) = 0.5434 4 0.6 1 26
Gaussian (normal) distribution
1 -( x - ) 2 2s 2
f ( x; , s ) = e
2p s
Properties: x =
V ( x) = s 2
Note that and also denote mean and standard deviation for any distribution, not just
the Gaussian. The fact that they appear as parameters in the pdf justifies their naming.
1 - x2 2
Standard Gaussian j ( x) = e
transform xx=(x-)/ 2p
x
Cumulative distribution ( x) =
analytically.
-
j ( x)dx can not be calculated
Tables provide: (1) = 68.27%, ( 2) = 95.45%, (3) = 99.73%
Central role in the treatment of errors: central limit theorem

27
Other distributions
Gaussian, poisson, and binomial are by far the most common and
useful. For the description of physical processes you encounter
x1 e - x / x x0 x =x
Exponential f ( x; x ) = ,
0 x<0 V ( x) = x 2
Decay of an unstable particle with mean life-time x .
b 1-a a xb x = (a + b ) / 2
Uniform f ( x;a , b ) = ,
0 otherwise V ( x) = ( b - a ) 2 / 12
1 G2 x not well defined

Breit-Wigner f ( x; G, x0 ) = ,
p (G 2) + ( x - x0 )
2 2
V ( x)
Mass of resonance, e.g. K*, f, r. Full width at half maximum, G, is
the decay rate, or the inverse of the lifetime.
1 n 2 -1 - x 2 x =n
Chi-square f ( x; n) = x e ,
2 n 2 G(n 2) V ( x ) = 2n
Goodness-of-fit test variable with method of least squares follows this.
Number of degrees of freedom n. 28
Central limit theorem
A variableY that is produced by the cumulative effect of many
N
independent variables Xi, Y = X , with mean i and variance i2 will
i =1
i
be approximately Gaussian.
N N
Expectation value Y = X i = i
i =1 i =1
N N
Variance V (Y ) = V ( X i ) = s i2
i =1 i =1
Becomes Gaussian as N
Examples
E.g. human height is Gaussian, since it is sum of many genetic factors.
Weight is not Gaussian, since it is dominated by the single factor food.
29
Half-time summary
Part I
Introduced probability
Frequency, subjective. Bayes theorem.
Properties of data samples
Mean, variance, correlation
Probability densities underlying distribution from which data samples
are drawn
Properties, multidimensional, marginal, conditional pdfs
Examples of pdfs in physics, CLT
Part II
HEP experiment: repeatedly drawing random events from underlying
distribution (the laws of physics that we want to understand). From the
drawn sample we want to estimate parameters of those laws
Purification of data sample: statistical testing of events
Estimation of parameters: maximum likelihood and chi-square fits
Error propagation
30
Intermezzo: Monte Carlo simulation
Looking at data, we want to infer something about the (probabilistic)
processes that produced the data.
Preparation:
tuning signal / background separation to achieve most significant signal
check quality of estimators (later) to find possible biases
test statistical methods for getting the final result
all of this requires data based on distribution with known parameters
Tool: Monte Carlo simulation
Based on sequences of random numbers simulate particle collisions,

decays, detector response,
Generate random numbers
Transform according to desired (known) PDF
Extract properties
31
Random numbers
Sequence of random numbers uniformly distributed between 0 and 1
True random numbers in computers use special sources of entropy: thermal noise
sources, sound card noise, hard-drive IO times,
Simulation has to run on many different types of computers, cant rely on these
Most random numbers in computers are pseudo-random: algorithmically determined
sequences
Many different methods, e.g. 4 in root

TRandom xn +1 = (axn + c) mod m with a = 1103515245, c = 12345, and m = 2
31
Same as BSD rand() function. Internal state 32bit, short period ~109.
TRandom1
Based on mathematically proven Ranlux. Internal state 24 x 32bit, period ~10171. 4
luxury levels. Slow. Ranlux is default in ATLAS simulation.
TRandom2
Based on maximally equi-distributed combined Tausworthe generator. Internal state
3 x 32bit, period ~1026. Fast. Use if small number of random numbers needed.
TRandom3
Based on Mersenne and Twister algorithm. Large state 624 x 32bit. Very long period
~106000. Fast. Default in ROOT.
Seed: Seed 0 uses random seed, anything else gives you reproducible sequence.
32
Transformation method analytic
Given r1, r2,..., rn uniform in [0, 1], find x1, x2,..., xn that follow f (x) by
finding a suitable transformation x (r).
Require P(r r ) = P( x x(r ))
r x ( r )
this means -
u (r ) dr = r =
-
f ( x) dx = F ( x(r ))
so set F ( x) = r and solve for x(r ) .

33
Example
1
Exponential pdf: f ( x; x ) = e - x x , with x 0
x
x 1
So set F ( x) = e - x x dx = r and solve for x(r )
0 x
This gives the transformation x(r ) = -x ln(1 - r )
34
Accept reject method
Enclose the pdf in a box
[ xmin , xmax ] x [ 0 , fmax ]
Procedure to select x according to f(x)
1) Generate two random numbers

1) x, uniform in [xmin, xmax]
2) u, uniform in [0, fmax]
1) If u<f(x), then accept x
If dot below curve, use x value in histogram
35
Improving accept reject method
In regions where f(x) is small compared to fmax a lot of the sampled
points are rejected.
Serious waste of computing power, simulation in HEP consists of billions of random
numbers, so this does add up!
WASTE
(i )
Split [xmin, xmax] in regions (i), each with its own f max , and simulate pdf
separately. Proper normalization N ( i ) A( i ) = ( xmax
(i )
- xmin
(i )
) f max
(i )
More general: find enveloping function around f(x), for which you can
generate random numbers. Use this to generate x.
36
MC simulation in HEP
Event generation: PYTHIA, Herwig, ISAJET,
general purpose generators
for a large variety of reactions:

e+e- +-, +,-, hadrons, ISR, ...
pp hadrons, Higgs, SUSY,...
Processes: hard production, resonance decays, parton
showers, hadronization, normal decays,
Get a long list of colliding particles:

intermediated resonances, short lived particles, long lived
particles and their momentum, energy, lifetimes Data reconstruction:
Same as for real data but keep
truth information
Detector response: GEANT Clustering, tracking, jet-finding
multiple Coulomb scattering (scattering angle)
particle decays (lifetime) Estimate efficiencies
ionization energy loss (DE) # found / # generated
electromagnetic, hadronic showers = detector acceptance x
production of signals, electronics response, ... reconstruction efficiencies x
event selection
Get simulated raw data
Test parameter estimation
37
Nature, Probability Data
Theory simulated or real
Given these
distributions, how will
the data look like ?
Statistical
Nature, inference Data
Theory simulated or real
Given these data, what can
we say about the correctness,
paramters, etc. of the
distribution functions ?
38
Typical HEP analysis
Signal ~10 orders below total cross-section
1. Improve significance: Discriminate signal

from background. Multivariate analysis,
using all available information.
Event(W/SUSY), cone(,jet), object level (PID)
2. Parameter estimation
Mass, CP, size of signal
39
Event Classification
Suppose data sample with two types of events: Signal S, Background B
Suppose we have found discriminating input variables x1, x2,
What decision boundary should we use to select signal events (type S)?
Rectangular cuts? Linear boundary? A nonlinear one?
x2 x2 x2
B B B
S S S
x1 x1 x1
How can we decide this in an optimal way?
Multivariate event classification. Machine learning

40
Multivariate classifiers
Input: Classifier: Output:
n variables as event Maps n-dimensional input space Distributions have
" maximum S/B
measurement described by x = ( x1 , x 2 ,!, x n ) n !to one-
n-dimensional joint pdf, one y (x )
dimensional output . separation.
for each event type: f(x|S)
and f(x|B) Classifier output distribution
y : n reject S accept S
g(y|B) g(y|S)
n
y(x)
ycut
Decision boundary can now be defined by single cut on
!
ycut = y( x )
the classifier output , which divides the
input space into the rejection (critical) and acceptance
region. This defines a test, if event falls into critical
region we reject S hypothesis. 41
Convention
In literature one often sees
Null-hypothesis H0, the presumed default stage

Alternative hypothesis H1
In HEP we usually talk about signal and background and it is common

to assign
Background B = H0
Signal S = H1
42
Definition of a test
Goal is to make some statement based on the observed data
x as to the validity of the possible hypotheses, e.g. signal hypothesis S.
A test of H0=B is defined by specifying a critical region WS (the signal region) of

the data space such that there is an (only small) probability, a, for an event x of
type H0=B, to be observed in there, i.e.,
!
P( x WS | H 0 ) a
Events that are in critical region WS: reject hypothesis H0 = accept as signal.
a is called the size or significance level of the test. Note that all a larger than
P(xWS|H0) are called significance of this test. Lets think of a now as the
smallest significance.
Back-
Signal
Errors: ground
Reject H0 for background events Type-I error a Signal J
Type-2
error
Accept H0 for signal events Type-II error b
! Back- Type-1
P( x W | S/ ) = b ground error
J
43
Efficiencies
Signal efficiency:
reject S accept S
Probability to accept signal events as signal
g(y|B) g(y|S)
e S = g ( y | S ) dy = 1 - b
ycut
1-b also called the power y(x)
b ycut a
Background efficiency:
Probability to accept background events as signal

e B = g ( y | B) dy = a
ycut
44
Neyman Pearson test
Design test in n-dimensional input space by defining critical region WS.
Selecting event in WS as signal with errors a and b:
! ! ! !
a = f B ( x ) dx = e B and b = 1- f S ( x ) dx = 1 - e S
WS WS
A good test makes both errors small, so chose WS where fB is small

and fS is large, define by likelihood ratio
!
fS ( x)
! c
f B ( x)
Any particular value of c determines the values of a and b.
45
Neyman Pearson
! !
Lemma
P( x | S ) f S ( x ) Accept
Likelihood ratio yr ( x) = ! = ! nothing
P( x | B) f B ( x )
1
The likelihood-ratio test as selection
1-eBackgr.=1-a
criteria gives for each selection efficiency
Degreasing type-1 error

the best background rejection.
It maximizes the area under the ROC-curve
Receiver Operating Characteristics (ROC) Accept

curve plots (1-) the minimum type-II error as a everything
function of (1-) the type-I error. The better the Degreasing type-2 error
classifier the larger the area under the ROC
0
curve.
0 e Signal = 1 - b 1
From the ROC of the classifier chose the working point

need expectation for S and B
Cross section measurement: maximum of S/(S+B) or equiv. (ep)

Discovery of a signal: maximum of S/(B)
Precision measurement: high purity (p)
Trigger selection: high efficiency (e)
46
Realistic event classification
Neyman-Pearson lemma doesnt really help us since true densities are
typically not known!
Need a way to describe them approximately:

MC simulated events
Control samples derived from data (even better but generally more difficult to get)
Use these training events to

Try to estimate the functional form of fS/B(x) from which the likelihood ratio can be
obtained
e.g. D-dimensional histogram, Kernel density estimators, MC-based matrix-element
methods,
Find a discrimination function y(x) and corresponding decision boundary (i.e.

affine hyperplane in the feature space: y(x) = const) that optimally separates signal
from background
e.g. Linear Discriminator, Neural Networks, Boosted Decision, Support Vector
Machines,
Supervised Machine Learning (two basic types) 47

Machine Learning
Computers do the hard work (number crunching) but its
not all magic. Still need to
Choose the discriminating variables, check for correlations

Choose the class of models (linear, non-linear, flexible or less flexible)
Tune the learning parameters
Check the generalization properties (avoid overtraining)
Check importance of input variables at the end of the training
Estimate efficiency
Estimate systematic uncertainties (consider trade off between statistical and
systematic uncertainties)
Lets look at a few:

Probability density estimation (PDE) methods
Boosted decision trees
48
PDE methods
! !
Construct non-parametric estimators f of the pdfs f ( x | S ) and f ( x | B)
and use these to construct the likelihood ratio:
!
! f ( x | S )
yr ( x ) = !
f ( x | B)
Methods are based on turning the training sample into! PDEs for signal
and background and then provide fast lookup for yr (x )
Two basic types

Projective Likelihood Estimator (Nave Bayes)
Multidimensional Probability Density Estimators

Parcel the input variable space in cells. Simple example: n-dimensional histograms
Kernels to weight the event contributions within each cell.
Organize data in search trees to provide fast access to cells
49
Projective Likelihood Estimator
Probability density estimators for each input variable (marginal PDF)
combined in overall likelihood estimator, much liked in HEP.
PDE for each
variable k
likelihood
ratio for !
k
f Signal
k{ variables}
( xik )
event i
y ( xi ) =

fU ( xi )
k
U {Signal , Background} k{ variables}

k
Normalize
with S+B
Nave assumption about independence of all input variables

Optimal approach if correlations are zero (or linear decorrelation)
Otherwise: significant performance loss
Advantages:
independently estimating the parameter distribution alleviates the problems from the
curse of dimensionality
Simple and robust, especially in low dimensional problems
50
Estimating the input PDFs from the sample
Technical challenge, three ways:
Parametric fitting: best

but variable distribution function must be
known. Cannot be generalized to a-priori
unknown problems.
Use analysis package RooFit.
Non-parametric fitting: ideal for machine learning

Easy to automate Nonparametric fitting
Can create artifacts (edge effects, outliers) or
Binned (uses histograms)
hide information (smoothing) shape interpolation using spline
Might need tuning. functions or adaptive smoothing
Unbinned (uses all data)

Event counting: unbiased PDF (histogram) adaptive kernel density estimation
Automatic (KDE) with Gaussian smearing
Sub-optimal since it exhibits details of the
Validation of goodness-of-fit afterwards
training sample
51
Multidimensional PDEs
Incorporates variable correlations, suffers in higher dimensions from lack of
statistics!
test event
PDE Range-Search
Count number of reference events (signal and background) in a x2 H1
rectangular volume around the test event
k-Nearest Neighbor H0
Better: count adjacent reference events till statistically significant
number reached (method intrinsically adaptive) x1
PDE-Foam
Parcel input space into cells of varying sizes, each cell contains representative information
(the average reference for the neighborhood)
Advantage: limited number of cells, independent of number of training events
No kernel weighting Gaussian kernel
Fast search: binary search tree that sorts
objects in space by their coordinates
Evaluation can use kernels to determine

response
52
Curse of Dimensionality
Problems caused by the exponential increase in volume associated
with adding extra dimensions to a mathematical space:
Volume in hyper-sphere becomes Vsphere pD2

negligible compared to hyper-cube lim = lim =0
D Vcube D D 2 D -1 G( D 2)
All the volume is in the corners
d max - d min
Distance functions losing their lim =0
usefulness in high dimensionality. D d min
Finding local densities in a many-dimensional problem requires a lot

of data. Nearest neighbor methods might not work well.
Especially if non-significant variables are included.
In many dimensions it is better to find the separation borders not by

using the likelihood ratio.
53
Boosted Decision Tree
DecisionTree (DT)
Series of cuts that split sample set into ever
smaller subsets
Growing
Each split try to maximizing gain in separation DG
DG = NG - N1G1 - N2G2
Gini- or inequality index:
S,B S node Bnode
Gnode =
S1,B1 S2,B2
(Snode + Bnode )2
Leafs are assigned either S or B
2) Boosting method Adaboost
Event classification Build forest of DTs:
Following the splits using test event variables until 1. Emphasizing classification errors in DTk:
a leaf is reached: S or B increase (boost) weight of incorrectly
classified events
2. Train new tree DTk+1
Pruning
Removing statistically insignificant nodes Final classifier linearly combines all trees
Bottom-up DT with small misclassification get large
Protect from overtraining coefficient
DT dimensionally robust and easy to understand Good performance and stability, little tuning needed.
but alone not powerful ! Popular in HEP (Miniboone, single top at D0)
54
Multivariate summary
Multivariate analysis packages:
StatPatternRecognition: I.Narsky, arXiv: physics/0507143
http://www.hep.caltech.edu/~narsky/spr.html
TMVA: Hoecker, Speckmayer, Stelzer, Therhaag, von Toerne, Voss, arXiv: physics/0703039
http://tmva.sf.net or every ROOT distribution
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
Huge data analysis library available in R: http://www.r-project.org/
Support training, evaluation, comparison of many state-of-the-art classifiers
How to proceed: chose most suitable method, then:
Use MVA output distribution, fit to estimate number of signal

and background events.
or
Chose working point for enhance signal selection. Use an
independent variable to estimate parameters of underlying
physics of signal process.
Parameter estimation 55
Estimation of variable properties
Estimator:
A procedure applicable to a data sample S which gives the numerical
value for a
a) property of the parent population from which S was selected
b) property or parameter from the parent distribution function that generated S
Estimators are denoted with a hat over the parameter or property
Estimators are judged by their properties. A good estimator is
consistent lim a = a
N
unbiased a = a
For large N any consistent estimator becomes unbiased!
Efficient V (a ) is small
More efficient estimators a more likely to be close to true value. There is a theoretical
limit of the variance, the minimum variance bound, MVB. The efficiency of an estimator
is MVB V (a ). 56
A mean estimator example
Consistent Unbiased Efficient
Estimators for the mean of a distribution
1) Sum up all x and divide by N 1
2) Sum up all x and divide by N-1 2
3) Sum up every second x and divide by int(N/2) 3
4) Throw away the data and return 42
4
Law of large
numbers
x1 + x2 + ... + xN
=x x =
1) N
x + x + ! + xN x + x +!+ x
3) is less efficient than1)
1 2 = = since it uses only half the data.
N N
Efficiency depends on data
sample S.
x1 + x2 + ... + xN N
= x x =
2)
N -1
x1 + x2 + ! + xN
N -1
=
N

Note that some estimators are always
consistent or unbiased. Most often the
N -1 N -1 properties of the estimator depend on
the data sample.
57
Examples of basic estimators
Estimating the mean: = x
s2
Consistent, unbiased, maybe efficient: V ( ) = (from central limit theorem)
N
Estimating the variance,

1
a) when knowing the true mean : V ( x) =
N
( xi - )2
This is usually not the case!
1
b) when not knowing the true mean: V ( x) = s 2 =
N -1
i
( x - x ) 2
Note the correction factor of N/(N-1) from the nave expectation. Since x is closer to the
average of the data sample S than the mean , the result would underestimate the
variance and introduce a bias!
A more general estimator for a parameter a and a data sample

{x1,x2,,xN} is based on the likelihood function
L( x1 , x2 ,!, xN ; a) = P( xi ; a)
58
Maximum likehood estimator
Variable x distributed according to pdf P which depends on a: P( x; a)
Sample S of data drawn from according to P: S = {x1 , x2 ,!, xN }

N
Probability of S being drawn: Likelihood L( x1 , x2 ,!, xN ; a) = P( xi ; a)
i =1
For different a1 , a2 ,!we find different likelihoods L(S; a1 ), L(S; a2 ),!
ML principle: a good estimator a ( S ; a) of a for sample S is the one with

the highest likehood for S being drawn:
d ln L( S ; a) In practice use ln L
=0 instead of L easier
da a =a
This is called the Maximum likelihood (ML)-estimator
59
Properties of the ML estimator
Usually consistent
Invariant under parameter transformation: f (a) = f (a )
d ln L d ln L df
Peak in likelihood function: = =0
d a a =a df f = f = f ( a )
da a = a
Price to pay: ML estimators are generally biased !

Invariance between two estimators is incompatible with both being unbiased !
Not a problem when sample size N is large! Remember, consistent estimators become
unbiased for large N.
At large N an ML estimator becomes efficient !
60
Error on an ML estimator for large N
d ln L( x1 ,!, xN ; a)
Expand ln L around its maximum a . We have seen =0
da a =a
d 2 ln L
Second derivative important to estimate error:
d a2
One can show for any unbiased and efficient ML estimator (e.g. large N)
d ln L( x1 ,!, xN ; a) d 2 ln L
= A(a)(a ( x1 ,!, xN ) - a ) , with proportionality factor A(a) = -
da d a2
The CLT tells us that the probability distribution of a is Gaussian. For this to be (close to be) true A must
be (relatively) constant around a = a
A[ a - a ( x1 , x2 ,!, x N )] 2
L( x1 , x2 , !, x N ; a) e 2
For large N the likelihood function becomes Gaussian,

the log-likelihood a parabola
The errors in your estimation you can read directly of

the ln L plot.
61
About ML
Not necessarily best classifier, but usually good to use. You need to assume the
underlying probability density P(x;a)
Does not give you the most likely value for a, it gives the value for which the
observed data is the most likely !
d ln L( S ; a)
= 0 Usually cant solve analytically, use numerical methods, such
da a =a as MINUIT. You need to program you P(x;a)
Below the large N regime, LH not a Gaussian (log-LH not a parabola)

MC simulation: generate U experiments, each with N events. Find and plot MLE. Use graphical
solution: plot lnL and find the points where it dropped by 0.5, 2, 4.5 to find s, 2s, 3s
Perhaps use transformation invariance to find a estimator with Gaussian distribution
Quote asymmetric errors on your estimate
No quality check: the value of ln L( S ; a ) will tell you nothing about how good
your P(x;a) assumption was
62
Least square estimation
Particular MLE with Gaussian distribution, each of the sample points yi
has its own expectation f ( xi ; a) and resolutions i
1 2
2s i2
P( yi ; a) = e -[ yi - f ( xi ;a )]
2p s i
2
y - f ( xi ; a)
To maximize LH, minimize c 2 = i f ( x; a)
i si
Fitting binned data:
Proper c2: c2 =
(n j - fj)
2
j fj
nj content of bin i follows poisson
c =
2
(n j - fj)
2
statistics
Simple c2: fj expectation for bin i, also the
(simpler to calculate) j nj
squared error
63
Advantages of least squares
Method provides goodness-of-fit
The value of the c2 at its minimum is a measure of the level of agreement between the
data and fitted curve.
c2 statistics follows the chi-square distribution f ( c 2 ; n)
Each data point contributesc 2 1 , minimizing c 2 makes it smaller by 1 per free
variable
Number of degrees of freedom n = Nbin - N var
c2 has mean n and variance 2n
If c2/n much larger than 1 something

might be wrong
n should be large for this test. Better to use
2 c 2 which has mean 2n - 1 and variance 1,
and becomes Gaussian at n~30.
64
Error Analysis
Statistical errors:
How much would result fluctuate upon repetition of the experiment
Also need to estimate the systematic errors: uncertainties

in our assumptions
Uncertainty in the theory (model)
Understanding of the detector in reconstruction (calibration constants)
Simulation: wrong simulation of detector response (material description)
Error from finite MC sample (MC statistical uncertainty)
requires some of thinking and is not as well defined as the statistical erro
65
Literature
Statistical Data Analysis
G. Cowan, Clarendon, Oxford, 1998, see also www.pp.rhul.ac.uk/~cowan/sda
Statistics, A Guide to the Use of Statistical in the Physical Sciences

R.J. Barlow, Wiley, 1989 see also hepwww.ph.man.ac.uk/~roger/book.html
Statistics for Nuclear and Particle Physics

L. Lyons, CUP, 1986
Statistical and Computational Methods in Experimental Physics, 2nd ed.

F. James., World Scientific, 2006
Statistical and Computational Methods in Data Analysis

S. Brandt, Springer, New York, 1998 (with program library on CD)
Review of Particle Physics (Particle Data Group)

Journal of Physics G 33 (2006) 1; see also pdg.lbl.gov sections on probability statistics,
Monte Carlo
TMVA - Toolkit for Multivariate Data Analysis

A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. von Toerne, and H. Voss,
PoS ACAT 040 (2007), arXiv:physics/0703039
66

Stat Meth

Uploaded by

Copyright:

Available Formats

Stat Meth

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Meth

Uploaded by

Copyright:

Available Formats

Statistical

From data samples to parameter estimation

Observe events of a certain type

Measure characteristics of each event

Theories (e.g. SM) predict distributions of these properties up to free

Some tasks of data analysis:

Mathematical probability: For each event E exists a P(E) with:

From these axioms we can derive further rules

Conditional probability of A given B P( A B)

Useful in daily life? German insurance company X finds

Common approach in HEP:

Probability of a certain event as an intrinsic property of the experiment.

Does not depend on any collection of events, it is a single-case

Propensities can not be empirically asserted

follows Bayes theorem

P( A) = degree of belief that A is true

Applied to Bayes theorem: Prediction

Normalization: total probability of seeing result B.

Experimental evidence or lack thereof modifies initial degree of belief,

If a result R forbidden by theory T, P(R|T) = 0, then the probability that

If theory T says R is unlikely, P(R|T) = , then the theory T is unlikely

If theory T predicts R to have a high probability, P(R|T) = , then the

Sample space S with subset B

Law of total probability P( B) = i P( B Ai ) = i P( B | Ai ) P( Ai )

0.7% chance that the signal came from a kaon.

In particle physics frequency approach most often useful.

Subjective probability can provide you with a more natural way of

Higgs event in an LHC protonproton collision

HEP: events of particle interactions, measured by complex detectors

Measurements of random variables, distribution governed by underlying

0: 0.998933 7: -0.0747045 14: -1.06067

There is more information than mean and variance of x and of y !

P( A : vx found in [ x, x + dx]) = f ( x)dx

For discrete x, with possible outcomes x1, x2, :

probability mass function: P( xi ) with P( x ) = 1

Note: x , h are averages over pdfs, x, h are averages over the

Alternatively define pdf with

n~i = fraction of x found in [ xi , xi + Dx]

N infinite data sample, frequentist approach

Probability described in "

= f ( x (1) , x ( 2) ,", x ( n ) )dx (1) dx ( 2) ! dx ( n )

f X j ( x( j ) ) = f ( x(1) , x( 2) ,", x( n) )dx (1) dx ( 2) !dx ( j -1) dx ( j +1) !dx ( n)

Other functions to describe special processes:

Expectation: A coin with p(head)=0.45 you expect to land on

cavalry: 122 fatalities in 10 corps over 20 years. l=122/200=0.61 deaths 1 66.3 65

Tables provide: (1) = 68.27%, ( 2) = 95.45%, (3) = 99.73%

Central role in the treatment of errors: central limit theorem

1 G2 x not well defined

Tool: Monte Carlo simulation

Based on sequences of random numbers simulate particle collisions,

Many different methods, e.g. 4 in root

Require P(r r ) = P( x x(r ))

so set F ( x) = r and solve for x(r ) .

This gives the transformation x(r ) = -x ln(1 - r )

Procedure to select x according to f(x)

1) Generate two random numbers

1) If u<f(x), then accept x

If dot below curve, use x value in histogram

for a large variety of reactions:

Get a long list of colliding particles: