Stat Meth
Stat Meth
Stat Meth
Methods in HEP
International School of
Theory & Analysis
in Particle Physics
Istanbul, Turkey
31st 11th February 2011
Jrg Stelzer
Michigan State University, East Lansing, USA
Outline
From probabilities to data samples
Probability, Bayes theorem
Properties of data samples
Probability densities, multi-dimensional
Catalogue of distributions in HEP, central limit theorem
Data Simulation, random numbers, transformations
2
Statistical analysis in particle physics
3
What is Probability
S={E1, E2,} set of possible results (events) of an experiment.
E.g. experiment: Throwing a dice.
E1=throw 1, E2=throw a 2, E3=throw an odd number, E4=throw a number>3,
Ex and Ey are mutually exclusive if they cant occur at the same time.
E1 and E2 are mutually exclusive, E3 and E4 are not
I : P( E ) 0
II : P ( E1 or E2 ) = P( E1 ) + P ( E2 ) if E1 and E2 are mutually exclusive
III : P( Ei ) = 1, where the sum is over all mutually exclusive events A.N. Kolmogorov
(1903-1987)
P( A B) P( A) P( B)
Independent events A and B P( A | B) = = = P( A)
If your friend hints it was a rainy day: P( B) P( B)
P(Tuesday|rainday) = 1/7
Axioms can be used to build a complicated theory, but the numbers so far are
entirely free of meaning. Different interpretations of probability
5
Probability as frequency limit
Perform an repeatable experiment N times with outcomes
X1,X2, (the ensemble). Count the number of times that X
occurs: NX. Fraction NX /N tends toward a limit, defined as
NX
the probability of outcome X: P( X ) = lim Richard von Mises
(1883-1953)
N N
6
Objective probability propensity
Examples: throwing a coin, rolling a die, or drawing colored pearls out
of a bag, playing roulette.
Hence propensities now often defined by the theoretical role they play
in science, e.g. based on an underlying physical law.
7
Bayes Theorem
From conditional probability
P( A | B) P( B) = P( A B) = P( B | A) P( A)
P( B | A) P( A)
P( A | B) =
P( B)
Reverend Thomas Bayes
Uncontroversial consequence of Kolmogorovs axioms! (17021761)
8
Subjective probability
A,B, are hypothesis (statements that are either true or false). Define
the probability of hypothesis A:
B is made up of B = !i B Ai
disjoint BAi:
BAi
P( B Ai ) = P( B | Ai ) P( Ai )
P( B | A) P( A)
Bayes theorem becomes P( A | B) =
i P( B | Ai ) P( Ai )
11
Example of Bayes theorem
Meson beam A1 = = A
Consists of 90% pions, 10% kaons A2 = K
Cherenkov counter to give signal on pions B = signal
95% efficient for pions, 6% fake rate (accidental signal) for kaons
Q1: if we see a signal in the counter, how likely did it come from a pion?
p(signal )
p( signal) = p( )
p(signal ) p( ) + p(signal K) p(K )
0.95
= 0.90 = 99.3%
0.95 0.90 + 0.06 0.10
Be aware that the naming conventions are not always clear (im
particular objective and subjective), best bet is to use Frequentist
and Bayesian.
13
Describing data
14
Data sample properties
Data sample (single variable) x = {x1 , x2 ,..., xN } , can be presented
Center of Kleinmai-
Nb scheid /Germany
1 N
1
Arithmetic mean: x=
N
x i
or x=
N
n x
j =1
j j
i =1
N
1
Variance: V ( x) =
N
i
( x
i =1
- x ) 2
= x 2
- x 2
also center of
Europe (2005)
Standard deviation: s = V ( x) = x 2 - x 2
15
More than one variable
Set of data of two variables x = {( x1 , y1 ), ( x2 , y2 ),..., ( xN , y N )}
0: (-1.34361,0.93106) 7: (0.517314,-0.512618) 14: (0.901526,-0.397986)
1: (0.370898,-0.337328) 8: (0.990128,-0.597206) 15: (0.761904,-0.462093)
2: (0.215065,0.437488) 9: (0.404006,-0.511216) 16: (-2.17269,2.31899)
3: (0.869935,-0.469104) 10: (0.789204,-0.657488) 17: (-0.653227,0.829676)
4: (0.452493,-0.687919) 11: (0.359607,-0.979264) 18: (-0.543407,0.560198)
5: (0.484871,-0.51858) 12: (-0.00844855,-0.0874483) 19: (-0.701186,1.03088)
6: (0.650495,-0.608453) 13: (0.264035,-0.559026)
= xy - x y
r=0 r=0.5
Correlation: cov( x, y)
r=
between -1 and 1 s xs y
without dimensions
r=0.9 r=-0.9
Example: group of adults
r(height, weight)>0, r(weight, stamina)<0, r(height, IQ)=0, but r(weight, IQ)<0 16
Probability density function
Suppose outcome of experiment is value vx for continuous variable x
Dimensions:
P(A) is dimensionless (between 0 and 1)
f(x) has the dimensionality of (1 / dimension of x)
2
and the variance: V ( x) = x - x2
Can also be defined for functions of x, e.g. h(x): h = h( x) f ( x)dx
-
g+h = g + h
gh g h unless g and h are independent
19
Drawing pdf from data sample
1. Histogram with B bins of width Dx
2. Fill with N outcomes of experiment
x1,,xN H=[n1,,nB]
3. Normalize integral to unit area
n~i = ni N i =1 n~i = 1
B
20
Multidimensional pdfs
Outcome of experiment (event) characterized by n variables
"
x = ( x1 , x 2 ,!, x n )
n # #
P( $ A ) = f ( x )dx
(i )
i =1
where
A(i ) : hypothesis that variable i of event
is in interval x (i ) and x (i ) + dx (i )
=1
(1) ( 2) ( n) (1) ( 2) ( n)
Normalization: ! f ( x , x , ", x ) dx dx ! dx
21
Marginal pdfs, independent variables
PDF of one (or some) of the variables, integration of all others
marginal PDF:
(1) ( 2) (n)
Variables x , x ,!, x are independent from each other if-and-only-
if they factorize:
!
f ( x ) = f X i ( x (i ) )
i
22
Conditional pdfs
Sometimes we want to consider some variables of joint pdf as constant.
Lets look at two dimensions, start from conditional probability:
P( A B) f ( x, y ) dxdy
P( B | A) = = h( y | x) dy
P( A) f x ( x) dx
f ( x = x1 , y )
Conditional pdf, distribution of y for fix x=x1: h( y | x = x1 ) =
f x ( x = x1 )
In joint pdf treat some variables as constant and evaluate at fix point (e.g. x=x1)
Divide the joint pdf by the marginal pdf of those variables being held constant
evaluated at fix point (e.g. fx(x=x1))
h(y|x1) is a slice of f(x,y) at x=x1 and has correct normalization h( y | x = x1 ) dy =1
23
Some Distributions in HEP
Binomial Branching ratio
Multinomial Histogram with fixed N
Poisson Number of events found in data sample
Uniform Monte Carlo method
Exponential Decay time
Gaussian Measurement error
Chi-square Goodness-of-fit
Cauchy (Breit-Wigner) Mass of resonance
Landau Ionization energy loss
24
Binomial distribution
Outcome of experiment is 0 or 1 with p=P(1) (Bernoulli
trials). r : number of 1s occurring in n independent
trials.
n!
Probability mass P(r; p, n) = p r (1 - p) n -r
function: r!(n - r )!
r times 1, combinatoric
n-r times 0 term
Properties: r = np
V (r ) = s 2 = np ( p - 1)
Example: spark chamber 95% efficient to detect the passing of a charged particle. How efficient
is a stack of four spark chambers if you require at least three hits to reconstruct a track?
P(3;0.95,4) + P(4;0.95,4) = 0.953 0.05 4 + 0.954 1 = 0.171 + 0.815 = 98.6%
25
Poisson distribution (law of small numbers)
Discrete like binomial distribution, but no notion of trials. Rather l, the mean
number of (rare) events occurring in a continuum of fixed size, is known.
Derivation from binomial distribution:
Divide the continuum into n intervals, in each interval assume p=probability that event occurs in
interval. Note that l=np is the known and constant.
Binomial distribution in the limit of large n (and small p) for fixed r
r
n! n n
P(r; p, n) = p (1 - p)
r n-r
p (1 - l n)
r
r!(n - r )! r!
lr e - l
Probability mass function: P(r ; l ) =
r!
Properties: r =l
V (r ) = s 2 = l
Famous example: Ladislaus Bortkiewicz (1868-1931). The number of Deaths Prediction Cases
soldiers killed by horse-kicks each year in each corps in the Prussian 0 108.7 109
3 4.1 3
Probability of no deaths in a corp in a year: P(0;0.61) = 0.5434 4 0.6 1 26
Gaussian (normal) distribution
1 -( x - ) 2 2s 2
f ( x; , s ) = e
2p s
Properties: x =
V ( x) = s 2
Note that and also denote mean and standard deviation for any distribution, not just
the Gaussian. The fact that they appear as parameters in the pdf justifies their naming.
1 - x2 2
Standard Gaussian j ( x) = e
transform xx=(x-)/ 2p
x
Cumulative distribution ( x) =
analytically.
-
j ( x)dx can not be calculated
b 1-a a xb x = (a + b ) / 2
Uniform f ( x;a , b ) = ,
0 otherwise V ( x) = ( b - a ) 2 / 12
be approximately Gaussian.
N N
Expectation value Y = X i = i
i =1 i =1
N N
Variance V (Y ) = V ( X i ) = s i2
i =1 i =1
Becomes Gaussian as N
Examples
E.g. human height is Gaussian, since it is sum of many genetic factors.
Weight is not Gaussian, since it is dominated by the single factor food.
29
Half-time summary
Part I
Introduced probability
Frequency, subjective. Bayes theorem.
Properties of data samples
Mean, variance, correlation
Probability densities underlying distribution from which data samples
are drawn
Properties, multidimensional, marginal, conditional pdfs
Examples of pdfs in physics, CLT
Part II
HEP experiment: repeatedly drawing random events from underlying
distribution (the laws of physics that we want to understand). From the
drawn sample we want to estimate parameters of those laws
Purification of data sample: statistical testing of events
Estimation of parameters: maximum likelihood and chi-square fits
Error propagation
30
Intermezzo: Monte Carlo simulation
Looking at data, we want to infer something about the (probabilistic)
processes that produced the data.
Preparation:
tuning signal / background separation to achieve most significant signal
check quality of estimators (later) to find possible biases
test statistical methods for getting the final result
all of this requires data based on distribution with known parameters
Same as BSD rand() function. Internal state 32bit, short period ~109.
TRandom1
Based on mathematically proven Ranlux. Internal state 24 x 32bit, period ~10171. 4
luxury levels. Slow. Ranlux is default in ATLAS simulation.
TRandom2
Based on maximally equi-distributed combined Tausworthe generator. Internal state
3 x 32bit, period ~1026. Fast. Use if small number of random numbers needed.
TRandom3
Based on Mersenne and Twister algorithm. Large state 624 x 32bit. Very long period
~106000. Fast. Default in ROOT.
Seed: Seed 0 uses random seed, anything else gives you reproducible sequence.
32
Transformation method analytic
Given r1, r2,..., rn uniform in [0, 1], find x1, x2,..., xn that follow f (x) by
finding a suitable transformation x (r).
r x ( r )
this means -
u (r ) dr = r =
-
f ( x) dx = F ( x(r ))
34
Accept reject method
Enclose the pdf in a box
[ xmin , xmax ] x [ 0 , fmax ]
35
Improving accept reject method
In regions where f(x) is small compared to fmax a lot of the sampled
points are rejected.
Serious waste of computing power, simulation in HEP consists of billions of random
numbers, so this does add up!
WASTE
(i )
Split [xmin, xmax] in regions (i), each with its own f max , and simulate pdf
separately. Proper normalization N ( i ) A( i ) = ( xmax
(i )
- xmin
(i )
) f max
(i )
More general: find enveloping function around f(x), for which you can
generate random numbers. Use this to generate x.
36
MC simulation in HEP
Event generation: PYTHIA, Herwig, ISAJET,
general purpose generators
Statistical
Nature, inference Data
Theory simulated or real
Given these data, what can
we say about the correctness,
paramters, etc. of the
distribution functions ?
38
Typical HEP analysis
2. Parameter estimation
Mass, CP, size of signal
39
Event Classification
Suppose data sample with two types of events: Signal S, Background B
Suppose we have found discriminating input variables x1, x2,
What decision boundary should we use to select signal events (type S)?
x2 x2 x2
B B B
S S S
x1 x1 x1
y : n reject S accept S
g(y|B) g(y|S)
n
y(x)
ycut
Decision boundary can now be defined by single cut on
!
ycut = y( x )
the classifier output , which divides the
input space into the rejection (critical) and acceptance
region. This defines a test, if event falls into critical
region we reject S hypothesis. 41
Convention
In literature one often sees
Background B = H0
Signal S = H1
42
Definition of a test
Goal is to make some statement based on the observed data
x as to the validity of the possible hypotheses, e.g. signal hypothesis S.
reject S accept S
Probability to accept signal events as signal
g(y|B) g(y|S)
e S = g ( y | S ) dy = 1 - b
ycut
b ycut a
Background efficiency:
44
Neyman Pearson test
Design test in n-dimensional input space by defining critical region WS.
Selecting event in WS as signal with errors a and b:
! ! ! !
a = f B ( x ) dx = e B and b = 1- f S ( x ) dx = 1 - e S
WS WS
45
Neyman Pearson
! !
Lemma
P( x | S ) f S ( x ) Accept
Likelihood ratio yr ( x) = ! = ! nothing
P( x | B) f B ( x )
1
The likelihood-ratio test as selection
1-eBackgr.=1-a
criteria gives for each selection efficiency
48
PDE methods
! !
Construct non-parametric estimators f of the pdfs f ( x | S ) and f ( x | B)
and use these to construct the likelihood ratio:
!
! f ( x | S )
yr ( x ) = !
f ( x | B)
Methods are based on turning the training sample into! PDEs for signal
and background and then provide fast lookup for yr (x )
49
Projective Likelihood Estimator
Probability density estimators for each input variable (marginal PDF)
combined in overall likelihood estimator, much liked in HEP.
PDE for each
variable k
likelihood
ratio for !
k
f Signal
k{ variables}
( xik )
event i
y ( xi ) =
fU ( xi )
k
Advantages:
independently estimating the parameter distribution alleviates the problems from the
curse of dimensionality
Simple and robust, especially in low dimensional problems
50
Estimating the input PDFs from the sample
Technical challenge, three ways:
k-Nearest Neighbor H0
Better: count adjacent reference events till statistically significant
number reached (method intrinsically adaptive) x1
PDE-Foam
Parcel input space into cells of varying sizes, each cell contains representative information
(the average reference for the neighborhood)
Advantage: limited number of cells, independent of number of training events
No kernel weighting Gaussian kernel
Fast search: binary search tree that sorts
objects in space by their coordinates
52
Curse of Dimensionality
Problems caused by the exponential increase in volume associated
with adding extra dimensions to a mathematical space:
d max - d min
Distance functions losing their lim =0
usefulness in high dimensionality. D d min
53
Boosted Decision Tree
DecisionTree (DT)
Series of cuts that split sample set into ever
smaller subsets
Growing
Each split try to maximizing gain in separation DG
DG = NG - N1G1 - N2G2
Gini- or inequality index:
S,B S node Bnode
Gnode =
S1,B1 S2,B2
(Snode + Bnode )2
Leafs are assigned either S or B
2) Boosting method Adaboost
Event classification Build forest of DTs:
Following the splits using test event variables until 1. Emphasizing classification errors in DTk:
a leaf is reached: S or B increase (boost) weight of incorrectly
classified events
2. Train new tree DTk+1
Pruning
Removing statistically insignificant nodes Final classifier linearly combines all trees
Bottom-up DT with small misclassification get large
Protect from overtraining coefficient
DT dimensionally robust and easy to understand Good performance and stability, little tuning needed.
but alone not powerful ! Popular in HEP (Miniboone, single top at D0)
54
Multivariate summary
Multivariate analysis packages:
StatPatternRecognition: I.Narsky, arXiv: physics/0507143
http://www.hep.caltech.edu/~narsky/spr.html
TMVA: Hoecker, Speckmayer, Stelzer, Therhaag, von Toerne, Voss, arXiv: physics/0703039
http://tmva.sf.net or every ROOT distribution
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
Huge data analysis library available in R: http://www.r-project.org/
Support training, evaluation, comparison of many state-of-the-art classifiers
Parameter estimation 55
Estimation of variable properties
Estimator:
A procedure applicable to a data sample S which gives the numerical
value for a
a) property of the parent population from which S was selected
b) property or parameter from the parent distribution function that generated S
Estimators are denoted with a hat over the parameter or property
consistent lim a = a
N
unbiased a = a
For large N any consistent estimator becomes unbiased!
Efficient V (a ) is small
More efficient estimators a more likely to be close to true value. There is a theoretical
limit of the variance, the minimum variance bound, MVB. The efficiency of an estimator
is MVB V (a ). 56
A mean estimator example
Consistent Unbiased Efficient
Estimators for the mean of a distribution
1) Sum up all x and divide by N 1
2) Sum up all x and divide by N-1 2
3) Sum up every second x and divide by int(N/2) 3
4) Throw away the data and return 42
4
Law of large
numbers
x1 + x2 + ... + xN
=x x =
1) N
x + x + ! + xN x + x +!+ x
3) is less efficient than1)
1 2 = = since it uses only half the data.
N N
Efficiency depends on data
sample S.
x1 + x2 + ... + xN N
= x x =
2)
N -1
x1 + x2 + ! + xN
N -1
=
N
Note that some estimators are always
consistent or unbiased. Most often the
N -1 N -1 properties of the estimator depend on
the data sample.
57
Examples of basic estimators
Estimating the mean: = x
s2
Consistent, unbiased, maybe efficient: V ( ) = (from central limit theorem)
N
1
b) when not knowing the true mean: V ( x) = s 2 =
N -1
i
( x - x ) 2
Note the correction factor of N/(N-1) from the nave expectation. Since x is closer to the
average of the data sample S than the mean , the result would underestimate the
variance and introduce a bias!
59
Properties of the ML estimator
Usually consistent
d ln L d ln L df
Peak in likelihood function: = =0
d a a =a df f = f = f ( a )
da a = a
60
Error on an ML estimator for large N
d ln L( x1 ,!, xN ; a)
Expand ln L around its maximum a . We have seen =0
da a =a
d 2 ln L
Second derivative important to estimate error:
d a2
One can show for any unbiased and efficient ML estimator (e.g. large N)
d ln L( x1 ,!, xN ; a) d 2 ln L
= A(a)(a ( x1 ,!, xN ) - a ) , with proportionality factor A(a) = -
da d a2
The CLT tells us that the probability distribution of a is Gaussian. For this to be (close to be) true A must
be (relatively) constant around a = a
A[ a - a ( x1 , x2 ,!, x N )] 2
L( x1 , x2 , !, x N ; a) e 2
Does not give you the most likely value for a, it gives the value for which the
observed data is the most likely !
d ln L( S ; a)
= 0 Usually cant solve analytically, use numerical methods, such
da a =a as MINUIT. You need to program you P(x;a)
No quality check: the value of ln L( S ; a ) will tell you nothing about how good
your P(x;a) assumption was
62
Least square estimation
Particular MLE with Gaussian distribution, each of the sample points yi
has its own expectation f ( xi ; a) and resolutions i
1 2
2s i2
P( yi ; a) = e -[ yi - f ( xi ;a )]
2p s i
2
y - f ( xi ; a)
To maximize LH, minimize c 2 = i f ( x; a)
i si
Fitting binned data:
Proper c2: c2 =
(n j - fj)
2
j fj
nj content of bin i follows poisson
c =
2
(n j - fj)
2
statistics
Simple c2: fj expectation for bin i, also the
(simpler to calculate) j nj
squared error
63
Advantages of least squares
Method provides goodness-of-fit
The value of the c2 at its minimum is a measure of the level of agreement between the
data and fitted curve.
c2 statistics follows the chi-square distribution f ( c 2 ; n)
Each data point contributesc 2 1 , minimizing c 2 makes it smaller by 1 per free
variable
Number of degrees of freedom n = Nbin - N var
64
Error Analysis
Statistical errors:
How much would result fluctuate upon repetition of the experiment
65
Literature
Statistical Data Analysis
G. Cowan, Clarendon, Oxford, 1998, see also www.pp.rhul.ac.uk/~cowan/sda
66