Lec3 IntroToProbabilityAndStatistics
Lec3 IntroToProbabilityAndStatistics
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
The binomial and Bernoulli distributions
Student’s T
Laplace distribution
Gamma distribution
Beta distribution
Pareto distribution
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Binary Variables
Consider a coin flipping experiment with heads = 1 and
tails = 0. With [0,1]
p ( x 1| )
p( x 0 | ) 1
Bern ( x | ) x (1 )1 x
Bern ( x | ) ( x 1)
(1 ) ( x 0)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
Bernoulli Distribution
Recall that in general
[ f ] p( x) f ( x), [ f ] p( x) f ( x)dx
x
var[ f ] [ f ( x) 2 ] [ f ( x)]2
For the Bernoulli distribution Bern ( x | ) x (1 )1 x , we
can easily show from the definitions:
[ x]
var[ x] (1 )
[ x]
x0,1
p ( x | ) ln p( x | ) ln (1 ) ln(1 )
D x1 , x2 ,..., xN
in which we have 𝑚 heads (𝑥 = 1), and 𝑁 − 𝑚 tails (𝑥 = 0)
N N
p (D | ) p ( xn | ) xn (1 )1 xn m (1 ) N m
n 1 n 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Binomial Distribution
Consider the discrete random variable X 0,1, 2,..., N
N m
Bin ( X m | N , ) (1 ) N m
0.3
0.3 0.35
0.25 0.3
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
Bin ( N , )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
Mean,Variance of the Binomial Distribution
Consider for independent events the mean of the sum is
the sum of the means, and the variance of the sum is the
sum of the variances.
m0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
Binomial Distribution: Normalization
To show that the Binomial is correctly normalized, we use the
following identities:
N N N 1
Can be shown with direct substitution: (*)
n n 1 n
N
N
m0 m
where 𝝁 = 𝜇1 , … , 𝜇𝐾 𝑇 .
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Multinoulli/Categorical Distribution
The distribution is already normalized:
K K
p( x | )
x x k 1
xk
k k 1
k 1
is the # of observations of 𝑥𝑘 = 1.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
MLE Estimate: Multinoulli Distribution
To compute the maximum likelihood (MLE) estimate of 𝝁, we
maximize an augmented log-likelihood
K K K
ln p (D | ) l k 1 mk ln k l k 1
k 1 k 1 k 1
mk
Setting the derivative wrt 𝜇𝑘 equal to zero: K
l
Substitution into the constraint
K
K m k K K
mk
mk
k 1
k 1 k 1
l
1 l mk
k 1 m
K
k
N
k 1
As expected, this is
the fraction in the 𝑁
observations of 𝑥𝑘 = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Multinomial Distribution
We can also consider the joint distribution of 𝑚1, … , 𝑚𝐾 in 𝑁
observations conditioned on the parameters 𝝁 =
(𝜇1, … , 𝜇𝐾).
N!
p (m1 , m2 ,..., mK | N , 1 , 2 ,..., K ) 1m1 2m2 ... Kmk where k 1 mk N
K
m1 !m2 !...mk !
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
Example: Biosequence Analysis
cgat acg gggtcgaa
Consider a set of DNA caat ccg agatcgca
Sequences
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Summary of Discrete Distributions
A summary of the multinomial and related discrete
distributions is summarized below on a Table from Kevin
Murphy’s textbook
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
The Poisson Distribution
We say that X 0,1, 2,3,... has a Poisson distribution with
parameter 𝜆 > 0, if its pmf is
l lx
X ~ Poi (l ) : Poi ( x | l ) e
x!
0.35 0.12
0.3
0.1
0.25
0.08
0.2
0.06
0.15
0.04
0.1
0.02
0.05
0
0 0 5 10 15 20 25 30
0 5 10 15 20 25 30
Use MatLab function poissonPlotDemo from Kevin Murphys’ PMTK
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
The Empirical Distribution
Given data, 𝒟 = {𝑥1, … , 𝑥𝑁}, we define the empirical
distribution as:
1 N
1 if xi A
pemp ( A) xi ( A), Dirac Measure : xi ( A)
N i 1 0 if xi A
We can also associate weights with each sample:
N N N
1
Generalize pemp ( x)
N
i 1
xi ( x) pemp ( x) wi xi ( x) , 0 wi 1, wi 1
i 1 i 1
This corresponds to a histogram with spikes at each
sample point with height equal to the corresponding
weight. This distribution assigns zero weight to any point
not in the dataset.
Note that the “sample mean of 𝑓(𝑥)” is the expectation of
𝑓(𝑥) under the empirical distribution:
N N
1 1
[ f ( x)] f ( x) xi ( x)dx f (x ) i
i 1 N N n 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 20
Student’s T Distribution
1 2 /2 1/2
( )
l lx
1/2
p ( x | , l , ) 2 2 1
( )
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
For 𝝊 → ∞, 𝓣(𝒙|𝝁, 𝝀, 𝝊) Becomes a Gaussian
1 2 /2 1/2
( )
l lx
1/2
p ( x | , l , ) 2 2 1
( )
2
We first write the distribution as follows:
/2 1/2
l x 2 1 l x 2
T ( x | , l , ) 1 exp ln 1
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Student’s T Distribution
1 2 /2 1/2
( ) lx
l
1/2
p ( x | , l , ) 2 2 1
( )
2
student distribution
0.4
v=10
0.35 v=1.0
v=0.1
Mean : , 1 0.3
0, l 1
Mode : 0.25
For , we
0.2
obtain N ( , l 1 )
Var : , 2
l 2 0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Student’s T Vs the Gaussian
prob. density functions
We plot: 0.8
Gauss
Student
0.7
0.6
0.5
0.2
0
Run MatLab function studentLaplacePdfPlot Gauss
-1 Student
from Kevin Murphys’ PMTK Laplace
-2
-5
-8
Recommended to use 𝜐 = 4. -9
-4 -3 -2 -1 0 1 2 3 4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Student’s T Distribution
p ( x | , a, b) N x | , 1 Gamma | a, b d
0
1/2
2 b
a
exp x a 1 b
e d
0
2 2 (a )
1/2
ba 1
p ( x | , a, b) 1/2
exp z a 1
d
(a) 2 0
1/2
b 1
a
1
1/2 a 11
z 1/2
exp z z a 1
dz
(a) 2 A 0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Appendix: Student’s T as a Mixture of Gaussians
1/2
ba 1 1
p ( x | , a, b) 1/2 a 11
z 1/2
exp z z a 1
dz
(a) 2 A 0
1/2 a 1/2
ba 1 1 2
b x a 1/2
exp z z dz
(a) 2 2 0
Recalling the definition of the Gamma function: (a) exp z z a 1dz
0
1/2 a 1/2
ba 1 1 2 1
p ( x | , a, b) b x (a )
(a) 2 2 2
a
It is common to redefine the parameters in this distribution as: 2a, l
1 b
( ) 2 /2 1/2
l lx
1/2
p ( x | , l , ) 2 2 1
( )
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Robustness of Student’s T Distribution
The robustness of the T-distribution is illustrated here by comparing
the “maximum likelihood solutions” for a Gaussian and a T-distribution
(30 data points from the Gaussian are used).
MatLab Code
T-distribution
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Robustness of Student’s T Distribution
The earlier simulation is repeated here with the PMTK toolbox.
0.5
gaussian
student T
laplace
0.4
0.5
gaussian
0.3 student T
laplace
0.4
0.2
0.3
0.1
0.2
0
-5 0 5 10
0.1
0
-5 0 5 10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
The Laplace Distribution
Another distribution with heavy tails is the Laplace distribution, also
known as the double sided exponential distribution. It has the following
pdf:
x
1
Lap x | , b e b
2b
𝜇 is a location parameter and 𝑏 > 0 is a scale parameter
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 30
Beta Distribution
The Beta(𝛼, 𝛽) distribution with x [0,1], , 0 is defined as
follows:
( ) 1 1 x 1 (1 x) 1
( x) x (1 x) , beta ( , )
( )( ) beta( , )
Normalizing
factor
-1
3
mode éë x ùû =
a=0.1, b=0.1
x , a=1.0, b=1.0
a +b -2
2.5 a=2.0, b=3.0
a=8.0, b=4.0
var x 1.5
( 1)
2
1
0.5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 31
Beta Distribution
If 𝛼 = 𝛽 = 1, we obtain a uniform distribution.
𝛼,𝛽<1
Run betaPlotDemo 1.5
from PMTK
1
𝛼=𝛽=1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Beta Distribution
3 3
Beta(0.1,0.1) Beta(1,1)
2.5 2.5
1.5
pdf
1.5
1
1
0.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
x
3
Beta(2,3) 3
Beta(8,4)
2.5
2.5
2
2
pdf
1.5
pdf
1.5
1
1
0.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Gamma Function
( ) 1
( x) x (1 x) 1
( )( )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Beta Distribution: Normalization
Showing that the Beta(𝛼, 𝛽) distribution is normalized
correctly is a bit tricky. We need to prove that:
1
1
1
( )( ) ( ) 1
d
0
Follow the steps:
(a) change the variable 𝑦 below to 𝑦 = 𝑡 − 𝑥;
(b) change the order of integration in the shaded
triangular region;
and (c) change 𝑥 to m via 𝑥 = 𝑡𝜇: t
𝑡=𝑥
1 x 1 y
1
( )( ) x e dx y e dy x e t x dt dx
t 1
x
0 0 y t x 0 x
t 1 t 1
x e t x dx dt t e t tdt 1 1 d
1 1 t 1 1
00 0 0
1
( ) 1 1
1
d
0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Gamma Distribution- Rate Parametrization
It is frequently a model for waiting times. For important
properties see here.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Gamma Distribution
b a a 1
Plots of Gamma ( X | a, b) x exp( xb), b 1
(a )
0.6
0.5
0.4
0.3
0.2
0.1
1 2 3 4 5 6 7
Run gammaPlotDemo
from PMTK
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Gamma Distribution
An empirical PDF of rainfall data fitted with a Gamma
distribution.
3.5 3.5
MoM
MLE
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Exponential Distribution
This is defined as
Expon ( X | l ) Gamma ( X |1, l ) l exp( xl ), x 0,
Here 𝜆 is the rate parameter.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
Chi-Squared Distribution
This is defined as
1 2
1 1
( X | ) Gamma ( X | , ) x 2 exp( ), x 0,
2 2 x
2 2 2
2
More precisely,
Let Z i ~ (0,1) and S Z i2 , then : S ~ 2
i 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
Inverse Gamma Distribution
This is defined as follows:
If X ~ Gamma ( X | a, b) X 1 ~ InvGamma ( X | a, b)
where:
b a ( a 1)
InvGamma ( X | a, b) x exp(b / x), x 0,
(a )
𝑎 is the shape and 𝑏 the scale parameters.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
The Pareto Distribution
Used to model the distribution of quantities that exhibit
long tails (heavy tails)
𝒫𝒶𝓇ℯ𝓉ℴ 𝑋|k, 𝑚 = 𝑘𝑚𝑘 𝑥 −(𝑘+1) 𝕀 𝑥 ≥ 𝑚
This density asserts that 𝑥 must be greater than some
constant 𝑚, but not too much greater, 𝑘 controls what is
“too much”.
Modeling the frequency of words vs. their rank (e.g. “the”,
“of”, etc.) or the wealth of people.*
As 𝑘 → ∞, the distribution approaches 𝛿(𝑥 − 𝑚).
On a log-log scale, the pdf forms a straight line of the form
log 𝑝(𝑥) = 𝑎 log 𝑥 + 𝑐 for some constants 𝑎 and 𝑐 (power
law, Zipf’s law).
* Basis of the distribution: a high proportion of a population has low income and only few have very high incomes.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
The Pareto Distribution
Applications: Modeling the frequency of words vs their
rank, distribution of wealth (𝑘 =Pareto Index), etc.
Pareto ( X | k , m) km k x ( k 1) ( x m), Pareto distribution
km
Mean (if k 1), 2 m=0.01, k=0.10
k 1 1.8
m=0.00, k=0.50
m=1.00, k=1.00
Mode m, 1.6
1.4
m2 k
var (if k 2) 1.2
(k 1) (k 2)
2
1
0.8
0.6
0.4
0.2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Covariance
Consider two random variables X , Y : .
P X A, Y B P X 1 ( A) Y 1 ( B ) p( x, y )dxdy
A B
p ( x, y ) p ( x ) p ( y )
cov( X , Y ) ( X ( X ))(Y (Y ))
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44
Correlation, Center Normalized Random Variables
Consider two random variables X , Y : .
~ ~ ~ ~
𝔼 𝑋 =𝔼 𝑌 =0 var 𝑋 = var 𝑌 = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 45