Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chap2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Chapter 2

A Quick Review of Basic Probability


and Statistics
This course presumes knowledge of Chapters 1 to 3 of Introduction to Probability Models by Sheldon M.
Ross. This material is also largely covered in the course text by P. Bremaud.

2.1

Probability: The Basics


: sample space
: sample outcome
A : event
X : S : S-valued random variable
P : a probability (distribution / measure) on

A probability has the following properties:


1. 0 P {A} 1 for each event A.
2. P {} = 1
3. for each sequence A1 , A2 , . . . of mutually disjoint events
( )

[
X
P
Ai =
P {Ai }
i=1

2.2

i=1

Conditional Probability

The conditional probability of A, given B, written as P {A|B}, is defined to be


P {A|B} =

P {A B}
.
P {B}

It is a probability on the new sample space B ; P {A|B} is interpreted as the likelihood / probability
that A occurs given knowledge that B has occurred.

Conditional probability is fundamental to stochastic modeling. In particular in modeling causality in a


stochastic setting, a causal connection between B and A means:
P {A|B} P {A}.

2.3

Independence

Two events A and B are independent of one another if


P {A|B} = P {A}
i.e. P {A B} = P {A}P {B}. Knowledge of Bs occurrence has no effect on the likelihood that A will occur.

2.4

Discrete Random Variables

Given a discrete random variable (rv) X which takes on values in S = {x1 , x2 , . . .}, its probability mass
function is defined by:
PX (xi ) = P {X = xi }, i 1.
Given a collection X1 , X2 , . . . , Xn of S-valued rvs, its joint probability mass function (pmf) is defined as
P(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = P {X1 = x1 , X2 = x2 , . . . , Xn = xn }.
The conditional pmf of X given Y = y is then given by
PX|Y (x|y) =

P(X,Y ) (x, y)
.
PY (y)

The collection of rvs X1 , X2 , . . . , Xn are independent if


P(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = PX1 (x1 ) PX2 (x2 ) PXn (xn )
for all (x1 , . . . , xn ) S n .

2.5

Continuous Random Variables

Given a continuous rv X taking values in R, its probability density function fX () is the function satisfying:
Z x
P {X x} =
fX (t)dt.

We interpret fX (x) as the likelihood that X takes on a value x. However, we need to exercise care in that
interpretation. Note that
Z x
P {X = x} =
fX (t)dt = 0,
x

so the probability that X takes on precisely the value x (to infinite precision) is zero. The likelihood
interpretation comes from the fact that
R a+
fX (t)dt 0 fX (a)
P {X [a , a + ]}

= Ra
b+
P {X [b , b + ]}
fX (b)
fX (t)dt
b

so that fX (a) does indeed measure the relative likelihood that X takes on a value a (as opposed, say, to b).

Given a collection X1 , X2 , . . . , Xn of real-valued continuous rvs its joint probability density function (pdf)
is defined as the function f(X1 ,X2 ,...,Xn ) () satisfying
Z x1
Z xn

f(X1 ,X2 ,...,Xn ) (t1 , t2 , . . . , tn )dt1 dtn .


P {X1 x1 , . . . , Xn xn } =

Again, f(X1 ,...,Xn ) (x1 , . . . , xn ) can be given a likelihood interpretation. The collection X1 , X2 , . . . is independent if
f(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = fX1 (x1 ) fXn (xn )
for all (x1 , . . . , xn ) Rn .
Finally, the conditional pdf of X given Y = y is given by
fX|Y (x|y) =

2.6

f(X,Y ) (x, y)
.
fY (y)

Sums of Random Variables

Many applications will require we compute the distribution of a sum Sn = X1 + X2 + . . . + Xn where the
Xi s are jointly distributed real-valued rvs. If the Xi s are continuous then
Z
fSn (z) =
fXn |Sn1 (z y|y)fSn1 (y)dy.

If the Xi s are independent rvs,


Z

fXn (z y)fSn1 (y)dy.

fSn (z) =

This type of integral is know, in applied mathematics, as a convolution integral. So, fSn () can be computed
recursively (in the independent setting) via n 1 convolution integrals.
A corresponding result holds in the discrete setting (with integrals replaced by sums).

2.7

Expectations

If X is a discrete real-valued rv, its expectation is defined as


X
xPX (x)
E [X] =
x

(assuming the sum exists); if X is a continuous rv, its expectation is just


Z
E [X] =
xfX (x)dx

(assuming the integral exists).


Suppose that we wish to compute the expectation of Y = g(X1 , . . . , Xn ), where (X1 , . . . , Xn ) is a jointly
distributed collection of continuous rvs. The above definition requires that we first compute the pdf of Y
and then calculate E [Y ] via the integral
Z
yfY (y)dy.
E [Y ] =

Fortunately, there is an alternative approach to computing E[Y ] that is often easier to implement.

Result 2.1: In the above setting, E[Y ] can be compute as:


Z
Z
E[Y ] =

g(x1 , . . . , xn )f(X1 ,...,Xn ) (x1 , . . . , xn )dx1 dxn .

Similarly in the discrete setting, if Y = g(X1 , . . . , Xn ), E[Y ] can be computed as


E[Y ] =

x1 S

g(x1 , . . . , xn )P(X1 ,...,Xn ) (x1 , . . . , xn ).

xn S

Remark 2.1: In older editions of his book, Sheldon Ross referred Result 2.1 as the Law of the Unconscious
Statistician!.
Example 2.1: Suppose X is a uniformly distributed rv on [0, 1], so that
(
1 0x1
fX (x) =
0 o.w.
Let Y = X 2 .

Approach 1 to computing E[Y ]: Note that P {Y y} = P {X 2 y} = P {X y} = y. So,


fY (y) =

d 1
1 1
y 2 = y 2 .
dy
2

Hence,
Z
E[Y ] =

yfY (y)dy =
0

1
2

y 2 dy =
0


1
1 2 3
1
y2 =
2 3
3
0

Approach 2 to computing E[Y ]:


1

Z
E[Y ] =

Z
g(x)fX (x)dy =

x2 dx =

1
.
3

The expectation of a rv is interpreted as a measure of a rvs central tendency . It is one of several summary
statistics that are widely used in communicating the essential features of a probability distribution.

2.8

Commonly Used Summary Statistics

Given a rv X, the following are the most commonly used summary statistics.
1. Mean of X: The mean of X is just its expectation E[X]. We will see later, in our discussion of the
law of large numbers, why E[X] is a key characteristic of Xs distribution.
2. Variance of X:


var(X) = E (X E[X])2
This is a measure of Xs variability.
3. Standard Deviation of X:
(X) =

var(X)

This is a measure of variability that scales appropriately under a change in the units used to measure
X (e.g. if X is a length, changing units from feet to inches multiplies the variance by 144, but the
standard deviation by 12).
6

4. Squared Coefficient of Variation:


var(X)

c2 (X) =

E [X]
This is a dimensionless measure of variability that is widely used when characterizing the variation
that is present in a non-negative rv X (e.g. task durations, component lifetimes, etc).
5. Median of X: this is the value n having the property that
1
P {X m} = = P {X m}
2
(and is uniquely defined when P {X } is continuous and strictly increasing). It is a measure of the
central tendency of X that complements the mean. Its advantage, relative to the mean, is that it is
less sensitive to outliers (i.e. observations that are in the tails of X that have a big influence on
the mean, but very little influence on the median).
6. pth quantile of X: The pth quantile of X is that value q having the property that

P {X q} = FX (q) = p
1
i.e. q = FX
(p).

7. Inter-quartile range: This is the quantity:


1
FX

 
 
3
1
1
FX
;
4
4

it is a measure of variability that, like the median, is (much) less sensitive to outliers than is the
standard deviation.

2.9

Conditional Expectation

The conditional expectation of X. given Y = y is just the quantity


X
xpX|Y (x|y)
E [X|Y = y] =
x

where X is discrete and

E [X|Y = y] =

xfX|Y (x|y)dx

h
i
~ = ~y (where Y
~ =
when X is continuous. We can similarly define E [X|Y1 = y1 , . . . , Yn = yn ] = E X|Y
(Y1 , . . . , Yn )T and ~y = (y1 , . . . , yn )T ). We sometimes denote E [X|Y = y] as Ey [X].
Note that expectations can be computed by conditioning:
Z
E [X] =
E [X|Y = y] fX (y)dy

(if Y is continuous), and


E [X] =

E [X|Y = y] pX (y)

(if Y is discrete). These equations can be rewritten more compactly as


E [X] = E [E [X|Y ]] .
~m = (Y1 , . . . , Ym )T and Y
~n = (Y1 , . . . , Yn )T , then
This identity can be generalized. If Y
h
i
h h
i
i
~m = E E X|Y
~n |Y
~m
E X|Y
if n m. This is often referred to as the tower property of conditional expectation.
7

2.10

Important Discrete Random Variables

1. Bernoulli(p) rv: X Ber (p) if X {0, 1}, and


P {X = 1} = p = 1 P {X = 0} .
Application: Coin tosses, defective / non-defective items, etc.
Statistics:
var (X) = p(1 p)

E [X] = p

2. Binomial(n,p) rv: X Bin (n, p) if X {0, 1, . . . , n} and


 
n k
P {X = k} =
p (1 p)nk .
k
Applications: Number of heads in n coin tosses; number of defectives in a product shipment of size n.
Statistics:
var (X) = np(1 p)

E [X] = np

3. Geometric(p) rv: X Geom (p) if X {0, 1, . . .} and


P {X = k} = p(1 p)k ,

k 0.

Applications: Number of coin tosses before the first head, etc.


Statistics:

1p
1p
var (X) =
p
p2
A closely related variant, also called a geometric rv, arises when X {1, 2, . . .}, and
E [X] =

P {X = k} = p(1 p)k1 ,

k 1.

Here the statistics are:

1
1p
var (X) =
p
p2
This time, it is the number of tosses required to observe the first head.
E [X] =

4. Poisson() rv: X Poisson () if X {0, 1, 2, . . .} and


k
, k 0.
k!
Applications: Number of defective pixels on a high-definition TV screen, etc.
P {X = k} = e

Statistics:
E [X] =

var (X) =

The Poisson rv arises as an approximation to a binomial rv when n is large and p is small. For example,
if there are n pixels on a screen and the probability a given pixel is defective is p, then the total number
of defectives on the screen is Bin (n, p). In the this setting, n is large and p is small. The binomial
probabilities are cumbersome to work with when n is large because of the binomial coefficient that
appear. As a result, we seek a suitable approximation. We propose the approximation
D

Bin (n, p) Poisson (np)


D

when n is large and p is small (where denotes has approximately the same distribution as). This
approximation is supported by the following theorem.
8

Theorem 2.1. P {Bin (n, p) = k} P {Poisson () = k} as n , provided np as n .


Outline of proof: We will prove this for k = 0; the general case is similar. Note that


 n

n
1
e
P {Bin (n, p) = 0} = (1 p) = 1 + o n
n
as n (where o (an ) represents a sequence having the property that o(an )/an 0 as n ).

2.11

Important Continuous Random Variables

1. Uniform(a,b) rv: X Unif (a, b), a < b if


(
fX (x) =

1
ba

axb
o.w.

Applications: Arises in random number generation, etc.


Statistics:
E [X] =

a+b
2

(b a)2
12

var (X) =

2. Beta(, ) rv: X Beta (, ), , > 0, if


x (1x)
B(,)

0x1

o.w.

(
fX (x) =

where B(, ) is the normalization factor chosen to ensure that fX () integrates to one, i.e.
Z
B(, ) =

y (1 y) dy.

Applications: The Beta distribution is a commonly used prior on the Bernoulli parameter p.

Exercise 2.1: Compute the mean and variance of a Beta (, ) rv in terms of the function B(, ).
3. Exponential() rv: X Exp (), > 0 if
(
ex
fX (x) =
0

x0
o.w.

Applications: Component lifetime, task duration, etc.


Statistics:
E [X] =

var (X) =

4. Gamma(, ) rv: X Gamma (, ), , > 0, if


(
1
fX (x) =

(x)
()

ex

1
2

x0
o.w.

where
Z

y 1 ey dy

() =
0

is the gamma function.


Applications: Component lifetime, task duration, etc.
Statistics:
E [X] =

var (X) =


5. Gaussian / Normal rv: X N , 2 , R, 2 > 0, if
fX (x) =

1
2 2

(x)2
2 2

Applications: Arises all over probability and statistics (as a result of the central limit theorem).
Statistics:
var (X) = 2

E [X] =

D
D
Note that N , 2 = + N (0, 1), where = denotes equality in distribution. (In
 other words, if one
takes a N (0, 1) rv, scales it by and adds on to it, we end up with a N , 2 rv.
6. Weibull(, ) rv: X Weibull (, ), , > 0, if
P {X > x} = e(x)

for x 0. Hence:
(

x1 e(x)
fX (x) =
0

x0
o.w.

Applications: Component lifetime, task duration, etc.


Statistics:
1+
E [X] =

1+
var (X) =
2


2

7. Pareto(, ) rv: X Pareto (, ), , > 0, if


(
fX (x) =

(1+x)+1

x0

o.w.

The Pareto distribution has a tail that decays to zero as a power of x (rather than exponentially
rapidly (or faster) in x). As a result, a Pareto rv is said to be a heavy tailed rv.
Applications: Component lifetime, task duration, etc.

10

2.12

Some Quick Illustrations of Basic Probability

Example 2.2: (Capture / Recapture sampling)


We wish to estimate the number N of fish that are present in a lake. We start by visiting the lake, catching k
fish, tagging each of the k fish and releasing them. (This is the capture phases.) A month later, we revisit
the lake, catch n fish (the recapture phase), and count the number X of tagged fish that are present in
the sample. How do we estimate the number N of fish in the lake?
Solution: Note that X Bin (n, p), where p is the probability that a fish is tagged. After a month, we
assume that the tagged fish are well mixed with the total population of size N , so p = k/N . Given that
E [X] = np, this suggests equating X to nk/N . In other words,
N

nk
.
X

Example 2.3: (Poker)


What is the probability that a five card hand contains k hearts? (k = 0, 1, . . . , 5)
Solution: Note
 
13
There are
ways to choose k hearts from the 14 hears present in the deck.
k

There are

39
5k


ways to choose the remaining 5 k cards from the 39 non-hearts present in the

deck.

There are

52
5


ways to choose 5 cards from a deck of 52 cards.

So,

P {k hearts in a hand of 5} =

13
k



39
5k
 
52
5

Example 2.4: (Lets make a Deal)


A prize is behind one of three doors. Goats are behind the other 2 doors. We choose Door 1. If we choose
correctly, we get the prize. The host opens one of Door 2 or Door 3 to expose a goat. Should we change out
choice of door from Door 1 to the remaining unopened door (selected from either Door 2 or Door 3)?
Solution: Let Y be the door that the host exposes. Assume that the host knows what is behind each
door, and never exposes the door behind which is the prize. Let P be the rv corresponding to the door the
prize is behind so
P {the prize is behind door k} = P {P = k}
Then
P {P = 1|Y = 2} =

P {P = 1 Y = 2}
P {Y = 2}

But,
P {P = 1 Y = 2} = P {P = 1} P {Y = 2|P = 1} =
11

1 1
,
3 2

where the 12 presumes that if the price is behind the door we initially select, then the host randomly chooses
to open one of the two doors with goats behind them at random. On the other hand,
P {Y = 2} = P {P = 1} P {Y = 2|P = 1} + P {P = 2} P {Y = 2|P = 2} + P {P = 3} P {Y = 2|P = 3}
1
1 1 1
= + 0+ 1
3 2 3
3
1
= .
2
So,
P {P = 1|Y = 2} =

1
.
3

Similarly,
2
,
3
so we should indeed change our choice of door in response to the information the host reveals.
P {P = 2|Y = 2} = 0,

P {P = 3|Y = 2} =

Remark 2.2: For a more detailed discussion see Appendix B for more on the Monty Hall, Lets Make a
Deal Problem..
Example 2.5: How should we model the idea that as a component ages, it becomes less reliable?
Solution: Let T be a continuous rv corresponding to the component lifetime. For h > 0 and fixed, consider
P {T [t, t + h]|T > t} =

P {T [t, t + h]}
P {T [t, t + h] T > t}
=
.
P {T > t}
P {T > t}

This conditional probability is the likelihood the component fails in the next h time units give that it has survived to time t. Reduction in reliability as the component ages amounts to asserting that P {T [t, t + h]|T > t}
should be a decreasing function of t, i.e.
P {T [t, t + h]|T > t} % 0
in t. Note that when h is small,
f (t)
P {T [t, t + h]|T > t} h
F (t)

where f is the density of T and F (t) = 1 F (t) = P {T > t}. Accordingly, r(t) = f (t)/F (t) is called the
failure rate (at time t) of T . Modeling reduction in reliability as the component ages amounts to requiring
that r(t) should be increasing in t. In other words, T has an increasing failure rate function.

New components often exhibit a burn-in phases where they are subject to immediate (or rapid) failure
because of the presence of manufacturing defects. Once a component survives through the burn-in phases,
its reliability improves. Such components have decreasing failure rate function (at least through the end of
the burn-in phase).
Most manufactured components have a failure rate that is bathtub shaped, see Figure 2.1
Over the operational interval [t1 , t2 ], the failure rate is essentially constant. This makes identifying the
constant failure rate distribution interesting (since the failure rate of a component is often constant over
the great majority of its design lifetime).
Suppose
r(t) = .
12

r(t)

t1

t2
t

Figure 2.1: Sample Failure Rate Function


Then,
f (t)
= ,
F (t)
so that

d
dt
F (t)
= .

F (t)

We conclude that

d
log F (t) =
dt

so that
log F (t) = log F (0) t.
Since T is positive, F (0) = 1 and hence F (t) = et . In other words, T Exp (). So, exponential rvs are
the unique rvs having a constant failure rate.
If T Weibull (, ), then log F (t) = (t) . So,
r(t) =

d
(t) = t1 ;
dt

so if < 1, T has a decreasing failure rate, which if > 1, T has an increasing failure rate. When = 1, T
has a constant failure rate and is exponentially distributed.

2.13

Statistical Parameter Estimation: The Method of Maximum


Likelihood

In building stochastic models, it is often the case that observational data exists that can be used to help guide
the construction of an appropriate model. In particular, the existing data can be used to help estimate the
13

model parameters. Statisticians call the process of fitting the parameters of a model to data the parameter
estimation problem (estimation for short).
To provide a concrete example, consider the problem of building a stochastic model to represent the number
of defective pixels on a high-definition television screen. We argued earlier, in Section 10 of this chapter, that
a good model for the number X of defective pixels on such a screen is to assume that it follows a Poisson
distribution with parameter . We now wish to estimate .
We select five such screens and count the number of defective pixels on each of the five screens, leading to
counts of 0, 3, 4, 2 and 5, respectively. We view the five observations as a random sample from the distribution of X, by which we mean that the five observations are the realized values of five iid (independent and
identically distributed) rvs X1 , X2 , . . . , X5 having a common Poisson ( ) distribution.
Maximum likelihood is generally viewed as the gold standard method for estimating statistical parameters.
We will discuss later the theoretical basis for why maximum likelihood is a preferred approach to estimating
parameters. The method of maximum likelihood asserts that one should:
that maximizes the likelihood of observing the given
Estimate the parameter as that value
sample.
In this case, the likelihood of observing 0,3,4,2 and 5 under a Poisson () model is:
L() =

e5 14
e 0 e 3 e 4 e 2 e 5

=
0!
3!
4!
2!
5!
0! 3! 4! 2! 5!

of L() is equal to the maximizer of the log-likelihood L(), namely


The maximizer
L() = 5 + 14 log log(0! 3! 4! 2! 5!).
satisfies:
The maximizer


14
d
= 0.
L()
= 5 +

=
i.e.
= 14 .

2.13.1

MLE for an Exp () Random Variable

More generally, if X1 , X2 , . . . , Xn is a random sample from a Poisson ( ) distribution, then the likelihood
is:
n
Y
e xi
Ln () =
,
xi !
i=1
having maximizer
n = x1 + x2 + . . . + xn .

n
n is just the (arithmetic) mean of the sample. This so-called sample mean is usually denoted
In other words,
n . We next work out the maximum likelihood estimator (MLE) for normally distributed and gamma
as X
distributed rvs.
14

2.13.2

MLE for a N (, 2 ) Random Variable

Here we work out the MLE for normally distributed (Gaussian)


rvs. Suppose that we observe a random

sample X1 , X2 , . . . , Xn of iid observations from a N , 2 distribution. The corresponding likelihood is:
L(, 2 ) =

n
Y

i=1

1
2 2

(xi )2
2 2

= (2 2 ) 2 e

Pn

i=1

(xi )2
2 2

and the log-likelihood is


n
X
n
(xi )2
n
2
Ln (, ) = log
log(2).
2
2
2
2
i=1
2

The MLE for ( , 2 ) is the value (


n ,
n2 ) satisfying
n

X (xi

n )
Ln (
n ,
n2 ) =
=0
2

n
i=1
and

X (xi
n

n )2
n ,
n2 ) = 2 +
= 0.
Ln (
2
4

2
n i=1
2
n
This yields:
n

n =

1X
n
xi = X
n i=1

and

n2 =

1X
(xi n )2 .
n i=1

Remark: It turns out that the estimators that are most frequently used by statisticians to estimate the
n and estimate 2 via
parameters ( , 2 ) for Gaussian models are the following. Estimate via X
n

s2n =

1 X
n

2 .
(xi
n )2 =
n 1 i=1
n1 n

The estimator s2n is what statisticians call the sample variance. Note the when n is reasonably large, s2n
and
n2 are almost identical. But for small samples, s2n and
n2 differ. Statisticians generally prefer s2n to
n2
2
2
2
because sn is undefined when n = 1 ( as it should be) and sn is unbiased as an estimator of , by which
we mean that
 
E s2n = 2
for n 2..
Exercise 2.2: Prove that s2n is an unbiased estimator for 2 when n 2.

2.13.3

MLE for a Gamma (, ) Random Variable

Suppose that we observe a random sample X1 , X2 , . . . , Xn of iid observations from a Gamma ( , ) population. The corresponding likelihood is
Ln (, ) =

n
Y
(xi )1 exi
i=1

()

and the log-likelihood is


Ln (, ) = n log + ( 1)

n
X
i=1

log xi

n
X

xi n log ().

i=1

n,
n,
For this example, there is no closed form for the maximizer (
n ) of Ln (); the MLE (
n ) must be
computed numerically. This example illustrates a key point about MLEs. While they are the statistical gold
standard, they are often notoriously difficult to compute (even in the presence of powerful computers).
15

Exercise 2.3: Compute the MLE for a random sample from a Weibull ( , ) population.
Exercise 2.4: Compete the MLE for a random sample from a Unif (a , b ) population.
Exercise 2.5: Compute the MLE for a random sample from a Bin (n, p ) population (where n is known).
Exercise 2.6: Compute the MLE for a random sample of Beta ( , ) population.

2.13.4

MLE as a Gold Standard

Let us now return the question of why maximum likelihood is the gold standard estimation method. Suppose that we have a random sample from a normally distributed population in which is unknown, but the
variance 2 is known to equal one. Recall that for a normal distribution, characterizes both the mean
n (the sample mean) or mn , the
and the median. This suggests estimating via either the estimator X
sample median. (The sample median is the (k + 1)th largest observation when n = 2k + 1, and the median
is defined as the arithmetic average of the k th and (k + 1)th largest observations when n = 2k.) Since the
n and mn are themselves random variables. The hope is that when the
sample is random, the estimators X
n and mn with be close to . The preferred estimator is clearly the one that has
sample size n is large, X
tendency to be closer to .
One way to mathematically characterize this preference is to study the rate of convergence of the esti n and mn obey central limit theorems that assert
mator to . We will see in Chapter 2 that both X

that Xn and mn are, for large n, asymptotically normally distributed with common mean and variances
2/n
2/n
2/n
1 and 2 , respectively. Our preference should obviously be for the estimator with the smaller value i .
n is the MLE in this Gaussian setting. As the gold standard estimator, it will come as
The estimator X
n is to be preferred to mn as an estimator of
no surprise that 12 is always less than or equal to 22 . So X

the parameter in a N ( , 1) population. It is the fact that the MLE has the fastest possible rate of
convergence among all possible estimators of an unknown statistical parameter that has let to its adoption
as the gold standard estimator. (For those of you familiar with statistics, the MLE achieves (in great
generality) the Cramer-Rao lower bound that describes a theoretical lower bound on the variance of an
(unbiased) estimator of an unknown statistical parameter.) As a consequence, it is typical that in approaching
parameter estimation problems, the first order of business is to study the associated MLE. If computation of
the MLE is analytically or numerically tractable, then one would generally adopt the MLE as ones preferred
estimator.

2.14

The Method of Moments

An alternative approach to estimating model parameters is the method of moments. Let us illustrate this
idea in the setting of a gamma distributed random sample.
Given a random sample X1 , X2 , . . . , Xn of iid observations for a Gamma ( , ) population, recall that:
E [X] =

and

var (X) =

.
2

n and sample variance s2n will be


One expects that that if the sample size n is large, then the sample mean X
close to E [X] and var (X), respectively. (This will follow from the Law of Large Number, to be discussed in
Chapter 2.) This suggests that

2
n
X
and
s

.
n

2
16

for and are obtained by replacing the approximations


The methods of moments estimators
and
with the equalities:

n =
X
and
s2n =
.

This leads to the estimators

n2
X
s2n

and

= Xn .

s2n

can be computed in analytical closed form, unlike


Note that the method of moments estimators
and
the MLE (which in this gamma setting must be computed numerically). Hence, the method of moments
estimators are (at least in this example) more tractable. On the other hand, they are inefficient statistically,
because they don not achieve the Cramer-Rao lower bound. So, method of moments estimators do not
typically fully exploit all the statistical information that is present in a sample (unlike MLEs). The advantage of the method of moments is that they can offer a computationally tractable alternative to parameter
estimation in settings where maximum likelihood is too difficult to implement numerically.
In general, if the rv X has a distribution that depends on d unknown statistical
parameters 1 , 2 , . . . , d
 k
one writes down expressions for the first d moments of the rv X (i.e. E X for k = 1, . . . , d) in terms of
the d parameters, leading to the equations
 
E X k = fk (1 , 2 , . . . , d ), k = 1, . . . , d.
The method of moments estimators 1 , . . . , d are obtained by equating the population moments to the
sample moments, namely as solution to the simultaneous equations
n

1X k
X = fk (1 , . . . , d ),
n i=1 i

k = 1, . . . , d.


Exercise 2.7: Compute the method of moment estimators for a N , 2 populations.
Exercise 2.8: Compute the method of moments estimators for a Unif (a , b ) population.
Exercise 2.9: Compute the method of moments estimators for Beta (, ) population.
Exercise 2.10: Compute the method of moments estimators for a Weibull ( , ) population.

2.15

Bayesian Statistics

Consider a case where we are attempting to estimate a Bernoulli parameter p corresponding to the probability that a given manufactured item is defective. With a good manufacturing process in place p should
be small.
In this cases, if we test n items, it is likely that all n items are non-defective. In other words, the random
sample X1 , X2 , . . . , Xn from such a Bernoulli population is likely to be one in which Xi = 0 for 1 i n.
The maximum likelihood estimator (and method of moments estimator) pn for p is given by
n = 0.
pn = X
Given the experimental data observed, this is perhaps a reasonable estimate for p .
But it is unlikely that a company would base any of its operational decisions on such an estimate of p .
Nobody truly believes that they have a flawless manufacturing process. One has a prior belief that p is
17

positive. Bayesian statistical methods offer a means of taking advantage of such prior information.
In a Bayesian approach to statistical analysis, one would view p as itself being a random variable. The
distribution of p (the so-called prior distribution on p ) reflects the statisticians beliefs about the likely
values of p in the absence of any experimental data. In our Bernoulli example, one possible prior would be the
uniform distribution on [0, 1]. Having postulated a prior, we now observe a random sample X1 , X2 , . . . , Xn .
Conditional on p = p, the likelihood of the sample is just
Ln (p) =

n
Y

pXi (1 p)1Xi .

i=1

We now wish to compute a new prior distribution on p that reflects the influence of the observed sample
on the prior. We do this by taking advantage of the basic ideas of conditional probability. In this statistical
setting, this application of conditional probability is often called Bayes rule. In particular, the posterior
distribution is just the distribution of p, given X1 , . . . , Xn . This translates into
f (p|X1 , . . . , Xn ) = R 1
0

pSn (1 p)nSn
rSn (1 r)nSn dr

where Sn = X1 + . . . + Xn . In particular, if X1 = = Xn = 0, we find that


f (p|X1 = 0, . . . , Xn = 0) = (n + 1)(1 p)n .
The mean of the posterior distribution is
Z 1
pf (p|X1 = 0, . . . , Xn = 0)dp =
0

1
.
n+2

(Note the when n = 0, the mean is 21 , which coincides with the mean of the uniform prior.) Thus, the
Bayesian approach here leads to an analysis that seems more consistent with usage of statistics in an operational decision-making environment.
Such a Bayesian approach to statistical analysis can be applied in any setting in which the underlying data
is assume to follow a parametric distribution. In particular, suppose that the random sample X1 , . . . , Xn is
a collection of observations from a population having a density function f (; ), where is the true value
of the unknown parameter . Suppose p() is a density corresponding to a prior distribution on . Bayes
rule dictates that the posterior distribution on equals
Qn
p() i=1 f (Xi ; )

R
Qn
f (|X1 , . . . , Xn ) =
p(0 ) i=1 f (Xi ; 0 )d0

where is the set of all possible values for the parameter .



Exercise 2.11: Suppose that we observe a random sample from a N , 2 population.
1. Compute the posterior distribution on when has the prior that is N (r, 1) distributed.
2. Repeat 1 for a general prior.
Exercise 2.12: Suppose that we observe a random sample from a Ber (p ) population.
1. Compute the posterior distribution on p when p has a prior that is Beta (, ) distributed.
2. Repeat 1 for a general prior.
The priors that are postulated in part 1 of the above problems are called conjugate priors for the normal
and Bernoulli distributions, respectively. Note that use of a conjugate prior simplifies computation of the
posterior.
18

2.16

The Law of Large Numbers

One of the two most important results in probability is the law of large numbers (LLN).
Theorem 2.2. Suppose that (Xn : n 1) is a sequence of iid rvs. Then,
1
P
(X1 + + Xn ) E(X1 )
n
as n 1 .
This result is easily to prove when Xi s have finite variance. The key is the following inequality, called
Markovs inequality.
Proposition 2.1: Suppose that W is a non-negative rv. Then,
P (W > w)

1
E(W ).
w

Proof. Note that if W is a continuous rv,


Z
P (W > w) =
f (x) dx
Zw  
x
x
f (x) dx (since
1 when x w)

w
w
Zw  
1
x
f (x) dx = E(W ).

w
w
0
The proof is similar for discrete rvs.
An important special case is called Chebyshevs inequality.
Proposition 2.2: Suppose that Xi s are iid with common (finite) variance 2 . If Sn = X1 + + Xn , then



Sn

2


P
E X1 >  2 .
n
n
Proof. Put W = (Sn nE(X1 ))2 and w = n2 2 . Note that E(W ) = var(Sn ) = n 2 , so
P (|

Sn
2
E(X1 )| > ) = P (W > w) 2 .
n
n

Theorem 2.2 is an immediate consequence of Proposition 2.2. Lets now apply the LLN.
The LLN guarantees that even though the sample average n1 (X1 + + Xn ) is a rv, it settles down to
something deterministic and predictable when n is large, namely E(X1 ). Hence, even though the individual
Xi s are unpredictable, their average (or mean) is predictable. The fact that the average n1 (X1 + + Xn )
settles down to the expectation E(X1 ) is a principal reason for why the expectation of a rv is the most widely
used measure of central tendency (as opposed, for example, to the median of the distribution).
1 Here

denotes convergence in probability. For a definition see Appendix A

19

2.17

Central Limit Theorem

The second key limit result in probability is the central limit theorem (CLT). (It is so important that it is
the central theorem of probability!)
Note that the LLN approximation (2.2) is rather crude:

P (X1 + + Xn x)

0,
1,

x < np
x np

Typically, wed prefer an approximation that tells us how close P (X1 + + Xn x) is to 0 when x < np
and how close to 1 when x np. The CLT provides exactly this additional information.
Theorem 2.3. Suppose that the Xi s are iid rvs with common (finite) variance 2 . Then, if Sn = X1 +
+ Xn
Sn nE(X1 )

N (0, 1)
(2.1)
n
as n .
The CLT (2.1) supports the use of the approximation
D

Sn nE(X1 ) +

n N (0, 1)

(2.2)

when n is large. The approximation (2.2) is valuable in many different problem settings. We now illustrate
its use with an example.
An outline of the proof of the CLT is given later in the notes.

2.18

Moment Generating Functions

A key idea of in applied mathematics is that of the Laplace transform. The Laplace transform also is a
useful tool in probability. In the probability context, the Laplace transform is usually called the moment
generating function (of the rv).
Definition 2.1: The moment generating function of a rv X is the function X () defined by
X () = E(exp(x)).
This function can be computed in closed form for many of the distributions encountered most frequently
in practice:
Bernoulli rv: X () = (1 p) + pe
Binomial(n, p) rv: X () = ((1 p) + pe )n
Geometric(p) rv: X () = p/(1 (1 p)e )
Poisson() rv: X () = exp((e 1))
Uniform(a, b) rv: X () = (eb ea )/(b a)
Exponential() rv: X () = ( )1



Gamma(, ) rv: X () =
Normal(, 2 ) rv: X () = exp( +

2 2
2 )

20

The moment generating function (mgf) of a rv X gets its name from the fact that the moments (i.e. E(X k )
for k = 1, 2, . . .) of the rv X can easily be computed from knowledge of X (). To see this, note that if X
is continuous, then
dk
X ()
dk

dk
E(exp(X))
dk
Z
dk
=
ex f (x) dx
dk
Z k
d x
e f (x) dx
=
k
d

Z
=
xk ex f (x) dx
=

E(X k exp(X)).

In particular,
dk
X (0) = E(X k ).
dk
Example 2.6: Suppose that X is exponentially distributed with parameter . Note that

X () = /( )1 = 1/(1

1 X 1 k

) =

(2.3)

k=0

On the other hand, X () has the power series representation,

X
1 dk
X (0)k
X () =
k! dk

(2.4)

k=0

Equating coefficients in (2.3) and (2.4), we find that


dk
k!
X (0) = k ,
dk

so that

k!
.
k
Note that we were able to compute all the moments of an exponential rv without having to repeatedly
compute integrals.
E(X k ) =

Another key property of mgfs is the fact that uniquely characterizes the distribution of the rv. In particular,
if X and Y are such that X () = Y () for all values of , then
P (X x) = P (Y x)
for all x.
This property turns out to be very useful when combined with the following proposition.
Proposition 2.3: Let the Xi s be independent rvs, and put Sn = X1 + + Xn . Then,
Sn () =

n
Y
i=1

21

Xi ().

Proof. Note that


Sn ()

=
=

E(exp((X1 + + Xn )))
n
Y
E( exp(Xi ))
i=1

=
=

n
Y
i=1
n
Y

E(exp(Xi ))

(due to independence)

Xi ().

i=1

In other words, the mgf of a sum of independence rvs is trivial to compute in terms of the mgfs of the
summands. So, one way to compute the exact distribution of a sum of n independent rvs X1 , . . . , Xn is:
1. Compute Xi () for 1 i n.
2. Compute
Sn () =

n
Y

Xi ().

i=1

3. Find a distribution/rv Y such that


Sn () = Y ()
for all .
Then,
P (Sn x) = P (Y x).
Example 2.7: Suppose the Xi s are iid Bernoulli rvs with parameter p. Then,
Sn () = (1 p + pe )n .
But (1 p + pe )n is the mgf of a Binomial rv with parameter n and p. So,
 
n k
p (1 p)nk .
P (Sn = k) = P (Bernoulli(n, p) = k) =
k
Theorem 2.4. Let (Xn : 1 n ) be a sequence of rvs with mgf s (Xn () : 1 n ). If, for each
,
Xn () X ()
as n , then
Xn X
as n .

22

You might also like