Chap2 PDF
Chap2 PDF
Chap2 PDF
2.1
[
X
P
Ai =
P {Ai }
i=1
2.2
i=1
Conditional Probability
P {A B}
.
P {B}
It is a probability on the new sample space B ; P {A|B} is interpreted as the likelihood / probability
that A occurs given knowledge that B has occurred.
2.3
Independence
2.4
Given a discrete random variable (rv) X which takes on values in S = {x1 , x2 , . . .}, its probability mass
function is defined by:
PX (xi ) = P {X = xi }, i 1.
Given a collection X1 , X2 , . . . , Xn of S-valued rvs, its joint probability mass function (pmf) is defined as
P(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = P {X1 = x1 , X2 = x2 , . . . , Xn = xn }.
The conditional pmf of X given Y = y is then given by
PX|Y (x|y) =
P(X,Y ) (x, y)
.
PY (y)
2.5
Given a continuous rv X taking values in R, its probability density function fX () is the function satisfying:
Z x
P {X x} =
fX (t)dt.
We interpret fX (x) as the likelihood that X takes on a value x. However, we need to exercise care in that
interpretation. Note that
Z x
P {X = x} =
fX (t)dt = 0,
x
so the probability that X takes on precisely the value x (to infinite precision) is zero. The likelihood
interpretation comes from the fact that
R a+
fX (t)dt 0 fX (a)
P {X [a , a + ]}
= Ra
b+
P {X [b , b + ]}
fX (b)
fX (t)dt
b
so that fX (a) does indeed measure the relative likelihood that X takes on a value a (as opposed, say, to b).
Given a collection X1 , X2 , . . . , Xn of real-valued continuous rvs its joint probability density function (pdf)
is defined as the function f(X1 ,X2 ,...,Xn ) () satisfying
Z x1
Z xn
Again, f(X1 ,...,Xn ) (x1 , . . . , xn ) can be given a likelihood interpretation. The collection X1 , X2 , . . . is independent if
f(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = fX1 (x1 ) fXn (xn )
for all (x1 , . . . , xn ) Rn .
Finally, the conditional pdf of X given Y = y is given by
fX|Y (x|y) =
2.6
f(X,Y ) (x, y)
.
fY (y)
Many applications will require we compute the distribution of a sum Sn = X1 + X2 + . . . + Xn where the
Xi s are jointly distributed real-valued rvs. If the Xi s are continuous then
Z
fSn (z) =
fXn |Sn1 (z y|y)fSn1 (y)dy.
fSn (z) =
This type of integral is know, in applied mathematics, as a convolution integral. So, fSn () can be computed
recursively (in the independent setting) via n 1 convolution integrals.
A corresponding result holds in the discrete setting (with integrals replaced by sums).
2.7
Expectations
Fortunately, there is an alternative approach to computing E[Y ] that is often easier to implement.
x1 S
xn S
Remark 2.1: In older editions of his book, Sheldon Ross referred Result 2.1 as the Law of the Unconscious
Statistician!.
Example 2.1: Suppose X is a uniformly distributed rv on [0, 1], so that
(
1 0x1
fX (x) =
0 o.w.
Let Y = X 2 .
d 1
1 1
y 2 = y 2 .
dy
2
Hence,
Z
E[Y ] =
yfY (y)dy =
0
1
2
y 2 dy =
0
1
1 2 3
1
y2 =
2 3
3
0
Z
E[Y ] =
Z
g(x)fX (x)dy =
x2 dx =
1
.
3
The expectation of a rv is interpreted as a measure of a rvs central tendency . It is one of several summary
statistics that are widely used in communicating the essential features of a probability distribution.
2.8
Given a rv X, the following are the most commonly used summary statistics.
1. Mean of X: The mean of X is just its expectation E[X]. We will see later, in our discussion of the
law of large numbers, why E[X] is a key characteristic of Xs distribution.
2. Variance of X:
var(X) = E (X E[X])2
This is a measure of Xs variability.
3. Standard Deviation of X:
(X) =
var(X)
This is a measure of variability that scales appropriately under a change in the units used to measure
X (e.g. if X is a length, changing units from feet to inches multiplies the variance by 144, but the
standard deviation by 12).
6
c2 (X) =
E [X]
This is a dimensionless measure of variability that is widely used when characterizing the variation
that is present in a non-negative rv X (e.g. task durations, component lifetimes, etc).
5. Median of X: this is the value n having the property that
1
P {X m} = = P {X m}
2
(and is uniquely defined when P {X } is continuous and strictly increasing). It is a measure of the
central tendency of X that complements the mean. Its advantage, relative to the mean, is that it is
less sensitive to outliers (i.e. observations that are in the tails of X that have a big influence on
the mean, but very little influence on the median).
6. pth quantile of X: The pth quantile of X is that value q having the property that
P {X q} = FX (q) = p
1
i.e. q = FX
(p).
3
1
1
FX
;
4
4
it is a measure of variability that, like the median, is (much) less sensitive to outliers than is the
standard deviation.
2.9
Conditional Expectation
E [X|Y = y] =
xfX|Y (x|y)dx
h
i
~ = ~y (where Y
~ =
when X is continuous. We can similarly define E [X|Y1 = y1 , . . . , Yn = yn ] = E X|Y
(Y1 , . . . , Yn )T and ~y = (y1 , . . . , yn )T ). We sometimes denote E [X|Y = y] as Ey [X].
Note that expectations can be computed by conditioning:
Z
E [X] =
E [X|Y = y] fX (y)dy
E [X|Y = y] pX (y)
2.10
E [X] = p
E [X] = np
k 0.
1p
1p
var (X) =
p
p2
A closely related variant, also called a geometric rv, arises when X {1, 2, . . .}, and
E [X] =
P {X = k} = p(1 p)k1 ,
k 1.
1
1p
var (X) =
p
p2
This time, it is the number of tosses required to observe the first head.
E [X] =
Statistics:
E [X] =
var (X) =
The Poisson rv arises as an approximation to a binomial rv when n is large and p is small. For example,
if there are n pixels on a screen and the probability a given pixel is defective is p, then the total number
of defectives on the screen is Bin (n, p). In the this setting, n is large and p is small. The binomial
probabilities are cumbersome to work with when n is large because of the binomial coefficient that
appear. As a result, we seek a suitable approximation. We propose the approximation
D
when n is large and p is small (where denotes has approximately the same distribution as). This
approximation is supported by the following theorem.
8
n
1
e
P {Bin (n, p) = 0} = (1 p) = 1 + o n
n
as n (where o (an ) represents a sequence having the property that o(an )/an 0 as n ).
2.11
1
ba
axb
o.w.
a+b
2
(b a)2
12
var (X) =
0x1
o.w.
(
fX (x) =
where B(, ) is the normalization factor chosen to ensure that fX () integrates to one, i.e.
Z
B(, ) =
y (1 y) dy.
Applications: The Beta distribution is a commonly used prior on the Bernoulli parameter p.
Exercise 2.1: Compute the mean and variance of a Beta (, ) rv in terms of the function B(, ).
3. Exponential() rv: X Exp (), > 0 if
(
ex
fX (x) =
0
x0
o.w.
var (X) =
(x)
()
ex
1
2
x0
o.w.
where
Z
y 1 ey dy
() =
0
var (X) =
5. Gaussian / Normal rv: X N , 2 , R, 2 > 0, if
fX (x) =
1
2 2
(x)2
2 2
Applications: Arises all over probability and statistics (as a result of the central limit theorem).
Statistics:
var (X) = 2
E [X] =
D
D
Note that N , 2 = + N (0, 1), where = denotes equality in distribution. (In
other words, if one
takes a N (0, 1) rv, scales it by and adds on to it, we end up with a N , 2 rv.
6. Weibull(, ) rv: X Weibull (, ), , > 0, if
P {X > x} = e(x)
for x 0. Hence:
(
x1 e(x)
fX (x) =
0
x0
o.w.
1+
var (X) =
2
2
(1+x)+1
x0
o.w.
The Pareto distribution has a tail that decays to zero as a power of x (rather than exponentially
rapidly (or faster) in x). As a result, a Pareto rv is said to be a heavy tailed rv.
Applications: Component lifetime, task duration, etc.
10
2.12
nk
.
X
39
5k
ways to choose the remaining 5 k cards from the 39 non-hearts present in the
deck.
There are
52
5
ways to choose 5 cards from a deck of 52 cards.
So,
P {k hearts in a hand of 5} =
13
k
39
5k
52
5
P {P = 1 Y = 2}
P {Y = 2}
But,
P {P = 1 Y = 2} = P {P = 1} P {Y = 2|P = 1} =
11
1 1
,
3 2
where the 12 presumes that if the price is behind the door we initially select, then the host randomly chooses
to open one of the two doors with goats behind them at random. On the other hand,
P {Y = 2} = P {P = 1} P {Y = 2|P = 1} + P {P = 2} P {Y = 2|P = 2} + P {P = 3} P {Y = 2|P = 3}
1
1 1 1
= + 0+ 1
3 2 3
3
1
= .
2
So,
P {P = 1|Y = 2} =
1
.
3
Similarly,
2
,
3
so we should indeed change our choice of door in response to the information the host reveals.
P {P = 2|Y = 2} = 0,
P {P = 3|Y = 2} =
Remark 2.2: For a more detailed discussion see Appendix B for more on the Monty Hall, Lets Make a
Deal Problem..
Example 2.5: How should we model the idea that as a component ages, it becomes less reliable?
Solution: Let T be a continuous rv corresponding to the component lifetime. For h > 0 and fixed, consider
P {T [t, t + h]|T > t} =
P {T [t, t + h]}
P {T [t, t + h] T > t}
=
.
P {T > t}
P {T > t}
This conditional probability is the likelihood the component fails in the next h time units give that it has survived to time t. Reduction in reliability as the component ages amounts to asserting that P {T [t, t + h]|T > t}
should be a decreasing function of t, i.e.
P {T [t, t + h]|T > t} % 0
in t. Note that when h is small,
f (t)
P {T [t, t + h]|T > t} h
F (t)
where f is the density of T and F (t) = 1 F (t) = P {T > t}. Accordingly, r(t) = f (t)/F (t) is called the
failure rate (at time t) of T . Modeling reduction in reliability as the component ages amounts to requiring
that r(t) should be increasing in t. In other words, T has an increasing failure rate function.
New components often exhibit a burn-in phases where they are subject to immediate (or rapid) failure
because of the presence of manufacturing defects. Once a component survives through the burn-in phases,
its reliability improves. Such components have decreasing failure rate function (at least through the end of
the burn-in phase).
Most manufactured components have a failure rate that is bathtub shaped, see Figure 2.1
Over the operational interval [t1 , t2 ], the failure rate is essentially constant. This makes identifying the
constant failure rate distribution interesting (since the failure rate of a component is often constant over
the great majority of its design lifetime).
Suppose
r(t) = .
12
r(t)
t1
t2
t
d
dt
F (t)
= .
F (t)
We conclude that
d
log F (t) =
dt
so that
log F (t) = log F (0) t.
Since T is positive, F (0) = 1 and hence F (t) = et . In other words, T Exp (). So, exponential rvs are
the unique rvs having a constant failure rate.
If T Weibull (, ), then log F (t) = (t) . So,
r(t) =
d
(t) = t1 ;
dt
so if < 1, T has a decreasing failure rate, which if > 1, T has an increasing failure rate. When = 1, T
has a constant failure rate and is exponentially distributed.
2.13
In building stochastic models, it is often the case that observational data exists that can be used to help guide
the construction of an appropriate model. In particular, the existing data can be used to help estimate the
13
model parameters. Statisticians call the process of fitting the parameters of a model to data the parameter
estimation problem (estimation for short).
To provide a concrete example, consider the problem of building a stochastic model to represent the number
of defective pixels on a high-definition television screen. We argued earlier, in Section 10 of this chapter, that
a good model for the number X of defective pixels on such a screen is to assume that it follows a Poisson
distribution with parameter . We now wish to estimate .
We select five such screens and count the number of defective pixels on each of the five screens, leading to
counts of 0, 3, 4, 2 and 5, respectively. We view the five observations as a random sample from the distribution of X, by which we mean that the five observations are the realized values of five iid (independent and
identically distributed) rvs X1 , X2 , . . . , X5 having a common Poisson ( ) distribution.
Maximum likelihood is generally viewed as the gold standard method for estimating statistical parameters.
We will discuss later the theoretical basis for why maximum likelihood is a preferred approach to estimating
parameters. The method of maximum likelihood asserts that one should:
that maximizes the likelihood of observing the given
Estimate the parameter as that value
sample.
In this case, the likelihood of observing 0,3,4,2 and 5 under a Poisson () model is:
L() =
e5 14
e 0 e 3 e 4 e 2 e 5
=
0!
3!
4!
2!
5!
0! 3! 4! 2! 5!
=
i.e.
= 14 .
2.13.1
More generally, if X1 , X2 , . . . , Xn is a random sample from a Poisson ( ) distribution, then the likelihood
is:
n
Y
e xi
Ln () =
,
xi !
i=1
having maximizer
n = x1 + x2 + . . . + xn .
n
n is just the (arithmetic) mean of the sample. This so-called sample mean is usually denoted
In other words,
n . We next work out the maximum likelihood estimator (MLE) for normally distributed and gamma
as X
distributed rvs.
14
2.13.2
n
Y
i=1
1
2 2
(xi )2
2 2
= (2 2 ) 2 e
Pn
i=1
(xi )2
2 2
X (xi
n )
Ln (
n ,
n2 ) =
=0
2
n
i=1
and
X (xi
n
n )2
n ,
n2 ) = 2 +
= 0.
Ln (
2
4
2
n i=1
2
n
This yields:
n
n =
1X
n
xi = X
n i=1
and
n2 =
1X
(xi n )2 .
n i=1
Remark: It turns out that the estimators that are most frequently used by statisticians to estimate the
n and estimate 2 via
parameters ( , 2 ) for Gaussian models are the following. Estimate via X
n
s2n =
1 X
n
2 .
(xi
n )2 =
n 1 i=1
n1 n
The estimator s2n is what statisticians call the sample variance. Note the when n is reasonably large, s2n
and
n2 are almost identical. But for small samples, s2n and
n2 differ. Statisticians generally prefer s2n to
n2
2
2
2
because sn is undefined when n = 1 ( as it should be) and sn is unbiased as an estimator of , by which
we mean that
E s2n = 2
for n 2..
Exercise 2.2: Prove that s2n is an unbiased estimator for 2 when n 2.
2.13.3
Suppose that we observe a random sample X1 , X2 , . . . , Xn of iid observations from a Gamma ( , ) population. The corresponding likelihood is
Ln (, ) =
n
Y
(xi )1 exi
i=1
()
n
X
i=1
log xi
n
X
xi n log ().
i=1
n,
n,
For this example, there is no closed form for the maximizer (
n ) of Ln (); the MLE (
n ) must be
computed numerically. This example illustrates a key point about MLEs. While they are the statistical gold
standard, they are often notoriously difficult to compute (even in the presence of powerful computers).
15
Exercise 2.3: Compute the MLE for a random sample from a Weibull ( , ) population.
Exercise 2.4: Compete the MLE for a random sample from a Unif (a , b ) population.
Exercise 2.5: Compute the MLE for a random sample from a Bin (n, p ) population (where n is known).
Exercise 2.6: Compute the MLE for a random sample of Beta ( , ) population.
2.13.4
Let us now return the question of why maximum likelihood is the gold standard estimation method. Suppose that we have a random sample from a normally distributed population in which is unknown, but the
variance 2 is known to equal one. Recall that for a normal distribution, characterizes both the mean
n (the sample mean) or mn , the
and the median. This suggests estimating via either the estimator X
sample median. (The sample median is the (k + 1)th largest observation when n = 2k + 1, and the median
is defined as the arithmetic average of the k th and (k + 1)th largest observations when n = 2k.) Since the
n and mn are themselves random variables. The hope is that when the
sample is random, the estimators X
n and mn with be close to . The preferred estimator is clearly the one that has
sample size n is large, X
tendency to be closer to .
One way to mathematically characterize this preference is to study the rate of convergence of the esti n and mn obey central limit theorems that assert
mator to . We will see in Chapter 2 that both X
that Xn and mn are, for large n, asymptotically normally distributed with common mean and variances
2/n
2/n
2/n
1 and 2 , respectively. Our preference should obviously be for the estimator with the smaller value i .
n is the MLE in this Gaussian setting. As the gold standard estimator, it will come as
The estimator X
n is to be preferred to mn as an estimator of
no surprise that 12 is always less than or equal to 22 . So X
the parameter in a N ( , 1) population. It is the fact that the MLE has the fastest possible rate of
convergence among all possible estimators of an unknown statistical parameter that has let to its adoption
as the gold standard estimator. (For those of you familiar with statistics, the MLE achieves (in great
generality) the Cramer-Rao lower bound that describes a theoretical lower bound on the variance of an
(unbiased) estimator of an unknown statistical parameter.) As a consequence, it is typical that in approaching
parameter estimation problems, the first order of business is to study the associated MLE. If computation of
the MLE is analytically or numerically tractable, then one would generally adopt the MLE as ones preferred
estimator.
2.14
An alternative approach to estimating model parameters is the method of moments. Let us illustrate this
idea in the setting of a gamma distributed random sample.
Given a random sample X1 , X2 , . . . , Xn of iid observations for a Gamma ( , ) population, recall that:
E [X] =
and
var (X) =
.
2
2
n
X
and
s
.
n
2
16
n =
X
and
s2n =
.
n2
X
s2n
and
= Xn .
s2n
1X k
X = fk (1 , . . . , d ),
n i=1 i
k = 1, . . . , d.
Exercise 2.7: Compute the method of moment estimators for a N , 2 populations.
Exercise 2.8: Compute the method of moments estimators for a Unif (a , b ) population.
Exercise 2.9: Compute the method of moments estimators for Beta (, ) population.
Exercise 2.10: Compute the method of moments estimators for a Weibull ( , ) population.
2.15
Bayesian Statistics
Consider a case where we are attempting to estimate a Bernoulli parameter p corresponding to the probability that a given manufactured item is defective. With a good manufacturing process in place p should
be small.
In this cases, if we test n items, it is likely that all n items are non-defective. In other words, the random
sample X1 , X2 , . . . , Xn from such a Bernoulli population is likely to be one in which Xi = 0 for 1 i n.
The maximum likelihood estimator (and method of moments estimator) pn for p is given by
n = 0.
pn = X
Given the experimental data observed, this is perhaps a reasonable estimate for p .
But it is unlikely that a company would base any of its operational decisions on such an estimate of p .
Nobody truly believes that they have a flawless manufacturing process. One has a prior belief that p is
17
positive. Bayesian statistical methods offer a means of taking advantage of such prior information.
In a Bayesian approach to statistical analysis, one would view p as itself being a random variable. The
distribution of p (the so-called prior distribution on p ) reflects the statisticians beliefs about the likely
values of p in the absence of any experimental data. In our Bernoulli example, one possible prior would be the
uniform distribution on [0, 1]. Having postulated a prior, we now observe a random sample X1 , X2 , . . . , Xn .
Conditional on p = p, the likelihood of the sample is just
Ln (p) =
n
Y
pXi (1 p)1Xi .
i=1
We now wish to compute a new prior distribution on p that reflects the influence of the observed sample
on the prior. We do this by taking advantage of the basic ideas of conditional probability. In this statistical
setting, this application of conditional probability is often called Bayes rule. In particular, the posterior
distribution is just the distribution of p, given X1 , . . . , Xn . This translates into
f (p|X1 , . . . , Xn ) = R 1
0
pSn (1 p)nSn
rSn (1 r)nSn dr
1
.
n+2
(Note the when n = 0, the mean is 21 , which coincides with the mean of the uniform prior.) Thus, the
Bayesian approach here leads to an analysis that seems more consistent with usage of statistics in an operational decision-making environment.
Such a Bayesian approach to statistical analysis can be applied in any setting in which the underlying data
is assume to follow a parametric distribution. In particular, suppose that the random sample X1 , . . . , Xn is
a collection of observations from a population having a density function f (; ), where is the true value
of the unknown parameter . Suppose p() is a density corresponding to a prior distribution on . Bayes
rule dictates that the posterior distribution on equals
Qn
p() i=1 f (Xi ; )
R
Qn
f (|X1 , . . . , Xn ) =
p(0 ) i=1 f (Xi ; 0 )d0
2.16
One of the two most important results in probability is the law of large numbers (LLN).
Theorem 2.2. Suppose that (Xn : n 1) is a sequence of iid rvs. Then,
1
P
(X1 + + Xn ) E(X1 )
n
as n 1 .
This result is easily to prove when Xi s have finite variance. The key is the following inequality, called
Markovs inequality.
Proposition 2.1: Suppose that W is a non-negative rv. Then,
P (W > w)
1
E(W ).
w
w
w
Zw
1
x
f (x) dx = E(W ).
w
w
0
The proof is similar for discrete rvs.
An important special case is called Chebyshevs inequality.
Proposition 2.2: Suppose that Xi s are iid with common (finite) variance 2 . If Sn = X1 + + Xn , then
Sn
2
P
E X1 > 2 .
n
n
Proof. Put W = (Sn nE(X1 ))2 and w = n2 2 . Note that E(W ) = var(Sn ) = n 2 , so
P (|
Sn
2
E(X1 )| > ) = P (W > w) 2 .
n
n
Theorem 2.2 is an immediate consequence of Proposition 2.2. Lets now apply the LLN.
The LLN guarantees that even though the sample average n1 (X1 + + Xn ) is a rv, it settles down to
something deterministic and predictable when n is large, namely E(X1 ). Hence, even though the individual
Xi s are unpredictable, their average (or mean) is predictable. The fact that the average n1 (X1 + + Xn )
settles down to the expectation E(X1 ) is a principal reason for why the expectation of a rv is the most widely
used measure of central tendency (as opposed, for example, to the median of the distribution).
1 Here
19
2.17
The second key limit result in probability is the central limit theorem (CLT). (It is so important that it is
the central theorem of probability!)
Note that the LLN approximation (2.2) is rather crude:
P (X1 + + Xn x)
0,
1,
x < np
x np
Typically, wed prefer an approximation that tells us how close P (X1 + + Xn x) is to 0 when x < np
and how close to 1 when x np. The CLT provides exactly this additional information.
Theorem 2.3. Suppose that the Xi s are iid rvs with common (finite) variance 2 . Then, if Sn = X1 +
+ Xn
Sn nE(X1 )
N (0, 1)
(2.1)
n
as n .
The CLT (2.1) supports the use of the approximation
D
Sn nE(X1 ) +
n N (0, 1)
(2.2)
when n is large. The approximation (2.2) is valuable in many different problem settings. We now illustrate
its use with an example.
An outline of the proof of the CLT is given later in the notes.
2.18
A key idea of in applied mathematics is that of the Laplace transform. The Laplace transform also is a
useful tool in probability. In the probability context, the Laplace transform is usually called the moment
generating function (of the rv).
Definition 2.1: The moment generating function of a rv X is the function X () defined by
X () = E(exp(x)).
This function can be computed in closed form for many of the distributions encountered most frequently
in practice:
Bernoulli rv: X () = (1 p) + pe
Binomial(n, p) rv: X () = ((1 p) + pe )n
Geometric(p) rv: X () = p/(1 (1 p)e )
Poisson() rv: X () = exp((e 1))
Uniform(a, b) rv: X () = (eb ea )/(b a)
Exponential() rv: X () = ( )1
Gamma(, ) rv: X () =
Normal(, 2 ) rv: X () = exp( +
2 2
2 )
20
The moment generating function (mgf) of a rv X gets its name from the fact that the moments (i.e. E(X k )
for k = 1, 2, . . .) of the rv X can easily be computed from knowledge of X (). To see this, note that if X
is continuous, then
dk
X ()
dk
dk
E(exp(X))
dk
Z
dk
=
ex f (x) dx
dk
Z k
d x
e f (x) dx
=
k
d
Z
=
xk ex f (x) dx
=
E(X k exp(X)).
In particular,
dk
X (0) = E(X k ).
dk
Example 2.6: Suppose that X is exponentially distributed with parameter . Note that
X () = /( )1 = 1/(1
1 X 1 k
) =
(2.3)
k=0
X
1 dk
X (0)k
X () =
k! dk
(2.4)
k=0
so that
k!
.
k
Note that we were able to compute all the moments of an exponential rv without having to repeatedly
compute integrals.
E(X k ) =
Another key property of mgfs is the fact that uniquely characterizes the distribution of the rv. In particular,
if X and Y are such that X () = Y () for all values of , then
P (X x) = P (Y x)
for all x.
This property turns out to be very useful when combined with the following proposition.
Proposition 2.3: Let the Xi s be independent rvs, and put Sn = X1 + + Xn . Then,
Sn () =
n
Y
i=1
21
Xi ().
=
=
E(exp((X1 + + Xn )))
n
Y
E( exp(Xi ))
i=1
=
=
n
Y
i=1
n
Y
E(exp(Xi ))
(due to independence)
Xi ().
i=1
In other words, the mgf of a sum of independence rvs is trivial to compute in terms of the mgfs of the
summands. So, one way to compute the exact distribution of a sum of n independent rvs X1 , . . . , Xn is:
1. Compute Xi () for 1 i n.
2. Compute
Sn () =
n
Y
Xi ().
i=1
22