Convergence of Random Variables

CHAPTER 5.
Convergence of Random Variables

5.1. Introduction
One of the most important parts of probability theory concerns the be-
havior of sequences of random variables. This part of probability is often
called large sample theory or limit theory or asymptotic theory. This
material is extremely important for statistical inference. The basic question
is this: what can we say about the limiting behavior of a sequence of random
variables X1 , X2 , X3 , . . .? Since statistics is all about gathering data, we will
naturally be interested in what happens as we gather more and more data,
hence our interest in this question.
Recall that in calculus, we say that a sequence of real numbers xn con-
verges to a limit x if, for every > 0, |xn x| < for all large n. In
probability, convergence is more subtle. Going back to calculus for a mo-
ment, suppose that xn = x for all n. Then, trivially, limn xn = x. Consider a
probabilistic version of this example. Suppose that X1 , X2 , . . . are a sequence
of random variables which are independent and suppose each has a N (0, 1)
distribution. Since these all have the same distribution, we are tempted to
say that Xn converges to Z N (0, 1). But this cant quite be right since
P (Xn = Z) = 0 for all n.
Here is another example. Consider X1 , X2 , . . . where Xi N (0, 1/n).
Intuitively, Xn is very concentrated around 0 for large n. But P (Xn = 0) =
0 for all n. The next section develops appropriate methods of discussing
convergence of random variables.
5.2. Types of Convergence
Let us start by giving some definitions of different types of convergence.

It is easy to get overwhelmed. Just hang on and remember this: the two key
ideas in what follows are convergence in probability and convergence in
distribution.
Suppose that X1 , X2 , . . . have finite second moments. Xn converges to X

q.m.
in quadratic mean (also called convergence in L2 ), written Xn X, if,
E(Xn X)2 0
1
as n .
p
Xn converges to X in probability, written Xn X, if, for every > 0,
P (|Xn X| > ) 0
as n .
Let Fn denote the cdf of Xn and let F denote the cdf of X. Xn converges
d
to X in distribution, written Xn X, if,
lim
n
Fn (t) = F (t)
at all t for which F is continuous.

Here is a summary:
Quadratic Mean E(Xn X)2 0
In probability P (|Xn X| > ) 0 for all > 0
In distribution Fn (t) F (t) at continuity points t
Recall that X is a point mass at c if P (X = c) = 1. The distribution
function for X is F (x) = 0 if x < c and F (x) = 1 if x c. In this case, we
q.m. p d
write the convergence of Xn to X as X c, X c, or X c, depending
d
on the type of convergence. Notice that X c means that Fn (t) 0 for
t < c and Fn (t) 1 for t > c. We do not require that Fn (c) converge to 1,
since c is not a point of continuity in the limiting distribution function.
EXAMPLE 5.2.1. Let Xn N (0, 1/n). Intuitively, Xn is concentrating

d
at 0 so we would like to say that Xn 0. Lets see if thisis true. Let F be the
distribution function for a point mass at 0. Note that nXn N (0, 1). Let
Z denote a standard
normal random variable. Fort < 0, Fn (t) = P (Xn <
t) = P ( nXn < nt) = P (Z < nt) 0 since nt . Fort > 0,
Fn (t) = P (Xn < t) = P ( nXn < nt) = P (Z < nt) 1 since nt
d
. Hence, Fn (t) F (t) for all t 6= 0 and so Xn 0. But notice that
Fn (0) = 1/2 6= F (1/2) = 1 so convergence fails at t = 0. But that doesnt
matter because t = 0 is not a continuity point of F and the definition of
convergence in distribution only requires convergence at continuity points.
The following diagram summarized the relationship between the types of

convergence.
2
Point Mass
Quadratic Mean Probability Distribution
Here is the theorem that corresponds to the diagram.
THEOREM 5.2.1. The following relationships hold:

q.m. p
(a) Xn X implies that Xn X.
p d
(b) Xn X implies that Xn X.
d
(c) If Xn X and if P (X = c) = 1 for some real number c, then
p
Xn X.
In general, none of the reverse implications hold except the special case
in (c).
q.m.
PROOF. We start by proving (a). Suppose that Xn X. Fix > 0.
Then, using Chebyshevs inequality,
E|Xn X|2
P (|Xn X| > ) = P (|Xn X|2 > 2 ) 0.
2
Proof of (b). This proof is a little more complicated. You may skip if it
you wish. Fix > 0. Then
Fn (x) = P (Xn x) = P (Xn x, X x + ) + P (Xn x, X > x + )

P (X x + ) + P (|Xn X| > )
= F (x + ) + P (|Xn X| > ).
Also,
F (x ) = P (X x ) = P (X x , Xn x) + P (X x + , Xn > x)
Fn (x) + P (|Xn X| > ).
Hence,
F (x ) P (|Xn X| > ) Fn (x) F (x + ) + P (|Xn X| > ).
Take the limit as n to conclude that
F (x ) liminf n Fn (x) limsupn Fn (x) F (x + ).
3
This holds for all > 0. Take the limit as 0 and use the fact that F is
continuous at x and conclude that limn Fn (x) = F (x).
Proof of (c). Fix > 0. Then,
P (|Xn c| > ) = P (Xn < c ) + P (Xn > c + )

P (Xn c ) + P (Xn > c + )
= Fn (c ) + 1 Fn (c + )
F (c ) + 1 F (c + )
= 0 + 1 0 = 0.
Let us now show that the reverse implications do not hold.
Convergence in probability does not implyconvergence in

quadratic mean. Let U Unif(0, 1) and let Xn = nI(0,1/n) (U ). Then
P (|Xn | > ) = P ( nI(0,1/n) (U ) > ) = P (0 U < 1/n) = 1/n 0. Hence,
p R 1/n
Then Xn 0. But E(Xn2 ) = n 0 du = 1 for all n so Xn does not converge
in quadratic mean.
Convergence in distribution does not imply convergence in

probability. Let X N (0, 1). Let Xn = X for n = 1, 2, 3, . . .; hence
Xn N (0, 1). Xn has the same distribution function as X for all n so,
d
trivially, limn Fn (x) = F (x) for all x. Therefore, Xn X. But P (|Xn X| >
) = P (|2X| > ) = P (|X| > /2) 6= 0. So Xn does not tend to X in
probability.
p
Warning! One might conjecture that if Xn b then E(Xn ) b. This
is not true. Let Xn be a random variable defined by P (Xn = n2 ) = 1/n and
P (Xn = 0) = 1 (1/n). Now, P (|Xn | < ) = P (Xn = 0) = 1 (1/n) 1.
p
Hence, Z 0. However, E(Xn ) = [n2 (1/n)] + [0 (1 (1/n))] = n. Thus,
E(Xn ) .
Summary. Stare at the diagram.
5.3 The Law of Large Numbers
Now we come to a crowning achievement in probability, the law of large

numbers. This theorem says that, in some sense, the mean of a large sample
4
is close to the mean of the distribution. For example, the proportion of heads
of a large number of tosses is expected to be close to 1/2. We now make this
more precise.
Let X1 , X2 , . . . , be an iid sample and let = E(X1 ) and 2 = V ar(X1 ).1
P
The sample mean is defined as X n = n1 ni=1 Xi . Recall these two important
facts: E(X n ) = and V ar(X n ) = 2 /n.
THEOREM. 5.3.1. (The Weak Law of Large Numbers.) If X1 , . . . , Xn

p
are iid, then X n .
PROOF. Assume that < . This is not necessary but it simplifies the
proof. Using Chebyshevs inequality,
V ar(X n ) 2
P |X n | > =
2 n2
which tends to 0 as n .
There is a stronger theorem in the appendix called the strong law of large
numbers.
EXAMPLE 5.3.2. Consider flipping a coin for which the probability of

heads is p. Let Xi denote the outcome of a single toss (0 or 1). Hence, p =
P (Xi = 1) = E(Xi ). The fraction of heads after n tosses is X n . According
to the law of large numbers, X n converges to p in probability. This does not
mean that X n will numerically equal p. It means that, when n is large, the
distribution of X n is tightly concentrated around p. Let us try to quantify
this more. Suppose the coin is fair, i.e p = 1/2. How large should n be
so that P (.4 X n .6) .7? First, E(X n ) = p = 1/2 and V ar(X n ) =
2 /n = p(1 p)/n = 1/(4n). Now we use Chebyshevs inequality:
P (.4 X n .6) = P (|X n | .1)
= 1 P (|X n | > .1)
1 25
1 2
=1 .
4n(.1) n
The last expression will be larger than .7 of n = 84. Later we shall see that
this calculation is unnecessarily conservative.
1
Note that = E(Xi ) is the same for all i so we can define in terms of X1 or any
other Xi .
5
5.4. The Central Limit Theorem
In this section we shall show that the sum (or average) of random variables
has a distribution which is approximately Normal. Suppose that X1 , . . . , Xn
are iid with mean and variance . The central limit theorem (CLT) says
P
that X n = n1 i Xi has a distribution which is approximately Normal with
mean and variance 2 /n. This is remarkable since nothing is assumed
about the distribution of Xi , except the existence of the mean and variance.
THEOREM 5.4.1. (Central Limit Theorem). Let X1 , . . . , Xn be i.i.d with

P
mean and variance 2 . Let X n = n1 ni=1 . Then

n(X n ) d
Zn Z

where Z N (0, 1). In other words,
lim P (Zn z) = (z)

n
where Z z 1 2
(z) = ex /2 dx
2
is the cdf of a standard normal.
The proof is in the appendix. The central limit theorem says that the
distribution of Zn can be approximated by a N (0, 1) distribution. In other
words:
probability statements about Zn can be approximated using a
Normal distribution. Its the probability statements that we are
approximating, not the random variable itself.
There are several ways to denote the fact that the distribution of Zn can
be approximated be a normal. They all mean the same thing. Here they are:
Zn N (0, 1)
!
2
X n N ,
n
!
2
X n N 0,
n
6

n(X n ) N 0, 2

n(X n )
N (0, 1).

EXAMPLE. 5.4.2. Suppose that the number of errors per computer

program has a Poisson distribution with mean 5. We get 125 programs.
Let X1 , . . . , X125 be the number of errors in the programs. Let X be the
average number of errors. We want to approximate P (X < 5.5). Let
= E(X1 ) = = 5 and 2 = V ar(X1 ) = = 5. So

Zn = n(X n )/ = 125(X n 5)/ 5 = 5(X n 5) N (0, 1).
Hence,
P (X < 5.5) = P (5(X 5) < 2.5) P (Z < 2.5) = .9938.
EXAMPLE 5.4.3. We will compare Chebychev to the CLT. Suppose that

n = 25 and suppose we wish to bound
!
|X n | 1
P > .
4
First, using Chebychev,

!
|X n | 1
P > = P |X n | >
4 4
V ar(X) 16
2
= = .64
16
25
Using the CLT,

! !
|X n | 1 5|X n | 5
P > = P >
4 4

5
P |Z| > = .21.
4
The CLT gives a sharper bound, albeit with some error.
7

The central limit theorem tells us that Zn = n(X )/ is approxi-
mately N(0,1). This is interesting but there is a practical problem: we dont
always know . We can estimate 2 from X1 , . . . , Xn by
1X n
Sn2 = (Xi X n )2 .
n i=1
This raises the following question: if we replace with Sn is the central limit
theorem still true? The answer is yes.
THEOREM 5.4.4. Assume the same conditions as the CLT. Then,

n(X n ) d
Z
Sn
where Z N (0, 1). Hence we may apply the central limit theorem with Sn
in place of .
You might wonder, how accurate is the normal approximation. The an-
swer is given in the Berry-Esseen theorem which we state next. You may
skip this theorem if you are not interested.
THEOREM 5.4.5. Suppose that E|X1 |3 < . Then
33 E|X1 |3
sup |P (Zn z) (z)| 3 .
z 4 n
5.5. The Effect of Transformations
Often, but not always, convergence properties are preserved under trans-
formations.
THEOREM 5.5.1. Let Xn , X, Yn , Y be random variables.

p p p
(a) If Xn X and Yn Y , then Xn + Yn X + Y .
q.m. q.m. q.m.
(b) If Xn X and Yn Y , then Xn + Yn X + Y .
d d
Generally, it is not the case that Xn X and Yn Y implies that
d
Xn + Yn X + Y . However, it does hold if one of the limits is constant.
8
d d
THEOREM 5.5.2 (Slutzkys Theorem.) If Xn X and Yn c, then
d
Xn + Yn X + c.
Products also preserve some forms of convergence.
THEOREM 5.5.3.
p p p
(a) If Xn X and Yn Y , then Xn Yn XY .
d d p
(b) If Xn X and Yn c, then Xn Yn cX.
Finally, convergence is also preserved under continuous mappings.
THEOREM 5.5.4. Let g be a continuous mapping.

p p
(a) If Xn X then g(Xn ) g(X).
d d
(b) If Xn X then g(Xn ) g(X).
Appendix A5.1. L1 convergence and Almost Sure Convergence

a.s.
We say that Xn converges almost surely to X, written Xn X, if
P ({s : Xn (s) X(s)}) = 1.
When P (X = c) = 1 we can write this as
P (lim
n
Xn = c) = 1.
L
We say that Xn converges in L1 to X, written Xn 1 X, if
E|Xn X| 0
as n .
The following relationships hold in addition to those in 4.2.1.
THEOREM A5.1.1. The following relationships hold:

a.s. p
(a) Xn X implies that Xn X.
q.m. L
(b) Xn X implies that Xn 1 X.
L p
(c) Xn 1 X implies that Xn X.
9
Appendix A5.2. The Strong Law of Large Numbers
The weak law of large numbers says that X n converges to EX1 in prob-
ability. The strong law asserts that this is also true almost surely.
THEOREM A5.2.1. (The strong law of large numbers.) Let X1 , X2 , . . .

a.s.
be iid. If = E|X1 | < then X n .
Appendix A5.3. Proof of the Central Limit Theorem
If X is a random variable, define its moment generating function (mgf)
by X (t) = EetX . Assume in what follows that the mgf is finite in a neigh-
borhood around t = 0.
LEMMA A5.3.1. (Convergence using mgfs). Let Z1 , Z2 , . . . be a sequence
of random variables. Let n the mgf of Zn . Let Z be another random variable
and denote its mgf by . If n (t) (t) for all t in some open interval
d
around 0, then Zn Z.
PROOF OF THE CENTRAL LIMIT THEOREM. Let Yi = (Xi )/.

P P
Then, Zn = n1/2 i Yi . Let (t) be the mgf of Yi . The mgf of i Yi is
((t))n and mgf of Zn is [(t/ n)]n n (t). Now 0 (0) = E(Y1 ) = 0,
00 (0) = E(Y12 ) = V ar(Y1 ) = 1. So,
t2 00 t3
(t) = (0) + t 0 (0) + (0) + 00 (0) +
2! 3!
t2 t3 00
= 1+0+ + (0) +
2 3!
t2 t3 00
= 1 + + (0) +
2 3!
Now,
" !#n
t
n (t) =
n
" #n
t2 t3 00
= 1+ + (0) +
2n 3!n3/2
n
t2 t3
2
+ 3!n1/2
00 (0) +
= 1 +
n
2 /2
et
10
which is the mgf of a N(0,1). The result follows from the previous Theorem.
In the last step we used the following fact from calculus:
FACT: If an a then
n
an
1+ ea .
n
11

Convergence of Random Variables

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convergence of Random Variables

Uploaded by

Copyright:

Available Formats

CHAPTER 5.

Convergence of Random Variables

5.2. Types of Convergence

Let us start by giving some definitions of different types of convergence.

Suppose that X1 , X2 , . . . have finite second moments. Xn converges to X

at all t for which F is continuous.

EXAMPLE 5.2.1. Let Xn N (0, 1/n). Intuitively, Xn is concentrating

The following diagram summarized the relationship between the types of

Quadratic Mean Probability Distribution

Here is the theorem that corresponds to the diagram.

THEOREM 5.2.1. The following relationships hold:

Fn (x) = P (Xn x) = P (Xn x, X x + ) + P (Xn x, X > x + )

F (x ) P (|Xn X| > ) Fn (x) F (x + ) + P (|Xn X| > ).

Take the limit as n to conclude that

F (x ) liminf n Fn (x) limsupn Fn (x) F (x + ).

P (|Xn c| > ) = P (Xn < c ) + P (Xn > c + )

Let us now show that the reverse implications do not hold.

Convergence in probability does not implyconvergence in

Convergence in distribution does not imply convergence in

Summary. Stare at the diagram.

5.3 The Law of Large Numbers

Now we come to a crowning achievement in probability, the law of large

THEOREM. 5.3.1. (The Weak Law of Large Numbers.) If X1 , . . . , Xn

EXAMPLE 5.3.2. Consider flipping a coin for which the probability of

THEOREM 5.4.1. (Central Limit Theorem). Let X1 , . . . , Xn be i.i.d with

lim P (Zn z) = (z)

EXAMPLE. 5.4.2. Suppose that the number of errors per computer

P (X < 5.5) = P (5(X 5) < 2.5) P (Z < 2.5) = .9938.

EXAMPLE 5.4.3. We will compare Chebychev to the CLT. Suppose that

First, using Chebychev,

Using the CLT,

THEOREM 5.4.4. Assume the same conditions as the CLT. Then,

THEOREM 5.4.5. Suppose that E|X1 |3 < . Then

5.5. The Effect of Transformations

THEOREM 5.5.1. Let Xn , X, Yn , Y be random variables.

Products also preserve some forms of convergence.

Finally, convergence is also preserved under continuous mappings.

THEOREM 5.5.4. Let g be a continuous mapping.

Appendix A5.1. L1 convergence and Almost Sure Convergence

P ({s : Xn (s) X(s)}) = 1.

When P (X = c) = 1 we can write this as

THEOREM A5.1.1. The following relationships hold:

THEOREM A5.2.1. (The strong law of large numbers.) Let X1 , X2 , . . .

PROOF OF THE CENTRAL LIMIT THEOREM. Let Yi = (Xi )/.

You might also like