Definition 4.1.1.: R Then We Could Define A

S U M M A R I Z I N G D I S C R E T E R A N D O M VA R I A B L E S
4
When we first looked at Bernoulli trials in Example 2.1.2 we asked the question “On average how
many successes will there be after n trials?” In order to answer this question, a specific definition
of “average” must be developed.
To begin, consider how to extend the basic notion of the average of a list of numbers to the
situation of equally likely outcomes. For instance, if we want to know what the average roll of a die
will be, it makes sense to declare it to be 3.5, the average value of 1, 2, 3, 4, 5, and 6. A motivation
for a more general definition of average comes from a rewriting of this calculation.
1+2+3+4+5+6 1 1 1 1 1 1
= 1( ) + 2( ) + 3( ) + 4( ) + 5( ) + 6( ).
6 6 6 6 6 6 6
From the perspective of the right hand side of the equation, the results of all outcomes are
added together after being weighted, each according to its probability. In the case of a die, all six
outcomes have probability 16 .
4.1 expected value
Definition 4.1.1. Let X : S → T be a discrete random variable (so T is countable). Then the
expected value (or average) of X is written as E [X ] and is given by
X
E [X ] = t · P (X = t)
t∈T
provided that the sum converges absolutely. In this case we say that X has “finite expectation”. If
the sum diverges to ±∞ we say the random variable has infinite expectation. If the sum diverges,
but not to infinity, we say the expected value is undefined.
Example 4.1.2. In the previous chapter, Example 3.1.4 described a lottery for which a ticket could
be worth nothing, or it could be worth either $20 or $200. What is the average value of such a
ticket?
1 27
We calculated the distribution of ticket values as P (X = 200) = 1000 , P (X = 20) = 1000 , and
972
P (X = 0) = 1000 . Applying the definition of expected value results in
1 27 972
E [X ] = 200( ) + 20( ) + 0( ) = 0.74,
1000 1000 1000
so a ticket has an expected value of 56 cents.
It is possible to think of a constant as a random variable. If c ∈ R then we could define a
random variable X with a distribution such that P (X = c) = 1. It is a slight abuse of notation,
but in this case we will simply write c for both the real number as well as the constant random
variable. Such random variables have the obvious expected value.
Version: – January 19, 2021

74 summarizing discrete random variables
Theorem 4.1.3. Let c be a real number. Then E [c] = c.

Proof - By definition E [c] is a sum over all possible values of c, but in this case that is just a single
value, so E [c] = c · P (c = c) = c · 1 = c.
When the range of X is finite, E [X ] always exists since it is a finite sum. When the range
of X is infinite there is a possibility that the infinite series will not be absolutely convergent and
therefore that E [X ] will be infinite or undefined. In fact, when proving theorems about how
expected values behave, most of the complications arise from the fact that one must know that an
infinite sum converges absolutely in order to rearrange terms within that sum with equality. The
next examples explore ways in which expected values may misbehave.
Example 4.1.4. Suppose X is a random variable taking values in the range T = {2, 4, 8, 16, . . . }
such that P (X = 2n ) = 21n for all integers n ≥ 1.
This is the distribution of a random variable since
∞ ∞
X
n
X 1
P (X = 2 ) = = 1.
2n
n=1 n=1
But note that

∞ ∞ ∞
X X 1 X
2n · P (X = 2n ) = 2n = 1
2n
n=1 n=1 n=1
which diverges to infinity, so this random variable has an infinite expected value.
Example 4.1.5. Suppose X is a random variable taking values in the range T = {−2, 4, −8, 16, . . . }
such that P (X = (−2)n ) = 21n for all integers n ≥ 1.
∞ ∞ ∞
X X 1 X
(−2)n · P (X = 2n ) = (−2)n = (−1)n .
2n
n=1 n=1 n=1
This infinite sum diverges (not to ±∞), so the expected value of this random variable is undefined.

The examples above were specifically constructed to produce series which clearly diverged, but
in general it can be complicated to check whether an infinite sum is absolutely convergent or not.
The next technical lemma provides a condition that is often simpler to check. The convenience of
this lemma is that, since |X| is always positive, the terms of the series for E [|X|] may be freely
rearranged without changing the value of (or the convergence of) the sum.
Lemma 4.1.6. E [X ] is a real number if and only if E [|X|] < ∞.
Proof - Let T be the range of X. So U = {|t| : t ∈ T } is the range of |X|. By definition
X
E [|X|] = u · P (|X| = u), while
u∈U
X
E [X ] = t · P (X = t).
t∈T
To more easilly relate these two sums, define T̂ = {t : |t| ∈ U }. Since every u ∈ U came from some
t ∈ T the new set T̂ contains every element of T . For every t ∈ T̂ for which t ∈ / T , the element is
outside of the range of X and so P (X = t) = 0 for such elements. Because of this E [X ] may be
written as X
E [X ] = t · P (X = t)
t∈T̂

4.1 expected value 75
since any additional terms in the series are zero.

Note that for each u ∈ U , the event (|X| = u) is equal to (X = u) ∪ (X = −u) where each of u
and −u is an element of T̂ . Therefore,
u · P (U = u) = u · (P (X = u) + P (X = −u))
= u · P (X = u) + u · P (X = −u)
= |u| · P (X = u) + | − u| · P (X = −u)
(When u = 0 the quantities P (|X| = 0) and P (X = 0) + P (X = −0) are typically not equal, but
the equation is still true since both sides of the equation are zero). Summing over all u ∈ U then
yields
X X
u · P (|X| = u) = |u| · P (X = u) + | − u| · P (X = −u)
u∈U u∈U
X
= |t| · P (X = t)
t∈T̂
X
= |t · P (X = t)|.
t∈T
Therefore the series describing E [X ] is absolutely convergent exactly when E [|X|] < ∞.
4.1.1 Properties of the Expected Value
We will eventually wish to calculate the expected values of functions of multiple random variables.
Of particular interest to statistics is an understanding of expected values of sums and averages of
i.i.d. sequences. That understanding will be made easier by first learning something about how
expected values behave for simple combinations of variables.
Theorem 4.1.7. Suppose that X and Y are discrete random variables, both with finite expected
value and both defined on the same sample space S. If a and b are real numbers then
(1) E [aX ] = aE [X ];
(2) E [X + Y ] = E [X ] + E [Y ]; and
(3) E [aX + bY ] = aE [X ] + bE [Y ].
(4) If X ≥ 0 then E [X ] ≥ 0.
Proof of (1) - If a = 0 then both sides of the equation are zero, so assume a 6= 0. We know
that X is a function from S to some range U . So aX is also a random variable and its range is
T = {au : u ∈ U }.
P
By definition E [aX ] = t · P (aX = t), but because of how T is defined, adding values indexed
t∈T
by t ∈ T is equivalent to adding values indexed by u ∈ U where t = au. In other words
X
E [aX ] = t · P (aX = t)
t∈T
X
= au · P (aX = au)
u∈U
X
= a· u · P (X = u)
u∈U
= aE [X ].

Proof of (2) - We are assuming that X and Y have the same domain, but they typically have
different ranges. Suppose X : S → U and Y : S → V . Then the random variable X + Y is also
defined on S and takes values in T = {u + v : u ∈ U , v ∈ V }. Therefore, adding values indexed by
t ∈ T is equivalent to adding values indexed by u and v as they range over U and V respectively.
So,
X
E [X + Y ] = t · P (X + Y = t)
t∈T
X
= (u + v ) · P (X = u, Y = v )
u∈U ,v∈V
XX
= (u + v ) · P (X = u, Y = v )
u∈U v∈V
XX XX
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V u∈U v∈V
XX XX
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V v∈V u∈U
where the rearrangement of summation is legitimate since the series converges absolutely. Notice
that as u ranges over all of U the sets (X = u, Y = v ) partition the set (Y = v ) into disjoint
pieces based on the value of X. Likewise the event (X = u) is partitioned by (X = u, Y = v ) as
v ranges over all values of v ∈ V . Therefore, as a disjoint union,
[ [
(Y = v ) = (X = u, Y = v ) and (X = u) = (X = u, Y = v ),
u∈U v∈V
and so
X X
P (Y = v ) = P (X = u, Y = v ) and P (X = u) = P (X = u, Y = v ).
u∈U v∈V
From there the proof may be completed, since

X X X X
E [X + Y ] = u P (X = u, Y = v ) + v P (X = u, Y = v )
u∈U v∈V v∈V u∈U
X X
= u · P (X = u) + v · P (Y = v )
u∈U v∈V
= E [X ] + E [Y ].
Proof of (3) - This is an easy consequence of (1) and (2). From (2) the expected value E [aX + bY ]
may be rewritten as E [aX ] + E [bY ]. From there, applying (1) shows this is also equal to aE [X ] +
bE [Y ]. (Using induction this theorem may be extended to any finite linear combination of random
variables, a fact which we leave as an exercise below).
Proof of (4) - We know that X is a function from S to T where t ∈ T implies that t ≥ 0. As,
X
E [X ] = t · P (X = t),
t∈T
it follows by definition of series (in the case T is countable) that E [X ] ≥ 0.

Example 4.1.8. What is the average value of the sum of a pair of dice?

To answer this question by appealing to the definition of expected value would require summing
over the eleven possible outcomes {2, 3, . . . , 12} and computing the probabilities of each of those
outcomes. Theorem 4.1.7 makes things much simpler. We began this section by noting that a
single die roll has an expected value of 3.5. The sum of two dice is X + Y where each of X and
Y represents the outcome of a single die. So the average value of the sum of a pair of dice is
E [X + Y ] = E [X ] + E [Y ] = 3.5 + 3.5 = 7.
Example 4.1.9. Consider a game in which a player might either gain or lose money based on the
result. A game is considered “fair” if it is described by a random variable with an expected value
of zero. Such a game is fair in the sense that, on average, the player will have no net change in
money after playing.
Suppose a particular game is played with one player (the roller) throwing a die. If the die comes
up an even number, the roller wins that dollar amount from his opponent. If the die is odd, the
roller wins nothing. Obviously the game as stated is not “fair” since the roller cannot lose money
and may win something. How much should the roller pay his opponent to play this game in order
to make it a fair game?
Let X be the amount of money the rolling player gains by the result on the die. The set of
possible outcomes is T = {0, 2, 4, 6} and it should be routine at this point to verify that E [X ] = 2.
Let c be the amount of money the roller should pay to play in order to make the game fair. Since
X is the amount of money gained by the roll, the net change of money for the roller is X − c after
accounting for how much was paid to play. A fair game requires
0 = E [X − c] = E [X ] − E [c] = 2 − c.
So the roller should pay his opponent $2 to make the game fair.
4.1.2 Expected Value of a Product
Theorem 4.1.7 showed that E [X + Y ] = E [X ] + E [Y ]. It is natural to ask whether a similar rule

exists for the product of variables. While it is not generally the case that the expected value of a
product is the product of the expected values, if X and Y happen to be independent, the result is
true.
Theorem 4.1.10. Suppose that X and Y are discrete random variables, both with finite expected
value and both defined on the same sample space S. If X and Y are independent, then E [XY ] =
E [X ]E [Y ].

Proof - Suppose X : S → U and Y : S → V . Then the random variable XY takes values in

T = {uv : u ∈ U , v ∈ V }. So,
X
E [XY ] = t · P (XY = t)
t∈T
XX
= (uv ) · P (X = u, Y = v )
u∈U v∈V
XX
= (uv ) · P (X = u)P (Y = v )
u∈U v∈V
X X
= u · P (X = u) v · P (Y = v )
u∈U v∈V
! !
X X
= u · P (X = u) v · P (Y = v )
u∈U v∈V
= E [X ]E [Y ].
Before showing an example of how this theorem might be used, we provide a demonstration
that the result will not typically hold without the assumption of independence.
Example 4.1.11. Let X ∼ Uniform({1, 2, 3}) and let Y = 4 − X. It is easy to verify Y ∼
Uniform({1, 2, 3}) as well, but X and Y are certainly dependent. A routine computation shows
E [X ] = E [Y ] = 2, and so E [X ]E [Y ] = 4.
However, the random variable XY can only take on two possible values. It may equal 3 (if
either X = 1 and Y = 3 or vica versa) or it may equal 4 (if X = Y = 2). So, P (XY = 3) = 23
and P (XY = 4) = 13 . Therefore,
2 1 10
E [XY ] = 3( ) + 4( ) = 6= 4.
3 3 3
The conclusion of Theorem 4.1.10 fails since X and Y are dependent.
Example 4.1.12. Suppose an insurance company assumes that, for a given month, both the number
of customer claims X and the average cost per claim Y are independent random variables. Suppose
further the company is able to estimate that E [X ] = 100 and E [Y ] = $1, 250. How should the
company estimate the total cost of all claims that month?
The total cost should be the number of claims times the average cost per claim, or XY . Using
Theorem 4.1.10 the expected value of XY is simply the product of the separate expected values.
E [XY ] = E [X ]E [Y ] = 100 · $1, 250 = $125, 000.
Notice, though, that the assumption of independence played a critical role in this computation.
Such an assumption might not be valid for many practical problems. Consider, for example, if a
weather event such as a tornado tends to cause both a larger-than-average number of claims and
also a larger-than-average value per claim. This could cause the variables X and Y to be dependent
and, in such a case, estimating the total cost would not be as simple as taking the product of the
separate expected values.
4.1.3 Expected Values of Common Distributions
A quick glance at the definition of expected value shows that it only depends on the distribution
of the random variable. Therefore one can compute the expected values for the various common
distributions we defined in the previous chapter.

Example 4.1.13. (Expected Value of a Bernoulli(p))

Let X ∼ Bernoulli(p). So P (X = 0) = 1 − p and P (X = 1) = p.
Therefore E [X ] = 0(1 − p) + 1(p) = p.
Example 4.1.14. (Expected Value of a Binomial(n,p))
We will show two ways to calculate this expected value – the first is more computationally
complicated, but follows from the definition of the binomial distribution directly; the second is
simpler, but requires using the relationship between the binomial and Bernoulli random variables.
In algebraic terms, if Y ∼ Binomial(n, p) then
n
X
E [Y ] = k · P (Y = k )
k =0
n
X n k
= k· p (1 − p)n−k
k
k =1
n
X n!
= k· pk (1 − p)n−k
k!(n − k )!
k =1
n
X (n − 1) !
= np · pk−1 (1 − p)(n−1)−(k−1)
(k − 1)!((n − 1) − (k − 1))!
k =1
n
X n − 1 k−1
= np · p (1 − p)(n−1)−(k−1)
k−1
k =1
n−1
X n − 1
= np · pk (1 − p)(n−1)−k
k
k =0
where the last equality is a shift of variables. But now, by the binomial theorem, the sum
n−1
P n−1 k
( k )p (1 − p)(n−1)−k is equal to 1 and therefore E [Y ] = np.
k =0
Alternatively, recall that the binomial distribution first came about as the total number of suc-
cesses in n independent Bernoulli trials. Therefore a Binomial(n, p) distribution results from adding
together n independent Bernoulli(p) random variables. Let X1 , X2 , . . . , Xn be i.i.d. Bernoulli(p)
and let Y = X1 + X2 + · · · + Xn . Then Y ∼ Binomial(n, p) and
E [Y ] = E [X1 + X2 + · · · + Xn ]
= E [X1 ] + E [X2 ] + · · · + E [Xn ]
= p + p + · · · + p = np.
This also provides the answer to part (d) of Example 2.1.2. The expected number of successes in
a series of n independent Bernoulli(p) trials is np.
In the next example we will calculate the expected value of a geometric random variable.
The computation illustrates a common technique from calculus for simplifying power series by
differentiating the sum term-by-term in order to rewrite a complicated series in a simpler way.
Example 4.1.15. (Expected Value of a Geometric(p))
If X ∼ Geometric(p) and 0 < p < 1, then
∞
X
E [X ] = k · p(1 − p)k−1
k =1

To evaluate the sum of the series we will need to work the partial sums of the same. For any n ≥ 1,
let
n
X n
X
Tn = kp(1 − p)k−1 = k (1 − (1 − p))(1 − p)k−1
k =1 k =1
X n n
X
= k (1 − p)k−1 − k (1 − p)k
k =1 k =1
n
X 1 − (1 − p)n
= (1 − p)k−1 − n(1 − p)n = − n(1 − p)n .
p
k =1
Using standard results from analysis we know that for 0 < p < 1,
lim (1 − p)n = 0 and lim n(1 − p)n = 0.

n→∞ n→∞
1
Therefore Tn → p as n → ∞. Hence
1
E [X ] = .
p
For instance, suppose we wanted to know on average how many rolls of a die it would take
before we observed a 5. Each roll is a Bernoulli trial with a probability 16 of success. The time
it takes to observe the first success is distributed as a Geometric( 16 ) and so has expected value
1
1/6 = 6. On average it should take six rolls before observing this outcome.
Example 4.1.16. (Expected Value of a Poisson(λ))
We can make a reasonable guess at the expected value of a Poisson(λ) random variable by
recalling that such a distribution was created to approximate a binomial when n was large and p
was small. The parameter λ = np remained fixed as we took a limit. Since we showed above that a
Binomial (n, p) has an expected value of np, it seems plausible that a P oisson(λ) should have an
expected value of λ. This is indeed true and it is possible to prove the fact by using the idea that
the Poisson random variable is the limit of a sequence of binomial random variables. However, this
proof requires an understanding of how limits and expected values interact, a concept that has not
yet been introduced in the text. Instead we leave a proof based on a direct algebraic computation
as Exercise 4.1.12.
Taking the result as a given, we will illustrate how this expected value might be used for
an applied problem. Suppose an insurance company wants to model catastrophic floods using a
Poisson(λ) random variable. Since floods are rare in any given year, and since the company is
considering what might occur over a long span of years, this may be a reasonable assumption.
As its name implies a “50-year flood” is a flood so substantial that it should occur, on average,
only once every fifty years. However, this is just an average; it may be possible to have two “50-year
floods” in consecutive years, though such an event would be quite rare. Suppose the insurance
company wants to know how likely it is that there will be two or more “50-year floods” in the next
decade, how should this be calculated?
There is an average of one such flood every fifty years, so by proportional reasoning, in the next
ten years there should be an average of 0.2 floods. In other words, the number of floods in the
next ten years should a random variable X ∼ P oisson(0.2) and we wish to calculate P (X ≥ 2).
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1)
= 1 − e−0.2 − e−0.2 (0.2)
≈ 0.0002.

So assuming the Poisson random variable is an accurate model, there is only about a 0.02% chance
that two or more such disastrous floods would occur in the next decade.
For a hypergeometric random variable, we will demonstrate another proof technique common
to probability. An expected value may involve a complicated (or infinite) sum which must be
computed. However, this sum includes within it the probabilities of each outcome of the random
variable, and those probabilities must therefore add to 1. It is sometimes possible to simplify the
sum describing the expected value using the fact that a related sum is already known.
Example 4.1.17. (Expected Value of a HyperGeo(N , r, m)) Let m and r be positive integers
an d let N be an integer for which N > max{m, r}. Let X be a random variable with X ∼
HyperGeo(N , r, m). To calculate the expected value of X, we begin with two facts. The first is
an identity involving combinations. If n ≥ k > 0 then

n n!
=
k k!(n − k )!
n (n − 1) !
=
k (k − 1)!((n − 1) − (k − 1))!

n n−1
= .
k k−1
The second comes from the consideration of the probabilities associated with a HyperGeo(N −
1, r − 1, m − 1) distribution. Specifically, as k ranges over all possible values of such a distribution,
we have
(N −1)−(r−1)
X (r−1
k )( (m−1)−k
)
−1
=1
k
(N
m−1)
since this is the sum over all outcomes of the random variable.
To calculate E [X ], let j range over the possible values of X. Recall that the minimum value of j
is max{0, m − (N − r )} and the maximum value of j is min{r, m}. Now let k = j − 1. This means
that the maximum value for k is min{r − 1, m − 1}. If the minimum value for j was m − (N − r )
then the minimum value for k is m − (N − r ) − 1 = ((m − 1) − ((N − 1) − (r − 1))). If the
minimum value for j was 0 then the minimum value for k is −1.
The key to the computation is to note that as j ranges over all of the values of X, the values
of k cover all possible values of a HyperGeo(N − 1, m − 1, r − 1) distribution. In fact, the only
possible value k may assume that is not in the range of such a distribution is if k = −1 as a
minimum value. Now,
−r
X (rj )(N
m−j )
E [X ] = j· ,
j (N
m)
and if j = 0 is in the range of X, then that term of the sum is zero and it may be deleted without
affecting the value. That is equivalent to deleting the k = −1 term, so the remaining values of

k exactly describe the range of a HyperGeo(N − 1, m − 1, r − 1) distriubtion. From there, the

expected value may be calculated as
−r
X (rj )(N
m−j )
E [X ] = j·
j (N
m)
r r−1 (N −1)−(r−1)
X j (j−1)( (m−1)−(j−1) )
= j· N N −1
j m (m−1)
r−1 (N −1)−(r−1)
rm X (j−1)( (m−1)−(j−1) )
= ( )·
N
j(N −1) m−1
(N −1)−(r−1)
rm X (r−1
k )( (m−1)−k )
= ( )· −1
N
k
(N
m−1)
rm rm
= ( ) · (1) = .
N N
This nearly completes the goal of calculating the expected values of hypergeometric distribu-
tions. The only remaining issues are the cases when m = 0 and r = 0. Since the hypergoemetric
distribution was only defined when m and r were non-negative integers, and since the proof above
requires the consideration of such a distribution for the values m − 1 and r − 1, the remaining cases
must be handled separately. However, they are fairly easy and yield the same result, a fact we
leave it to the reader to verify.
4.1.4 Expected Value of f (X1 , X2 , . . . , Xn )
As we have seen previously, if X is a random variable and if f is a function defined on the possible
outputs of X, then f (X ) is a random variable in its own right. The expected value of this new
random variable may be computed in the usual way from the distribution of f (X ), but it is an
extremely useful fact that it may also be computed from the distribution of X itself. The next
example and theorems illustrate this fact.
Example 4.1.18. Returning to a setting first seen in Example 3.3.1 we will let X ∼ Uniform({−2, −1, 0, 1, 2}),
and let f (x) = x2 . How may E [f (X )] be calculated?
We will demonstrate this in two ways – first by appealing directly to the definition, and then
using the distribution of X instead of the distribution of f (X ). To use the definition of expected
value, recall that f (X ) = X 2 takes values in {0, 1, 4} with the following probabilities: P (f (X ) =
0) = 15 while P (f (X ) = 1) = P (f (X ) = 4) = 25 . Therefore,
1 2 2
E [f (X )] = 0( ) + 1( ) + 4( ) = 2.
5 5 5
However, the values of f (X ) are completely determined from the values of X. For instance,
the event (f (X ) = 4) had a probability of 52 because it was the disjoint union of two other events
(X = 2) ∪ (X = −2), each of which had probability 15 . So the term 4( 52 ) in the computation above
could equally well have been thought of in two pieces
4 · P (f (X ) = 4) = 4 · P ((X = 2) ∪ (X = −2))
= 4 · (P (X = 2) + P (X = −2))
= 4 · P (X = 2) + 4 · P (X = −2)
= 22 · P (X = 2) + (−2)2 · P (X = −2),

where the final expression emphasizes that the outcome of 4 resulted either from 22 or (−2)2
depending on the value of X. Following a similar plan for the other values of f (X ) allows E [f (X )]
to be calcualted directly from the probabilities of X as
E [f (X )] = (−2)2 · P (X = −2) + (−1)2 · P (X = −1) + 02 · P (X = 0)

+12 · P (X = 1) + 22 · P (X = 2)
1 1 1 1 1
= 4( ) + 1( ) + 0( ) + 1( ) + 4( )
5 5 5 5 5
= 2,
which gives the same result as the previous computation.

The technique of the example above works for any functions as demonstrated by the next two
theorems. We first state and prove a version for functions of a single random variable and then
deal with the multivariate case.
Theorem 4.1.19. Let X : S → T be a discrete random variable and define a function f : T → U .

Then the expected value of f (X ) may be computed as
X
E [f (X )] = f (t) · P (X = t).
t∈T
P
Proof - By definition E [f (X )] = u · P (f (X ) = u). However, as in the previous example, the
u∈U
event (f (X ) = u) may be partitioned according to the input values of X which cause f (X ) to
equal u. Recall that f −1 (u) describes the set of values in T which, when input into the function
f , produce the value u. That is, f −1 (u) = {t ∈ T : f (t) = u}. Therefore,
[
(f (X ) = u) = (X = t), and so
t∈f −1 (u)
X
P (f (X ) = u) = P (X = t).
t∈f −1 (u)
Putting this together with the definition of E [f (X )] shows
X
E [f (X )] = u · P (f (X ) = u)
u∈U
X X
= u· P (X = t)
u∈U t∈f −1 (u)
X X
= u · P (X = t)
u∈U t∈f −1 (u)
X X
= f (t) · P (X = t)
u∈U t∈f −1 (u)
X
= f (t) · P (X = t),
t∈T
where the final step is simply the fact that T = f −1 (U ) and so summing over the values of t ∈ T
is equivalent to grouping them together in the sets f −1 (u) and summing over all values in U that
may be achieved by f (X ).

Theorem 4.1.20. Let X1 , X2 , . . . Xn be random variables defined on a common sample space S.

The Xj variables may have different ranges, so let Xj : S → Tj . Let f be a function defined for all
possible outputs of the Xj variables. Then
X
E [f (X )] = f (t1 , . . . , tn ) · P (X1 = t1 , . . . , Xn = tn ).
t1 ∈T1 ,... tn ∈Tn
The proof is nearly the same as for the one-variable case. The only diference is that f −1 (u) is now
a set of vectors of values (t1 , . . . , tn ), so that the event (f (X ) = u) decomposes into events of the
form (X1 = t1 , . . . , Xn = tn ). However, this change does not interfere with the logic of the proof.
We leave the details to the reader.
exercises
Ex. 4.1.1. Let X, Y be discrete random variables. Suppose X ≤ Y then show that E [X ] ≤ E [Y ].
Ex. 4.1.2. A lottery is held every day, and on any given day there is a 30% chance that someone
will win, with each day independent of every other. Let X denote the random variable describing
the number of times in the next five days that the lottery will be won.
(a) What type of random variable (with what parameter) is X?
(b) On average (expected value), how many times in the next five days will the lottery be won?
(c) When the lottery occurs for each of the next five days, what is the most likely number (mode)
of days there will be a winner?
(d) How likely is it the lottery will be won in either one or two of the next five days?
Ex. 4.1.3. A game show contestant is asked a series of questions. She has a probability of 0.88 of
knowing the answer to any given question, independently of every other. Let Y denote the random
variable describing the number of questions asked until the contestant does not know the correct
answer.
(a) What type of random variable (with what parameter) is Y ?
(b) On average (expected value), how many questions will be asked until the first question for
which the contestant does not know the answer?
(c) What is the most likely number of questions (mode) that will be asked until the contestant
does not know a correct answer?
(d) If the contestant is able to answer twelve questions in a row, she will win the grand prize.
How likely is it that she will know the answers to all twelve questions?
Ex. 4.1.4. Sonia sends out invitations to eleven of her friends to join her on a hike she’s planning.
She knows that each of her friends has a 59% chance of deciding to join her independently of each
other. Let Z denote the number of friends who join her on the hike.
(a) What type of random variable (with what parameter) is Z?
(b) What is the average (expected value) number of her friends that will join her on the hike?
(c) What is the most likely number (mode) of her friends that will join her on the hike?

(d) How do your answers to (b) and (c) change if each friend has only a 41% chance of joining
her?
Ex. 4.1.5. A player rolls three dice and earns $1 for each die that shows a 6. How much should
the player pay to make this a fair game?
Ex. 4.1.6. (“The St.Petersburg Paradox”) Suppose a game is played whereby a player begins
flipping a fair coin and continues flipping it until it comes up heads. At that time the player wins a
2n dollars where n is the total number of times he flipped the coin. Show that there is no amount
of money the player could pay to make this a fair game. (Hint: See Example 4.1.4).
Ex. 4.1.7. Two different investment strategies have the following probabilities of return on $10,000.
Strategy A has a 20% chance of returning $14,000, a 35% chance of returning $12,000, a 20%
chance of returning $10,000, a 15% chance of returning $8,000, and a 10% chance of returning only
$6,000.
Strategy B has a 25% chance of returning $12,000, a 35% chance of returning $11,000, a 25%
chance of returning $10,000, and a 15% chance of returning $9,000.
(a) Which strategy has the larger expected value of return?
(b) Which strategy is more likely to produce a positive return on investment?
(c) Is one strategy clearly preferable to the other? Explain your reasoning.
Ex. 4.1.8. Calculate the expected value of a Uniform({1, 2, . . . , n}) random variable by following
the steps below.
n
P n2 +n
(a) Prove the numerical fact that j= 2 . (Hint: There are many methods to do this. One
j =1
uses induction).
n+1
(b) Use (a) to show that if X ∼ Uniform({1, 2, . . . , n}), then E [X ] = 2 .
Ex. 4.1.9. Use induction to extend the result of Theorem 4.1.7 by proving the following:
If X1 , X2 , . . . , Xn are random variables with finite expectation all defined on the same sample
space S and if a1 , a2 , . . . an are real numbers, then
E [a1 X1 + a2 X2 + · · · + an Xn ] = a1 E [X1 ] + a2 E [X2 ] + · · · + an E [Xn ].
Ex. 4.1.10. Suppose X and Y are random variables for which X has finitie expected value and Y
has infinite expected value. Prove that X + Y has infinite expected value.
Ex. 4.1.11. Suppose X and Y are random variables. Suppose E [X ] = ∞ and E [Y ] = −∞.
(a) Provide an example to show that E [X + Y ] = ∞ is possible.
(b) Provide an example to show that E [X + Y ] = −∞ is possible.
(c) Provide an example to show that E [X + Y ] may have finite expected value.
Ex. 4.1.12. Let X ∼ P oisson(λ).
(a) Write an expression for E [X ] as an infinite sum.
(b) Every non-zero term in your answer to (a) should have a λ in it. Factor this λ out and
explain why the remaining sum equals 1. (Hint: One way to do this is through the use of
infinite series. Another way is to use the idea from Example 4.1.17).

Ex. 4.1.13. A daily lottery is an event that many people play, but for which the likelihood of any
given person winning is very small, making a Poisson approximation appropriate. Suppose a daily
lottery has, on average, two winners every five weeks. Estimate the probability that next week
there will be more than one winner.
4.2 variance and standard deviation
As a single number, the average of a random variable may or may not be a good approximation
of the values that variable is likely to produce. For example, let X be defined such that P (X =
10) = 1, let Y be defined so that P (Y = 9) = P (Y = 11) = 12 , and let Z be defined such that
P (Z = 0) = P (Z = 20) = 12 . It is easy to check that all three of these random variables have
an expected value of 10. However the number 10 exactly describes X, is always off from Y by an
absolute value of 1 and is always off from Z by an absolute value of 10.
It is useful to be able to quantify how far away a random variable typically is from its average.
Put another way, if we think of the expected value as somehow measuring the “center” of the
random variable, we would like to find a way to measure the size of the “spread” of the variable
about its center. Quantities useful for this are the variance and standard deviation.
Definition 4.2.1. Let X be a random variable with finite expected value. Then the variance of
the random variable is written as V ar [X ] and is defined as
V ar [X ] = E [(X − E [X ])2 ]
The standard deviation of X is written as SD [X ] and is defined as

q
SD [X ] = V ar [X ]
Notice that V ar [X ] is the average of the square distance of X from its expected value. So if
X has a high probability of being far away from E [X ] the variance will tend to be large, while if
X is very near E [X ] with high probability the variance will tend to be small. In either case the
variance is the expected value of a squared quantity, and as such is always non-negative. Therefore
SD [X ] is defined whenever V ar [X ] is defined.
If we were to associate units with the random variable X (say meters), then the units of V ar [X ]
would be meters2 and the units of SD [X ] would be meters. We will see that the standard deviation
is more meaningful as a measure of the “spread” of a random variable while the variance tends to
be a more useful quantity to consider when carrying out complex computations.
Informally we will view the standard deviation as a typical distance from average. So if X is a
random variable and we calculate that E [X ] = 12 and SD [X ] = 3, we might say, “The variable
X will typically take on values that are in or near the range 9 − 15, one standard deviation either
side of the average”. A goal of this section is to make that language more precise, but at this point
it will help with intuition to understand this informal view.
The variance and standard deviation are described in terms of the expected value. Therefore
V ar [X ] and SD [X ] can only be defined if E [X ] exists as a real number. However, it is possible
that V ar [X ] and SD [X ] could be infinite even if E [X ] is finite (see Exercises). In practical terms,
if X has a finite expected value and infinite standard deviation, it means that the random variable

4.2 variance and standard deviation 87
has a clear average, but is so spread out that any finite number underestimates the typical distance
of the random variable from its average.
Example 4.2.2. As above, let X be a constant varaible with P (X = 10) = 1. Let Y be such that
P (Y = 9) = P (Y = 11) = 12 and let Z be such that P (Z = 0) = P (Z = 20) = 12 .
Since X always equals E [X ], the quantity (X − E [X ])2 is always zero and we can conclude
that V ar [X ] = 0 and SD [X ] = 0. This makes sense given the view of SD [X ] as an estimate of
how spread out the variable is. Since X is constant it is not at all spread out and so SD [X ] = 0.
To calculate V ar [Y ] we note that (Y − E [Y ])2 is always equal to 1. Therefore V ar [Y ] = 1 and
SD [Y ] = 1. Again this reaffirms the informal description of the standard deviation; the typical
distance between Y and its average is 1.
Likewise (Z − E [Z ])2 is always equal to 100. Therefore V ar [Z ] = 100 and SD [Z ] = 10. The
typical distance between Z and its average is 10.
Example 4.2.3. What are the variance and standard deviation of a die roll?
Before we carry out the calculation, let us use the informal idea of standard deviation to
estimate an answer and help build intuition. We know the average of a die roll is 3.5. The closest
a die could possibly be to this average is 0.5 (if it were to roll a 3 or a 4) and the furthest it could
possibly be is 2.5 (if it were to roll a 1 or a 6). Therefore the standard deviation, a typical distance
from average, should be somewhere between 0.5 and 2.5.
To calculate the quantity exactly, let X represent the roll of a die. By definition, V ar [X ] =
E [(X − 3.5)2 ], and the values that (X − 3.5)2 may assume are determined by the six values X
may take on.
V ar [X ] = E [(X − 3.5)2 ]
1 1 1 1 1 1
= (2.5)2 + (1.5)2 + (0.5)2 + (−0.5)2 + (−1.5)2 + (−2.5)2
6 6 6 6 6 6
35
= .
12
q
35
So, SD [X ] = 12 ≈ 1.71 which is near the midpoint of the range of our estimate above.
4.2.1 Properties of Variance and Standard Deviation
Theorem 4.2.4. Let a ∈ R and let X be a random variable with finite variance (and thus, with
finite expected value as well). Then,
(a) V ar [aX ] = a2 · V ar [X ];
(b) SD [aX ] = |a| · SD [X ];
(c) V ar [X + a] = V ar [X ]; and
(d) SD [X + a] = SD [X ].
Proof of (a) and (b) - V ar [aX ] = E [(aX − E [aX ])2 ]. Using known properties of expected value
this may be rewritten as
V ar [aX ] = E [(aX − aE [X ])2 ]

= E [a2 (X − E [X ])2 ]
= a2 E [(X − E [X ])2 ]
= a2 V ar [X ].

That concludes the proof of (a). The result from (b) follows by taking square roots of both sides
of this equation.
Proof of (c) and (d) - (See Exercises)
The variance may also be computed using a different (but equivalent) formula if E [X ] and
E [X 2 ] are known.
Theorem 4.2.5. Let X be a random variable for which E [X ] and E [X 2 ] are both finite. Then
V ar [X ] = E [X 2 ] − (E [X ])2 .
Proof -
V ar [X ] = E [(X − E [X ])2 ]
= E [X 2 − 2XE [X ] + (E [X ])2 ]
= E [X 2 ] − 2E [XE [X ]] + E [(E [X ])2 ].
But E [X ] is a constant, so
V ar [X ] = E [X 2 ] − 2E [XE [X ]] + E [(E [X ])2 ]
= E [X 2 ] − 2E [X ]E [X ] + (E [X ])2
= E [X 2 ] − (E [X ])2 .

In statistics we frequently want to consider the sum or average of many random variables.
As such it is useful to know how the variance of a sum relates to the variances of each variable
separately. Toward that goal we have
Theorem 4.2.6. If X and Y are independent random variables, both with finite expectation and
finite variance, then
(a) V ar [X + Y ] = V ar [X ] + V ar [Y ]; and
q
(b) SD [X + Y ] = (SD [X ])2 + (SD [Y ])2 .
Proof - Using Theorem 4.2.5,
V ar [X + Y ] = E [(X + Y )2 ] − (E [X + Y ])2
= E [X 2 + 2XY + Y 2 ] − (E [X ])2 + 2E [X ]E [Y ] + (E [Y ])2

= E [X 2 ] + 2E [XY ] + E [Y 2 ] − (E [X ])2 − 2E [X ]E [Y ] − (E [Y ])2 .

But by Theorem 4.1.10, E [XY ] = E [X ]E [Y ] since X and Y are independent. So,
V ar [X + Y ] = E [X 2 ] − (E [X ])2 + E [Y 2 ] − (E [Y ])2
= V ar [X ] + V ar [Y ].
Part (b) follows immediately after rewriting the variances in terms of standard deviations and
taking square roots. As with expected values, this theorem may be generalized to a sum of any
finite number of independent random variables using induction. The proof of that fact is left as
Exercise 4.2.11.
Example 4.2.7. What is the standard deviation of the sum of two dice?
We previously found that if X represents one die, then V ar [X ] = 35 12 . If X and Y are two
35 35 35
independent dice, then V ar [X + Y ] = V ar [X ] + V ar [Y ] = 12 + 12 = 6 . Therefore SD [X + Y ] =
q
35
6 ≈ 2.42.

4.2.2 Variances of Common Distributions
As with expected value, the variances of the common discrete random variables can be calculated
from their corresponding distributions.
Example 4.2.8. (Variance of a Bernoulli(p))
Let X ∼ Bernoulli(p). We have already calculated that E [X ] = p. Since X only takes on the
values 0 or 1 it is always true that X 2 = X. Therefore E [X 2 ] = E [X ] = p.
So, V ar [X ] = E [X 2 ] − (E [X ])2 = p − p2 = p(1 − p).
Example 4.2.9. (Variance of a Binomial(n,p))

We will calculate the variance of a binomial using the fact that it may be viewed as the sum of
n independent Bernoulli random variables. A strictly algebraic computation is also possible (see
Exercises).
Let X1 , X2 , . . . , Xn be independent Bernoulli(p) random variables. Therefore, if Y = X1 +
X2 + · · · + Xn then Y ∼ Binomial (n, p) and
V ar [Y ] = V ar [X1 + X2 + · · · + Xn ]
= V ar [X1 ] + V ar [X2 ] + · · · + V ar [Xn ]
= p(1 − p) + p(1 − p) + · · · + p(1 − p)
= np(1 − p).
For an application of this computation we return to the idea of sampling from a population
where some members of the population have a certain characteristic and others do not. The goal
is to provide an estimate of the number of people in the sample that have the characteristic. For
this example, suppose we were to randomly select 100 people from a large city in which 20% of
the population works in a service industry. How many of the 100 people from our sample should
we expect to be service industry workers?
If the sampling is done without replacement (so we cannot pick the same person twice), then
strictly speaking the desired number would be described by a hypergeometric random variable.
However, we have also seen that there is little difference between the binomial and hypergeometric
distributions when the size of the sample is small relative to the size of the population. So since
the sample is only 100 people from a “large city”, we will assume this situation is modeled by a
binomial random variable. Specifically, since 20% of the population consits of service workers, we
will assume X ∼ Binomial (100, 0.2).
The simplest way to answer to the question of how many service industy workers to expect
within the sample is to compute the expected value of X. In this case E [X ] = 100(0.2) = 20,
so we should expect around 20 of the 100 people in the sample to be service workers. However,
this is an incomplete answer to the question since it only provides an average value; the actual
number of service workers in the sample is probably not going to be exactly 20, it’s only likely
to be around 20 on average. A more complete answer to the question would give an estimate as
to how far away from 20 the actual value is likely to be. But this is precisely what the standard
deviation describes – an estimate of the likely difference between the actual result of the random
variable and its expected value. √
In this case V ar [X ] = 100(0.2)(0.8) = 16 and so SD [X ] = 16 = 4. This means that the
actual number of service industry workers in the sample will typically be about 4 or so away from
the expected value of 20, so a more complete answer to the question would be “The sample is
likely to have around 16 − 24 service workers in it”. That is not to say that the actual number
of service workers is guaranteed to fall in the that range, but the range provides s a sort of likely

error associated with the estimate of 20. Results in the 16 − 24 range should be considered fairly
common. Results far outside that range, while possible, should be considered fairly unusual.
Recall in Example 4.1.17 we calculated E [X ] using a technique in which the sum describing
E [X ] was computed based on another sum which only involved the distribution of X directly.
This second sum equalled 1 since it simply added up the probabilities that X assumed each of its
possible values. In a similar fashion, it is sometimes possible to calculate a sum describing E [X 2 ]
in terms of a sum for E [X ] which is already known. From that point, Theorem 4.2.5 may be used
to calculate the variance and standard deviation of X. This technique will be illustrated in the
next example in which we calculate the spread associated with a geometric random variable.
Example 4.2.10. (Variance of a Geometric(p))
Let 0 < p < 1. X ∼ Geometric(p) for which we know E [X ] = p1 . Then,
∞
X
E [X 2 ] = k 2 p(1 − p)k−1
k =1
To evaluate the sum of the series we will need to work the partial sums of the same. For any n ≥ 1,
let
n
X n
X
Sn = k 2 p(1 − p)k−1 = k 2 (1 − (1 − p))(1 − p)k−1
k =1 k =1
X n n
X
= k 2 (1 − p)k−1 − k 2 (1 − p)k
k =1 k =1
n
X
= 1+ (2k − 1)(1 − p)k−1 − n2 (1 − p)n
k =2
n
X n
X
= 1− (1 − p)k−1 + 2 k (1 − p)k−1 − n2 (1 − p)n
k =2 k =2
X n n
X
k−1
= 2− (1 − p) + 2(−1 + k (1 − p)k−1 ) − n2 (1 − p)n
k =1 k =1
n
1 − (1 − p)n 2X
= − + kp(1 − p)k−1 − n2 (1 − p)n
p p
k =1
Using standard results from analysis and result from Example 4.1.15 we know that for 0 < p < 1,
n
X 1
lim kp(1 − p)k−1 = , lim (1 − p)n = 0, and lim n2 (1 − p)n = 0.
n→∞ p n→∞ n→∞
k =1
Therefore Sn → − p1 + 2
p2
as n → ∞. Hence
1 2
E [X 2 ] = − + 2 .
p p
Using Theorem 4.2.5 the variance may then be calculated as
V ar [X ] = E [X 2 ] − (E [X ])2
2 1 1
= − − ( )2
p2 p p
1 1
= 2
−
p p


A similar technique may be used for calculating the variance of a Poisson random variable, a
fact which is left as an exercise. We finish this subsection with a computation of the variance
of a hypergeometric distribution using an idea similar to how we calculated its expected value in
Example 4.1.17.
Example 4.2.11. Let m and r be positive integers and let N be an integer with N > max{m, r}
and let X ∼ HyperGeo(N , r, m). To calculate E [X 2 ], as j ranges over the values of X,
−r
X (rj )(N
m−j )
E [X 2 ] = j2 ·
j (N
m)
r r−1 (N −1)−(r−1)
X
2 j (j−1)( (m−1)−(j−1) )
= j · N N −1
j m (m−1)
r−1 (N −1)−(r−1)
rm X (j−1)( (m−1)−(j−1) )
= ( ) j·
N
j(N −1) m−1
(N −1)−(r−1)
rm X (r−1
k )( (m−1)−k )
= ( )· (k + 1) −1
N
k
(N
m−1)
where k ranges over the values of Y ∼ HyperGeo(N − 1, r − 1, m − 1). Therefore,

rm
E [X 2 ] = ( )E [Y + 1]
N
rm
= ( )(E [Y ] + 1)
N
rm (r − 1)(m − 1)
= ( )( + 1).
N (N − 1)
Now the variance may be easily computed as
V ar [X ] = E [X 2 ] − (E [X ])2
rm (r − 1)(m − 1) rm 2
= ( )( + 1) − ( )
N (N − 1) N
N 2 rm − N rm2 − N r2 m + r2 m2
= .
N 2 (N − 1)
As with the computation of expected value, the cases of m = 0 and r = 0 must be handled
separately, but yield the same result.
4.2.3 Standardized Variables
Many random variables may be rescaled into a standard format by shifting them so that they have
an average of zero and then rescaling them so that they have a variance (and standard deviation)
of one. We introduce this idea now, though its chief importance will not be realized until later.

Definition 4.2.12. A standardized random variable X is one for which
E [X ] = 0 and V ar [X ] = 1.
Theorem 4.2.13. Let X be a discrete random variable with finite expected value and finite, non-
X−E [X ]
zero variance. Then Z = SD [X ] is a standardized random variable.
Proof - The expected value value of Z is
X − E [X ]
E [Z ] = E[ ]
SD [X ]
E [X − E [X ]]
=
SD [X ]
E [X ] − E [X ]
= =0
SD [X ]
while the variance of Z is

X − E [X ]
V ar [Z ] = V ar [ ]
SD [X ]
V ar [X − E [X ]]
=
(SD [X ])2
V ar [X ]
= = 1.
V ar [X ]
For easy reference we finish off this section by providing a chart of values associated with
common discrete distributions.
Distribution Expected Value Variance

Bernoulli(p) p p(1 − p)
Binomial(n, p) np np(1 − p)
1 1−p
Geometric(p) p p2
rm N 2 rm−N rm2 −N r2 m+r2 m2
HyperGeo(N , r, m) N N 2 (N −1)
Poisson(λ) λ λ
n+1 n2 −1
Uniform({1, 2, . . . , n}) 2 12
exercises
Ex. 4.2.1. A random variable X has a probability mass function given by
P (X = 0) = 0.2, P (X = 1) = 0.5, P (X = 2) = 0.2, and P (X = 3) = 0.1.
Calculate the expected value and standard deviation of this random variable. What is the probabil-
ity this random variable will produce a result more than one standard deviation from its expected
value?

Ex. 4.2.2. Answer the following questions about flips of a fair coin.
(a) Calculate the standard deviation of the number of heads that show up in 100 flips of a fair
coin.
(b) Show that if the number of coins is quadrupled (to 400) the standard deviation only doubles.
Ex. 4.2.3. Suppose we begin rolling a die, and let X be the number of rolls needed before we see
the first 3.
(a) Show that E [X ] = 6.
(b) Calculate SD [X ].
(c) Viewing SD [X ] as a typical distance of X from its expected value, would it seem unusual to
roll the die more than nine times before seeing a 3?
(d) Calculate the actual probability P (X > 9).
(e) Calculate the probability X produces a result within one standard deviation of its expected
value.
Ex. 4.2.4. A key issue in statistical sampling is the determination of how much a sample is likely
to differ from the population it came from. This exercise explores some of these ideas.
(a) Suppose a large city is exactly 50% women and 50% men and suppose we randomly select
60 people from this city as part of a sample. Let X be the number of women in the sample.
What are the expected value and standard deviation of X? Given these values, would it seem
unusual if fewer than 45% of the individuals in the sample were women?
(b) Repeat part (a), but now assume that the sample consists of 600 people.
Ex. 4.2.5. Calculate the variance and standard deviation of the value of the lottery ticket from
Example 3.1.4.
Ex. 4.2.6. Prove parts (c) and (d) of Theorem 4.2.4.
Ex. 4.2.7. Let X ∼ Binomial (n, p). Show that for 0 < p < 1, this random variable has the largest
standard deviation when p = 12 .
Ex. 4.2.8. Follow the steps below to calculate the variance of a random variable with a Uniform({1, 2, . . . , n})
distribution.
n
n(n+1)(2n+1)
k2 =
P
(a) Prove that 6 . (Induction is one way to do this).
k =1
(b) Let X ∼ Uniform({1, 2, . . . , n}). Use (a) to calculate E [X 2 ].

n+1
(c) Use (b) and the fact that E [X ] = 2 to calculate V ar [X ].
Ex. 4.2.9. This exercise provides an example of a random variable with finite expected value, but
n
infinite variance. Let X be a random variable for which P (X = n(n2 +1) ) = 21n for all integers
n ≥ 1.
∞
P 2n
(a) Prove that X is a well-defined variable by showing P (X = n(n+1)
) = 1.
n=1
(b) Prove that E [X ] = 1.

(c) Prove that V ar [X ] is infinite.

Ex. 4.2.10. Recall that the hypergeometric distribution was first developed to answer questions
about sampling without replacement. With that in mind, answer the following questions using the
chart of expected values and variances.
(a) Use the formula in the chart to calculate the variance of a hypergeometric distribution if
m = 0. Explain this result in the context of what it means in terms of sampling.
(b) Use the formula in the chart to calculate the variance of a hypergeometric distribution if
r = 0. Explain this result in the context of what it means in terms of sampling.
(c) Though we only defined a hypergeometric distrbiution if N > max{r, m}, the definition could
be extended to N = max{r, m}. Use the chart to calculate the variance of a hypergeometric
distribution if N = m. Explain this result in the context of what it means in terms of
sampling without replacement.
Ex. 4.2.11. Prove the following facts about independent random variables.
(a) Use Theorem 4.2.6 and induction to prove that if X1 , X2 , . . . , Xn are independent, then
V ar [X1 + · · · + Xn ] = V ar [X1 ] + · · · + V ar [Xn ].
(b) Suppose X1 , X2 , . . . , Xn are i.i.d. Prove that

√
E [X1 + · · · + Xn ] = n · E [X1 ] and SD [X1 + · · · + Xn ] = n · SD [X1 ].
(c) Suppose X1 , X2 , . . . , Xn are mutually independent standardized random variables (not

necessarilly identically distributed). Let
X1 + X2 + · · · + Xn
Y = √ .
n
Prove that Y is a standardized random variable.
Ex. 4.2.12. Let X be a discrete random variable which takes on only non-negative values. Show
that if E [X ] = 0 then P (X = 0) = 1.
Ex. 4.2.13. Suppose X is a discrete random variable with finite variance (and thus finite expected
value as well) and suppose there are two different numbers a, b ∈ R for which P (X = a) and
P (X = b) are both positive. Prove that V ar [X ] > 0.
Ex. 4.2.14. Let X be a discrete random variable with finite variance (and thus finite expected
value as well).
(a) Prove that E [X 2 ] ≥ (E [X ])2 .
(b) Suppose there are two different numbers a, b ∈ R for which P (X = a) and P (X = b) are
both positive. Prove that E [X 2 ] > (E [X ])2 .
Ex. 4.2.15. Let X ∼ Binomial(n, p) for n > 1 and 0 < p < 1. Using the steps below, provide
an algebraic proof of the fact that V ar [X ] = np(1 − p) without appealing to the fact that such a
variable is the sum of Bernoulli trials.
(a) Begin by writing an expression for E [X 2 ] in summation form.
n−1
(b) Use (a) to show that E [X 2 ] = np · (k + 1)(n−1 k (n−1)−k .
P
k )p (1 − p)
k =0
(c) Use (b) to explain why E [X 2 ] = np · E [Y + 1] where Y ∼ Binomial(n − 1, p).

(d) Use (c) together with Theorem 4.2.5 to prove that V ar [X ] = np(1 − p).

4.3 standard units 95
4.3 standard units
When there is no confusion about what random variable is being discussed, it is usual to use the
Greek letter µ in place of E [X ] and σ in place of SD [X ]. When more than one variable is involved
the same letters can be used with subscripts (µX and σX ) to indicate which variable is being
described.
In statistics one frequently measures results in terms of “standard units” – the number of
standard deviations a result is from its expected value. For instance if µ = 12 and σ = 5, then a
result of X = 20 would be 1.6 standard units because 20 = µ + 1.6σ. That is, 20 is 1.6 standard
deviations above expected value. Similarly a result of X = 10 would be −0.4 standard units
because 10 = µ − 0.4σ.
Since the standard deviation measures a typical distance from average, results that are within
one standard deviation from average (between −1 and +1 standard units) will tend to be fairly
common, while results that are more than two standard deviations from average (less than −2 or
greater than +2 in standard units) will usually be relatively rare. The likelihoods of some such
events will be calculated in the next two examples. Notice that the event (|X − µ| ≤ kσ ) describes
those outcomes of X that are within k standard deviations from average.
Example 4.3.1. Let Y represent the sum of two dice. How likely is it that Y will be within
one standard deviation of its average? How likely is it that Y will be more than two standard
deviations from its average?
q
We can use our previous calculations that µ = 7 and σ = 35 6 ≈ 2.42. The achievable values
that are within one standard deviation of average are 5, 6, 7, 8, and 9. So the probability that the
sum of two dice will be within one standard deviation of average is
P (|Y − µ| ≤ σ ) = P (Y ∈ {5, 6, 7, 8, 9})

4 5 6 5 4
= + + + +
36 36 36 36 36
2
= .
3
There is about a 66.7% chance that a pair of dice will fall within one standard deviation of their
expected value.
q
Two standard deviations is 2 35 6 ≈ 4.83. Only the results 2 and 12 further than this distance
from the expected value, so the probability that X will be more than two standard deviations from
average is
P (|Y − µ| > 2σ ) = P (Y ∈ {2, 12})

2
= ≈ 0.056.
36
There is only about a 5.6% chance that a pair of dice will be more than two standard deviations
from expected value.
Example 4.3.2. If X ∼ U nif orm{(1, 2, . . . , 100)}, what is the probability that X will be within
one standard deviation of expected value? What is the probability it will be more than two
standard deviations from expected value?

q
Again, based on earlier calculations we know that µ = 101
2 = 50.5 and that σ = 9999
12 ≈ 28.9.
Of the possible values that X can achieve, only the numbers 22, 23, . . . , 79 fall within one standard
deviation of average. So the desired probability is
P (|X − µ| ≤ σ ) = P (X ∈ {22, 23, . . . , 79})
58
= .
100
There is a 58% chance that this random variable will be within one standard deviation of expected
value. q
Similarly we can calculate that two standard deviations is 2 999912 ≈ 57.7. Since µ = 50.5 and
since the minimal and maximal values of X are 1 and 100 respectively, results that are more than
two or more standard deviations from average cannot happen at all for this random variable. In
other words P (|X − µ| > 2σ ) = 0.
4.3.1 Markov and Chebyshev Inequalities
The examples of the previous section show that the exact probabilities a random variable will fall
within a certain number of standard deviations of its expected value depend on the distribution of
the random variable. However, there are some general results that apply to all random variables.
To prove these results we will need to investigate some inequalities.
Theorem 4.3.3. (Markov’s Inequality) Let X be a discrete random variable which takes on
only non-negative values and suppose that X has a finite expected value. Then for any c > 0,
µ
P (X ≥ c) ≤ .
c
Proof - Let T be the range of X, so T is a countable subset of the positive real numbers. By
dividing T into those numbers smaller than c and those numbers that are at least as large as c we
have
X
µ = t · P (X = t)
t∈T
X X
= t · P (X = t) + t · P (X = t).
t∈T ,t<c t∈T ,t≥c
The first sum must be non-negative, since we assumed that T consisted of only non-negative
numbers, so we only make the quantity smaller by deleting it. Likewise, for each term in the
second sum, t ≥ c so we only make the quantity smaller by replacing t by c. This gives us
X X
µ = t · P (X = t) + t · P (X = t)
t∈T ,t<c t∈T ,t≥c
X
≥ c · P (X = t)
t∈T ,t≥c
X
= c· P (X = t).
t∈T ,t≥c
The events (X = t) indexed over all values t ∈ T for which t ≥ c are a countable collection of
disjoint sets whose union is (X ≥ c). So,
X
µ ≥ c· P (X = t)
t∈T ,t≥c
= cP (X ≥ c).

4.3 standard units 97
Dividing by c gives the desired result.

Markov’s theorem can be useful in its own right for producing an upper bound on the liklihood
of certain events, but for now we will use it simply as a lemma to prove our next result.
Theorem 4.3.4. (Chebychev’s Inequality) Let X be a discrete random variable with finite,
non-zero variance. Then for any k > 0,
1
P (|X − µ| ≥ kσ ) ≤ .
k2
Proof - The event (|X − µ| ≥ kσ ) is the same as the event ((X − µ)2 ≥ k 2 σ 2 ). The random
variable (X − µ)2 is certainly non-negative and its expected value is the variance of X which we
have assumed to be finite. Therefore we may apply Markov’s inequality to (X − µ)2 to get
P (|X − µ| ≥ kσ ) = P ((X − µ)2 ≥ k 2 σ 2 )

E [(X − µ)2 ]
≤
k2 σ2
V ar [X ]
=
k2 σ2
σ2
=
k2 σ2
1
= .
k2
Though the theorem is true for all k > 0, it doesn’t give any useful information unless k > 1.
Example 4.3.5. Let X be a discrete random variable. Find an upper bound on the likelihood that
X will be more than two standard deviations from its expected value.
For the question to make sense we need to assume that X has finite variance to begin with. In
which case we may apply Chebychev’s inequality with k = 2 to find that
1
P (|X − µ| > 2σ ) ≤ P (|X − µ| ≥ 2σ ) ≤ .
4
There is at most a 25% chance that a random variable will be more than two standard deviations
from its expected value.
exercises
Ex. 4.3.1. Let X ∼ Binomial (10, 12 ).

(a) Calculate µ and σ.
(b) Calculate P (|X − µ| ≤ σ ), the probability that X will be within one standard deviation of
average. Approximate your answer to the nearest tenth of a percent.
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard deviations
from average. Approximate your answer to the nearest tenth of a percent.
Ex. 4.3.2. Let X ∼ Geometric( 41 ).

Ex. 4.3.3. Let X ∼ P oisson(3).
Ex. 4.3.4. Let X ∼ Binomial (n, 21 ). Determine the smallest value of n for which P (|X − µ| >
4σ ) > 0. That is, what is the smallest n for which there is a positive probability that X will be
more than four standard deviations from average.
Ex. 4.3.5. For k ≥ 1 there are distributions for which Chebychev’s inequality is an equality.
(a) Let X be a random variable with probability mass function P (X = 1) = P (X = −1) = 12 .
Prove that Chebychev’s inequality is an equality for this random variable when k = 1.
(b) Let X be a random variable with probability mass function P (X = 1) = P (X = −1) = p

and P (X = 0) = 1 − 2p. For any given value of k > 1, show that it is possible to select
a value of p for which Chebychev’s inequality is an equality when applied to this random
variable.
Ex. 4.3.6. Let X be a discrete random variable with finite expected value µ and finite variance σ 2 .
(a) Explain why P (|X − µ| > σ ) = P ((X − µ)2 > σ 2 ).
(b) Let T be the range of the random variable (X − µ)2 .

P ((X − µ)2 = t) = 1.
P
Explain why
t∈T
t · P ((X − µ)2 = t).

P
(c) Explain why V ar [X ] =
t∈T
(d) Prove that if P (|X − µ| > σ ) = 1, then

P 2
V ar [X ] > σ · P ((X − µ)2 = t). (Hint: Use (a) to explain why replacing t by σ 2 in the
t∈T
sum from (c) will only make the quantity smaller).
(e) Use parts (b) and (d) to derive a contradiction. Note that this proves that the assumption
that was made in part (d), namely that P (|X − µ| > σ ) = 1, cannot be true for any discrete
random variable where µ and σ are finite quantities. In other words, no random variable can
produce only values that are more than one standard deviation from average.
Ex. 4.3.7. Let X be a discrete random variable with finite expected value and finite variance.
(a) Prove P (|X − µ| ≥ σ ) = 1 ⇐⇒ P (|X − µ| = σ ) = 1. (A random variable that assumes
only values one or more standard deviations from average must only produce values that are
exactly one standard deviation from average).
(b) Prove that if P (|X − µ| > σ ) > 0 then P (|X − µ| < σ ) > 0. (If a random variable is able to
produce values more one standard deviation from average, it must also be able to produce
values that are less than one standard deviation from average).

4.4 conditional expectation and conditional variance 99
4.4 conditional expectation and conditional variance
In previous chapters we saw that information that a particular event had occurred could substan-
tially change the probability associated with another event. That realization led us to the notion of
conditional probability. It is also reasonable to ask how such information might affect the expected
value or variance of a random variable.
Definition 4.4.1. Let X : S → T be a discrete random variable and let A ⊂ S be an event

for which P (A) > 0. The “conditional expected value” is defined from conditional probabilities
in the same way the (ordinary) expected value is defined from (ordinary) probabilities. Likewise
the “conditional variance” is described in terms of the conditional expected value in the same way
the (ordinary) variance is described in terms of the (ordinary) expected value. Specificially, the
“conditional expected value” of X given A is
X
E [X|A] = t · P (X = t|A),
t∈T
and the “conditional variance” of X given A is
V ar [X|A] = E [(X − E [X|A])2 |A].
Example 4.4.2. A die is rolled. What are the expected value and variance of the result given that
the roll was even?
Let X be the die roll. Then X ∼ Uniform({1, 2, 3, 4, 5, 6}), but conditioned on the event A
that the roll was even, this changes so that
P (X = 1|A) = P (X = 3|A) = P (X = 5|A) = 0 while
1
P (X = 2|A) = P (X = 4|A) = P (X = 6|A) = .
3
Therefore,
1 1 1
E [X|A] = 2( ) + 4( ) + 6( ) = 4.
3 3 3
Note that the (unconditioned) expected value of a die roll is E [X ] = 3.5, so the knowledge of event
A slightly increases the expected value of the die roll.
The conditional variance is
1 1 1 8
V ar [X|A] = (2 − 4)2 ( ) + (4 − 4)2 ( ) + (6 − 4)2 ( ) = .
3 3 3 3
This result is slightly less than 35

12 , the (unconditional) variance of a die roll. This means that
knowledge of event A sligthly decreased the typical spread of the die roll results.
In many cases the event A on which an expected value is conditioned will be described in terms
of another random variable. For instance E [X|Y = y ] is the conditional expectation of X given
that variable Y has taken on the value y.
Example 4.4.3. Cards are drawn from an ordinary deck of 52, one at a time, randomly and with
replacement. Let X and Y denote the number of draws until the first ace and first king are drawn,

respectively. We are interested in say, E [X|Y = 3]. When Y = 3 an ace was seen of draw 3, but
not on draws 1 or 2. Hence
 4
 48 if n = 1 or 2
P (king on draw n|Y = 3) = 0 if n = 3
 4
52 if n>3
so that 
44 n−1 4


 48 48 if n = 1 or 2
P (X = n|Y = 5) = 0 if n = 3
44 2 48 n−4 4


48 52 52 if n > 3
For example, when n > 3, in order to have X = n a non-king must have been seen on draws 1 and
2 (each with probability 4448 ), a non-king must have resulted on draw 3 (which is automatic, since
an ace was drawn), a non-king must have been seen on each of draws 4 through n − 1 (each with
48 4
probability 52 ), and finally a king was produced on draw n (with probability 52 ). Hence,
2 ∞
X 44 n−1 4 X 44 2 48 n−4 4
E [X|Y = 3] = n + n
48 48 48 52 52
n=1 n=4
2 ∞
X 44 n−1 4 X 44 2 48 m 4
= n + (m + 4) .
48 48 48 52 52
n=1 m=0
But
∞ ∞
X X d m+1
(m + 4)r m = 3rm + r
dr
m=0 m=0
3 d r
= +
1 − r dr 1 − r
3 1
= + ,
1−r (1 − r )2
so
4 44 4 44 2 4 3 1
E [X|Y = 3] = +2 + + 2
48 48 48 48 52 1 − (48/52) (1 − (48/52))
4 44 4
44 2 4 3 × 52 522

= +2 + + 2
48 48 48 48 52 4 4
1 11 1 11 2 52 11 2
= +2 +3 +
12 12 12 12 4 12
985
= ≈ 13.68.
72
Given that the first ace appeared on draw 3, it takes an average of between 13 and 14 draws until
4
the first king appears. Compare this to the unconditional E [X ]. Since X ∼ Geometric( 52 ) we
52
know E [X ] = 4 = 13. In other words, on average it takes 13 draws to observe the first king. But
given that the first ace appeared on draw three, we should expect to need about 0.68 draws more
(on average) to see the first king.
Recall how Theorem 1.3.2 described a way in which a non-conditional probability could be
calculated in terms of conditional probabilities. There is an analogous theorem for expected value.

Theorem 4.4.4. Let X : S → T be a discrete random variable and let {Bi : i ≥ 1} be a disjoint
∞
S
collection of events for which P (Bi ) > 0 for all i and such that Bi = S. Suppose P (Bi ) and
i=1
E [X|Bi ] are known. Then E [X ] may be computed as
∞
X
E [X ] = E [X|Bi ]P (Bi ).
i=1
Proof - Using Theorem 1.3.2 and the definition of conditional expectation,

∞
X ∞ X
X
E [X|Bi ]P (Bi ) = t · P (X = t|Bi )P (Bi )
i=1 i=1 t∈T
XX ∞
= t · P (X = t|Bi )P (Bi )
t∈T i=1
X
= t · P (X = t) = E [X ].
t∈T

Example 4.4.5. A venture capitalist estimates that regardless of whether the economy strengthens,
weakens, or remains the same in the next fiscal quarter, a particular investment could either gain
or lose money. However, he figures that if the economy strengthens, the investment should, on
average, earn 3 million dollars. If the economy remains the same, he figures the expected gain
on the investment will be 1 million dollars, while if the economy weakens, the investment will, on
average, lose 1 million dollars. He also trusts economic forcasts which predict a 50% chance of a
weaker economy, a 40% chance of a stagnant economy, and a 10% chance of a stronger economy.
What should he calculate is the expected return on the investment?
Let X be the return on investment and let A, B, and C represent the events that the economy
will be stronger, the same, and weaker in the next quarter, respectively. Then the estimates on
return give the following information in millions:
E [X|A] = 3; E [X|B ] = 1; and E [X|C ] = −1.
Therefore,
E [X ] = E [X|A]P (A) + E [X|B ]P (B ) + E [X|C ]P (C )

= 3(0.1) + 1(0.4) + (−1)(0.5) = 0.2
The expected return on investment is $200,000.

When the conditioning event is described in terms of outcomes of a random variable, Theorem
4.4.4 can be written in another useful way.
Theorem 4.4.6. Let X and Y be two discrete random variables on a sample space S with Y : S →
T . Let g : T → R be defined as g (y ) = E [X|Y = y ]. Then
E [g (Y )] = E [X ].
It is common to use E [X|Y ] to denote g (Y ) after which the theorem may be expressed as E [E [X|Y ]] =
E [X ]. This can be slightly confusing notation, but one must keep in mind that the exterior expected
value in the expression E [E [X|Y ]] refers to the averge of E [X|Y ] viewed as a function of Y .

Proof - As y ranges over T , the events (Y = y ) are disjoint and cover all of S. Therefore, by
Theorem 4.4.4,
X
E [g (Y )] = g (y )P (Y = y )
y∈T
X
= E [X|Y = y ]P (Y = y )
y∈T
= E [X ].

Example 4.4.7. Let Y ∼ Uniform({1, 2, . . . , n}) and let X be the number of heads on Y flips of
a coin. What is the expected value of X?
Without Theorem 4.4.6 this problem would require computing many complicated probabilities.
However, it is made much simpler by noting that the distribution of X is given conditionally by
(X|Y = j ) ∼ Binomial(j, 21 ). Therefore we know E [X|Y = j ] = 2j . Using the notation above, this
may be written as E [X|Y ] = Y2 after which
Y 1n+1 n+1
E [X ] = E [E [X|Y ]] = E [ ]= = .
2 2 2 4

Though it requires a somewhat more complicated formula, the variance of a random variable
can be computed from conditional information.
Theorem 4.4.8. Let X : S → T be a discrete random variable and let {Bi : i ≥ 1} be a disjoint
∞
S
collection of events for which P (Bi ) > 0 for all i and such that Bi = S. Suppose E [X|Bi ] and
i=1
V ar [X|Bi ] are known. Then V ar [X ] may be computed as
∞
X
(V ar [X|Bi ] + (E [X|Bi ])2 )P (Bi ) − (E [X ])2 .

V ar [X ] =
i=1
Proof- First note that V ar [X|Bi ] = E [X 2 |Bi ] − (E [X|Bi ])2 , and so
V ar [X|Bi ] + (E [X|Bi ])2 = E [X 2 |Bi ].
Therefore,
∞
X ∞
X
(V ar [X|Bi ] + (E [X|Bi ])2 )P (Bi ) = E [X 2 |Bi ]P (Bi ),
i=1 i=1
but the right hand side of this equation is E [X 2 ]

from Theorem 4.4.4. The fact that V ar [X ] =
E [X 2 ] − (E [X ])2 completes the proof of the theorem.
As with expected value, this formula may be rewritten in a different form if the conditioning
events describe the outcomes of a random variable.
Theorem 4.4.9. Let X and Y : S → T be two discrete random variables on a sample space S. As
in Theorem 4.4.6 let g (y ) = E [X|Y = y ]. Let h(y ) = V ar [X|Y = y ]. Denoting g (Y ) by E [X|Y ]
and denoting h(Y ) by V ar [X|Y ], then
V ar [X ] = E [V ar [X|Y ]] + V ar [E [X|Y ]].

Proof - First consdier the following three facts:

P
(1) V ar [X|Y = t]P (Y = t) = E [V ar [X|Y ]];
t∈T
(E [X|Y = t])2 P (Y = t) = E [(E [X|Y ])2 ]; and

P
(2)
t∈T
(3) V ar [E [X|Y ]] = E [(E [X|Y ])2 ] − (E [E [X|Y ]])2 = E [(E [X|Y ])2 ] − (E [X ])2 .
Then from Theorem 4.4.8,

X
V ar [ X ] = (V ar [X|Y = t] + (E [X|Y = t])2 )P (Y = t) − (E [X ])2
t∈T
X X
= V ar [X|Y = t]P (Y = t) + (E [X|Y = t])2 P (Y = t) − (E [X ])2
t∈T t∈T
= E [V ar [X|Y ]] + E [(E [X|Y ])2 ] − (E [X ])2
= E [V ar [X|Y ]] + V ar [E [X|Y ]].

Example 4.4.10. The number of eggs N found in nests of a certain species of turtles has a Poisson
distribution with mean λ. Each egg has probability p of being viable and this event is independent
from egg to egg. Find the mean and variance of the number of viable eggs per nest.
Let N be the total number of eggs in a nest and X the number of viable ones. Then if N = n,
X has a binomial distribution with number of trials n and probability p of success for each trial.
Thus, if N = n, X has mean np and variance np(1 − p). That is,
E [X|N = n] = np; V ar [X|N = n] = np(1 − p)
or
E [X|N ] = pN ; V ar [X|N ] = p(1 − p)N .
Hence
E [X ] = E [E [X|N ]] = E [pN ] = pE [N ] = pλ
and
V ar [X ] = E [V ar [X|N ]] + V ar [E [X|N ]]
= E [p(1 − p)N ] + V ar [pN ] = p(1 − p)E [N ] + p2 V ar [N ].
Since N is Poisson we know that E [N ] = V ar [N ] = λ, so that
E [X ] = pλ and V ar [X ] = p(1 − p)λ + p2 λ = pλ.
exercises
Ex. 4.4.1. Let X ∼ Geometric(p) and let A be event (X ≤ 3). Calculate E [X|A] and V ar [X|A].
Ex. 4.4.2. Calculate the variance of the quantity X from Example 4.4.7.

Ex. 4.4.3. Return to Example 4.4.5. Suppose that, in addition to the estimates on average return,
the investor had estimates on the standard deviations. If the economy strengthens or weakens, the
estimated standard deviation is 3 million dollars, but if the economy stays the same, the estimated
standard deviation is 2 million dollars. So, in millions of dollars,
SD [X|A] = 3; SD [X|B ] = 2; and SD [X|C ] = 3.
Use this information, together with the conditional expectations from Example 4.4.5 to calculate
V ar [X ].
Ex. 4.4.4. A standard light bulb has an average lifetime of four years with a standard deviation of
one year. A Super D-Lux lightbulb has an average lifetime of eight years with a standard devaition
of three years. A box contains many bulbs – 90% of which are standard bulbs and 10% of which
are Super D-Lux bulbs. A bulb is selected at random from the box. What are the average and
standard deviation of the lifetime of the selected bulb?
Ex. 4.4.5. Let X and Y be described by the joint distribution
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
and answer the following questions.
(a) Calculate E [X|Y = −1].
(b) Calculate V ar [X|Y = −1].
(c) Describe the distribution of E [X|Y ].
(d) Describe the distribution of V ar [X|Y ].
Ex. 4.4.6. Let X and Y be discrete random variables. Let x be in the range of X and let y be in
the range of Y .
(a) Suppose X and Y are independent. Show that E [X|Y = y ] = E [X ] (and so E [X|Y ] =
E [X ]).
(b) Show that E [X|X = x] = x (and so E [X|X ] = X). (From results in this section we know
E [X|Y ] is always a random variable with expected value equal to E [X ]. The results above
in some sense show two extremes. When X and Y are independent, E [X|Y ] is a constant
random variable E [X ]. When X and Y are equal, E [X|X ] is just X itself).
Ex. 4.4.7. Let X ∼ Uniform {1, 2, . . . , n} be independent of Y ∼ Uniform {1, 2, . . . , n}. Let
Z = max(X, Y ) and W = min(X, Y ).
(a) Find the joint distribution of (Z, W ).
(b) Fine E [Z | W ].

4.5 covariance and correlation 105
4.5 covariance and correlation
When faced with two different random variables, we are frequently interested in how the two
different quantities relate to each other. Often the purpose of this is to predict something about
one variable knowing information about the other. For instance, if rainfall amounts in July affect
the quantity of corn harvested in August, then a farmer, or anyone else keenly interested in the
supply and demand of the agriculture industry, would like to be able to use the July information
to help make predictions about August costs.
4.5.1 Covariance
Just as we developed the concepts of expected value and standard deviation to summarize a single
random variable, we would like to develop a number that describes something about how two
different random variables X and Y relate to each other.
Definition 4.5.1. (Covariance of X and Y ) Let X and Y be two discrete random variables on
a sample space S. Then the “covariance of X and Y ” is defined as
Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])]. (4.5.1)
Since it is defined in terms of an expected value, there is the possibility that the covariance may be
infinite or not defined at all because the sum describing the expectation is divergent.
Notice that if X is larger than its average at the same time that Y is larger than its average
(or if X is smaller than its average at the same time Y is smaller than its average) then (X −
E [X ])(Y − E [Y ]) will contribute a positive result to the expected value describing the covariance.
Conversely, if X is smaller than E [X ] while Y is larger than E [Y ] or vica versa, a negative result
will be contributed toward the covariance. This means that when two variables tend to be both
above average or both below average simultaneously, the covariance will typically be positive (and
the variables are said to be positively correlated ), but when one variable tends to be above average
when the other is below average, the covariance will typically be negative (and the variables are
said to be negatively correlated ). When Cov [X, Y ] = 0 the variables X and Y are said to be
“uncorrelated”.
For example, suppose X and Y are the height and weight, respectively, of an individual ran-
domly selected from a large population. We might expect that Cov [X, Y ] > 0 since people who are
taller than average also tend to be heavier than average and people who are shorter than average
tend to be lighter. Conversely suppose X and Y represent elevation and air density at a randomly
selected point on Earth. We might expect Cov [X, Y ] < 0 since locations at a higher elevation tend
to have thinner air.
Example 4.5.2. Consider a pair of random variables X and Y with joint distribution
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
By a routine calculation of the marginal distributions it can be shown that X, Y ∼ Uniform({−1, 0, 1})
and therefore that E [X ] = E [Y ] = 0. However, it is clear from the joint distribution that when

X = −1, then Y is more likely to be above average than below, while when X = 1, then Y is
more likely to be below average than above. This suggests the two random variables should have
a negative correlation. In fact, we can calculate
4 9 2 2
E [XY ] = (−1)( ) + 0( ) + 1( ) = − ,
15 15 15 15
2
and therefore Cov [X, Y ] = E [XY ] − E [X ]E [Y ] = − 15 .
As its name suggests, the covariance is closely related to the variance.
Theorem 4.5.3. Let X be a discrete random variable. Then
Cov [X, X ] = V ar [X ].
Proof - Cov [X, X ] = E [(X − E [X ])(X − E [X ])) = E [(X − E [X ])2 ) = V ar [X ].

With Theorem 4.2.5 it was shown that V ar [X ] = E [X 2 ] − (E [X ])2 , which provided an alter-
nate formula for the variance. There is an analogous alternate formula for the covariance.
Theorem 4.5.4. Let X and Y be discrete random variables with finite mean for which E [XY ] is
also finite. Then
Cov [X, Y ] = E [XY ] − E [X ]E [Y ].
Proof - Using the linearity properties of expected value,
Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])]

= E [XY − XE [Y ] − E [X ]Y + E [X ]E [Y ]]
= E [XY ] − E [XE [Y ]] − E [E [X ]Y ] + E [E [X ]E [Y ]]
= E [XY ] − E [Y ]E [X ] − E [X ]E [Y ] + E [X ]E [Y ]
= E [XY ] − E [X ]E [Y ].

As with the expected value, the covariance is a linear quantity. It is also related to the concept
of independence.
Theorem 4.5.5. Let X, Y , and Z be discrete random variables, and let a, b ∈ R. Then,
(a) Cov [X, Y ] = Cov [Y , X ];
(b) Cov [X, aY + bZ ] = a · Cov [X, Y ] + b · Cov [X, Z ];
(c) Cov [aX + bY , Z ] = a · Cov [X, Z ] + b · Cov [Y , Z ]; and
(d) If X and Y are independent with a finite covariance, then Cov [X, Y ] = 0.
Proof of (1) - This follows immediately from the definition.
Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])]

= E [(Y − E [Y ])(X − E [X ])] = Cov [Y , X ].
Therefore, reversing the roles of X and Y does not change the correlation.

4.5 covariance and correlation 107
Proof of (2) - This follows from linearity properties of expected value. Using Theorem 4.5.4
Cov [X, aY + bZ ] = E [X (aY + bZ )] − E [X ]E [aY + bZ ]

= a · E [XY ] + b · E [XZ ] − a · E [X ]E [Y ] − b · E [X ]E [Z ]
= a · (E [XY ] − E [X ]E [Y ]) + b · (E [XZ ] − E [X ]E [Z ])
= a · Cov [X, Y ] + b · Cov [X, Z ]
Proof of (3) - This proof is essentially the same as that of (2) and is left as an exercise.
Poof of (4) - We have previously seen that if X and Y are independent, then E [XY ] = E [X ]E [Y ].
Using Theorem 4.5.4 it follows that
Cov [X, Y ] = E [XY ] − E [X ]E [Y ] = 0.
Though independence of X and Y guarantees that they are uncorrelated, the converse is not
true. It is possible that Cov [X, Y ] = 0 and yet that X and Y are dependent, as the next example
shows.
Example 4.5.6. Let X, Y be two discrete random variables taking values {−1, 1}. Suppose their
joint distribution P (X = x, Y = y ) is given by the table
x=-1 x=1
y=-1 0.3 0.2
y=1 0.3 0.2

By summing the columns and rows respectively,
P (X = 1) = 0.4 and P (X = −1) = 0.6, while
P (Y = 1) = 0.5 and P (Y = −1) = 0.5.

Moreover,
E [XY ] = (1)(−1)P (X = 1, Y = −1) + (−1)(1)P (X = −1, Y = 1)

+(1)(1)P (X = 1, Y = 1) + (−1)(−1)P (X = −1, Y = −1)
= −0.3 − 0.2 + 0.2 + 0.3 = 0,
E [X ] = (1)0.4 + (−1)0.6 = −0.2,
E [Y ] = (1)0.5 + (−1)0.5 = 0,
implying that Cov[X, Y ] = 0. As
P (X = 1, Y = 1) = 0.2 6= 0.1 = P (X = 1)P (Y = 1),
they are not independent random variables.
4.5.2 Correlation
The possible size of Cov [X, Y ] has upper and lower bounds based on the standard deviations of
the two variables.

Theorem 4.5.7. Let X and Y be two discrete random variables both with finite variance. Then
−σX σY ≤ Cov [X, Y ] ≤ σX σY ,

Cov [X,Y ]
and therefore −1 ≤ σX σY ≤ 1.
Proof - Standardize both variables and consider the expected value of their sum squared. Since
this is the expected value of a non-negative quantity,
X − µX Y − µY 2
0 ≤ E [( + ) ]
σX σY
(X − µX )2 (X − µX )(Y − µY ) (Y − µY )2
= E[ 2 + 2 + ]
σX σX σY σY2
E [(X − µX )2 ] 2E [(X − µX )(Y − µY )] E [(Y − µY )2 ]
= 2 + +
σX σX σY σY2
Cov [X, Y ]
= 1+2 + 1.
σX σY
Sovling the inequality for the covariance yields
Cov [X, Y ] ≥ −σX σY .
A similar computation (see Exercises) for the expected value of the squared difference of the
standardized variables shows
Cov [X, Y ] ≤ σX σY .
Putting both inequalities together proves the theorem.
Cov [X,Y ]
Definition 4.5.8. The quantity σX σY from Theorem 4.5.7 is known as the“correlation” of
X and Y and is often denoted as ρ[X, Y ]. Thinking in terms of dimensional analysis, both the
numerator and denominator include the units of X and the units of Y . The correlation, therefore,
has no units associated with it. It is thus a dimensionless rescaling of the covariance and is
frequently used as an absolute measure of trends between the two variables.
exercises
Ex. 4.5.1. Consider the experiment of flipping two coins. Let X be the number of heads among
the coins and let Y be the number of tails among the coins.
(a) Should you expect X and Y to be posivitely correlated, negatively correlated, or uncorrelated?
Why?
(b) Calculate Cov [X, Y ] to confirm your answer to (a).
Ex. 4.5.2. Let X ∼ Uniform({0, 1, 2}) and let Y be the number of heads in X flips of a coin.
(a) Should you expect X and Y to be positively correlated, negatively correlated, or uncorrelated?
Why?

4.6 exchangeable random variables 109
(b) Calculate Cov [X, Y ] to confirm your answer to (a).

Ex. 4.5.3. Prove part (3) of Theorem 4.5.5.
Ex. 4.5.4. Prove the missing inequality from the proof of Theorem 4.5.7. Specifically, use the
inequality
X − µX Y − µY 2
0 ≤ E [( − ) ]
σX σY
to prove that Cov [X, Y ] ≤ σX σY .
Ex. 4.5.5. Prove that the inequality of Theorem 4.5.7 is an equality if and only if there are a, b ∈ R
with a 6= 0 for which P (Y = aX + b) = 1. (Put another way, the correlation of X and Y is ±1
exactly when Y can be expressed as a non-trivial linear function of X).
Ex. 4.5.6. In previous sections it was shown that if X and Y are independent, then V ar [X + Y ] =
V ar [X ] + V ar [Y ]. If X and Y are dependent, the result is typically not true, but the covariance
provides a way relate the variances of X and Y to the variance of their sum.
(a) Show that for any discrete random variables X and Y ,
V ar [X + Y ] = V ar [X ] + V ar [Y ] + 2Cov [X, Y ].
(b) Use (a) to conclude that when X and Y are positively correlated, then V ar [X + Y ] >
V ar [X ] + V ar [Y ], while when X and Y are negatively correlated, V ar [X + Y ] < V ar [X ] +
V ar [Y ].
(c) Suppose Xi 1 ≤ i ≤ n are discrete random variables with finite variance and covariances.
Use induction and (a) to conclude that
n
X n
X X
V ar [ Xi ] = V ar [Xi ] + 2 Cov [Xi , Xj ].
i=1 i=1 1≤i<j≤n
4.6 exchangeable random variables
We conclude this section with a discussion on exchangeable random variables. In brief we say
that a collection of random variables is exchangeable if the joint probability mass function of
(X1 , X2 , . . . , Xn ) is a symmetric function. In other words, the distribution of (X1 , X2 , . . . , Xn )
is independent of the order in which the Xi0 s appear. In particular any collection of mutually
independent random variables is exchangeable.
Definition 4.6.1. Let n ≥ 2 and σ : {1, 2, . . . , n} → {1, 2, . . . , n} be a bijection. We say that a

subset T of Rn is symmetric if
(x1 , x2 , . . . , xn ) ∈ T ⇐⇒ (xσ (1) , xσ (2) , . . . , xσ (n) ) ∈ T
for all (x1 , x2 , . . . , xn ) ∈ Rn . For any symmetric set T , a function f : T → R is symmetric if
f ( x1 , x2 , . . . , xn ) = f ( xσ ( 1 ) , xσ ( 2 ) , . . . , xσ ( n ) )
for all (x1 , x2 , . . . , xn ) ∈ Rn .

A bijection σ : {1, 2, . . . , n} → {1, 2, . . . , n} is often referred to as a permutation of {1, 2, . . . , n}.

When n = 2 the function f would be symmetric if f (x, y ) = f (y, x) for all x, y ∈ R.
Definition 4.6.2. Let n ≥ 1 and X1 , X2 , . . . , Xn be discrete random variables. We say that

X1 , X2 , . . . , Xn is a collection of exchangeable random variables if the joint probability mass func-
tion given by
f (x1 , x2 , . . . , xn ) = P (X1 = x1 , . . . Xn = xn )
is a symmetric function.
In particular, X1 , X2 , . . . , Xn are exchangeable then for any one of the possible n! permutations,
σ, of {1, 2, . . . , n}, X1 , X2 , . . . , Xn and Xσ (1) , Xσ (2) , . . . , Xσ (n) have the same distribution.
Example 4.6.3. Suppose we have an urn of m distinct objects labelled {1, 2, . . . , m}. Objects are
drawn at random from the urn without replacements till the urn is empty. Let Xi be the label
of the i-th object that is drawn. Then X1 , X2 , . . . , Xm is a particular ordering of the objects in
the urn. Since each ordering is equally likely and there are m! possible orderings we have that the
joint probability mass function
1
f (x1 , x2 , . . . , xm ) = P (X1 = x1 , X2 = x2 , . . . , Xm = xm ) = ,
m!
whenever xi ∈ {1, 2, . . . m} with xi 6= xj . As the function is a constant function on the symmetric

set {1, 2, . . . , m}, it is clearly symmetric. So the random variables X1 , X2 , . . . , Xm are exchangeable.

Theorem 4.6.4. Let X1 , X2 , . . . , Xn be a collection of exchangeable random variables on a sample

space S. Then for any i, j ∈ {1, 2, . . . , n}, Xi and Xj have the same marginal distribution.
Proof - The random variables (X1 , X2 , . . . , Xn ) are exchangeable. Then we have for any per-
mutation σ and xi ∈ Range(Xi )
P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P (Xσ (1) = x1 , Xσ (2) = x2 , . . . , Xσ (n) = xn ).
As this is true for all permutations σ all the random variables must have same range. Otherwise if
any two of them differ the we could get a contradiction by choosing an appropriate permutation.
Let T denote the common range. Let i ∈ {2, . . . , n}, a, b ∈ T . Let
A = {xj ∈ T : 1 ≤ j 6= 1, i ≤ n}
By using the exchangeable property with the permutation σ that is given by σ (i) = 1, σ (1) = i
and σ (j ) = j for all j 6= 1, i. We have that for any x2 , . . . , xi−1 , xi+1 , . . . , xn ∈ A
P (X1 = a, X2 = x2 , . . . , Xi−1 = xi−1 , Xi = b, Xi+1 = xi+1 , . . . Xn = xn )

= P (X1 = b, X2 = x2 , . . . , Xi−1 = xi−1 , Xi = a, Xi+1 = xi+1 , . . . Xn = xn ).

Therefore,
[
P (X1 = a) = P ( X1 = a, Xi = b)
b∈T
X
= P (X1 = a, Xi = b)
b∈T
X [
= P( X1 = a, X2 = x2 , . . . Xi−1 = xi−1 , Xi = b, Xi+1 = xi+1 , . . . Xn = xn )
b∈T xj ∈A
X X
= P (X1 = a, X2 = x2 , . . . , Xi = b, . . . Xn = xn )
b∈T xj ∈A
X X
= P (X1 = b, X2 = x2 , . . . , Xi = a, . . . Xn = xn )
b∈T xj ∈A
X [
= P( X1 = b, X2 = x2 , . . . Xi−1 = xi−1 , Xi = a, Xi+1 = xi+1 , . . . Xn = xn )
b∈T xj ∈A
X
= P (X1 = b, Xi = a)
b∈T
[
= P( X1 = b, Xi = a)
b∈T
= P (Xi = a)
So the distribution of Xi is the same as the distribution of X1 and hence all of them have the same
distribution.
Example 4.6.5. (Sampling without Replacement) An urn contains b black balls and r red
balls. A ball is drawn at random and its colour noted. This procedure is repeated n times. Assume
that n ≤ b + r. Let max 0, n − r ≤ k ≤ min(n, b). In this example we examine the random variables
Xi given by
1 if i-th ball drawn is black
Xi =
0 otherwise
We have already seen that (See Theorem 2.3.2 and Example 2.3.1)
Qk−1 Qm−k−1
n i=0 (b − i ) i=0 (r − i)
P (k black balls are drawn in n draws) = Qm−1 .
k i=0 (r + b − i )
Using the same proof we see that the joint probability mass function of (X1 , X2 , . . . , Xn ) is given
by
Q ni=1 xi −1 Q ni=1 xi −k−1

P P
(b − i ) i=0 (r − i)
f (x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 . . . Xn = xn ) = i=0 ,
Q ni=1 xi −1
P
i=0 (r + b − i)
where xi ∈ {0, 1}. It is clear from the right hand side of the above that the function f depends
only on the ni=1 xi . Hence any permutation of the xi ’s will not change the value of f . So f is a
P
symmetric function and the random variables are exchangeable. Therefore, by Theorem 4.6.4 we
know that for any 1 ≤ i ≤ n,
b
P (Xi = 1) = P (X1 = 1) = .
b+r

So we can conclude that they are all identically distributed as Bernoulli ( b+b r ) and the probability
of choosing a black ball in the i-th draw is b+b r (See Exercise 4.6.4 for a similar result). Further
for any i, j
Cov [Xi , Xj ] = E [Xi Xj ] − E [Xi ]E [Xj ]

2
b
= E [X1 X2 ] −
b+r
2
b(b − 1) b
= −
(b + r )(b + r − 1) b+r
−br
=
(b + r )2 (b + r − 1)
Pn
Finally, we observe that Y = i=1 Xi is a Hypergeometric (b + r, b, m). Exchangeability thus
provides another alternative way to compute the mean and variance of Y . Using the linearity of
expectation provided by Theorem 4.1.7, we have
n n
X X b
E [Y ] = E [ Xi ] = E [Xi ] = n .
b+r
i=1 i=1
and by Exercise 4.5.6,

n
X n
X n
X
V ar [Y ] = V ar [ Xi ] = V ar [Xi ] + 2 Cov [Xi , Xj ]
i=1 i=1 1≤i<j≤n
= nV ar [X1 ] + n(n − 1)Cov [X1 , X2 ]
br −br
= n + n(n − 1)( )
(b + r )2 (b + r )2 (b + r − 1)
br b+r−n
= n .
(b + r )2 b + r − 1

exercises
Ex. 4.6.1. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. For any 2 ≤ m < n, show
that X1 , X2 , . . . , Xm are also a collection of exchangeable random variables.
Ex. 4.6.2. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. Let T denote their common
range. Suppose b : T → R. Show that b(X1 ), b(X2 ), . . . , b(Xn ) is also a collection of exchangeable
random variables.
Ex. 4.6.3. Suppose n cards are drawn from a standard pack of 52 cards without replacement (so
we will assume n ≤ 52). For 1 ≤ i ≤ n, let Xi be random variables given by

1 if i-th card drawn is black in colour
Xi =
0 otherwise
(a) Suppose n = 52. Using Example 4.6.3 and the Exercise 4.6.2 show that (X1 , X2 , X3 , . . . Xn )
are exchangeable.
(b) Show that (X1 , X2 , X3 , . . . , Xn ) are exchangeable for any 2 ≤ n ≤ 52. Hint: If n < 52 extend
the sample to exhause the deck of cards. Use (a) and Exercise 4.6.1

(c) Find the probability that the second and fourth card drawn have the same colour.
Ex. 4.6.4. (Polya Urn Scheme) An urn contains b black balls and r red balls. A ball is drawn
at random and its colour noted. Then it is replaced along with c ≥ 0 balls of the same colour. This
procedure is repeated n times.
(a) Let 1 ≤ k ≤ m ≤ n. Show that

Qk−1 Qm−k−1
m i=0 (b + ci) i=0 (r + ci)
P (k black balls are drawn in m draws) = Qm−1
k i=0 (r + b + ci)
(b) Let 1 ≤ i ≤ n and random variables Xi be given by

1 if i-th ball drawn is black
Xi =
0 otherwise
Show that the collection of random variables is exchangeable.
(c) Let 1 ≤ m ≤ n. Let Bm be the event that the m-th ball drawn is black. Show that
b
P (Bm ) = .
b+r


Definition 4.1.1.: R Then We Could Define A

Uploaded by

Copyright:

Available Formats

Definition 4.1.1.: R Then We Could Define A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Definition 4.1.1.: R Then We Could Define A

Uploaded by

Copyright:

Available Formats

S U M M A R I Z I N G D I S C R E T E R A N D O M VA R I A B L E S

4.1 expected value

Version: – January 19, 2021

Theorem 4.1.3. Let c be a real number. Then E [c] = c.

But note that

Version: – January 19, 2021

since any additional terms in the series are zero.

4.1.1 Properties of the Expected Value

Version: – January 19, 2021

From there the proof may be completed, since

it follows by definition of series (in the case T is countable) that E [X ] ≥ 0.

Version: – January 19, 2021

4.1.2 Expected Value of a Product

Theorem 4.1.7 showed that E [X + Y ] = E [X ] + E [Y ]. It is natural to ask whether a similar rule

Version: – January 19, 2021

Proof - Suppose X : S → U and Y : S → V . Then the random variable XY takes values in

4.1.3 Expected Values of Common Distributions

Version: – January 19, 2021

Example 4.1.13. (Expected Value of a Bernoulli(p))

Version: – January 19, 2021

lim (1 − p)n = 0 and lim n(1 − p)n = 0.

Version: – January 19, 2021

Version: – January 19, 2021

k exactly describe the range of a HyperGeo(N − 1, m − 1, r − 1) distriubtion. From there, the

4.1.4 Expected Value of f (X1 , X2 , . . . , Xn )

Version: – January 19, 2021

E [f (X )] = (−2)2 · P (X = −2) + (−1)2 · P (X = −1) + 02 · P (X = 0)

which gives the same result as the previous computation.

Theorem 4.1.19. Let X : S → T be a discrete random variable and define a function f : T → U .

Putting this together with the definition of E [f (X )] shows

Version: – January 19, 2021

Theorem 4.1.20. Let X1 , X2 , . . . Xn be random variables defined on a common sample space S.

(a) What type of random variable (with what parameter) is X?

(a) What type of random variable (with what parameter) is Y ?

(a) What type of random variable (with what parameter) is Z?

Version: – January 19, 2021

(a) Which strategy has the larger expected value of return?

(b) Which strategy is more likely to produce a positive return on investment?

E [a1 X1 + a2 X2 + · · · + an Xn ] = a1 E [X1 ] + a2 E [X2 ] + · · · + an E [Xn ].

(b) Provide an example to show that E [X + Y ] = −∞ is possible.

Version: – January 19, 2021

4.2 variance and standard deviation

The standard deviation of X is written as SD [X ] and is defined as

Version: – January 19, 2021

4.2.1 Properties of Variance and Standard Deviation

V ar [aX ] = E [(aX − aE [X ])2 ]

Version: – January 19, 2021

= E [X 2 ] + 2E [XY ] + E [Y 2 ] − (E [X ])2 − 2E [X ]E [Y ] − (E [Y ])2 .

Version: – January 19, 2021

4.2.2 Variances of Common Distributions

Example 4.2.9. (Variance of a Binomial(n,p))

Version: – January 19, 2021

Version: – January 19, 2021

where k ranges over the values of Y ∼ HyperGeo(N − 1, r − 1, m − 1). Therefore,

Now the variance may be easily computed as

4.2.3 Standardized Variables

Version: – January 19, 2021

Definition 4.2.12. A standardized random variable X is one for which

Proof - The expected value value of Z is