Definition 4.1.1.: R Then We Could Define A
Definition 4.1.1.: R Then We Could Define A
Definition 4.1.1.: R Then We Could Define A
4
When we first looked at Bernoulli trials in Example 2.1.2 we asked the question “On average how
many successes will there be after n trials?” In order to answer this question, a specific definition
of “average” must be developed.
To begin, consider how to extend the basic notion of the average of a list of numbers to the
situation of equally likely outcomes. For instance, if we want to know what the average roll of a die
will be, it makes sense to declare it to be 3.5, the average value of 1, 2, 3, 4, 5, and 6. A motivation
for a more general definition of average comes from a rewriting of this calculation.
1+2+3+4+5+6 1 1 1 1 1 1
= 1( ) + 2( ) + 3( ) + 4( ) + 5( ) + 6( ).
6 6 6 6 6 6 6
From the perspective of the right hand side of the equation, the results of all outcomes are
added together after being weighted, each according to its probability. In the case of a die, all six
outcomes have probability 16 .
Definition 4.1.1. Let X : S → T be a discrete random variable (so T is countable). Then the
expected value (or average) of X is written as E [X ] and is given by
X
E [X ] = t · P (X = t)
t∈T
provided that the sum converges absolutely. In this case we say that X has “finite expectation”. If
the sum diverges to ±∞ we say the random variable has infinite expectation. If the sum diverges,
but not to infinity, we say the expected value is undefined.
Example 4.1.2. In the previous chapter, Example 3.1.4 described a lottery for which a ticket could
be worth nothing, or it could be worth either $20 or $200. What is the average value of such a
ticket?
1 27
We calculated the distribution of ticket values as P (X = 200) = 1000 , P (X = 20) = 1000 , and
972
P (X = 0) = 1000 . Applying the definition of expected value results in
1 27 972
E [X ] = 200( ) + 20( ) + 0( ) = 0.74,
1000 1000 1000
so a ticket has an expected value of 56 cents.
It is possible to think of a constant as a random variable. If c ∈ R then we could define a
random variable X with a distribution such that P (X = c) = 1. It is a slight abuse of notation,
but in this case we will simply write c for both the real number as well as the constant random
variable. Such random variables have the obvious expected value.
This infinite sum diverges (not to ±∞), so the expected value of this random variable is undefined.
The examples above were specifically constructed to produce series which clearly diverged, but
in general it can be complicated to check whether an infinite sum is absolutely convergent or not.
The next technical lemma provides a condition that is often simpler to check. The convenience of
this lemma is that, since |X| is always positive, the terms of the series for E [|X|] may be freely
rearranged without changing the value of (or the convergence of) the sum.
Lemma 4.1.6. E [X ] is a real number if and only if E [|X|] < ∞.
Proof - Let T be the range of X. So U = {|t| : t ∈ T } is the range of |X|. By definition
X
E [|X|] = u · P (|X| = u), while
u∈U
X
E [X ] = t · P (X = t).
t∈T
To more easilly relate these two sums, define T̂ = {t : |t| ∈ U }. Since every u ∈ U came from some
t ∈ T the new set T̂ contains every element of T . For every t ∈ T̂ for which t ∈ / T , the element is
outside of the range of X and so P (X = t) = 0 for such elements. Because of this E [X ] may be
written as X
E [X ] = t · P (X = t)
t∈T̂
Therefore the series describing E [X ] is absolutely convergent exactly when E [|X|] < ∞.
We will eventually wish to calculate the expected values of functions of multiple random variables.
Of particular interest to statistics is an understanding of expected values of sums and averages of
i.i.d. sequences. That understanding will be made easier by first learning something about how
expected values behave for simple combinations of variables.
Theorem 4.1.7. Suppose that X and Y are discrete random variables, both with finite expected
value and both defined on the same sample space S. If a and b are real numbers then
(1) E [aX ] = aE [X ];
(2) E [X + Y ] = E [X ] + E [Y ]; and
(3) E [aX + bY ] = aE [X ] + bE [Y ].
(4) If X ≥ 0 then E [X ] ≥ 0.
Proof of (1) - If a = 0 then both sides of the equation are zero, so assume a 6= 0. We know
that X is a function from S to some range U . So aX is also a random variable and its range is
T = {au : u ∈ U }.
P
By definition E [aX ] = t · P (aX = t), but because of how T is defined, adding values indexed
t∈T
by t ∈ T is equivalent to adding values indexed by u ∈ U where t = au. In other words
X
E [aX ] = t · P (aX = t)
t∈T
X
= au · P (aX = au)
u∈U
X
= a· u · P (X = u)
u∈U
= aE [X ].
Proof of (2) - We are assuming that X and Y have the same domain, but they typically have
different ranges. Suppose X : S → U and Y : S → V . Then the random variable X + Y is also
defined on S and takes values in T = {u + v : u ∈ U , v ∈ V }. Therefore, adding values indexed by
t ∈ T is equivalent to adding values indexed by u and v as they range over U and V respectively.
So,
X
E [X + Y ] = t · P (X + Y = t)
t∈T
X
= (u + v ) · P (X = u, Y = v )
u∈U ,v∈V
XX
= (u + v ) · P (X = u, Y = v )
u∈U v∈V
XX XX
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V u∈U v∈V
XX XX
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V v∈V u∈U
where the rearrangement of summation is legitimate since the series converges absolutely. Notice
that as u ranges over all of U the sets (X = u, Y = v ) partition the set (Y = v ) into disjoint
pieces based on the value of X. Likewise the event (X = u) is partitioned by (X = u, Y = v ) as
v ranges over all values of v ∈ V . Therefore, as a disjoint union,
[ [
(Y = v ) = (X = u, Y = v ) and (X = u) = (X = u, Y = v ),
u∈U v∈V
and so
X X
P (Y = v ) = P (X = u, Y = v ) and P (X = u) = P (X = u, Y = v ).
u∈U v∈V
Proof of (3) - This is an easy consequence of (1) and (2). From (2) the expected value E [aX + bY ]
may be rewritten as E [aX ] + E [bY ]. From there, applying (1) shows this is also equal to aE [X ] +
bE [Y ]. (Using induction this theorem may be extended to any finite linear combination of random
variables, a fact which we leave as an exercise below).
Proof of (4) - We know that X is a function from S to T where t ∈ T implies that t ≥ 0. As,
X
E [X ] = t · P (X = t),
t∈T
To answer this question by appealing to the definition of expected value would require summing
over the eleven possible outcomes {2, 3, . . . , 12} and computing the probabilities of each of those
outcomes. Theorem 4.1.7 makes things much simpler. We began this section by noting that a
single die roll has an expected value of 3.5. The sum of two dice is X + Y where each of X and
Y represents the outcome of a single die. So the average value of the sum of a pair of dice is
E [X + Y ] = E [X ] + E [Y ] = 3.5 + 3.5 = 7.
Example 4.1.9. Consider a game in which a player might either gain or lose money based on the
result. A game is considered “fair” if it is described by a random variable with an expected value
of zero. Such a game is fair in the sense that, on average, the player will have no net change in
money after playing.
Suppose a particular game is played with one player (the roller) throwing a die. If the die comes
up an even number, the roller wins that dollar amount from his opponent. If the die is odd, the
roller wins nothing. Obviously the game as stated is not “fair” since the roller cannot lose money
and may win something. How much should the roller pay his opponent to play this game in order
to make it a fair game?
Let X be the amount of money the rolling player gains by the result on the die. The set of
possible outcomes is T = {0, 2, 4, 6} and it should be routine at this point to verify that E [X ] = 2.
Let c be the amount of money the roller should pay to play in order to make the game fair. Since
X is the amount of money gained by the roll, the net change of money for the roller is X − c after
accounting for how much was paid to play. A fair game requires
0 = E [X − c] = E [X ] − E [c] = 2 − c.
So the roller should pay his opponent $2 to make the game fair.
Theorem 4.1.10. Suppose that X and Y are discrete random variables, both with finite expected
value and both defined on the same sample space S. If X and Y are independent, then E [XY ] =
E [X ]E [Y ].
A quick glance at the definition of expected value shows that it only depends on the distribution
of the random variable. Therefore one can compute the expected values for the various common
distributions we defined in the previous chapter.
where the last equality is a shift of variables. But now, by the binomial theorem, the sum
n−1
P n−1 k
( k )p (1 − p)(n−1)−k is equal to 1 and therefore E [Y ] = np.
k =0
Alternatively, recall that the binomial distribution first came about as the total number of suc-
cesses in n independent Bernoulli trials. Therefore a Binomial(n, p) distribution results from adding
together n independent Bernoulli(p) random variables. Let X1 , X2 , . . . , Xn be i.i.d. Bernoulli(p)
and let Y = X1 + X2 + · · · + Xn . Then Y ∼ Binomial(n, p) and
E [Y ] = E [X1 + X2 + · · · + Xn ]
= E [X1 ] + E [X2 ] + · · · + E [Xn ]
= p + p + · · · + p = np.
This also provides the answer to part (d) of Example 2.1.2. The expected number of successes in
a series of n independent Bernoulli(p) trials is np.
In the next example we will calculate the expected value of a geometric random variable.
The computation illustrates a common technique from calculus for simplifying power series by
differentiating the sum term-by-term in order to rewrite a complicated series in a simpler way.
Example 4.1.15. (Expected Value of a Geometric(p))
If X ∼ Geometric(p) and 0 < p < 1, then
∞
X
E [X ] = k · p(1 − p)k−1
k =1
To evaluate the sum of the series we will need to work the partial sums of the same. For any n ≥ 1,
let
n
X n
X
Tn = kp(1 − p)k−1 = k (1 − (1 − p))(1 − p)k−1
k =1 k =1
X n n
X
= k (1 − p)k−1 − k (1 − p)k
k =1 k =1
n
X 1 − (1 − p)n
= (1 − p)k−1 − n(1 − p)n = − n(1 − p)n .
p
k =1
Using standard results from analysis we know that for 0 < p < 1,
1
Therefore Tn → p as n → ∞. Hence
1
E [X ] = .
p
For instance, suppose we wanted to know on average how many rolls of a die it would take
before we observed a 5. Each roll is a Bernoulli trial with a probability 16 of success. The time
it takes to observe the first success is distributed as a Geometric( 16 ) and so has expected value
1
1/6 = 6. On average it should take six rolls before observing this outcome.
Example 4.1.16. (Expected Value of a Poisson(λ))
We can make a reasonable guess at the expected value of a Poisson(λ) random variable by
recalling that such a distribution was created to approximate a binomial when n was large and p
was small. The parameter λ = np remained fixed as we took a limit. Since we showed above that a
Binomial (n, p) has an expected value of np, it seems plausible that a P oisson(λ) should have an
expected value of λ. This is indeed true and it is possible to prove the fact by using the idea that
the Poisson random variable is the limit of a sequence of binomial random variables. However, this
proof requires an understanding of how limits and expected values interact, a concept that has not
yet been introduced in the text. Instead we leave a proof based on a direct algebraic computation
as Exercise 4.1.12.
Taking the result as a given, we will illustrate how this expected value might be used for
an applied problem. Suppose an insurance company wants to model catastrophic floods using a
Poisson(λ) random variable. Since floods are rare in any given year, and since the company is
considering what might occur over a long span of years, this may be a reasonable assumption.
As its name implies a “50-year flood” is a flood so substantial that it should occur, on average,
only once every fifty years. However, this is just an average; it may be possible to have two “50-year
floods” in consecutive years, though such an event would be quite rare. Suppose the insurance
company wants to know how likely it is that there will be two or more “50-year floods” in the next
decade, how should this be calculated?
There is an average of one such flood every fifty years, so by proportional reasoning, in the next
ten years there should be an average of 0.2 floods. In other words, the number of floods in the
next ten years should a random variable X ∼ P oisson(0.2) and we wish to calculate P (X ≥ 2).
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1)
= 1 − e−0.2 − e−0.2 (0.2)
≈ 0.0002.
So assuming the Poisson random variable is an accurate model, there is only about a 0.02% chance
that two or more such disastrous floods would occur in the next decade.
For a hypergeometric random variable, we will demonstrate another proof technique common
to probability. An expected value may involve a complicated (or infinite) sum which must be
computed. However, this sum includes within it the probabilities of each outcome of the random
variable, and those probabilities must therefore add to 1. It is sometimes possible to simplify the
sum describing the expected value using the fact that a related sum is already known.
Example 4.1.17. (Expected Value of a HyperGeo(N , r, m)) Let m and r be positive integers
an d let N be an integer for which N > max{m, r}. Let X be a random variable with X ∼
HyperGeo(N , r, m). To calculate the expected value of X, we begin with two facts. The first is
an identity involving combinations. If n ≥ k > 0 then
n n!
=
k k!(n − k )!
n (n − 1) !
=
k (k − 1)!((n − 1) − (k − 1))!
n n−1
= .
k k−1
The second comes from the consideration of the probabilities associated with a HyperGeo(N −
1, r − 1, m − 1) distribution. Specifically, as k ranges over all possible values of such a distribution,
we have
(N −1)−(r−1)
X (r−1
k )( (m−1)−k
)
−1
=1
k
(N
m−1)
since this is the sum over all outcomes of the random variable.
To calculate E [X ], let j range over the possible values of X. Recall that the minimum value of j
is max{0, m − (N − r )} and the maximum value of j is min{r, m}. Now let k = j − 1. This means
that the maximum value for k is min{r − 1, m − 1}. If the minimum value for j was m − (N − r )
then the minimum value for k is m − (N − r ) − 1 = ((m − 1) − ((N − 1) − (r − 1))). If the
minimum value for j was 0 then the minimum value for k is −1.
The key to the computation is to note that as j ranges over all of the values of X, the values
of k cover all possible values of a HyperGeo(N − 1, m − 1, r − 1) distribution. In fact, the only
possible value k may assume that is not in the range of such a distribution is if k = −1 as a
minimum value. Now,
−r
X (rj )(N
m−j )
E [X ] = j· ,
j (N
m)
and if j = 0 is in the range of X, then that term of the sum is zero and it may be deleted without
affecting the value. That is equivalent to deleting the k = −1 term, so the remaining values of
As we have seen previously, if X is a random variable and if f is a function defined on the possible
outputs of X, then f (X ) is a random variable in its own right. The expected value of this new
random variable may be computed in the usual way from the distribution of f (X ), but it is an
extremely useful fact that it may also be computed from the distribution of X itself. The next
example and theorems illustrate this fact.
Example 4.1.18. Returning to a setting first seen in Example 3.3.1 we will let X ∼ Uniform({−2, −1, 0, 1, 2}),
and let f (x) = x2 . How may E [f (X )] be calculated?
We will demonstrate this in two ways – first by appealing directly to the definition, and then
using the distribution of X instead of the distribution of f (X ). To use the definition of expected
value, recall that f (X ) = X 2 takes values in {0, 1, 4} with the following probabilities: P (f (X ) =
0) = 15 while P (f (X ) = 1) = P (f (X ) = 4) = 25 . Therefore,
1 2 2
E [f (X )] = 0( ) + 1( ) + 4( ) = 2.
5 5 5
However, the values of f (X ) are completely determined from the values of X. For instance,
the event (f (X ) = 4) had a probability of 52 because it was the disjoint union of two other events
(X = 2) ∪ (X = −2), each of which had probability 15 . So the term 4( 52 ) in the computation above
could equally well have been thought of in two pieces
4 · P (f (X ) = 4) = 4 · P ((X = 2) ∪ (X = −2))
= 4 · (P (X = 2) + P (X = −2))
= 4 · P (X = 2) + 4 · P (X = −2)
= 22 · P (X = 2) + (−2)2 · P (X = −2),
where the final expression emphasizes that the outcome of 4 resulted either from 22 or (−2)2
depending on the value of X. Following a similar plan for the other values of f (X ) allows E [f (X )]
to be calcualted directly from the probabilities of X as
X
E [f (X )] = u · P (f (X ) = u)
u∈U
X X
= u· P (X = t)
u∈U t∈f −1 (u)
X X
= u · P (X = t)
u∈U t∈f −1 (u)
X X
= f (t) · P (X = t)
u∈U t∈f −1 (u)
X
= f (t) · P (X = t),
t∈T
where the final step is simply the fact that T = f −1 (U ) and so summing over the values of t ∈ T
is equivalent to grouping them together in the sets f −1 (u) and summing over all values in U that
may be achieved by f (X ).
The proof is nearly the same as for the one-variable case. The only diference is that f −1 (u) is now
a set of vectors of values (t1 , . . . , tn ), so that the event (f (X ) = u) decomposes into events of the
form (X1 = t1 , . . . , Xn = tn ). However, this change does not interfere with the logic of the proof.
We leave the details to the reader.
exercises
Ex. 4.1.1. Let X, Y be discrete random variables. Suppose X ≤ Y then show that E [X ] ≤ E [Y ].
Ex. 4.1.2. A lottery is held every day, and on any given day there is a 30% chance that someone
will win, with each day independent of every other. Let X denote the random variable describing
the number of times in the next five days that the lottery will be won.
(b) On average (expected value), how many times in the next five days will the lottery be won?
(c) When the lottery occurs for each of the next five days, what is the most likely number (mode)
of days there will be a winner?
(d) How likely is it the lottery will be won in either one or two of the next five days?
Ex. 4.1.3. A game show contestant is asked a series of questions. She has a probability of 0.88 of
knowing the answer to any given question, independently of every other. Let Y denote the random
variable describing the number of questions asked until the contestant does not know the correct
answer.
(b) On average (expected value), how many questions will be asked until the first question for
which the contestant does not know the answer?
(c) What is the most likely number of questions (mode) that will be asked until the contestant
does not know a correct answer?
(d) If the contestant is able to answer twelve questions in a row, she will win the grand prize.
How likely is it that she will know the answers to all twelve questions?
Ex. 4.1.4. Sonia sends out invitations to eleven of her friends to join her on a hike she’s planning.
She knows that each of her friends has a 59% chance of deciding to join her independently of each
other. Let Z denote the number of friends who join her on the hike.
(b) What is the average (expected value) number of her friends that will join her on the hike?
(c) What is the most likely number (mode) of her friends that will join her on the hike?
(d) How do your answers to (b) and (c) change if each friend has only a 41% chance of joining
her?
Ex. 4.1.5. A player rolls three dice and earns $1 for each die that shows a 6. How much should
the player pay to make this a fair game?
Ex. 4.1.6. (“The St.Petersburg Paradox”) Suppose a game is played whereby a player begins
flipping a fair coin and continues flipping it until it comes up heads. At that time the player wins a
2n dollars where n is the total number of times he flipped the coin. Show that there is no amount
of money the player could pay to make this a fair game. (Hint: See Example 4.1.4).
Ex. 4.1.7. Two different investment strategies have the following probabilities of return on $10,000.
Strategy A has a 20% chance of returning $14,000, a 35% chance of returning $12,000, a 20%
chance of returning $10,000, a 15% chance of returning $8,000, and a 10% chance of returning only
$6,000.
Strategy B has a 25% chance of returning $12,000, a 35% chance of returning $11,000, a 25%
chance of returning $10,000, and a 15% chance of returning $9,000.
(c) Is one strategy clearly preferable to the other? Explain your reasoning.
Ex. 4.1.8. Calculate the expected value of a Uniform({1, 2, . . . , n}) random variable by following
the steps below.
n
P n2 +n
(a) Prove the numerical fact that j= 2 . (Hint: There are many methods to do this. One
j =1
uses induction).
n+1
(b) Use (a) to show that if X ∼ Uniform({1, 2, . . . , n}), then E [X ] = 2 .
Ex. 4.1.9. Use induction to extend the result of Theorem 4.1.7 by proving the following:
If X1 , X2 , . . . , Xn are random variables with finite expectation all defined on the same sample
space S and if a1 , a2 , . . . an are real numbers, then
Ex. 4.1.10. Suppose X and Y are random variables for which X has finitie expected value and Y
has infinite expected value. Prove that X + Y has infinite expected value.
Ex. 4.1.11. Suppose X and Y are random variables. Suppose E [X ] = ∞ and E [Y ] = −∞.
(a) Provide an example to show that E [X + Y ] = ∞ is possible.
(c) Provide an example to show that E [X + Y ] may have finite expected value.
Ex. 4.1.12. Let X ∼ P oisson(λ).
(a) Write an expression for E [X ] as an infinite sum.
(b) Every non-zero term in your answer to (a) should have a λ in it. Factor this λ out and
explain why the remaining sum equals 1. (Hint: One way to do this is through the use of
infinite series. Another way is to use the idea from Example 4.1.17).
Ex. 4.1.13. A daily lottery is an event that many people play, but for which the likelihood of any
given person winning is very small, making a Poisson approximation appropriate. Suppose a daily
lottery has, on average, two winners every five weeks. Estimate the probability that next week
there will be more than one winner.
As a single number, the average of a random variable may or may not be a good approximation
of the values that variable is likely to produce. For example, let X be defined such that P (X =
10) = 1, let Y be defined so that P (Y = 9) = P (Y = 11) = 12 , and let Z be defined such that
P (Z = 0) = P (Z = 20) = 12 . It is easy to check that all three of these random variables have
an expected value of 10. However the number 10 exactly describes X, is always off from Y by an
absolute value of 1 and is always off from Z by an absolute value of 10.
It is useful to be able to quantify how far away a random variable typically is from its average.
Put another way, if we think of the expected value as somehow measuring the “center” of the
random variable, we would like to find a way to measure the size of the “spread” of the variable
about its center. Quantities useful for this are the variance and standard deviation.
Definition 4.2.1. Let X be a random variable with finite expected value. Then the variance of
the random variable is written as V ar [X ] and is defined as
V ar [X ] = E [(X − E [X ])2 ]
Notice that V ar [X ] is the average of the square distance of X from its expected value. So if
X has a high probability of being far away from E [X ] the variance will tend to be large, while if
X is very near E [X ] with high probability the variance will tend to be small. In either case the
variance is the expected value of a squared quantity, and as such is always non-negative. Therefore
SD [X ] is defined whenever V ar [X ] is defined.
If we were to associate units with the random variable X (say meters), then the units of V ar [X ]
would be meters2 and the units of SD [X ] would be meters. We will see that the standard deviation
is more meaningful as a measure of the “spread” of a random variable while the variance tends to
be a more useful quantity to consider when carrying out complex computations.
Informally we will view the standard deviation as a typical distance from average. So if X is a
random variable and we calculate that E [X ] = 12 and SD [X ] = 3, we might say, “The variable
X will typically take on values that are in or near the range 9 − 15, one standard deviation either
side of the average”. A goal of this section is to make that language more precise, but at this point
it will help with intuition to understand this informal view.
The variance and standard deviation are described in terms of the expected value. Therefore
V ar [X ] and SD [X ] can only be defined if E [X ] exists as a real number. However, it is possible
that V ar [X ] and SD [X ] could be infinite even if E [X ] is finite (see Exercises). In practical terms,
if X has a finite expected value and infinite standard deviation, it means that the random variable
has a clear average, but is so spread out that any finite number underestimates the typical distance
of the random variable from its average.
Example 4.2.2. As above, let X be a constant varaible with P (X = 10) = 1. Let Y be such that
P (Y = 9) = P (Y = 11) = 12 and let Z be such that P (Z = 0) = P (Z = 20) = 12 .
Since X always equals E [X ], the quantity (X − E [X ])2 is always zero and we can conclude
that V ar [X ] = 0 and SD [X ] = 0. This makes sense given the view of SD [X ] as an estimate of
how spread out the variable is. Since X is constant it is not at all spread out and so SD [X ] = 0.
To calculate V ar [Y ] we note that (Y − E [Y ])2 is always equal to 1. Therefore V ar [Y ] = 1 and
SD [Y ] = 1. Again this reaffirms the informal description of the standard deviation; the typical
distance between Y and its average is 1.
Likewise (Z − E [Z ])2 is always equal to 100. Therefore V ar [Z ] = 100 and SD [Z ] = 10. The
typical distance between Z and its average is 10.
Example 4.2.3. What are the variance and standard deviation of a die roll?
Before we carry out the calculation, let us use the informal idea of standard deviation to
estimate an answer and help build intuition. We know the average of a die roll is 3.5. The closest
a die could possibly be to this average is 0.5 (if it were to roll a 3 or a 4) and the furthest it could
possibly be is 2.5 (if it were to roll a 1 or a 6). Therefore the standard deviation, a typical distance
from average, should be somewhere between 0.5 and 2.5.
To calculate the quantity exactly, let X represent the roll of a die. By definition, V ar [X ] =
E [(X − 3.5)2 ], and the values that (X − 3.5)2 may assume are determined by the six values X
may take on.
V ar [X ] = E [(X − 3.5)2 ]
1 1 1 1 1 1
= (2.5)2 + (1.5)2 + (0.5)2 + (−0.5)2 + (−1.5)2 + (−2.5)2
6 6 6 6 6 6
35
= .
12
q
35
So, SD [X ] = 12 ≈ 1.71 which is near the midpoint of the range of our estimate above.
Theorem 4.2.4. Let a ∈ R and let X be a random variable with finite variance (and thus, with
finite expected value as well). Then,
(a) V ar [aX ] = a2 · V ar [X ];
(b) SD [aX ] = |a| · SD [X ];
(c) V ar [X + a] = V ar [X ]; and
(d) SD [X + a] = SD [X ].
Proof of (a) and (b) - V ar [aX ] = E [(aX − E [aX ])2 ]. Using known properties of expected value
this may be rewritten as
That concludes the proof of (a). The result from (b) follows by taking square roots of both sides
of this equation.
Proof of (c) and (d) - (See Exercises)
The variance may also be computed using a different (but equivalent) formula if E [X ] and
E [X 2 ] are known.
Theorem 4.2.5. Let X be a random variable for which E [X ] and E [X 2 ] are both finite. Then
V ar [X ] = E [X 2 ] − (E [X ])2 .
Proof -
V ar [X ] = E [(X − E [X ])2 ]
= E [X 2 − 2XE [X ] + (E [X ])2 ]
= E [X 2 ] − 2E [XE [X ]] + E [(E [X ])2 ].
But E [X ] is a constant, so
V ar [X ] = E [X 2 ] − 2E [XE [X ]] + E [(E [X ])2 ]
= E [X 2 ] − 2E [X ]E [X ] + (E [X ])2
= E [X 2 ] − (E [X ])2 .
In statistics we frequently want to consider the sum or average of many random variables.
As such it is useful to know how the variance of a sum relates to the variances of each variable
separately. Toward that goal we have
Theorem 4.2.6. If X and Y are independent random variables, both with finite expectation and
finite variance, then
(a) V ar [X + Y ] = V ar [X ] + V ar [Y ]; and
q
(b) SD [X + Y ] = (SD [X ])2 + (SD [Y ])2 .
Proof - Using Theorem 4.2.5,
V ar [X + Y ] = E [(X + Y )2 ] − (E [X + Y ])2
= E [X 2 + 2XY + Y 2 ] − (E [X ])2 + 2E [X ]E [Y ] + (E [Y ])2
As with expected value, the variances of the common discrete random variables can be calculated
from their corresponding distributions.
Example 4.2.8. (Variance of a Bernoulli(p))
Let X ∼ Bernoulli(p). We have already calculated that E [X ] = p. Since X only takes on the
values 0 or 1 it is always true that X 2 = X. Therefore E [X 2 ] = E [X ] = p.
So, V ar [X ] = E [X 2 ] − (E [X ])2 = p − p2 = p(1 − p).
V ar [Y ] = V ar [X1 + X2 + · · · + Xn ]
= V ar [X1 ] + V ar [X2 ] + · · · + V ar [Xn ]
= p(1 − p) + p(1 − p) + · · · + p(1 − p)
= np(1 − p).
For an application of this computation we return to the idea of sampling from a population
where some members of the population have a certain characteristic and others do not. The goal
is to provide an estimate of the number of people in the sample that have the characteristic. For
this example, suppose we were to randomly select 100 people from a large city in which 20% of
the population works in a service industry. How many of the 100 people from our sample should
we expect to be service industry workers?
If the sampling is done without replacement (so we cannot pick the same person twice), then
strictly speaking the desired number would be described by a hypergeometric random variable.
However, we have also seen that there is little difference between the binomial and hypergeometric
distributions when the size of the sample is small relative to the size of the population. So since
the sample is only 100 people from a “large city”, we will assume this situation is modeled by a
binomial random variable. Specifically, since 20% of the population consits of service workers, we
will assume X ∼ Binomial (100, 0.2).
The simplest way to answer to the question of how many service industy workers to expect
within the sample is to compute the expected value of X. In this case E [X ] = 100(0.2) = 20,
so we should expect around 20 of the 100 people in the sample to be service workers. However,
this is an incomplete answer to the question since it only provides an average value; the actual
number of service workers in the sample is probably not going to be exactly 20, it’s only likely
to be around 20 on average. A more complete answer to the question would give an estimate as
to how far away from 20 the actual value is likely to be. But this is precisely what the standard
deviation describes – an estimate of the likely difference between the actual result of the random
variable and its expected value. √
In this case V ar [X ] = 100(0.2)(0.8) = 16 and so SD [X ] = 16 = 4. This means that the
actual number of service industry workers in the sample will typically be about 4 or so away from
the expected value of 20, so a more complete answer to the question would be “The sample is
likely to have around 16 − 24 service workers in it”. That is not to say that the actual number
of service workers is guaranteed to fall in the that range, but the range provides s a sort of likely
error associated with the estimate of 20. Results in the 16 − 24 range should be considered fairly
common. Results far outside that range, while possible, should be considered fairly unusual.
Recall in Example 4.1.17 we calculated E [X ] using a technique in which the sum describing
E [X ] was computed based on another sum which only involved the distribution of X directly.
This second sum equalled 1 since it simply added up the probabilities that X assumed each of its
possible values. In a similar fashion, it is sometimes possible to calculate a sum describing E [X 2 ]
in terms of a sum for E [X ] which is already known. From that point, Theorem 4.2.5 may be used
to calculate the variance and standard deviation of X. This technique will be illustrated in the
next example in which we calculate the spread associated with a geometric random variable.
Example 4.2.10. (Variance of a Geometric(p))
Let 0 < p < 1. X ∼ Geometric(p) for which we know E [X ] = p1 . Then,
∞
X
E [X 2 ] = k 2 p(1 − p)k−1
k =1
To evaluate the sum of the series we will need to work the partial sums of the same. For any n ≥ 1,
let
n
X n
X
Sn = k 2 p(1 − p)k−1 = k 2 (1 − (1 − p))(1 − p)k−1
k =1 k =1
X n n
X
= k 2 (1 − p)k−1 − k 2 (1 − p)k
k =1 k =1
n
X
= 1+ (2k − 1)(1 − p)k−1 − n2 (1 − p)n
k =2
n
X n
X
= 1− (1 − p)k−1 + 2 k (1 − p)k−1 − n2 (1 − p)n
k =2 k =2
X n n
X
k−1
= 2− (1 − p) + 2(−1 + k (1 − p)k−1 ) − n2 (1 − p)n
k =1 k =1
n
1 − (1 − p)n 2X
= − + kp(1 − p)k−1 − n2 (1 − p)n
p p
k =1
Using standard results from analysis and result from Example 4.1.15 we know that for 0 < p < 1,
n
X 1
lim kp(1 − p)k−1 = , lim (1 − p)n = 0, and lim n2 (1 − p)n = 0.
n→∞ p n→∞ n→∞
k =1
Therefore Sn → − p1 + 2
p2
as n → ∞. Hence
1 2
E [X 2 ] = − + 2 .
p p
Using Theorem 4.2.5 the variance may then be calculated as
V ar [X ] = E [X 2 ] − (E [X ])2
2 1 1
= − − ( )2
p2 p p
1 1
= 2
−
p p
A similar technique may be used for calculating the variance of a Poisson random variable, a
fact which is left as an exercise. We finish this subsection with a computation of the variance
of a hypergeometric distribution using an idea similar to how we calculated its expected value in
Example 4.1.17.
Example 4.2.11. Let m and r be positive integers and let N be an integer with N > max{m, r}
and let X ∼ HyperGeo(N , r, m). To calculate E [X 2 ], as j ranges over the values of X,
−r
X (rj )(N
m−j )
E [X 2 ] = j2 ·
j (N
m)
r r−1 (N −1)−(r−1)
X
2 j (j−1)( (m−1)−(j−1) )
= j · N N −1
j m (m−1)
r−1 (N −1)−(r−1)
rm X (j−1)( (m−1)−(j−1) )
= ( ) j·
N
j(N −1) m−1
(N −1)−(r−1)
rm X (r−1
k )( (m−1)−k )
= ( )· (k + 1) −1
N
k
(N
m−1)
V ar [X ] = E [X 2 ] − (E [X ])2
rm (r − 1)(m − 1) rm 2
= ( )( + 1) − ( )
N (N − 1) N
N 2 rm − N rm2 − N r2 m + r2 m2
= .
N 2 (N − 1)
As with the computation of expected value, the cases of m = 0 and r = 0 must be handled
separately, but yield the same result.
Many random variables may be rescaled into a standard format by shifting them so that they have
an average of zero and then rescaling them so that they have a variance (and standard deviation)
of one. We introduce this idea now, though its chief importance will not be realized until later.
E [X ] = 0 and V ar [X ] = 1.
Theorem 4.2.13. Let X be a discrete random variable with finite expected value and finite, non-
X−E [X ]
zero variance. Then Z = SD [X ] is a standardized random variable.
X − E [X ]
E [Z ] = E[ ]
SD [X ]
E [X − E [X ]]
=
SD [X ]
E [X ] − E [X ]
= =0
SD [X ]
For easy reference we finish off this section by providing a chart of values associated with
common discrete distributions.
exercises
Calculate the expected value and standard deviation of this random variable. What is the probabil-
ity this random variable will produce a result more than one standard deviation from its expected
value?
Ex. 4.2.2. Answer the following questions about flips of a fair coin.
(a) Calculate the standard deviation of the number of heads that show up in 100 flips of a fair
coin.
(b) Show that if the number of coins is quadrupled (to 400) the standard deviation only doubles.
Ex. 4.2.3. Suppose we begin rolling a die, and let X be the number of rolls needed before we see
the first 3.
(b) Calculate SD [X ].
(c) Viewing SD [X ] as a typical distance of X from its expected value, would it seem unusual to
roll the die more than nine times before seeing a 3?
(e) Calculate the probability X produces a result within one standard deviation of its expected
value.
Ex. 4.2.4. A key issue in statistical sampling is the determination of how much a sample is likely
to differ from the population it came from. This exercise explores some of these ideas.
(a) Suppose a large city is exactly 50% women and 50% men and suppose we randomly select
60 people from this city as part of a sample. Let X be the number of women in the sample.
What are the expected value and standard deviation of X? Given these values, would it seem
unusual if fewer than 45% of the individuals in the sample were women?
(b) Repeat part (a), but now assume that the sample consists of 600 people.
Ex. 4.2.5. Calculate the variance and standard deviation of the value of the lottery ticket from
Example 3.1.4.
Ex. 4.2.6. Prove parts (c) and (d) of Theorem 4.2.4.
Ex. 4.2.7. Let X ∼ Binomial (n, p). Show that for 0 < p < 1, this random variable has the largest
standard deviation when p = 12 .
Ex. 4.2.8. Follow the steps below to calculate the variance of a random variable with a Uniform({1, 2, . . . , n})
distribution.
n
n(n+1)(2n+1)
k2 =
P
(a) Prove that 6 . (Induction is one way to do this).
k =1
Ex. 4.2.9. This exercise provides an example of a random variable with finite expected value, but
n
infinite variance. Let X be a random variable for which P (X = n(n2 +1) ) = 21n for all integers
n ≥ 1.
∞
P 2n
(a) Prove that X is a well-defined variable by showing P (X = n(n+1)
) = 1.
n=1
When there is no confusion about what random variable is being discussed, it is usual to use the
Greek letter µ in place of E [X ] and σ in place of SD [X ]. When more than one variable is involved
the same letters can be used with subscripts (µX and σX ) to indicate which variable is being
described.
In statistics one frequently measures results in terms of “standard units” – the number of
standard deviations a result is from its expected value. For instance if µ = 12 and σ = 5, then a
result of X = 20 would be 1.6 standard units because 20 = µ + 1.6σ. That is, 20 is 1.6 standard
deviations above expected value. Similarly a result of X = 10 would be −0.4 standard units
because 10 = µ − 0.4σ.
Since the standard deviation measures a typical distance from average, results that are within
one standard deviation from average (between −1 and +1 standard units) will tend to be fairly
common, while results that are more than two standard deviations from average (less than −2 or
greater than +2 in standard units) will usually be relatively rare. The likelihoods of some such
events will be calculated in the next two examples. Notice that the event (|X − µ| ≤ kσ ) describes
those outcomes of X that are within k standard deviations from average.
Example 4.3.1. Let Y represent the sum of two dice. How likely is it that Y will be within
one standard deviation of its average? How likely is it that Y will be more than two standard
deviations from its average?
q
We can use our previous calculations that µ = 7 and σ = 35 6 ≈ 2.42. The achievable values
that are within one standard deviation of average are 5, 6, 7, 8, and 9. So the probability that the
sum of two dice will be within one standard deviation of average is
There is about a 66.7% chance that a pair of dice will fall within one standard deviation of their
expected value.
q
Two standard deviations is 2 35 6 ≈ 4.83. Only the results 2 and 12 further than this distance
from the expected value, so the probability that X will be more than two standard deviations from
average is
There is only about a 5.6% chance that a pair of dice will be more than two standard deviations
from expected value.
Example 4.3.2. If X ∼ U nif orm{(1, 2, . . . , 100)}, what is the probability that X will be within
one standard deviation of expected value? What is the probability it will be more than two
standard deviations from expected value?
q
Again, based on earlier calculations we know that µ = 101
2 = 50.5 and that σ = 9999
12 ≈ 28.9.
Of the possible values that X can achieve, only the numbers 22, 23, . . . , 79 fall within one standard
deviation of average. So the desired probability is
P (|X − µ| ≤ σ ) = P (X ∈ {22, 23, . . . , 79})
58
= .
100
There is a 58% chance that this random variable will be within one standard deviation of expected
value. q
Similarly we can calculate that two standard deviations is 2 999912 ≈ 57.7. Since µ = 50.5 and
since the minimal and maximal values of X are 1 and 100 respectively, results that are more than
two or more standard deviations from average cannot happen at all for this random variable. In
other words P (|X − µ| > 2σ ) = 0.
The examples of the previous section show that the exact probabilities a random variable will fall
within a certain number of standard deviations of its expected value depend on the distribution of
the random variable. However, there are some general results that apply to all random variables.
To prove these results we will need to investigate some inequalities.
Theorem 4.3.3. (Markov’s Inequality) Let X be a discrete random variable which takes on
only non-negative values and suppose that X has a finite expected value. Then for any c > 0,
µ
P (X ≥ c) ≤ .
c
Proof - Let T be the range of X, so T is a countable subset of the positive real numbers. By
dividing T into those numbers smaller than c and those numbers that are at least as large as c we
have
X
µ = t · P (X = t)
t∈T
X X
= t · P (X = t) + t · P (X = t).
t∈T ,t<c t∈T ,t≥c
The first sum must be non-negative, since we assumed that T consisted of only non-negative
numbers, so we only make the quantity smaller by deleting it. Likewise, for each term in the
second sum, t ≥ c so we only make the quantity smaller by replacing t by c. This gives us
X X
µ = t · P (X = t) + t · P (X = t)
t∈T ,t<c t∈T ,t≥c
X
≥ c · P (X = t)
t∈T ,t≥c
X
= c· P (X = t).
t∈T ,t≥c
The events (X = t) indexed over all values t ∈ T for which t ≥ c are a countable collection of
disjoint sets whose union is (X ≥ c). So,
X
µ ≥ c· P (X = t)
t∈T ,t≥c
= cP (X ≥ c).
exercises
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard deviations
from average. Approximate your answer to the nearest tenth of a percent.
Ex. 4.3.3. Let X ∼ P oisson(3).
(a) Calculate µ and σ.
(b) Calculate P (|X − µ| ≤ σ ), the probability that X will be within one standard deviation of
average. Approximate your answer to the nearest tenth of a percent.
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard deviations
from average. Approximate your answer to the nearest tenth of a percent.
Ex. 4.3.4. Let X ∼ Binomial (n, 21 ). Determine the smallest value of n for which P (|X − µ| >
4σ ) > 0. That is, what is the smallest n for which there is a positive probability that X will be
more than four standard deviations from average.
Ex. 4.3.5. For k ≥ 1 there are distributions for which Chebychev’s inequality is an equality.
(a) Let X be a random variable with probability mass function P (X = 1) = P (X = −1) = 12 .
Prove that Chebychev’s inequality is an equality for this random variable when k = 1.
(e) Use parts (b) and (d) to derive a contradiction. Note that this proves that the assumption
that was made in part (d), namely that P (|X − µ| > σ ) = 1, cannot be true for any discrete
random variable where µ and σ are finite quantities. In other words, no random variable can
produce only values that are more than one standard deviation from average.
Ex. 4.3.7. Let X be a discrete random variable with finite expected value and finite variance.
(a) Prove P (|X − µ| ≥ σ ) = 1 ⇐⇒ P (|X − µ| = σ ) = 1. (A random variable that assumes
only values one or more standard deviations from average must only produce values that are
exactly one standard deviation from average).
(b) Prove that if P (|X − µ| > σ ) > 0 then P (|X − µ| < σ ) > 0. (If a random variable is able to
produce values more one standard deviation from average, it must also be able to produce
values that are less than one standard deviation from average).
In previous chapters we saw that information that a particular event had occurred could substan-
tially change the probability associated with another event. That realization led us to the notion of
conditional probability. It is also reasonable to ask how such information might affect the expected
value or variance of a random variable.
Example 4.4.2. A die is rolled. What are the expected value and variance of the result given that
the roll was even?
Let X be the die roll. Then X ∼ Uniform({1, 2, 3, 4, 5, 6}), but conditioned on the event A
that the roll was even, this changes so that
1
P (X = 2|A) = P (X = 4|A) = P (X = 6|A) = .
3
Therefore,
1 1 1
E [X|A] = 2( ) + 4( ) + 6( ) = 4.
3 3 3
Note that the (unconditioned) expected value of a die roll is E [X ] = 3.5, so the knowledge of event
A slightly increases the expected value of the die roll.
The conditional variance is
1 1 1 8
V ar [X|A] = (2 − 4)2 ( ) + (4 − 4)2 ( ) + (6 − 4)2 ( ) = .
3 3 3 3
respectively. We are interested in say, E [X|Y = 3]. When Y = 3 an ace was seen of draw 3, but
not on draws 1 or 2. Hence
4
48 if n = 1 or 2
P (king on draw n|Y = 3) = 0 if n = 3
4
52 if n>3
so that
44 n−1 4
48 48 if n = 1 or 2
P (X = n|Y = 5) = 0 if n = 3
44 2 48 n−4 4
48 52 52 if n > 3
For example, when n > 3, in order to have X = n a non-king must have been seen on draws 1 and
2 (each with probability 4448 ), a non-king must have resulted on draw 3 (which is automatic, since
an ace was drawn), a non-king must have been seen on each of draws 4 through n − 1 (each with
48 4
probability 52 ), and finally a king was produced on draw n (with probability 52 ). Hence,
2 ∞
X 44 n−1 4 X 44 2 48 n−4 4
E [X|Y = 3] = n + n
48 48 48 52 52
n=1 n=4
2 ∞
X 44 n−1 4 X 44 2 48 m 4
= n + (m + 4) .
48 48 48 52 52
n=1 m=0
But
∞ ∞
X X d m+1
(m + 4)r m = 3rm + r
dr
m=0 m=0
3 d r
= +
1 − r dr 1 − r
3 1
= + ,
1−r (1 − r )2
so
4 44 4 44 2 4 3 1
E [X|Y = 3] = +2 + + 2
48 48 48 48 52 1 − (48/52) (1 − (48/52))
4 44 4
44 2 4 3 × 52 522
= +2 + + 2
48 48 48 48 52 4 4
1 11 1 11 2 52 11 2
= +2 +3 +
12 12 12 12 4 12
985
= ≈ 13.68.
72
Given that the first ace appeared on draw 3, it takes an average of between 13 and 14 draws until
4
the first king appears. Compare this to the unconditional E [X ]. Since X ∼ Geometric( 52 ) we
52
know E [X ] = 4 = 13. In other words, on average it takes 13 draws to observe the first king. But
given that the first ace appeared on draw three, we should expect to need about 0.68 draws more
(on average) to see the first king.
Recall how Theorem 1.3.2 described a way in which a non-conditional probability could be
calculated in terms of conditional probabilities. There is an analogous theorem for expected value.
Theorem 4.4.4. Let X : S → T be a discrete random variable and let {Bi : i ≥ 1} be a disjoint
∞
S
collection of events for which P (Bi ) > 0 for all i and such that Bi = S. Suppose P (Bi ) and
i=1
E [X|Bi ] are known. Then E [X ] may be computed as
∞
X
E [X ] = E [X|Bi ]P (Bi ).
i=1
Example 4.4.5. A venture capitalist estimates that regardless of whether the economy strengthens,
weakens, or remains the same in the next fiscal quarter, a particular investment could either gain
or lose money. However, he figures that if the economy strengthens, the investment should, on
average, earn 3 million dollars. If the economy remains the same, he figures the expected gain
on the investment will be 1 million dollars, while if the economy weakens, the investment will, on
average, lose 1 million dollars. He also trusts economic forcasts which predict a 50% chance of a
weaker economy, a 40% chance of a stagnant economy, and a 10% chance of a stronger economy.
What should he calculate is the expected return on the investment?
Let X be the return on investment and let A, B, and C represent the events that the economy
will be stronger, the same, and weaker in the next quarter, respectively. Then the estimates on
return give the following information in millions:
Therefore,
Theorem 4.4.6. Let X and Y be two discrete random variables on a sample space S with Y : S →
T . Let g : T → R be defined as g (y ) = E [X|Y = y ]. Then
E [g (Y )] = E [X ].
It is common to use E [X|Y ] to denote g (Y ) after which the theorem may be expressed as E [E [X|Y ]] =
E [X ]. This can be slightly confusing notation, but one must keep in mind that the exterior expected
value in the expression E [E [X|Y ]] refers to the averge of E [X|Y ] viewed as a function of Y .
Proof - As y ranges over T , the events (Y = y ) are disjoint and cover all of S. Therefore, by
Theorem 4.4.4,
X
E [g (Y )] = g (y )P (Y = y )
y∈T
X
= E [X|Y = y ]P (Y = y )
y∈T
= E [X ].
Example 4.4.7. Let Y ∼ Uniform({1, 2, . . . , n}) and let X be the number of heads on Y flips of
a coin. What is the expected value of X?
Without Theorem 4.4.6 this problem would require computing many complicated probabilities.
However, it is made much simpler by noting that the distribution of X is given conditionally by
(X|Y = j ) ∼ Binomial(j, 21 ). Therefore we know E [X|Y = j ] = 2j . Using the notation above, this
may be written as E [X|Y ] = Y2 after which
Y 1n+1 n+1
E [X ] = E [E [X|Y ]] = E [ ]= = .
2 2 2 4
Though it requires a somewhat more complicated formula, the variance of a random variable
can be computed from conditional information.
Theorem 4.4.8. Let X : S → T be a discrete random variable and let {Bi : i ≥ 1} be a disjoint
∞
S
collection of events for which P (Bi ) > 0 for all i and such that Bi = S. Suppose E [X|Bi ] and
i=1
V ar [X|Bi ] are known. Then V ar [X ] may be computed as
∞
X
(V ar [X|Bi ] + (E [X|Bi ])2 )P (Bi ) − (E [X ])2 .
V ar [X ] =
i=1
Therefore,
∞
X ∞
X
(V ar [X|Bi ] + (E [X|Bi ])2 )P (Bi ) = E [X 2 |Bi ]P (Bi ),
i=1 i=1
Theorem 4.4.9. Let X and Y : S → T be two discrete random variables on a sample space S. As
in Theorem 4.4.6 let g (y ) = E [X|Y = y ]. Let h(y ) = V ar [X|Y = y ]. Denoting g (Y ) by E [X|Y ]
and denoting h(Y ) by V ar [X|Y ], then
(3) V ar [E [X|Y ]] = E [(E [X|Y ])2 ] − (E [E [X|Y ]])2 = E [(E [X|Y ])2 ] − (E [X ])2 .
Example 4.4.10. The number of eggs N found in nests of a certain species of turtles has a Poisson
distribution with mean λ. Each egg has probability p of being viable and this event is independent
from egg to egg. Find the mean and variance of the number of viable eggs per nest.
Let N be the total number of eggs in a nest and X the number of viable ones. Then if N = n,
X has a binomial distribution with number of trials n and probability p of success for each trial.
Thus, if N = n, X has mean np and variance np(1 − p). That is,
or
E [X|N ] = pN ; V ar [X|N ] = p(1 − p)N .
Hence
E [X ] = E [E [X|N ]] = E [pN ] = pE [N ] = pλ
and
V ar [X ] = E [V ar [X|N ]] + V ar [E [X|N ]]
= E [p(1 − p)N ] + V ar [pN ] = p(1 − p)E [N ] + p2 V ar [N ].
exercises
Ex. 4.4.1. Let X ∼ Geometric(p) and let A be event (X ≤ 3). Calculate E [X|A] and V ar [X|A].
Ex. 4.4.2. Calculate the variance of the quantity X from Example 4.4.7.
Ex. 4.4.3. Return to Example 4.4.5. Suppose that, in addition to the estimates on average return,
the investor had estimates on the standard deviations. If the economy strengthens or weakens, the
estimated standard deviation is 3 million dollars, but if the economy stays the same, the estimated
standard deviation is 2 million dollars. So, in millions of dollars,
Use this information, together with the conditional expectations from Example 4.4.5 to calculate
V ar [X ].
Ex. 4.4.4. A standard light bulb has an average lifetime of four years with a standard deviation of
one year. A Super D-Lux lightbulb has an average lifetime of eight years with a standard devaition
of three years. A box contains many bulbs – 90% of which are standard bulbs and 10% of which
are Super D-Lux bulbs. A bulb is selected at random from the box. What are the average and
standard deviation of the lifetime of the selected bulb?
Ex. 4.4.5. Let X and Y be described by the joint distribution
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
Ex. 4.4.6. Let X and Y be discrete random variables. Let x be in the range of X and let y be in
the range of Y .
(a) Suppose X and Y are independent. Show that E [X|Y = y ] = E [X ] (and so E [X|Y ] =
E [X ]).
(b) Show that E [X|X = x] = x (and so E [X|X ] = X). (From results in this section we know
E [X|Y ] is always a random variable with expected value equal to E [X ]. The results above
in some sense show two extremes. When X and Y are independent, E [X|Y ] is a constant
random variable E [X ]. When X and Y are equal, E [X|X ] is just X itself).
Ex. 4.4.7. Let X ∼ Uniform {1, 2, . . . , n} be independent of Y ∼ Uniform {1, 2, . . . , n}. Let
Z = max(X, Y ) and W = min(X, Y ).
(b) Fine E [Z | W ].
When faced with two different random variables, we are frequently interested in how the two
different quantities relate to each other. Often the purpose of this is to predict something about
one variable knowing information about the other. For instance, if rainfall amounts in July affect
the quantity of corn harvested in August, then a farmer, or anyone else keenly interested in the
supply and demand of the agriculture industry, would like to be able to use the July information
to help make predictions about August costs.
4.5.1 Covariance
Just as we developed the concepts of expected value and standard deviation to summarize a single
random variable, we would like to develop a number that describes something about how two
different random variables X and Y relate to each other.
Definition 4.5.1. (Covariance of X and Y ) Let X and Y be two discrete random variables on
a sample space S. Then the “covariance of X and Y ” is defined as
Since it is defined in terms of an expected value, there is the possibility that the covariance may be
infinite or not defined at all because the sum describing the expectation is divergent.
Notice that if X is larger than its average at the same time that Y is larger than its average
(or if X is smaller than its average at the same time Y is smaller than its average) then (X −
E [X ])(Y − E [Y ]) will contribute a positive result to the expected value describing the covariance.
Conversely, if X is smaller than E [X ] while Y is larger than E [Y ] or vica versa, a negative result
will be contributed toward the covariance. This means that when two variables tend to be both
above average or both below average simultaneously, the covariance will typically be positive (and
the variables are said to be positively correlated ), but when one variable tends to be above average
when the other is below average, the covariance will typically be negative (and the variables are
said to be negatively correlated ). When Cov [X, Y ] = 0 the variables X and Y are said to be
“uncorrelated”.
For example, suppose X and Y are the height and weight, respectively, of an individual ran-
domly selected from a large population. We might expect that Cov [X, Y ] > 0 since people who are
taller than average also tend to be heavier than average and people who are shorter than average
tend to be lighter. Conversely suppose X and Y represent elevation and air density at a randomly
selected point on Earth. We might expect Cov [X, Y ] < 0 since locations at a higher elevation tend
to have thinner air.
Example 4.5.2. Consider a pair of random variables X and Y with joint distribution
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
By a routine calculation of the marginal distributions it can be shown that X, Y ∼ Uniform({−1, 0, 1})
and therefore that E [X ] = E [Y ] = 0. However, it is clear from the joint distribution that when
X = −1, then Y is more likely to be above average than below, while when X = 1, then Y is
more likely to be below average than above. This suggests the two random variables should have
a negative correlation. In fact, we can calculate
4 9 2 2
E [XY ] = (−1)( ) + 0( ) + 1( ) = − ,
15 15 15 15
2
and therefore Cov [X, Y ] = E [XY ] − E [X ]E [Y ] = − 15 .
As its name suggests, the covariance is closely related to the variance.
Cov [X, X ] = V ar [X ].
Theorem 4.5.4. Let X and Y be discrete random variables with finite mean for which E [XY ] is
also finite. Then
Cov [X, Y ] = E [XY ] − E [X ]E [Y ].
As with the expected value, the covariance is a linear quantity. It is also related to the concept
of independence.
Theorem 4.5.5. Let X, Y , and Z be discrete random variables, and let a, b ∈ R. Then,
(d) If X and Y are independent with a finite covariance, then Cov [X, Y ] = 0.
Therefore, reversing the roles of X and Y does not change the correlation.
Proof of (2) - This follows from linearity properties of expected value. Using Theorem 4.5.4
Proof of (3) - This proof is essentially the same as that of (2) and is left as an exercise.
Poof of (4) - We have previously seen that if X and Y are independent, then E [XY ] = E [X ]E [Y ].
Using Theorem 4.5.4 it follows that
Though independence of X and Y guarantees that they are uncorrelated, the converse is not
true. It is possible that Cov [X, Y ] = 0 and yet that X and Y are dependent, as the next example
shows.
Example 4.5.6. Let X, Y be two discrete random variables taking values {−1, 1}. Suppose their
joint distribution P (X = x, Y = y ) is given by the table
x=-1 x=1
4.5.2 Correlation
The possible size of Cov [X, Y ] has upper and lower bounds based on the standard deviations of
the two variables.
Theorem 4.5.7. Let X and Y be two discrete random variables both with finite variance. Then
Proof - Standardize both variables and consider the expected value of their sum squared. Since
this is the expected value of a non-negative quantity,
X − µX Y − µY 2
0 ≤ E [( + ) ]
σX σY
(X − µX )2 (X − µX )(Y − µY ) (Y − µY )2
= E[ 2 + 2 + ]
σX σX σY σY2
E [(X − µX )2 ] 2E [(X − µX )(Y − µY )] E [(Y − µY )2 ]
= 2 + +
σX σX σY σY2
Cov [X, Y ]
= 1+2 + 1.
σX σY
Sovling the inequality for the covariance yields
A similar computation (see Exercises) for the expected value of the squared difference of the
standardized variables shows
Cov [X, Y ] ≤ σX σY .
Putting both inequalities together proves the theorem.
Cov [X,Y ]
Definition 4.5.8. The quantity σX σY from Theorem 4.5.7 is known as the“correlation” of
X and Y and is often denoted as ρ[X, Y ]. Thinking in terms of dimensional analysis, both the
numerator and denominator include the units of X and the units of Y . The correlation, therefore,
has no units associated with it. It is thus a dimensionless rescaling of the covariance and is
frequently used as an absolute measure of trends between the two variables.
exercises
Ex. 4.5.1. Consider the experiment of flipping two coins. Let X be the number of heads among
the coins and let Y be the number of tails among the coins.
(a) Should you expect X and Y to be posivitely correlated, negatively correlated, or uncorrelated?
Why?
Ex. 4.5.2. Let X ∼ Uniform({0, 1, 2}) and let Y be the number of heads in X flips of a coin.
(a) Should you expect X and Y to be positively correlated, negatively correlated, or uncorrelated?
Why?
V ar [X + Y ] = V ar [X ] + V ar [Y ] + 2Cov [X, Y ].
(b) Use (a) to conclude that when X and Y are positively correlated, then V ar [X + Y ] >
V ar [X ] + V ar [Y ], while when X and Y are negatively correlated, V ar [X + Y ] < V ar [X ] +
V ar [Y ].
(c) Suppose Xi 1 ≤ i ≤ n are discrete random variables with finite variance and covariances.
Use induction and (a) to conclude that
n
X n
X X
V ar [ Xi ] = V ar [Xi ] + 2 Cov [Xi , Xj ].
i=1 i=1 1≤i<j≤n
We conclude this section with a discussion on exchangeable random variables. In brief we say
that a collection of random variables is exchangeable if the joint probability mass function of
(X1 , X2 , . . . , Xn ) is a symmetric function. In other words, the distribution of (X1 , X2 , . . . , Xn )
is independent of the order in which the Xi0 s appear. In particular any collection of mutually
independent random variables is exchangeable.
f ( x1 , x2 , . . . , xn ) = f ( xσ ( 1 ) , xσ ( 2 ) , . . . , xσ ( n ) )
In particular, X1 , X2 , . . . , Xn are exchangeable then for any one of the possible n! permutations,
σ, of {1, 2, . . . , n}, X1 , X2 , . . . , Xn and Xσ (1) , Xσ (2) , . . . , Xσ (n) have the same distribution.
Example 4.6.3. Suppose we have an urn of m distinct objects labelled {1, 2, . . . , m}. Objects are
drawn at random from the urn without replacements till the urn is empty. Let Xi be the label
of the i-th object that is drawn. Then X1 , X2 , . . . , Xm is a particular ordering of the objects in
the urn. Since each ordering is equally likely and there are m! possible orderings we have that the
joint probability mass function
1
f (x1 , x2 , . . . , xm ) = P (X1 = x1 , X2 = x2 , . . . , Xm = xm ) = ,
m!
Proof - The random variables (X1 , X2 , . . . , Xn ) are exchangeable. Then we have for any per-
mutation σ and xi ∈ Range(Xi )
As this is true for all permutations σ all the random variables must have same range. Otherwise if
any two of them differ the we could get a contradiction by choosing an appropriate permutation.
Let T denote the common range. Let i ∈ {2, . . . , n}, a, b ∈ T . Let
A = {xj ∈ T : 1 ≤ j 6= 1, i ≤ n}
By using the exchangeable property with the permutation σ that is given by σ (i) = 1, σ (1) = i
and σ (j ) = j for all j 6= 1, i. We have that for any x2 , . . . , xi−1 , xi+1 , . . . , xn ∈ A
Therefore,
[
P (X1 = a) = P ( X1 = a, Xi = b)
b∈T
X
= P (X1 = a, Xi = b)
b∈T
X [
= P( X1 = a, X2 = x2 , . . . Xi−1 = xi−1 , Xi = b, Xi+1 = xi+1 , . . . Xn = xn )
b∈T xj ∈A
X X
= P (X1 = a, X2 = x2 , . . . , Xi = b, . . . Xn = xn )
b∈T xj ∈A
X X
= P (X1 = b, X2 = x2 , . . . , Xi = a, . . . Xn = xn )
b∈T xj ∈A
X [
= P( X1 = b, X2 = x2 , . . . Xi−1 = xi−1 , Xi = a, Xi+1 = xi+1 , . . . Xn = xn )
b∈T xj ∈A
X
= P (X1 = b, Xi = a)
b∈T
[
= P( X1 = b, Xi = a)
b∈T
= P (Xi = a)
So the distribution of Xi is the same as the distribution of X1 and hence all of them have the same
distribution.
Example 4.6.5. (Sampling without Replacement) An urn contains b black balls and r red
balls. A ball is drawn at random and its colour noted. This procedure is repeated n times. Assume
that n ≤ b + r. Let max 0, n − r ≤ k ≤ min(n, b). In this example we examine the random variables
Xi given by
1 if i-th ball drawn is black
Xi =
0 otherwise
We have already seen that (See Theorem 2.3.2 and Example 2.3.1)
Qk−1 Qm−k−1
n i=0 (b − i ) i=0 (r − i)
P (k black balls are drawn in n draws) = Qm−1 .
k i=0 (r + b − i )
Using the same proof we see that the joint probability mass function of (X1 , X2 , . . . , Xn ) is given
by
where xi ∈ {0, 1}. It is clear from the right hand side of the above that the function f depends
only on the ni=1 xi . Hence any permutation of the xi ’s will not change the value of f . So f is a
P
symmetric function and the random variables are exchangeable. Therefore, by Theorem 4.6.4 we
know that for any 1 ≤ i ≤ n,
b
P (Xi = 1) = P (X1 = 1) = .
b+r
So we can conclude that they are all identically distributed as Bernoulli ( b+b r ) and the probability
of choosing a black ball in the i-th draw is b+b r (See Exercise 4.6.4 for a similar result). Further
for any i, j
exercises
Ex. 4.6.1. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. For any 2 ≤ m < n, show
that X1 , X2 , . . . , Xm are also a collection of exchangeable random variables.
Ex. 4.6.2. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. Let T denote their common
range. Suppose b : T → R. Show that b(X1 ), b(X2 ), . . . , b(Xn ) is also a collection of exchangeable
random variables.
Ex. 4.6.3. Suppose n cards are drawn from a standard pack of 52 cards without replacement (so
we will assume n ≤ 52). For 1 ≤ i ≤ n, let Xi be random variables given by
1 if i-th card drawn is black in colour
Xi =
0 otherwise
(a) Suppose n = 52. Using Example 4.6.3 and the Exercise 4.6.2 show that (X1 , X2 , X3 , . . . Xn )
are exchangeable.
(b) Show that (X1 , X2 , X3 , . . . , Xn ) are exchangeable for any 2 ≤ n ≤ 52. Hint: If n < 52 extend
the sample to exhause the deck of cards. Use (a) and Exercise 4.6.1
(c) Find the probability that the second and fourth card drawn have the same colour.
Ex. 4.6.4. (Polya Urn Scheme) An urn contains b black balls and r red balls. A ball is drawn
at random and its colour noted. Then it is replaced along with c ≥ 0 balls of the same colour. This
procedure is repeated n times.
(c) Let 1 ≤ m ≤ n. Let Bm be the event that the m-th ball drawn is black. Show that
b
P (Bm ) = .
b+r