0% found this document useful (0 votes)

479 views

ECE2191 Lecture Notes

This document provides course notes on probability models in engineering. It covers topics such as set theory, operations on sets, random experiments, probability theory, random variables, pairs and multiple random variables, random sums and sequences. The document is organized into chapters covering preliminary concepts, probability theory, random variables, two or more random variables, and random sums and sequences. It includes definitions, theorems, examples and explanations of key concepts in probability and statistics.

Uploaded by

Jason Cruz

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

479 views

ECE2191 Lecture Notes

Uploaded by

Jason Cruz

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

1.15Review of Set Theorytheo.1.1 @cref@cref[theo][1][1]1.1[1][5][]5 1.

38Op-

PROBABILITY MODELS IN
erations on setstheo.1.3 @cref@cref[theo][3][1]1.3[1][8][]8 1.48Operations on setstheo.1.4
@cref@cref[theo][4][1]1.4[1][8][]8
Probability Density Functionstheo.4.1 @cref@cref[theo][1][4]4.1[1][67][]67 4.275Ex-

ENGINEERING
pected Values and Moments Involving Pairs of Random Variablestheo.4.2 @cref@cref[theo][2][4]4.2[1][75][]75
4.376Expected Values and Moments Involving Pairs of Random Variablestheo.4.3 @cref@cref[theo][3][4]4.3[1][
4.786Expectations Involving Multiple Random Variablestheo.4.7 @cref@cref[theo][7][4]4.7[1][86][]86
4.887Expectations Involving Multiple Random Variablestheo.4.8 @cref@cref[theo][8][4]4.8[1][87][]87
5.194Laws of Large Numberstheo.5.1 @cref@cref[theo][1][5]5.1[1][93][]94 5.294Laws of
Large Numberstheo.5.2 @cref@cref[theo][2][5]5.2[1][94][]94

COURSE NOTES
ECE2191

Dr Faezeh Marzbanrad

Department of Electrical and

Computer Systems Engineering
Monash University

Lecturers:
Dr Faezeh Marzbanrad (Clayton)
Dr Wynita Griggs (Clayton)
Dr Mohamed Hisham (Malaysia)

2020
Contents

Contents

1 Preliminary Concepts 4
1.1 Probability Models in Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Review of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Operations on sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Other Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Random Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Tree Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Probability Theory 14
2.1 Definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Relative Frequency Definition . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Axiomatic Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Joint Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Bayes’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Basic Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Sequence of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Sampling with Replacement and with Ordering . . . . . . . . . . . . . . 23
2.5.3 Sampling without Replacement and with Ordering . . . . . . . . . . . . 24
2.5.4 Sampling without Replacement and without Ordering . . . . . . . . . . 25
2.5.5 Sampling with Replacement and without Ordering . . . . . . . . . . . . 27

3 Random Variables 29
3.1 The Notion of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 The Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . 32
3.2.3 Expected Value and Moments . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Conditional Probability Mass Function and Expectation . . . . . . . . . 37
3.2.5 Common Discrete Random Variables . . . . . . . . . . . . . . . . . . . . 40
3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 The Probability Density Function . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Conditional CDF and PDF . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 The Expected Value and Moments . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Important Continuous Random Variables . . . . . . . . . . . . . . . . . 55
3.4 The Markov and Chebyshev Inequalities . . . . . . . . . . . . . . . . . . . . . . 61

4 Two or More Random Variables 64

4.1 Pairs of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . 65

2
Contents

4.1.2 Joint Probability Density Functions . . . . . . . . . . . . . . . . . . . . . 67

4.1.3 Joint Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . 70
4.1.4 Conditional Probabilities and densities . . . . . . . . . . . . . . . . . . . 71
4.1.5 Expected Values and Moments Involving Pairs of Random Variables . . 74
4.1.6 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . . 78
4.1.7 Pairs of Jointly Gaussian Random Variables . . . . . . . . . . . . . . . . 81
4.2 Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Vector Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Joint and Conditional PMFs, CDFs and PDFs . . . . . . . . . . . . . . . . 85
4.2.3 Expectations Involving Multiple Random Variables . . . . . . . . . . . . 86
4.2.4 Multi-Dimensional Gaussian Random Variables . . . . . . . . . . . . . . 88

5 Random Sums and Sequences 90

5.1 Independent and Identically Distributed Random Variables . . . . . . . . . . . . 90
5.2 Mean and Variance of Sums of Random Variables . . . . . . . . . . . . . . . . . 91
5.3 The Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Convergence of Sequences of Random Variables . . . . . . . . . . . . . . . . . . 98
5.6.1 Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.2 Almost-Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.3 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6.4 Convergence in the Mean Square Sense . . . . . . . . . . . . . . . . . . 103
5.6.5 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3
1 Preliminary Concepts

1 Preliminary Concepts

1.1 Probability Models in Engineering

In many real world situations, the outcome is uncertain. Many systems involve phenomena with
unpredictable variation and randomness. We often deal with random experiments in which the
outcome varies unpredictably when the experiment is repeated under the same conditions. In
those cases, deterministic models are not appropriate, since they predict the same outcome for
each repetition of an experiment. Probability models are intended for such random experiments.
In engineering problems in particular, the occurrence of many events is either uncertain or the
outcome cannot be specified by a precise value or formula. The exact value of the power line
voltage during high activity in the summer is an example which cannot be described in any
deterministic way. In communications, the events can frequently be reduced to a series of binary
digits, while the sequence of these digits is uncertain and that is how it carries the information.
Therefore in engineering applications, probability models play a fundamental role.

1.2 Review of Set Theory

In random experiments we are interested in the occurrence of events that are represented by sets.
Before proceeding with further discussion of events and random experiments, we present some
essential concepts from set theory. As we will see, the definitions and concepts presented here
will clarify and unify the mathematical foundations of probability theory.
Definition 1.1. Set: A set is an unordered collection of objects.
We typically use a capital letter to denote a set, listing the objects within braces or by graphing.
The notation 𝐴 = {𝑥 : 𝑥 > 0, 𝑥 ≤ 2} is read as “the set 𝐴 contains all 𝑥 such that 𝑥 is greater than
zero and less than or equal to two.” The notation 𝜁 ∈ 𝐴 is read as “the object zeta is in the set A.”
Two sets are equal if they have exactly the same objects in them; i.e., 𝐴 = 𝐵 if 𝐴 contains exactly
the same elements that are contained in 𝐵.
Definition 1.2. Null set: denoted ∅, is the empty set and contains no objects.
Definition 1.3. Universal set: denoted 𝑆, is the set of all objects in the universe. The universe can
be anything we define it to be.
For example, we sometimes consider 𝑆 = 𝑅, the set of all real numbers.
Definition 1.4. Subset: If every object in set 𝐴 is also an object in set 𝐵, then 𝐴 is a subset of 𝐵,
denoted by 𝐴 ⊂ 𝐵.
The expression 𝐵 ⊃ 𝐴 read as “𝐴 contains 𝐵” is equivalent to 𝐴 ⊂ 𝐵.
Definition 1.5. Union: The union of sets 𝐴 and 𝐵, denoted 𝐴 ∪ 𝐵, is the set of objects that belong
to 𝐴 or 𝐵 or both, i.e., 𝐴 ∪ 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 or 𝜁 ∈ 𝐵}.

4
1 Preliminary Concepts

Definition 1.6. Intersection: The intersection of sets 𝐴 and 𝐵, denoted 𝐴 ∩ 𝐵, is the set of objects
common to both 𝐴 and 𝐵; i.e ., 𝐴 ∩ 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∈ 𝐵}
Note that if 𝐴 ⊂ 𝐵, then 𝐴 ∩ 𝐵 = 𝐴. In particular, we always have 𝐴 ∩ 𝑆 = 𝐴.
Definition 1.7. Complement: The complement of a set 𝐴, denoted 𝐴𝑐 , is the collection of all objects
in 𝑆 not included in 𝐴; i.e ., 𝐴𝑐 = {𝜁 ∈ 𝑆 : 𝜁 ∉ 𝐴} .
Definition 1.8. Difference: The relative complement or difference of sets 𝐴 and 𝐵 is the set of
elements in 𝐴 that are not in 𝐵: 𝐴 − 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∉ 𝐵}
Note that 𝐴 − 𝐵 = 𝐴 ∩ 𝐵𝑐 .
These definitions and relationships among sets are illustrated in Figure 1.1. These diagrams are
called Venn diagrams, which represent sets by simple plane areas within the universal set, pictured
as a rectangle. Venn diagrams are important visual aids to understand relationships among sets.

B A B A B A
s s s
(a) Universal Set S. (b) Set A. (c) Set B.

B A B A B A

s s s
(d) Set Ac. (e) Set A ∪ B. (f) Set A ∩ B.

B A
A B A B
s s s
(g) A ⊂ B. (h) disjoint sets A and B. (b) Set A-B.

Figure 1.1: Venn diagrams representing sets

Theorem 1.1

Let 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴. Then 𝐴 = 𝐵.

Proof. Since the empty set is a subset of any set, if 𝐴 = ∅ then 𝐵 ⊂ 𝐴 implies that 𝐵 = ∅.
Similarly, if 𝐵 = ∅ then 𝐴 ⊂ 𝐵 implies that 𝐴 = ∅. The theorem is obviously true if 𝐴 and 𝐵
are both empty. If 𝐴 and 𝐵 are nonempty, since 𝐴 ⊂ 𝐵, if 𝜁 ∈ 𝐴 then 𝜁 ∈ 𝐵. Since 𝐵 ⊂ 𝐴, if
𝜁 ∈ 𝐵 then 𝜁 ∈ 𝐴. We therefore conclude that 𝐴 = 𝐵.

The converse of the above theorem is also true: If 𝐴 = 𝐵 then 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴.

5
1 Preliminary Concepts

Example 1.1

Let 𝐴 = {(𝑥, 𝑦) : 𝑦 ≤ 𝑥 }, 𝐵 = {(𝑥, 𝑦) : 𝑥 ≤ 𝑦 + 1}, 𝐶 = {(𝑥, 𝑦) : 𝑦 < 1}, and

𝐷 = {(𝑥, 𝑦) : 0 ≤ 𝑦}. Find and sketch 𝐸 = 𝐴 ∩ 𝐵, 𝐹 = 𝐶 ∩ 𝐷, 𝐺 = 𝐸 ∩ 𝐹 , and
𝐻 = {(𝑥, 𝑦) : (−𝑥, 𝑦 + 1) ∈ 𝐺 }.

Solution. We first sketch the boundaries of the given sets 𝐴, 𝐵, 𝐶, and 𝐷. Note that if the
boundary of the region is included in the set, it is indicated with a solid line, and if not, it is
indicated with a dotted line. We have
𝐸 = 𝐴 ∩ 𝐵 = {(𝑥, 𝑦) : 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 }
and
𝐹 = 𝐶 ∩ 𝐷 = {(𝑥, 𝑦) : 0 ≤ 𝑦 < 1}.
The set 𝐺 is the set of all ordered pairs (𝑥, 𝑦) satisfying both 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 and 0 ≤ 𝑦 < 1.
Using 1− to denote a value just less than 1, the second inequality may be expressed as
0 ≤ 𝑦 ≤ 1− . We may then express the set 𝐺 as
𝐺 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, 𝑥 − 1} ≤ 𝑦 ≤ 𝑚𝑖𝑛{𝑥, 1− }},
The set 𝐻 is obtained from 𝐺 by folding about the y-axis and translating down one unit.
This can be seen from the definitions of G and H by noting that (𝑥, 𝑦) ∈ 𝐻 if (−𝑥, 𝑦 + 1) ∈ 𝐺;
hence, we replace 𝑥 with −𝑥 and 𝑦 with 𝑦 + 1 in the above result for 𝐺 to obtain
𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, −𝑥 − 1} ≤ 𝑦 + 1 ≤ 𝑚𝑖𝑛{−𝑥, 1− }},
or
𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {−1, −𝑥 − 2} ≤ 𝑦 ≤ 𝑚𝑖𝑛{−1 − 𝑥, 0− }}.
The sets are illustrated in Figure 1.2.

Figure 1.2:

6
1 Preliminary Concepts

1.3 Operations on sets

Throughout probability theory it is often required to establish relationships between sets. The
set operations ∪ and ∩ operate on sets in much the same way the operations + and × operate
on real numbers. Similarly, the special sets ∅ and 𝑆 correspond to the additive identity 0 and
the multiplicative identity 1, respectively. This correspondence between operations on sets and
operations on real numbers is made explicit by the theorem below, which can be proved by
applying the definitions of the basic set operations stated above.

Theorem 1.2: Properties of Set Operations

Commutative Properties:

𝐴∪𝐵 = 𝐵 ∪𝐴 (1.1)
𝐴∩𝐵 = 𝐵 ∩𝐴 (1.2)

Associative Properties:

𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶 (1.3)
𝐴 ∩ (𝐵 ∩ 𝐶) = (𝐴 ∩ 𝐵) ∩ 𝐶 (1.4)

Distributive Properties:

𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶) (1.5)
𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) (1.6)

De Morgan’s Laws:

(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵𝑐 (1.7)
𝑐
(𝐴 ∩ 𝐵) = 𝐴 ∪ 𝐵𝑐 𝑐
(1.8)

Identities involving ∅ and 𝑆:

𝐴∪∅=𝐴 (1.9)
𝐴∩𝑆 =𝐴 (1.10)
𝐴∩∅=∅ (1.11)
𝐴∪𝑆 =𝑆 (1.12)

Identities involving complements:

𝐴 ∩ 𝐴𝑐 = ∅ (1.13)
𝐴∪𝐴 =𝑆 𝑐
(1.14)
𝑐 𝑐
(𝐴 ) = 𝐴 (1.15)

Example 1.2

Prove De Morgan’s rules.

7
1 Preliminary Concepts

Solution. First suppose that 𝜁 ∈ (𝐴 ∪ 𝐵)𝑐 , then 𝜁 ∉ (𝐴 ∪ 𝐵). In particular, we have 𝜁 ∉ 𝐴

which implies 𝜁 ∈ 𝐴𝑐 . Similarly, we have 𝜁 ∉ 𝐵 which implies 𝜁 ∈ 𝐵𝑐 . Hence 𝜁 is in both 𝐴𝑐
and 𝐵𝑐 that is, 𝜁 ∈ 𝐴𝑐 ∩ 𝐵𝑐 . We have shown that (𝐴 ∪ 𝐵)𝑐 ⊂ 𝐴𝑐 ∩ 𝐵𝑐 .
To prove inclusion in the other direction, suppose that 𝜁 ∈ 𝐴𝑐 ∩ 𝐵𝑐 . This implies that 𝜁 ∈ 𝐴𝑐
so 𝜁 ∉ 𝐴. Similarly, 𝜁 ∈ 𝐵𝑐 so 𝜁 ∉ 𝐵. Therefore, 𝜁 ∉ (𝐴 ∪ 𝐵) and so 𝜁 ∈ (𝐴 ∪ 𝐵)𝑐 . We have
shown that 𝐴𝑐 ∩ 𝐵𝑐 ⊂ (𝐴 ∪ 𝐵)𝑐 . This proves that (𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵𝑐 .
To prove the second De Morgan rule, apply the first De Morgan rule to 𝐴𝑐 and to 𝐵𝑐 to
obtain: (𝐴𝑐 ∪ 𝐵𝑐 )𝑐 = (𝐴𝑐 )𝑐 ∩ (𝐵𝑐 )𝑐 = 𝐴 ∩ 𝐵, where we used the identity (𝐴𝑐 )𝑐 = 𝐴. Now
take complements of both sides: 𝐴𝑐 ∪ 𝐵𝑐 = (𝐴 ∩ 𝐵)𝑐 .

[Exercise−] Use a Venn diagram to demonstrate De Morgan’s rule.

Additional insight to operations on sets is provided by the correspondence between the algebra
of set inclusion and Boolean algebra. An element either belongs to a set or it does not. Thus,
interpreting sets as Boolean (logical) variables having values of 0 or 1, the ∪ operation as the
logical "OR", the ∩ as the logical "AND" operation, and the 𝑐 as the logical complement "NOT",
any expression involving set operations can be treated as a Boolean expression.
Theorem 1.3

Negative Absorption Theorem:

𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = 𝐴 ∪ 𝐵. (1.16)

Proof. Using the distributive property,

𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = (𝐴 ∪ 𝐴𝑐 ) ∩ (𝐴 ∪ 𝐵)
= 𝑆 ∩ (𝐴 ∪ 𝐵)
= 𝐴 ∪ 𝐵.

Theorem 1.4

Principle of Duality: Any set identity remains true if the symbols ∪,∩, S, and ∅, are replaced
with the symbols ∩,∪,∅, and S, respectively.

Proof. The proof follows by applying De Morgan’s Laws and renaming sets 𝐴𝑐 , 𝐵𝑐 , etc. as
𝐴, 𝐵, etc.

Properties of set operations are easily extended to deal with any finite number of sets. To do this,
we need notation for the union and intersection of a collection of sets.
Definition 1.9. Union: We define the union of a collection of sets (or “set of sets”)
{𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.17)
by:
Ø
𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for some 𝑖 ∈ 𝐼 } (1.18)
𝑖 ∈𝐼

8
1 Preliminary Concepts

Definition 1.10. Intersection: We define the intersection of a collection of sets

{𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.19)

by:
Ù
𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for every 𝑖 ∈ 𝐼 } (1.20)
𝑖 ∈𝐼

Theorem 1.5: Properties of Set Operations (extended)

Commutative and Associative Properties:

𝑛
Ø
𝐴𝑖 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝐴𝑖 1 ∪ 𝐴𝑖 2 ∪ ... ∪ 𝐴𝑖𝑛 , (1.21)
𝑖=1
Ù𝑛
𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 = 𝐴𝑖 1 ∩ 𝐴𝑖 2 ∩ ... ∩ 𝐴𝑖𝑛 , (1.22)
𝑖=1

where 𝑖 1 ∈ {1, 2, ..., 𝑛} = 𝐼 1, 𝑖 2 ∈ 𝐼 2 = 𝐼 1 ∩ {𝑖 1 }𝑐 , and 𝑖𝑙 ∈ 𝐼𝑙 = 𝐼𝑙−1 ∩ {𝑖𝑙−1 }𝑐 , 𝑙 = 2, 3, ..., 𝑛. In

other words, the union (or intersection) of 𝑛 sets is independent of the order in which the
unions (or intersections) are taken.
Distributive Properties:

𝑛
Ø 𝑛
Ø
𝐵∩ 𝐴𝑖 = (𝐵 ∩ 𝐴𝑖 ) (1.23)
𝑖=1 𝑖=1
Ù𝑛 Ù𝑛
𝐵∪ 𝐴𝑖 = (𝐵 ∪ 𝐴𝑖 ) (1.24)
𝑖=1 𝑖=1

De Morgan’s Laws:
𝑛
Ù 𝑛
Ø
( 𝐴𝑖 )𝑐 = 𝐴𝑐𝑖 (1.25)
𝑖=1 𝑖=1
Ø𝑛 Ù𝑛
( 𝐴𝑖 )𝑐 = 𝐴𝑐𝑖 (1.26)
𝑖=1 𝑖=1

Throughout much of probability, it is useful to decompose a set into a union of simpler, non-
overlapping sets. This is an application of the “divide and conquer” approach to problem solving.
Necessary terminology is established in the following definitions.
Definition 1.11. Mutually Exclusive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are mutually exclusive (or disjoint)
if 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all 𝑖 and 𝑗 with 𝑖 ≠ 𝑗 .
Definition 1.12. Partition: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 form a partition of the set 𝐵 if they are mutually
Ð
exclusive and 𝐵 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖
Definition 1.13. Collectively Exhaustive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are collectively exhaustive if
Ð
𝑆 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖

9
1 Preliminary Concepts

Example 1.3

Let 𝑆 = {(𝑥, 𝑦) : 𝑥 ≥ 0, 𝑦 ≥ 0}, 𝐴 = {(𝑥, 𝑦) : 𝑥 + 𝑦 < 1}, 𝐵 = {(𝑥, 𝑦) : 𝑥 < 𝑦}, and
𝐶 = {(𝑥, 𝑦) : 𝑥𝑦 > 1/4}. Are the sets 𝐴, 𝐵, and 𝐶 mutually exclusive, collectively exhaustive,
and/or a partition of 𝑆?
Solution. Since 𝐴 ∩ 𝐶 = ∅, the sets 𝐴 and 𝐶 are mutually exclusive; however, 𝐴 ∩ 𝐵 ≠ ∅ and
𝐵 ∩ 𝐶 ≠ ∅, so 𝐴 and 𝐵, and 𝐵 and 𝐶 are not mutually exclusive. Since 𝐴 ∪ 𝐵 ∪ 𝐶 ≠ 𝑆, the
events are not collectively exhaustive. The events 𝐴, 𝐵, and 𝐶 are not a partition of S since
they are not mutually exclusive and collectively exhaustive.

Definition 1.14. Cartesian Product: The Cartesian product of sets 𝐴 and 𝐵 is a set of ordered
pairs of elements of 𝐴 and 𝐵:

𝐴 × 𝐵 = {𝜁 = (𝜁 1, 𝜁 2 ) : 𝜁 1 ∈ 𝐴, 𝜁 2 ∈ 𝐵}. (1.27)

The Cartesian product of sets 𝐴1, 𝐴2, ..., 𝐴𝑛 is a set of n-tuples (an ordered list of 𝑛 elements) of
elements of 𝐴1, 𝐴2, ..., 𝐴𝑛 :

𝐴1 × 𝐴2 × ... × 𝐴𝑛 = {𝜁 = (𝜁 1, 𝜁 2, ...𝜁𝑛 ) : 𝜁 1 ∈ 𝐴1, 𝜁 2 ∈ 𝐴2, ..., 𝜁𝑛 ∈ 𝐴𝑛 }. (1.28)

An important example of a Cartesian product is the usual n-dimensional real Euclidean space:

𝑅𝑛 = 𝑅 × 𝑅 × ... × 𝑅 . (1.29)
| {z }
𝑛 terms

1.4 Other Notations

Some special sets of real numbers will often be encountered:

(𝑎, 𝑏) = 𝑥 : 𝑎 < 𝑥 < 𝑏,

(𝑎, 𝑏] = 𝑥 : 𝑎 < 𝑥 ≤ 𝑏,
[𝑎, 𝑏) = 𝑥 : 𝑎 ≤ 𝑥 < 𝑏,
[𝑎, 𝑏] = 𝑥 : 𝑎 ≤ 𝑥 ≤ 𝑏.

Note that if 𝑎 > 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) = [𝑎, 𝑏] = ∅. If 𝑎 = 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) =
∅ and [𝑎, 𝑏] = 𝑎. The notation (𝑎, 𝑏) is also used to denote an ordered pair—we depend on the
context to determine whether (𝑎, 𝑏) represents an open interval of real numbers or an ordered
pair.

1.5 Random Experiments

To further clarify the basics of random experiments, we begin with a few simple definitions.
Definition 1.15. Experiment: An experiment is a procedure we perform (quite often hypothetical)
that produces some result.
Often the letter 𝐸 is used to designate an experiment. For example, the experiment 𝐸 5 might
consist of tossing a coin five times.

10
1 Preliminary Concepts

Definition 1.16. Outcome: An outcome is a possible result of an experiment.

The letter 𝜁 is often used to represent outcomes. For example, the outcome 𝜁 1 of experiment 𝐸 5
might represent the sequence of tosses heads-heads-tails-heads-tails; or concisely, HHTHT.
Definition 1.17. Event: An event is a certain set of outcomes of an experiment.
For example, the event 𝐶 associated with experiment 𝐸 5 might be 𝐶 = {all outcomes consisting of
an even number of heads}
Definition 1.18. Sample space: The sample space is the collection or set of “all possible” distinct
(collectively exhaustive and mutually exclusive) outcomes of an experiment.
The letter 𝑆 is used to designate the sample space, which is the universal set of outcomes of an
experiment. Note that in the coin tossing experiment, the coin may land on edge. But experience
has shown us that such a result is highly unlikely to occur. Therefore, our sample space for such
experiments typically excludes such unlikely outcomes, and only includes all possible outcomes.
For now, we assume all outcomes to be distinct. Consequently, we are considering only the set of
simple outcomes that are collectively exhaustive and mutually exclusive.
A sample space is called discrete if it is a finite or a countably infinite set. It is called continuous
or a continuum otherwise. The set of all real numbers between 0 and 1 is an example of an
uncountable sample space. For now, we only deal with discrete sample spaces.

Example 1.4

Consider the experiment of flipping a fair coin once, where fair means that the coin is not
biased in weight to a particular side. There are two possible outcomes: head (𝜁 1 = 𝐻 ) or a
tail (𝜁 2 = 𝑇 ). Thus, the sample space 𝑆, consists of two outcomes, 𝜁 1 = 𝐻 and 𝜁 2 = 𝑇 .

Example 1.5

Now consider flipping the coin until a tails occurs, when the experiment is terminated.
The sample space consists of a collection of sequences of coin tosses. The outcomes are
𝜁𝑛 , 𝑛 = 1, 2, 3, .... The final toss in any particular sequence is a tail and terminates the
sequence. The preceding tosses prior to the occurrence of the tail must be heads. The
possible outcomes that may occur are: 𝜁 1 = (𝑇 ), 𝜁 2 = (𝐻,𝑇 ), 𝜁 3 = (𝐻, 𝐻,𝑇 ), ...
Note that in this case, n can extend to infinity. This is a combined sample space resulting
from conducting independent but identical experiments. In this example, the sample space
is countably infinite.

Example 1.6

A cubical die with numbered faces is rolled and the result observed. The sample space
consists of six possible outcomes, 𝜁 1 = 1, 𝜁 2 = 2, ..., 𝜁 6 = 6, indicating the possible observed
faces of the cubical die.

Example 1.7

Now consider the experiment of rolling two dice and observing the results. The sample space

11
1 Preliminary Concepts

consists of 36 outcomes: 𝜁 1 = (1, 1), 𝜁 2 = (1, 2), ..., 𝜁 6 = (1, 6), 𝜁 7 = (2, 1), 𝜁 8 = (2, 2), ..., 𝜁 3 6 =
(6, 6) the first component in the ordered pair indicates the result of the toss of the first die,
and the second component indicates the result of the toss of the second die. Alternatively
we can consider this experiment as two distinct experiments, each consisting of rolling
a single die. The sample spaces (𝑆 1 and 𝑆 2 ) for each of the two experiments are identical,
namely, the same as Example 1.6. We may now consider the sample space of the original
experiment 𝑆, to be the combination of the sample spaces 𝑆 1 and 𝑆 2 , which consists of
all possible combinations of the elements of both 𝑆 1 and 𝑆 2 . This is another example of a
combined sample space. Several interesting events can be also defined from this experiment,
such as:
𝐴 = {the sum of the outcomes of the two rolls = 4},
𝐵 = {the outcomes of the two rolls are identical},
𝐶 = {the first roll was bigger than the second}.

The choice of a particular sample space depends upon the questions that are to be answered
concerning the experiment. Suppose that in Example 1.7, we were asked to record after each roll
the sum of the numbers shown on the two faces. Then, the sample space could be represented
by eleven outcomes, 𝜁 1 = 2, 𝜁 2 = 3, ..., 𝜁 11 = 12. However, the original sample space was in
some way more fundamental. Because the sum of the die faces can be determined from the num-
bers on the die faces, but the sum is not sufficient to specify the sequence of numbers that occurred.

1.5.1 Tree Diagrams

Many experiments consist of a sequence of simpler “sub-experiments” as, for example, the
sequential tossing of a coin or the sequential die rolling. A tree diagram is a useful graphical
representation of a sequence of experiments, particularly when each sub-experiment has a small
number of possible outcomes.

Example 1.8

The coin in Example 1.4 is tossed twice. Illustrate the sample space with a tree diagram.
Let 𝐻𝑖 and 𝑇𝑖 denote the outcome of a head or a tale on the the 𝑖 𝑡ℎ toss, respectively. The
sample space is: 𝑆 = {𝐻 1𝐻 2, 𝐻 1𝑇2,𝑇1𝐻 2,𝑇1𝑇2 } The tree diagram illustrating the sample space
for this sequence of two coin tosses is shown in Figure 1.3.

Figure 1.3: Tree diagram for Example 1.8

12
1 Preliminary Concepts

Each node represents an outcome of one coin toss and the branches of the tree connect
the nodes. The number of branches to the right of each node corresponds to the number
of outcomes for the next coin toss (or experiment). A sequence of samples connected by
branches in a left to right path from the origin to a terminal node represents a sample point
for the combined experiment. There is a one-to-one correspondence between the paths in
the tree diagram and the sample points in the sample space for the combined experiment.

1.5.2 Coordinate System

Coordinate system representation is another way to illustrate the sample space, especially useful
for a combination of two experiment with numerical outcomes. With this method, each axis lists
the outcomes for each sub-experiment. In Example 1.7, if a die is tossed twice, the coordinate
system can represent the sample space as shown in Figure 1.4.

Figure 1.4: Coordinate system representation for Example 1.7

Note that there are 36 sample points in the experiment. Additionally, we distinguish between
sample points with regard to order; e.g., (1,2) is different from (2,1).

Further Reading
1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical
Engineers, Morgan & Claypool, 2006: sections 1.1 and 1.2
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.1
3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 1.3 and 2.1
4. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004:
chapter 1

13
2 Probability Theory

2 Probability Theory

2.1 Definition of Probability

Now that the concepts of experiments, outcomes, and events have been introduced, the next
step is to assign probabilities to various outcomes and events. This requires a careful definition
of probability. It should be clear from our everyday usage of the word probability that it is a
measure of the likelihood of various events. In general terms, probability is a function of an event
that produces a numerical quantity that measures the likelihood of that event. More specifically,
probability is a real number between 0 and 1, with probability = 0 meaning that the event is
extremely unlikely to occur and probability = 1 meaning that the event is almost certain to occur.
Several approaches to probability theory have been taken. Two definitions are discussed in this
section.

2.1.1 Relative Frequency Definition

The relative frequency definition of probability is based on observation or experimental evidence
and not on prior knowledge. If an experiment is repeated 𝑁 times and a certain event 𝐴 occurs in
𝑁𝐴 out of 𝑁 trials, then the probability of 𝐴 is defined to be:
𝑁𝐴
𝑃 (𝐴) = lim (2.1)
𝑁 →+∞ 𝑁

For example, if a six-sided die is rolled a large number of times and the numbers on the face of the
die come up in approximately equal proportions, then we could say that the probability of each
number on the upturned face of the die is 1/6. The difficulty with this definition is determining
when 𝑁 is sufficiently large and indeed if the limit actually exists. We will certainly use this
definition in practice, relating deduced probabilities to the physical world, but we will not develop
probability theory from it.

2.1.2 Axiomatic Definition

For now, we consider the event space (denoted by 𝐹 ) to be simply the space containing all events to
which we wish to assign a probability. We start with three axioms that any method for assigning
probabilities must satisfy:
1. For any event 𝐴 ∈ 𝐹 , 𝑃 (𝐴) ≥ 0 (a negative probability does not make sense).
2. If 𝑆 is the sample space for a given experiment, 𝑃 (𝑆) = 1 (probabilities are normalized so
that the maximum value is unity).
3. If 𝐴 ∩ 𝐵 = ∅, then 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵). In general if 𝐴1, 𝐴2, ... are mutually exclusive
events in 𝐹 , i.e. 𝐴𝑖 ∩ 𝐴 𝑗 = for all 𝑖 ≠ 𝑗, then:
∞
Ø ∞
Õ
𝑃( 𝐴𝑖 ) = 𝑃 (𝐴𝑖 )
𝑖=1 𝑖=1

14
2 Probability Theory

The following theorem is a direct consequence of the axioms of probability, which is useful for
solving probability problems.

Theorem 2.1

Assuming that all events indicated are in the event space 𝐹 , we have:
(i) 𝑃 (𝐴𝑐 ) = 1 − 𝑃 (𝐴),
(ii) 𝑃 (∅) = 0,
(iii) 0 ≤ 𝑃 (𝐴) ≤ 1,
(iv) 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵)
(v) 𝑃 (𝐵) ≤ 𝑃 (𝐴) if 𝐵 ⊂ 𝐴.

Proof.

(i) Since 𝑆 = 𝐴 ∪ 𝐴𝑐 and 𝐴 ∩ 𝐴𝑐 = ∅, we apply the second and third axioms of probability
to obtain 𝑃 (𝑆) = 1 = 𝑃 (𝐴) + 𝑃 (𝐴𝑐 ), from which (i) follows.
(ii) Applying (i) with 𝐴 = 𝑆 we have 𝐴𝑐 = ∅ so that 𝑃 (∅) = 1 − 𝑃 (𝑆) = 0.
(iii) From (i) we have 𝑃 (𝐴) = 1 − 𝑃 (𝐴𝑐 ), from the first axiom we have 𝑃 (𝐴) ≥ 0 and
𝑃 (𝐴𝑐 ) ≥ 0; consequently, 0 ≤ 𝑃 (𝐴) ≤ 1.
(iv) Let 𝐶 = 𝐵 ∩ 𝐴𝑐 . Then 𝐴 ∪ 𝐶 = 𝐴 ∪ (𝐵 ∩ 𝐴𝑐 ) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐴𝑐 ) = 𝐴 ∪ 𝐵, and
𝐴 ∩ 𝐶 = 𝐴 ∩ 𝐵 ∩ 𝐴𝑐 = ∅, so that 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴 ∪ 𝐶) = 𝑃 (𝐴) + 𝑃 (𝐶). Now we find
𝑃 (𝐶). Since 𝐵 = 𝐵 ∩ 𝑆 = 𝐵 ∩ (𝐴 ∪ 𝐴𝑐 ) = (𝐵 ∩ 𝐴) ∪ (𝐵 ∩ 𝐴𝑐 ) and (𝐵 ∩ 𝐴) ∩ (𝐵 ∩ 𝐴𝑐 ) = ∅,
𝑃 (𝐵) = 𝑃 (𝐵 ∩ 𝐴𝑐 ) + 𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐶) + 𝑃 (𝐴 ∩ 𝐵), so 𝑃 (𝐶) = 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵).
(v) We have 𝐴 = 𝐴 ∩ (𝐵 ∪ 𝐵𝑐 ) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐵𝑐 ), and if 𝐵 ⊂ 𝐴, then 𝐴 = 𝐵 ∪ (𝐴 ∩ 𝐵𝑐 ).
Since 𝐵 ∩ (𝐴 ∩ 𝐵𝑐 ) = ∅, consequently, 𝑃 (𝐴) = 𝑃 (𝐵) + 𝑃 (𝐴 ∩ 𝐵𝑐 ) ≥ 𝑃 (𝐵).

[Exercise−] Visualize this theorem by drawing a Venn diagram.

Example 2.1

Given 𝑃 (𝐴) = 0.4, 𝑃 (𝐴 ∩ 𝐵𝑐 ) = 0.2, and 𝑃 (𝐴 ∪ 𝐵) = 0.6, find 𝑃 (𝐴 ∩ 𝐵) and 𝑃 (𝐵).

Solution. We have 𝑃 (𝐴) = 𝑃 (𝐴 ∩ 𝐵) + 𝑃 (𝐴 ∩ 𝐵𝑐 ) so that 𝑃 (𝐴 ∩ 𝐵) = 0.4 − 0.2 = 0.2. Similarly,
𝑃 (𝐵𝑐 ) = 𝑃 (𝐵𝑐 ∩ 𝐴) + 𝑃 (𝐵𝑐 ∩ 𝐴𝑐 ) = 0.2 + 1 − 𝑃 (𝐴 ∪ 𝐵) = 0.6. Hence, 𝑃 (𝐵) = 1 − 𝑃 (𝐵𝑐 ) = 0.4.

Note that since probabilities are non-negative (theorem 2.1 (iii)), the theorem 2.1 (iv) implies that
the probability of the union of two events is no greater than the sum of the individual event
probabilities:
𝑃 (𝐴 ∪ 𝐵) ≤ 𝑃 (𝐴) + 𝑃 (𝐵) (2.2)
This can be extended to Boole’s Inequality, described as follows.

15
2 Probability Theory

Theorem 2.2

Boole’s inequality: Let 𝐴1, 𝐴2, ... all belong to 𝐹 . Then

∞
Ø ∞
Õ ∞
Õ
𝑃( 𝐴𝑖 ) = (𝑃 (𝐴𝑘 ) − 𝑃 (𝐴𝑘 ∩ 𝐵𝑘 )) ≤ 𝑃 (𝐴𝑘 )
𝑖=1 𝑘=1 𝑘=1

where
𝑘−1
Ø
𝐵𝑘 = 𝐴𝑖
𝑖=1

Proof. Note that 𝐵 1 = ∅, 𝐵 2 = 𝐴1, 𝐵 3 = 𝐴1 ∪ 𝐴2, ..., 𝐵𝑘 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑘−1 ; as 𝑘 increases,

the size of 𝐵𝑘 is non-decreasing. Let 𝐶𝑘 = 𝐴𝑘 ∩ 𝐵𝑘𝑐 ; thus, 𝐶𝑘 = 𝐴𝑘 ∩ (𝐴𝑐1 ∩ 𝐴𝑐2 ∩ ... ∩ 𝐴𝑐𝑘−1 )
consists of all elements in 𝐴𝑘 and not in any 𝐴𝑖 , 𝑖 = 1, 2, ..., 𝑘 − 1. Then
𝑘
Ø
𝐵𝑘+1 = 𝐴𝑖 = 𝐵𝑘 ∪ (𝐴𝑘 ∩ 𝐵𝑘𝑐 ) .
𝑖=1 | {z }
𝐶𝑘

and 𝑃 (𝐵𝑘 + 1) = 𝑃 (𝐵𝑘 ) + 𝑃 (𝐶𝑘 ). We have 𝑃 (𝐵 2 ) = 𝑃 (𝐶 1 ), 𝑃 (𝐵 3 ) = 𝑃 (𝐶 1 ) + 𝑃 (𝐶 2 ), and

𝑘
Ø 𝑘
Õ
𝑃 (𝐵𝑘+1 ) = 𝑃 ( 𝐴𝑖 ) = 𝑃 (𝐶𝑖 )
𝑖=1 𝑖=1

The desired result follows by noting that 𝑃 (𝐶𝑖 ) = 𝑃 (𝐴𝑖 ) − 𝑃 (𝐴𝑖 ∩ 𝐵𝑖 ).

Example 2.2

Let 𝑆 = [0, 1] (the set of real numbers 𝑥 : 0 ≤ 𝑥 ≤ 1). Let 𝐴1 = [0, 0.5], 𝐴2 = (0.45, 0.7),
𝐴3 = [0.6, 0.8), and assume 𝑃 (𝜁 ∈ 𝐼 ) = length of the interval 𝐼 ∩ 𝑆, so that 𝑃 (𝐴1 ) = 0.5,
𝑃 (𝐴2 ) = 0.25, and 𝑃 (𝐴3 ) = 0.2. Find 𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ).
Solution. Let 𝐶 1 = 𝐴1, 𝐶 2 = 𝐴2 ∩ 𝐴𝑐1 = (0.5, 0.7), and 𝐶 3 = 𝐴3 ∩ 𝐴𝑐1 ∩ 𝐴𝑐2 = [0.7, 0.8). Then
𝐶 1, 𝐶 2, and 𝐶 3 are mutually exclusive and 𝐴1 ∪𝐴2 ∪𝐴3 = 𝐶 1 ∪𝐶 2 ∪𝐶 3 ; hence 𝑃 (𝐴1 ∪𝐴2 ∪𝐴3 ) =
𝑃 (𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 ) = 0.5 + 0.2 + 0.1 = 0.8. Note that for this example, Boole’s inequality yields
𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ) ≤ 0.5 + 0.25 + 0.2 = 0.95.

2.2 Joint Probabilities

Suppose that we have two sets, 𝐴 and 𝐵. We saw a few results in the previous section that dealt
with how to calculate the probability of the union of two sets, 𝐴 ∪ 𝐵. At least as frequently, we
are interested in calculating the probability of the intersection of two sets, 𝐴 ∩ 𝐵.
Definition 2.1. Joint probability: The probability of the intersection of two sets, 𝐴 ∩ 𝐵 is referred
to as the joint probability of the sets 𝐴 and 𝐵, 𝑃 (𝐴 ∩ 𝐵), usually denoted by 𝑃 (𝐴, 𝐵).
Extending to an arbitrary number of sets, the joint probability of the sets 𝐴1, 𝐴2, ..., 𝐴𝑀 , denoted
𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑀 ), is 𝑃 (𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑀 ).

16
2 Probability Theory

From the relative frequency definition, in practice we may let 𝑛𝐴,𝐵 be the number of times that 𝐴
and 𝐵 simultaneously occur in 𝑛 trials. Then,
𝑛𝐴,𝐵
𝑃 (𝐴, 𝐵) = lim (2.3)
𝑛→∞ 𝑛

Example 2.3

A standard deck of playing cards has 52 cards that can be divided in several manners. There
are four suits (spades, hearts,diamonds, and clubs), each of which has 13 cards (ace, 2, 3, 4,
... , 10, jack, queen, king). There are two red suits (hearts and diamonds) and two black suits
(spades and clubs). Also, the jacks, queens, and kings are referred to as face cards, while
the others are number cards. Suppose the cards are sufficiently shuffled (randomized) and
one card is drawn from the deck. The experiment has 52 outcomes corresponding to the 52
individual cards that could have been selected. Hence, each outcome has a probability of
1/52. Define the events:
A = {red card selected},
B = {number card selected},
C = {heart selected}.
Since the event A consists of 26 outcomes (there are 26 red cards), then 𝑃 (𝐴) = 26/52 = 1/2.
Likewise, 𝑃 (𝐵) = 40/52 = 10/13 and 𝑃 (𝐶) = 13/52 = 1/4. Events A and B have 20
outcomes in common, hence 𝑃 (𝐴, 𝐵) = 20/52 = 5/13. Likewise, 𝑃 (𝐵, 𝐶) = 10/52 = 5/26
and 𝑃 (𝐴, 𝐶) = 13/52 = 1/4. It is interesting to note that in this example, 𝑃 (𝐴, 𝐶) = 𝑃 (𝐶),
because 𝐶 ⊂ 𝐴 and as a result 𝐴 ∩ 𝐶 = 𝐶.

2.3 Conditional Probabilities

Often the occurrence of one event may be dependent upon the occurrence of another. In Example
2.3, the event A = {a red card is selected} had a probability of 𝑃 (𝐴) = 1/2. If it is known that event
C = {a heart is selected} has occurred, then the event A is now certain (probability equal to 1),
since all cards in the heart suit are red. Likewise, if it is known that the event C did not occur, then
there are 39 cards remaining, 13 of which are red (all the diamonds). Hence, the probability of
event A in that case becomes 1/3. Clearly, the probability of event A depends on the occurrence of
event C. We say that the probability of A is conditional on C, or the probability of A conditioned
on knowing that C has occurred.
Definition 2.2. Conditional probability: the probability of A given knowledge that the event B
has occurred is referred to as the conditional probability of A given B, denoted by 𝑃 (𝐴|𝐵), i.e.:
𝑃 (𝐴, 𝐵)
𝑃 (𝐴|𝐵) = (2.4)
𝑃 (𝐵)
provided that 𝑃 (𝐵) is nonzero.
The conditional probability measure is a legitimate probability measure that satisfies each of the
axioms of probability. Note carefully that 𝑃 (𝐵|𝐴) ≠ 𝑃 (𝐴|𝐵). If we interpret probability as relative
frequency, then 𝑃 (𝐴|𝐵) should be the relative frequency of the event 𝐴 ∩ 𝐵 in experiments where
𝐵 occurred. Suppose that the experiment is performed 𝑛 times, and suppose that event 𝐵 occurs
𝑛𝐵 times, and that event 𝐴 ∩ 𝐵 occurs 𝑛𝐴,𝐵 times. The relative frequency of interest is then:
𝑃 (𝐴, 𝐵) 𝑛𝐴,𝐵 /𝑛 𝑛𝐴,𝐵
= lim = lim (2.5)
𝑃 (𝐵) 𝑛→∞ 𝑛𝐵 /𝑛 𝑛→∞ 𝑛𝐵

17
2 Probability Theory

provided that 𝑃 (𝐵) is nonzero.

We may find in some cases that conditional probabilities are easier to compute than the corre-
sponding joint probabilities, and hence this formula offers a convenient way to compute joint
probabilities:
𝑃 (𝐴, 𝐵) = 𝑃 (𝐵|𝐴)𝑃 (𝐴) = 𝑃 (𝐴|𝐵)𝑃 (𝐵) (2.6)
This idea can be extended to more than two events. Consider finding the joint probability of three
events, 𝐴, 𝐵, and 𝐶:

𝑃 (𝐴, 𝐵, 𝐶) = 𝑃 (𝐶 |𝐴, 𝐵)𝑃 (𝐴, 𝐵) = 𝑃 (𝐶 |𝐴, 𝐵)𝑃 (𝐵|𝐴)𝑃 (𝐴) (2.7)

In general, for 𝑀 events, 𝐴1, 𝐴2, ..., 𝐴𝑀 ,

𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑀 ) = 𝑃 (𝐴𝑀 |𝐴1, 𝐴2, ..., 𝐴𝑀−1 )𝑃 (𝐴𝑀−1 |𝐴1, 𝐴2, ..., 𝐴𝑀−2 )... × 𝑃 (𝐴2 |𝐴1 )𝑃 (𝐴1 ) (2.8)

Example 2.4

Return to the experiment of drawing cards from a deck as described in Example 2.3. Suppose
now that we select two cards at random from the deck. When we select the second card,
we do not return the first card to the deck. In this case, we say that we are selecting cards
without replacement. As a result, the probabilities associated with selecting the second card
are slightly different if we have knowledge of which card was drawn on the first selection.
To illustrate this, let:
A = {first card was a spade} and
B = {second card was a spade}.
The probability of the event A can be calculated as in the previous example to be 𝑃 (𝐴) =
13/52 = 1/4. Likewise, if we have no knowledge of what was drawn on the first selection, the
probability of the event B is the same, 𝑃 (𝐵) = 1/4. To calculate the joint probability of A and
B, we have to do some counting. To begin, when we select the first card there are 52 possible
outcomes. Since this card is not returned to the deck, there are only 51 possible outcomes
for the second card. Hence, this experiment of selecting two cards from the deck has 52 ∗ 51
possible outcomes each of which is equally likely. Similarly, there are 13 ∗ 12 outcomes that
belong to the joint event 𝐴 ∩ 𝐵. Therefore, the joint probability for A and B is 𝑃 (𝐴, 𝐵) =
(13 ∗ 12)/(52 ∗ 51) = 1/17. The conditional probability of the second card being a spade
given that the first card is a spade is then 𝑃 (𝐵|𝐴) = 𝑃 (𝐴, 𝐵)/𝑃 (𝐴) = (1/17)/(1/4) = 4/17.
However, calculating this conditional probability directly is probably easier than calculating
the joint probability. Given that we know the first card selected was a spade, there are now
51 cards left in the deck, 12 of which are spades, thus 𝑃 (𝐵|𝐴) = 12/51 = 4/17.

2.3.1 Bayes’s Theorem

The concept of conditional probability leads us to the following theorem.

Theorem 2.3

For any events 𝐴 and 𝐵 such that 𝑃 (𝐵) ≠ 0,

𝑃 (𝐵|𝐴)𝑃 (𝐴)
𝑃 (𝐴|𝐵) = (2.9)
𝑃 (𝐵)

18
2 Probability Theory

Proof. From definition 2.2,

𝑃 (𝐴, 𝐵) = 𝑃 (𝐴|𝐵)𝑃 (𝐵) = 𝑃 (𝐵|𝐴)𝑃 (𝐴). (2.10)

It follows directly by dividing the preceding equations by 𝑃 (𝐵).

Theorem 2.3 is useful for calculating certain conditional probabilities since, in many problems, it
may be quite difficult to compute 𝑃 (𝐴|𝐵) directly, whereas calculating 𝑃 (𝐵|𝐴) may be straightfor-
ward.
Theorem 2.4: Theorem of Total Probability

Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. That is,
𝐵𝑖 ∩ 𝐵 𝑗 = for all 𝑖 ≠ 𝑗 and
𝑛
Ø Õ𝑛
𝐵𝑖 = 𝑆 ⇒ 𝑃 (𝐵𝑖 ) = 1 (2.11)
𝑖=1 𝑖=1

then
𝑛
Õ
𝑃 (𝐴) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (2.12)
𝑖=1

Proof. From the Venn diagram in Figure 2.1, it can be seen that the event 𝐴 can be written
as:

𝐴 = (𝐴∩𝐵 1 ) ∪ (𝐴∩𝐵 2 ) ∪...∪ (𝐴∩𝐵𝑛 ) ⇒ 𝑃 (𝐴) = 𝑃 ({𝐴∩𝐵 1 }∪{𝐴∩𝐵 2 }∪...∪{𝐴∩𝐵𝑛 }) (2.13)

Also, since the 𝐵𝑖 are all mutually exclusive, then the {𝐴 ∩ 𝐵𝑖 } are also mutually exclusive,
so that
Õ𝑛 𝑛
Õ
𝑃 (𝐴) = 𝑃 (𝐴, 𝐵𝑖 ) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (by Theorem 2.3). (2.14)
𝑖=1 𝑖=1

Figure 2.1: Venn diagram used to help prove the theorem of total probability

By combining the results of Theorems 2.3 and 2.4, we get what has come to be known as Bayes’s
theorem.
Theorem 2.5: Bayes’s Theorem

19
2 Probability Theory

Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. Then:

𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 |𝐴) = Í𝑛 (2.15)
𝑖=1 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )

𝑃 (𝐵𝑖 ) is often referred to as the a priori probability of event 𝐵𝑖 , while 𝑃 (𝐵𝑖 |𝐴) is known as the a
posteriori probability of event 𝐵𝑖 given 𝐴.

Example 2.5

A certain auditorium has 30 rows of seats. Row 1 has 11 seats, while Row 2 has 12 seats, Row
3 has 13 seats, and so on to the back of the auditorium where Row 30 has 40 seats. A door
prize is to be given away by randomly selecting a row (with equal probability of selecting
any of the 30 rows) and then randomly selecting a seat within that row (with each seat in
the row equally likely to be selected). Find the probability that Seat 15 was selected given
that Row 20 was selected and also find the probability that Row 20 was selected given that
Seat 15 was selected.

Solution. The first task is straightforward. Given that Row 20 was selected, there are 30
possible seats in Row 20 that are equally likely to be selected. Hence, 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20) =
1/30. Without the help of Bayes’s theorem, finding the probability that Row 20 was selected
given that we know Seat 15 was selected would seem to be a formidable problem. Using
Bayes’s theorem,
𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20)𝑃 (𝑅𝑜𝑤20)/𝑃 (𝑆𝑒𝑎𝑡15).
The two terms in the numerator on the right-hand side are both equal to 1/30. The term in
the denominator is calculated using the help of the theorem of total probability.
30
Õ 1 1
𝑃 (𝑆𝑒𝑎𝑡15) = = 0.0342
𝑘 + 10 30
𝑘=5

With this calculation completed, the a posteriori probability of Row 20 being selected given
seat 15 was selected is given by:

1/30 ∗ 1/30
𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = = 0.0325
0.0342
Note that the a priori probability that Row 20 was selected is 1/30 = 0.0333. Therefore, the
additional information that Seat 15 was selected makes the event that Row 20 was selected
slightly less likely. In some sense, this may be counterintuitive, since we know that if Seat
15 was selected, there are certain rows that could not have been selected (i.e., Rows 1–4
have fewer than 15 seats) and, therefore, we might expect Row 20 to have a slightly higher
probability of being selected compared to when we have no information about which seat
was selected. To see why the probability actually goes down, try computing the probability
that Row 5 was selected given that Seat 15 was selected. The event that Seat 15 was selected
makes some rows much more probable, while it makes others less probable and a few rows
now impossible.

20
2 Probability Theory

2.4 Independence
In Example 2.5, it was seen that observing one event can change the probability of the occurrence
of another event. In that particular case, the fact that it was known that Seat 15 was selected,
lowered the probability that Row 20 was selected. We say that the event 𝐴 = {Row 20 was
selected} is statistically dependent on the event 𝐵 = {Seat 15 was selected}. If the description of
the auditorium were changed so that each row had an equal number of seats (e.g., say all 30 rows
had 20 seats each), then observing the event B = Seat 15 was selected would not give us any new
information about the likelihood of the event 𝐴 = {Row 20 was selected}. In that case, we say
that the events 𝐴 and 𝐵 are statistically independent.
Mathematically, two events 𝐴 and 𝐵 are independent if 𝑃 (𝐴|𝐵) = 𝑃 (𝐴). That is, the a priori
probability of event 𝐴 is identical to the a posteriori probability of 𝐴 given 𝐵. Note that if
𝑃 (𝐴|𝐵) = 𝑃 (𝐴), then the following conditions also hold: 𝑃 (𝐵|𝐴) = 𝑃 (𝐵) and 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵).
Furthermore, if 𝑃 (𝐴|𝐵) ≠ 𝑃 (𝐴), then the other two conditions also do not hold. We can thereby
conclude that any of these three conditions can be used as a test for independence and the other
two forms must follow. We use the last form as a definition of independence since it is symmetric
relative to the events A and B.
Definition 2.3. Independence: Two events are statistically independent if and only if:
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) (2.16)

Example 2.6

Consider the experiment of tossing two numbered dice and observing the numbers that
appear on the two upper faces. For convenience, let the dice be distinguished by color, with
the first die tossed being red and the second being white. Let:
A = {number on the red die is less than or equal to 2},
B = {number on the white die is greater than or equal to 4},
C = {the sum of the numbers on the two dice is 3}.
As mentioned in the preceding text, there are several ways to establish independence (or
lack thereof) of a pair of events. One possible way is to compare 𝑃 (𝐴, 𝐵) with 𝑃 (𝐴)𝑃 (𝐵).
Note that for the events defined here, 𝑃 (𝐴) = 1/3, 𝑃 (𝐵) = 1/2, 𝑃 (𝐶) = 1/18. Also, of the 36
possible outcomes of the experiment, six belong to the event 𝐴 ∩ 𝐵 and hence 𝑃 (𝐴, 𝐵) = 1/6.
Since 𝑃 (𝐴)𝑃 (𝐵) = 1/6 as well, we conclude that the events 𝐴 and 𝐵 are independent. This
agrees with intuition since we would not expect the outcome of the roll of one die to affect
the outcome of the other. What about the events 𝐴 and 𝐶? Of the 36 possible outcomes of the
experiment, two belong to the event 𝐴∩𝐶 and hence 𝑃 (𝐴, 𝐶) = 1/18. Since 𝑃 (𝐴)𝑃 (𝐶) = 1/54,
the events 𝐴 and 𝐶 are not independent. Again, this is intuitive since whenever the event 𝐶
occurs, the event 𝐴 must also occur and so the two must be dependent. Finally, we look at
the pair of events 𝐵 and 𝐶. Clearly, 𝐵 and 𝐶 are mutually exclusive. If the white die shows a
number greater than or equal to 4, there is no way the sum can be 3. Hence, 𝑃 (𝐵, 𝐶) = 0 and
since 𝑃 (𝐵)𝑃 (𝐶) = 1/36, these two events are also dependent.

Note that mutually exclusive events are not the same as independent events. For two events 𝐴
and 𝐵 for which 𝑃 (𝐴) ≠ 0 and 𝑃 (𝐵) ≠ 0, 𝐴 and 𝐵 can never be both independent and mutually
exclusive. Thus, mutually exclusive events are necessarily statistically dependent.
Generalizing the definition of independence to three events, 𝐴, 𝐵, and 𝐶 are mutually independent
if each pair of events is independent;
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) (2.17)

21
2 Probability Theory

𝑃 (𝐴, 𝐶) = 𝑃 (𝐴)𝑃 (𝐶) (2.18)

𝑃 (𝐵, 𝐶) = 𝑃 (𝐵)𝑃 (𝐶) (2.19)
and in addition,
𝑃 (𝐴, 𝐵, 𝐶) = 𝑃 (𝐴)𝑃 (𝐵)𝑃 (𝐶) (2.20)

Definition 2.4. The events 𝐴1, 𝐴2, ..., 𝐴𝑛 are independent if any subset of 𝑘 < 𝑛 of these events are
independent, and in addition
𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ) (2.21)

There are basically two ways in which we can use the idea of independence. We can compute
joint or conditional probabilities and apply one of the definitions as a test for independence.
Alternatively, we can assume independence and use the definitions to compute joint or conditional
probabilities that otherwise may be difficult to find. The latter approach is used extensively in
engineering applications. For example, certain types of noise signals can be modeled in this
way. Suppose we have some time waveform 𝑋 (𝑡) which represents a noisy signal that we wish
to sample at various points in time, 𝑡 1, 𝑡 2, ..., 𝑡𝑛 . Perhaps we are interested in the probabilities
that these samples might exceed some threshold, so we define the events 𝐴𝑖 = 𝑃 (𝑋 (𝑡𝑖 ) > 𝑇 ),
𝑖 = 1, 2, ..., 𝑛. In some cases, we can assume that the value of the noise at one point in time does
not affect the value of the noise at another point in time. Hence, we assume that these events are
independent and therefore 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ).

2.5 Basic Combinatorics

In many situations, the probability of each possible outcome of an experiment is taken to be
equally likely. The card drawing and dice rolling examples can fall into this category, where
finding the probability of a certain event 𝐴 can be obtained by counting.
Number of outcomes in A
𝑃 (𝐴) = (2.22)
Number of outcomes in entire sample space
Sometimes, when the scope of the experiment is fairly small, it is straightforward to count the
number of outcomes. On the other hand, for problems where the experiment is fairly complicated,
the number of outcomes involved can quickly become astronomical, and the corresponding
exercise in counting can be quite daunting. In this section, we present some fairly simple tools
that are helpful for counting the number of outcomes in a variety of commonly encountered
situations.

2.5.1 Sequence of Experiments

Suppose a combined experiment (𝐸 = 𝐸 1 ×𝐸 2 ×𝐸 3 ×...×𝐸𝑘 ) is performed where the first experiment
𝐸 1 has 𝑛 1 possible outcomes, followed by a second experiment 𝐸 2 which has 𝑛 2 possible outcomes
and so on. A sequence of 𝑘 such experiments thus has
𝑘
Ö
𝑛 = 𝑛 1𝑛 2 ...𝑛𝑘 = 𝑛𝑖 (2.23)
𝑖=1

possible outcomes. This result allows us to quickly calculate the number of sample points in a
sequence of experiments.

22
2 Probability Theory

Example 2.7

How many odd two digit numbers can be formed from the digits 2, 7, 8, and 9, if each digit
can be used only once?

Solution. As the first experiment, there are two ways of selecting a number for the unit’s
place (either 7 or 9). In each case of the first experiment, there are three ways of selecting a
number for the ten’s place in the second experiment, excluding the digit used for the unit’s
place. The number of outcomes in the combined experiment is therefore 2 × 3 = 6.

Example 2.8

An analog-to-digital converter outputs an 8-bit word to represent an input analog voltage

in the range −5 to +5 V. Determine the total number of words possible and the maximum
sampling (quantization) error.

Solution. Since each bit (or binary digit) in a computer word is either a one or a zero, and
there are 8 bits, then the total number of computer words is 𝑛 = 28 = 256. To determine
the maximum sampling error, first compute the range of voltage assigned to each computer
word which equals 10 V/256 words = 0.0390625 V/word and then divide by two (i.e. round
off to the nearest level), which yields a maximum error of 0.0195312 V/word.

2.5.2 Sampling with Replacement and with Ordering

Suppose we choose 𝑘 objects in an order from a set 𝐴 with 𝑛 distinct objects, in a way that after
selecting each object and noting its identity in an ordered list, we place it back in the set before
the next choice is made. Therefore the same choice can be repeated. We will refer to the set 𝐴 as
the “population.” The experiment produces an ordered 𝑘−tuple
(𝑥 1, 𝑥 2, ..., 𝑥𝑘 )
where 𝑥𝑖 ∈ 𝐴 and 𝑖 = 1, 2, ..., 𝑘. Equation 2.23 with 𝑛 1 = 𝑛 2 = ... = 𝑛𝑘 = 𝑛 implies that number of
distinct ordered 𝑘−tuples = 𝑛𝑘 .
Example 2.9

How many k-digit binary numbers are there?

Solution. There are 2𝑘 different binary numbers. Note that the digits are "ordered", and
repeated 0 and 1 digits are possible.

Example 2.10

An urn contains five balls numbered 1 to 5. Suppose we select two balls from the urn with
replacement. How many distinct ordered pairs are possible? What is the probability that
the two draws yield the same number?

23
2 Probability Theory

Solution. The number of ordered pairs is 52 = 25. Figure 2.2 shows the 25 possible pairs.
Five of the 25 outcomes have the two draws with the same number; if we suppose that all
pairs are equiprobable, then the probability that the two draws yield the same number is
5/25 = 0.2.

Figure 2.2: Possible outcomes in sampling with replacement and with ordering of two balls
from an urn containing five distinct balls

2.5.3 Sampling without Replacement and with Ordering

Suppose we choose 𝑘 objects in an order from a population 𝐴 of 𝑛 distinct objects, without
replacement. Clearly 𝑘 ≤ 𝑛. The number of possible outcomes in the first draw is 𝑛 1 = 𝑛; the
number of possible outcomes in the second draw is 𝑛 2 = 𝑛 − 1, namely all 𝑛 objects except the
one selected in the first draw; and so on, up to 𝑛𝑘 = 𝑛 − (𝑘 − 1) in the final draw. The number of
distinct ordered 𝑘−tuples is:
𝑛!
𝑃𝑘𝑛 = 𝑛(𝑛 − 1)...(𝑛 − 𝑘 + 1) = (2.24)
(𝑛 − 𝑘)!
The quantity 𝑃𝑘𝑛 is also called the number of permutations of 𝑛 things taken 𝑘 at a time, or
𝑘−permutations.

Example 2.11

An urn contains five balls numbered 1 to 5. Suppose we select two balls in succession
without replacement. How many distinct ordered pairs are possible? What is the probability
that the first ball has a number larger than that of the second ball?

Solution. Equation 2.24 states that the number of ordered pairs is 5 × 4 = 20, as shown in
figure 2.3. Ten ordered pairs (in the dashed triangle) have the first number larger than the
second number ; thus the probability of this event is 10/20 = 0.5.

24
2 Probability Theory

Figure 2.3: Possible outcomes in sampling without replacement and with ordering.

Example 2.12

An urn contains five balls numbered 1 to 5. Suppose we draw three balls with replacement.
What is the probability that all three balls are different

Solution. From Equation 2.23 there are 53 = 125 possible outcomes, which we will suppose
are equiprobable. The number of these outcomes for which the three draws are different is
given by Equation 2.24, 5 × 4 × 3 = 60, Thus the probability that all three balls are different
is 60/125 = 0.48.

In many problems of interest, we seek to find the number of different ways that we can rearrange
or order several items. The number of permutations can easily be determined from equation 2.24
and is given as follows. Consider drawing 𝑛 objects from an urn containing 𝑛 distinct objects until
the urn is empty, i.e. sampling without replacement with 𝑘 = 𝑛. Thus, the number of possible
orderings, i.e. permutations of 𝑛 distinct objects is:

number of permutations of 𝑛 objects = 𝑛(𝑛 − 1)...(2) (1) = 𝑛! (2.25)

2.5.4 Sampling without Replacement and without Ordering

Suppose we pick 𝑘 objects from a set of 𝑛 distinct objects without replacement and that we record
the result without regard to order. (You can imagine that you have no record of the order in which
the selection was done.) We call the resulting subset of 𝑘 selected objects a combination of size
𝑘. The number of different combinations of size 𝑘 from a set of size 𝑛 (𝑘 ≤ 𝑛) is:

𝑛(𝑛 − 1)...(𝑛 − 𝑘 + 1)

𝑛! 𝑛
𝑛
𝐶𝑘 = = , (2.26)
𝑘! (𝑛 − 𝑘)!𝑘! 𝑘

𝑛
The expression is also called a binomial coefficient and is read “n choose k.” Note that choosing
𝑘
𝑘 objects out of a set of 𝑛 is equivalent to choosing the objects that are to be left out, since

𝐶𝑘𝑛 = 𝐶𝑛−𝑘
𝑛
(2.27)

25
2 Probability Theory

Note that from Equation 2.25, there are 𝑘! possible orders in which the 𝑘 selected objects could
have been selected. Thus in the case of 𝑘−permutations 𝑃𝑘𝑛 , the total number of distinct ordered
samples of 𝑘 objects is:
𝑃𝑘𝑛 = 𝐶𝑘𝑛 𝑘! (2.28)

Example 2.13

Find the number of ways of selecting two balls from five balls numbered 1 to 5, without
replacement and without regard to order.

Solution. From Equation 2.26:

5 5!

= = 10
2 2!3!
Figure 2.4 shows the 10 pairs.

Figure 2.4: Possible outcomes in sampling without replacement and without ordering.

Example 2.14

Find the number of distinct permutations of 2 white balls and 3 black balls.

Solution. This problem is equivalent to the sampling problem: Assume 5 possible positions
for the balls, then pick a combination of 2 positions out of 5 and arrange the 2 white balls
accordingly. Each combination leads to a distinct arrangement (permutation) of 2 white
balls and 3 black balls. Thus the number of distinct permutations of 2 white balls and 3 black
balls is: 𝐶 25 . The 10 distinct permutations with 2 whites (zeros) and 3 blacks (ones) are:
00111 01011 01101 01110 10011 10101 10110 11001 11010 11100. Note that the position of
whites (zeros) can be represented by the pair of numbers on the two selected balls in figure
2.4.

Example 2.14 shows that sampling without replacement and without ordering is equivalent to
partitioning the set of 𝑛 distinct objects into two sets: 𝐵, containing the 𝑘 items that are picked
from the urn, and 𝐵𝑐 containing the 𝑛 − 𝑘 left behind. Suppose we partition a set of 𝑛 distinct

26
2 Probability Theory

objects into 𝐽 subsets 𝐵 1, 𝐵 2, ..., 𝐵 𝐽 where 𝐵 𝐽 is assigned 𝑘 𝐽 elements and 𝑘 1 + 𝑘 2 + ... + 𝑘 𝐽 = 𝑛 It is

shown that the number of distinct partitions is:
𝑛!
(2.29)
𝑘 1 !𝑘 2 !...𝑘 𝐽 !

which is called the multinomial coefficient. The binomial coefficient is a special case of the
multinomial coefficient where 𝐽 = 2.

2.5.5 Sampling with Replacement and without Ordering

Suppose we pick 𝑘 objects from a set of 𝑛 distinct objects with replacement and we record the
result without regard to order. This can be done by filling out a form which has 𝑛 columns, one
for each distinct object. Each time an object is selected, an “×” is placed in the corresponding
column. For example, if we are picking 5 objects from 4 distinct objects, one possible form would
look like this:

Object 1 Object 2 Object 3 Object 4

×× × ××

Note that this form can be summarized by the sequence ×× | | × | ×× where the "|" s indicate the
lines between columns, and where nothing appears between consecutive |s if the corresponding
object was not selected. Each different arrangement of 5 ×s and 3 |s leads to a distinct form. If
we identify ×s with “white balls” and |s with “black balls,” then this problem
becomes similar to
8
the example 2.14, and the number of different arrangements is given by . In the general case
3
the form will involve 𝑘 ×s and (𝑛 − 1) |s. Thus the number of different ways of picking 𝑘 objects
from a set of 𝑛 distinct objects with replacement and without ordering is given by:

𝑛 −1+𝑘 𝑛 −1+𝑘

= (2.30)
𝑘 𝑛−1

Example 2.15

Find the number of ways of selecting two balls from five balls numbered 1 to 5, with replace-
ment but without regard to order.

Solution. From Equation 2.30:

5−1+2 6!

= = 15
2 2!4!

Figure 2.5 shows the 15 pairs. Note that because of the replacement after each selection, the
same ball can be selected twice for each pair.

27
2 Probability Theory

Figure 2.5: Possible outcomes in sampling with replacement and without ordering.

Further Reading
1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical
Engineers, Morgan & Claypool, 2006: sections 1.2.3 to 1.9
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.2 to 2.7
3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 2.2 to 2.6

28
3 Random Variables

3 Random Variables

In most random experiments, we are interested in a numerical attribute of the outcome of the
experiment. A random variable is defined as a function that assigns a numerical value to the
outcome of the experiment.

3.1 The Notion of a Random Variable

The outcome of a random experiment need not be a number. However, we are usually interested
not in the outcome itself, but rather in some measurement or numerical attribute of the outcome.
For example, in 𝑛 tosses of a coin, we may be interested in the total number of heads and not
in the specific order in which heads and tails occur. In a randomly selected Web document, we
may be interested only in the length of the document. In each of these examples, a measurement
assigns a numerical value to the outcome of the random experiment. Since the outcomes are
random, the results of the measurements will also be random. Hence it makes sense to talk about
the probabilities of the resulting numerical values.
Definition 3.1. Random variable: A random variable is a real valued function of the elements
of a sample space, 𝑆. A random variable 𝑋 is a function that assigns a real number, 𝑋 (𝜁 ), to each
outcome 𝜁 in the sample space, 𝑆, of a random experiment, 𝐸. If the mapping 𝑋 (𝜁 ) is such that the
random variable 𝑋 takes on a finite or countably infinite number of values, then we refer to 𝑋 as
a discrete random variable; whereas, if the range of 𝑋 (𝜁 ) is an uncountably infinite number of
points, we refer to 𝑋 as a continuous random variable.
Figure 3.1 illustrates how a random variable assigns a number to an outcome in the sample space.
The sample space 𝑆 is the domain of the random variable, and the set 𝑆𝑥 of all values taken on by
𝑋 is the range of the random variable. Thus 𝑆𝑥 is a subset of the set of all real numbers. We will
use the capital letters (𝑋 , 𝑌 , etc.) to denote random variables, and lower case letters (𝑥, 𝑦, etc.) to
denote possible values of the random variables.

Figure 3.1: A random variable assigns a number 𝑋 (𝜁 ) to each outcome 𝜁 in the sample space 𝑆 of
a random experiment.

Since 𝑋 (𝜁 ) is a random variable whose numerical value depends on the outcome of an experiment,
we cannot describe the random variable by stating its value; rather, we describe the probabilities

29
3 Random Variables

that the variable takes on a specific value or values (e.g. 𝑃 (𝑋 = 3) or 𝑃 (𝑋 > 8)).

Example 3.1

A coin is tossed three times and the sequence of heads and tails is noted. The sample space
for this experiment is 𝑆 ={ HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. (a) Let 𝑋 be the
number of heads in the three tosses. Find the random variable 𝑋 (𝜁 ) for each outcome 𝜁 . (b)
Now find the probability of the event {𝑋 = 2}.

Solution. (a) 𝑋 assigns each outcome 𝜁 in 𝑆 a number from the set 𝑆𝑥 = {0, 1, 2, 3}. The table
below lists the eight outcomes of 𝑆 and the corresponding values of 𝑋 .

𝜁 : HHH HHT HTH THH HTT THT TTH TTT

𝑋 (𝜁 ) : 3 2 2 2 1 1 1 0

(b) Note that 𝑋 (𝜁 ) = 2 if and only if 𝜁 is in { HHT, HTH, THH}, therefore:

𝑃 (𝑋 = 2) = 𝑃 ({ HHT, HTH, THH})

= 𝑃 ({HHT}) + 𝑃 ({HTH}) + 𝑃 ({THH})
= 3/8

Example 3.1 shows a general technique for finding the probabilities of events involving the random
variable 𝑋 . Let the underlying random experiment have sample space 𝑆. To find the probability
of a subset 𝐵 of 𝑅, e.g., 𝐵 = {𝑥𝑘 }, we need to find the outcomes in 𝑆 that are mapped to 𝐵, i.e.:

𝐴 = {𝜁 : 𝑋 (𝜁 ) ∈ 𝐵} (3.1)

As shown in figure 3.2. If event 𝐴 occurs then 𝑋 (𝜁 ) ∈ 𝐵, so event 𝐵 occurs. Conversely, if event 𝐵
occurs, then the value 𝑋 (𝜁 ) implies that 𝜁 is in 𝐴, so event 𝐴 occurs. Thus the probability that 𝑋
is in 𝐵 is given by:
𝑃 (𝑋 ∈ 𝐵) = 𝑃 (𝐴) = 𝑃 ({𝜁 : 𝑋 (𝜁 ) ∈ 𝐵}) (3.2)
We refer to 𝐴 and 𝐵 as equivalent events. In some random experiments the outcome 𝜁 is already
the numerical value we are interested in. In such cases we simply let 𝑋 (𝜁 ) = 𝜁 that is, the identity
function, to obtain a random variable.

Figure 3.2: An illustration of 𝑃 (𝑋 ∈ 𝐵) = 𝑃 (𝜁 ∈ 𝐴).

30
3 Random Variables

3.2 Discrete Random Variables

Definition 3.2. Discrete random variable: A random variable 𝑋 that assumes values from a
countable set, that is, 𝑆𝑥 = {𝑥 1, 𝑥 2, 𝑥 3, ...}.
A discrete random variable is said to be finite if its range is finite, that is, 𝑆𝑥 = {𝑥 1, 𝑥 2, 𝑥 3, ..., 𝑥𝑛 }.
We are interested in finding the probabilities of events involving a discrete random variable
𝑋 . Since the sample space is discrete, we only need to obtain the probabilities for the events
𝐴𝑘 = {𝜁 : 𝑋 (𝜁 ) = 𝑥𝑘 } in the underlying random experiment. The probabilities of all events
involving 𝑋 can be found from the probabilities of the 𝐴𝑘 s.

3.2.1 Probability Mass Function

Definition 3.3. Probability mass function: The probability mass function (PMF), 𝑃𝑋 (𝑥), of a
random variable, 𝑋 , is a function that assigns a probability to each possible value of the random
variable.
The probability that the random variable 𝑋 takes on the specific value 𝑥 is the value of the
probability mass function for 𝑥. That is,

𝑃𝑋 (𝑥) = 𝑃 (𝑋 = 𝑥) = 𝑃 ({𝜁 : 𝑋 (𝜁 ) = 𝑥𝑘 }) for 𝑥 a real number (3.3)

Note that we use the convention that upper case variables represent random variables while lower
case variables represent fixed values that the random variable can assume. The PMF satisfies the
following properties that provide all the information required to calculate probabilities for events
involving the discrete random variable 𝑋 :
(i) 𝑃𝑋 (𝑥) ≥ 0 for all 𝑥
(ii) 𝑥 ∈𝑆𝑥 𝑃𝑋 (𝑥) = 𝑘 𝑃𝑋 (𝑥) = 𝑘 𝑃 (𝐴𝑘 ) = 1
Í Í Í

(iii) 𝑃 (𝑋 ∈ 𝐵) = 𝑥 ∈𝐵 𝑃𝑋 (𝑥) where 𝐵 ⊂ 𝑆𝑋

Example 3.2

Let 𝑋 be the number of heads in three independent tosses of a fair coin. Find the PMF of 𝑋 .

Solution. As seen in Example 3.1:

𝑃𝑋 (0) = 𝑃 (𝑋 = 0) = 𝑃 ({TTT}) = 1/8

𝑃𝑋 (1) = 𝑃 (𝑋 = 1) = 𝑃 ({HTT}) + 𝑃 ({THT}) + 𝑃 ({TTH}) = 3/8

𝑃𝑋 (2) = 𝑃 (𝑋 = 2) = 𝑃 ({HHT}) + 𝑃 ({HTH}) + 𝑃 ({THH}) = 3/8
𝑃𝑋 (3) = 𝑃 (𝑋 = 3) = 𝑃 ({HHH}) = 1/8
Note that: 𝑃𝑋 (0) + 𝑃𝑋 (1) + 𝑃𝑋 (2) + 𝑃𝑋 (3) = 1

31
3 Random Variables

Figure 3.2 shows the graph of 𝑃𝑋 (𝑥) versus 𝑥 for the random variables in this example.

Generally the graph of the PMF of a discrete random variable has vertical arrows of height 𝑃𝑋 (𝑥𝑘 )
at the values 𝑥𝑘 in 𝑆𝑥 . The relative values of PMF at different points give an indication of the
relative likelihoods of occurrence.
Finally, let’s consider the relationship between relative frequencies and the PMF. Suppose we
perform 𝑛 independent repetitions to obtain 𝑛 observations of the discrete random variable 𝑋 .
Let 𝑁𝑘 (𝑛) be the number of times the event 𝑋 = 𝑥𝑘 occurs and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the
corresponding relative frequency. As 𝑛 becomes large we expect that 𝑓𝑘 (𝑛) → 𝑃𝑋 (𝑥𝑘 ). Therefore
the graph of relative frequencies should approach the graph of the PMF. For the experiment in
Example 3.2, 1000 repetitions of an experiment of tossing a coin may generate a graph of relative
frequencies shown in Figure 3.3.

Figure 3.3: Relative frequencies and corresponding PMF for the experiment in Example 3.2

3.2.2 The Cumulative Distribution Function

The PMF of a discrete random variable was defined in terms of events of the form {𝑋 = 𝑏}. The
cumulative distribution function is an alternative approach which uses events of the form {𝑋 ≤ 𝑏}.
The cumulative distribution function has the advantage that it is not limited to discrete random
variables and applies to all types of random variables.
Definition 3.4. Cumulative distribution function: The cumulative distribution function (CDF)
of a random variable 𝑋 is defined as the probability of the event {𝑋 ≤ 𝑥 }:

𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) for −∞ < 𝑥 < +∞ (3.4)

In other words, the CDF is the probability that the random variable 𝑋 takes on a value in the
set (−∞, 𝑥]. In terms of the underlying sample space, the CDF is the probability of the event

32
3 Random Variables

{𝜁 : 𝑋 (𝜁 ) ≤ 𝑥 }. The event {𝑋 ≤ 𝑥 } and its probability vary as 𝑥 is varied; since 𝐹𝑋 (𝑥) is a

function of the variable 𝑥.
From the definition of CDF, the following property can be derived:
𝑃 (𝑋 > 𝑥) = 1 − 𝐹𝑋 (𝑥) (3.5)
The CDF has the following interpretation in terms of relative frequency. Suppose that the
experiment that yields the outcome 𝜁 and hence 𝑋 (𝜁 ) is performed a large number of times.
𝐹𝑋 (𝑏) is then the long-term proportion of times in which 𝑋 (𝜁 ) ≤ 𝑏.
Like the PMF, the CDF summarizes the probabilistic properties of a random variable. Knowledge
of either of them allows the other function to be calculated. For example, suppose that the PMF is
known. The CDF can then be calculated from the expression:
Õ Õ
𝐹𝑋 (𝑥) = 𝑃 (𝑋 = 𝑦) = 𝑃𝑋 (𝑦) (3.6)
𝑦 ≤𝑥 𝑦 ≤𝑥

In other words, the value of 𝐹𝑋 (𝑥) is constructed by simply adding together the probabilities
𝑃𝑋 (𝑥) for values 𝑦 that are no larger than 𝑥. Note that:
𝑃 (𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎) (3.7)
The CDF is an increasing step function with steps at the values taken by the random variable.
The heights of the steps are the probabilities of taking these values. Mathematically, the PMF can
be obtained from the CDF through the relationship:
𝑃𝑋 (𝑥) = 𝐹𝑋 (𝑥) − 𝐹𝑋 (𝑥 − ) (3.8)
where 𝐹𝑋 (𝑥 − ) is the limiting value from below of the cumulative distribution function. If there is
no step in the cumulative distribution function at a point 𝑥, then 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥 − ) and 𝑃𝑋 (𝑥) = 0.
If there is a step at a point 𝑥, then 𝐹𝑋 (𝑥) is the value of the CDF at the top of the step, and 𝐹𝑋 (𝑥 − )
is the value of the CDF at the bottom of the step, so that 𝑃𝑋 (𝑥) is the height of the step. These
relationships are illustrated in the following example.

Example 3.3

Similar to Example 3.2, let 𝑋 be the number of heads in three tosses of a fair coin. Find the
CDF of X.
Solution. From Example 3.2, we know that 𝑋 takes on only the values 0, 1, 2, and 3 with
probabilities 1/8, 3/8, 3/8, and 1/8, respectively, so 𝐹𝑋 (𝑥) is simply the sum of the probabili-
ties of the outcomes from {0, 1, 2, 3} that are less than or equal to 𝑥. The resulting CDF is a
non-decreasing staircase function that grows from 0 to 1. It has jumps at the points 0, 1, 2, 3
of magnitudes 1/8, 3/8, 3/8, and 1/8, respectively.

33
3 Random Variables

Let us take a closer look at one of these discontinuities, say, in the vicinity of 𝑥 = 1. For a
small positive number 𝛿, we have:

𝐹𝑋 (1− ) = 𝐹𝑋 (1 − 𝛿) = 𝑃 (𝑋 ≤ 1 − 𝛿) = 𝑃 ({0 heads}) = 1/8

so the limit of the CDF as 𝑥 approaches 1 from the left is 1/8. However,

𝐹𝑋 (1) = 𝑃 (𝑋 ≤ 1) = 𝑃 ({0 or 1 heads}) = 1/8 + 3/8 = 1/2

Thus the CDF is continuous from the right and equal to 1/2 at the point 𝑥 = 1. Indeed, we
note the magnitude of the step at the point 𝑥 = 1 is 𝑃 (𝑋 = 1) = 1/2 − 1/8 = 3/8. The CDF
can be written compactly in terms of the unit step function:
1 3 3 1
𝐹𝑋 (𝑥) = 𝑢 (𝑥) + 𝑢 (𝑥 − 1) + 𝑢 (𝑥 − 2) + 𝑢 (𝑥 − 3)
8 8 8 8

3.2.3 Expected Value and Moments

Expected Value
In some situations we are interested in a few parameters that summarize the information provided
by the PMF. For example, Figure 3.4 shows the results of many repetitions of an experiment that
produces two random variables. It can be observed that the random variable 𝑌 varies about the
value 0, whereas the random variable 𝑋 varies around the value 5. It is also clear that 𝑋 is more
spread out than 𝑌 . We may just need some parameters that quantify these properties.

Figure 3.4: The graphs show 150 repetitions of the experiments yielding 𝑋 and 𝑌 . It is clear that
𝑋 is centered about the value 5 while 𝑌 is centered about 0. It is also clear that 𝑋 is
more spread out than 𝑌 (Taken from Alberto Leon-Garcia, Probability, statistics, and
random processes for electrical engineering,3rd ed. Pearson, 2007).

Definition 3.5. Expected value: The expected value or expectation or mean of a discrete random
variable 𝑋 , with a probability mass function 𝑃𝑋 (𝑥) is defined by:
Õ
𝑚𝑋 = 𝐸 [𝑋 ] = 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) (3.9)
𝑘

34
3 Random Variables

𝐸 [𝑋 ] provides a summary measure of the average value taken by the random variable and is also
known as the mean of the random variable. The expected value 𝐸 [𝑋 ] is defined if the above sum
converges absolutely, that is: Õ
𝐸 [|𝑋 |] = |𝑥𝑘 |𝑃𝑋 (𝑥𝑘 ) < ∞ (3.10)
𝑘
otherwise the expected value does not exist.
Random variables with unbounded expected value are not uncommon and appear in models
where outcomes that have extremely large values are not that rare. Examples include the sizes
of files in Web transfers, frequencies of words in large bodies of text, and various financial and
economic problems.
If we view 𝑃𝑋 (𝑥) as the distribution of mass on the points 𝑥 1, 𝑥 2, ... on the real line, then 𝐸 [𝑋 ]
represents the center of mass of this distribution.

Example 3.4

Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find
𝐸 [𝑋 ].

Solution. From Example 3.2 and the pmf of 𝑋 :

3
Õ
𝐸 [𝑋 ] = 𝑘𝑃𝑋 (𝑘) = 0(1/8) + 1(3/8) + 2(3/8) + 3(1/8) = 1.5
𝑘=0

The use of the term “expected value” does not mean that we expect to observe 𝐸 [𝑋 ] when we
perform the experiment that generates 𝑋 . For example, the expected value of the number of heads
in Example 3.4 is 1.5, but its outcomes can only be 0, 1, 2 or 3.
𝐸 [𝑋 ] can be explained as an average of 𝑋 in a large number of observations of 𝑋 . Suppose we
perform 𝑛 independent repetitions of the experiment that generates 𝑋 , and we record the observed
values as 𝑥 (1), 𝑥 (2), ..., 𝑥 (𝑛), where 𝑥 ( 𝑗) is the observation in the 𝑗 𝑡 ℎ experiment. Let 𝑁𝑘 (𝑛) be
the number of times 𝑥𝑘 is observed (𝑘 = 1, 2, ..., 𝐾), and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the corresponding
relative frequency. The arithmetic average, or sample mean of the observations, is:
𝑥 (1) + 𝑥 (2) + ... + 𝑥 (𝑛) 𝑥 1 𝑁 1 (𝑛) + 𝑥 2 𝑁 2 (𝑛) + ... + 𝑥 𝐾 𝑁𝐾 (𝑛)
h𝑋 i𝑛 = = (3.11)
𝑛 𝑛
= 𝑥 1 𝑓1 (𝑛) + 𝑥 2 𝑓2 (𝑛) + ... + 𝑥 𝐾 𝑓𝐾 (𝑛) (3.12)
Õ
= 𝑥𝑘 𝑓𝑘 (𝑛) (3.13)
𝑘

The first numerator adds the observations in the order in which they occur, and the second
numerator counts how many times each 𝑥𝑘 occurs and then computes the total. As 𝑛 becomes
large, we expect relative frequencies to approach the probabilities 𝑃𝑋 (𝑥𝑘 ):

lim 𝑓𝑘 (𝑛) = 𝑃𝑋 (𝑥𝑘 ) for all 𝑘 (3.14)

𝑛→∞

Equation 3.13 then implies that:

Õ Õ
h𝑋 i𝑛 = 𝑥𝑘 𝑓𝑘 (𝑛) → 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) = 𝐸 [𝑋 ] (3.15)
𝑘 𝑘

35
3 Random Variables

Thus we expect the sample mean to converge to 𝐸 [𝑋 ] as 𝑛 becomes large.

We can also easily find the expected value of functions of a random variable. Let 𝑋 be a discrete
random variable, and let 𝑍 = 𝑔(𝑋 ) Since 𝑋 is discrete, 𝑍 = 𝑔(𝑋 ) will assume a countable set of
values of the form 𝑔(𝑥𝑘 ) where 𝑥𝑘 ∈ 𝑆𝑋 . One way to find the expectation of 𝑍 is to use Equation
3.9, which requires that we first find the PMF of 𝑍 . Another way is to use the following:
Õ
𝐸 [𝑍 ] = 𝐸 [𝑔(𝑋 )] = 𝑔(𝑥𝑘 )𝑃𝑋 (𝑥𝑘 ) (3.16)
𝑘

Let 𝑍 be the function:

𝑍 = 𝑎𝑔(𝑋 ) + 𝑏ℎ(𝑋 ) + 𝑐
where 𝑎, 𝑏, and 𝑐 are real numbers, then:
𝐸 [𝑍 ] = 𝑎𝐸 [𝑔(𝑋 )] + 𝑏𝐸 [ℎ(𝑋 )] + 𝑐 (3.17)
It further implies that:
𝐸 [𝑔(𝑋 ) + ℎ(𝑋 )] = 𝐸 [𝑔(𝑋 )] + 𝐸 [ℎ(𝑋 )] (3.18)
𝐸 [𝑎𝑋 ] = 𝑎𝐸 [𝑋 ] (3.19)
𝐸 [𝑋 + 𝑐] = 𝐸 [𝑋 ] + 𝑐 (3.20)
𝐸 [𝑐] = 𝑐 (3.21)

Variance of a Random Variable

We usually need more information about 𝑋 , than what expected value 𝐸 [𝑋 ] provides. For example,
if we know that 𝐸 [𝑋 ] = 0 then it could be that 𝑋 is zero all the time, or it takes on extremely
large positive and negative values. We are therefore interested not only in the mean of a random
variable, but also in the extent of the random variable’s variation about its mean. Let the deviation
of the random variable 𝑋 about its mean be 𝑋 − 𝐸 [𝑋 ] which can take on positive and negative
values. Since we are interested in the magnitude of the variations only, it is convenient to work
with the square of the deviation, which is always positive, (𝑋 − 𝐸 [𝑋 ]) 2 .
Definition 3.6. Variance: The variance of the random variable 𝑋 is defined as:

𝜎𝑋2 = 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑚𝑋 ) 2 ] = (𝑥 − 𝑚𝑋 ) 2 𝑃𝑋 (𝑥)

Õ
(3.22)
𝑋 ∈𝑆𝑋

The variance is a positive quantity that measures the spread of the distribution of the random
variable about its mean value. Larger values of the variance indicate that the distribution is more
spread out. For example in Figure 3.4, 𝑋 has a larger variance than 𝑌 .
Definition 3.7. Standard deviation: The standard deviation of the random variable 𝑋 is defined
by:
𝜎𝑋 = 𝑆𝑇 𝐷 (𝑋 ) = 𝑉 𝐴𝑅 [𝑋 ] 1/2 (3.23)
By taking the square root of the variance, we obtain a quantity with the same units as 𝑋 .
An alternative expression for the variance can be obtained as follows:
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑚𝑋 ) 2 ] = 𝐸 [𝑋 2 − 2𝑚𝑋 𝑋 + 𝑚𝑋2 ] (3.24)
2
= 𝐸 [𝑋 ] − 2𝑚𝑋 𝐸 [𝑋 ] + 𝑚𝑋2 (3.25)
2
= 𝐸 [𝑋 ] − 𝑚𝑋2 (3.26)
𝐸 [𝑋 2 ] is called the second moment of 𝑋 .

36
3 Random Variables

Definition 3.8. Moment: The 𝑛𝑡ℎ moment of 𝑋 is defined as: 𝐸 [𝑋 𝑛 ].

Example 3.5

Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find
𝑉 𝐴𝑅 [𝑋 ].

Solution.
3
2
𝑘 2 𝑃𝑋 (𝑘) = 0(1/8) + 12 (3/8) + 22 (3/8) + 32 (1/8) = 3
Õ
𝐸 [𝑋 ] =
𝑘=0

𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − (𝐸 [𝑋 ]) 2 = 3 − (1.5) 2 = 0.75

Let 𝑌 = 𝑋 + 𝑐, then:
𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝐸 [(𝑋 + 𝑐 − (𝐸 [𝑋 ] + 𝑐)) 2 ] (3.27)
2
= 𝐸 [(𝑋 − 𝐸 [𝑋 ]) ] = 𝑉 𝐴𝑅 [𝑋 ] (3.28)
Adding a constant to a random variable does not affect the variance.
Let 𝑍 = 𝑐𝑋 then:
𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝐸 [(𝑐𝑋 − 𝑐 (𝐸 [𝑋 ])) 2 ] (3.29)
2 2
= 𝐸 [𝑐 (𝑋 − 𝐸 [𝑋 ]) ] (3.30)
2
= 𝑐 (𝑉 𝐴𝑅 [𝑋 ]) (3.31)
Scaling a random variable by 𝑐 scales the variance by 𝑐 2 and the standard deviation by |𝑐 |.
Note that a random variable that is equal to a constant 𝑋 = 𝑐 with probability 1, has zero variance:
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑐) 2 ] = 𝐸 [0] = 0
Finally, Variance is a special case of central moments, for 𝑛 = 2, where we define 𝑛𝑡ℎ central
moment as follows.
Definition 3.9. Central Moments: The 𝑛𝑡ℎ central moment of a random variable is defined as:
𝐸 [(𝑋 − 𝑚𝑋 )𝑛 ].

3.2.4 Conditional Probability Mass Function and Expectation

In many situations we have partial information about a random variable 𝑋 or about the outcome
of its underlying random experiment. We are interested in how this information changes the
probability of events involving the random variable.
Definition 3.10. Conditional Probability Mass Function: Let 𝑋 be a discrete random variable
with PMF 𝑃𝑋 (𝑥) and let 𝐶 be an event that has nonzero probability, 𝑃 (𝐶) > 0. The conditional
probability mass function of 𝑋 is defined by the conditional probability:
𝑃𝑋 𝐼𝐶 (𝑥) = 𝑃 (𝑋 = 𝑥 |𝐶) for 𝑥 a real number (3.32)
Applying the definition of conditional probability we have:
𝑃 ({𝑋 = 𝑥 } ∩ 𝐶)
𝑃𝑋 𝐼𝐶 (𝑥) = (3.33)
𝑃 (𝐶)

37
3 Random Variables

As illustrated in Figure 3.5, the above expression has a nice intuitive interpretation: The conditional
probability of the event {𝑋 = 𝑥𝑘 } is given by the probabilities of outcomes 𝜁 for which both
𝑋 (𝜁 ) = 𝑥𝑘 and 𝜁 are in 𝐶, normalized by 𝑃 (𝐶).

Figure 3.5: Conditional PMF of 𝑋 given event 𝐶.

The conditional PMF has the same properties as PMF. If 𝑆 is partitioned by 𝐴𝑘 = {𝑋 = 𝑥𝑘 }, then:
Ø
𝐶= (𝐴𝑘 ∩ 𝐶) and
𝑘

Õ Õ Õ 𝑃 ({𝑋 = 𝑥 } ∩ 𝐶)
𝑘
𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑃𝑋 |𝐶 (𝑥𝑘 ) =
𝑥𝑘 ∈𝑆𝑋
𝑃 (𝐶)
𝑘 𝑘
1 Õ 𝑃 (𝐶)
= 𝑃 (𝐴𝑘 ∩ 𝐶) = =1
𝑃 (𝐶) 𝑃 (𝐶)
𝑘

Most of the time the event 𝐶 is defined in terms of 𝑋 , for example 𝐶 = {𝑎 ≤ 𝑋 ≤ 𝑏}. For 𝑥𝑘 ∈ 𝑆𝑋 ,
we have the following result:
( 𝑃 (𝑥 )
𝑋 𝑘
if 𝑥𝑘 ∈ 𝐶
𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑃 (𝐶) (3.34)
0 if 𝑥𝑘 ∉ 𝐶

Example 3.6

Let 𝑋 be the number of heads in three tosses of a fair coin. Find the conditional PMF of 𝑋
given that we know the observed number was less than 2.

Solution. We condition on the event 𝐶 = {𝑋 < 2}. From Example 3.2:

𝑃 (𝐶) = 𝑃𝑋 (0) + 𝑃𝑋 (1) = 1/8 + 3/8 = 1/2.

Many random experiments have natural ways of partitioning the sample space 𝑆 into the union
of disjoint events 𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝑃𝑋 |𝐵𝑖 (𝑥) be the conditional PMF of 𝑋 given event 𝐵𝑖 . The

38
3 Random Variables

theorem on total probability allows us to find the PMF of 𝑋 in terms of the conditional PMFs:
𝑛
Õ
𝑃𝑋 (𝑥) = 𝑃𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.35)
𝑖=1

Definition 3.11. Conditional Expected Value: Let 𝑋 be a discrete random variable, and suppose
that we know that event 𝐵 has occurred. The conditional expected value of 𝑋 given 𝐵 is defined as:
Õ Õ
𝑚𝑋 |𝐵 = 𝐸 [𝑋 |𝐵] = 𝑥𝑃𝑋 |𝐵 (𝑥) = 𝑥𝑘 𝑃𝑋 |𝐵 (𝑥𝑘 ) (3.36)
𝑥 ∈𝑆𝑥 𝑘

where we apply the absolute convergence requirement on the summation.

Definition 3.12. Conditional Variance: Let 𝑋 be a discrete random variable, and suppose that
we know that event 𝐵 has occurred. The conditional variance of 𝑋 given 𝐵 is defined as:

𝜎𝑋2 |𝐵 = 𝑉 𝐴𝑅 [𝑋 |𝐵] = 𝐸 [(𝑋 − 𝑚𝑋 |𝐵 ) 2 |𝐵] = (𝑥𝑘 − 𝑚𝑋 |𝐵 ) 2 𝑃𝑋 |𝐵 (𝑥𝑘 )

Õ
(3.37)
𝑘
2
= 𝐸 [𝑋 |𝐵] − 𝑚𝑋2 |𝐵 (3.38)

Note that the variation is measured with respect to 𝑚𝑋 |𝐵 not 𝑚𝑋 .

Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be the partition of 𝑆, and let 𝑃𝑋 |𝐵𝑖 (𝑥) be the conditional PMF of 𝑋 given event 𝐵𝑖 .
𝐸 [𝑋 ] can be calculated from the conditional expectation 𝐸 [𝑋 |𝐵𝑖 ]:
𝑛
Õ
𝐸 [𝑋 ] = 𝐸 [𝑋 |𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.39)
𝑖=1

By the theorem on total probability we have:

Õ Õ 𝑛
Õ
𝐸 [𝑋 ] = 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) = 𝑥𝑘 { 𝑃𝑋 |𝐵𝑖 (𝑥𝑘 )𝑃 (𝐵𝑖 )} (3.40)
𝑘 𝑘 𝑖=1
𝑛 Õ
Õ 𝑛
Õ
= { 𝑥𝑘 𝑃𝑋 |𝐵𝑖 (𝑥𝑘 )}𝑃 (𝐵𝑖 ) = 𝐸 [𝑋 |𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.41)
𝑖=1 𝑘 𝑖=1

where we first express 𝑃𝑋 (𝑥𝑘 ) in terms of the conditional PMFs, and we then change the order of
summation. Using the same approach we can also show:
𝑛
Õ
𝐸 [𝑔(𝑋 )] = 𝐸 [𝑔(𝑋 )|𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.42)
𝑖=1

Example 3.7

Let 𝑋 be the number of heads in three tosses of a fair coin. Find the expected value and
variance of 𝑋 ,if we know that at least one head was observed.

Solution. We are given 𝐶 = {𝑋 > 0}, so for 𝑥𝑘 = 1, 2, 3:

𝑃 (𝐶) = 1 − 𝑃𝑋 (0) = 7/8

39
3 Random Variables

Õ 𝑃𝑋 (1) 𝑃𝑋 (2) 𝑃𝑋 (3)

𝐸 [𝑋 |𝐶] = 𝑥𝑘 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 1( ) + 2( ) + 3( )
𝑃 (𝐶) 𝑃 (𝐶) 𝑃 (𝐶)
𝑘
3/8 3/8 1/8
= 1( ) + 2( ) + 3( )
7/8 7/8 7/8
= 12/7 ≈ 1.7

which is larger than 𝐸 [𝑋 ] = 1.5 found in Example 3.4

3/8 3/8 1/8

𝐸 [𝑋 2 |𝐶] = 𝑥𝑘2 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 1(
Õ
) + 4( ) + 9( ) = 24/7
7/8 7/8 7/8
𝑘
𝑉 𝐴𝑅 [𝑋 |𝐶] = 𝐸 [𝑋 2 |𝐶] − (𝐸 [𝑋 |𝐶]) 2 ≈ 0.49

3.2.5 Common Discrete Random Variables

In this section we present the most important of the discrete random variables and their basic
properties and applications.

Bernoulli Random Variable

Definition 3.13. Bernoulli trial: A Bernoulli trial involves performing an experiment once and
noting whether a particular event 𝐴 occurs. The outcome of the Bernoulli trial is said to be a “success”
if 𝐴 occurs and a “failure” otherwise.
We can view the outcome of a single Bernoulli trial as the outcome of a toss of a coin for which
the probability of heads (success) is 𝑝 = 𝑃 (𝐴). The probability of 𝑘 successes in 𝑛 Bernoulli trials
is then equal to the probability of 𝑘 heads in 𝑛 tosses of the coin.
Definition 3.14. Bernoulli random variable: Let 𝐴 be an event related to the outcomes of some
random experiment. The Bernoulli random variable 𝐼𝐴 equals one if the event 𝐴 occurs, and zero
otherwise, and is given by the indicator function for 𝐴:
(
1 if 𝜁 ∈ 𝐴
𝐼𝐴 (𝜁 ) = (3.43)
0 if 𝜁 ∉ 𝐴

𝐼𝐴 is a discrete random variable with range = {0, 1}.

• The PMF of 𝐼𝐴 is:
𝑃𝐼 (1) = 𝑝 and 𝑃𝐼 (0) = 1 − 𝑝 = 𝑞 (3.44)
where 𝑃 (𝐴) = 𝑝.

• The mean of 𝐼𝐴 is 𝐸 [𝐼𝐴 ] = 1 × 𝑃𝐼 (1) + 0 × 𝑃𝐼 (0) = 𝑝 The sample mean in 𝑛 independent

Bernoulli trials is simply the relative frequency of successes and converges to 𝑝 as 𝑛
increases.
0𝑁 0 (𝑛) + 1𝑁 1 (𝑛)
h𝐼𝐴 i𝑛 = →𝑝 (3.45)
𝑛
• The variance of 𝐼𝐴 can be found as follows:
𝐸 [𝐼𝐴2 ] = 1 × 𝑃𝐼 (1) + 0 × 𝑃𝐼 (0) = 𝑝
𝜎𝐼2 = 𝑉 𝐴𝑅 [𝐼𝐴 ] = 𝑝 − 𝑝 2 = 𝑝 (1 − 𝑝) = 𝑝𝑞 (3.46)

40
3 Random Variables

The variance is quadratic in 𝑝, with value zero at 𝑝 = 0 and 𝑝 = 1 and maximum at 𝑝 = 1/2.
This agrees with intuition since values of 𝑝 close to 0 or to 1 imply a preponderance of
successes or failures and hence less variability in the observed values. The maximum
variability occurs when which corresponds to the case that is most difficult to predict. Every
Bernoulli trial, regardless of the event 𝐴, is equivalent to the tossing of a biased coin with
probability of heads 𝑝.

Binomial Random Variable

Consider 𝑛 independent Bernoulli trials, with outcomes of 𝑘 successes (e.g. Example3.1). Outcomes
of the repeated trials are represented as 𝑛 element vectors whose elements are taken from 𝑆 = {0, 1},
therefore the repeated experiment has a sample space of 𝑆𝑛 = {0, 1}𝑛 , which is referred to as a
Cartesian space. For example consider the following outcome:

𝜁𝑘 = (1, 1, ..., 1, 0, 0, ..., 0 )

| {z } | {z }
𝑘 times 𝑛 − 𝑘 times

The probability of this outcome occurring is:

𝑃 (𝜁𝑘 ) = 𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 (3.47)

In fact, the order of the 1s and 0s in the sequence is irrelevant. Any outcome with exactly 𝑘 1s
and 𝑛 − 𝑘 0s would have the same probability. The number of outcomes in the event of exactly 𝑘
successes, is just the number of combinations of 𝑛 trials taken 𝑘 successes at a time.

Theorem 3.1: Binomial probability law

Let 𝑘 be the number of successes in 𝑛 independent Bernoulli trials, then the probabilities of
𝑘 are given by the binomial probability law:

𝑛 𝑘
𝑃𝑛 (𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 for 𝑘 = 0, ..., 𝑛 (3.48)
𝑘

𝑛
is the binomial coefficient (see equation 2.26).
𝑘

Now let the random variable 𝑋 represent the number of successes occurred in the sequence of 𝑛
trials.
Definition 3.15. Binomial random variable: let 𝑋 be the number of times a certain event 𝐴
occurs in 𝑛 independent Bernoulli trials. 𝑋 is called the Binomial random variable.
For example, 𝑋 could be the number of heads in 𝑛 tosses of a coin (as seen in Examples 3.2 to 3.5,
where 𝑛 = 3 and 𝑝 = 1/2).

• The PMF of the binomial random variable 𝑋 is:

𝑛 𝑘
𝑃 (𝑋 = 𝑘) = 𝑃𝑋 (𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 for 𝑘 = 0, ..., 𝑛 (3.49)
𝑘

41
3 Random Variables

• The expected value of 𝑋 is:

𝑛
Õ 𝑛
Õ
𝑛 𝑘
𝐸 [𝑋 ] = 𝑘𝑃𝑋 (𝑘) = 𝑘 𝑝 (1 − 𝑝)𝑛−𝑘 (3.50)
𝑘
𝑘=0 𝑘=0

Since the summation is zero for 𝑘 = 0,

(𝑛 − 1)!
𝑛 𝑛
Õ 𝑛! Õ
𝐸 [𝑋 ] = 𝑘 𝑘
𝑝 (1 − 𝑝) 𝑛−𝑘
= 𝑛𝑝 𝑝 𝑘−1 (1 − 𝑝)𝑛−𝑘 (3.51)
𝑘!(𝑛 − 𝑘)! (𝑘 − 1)!(𝑛 − 𝑘)!
𝑘=1 𝑘=1

(𝑛 − 1)!
𝑛−1
Õ
= 𝑛𝑝 𝑝 𝑗 (1 − 𝑝)𝑛−1−𝑗 (3.52)
𝑗=0
( 𝑗)!(𝑛 − 1 − 𝑗)!
(3.53)

(𝑛−1)!
Note that the summation 𝑛−1 𝑛−1−𝑗 equal to one, since it adds all the terms
Í
𝑗=0 ( 𝑗)!(𝑛−1−𝑗)! 𝑝 (1 − 𝑝)
𝑗

in a binomial PMF with parameters 𝑛 − 1 and 𝑝, so:

𝐸 [𝑋 ] = 𝑛𝑝 × 1 = 𝑛𝑝 (3.54)
It agrees with our intuition since we expect a fraction 𝑝 of the outcomes to result in success.

to find the variance of 𝑋 ,

𝑛 𝑛
𝑛! 𝑛!
𝐸 [𝑋 2 ] = 𝑘2
Õ Õ
𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 = 𝑘 𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 (3.55)
𝑘!(𝑛 − 𝑘)! (𝑘 − 1)!(𝑛 − 𝑘)!
𝑘=0 𝑘=1

𝑛−1 𝑗
𝑛−1
Õ
= 𝑛𝑝 ( 𝑗 + 1) 𝑝 (1 − 𝑝)𝑛−1−𝑗 (3.56)
𝑗
𝑗=0

𝑛−1 𝑗 𝑛−1 𝑗
𝑛−1
Õ Õ𝑛−1
= 𝑛𝑝 ( 𝑗 𝑝 (1 − 𝑝) 𝑛−1−𝑗
+ 𝑝 (1 − 𝑝)𝑛−1−𝑗 ) (3.57)
𝑗 𝑗
𝑗=0 𝑗=0

In the third line, the first sum is the mean of a binomial random variable with parameters 𝑛 − 1
and 𝑝, and hence equal to (𝑛 − 1)𝑝. The second sum is the sum of the binomial probabilities and
hence equal to 1. Therefore,
𝐸 [𝑋 2 ] = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) (3.58)
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) − (𝑛𝑝) 2 = 𝑛𝑝 (1 − 𝑝) = 𝑛𝑝𝑞 (3.59)
We see that the variance of the binomial is 𝑛 times the variance of a Bernoulli random variable.
We observe that values of p close to 0 or to 1 imply smaller variance, and that the maximum
variability is when 𝑝 = 1/2.
The binomial random variable arises in applications where there are two types of objects (i.e.,
heads/tails, correct/erroneous bits, good/defective items, active/silent speakers), and we are
interested in the number of type 1 objects in a randomly selected batch of size 𝑛, where the type
of each object is independent of the types of the other objects in the batch.
Example 3.8

A binary communications channel introduces a bit error in a transmission with probability

𝑝. Let 𝑋 be the number of errors in 𝑛 independent transmissions. Find the probability of
one or fewer errors.

42
3 Random Variables

Solution. 𝑋 is a binomial random variable, and the probability of 𝑘 errors in 𝑛 bit transmissions
is given by the PMF in Equation 3.60:

𝑛 0 𝑛 1
𝑃 (𝑋 ≤ 1) = 𝑝 (1 − 𝑝)𝑛 + 𝑝 (1 − 𝑝)𝑛−1 = (1 − 𝑝)𝑛 + 𝑛𝑝 (1 − 𝑝)𝑛−1
0 1

Geometric Random Variable

Definition 3.16. Geometric random variable: The geometric random variable is defined as the
number 𝑋 of independent Bernoulli trials until the first occurrence of a success.
Note that the event 𝑋 = 𝑘 occurs if the underlying experiment finds 𝑘 − 1 consecutive failures,
followed by one success. If the probability of “success” in each Bernoulli trial is 𝑝, then:
• Therefore the PMF is:

𝑃𝑋 (𝑘) = 𝑃 (00...01) = (1 − 𝑝)𝑘−1𝑝 = 𝑞𝑘−1𝑝 for 𝑘 = 1, 2, ... (3.60)

Note that the PMF decays geometrically with 𝑘, and the ratio 1 − 𝑝 = 𝑞. As 𝑝 increases, the
PMF decays more rapidly.

• The probability that 𝑋 ≤ 𝑘 can be written in closed form:

1 − 𝑞𝑘
𝑘
Õ 𝑘−1
Õ
𝑃 (𝑋 ≤ 𝑘) = 𝑞 𝑗−1𝑝 = 𝑝 𝑞𝑗 = 𝑝 = 1 − 𝑞𝑘 (3.61)
𝑗=1 𝑗=0
1−𝑞

• The expectation of 𝑋 is:

∞
Õ ∞
Õ
𝐸 [𝑋 ] = 𝑘𝑝𝑞𝑘−1 = 𝑝 𝑘𝑞𝑘−1 (3.62)
𝑘=1 𝑘=1

This expression can be evaluated by differentiating the series:

∞
1 Õ
= 𝑥𝑘 (3.63)
1−𝑥
𝑘=0

to obtain:
∞
1 Õ
= 𝑘𝑥 𝑘−1 (3.64)
(1 − 𝑥) 2
𝑘=0
Letting 𝑥 = 𝑞:
1
𝐸 [𝑋 ] = 𝑝 = 1/𝑝 (3.65)
(1 − 𝑞) 2
which is finite as long as 𝑝 > 0.

• If the Equation 3.64 is further differentiated, we obtain:

∞
2 Õ
= 𝑘 (𝑘 − 1)𝑥 𝑘−2 (3.66)
(1 − 𝑥) 3
𝑘=0

43
3 Random Variables

Let 𝑥 = 𝑞 and multiply both sides by 𝑝𝑞 to obtain:

∞
2𝑝𝑞 Õ
= 𝑝𝑞 𝑘 (𝑘 − 1)𝑞𝑘−2
(1 − 𝑞) 3
𝑘=0
∞
(𝑘 2 − 𝑘)𝑝𝑞𝑘−1 = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ]
Õ
=
𝑘=0

So the second moment is:

2𝑝𝑞 1+𝑞
𝐸 [𝑋 2 ] = 3
+ 𝐸 [𝑋 ] = 2𝑞/𝑝 2 + 1/𝑝 = 2 (3.67)
(1 − 𝑞) 𝑝
1+𝑞
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = − 1/𝑝 2 = 𝑞/𝑝 2 (3.68)
𝑝2

We see that the mean and variance increase as 𝑝, the success probability, decreases.
Sometimes we are interested in 𝑀 the number of failures before a success occurs, also referred to
as a geometric random variable. Its PMF is:
𝑃 (𝑀 = 𝑘) = (1 − 𝑝)𝑘 𝑝 (3.69)

The geometric random variable is the only discrete random variable that satisfies the memoryless
property:
𝑃 (𝑋 ≥ 𝑘 + 𝑗 |𝑋 > 𝑗) = 𝑃 (𝑋 ≥ 𝑘) (3.70)
The above expression states that if a success has not occurred in the first 𝑗 trials, then the
probability of having to perform at least 𝑘 more trials is the same as the probability of initially
having to perform at least 𝑘 trials. Thus, each time a failure occurs, the system “forgets” and
begins anew as if it were performing the first trial.
The geometric random variable arises in applications where one is interested in the time (i.e.,
number of trials) that elapses between the occurrence of events in a sequence of independent
experiments. Examples where the modified geometric random variable arises are: number of
customers awaiting service in a queuing system; number of white dots between successive black
dots in a scan of a black-and-white document.
Example 3.9

A production line yields two types of devices. Type 1 devices occur with probability 𝛼
and work for a relatively short time that is geometrically distributed with parameter 𝑟 .
Type 2 devices work much longer, occur with probability 1 − 𝛼 and have a lifetime that is
geometrically distributed with parameter 𝑠. Let 𝑋 be the lifetime of an arbitrary device. Find
the PMF, mean and variance of 𝑋 .
Solution. The random experiment that generates 𝑋 involves selecting a device type and then
observing its lifetime. We can partition the sets of outcomes in this experiment into event
𝐵 1 consisting of those outcomes in which the device is type 1, and 𝐵 2 consisting of those
outcomes in which the device is type 2. From the theorem of total probability:

𝑃𝑋 (𝑘) = 𝑃𝑋 |𝐵1 (𝑘)𝑃 (𝐵 1 ) + 𝑃𝑋 |𝐵2 (𝑘)𝑃 (𝐵 2 )

= (1 − 𝑟 )𝑘−1𝑟 (𝛼) + (1 − 𝑠)𝑘−1𝑠 (1 − 𝛼) for 𝑘 = 1, 2, ...

The conditional mean and second moment of each device type is that of a geometric random

44
3 Random Variables

variable with the corresponding parameter:

𝐸 [𝑋 |𝐵 1 ] = 1/𝑟
𝐸 [𝑋 |𝐵 2 ] = 1/𝑠
𝐸 [𝑋 2 |𝐵 1 ] = (1 + 1 − 𝑟 )/𝑟 2
𝐸 [𝑋 2 |𝐵 2 ] = (1 + 1 − 𝑠)/𝑠 2

The mean and the second moment of 𝑋 are then:

𝐸 [𝑋 ] = (𝐸 [𝑋 |𝐵 1 ]) (𝛼) + (𝐸 [𝑋 |𝐵 2 ]) (1 − 𝛼) = 𝛼/𝑟 + (1 − 𝛼)/𝑠

𝐸 [𝑋 2 ] = 𝐸 [𝑋 2 |𝐵 1 ] (𝛼) + 𝐸 [𝑋 2 |𝐵 2 ] (1 − 𝛼) = 𝛼 (2 − 𝑟 )/𝑟 2 + (1 − 𝛼) (2 − 𝑠)/𝑠 2
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 𝛼 (2 − 𝑟 )/𝑟 2 + (1 − 𝛼) (2 − 𝑠)/𝑠 2 − (𝛼/𝑟 + (1 − 𝛼)/𝑠) 2

Note that we do not use the conditional variances to find 𝑉 𝐴𝑅 [𝑋 ], since the Equation 3.42
does not similarly apply to the conditional variances.

Poisson Random Variable

In many applications, we are interested in counting the number of occurrences of an event in
a certain time period or in a certain region in space. The Poisson random variable arises in
situations where the events occur “completely at random” in time or space. For example, the
Poisson random variable arises in counts of emissions from radioactive substances, in counts of
demands for telephone connections, and in counts of defects in a semiconductor chip, in queuing
theory and in communication networks. The number of customers arriving at a cashier in a store
during some time interval may be well modeled as a Poisson random variable as may the number
of data packets arriving at a node in a computer network.

• The PMF of the Poisson random variable is given by:

𝛼 𝑘 −𝛼
𝑃𝑋 (𝑘) = 𝑒 , 𝑘 = 0, 1, 2, ... (3.71)
𝑘!
where 𝛼 is the average number of event occurrences in a specified time interval or region
in space. The PMF sums to one, as required, since:
∞ ∞
Õ 𝛼𝑘 Õ 𝛼𝑘
𝑒 −𝛼 = 𝑒 −𝛼 = 𝑒 −𝛼 𝑒 𝛼 = 1
𝑘! 𝑘!
𝑘=0 𝑘=0

where we used the fact that the second summation is the infinite series expansion for 𝑒 𝛼 .

• The mean can be found as follows:

∞ ∞ ∞
Õ 𝛼 𝑘 −𝛼 Õ 𝛼𝑘 Õ 𝛼 (𝑘−1)
𝐸 [𝑋 ] = 𝑘 𝑒 =𝑒 −𝛼 −𝛼
=𝑒 𝛼 = 𝑒 −𝛼 𝛼𝑒 𝛼 = 𝛼 (3.72)
𝑘! (𝑘 − 1)! (𝑘 − 1)!
𝑘=0 𝑘=1 𝑘=1

45
3 Random Variables

• [Exercise−] It can be shown that the variance is:

𝑉 𝐴𝑅 [𝑋 ] = 𝛼 (3.73)

One of the applications of the Poisson probabilities is to approximate the binomial probabilities
when the number of repeated trials, 𝑛 , is very large and the probability of success in each
individual trial,𝑝 , is very small. Then the binomial random variable can be well approximated by
a Poisson random variable. That is, the Poisson random variable is a limiting case of the binomial
random variable. Let 𝑛 approach infinity and 𝑝 approach 0 in such a way that lim𝑛→∞ 𝑛𝑝 = 𝛼,
then the binomial PMF converges to the PMF of Poisson random variable:

𝑛 𝑘 𝛼𝑘
𝑝 (1 − 𝑝)𝑛−𝑘 → 𝑒 −𝛼 , for 𝑘 = 0, 1, 2, ... (3.74)
𝑘 𝑘!
The Poisson random variable appears in numerous physical situations because many models are
very large in scale and involve very rare events. For example, the Poisson PMF gives an accurate
prediction for the relative frequencies of the number of particles emitted by a radioactive mass
during a fixed time period.
The Poisson random variable also comes up in situations where we can imagine a sequence of
Bernoulli trials taking place in time or space. Suppose we count the number of event occurrences
in a T-second interval. Divide the time interval into a very large number, 𝑛, of sub-intervals. A
pulse in a sub-interval indicates the occurrence of an event. Each sub-interval can be viewed as
one in a sequence of independent Bernoulli trials if the following conditions hold: (1) At most one
event can occur in a sub-interval, that is, the probability of more than one event occurrence is
negligible; (2) the outcomes in different sub-intervals are independent; and (3) the probability of
an event occurrence in a sub-interval is 𝑝 = 𝛼/𝑛 where 𝛼 is the average number of events observed
in a 1-second interval. The number 𝑁 of events in 1 second is a binomial random variable with
parameters 𝑛 and 𝑝 = 𝛼/𝑛. Thus as 𝑛 → ∞ 𝑁 becomes a Poisson random variable with parameter
𝛼.
Example 3.10

An optical communication system transmits information at a rate of 109 bits/second. The

probability of a bit error in the optical communication system is 10−9 . Find the probability
of five or more errors in 1 second.

Solution. Each bit transmission corresponds to a Bernoulli trial with a “success” correspond-
ing to a bit error in transmission. The probability of 𝑘 errors in 𝑛 = 109 transmissions (1
second) is then given by the binomial probability with 𝑛 = 109 and 𝑝 = 10−9 .
The Poisson approximation uses 𝛼 = 𝑛𝑝 = 109 (10−9 ) = 1. Thus:
4
Õ 𝛼𝑘
𝑃 (𝑋 ≥ 5) = 1 − 𝑃 (𝑋 < 5) = 1 − 𝑒 −𝛼
𝑘!
𝑘=0
= 1 − 𝑒 (1 + 1/1! + 1/2! + 1/3! + 1/4!) = 0.00366
−1

Uniform Random Variable

Definition 3.17. Uniform random variable: The discrete uniform random variable 𝑋 takes on
values in a set of consecutive integers 𝑆𝑋 = { 𝑗 + 1, ..., 𝑗 + 𝐿} with equal probability.

46
3 Random Variables

• The PMF of the uniform random variable is:

𝑃𝑋 (𝑘) = 1/𝐿 for 𝑘 ∈ { 𝑗 + 1, ..., 𝑗 + 𝐿}

• [Exercise−] It can be shown that the mean is:

𝐿+1
𝐸 [𝑋 ] = 𝑗 + (3.75)
2

• [Exercise−] It is easy to show that the variance is:

𝐿2 − 1
𝑉 𝐴𝑅 [𝑋 ] = (3.76)
12

This random variable occurs whenever outcomes are equally likely, e.g., toss of a fair coin or a
fair die, spinning of an arrow in a wheel divided into equal segments, selection of numbers from
an urn.

Example 3.11

Let 𝑋 be the time required to transmit a message, where 𝑋 is a uniform random variable
with 𝑆𝑋 = {1, ..., 𝐿}. Suppose that a message has already been transmitting for 𝑚 time units,
find the probability that the remaining transmission time is 𝑗 time units and the expected
value of the remaining transmission time.

Solution. We are given the condition 𝐶 = {𝑋 > 𝑚}, so for 𝑚 + 1 ≤ 𝑚 + 𝑗 ≤ 𝐿:

𝑃 (𝑋 = 𝑚 + 𝑗) 1/𝐿 1
𝑃𝑋 |𝐶 (𝑚 + 𝑗) = = = , for 𝑚 + 1 ≤ 𝑚 + 𝑗 ≤ 𝐿
𝑃 (𝑋 > 𝑚) (𝐿 − 𝑚)/𝐿 𝐿 − 𝑚
𝐿 +𝑚 + 1
𝐿
Õ
𝐸 [𝑋 |𝐶] = 𝑗 (1/𝐿 − 𝑚) =
𝑗=𝑚+1
2

The expectation can also be directly calculated from Equation 3.75, replacing the parameters
𝐿 and 𝑗 by 𝐿 − 𝑚 and 𝑚, respectively.

3.3 Continuous Random Variables

Consider a discrete uniform random variable, 𝑋 , that takes on values from the set {0, 1/𝑁 , 2/𝑁 , ..., (𝑁 −
1)/𝑁 }, with PMF of 1/𝑁 . If 𝑁 is a large number so that it appears that the random number can
be anything in the continuous range [0, 1), i.e. 𝑁 → ∞, then the PMF approaches zero! That
is, each point has zero probability of occurring, or in other words, every possible outcome has
probability zero. Yet, something has to occur! Since a continuous random variable typically
has a zero probability of taking on a specific value, the pmf cannot be used to characterize the
probabilities of 𝑋 . Therefore we define it by its CDF property.

47
3 Random Variables

Definition 3.18. Continuous random variable: A random variable whose CDF 𝐹𝑋 (𝑥) is contin-
uous everywhere, and which, in addition, is sufficiently smooth that it can be written as an integral
of some non-negative function 𝑓 (𝑥):
∫ 𝑥
𝐹𝑋 (𝑥) = 𝑓 (𝑡)𝑑𝑡 (3.77)
−∞

For continuous random variables, we calculate probabilities as integrals of “probability densities”

over intervals of the real line.
A random variable can also be of mixed type, that is a random variable with a CDF that has
jumps on a countable set of points 𝑥 0, 𝑥 1, 𝑥 2, ... but that also increases continuously over at least
one interval of values of 𝑥. The CDF for these random variables has the form:

𝐹𝑋 (𝑥) = 𝑝𝐹 1 (𝑥) + (1 − 𝑝)𝐹 2 (𝑥)

where 0 < 𝑝 < 1 and 𝐹 1 (𝑥) is the CDF of a discrete random variable and 𝐹 2 (𝑥) is the CDF of a
continuous random variable. Random variables of mixed type can be viewed as being produced
by a two-step process: A coin is tossed; if the outcome of the toss is heads, a discrete random
variable is generated according to 𝐹 1 (𝑥) otherwise, a continuous random variable is generated
according to 𝐹 2 (𝑥).

3.3.1 The Probability Density Function

While the CDF represents a mathematical tool to statistically describe a random variable, it is
often quite cumbersome to work with CDFs or to infer various properties of a random variable
from its CDF. To help circumvent these problems, an alternative and often more convenient
description known as the probability density function is often used.
Definition 3.19. Probability density function: The probability density function of 𝑋 (PDF), if
it exists, is defined as the derivative of 𝐹𝑋 (𝑥):

𝑑𝐹𝑋 (𝑥)
𝑓𝑋 (𝑥) = (3.78)
𝑑𝑥

The PDF represents the “density” of probability at the point 𝑥 in the following sense: The
probability that 𝑋 is in a small interval in the vicinity of 𝑥, i.e. 𝑥 < 𝑋 ≤ 𝑥 + ℎ, is:

𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥)
𝑃 (𝑥 < 𝑋 ≤ 𝑥 + ℎ) = 𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥) = ℎ (3.79)
ℎ
If the CDF has a derivative at 𝑥, then as ℎ becomes very small,

𝑃 (𝑥 < 𝑋 ≤ 𝑥 + ℎ) ≈ 𝑓𝑋 (𝑥)ℎ (3.80)

Thus represents the “density” of probability at the point 𝑥 in the sense that the probability that
𝑋 is in a small interval in the vicinity of 𝑥 is approximately 𝑓𝑋 (𝑥)ℎ. The derivative of the CDF,
when it exists, is positive since the CDF is a non-decreasing function of 𝑥, thus:

𝑓𝑋 (𝑥) ≥ 0 (3.81)

Note that the PDF specifies the probabilities of events of the form “𝑋 falls in a small interval of
width 𝑑𝑥 about the point 𝑥”. Therefore probabilities of events involving 𝑋 in a certain range can

48
3 Random Variables

be expressed in terms of the PDF by adding the probabilities of intervals of width 𝑑𝑥. As the
widths of the intervals approach zero, we obtain an integral in terms of the PDF:
∫ 𝑏
𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓𝑋 (𝑥)𝑑𝑥 (3.82)
𝑎
The probability of an interval is therefore the area under 𝑓𝑋 (𝑥) in that interval.

Figure 3.6: (a) The probability density function specifies the probability of intervals of infinitesimal
width. (b) The probability of an interval [𝑎, 𝑏] is the area under the PDF in that interval.
(Taken from Alberto Leon-Garcia, Probability, statistics, and random processes for
electrical engineering,3rd ed. Pearson, 2007)

The probability of any event that consists of the union of disjoint intervals can thus be found by
adding the integrals of the PDF over each of the intervals.
The CDF of 𝑋 can be obtained by integrating the PDF:
∫ 𝑥
𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑡)𝑑𝑡 (3.83)
−∞
Since the probabilities of all events involving 𝑋 can be written in terms of the CDF, it then follows
that these probabilities can be written in terms of the PDF. Thus the PDF completely specifies the
behavior of continuous random variables.
By letting 𝑥 tend to infinity in Equation 3.83, we obtain:
∫ ∞
1= 𝑓𝑋 (𝑡)𝑑𝑡 (3.84)
−∞
A valid PDF can be formed by normalising any non-negative, piecewise continuous function 𝑔(𝑥)
that has a finite integral over all real values of 𝑥.
Example 3.12

The PDF for the random variable 𝑋 is:

(
𝛽𝑥 2 −1 < 𝑥 < 2
𝑓𝑋 (𝑥) =
0 otherwise

Find 𝛽 so that 𝑓𝑋 (𝑥) is a PDF, and find the CDF 𝐹𝑋 (𝑥).

49
3 Random Variables

Solution. We require:
∫ ∞ ∫ 2
1= 𝑓𝑋 (𝑡)𝑑𝑡 = 𝛽 𝑥 2𝑑𝑥 = (𝛽/3) (8 + 1) = 3𝛽
−∞ −1

So, 𝛽 = 1/3, which is positive, as required. To find the CDF:

∫ 𝑥 ∫ 𝑥
𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑡)𝑑𝑡 = (1/3)𝑡 2𝑑𝑡 = (1/9) (𝑥 3 + 1)
−∞ −1

Finally, since 𝑓𝑋 (𝑥) = 0 for 𝑥 > 2, 𝐹𝑋 (𝑥) = 1 for𝑥 ≥ 2.

PDF of Discrete Random Variables

The derivative of the CDF does not exist at points where the CDF is not continuous. As seen
in section 3.2.2, CDF of discrete random variables has discontinuities, where the notion of PDF
cannot be applied. We can generalize the definition of the PDF by noting the relation between the
unit step function 𝑢 (𝑥) and Dirac delta function 𝛿 (𝑥):
(
1 𝑥≥0
𝑢 (𝑥) = (3.85)
0 𝑥<0
∫ 𝑥
𝑢 (𝑥) = 𝛿 (𝑡)𝑑𝑡 (3.86)
−∞

Recall that the delta function 𝛿 (𝑥) is zero everywhere except at 𝑥 = 0, where it is unbounded. To
maintain the right continuity of the step function at 0, we use the convention:
∫ 0
𝑢 (0) = 1 = 𝛿 (𝑡)𝑑𝑡 (3.87)
−∞

The PDF for a discrete random variable can be defined by:

𝑑 Õ
𝑓𝑋 (𝑥) = 𝐹𝑋 (𝑥) = 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑥 − 𝑥𝑘 ) (3.88)
𝑑𝑥
𝑘

Thus the generalized definition of PDF places a delta function of weight 𝑃 (𝑋 = 𝑥𝑘 ) at the points
𝑥𝑘 where the CDF is discontinuous.

50
3 Random Variables

Example 3.13

Find the PDF of 𝑋 in Example 3.3.

Solution. We found that the CDF of 𝑋 is:

1 3 3 1
𝐹𝑋 (𝑥) = 𝑢 (𝑥) + 𝑢 (𝑥 − 1) + 𝑢 (𝑥 − 2) + 𝑢 (𝑥 − 3)
8 8 8 8
Therefore the PDF of 𝑋 is given by:
1 3 3 1
𝑓𝑋 (𝑥) = 𝛿 (𝑥) + 𝛿 (𝑥 − 1) + 𝛿 (𝑥 − 2) + 𝛿 (𝑥 − 3)
8 8 8 8

3.3.2 Conditional CDF and PDF

Definition 3.20. Conditional cumulative distribution function: Suppose that event 𝐶 is given
and that 𝑃 (𝐶) > 0. The conditional CDF of 𝑋 given 𝐶 is defined by:

𝑃 ({𝑋 ≤ 𝑥 } ∩ 𝐶)
𝐹𝑋 |𝐶 (𝑥) = (3.89)
𝑃 (𝐶)
and satisfies all the properties of a CDF.
The conditional PDF of 𝑋 given 𝐶 is then defined by:

𝑑
𝑓𝑋 |𝐶 (𝑥) = 𝐹𝑋 |𝐶 (𝑥) (3.90)
𝑑𝑥

Example 3.14

The lifetime 𝑋 of a machine has a continuous CDF 𝐹𝑋 (𝑥). Find the conditional CDF and
PDF given the event 𝐶 = {𝑋 > 𝑡 } (i.e., “machine is still working at time 𝑡”).

Solution. The conditional CDF is:

𝑃 ({𝑋 ≤ 𝑥 } ∩ {𝑋 > 𝑡 })
𝐹𝑋 |𝐶 (𝑥) = 𝑃 (𝑋 ≤ 𝑥 |𝑋 > 𝑡) =
𝑃 (𝑋 > 𝑡)
The intersection of the two events in the numerator is equal to the empty set when 𝑥 < 𝑡
and to {𝑡 < 𝑋 ≤ 𝑥 } when 𝑥 ≥ 𝑡. Then:
( 𝐹 (𝑥)−𝐹 (𝑡 )
𝑋 𝑋
1−𝐹𝑋 (𝑡 ) 𝑥 >𝑡
𝐹𝑋 |𝐶 (𝑥) =
0 𝑥 ≤𝑡

The conditional pdf is found by differentiating with respect to 𝑥:

𝑓𝑋 (𝑥)
𝑓𝑋 |𝐶 (𝑥) =
1 − 𝐹𝑋 (𝑡)

51
3 Random Variables

Now suppose that we have a partition of the sample space 𝑆 into the union of disjoint events
𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝐹𝑋 |𝐵𝑖 (𝑥) be the conditional CDF of 𝑋 given event 𝐵𝑖 . The theorem on total
probability allows us to find the CDF of 𝑋 in terms of the conditional CDFs:
𝑛
Õ 𝑛
Õ
𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥 |𝐵𝑖 )𝑃 (𝐵𝑖 ) = 𝐹𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.91)
𝑖=1 𝑖=1

The PDF is obtained by differentiation:

𝑛
𝑑 Õ
𝑓𝑋 (𝑥) = 𝐹𝑋 (𝑥) = 𝑓𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.92)
𝑑𝑥 𝑖=1

3.3.3 The Expected Value and Moments

Expected Value
We discussed the expectation for discrete random variables in Section 3.2.3, and found that the
sample mean of independent observations of a random variable approaches 𝐸 [𝑋 ]. Suppose we
perform a series of such experiments for continuous random variables. Since for continuous
random variables we have 𝑃 (𝑋 = 𝑥) = 0 for any specific value of 𝑥, we divide the real line into
small intervals and count the number of times the observations fall in the interval 𝑥𝑘 < 𝑋 < 𝑥𝑘 + Δ.
As 𝑛 becomes large, then the relative frequency 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 will approach 𝑓𝑋 (𝑥𝑘 )Δ, the
probability of the interval. We calculate the sample mean in terms of the relative frequencies and
let 𝑛 → ∞: Õ Õ
h𝑋 i𝑛 = 𝑥𝑘 𝑓𝑘 (𝑛) → 𝑥𝑘 𝑓𝑋 (𝑥𝑘 )Δ
𝑘 𝑘
The expression on the right-hand side approaches an integral as we decrease Δ. Thus, the expected
value or mean of a continuous random variable 𝑋 is defined by:
∫ +∞
𝐸 [𝑋 ] = 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 (3.93)
−∞

The expected value 𝐸 [𝑋 ] is defined if the above integral converges absolutely, that is,
∫ +∞
𝐸 [|𝑋 |] = |𝑡 |𝑓𝑋 (𝑡)𝑑𝑡 < ∞
−∞

We already discussed 𝐸 [𝑋 ] for discrete random variables in detail, but the definition in Equation
3.93 is applicable if we express the PDF of a discrete random variable using delta (𝛿) functions:
∫ +∞ Õ
𝐸 [𝑋 ] = 𝑡 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡
−∞ 𝑘
Õ ∫ +∞
= 𝑃𝑋 (𝑥𝑘 ) 𝑡𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡
𝑘 −∞
Õ
= 𝑃𝑋 (𝑥𝑘 )𝑥𝑘
𝑘

Example 3.15

The PDF of the uniform random variable is a constant value over a certain range and zero

52
3 Random Variables

outside that range:

1
(
𝑎 ≤𝑥 ≤𝑏
𝑓𝑋 (𝑥) = 𝑏−𝑎
0 𝑥 < 𝑎 𝑎𝑛𝑑 𝑥 > 𝑏
Find the expectation 𝐸 [𝑋 ].

Solution.
1
∫ 𝑏
𝑎 +𝑏
𝐸 [𝑋 ] = 𝑡 𝑑𝑡 =
𝑎 𝑏 −𝑎 2
which is the midpoint of the interval [𝑎, 𝑏].

The result in Example 3.15 could have been found immediately by noting that 𝐸 [𝑋 ] = 𝑚 when
the PDF is symmetric about a point 𝑚, i.e. 𝑓𝑋 (𝑚 − 𝑥) = 𝑓𝑋 (𝑚 + 𝑥) for all 𝑥, then assuming that
the mean exists, ∫ +∞ ∫ +∞
0= (𝑚 − 𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑚 − 𝑡 𝑓𝑋 (𝑡)𝑑𝑡
−∞ −∞
The first equality above follows from the symmetry of 𝑓𝑋 (𝑡) about 𝑡 = 𝑚 and the odd symmetry
of (𝑚 − 𝑡) about the same point. We then have that 𝐸 [𝑋 ] = 𝑚.
The following expressions are useful when 𝑋 is a nonnegative random variable:
∫ ∞
𝐸 [𝑋 ] = (1 − 𝐹𝑋 (𝑡))𝑑𝑡 if 𝑋 continuous and nonnegative (3.94)
0

∞
Õ
𝐸 [𝑋 ] = 𝑃 (𝑋 > 𝑘) if 𝑋 nonnegative, integer-valued (3.95)
𝑘=0

Functions of a Random Variable

The concept of expectation can be applied to the functions of random variables as well. This will
allow us to define many other parameters that describe various aspects of a continuous random
variable.
Definition 3.21. Given a continuous random variable 𝑋 with PDF 𝑓𝑋 (𝑥), the expected value of a
function, 𝑔(𝑋 ), of that random variable is given by:
∫ +∞
𝐸 [𝑔(𝑋 )] = 𝑔(𝑥) 𝑓𝑋 (𝑥)𝑑𝑥 (3.96)
−∞

Example 3.16

If 𝑌 = 𝑎𝑋 + 𝑏 where 𝑋 is a continuous random variable with expected value of 𝐸 [𝑋 ] and 𝑎

and 𝑏 are constant values, find 𝐸 [𝑌 ].

Solution.
∫ +∞ ∫ +∞
𝐸 [𝑌 ] = 𝐸 [𝑎𝑋 + 𝑏] = (𝑎𝑥 + 𝑏) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 + 𝑏 = 𝑎𝐸 [𝑋 ] + 𝑏
−∞ −∞

53
3 Random Variables

In general, expectation is a linear operation and expectation operator can be exchanged (in order)
with any other linear operation. For any linear combination of functions:
Õ ∫ ∞ Õ Õ ∫ ∞ Õ
𝐸 [ 𝑎𝑘 𝑔𝑘 (𝑋 )] = ( 𝑎𝑘 𝑔𝑘 (𝑥))𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝑔𝑘 (𝑥) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝐸 [𝑔𝑘 (𝑋 )]
𝑘 −∞ 𝑘 𝑘 −∞ 𝑘
(3.97)

Moments
Definition 3.22. Moment: The 𝑛𝑡ℎ moment of a continuous random variable 𝑋 is defined as:
∫ +∞
𝐸 [𝑋 𝑛 ] = 𝑥 𝑛 𝑓𝑋 (𝑥)𝑑𝑥 (3.98)
−∞

The zeroth moment is simply the area under the PDF and must be one for any random variable.
The most commonly used moments are the first and second moments. The first moment is the
expected value. For some random variables, the second moment might be a more meaningful
characterization than the first. For example, suppose 𝑋 is a sample of a noise waveform. We might
expect that the distribution of the noise is symmetric about zero and hence the first moment will
be zero. It only shows that the noise does not have a bias. However, the second moment of the
random noise is in some sense a measure of the strength of the noise, which can give us some
useful physical insight into the power of the noise.
Under certain conditions, a PDF is completely specified if the expected values of all the moments
of 𝑋 are known.

Variance
Similar to the definition of variance for discrete random variables, for continuous random variables
𝑋 , the variance is defined as:

𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 (3.99)

and the standard deviation is defined by: 𝑆𝑇 𝐷 [𝑋 ] = 𝜎𝑋 = 𝑉 𝐴𝑅 [𝑋 ] 1/2 .

Example 3.17

Find the variance of the continuous uniform random variable in Example 3.15.

Solution.
𝑎 +𝑏 2 1
∫ 𝑏
𝑉 𝐴𝑅 [𝑋 ] = (𝑥 − ) 𝑑𝑥
𝑎 2 𝑏 −𝑎
Let 𝑦 = 𝑥 − 2 ,
𝑎+𝑏

1 (𝑏−𝑎)/2
(𝑏 − 𝑎) 2
∫
𝑉 𝐴𝑅 [𝑋 ] = 𝑦 2𝑑𝑦 =
𝑏 −𝑎 −(𝑏−𝑎)/2 12

The properties derived in section 3.2.3 can be similarly derived for the variance of continuous
random variables:
𝑉 𝐴𝑅 [𝑐] = 0 (3.100)

54
3 Random Variables

𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝑉 𝐴𝑅 [𝑋 ] (3.101)
𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝑐 2𝑉 𝐴𝑅 [𝑋 ] (3.102)
where 𝑐 is a constant.
The mean and variance are the two most important parameters used in summarizing the PDF
of a random variable. Other parameters and moments are occasionally used. For example, the
skewness defined by 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 3 ]/𝑆𝑇 𝐷 [𝑋 ] 3 measures the degree of asymmetry about the
mean. It is easy to show that if a PDF is symmetric about its mean, then its skewness is zero.
The point to note with these parameters of the PDF is that each involves the expected value of a
higher power of 𝑋 .

3.3.4 Important Continuous Random Variables

The Uniform Random Variable
The uniform random variable arises in situations where all values in an interval of the real line
are equally likely to occur.
• As introduced in Example 3.15, the uniform random variable 𝑈 in the interval [𝑎, 𝑏] has
PDF:
1
(
𝑎 ≤𝑥 ≤𝑏
𝑓𝑈 (𝑥) = 𝑏−𝑎 (3.103)
0 𝑥 < 𝑎 𝑎𝑛𝑑 𝑥 > 𝑏

• and CDF:


 0 𝑥 <𝑎
(3.104)


𝐹𝑈 (𝑥) = 𝑥−𝑎
𝑎 ≤𝑥 ≤𝑏
 𝑏−𝑎
1

𝑥 >𝑏


• As found in Examples 3.15 and 3.17

𝑎 +𝑏
𝐸 [𝑈 ] = (3.105)
2
(𝑏 − 𝑎) 2
𝑉 𝐴𝑅 [𝑈 ] = (3.106)
12
The uniform random variable appears in many situations that involve equally likely continuous
random variables. Obviously 𝑈 can only be defined over intervals that are finite in length.

The Exponential Random Variable

The exponential random variable arises in the modeling of the time between occurrence of events
(e.g., the time between customer demands for call connections), and in the modeling of the lifetime
of devices and systems.
• The exponential random variable 𝑋 with parameter 𝜆 has PDF:
(
𝜆𝑒 −𝜆𝑥 𝑥 ≥ 0
𝑓𝑋 (𝑥) = (3.107)
0 𝑥<0

55
3 Random Variables

• and CDF: (
1 − 𝑒 −𝜆𝑥 𝑥≥0
𝐹𝑋 (𝑥) = (3.108)
0 𝑥<0

The parameter 𝜆 is the rate at which events occur, so 𝐹𝑋 (𝑥), the probability of an event
occurring by time 𝑥, increases at the rate 𝜆 increases.

• The expectation is given by: ∫ ∞

𝐸 [𝑋 ] = 𝑡𝜆𝑒 −𝜆𝑡 𝑑𝑡 (3.109)
0

using integration by parts ( 𝑢𝑑𝑣 = 𝑢𝑣 − 𝑣𝑑𝑢), with 𝑢 = 𝑡 and 𝑑𝑣 = 𝜆𝑒 −𝜆𝑡 𝑑𝑡:

∫ ∫

∞ ∫ ∞
−𝜆𝑡
𝑒 −𝜆𝑡 𝑑𝑡

𝐸 [𝑋 ] = −𝑡𝑒 +
0 0
−𝑒 −𝜆𝑡 ∞

= lim 𝑡𝑒 −𝜆𝑡
−0+( )
𝑡 →∞ 𝜆 0
−𝑒 −𝜆𝑡 1 1
= lim + = (3.110)
𝑡 →∞ 𝜆 𝜆 𝜆
where we have used the fact that 𝑒 −𝜆𝑡 and 𝑡𝑒 −𝜆𝑡 go to zero as 𝑡 approaches infinity.

• [Exercise−] It can be shown that the variance is:

1
𝑉 𝐴𝑅 [𝑋 ] = (3.111)
𝜆2

In event inter-arrival situations, 𝜆 is in units of events/second and 1/𝜆 is in units of seconds per
event inter-arrival.
The exponential random variable satisfies the memoryless property:

𝑃 (𝑋 > 𝑡 + ℎ|𝑋 > 𝑡) = 𝑃 (𝑋 > ℎ) (3.112)

The expression on the left side is the probability of having to wait at least ℎ additional seconds
given that one has already been waiting 𝑡 seconds.The expression on the right side is the probability
of waiting at least ℎ seconds when one first begins to wait. Thus the probability of waiting at
least an additional ℎ seconds is the same regardless of how long one has already been waiting!
This property can be proved as follows:
𝑃 (𝑋 > 𝑡 + ℎ ∩ 𝑋 > 𝑡)
𝑃 (𝑋 > 𝑡 + ℎ|𝑋 > 𝑡) = for ℎ > 0
𝑃 (𝑋 > 𝑡)
𝑃 (𝑋 > 𝑡 + ℎ) 𝑒 −𝜆 (𝑡 +ℎ)
= =
𝑃 (𝑋 > 𝑡) 𝑒 −𝜆𝑡
= 𝑒 −𝜆ℎ = 𝑃 (𝑋 > ℎ)

The memoryless property of the exponential random variable makes it the cornerstone for the
theory of Markov chains, which is used extensively in evaluating the performance of computer
systems and communications networks. It can be shown that the exponential random variable is
the only continuous random variable that satisfies the memoryless property.

56
3 Random Variables

The Gaussian (Normal) Random Variable

There are many real-world situations where one deals with a random variable 𝑋 that consists
of the sum of a large number of “small” random variables. The exact description of the PDF
of 𝑋 in terms of the component random variables can become quite complex and unwieldy.
However, under very general conditions, as the number of components becomes large, the CDF
of 𝑋 approaches that of the Gaussian random variable. This random variable appears so often in
problems involving randomness that it is known as the “normal” random variable.

Figure 3.7: Probability density function of Gaussian random variable.

• The PDF for the Gaussian random variable 𝑋 is given by:

1 −(𝑥−𝑚) 2 /2𝜎 2
𝑓𝑋 (𝑥) = √ 𝑒 −∞ <𝑥 < ∞ (3.113)
2𝜋𝜎
where 𝑚 and 𝜎 > 0 are real numbers, denoting the mean and standard deviation of 𝑋 . As
shown in Figure 3.7, the Gaussian PDF is a “bell-shaped” curve centered and symmetric
about 𝑚 and whose “width” increases with 𝜎. In general, the Gaussian PDF is centered
about the point 𝑥 = 𝑚 and has a width that is proportional to 𝜎. The special case when
𝑚 = 0 and 𝜎 = 1, is called “standard normal” random variable. Because Gaussian random
variables are so commonly used in such a wide variety of applications, it is standard practice
to introduce a shorthand notation to describe a Gaussian random variable, 𝑋 ∼ 𝑁 (𝑚, 𝜎 2 ).

• The CDF of the Gaussian random variable is given by:

1
∫ 𝑥
0 2 2
𝑃 (𝑋 ≤ 𝑥) = √ 𝑒 −(𝑥 −𝑚) /2𝜎 𝑑𝑥 0
2𝜋𝜎 −∞
The change of variable 𝑡 = (𝑥 0 − 𝑚)/𝜎 results in:

1
∫ (𝑥−𝑚)/𝜎
2 𝑥 −𝑚
𝐹𝑋 (𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 = Φ( ) (3.114)
2𝜋 −∞ 𝜎
where Φ(𝑥) is the CDF of a Gaussian random variable with 𝑚 = 0 and 𝜎 = 1:
1
∫ 𝑥
2
Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 (3.115)
2𝜋 −∞

57
3 Random Variables

Therefore any probability involving an arbitrary Gaussian random variable can be expressed
in terms of Φ(𝑥).

• Note that the PDF of a Gaussian random variable is symmetric about the point 𝑚. Therefore
the mean is 𝐸 [𝑋 ] = 𝑚 (as also defined above).

• Since 𝜎 is the standard deviation, the variance is 𝑉 𝐴𝑅 [𝑋 ] = 𝜎 2 .

In electrical engineering it is customary to work with the Q-function, which is defined by:

1
∫ ∞
2
𝑄 (𝑥) = 1 − Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 (3.116)
2𝜋 𝑥
𝑄 (𝑥) is simply the probability of the “tail” of the PDF. The symmetry of the PDF implies that:

𝑄 (0) = 1/2 and 𝑄 (−𝑥) = 1 − 𝑄 (𝑥) (3.117)

From Equation 3.114 which corresponds to 𝑃 (𝑋 ≤ 𝑥), the following can be derived:
𝑥 −𝑚
𝑃 (𝑋 > 𝑥) = 𝑄 ( ) (3.118)
𝜎

Figure 3.8: Standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions.

Figure 3.8 shows the standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions.
It can be shown that it is impossible to express the CDF integral in closed form. However, as with
other important integrals that cannot be expressed in closed form (e.g., Bessel functions), one can
always look up values of the required CDF in looking up tables, or use numerical approximations
of the desired integral to any desired accuracy. The following expression has been found to give
good accuracy for 𝑄 (𝑥) over the entire range 0 < 𝑥 < ∞:
1 1 2
𝑄 (𝑥) ≈ ( √ ) √ 𝑒 −𝑥 /2 (3.119)
(1 − 𝑎)𝑥 + 𝑎 𝑥 2 + 𝑏 2𝜋
where 𝑎 = 1/𝜋 and 𝑏 = 2𝜋.
In some problems, we are interested in finding the value of 𝑥 for which 𝑄 (𝑥) = 10−𝑘 . Table 3.1
gives these values for 𝑘 = 1, ..., 10.

58
3 Random Variables

Table 3.1: Look-up table for 𝑄 (𝑥) = 10−𝑘 .

𝑘 𝑥 so that 𝑄 (𝑥) = 10−𝑘
1 1.2815
2 2.3263
3 3.0902
4 3.7190
5 4.2649
6 4.7535
7 5.1993
8 5.6120
9 5.9978
10 6.3613

The Gaussian random variable plays a very important role in communication systems, where
transmission signals are corrupted by noise voltages resulting from the thermal motion of electrons.
It can be shown from physical principles that these voltages will have a Gaussian PDF.

Example 3.18

A communication system accepts a positive voltage 𝑉 as input and outputs a voltage

𝑌 = 𝛼𝑉 + 𝑁 , where 𝛼 = 10−2 and 𝑁 is a Gaussian random variable with parameters 𝑚 = 0
and 𝜎 = 2. Find the value of 𝑉 that gives 𝑃 (𝑌 < 0) = 10−6 .

Solution. The probability 𝑃 (𝑌 < 0) is written in terms of 𝑁 as follows:

−𝛼𝑉 𝛼𝑉
𝑃 (𝑌 < 0) = 𝑃 (𝛼𝑉 + 𝑁 < 0) = 𝑃 (𝑁 < −𝛼𝑉 ) = Φ( ) = 𝑄( ) = 10−6 .
𝜎 𝜎
From Table 3.1 we see that the argument of the Q-function should be 𝛼𝑉
𝜎 = 4.753. Thus
𝑉 = 950.6

The Gamma Random Variable

The Gamma random variable is a versatile random variable that appears in many applications.
For example, it is used to model the time required to service customers in queueing systems, the
lifetime of devices and systems in reliability studies, and the defect clustering behavior in VLSI
chips.
• The PDF of the gamma random variable has two parameters, 𝛼 > 0 and 𝜆 > 0 and is given
by:
𝜆(𝜆𝑥) 𝛼−1𝑒 −𝜆𝑥
𝑓𝑋 (𝑥) = 0 < 𝑥 < ∞, (3.120)
Γ(𝛼)
where Γ is the gamma function, which is defined by:
∫ ∞
Γ(𝛼) = 𝑥 𝛼−1𝑒 −𝑥 𝑑𝑥 𝛼>0 (3.121)
0

59
3 Random Variables

The gamma function has the following properties:

√
Γ(1/2) = 𝜋,
Γ(𝛼 + 1) = 𝛼 Γ(𝛼) for 𝛼 > 0
Γ(𝑚 + 1) = 𝑚! for 𝑚 a non negative integer

• The CDF of Gamma random variable is given by:

𝛾 (𝛼, 𝜆𝑥)
𝐹𝑋 (𝑥) = 0<𝑥 <∞ (3.122)
Γ(𝛼)
where the incomplete gamma function 𝛾 is given by:
∫ 𝛽
𝛾 (𝛼, 𝛽) = 𝑥 𝛼−1𝑒 −𝑥 𝑑𝑥 (3.123)
0

• Mean of the Gamma random variable is:

𝐸 [𝑋 ] = 𝛼/𝜆 (3.124)

• Variance of the Gamma random variable is:

𝑉 𝐴𝑅 [𝑋 ] = 𝛼/𝜆 2 (3.125)

The versatility of the gamma random variable is due to the richness of the gamma function Γ(𝛼).

Figure 3.9: Probability density function of gamma random variable.

The PDF of the gamma random variable can assume a variety of shapes as shown in Figure 3.9. By
varying the parameters 𝜆 and 𝛼 it is possible to fit the gamma PDF to many types of experimental
data. The exponential random variable is obtained by letting 𝛼 = 1. By letting 𝜆 = 1/2 and 𝛼 = 𝑘/2,

60
3 Random Variables

where 𝑘 is a positive integer, we obtain the Chi-square random variable, which appears in
certain statistical problems and wireless communications applications. The m-Erlang random
variable is obtained when 𝛼 = 𝑚 a positive integer. The m-Erlang random variable is used in the
system reliability models and in queueing systems models and plays a fundamental role in the
study of wireline telecommunication networks.
In general, the CDF of the gamma random variable does not have a closed-form expression.
However, the special case of the m-Erlang random variable does have a closed-form expression.

3.4 The Markov and Chebyshev Inequalities

In general, the mean and variance of a random variable do not provide enough information to
determine the CDF/PDF. However, the mean and variance of a random variable 𝑋 do allow us to
obtain bounds for probabilities of the form 𝑃 (|𝑋 | ≥ 𝑡).
Definition 3.23. Markov inequality: Suppose first that 𝑋 is a nonnegative random variable with
mean 𝐸 [𝑋 ].The Markov inequality then states that:
𝐸 [𝑋 ]
𝑃 (𝑋 ≥ 𝑎) ≤ for 𝑋 non negative (3.126)
𝑎

Markov inequality can be obtained as follows:

∫ 𝑎 ∫ ∞ ∫ ∞ ∫ ∞
𝐸 [𝑋 ] = 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 + 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 ≥ 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 ≥ 𝑎𝑓𝑋 (𝑡)𝑑𝑡 ≥ 𝑎𝑃 (𝑋 ≥ 𝑎)
0 𝑎 𝑎 𝑎

Example 3.19

The mean height of children in a kindergarten class is 70 cm. Find the bound on the proba-
bility that a kid in the class is taller than 140 cm.

Solution. The Markov inequality gives 𝑃 (𝐻 ≥ 140) ≤ 70/140 = 0.5

The bound in the above example appears to be ridiculous. However, a bound, by its very nature,
must take the worst case into consideration. One can easily construct a random variable for which
the bound given by the Markov inequality is exact. The reason we know that the bound in the
above example is ridiculous is that we have knowledge about the variability of the children’s
height about their mean.
Definition 3.24. Chebyshev inequality: Suppose that the mean 𝐸 [𝑋 ] = 𝑚 and the variance
𝑉 𝐴𝑅 [𝑋 ] = 𝜎 2 of a random variable are known, and that we are interested in bounding 𝑃 (|𝑋 −𝑚| ≥ 𝑎).
The Chebyshev inequality states that:
𝜎2
𝑃 (|𝑋 − 𝑚| ≥ 𝑎) ≤ (3.127)
𝑎2

The Chebyshev inequality is a consequence of the Markov inequality. Let 𝐷 2 = (𝑋 − 𝑚) 2 be the

squared deviation from the mean. Then the Markov inequality applied to 𝐷 2 gives:
𝐸 [(𝑋 − 𝑚) 2 ] 𝜎 2
𝑃 (𝐷 2 ≥ 𝑎 2 ) ≤ = 2
𝑎2 𝑎

61
3 Random Variables

and note that {𝐷 2 ≥ 𝑎 2 } and {|𝑋 −𝑚| ≥ 𝑎} are equivalent events. Suppose that a random variable
𝑋 has zero variance; then the Chebyshev inequality implies that 𝑃 (𝑋 = 𝑚) = 1, i.e. random
variable is equal to its mean with probability one, hence constant in almost all experiments.

Example 3.20

If 𝑋 is a Gaussian random variable with mean 𝑚 and variance 𝜎 2 , Find the upper bound for
𝑃 (|𝑋 − 𝑚| ≥ 𝑘𝜎) according to the Chebyshev inequality.

Solution. The Chebyshev inequality for 𝑎 = 𝑘𝜎 gives:

1
𝑃 (|𝑋 − 𝑚| ≥ 𝑘𝜎) ≤
𝑘2
if 𝑘 = 2 the Chebyshev inequality gives the upper bound 0.25. Also we know that for
Gaussian random variables:

𝑃 (|𝑋 − 𝑚| ≥ 2𝜎) = 2𝑄 (2) ≈ 0.0456.

We see that for certain random variables, the Chebyshev inequality can give rather loose bounds.
Nevertheless, the inequality is useful in situations in which we have no knowledge about the
distribution of a given random variable other than its mean and variance. We will later use the
Chebyshev inequality to prove that the arithmetic average of independent measurements of the
same random variable is highly likely to be close to the expected value of the random variable
when the number of measurements is large.
If more information is available than just the mean and variance, then it is possible to obtain
bounds that are tighter than the Markov and Chebyshev inequalities. Consider the Markov
inequality again. The region of interest is 𝐴 = {𝑡 ≥ 𝑎}, so let 𝐼𝐴 (𝑡) be the indicator function, i.e.
𝐼𝐴 (𝑡) = 1 if 𝑡 ∈ 𝐴 and 𝐼𝐴 (𝑡) = 0 otherwise. The key step in the derivation is to note that 𝑡/𝑎 ≥ 1
in the region of interest. In effect we bounded 𝐼𝐴 (𝑡) by 𝑡/𝑎 and then have:
∫ ∞ ∫ ∞
𝑃 (𝑋 ≥ 𝑎) = 𝐼𝐴 (𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 ≤ (𝑡/𝑎) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝐸 [𝑥]/𝑎
0 0

By changing the upper bound on 𝐼𝐴 (𝑡), we can obtain different bounds on 𝑃 (𝑋 ≥ 𝑎). Consider
the bound 𝐼𝐴 (𝑡) ≤ 𝑒 𝑠 (𝑡 −𝑎) , also shown in Figure 3.10, where 𝑠 > 0 then the following bound can
be obtained.
Definition 3.25. Chernoff bound: Suppose 𝑋 is a random variable, then:
∫ ∞
𝑃 (𝑋 ≥ 𝑎) ≤ (𝑒 𝑠 (𝑡 −𝑎) ) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑒 −𝑠𝑎 𝐸 [𝑒 𝑠𝑋 ] (3.128)
0

This bound is called the Chernoff bound, which can be seen to depend on the expected value of an
exponential function of 𝑋 . This function is called the moment generating function.

62
3 Random Variables

Figure 3.10: Bounds on indicator function for 𝐴 = {𝑡 ≥ 𝑎}.

Further Reading
1. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: chapters 3 and 4
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.8 and 2.9, and
chapters 3 and 4.
3. Anthony Hayter, Probability and Statistics for Engineers and Scientists, 4th ed., Brooks/Cole,
Cengage Learning 2012: chapter 2 to 5.

63
4 Two or More Random Variables

4 Two or More Random Variables

Many random experiments involve several random variables. In some experiments a number of
different quantities are measured. For example, the voltage signals at several points in a circuit at
some specific time may be of interest. Other experiments involve the repeated measurement of
a certain quantity such as the repeated measurement (“sampling”) of the amplitude of an audio
or video signal that varies with time. In this chapter, we extend the random variable concepts
already introduced to two or more random variables. In a sense we have already covered all the
fundamental concepts of probability and random variables, and we are “simply” elaborating on the
case of two or more random variables. Nevertheless, there are significant analytical techniques
that need to be learned.

4.1 Pairs of Random Variables

Some experiments involve two random variables, e.g. the study of a system with a random input.
Due to the randomness of the input, the output will naturally be random as well. Quite often it is
necessary to characterize the relationship between the input and the output. A pair of random
variables can be used to characterize this relationship: one for the input and another for the output.
Another class of examples involving random variables are those involving spatial coordinates in
two dimensions. A pair of random variables can be used to probabilistically describe the position
of an object which is subject to various random forces. There are endless examples of situations
where we are interested in two random quantities that may or may not be related to one another,
for example, the height and weight of a student, or the temperature and relative humidity at a
certain place and time.
Consider an experiment 𝐸 whose outcomes lie in a sample space, 𝑆. A two dimensional random
variable is a mapping of the points in the sample space to ordered pairs {𝑥, 𝑦}. Usually, when
dealing with a pair of random variables, the sample space naturally partitions itself so that it can
be viewed as a combination of two simpler sample spaces. For example, suppose the experiment
was to observe the height and weight of a typical student. The range of student heights could
fall within some set which we call sample space 𝑆 1 , while the range of student weights could
fall within the space 𝑆 2 . The overall sample space of the experiment could then be viewed as
𝑆 1 × 𝑆 2 . For any outcome 𝑠 ∈ 𝑆 of this experiment, the pair of random variables (𝑋, 𝑌 ) is merely a
mapping of the outcome 𝑠 to a pair of numerical values 𝑥 (𝑠), 𝑦 (𝑥). In the case of our height/weight
experiment, it would be natural to choose 𝑥 (𝑠) to be the height of the student, while 𝑦 (𝑠) is the
weight of the student. While the density functions 𝑓𝑋 (𝑥) and 𝑓𝑌 (𝑦) do partially characterize
the experiment, they do not completely describe the situation. It would be natural to expect
that the height and weight are somehow related to each other. While it may not be very rare to
have a student 180 cm tall nor unusual to have a student who weighs 55 kg, it is probably rare
indeed to have a student who is both 180 cm tall and weighs 55 kg. Therefore, to characterize the
relationship between a pair of random variables, it is necessary to look at the joint probabilities
of events relating to both random variables.

64
4 Two or More Random Variables

4.1.1 Joint Cumulative Distribution Function

We start with the notion of a joint CDF.
Definition 4.1. Joint Cumulative Distribution Function: The joint CDF of a pair of random
variables, {𝑋, 𝑌 }, is 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃 (𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦). That is, the joint CDF is the joint probability of
the two events {𝑋 ≤ 𝑥 } and {𝑌 ≤ 𝑦}.
As with the CDF of a single random variable, not any function can be a joint CDF. The joint CDF
of a pair of random variables will satisfy properties similar to those satisfied by the CDFs of single
random variables.
• Since the joint CDF is a probability, it must take on a value between 0 and 1, i.e. 0 ≤
𝐹𝑋 ,𝑌 (𝑥, 𝑦) ≤ 1

• 𝐹𝑋 ,𝑌 (𝑥, 𝑦) evaluated at either 𝑥 = −∞ or 𝑦 = −∞ (or both) must be zero and 𝐹𝑋 ,𝑌 (∞, ∞)

must be one, i.e.

𝐹𝑋 ,𝑌 (−∞, −∞) = 0
𝐹𝑋 ,𝑌 (−∞, 𝑦) = 0
𝐹𝑋 ,𝑌 (𝑥, −∞) = 0
𝐹𝑋 ,𝑌 (∞, ∞) = 1

• For 𝑥 1 ≤ 𝑥 2 and 𝑦1 ≤ 𝑦2 , {𝑋 ≤ 𝑥 1 } ∩ {𝑌 ≤ 𝑦1 } is a subset of {𝑋 ≤ 𝑥 2 } ∩ {𝑌 ≤ 𝑦2 } so that

𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ) ≤ 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ). That is, the CDF is a monotonic, non-decreasing function of
both 𝑥 and 𝑦.

• Since the event {𝑋 ≤ ∞} must happen, then {𝑋 ≤ ∞} ∩ {𝑌 ≤ 𝑦} = {𝑌 ≤ 𝑦}, so that

𝐹𝑋 ,𝑌 (∞, 𝑦) = 𝐹𝑌 (𝑦). Likewise, 𝐹𝑋 ,𝑌 (𝑥, ∞) = 𝐹𝑋 (𝑥). In the context of joint CDFs, 𝐹𝑋 (𝑥) and
𝐹𝑌 (𝑦) are referred to as the marginal CDFs of 𝑋 and 𝑌 , respectively.

• Consider using a joint CDF to evaluate the probability that the pair of random variables
(𝑋, 𝑌 ) falls into a rectangular region bounded by the points (𝑥 1, 𝑦1 ), (𝑥 2, 𝑦1 ), (𝑥 1, 𝑦2 ) and
(𝑥 2, 𝑦2 ) (white rectangle is figure 4.1). Evaluating 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) gives the probability that the
random variable falls anywhere below or to the left of the point (𝑥 2, 𝑦2 ); this includes all
of the area in the desired rectangle, plus everything below and to the left of the desired
rectangle. The probability of the random variable falling to the left of the rectangle can be
subtracted off using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ). Similarly, the region below the rectangle can be subtracted
off using 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) (two shaded regions). In subtracting off these two quantities, we have
subtracted twice the probability of the pair falling both below and to the left of the desired
rectangle (dark-shaded region). Hence we must add back this probability using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ).
That is:

𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) = 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) + 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ) ≥ 0.
(4.1)

65
4 Two or More Random Variables

Figure 4.1: Illustrating the evaluation of the probability of a pair of random variables falling in a
rectangular region.

Equation 4.1 tells us how to calculate the probability of the pair of random variables falling in
a rectangular region. Often, we are interested in also calculating the probability of the pair of
random variables falling in non rectangular (e.g., a circle or triangle) region. This can be done by
forming the required region using many infinitesimal rectangles and then repeatedly applying
Equation 4.1.

Example 4.1

Consider a pair of random variables which are uniformly distributed over the unit square
(i.e., 0 < 𝑥 < 1, 0 < 𝑦 < 1). Find the joint CDF.

Solution. The CDF is:



 0, 𝑥 < 0 or 𝑦 < 0

0 ≤ 𝑥 ≤ 1, 𝑦 > 1



 𝑥,


𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑦, 𝑥 > 1, 0 ≤ 𝑦 ≤ 1

𝑥𝑦, 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1




 1, 𝑥 > 1, 𝑦 > 1



Even this very simple example leads to a rather cumbersome function. Nevertheless, it is
straightforward to verify that this function does indeed satisfy all the properties of a joint
CDF. From this joint CDF, the marginal CDF of 𝑋 can be found to be:



 0, 𝑥 < 0


𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞) = 𝑥, 0 ≤ 𝑥 ≤ 1

 1, 𝑥 > 1


Hence, the marginal CDF of 𝑋 is a uniform distribution. The same statement holds for 𝑌 as
well.

66
4 Two or More Random Variables

4.1.2 Joint Probability Density Functions

As seen in Example 4.1, even the simplest joint random variables can lead to CDFs which are
quite unwieldy. As a result, working with joint CDFs can be difficult. In order to avoid extensive
use of joint CDFs, attention is now turned to the two dimensional equivalent of the PDF.
Definition 4.2. Joint Probability Density Functions: The joint probability density function of
a pair of random variables (𝑋, 𝑌 ) evaluated at the point (𝑥, 𝑦) is:

𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑙𝑖𝑚𝜀𝑥 →0,𝜀 𝑦 →0 (4.2)
𝜀𝑥 𝜀 𝑦

Similar to the one-dimensional case, the joint PDF is the probability that the pair of random variables
(𝑋, 𝑌 ) lies in an infinitesimal region defined by the point (𝑥, 𝑦) normalised by the area of the region.
For a single random variable, the PDF was the derivative of the CDF. By applying Equation 4.1 to
the definition of the joint PDF, a similar relationship is obtained.

Theorem 4.1

The joint PDF 𝑓𝑋 ,𝑌 (𝑥, 𝑦) can be obtained from the joint CDF 𝐹𝑋 ,𝑌 (𝑥, 𝑦) by taking a partial
derivative with respect to each variable. That is,

𝜕2
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝐹𝑋 ,𝑌 (𝑥, 𝑦) (4.3)
𝜕𝑥 𝜕𝑦

Proof. Using Equation 4.1

𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
= 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) + 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
= [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] − [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)]

Dividing by 𝜀𝑥 and taking the limit as 𝜀𝑥 → 0 results in

𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
lim
𝜀𝑥 →0 𝜀𝑥
[𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)]
= lim − lim
𝜀𝑥 →0 𝜀𝑥 𝜀𝑥 →0 𝜀𝑥
𝜕 𝜕
= 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
𝜕𝑥 𝜕𝑥
Dividing by 𝜀 𝑦 and taking the limit as 𝜀 𝑦 → 0 gives the desired result:

𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = lim
𝜀𝑥 →0,𝜀 𝑦 →0 𝜀𝑥 , 𝜀 𝑦
𝜕2
𝜕 𝜕
𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀𝑦 ) − 𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
= lim = 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
𝜀 𝑦 →0 𝜀𝑦 𝜕𝑥 𝜕𝑦

This theorem shows that we can obtain a joint PDF from a joint CDF by differentiating with
respect to each variable. The converse of this statement would be that we could obtain a joint

67
4 Two or More Random Variables

CDF from a joint PDF by integrating with respect to each variable. Specifically:
∫ 𝑦∫ 𝑥
𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 ,𝑌 (𝑢, 𝑣)𝑑𝑢𝑑𝑣 (4.4)
−∞ −∞

Example 4.2

Consider the pair of random variables with uniform distribution in Example 4.1. Find the
joint PDF.

Solution. By differentiating the joint CDF with respect to both 𝑥 and 𝑦, the joint PDF is
(
1, 0 < 𝑥 < 1 and 0 < 𝑦 < 1
𝑓𝑋 ,𝑌 (𝑥, 𝑦) =
0, otherwise

which is much simpler than the joint CDF.

From the definition of the joint PDF and its relationship with the joint CDF, several properties of
joint PDFs can be inferred:
(i) 𝑓𝑋 ,𝑌 (𝑥, 𝑦) ≥ 0

∫∞∫∞
(ii) 𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞ 𝑋 ,𝑌
=1

∫∞ ∫∞
(iii) 𝑓𝑋 (𝑥) = 𝑓 (𝑥, 𝑦)𝑑𝑦 and 𝑓𝑌 (𝑦) =
−∞ 𝑋 ,𝑌
𝑓 (𝑥, 𝑦)𝑑𝑥
−∞ 𝑋 ,𝑌

𝑦2 ∫ 𝑥 2
(iv) 𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) =
∫
𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
𝑦1 𝑥 1 𝑋 ,𝑌

Property (i) follows directly from the definition of the joint PDF since both the numerator and
denominator there are nonnegative. Property (ii) results from the relationship in Equation 4.4
together with the fact that 𝐹𝑋 ,𝑌 (∞, ∞) = 1. This is the normalization integral for joint PDFs. These
first two properties form a set of sufficient conditions for a function of two variables to be a valid
joint PDF. Property (iii) is obtained by noting
∫ ∞ that
∫ 𝑥 the marginal CDF of 𝑋 is 𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞).
Using Equation 4.4 then results in 𝐹𝑋 (𝑥) = −∞ −∞ 𝑓𝑋 ,𝑌 (𝑢, 𝑦)𝑑𝑢𝑑𝑦. Differentiating this expression
with respect to 𝑥 produces the expression in property (iii) for the marginal PDF of 𝑥. A similar
derivation produces the marginal PDF of 𝑦. Hence, the marginal PDFs are obtained by integrating
out the unwanted variable in the joint PDF. The last property is obtained by combining Equations
4.1 and 4.4.
Property (iv) of joint PDFs specifies how to compute the probability that a pair of random variables
takes on a value in a rectangular region. Often, we are interested in computing the probability
that the pair of random variables falls in a region which is not rectangularly shaped. In general,
suppose we wish to compute 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴), where 𝐴 is the region illustrated in Figure 4.2. This
general region can be approximated as a union of many nonoverlapping rectangular regions as
shown in the figure. In fact, as we make the rectangles ever smaller, the approximation improves
to the point where the representation becomes exact in the limit as the rectangles get infinitely
small. That is, any region can be represented as an infinite number of infinitesimal rectangular

68
4 Two or More Random Variables

regions so that 𝐴 = 𝑅𝑖 , where 𝑅𝑖 represents the ith rectangular region. The probability that the
Ð
random pair falls in 𝐴 is then computed as:
Õ Õ∬
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑃 ((𝑋, 𝑌 ) ∈ 𝑅𝑖 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.5)
𝑖 𝑖 𝑅𝑖

The sum of the integrals over the rectangular regions can be replaced by an integral over the
original region 𝐴: ∬
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.6)
𝐴
This important result shows that the probability of a pair of random variables falling in some
two-dimensional region 𝐴 is found by integrating the joint PDF of the two random variables over
the region 𝐴.

Figure 4.2: Approximation of an arbitrary region by a series of infinitesimal rectangles.

Example 4.3

Suppose that a pair of random variables has the joint PDF given by:

𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑐𝑒 −𝑥 𝑒 −𝑦/2𝑢 (𝑥)𝑢 (𝑦)

Find (a) the constant value 𝑐 and (b) the probability of the event {𝑋 > 𝑌 }.

Solution. (a) The constant 𝑐 is found using the normalization integral:

∫ ∞∫ ∞ ∫ ∞∫ ∞
𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑐𝑒 −𝑥 𝑒 −𝑦/2𝑑𝑥𝑑𝑦 = 1 ⇒ 𝑐 = 1/2
−∞ −∞ 0 0

(b) This probability can be viewed as the probability of the pair (𝑋, 𝑌 ) falling in the region
𝐴 that is now defined as 𝐴 = {(𝑥, 𝑦) : 𝑥 > 𝑦}. This probability is calculated as:

1 −𝑥 −𝑦/2 1 −3𝑦/2
∬ ∫ ∞∫ ∞ ∫ ∞
𝑃 (𝑋 > 𝑌 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑒 𝑒 𝑑𝑥𝑑𝑦 = 𝑒 𝑑𝑦 = 1/3
𝑥 >𝑦 0 𝑦 2 0 2

69
4 Two or More Random Variables

4.1.3 Joint Probability Mass Functions

When the random variables are discrete rather than continuous, it is often more convenient to
work with probability mass functions (PMFs) rather than PDFs or CDFs. It is straightforward to
extend the concept of the PMF to a pair of random variables.
Definition 4.3. Joint Probability Mass Function: The joint PMF for a pair of discrete random
variables 𝑋 and 𝑌 is given by: 𝑃𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃 ({𝑋 = 𝑥 } ∩ {𝑌 = 𝑦})
In particular, suppose the random variable 𝑋 takes on values from the set {𝑥 1, 𝑥 2, ..., 𝑥 𝑀 } and
the random variable 𝑌 takes on values from the set {𝑦1, 𝑦2, ..., 𝑦𝑁 }. Here, either 𝑀 or 𝑁 could be
potentially infinite, or both could be finite. Several properties of the joint PMF analogous to those
developed for joint PDFs should be apparent.
(i)
0 ≤ 𝑃𝑋 ,𝑌 (𝑥, 𝑦) ≤ 1 (4.7)

(ii)
𝑀 Õ
Õ 𝑁
𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 1 (4.8)
𝑚=1 𝑛=1

(iii)
𝑁
Õ 𝑀
Õ
𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑋 (𝑥𝑚 ), 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑌 (𝑦𝑛 ) (4.9)
𝑛=1 𝑚=1

(iv) ÕÕ
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑃𝑋 ,𝑌 (𝑥, 𝑦) (4.10)
(𝑥,𝑦) ∈𝐴

Furthermore, the joint PDF or the joint CDF of a pair of discrete random variables can be related
to the joint PMF through the use of delta functions or step functions by:
𝑀 Õ
Õ 𝑁
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝛿 (𝑥 − 𝑥𝑚 )𝛿 (𝑦 − 𝑦𝑛 ) (4.11)
𝑚=1 𝑛=1

𝑀 Õ
Õ 𝑁
𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝑢 (𝑥 − 𝑥𝑚 )𝑢 (𝑦 − 𝑦𝑛 ) (4.12)
𝑚=1 𝑛=1
Usually, it is most convenient to work with PMFs when the random variables are discrete. However,
if the random variables are mixed (i.e., one is discrete and one is continuous), then it becomes
necessary to work with PDFs or CDFs since the PMF will not be meaningful for the continuous
random variable.
Example 4.4

Two discrete random variables 𝑁 and 𝑀 have a joint PMF given by:

(𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚
𝑃 𝑁 ,𝑀 (𝑛, 𝑚) = , 𝑚 = 0, 1, 2, 3, ..., 𝑛 = 0, 1, 2, 3, ...
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1
Find the marginal PMFs 𝑃 𝑁 (𝑛) and 𝑃𝑀 (𝑚).

70
4 Two or More Random Variables

Solution. The marginal PMF of 𝑁 can be found by summing over 𝑚 in the joint PMF:
∞ ∞
Õ Õ (𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚
𝑃 𝑁 (𝑛) = 𝑃 𝑁 ,𝑀 (𝑛, 𝑚) =
𝑚=0 𝑚=0
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1

To evaluate this series, the following identity is used:

∞
Õ (𝑛 + 𝑚)! 𝑚 1 𝑛+1
𝑥 =( )
𝑚=0
𝑛!𝑚! 1−𝑥

The marginal PMF then reduces to

∞
𝑎𝑛 Õ (𝑛 + 𝑚)! 𝑏𝑚
𝑃 𝑁 (𝑛) =
(𝑎 + 𝑏 + 1)𝑛+1 𝑚=0 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑚
𝑎𝑛 1 𝑎𝑛
= ( )𝑛+1 =
(𝑎 + 𝑏 + 1) 𝑛+1
1− 𝑏 (1 + 𝑎)𝑛+1
𝑎+𝑏+1

Likewise, by symmetry, the marginal PMF of 𝑀 is

𝑏𝑚
𝑃𝑀 (𝑚) =
(1 + 𝑏)𝑚+1
Hence, the random variables 𝑀 and 𝑁 both follow a geometric distribution

4.1.4 Conditional Probabilities and densities

The notion of conditional distribution functions and conditional density functions can be extended
to the case where the conditioning event is related to another random variable. For example,
we might want to know the distribution of a random variable representing the score a student
achieves on a test given the value of another random variable representing the number of hours
the student studied for the test. Or, perhaps we want to know the probability density function of
the outside temperature, given that the humidity is known to be below 50%.
To start with, consider a pair of discrete random variables 𝑋 and 𝑌 with a PMF, 𝑃𝑋 ,𝑌 (𝑥, 𝑦). Suppose
we would like to know the PMF of the random variable X given that the value of 𝑌 has been
observed. Then, according to the definition of conditional probability:

𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) 𝑃𝑋 ,𝑌 (𝑥, 𝑦)
𝑃 (𝑋 = 𝑥 |𝑌 = 𝑦) = = (4.13)
𝑃 (𝑌 = 𝑦) 𝑃𝑌 (𝑦)

We refer to this as the conditional PMF of 𝑋 given 𝑌 . By way of notation we write:

𝑃𝑋 ,𝑌 (𝑥, 𝑦)
𝑃𝑋 |𝑌 (𝑥 |𝑦) = (4.14)
𝑃𝑌 (𝑦)

Example 4.5

Using the joint PMF given in Example 4.4 along with the marginal PMF found in that exam-
ple, find the conditional PMF: 𝑃 𝑁 |𝑀 (𝑛|𝑚)

71
4 Two or More Random Variables

Solution.
𝑃𝑀,𝑁 (𝑚, 𝑛)
𝑃 𝑁 |𝑀 (𝑛|𝑚) =
𝑃𝑀 (𝑚)
(𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚 (1 + 𝑏)𝑚+1
=
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 𝑏𝑚
(𝑛 + 𝑚)! 𝑎 (1 + 𝑏)
𝑛 𝑚+1
=
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1
Note that the conditional PMF of 𝑁 given 𝑀 is quite different than the marginal PMF of 𝑁 .
That is, knowing 𝑀 changes the distribution of 𝑁 .

The simple result developed in Equation 4.13 can be extended to the case of continuous random
variables and PDFs.
Definition 4.4. Conditional probability density function: The conditional PDF of a random
variable 𝑋 given that 𝑌 = 𝑦 is:
𝑓𝑋 ,𝑌 (𝑥, 𝑦)
𝑓𝑋 |𝑌 (𝑥 |𝑦) = (4.15)
𝑓𝑌 (𝑦)

Integrating both sides of this equation with respect to x produces the conditional CDFs:
Definition 4.5. Conditional cumulative distribution function: The conditional CDF of a
random variable 𝑋 given that 𝑌 = 𝑦 is:
∫𝑥
𝑓𝑋 ,𝑌 (𝑥 0, 𝑦)𝑑𝑥 0
𝐹𝑋 |𝑌 (𝑥 |𝑦) = −∞ (4.16)
𝑓𝑌 (𝑦)

Usually, the conditional PDF is much easier to work with, so the conditional CDF will not be
discussed further.
Example 4.6

A certain pair of random variables has a joint PDF given by:

2𝑎𝑏𝑐
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑢 (𝑥)𝑢 (𝑦)
(𝑎𝑥 + 𝑏𝑦 + 𝑐) 3

for some positive constants 𝑎, 𝑏, and 𝑐. Find the conditional PDF of 𝑋 given 𝑌 and 𝑌 given
𝑋.

Solution. The marginal PDFs are easily found to be:

∫ ∞
𝑎𝑐
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = 𝑢 (𝑥)
0 (𝑎𝑥 + 𝑐) 2
∫ ∞
𝑏𝑐
𝑓𝑌 (𝑦) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥 = 𝑢 (𝑦)
0 (𝑏𝑦 + 𝑐) 2
The conditional PDF of 𝑋 given 𝑌 then works out to be:

𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑎(𝑏𝑦 + 𝑐) 2
𝑓𝑋 |𝑌 (𝑥 |𝑦) = = 𝑢 (𝑥)
𝑓𝑌 (𝑦) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3

72
4 Two or More Random Variables

The conditional PDF of Y given X could also be determined in a similar way:

𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑏 (𝑎𝑥 + 𝑐) 2
𝑓𝑌 |𝑋 (𝑦|𝑥) = = 𝑢 (𝑦)
𝑓𝑋 (𝑥) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3

Example 4.7

𝑋 and 𝑌 are two Gaussian random variables with a joint PDF:

1 2
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = √ 𝑒𝑥𝑝 (− (𝑥 2 − 𝑥𝑦 + 𝑦 2 ))
𝜋 3 3

Find the marginal PDFs and the conditional PDF of 𝑋 given 𝑌 .

Solution. The marginal PDF is found as follows:

1 2 2
∫ ∞ ∫ ∞
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− 𝑥 2 ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦))𝑑𝑦
−∞ 𝜋 3 3 −∞ 3
1 𝑥 2 ∫ ∞
2 𝑥 2
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦 + ))𝑑𝑦
𝜋 3 2 −∞ 3 4
1 𝑥 2 ∫ ∞
2
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 − 𝑥/2) 2 )𝑑𝑦
𝜋 3 2 −∞ 3

Now the integrand is a Gaussian-looking function. If the appropriate constant is added to

the integrand, the integrand
p will be a valid PDF and hence must integrate out to one. In
this case, the constant is 2/(3𝜋). Therefore, the integral as just written must evaluate to
(3𝜋)/2. So:
p

1 𝑥2
𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− )
2𝜋 2
and we see that 𝑋 is a zero-mean, unit-variance, Gaussian (i.e., standard normal) random
variable. By symmetry, the marginal PDF of 𝑌 must also be of the same form.
The conditional PDF of 𝑋 given 𝑌 is
1
√ 𝑒𝑥𝑝 (− 32 (𝑥 2 − 𝑥𝑦 + 𝑦 2 )) 2 2
r
𝑓𝑋 ,𝑌 (𝑥, 𝑦) 𝜋 3 𝑦
𝑓𝑋 |𝑌 (𝑥 |𝑦) = = = 𝑒𝑥𝑝 (− (𝑥 − ) 2 )
𝑓𝑌 (𝑦) 𝑦2
√1 𝑒𝑥𝑝 (− ) 3𝜋 3 2
2𝜋 2

So, the conditional PDF of 𝑋 given 𝑌 is also Gaussian. But, given that it is known that 𝑌 = 𝑦,
the mean of 𝑋 is now 𝑦/2 (instead of zero), and the variance of 𝑋 is 3/4 (instead of one). In
this example, knowledge of 𝑌 has shifted the mean and reduced the variance of 𝑋 .

In addition to conditioning on a random variable taking on a point value such as 𝑌 = 𝑦, the

conditioning can also occur on an interval of the form 𝑦1 ≤ 𝑌 ≤ 𝑦2 . To simplify notation, let the
conditioning event 𝐴 be 𝐴 = {𝑦1 ≤ 𝑌 ≤ 𝑦2 }. The relevant conditional PMF, PDF, and CDF are
then given, respectively, by: Í𝑦2
𝑦=𝑦1 𝑃𝑋 ,𝑌 (𝑥, 𝑦)
𝑃𝑋 |𝐴 (𝑥) = Í𝑦2 (4.17)
𝑦=𝑦1 𝑃𝑌 (𝑦)

73
4 Two or More Random Variables

∫ 𝑦2
𝑓 (𝑥, 𝑦)𝑑𝑦
𝑦1 𝑋 ,𝑌
𝑓𝑋 |𝐴 (𝑥) = ∫ 𝑦2 (4.18)
𝑦1 𝑌
𝑓 (𝑦)𝑑𝑦
𝐹𝑋 ,𝑌 (𝑥, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦1 )
𝐹𝑋 |𝐴 (𝑥) = (4.19)
𝐹𝑌 (𝑦2 ) − 𝐹𝑌 (𝑦1 )

Example 4.8

Using the joint PDF of Example 4.7, determine the conditional PDF of 𝑋 given that 𝑌 > 𝑦0 .

Solution.
∞
1 ∞
2
∫ ∫
𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− (𝑥 2 − 𝑥𝑦 + 𝑦 2 ))𝑑𝑦
𝑦0 𝑦0 𝜋 3 3
1 𝑥2 2 2
∫ ∞
r
𝑥
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦
2𝜋 2 𝑦0 3𝜋 3 2
1 𝑥2 2𝑦0 − 𝑥
= √ 𝑒𝑥𝑝 (− )𝑄 ( √ )
2𝜋 2 3
Since the marginal PDF of 𝑌 is a zero-mean, unit-variance Gaussian PDF,

1 𝑦2
∫ ∞ ∫ ∞
𝑓𝑌 (𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− )𝑑𝑦 = 𝑄 (𝑦0 )
𝑦0 𝑦0 2𝜋 2

Therefore, the PDF of 𝑋 given 𝑌 > 𝑦0 is:

0 2𝑦 −𝑥
1 𝑥 2 𝑄 ( √3 )
𝑓𝑋 |𝑌 >𝑦0 (𝑥) = √ 𝑒𝑥𝑝 (− )
2𝜋 2 𝑄 (𝑦0 )

Note that when the conditioning event was a point condition on 𝑌 , the conditional PDF of 𝑋
was Gaussian; yet, when the conditioning event is an interval condition on 𝑌 , the resulting
conditional PDF of 𝑋 is not Gaussian at all.

4.1.5 Expected Values and Moments Involving Pairs of Random Variables

We are often interested in how two variables 𝑋 and 𝑌 vary together. In particular, we are interested
in whether the variation of 𝑋 and 𝑌 are correlated. For example, if 𝑋 increases does 𝑌 tend to
increase or to decrease? The joint moments of 𝑋 and 𝑌 provide this information.
Definition 4.6. Let 𝑔(𝑥, 𝑦) be an arbitrary two-dimensional function. The expected value of 𝑔(𝑋, 𝑌 ),
where 𝑋 and 𝑌 are random variables, is
∬
𝐸 [𝑔(𝑋, 𝑌 )] = 𝑔(𝑥, 𝑦) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.20)

For discrete random variables, the equivalent expression in terms of the joint PMF is:
ÕÕ
𝐸 [𝑔(𝑋, 𝑌 )] = 𝑔(𝑥𝑚 , 𝑦𝑛 )𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) (4.21)
𝑚 𝑛

74
4 Two or More Random Variables

If the function 𝑔(𝑥, 𝑦) is actually a function of only a single variable, say 𝑥, then this definition
reduces to the definition of expected values for functions of a single random variable:
∫ ∞∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞
𝐸 [𝑔(𝑋 )] = 𝑔(𝑥) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑔(𝑥) ( 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦)𝑑𝑥 = 𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥
−∞ −∞ −∞ −∞ −∞
(4.22)
To start with, consider an arbitrary linear function of the two variables 𝑔(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦, where 𝑎
and 𝑏 are constants. Then:
∫ ∞∫ ∞
𝐸 [𝑎𝑋 + 𝑏𝑌 ] = (𝑎𝑥 + 𝑏𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞∫ ∞ ∫ ∞∫ ∞
=𝑎 𝑥 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 + 𝑏 𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞ −∞ −∞
= 𝑎𝐸 [𝑋 ] + 𝑏𝐸 [𝑌 ]

This result merely states that expectation is a linear operation.

Definition 4.7. Correlation The correlation between two random variables is defined as:
∬
𝑅𝑋 ,𝑌 = 𝐸 [𝑋𝑌 ] = 𝑥𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.23)

Furthermore, two random variables which have a correlation of zero are said to be orthogonal.
One instance in which the correlation appears is in calculating the second moment of a sum of
two random variables. That is, consider finding the expected value of 𝑔(𝑋, 𝑌 ) = (𝑋 + 𝑌 ) 2 .

𝐸 [(𝑋 + 𝑌 ) 2 ] = 𝐸 [𝑋 2 + 2𝑋𝑌 + 𝑌 2 ] = 𝐸 [𝑋 2 ] + 𝐸 [𝑌 2 ] + 2𝐸 [𝑋𝑌 ] (4.24)

Hence the second moment of the sum is the sum of the second moments plus twice the correlation.
Definition 4.8. Covariance The covariance between two random variables is:
∬
𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] = (𝑥 − 𝐸 [𝑋 ]) (𝑦 − 𝐸 [𝑌 ]) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.25)

If two random variables have a covariance of zero, they are said to be uncorrelated.

Theorem 4.2

The correlation and covariance are strongly related to one another as follows:

𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝑅𝑋 ,𝑌 − 𝐸 [𝑋 ]𝐸 [𝑌 ] (4.26)

Proof.

𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] = 𝐸 [𝑋𝑌 − 𝐸 [𝑋 ]𝑌 − 𝐸 [𝑌 ]𝑋 + 𝐸 [𝑋 ]𝐸 [𝑌 ]]

= 𝐸 [𝑋𝑌 ] − 𝐸 [𝑋 ]𝐸 [𝑌 ] − 𝐸 [𝑌 ]𝐸 [𝑋 ] + 𝐸 [𝑋 ]𝐸 [𝑌 ] = 𝐸 [𝑋𝑌 ] − 𝐸 [𝑋 ]𝐸 [𝑌 ]

As a result, if either 𝑋 or 𝑌 (or both) has a mean of zero, correlation and covariance are equivalent.
The covariance function occurs when calculating the variance of a sum of two random variables:

𝑉 𝐴𝑅 [𝑋 + 𝑌 ] = 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 2𝐶𝑂𝑉 (𝑋, 𝑌 ) (4.27)

75
4 Two or More Random Variables

This result can be obtained from Equation 4.24 by replacing 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ].
Another statistical parameter related to a pair of random variables is the correlation coefficient,
which is nothing more than a normalized version of the covariance.
Definition 4.9. Correlation coefficient The correlation coefficient of two random variables 𝑋 and
𝑌 , 𝜌𝑋𝑌 , is defined as
𝐶𝑂𝑉 (𝑋, 𝑌 ) 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])]
𝜌𝑋𝑌 = p = (4.28)
𝑉 𝐴𝑅(𝑋 )𝑉 𝐴𝑅(𝑌 ) 𝜎𝑋 𝜎𝑌

The next theorem quantifies the nature of the normalization.

Theorem 4.3

Correlation coefficient is less than 1 in absolute value.

Proof. Consider taking the second moment of 𝑋 + 𝑎𝑌 , where 𝑎 is a real constant:

𝐸 [(𝑋 + 𝑎𝑌 ) 2 ] = 𝐸 [𝑋 2 ] + 2𝑎𝐸 [𝑋𝑌 ] + 𝑎 2 𝐸 [𝑌 2 ] ≥ 0

Since this is true for any 𝑎, we can tighten the bound by choosing the value of 𝑎 that
minimizes the left-hand side. This value of 𝑎 turns out to be
−𝐸 [𝑋𝑌 ]
𝑎=
𝐸 [𝑌 2 ]
Plugging in this value gives

𝐸 [𝑋𝑌 ] 2 𝐸 [𝑋𝑌 ] 2
𝐸 [𝑋 2 ] + − 2 ≥ 0 ⇒ 𝐸 [𝑋𝑌 ] 2 ≤ 𝐸 [𝑋 2 ]𝐸 [𝑌 2 ]
𝐸 [𝑌 2 ] 𝐸 [𝑌 2 ]
If we replace 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ], the result is

(𝐶𝑂𝑉 (𝑋, 𝑌 )) 2 ≤ 𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ]

Rearranging terms then gives the desired result:

𝐶𝑂𝑉 (𝑋, 𝑌 )
|𝜌𝑋𝑌 | = | p |≤1
𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ]

Note that we can also infer from the proof that equality holds if 𝑌 is a constant times 𝑋 . That is, a
correlation coefficient of 1 (or −1) implies that 𝑋 and 𝑌 are completely correlated (knowing 𝑌
determines 𝑋 ). Furthermore, uncorrelated random variables will have a correlation coefficient
of zero. Therefore, as its name implies, the correlation coefficient is a quantitative measure of
the correlation between two random variables. It should be emphasized at this point that zero
correlation is not to be confused with independence. These two concepts are not the same.
Example 4.9

Consider once again the joint PDF of Example 4.7. Find 𝑅𝑋 ,𝑌 , 𝐶𝑂𝑉 (𝑋, 𝑌 ) and 𝜌𝑋 ,𝑌 .

76
4 Two or More Random Variables

Solution. The correlation for these random variables is:

2
∫ ∞∫ ∞
𝑥𝑦
𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− (𝑥 2 + 𝑥𝑦 + 𝑦 2 ))𝑑𝑦𝑑𝑥
−∞ −∞ 𝜋 3 3

In order to evaluate this integral, the joint PDF is rewritten 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑓𝑋 (𝑥) and
then those terms involving only 𝑥 are pulled outside the inner integral over 𝑦.

𝑥2 2 2
∫ ∞ ∫ ∞ r
𝑥 𝑥
𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− ) ( 𝑦 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦)𝑑𝑥
−∞ 2𝜋 2 −∞ 3𝜋 3 2

The inner integral (in square brackets) is the expected value of a Gaussian random variable
with a mean of 𝑥/2 and variance of 3/4 which thus evaluates to 𝑥/2. Hence,

1 ∞ 𝑥2 𝑥2
∫
𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− )𝑑𝑥
2 −∞ 2𝜋 2

The remaining integral is the second moment of a Gaussian random variable with zero mean
and unit variance which integrates to 1. The correlation of these two random variables is
therefore 𝐸 [𝑋𝑌 ] = 1/2. Since both 𝑋 and 𝑌 have zero means, 𝐶𝑂𝑉 (𝑋, 𝑌 ) is also equal to
1/2. Finally, the correlation coefficient is also 𝜌𝑋𝑌 = 1/2 due to the fact that both 𝑋 and 𝑌
have unit variance.

The concepts of correlation and covariance can be generalized to higher-order moments as given
in the following definition.
Definition 4.10. Joint moment: The (𝑚, 𝑛)𝑡ℎ joint moment of two random variables 𝑋 and 𝑌 is:
∬
𝑚 𝑛
𝐸 [𝑋 𝑌 ] = 𝑥 𝑚𝑦𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.29)

Definition 4.11. Joint central moment: The (𝑚, 𝑛)𝑡ℎ joint central moment of two random vari-
ables 𝑋 and 𝑌 is:
∬
𝑚 𝑛
𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ]) ] = (𝑥 − 𝐸 [𝑋 ])𝑚 (𝑦 − 𝐸 [𝑌 ])𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.30)

These higher-order joint moments are not frequently used. As with single random variables,
a conditional expected value can also be defined for which the expectation is carried out with
respect to the appropriate conditional density function.
Definition 4.12. The conditional expected value of a function 𝑔(𝑋 ) of a random variable 𝑋 given
that 𝑌 = 𝑦 is: ∫ ∞
𝐸 [𝑔(𝑋 )|𝑌 ] = 𝑔(𝑥) 𝑓𝑋 |𝑌 (𝑥 |𝑦)𝑑𝑥 (4.31)
−∞

Conditional expected values can be particularly useful in calculating expected values of functions
of two random variables that can be factored into the product of two one-dimensional functions.

77
4 Two or More Random Variables

That is, consider a function of the form 𝑔(𝑥, 𝑦) = 𝑔1 (𝑥)𝑔2 (𝑦). Then:
∫ ∞∫ ∞
𝐸 [𝑔1 (𝑋 )𝑔2 (𝑌 )] = 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞∫ ∞
= 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 (𝑥) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞ ∫ ∞
= 𝑔1 (𝑥) 𝑓𝑋 (𝑥) ( 𝑔2 (𝑦) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑦)𝑑𝑥
−∞ −∞
∫ ∞
= 𝑔1 (𝑥) 𝑓𝑋 (𝑥)𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]𝑑𝑥
−∞
= 𝐸𝑋 [𝑔1 (𝑋 )𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]]

Here, the subscripts on the expectation operator have been included for clarity to emphasize
that the outer expectation is with respect to the random variable X, while the inner expectation
is with respect to the random variable Y (conditioned on X). This result allows us to break a
two-dimensional expectation into two one-dimensional expectations. This technique was used in
Example 4.9, where the correlation between two variables was essentially written as:

𝑅𝑋 ,𝑌 = 𝐸𝑋 [𝑋 𝐸𝑌 [𝑌 |𝑋 ]] (4.32)

In that example, the conditional PDF of Y given X was Gaussian, thus finding the conditional
mean was accomplished by inspection. The outer expectation then required finding the second
moment of a Gaussian random variable, which is also straightforward.

4.1.6 Independence of Random Variables

The concept of independent events was introduced in section 2.4. In this section, we extend
this concept to the realm of random variables. To make that extension, consider the events
𝐴 = {𝑋 ≤ 𝑥 } and 𝐵 = {𝑌 ≤ 𝑦} related to the random variables 𝑋 and 𝑌 . The two events 𝐴 and 𝐵
are statistically independent if 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵). Restated in terms of the random variables,
this condition becomes

𝑃 (𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦) = 𝑃 (𝑋 ≤ 𝑥)𝑃 (𝑌 ≤ 𝑦) ⇒ 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝐹𝑋 (𝑥)𝐹𝑌 (𝑦) (4.33)

Hence, two random variables are statistically independent if their joint CDF factors into a product
of the marginal CDFs. Differentiating both sides of this equation with respect to both 𝑥 and 𝑦
reveals that the same statement applies to the PDF as well. That is, for statistically independent
random variables, the joint PDF factors into a product of the marginal PDFs:

𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) (4.34)

It is not difficult to show that the same statement applies to PMFs as well. The preceding condition
can also be restated in terms of conditional PDFs. Dividing both sides of Equation 4.34 by 𝑓𝑋 (𝑥)
results in
𝑓𝑌 |𝑋 (𝑦|𝑥) = 𝑓𝑌 (𝑦) (4.35)
A similar result involving the conditional PDF of X given Y could have been obtained by dividing
both sides by the PDF of Y. In other words, if X and Y are independent, knowing the value of the
random variable X should not change the distribution of Y and vice versa.

78
4 Two or More Random Variables

Example 4.10

Revisiting Example 4.7 once again, verify the Independence of 𝑋 and 𝑌 .

Solution. Since the marginal PDF of X is

1 𝑥2
𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− )
2𝜋 2

and the conditional PDF of X given Y is:

2 2
r
𝑦
𝑓𝑋 |𝑌 (𝑥 |𝑦) = 𝑒𝑥𝑝 (− (𝑥 − ) 2 )
3𝜋 3 2
which are not equal, these two random variables are not independent.

Example 4.11

Suppose the random variables X and Y are uniformly distributed on the square defined by
0 ≤ 𝑥, 𝑦 ≤ 1. Are these two random variables independent?

Solution. The joint PDF of X and Y is:

(
1, 0 ≤ 𝑥, 𝑦 ≤ 1
𝑓𝑋 ,𝑌 (𝑥, 𝑦) =
0, otherwise

and marginal PDFs of X and Y is:

(
1, 0≤𝑥 ≤1
𝑓𝑋 (𝑥) =
0, otherwise
(
1, 0≤𝑦 ≤1
𝑓𝑌 (𝑦) =
0, otherwise
These random variables are statistically independent since 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦).

Theorem 4.4

Let 𝑋 and 𝑌 be two independent random variables and consider forming two new random
variables 𝑈 = 𝑔1 (𝑋 ) and 𝑉 = 𝑔2 (𝑌 ). These new random variables 𝑈 and 𝑉 are also
independent

Another important result deals with the correlation, covariance, and correlation coefficients of
independent random variables.

79
4 Two or More Random Variables

Theorem 4.5

If 𝑋 and 𝑌 are independent random variables, then 𝑅𝑋 ,𝑌 = 𝐸 [𝑋 ]𝐸 [𝑌 ], 𝐶𝑜𝑣 (𝑋, 𝑌 ) = 0, and

𝜌𝑋 ,𝑌 = 0.

Proof.
∬ ∫ ∫
𝐸 [𝑋𝑌 ] = 𝑥𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 𝑦 𝑓𝑌 (𝑦)𝑑𝑦 = 𝐸 [𝑋 ]𝐸 [𝑌 ]

The conditions involving covariance and correlation coefficient follow directly from this
result.

Therefore, independent random variables are necessarily uncorrelated, but the converse is not
always true. Uncorrelated random variables do not have to be independent as demonstrated by
the next example.

Example 4.12: Uncorrelated but Dependent Random Variables

Consider a pair of random variables 𝑋 and 𝑌 that are uniformly distributed over the unit
circle so that: (
1/𝜋, 𝑥 2 + 𝑦 2 ≤ 1
𝑓𝑋 ,𝑌 (𝑥, 𝑦) =
0, otherwise
The marginal PDF of 𝑋 can be found as follows:
√
1−𝑥 2
∞
1 2√
∫ ∫
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑑𝑦 = 1 − 𝑥 2, −1 ≤𝑥 ≤ 1
−∞ − 1−𝑥 2 𝜋 𝜋

By symmetry, the marginal PDF of 𝑌 must take on the same functional form. Hence, the
product of the marginal PDFs is
4p
𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) = (1 − 𝑥 2 ) (1 − 𝑦 2 ), − 1 ≤ 𝑥, 𝑦 ≤ 1
𝜋2
Clearly, this is not equal to the joint PDF, and therefore, the two random variables are
dependent. This conclusion could have been determined in a simpler manner. Note that
if we are told that 𝑋 = 1, then necessarily 𝑌 = 0, whereas if we know that 𝑋 = 0, then 𝑌
can range anywhere from -1 to 1. Therefore, conditioning on different values of 𝑋 leads to
different distributions for 𝑌 . Next, the correlation between 𝑋 and 𝑌 is calculated.
√
1 1−𝑥 2
1
∬ ∫ ∫
𝑥𝑦
𝑅𝑋 ,𝑌 = 𝐸 [𝑋𝑌 ] = 𝑑𝑥𝑑𝑦 = 𝑥( √ 𝑦𝑑𝑦)𝑑𝑥
𝑥 2 +𝑦 2 ≤1 𝜋 𝜋 −1 − 1−𝑥 2

Since the inner integrand is an odd function (of 𝑦) and the limits of integration are symmetric
about zero, the integral is zero. Hence, 𝑅𝑋 ,𝑌 = 0. Note from the marginal PDFs just found
that both 𝑋 and 𝑌 are zero-mean. So, it is seen for this example that while the two random
variables are uncorrelated, they are not independent.

80
4 Two or More Random Variables

4.1.7 Pairs of Jointly Gaussian Random Variables

As with single random variables, the most common and important example of a two-dimensional
probability distribution is that of a joint Gaussian distribution. The jointly Gaussian random
variables appear in numerous applications in electrical engineering.They are frequently used to
model signals in signal processing applications, and they are the most important model used in
communication systems that involve dealing with signals in the presence of noise. They also play
a central role in many statistical methods.
Definition 4.13. Jointly Gaussian random variables: A pair of random variables 𝑋 and 𝑌 is
said to be jointly Gaussian if their joint PDF is of the general form:
𝑥−𝑚𝑋 2 𝑦−𝑚𝑌 2
− 2𝜌𝑋𝑌 ( 𝑥−𝑚
𝑦−𝑚𝑌
1 ( 𝜎𝑋 ) 𝜎𝑋 ) ( )+( 𝜎𝑌 )
𝑋

(4.36)
𝜎𝑌
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑒𝑥𝑝 (− 2 )
)
2(1 − 𝜌𝑋𝑌
q
2
2𝜋𝜎𝑋 𝜎𝑌 1 − 𝜌𝑋𝑌

where 𝑚𝑋 and 𝑚𝑌 are the means of 𝑋 and 𝑌 , respectively; 𝜎𝑋 and 𝜎𝑌 are the standard deviations of
𝑋 and 𝑌 , respectively; and 𝜌𝑋𝑌 is the correlation coefficient of 𝑋 and 𝑌 .
It can be shown that this joint PDF results in Gaussian marginal PDFs:

1 (𝑥 − 𝑚𝑋 ) 2
∫ ∞
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− ) (4.37)
−∞ 2𝜋𝜎𝑋 2𝜎𝑋2

1 (𝑦 − 𝑚𝑌 ) 2
∫ ∞
𝑓𝑌 (𝑦) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥 = √ 𝑒𝑥𝑝 (− ) (4.38)
−∞ 2𝜋𝜎𝑌 2𝜎𝑌2
Furthermore, if 𝑋 and 𝑌 are jointly Gaussian, then the conditional PDF of 𝑋 given 𝑌 = 𝑦 is also
Gaussian, with a mean of 𝑚𝑋 + 𝜌𝑋𝑌 (𝜎𝑋 /𝜎𝑌 ) (𝑦 − 𝑚𝑌 ) and a variance of 𝜎𝑋2 (1 − 𝜌𝑋𝑌 2 ).

Figure 4.3 shows the joint Gaussian PDF for three different values of the correlation coefficient.
In Figure 4.3(a), the correlation coefficient is 𝜌𝑋𝑌 = 0 and thus the two random variables are
uncorrelated. Figure 4.3(b) shows the joint PDF when the correlation coefficient is large and
positive, 𝜌𝑋𝑌 = 0.9. Note how the surface has become taller and thinner and largely lies above
the line 𝑦 = 𝑥. In Figure 4.3(c), the correlation is now large and negative, 𝜌𝑋𝑌 = −0.9. Note that
this is the same picture as in Figure 4.3(b), except that it has been rotated by 90𝑜 . Now the surface
lies largely above the line 𝑦 = −𝑥. In all three figures, the means of both 𝑋 and 𝑌 are zero and
the variances of both 𝑋 and 𝑌 are 1. Changing the means would simply translate the surface but
would not change the shape. Changing the variances would expand or contract the surface along
either the 𝑋 − or 𝑌 −axis depending on which variance was changed.

81
4 Two or More Random Variables

Figure 4.3: The joint Gaussian PDF: (a) 𝑚𝑋 = 𝑚𝑌 = 0, 𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = 0; (b) 𝑚𝑋 = 𝑚𝑌 = 0,

𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = 0.9; (c) 𝑚𝑋 = 𝑚𝑌 = 0, 𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = −0.9

Example 4.13

82
4 Two or More Random Variables

The joint Gaussian PDF is given by the Equation 4.36. Suppose have the following equation:
𝑥 − 𝑚𝑋 2 𝑥 − 𝑚 𝑋 𝑦 − 𝑚𝑌 𝑦 − 𝑚𝑌 2
( ) − 2𝜌𝑋𝑌 ( )( )+( ) = 𝑐2
𝜎𝑋 𝜎𝑋 𝜎𝑌 𝜎𝑌
This is the equation for an ellipse. Plotting these ellipses for different values of 𝑐 results in
what is known as a contour plot. Figure 4.4 shows such plots for the two-dimensional joint
Gaussian PDF.

Figure 4.4: Contour plots for joint Gaussian random variables

Theorem 4.6

Uncorrelated Gaussian random variables are independent.

Proof. Uncorrelated Gaussian random variables have a correlation coefficient of zero. Plug-
ging 𝜌𝑋𝑌 = 0 into the general joint Gaussian PDF results in
𝑥−𝑚𝑋 2 𝑦−𝑚𝑌 2
1 ( 𝜎𝑋 ) +( 𝜎𝑌 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑒𝑥𝑝 (− )
2𝜋𝜎𝑋 𝜎𝑌 2
This clearly factors into the product of the marginal Gaussian PDFs.

1 (𝑥 − 𝑚𝑋 ) 2 1 (𝑦 − 𝑚𝑌 ) 2
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = √ 𝑒𝑥𝑝 (− ) √ 𝑒𝑥𝑝 (− ) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦)
2𝜋𝜎𝑋 2𝜎𝑋2 2𝜋𝜎𝑌 2𝜎𝑌2

While Example 4.12 demonstrated that this property does not hold for all random variables,
however it is true for Gaussian random variables. This allows us to give a stronger interpretation
to the correlation coefficient when dealing with Gaussian random variables. Previously, it was
stated that the correlation coefficient is a quantitative measure of the amount of correlation
between two variables. While this is true, it is a rather vague statement. We see that in the case
of Gaussian random variables, we can make the connection between correlation and statistical
dependence. Hence, for jointly Gaussian random variables, the correlation coefficient can indeed
be viewed as a quantitative measure of statistical dependence.

83
4 Two or More Random Variables

4.2 Multiple Random Variables

In many applications, it is necessary to deal with a large numbers of random variables. Often, the
number of variables can be arbitrary. Therefore we extend the concepts developed previously
for single random variables and pairs of random variables to allow for an arbitrary number of
random variables. A common example is multidimensional Gaussian random variables, while
most non-Gaussian random variables are difficult to deal with in many dimensions. One of the
main goals here is to develop a vector/matrix notation which will allow us to represent potentially
large sequences of random variables with a compact notation.

4.2.1 Vector Random Variables

The notion of a random variable is easily generalized to the case where several quantities are of
interest.
Definition 4.14. Vector random variables: A vector random variable X is a function that assigns
a vector of real numbers to each outcome 𝜁 in 𝑆, the sample space of the random experiment.
We use uppercase boldface notation for vector random variables. By convention X is a column vec-
tor (𝑛 rows by 1 column), so the vector random variable with components𝑋 1, 𝑋 2, ..., 𝑋𝑛 corresponds
to
𝑋 1 
 
𝑋 2 
X =  ..  = [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 (4.39)
 
 . 
 
𝑋𝑛 
 
where T denotes the transpose of a matrix or vector. Possible values of the vector random variable
are denoted by x= [𝑥 1, 𝑥 2, ..., 𝑥𝑛 ]𝑇 where 𝑥𝑖 corresponds to the value of 𝑋𝑖 .
Example 4.14: Samples of an Audio Signal

Let the outcome of a random experiment be an audio signal 𝑋 (𝑡). Let the random variable
𝑋𝑘 = 𝑋 (𝑘𝑇 ) be the sample of the signal taken at time 𝑘𝑇 . An MP3 codec processes the audio
in blocks of 𝑛 samples X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 . X is a vector random variable.

Each event 𝐴 involving X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 has a corresponding region in an n-dimensional real

space 𝑅𝑛 . As before, we use “rectangular” product-form sets in 𝑅𝑛 as building blocks. For the
n-dimensional random variable X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 we are interested in events that have the
product form:
𝐴 = {𝑋 1 in 𝐴1 } ∩ {𝑋 2 in 𝐴2 } ∩ ... ∩ {𝑋𝑛 in 𝐴𝑛 } (4.40)
where each 𝐴𝑘 is a one-dimensional event (i.e., subset of the real line) that involves 𝑋𝑘 only. The
event 𝐴 occurs when all of the events {𝑋𝑘 in 𝐴𝑘 } occur jointly. We are interested in obtaining the
probabilities of these product-form events:
𝑃 (𝐴) = 𝑃 (X ∈ A) = 𝑃 ({𝑋 1 in 𝐴1 } ∩ {𝑋 2 in 𝐴2 } ∩ ... ∩ {𝑋𝑛 in 𝐴𝑛 }) (4.41)
, 𝑃 (𝑋 1 in 𝐴1, 𝑋 2 in 𝐴2, ..., 𝑋𝑛 in 𝐴𝑛 ) (4.42)
In principle, this probability is obtained by finding the probability of the equivalent event in the
underlying sample space, that is,
𝑃 (𝐴) = 𝑃 ({𝜁 in 𝑆 : X(𝜁 ) in A}) (4.43)
= 𝑃 ({𝜁 in 𝑆 : 𝑋 1 (𝜁 ) ∈ 𝐴1, 𝑋 2 (𝜁 ) ∈ 𝐴2, ..., 𝑋𝑛 (𝜁 ) ∈ 𝐴𝑛 }) (4.44)

84
4 Two or More Random Variables

4.2.2 Joint and Conditional PMFs, CDFs and PDFs

The concepts of PMF, CDF, PDF are easily extended to an arbitrary number of random variables.
Definition 4.15. For a vector of 𝑁 random variables X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , with possible values
x= [𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ]𝑇 , the joint PMF, CDF, and PDF are given, respectively, by:

𝑃 X (x) = 𝑃𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = 𝑃 (𝑋 1 = 𝑥 1, 𝑋 2 = 𝑥 2, ..., 𝑋 𝑁 = 𝑥 𝑁 ) (4.45)

𝐹 X (x) = 𝐹𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = 𝑃 (𝑋 1 ≤ 𝑥 1, 𝑋 2 ≤ 𝑥 2, ..., 𝑋 𝑁 ≤ 𝑥 𝑁 ) (4.46)

𝜕𝑁
𝑓X (x) = 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = 𝐹𝑋 ,𝑋 ,...,𝑋 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) (4.47)
𝜕𝑥 1 𝜕𝑥 2 ...𝜕𝑥 𝑁 1 2 𝑛

Marginal CDFs can be found for a subset of the variables by evaluating the joint CDF at infinity
for the unwanted variables. For example, if we are only interested in a subset {𝑋 1, 𝑋 2, ..., 𝑋𝑀 } of
X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , where 𝑁 ≥ 𝑀:

𝐹𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = 𝐹𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 , ..., ∞, ∞, ..., ∞) (4.48)

Marginal PDFs are found from the joint PDF by integrating out the unwanted variables. Similarly,
marginal PMFs are obtained from the joint PMF by summing out the unwanted variables.
∫ ∞∫ ∞ ∫ ∞
𝑓𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = ... 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 )𝑑𝑥 𝑀+1𝑑𝑥 𝑀+2 ...𝑑𝑥 𝑁 (4.49)
−∞ −∞ −∞
Õ Õ Õ
𝑃𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = ... 𝑃𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) (4.50)
𝑥 𝑀+1 𝑥 𝑀+2 𝑥𝑁

Similar to that done for pairs of random variables, we can also establish conditional PMFs and
PDFs.
Definition 4.16. For a set of 𝑁 random variables 𝑋 1, 𝑋 2, ..., 𝑋 𝑁 , the conditional PMF and PDF of
𝑋 1, 𝑋 2, ..., 𝑋𝑀 conditioned on 𝑋𝑀+1, 𝑋𝑀+2, ..., 𝑋 𝑁 are given by

𝑃 (𝑋 1 = 𝑥 1, 𝑋 2 = 𝑥 2, ..., 𝑋 𝑁 = 𝑥 𝑁 )
𝑃𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = (4.51)
𝑃 (𝑋𝑀+1 = 𝑥 𝑀+1, ..., 𝑋 𝑁 = 𝑥 𝑁 )
𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 )
𝑓𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = (4.52)
𝑓𝑋𝑀+1,...,𝑋𝑁 (𝑥 𝑀+1, ..., 𝑥 𝑁 )

Using conditional PDFs, many interesting factorization results can be established for joint PDFs
involving multiple random variables. For example, consider four random variables, 𝑋 1, 𝑋 2, 𝑋 3, 𝑋 4 .

𝑓𝑋1,𝑋2,𝑋3,𝑋4 (𝑥 1, 𝑥 2, 𝑥 3, 𝑥 4 ) = 𝑓𝑋1 |𝑋2,𝑋3,𝑋4 (𝑥 1 |𝑥 2, 𝑥 3, 𝑥 4 ) 𝑓𝑋2,𝑋3,𝑋4 (𝑥 2, 𝑥 3, 𝑥 4 )

Almost endless other possibilities exist as well.

Definition 4.17. A set of 𝑁 random variables are statistically independent if any subset of the
random variables are independent of any other disjoint subset. In particular, any joint PDF of 𝑀 ≤ 𝑁
variables should factor into a product of the corresponding marginal PDFs.

85
4 Two or More Random Variables

As an example, consider three random variables, 𝑋 , 𝑌 , 𝑍 . For these three random variables to be
independent, we must have each pair independent. This implies that:

𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦)

𝑓𝑋 ,𝑍 (𝑥, 𝑧) = 𝑓𝑋 (𝑥)𝑓𝑍 (𝑧) (4.53)
𝑓𝑌 ,𝑍 (𝑦, 𝑧) = 𝑓𝑌 (𝑦)𝑓𝑍 (𝑧)

In addition, the joint PDF of all three must also factor into a product of the marginals,

𝑓𝑋 ,𝑌 ,𝑍 (𝑥, 𝑦, 𝑧) = 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦) 𝑓𝑍 (𝑧) (4.54)

Note that all three conditions in Equation 4.53 follow directly from the single condition in
Equation 4.54. Hence, Equation 4.54 is a necessary and sufficient condition for three variables to
be statistically independent. Naturally, this result can be extended to any number of variables.
That is, the elements of a random vector X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 are independent if
𝑁
Ö
𝑓X (x) = 𝑓𝑋𝑛 (𝑥𝑛 ) (4.55)
𝑛=1

4.2.3 Expectations Involving Multiple Random Variables

For a vector of random variables X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , we can construct a corresponding mean
vector that is a column vector of the same dimension and whose components are the means
of the elements of X. Mathematically, we say 𝑚 = 𝐸 [X] = [𝐸 [𝑋 1 ], 𝐸 [𝑋 2 ], ..., 𝐸 [𝑋 𝑁 ]]𝑇 . Two
other important quantities associated with the random vector are the correlation and covariance
matrices.
Definition 4.18. For a random vector X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , the correlation matrix is defined as
RXX = 𝐸 [XX𝑇 ]. That is, the (𝑖, 𝑗)𝑡ℎ element of the 𝑁 × 𝑁 matrix RXX is 𝐸 [𝑋𝑖 𝑋 𝑗 ]. Similarly, the
covariance matrix is defined as CXX = 𝐸 [(X − 𝑚) (X − 𝑚)𝑇 ] so that the (𝑖, 𝑗)𝑡ℎ element of CXX is
COV(𝑋𝑖 , 𝑋 𝑗 ).

Theorem 4.7

Correlation matrices and covariance matrices are symmetric and positive definite.

Proof. Recall that a square matrix, RXX , is symmetric if RXX = R𝑇XX . Equivalently, the
(𝑖, 𝑗)𝑡ℎ element must be the same as the (𝑖, 𝑗)𝑡ℎ element. This is clearly the case here since
𝐸 [𝑋𝑖 𝑋 𝑗 ] = 𝐸 [𝑋 𝑗 𝑋𝑖 ]. Recall that the matrix is positive definite if z𝑇 RXX z > 0 for any vector
z such that ||𝑧|| > 0.

z𝑇 RXX z = z𝑇 𝐸 [XX𝑇 ]z = 𝐸 [z𝑇 XX𝑇 z] = 𝐸 [(z𝑇 X) 2 ] (4.56)

Note that z𝑇 X is a scalar random variable (a linear combination of the components of X).
Since the second moment of any random variable is positive (except for the pathological
case of a random variable which is identically equal to zero), then the correlation matrix is
positive definite. As an aside, this also implies that the eigenvalues of the correlation matrix
are all positive. Identical steps can be followed to prove the same properties hold for the
covariance matrix.

86
4 Two or More Random Variables

Next, consider a linear transformation of a vector random variable. That is, create a new set of 𝑀
random variables, Y = [𝑌1, 𝑌2, ..., 𝑌𝑀 ]𝑇 , according to:

𝑌1 =𝑎 1,1𝑋 1 + 𝑎 1,2𝑋 2 + ... + 𝑎 1,𝑁 𝑋 𝑁 + 𝑏 1

𝑌2 =𝑎 2,1𝑋 1 + 𝑎 2,2𝑋 2 + ... + 𝑎 2,𝑁 𝑋 𝑁 + 𝑏 2 (4.57)
..
.
𝑌𝑀 =𝑎𝑀,1𝑋 1 + 𝑎𝑀,2𝑋 2 + ... + 𝑎𝑀,𝑁 𝑋 𝑁 + 𝑏 𝑀

The number of new variables, M, does not have to be the same as the number of original variables,
N. To write this type of linear transformation in a compact fashion, define a matrix A whose
(𝑖, 𝑗)𝑡ℎ element is the coefficient 𝑎𝑖,𝑗 and a column vector, b= [𝑏 1, 𝑏 2, ..., 𝑏 𝑀 ]𝑇 . Then the linear
transformation of Equation 4.57 is written in vector/matrix form as Y = AX + b. The next theorem
describes the relationship between the means of X and Y and the correlation matrices of X and Y.

Theorem 4.8

For a linear transformation of vector random variables of the form Y = AX + b, the means
of X and Y are related by.
mY = AmX + b (4.58)
Also, the correlation matrices of X and Y are related by:

RYY = ARXX A𝑇 + AmX b𝑇 + bm𝑇X A𝑇 + bb𝑇 (4.59)

and the covariance matrices of X and Y are related by:

CYY = ACXX A𝑇 (4.60)

Proof. For the mean vector,

mY = 𝐸 [Y] = 𝐸 [AX + b] = A𝐸 [X] + b = AmX + b (4.61)

Similarly, for the correlation matrix

RYY = 𝐸 [YY𝑇 ] = 𝐸 [(AX + b) (AX + b)𝑇 ]

= 𝐸 [AXX𝑇 A𝑇 ] + 𝐸 [bX𝑇 A𝑇 ] + 𝐸 [AXb𝑇 ] + 𝐸 [bb𝑇 ]
= ARXX A𝑇 + AmX b𝑇 + bm𝑇X A𝑇 + bb𝑇 (4.62)

To prove the result for the covariance matrix, write Y−mY as

Y − mY = (AX + b) − (AmX + b) = A(X − mX ) (4.63)

Then,

CYY = 𝐸 [(Y − mY ) (Y − mY )𝑇 ] = 𝐸 [(A(X − mX )) (A(X − mX ))𝑇 ]

= 𝐸 [A(X − mX ) (X − mX )𝑇 A𝑇 ] = A𝐸 [(X − mX ) (X − mX )𝑇 ]A𝑇 = ACXX A𝑇 (4.64)

87
4 Two or More Random Variables

4.2.4 Multi-Dimensional Gaussian Random Variables

Recall from the study of two-dimensional random variables in the previous chapter that the
functional form of the joint Gaussian PDF was fairly complicated. It would seem that the prospects
of forming a joint Gaussian PDF for an arbitrary number of dimensions are grim. However, the
vector/matrix notation developed in the previous sections make this task manageable and, in fact,
the resulting joint Gaussian PDF is quite simple.
Definition 4.19. The joint Gaussian PDF for a vector of 𝑁 random variables, 𝑋 , with mean vector,
m𝑋 , and covariance matrix, CXX , is given by
1 1
𝑓X (x) = p 𝑒𝑥𝑝 (− (X − mX )𝑇 C−1
XX (X − mX )) (4.65)
(2𝜋) 𝑁 𝑑𝑒𝑡 (CXX ) 2

Example 4.15

To demonstrate the use of this matrix notation, suppose X is a two-element vector and the
mean vector and covariance matrix are given by their general forms:

mX = 1
𝑚
𝑚2

and
𝜎12

𝜌𝜎1𝜎2
CXX =
𝜌𝜎1𝜎2 𝜎22
The determinant of the covariance matrix is

𝑑𝑒𝑡 (CXX ) = 𝜎12𝜎22 − (𝜌𝜎1𝜎2 ) 2 = 𝜎12𝜎22 (1 − 𝜌 2 )

while the inverse is

𝜎22 𝜎1−2 −𝜌𝜎1−1𝜎2−1

−𝜌𝜎1𝜎2
−𝜌𝜎1𝜎2 𝜎12 −1
−𝜌𝜎1 𝜎2−1 𝜎2−2
C−1
XX = =
𝜎12𝜎22 (1 − 𝜌 2 ) 1 − 𝜌2

The quadratic form in the exponent then works out to be

𝜎1−2 −𝜌𝜎1−1𝜎2−1

−𝜌𝜎1−1𝜎2−1 𝜎2−2

𝑇 −1
𝑥1 − 𝑚1
(X − mX ) CXX (X − mX ) = 𝑥 1 − 𝑚 1 𝑥 2 − 𝑚 2
1 − 𝜌2 𝑥2 − 𝑚2
𝑥 1 −𝑚 1 2 𝑥 1 −𝑚 1 𝑥 2 −𝑚 2 𝑥 2 −𝑚 2 2
( 𝜎1 ) − 2𝜌 ( 𝜎1 )( 𝜎2 ) + ( 𝜎2 )
=
1 − 𝜌2

Plugging all these results into the general form for the joint Gaussian PDF gives
1 2 2 2
1 ( 𝑥 1𝜎−𝑚
1
) − 2𝜌 ( 𝑥 1𝜎−𝑚
1
1
)( 𝑥 2𝜎−𝑚
2
2
) + ( 𝑥 2𝜎−𝑚
2
)
𝑓𝑋1,𝑋2 (𝑥 1, 𝑥 2 ) = q 𝑒𝑥𝑝 (− )
2(1 − 𝜌 2 )
(2𝜋) 2𝜎12𝜎22 (1 − 𝜌 2 )
(4.66)
This is exactly the form of the two-dimensional joint Gaussian PDF defined in Equation 4.36.

88
4 Two or More Random Variables

Example 4.16

Suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are jointly Gaussian random variables with 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for 𝑖 ≠ 𝑗.
Show that 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are independent random variables.

Solution. Since for all 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for all 𝑖 ≠ 𝑗, all of the off-diagonal elements of the
covariance matrix of X are zero. In other words, CXX is a diagonal matrix of the general
form:
𝜎12 0 ... 0 
 0 𝜎 2 ... 0 
 
CXX =  2 
... ... ... 
 0 0 ... 𝜎 2 
 
 𝑁
The determinant of a diagonal matrix is the product of the diagonal entries so that in this
case 𝑑𝑒𝑡 (CXX ) = 𝜎12𝜎22 ...𝜎𝑁2 . The inverse is also trivial to compute and takes on the form

𝜎1−2 0 ... 0 
 0 𝜎 −2 ... 0 
 
C−1 =
 2 
XX
 ... ... ... 
 0 0 ... 𝜎𝑁−2 

The quadratic form that appears in the exponent of the Gaussian PDF becomes,

(X − mX )𝑇 C−1
XX (X − mX ) =
𝜎1−2 0 ... 0   𝑥1 − 𝑚1 
 0 𝜎2−2 ... 0  𝑁
   
 𝑥2 − 𝑚2  Õ 𝑥 𝑛 − 𝑚𝑛 2
𝑥1 − 𝑚1 𝑥2 − 𝑚2 ... 𝑥 𝑁 − 𝑚 𝑁    = ( )
 ... ... ... 

 ... 
  𝑛=1 𝜎𝑛
 0 0 ... 𝜎𝑁 
−2 𝑥 𝑁 − 𝑚 𝑁 
  
The joint Gaussian PDF for a vector of uncorrelated random variables is then

1 1 Õ 𝑥 𝑛 − 𝑚𝑛 2 Ö 1 (𝑥𝑛 − 𝑚𝑛 ) 2
𝑁 𝑁
𝑓X (x) = q 𝑒𝑥𝑝 (− ( ) )= 𝑒𝑥𝑝 (− )
2 𝑛=1 2 2𝜎𝑛2
p
(2𝜋) 𝑁 𝜎12𝜎22 ...𝜎𝑁2 𝑛=1 2𝜋𝜎𝑛
𝜎𝑛

This shows that for any number of uncorrelated Gaussian random variables, the joint
PDF factors into the product of marginal PDFs and hence uncorrelated Gaussian random
variables are independent. This is a generalization of the same result for two Gaussian
random variables.

Further Reading
1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, Elsevier 2012: sections 5.1 to 5.7 and 6.1 to 6.3
2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 5.1 to 5.9 and 6.1 to 6.4
3. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004:
chapter 5

89
5 Random Sums and Sequences

5 Random Sums and Sequences

Many problems involve the counting of the number of occurrences of events, the measurement
of cumulative effects, or the computation of arithmetic averages in a series of measurements.
Usually these problems can be reduced to the problem of finding, exactly or approximately, the
distribution of a random variable that consists of the sum of 𝑛 independent, identically distributed
random variables. In this chapter, we investigate sums of random variables and their properties
as 𝑛 becomes large.

5.1 Independent and Identically Distributed Random Variables

In many applications, we are able to observe an experiment repeatedly. Each new observation can
occur with an independent realization of whatever random phenomena control the experiment.
This sort of situation gives rise to independent and identically distributed (IID or i.i.d.) random
variables.
Definition 5.1. Independent and Identically Distributed: A sequence of random variables
𝑋 1, 𝑋 2, ..., 𝑋𝑛 is IID if
𝐹𝑋 𝑖 (𝑥) = 𝐹𝑋 (𝑥) ∀𝑖 = 1, 2, ..., 𝑛 (5.1)
and
𝑛
Ö
𝐹𝑋1,𝑋2,...,𝑋𝑛 (𝑥 1, 𝑥 2, ..., 𝑥𝑛 ) = 𝐹𝑋 𝑖 (𝑥𝑖 ) (5.2)
𝑖=1

For continuous random variables, the CDFs can be replaced with PDFs in Equations 5.1 and 5.2, while
for discrete random variables, the CDFs can be replaced by PMFs.
Suppose, for example, we wish to measure the voltage produced by a certain sensor. The sensor
might be measuring the relative humidity outside. Our sensor converts the humidity to a voltage
level which we can then easily measure. However, as with any measuring equipment, the voltage
we measure is random due to noise generated in the sensor as well as in the measuring equipment.
Suppose the voltage we measure is represented by a random variable 𝑋 given by 𝑋 = 𝑣 (ℎ) + 𝑁 ,
where 𝑣 (ℎ) is the true voltage that should be presented by the sensor when the humidity is ℎ,
and 𝑁 is the noise in the measurement. Assuming that the noise is zero-mean, then 𝐸 [𝑋 ] = 𝑣 (ℎ).
That is, on the average, the measurement will be equal to the true voltage 𝑣 (ℎ). Furthermore,
if the variance of the noise is sufficiently small, then the measurement will tend to be close to
the true value we are trying to measure. But what if the variance is not small? Then the noise
will tend to distort our measurement making our system unreliable. In such a case, we might be
able to improve our measurement system by taking several measurements. This will allow us to
“average out” the effects of the noise.
Suppose we have the ability to make several measurements and observe a sequence of mea-
surements 𝑋 1, 𝑋 2, ..., 𝑋𝑛 . It might be reasonable to expect that the noise that corrupts a given
measurement has the same distribution each time (and hence the 𝑋𝑖 are identically distributed)
and is independent of the noise in any other measurement (so that the 𝑋𝑖 are independent). Then
the 𝑛 measurements form a sequence of IID random variables. A fundamental question is then:

90
5 Random Sums and Sequences

How do we process an IID sequence to extract the desired information from it? In the preceding
case, the parameter of interest, 𝑣 (ℎ), happens to be the mean of the distribution of the 𝑋𝑖 . This
turns out to be a fairly common problem and so we address that in the following sections.

5.2 Mean and Variance of Sums of Random Variables

Let 𝑋 1, 𝑋 2, ..., 𝑋𝑛 be a sequence of random variables, and let 𝑆𝑛 be their sum:

𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛

It was shown in section 3.2.3 that regardless of statistical dependence of 𝑋𝑖 s, the expected value
of a sum of 𝑛 random variables is equal to the sum of the expected values:

𝐸 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝐸 [𝑋 1 ] + 𝐸 [𝑋 2 ] + ... + 𝐸 [𝑋𝑛 ]

Thus knowledge of the means of the 𝑋𝑖 s suffices to find the mean of 𝑆𝑛 . The following example
shows that in order to compute the variance of a sum of random variables, we need to know the
variances and covariances of the 𝑋𝑖 s.

Example 5.1

Find the variance of 𝑍 = 𝑋 + 𝑌 .

Solution. The variance of 𝑍 is:

𝑉 𝐴𝑅 [𝑍 ] = 𝐸 [(𝑍 − 𝐸 [𝑍 ]) 2 ] = 𝐸 [(𝑋 + 𝑌 − 𝐸 [𝑋 ] − 𝐸 [𝑌 ]) 2 ]
= 𝐸 [((𝑋 − 𝐸 [𝑋 ]) + (𝑌 − 𝐸 [𝑌 ])) 2 ]
= 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 + (𝑌 − 𝐸 [𝑌 ]) 2 + (𝑋 − 𝐸 [𝑋 ])(𝑌 − 𝐸 [𝑌 ]) + (𝑌 − 𝐸 [𝑌 ])(𝑋 − 𝐸 [𝑋 ])]
= 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 𝐶𝑂𝑉 (𝑋, 𝑌 ) + 𝐶𝑂𝑉 (𝑌 , 𝑋 )
= 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 2𝐶𝑂𝑉 (𝑋, 𝑌 )

In general, the covariance 𝐶𝑂𝑉 (𝑋, 𝑌 ) is not equal to zero, so the variance of a sum is not
necessarily equal to the sum of the individual variances.

The result in Example 5.1 can be generalized to the case of 𝑛 random variables:
𝑛
Õ 𝑛
Õ
𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝐸 [ (𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) (𝑋𝑘 − 𝐸 [𝑋𝑘 ])]
𝑗=1 𝑘=1
𝑛
ÕÕ 𝑛
= 𝐸 [(𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) (𝑋𝑘 − 𝐸 [𝑋𝑘 ])]
𝑗=1 𝑘=1
Õ𝑛 𝑛 Õ
Õ 𝑛
= 𝑉 𝐴𝑅 [𝑋𝑘 ] + 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) (5.3)
𝑘=1 𝑗=1 𝑘=1
𝑗≠𝑘

Thus in general, the variance of a sum of random variables is not equal to the sum of the individual
variances.

91
5 Random Sums and Sequences

An important special case is when the 𝑋 𝑗 s are independent random variables. If 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are
independent random variables, then 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) = 0 for 𝑗 ≠ 𝑘 and:
𝑛
Õ
𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝑉 𝐴𝑅 [𝑋𝑘 ] (5.4)
𝑘=1

Now suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are 𝑛 IID random variables, each with mean 𝑚 and variance 𝜎 2 , then
the sum of 𝑋𝑖 s, 𝑆𝑛 , has the following mean:

𝐸 [𝑆𝑛 ] = 𝐸 [𝑋 1 ] + 𝐸 [𝑋 2 ] + ... + 𝐸 [𝑋𝑛 ] = 𝑛𝑚 (5.5)

The covariance of pairs of independent random variables is zero, so:

𝑛
𝑉 𝐴𝑅 [𝑋𝑘 ] = 𝑛𝑉 𝐴𝑅 [𝑋𝑖 ] = 𝑛𝜎 2
Õ
𝑉 𝐴𝑅 [𝑆𝑛 ] = (5.6)
𝑘=1

5.3 The Sample Mean

Definition 5.2. Sample Mean: Let 𝑋 be a random variable for which the mean, 𝐸 [𝑋 ] = 𝑚 is
unknown. Let 𝑋 1, 𝑋 2, ..., 𝑋𝑛 denote 𝑛 independent, repeated measurements of 𝑋 ; i.e. 𝑋𝑖 s are IID
random variables with the same PDF as 𝑋 . The sample mean of the sequence is used to estimate
𝐸 [𝑋 ]:
1Õ
𝑛
𝑀𝑛 = 𝑋𝑗 (5.7)
𝑛 𝑗=1

The sample variance is then defined as:

1Õ
𝑛
𝜎𝑛2 = (𝑋 𝑗 − 𝑀𝑛 ) 2 (5.8)
𝑛 𝑗=1

The sample mean is itself a random variable, so it will exhibit random variation. Our aim is to
verify if 𝑀𝑛 can be a good estimator of 𝐸 [𝑋 ] = 𝑚. A good estimator is expected to have the
following two properties:
1. On the average, it should give the correct expected value (with no bias): 𝐸 [𝑀𝑛 ] = 𝑚

2. It should not vary too much about the correct value of this parameter, that is, 𝐸 [(𝑀𝑛 − 𝑚) 2 ]
(variance) is small.
The expected value of the sample mean is given by:

1Õ 1Õ
𝑛 𝑛
𝐸 [𝑀𝑛 ] = 𝐸 [ 𝑋𝑗] = 𝐸 [𝑋 𝑗 ] = 𝑚 (5.9)
𝑛 𝑗=1 𝑛 𝑗=1

since 𝐸 [𝑋 𝑗 ] = 𝐸 [𝑋 ] = 𝑚 for all 𝑗. Thus the sample mean is equal to 𝐸 [𝑋 ] = 𝑚 on the average.
For this reason, we say that the sample mean is an unbiased estimator for 𝑚.
The mean square error of the sample mean about 𝑚 is equal to the variance of 𝑀𝑛 that is,

𝐸 [(𝑀𝑛 − 𝑚) 2 ] = 𝐸 [(𝑀𝑛 − 𝐸 [𝑀𝑛 ]) 2 ] (5.10)

92
5 Random Sums and Sequences

Note that 𝑀𝑛 = 𝑆𝑛 /𝑛 where 𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . From Equation 5.6, 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝜎 2 , since the
𝑋 𝑗 s are IID random variables. Thus
1 𝑛𝜎 2 𝜎 2
𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝑉 𝐴𝑅 [𝑆 ] = = (5.11)
𝑛2 𝑛2
𝑛
𝑛
Therefore the variance of the sample mean approaches zero as the number of samples is increased.
This implies that the probability that the sample mean is close to the true mean approaches one as
𝑛 becomes very large. We can formalize this statement by using the Chebyshev inequality from
Equation 3.127:
𝑉 𝐴𝑅 [𝑀𝑛 ]
𝑃 (|𝑀𝑛 − 𝐸 [𝑀𝑛 ] | ≥ 𝜀) ≤ (5.12)
𝜀2
Substituting for 𝐸 [𝑀𝑛 ] and 𝑉 𝐴𝑅 [𝑀𝑛 ], we obtain
𝜎2
𝑃 (|𝑀𝑛 − 𝑚| ≥ 𝜀) ≤ (5.13)
𝑛𝜀 2
If we consider the complement, we obtain
𝜎2
𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) ≥ 1 − (5.14)
𝑛𝜀 2
Thus for any choice of error 𝜀 and probability 1 − 𝛿, we can select the number of samples 𝑛 so
that 𝑀𝑛 is within 𝜀 of the true mean with probability 1 − 𝛿 or greater. The following example
illustrates this.
Example 5.2

A voltage of constant, but unknown, value is to be measured. Each measurement 𝑋 𝑗 is

actually the sum of the desired voltage 𝑣 and a noise voltage 𝑁 𝑗 of zero mean and standard
deviation of 1 microvolt (𝜇𝑉 ):
𝑋𝑗 = 𝑣 + 𝑁𝑗
Assume that the noise voltages are independent random variables. How many measurements
are required so that the probability that 𝑀𝑛 is within 𝜀 = 1𝜇𝑉 of the true mean is at least 0.99?

Solution. Each measurement 𝑋 𝑗 has mean 𝑣 and variance 1, so from Equation 5.14 we require
that 𝑛 satisfy:
𝜎2 1
1 − 2 = 1 − = 0.99
𝑛𝜀 𝑛
This implies that 𝑛 = 100. Thus if we were to repeat the measurement 100 times and compute
the sample mean, on the average, at least 99 times out of 100, the resulting sample mean
will be within 1𝜇𝑉 of the true mean.

5.4 Laws of Large Numbers

Note that if we let 𝑛 approach infinity in Equation 5.14 we obtain
lim 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 1 (5.15)
𝑛→∞

Equation 5.14 requires that the 𝑋 𝑗 s have finite variance. It can be shown that this limit holds even
if the variance of the 𝑋 𝑗 s does not exist.

93
5 Random Sums and Sequences

Theorem 5.1: Weak Law of Large Numbers

Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, then for
𝜀 > 0,
lim 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 1 (5.16)
𝑛→∞

The weak law of large numbers states that for a large enough fixed value of 𝑛, the sample mean
using 𝑛 samples will be close to the true mean with high probability. The weak law of large
numbers does not address the question about what happens to the sample mean as a function
of 𝑛 as we make additional measurements. This question is taken up by the strong law of large
numbers.
Suppose we make a series of independent measurements of the same random variable. Let
𝑋 1, 𝑋 2, ... be the resulting sequence of IID random variables with mean 𝑚. Now consider the
sequence of sample means that results from the above measurements: 𝑀1, 𝑀2, ... where 𝑀 𝑗 is
the sample mean computed using 𝑋 1 through 𝑋 𝑗 . We expect that with high probability, each
particular sequence of sample means approaches 𝑚 and stays there:
𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 (5.17)
𝑛→∞

that is, with virtual certainty, every sequence of sample mean calculations converges to the true
mean of the quantity (The proof of this result is beyond the level of this unit).
Theorem 5.2: Strong Law of Large Numbers

Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, and finite
variance, then,
𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 (5.18)
𝑛→∞

Equation 5.18 appears similar to Equation 5.16, but in fact it makes a dramatically different
statement. It states that with probability 1, every sequence of sample mean calculations will
eventually approach and stay close to 𝐸 [𝑋 ] = 𝑚. This is the type of convergence we expect in
physical situations where statistical regularity holds.
Although under certain conditions, the theory predicts the convergence of sample means to
expected values, there are still gaps between the mathematical theory and the real world (i.e., we
can never actually carry out an infinite number of measurements and compute an infinite number
of sample means). Nevertheless, the strong law of large numbers demonstrates the remarkable
consistency between the theory and the observed physical behavior.
Note that relative frequencies discussed in previous chapters, are special cases of sample averages.
If we apply the weak law of large numbers to the relative frequency of an event 𝐴, 𝑓𝐴 (𝑛), in a
sequence of independent repetitions of a random experiment, we obtain
lim 𝑃 (|𝑓𝐴 (𝑛) − 𝑃 (𝐴)| < 𝜀) = 1 (5.19)
𝑛→∞

If we apply the strong law of large numbers, we obtain:

𝑃 ( lim 𝑓𝐴 (𝑛) = 𝑃 (𝐴)) = 1 (5.20)
𝑛→∞

94
5 Random Sums and Sequences

Example 5.3

In order to estimate the probability of an event 𝐴, a sequence of Bernoulli trials is carried

out and the relative frequency of 𝐴 is observed. How large should 𝑛 be in order to have a
0.95 probability that the relative frequency is within 0.01 of 𝑝 = 𝑃 (𝐴)?

Solution. Let 𝑋 = 𝐼𝐴 be the indicator function of 𝐴. From Equations 3.45 and 3.46 we have
that the mean of is 𝑚 = 𝑝 and the variance is 𝜎 2 = 𝑝 (1 − 𝑝). Since 𝑝 is unknown, 𝜎 2 is also
unknown. However, it is easy to show that 𝑝 (1 − 𝑝) is at most 1/4 for 0 ≤ 𝑝 ≤ 1 Therefore,
by Equation 5.13,
𝜎2 1
𝑃 (|𝑓𝐴 (𝑛) − 𝑝 | ≥ 𝜀) ≤ 2 ≤
𝑛𝜀 4𝑛𝜀 2
The desired accuracy is 𝜀 = 0.01 and the desired probability is:
1
1 − 0.95 =
4𝑛𝜀 2
We then solve this for 𝑛 and obtain 𝑛 = 50, 000. It has already been pointed out that the
Chebyshev inequality gives very loose bounds, so we expect that this value for 𝑛 is probably
overly conservative. In the next section, we present a better estimate for the required value
of 𝑛.

5.5 The Central Limit Theorem

Probably the most important result dealing with sums of random variables is the central limit
theorem which states that under some mild conditions, these sums converge to a Gaussian random
variable in distribution. This result provides the basis for many theoretical models of random
phenomena. The central limit theorem explains why the Gaussian random variable appears in so
many diverse applications. In nature, many macroscopic phenomena result from the addition of
numerous independent, microscopic processes; this gives rise to the Gaussian random variable.
In many man-made problems, we are interested in averages that often consist of the sum of
independent random variables. This again gives rise to the Gaussian random variable.
Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, and finite
variance 𝜎 2 , and let 𝑆𝑛 be the sum of the first 𝑛 random variables in the sequence. We present
the central limit theorem, which states that, as 𝑛 becomes large, the CDF of a properly nor-
malized 𝑆𝑛 approaches that of a Gaussian random variable. This enables us to approximate
the CDF of 𝑆𝑛 with that of a Gaussian random variable. We know from equations 5.5 and 5.6
that if the 𝑋 𝑗 s are IID, then 𝑆𝑛 has mean 𝑛𝑚 and variance 𝑛𝜎 2 . The central limit theorem states
that the CDF of a suitably normalized version of 𝑆𝑛 approaches that of a Gaussian random variable.

Central Limit Theorem:

Let 𝑋 𝑗 be a sequence of IID random variables with mean 𝑚 and variance 𝜎 2 . Define a new random
variable, 𝑍 , as a (shifted and scaled) sum of the 𝑋 𝑗 s:

1 Õ 𝑋𝑗 − 𝑚
𝑛
𝑍=√ (5.21)
𝑛 𝑗=1 𝜎

Note that 𝑍 has been constructed such that 𝐸 [𝑍 ] = 0 and 𝑉 𝐴𝑅 [𝑍 ] = 1. In the limit as 𝑛 approaches

95
5 Random Sums and Sequences

infinity, the random variable 𝑍 converges in distribution to a standard Gaussian random variable.

Several remarks about this theorem are in order at this point. First, no restrictions were put on
the distribution of the 𝑋 𝑗 s, since it applies to any infinite sum of IID random variables, regardless
of the distribution.
From a practical standpoint, the central limit theorem implies that for the sum of a sufficiently
large (but finite) number of random variables, the sum is approximately Gaussian distributed. Of
course, the goodness of this approximation depends on how many terms are in the sum and also
the distribution of the individual terms in the sum.
Figures 5.1 to 5.3 compare the exact CDF and the Gaussian approximation for the sums of Bernoulli,
uniform, and exponential random variables, respectively. In all three cases, it can be seen that the
approximation improves as the number of terms in the sum increases.

Figure 5.1: (a) The CDF of the sum of five independent Bernoulli random variables with 𝑝 = 1/2
and the CDF of a Gaussian random variable of the same mean and variance. (b) The
CDF of the sum of 25 independent Bernoulli random variables with 𝑝 = 1/2 and the
CDF of a Gaussian random variable of the same mean and variance.

Figure 5.2: The CDF of the sum of five independent discrete, uniform random variables from the
set {0, 1, 2, ..., 9} and the CDF of a Gaussian random variable of the same mean and
variance.

The central limit theorem guarantees that the sum converges in "distribution" to Gaussian, but
this does not necessarily imply convergence in "density". As a counter example, suppose that the

96
5 Random Sums and Sequences

Figure 5.3: (a) The CDF of the sum of five independent exponential random variables of mean 1
and the CDF of a Gaussian random variable of the same mean and variance. (b) The
CDF of the sum of 50 independent exponential random variables of mean 1 and the
CDF of a Gaussian random variable of the same mean and variance.

𝑋 𝑗 s are discrete random variables, then the sum must also be a discrete random variable. Strictly
speaking, the density of 𝑍 would then not exist, and it would not be meaningful to say that the
density of 𝑍 is Gaussian. From a practical standpoint, the probability density of 𝑍 would be a
series of impulses. While the envelope of these impulses would have a Gaussian shape to it, the
density is clearly not Gaussian. If the 𝑋 𝑗 s are continuous random variables, the convergence in
density generally occurs as well.
The IID assumption is not needed in many cases. The central limit theorem also applies to
independent random variables that are not necessarily identically distributed. Loosely speak-
ing, all that is required is that no term (or small number of terms) dominates the sum, and
the resulting infinite sum of independent random variables will approach a Gaussian distribu-
tion in the limit as the number of terms in the sum goes to infinity. The central limit theorem
also applies to some cases of dependent random variables, but we will not consider such cases here.

Example 5.4

The time between events in a certain random experiment is IID exponential random variables
with mean 𝑚 seconds. Find the probability that the 1000th event occurs in the time interval
(1000 ± 50)𝑚.

Solution. Let 𝑋𝑖 be the time between events and let 𝑆𝑛 be the time of the 𝑛th event, then
𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . We know that the mean and variance of the exponential random
variable 𝑋 𝑗 is given by 𝐸 [𝑋 𝑗 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑚 2 . The mean and variance of 𝑆𝑛 are
then 𝐸 [𝑆𝑛 ] = 𝑛𝐸 [𝑋 𝑗 ] = 𝑛𝑚 and 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑛𝑚 2 . The central limit theorem
then gives
950𝑚 − 1000𝑚 1050𝑚 − 1000𝑚
𝑃 (950𝑚 ≤ 𝑆 1000 ≤ 1050𝑚) = 𝑃 ( √ ≤ 𝑍𝑛 ≤ √ )
𝑚 1000 𝑚 1000
' 𝑄 (1.58) − 𝑄 (−1.58) = 1 − 2𝑄 (1.58) = 0.8866

97
5 Random Sums and Sequences

Thus as 𝑛 becomes large, 𝑆𝑛 is very likely to be close to its mean 𝑛𝑚. We can therefore
conjecture that the long-term average rate at which events occur is
𝑛 events 𝑛 1
= = 𝑒𝑣𝑒𝑛𝑡𝑠/𝑠𝑒𝑐𝑜𝑛𝑑
𝑆𝑛 seconds 𝑛𝑚 𝑚

5.6 Convergence of Sequences of Random Variables

We discussed the convergence of the sequence of arithmetic averages 𝑀𝑛 of IID random variables
to the expected value 𝑚:
𝑀𝑛 → 𝑚 as 𝑛 → ∞ (5.22)
The weak law and strong law of large numbers describe two ways in which for the sequence of
random variables 𝑀𝑛 converges to the constant value given by 𝑚. In this section we consider the
more general situation where a sequence of random variables (usually not IID) 𝑋 1, 𝑋 2, ... converges
to some random variable 𝑋 :
𝑋𝑛 → 𝑋 as 𝑛 → ∞ (5.23)
We will describe several ways in which this convergence can take place. Note that Equation 5.22
is a special case of Equation 5.23 where the limiting random variable 𝑋 is given by the constant 𝑚.
To understand the meaning of Equation 5.23, we first need to revisit the definition of a vector
random variable X= (𝑋 1, 𝑋 2, ..., 𝑋𝑛 ). X was defined as a function that assigns a vector of real
values to each outcome 𝜁 from some sample space 𝑆:
𝑋 (𝜁 ) = (𝑋 1 (𝜁 ), 𝑋 2 (𝜁 ), ..., 𝑋𝑛 (𝜁 )) (5.24)
The randomness in the vector random variable was induced by the randomness in the underlying
probability law governing the selection of 𝜁 . We obtain a sequence of random variables by letting
𝑛 increase without bound, that is, a sequence of random variables 𝑋 is a function that assigns a
countably infinite number of real values to each outcome 𝜁 from some sample space 𝑆:
𝑋 (𝜁 ) = (𝑋 1 (𝜁 ), 𝑋 2 (𝜁 ), ..., 𝑋𝑛 (𝜁 ), ...) (5.25)
From now on, we will use the notation {𝑋𝑛 (𝜁 )} or {𝑋𝑛 } instead of X(𝜁 ) to denote the sequence
of random variables.
A sequence of random variables can be viewed as a sequence of functions of 𝜁 . On the other hand,
it is more natural to instead imagine that each point in 𝑆, say 𝜁 , produces a particular sequence of
real numbers,
𝑥 1, 𝑥 2, 𝑥 3, ...
where 𝑥 1 = 𝑋 1 (𝜁 ), 𝑥 2 = 𝑋 2 (𝜁 ) and so on. This sequence is called the sample sequence for the
point 𝜁 .
Example 5.5

Let 𝜁 be selected at random from the interval 𝑆 = [0, 1] where we assume that the probability
that 𝜁 is in a sub-interval of 𝑆 is equal to the length of the sub-interval. For 𝑛 = 1, 2, ... we
define the sequence of random variables:
1
𝑉𝑛 (𝜁 ) = 𝜁 (1 − )
𝑛
The two ways of looking at sequences of random variables is evident here. First, we can
view 𝑉𝑛 (𝜁 ) as a sequence of functions of 𝜁 as shown in Figure 5.4(a). Alternatively, we can

98
5 Random Sums and Sequences

imagine that we first perform the random experiment that yields 𝜁 and that we then observe
the corresponding sequence of real numbers 𝑉𝑛 (𝜁 ) as shown in Figure 5.4(b).

Figure 5.4: Two ways of looking at sequences of random variables: (a) Sequence of random
variables as a sequence of functions of 𝜁 , (b) Sequence of random variables as a
sequence of real numbers determined by 𝜁

The standard methods from calculus can be used to determine the convergence of the sample
sequence for each point 𝜁 . Intuitively, we say that the sequence of real numbers 𝑥𝑛 converges
to the real number 𝑥 if the difference |𝑥𝑛 − 𝑥 | approaches zero as 𝑛 approaches infinity. More
formally, we say that:

The sequence 𝑥𝑛 converges to 𝑥 if, given any 𝜀 > 0, we can specify an integer 𝑁 such that for all
values of 𝑛 beyond 𝑁 we can guarantee that |𝑥𝑛 − 𝑥 | < 𝜀

Thus if a sequence converges, then for any 𝜀 we can find an 𝑁 so that the sequence remains inside
a 2𝜀 corridor about 𝑥, as shown in Figure 5.5.

Figure 5.5: Convergence of a sequence of numbers

If we make 𝜀 smaller, 𝑁 becomes larger. Hence we arrive at our intuitive view that 𝑥𝑛 becomes
closer and closer to 𝑥. If the limiting value 𝑥 is not known, we can still determine whether a

99
5 Random Sums and Sequences

sequence converges by applying the Cauchy criterion:

The sequence 𝑥𝑛 converges if and only if, given 𝜀 > 0 we can specify integer 𝑁 0 such that for 𝑚 and
𝑛 greater than 𝑁 0, |𝑥𝑛 − 𝑥𝑚 | < 𝜀

The Cauchy criterion states that the maximum variation in the sequence for points beyond 𝑁 0 is
less than 𝜀.
Example 5.6

Let 𝑉𝑛 (𝜁 ) be the sequence of random variables from Example 5.5. Does the sequence of real
numbers corresponding to a fixed 𝜁 converge?

Solution. From Figure 5.4(a), we expect that for a fixed value 𝜁 , 𝑉𝑛 (𝜁 ) will converge to the
limit 𝜁 . Therefore, we consider the difference between the 𝑛th number in the sequence and
the limit:
1 𝜁 1
|𝑉𝑛 (𝜁 ) − 𝜁 | = |𝜁 (1 − ) − 𝜁 | = | | <
𝑛 𝑛 𝑛
where the last inequality follows from the fact that 𝜁 is always less than one. In order to
keep the above difference less than 𝜀 we choose 𝑛 so that
1
|𝑉𝑛 (𝜁 ) − 𝜁 | < <𝜀
𝑛
1
that is, we select 𝑛 > 𝑁 = 𝜀 Thus the sequence of real numbers 𝑉𝑛 (𝜁 ) converges to 𝜁 .

When we talk about the convergence of sequences of random variables, we are concerned with
questions such as: Do all (or almost all) sample sequences converge, and if so, do they all converge
to the same values or to different values? The first two definitions of convergence address these
questions.

5.6.1 Sure Convergence

Definition 5.3. Sure Convergence: The sequence of random variables {𝑋𝑛 (𝜁 )} converges surely
to the random variable 𝑋 (𝜁 ) if the sequence of functions 𝑋𝑛 (𝜁 ) converges to the function 𝑋 (𝜁 ) as
𝑛 → ∞ for all 𝜁 in 𝑆:
𝑋𝑛 (𝜁 ) → 𝑋 (𝜁 ) as 𝑛 → ∞ for all 𝜁 ∈ 𝑆

Example 5.7

Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random
sequence
𝑋
𝑋𝑛 = , 𝑛 = 1, 2, 3, ...
1 + 𝑛2
In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form:
𝑥
𝑥𝑛 =
1 + 𝑛2

100
5 Random Sums and Sequences

which converges to lim𝑛→∞ 𝑥𝑛 = 0. We say that the sequence converges surely to lim𝑛→∞ 𝑋𝑛 =
0.

Sure convergence requires that the sample sequence corresponding to every 𝜁 converges. Note
that it does not require that all the sample sequences converge to the same values; that is, the
sample sequences for different points 𝜁 and 𝜁 0 can converge to different values.

Example 5.8

Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random
sequence
𝑛𝑋
𝑋𝑛 = , 𝑛 = 1, 2, 3, ...
1 + 𝑛2
In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form:
𝑛𝑥
𝑥𝑛 =
1 + 𝑛2
which converges to lim𝑛→∞ 𝑥𝑛 = 𝑥. We say that the sequence converges surely to a random
variable lim𝑛→∞ 𝑋𝑛 = 𝑋 . In this case, the value that the sequence converges to depends on
the particular realization of the random variable 𝑋 .

5.6.2 Almost-Sure Convergence

Definition 5.4. Almost-Sure Convergence: The sequence of random variables {𝑋𝑛 (𝜁 )} converges
almost surely to the random variable 𝑋 (𝜁 ) if the sequence of functions 𝑋𝑛 (𝜁 ) converges to the function
𝑋 (𝜁 ) as 𝑛 → ∞ for all 𝜁 in 𝑆, except possibly on a set of probability zero; that is:

𝑃 (𝜁 : 𝑋𝑛 (𝜁 ) → 𝑋 (𝜁 ) as 𝑛 → ∞) = 1

In Figure 5.6 we illustrate almost-sure convergence for the case where sample sequences converge
to the same value 𝑥; we see that almost all sequences must eventually enter and remain inside a
2𝜀 corridor. In almost-sure convergence some of the sample sequences may not converge, but
these must all belong to 𝜁 s that are in a set that has probability zero.

Figure 5.6: Almost-sure convergence for sample sequences

The strong law of large numbers is an example of almost-sure convergence. Note that sure
convergence implies almost-sure convergence.

101
5 Random Sums and Sequences

Example 5.9

As an example of a sequence that converges almost surely, consider the random sequence

𝑠𝑖𝑛(𝑛𝜋𝑋 )
𝑋𝑛 =
𝑛𝜋𝑋
where 𝑋𝑛 is a random variable uniformly distributed over [0,1). For almost every realization
𝑋 = 𝑥, the sequence:
𝑠𝑖𝑛(𝑛𝜋𝑥)
𝑥𝑛 =
𝑛𝜋𝑥
converges to lim𝑛→∞ 𝑥𝑛 = 0. The one exception is the realization 𝑋 = 0 in which case the
sequence becomes 𝑥𝑛 = 1 which converges, but not to the same value. Therefore, we say
that the sequence 𝑋𝑛 converges almost surely to lim𝑛→∞ 𝑋𝑛 = 0 since the one exception to
this convergence occurs with zero probability; that is, 𝑃 (𝑋 = 0) = 0

5.6.3 Convergence in Probability

Definition 5.5. Convergence in Probability: The sequence of random variables {𝑋𝑛 (𝜁 )} con-
verges in probability to the random variable 𝑋 (𝜁 ) if for any 𝜀 > 0:

𝑃 (|𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )| > 𝜀) → 0 as 𝑛 → ∞

In Figure 5.7 we illustrate convergence in probability for the case where the limiting random
variable is a constant 𝑥; we see that at the specified time 𝑛 0 most sample sequences must be within
𝜀 of 𝑥. However, the sequences are not required to remain inside a 2𝜀 corridor. The weak law of
large numbers is an example of convergence in probability.Thus we see that the fundamental
difference between almost-sure convergence and convergence in probability is the same as that
between the strong law and the weak law of large numbers.

Figure 5.7: Convergence in probability for sample sequences

Example 5.10

Let 𝑋𝑘 , 𝑘 = 1, 2, 3, ... be a sequence of IID Gaussian random variables with mean 𝑚 and
variance 𝜎 2 . Suppose we form the sequence of sample means 𝑀𝑛 = 𝑛1 𝑛𝑘=1 𝑋𝑘 , 𝑛 = 1, 2, 3, ....
Í
Since the 𝑀𝑛 are linear combinations of Gaussian random variables, then they are also
Gaussian with 𝐸 [𝑀𝑛 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝜎 2 /𝑛. Therefore, the probability that the sample

102
5 Random Sums and Sequences

mean is removed from the true mean by more than 𝜀 is

r
𝑛𝜀
𝑃 (|𝑀𝑛 − 𝑚| > 𝜀) = 2𝑄 ( )
𝜎2
As 𝑛 → ∞, this quantity clearly approaches zero, so that this sequence of sample means
converges in probability to the true mean.

5.6.4 Convergence in the Mean Square Sense

Definition 5.6. Convergence in the Mean Square Sense: The sequence of random variables
{𝑋𝑛 (𝜁 )} converges in the Mean Square (MS) sense to the random variable 𝑋 (𝜁 ) if:

𝐸 [(𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )) 2 ] → 0 as 𝑛 → ∞

Mean square convergence is of great practical interest in electrical engineering applications

because of its analytical simplicity and because of the interpretation of 𝐸 [(𝑋𝑛 − 𝑋 ) 2 ] as the
“power” in an error signal.

Example 5.11

Consider the sequence of sample means of IID Gaussian random variables described in
Example 5.10. This sequence also converges in the MS sense since:

𝜎2
𝐸 [(𝑀𝑛 − 𝑚) 2 ] = 𝑉 𝐴𝑅 [𝑀𝑛 ] =
𝑛
This sequence of sample variances converges to 0 as 𝑛 → ∞, thus producing convergence
of the random sequence in the MS sense.

5.6.5 Convergence in Distribution

Definition 5.7. Convergence in Distribution: The sequence of random variables {𝑋𝑛 } with
cumulative distribution functions {𝐹𝑛 (𝑥)} converges in distribution to the random variable 𝑋 with
cumulative distribution 𝐹 (𝑥) if:

𝐹𝑛 (𝑥) → 𝐹 (𝑥) as 𝑛 → ∞

for all 𝑥 at which 𝐹 (𝑥) is continuous.

The central limit theorem is an example of convergence in distribution.

Example 5.12

Consider once again the sequence of sample means of IID Gaussian random variables
described in Example 5.10. Since 𝑀𝑛 is Gaussian with mean 𝑚 and variance 𝜎 2 /𝑛, its CDF
takes the form
𝑥 −𝑚
𝐹𝑀𝑛 (𝑥) = 1 − 𝑄 ( √ )
𝜎/ 𝑛
For any 𝑥 > 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 1, while for any 𝑥 < 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 0. Thus, the

103
5 Random Sums and Sequences

limiting form of the CDF is:

lim 𝐹𝑀𝑛 (𝑥) = 𝑢 (𝑥 − 𝑚)
𝑛→∞

where 𝑢 (𝑥) is the unit step function. Note that the point 𝑥 = 𝑚 is not a point of continuity
of 𝐹𝑀 (𝑥).

It should be noted, as was seen in the previous sequence of examples, that some random sequences
converge in many of the different senses. In fact, one form of convergence may necessarily
imply convergence in several other forms. Table 5.1 illustrates these relationships. For example,
convergence in distribution is the weakest form of convergence and does not necessarily imply
any of the other forms of convergence. Conversely, if a sequence converges in any of the other
modes presented, it will also converge in distribution.

Table 5.1: Relationships between convergence modes, showing whether the convergence mode in
each row implies the convergence mode in each column
This ↓ implies this → Sure Almost Sure Probability Mean Square Distribution
Sure X Yes Yes No Yes
Almost Sure No X Yes No Yes
Probability No No X No Yes
Mean Square No No Yes X Yes
Distribution No No No No X

5.7 Confidence Intervals

Consider once again the problem of estimating the mean of a distribution from 𝑛 IID random
variables. When the sample mean 𝑀𝑛 is formed, it could be said that (hopefully) the true mean is
“close” to the sample mean. While this is a vague statement, with the help of the central limit
theorem, we can make the statement mathematically precise.
If a sufficient number of samples are taken, the sample mean can be well approximated by a
Gaussian random variable with a mean of 𝐸 [𝑀𝑛 ] = 𝑚 (Equation 5.9) and variance of 𝑉 𝐴𝑅 [𝑀𝑛 ] =
𝜎 2 /𝑛 (Equation 5.11). Using the Gaussian distribution, the probability of the sample mean being
within some amount 𝜀 of the true mean can be easily calculated,
√
𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 𝑃 (𝑚 − 𝜀 < 𝑀𝑛 < 𝑚 + 𝜀) = 1 − 2𝑄 (𝜀 𝑛/𝜎) (5.26)

Stated another way, let 𝜀𝑎 be the value of 𝜀 such that the right hand side of the above equation is
1 − 𝑎; that is,
𝜎
𝜀𝑎 = √ 𝑄 −1 (𝑎/2) (5.27)
𝑛
where 𝑄 −1 is the inverse of the Q-function. Then, given 𝑛 samples which lead to a sample mean
𝑀𝑛 , the true mean will fall in the interval (𝑀𝑛 − 𝜀𝑎 , 𝑀𝑛 + 𝜀𝑎 ) with probability 1 − 𝑎. The interval
(𝑀𝑛 −𝜀𝑎 , 𝑀𝑛 +𝜀𝑎 ) is referred to as the confidence interval while the probability is the confidence
level or, alternatively, is the level of significance. The confidence level and level of significance
are usually expressed as percentages. The corresponding values of the quantity 𝑐 𝑎 = 𝑄 −1 (𝑎/2)
are provided in Table 5.2 for several typical values of 𝑎.

104
5 Random Sums and Sequences

Table 5.2: Reference values to calculate confidence intervals

Percentage of Percentage of
Confidence Level Level of Significance
(1 − 𝑎) ∗ 100% 𝑎 ∗ 100% 𝑐 𝑎 = 𝑄 −1 (𝑎/2)
90 10 1.64
95 5 1.96
99 1 2.58
99.9 0.1 3.29
99.99 0.01 3.89

Example 5.13

Suppose the IID random variables each have a variance of 𝜎 2 = 4. A sample of 𝑛 = 100 values
is taken and the sample mean is found to be 𝑀𝑛 = 10.2. (a) Determine the 95% confidence
interval for the true mean 𝑚. (b) Suppose we want to be 99% confident that the true mean
falls within a factor of ±0.5 of the sample mean. How many samples need to be taken in
forming the sample mean?

√
Solution. (a) In this case 𝜎/ 𝑛 = 0.2, and the appropriate value of 𝑐 𝑎 is 𝑐 0.05 = 1.96 from
Table 5.2. The 95% confidence interval is then:
𝜎 𝜎
(𝑀𝑛 − √ 𝑐 0.05, 𝑀𝑛 + √ 𝑐 0.05 ) = (9.808, 10.592)
𝑛 𝑛

(b) To ensure this level of confidence, it is required that

𝜎
√ 𝑐 0.01 = 0.5
𝑛

and therefore
𝑐 0.01𝜎 2 2.58 ∗ 2 2
𝑛=( ) =( ) = 106.5
0.5 0.5
Since 𝑛 must be an integer, it is concluded that at least 107 samples must be taken.

In summary, to achieve a level of significance specified by 𝑎, we note that by virtue of the central
limit theorem, the sum
𝑀𝑛 − 𝑚
𝑍ˆ𝑛 = √ (5.28)
𝜎/ 𝑛
approximately follows a standard normal distribution. We can then easily specify a symmetric
interval about zero in which a standard Gaussian random variable will fall with probability 1 − 𝑎.
As long as 𝑛 is sufficiently large, the original distribution of the IID random variables does not
matter.
Note that in order to form the confidence interval as specified, the standard deviation of the 𝑋 𝑗
must be known. While in some cases, this may be a reasonable assumption, in many applications,
the standard deviation is also unknown. The most obvious thing to do in that case would be to
replace the true standard deviation with the sample standard deviation.

105
5 Random Sums and Sequences

Further Reading
1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, Elsevier 2012: chapter 7.
2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: chapter 7.

106