ECE2191 Lecture Notes
ECE2191 Lecture Notes
38Op-
PROBABILITY MODELS IN
erations on setstheo.1.3 @cref@cref[theo][3][1]1.3[1][8][]8 1.48Operations on setstheo.1.4
@cref@cref[theo][4][1]1.4[1][8][]8
Probability Density Functionstheo.4.1 @cref@cref[theo][1][4]4.1[1][67][]67 4.275Ex-
ENGINEERING
pected Values and Moments Involving Pairs of Random Variablestheo.4.2 @cref@cref[theo][2][4]4.2[1][75][]75
4.376Expected Values and Moments Involving Pairs of Random Variablestheo.4.3 @cref@cref[theo][3][4]4.3[1][
4.786Expectations Involving Multiple Random Variablestheo.4.7 @cref@cref[theo][7][4]4.7[1][86][]86
4.887Expectations Involving Multiple Random Variablestheo.4.8 @cref@cref[theo][8][4]4.8[1][87][]87
5.194Laws of Large Numberstheo.5.1 @cref@cref[theo][1][5]5.1[1][93][]94 5.294Laws of
Large Numberstheo.5.2 @cref@cref[theo][2][5]5.2[1][94][]94
COURSE NOTES
ECE2191
Dr Faezeh Marzbanrad
Lecturers:
Dr Faezeh Marzbanrad (Clayton)
Dr Wynita Griggs (Clayton)
Dr Mohamed Hisham (Malaysia)
2020
Contents
Contents
1 Preliminary Concepts 4
1.1 Probability Models in Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Review of Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Operations on sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Other Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Random Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Tree Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.2 Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Probability Theory 14
2.1 Definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Relative Frequency Definition . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Axiomatic Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Joint Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Bayes’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Basic Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Sequence of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Sampling with Replacement and with Ordering . . . . . . . . . . . . . . 23
2.5.3 Sampling without Replacement and with Ordering . . . . . . . . . . . . 24
2.5.4 Sampling without Replacement and without Ordering . . . . . . . . . . 25
2.5.5 Sampling with Replacement and without Ordering . . . . . . . . . . . . 27
3 Random Variables 29
3.1 The Notion of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 The Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . 32
3.2.3 Expected Value and Moments . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Conditional Probability Mass Function and Expectation . . . . . . . . . 37
3.2.5 Common Discrete Random Variables . . . . . . . . . . . . . . . . . . . . 40
3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 The Probability Density Function . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Conditional CDF and PDF . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 The Expected Value and Moments . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Important Continuous Random Variables . . . . . . . . . . . . . . . . . 55
3.4 The Markov and Chebyshev Inequalities . . . . . . . . . . . . . . . . . . . . . . 61
2
Contents
3
1 Preliminary Concepts
1 Preliminary Concepts
4
1 Preliminary Concepts
Definition 1.6. Intersection: The intersection of sets 𝐴 and 𝐵, denoted 𝐴 ∩ 𝐵, is the set of objects
common to both 𝐴 and 𝐵; i.e ., 𝐴 ∩ 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∈ 𝐵}
Note that if 𝐴 ⊂ 𝐵, then 𝐴 ∩ 𝐵 = 𝐴. In particular, we always have 𝐴 ∩ 𝑆 = 𝐴.
Definition 1.7. Complement: The complement of a set 𝐴, denoted 𝐴𝑐 , is the collection of all objects
in 𝑆 not included in 𝐴; i.e ., 𝐴𝑐 = {𝜁 ∈ 𝑆 : 𝜁 ∉ 𝐴} .
Definition 1.8. Difference: The relative complement or difference of sets 𝐴 and 𝐵 is the set of
elements in 𝐴 that are not in 𝐵: 𝐴 − 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∉ 𝐵}
Note that 𝐴 − 𝐵 = 𝐴 ∩ 𝐵𝑐 .
These definitions and relationships among sets are illustrated in Figure 1.1. These diagrams are
called Venn diagrams, which represent sets by simple plane areas within the universal set, pictured
as a rectangle. Venn diagrams are important visual aids to understand relationships among sets.
B A B A B A
s s s
(a) Universal Set S. (b) Set A. (c) Set B.
B A B A B A
s s s
(d) Set Ac. (e) Set A ∪ B. (f) Set A ∩ B.
B A
A B A B
s s s
(g) A ⊂ B. (h) disjoint sets A and B. (b) Set A-B.
Theorem 1.1
Proof. Since the empty set is a subset of any set, if 𝐴 = ∅ then 𝐵 ⊂ 𝐴 implies that 𝐵 = ∅.
Similarly, if 𝐵 = ∅ then 𝐴 ⊂ 𝐵 implies that 𝐴 = ∅. The theorem is obviously true if 𝐴 and 𝐵
are both empty. If 𝐴 and 𝐵 are nonempty, since 𝐴 ⊂ 𝐵, if 𝜁 ∈ 𝐴 then 𝜁 ∈ 𝐵. Since 𝐵 ⊂ 𝐴, if
𝜁 ∈ 𝐵 then 𝜁 ∈ 𝐴. We therefore conclude that 𝐴 = 𝐵.
5
1 Preliminary Concepts
Example 1.1
Solution. We first sketch the boundaries of the given sets 𝐴, 𝐵, 𝐶, and 𝐷. Note that if the
boundary of the region is included in the set, it is indicated with a solid line, and if not, it is
indicated with a dotted line. We have
𝐸 = 𝐴 ∩ 𝐵 = {(𝑥, 𝑦) : 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 }
and
𝐹 = 𝐶 ∩ 𝐷 = {(𝑥, 𝑦) : 0 ≤ 𝑦 < 1}.
The set 𝐺 is the set of all ordered pairs (𝑥, 𝑦) satisfying both 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 and 0 ≤ 𝑦 < 1.
Using 1− to denote a value just less than 1, the second inequality may be expressed as
0 ≤ 𝑦 ≤ 1− . We may then express the set 𝐺 as
𝐺 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, 𝑥 − 1} ≤ 𝑦 ≤ 𝑚𝑖𝑛{𝑥, 1− }},
The set 𝐻 is obtained from 𝐺 by folding about the y-axis and translating down one unit.
This can be seen from the definitions of G and H by noting that (𝑥, 𝑦) ∈ 𝐻 if (−𝑥, 𝑦 + 1) ∈ 𝐺;
hence, we replace 𝑥 with −𝑥 and 𝑦 with 𝑦 + 1 in the above result for 𝐺 to obtain
𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, −𝑥 − 1} ≤ 𝑦 + 1 ≤ 𝑚𝑖𝑛{−𝑥, 1− }},
or
𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {−1, −𝑥 − 2} ≤ 𝑦 ≤ 𝑚𝑖𝑛{−1 − 𝑥, 0− }}.
The sets are illustrated in Figure 1.2.
Figure 1.2:
6
1 Preliminary Concepts
Commutative Properties:
𝐴∪𝐵 = 𝐵 ∪𝐴 (1.1)
𝐴∩𝐵 = 𝐵 ∩𝐴 (1.2)
Associative Properties:
𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶 (1.3)
𝐴 ∩ (𝐵 ∩ 𝐶) = (𝐴 ∩ 𝐵) ∩ 𝐶 (1.4)
Distributive Properties:
𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶) (1.5)
𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) (1.6)
De Morgan’s Laws:
(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵𝑐 (1.7)
𝑐
(𝐴 ∩ 𝐵) = 𝐴 ∪ 𝐵𝑐 𝑐
(1.8)
𝐴∪∅=𝐴 (1.9)
𝐴∩𝑆 =𝐴 (1.10)
𝐴∩∅=∅ (1.11)
𝐴∪𝑆 =𝑆 (1.12)
𝐴 ∩ 𝐴𝑐 = ∅ (1.13)
𝐴∪𝐴 =𝑆 𝑐
(1.14)
𝑐 𝑐
(𝐴 ) = 𝐴 (1.15)
Example 1.2
7
1 Preliminary Concepts
Additional insight to operations on sets is provided by the correspondence between the algebra
of set inclusion and Boolean algebra. An element either belongs to a set or it does not. Thus,
interpreting sets as Boolean (logical) variables having values of 0 or 1, the ∪ operation as the
logical "OR", the ∩ as the logical "AND" operation, and the 𝑐 as the logical complement "NOT",
any expression involving set operations can be treated as a Boolean expression.
Theorem 1.3
𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = 𝐴 ∪ 𝐵. (1.16)
𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = (𝐴 ∪ 𝐴𝑐 ) ∩ (𝐴 ∪ 𝐵)
= 𝑆 ∩ (𝐴 ∪ 𝐵)
= 𝐴 ∪ 𝐵.
Theorem 1.4
Principle of Duality: Any set identity remains true if the symbols ∪,∩, S, and ∅, are replaced
with the symbols ∩,∪,∅, and S, respectively.
Proof. The proof follows by applying De Morgan’s Laws and renaming sets 𝐴𝑐 , 𝐵𝑐 , etc. as
𝐴, 𝐵, etc.
Properties of set operations are easily extended to deal with any finite number of sets. To do this,
we need notation for the union and intersection of a collection of sets.
Definition 1.9. Union: We define the union of a collection of sets (or “set of sets”)
{𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.17)
by:
Ø
𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for some 𝑖 ∈ 𝐼 } (1.18)
𝑖 ∈𝐼
8
1 Preliminary Concepts
{𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.19)
by:
Ù
𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for every 𝑖 ∈ 𝐼 } (1.20)
𝑖 ∈𝐼
𝑛
Ø 𝑛
Ø
𝐵∩ 𝐴𝑖 = (𝐵 ∩ 𝐴𝑖 ) (1.23)
𝑖=1 𝑖=1
Ù𝑛 Ù𝑛
𝐵∪ 𝐴𝑖 = (𝐵 ∪ 𝐴𝑖 ) (1.24)
𝑖=1 𝑖=1
De Morgan’s Laws:
𝑛
Ù 𝑛
Ø
( 𝐴𝑖 )𝑐 = 𝐴𝑐𝑖 (1.25)
𝑖=1 𝑖=1
Ø𝑛 Ù𝑛
( 𝐴𝑖 )𝑐 = 𝐴𝑐𝑖 (1.26)
𝑖=1 𝑖=1
Throughout much of probability, it is useful to decompose a set into a union of simpler, non-
overlapping sets. This is an application of the “divide and conquer” approach to problem solving.
Necessary terminology is established in the following definitions.
Definition 1.11. Mutually Exclusive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are mutually exclusive (or disjoint)
if 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all 𝑖 and 𝑗 with 𝑖 ≠ 𝑗 .
Definition 1.12. Partition: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 form a partition of the set 𝐵 if they are mutually
Ð
exclusive and 𝐵 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖
Definition 1.13. Collectively Exhaustive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are collectively exhaustive if
Ð
𝑆 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖
9
1 Preliminary Concepts
Example 1.3
Let 𝑆 = {(𝑥, 𝑦) : 𝑥 ≥ 0, 𝑦 ≥ 0}, 𝐴 = {(𝑥, 𝑦) : 𝑥 + 𝑦 < 1}, 𝐵 = {(𝑥, 𝑦) : 𝑥 < 𝑦}, and
𝐶 = {(𝑥, 𝑦) : 𝑥𝑦 > 1/4}. Are the sets 𝐴, 𝐵, and 𝐶 mutually exclusive, collectively exhaustive,
and/or a partition of 𝑆?
Solution. Since 𝐴 ∩ 𝐶 = ∅, the sets 𝐴 and 𝐶 are mutually exclusive; however, 𝐴 ∩ 𝐵 ≠ ∅ and
𝐵 ∩ 𝐶 ≠ ∅, so 𝐴 and 𝐵, and 𝐵 and 𝐶 are not mutually exclusive. Since 𝐴 ∪ 𝐵 ∪ 𝐶 ≠ 𝑆, the
events are not collectively exhaustive. The events 𝐴, 𝐵, and 𝐶 are not a partition of S since
they are not mutually exclusive and collectively exhaustive.
Definition 1.14. Cartesian Product: The Cartesian product of sets 𝐴 and 𝐵 is a set of ordered
pairs of elements of 𝐴 and 𝐵:
𝐴 × 𝐵 = {𝜁 = (𝜁 1, 𝜁 2 ) : 𝜁 1 ∈ 𝐴, 𝜁 2 ∈ 𝐵}. (1.27)
The Cartesian product of sets 𝐴1, 𝐴2, ..., 𝐴𝑛 is a set of n-tuples (an ordered list of 𝑛 elements) of
elements of 𝐴1, 𝐴2, ..., 𝐴𝑛 :
An important example of a Cartesian product is the usual n-dimensional real Euclidean space:
𝑅𝑛 = 𝑅 × 𝑅 × ... × 𝑅 . (1.29)
| {z }
𝑛 terms
Note that if 𝑎 > 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) = [𝑎, 𝑏] = ∅. If 𝑎 = 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) =
∅ and [𝑎, 𝑏] = 𝑎. The notation (𝑎, 𝑏) is also used to denote an ordered pair—we depend on the
context to determine whether (𝑎, 𝑏) represents an open interval of real numbers or an ordered
pair.
10
1 Preliminary Concepts
Example 1.4
Consider the experiment of flipping a fair coin once, where fair means that the coin is not
biased in weight to a particular side. There are two possible outcomes: head (𝜁 1 = 𝐻 ) or a
tail (𝜁 2 = 𝑇 ). Thus, the sample space 𝑆, consists of two outcomes, 𝜁 1 = 𝐻 and 𝜁 2 = 𝑇 .
Example 1.5
Now consider flipping the coin until a tails occurs, when the experiment is terminated.
The sample space consists of a collection of sequences of coin tosses. The outcomes are
𝜁𝑛 , 𝑛 = 1, 2, 3, .... The final toss in any particular sequence is a tail and terminates the
sequence. The preceding tosses prior to the occurrence of the tail must be heads. The
possible outcomes that may occur are: 𝜁 1 = (𝑇 ), 𝜁 2 = (𝐻,𝑇 ), 𝜁 3 = (𝐻, 𝐻,𝑇 ), ...
Note that in this case, n can extend to infinity. This is a combined sample space resulting
from conducting independent but identical experiments. In this example, the sample space
is countably infinite.
Example 1.6
A cubical die with numbered faces is rolled and the result observed. The sample space
consists of six possible outcomes, 𝜁 1 = 1, 𝜁 2 = 2, ..., 𝜁 6 = 6, indicating the possible observed
faces of the cubical die.
Example 1.7
Now consider the experiment of rolling two dice and observing the results. The sample space
11
1 Preliminary Concepts
consists of 36 outcomes: 𝜁 1 = (1, 1), 𝜁 2 = (1, 2), ..., 𝜁 6 = (1, 6), 𝜁 7 = (2, 1), 𝜁 8 = (2, 2), ..., 𝜁 3 6 =
(6, 6) the first component in the ordered pair indicates the result of the toss of the first die,
and the second component indicates the result of the toss of the second die. Alternatively
we can consider this experiment as two distinct experiments, each consisting of rolling
a single die. The sample spaces (𝑆 1 and 𝑆 2 ) for each of the two experiments are identical,
namely, the same as Example 1.6. We may now consider the sample space of the original
experiment 𝑆, to be the combination of the sample spaces 𝑆 1 and 𝑆 2 , which consists of
all possible combinations of the elements of both 𝑆 1 and 𝑆 2 . This is another example of a
combined sample space. Several interesting events can be also defined from this experiment,
such as:
𝐴 = {the sum of the outcomes of the two rolls = 4},
𝐵 = {the outcomes of the two rolls are identical},
𝐶 = {the first roll was bigger than the second}.
The choice of a particular sample space depends upon the questions that are to be answered
concerning the experiment. Suppose that in Example 1.7, we were asked to record after each roll
the sum of the numbers shown on the two faces. Then, the sample space could be represented
by eleven outcomes, 𝜁 1 = 2, 𝜁 2 = 3, ..., 𝜁 11 = 12. However, the original sample space was in
some way more fundamental. Because the sum of the die faces can be determined from the num-
bers on the die faces, but the sum is not sufficient to specify the sequence of numbers that occurred.
Example 1.8
The coin in Example 1.4 is tossed twice. Illustrate the sample space with a tree diagram.
Let 𝐻𝑖 and 𝑇𝑖 denote the outcome of a head or a tale on the the 𝑖 𝑡ℎ toss, respectively. The
sample space is: 𝑆 = {𝐻 1𝐻 2, 𝐻 1𝑇2,𝑇1𝐻 2,𝑇1𝑇2 } The tree diagram illustrating the sample space
for this sequence of two coin tosses is shown in Figure 1.3.
12
1 Preliminary Concepts
Each node represents an outcome of one coin toss and the branches of the tree connect
the nodes. The number of branches to the right of each node corresponds to the number
of outcomes for the next coin toss (or experiment). A sequence of samples connected by
branches in a left to right path from the origin to a terminal node represents a sample point
for the combined experiment. There is a one-to-one correspondence between the paths in
the tree diagram and the sample points in the sample space for the combined experiment.
Note that there are 36 sample points in the experiment. Additionally, we distinguish between
sample points with regard to order; e.g., (1,2) is different from (2,1).
Further Reading
1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical
Engineers, Morgan & Claypool, 2006: sections 1.1 and 1.2
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.1
3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 1.3 and 2.1
4. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004:
chapter 1
13
2 Probability Theory
2 Probability Theory
For example, if a six-sided die is rolled a large number of times and the numbers on the face of the
die come up in approximately equal proportions, then we could say that the probability of each
number on the upturned face of the die is 1/6. The difficulty with this definition is determining
when 𝑁 is sufficiently large and indeed if the limit actually exists. We will certainly use this
definition in practice, relating deduced probabilities to the physical world, but we will not develop
probability theory from it.
14
2 Probability Theory
The following theorem is a direct consequence of the axioms of probability, which is useful for
solving probability problems.
Theorem 2.1
Assuming that all events indicated are in the event space 𝐹 , we have:
(i) 𝑃 (𝐴𝑐 ) = 1 − 𝑃 (𝐴),
(ii) 𝑃 (∅) = 0,
(iii) 0 ≤ 𝑃 (𝐴) ≤ 1,
(iv) 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵)
(v) 𝑃 (𝐵) ≤ 𝑃 (𝐴) if 𝐵 ⊂ 𝐴.
Proof.
(i) Since 𝑆 = 𝐴 ∪ 𝐴𝑐 and 𝐴 ∩ 𝐴𝑐 = ∅, we apply the second and third axioms of probability
to obtain 𝑃 (𝑆) = 1 = 𝑃 (𝐴) + 𝑃 (𝐴𝑐 ), from which (i) follows.
(ii) Applying (i) with 𝐴 = 𝑆 we have 𝐴𝑐 = ∅ so that 𝑃 (∅) = 1 − 𝑃 (𝑆) = 0.
(iii) From (i) we have 𝑃 (𝐴) = 1 − 𝑃 (𝐴𝑐 ), from the first axiom we have 𝑃 (𝐴) ≥ 0 and
𝑃 (𝐴𝑐 ) ≥ 0; consequently, 0 ≤ 𝑃 (𝐴) ≤ 1.
(iv) Let 𝐶 = 𝐵 ∩ 𝐴𝑐 . Then 𝐴 ∪ 𝐶 = 𝐴 ∪ (𝐵 ∩ 𝐴𝑐 ) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐴𝑐 ) = 𝐴 ∪ 𝐵, and
𝐴 ∩ 𝐶 = 𝐴 ∩ 𝐵 ∩ 𝐴𝑐 = ∅, so that 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴 ∪ 𝐶) = 𝑃 (𝐴) + 𝑃 (𝐶). Now we find
𝑃 (𝐶). Since 𝐵 = 𝐵 ∩ 𝑆 = 𝐵 ∩ (𝐴 ∪ 𝐴𝑐 ) = (𝐵 ∩ 𝐴) ∪ (𝐵 ∩ 𝐴𝑐 ) and (𝐵 ∩ 𝐴) ∩ (𝐵 ∩ 𝐴𝑐 ) = ∅,
𝑃 (𝐵) = 𝑃 (𝐵 ∩ 𝐴𝑐 ) + 𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐶) + 𝑃 (𝐴 ∩ 𝐵), so 𝑃 (𝐶) = 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵).
(v) We have 𝐴 = 𝐴 ∩ (𝐵 ∪ 𝐵𝑐 ) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐵𝑐 ), and if 𝐵 ⊂ 𝐴, then 𝐴 = 𝐵 ∪ (𝐴 ∩ 𝐵𝑐 ).
Since 𝐵 ∩ (𝐴 ∩ 𝐵𝑐 ) = ∅, consequently, 𝑃 (𝐴) = 𝑃 (𝐵) + 𝑃 (𝐴 ∩ 𝐵𝑐 ) ≥ 𝑃 (𝐵).
Example 2.1
Note that since probabilities are non-negative (theorem 2.1 (iii)), the theorem 2.1 (iv) implies that
the probability of the union of two events is no greater than the sum of the individual event
probabilities:
𝑃 (𝐴 ∪ 𝐵) ≤ 𝑃 (𝐴) + 𝑃 (𝐵) (2.2)
This can be extended to Boole’s Inequality, described as follows.
15
2 Probability Theory
Theorem 2.2
where
𝑘−1
Ø
𝐵𝑘 = 𝐴𝑖
𝑖=1
Example 2.2
Let 𝑆 = [0, 1] (the set of real numbers 𝑥 : 0 ≤ 𝑥 ≤ 1). Let 𝐴1 = [0, 0.5], 𝐴2 = (0.45, 0.7),
𝐴3 = [0.6, 0.8), and assume 𝑃 (𝜁 ∈ 𝐼 ) = length of the interval 𝐼 ∩ 𝑆, so that 𝑃 (𝐴1 ) = 0.5,
𝑃 (𝐴2 ) = 0.25, and 𝑃 (𝐴3 ) = 0.2. Find 𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ).
Solution. Let 𝐶 1 = 𝐴1, 𝐶 2 = 𝐴2 ∩ 𝐴𝑐1 = (0.5, 0.7), and 𝐶 3 = 𝐴3 ∩ 𝐴𝑐1 ∩ 𝐴𝑐2 = [0.7, 0.8). Then
𝐶 1, 𝐶 2, and 𝐶 3 are mutually exclusive and 𝐴1 ∪𝐴2 ∪𝐴3 = 𝐶 1 ∪𝐶 2 ∪𝐶 3 ; hence 𝑃 (𝐴1 ∪𝐴2 ∪𝐴3 ) =
𝑃 (𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 ) = 0.5 + 0.2 + 0.1 = 0.8. Note that for this example, Boole’s inequality yields
𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ) ≤ 0.5 + 0.25 + 0.2 = 0.95.
16
2 Probability Theory
From the relative frequency definition, in practice we may let 𝑛𝐴,𝐵 be the number of times that 𝐴
and 𝐵 simultaneously occur in 𝑛 trials. Then,
𝑛𝐴,𝐵
𝑃 (𝐴, 𝐵) = lim (2.3)
𝑛→∞ 𝑛
Example 2.3
A standard deck of playing cards has 52 cards that can be divided in several manners. There
are four suits (spades, hearts,diamonds, and clubs), each of which has 13 cards (ace, 2, 3, 4,
... , 10, jack, queen, king). There are two red suits (hearts and diamonds) and two black suits
(spades and clubs). Also, the jacks, queens, and kings are referred to as face cards, while
the others are number cards. Suppose the cards are sufficiently shuffled (randomized) and
one card is drawn from the deck. The experiment has 52 outcomes corresponding to the 52
individual cards that could have been selected. Hence, each outcome has a probability of
1/52. Define the events:
A = {red card selected},
B = {number card selected},
C = {heart selected}.
Since the event A consists of 26 outcomes (there are 26 red cards), then 𝑃 (𝐴) = 26/52 = 1/2.
Likewise, 𝑃 (𝐵) = 40/52 = 10/13 and 𝑃 (𝐶) = 13/52 = 1/4. Events A and B have 20
outcomes in common, hence 𝑃 (𝐴, 𝐵) = 20/52 = 5/13. Likewise, 𝑃 (𝐵, 𝐶) = 10/52 = 5/26
and 𝑃 (𝐴, 𝐶) = 13/52 = 1/4. It is interesting to note that in this example, 𝑃 (𝐴, 𝐶) = 𝑃 (𝐶),
because 𝐶 ⊂ 𝐴 and as a result 𝐴 ∩ 𝐶 = 𝐶.
17
2 Probability Theory
𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑀 ) = 𝑃 (𝐴𝑀 |𝐴1, 𝐴2, ..., 𝐴𝑀−1 )𝑃 (𝐴𝑀−1 |𝐴1, 𝐴2, ..., 𝐴𝑀−2 )... × 𝑃 (𝐴2 |𝐴1 )𝑃 (𝐴1 ) (2.8)
Example 2.4
Return to the experiment of drawing cards from a deck as described in Example 2.3. Suppose
now that we select two cards at random from the deck. When we select the second card,
we do not return the first card to the deck. In this case, we say that we are selecting cards
without replacement. As a result, the probabilities associated with selecting the second card
are slightly different if we have knowledge of which card was drawn on the first selection.
To illustrate this, let:
A = {first card was a spade} and
B = {second card was a spade}.
The probability of the event A can be calculated as in the previous example to be 𝑃 (𝐴) =
13/52 = 1/4. Likewise, if we have no knowledge of what was drawn on the first selection, the
probability of the event B is the same, 𝑃 (𝐵) = 1/4. To calculate the joint probability of A and
B, we have to do some counting. To begin, when we select the first card there are 52 possible
outcomes. Since this card is not returned to the deck, there are only 51 possible outcomes
for the second card. Hence, this experiment of selecting two cards from the deck has 52 ∗ 51
possible outcomes each of which is equally likely. Similarly, there are 13 ∗ 12 outcomes that
belong to the joint event 𝐴 ∩ 𝐵. Therefore, the joint probability for A and B is 𝑃 (𝐴, 𝐵) =
(13 ∗ 12)/(52 ∗ 51) = 1/17. The conditional probability of the second card being a spade
given that the first card is a spade is then 𝑃 (𝐵|𝐴) = 𝑃 (𝐴, 𝐵)/𝑃 (𝐴) = (1/17)/(1/4) = 4/17.
However, calculating this conditional probability directly is probably easier than calculating
the joint probability. Given that we know the first card selected was a spade, there are now
51 cards left in the deck, 12 of which are spades, thus 𝑃 (𝐵|𝐴) = 12/51 = 4/17.
Theorem 2.3
𝑃 (𝐵|𝐴)𝑃 (𝐴)
𝑃 (𝐴|𝐵) = (2.9)
𝑃 (𝐵)
18
2 Probability Theory
Theorem 2.3 is useful for calculating certain conditional probabilities since, in many problems, it
may be quite difficult to compute 𝑃 (𝐴|𝐵) directly, whereas calculating 𝑃 (𝐵|𝐴) may be straightfor-
ward.
Theorem 2.4: Theorem of Total Probability
Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. That is,
𝐵𝑖 ∩ 𝐵 𝑗 = for all 𝑖 ≠ 𝑗 and
𝑛
Ø Õ𝑛
𝐵𝑖 = 𝑆 ⇒ 𝑃 (𝐵𝑖 ) = 1 (2.11)
𝑖=1 𝑖=1
then
𝑛
Õ
𝑃 (𝐴) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (2.12)
𝑖=1
Proof. From the Venn diagram in Figure 2.1, it can be seen that the event 𝐴 can be written
as:
Also, since the 𝐵𝑖 are all mutually exclusive, then the {𝐴 ∩ 𝐵𝑖 } are also mutually exclusive,
so that
Õ𝑛 𝑛
Õ
𝑃 (𝐴) = 𝑃 (𝐴, 𝐵𝑖 ) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (by Theorem 2.3). (2.14)
𝑖=1 𝑖=1
Figure 2.1: Venn diagram used to help prove the theorem of total probability
By combining the results of Theorems 2.3 and 2.4, we get what has come to be known as Bayes’s
theorem.
Theorem 2.5: Bayes’s Theorem
19
2 Probability Theory
Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. Then:
𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 |𝐴) = Í𝑛 (2.15)
𝑖=1 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 )
𝑃 (𝐵𝑖 ) is often referred to as the a priori probability of event 𝐵𝑖 , while 𝑃 (𝐵𝑖 |𝐴) is known as the a
posteriori probability of event 𝐵𝑖 given 𝐴.
Example 2.5
A certain auditorium has 30 rows of seats. Row 1 has 11 seats, while Row 2 has 12 seats, Row
3 has 13 seats, and so on to the back of the auditorium where Row 30 has 40 seats. A door
prize is to be given away by randomly selecting a row (with equal probability of selecting
any of the 30 rows) and then randomly selecting a seat within that row (with each seat in
the row equally likely to be selected). Find the probability that Seat 15 was selected given
that Row 20 was selected and also find the probability that Row 20 was selected given that
Seat 15 was selected.
Solution. The first task is straightforward. Given that Row 20 was selected, there are 30
possible seats in Row 20 that are equally likely to be selected. Hence, 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20) =
1/30. Without the help of Bayes’s theorem, finding the probability that Row 20 was selected
given that we know Seat 15 was selected would seem to be a formidable problem. Using
Bayes’s theorem,
𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20)𝑃 (𝑅𝑜𝑤20)/𝑃 (𝑆𝑒𝑎𝑡15).
The two terms in the numerator on the right-hand side are both equal to 1/30. The term in
the denominator is calculated using the help of the theorem of total probability.
30
Õ 1 1
𝑃 (𝑆𝑒𝑎𝑡15) = = 0.0342
𝑘 + 10 30
𝑘=5
With this calculation completed, the a posteriori probability of Row 20 being selected given
seat 15 was selected is given by:
1/30 ∗ 1/30
𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = = 0.0325
0.0342
Note that the a priori probability that Row 20 was selected is 1/30 = 0.0333. Therefore, the
additional information that Seat 15 was selected makes the event that Row 20 was selected
slightly less likely. In some sense, this may be counterintuitive, since we know that if Seat
15 was selected, there are certain rows that could not have been selected (i.e., Rows 1–4
have fewer than 15 seats) and, therefore, we might expect Row 20 to have a slightly higher
probability of being selected compared to when we have no information about which seat
was selected. To see why the probability actually goes down, try computing the probability
that Row 5 was selected given that Seat 15 was selected. The event that Seat 15 was selected
makes some rows much more probable, while it makes others less probable and a few rows
now impossible.
20
2 Probability Theory
2.4 Independence
In Example 2.5, it was seen that observing one event can change the probability of the occurrence
of another event. In that particular case, the fact that it was known that Seat 15 was selected,
lowered the probability that Row 20 was selected. We say that the event 𝐴 = {Row 20 was
selected} is statistically dependent on the event 𝐵 = {Seat 15 was selected}. If the description of
the auditorium were changed so that each row had an equal number of seats (e.g., say all 30 rows
had 20 seats each), then observing the event B = Seat 15 was selected would not give us any new
information about the likelihood of the event 𝐴 = {Row 20 was selected}. In that case, we say
that the events 𝐴 and 𝐵 are statistically independent.
Mathematically, two events 𝐴 and 𝐵 are independent if 𝑃 (𝐴|𝐵) = 𝑃 (𝐴). That is, the a priori
probability of event 𝐴 is identical to the a posteriori probability of 𝐴 given 𝐵. Note that if
𝑃 (𝐴|𝐵) = 𝑃 (𝐴), then the following conditions also hold: 𝑃 (𝐵|𝐴) = 𝑃 (𝐵) and 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵).
Furthermore, if 𝑃 (𝐴|𝐵) ≠ 𝑃 (𝐴), then the other two conditions also do not hold. We can thereby
conclude that any of these three conditions can be used as a test for independence and the other
two forms must follow. We use the last form as a definition of independence since it is symmetric
relative to the events A and B.
Definition 2.3. Independence: Two events are statistically independent if and only if:
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) (2.16)
Example 2.6
Consider the experiment of tossing two numbered dice and observing the numbers that
appear on the two upper faces. For convenience, let the dice be distinguished by color, with
the first die tossed being red and the second being white. Let:
A = {number on the red die is less than or equal to 2},
B = {number on the white die is greater than or equal to 4},
C = {the sum of the numbers on the two dice is 3}.
As mentioned in the preceding text, there are several ways to establish independence (or
lack thereof) of a pair of events. One possible way is to compare 𝑃 (𝐴, 𝐵) with 𝑃 (𝐴)𝑃 (𝐵).
Note that for the events defined here, 𝑃 (𝐴) = 1/3, 𝑃 (𝐵) = 1/2, 𝑃 (𝐶) = 1/18. Also, of the 36
possible outcomes of the experiment, six belong to the event 𝐴 ∩ 𝐵 and hence 𝑃 (𝐴, 𝐵) = 1/6.
Since 𝑃 (𝐴)𝑃 (𝐵) = 1/6 as well, we conclude that the events 𝐴 and 𝐵 are independent. This
agrees with intuition since we would not expect the outcome of the roll of one die to affect
the outcome of the other. What about the events 𝐴 and 𝐶? Of the 36 possible outcomes of the
experiment, two belong to the event 𝐴∩𝐶 and hence 𝑃 (𝐴, 𝐶) = 1/18. Since 𝑃 (𝐴)𝑃 (𝐶) = 1/54,
the events 𝐴 and 𝐶 are not independent. Again, this is intuitive since whenever the event 𝐶
occurs, the event 𝐴 must also occur and so the two must be dependent. Finally, we look at
the pair of events 𝐵 and 𝐶. Clearly, 𝐵 and 𝐶 are mutually exclusive. If the white die shows a
number greater than or equal to 4, there is no way the sum can be 3. Hence, 𝑃 (𝐵, 𝐶) = 0 and
since 𝑃 (𝐵)𝑃 (𝐶) = 1/36, these two events are also dependent.
Note that mutually exclusive events are not the same as independent events. For two events 𝐴
and 𝐵 for which 𝑃 (𝐴) ≠ 0 and 𝑃 (𝐵) ≠ 0, 𝐴 and 𝐵 can never be both independent and mutually
exclusive. Thus, mutually exclusive events are necessarily statistically dependent.
Generalizing the definition of independence to three events, 𝐴, 𝐵, and 𝐶 are mutually independent
if each pair of events is independent;
𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) (2.17)
21
2 Probability Theory
Definition 2.4. The events 𝐴1, 𝐴2, ..., 𝐴𝑛 are independent if any subset of 𝑘 < 𝑛 of these events are
independent, and in addition
𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ) (2.21)
There are basically two ways in which we can use the idea of independence. We can compute
joint or conditional probabilities and apply one of the definitions as a test for independence.
Alternatively, we can assume independence and use the definitions to compute joint or conditional
probabilities that otherwise may be difficult to find. The latter approach is used extensively in
engineering applications. For example, certain types of noise signals can be modeled in this
way. Suppose we have some time waveform 𝑋 (𝑡) which represents a noisy signal that we wish
to sample at various points in time, 𝑡 1, 𝑡 2, ..., 𝑡𝑛 . Perhaps we are interested in the probabilities
that these samples might exceed some threshold, so we define the events 𝐴𝑖 = 𝑃 (𝑋 (𝑡𝑖 ) > 𝑇 ),
𝑖 = 1, 2, ..., 𝑛. In some cases, we can assume that the value of the noise at one point in time does
not affect the value of the noise at another point in time. Hence, we assume that these events are
independent and therefore 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ).
possible outcomes. This result allows us to quickly calculate the number of sample points in a
sequence of experiments.
22
2 Probability Theory
Example 2.7
How many odd two digit numbers can be formed from the digits 2, 7, 8, and 9, if each digit
can be used only once?
Solution. As the first experiment, there are two ways of selecting a number for the unit’s
place (either 7 or 9). In each case of the first experiment, there are three ways of selecting a
number for the ten’s place in the second experiment, excluding the digit used for the unit’s
place. The number of outcomes in the combined experiment is therefore 2 × 3 = 6.
Example 2.8
Solution. Since each bit (or binary digit) in a computer word is either a one or a zero, and
there are 8 bits, then the total number of computer words is 𝑛 = 28 = 256. To determine
the maximum sampling error, first compute the range of voltage assigned to each computer
word which equals 10 V/256 words = 0.0390625 V/word and then divide by two (i.e. round
off to the nearest level), which yields a maximum error of 0.0195312 V/word.
Solution. There are 2𝑘 different binary numbers. Note that the digits are "ordered", and
repeated 0 and 1 digits are possible.
Example 2.10
An urn contains five balls numbered 1 to 5. Suppose we select two balls from the urn with
replacement. How many distinct ordered pairs are possible? What is the probability that
the two draws yield the same number?
23
2 Probability Theory
Solution. The number of ordered pairs is 52 = 25. Figure 2.2 shows the 25 possible pairs.
Five of the 25 outcomes have the two draws with the same number; if we suppose that all
pairs are equiprobable, then the probability that the two draws yield the same number is
5/25 = 0.2.
Figure 2.2: Possible outcomes in sampling with replacement and with ordering of two balls
from an urn containing five distinct balls
Example 2.11
An urn contains five balls numbered 1 to 5. Suppose we select two balls in succession
without replacement. How many distinct ordered pairs are possible? What is the probability
that the first ball has a number larger than that of the second ball?
Solution. Equation 2.24 states that the number of ordered pairs is 5 × 4 = 20, as shown in
figure 2.3. Ten ordered pairs (in the dashed triangle) have the first number larger than the
second number ; thus the probability of this event is 10/20 = 0.5.
24
2 Probability Theory
Figure 2.3: Possible outcomes in sampling without replacement and with ordering.
Example 2.12
An urn contains five balls numbered 1 to 5. Suppose we draw three balls with replacement.
What is the probability that all three balls are different
Solution. From Equation 2.23 there are 53 = 125 possible outcomes, which we will suppose
are equiprobable. The number of these outcomes for which the three draws are different is
given by Equation 2.24, 5 × 4 × 3 = 60, Thus the probability that all three balls are different
is 60/125 = 0.48.
In many problems of interest, we seek to find the number of different ways that we can rearrange
or order several items. The number of permutations can easily be determined from equation 2.24
and is given as follows. Consider drawing 𝑛 objects from an urn containing 𝑛 distinct objects until
the urn is empty, i.e. sampling without replacement with 𝑘 = 𝑛. Thus, the number of possible
orderings, i.e. permutations of 𝑛 distinct objects is:
𝑛(𝑛 − 1)...(𝑛 − 𝑘 + 1)
𝑛! 𝑛
𝑛
𝐶𝑘 = = , (2.26)
𝑘! (𝑛 − 𝑘)!𝑘! 𝑘
𝑛
The expression is also called a binomial coefficient and is read “n choose k.” Note that choosing
𝑘
𝑘 objects out of a set of 𝑛 is equivalent to choosing the objects that are to be left out, since
𝐶𝑘𝑛 = 𝐶𝑛−𝑘
𝑛
(2.27)
25
2 Probability Theory
Note that from Equation 2.25, there are 𝑘! possible orders in which the 𝑘 selected objects could
have been selected. Thus in the case of 𝑘−permutations 𝑃𝑘𝑛 , the total number of distinct ordered
samples of 𝑘 objects is:
𝑃𝑘𝑛 = 𝐶𝑘𝑛 𝑘! (2.28)
Example 2.13
Find the number of ways of selecting two balls from five balls numbered 1 to 5, without
replacement and without regard to order.
Figure 2.4: Possible outcomes in sampling without replacement and without ordering.
Example 2.14
Find the number of distinct permutations of 2 white balls and 3 black balls.
Solution. This problem is equivalent to the sampling problem: Assume 5 possible positions
for the balls, then pick a combination of 2 positions out of 5 and arrange the 2 white balls
accordingly. Each combination leads to a distinct arrangement (permutation) of 2 white
balls and 3 black balls. Thus the number of distinct permutations of 2 white balls and 3 black
balls is: 𝐶 25 . The 10 distinct permutations with 2 whites (zeros) and 3 blacks (ones) are:
00111 01011 01101 01110 10011 10101 10110 11001 11010 11100. Note that the position of
whites (zeros) can be represented by the pair of numbers on the two selected balls in figure
2.4.
Example 2.14 shows that sampling without replacement and without ordering is equivalent to
partitioning the set of 𝑛 distinct objects into two sets: 𝐵, containing the 𝑘 items that are picked
from the urn, and 𝐵𝑐 containing the 𝑛 − 𝑘 left behind. Suppose we partition a set of 𝑛 distinct
26
2 Probability Theory
which is called the multinomial coefficient. The binomial coefficient is a special case of the
multinomial coefficient where 𝐽 = 2.
Note that this form can be summarized by the sequence ×× | | × | ×× where the "|" s indicate the
lines between columns, and where nothing appears between consecutive |s if the corresponding
object was not selected. Each different arrangement of 5 ×s and 3 |s leads to a distinct form. If
we identify ×s with “white balls” and |s with “black balls,” then this problem
becomes similar to
8
the example 2.14, and the number of different arrangements is given by . In the general case
3
the form will involve 𝑘 ×s and (𝑛 − 1) |s. Thus the number of different ways of picking 𝑘 objects
from a set of 𝑛 distinct objects with replacement and without ordering is given by:
𝑛 −1+𝑘 𝑛 −1+𝑘
= (2.30)
𝑘 𝑛−1
Example 2.15
Find the number of ways of selecting two balls from five balls numbered 1 to 5, with replace-
ment but without regard to order.
5−1+2 6!
= = 15
2 2!4!
Figure 2.5 shows the 15 pairs. Note that because of the replacement after each selection, the
same ball can be selected twice for each pair.
27
2 Probability Theory
Figure 2.5: Possible outcomes in sampling with replacement and without ordering.
Further Reading
1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical
Engineers, Morgan & Claypool, 2006: sections 1.2.3 to 1.9
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.2 to 2.7
3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 2.2 to 2.6
28
3 Random Variables
3 Random Variables
In most random experiments, we are interested in a numerical attribute of the outcome of the
experiment. A random variable is defined as a function that assigns a numerical value to the
outcome of the experiment.
Figure 3.1: A random variable assigns a number 𝑋 (𝜁 ) to each outcome 𝜁 in the sample space 𝑆 of
a random experiment.
Since 𝑋 (𝜁 ) is a random variable whose numerical value depends on the outcome of an experiment,
we cannot describe the random variable by stating its value; rather, we describe the probabilities
29
3 Random Variables
that the variable takes on a specific value or values (e.g. 𝑃 (𝑋 = 3) or 𝑃 (𝑋 > 8)).
Example 3.1
A coin is tossed three times and the sequence of heads and tails is noted. The sample space
for this experiment is 𝑆 ={ HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. (a) Let 𝑋 be the
number of heads in the three tosses. Find the random variable 𝑋 (𝜁 ) for each outcome 𝜁 . (b)
Now find the probability of the event {𝑋 = 2}.
Solution. (a) 𝑋 assigns each outcome 𝜁 in 𝑆 a number from the set 𝑆𝑥 = {0, 1, 2, 3}. The table
below lists the eight outcomes of 𝑆 and the corresponding values of 𝑋 .
Example 3.1 shows a general technique for finding the probabilities of events involving the random
variable 𝑋 . Let the underlying random experiment have sample space 𝑆. To find the probability
of a subset 𝐵 of 𝑅, e.g., 𝐵 = {𝑥𝑘 }, we need to find the outcomes in 𝑆 that are mapped to 𝐵, i.e.:
𝐴 = {𝜁 : 𝑋 (𝜁 ) ∈ 𝐵} (3.1)
As shown in figure 3.2. If event 𝐴 occurs then 𝑋 (𝜁 ) ∈ 𝐵, so event 𝐵 occurs. Conversely, if event 𝐵
occurs, then the value 𝑋 (𝜁 ) implies that 𝜁 is in 𝐴, so event 𝐴 occurs. Thus the probability that 𝑋
is in 𝐵 is given by:
𝑃 (𝑋 ∈ 𝐵) = 𝑃 (𝐴) = 𝑃 ({𝜁 : 𝑋 (𝜁 ) ∈ 𝐵}) (3.2)
We refer to 𝐴 and 𝐵 as equivalent events. In some random experiments the outcome 𝜁 is already
the numerical value we are interested in. In such cases we simply let 𝑋 (𝜁 ) = 𝜁 that is, the identity
function, to obtain a random variable.
30
3 Random Variables
Note that we use the convention that upper case variables represent random variables while lower
case variables represent fixed values that the random variable can assume. The PMF satisfies the
following properties that provide all the information required to calculate probabilities for events
involving the discrete random variable 𝑋 :
(i) 𝑃𝑋 (𝑥) ≥ 0 for all 𝑥
(ii) 𝑥 ∈𝑆𝑥 𝑃𝑋 (𝑥) = 𝑘 𝑃𝑋 (𝑥) = 𝑘 𝑃 (𝐴𝑘 ) = 1
Í Í Í
Example 3.2
Let 𝑋 be the number of heads in three independent tosses of a fair coin. Find the PMF of 𝑋 .
31
3 Random Variables
Figure 3.2 shows the graph of 𝑃𝑋 (𝑥) versus 𝑥 for the random variables in this example.
Generally the graph of the PMF of a discrete random variable has vertical arrows of height 𝑃𝑋 (𝑥𝑘 )
at the values 𝑥𝑘 in 𝑆𝑥 . The relative values of PMF at different points give an indication of the
relative likelihoods of occurrence.
Finally, let’s consider the relationship between relative frequencies and the PMF. Suppose we
perform 𝑛 independent repetitions to obtain 𝑛 observations of the discrete random variable 𝑋 .
Let 𝑁𝑘 (𝑛) be the number of times the event 𝑋 = 𝑥𝑘 occurs and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the
corresponding relative frequency. As 𝑛 becomes large we expect that 𝑓𝑘 (𝑛) → 𝑃𝑋 (𝑥𝑘 ). Therefore
the graph of relative frequencies should approach the graph of the PMF. For the experiment in
Example 3.2, 1000 repetitions of an experiment of tossing a coin may generate a graph of relative
frequencies shown in Figure 3.3.
Figure 3.3: Relative frequencies and corresponding PMF for the experiment in Example 3.2
In other words, the CDF is the probability that the random variable 𝑋 takes on a value in the
set (−∞, 𝑥]. In terms of the underlying sample space, the CDF is the probability of the event
32
3 Random Variables
In other words, the value of 𝐹𝑋 (𝑥) is constructed by simply adding together the probabilities
𝑃𝑋 (𝑥) for values 𝑦 that are no larger than 𝑥. Note that:
𝑃 (𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎) (3.7)
The CDF is an increasing step function with steps at the values taken by the random variable.
The heights of the steps are the probabilities of taking these values. Mathematically, the PMF can
be obtained from the CDF through the relationship:
𝑃𝑋 (𝑥) = 𝐹𝑋 (𝑥) − 𝐹𝑋 (𝑥 − ) (3.8)
where 𝐹𝑋 (𝑥 − ) is the limiting value from below of the cumulative distribution function. If there is
no step in the cumulative distribution function at a point 𝑥, then 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥 − ) and 𝑃𝑋 (𝑥) = 0.
If there is a step at a point 𝑥, then 𝐹𝑋 (𝑥) is the value of the CDF at the top of the step, and 𝐹𝑋 (𝑥 − )
is the value of the CDF at the bottom of the step, so that 𝑃𝑋 (𝑥) is the height of the step. These
relationships are illustrated in the following example.
Example 3.3
Similar to Example 3.2, let 𝑋 be the number of heads in three tosses of a fair coin. Find the
CDF of X.
Solution. From Example 3.2, we know that 𝑋 takes on only the values 0, 1, 2, and 3 with
probabilities 1/8, 3/8, 3/8, and 1/8, respectively, so 𝐹𝑋 (𝑥) is simply the sum of the probabili-
ties of the outcomes from {0, 1, 2, 3} that are less than or equal to 𝑥. The resulting CDF is a
non-decreasing staircase function that grows from 0 to 1. It has jumps at the points 0, 1, 2, 3
of magnitudes 1/8, 3/8, 3/8, and 1/8, respectively.
33
3 Random Variables
Let us take a closer look at one of these discontinuities, say, in the vicinity of 𝑥 = 1. For a
small positive number 𝛿, we have:
so the limit of the CDF as 𝑥 approaches 1 from the left is 1/8. However,
Thus the CDF is continuous from the right and equal to 1/2 at the point 𝑥 = 1. Indeed, we
note the magnitude of the step at the point 𝑥 = 1 is 𝑃 (𝑋 = 1) = 1/2 − 1/8 = 3/8. The CDF
can be written compactly in terms of the unit step function:
1 3 3 1
𝐹𝑋 (𝑥) = 𝑢 (𝑥) + 𝑢 (𝑥 − 1) + 𝑢 (𝑥 − 2) + 𝑢 (𝑥 − 3)
8 8 8 8
Figure 3.4: The graphs show 150 repetitions of the experiments yielding 𝑋 and 𝑌 . It is clear that
𝑋 is centered about the value 5 while 𝑌 is centered about 0. It is also clear that 𝑋 is
more spread out than 𝑌 (Taken from Alberto Leon-Garcia, Probability, statistics, and
random processes for electrical engineering,3rd ed. Pearson, 2007).
Definition 3.5. Expected value: The expected value or expectation or mean of a discrete random
variable 𝑋 , with a probability mass function 𝑃𝑋 (𝑥) is defined by:
Õ
𝑚𝑋 = 𝐸 [𝑋 ] = 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) (3.9)
𝑘
34
3 Random Variables
𝐸 [𝑋 ] provides a summary measure of the average value taken by the random variable and is also
known as the mean of the random variable. The expected value 𝐸 [𝑋 ] is defined if the above sum
converges absolutely, that is: Õ
𝐸 [|𝑋 |] = |𝑥𝑘 |𝑃𝑋 (𝑥𝑘 ) < ∞ (3.10)
𝑘
otherwise the expected value does not exist.
Random variables with unbounded expected value are not uncommon and appear in models
where outcomes that have extremely large values are not that rare. Examples include the sizes
of files in Web transfers, frequencies of words in large bodies of text, and various financial and
economic problems.
If we view 𝑃𝑋 (𝑥) as the distribution of mass on the points 𝑥 1, 𝑥 2, ... on the real line, then 𝐸 [𝑋 ]
represents the center of mass of this distribution.
Example 3.4
Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find
𝐸 [𝑋 ].
The use of the term “expected value” does not mean that we expect to observe 𝐸 [𝑋 ] when we
perform the experiment that generates 𝑋 . For example, the expected value of the number of heads
in Example 3.4 is 1.5, but its outcomes can only be 0, 1, 2 or 3.
𝐸 [𝑋 ] can be explained as an average of 𝑋 in a large number of observations of 𝑋 . Suppose we
perform 𝑛 independent repetitions of the experiment that generates 𝑋 , and we record the observed
values as 𝑥 (1), 𝑥 (2), ..., 𝑥 (𝑛), where 𝑥 ( 𝑗) is the observation in the 𝑗 𝑡 ℎ experiment. Let 𝑁𝑘 (𝑛) be
the number of times 𝑥𝑘 is observed (𝑘 = 1, 2, ..., 𝐾), and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the corresponding
relative frequency. The arithmetic average, or sample mean of the observations, is:
𝑥 (1) + 𝑥 (2) + ... + 𝑥 (𝑛) 𝑥 1 𝑁 1 (𝑛) + 𝑥 2 𝑁 2 (𝑛) + ... + 𝑥 𝐾 𝑁𝐾 (𝑛)
h𝑋 i𝑛 = = (3.11)
𝑛 𝑛
= 𝑥 1 𝑓1 (𝑛) + 𝑥 2 𝑓2 (𝑛) + ... + 𝑥 𝐾 𝑓𝐾 (𝑛) (3.12)
Õ
= 𝑥𝑘 𝑓𝑘 (𝑛) (3.13)
𝑘
The first numerator adds the observations in the order in which they occur, and the second
numerator counts how many times each 𝑥𝑘 occurs and then computes the total. As 𝑛 becomes
large, we expect relative frequencies to approach the probabilities 𝑃𝑋 (𝑥𝑘 ):
35
3 Random Variables
The variance is a positive quantity that measures the spread of the distribution of the random
variable about its mean value. Larger values of the variance indicate that the distribution is more
spread out. For example in Figure 3.4, 𝑋 has a larger variance than 𝑌 .
Definition 3.7. Standard deviation: The standard deviation of the random variable 𝑋 is defined
by:
𝜎𝑋 = 𝑆𝑇 𝐷 (𝑋 ) = 𝑉 𝐴𝑅 [𝑋 ] 1/2 (3.23)
By taking the square root of the variance, we obtain a quantity with the same units as 𝑋 .
An alternative expression for the variance can be obtained as follows:
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑚𝑋 ) 2 ] = 𝐸 [𝑋 2 − 2𝑚𝑋 𝑋 + 𝑚𝑋2 ] (3.24)
2
= 𝐸 [𝑋 ] − 2𝑚𝑋 𝐸 [𝑋 ] + 𝑚𝑋2 (3.25)
2
= 𝐸 [𝑋 ] − 𝑚𝑋2 (3.26)
𝐸 [𝑋 2 ] is called the second moment of 𝑋 .
36
3 Random Variables
Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find
𝑉 𝐴𝑅 [𝑋 ].
Solution.
3
2
𝑘 2 𝑃𝑋 (𝑘) = 0(1/8) + 12 (3/8) + 22 (3/8) + 32 (1/8) = 3
Õ
𝐸 [𝑋 ] =
𝑘=0
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − (𝐸 [𝑋 ]) 2 = 3 − (1.5) 2 = 0.75
Let 𝑌 = 𝑋 + 𝑐, then:
𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝐸 [(𝑋 + 𝑐 − (𝐸 [𝑋 ] + 𝑐)) 2 ] (3.27)
2
= 𝐸 [(𝑋 − 𝐸 [𝑋 ]) ] = 𝑉 𝐴𝑅 [𝑋 ] (3.28)
Adding a constant to a random variable does not affect the variance.
Let 𝑍 = 𝑐𝑋 then:
𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝐸 [(𝑐𝑋 − 𝑐 (𝐸 [𝑋 ])) 2 ] (3.29)
2 2
= 𝐸 [𝑐 (𝑋 − 𝐸 [𝑋 ]) ] (3.30)
2
= 𝑐 (𝑉 𝐴𝑅 [𝑋 ]) (3.31)
Scaling a random variable by 𝑐 scales the variance by 𝑐 2 and the standard deviation by |𝑐 |.
Note that a random variable that is equal to a constant 𝑋 = 𝑐 with probability 1, has zero variance:
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑐) 2 ] = 𝐸 [0] = 0
Finally, Variance is a special case of central moments, for 𝑛 = 2, where we define 𝑛𝑡ℎ central
moment as follows.
Definition 3.9. Central Moments: The 𝑛𝑡ℎ central moment of a random variable is defined as:
𝐸 [(𝑋 − 𝑚𝑋 )𝑛 ].
37
3 Random Variables
As illustrated in Figure 3.5, the above expression has a nice intuitive interpretation: The conditional
probability of the event {𝑋 = 𝑥𝑘 } is given by the probabilities of outcomes 𝜁 for which both
𝑋 (𝜁 ) = 𝑥𝑘 and 𝜁 are in 𝐶, normalized by 𝑃 (𝐶).
The conditional PMF has the same properties as PMF. If 𝑆 is partitioned by 𝐴𝑘 = {𝑋 = 𝑥𝑘 }, then:
Ø
𝐶= (𝐴𝑘 ∩ 𝐶) and
𝑘
Õ Õ Õ 𝑃 ({𝑋 = 𝑥 } ∩ 𝐶)
𝑘
𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑃𝑋 |𝐶 (𝑥𝑘 ) =
𝑥𝑘 ∈𝑆𝑋
𝑃 (𝐶)
𝑘 𝑘
1 Õ 𝑃 (𝐶)
= 𝑃 (𝐴𝑘 ∩ 𝐶) = =1
𝑃 (𝐶) 𝑃 (𝐶)
𝑘
Most of the time the event 𝐶 is defined in terms of 𝑋 , for example 𝐶 = {𝑎 ≤ 𝑋 ≤ 𝑏}. For 𝑥𝑘 ∈ 𝑆𝑋 ,
we have the following result:
( 𝑃 (𝑥 )
𝑋 𝑘
if 𝑥𝑘 ∈ 𝐶
𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑃 (𝐶) (3.34)
0 if 𝑥𝑘 ∉ 𝐶
Example 3.6
Let 𝑋 be the number of heads in three tosses of a fair coin. Find the conditional PMF of 𝑋
given that we know the observed number was less than 2.
Therefore:
𝑃𝑋 (0) 1/8
𝑃𝑋 |𝐶 (0) = = = 1/4.
𝑃 (𝐶) 1/2
𝑃𝑋 (1) 3/8
𝑃𝑋 |𝐶 (1) = = = 3/4.
𝑃 (𝐶) 1/2
and 𝑃𝑋 |𝐶 (𝑥𝑘 ) is zero otherwise. Note that 𝑃𝑋 |𝐶 (0) + 𝑃𝑋 |𝐶 (1) = 1.
Many random experiments have natural ways of partitioning the sample space 𝑆 into the union
of disjoint events 𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝑃𝑋 |𝐵𝑖 (𝑥) be the conditional PMF of 𝑋 given event 𝐵𝑖 . The
38
3 Random Variables
theorem on total probability allows us to find the PMF of 𝑋 in terms of the conditional PMFs:
𝑛
Õ
𝑃𝑋 (𝑥) = 𝑃𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.35)
𝑖=1
Definition 3.11. Conditional Expected Value: Let 𝑋 be a discrete random variable, and suppose
that we know that event 𝐵 has occurred. The conditional expected value of 𝑋 given 𝐵 is defined as:
Õ Õ
𝑚𝑋 |𝐵 = 𝐸 [𝑋 |𝐵] = 𝑥𝑃𝑋 |𝐵 (𝑥) = 𝑥𝑘 𝑃𝑋 |𝐵 (𝑥𝑘 ) (3.36)
𝑥 ∈𝑆𝑥 𝑘
where we first express 𝑃𝑋 (𝑥𝑘 ) in terms of the conditional PMFs, and we then change the order of
summation. Using the same approach we can also show:
𝑛
Õ
𝐸 [𝑔(𝑋 )] = 𝐸 [𝑔(𝑋 )|𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.42)
𝑖=1
Example 3.7
Let 𝑋 be the number of heads in three tosses of a fair coin. Find the expected value and
variance of 𝑋 ,if we know that at least one head was observed.
39
3 Random Variables
40
3 Random Variables
The variance is quadratic in 𝑝, with value zero at 𝑝 = 0 and 𝑝 = 1 and maximum at 𝑝 = 1/2.
This agrees with intuition since values of 𝑝 close to 0 or to 1 imply a preponderance of
successes or failures and hence less variability in the observed values. The maximum
variability occurs when which corresponds to the case that is most difficult to predict. Every
Bernoulli trial, regardless of the event 𝐴, is equivalent to the tossing of a biased coin with
probability of heads 𝑝.
In fact, the order of the 1s and 0s in the sequence is irrelevant. Any outcome with exactly 𝑘 1s
and 𝑛 − 𝑘 0s would have the same probability. The number of outcomes in the event of exactly 𝑘
successes, is just the number of combinations of 𝑛 trials taken 𝑘 successes at a time.
Let 𝑘 be the number of successes in 𝑛 independent Bernoulli trials, then the probabilities of
𝑘 are given by the binomial probability law:
𝑛 𝑘
𝑃𝑛 (𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 for 𝑘 = 0, ..., 𝑛 (3.48)
𝑘
𝑛
is the binomial coefficient (see equation 2.26).
𝑘
Now let the random variable 𝑋 represent the number of successes occurred in the sequence of 𝑛
trials.
Definition 3.15. Binomial random variable: let 𝑋 be the number of times a certain event 𝐴
occurs in 𝑛 independent Bernoulli trials. 𝑋 is called the Binomial random variable.
For example, 𝑋 could be the number of heads in 𝑛 tosses of a coin (as seen in Examples 3.2 to 3.5,
where 𝑛 = 3 and 𝑝 = 1/2).
41
3 Random Variables
(𝑛 − 1)!
𝑛−1
Õ
= 𝑛𝑝 𝑝 𝑗 (1 − 𝑝)𝑛−1−𝑗 (3.52)
𝑗=0
( 𝑗)!(𝑛 − 1 − 𝑗)!
(3.53)
(𝑛−1)!
Note that the summation 𝑛−1 𝑛−1−𝑗 equal to one, since it adds all the terms
Í
𝑗=0 ( 𝑗)!(𝑛−1−𝑗)! 𝑝 (1 − 𝑝)
𝑗
𝑛−1 𝑗
𝑛−1
Õ
= 𝑛𝑝 ( 𝑗 + 1) 𝑝 (1 − 𝑝)𝑛−1−𝑗 (3.56)
𝑗
𝑗=0
𝑛−1 𝑗 𝑛−1 𝑗
𝑛−1
Õ Õ𝑛−1
= 𝑛𝑝 ( 𝑗 𝑝 (1 − 𝑝) 𝑛−1−𝑗
+ 𝑝 (1 − 𝑝)𝑛−1−𝑗 ) (3.57)
𝑗 𝑗
𝑗=0 𝑗=0
In the third line, the first sum is the mean of a binomial random variable with parameters 𝑛 − 1
and 𝑝, and hence equal to (𝑛 − 1)𝑝. The second sum is the sum of the binomial probabilities and
hence equal to 1. Therefore,
𝐸 [𝑋 2 ] = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) (3.58)
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) − (𝑛𝑝) 2 = 𝑛𝑝 (1 − 𝑝) = 𝑛𝑝𝑞 (3.59)
We see that the variance of the binomial is 𝑛 times the variance of a Bernoulli random variable.
We observe that values of p close to 0 or to 1 imply smaller variance, and that the maximum
variability is when 𝑝 = 1/2.
The binomial random variable arises in applications where there are two types of objects (i.e.,
heads/tails, correct/erroneous bits, good/defective items, active/silent speakers), and we are
interested in the number of type 1 objects in a randomly selected batch of size 𝑛, where the type
of each object is independent of the types of the other objects in the batch.
Example 3.8
42
3 Random Variables
Solution. 𝑋 is a binomial random variable, and the probability of 𝑘 errors in 𝑛 bit transmissions
is given by the PMF in Equation 3.60:
𝑛 0 𝑛 1
𝑃 (𝑋 ≤ 1) = 𝑝 (1 − 𝑝)𝑛 + 𝑝 (1 − 𝑝)𝑛−1 = (1 − 𝑝)𝑛 + 𝑛𝑝 (1 − 𝑝)𝑛−1
0 1
Note that the PMF decays geometrically with 𝑘, and the ratio 1 − 𝑝 = 𝑞. As 𝑝 increases, the
PMF decays more rapidly.
1 − 𝑞𝑘
𝑘
Õ 𝑘−1
Õ
𝑃 (𝑋 ≤ 𝑘) = 𝑞 𝑗−1𝑝 = 𝑝 𝑞𝑗 = 𝑝 = 1 − 𝑞𝑘 (3.61)
𝑗=1 𝑗=0
1−𝑞
to obtain:
∞
1 Õ
= 𝑘𝑥 𝑘−1 (3.64)
(1 − 𝑥) 2
𝑘=0
Letting 𝑥 = 𝑞:
1
𝐸 [𝑋 ] = 𝑝 = 1/𝑝 (3.65)
(1 − 𝑞) 2
which is finite as long as 𝑝 > 0.
43
3 Random Variables
We see that the mean and variance increase as 𝑝, the success probability, decreases.
Sometimes we are interested in 𝑀 the number of failures before a success occurs, also referred to
as a geometric random variable. Its PMF is:
𝑃 (𝑀 = 𝑘) = (1 − 𝑝)𝑘 𝑝 (3.69)
The geometric random variable is the only discrete random variable that satisfies the memoryless
property:
𝑃 (𝑋 ≥ 𝑘 + 𝑗 |𝑋 > 𝑗) = 𝑃 (𝑋 ≥ 𝑘) (3.70)
The above expression states that if a success has not occurred in the first 𝑗 trials, then the
probability of having to perform at least 𝑘 more trials is the same as the probability of initially
having to perform at least 𝑘 trials. Thus, each time a failure occurs, the system “forgets” and
begins anew as if it were performing the first trial.
The geometric random variable arises in applications where one is interested in the time (i.e.,
number of trials) that elapses between the occurrence of events in a sequence of independent
experiments. Examples where the modified geometric random variable arises are: number of
customers awaiting service in a queuing system; number of white dots between successive black
dots in a scan of a black-and-white document.
Example 3.9
A production line yields two types of devices. Type 1 devices occur with probability 𝛼
and work for a relatively short time that is geometrically distributed with parameter 𝑟 .
Type 2 devices work much longer, occur with probability 1 − 𝛼 and have a lifetime that is
geometrically distributed with parameter 𝑠. Let 𝑋 be the lifetime of an arbitrary device. Find
the PMF, mean and variance of 𝑋 .
Solution. The random experiment that generates 𝑋 involves selecting a device type and then
observing its lifetime. We can partition the sets of outcomes in this experiment into event
𝐵 1 consisting of those outcomes in which the device is type 1, and 𝐵 2 consisting of those
outcomes in which the device is type 2. From the theorem of total probability:
The conditional mean and second moment of each device type is that of a geometric random
44
3 Random Variables
𝐸 [𝑋 |𝐵 1 ] = 1/𝑟
𝐸 [𝑋 |𝐵 2 ] = 1/𝑠
𝐸 [𝑋 2 |𝐵 1 ] = (1 + 1 − 𝑟 )/𝑟 2
𝐸 [𝑋 2 |𝐵 2 ] = (1 + 1 − 𝑠)/𝑠 2
Note that we do not use the conditional variances to find 𝑉 𝐴𝑅 [𝑋 ], since the Equation 3.42
does not similarly apply to the conditional variances.
𝛼 𝑘 −𝛼
𝑃𝑋 (𝑘) = 𝑒 , 𝑘 = 0, 1, 2, ... (3.71)
𝑘!
where 𝛼 is the average number of event occurrences in a specified time interval or region
in space. The PMF sums to one, as required, since:
∞ ∞
Õ 𝛼𝑘 Õ 𝛼𝑘
𝑒 −𝛼 = 𝑒 −𝛼 = 𝑒 −𝛼 𝑒 𝛼 = 1
𝑘! 𝑘!
𝑘=0 𝑘=0
where we used the fact that the second summation is the infinite series expansion for 𝑒 𝛼 .
45
3 Random Variables
𝑉 𝐴𝑅 [𝑋 ] = 𝛼 (3.73)
One of the applications of the Poisson probabilities is to approximate the binomial probabilities
when the number of repeated trials, 𝑛 , is very large and the probability of success in each
individual trial,𝑝 , is very small. Then the binomial random variable can be well approximated by
a Poisson random variable. That is, the Poisson random variable is a limiting case of the binomial
random variable. Let 𝑛 approach infinity and 𝑝 approach 0 in such a way that lim𝑛→∞ 𝑛𝑝 = 𝛼,
then the binomial PMF converges to the PMF of Poisson random variable:
𝑛 𝑘 𝛼𝑘
𝑝 (1 − 𝑝)𝑛−𝑘 → 𝑒 −𝛼 , for 𝑘 = 0, 1, 2, ... (3.74)
𝑘 𝑘!
The Poisson random variable appears in numerous physical situations because many models are
very large in scale and involve very rare events. For example, the Poisson PMF gives an accurate
prediction for the relative frequencies of the number of particles emitted by a radioactive mass
during a fixed time period.
The Poisson random variable also comes up in situations where we can imagine a sequence of
Bernoulli trials taking place in time or space. Suppose we count the number of event occurrences
in a T-second interval. Divide the time interval into a very large number, 𝑛, of sub-intervals. A
pulse in a sub-interval indicates the occurrence of an event. Each sub-interval can be viewed as
one in a sequence of independent Bernoulli trials if the following conditions hold: (1) At most one
event can occur in a sub-interval, that is, the probability of more than one event occurrence is
negligible; (2) the outcomes in different sub-intervals are independent; and (3) the probability of
an event occurrence in a sub-interval is 𝑝 = 𝛼/𝑛 where 𝛼 is the average number of events observed
in a 1-second interval. The number 𝑁 of events in 1 second is a binomial random variable with
parameters 𝑛 and 𝑝 = 𝛼/𝑛. Thus as 𝑛 → ∞ 𝑁 becomes a Poisson random variable with parameter
𝛼.
Example 3.10
Solution. Each bit transmission corresponds to a Bernoulli trial with a “success” correspond-
ing to a bit error in transmission. The probability of 𝑘 errors in 𝑛 = 109 transmissions (1
second) is then given by the binomial probability with 𝑛 = 109 and 𝑝 = 10−9 .
The Poisson approximation uses 𝛼 = 𝑛𝑝 = 109 (10−9 ) = 1. Thus:
4
Õ 𝛼𝑘
𝑃 (𝑋 ≥ 5) = 1 − 𝑃 (𝑋 < 5) = 1 − 𝑒 −𝛼
𝑘!
𝑘=0
= 1 − 𝑒 (1 + 1/1! + 1/2! + 1/3! + 1/4!) = 0.00366
−1
46
3 Random Variables
𝐿2 − 1
𝑉 𝐴𝑅 [𝑋 ] = (3.76)
12
This random variable occurs whenever outcomes are equally likely, e.g., toss of a fair coin or a
fair die, spinning of an arrow in a wheel divided into equal segments, selection of numbers from
an urn.
Example 3.11
Let 𝑋 be the time required to transmit a message, where 𝑋 is a uniform random variable
with 𝑆𝑋 = {1, ..., 𝐿}. Suppose that a message has already been transmitting for 𝑚 time units,
find the probability that the remaining transmission time is 𝑗 time units and the expected
value of the remaining transmission time.
𝑃 (𝑋 = 𝑚 + 𝑗) 1/𝐿 1
𝑃𝑋 |𝐶 (𝑚 + 𝑗) = = = , for 𝑚 + 1 ≤ 𝑚 + 𝑗 ≤ 𝐿
𝑃 (𝑋 > 𝑚) (𝐿 − 𝑚)/𝐿 𝐿 − 𝑚
𝐿 +𝑚 + 1
𝐿
Õ
𝐸 [𝑋 |𝐶] = 𝑗 (1/𝐿 − 𝑚) =
𝑗=𝑚+1
2
The expectation can also be directly calculated from Equation 3.75, replacing the parameters
𝐿 and 𝑗 by 𝐿 − 𝑚 and 𝑚, respectively.
47
3 Random Variables
Definition 3.18. Continuous random variable: A random variable whose CDF 𝐹𝑋 (𝑥) is contin-
uous everywhere, and which, in addition, is sufficiently smooth that it can be written as an integral
of some non-negative function 𝑓 (𝑥):
∫ 𝑥
𝐹𝑋 (𝑥) = 𝑓 (𝑡)𝑑𝑡 (3.77)
−∞
where 0 < 𝑝 < 1 and 𝐹 1 (𝑥) is the CDF of a discrete random variable and 𝐹 2 (𝑥) is the CDF of a
continuous random variable. Random variables of mixed type can be viewed as being produced
by a two-step process: A coin is tossed; if the outcome of the toss is heads, a discrete random
variable is generated according to 𝐹 1 (𝑥) otherwise, a continuous random variable is generated
according to 𝐹 2 (𝑥).
𝑑𝐹𝑋 (𝑥)
𝑓𝑋 (𝑥) = (3.78)
𝑑𝑥
The PDF represents the “density” of probability at the point 𝑥 in the following sense: The
probability that 𝑋 is in a small interval in the vicinity of 𝑥, i.e. 𝑥 < 𝑋 ≤ 𝑥 + ℎ, is:
𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥)
𝑃 (𝑥 < 𝑋 ≤ 𝑥 + ℎ) = 𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥) = ℎ (3.79)
ℎ
If the CDF has a derivative at 𝑥, then as ℎ becomes very small,
Thus represents the “density” of probability at the point 𝑥 in the sense that the probability that
𝑋 is in a small interval in the vicinity of 𝑥 is approximately 𝑓𝑋 (𝑥)ℎ. The derivative of the CDF,
when it exists, is positive since the CDF is a non-decreasing function of 𝑥, thus:
𝑓𝑋 (𝑥) ≥ 0 (3.81)
Note that the PDF specifies the probabilities of events of the form “𝑋 falls in a small interval of
width 𝑑𝑥 about the point 𝑥”. Therefore probabilities of events involving 𝑋 in a certain range can
48
3 Random Variables
be expressed in terms of the PDF by adding the probabilities of intervals of width 𝑑𝑥. As the
widths of the intervals approach zero, we obtain an integral in terms of the PDF:
∫ 𝑏
𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓𝑋 (𝑥)𝑑𝑥 (3.82)
𝑎
The probability of an interval is therefore the area under 𝑓𝑋 (𝑥) in that interval.
Figure 3.6: (a) The probability density function specifies the probability of intervals of infinitesimal
width. (b) The probability of an interval [𝑎, 𝑏] is the area under the PDF in that interval.
(Taken from Alberto Leon-Garcia, Probability, statistics, and random processes for
electrical engineering,3rd ed. Pearson, 2007)
The probability of any event that consists of the union of disjoint intervals can thus be found by
adding the integrals of the PDF over each of the intervals.
The CDF of 𝑋 can be obtained by integrating the PDF:
∫ 𝑥
𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑡)𝑑𝑡 (3.83)
−∞
Since the probabilities of all events involving 𝑋 can be written in terms of the CDF, it then follows
that these probabilities can be written in terms of the PDF. Thus the PDF completely specifies the
behavior of continuous random variables.
By letting 𝑥 tend to infinity in Equation 3.83, we obtain:
∫ ∞
1= 𝑓𝑋 (𝑡)𝑑𝑡 (3.84)
−∞
A valid PDF can be formed by normalising any non-negative, piecewise continuous function 𝑔(𝑥)
that has a finite integral over all real values of 𝑥.
Example 3.12
49
3 Random Variables
Solution. We require:
∫ ∞ ∫ 2
1= 𝑓𝑋 (𝑡)𝑑𝑡 = 𝛽 𝑥 2𝑑𝑥 = (𝛽/3) (8 + 1) = 3𝛽
−∞ −1
Recall that the delta function 𝛿 (𝑥) is zero everywhere except at 𝑥 = 0, where it is unbounded. To
maintain the right continuity of the step function at 0, we use the convention:
∫ 0
𝑢 (0) = 1 = 𝛿 (𝑡)𝑑𝑡 (3.87)
−∞
𝑑 Õ
𝑓𝑋 (𝑥) = 𝐹𝑋 (𝑥) = 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑥 − 𝑥𝑘 ) (3.88)
𝑑𝑥
𝑘
Thus the generalized definition of PDF places a delta function of weight 𝑃 (𝑋 = 𝑥𝑘 ) at the points
𝑥𝑘 where the CDF is discontinuous.
50
3 Random Variables
Example 3.13
𝑃 ({𝑋 ≤ 𝑥 } ∩ 𝐶)
𝐹𝑋 |𝐶 (𝑥) = (3.89)
𝑃 (𝐶)
and satisfies all the properties of a CDF.
The conditional PDF of 𝑋 given 𝐶 is then defined by:
𝑑
𝑓𝑋 |𝐶 (𝑥) = 𝐹𝑋 |𝐶 (𝑥) (3.90)
𝑑𝑥
Example 3.14
The lifetime 𝑋 of a machine has a continuous CDF 𝐹𝑋 (𝑥). Find the conditional CDF and
PDF given the event 𝐶 = {𝑋 > 𝑡 } (i.e., “machine is still working at time 𝑡”).
𝑓𝑋 (𝑥)
𝑓𝑋 |𝐶 (𝑥) =
1 − 𝐹𝑋 (𝑡)
51
3 Random Variables
Now suppose that we have a partition of the sample space 𝑆 into the union of disjoint events
𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝐹𝑋 |𝐵𝑖 (𝑥) be the conditional CDF of 𝑋 given event 𝐵𝑖 . The theorem on total
probability allows us to find the CDF of 𝑋 in terms of the conditional CDFs:
𝑛
Õ 𝑛
Õ
𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = 𝑃 (𝑋 ≤ 𝑥 |𝐵𝑖 )𝑃 (𝐵𝑖 ) = 𝐹𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.91)
𝑖=1 𝑖=1
The expected value 𝐸 [𝑋 ] is defined if the above integral converges absolutely, that is,
∫ +∞
𝐸 [|𝑋 |] = |𝑡 |𝑓𝑋 (𝑡)𝑑𝑡 < ∞
−∞
We already discussed 𝐸 [𝑋 ] for discrete random variables in detail, but the definition in Equation
3.93 is applicable if we express the PDF of a discrete random variable using delta (𝛿) functions:
∫ +∞ Õ
𝐸 [𝑋 ] = 𝑡 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡
−∞ 𝑘
Õ ∫ +∞
= 𝑃𝑋 (𝑥𝑘 ) 𝑡𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡
𝑘 −∞
Õ
= 𝑃𝑋 (𝑥𝑘 )𝑥𝑘
𝑘
Example 3.15
The PDF of the uniform random variable is a constant value over a certain range and zero
52
3 Random Variables
Solution.
1
∫ 𝑏
𝑎 +𝑏
𝐸 [𝑋 ] = 𝑡 𝑑𝑡 =
𝑎 𝑏 −𝑎 2
which is the midpoint of the interval [𝑎, 𝑏].
The result in Example 3.15 could have been found immediately by noting that 𝐸 [𝑋 ] = 𝑚 when
the PDF is symmetric about a point 𝑚, i.e. 𝑓𝑋 (𝑚 − 𝑥) = 𝑓𝑋 (𝑚 + 𝑥) for all 𝑥, then assuming that
the mean exists, ∫ +∞ ∫ +∞
0= (𝑚 − 𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑚 − 𝑡 𝑓𝑋 (𝑡)𝑑𝑡
−∞ −∞
The first equality above follows from the symmetry of 𝑓𝑋 (𝑡) about 𝑡 = 𝑚 and the odd symmetry
of (𝑚 − 𝑡) about the same point. We then have that 𝐸 [𝑋 ] = 𝑚.
The following expressions are useful when 𝑋 is a nonnegative random variable:
∫ ∞
𝐸 [𝑋 ] = (1 − 𝐹𝑋 (𝑡))𝑑𝑡 if 𝑋 continuous and nonnegative (3.94)
0
∞
Õ
𝐸 [𝑋 ] = 𝑃 (𝑋 > 𝑘) if 𝑋 nonnegative, integer-valued (3.95)
𝑘=0
Example 3.16
Solution.
∫ +∞ ∫ +∞
𝐸 [𝑌 ] = 𝐸 [𝑎𝑋 + 𝑏] = (𝑎𝑥 + 𝑏) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 + 𝑏 = 𝑎𝐸 [𝑋 ] + 𝑏
−∞ −∞
53
3 Random Variables
In general, expectation is a linear operation and expectation operator can be exchanged (in order)
with any other linear operation. For any linear combination of functions:
Õ ∫ ∞ Õ Õ ∫ ∞ Õ
𝐸 [ 𝑎𝑘 𝑔𝑘 (𝑋 )] = ( 𝑎𝑘 𝑔𝑘 (𝑥))𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝑔𝑘 (𝑥) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝐸 [𝑔𝑘 (𝑋 )]
𝑘 −∞ 𝑘 𝑘 −∞ 𝑘
(3.97)
Moments
Definition 3.22. Moment: The 𝑛𝑡ℎ moment of a continuous random variable 𝑋 is defined as:
∫ +∞
𝐸 [𝑋 𝑛 ] = 𝑥 𝑛 𝑓𝑋 (𝑥)𝑑𝑥 (3.98)
−∞
The zeroth moment is simply the area under the PDF and must be one for any random variable.
The most commonly used moments are the first and second moments. The first moment is the
expected value. For some random variables, the second moment might be a more meaningful
characterization than the first. For example, suppose 𝑋 is a sample of a noise waveform. We might
expect that the distribution of the noise is symmetric about zero and hence the first moment will
be zero. It only shows that the noise does not have a bias. However, the second moment of the
random noise is in some sense a measure of the strength of the noise, which can give us some
useful physical insight into the power of the noise.
Under certain conditions, a PDF is completely specified if the expected values of all the moments
of 𝑋 are known.
Variance
Similar to the definition of variance for discrete random variables, for continuous random variables
𝑋 , the variance is defined as:
𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 (3.99)
Example 3.17
Find the variance of the continuous uniform random variable in Example 3.15.
Solution.
𝑎 +𝑏 2 1
∫ 𝑏
𝑉 𝐴𝑅 [𝑋 ] = (𝑥 − ) 𝑑𝑥
𝑎 2 𝑏 −𝑎
Let 𝑦 = 𝑥 − 2 ,
𝑎+𝑏
1 (𝑏−𝑎)/2
(𝑏 − 𝑎) 2
∫
𝑉 𝐴𝑅 [𝑋 ] = 𝑦 2𝑑𝑦 =
𝑏 −𝑎 −(𝑏−𝑎)/2 12
The properties derived in section 3.2.3 can be similarly derived for the variance of continuous
random variables:
𝑉 𝐴𝑅 [𝑐] = 0 (3.100)
54
3 Random Variables
𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝑉 𝐴𝑅 [𝑋 ] (3.101)
𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝑐 2𝑉 𝐴𝑅 [𝑋 ] (3.102)
where 𝑐 is a constant.
The mean and variance are the two most important parameters used in summarizing the PDF
of a random variable. Other parameters and moments are occasionally used. For example, the
skewness defined by 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 3 ]/𝑆𝑇 𝐷 [𝑋 ] 3 measures the degree of asymmetry about the
mean. It is easy to show that if a PDF is symmetric about its mean, then its skewness is zero.
The point to note with these parameters of the PDF is that each involves the expected value of a
higher power of 𝑋 .
• and CDF:
0 𝑥 <𝑎
(3.104)
𝐹𝑈 (𝑥) = 𝑥−𝑎
𝑎 ≤𝑥 ≤𝑏
𝑏−𝑎
1
𝑥 >𝑏
55
3 Random Variables
• and CDF: (
1 − 𝑒 −𝜆𝑥 𝑥≥0
𝐹𝑋 (𝑥) = (3.108)
0 𝑥<0
The parameter 𝜆 is the rate at which events occur, so 𝐹𝑋 (𝑥), the probability of an event
occurring by time 𝑥, increases at the rate 𝜆 increases.
∞ ∫ ∞
−𝜆𝑡
𝑒 −𝜆𝑡 𝑑𝑡
𝐸 [𝑋 ] = −𝑡𝑒 +
0 0
−𝑒 −𝜆𝑡 ∞
= lim 𝑡𝑒 −𝜆𝑡
−0+( )
𝑡 →∞ 𝜆 0
−𝑒 −𝜆𝑡 1 1
= lim + = (3.110)
𝑡 →∞ 𝜆 𝜆 𝜆
where we have used the fact that 𝑒 −𝜆𝑡 and 𝑡𝑒 −𝜆𝑡 go to zero as 𝑡 approaches infinity.
In event inter-arrival situations, 𝜆 is in units of events/second and 1/𝜆 is in units of seconds per
event inter-arrival.
The exponential random variable satisfies the memoryless property:
The expression on the left side is the probability of having to wait at least ℎ additional seconds
given that one has already been waiting 𝑡 seconds.The expression on the right side is the probability
of waiting at least ℎ seconds when one first begins to wait. Thus the probability of waiting at
least an additional ℎ seconds is the same regardless of how long one has already been waiting!
This property can be proved as follows:
𝑃 (𝑋 > 𝑡 + ℎ ∩ 𝑋 > 𝑡)
𝑃 (𝑋 > 𝑡 + ℎ|𝑋 > 𝑡) = for ℎ > 0
𝑃 (𝑋 > 𝑡)
𝑃 (𝑋 > 𝑡 + ℎ) 𝑒 −𝜆 (𝑡 +ℎ)
= =
𝑃 (𝑋 > 𝑡) 𝑒 −𝜆𝑡
= 𝑒 −𝜆ℎ = 𝑃 (𝑋 > ℎ)
The memoryless property of the exponential random variable makes it the cornerstone for the
theory of Markov chains, which is used extensively in evaluating the performance of computer
systems and communications networks. It can be shown that the exponential random variable is
the only continuous random variable that satisfies the memoryless property.
56
3 Random Variables
1
∫ (𝑥−𝑚)/𝜎
2 𝑥 −𝑚
𝐹𝑋 (𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 = Φ( ) (3.114)
2𝜋 −∞ 𝜎
where Φ(𝑥) is the CDF of a Gaussian random variable with 𝑚 = 0 and 𝜎 = 1:
1
∫ 𝑥
2
Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 (3.115)
2𝜋 −∞
57
3 Random Variables
Therefore any probability involving an arbitrary Gaussian random variable can be expressed
in terms of Φ(𝑥).
• Note that the PDF of a Gaussian random variable is symmetric about the point 𝑚. Therefore
the mean is 𝐸 [𝑋 ] = 𝑚 (as also defined above).
1
∫ ∞
2
𝑄 (𝑥) = 1 − Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 (3.116)
2𝜋 𝑥
𝑄 (𝑥) is simply the probability of the “tail” of the PDF. The symmetry of the PDF implies that:
From Equation 3.114 which corresponds to 𝑃 (𝑋 ≤ 𝑥), the following can be derived:
𝑥 −𝑚
𝑃 (𝑋 > 𝑥) = 𝑄 ( ) (3.118)
𝜎
Figure 3.8: Standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions.
Figure 3.8 shows the standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions.
It can be shown that it is impossible to express the CDF integral in closed form. However, as with
other important integrals that cannot be expressed in closed form (e.g., Bessel functions), one can
always look up values of the required CDF in looking up tables, or use numerical approximations
of the desired integral to any desired accuracy. The following expression has been found to give
good accuracy for 𝑄 (𝑥) over the entire range 0 < 𝑥 < ∞:
1 1 2
𝑄 (𝑥) ≈ ( √ ) √ 𝑒 −𝑥 /2 (3.119)
(1 − 𝑎)𝑥 + 𝑎 𝑥 2 + 𝑏 2𝜋
where 𝑎 = 1/𝜋 and 𝑏 = 2𝜋.
In some problems, we are interested in finding the value of 𝑥 for which 𝑄 (𝑥) = 10−𝑘 . Table 3.1
gives these values for 𝑘 = 1, ..., 10.
58
3 Random Variables
The Gaussian random variable plays a very important role in communication systems, where
transmission signals are corrupted by noise voltages resulting from the thermal motion of electrons.
It can be shown from physical principles that these voltages will have a Gaussian PDF.
Example 3.18
59
3 Random Variables
𝐸 [𝑋 ] = 𝛼/𝜆 (3.124)
𝑉 𝐴𝑅 [𝑋 ] = 𝛼/𝜆 2 (3.125)
The versatility of the gamma random variable is due to the richness of the gamma function Γ(𝛼).
The PDF of the gamma random variable can assume a variety of shapes as shown in Figure 3.9. By
varying the parameters 𝜆 and 𝛼 it is possible to fit the gamma PDF to many types of experimental
data. The exponential random variable is obtained by letting 𝛼 = 1. By letting 𝜆 = 1/2 and 𝛼 = 𝑘/2,
60
3 Random Variables
where 𝑘 is a positive integer, we obtain the Chi-square random variable, which appears in
certain statistical problems and wireless communications applications. The m-Erlang random
variable is obtained when 𝛼 = 𝑚 a positive integer. The m-Erlang random variable is used in the
system reliability models and in queueing systems models and plays a fundamental role in the
study of wireline telecommunication networks.
In general, the CDF of the gamma random variable does not have a closed-form expression.
However, the special case of the m-Erlang random variable does have a closed-form expression.
Example 3.19
The mean height of children in a kindergarten class is 70 cm. Find the bound on the proba-
bility that a kid in the class is taller than 140 cm.
The bound in the above example appears to be ridiculous. However, a bound, by its very nature,
must take the worst case into consideration. One can easily construct a random variable for which
the bound given by the Markov inequality is exact. The reason we know that the bound in the
above example is ridiculous is that we have knowledge about the variability of the children’s
height about their mean.
Definition 3.24. Chebyshev inequality: Suppose that the mean 𝐸 [𝑋 ] = 𝑚 and the variance
𝑉 𝐴𝑅 [𝑋 ] = 𝜎 2 of a random variable are known, and that we are interested in bounding 𝑃 (|𝑋 −𝑚| ≥ 𝑎).
The Chebyshev inequality states that:
𝜎2
𝑃 (|𝑋 − 𝑚| ≥ 𝑎) ≤ (3.127)
𝑎2
61
3 Random Variables
and note that {𝐷 2 ≥ 𝑎 2 } and {|𝑋 −𝑚| ≥ 𝑎} are equivalent events. Suppose that a random variable
𝑋 has zero variance; then the Chebyshev inequality implies that 𝑃 (𝑋 = 𝑚) = 1, i.e. random
variable is equal to its mean with probability one, hence constant in almost all experiments.
Example 3.20
If 𝑋 is a Gaussian random variable with mean 𝑚 and variance 𝜎 2 , Find the upper bound for
𝑃 (|𝑋 − 𝑚| ≥ 𝑘𝜎) according to the Chebyshev inequality.
We see that for certain random variables, the Chebyshev inequality can give rather loose bounds.
Nevertheless, the inequality is useful in situations in which we have no knowledge about the
distribution of a given random variable other than its mean and variance. We will later use the
Chebyshev inequality to prove that the arithmetic average of independent measurements of the
same random variable is highly likely to be close to the expected value of the random variable
when the number of measurements is large.
If more information is available than just the mean and variance, then it is possible to obtain
bounds that are tighter than the Markov and Chebyshev inequalities. Consider the Markov
inequality again. The region of interest is 𝐴 = {𝑡 ≥ 𝑎}, so let 𝐼𝐴 (𝑡) be the indicator function, i.e.
𝐼𝐴 (𝑡) = 1 if 𝑡 ∈ 𝐴 and 𝐼𝐴 (𝑡) = 0 otherwise. The key step in the derivation is to note that 𝑡/𝑎 ≥ 1
in the region of interest. In effect we bounded 𝐼𝐴 (𝑡) by 𝑡/𝑎 and then have:
∫ ∞ ∫ ∞
𝑃 (𝑋 ≥ 𝑎) = 𝐼𝐴 (𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 ≤ (𝑡/𝑎) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝐸 [𝑥]/𝑎
0 0
By changing the upper bound on 𝐼𝐴 (𝑡), we can obtain different bounds on 𝑃 (𝑋 ≥ 𝑎). Consider
the bound 𝐼𝐴 (𝑡) ≤ 𝑒 𝑠 (𝑡 −𝑎) , also shown in Figure 3.10, where 𝑠 > 0 then the following bound can
be obtained.
Definition 3.25. Chernoff bound: Suppose 𝑋 is a random variable, then:
∫ ∞
𝑃 (𝑋 ≥ 𝑎) ≤ (𝑒 𝑠 (𝑡 −𝑎) ) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑒 −𝑠𝑎 𝐸 [𝑒 𝑠𝑋 ] (3.128)
0
This bound is called the Chernoff bound, which can be seen to depend on the expected value of an
exponential function of 𝑋 . This function is called the moment generating function.
62
3 Random Variables
Further Reading
1. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: chapters 3 and 4
2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, 2nd ed., Elsevier 2012: section 2.8 and 2.9, and
chapters 3 and 4.
3. Anthony Hayter, Probability and Statistics for Engineers and Scientists, 4th ed., Brooks/Cole,
Cengage Learning 2012: chapter 2 to 5.
63
4 Two or More Random Variables
Many random experiments involve several random variables. In some experiments a number of
different quantities are measured. For example, the voltage signals at several points in a circuit at
some specific time may be of interest. Other experiments involve the repeated measurement of
a certain quantity such as the repeated measurement (“sampling”) of the amplitude of an audio
or video signal that varies with time. In this chapter, we extend the random variable concepts
already introduced to two or more random variables. In a sense we have already covered all the
fundamental concepts of probability and random variables, and we are “simply” elaborating on the
case of two or more random variables. Nevertheless, there are significant analytical techniques
that need to be learned.
64
4 Two or More Random Variables
𝐹𝑋 ,𝑌 (−∞, −∞) = 0
𝐹𝑋 ,𝑌 (−∞, 𝑦) = 0
𝐹𝑋 ,𝑌 (𝑥, −∞) = 0
𝐹𝑋 ,𝑌 (∞, ∞) = 1
• Consider using a joint CDF to evaluate the probability that the pair of random variables
(𝑋, 𝑌 ) falls into a rectangular region bounded by the points (𝑥 1, 𝑦1 ), (𝑥 2, 𝑦1 ), (𝑥 1, 𝑦2 ) and
(𝑥 2, 𝑦2 ) (white rectangle is figure 4.1). Evaluating 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) gives the probability that the
random variable falls anywhere below or to the left of the point (𝑥 2, 𝑦2 ); this includes all
of the area in the desired rectangle, plus everything below and to the left of the desired
rectangle. The probability of the random variable falling to the left of the rectangle can be
subtracted off using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ). Similarly, the region below the rectangle can be subtracted
off using 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) (two shaded regions). In subtracting off these two quantities, we have
subtracted twice the probability of the pair falling both below and to the left of the desired
rectangle (dark-shaded region). Hence we must add back this probability using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ).
That is:
𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) = 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) + 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ) ≥ 0.
(4.1)
65
4 Two or More Random Variables
Figure 4.1: Illustrating the evaluation of the probability of a pair of random variables falling in a
rectangular region.
Equation 4.1 tells us how to calculate the probability of the pair of random variables falling in
a rectangular region. Often, we are interested in also calculating the probability of the pair of
random variables falling in non rectangular (e.g., a circle or triangle) region. This can be done by
forming the required region using many infinitesimal rectangles and then repeatedly applying
Equation 4.1.
Example 4.1
Consider a pair of random variables which are uniformly distributed over the unit square
(i.e., 0 < 𝑥 < 1, 0 < 𝑦 < 1). Find the joint CDF.
0, 𝑥 < 0
𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞) = 𝑥, 0 ≤ 𝑥 ≤ 1
1, 𝑥 > 1
Hence, the marginal CDF of 𝑋 is a uniform distribution. The same statement holds for 𝑌 as
well.
66
4 Two or More Random Variables
𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑙𝑖𝑚𝜀𝑥 →0,𝜀 𝑦 →0 (4.2)
𝜀𝑥 𝜀 𝑦
Similar to the one-dimensional case, the joint PDF is the probability that the pair of random variables
(𝑋, 𝑌 ) lies in an infinitesimal region defined by the point (𝑥, 𝑦) normalised by the area of the region.
For a single random variable, the PDF was the derivative of the CDF. By applying Equation 4.1 to
the definition of the joint PDF, a similar relationship is obtained.
Theorem 4.1
The joint PDF 𝑓𝑋 ,𝑌 (𝑥, 𝑦) can be obtained from the joint CDF 𝐹𝑋 ,𝑌 (𝑥, 𝑦) by taking a partial
derivative with respect to each variable. That is,
𝜕2
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝐹𝑋 ,𝑌 (𝑥, 𝑦) (4.3)
𝜕𝑥 𝜕𝑦
𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
= 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) + 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
= [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] − [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)]
𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
lim
𝜀𝑥 →0 𝜀𝑥
[𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)]
= lim − lim
𝜀𝑥 →0 𝜀𝑥 𝜀𝑥 →0 𝜀𝑥
𝜕 𝜕
= 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
𝜕𝑥 𝜕𝑥
Dividing by 𝜀 𝑦 and taking the limit as 𝜀 𝑦 → 0 gives the desired result:
𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = lim
𝜀𝑥 →0,𝜀 𝑦 →0 𝜀𝑥 , 𝜀 𝑦
𝜕2
𝜕 𝜕
𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀𝑦 ) − 𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
= lim = 𝐹𝑋 ,𝑌 (𝑥, 𝑦)
𝜀 𝑦 →0 𝜀𝑦 𝜕𝑥 𝜕𝑦
This theorem shows that we can obtain a joint PDF from a joint CDF by differentiating with
respect to each variable. The converse of this statement would be that we could obtain a joint
67
4 Two or More Random Variables
CDF from a joint PDF by integrating with respect to each variable. Specifically:
∫ 𝑦∫ 𝑥
𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 ,𝑌 (𝑢, 𝑣)𝑑𝑢𝑑𝑣 (4.4)
−∞ −∞
Example 4.2
Consider the pair of random variables with uniform distribution in Example 4.1. Find the
joint PDF.
Solution. By differentiating the joint CDF with respect to both 𝑥 and 𝑦, the joint PDF is
(
1, 0 < 𝑥 < 1 and 0 < 𝑦 < 1
𝑓𝑋 ,𝑌 (𝑥, 𝑦) =
0, otherwise
From the definition of the joint PDF and its relationship with the joint CDF, several properties of
joint PDFs can be inferred:
(i) 𝑓𝑋 ,𝑌 (𝑥, 𝑦) ≥ 0
∫∞∫∞
(ii) 𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞ 𝑋 ,𝑌
=1
∫∞ ∫∞
(iii) 𝑓𝑋 (𝑥) = 𝑓 (𝑥, 𝑦)𝑑𝑦 and 𝑓𝑌 (𝑦) =
−∞ 𝑋 ,𝑌
𝑓 (𝑥, 𝑦)𝑑𝑥
−∞ 𝑋 ,𝑌
𝑦2 ∫ 𝑥 2
(iv) 𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) =
∫
𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
𝑦1 𝑥 1 𝑋 ,𝑌
Property (i) follows directly from the definition of the joint PDF since both the numerator and
denominator there are nonnegative. Property (ii) results from the relationship in Equation 4.4
together with the fact that 𝐹𝑋 ,𝑌 (∞, ∞) = 1. This is the normalization integral for joint PDFs. These
first two properties form a set of sufficient conditions for a function of two variables to be a valid
joint PDF. Property (iii) is obtained by noting
∫ ∞ that
∫ 𝑥 the marginal CDF of 𝑋 is 𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞).
Using Equation 4.4 then results in 𝐹𝑋 (𝑥) = −∞ −∞ 𝑓𝑋 ,𝑌 (𝑢, 𝑦)𝑑𝑢𝑑𝑦. Differentiating this expression
with respect to 𝑥 produces the expression in property (iii) for the marginal PDF of 𝑥. A similar
derivation produces the marginal PDF of 𝑦. Hence, the marginal PDFs are obtained by integrating
out the unwanted variable in the joint PDF. The last property is obtained by combining Equations
4.1 and 4.4.
Property (iv) of joint PDFs specifies how to compute the probability that a pair of random variables
takes on a value in a rectangular region. Often, we are interested in computing the probability
that the pair of random variables falls in a region which is not rectangularly shaped. In general,
suppose we wish to compute 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴), where 𝐴 is the region illustrated in Figure 4.2. This
general region can be approximated as a union of many nonoverlapping rectangular regions as
shown in the figure. In fact, as we make the rectangles ever smaller, the approximation improves
to the point where the representation becomes exact in the limit as the rectangles get infinitely
small. That is, any region can be represented as an infinite number of infinitesimal rectangular
68
4 Two or More Random Variables
regions so that 𝐴 = 𝑅𝑖 , where 𝑅𝑖 represents the ith rectangular region. The probability that the
Ð
random pair falls in 𝐴 is then computed as:
Õ Õ∬
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑃 ((𝑋, 𝑌 ) ∈ 𝑅𝑖 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.5)
𝑖 𝑖 𝑅𝑖
The sum of the integrals over the rectangular regions can be replaced by an integral over the
original region 𝐴: ∬
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.6)
𝐴
This important result shows that the probability of a pair of random variables falling in some
two-dimensional region 𝐴 is found by integrating the joint PDF of the two random variables over
the region 𝐴.
Example 4.3
Suppose that a pair of random variables has the joint PDF given by:
Find (a) the constant value 𝑐 and (b) the probability of the event {𝑋 > 𝑌 }.
(b) This probability can be viewed as the probability of the pair (𝑋, 𝑌 ) falling in the region
𝐴 that is now defined as 𝐴 = {(𝑥, 𝑦) : 𝑥 > 𝑦}. This probability is calculated as:
1 −𝑥 −𝑦/2 1 −3𝑦/2
∬ ∫ ∞∫ ∞ ∫ ∞
𝑃 (𝑋 > 𝑌 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑒 𝑒 𝑑𝑥𝑑𝑦 = 𝑒 𝑑𝑦 = 1/3
𝑥 >𝑦 0 𝑦 2 0 2
69
4 Two or More Random Variables
(ii)
𝑀 Õ
Õ 𝑁
𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 1 (4.8)
𝑚=1 𝑛=1
(iii)
𝑁
Õ 𝑀
Õ
𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑋 (𝑥𝑚 ), 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑌 (𝑦𝑛 ) (4.9)
𝑛=1 𝑚=1
(iv) ÕÕ
𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑃𝑋 ,𝑌 (𝑥, 𝑦) (4.10)
(𝑥,𝑦) ∈𝐴
Furthermore, the joint PDF or the joint CDF of a pair of discrete random variables can be related
to the joint PMF through the use of delta functions or step functions by:
𝑀 Õ
Õ 𝑁
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝛿 (𝑥 − 𝑥𝑚 )𝛿 (𝑦 − 𝑦𝑛 ) (4.11)
𝑚=1 𝑛=1
𝑀 Õ
Õ 𝑁
𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝑢 (𝑥 − 𝑥𝑚 )𝑢 (𝑦 − 𝑦𝑛 ) (4.12)
𝑚=1 𝑛=1
Usually, it is most convenient to work with PMFs when the random variables are discrete. However,
if the random variables are mixed (i.e., one is discrete and one is continuous), then it becomes
necessary to work with PDFs or CDFs since the PMF will not be meaningful for the continuous
random variable.
Example 4.4
Two discrete random variables 𝑁 and 𝑀 have a joint PMF given by:
(𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚
𝑃 𝑁 ,𝑀 (𝑛, 𝑚) = , 𝑚 = 0, 1, 2, 3, ..., 𝑛 = 0, 1, 2, 3, ...
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1
Find the marginal PMFs 𝑃 𝑁 (𝑛) and 𝑃𝑀 (𝑚).
70
4 Two or More Random Variables
Solution. The marginal PMF of 𝑁 can be found by summing over 𝑚 in the joint PMF:
∞ ∞
Õ Õ (𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚
𝑃 𝑁 (𝑛) = 𝑃 𝑁 ,𝑀 (𝑛, 𝑚) =
𝑚=0 𝑚=0
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1
𝑏𝑚
𝑃𝑀 (𝑚) =
(1 + 𝑏)𝑚+1
Hence, the random variables 𝑀 and 𝑁 both follow a geometric distribution
𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) 𝑃𝑋 ,𝑌 (𝑥, 𝑦)
𝑃 (𝑋 = 𝑥 |𝑌 = 𝑦) = = (4.13)
𝑃 (𝑌 = 𝑦) 𝑃𝑌 (𝑦)
𝑃𝑋 ,𝑌 (𝑥, 𝑦)
𝑃𝑋 |𝑌 (𝑥 |𝑦) = (4.14)
𝑃𝑌 (𝑦)
Example 4.5
Using the joint PMF given in Example 4.4 along with the marginal PMF found in that exam-
ple, find the conditional PMF: 𝑃 𝑁 |𝑀 (𝑛|𝑚)
71
4 Two or More Random Variables
Solution.
𝑃𝑀,𝑁 (𝑚, 𝑛)
𝑃 𝑁 |𝑀 (𝑛|𝑚) =
𝑃𝑀 (𝑚)
(𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚 (1 + 𝑏)𝑚+1
=
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 𝑏𝑚
(𝑛 + 𝑚)! 𝑎 (1 + 𝑏)
𝑛 𝑚+1
=
𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1
Note that the conditional PMF of 𝑁 given 𝑀 is quite different than the marginal PMF of 𝑁 .
That is, knowing 𝑀 changes the distribution of 𝑁 .
The simple result developed in Equation 4.13 can be extended to the case of continuous random
variables and PDFs.
Definition 4.4. Conditional probability density function: The conditional PDF of a random
variable 𝑋 given that 𝑌 = 𝑦 is:
𝑓𝑋 ,𝑌 (𝑥, 𝑦)
𝑓𝑋 |𝑌 (𝑥 |𝑦) = (4.15)
𝑓𝑌 (𝑦)
Integrating both sides of this equation with respect to x produces the conditional CDFs:
Definition 4.5. Conditional cumulative distribution function: The conditional CDF of a
random variable 𝑋 given that 𝑌 = 𝑦 is:
∫𝑥
𝑓𝑋 ,𝑌 (𝑥 0, 𝑦)𝑑𝑥 0
𝐹𝑋 |𝑌 (𝑥 |𝑦) = −∞ (4.16)
𝑓𝑌 (𝑦)
Usually, the conditional PDF is much easier to work with, so the conditional CDF will not be
discussed further.
Example 4.6
2𝑎𝑏𝑐
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑢 (𝑥)𝑢 (𝑦)
(𝑎𝑥 + 𝑏𝑦 + 𝑐) 3
for some positive constants 𝑎, 𝑏, and 𝑐. Find the conditional PDF of 𝑋 given 𝑌 and 𝑌 given
𝑋.
𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑎(𝑏𝑦 + 𝑐) 2
𝑓𝑋 |𝑌 (𝑥 |𝑦) = = 𝑢 (𝑥)
𝑓𝑌 (𝑦) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3
72
4 Two or More Random Variables
𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑏 (𝑎𝑥 + 𝑐) 2
𝑓𝑌 |𝑋 (𝑦|𝑥) = = 𝑢 (𝑦)
𝑓𝑋 (𝑥) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3
Example 4.7
1 2 2
∫ ∞ ∫ ∞
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− 𝑥 2 ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦))𝑑𝑦
−∞ 𝜋 3 3 −∞ 3
1 𝑥 2 ∫ ∞
2 𝑥 2
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦 + ))𝑑𝑦
𝜋 3 2 −∞ 3 4
1 𝑥 2 ∫ ∞
2
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 − 𝑥/2) 2 )𝑑𝑦
𝜋 3 2 −∞ 3
1 𝑥2
𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− )
2𝜋 2
and we see that 𝑋 is a zero-mean, unit-variance, Gaussian (i.e., standard normal) random
variable. By symmetry, the marginal PDF of 𝑌 must also be of the same form.
The conditional PDF of 𝑋 given 𝑌 is
1
√ 𝑒𝑥𝑝 (− 32 (𝑥 2 − 𝑥𝑦 + 𝑦 2 )) 2 2
r
𝑓𝑋 ,𝑌 (𝑥, 𝑦) 𝜋 3 𝑦
𝑓𝑋 |𝑌 (𝑥 |𝑦) = = = 𝑒𝑥𝑝 (− (𝑥 − ) 2 )
𝑓𝑌 (𝑦) 𝑦2
√1 𝑒𝑥𝑝 (− ) 3𝜋 3 2
2𝜋 2
So, the conditional PDF of 𝑋 given 𝑌 is also Gaussian. But, given that it is known that 𝑌 = 𝑦,
the mean of 𝑋 is now 𝑦/2 (instead of zero), and the variance of 𝑋 is 3/4 (instead of one). In
this example, knowledge of 𝑌 has shifted the mean and reduced the variance of 𝑋 .
73
4 Two or More Random Variables
∫ 𝑦2
𝑓 (𝑥, 𝑦)𝑑𝑦
𝑦1 𝑋 ,𝑌
𝑓𝑋 |𝐴 (𝑥) = ∫ 𝑦2 (4.18)
𝑦1 𝑌
𝑓 (𝑦)𝑑𝑦
𝐹𝑋 ,𝑌 (𝑥, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦1 )
𝐹𝑋 |𝐴 (𝑥) = (4.19)
𝐹𝑌 (𝑦2 ) − 𝐹𝑌 (𝑦1 )
Example 4.8
Using the joint PDF of Example 4.7, determine the conditional PDF of 𝑋 given that 𝑌 > 𝑦0 .
Solution.
∞
1 ∞
2
∫ ∫
𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− (𝑥 2 − 𝑥𝑦 + 𝑦 2 ))𝑑𝑦
𝑦0 𝑦0 𝜋 3 3
1 𝑥2 2 2
∫ ∞
r
𝑥
= √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦
2𝜋 2 𝑦0 3𝜋 3 2
1 𝑥2 2𝑦0 − 𝑥
= √ 𝑒𝑥𝑝 (− )𝑄 ( √ )
2𝜋 2 3
Since the marginal PDF of 𝑌 is a zero-mean, unit-variance Gaussian PDF,
1 𝑦2
∫ ∞ ∫ ∞
𝑓𝑌 (𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− )𝑑𝑦 = 𝑄 (𝑦0 )
𝑦0 𝑦0 2𝜋 2
Note that when the conditioning event was a point condition on 𝑌 , the conditional PDF of 𝑋
was Gaussian; yet, when the conditioning event is an interval condition on 𝑌 , the resulting
conditional PDF of 𝑋 is not Gaussian at all.
For discrete random variables, the equivalent expression in terms of the joint PMF is:
ÕÕ
𝐸 [𝑔(𝑋, 𝑌 )] = 𝑔(𝑥𝑚 , 𝑦𝑛 )𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) (4.21)
𝑚 𝑛
74
4 Two or More Random Variables
If the function 𝑔(𝑥, 𝑦) is actually a function of only a single variable, say 𝑥, then this definition
reduces to the definition of expected values for functions of a single random variable:
∫ ∞∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞
𝐸 [𝑔(𝑋 )] = 𝑔(𝑥) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑔(𝑥) ( 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦)𝑑𝑥 = 𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥
−∞ −∞ −∞ −∞ −∞
(4.22)
To start with, consider an arbitrary linear function of the two variables 𝑔(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦, where 𝑎
and 𝑏 are constants. Then:
∫ ∞∫ ∞
𝐸 [𝑎𝑋 + 𝑏𝑌 ] = (𝑎𝑥 + 𝑏𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞∫ ∞ ∫ ∞∫ ∞
=𝑎 𝑥 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 + 𝑏 𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞ −∞ −∞
= 𝑎𝐸 [𝑋 ] + 𝑏𝐸 [𝑌 ]
Furthermore, two random variables which have a correlation of zero are said to be orthogonal.
One instance in which the correlation appears is in calculating the second moment of a sum of
two random variables. That is, consider finding the expected value of 𝑔(𝑋, 𝑌 ) = (𝑋 + 𝑌 ) 2 .
Hence the second moment of the sum is the sum of the second moments plus twice the correlation.
Definition 4.8. Covariance The covariance between two random variables is:
∬
𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] = (𝑥 − 𝐸 [𝑋 ]) (𝑦 − 𝐸 [𝑌 ]) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.25)
If two random variables have a covariance of zero, they are said to be uncorrelated.
Theorem 4.2
The correlation and covariance are strongly related to one another as follows:
Proof.
As a result, if either 𝑋 or 𝑌 (or both) has a mean of zero, correlation and covariance are equivalent.
The covariance function occurs when calculating the variance of a sum of two random variables:
75
4 Two or More Random Variables
This result can be obtained from Equation 4.24 by replacing 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ].
Another statistical parameter related to a pair of random variables is the correlation coefficient,
which is nothing more than a normalized version of the covariance.
Definition 4.9. Correlation coefficient The correlation coefficient of two random variables 𝑋 and
𝑌 , 𝜌𝑋𝑌 , is defined as
𝐶𝑂𝑉 (𝑋, 𝑌 ) 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])]
𝜌𝑋𝑌 = p = (4.28)
𝑉 𝐴𝑅(𝑋 )𝑉 𝐴𝑅(𝑌 ) 𝜎𝑋 𝜎𝑌
Since this is true for any 𝑎, we can tighten the bound by choosing the value of 𝑎 that
minimizes the left-hand side. This value of 𝑎 turns out to be
−𝐸 [𝑋𝑌 ]
𝑎=
𝐸 [𝑌 2 ]
Plugging in this value gives
𝐸 [𝑋𝑌 ] 2 𝐸 [𝑋𝑌 ] 2
𝐸 [𝑋 2 ] + − 2 ≥ 0 ⇒ 𝐸 [𝑋𝑌 ] 2 ≤ 𝐸 [𝑋 2 ]𝐸 [𝑌 2 ]
𝐸 [𝑌 2 ] 𝐸 [𝑌 2 ]
If we replace 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ], the result is
(𝐶𝑂𝑉 (𝑋, 𝑌 )) 2 ≤ 𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ]
𝐶𝑂𝑉 (𝑋, 𝑌 )
|𝜌𝑋𝑌 | = | p |≤1
𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ]
Note that we can also infer from the proof that equality holds if 𝑌 is a constant times 𝑋 . That is, a
correlation coefficient of 1 (or −1) implies that 𝑋 and 𝑌 are completely correlated (knowing 𝑌
determines 𝑋 ). Furthermore, uncorrelated random variables will have a correlation coefficient
of zero. Therefore, as its name implies, the correlation coefficient is a quantitative measure of
the correlation between two random variables. It should be emphasized at this point that zero
correlation is not to be confused with independence. These two concepts are not the same.
Example 4.9
Consider once again the joint PDF of Example 4.7. Find 𝑅𝑋 ,𝑌 , 𝐶𝑂𝑉 (𝑋, 𝑌 ) and 𝜌𝑋 ,𝑌 .
76
4 Two or More Random Variables
In order to evaluate this integral, the joint PDF is rewritten 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑓𝑋 (𝑥) and
then those terms involving only 𝑥 are pulled outside the inner integral over 𝑦.
𝑥2 2 2
∫ ∞ ∫ ∞ r
𝑥 𝑥
𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− ) ( 𝑦 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦)𝑑𝑥
−∞ 2𝜋 2 −∞ 3𝜋 3 2
The inner integral (in square brackets) is the expected value of a Gaussian random variable
with a mean of 𝑥/2 and variance of 3/4 which thus evaluates to 𝑥/2. Hence,
1 ∞ 𝑥2 𝑥2
∫
𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− )𝑑𝑥
2 −∞ 2𝜋 2
The remaining integral is the second moment of a Gaussian random variable with zero mean
and unit variance which integrates to 1. The correlation of these two random variables is
therefore 𝐸 [𝑋𝑌 ] = 1/2. Since both 𝑋 and 𝑌 have zero means, 𝐶𝑂𝑉 (𝑋, 𝑌 ) is also equal to
1/2. Finally, the correlation coefficient is also 𝜌𝑋𝑌 = 1/2 due to the fact that both 𝑋 and 𝑌
have unit variance.
The concepts of correlation and covariance can be generalized to higher-order moments as given
in the following definition.
Definition 4.10. Joint moment: The (𝑚, 𝑛)𝑡ℎ joint moment of two random variables 𝑋 and 𝑌 is:
∬
𝑚 𝑛
𝐸 [𝑋 𝑌 ] = 𝑥 𝑚𝑦𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.29)
Definition 4.11. Joint central moment: The (𝑚, 𝑛)𝑡ℎ joint central moment of two random vari-
ables 𝑋 and 𝑌 is:
∬
𝑚 𝑛
𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ]) ] = (𝑥 − 𝐸 [𝑋 ])𝑚 (𝑦 − 𝐸 [𝑌 ])𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.30)
These higher-order joint moments are not frequently used. As with single random variables,
a conditional expected value can also be defined for which the expectation is carried out with
respect to the appropriate conditional density function.
Definition 4.12. The conditional expected value of a function 𝑔(𝑋 ) of a random variable 𝑋 given
that 𝑌 = 𝑦 is: ∫ ∞
𝐸 [𝑔(𝑋 )|𝑌 ] = 𝑔(𝑥) 𝑓𝑋 |𝑌 (𝑥 |𝑦)𝑑𝑥 (4.31)
−∞
Conditional expected values can be particularly useful in calculating expected values of functions
of two random variables that can be factored into the product of two one-dimensional functions.
77
4 Two or More Random Variables
That is, consider a function of the form 𝑔(𝑥, 𝑦) = 𝑔1 (𝑥)𝑔2 (𝑦). Then:
∫ ∞∫ ∞
𝐸 [𝑔1 (𝑋 )𝑔2 (𝑌 )] = 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞∫ ∞
= 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 (𝑥) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑥𝑑𝑦
−∞ −∞
∫ ∞ ∫ ∞
= 𝑔1 (𝑥) 𝑓𝑋 (𝑥) ( 𝑔2 (𝑦) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑦)𝑑𝑥
−∞ −∞
∫ ∞
= 𝑔1 (𝑥) 𝑓𝑋 (𝑥)𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]𝑑𝑥
−∞
= 𝐸𝑋 [𝑔1 (𝑋 )𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]]
Here, the subscripts on the expectation operator have been included for clarity to emphasize
that the outer expectation is with respect to the random variable X, while the inner expectation
is with respect to the random variable Y (conditioned on X). This result allows us to break a
two-dimensional expectation into two one-dimensional expectations. This technique was used in
Example 4.9, where the correlation between two variables was essentially written as:
𝑅𝑋 ,𝑌 = 𝐸𝑋 [𝑋 𝐸𝑌 [𝑌 |𝑋 ]] (4.32)
In that example, the conditional PDF of Y given X was Gaussian, thus finding the conditional
mean was accomplished by inspection. The outer expectation then required finding the second
moment of a Gaussian random variable, which is also straightforward.
Hence, two random variables are statistically independent if their joint CDF factors into a product
of the marginal CDFs. Differentiating both sides of this equation with respect to both 𝑥 and 𝑦
reveals that the same statement applies to the PDF as well. That is, for statistically independent
random variables, the joint PDF factors into a product of the marginal PDFs:
It is not difficult to show that the same statement applies to PMFs as well. The preceding condition
can also be restated in terms of conditional PDFs. Dividing both sides of Equation 4.34 by 𝑓𝑋 (𝑥)
results in
𝑓𝑌 |𝑋 (𝑦|𝑥) = 𝑓𝑌 (𝑦) (4.35)
A similar result involving the conditional PDF of X given Y could have been obtained by dividing
both sides by the PDF of Y. In other words, if X and Y are independent, knowing the value of the
random variable X should not change the distribution of Y and vice versa.
78
4 Two or More Random Variables
Example 4.10
1 𝑥2
𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− )
2𝜋 2
2 2
r
𝑦
𝑓𝑋 |𝑌 (𝑥 |𝑦) = 𝑒𝑥𝑝 (− (𝑥 − ) 2 )
3𝜋 3 2
which are not equal, these two random variables are not independent.
Example 4.11
Suppose the random variables X and Y are uniformly distributed on the square defined by
0 ≤ 𝑥, 𝑦 ≤ 1. Are these two random variables independent?
Theorem 4.4
Let 𝑋 and 𝑌 be two independent random variables and consider forming two new random
variables 𝑈 = 𝑔1 (𝑋 ) and 𝑉 = 𝑔2 (𝑌 ). These new random variables 𝑈 and 𝑉 are also
independent
Another important result deals with the correlation, covariance, and correlation coefficients of
independent random variables.
79
4 Two or More Random Variables
Theorem 4.5
Proof.
∬ ∫ ∫
𝐸 [𝑋𝑌 ] = 𝑥𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 𝑦 𝑓𝑌 (𝑦)𝑑𝑦 = 𝐸 [𝑋 ]𝐸 [𝑌 ]
The conditions involving covariance and correlation coefficient follow directly from this
result.
Therefore, independent random variables are necessarily uncorrelated, but the converse is not
always true. Uncorrelated random variables do not have to be independent as demonstrated by
the next example.
Consider a pair of random variables 𝑋 and 𝑌 that are uniformly distributed over the unit
circle so that: (
1/𝜋, 𝑥 2 + 𝑦 2 ≤ 1
𝑓𝑋 ,𝑌 (𝑥, 𝑦) =
0, otherwise
The marginal PDF of 𝑋 can be found as follows:
√
1−𝑥 2
∞
1 2√
∫ ∫
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑑𝑦 = 1 − 𝑥 2, −1 ≤𝑥 ≤ 1
−∞ − 1−𝑥 2 𝜋 𝜋
By symmetry, the marginal PDF of 𝑌 must take on the same functional form. Hence, the
product of the marginal PDFs is
4p
𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) = (1 − 𝑥 2 ) (1 − 𝑦 2 ), − 1 ≤ 𝑥, 𝑦 ≤ 1
𝜋2
Clearly, this is not equal to the joint PDF, and therefore, the two random variables are
dependent. This conclusion could have been determined in a simpler manner. Note that
if we are told that 𝑋 = 1, then necessarily 𝑌 = 0, whereas if we know that 𝑋 = 0, then 𝑌
can range anywhere from -1 to 1. Therefore, conditioning on different values of 𝑋 leads to
different distributions for 𝑌 . Next, the correlation between 𝑋 and 𝑌 is calculated.
√
1 1−𝑥 2
1
∬ ∫ ∫
𝑥𝑦
𝑅𝑋 ,𝑌 = 𝐸 [𝑋𝑌 ] = 𝑑𝑥𝑑𝑦 = 𝑥( √ 𝑦𝑑𝑦)𝑑𝑥
𝑥 2 +𝑦 2 ≤1 𝜋 𝜋 −1 − 1−𝑥 2
Since the inner integrand is an odd function (of 𝑦) and the limits of integration are symmetric
about zero, the integral is zero. Hence, 𝑅𝑋 ,𝑌 = 0. Note from the marginal PDFs just found
that both 𝑋 and 𝑌 are zero-mean. So, it is seen for this example that while the two random
variables are uncorrelated, they are not independent.
80
4 Two or More Random Variables
(4.36)
𝜎𝑌
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑒𝑥𝑝 (− 2 )
)
2(1 − 𝜌𝑋𝑌
q
2
2𝜋𝜎𝑋 𝜎𝑌 1 − 𝜌𝑋𝑌
where 𝑚𝑋 and 𝑚𝑌 are the means of 𝑋 and 𝑌 , respectively; 𝜎𝑋 and 𝜎𝑌 are the standard deviations of
𝑋 and 𝑌 , respectively; and 𝜌𝑋𝑌 is the correlation coefficient of 𝑋 and 𝑌 .
It can be shown that this joint PDF results in Gaussian marginal PDFs:
1 (𝑥 − 𝑚𝑋 ) 2
∫ ∞
𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− ) (4.37)
−∞ 2𝜋𝜎𝑋 2𝜎𝑋2
1 (𝑦 − 𝑚𝑌 ) 2
∫ ∞
𝑓𝑌 (𝑦) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥 = √ 𝑒𝑥𝑝 (− ) (4.38)
−∞ 2𝜋𝜎𝑌 2𝜎𝑌2
Furthermore, if 𝑋 and 𝑌 are jointly Gaussian, then the conditional PDF of 𝑋 given 𝑌 = 𝑦 is also
Gaussian, with a mean of 𝑚𝑋 + 𝜌𝑋𝑌 (𝜎𝑋 /𝜎𝑌 ) (𝑦 − 𝑚𝑌 ) and a variance of 𝜎𝑋2 (1 − 𝜌𝑋𝑌 2 ).
Figure 4.3 shows the joint Gaussian PDF for three different values of the correlation coefficient.
In Figure 4.3(a), the correlation coefficient is 𝜌𝑋𝑌 = 0 and thus the two random variables are
uncorrelated. Figure 4.3(b) shows the joint PDF when the correlation coefficient is large and
positive, 𝜌𝑋𝑌 = 0.9. Note how the surface has become taller and thinner and largely lies above
the line 𝑦 = 𝑥. In Figure 4.3(c), the correlation is now large and negative, 𝜌𝑋𝑌 = −0.9. Note that
this is the same picture as in Figure 4.3(b), except that it has been rotated by 90𝑜 . Now the surface
lies largely above the line 𝑦 = −𝑥. In all three figures, the means of both 𝑋 and 𝑌 are zero and
the variances of both 𝑋 and 𝑌 are 1. Changing the means would simply translate the surface but
would not change the shape. Changing the variances would expand or contract the surface along
either the 𝑋 − or 𝑌 −axis depending on which variance was changed.
81
4 Two or More Random Variables
Example 4.13
82
4 Two or More Random Variables
The joint Gaussian PDF is given by the Equation 4.36. Suppose have the following equation:
𝑥 − 𝑚𝑋 2 𝑥 − 𝑚 𝑋 𝑦 − 𝑚𝑌 𝑦 − 𝑚𝑌 2
( ) − 2𝜌𝑋𝑌 ( )( )+( ) = 𝑐2
𝜎𝑋 𝜎𝑋 𝜎𝑌 𝜎𝑌
This is the equation for an ellipse. Plotting these ellipses for different values of 𝑐 results in
what is known as a contour plot. Figure 4.4 shows such plots for the two-dimensional joint
Gaussian PDF.
Theorem 4.6
Proof. Uncorrelated Gaussian random variables have a correlation coefficient of zero. Plug-
ging 𝜌𝑋𝑌 = 0 into the general joint Gaussian PDF results in
𝑥−𝑚𝑋 2 𝑦−𝑚𝑌 2
1 ( 𝜎𝑋 ) +( 𝜎𝑌 )
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑒𝑥𝑝 (− )
2𝜋𝜎𝑋 𝜎𝑌 2
This clearly factors into the product of the marginal Gaussian PDFs.
1 (𝑥 − 𝑚𝑋 ) 2 1 (𝑦 − 𝑚𝑌 ) 2
𝑓𝑋 ,𝑌 (𝑥, 𝑦) = √ 𝑒𝑥𝑝 (− ) √ 𝑒𝑥𝑝 (− ) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦)
2𝜋𝜎𝑋 2𝜎𝑋2 2𝜋𝜎𝑌 2𝜎𝑌2
While Example 4.12 demonstrated that this property does not hold for all random variables,
however it is true for Gaussian random variables. This allows us to give a stronger interpretation
to the correlation coefficient when dealing with Gaussian random variables. Previously, it was
stated that the correlation coefficient is a quantitative measure of the amount of correlation
between two variables. While this is true, it is a rather vague statement. We see that in the case
of Gaussian random variables, we can make the connection between correlation and statistical
dependence. Hence, for jointly Gaussian random variables, the correlation coefficient can indeed
be viewed as a quantitative measure of statistical dependence.
83
4 Two or More Random Variables
Let the outcome of a random experiment be an audio signal 𝑋 (𝑡). Let the random variable
𝑋𝑘 = 𝑋 (𝑘𝑇 ) be the sample of the signal taken at time 𝑘𝑇 . An MP3 codec processes the audio
in blocks of 𝑛 samples X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 . X is a vector random variable.
84
4 Two or More Random Variables
Marginal CDFs can be found for a subset of the variables by evaluating the joint CDF at infinity
for the unwanted variables. For example, if we are only interested in a subset {𝑋 1, 𝑋 2, ..., 𝑋𝑀 } of
X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , where 𝑁 ≥ 𝑀:
Marginal PDFs are found from the joint PDF by integrating out the unwanted variables. Similarly,
marginal PMFs are obtained from the joint PMF by summing out the unwanted variables.
∫ ∞∫ ∞ ∫ ∞
𝑓𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = ... 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 )𝑑𝑥 𝑀+1𝑑𝑥 𝑀+2 ...𝑑𝑥 𝑁 (4.49)
−∞ −∞ −∞
Õ Õ Õ
𝑃𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = ... 𝑃𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) (4.50)
𝑥 𝑀+1 𝑥 𝑀+2 𝑥𝑁
Similar to that done for pairs of random variables, we can also establish conditional PMFs and
PDFs.
Definition 4.16. For a set of 𝑁 random variables 𝑋 1, 𝑋 2, ..., 𝑋 𝑁 , the conditional PMF and PDF of
𝑋 1, 𝑋 2, ..., 𝑋𝑀 conditioned on 𝑋𝑀+1, 𝑋𝑀+2, ..., 𝑋 𝑁 are given by
𝑃 (𝑋 1 = 𝑥 1, 𝑋 2 = 𝑥 2, ..., 𝑋 𝑁 = 𝑥 𝑁 )
𝑃𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = (4.51)
𝑃 (𝑋𝑀+1 = 𝑥 𝑀+1, ..., 𝑋 𝑁 = 𝑥 𝑁 )
𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 )
𝑓𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = (4.52)
𝑓𝑋𝑀+1,...,𝑋𝑁 (𝑥 𝑀+1, ..., 𝑥 𝑁 )
Using conditional PDFs, many interesting factorization results can be established for joint PDFs
involving multiple random variables. For example, consider four random variables, 𝑋 1, 𝑋 2, 𝑋 3, 𝑋 4 .
85
4 Two or More Random Variables
As an example, consider three random variables, 𝑋 , 𝑌 , 𝑍 . For these three random variables to be
independent, we must have each pair independent. This implies that:
In addition, the joint PDF of all three must also factor into a product of the marginals,
Note that all three conditions in Equation 4.53 follow directly from the single condition in
Equation 4.54. Hence, Equation 4.54 is a necessary and sufficient condition for three variables to
be statistically independent. Naturally, this result can be extended to any number of variables.
That is, the elements of a random vector X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 are independent if
𝑁
Ö
𝑓X (x) = 𝑓𝑋𝑛 (𝑥𝑛 ) (4.55)
𝑛=1
Theorem 4.7
Correlation matrices and covariance matrices are symmetric and positive definite.
Proof. Recall that a square matrix, RXX , is symmetric if RXX = R𝑇XX . Equivalently, the
(𝑖, 𝑗)𝑡ℎ element must be the same as the (𝑖, 𝑗)𝑡ℎ element. This is clearly the case here since
𝐸 [𝑋𝑖 𝑋 𝑗 ] = 𝐸 [𝑋 𝑗 𝑋𝑖 ]. Recall that the matrix is positive definite if z𝑇 RXX z > 0 for any vector
z such that ||𝑧|| > 0.
Note that z𝑇 X is a scalar random variable (a linear combination of the components of X).
Since the second moment of any random variable is positive (except for the pathological
case of a random variable which is identically equal to zero), then the correlation matrix is
positive definite. As an aside, this also implies that the eigenvalues of the correlation matrix
are all positive. Identical steps can be followed to prove the same properties hold for the
covariance matrix.
86
4 Two or More Random Variables
Next, consider a linear transformation of a vector random variable. That is, create a new set of 𝑀
random variables, Y = [𝑌1, 𝑌2, ..., 𝑌𝑀 ]𝑇 , according to:
The number of new variables, M, does not have to be the same as the number of original variables,
N. To write this type of linear transformation in a compact fashion, define a matrix A whose
(𝑖, 𝑗)𝑡ℎ element is the coefficient 𝑎𝑖,𝑗 and a column vector, b= [𝑏 1, 𝑏 2, ..., 𝑏 𝑀 ]𝑇 . Then the linear
transformation of Equation 4.57 is written in vector/matrix form as Y = AX + b. The next theorem
describes the relationship between the means of X and Y and the correlation matrices of X and Y.
Theorem 4.8
For a linear transformation of vector random variables of the form Y = AX + b, the means
of X and Y are related by.
mY = AmX + b (4.58)
Also, the correlation matrices of X and Y are related by:
Then,
87
4 Two or More Random Variables
Example 4.15
To demonstrate the use of this matrix notation, suppose X is a two-element vector and the
mean vector and covariance matrix are given by their general forms:
mX = 1
𝑚
𝑚2
and
𝜎12
𝜌𝜎1𝜎2
CXX =
𝜌𝜎1𝜎2 𝜎22
The determinant of the covariance matrix is
𝜎1−2 −𝜌𝜎1−1𝜎2−1
−𝜌𝜎1−1𝜎2−1 𝜎2−2
𝑇 −1
𝑥1 − 𝑚1
(X − mX ) CXX (X − mX ) = 𝑥 1 − 𝑚 1 𝑥 2 − 𝑚 2
1 − 𝜌2 𝑥2 − 𝑚2
𝑥 1 −𝑚 1 2 𝑥 1 −𝑚 1 𝑥 2 −𝑚 2 𝑥 2 −𝑚 2 2
( 𝜎1 ) − 2𝜌 ( 𝜎1 )( 𝜎2 ) + ( 𝜎2 )
=
1 − 𝜌2
Plugging all these results into the general form for the joint Gaussian PDF gives
1 2 2 2
1 ( 𝑥 1𝜎−𝑚
1
) − 2𝜌 ( 𝑥 1𝜎−𝑚
1
1
)( 𝑥 2𝜎−𝑚
2
2
) + ( 𝑥 2𝜎−𝑚
2
)
𝑓𝑋1,𝑋2 (𝑥 1, 𝑥 2 ) = q 𝑒𝑥𝑝 (− )
2(1 − 𝜌 2 )
(2𝜋) 2𝜎12𝜎22 (1 − 𝜌 2 )
(4.66)
This is exactly the form of the two-dimensional joint Gaussian PDF defined in Equation 4.36.
88
4 Two or More Random Variables
Example 4.16
Suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are jointly Gaussian random variables with 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for 𝑖 ≠ 𝑗.
Show that 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are independent random variables.
Solution. Since for all 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for all 𝑖 ≠ 𝑗, all of the off-diagonal elements of the
covariance matrix of X are zero. In other words, CXX is a diagonal matrix of the general
form:
𝜎12 0 ... 0
0 𝜎 2 ... 0
CXX = 2
... ... ...
0 0 ... 𝜎 2
𝑁
The determinant of a diagonal matrix is the product of the diagonal entries so that in this
case 𝑑𝑒𝑡 (CXX ) = 𝜎12𝜎22 ...𝜎𝑁2 . The inverse is also trivial to compute and takes on the form
𝜎1−2 0 ... 0
0 𝜎 −2 ... 0
C−1 =
2
XX
... ... ...
0 0 ... 𝜎𝑁−2
The quadratic form that appears in the exponent of the Gaussian PDF becomes,
(X − mX )𝑇 C−1
XX (X − mX ) =
𝜎1−2 0 ... 0 𝑥1 − 𝑚1
0 𝜎2−2 ... 0 𝑁
𝑥2 − 𝑚2 Õ 𝑥 𝑛 − 𝑚𝑛 2
𝑥1 − 𝑚1 𝑥2 − 𝑚2 ... 𝑥 𝑁 − 𝑚 𝑁 = ( )
... ... ...
...
𝑛=1 𝜎𝑛
0 0 ... 𝜎𝑁
−2 𝑥 𝑁 − 𝑚 𝑁
The joint Gaussian PDF for a vector of uncorrelated random variables is then
1 1 Õ 𝑥 𝑛 − 𝑚𝑛 2 Ö 1 (𝑥𝑛 − 𝑚𝑛 ) 2
𝑁 𝑁
𝑓X (x) = q 𝑒𝑥𝑝 (− ( ) )= 𝑒𝑥𝑝 (− )
2 𝑛=1 2 2𝜎𝑛2
p
(2𝜋) 𝑁 𝜎12𝜎22 ...𝜎𝑁2 𝑛=1 2𝜋𝜎𝑛
𝜎𝑛
This shows that for any number of uncorrelated Gaussian random variables, the joint
PDF factors into the product of marginal PDFs and hence uncorrelated Gaussian random
variables are independent. This is a generalization of the same result for two Gaussian
random variables.
Further Reading
1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, Elsevier 2012: sections 5.1 to 5.7 and 6.1 to 6.3
2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: sections 5.1 to 5.9 and 6.1 to 6.4
3. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004:
chapter 5
89
5 Random Sums and Sequences
Many problems involve the counting of the number of occurrences of events, the measurement
of cumulative effects, or the computation of arithmetic averages in a series of measurements.
Usually these problems can be reduced to the problem of finding, exactly or approximately, the
distribution of a random variable that consists of the sum of 𝑛 independent, identically distributed
random variables. In this chapter, we investigate sums of random variables and their properties
as 𝑛 becomes large.
For continuous random variables, the CDFs can be replaced with PDFs in Equations 5.1 and 5.2, while
for discrete random variables, the CDFs can be replaced by PMFs.
Suppose, for example, we wish to measure the voltage produced by a certain sensor. The sensor
might be measuring the relative humidity outside. Our sensor converts the humidity to a voltage
level which we can then easily measure. However, as with any measuring equipment, the voltage
we measure is random due to noise generated in the sensor as well as in the measuring equipment.
Suppose the voltage we measure is represented by a random variable 𝑋 given by 𝑋 = 𝑣 (ℎ) + 𝑁 ,
where 𝑣 (ℎ) is the true voltage that should be presented by the sensor when the humidity is ℎ,
and 𝑁 is the noise in the measurement. Assuming that the noise is zero-mean, then 𝐸 [𝑋 ] = 𝑣 (ℎ).
That is, on the average, the measurement will be equal to the true voltage 𝑣 (ℎ). Furthermore,
if the variance of the noise is sufficiently small, then the measurement will tend to be close to
the true value we are trying to measure. But what if the variance is not small? Then the noise
will tend to distort our measurement making our system unreliable. In such a case, we might be
able to improve our measurement system by taking several measurements. This will allow us to
“average out” the effects of the noise.
Suppose we have the ability to make several measurements and observe a sequence of mea-
surements 𝑋 1, 𝑋 2, ..., 𝑋𝑛 . It might be reasonable to expect that the noise that corrupts a given
measurement has the same distribution each time (and hence the 𝑋𝑖 are identically distributed)
and is independent of the noise in any other measurement (so that the 𝑋𝑖 are independent). Then
the 𝑛 measurements form a sequence of IID random variables. A fundamental question is then:
90
5 Random Sums and Sequences
How do we process an IID sequence to extract the desired information from it? In the preceding
case, the parameter of interest, 𝑣 (ℎ), happens to be the mean of the distribution of the 𝑋𝑖 . This
turns out to be a fairly common problem and so we address that in the following sections.
𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛
It was shown in section 3.2.3 that regardless of statistical dependence of 𝑋𝑖 s, the expected value
of a sum of 𝑛 random variables is equal to the sum of the expected values:
Thus knowledge of the means of the 𝑋𝑖 s suffices to find the mean of 𝑆𝑛 . The following example
shows that in order to compute the variance of a sum of random variables, we need to know the
variances and covariances of the 𝑋𝑖 s.
Example 5.1
𝑉 𝐴𝑅 [𝑍 ] = 𝐸 [(𝑍 − 𝐸 [𝑍 ]) 2 ] = 𝐸 [(𝑋 + 𝑌 − 𝐸 [𝑋 ] − 𝐸 [𝑌 ]) 2 ]
= 𝐸 [((𝑋 − 𝐸 [𝑋 ]) + (𝑌 − 𝐸 [𝑌 ])) 2 ]
= 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 + (𝑌 − 𝐸 [𝑌 ]) 2 + (𝑋 − 𝐸 [𝑋 ])(𝑌 − 𝐸 [𝑌 ]) + (𝑌 − 𝐸 [𝑌 ])(𝑋 − 𝐸 [𝑋 ])]
= 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 𝐶𝑂𝑉 (𝑋, 𝑌 ) + 𝐶𝑂𝑉 (𝑌 , 𝑋 )
= 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 2𝐶𝑂𝑉 (𝑋, 𝑌 )
In general, the covariance 𝐶𝑂𝑉 (𝑋, 𝑌 ) is not equal to zero, so the variance of a sum is not
necessarily equal to the sum of the individual variances.
The result in Example 5.1 can be generalized to the case of 𝑛 random variables:
𝑛
Õ 𝑛
Õ
𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝐸 [ (𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) (𝑋𝑘 − 𝐸 [𝑋𝑘 ])]
𝑗=1 𝑘=1
𝑛
ÕÕ 𝑛
= 𝐸 [(𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) (𝑋𝑘 − 𝐸 [𝑋𝑘 ])]
𝑗=1 𝑘=1
Õ𝑛 𝑛 Õ
Õ 𝑛
= 𝑉 𝐴𝑅 [𝑋𝑘 ] + 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) (5.3)
𝑘=1 𝑗=1 𝑘=1
𝑗≠𝑘
Thus in general, the variance of a sum of random variables is not equal to the sum of the individual
variances.
91
5 Random Sums and Sequences
An important special case is when the 𝑋 𝑗 s are independent random variables. If 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are
independent random variables, then 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) = 0 for 𝑗 ≠ 𝑘 and:
𝑛
Õ
𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝑉 𝐴𝑅 [𝑋𝑘 ] (5.4)
𝑘=1
Now suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are 𝑛 IID random variables, each with mean 𝑚 and variance 𝜎 2 , then
the sum of 𝑋𝑖 s, 𝑆𝑛 , has the following mean:
1Õ
𝑛
𝜎𝑛2 = (𝑋 𝑗 − 𝑀𝑛 ) 2 (5.8)
𝑛 𝑗=1
The sample mean is itself a random variable, so it will exhibit random variation. Our aim is to
verify if 𝑀𝑛 can be a good estimator of 𝐸 [𝑋 ] = 𝑚. A good estimator is expected to have the
following two properties:
1. On the average, it should give the correct expected value (with no bias): 𝐸 [𝑀𝑛 ] = 𝑚
2. It should not vary too much about the correct value of this parameter, that is, 𝐸 [(𝑀𝑛 − 𝑚) 2 ]
(variance) is small.
The expected value of the sample mean is given by:
1Õ 1Õ
𝑛 𝑛
𝐸 [𝑀𝑛 ] = 𝐸 [ 𝑋𝑗] = 𝐸 [𝑋 𝑗 ] = 𝑚 (5.9)
𝑛 𝑗=1 𝑛 𝑗=1
since 𝐸 [𝑋 𝑗 ] = 𝐸 [𝑋 ] = 𝑚 for all 𝑗. Thus the sample mean is equal to 𝐸 [𝑋 ] = 𝑚 on the average.
For this reason, we say that the sample mean is an unbiased estimator for 𝑚.
The mean square error of the sample mean about 𝑚 is equal to the variance of 𝑀𝑛 that is,
92
5 Random Sums and Sequences
Note that 𝑀𝑛 = 𝑆𝑛 /𝑛 where 𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . From Equation 5.6, 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝜎 2 , since the
𝑋 𝑗 s are IID random variables. Thus
1 𝑛𝜎 2 𝜎 2
𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝑉 𝐴𝑅 [𝑆 ] = = (5.11)
𝑛2 𝑛2
𝑛
𝑛
Therefore the variance of the sample mean approaches zero as the number of samples is increased.
This implies that the probability that the sample mean is close to the true mean approaches one as
𝑛 becomes very large. We can formalize this statement by using the Chebyshev inequality from
Equation 3.127:
𝑉 𝐴𝑅 [𝑀𝑛 ]
𝑃 (|𝑀𝑛 − 𝐸 [𝑀𝑛 ] | ≥ 𝜀) ≤ (5.12)
𝜀2
Substituting for 𝐸 [𝑀𝑛 ] and 𝑉 𝐴𝑅 [𝑀𝑛 ], we obtain
𝜎2
𝑃 (|𝑀𝑛 − 𝑚| ≥ 𝜀) ≤ (5.13)
𝑛𝜀 2
If we consider the complement, we obtain
𝜎2
𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) ≥ 1 − (5.14)
𝑛𝜀 2
Thus for any choice of error 𝜀 and probability 1 − 𝛿, we can select the number of samples 𝑛 so
that 𝑀𝑛 is within 𝜀 of the true mean with probability 1 − 𝛿 or greater. The following example
illustrates this.
Example 5.2
Solution. Each measurement 𝑋 𝑗 has mean 𝑣 and variance 1, so from Equation 5.14 we require
that 𝑛 satisfy:
𝜎2 1
1 − 2 = 1 − = 0.99
𝑛𝜀 𝑛
This implies that 𝑛 = 100. Thus if we were to repeat the measurement 100 times and compute
the sample mean, on the average, at least 99 times out of 100, the resulting sample mean
will be within 1𝜇𝑉 of the true mean.
Equation 5.14 requires that the 𝑋 𝑗 s have finite variance. It can be shown that this limit holds even
if the variance of the 𝑋 𝑗 s does not exist.
93
5 Random Sums and Sequences
Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, then for
𝜀 > 0,
lim 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 1 (5.16)
𝑛→∞
The weak law of large numbers states that for a large enough fixed value of 𝑛, the sample mean
using 𝑛 samples will be close to the true mean with high probability. The weak law of large
numbers does not address the question about what happens to the sample mean as a function
of 𝑛 as we make additional measurements. This question is taken up by the strong law of large
numbers.
Suppose we make a series of independent measurements of the same random variable. Let
𝑋 1, 𝑋 2, ... be the resulting sequence of IID random variables with mean 𝑚. Now consider the
sequence of sample means that results from the above measurements: 𝑀1, 𝑀2, ... where 𝑀 𝑗 is
the sample mean computed using 𝑋 1 through 𝑋 𝑗 . We expect that with high probability, each
particular sequence of sample means approaches 𝑚 and stays there:
𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 (5.17)
𝑛→∞
that is, with virtual certainty, every sequence of sample mean calculations converges to the true
mean of the quantity (The proof of this result is beyond the level of this unit).
Theorem 5.2: Strong Law of Large Numbers
Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, and finite
variance, then,
𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 (5.18)
𝑛→∞
Equation 5.18 appears similar to Equation 5.16, but in fact it makes a dramatically different
statement. It states that with probability 1, every sequence of sample mean calculations will
eventually approach and stay close to 𝐸 [𝑋 ] = 𝑚. This is the type of convergence we expect in
physical situations where statistical regularity holds.
Although under certain conditions, the theory predicts the convergence of sample means to
expected values, there are still gaps between the mathematical theory and the real world (i.e., we
can never actually carry out an infinite number of measurements and compute an infinite number
of sample means). Nevertheless, the strong law of large numbers demonstrates the remarkable
consistency between the theory and the observed physical behavior.
Note that relative frequencies discussed in previous chapters, are special cases of sample averages.
If we apply the weak law of large numbers to the relative frequency of an event 𝐴, 𝑓𝐴 (𝑛), in a
sequence of independent repetitions of a random experiment, we obtain
lim 𝑃 (|𝑓𝐴 (𝑛) − 𝑃 (𝐴)| < 𝜀) = 1 (5.19)
𝑛→∞
94
5 Random Sums and Sequences
Example 5.3
Solution. Let 𝑋 = 𝐼𝐴 be the indicator function of 𝐴. From Equations 3.45 and 3.46 we have
that the mean of is 𝑚 = 𝑝 and the variance is 𝜎 2 = 𝑝 (1 − 𝑝). Since 𝑝 is unknown, 𝜎 2 is also
unknown. However, it is easy to show that 𝑝 (1 − 𝑝) is at most 1/4 for 0 ≤ 𝑝 ≤ 1 Therefore,
by Equation 5.13,
𝜎2 1
𝑃 (|𝑓𝐴 (𝑛) − 𝑝 | ≥ 𝜀) ≤ 2 ≤
𝑛𝜀 4𝑛𝜀 2
The desired accuracy is 𝜀 = 0.01 and the desired probability is:
1
1 − 0.95 =
4𝑛𝜀 2
We then solve this for 𝑛 and obtain 𝑛 = 50, 000. It has already been pointed out that the
Chebyshev inequality gives very loose bounds, so we expect that this value for 𝑛 is probably
overly conservative. In the next section, we present a better estimate for the required value
of 𝑛.
1 Õ 𝑋𝑗 − 𝑚
𝑛
𝑍=√ (5.21)
𝑛 𝑗=1 𝜎
Note that 𝑍 has been constructed such that 𝐸 [𝑍 ] = 0 and 𝑉 𝐴𝑅 [𝑍 ] = 1. In the limit as 𝑛 approaches
95
5 Random Sums and Sequences
infinity, the random variable 𝑍 converges in distribution to a standard Gaussian random variable.
Several remarks about this theorem are in order at this point. First, no restrictions were put on
the distribution of the 𝑋 𝑗 s, since it applies to any infinite sum of IID random variables, regardless
of the distribution.
From a practical standpoint, the central limit theorem implies that for the sum of a sufficiently
large (but finite) number of random variables, the sum is approximately Gaussian distributed. Of
course, the goodness of this approximation depends on how many terms are in the sum and also
the distribution of the individual terms in the sum.
Figures 5.1 to 5.3 compare the exact CDF and the Gaussian approximation for the sums of Bernoulli,
uniform, and exponential random variables, respectively. In all three cases, it can be seen that the
approximation improves as the number of terms in the sum increases.
Figure 5.1: (a) The CDF of the sum of five independent Bernoulli random variables with 𝑝 = 1/2
and the CDF of a Gaussian random variable of the same mean and variance. (b) The
CDF of the sum of 25 independent Bernoulli random variables with 𝑝 = 1/2 and the
CDF of a Gaussian random variable of the same mean and variance.
Figure 5.2: The CDF of the sum of five independent discrete, uniform random variables from the
set {0, 1, 2, ..., 9} and the CDF of a Gaussian random variable of the same mean and
variance.
The central limit theorem guarantees that the sum converges in "distribution" to Gaussian, but
this does not necessarily imply convergence in "density". As a counter example, suppose that the
96
5 Random Sums and Sequences
Figure 5.3: (a) The CDF of the sum of five independent exponential random variables of mean 1
and the CDF of a Gaussian random variable of the same mean and variance. (b) The
CDF of the sum of 50 independent exponential random variables of mean 1 and the
CDF of a Gaussian random variable of the same mean and variance.
𝑋 𝑗 s are discrete random variables, then the sum must also be a discrete random variable. Strictly
speaking, the density of 𝑍 would then not exist, and it would not be meaningful to say that the
density of 𝑍 is Gaussian. From a practical standpoint, the probability density of 𝑍 would be a
series of impulses. While the envelope of these impulses would have a Gaussian shape to it, the
density is clearly not Gaussian. If the 𝑋 𝑗 s are continuous random variables, the convergence in
density generally occurs as well.
The IID assumption is not needed in many cases. The central limit theorem also applies to
independent random variables that are not necessarily identically distributed. Loosely speak-
ing, all that is required is that no term (or small number of terms) dominates the sum, and
the resulting infinite sum of independent random variables will approach a Gaussian distribu-
tion in the limit as the number of terms in the sum goes to infinity. The central limit theorem
also applies to some cases of dependent random variables, but we will not consider such cases here.
Example 5.4
The time between events in a certain random experiment is IID exponential random variables
with mean 𝑚 seconds. Find the probability that the 1000th event occurs in the time interval
(1000 ± 50)𝑚.
Solution. Let 𝑋𝑖 be the time between events and let 𝑆𝑛 be the time of the 𝑛th event, then
𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . We know that the mean and variance of the exponential random
variable 𝑋 𝑗 is given by 𝐸 [𝑋 𝑗 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑚 2 . The mean and variance of 𝑆𝑛 are
then 𝐸 [𝑆𝑛 ] = 𝑛𝐸 [𝑋 𝑗 ] = 𝑛𝑚 and 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑛𝑚 2 . The central limit theorem
then gives
950𝑚 − 1000𝑚 1050𝑚 − 1000𝑚
𝑃 (950𝑚 ≤ 𝑆 1000 ≤ 1050𝑚) = 𝑃 ( √ ≤ 𝑍𝑛 ≤ √ )
𝑚 1000 𝑚 1000
' 𝑄 (1.58) − 𝑄 (−1.58) = 1 − 2𝑄 (1.58) = 0.8866
97
5 Random Sums and Sequences
Thus as 𝑛 becomes large, 𝑆𝑛 is very likely to be close to its mean 𝑛𝑚. We can therefore
conjecture that the long-term average rate at which events occur is
𝑛 events 𝑛 1
= = 𝑒𝑣𝑒𝑛𝑡𝑠/𝑠𝑒𝑐𝑜𝑛𝑑
𝑆𝑛 seconds 𝑛𝑚 𝑚
Let 𝜁 be selected at random from the interval 𝑆 = [0, 1] where we assume that the probability
that 𝜁 is in a sub-interval of 𝑆 is equal to the length of the sub-interval. For 𝑛 = 1, 2, ... we
define the sequence of random variables:
1
𝑉𝑛 (𝜁 ) = 𝜁 (1 − )
𝑛
The two ways of looking at sequences of random variables is evident here. First, we can
view 𝑉𝑛 (𝜁 ) as a sequence of functions of 𝜁 as shown in Figure 5.4(a). Alternatively, we can
98
5 Random Sums and Sequences
imagine that we first perform the random experiment that yields 𝜁 and that we then observe
the corresponding sequence of real numbers 𝑉𝑛 (𝜁 ) as shown in Figure 5.4(b).
Figure 5.4: Two ways of looking at sequences of random variables: (a) Sequence of random
variables as a sequence of functions of 𝜁 , (b) Sequence of random variables as a
sequence of real numbers determined by 𝜁
The standard methods from calculus can be used to determine the convergence of the sample
sequence for each point 𝜁 . Intuitively, we say that the sequence of real numbers 𝑥𝑛 converges
to the real number 𝑥 if the difference |𝑥𝑛 − 𝑥 | approaches zero as 𝑛 approaches infinity. More
formally, we say that:
The sequence 𝑥𝑛 converges to 𝑥 if, given any 𝜀 > 0, we can specify an integer 𝑁 such that for all
values of 𝑛 beyond 𝑁 we can guarantee that |𝑥𝑛 − 𝑥 | < 𝜀
Thus if a sequence converges, then for any 𝜀 we can find an 𝑁 so that the sequence remains inside
a 2𝜀 corridor about 𝑥, as shown in Figure 5.5.
If we make 𝜀 smaller, 𝑁 becomes larger. Hence we arrive at our intuitive view that 𝑥𝑛 becomes
closer and closer to 𝑥. If the limiting value 𝑥 is not known, we can still determine whether a
99
5 Random Sums and Sequences
The sequence 𝑥𝑛 converges if and only if, given 𝜀 > 0 we can specify integer 𝑁 0 such that for 𝑚 and
𝑛 greater than 𝑁 0, |𝑥𝑛 − 𝑥𝑚 | < 𝜀
The Cauchy criterion states that the maximum variation in the sequence for points beyond 𝑁 0 is
less than 𝜀.
Example 5.6
Let 𝑉𝑛 (𝜁 ) be the sequence of random variables from Example 5.5. Does the sequence of real
numbers corresponding to a fixed 𝜁 converge?
Solution. From Figure 5.4(a), we expect that for a fixed value 𝜁 , 𝑉𝑛 (𝜁 ) will converge to the
limit 𝜁 . Therefore, we consider the difference between the 𝑛th number in the sequence and
the limit:
1 𝜁 1
|𝑉𝑛 (𝜁 ) − 𝜁 | = |𝜁 (1 − ) − 𝜁 | = | | <
𝑛 𝑛 𝑛
where the last inequality follows from the fact that 𝜁 is always less than one. In order to
keep the above difference less than 𝜀 we choose 𝑛 so that
1
|𝑉𝑛 (𝜁 ) − 𝜁 | < <𝜀
𝑛
1
that is, we select 𝑛 > 𝑁 = 𝜀 Thus the sequence of real numbers 𝑉𝑛 (𝜁 ) converges to 𝜁 .
When we talk about the convergence of sequences of random variables, we are concerned with
questions such as: Do all (or almost all) sample sequences converge, and if so, do they all converge
to the same values or to different values? The first two definitions of convergence address these
questions.
Example 5.7
Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random
sequence
𝑋
𝑋𝑛 = , 𝑛 = 1, 2, 3, ...
1 + 𝑛2
In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form:
𝑥
𝑥𝑛 =
1 + 𝑛2
100
5 Random Sums and Sequences
which converges to lim𝑛→∞ 𝑥𝑛 = 0. We say that the sequence converges surely to lim𝑛→∞ 𝑋𝑛 =
0.
Sure convergence requires that the sample sequence corresponding to every 𝜁 converges. Note
that it does not require that all the sample sequences converge to the same values; that is, the
sample sequences for different points 𝜁 and 𝜁 0 can converge to different values.
Example 5.8
Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random
sequence
𝑛𝑋
𝑋𝑛 = , 𝑛 = 1, 2, 3, ...
1 + 𝑛2
In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form:
𝑛𝑥
𝑥𝑛 =
1 + 𝑛2
which converges to lim𝑛→∞ 𝑥𝑛 = 𝑥. We say that the sequence converges surely to a random
variable lim𝑛→∞ 𝑋𝑛 = 𝑋 . In this case, the value that the sequence converges to depends on
the particular realization of the random variable 𝑋 .
𝑃 (𝜁 : 𝑋𝑛 (𝜁 ) → 𝑋 (𝜁 ) as 𝑛 → ∞) = 1
In Figure 5.6 we illustrate almost-sure convergence for the case where sample sequences converge
to the same value 𝑥; we see that almost all sequences must eventually enter and remain inside a
2𝜀 corridor. In almost-sure convergence some of the sample sequences may not converge, but
these must all belong to 𝜁 s that are in a set that has probability zero.
The strong law of large numbers is an example of almost-sure convergence. Note that sure
convergence implies almost-sure convergence.
101
5 Random Sums and Sequences
Example 5.9
As an example of a sequence that converges almost surely, consider the random sequence
𝑠𝑖𝑛(𝑛𝜋𝑋 )
𝑋𝑛 =
𝑛𝜋𝑋
where 𝑋𝑛 is a random variable uniformly distributed over [0,1). For almost every realization
𝑋 = 𝑥, the sequence:
𝑠𝑖𝑛(𝑛𝜋𝑥)
𝑥𝑛 =
𝑛𝜋𝑥
converges to lim𝑛→∞ 𝑥𝑛 = 0. The one exception is the realization 𝑋 = 0 in which case the
sequence becomes 𝑥𝑛 = 1 which converges, but not to the same value. Therefore, we say
that the sequence 𝑋𝑛 converges almost surely to lim𝑛→∞ 𝑋𝑛 = 0 since the one exception to
this convergence occurs with zero probability; that is, 𝑃 (𝑋 = 0) = 0
𝑃 (|𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )| > 𝜀) → 0 as 𝑛 → ∞
In Figure 5.7 we illustrate convergence in probability for the case where the limiting random
variable is a constant 𝑥; we see that at the specified time 𝑛 0 most sample sequences must be within
𝜀 of 𝑥. However, the sequences are not required to remain inside a 2𝜀 corridor. The weak law of
large numbers is an example of convergence in probability.Thus we see that the fundamental
difference between almost-sure convergence and convergence in probability is the same as that
between the strong law and the weak law of large numbers.
Example 5.10
Let 𝑋𝑘 , 𝑘 = 1, 2, 3, ... be a sequence of IID Gaussian random variables with mean 𝑚 and
variance 𝜎 2 . Suppose we form the sequence of sample means 𝑀𝑛 = 𝑛1 𝑛𝑘=1 𝑋𝑘 , 𝑛 = 1, 2, 3, ....
Í
Since the 𝑀𝑛 are linear combinations of Gaussian random variables, then they are also
Gaussian with 𝐸 [𝑀𝑛 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝜎 2 /𝑛. Therefore, the probability that the sample
102
5 Random Sums and Sequences
𝐸 [(𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )) 2 ] → 0 as 𝑛 → ∞
Example 5.11
Consider the sequence of sample means of IID Gaussian random variables described in
Example 5.10. This sequence also converges in the MS sense since:
𝜎2
𝐸 [(𝑀𝑛 − 𝑚) 2 ] = 𝑉 𝐴𝑅 [𝑀𝑛 ] =
𝑛
This sequence of sample variances converges to 0 as 𝑛 → ∞, thus producing convergence
of the random sequence in the MS sense.
𝐹𝑛 (𝑥) → 𝐹 (𝑥) as 𝑛 → ∞
Example 5.12
Consider once again the sequence of sample means of IID Gaussian random variables
described in Example 5.10. Since 𝑀𝑛 is Gaussian with mean 𝑚 and variance 𝜎 2 /𝑛, its CDF
takes the form
𝑥 −𝑚
𝐹𝑀𝑛 (𝑥) = 1 − 𝑄 ( √ )
𝜎/ 𝑛
For any 𝑥 > 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 1, while for any 𝑥 < 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 0. Thus, the
103
5 Random Sums and Sequences
where 𝑢 (𝑥) is the unit step function. Note that the point 𝑥 = 𝑚 is not a point of continuity
of 𝐹𝑀 (𝑥).
It should be noted, as was seen in the previous sequence of examples, that some random sequences
converge in many of the different senses. In fact, one form of convergence may necessarily
imply convergence in several other forms. Table 5.1 illustrates these relationships. For example,
convergence in distribution is the weakest form of convergence and does not necessarily imply
any of the other forms of convergence. Conversely, if a sequence converges in any of the other
modes presented, it will also converge in distribution.
Table 5.1: Relationships between convergence modes, showing whether the convergence mode in
each row implies the convergence mode in each column
This ↓ implies this → Sure Almost Sure Probability Mean Square Distribution
Sure X Yes Yes No Yes
Almost Sure No X Yes No Yes
Probability No No X No Yes
Mean Square No No Yes X Yes
Distribution No No No No X
Stated another way, let 𝜀𝑎 be the value of 𝜀 such that the right hand side of the above equation is
1 − 𝑎; that is,
𝜎
𝜀𝑎 = √ 𝑄 −1 (𝑎/2) (5.27)
𝑛
where 𝑄 −1 is the inverse of the Q-function. Then, given 𝑛 samples which lead to a sample mean
𝑀𝑛 , the true mean will fall in the interval (𝑀𝑛 − 𝜀𝑎 , 𝑀𝑛 + 𝜀𝑎 ) with probability 1 − 𝑎. The interval
(𝑀𝑛 −𝜀𝑎 , 𝑀𝑛 +𝜀𝑎 ) is referred to as the confidence interval while the probability is the confidence
level or, alternatively, is the level of significance. The confidence level and level of significance
are usually expressed as percentages. The corresponding values of the quantity 𝑐 𝑎 = 𝑄 −1 (𝑎/2)
are provided in Table 5.2 for several typical values of 𝑎.
104
5 Random Sums and Sequences
Example 5.13
Suppose the IID random variables each have a variance of 𝜎 2 = 4. A sample of 𝑛 = 100 values
is taken and the sample mean is found to be 𝑀𝑛 = 10.2. (a) Determine the 95% confidence
interval for the true mean 𝑚. (b) Suppose we want to be 99% confident that the true mean
falls within a factor of ±0.5 of the sample mean. How many samples need to be taken in
forming the sample mean?
√
Solution. (a) In this case 𝜎/ 𝑛 = 0.2, and the appropriate value of 𝑐 𝑎 is 𝑐 0.05 = 1.96 from
Table 5.2. The 95% confidence interval is then:
𝜎 𝜎
(𝑀𝑛 − √ 𝑐 0.05, 𝑀𝑛 + √ 𝑐 0.05 ) = (9.808, 10.592)
𝑛 𝑛
and therefore
𝑐 0.01𝜎 2 2.58 ∗ 2 2
𝑛=( ) =( ) = 106.5
0.5 0.5
Since 𝑛 must be an integer, it is concluded that at least 107 samples must be taken.
In summary, to achieve a level of significance specified by 𝑎, we note that by virtue of the central
limit theorem, the sum
𝑀𝑛 − 𝑚
𝑍ˆ𝑛 = √ (5.28)
𝜎/ 𝑛
approximately follows a standard normal distribution. We can then easily specify a symmetric
interval about zero in which a standard Gaussian random variable will fall with probability 1 − 𝑎.
As long as 𝑛 is sufficiently large, the original distribution of the IID random variables does not
matter.
Note that in order to form the confidence interval as specified, the standard deviation of the 𝑋 𝑗
must be known. While in some cases, this may be a reasonable assumption, in many applications,
the standard deviation is also unknown. The most obvious thing to do in that case would be to
replace the true standard deviation with the sample standard deviation.
105
5 Random Sums and Sequences
Further Reading
1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to
signal processing and communications, Elsevier 2012: chapter 7.
2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,
3rd ed. Pearson, 2007: chapter 7.
106