Notes Part1
Notes Part1
May 1, 2019
i
ii
i
iii
Here I shall present, without using Analysis, the principles and general results of the
Theory, applying them to the most important questions of life, which are indeed, for
the most part, only problems in probability. One may even say, strictly speaking, that
almost all our knowledge is only probable; and in the small number of things that we
are able to know with certainty, in the mathematical sciences themselves, the principal
means of arriving at the truth - induction and analogy - are based on probabilities, so
that the whole system of human knowledge is tied up with the theory set out in this essay.
P.-S. Laplace
In this work I try to give definitions that are as general and precise as possible to some
numbers that we consider in Analysis: the definite integral, the length of a curve, the
area of a surface.
A. Kolmogorov
The theory of probability, like all other mathematical theories, cannot resolve a priori
concrete questions which belong to the domain of experience. Its only role - beautiful
in itself - is to guide experience and observations by the interpretation it furnishes for
their results.
É. Borel
The statistician who supposes that [their] main contribution to the planning of an ex-
periment will involve statistical theory, finds repeatedly that [they] makes [their] most
valuable contribution simply by persuading the investigator to explain why [they wish]
to do the experiment.
G. M. Cox
The combination of some data and an aching desire for an answer does not ensure that
a reasonable answer can be extracted from a given body of data.
J. Tukey
On two occasions I have been asked [by members of Parliament], “Pray, Mr. Babbage,
if you put into the machine wrong figures, will the right answers come out?” I am
not able rightly to apprehend the kind of confusion of ideas that could provoke such a
question.
C. Babbage
While you are young, and are not burdened by the need to take care of the earnings
for keeping the family, try to penetrate deeper into the essence of mathematics and to
appreciate its beauty.
A. Skorokhod
Arithmetic is where the answer is right and everything is nice and you look out the
window and see the blue sky – or the answer is wrong and you have to start over and
try again and see how it comes out this time.
C. Sandburg
i
iv
i
Contents
Preface ix
2 Preliminaries 5
2.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Sequences of sets . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Extended real number system . . . . . . . . . . . . . . . . . . . . 18
2.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II The Foothills 59
v
i
vi Contents
Hausdorff measure . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Outer measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 89
5.6 Premeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.7 Approximation of measures . . . . . . . . . . . . . . . . . . . . . 99
5.8 Zoology of measure creatures . . . . . . . . . . . . . . . . . . . . 101
5.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.10 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Regularity of Lebesgue-Stieljes measure . . . . . . . . . . . . . . 128
6.6 Properties of Lebesgue measure . . . . . . . . . . . . . . . . . 130
6.7 Approximation of Lebesgue-Stieljes measure . . . . . . . . . . 132
6.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.9 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7 Probability 137
7.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Examples of probability models . . . . . . . . . . . . . . . . . . . 145
7.3 “Infinitely often”, First Borel Cantelli Lemma . . . . . . . . . . . 158
7.4 Conditional probability, independence . . . . . . . . . . . . . . . 168
7.5 Second Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . 173
7.6 Tail σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.8 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.3 Approximation by simple functions . . . . . . . . . . . . . . . . . 187
8.4 Relation to continuity . . . . . . . . . . . . . . . . . . . . . . 194
8.5 Measure induced by a measurable function . . . . . . . . . . . . . 195
8.6 Probability measures of random variables . . . . . . . . . . . . . . 198
8.7 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.8 Properties of the Hausdorff measure in Rn . . . . . . . . . . . 210
8.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.10 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Contents vii
9.9 Integration of functions that depend on parameters . . . . . . . . . 259
9.10 Riemann and Lebesgue integration . . . . . . . . . . . . . . . 261
9.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
9.12 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.9 Countable collections of independent random variables . . . . . . 298
10.10 Application to a stochastic process model . . . . . . . . . . . . 303
10.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
10.12 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 306
11.2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 312
11.3 Empirical distribution functions, medians, quantile functions . 316
11.4 Convergence of random series . . . . . . . . . . . . . . . . . . . . 325
11.5 Kolmogorov 0 − 1 Law for random variables . . . . . . . . . . 332
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
11.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 334
viii Contents
15.4 Proof of the Lindeberg-Feller Central Limit Theorem . . . . . . . 397
15.5 Law of the Iterated Logarithm . . . . . . . . . . . . . . . . . . 397
15.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
15.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 397
16 Lp Spaces 399
16.1 Lp spaces for 1 ≤ p < ∞ . . . . . . . . . . . . . . . . . . . . . . 400
16.2 L∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
16.3 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
16.4 Approximation by simple functions and separability . . . . . . . . 410
16.5 Comparison of Lp spaces . . . . . . . . . . . . . . . . . . . . . . 411
16.6 Convergence in measure . . . . . . . . . . . . . . . . . . . . . . . 411
Hilbert Space structure of L2 . . . . . . . . . . . . . . . .
16.7 . . . . 412
16.8 Conditional probability and orthogonal projection . . . . . . . 420
16.9 Duality and the Riesz Representation Theorem . . . . . . . . . 421
16.10 Weierstrass Approximation Theorem . . . . . . . . . . . . . . 423
16.11 Constructing smooth approximations using convolution . . . . 423
16.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
16.13 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 423
B To Do List 437
Bibliography 441
Index 445
i
Preface
We have a habit in writing articles published in scientific journals to make the work
as finished as possible, to cover up all the tracks, to not worry about the blind alleys
or describe how you had the wrong idea first, and so on. So there isn’t any place to
publish, in a dignified manner, what you actually did in order to get to do the work.
R. Feynman
P. Halmos
J. Conway
We often hear that mathematics consists mainly of proving theorems. Is a writer’s job
mainly that of writing sentences?
G.-C. Rota
S. Kovalevskaya
A professor is one who can speak on any subject – for precisely fifty minutes.
N. Wiener
Life is like riding a bicycle. To keep your balance, you must keep moving.
A. Einstein
This textbook is aimed at providing a mathematically capable reader with the means
to learn measure theory and probability theory together. One consequence of the de-
velopment of abstract measure theory and probability theory on parallel tracks is that
textbooks tend to focus heavily on one or the other. This is unfortunate, because abstract
measure theory and probability theory have a strongly positive symbiotic relationship
in terms of gaining understanding. Probability provides a practical and accessible mo-
tivation for the complicated details required in measure theory, while measure theory
ix
i
x Preface
lays bare the essential foundation of numerical approximation. So, this book attempts
to weave both measure theory and probability together.
Even so, there are many fine books on both measure theory and probability theory,
e.g., see [AG96, ADD00, Bil12, Bog00, Bre92, Chu01, Doo94, Dud92, Dur10, Fel68,
Fel66, Fol99, FG97, Gut13, HJ94, JP04, McK14, Par00, Pol02, Res05, Roy88, Ros00,
Swa94], and it is reasonable to ask if another book is necessary. In addition to melding
measure and probabilities theories, this book is designed to address several considera-
tions.
• Measure theory can be presented from a number of viewpoints using a multitude
of approaches including proofs for the major results. Each particular approach is
better suited for some aspects of the subject but less well suited for other aspects.
Over two decades of teaching measure theory and probability, we have come to
rely on a literal library of books, using different books as the basis for different
topics. But, it is not a trivial task for a reader to move between books, because
of differences in notation and definition, order of assumptions and results, and
details in statements of theorems. In some sense, a major part of the work for
this book involved abstracting the approaches from the books we like for various
topics and presenting them systematically with an unified development. We have
borrowed freely from many great textbooks and present detailed description of
the books used as sources at the end of each chapter. In using this approach, we
seem to be following a well worn path among writers of textbooks for measure
theory and probability!
• It may be a consequence of the fact that the development of measure theory since
Lebesgue’s thesis has been focused on abstraction and generalization, but in any
case, graduate textbooks in measure theory and probability theory tend to use
very efficient proofs and have very few examples. By efficient, we mean that an
extremely high value is placed on brevity and abstraction. In our experience, these
are not qualities that are optimal for learning any mathematics. This book uses
“elementary” proofs that explain why theorems hold, when we know them, and
presents a number of examples illustrating definitions and results. This has the
side benefit of making this textbook a course in "How to do analysis".
• At its heart, measure theory is built on numerical approximation of “length, area,
and volume”. But this fact is certainly downplayed, if not hidden outright, in
graduate textbooks in both measure theory and probability theory. This book
emphasizes the numerical approximation aspects of measure theory in its devel-
opment and includes significant material on the application of measure theory to
numerical approximation and practical numerical measure theory that is difficult
to find in textbooks that use rigorous measure theory.
• Mathematics students do not often learn about probability theory, even though
most take a measure theory course. Statistics students tend to have far too little
training in analysis. Both groups learn far too little about numerical approxima-
tion. Students in applied mathematical sciences, engineering and science students
often find it difficult to gain immediate practical sense from the abstract approach
of most textbooks in measure theory and probability even as rigorous probability
becomes increasingly important in science and engineering. This book aims to
provide a way through measure theory and probability that addresses these peda-
gogical issues.
On the other hand, this book does not present the most general setting for probability and
measure theory nor is the coverage of some advanced probability and measure theory
i
Preface xi
as comprehensive as found in other textbooks. All of these topics can be pursued after
learning the material in this book by readers that need specialized training.
What is covered in the book. This book presents the core material from measure the-
ory and measure-theoretic probability theory, i.e., the main concepts and theorems along
with numerous examples, albeit in an unusual arrangement. In addition, it covers the
foundation of stochastic computation and some important results in mathematical statis-
tics because both provide good applications of the core material. Those subjects are
not commonly covered in standard courses in probability. The unusual topics are well
marked and are not required to follow the development of the core material.
The book does not cover a few topics commonly found in other books. The main
omission is Fourier analysis and applications to probability including the common proof
of the Central Limit Theorem (we present another proof). Fourier analysis is an im-
portant subject that deserves a careful theoretical and computational treatment and the
authors find that the cursory treatment provided in probability textbooks is less than
useful. This book also omits stochastic processes. That is another beautiful, important
application of probability that deserves its own treatment. It is easy to find good text-
books in Fourier analysis, see [AH01, Bra86, Che01, Fol92, Fol99, Kat04, SW71], and
stochastic processes, see [Fel68, GS04, KT75, KT81, Res92, Ros96], for readers that
want to learn those topics. Also, this book does not develop very abstract measure the-
ory and probability theory, though it gives a taste with a discussion of measure theory
in abstract metric spaces. Again, it is easy to find good books at that level for readers
who need to study more than the core material.
This book also avoids developing machinery for “slick” proofs that is common
among probability books, e.g. the π − λ theorem. In the authors’ opinion, such proofs
do not do a good job of explaining why theorems are true and readers who learn prob-
ability that way tend to see probability as a collection of results rather than a coherent
method of reasoning, estimation and computation. Instead, this book uses what used to
be called "elementary proofs" based on approximation, convergence, and estimation. At
first glance, that makes some proofs longer. But there are patterns in such an approach
that has the effect that proofs become easier to understand as the material progresses.
A guide to using the book. Every chapter has the same outline. They begin with a
section called “A panorama" that provides a picture of what will be covered. After the
main material, there is a section on references and another on worked problems. There
are many good books on measure theory and probability theory, and we link the material
presented in this book to the material in other textbooks. Where we have frequently
borrowed ideas about presentation and proofs, we note that in the reference section. We
encourage readers to look into other books to learn about different approaches. The
worked example section presents a few complicated problems for which solutions are
provided in the back of the book.
In order to refrain from laboriously repeating assumptions in theorems, we set global
assumptions for a section or a chapter displayed prominently in a box with red borders
and purple background. It is very important to keep track of such global assumptions.
Any section that can be safely skipped is indicated with a bicycle.
We have divided the chapters into four Parts.
Part I presents material on how probability arises as a model and how considering
apparently simple questions in probability quickly escalates into complicated mathe-
matics that can be addressed by measure theory. It also gives a picture of what is in-
volved in developing a general theory for measuring “length, area, and volume”, which
will help the reader when dealing with the complexities of measure theory. Only Chap-
i
xii Preface
ter 2 in preliminary material is required for the following material, however the authors
strongly recommend reading through all of Part I before embarking on the rest of the
book.
We present the foundations of measure theory and measure-theoretic probability in
Part II. We begin with the construction of general measure theory in Chapter 5, then
focus on the application to construction of measures in Euclidean space. Following the
general measure theory, we develop the application to probability theory, and justify the
application by stating and proving several fundamental results. We conclude with part
with investigating integration and the probability-counterpart of expectation.
In Part III, we develop more advanced topics in measure theory and probability.
i
Part I
i
i
Chapter 1
With caution judge of probability. Things deemed unlikely, e’en impossible, experience
oft’
Shakespeare
A very small cause which escapes our notice determines a considerable effect that we
cannot fail to see, and then we say that the effect is due to chance.
H. Poincaré
When things get too complicated, it sometimes makes sense to stop and wonder: ’Have
I asked the right question?
E. Bombieri
The origin of measure theory begins in the 1800’s with the start of a decades-long
effort to create a theory for integration that improves and generalizes the Riemann inte-
gral. Early contributors E. Borel, C. Jordan, J. Hadamard, and G. Peano recognize that
a general theory of integration would also involve generalizing the familiar notions of
length, area, and volume for simple geometric figures to more complicated sets. Build-
ing on Borel’s work in particular, H. Lebesgue establishes a systematic theoretical basis
for both measure and integration in his thesis published in 1902.
But that was only a launching point for measure theory, as it turned out that Lebesgue’s
theory could be generalized, abstracted, and applied in a number of ways. Early exten-
sions are made by C. Carathéodory, M. Fréchet, Nikodym, J. Radon, W. Sierpiński,
and T.-J. Stieltjes. The formative development of modern measure theory continues for
roughly five more decades, with contributions by many well-known mathematicians.
One of the highlights occurs in 1933, when A. Kolmogorov publishes a book in
which he proposes measure theory as a foundation for rigorous probability, based on
related work of E. Borel, F. Bernstein, F.P. Cantelli, M. Fréchet, P. Lévy, A. Markov,
H. Rademacher, E. Slutsky, H. Steinhaus, S. Ulam and R. von Mises. This lifts prob-
ability out of a disreputable state in the mathematical sciences and initiates a parallel
development in the theory of probability. This also initiates a blossoming of mathemat-
ical statistics.
Today, it would be impossible to overstate the central importance of measure theory
3
i
in the mathematical sciences. For example, measure theory provides the foundation
for a comprehensive theory of integration, probability theory, ergodic theory, stochastic
computation, mathematical statistics, theory of partial differential equations, spaces of
functions, functional analysis, and Fourier analysis. It is a theory that is both beautiful
and useful.
i
Chapter 2
Preliminaries
You know, many years ago – back in the 1930’s – I thought I was interested in the
foundations of mathematics. I even wrote a paper or two in the area. We had a brilliant
young logician on our faculty who was making a name for himself. One term I happened
to lecture in the same classroom, the hour immediately after he did. Each day, before
erasing the blackboard, I would take a look to see what he was up to. Well, in September
he started out with Theorem 1. Shortly before Christmas he was up to Theorem 747.
You know what it was? ’If x=y and y=z, then x=z’! At that point, something within
me snapped. I said to myself, ’There are some things in mathematics that you just have
to take for granted!’ And I never again had anything to do with the foundations of
mathematics.
G. Birkhoff
Later generations will regard Mengenlehre (set theory) as a disease from which one has
recovered.
H. Poincaré
There exists a Dedekind complete, chain ordered field, called the real numbers. It is
unique up to isomorphism...
E. Schechter
Panorama
Measure theory and probability are constructed in order to deal with complex sets that
arise when describing very practical situations. Directly or indirectly, what makes the
sets complicated has to do with limiting behavior of some kind of infinite process. In
this first chapter, we describe some of the important concepts and operations for sets.
The key ideas are methods for combining sets in order to get new sets, how to define
functions and their inverses on sets, and how to count the number of elements in sets.
Some of the examples are aimed at helping the reader to think about complicated sets,
including the central idea of a set whose elements are themselves sets, i.e. a “set of
sets”. That idea leads to another central idea of an infinite sequence of sets. We also
explain the importance of distinguishing two “kinds” of infinity.
5
i
6 Chapter 2. Preliminaries
While this chapter is admittedly very dry, it is very important to everything that
follows. At least, it is relatively short!
2.1 Sets
Sets are familiar of course.
We use capital letters A, B to denote sets, and when we want to distinguish one
particular set, e.g., as the largest set that contains all other elements and sets of a certain
class of objects, we use the “blackboard” font X in measure theory and Ω in probability.
Some important examples with their notation:
Example 2.1.1
A reader can properly complain that there is something circular about defining a
“set” as a “collection” and Definition 2.1.1 is only a statement of the labels we use
to denote the concept rather than a workable definition. In fact, the concept of set is
understood by the operations involving sets that can be performed. Perhaps, the most
fundamental of these is “belonging to":
Definition 2.1.2
Recall the notation {indices : specified condition is satisfied} that is used to describe
subsets.
Example 2.1.2
2.1. Sets 7
One of the attractive qualities of measure theory is that it can be applied to familiar
sets in Euclidean space and sets in much more complicated spaces.
Example 2.1.3
We always allow ∅ ⊂ A for any set A, which means ∅ is an empty subset. For any set
A, we always have a ∈ / ∅ where a is any element in A.
At an elementary level, we think of a set in terms of its points. But, we can also
think of a set as defining the collection of its subsets, which is fundamental to measure
theory.
The power set PX for a set X is the set consisting of all subsets of X.
The power set of a set includes the set itself and the empty set.
Example 2.1.4
If X = {a, b, c}, then PX = ∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} .
Measure theory is built on sets and set operations. The main operations are:
Definition 2.1.5
A ∪ B = {a : a ∈ A or a ∈ B} (Union),
A ∩ B = {a : a ∈ A and a ∈ B} (Intersection),
A\B = A − B = {a : a ∈ A and a ∈
/ B} (Difference).
Another less familiar operation turns out to be important for measure theory:
i
8 Chapter 2. Preliminaries
If the symmetric difference of two sets is “small”, the sets nearly coincide.
We frequently deal with operations and sums of collections of objects indexed by a
set. It is usually important to distinguish the case of the index set being a subset of the
natural numbers from other cases. We use roman letter for indices, e.g. i, j, k, l, m, n,
for indices that take their values in a subset of the natural numbers and greek letter for
indices in all other cases. Note that we drop the subscript index set A in the statements
when it is clear which index set is being considered.
Example 2.1.5
∞
[ [
(i, i + 1] = R+ , {α} = [0, 1].
i=0 α∈[0,1]
We collect the basic facts about these operations in the theorem below.
Theorem 2.1.1
2.2. Functions 9
Example 2.1.6
A × B = {(a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2)}.
Theorem 2.1.2
(A ∩ B) × (C ∩ D) = (A × C) ∩ (B × D).
However, it does not behave nicely with respect to other set operations, e.g.
(A ∪ B) × (C ∪ D) 6= (A × C) ∪ (B × D),
D
D
(
(
C ∩D
C
(
C
(
( ( ( (
( ( ( (
A A∩B B A B
2.2 Functions
Along with sets, measure theory and probability are also built on functions.
10 Chapter 2. Preliminaries
b = f (a), a∈X
Typically, operator is reserved for maps on infinite dimensional spaces, while map and
transformation usually indicate a function valued in a vector space of finite dimension
larger than 1. A function usually indicates a scalar, e.g. real or complex, valued map.
Example 2.2.1
Example 2.2.2
Rb
Definite integration on an interval [a, b], i.e. f (x) → a
f (x) dx, is an operator
that maps C([a, b]) into R.
Example 2.2.3
for A ∈ PX . The double use of | | for absolute value and cardinality ends up not
causing much trouble.
Example 2.2.4
using the familiar Riemann integral. One outcome of measure theory is to extend
the domain of this set function to a much larger class of sets.
A set function may be defined on sets consisting of single points. Vice versa, we
can think of a function defined on points of a set as a set function,
i
2.2. Functions 11
Definition 2.2.3
f (A) = {f (a) : a ∈ A} ⊂ Y.
Definition 2.2.4
The domain of a function is the set of permissible inputs. The range of a function
is the set of all outputs of a function for a given domain.
In practice, there is some ambiguity in the definitions of domain and range. The
“natural” domain is the set of all inputs for which the function is defined, but we often
restrict this to some subset. Likewise, range is often used to refer to a set that contains
the actual range of a function, e.g. R and R+ both might be called the range of x2 for
x ∈ R. It is important to be very precise about the domain and range in measure theory
and probability. With this in mind, we define:
Definition 2.2.5
Example 2.2.5
Example 2.2.6
The operator f (x) → 2f (x) is an onto and 1 − 1 map from C([a, b]) to C([a, b]).
Example 2.2.7
The concept of the inverse map to a function is centrally important to measure the-
ory. It is extremely important to pay attention to the domain and range in this situa-
tion.
Definition 2.2.6: Inverse map
12 Chapter 2. Preliminaries
Example 2.2.8
√ √
Consider f (x) = x2 mapping R to R+ . Then f −1 (y) = − y, + y for y ≥ 0.
Example 2.2.9
Consider
√ f (x, y) = x2 + y 2 mapping R2 to R+ . f −1 (z) is a circle in R2 of radius
z for z ≥ 0.
Example 2.2.10
Consider f (x) = sin(x) mapping R to [−1, 1]. Then f −1 (y) is an infinite set of
points for each for 1 ≤ y ≤ 1.
Example 2.2.11
Example 2.2.12
Consider the average over [a, b] of a continuous function mapping C([a, b]) to R,
Rb
f (x) → b−a 1
a
f (x) dx. f −1 (c) in C([a, b]) consists of the set of continuous
functions that have average value c.
The natural domain of the inverse map to a function f : X → Y is the range Y. The
range of the inverse map is a subset of the power set PX of the domain of the map. That
is, the inverse map to a function f : X → Y maps Y to a space whose points are sets
in X in general. We can interpret this another way:
Definition 2.2.7
Example 2.2.13
2.2. Functions 13
Remark 2.2.1
Definition 2.2.8
Theorem 2.2.1
The reason that inverse maps play such an important role in measure theory is that
the inverse of a map “commutes” with unions, intersections and complements.
Theorem 2.2.2
14 Chapter 2. Preliminaries
Proof. This is another good exercise. We prove the first result to give a start.
The importance of Theorem 2.2.2 is revealed below. Note that functions and set
operations do not commute the same way, e.g. f (B1 ∩ B2 ) =
6 f (B1 ) ∩ f (B2 ), in
general.
2.3 Cardinality
We mentioned above that specifying the size, or cardinality, of an index set is important
in certain places. Formalizing that notion,
Definition 2.3.1
Two sets X and Y are equivalent or have the same cardinality, written X ∼ Y,
if there is a 1 − 1 and onto map f : X → Y.
• If X = ∅ or X ∼ {1, 2, . . . , n} for some n ∈ N, we say that X is finite.
• If X is finite or X ∼ N, we say that X is countable.
• If X is not empty, finite, or countable, we say is X is uncountable.
Note that there are different cardinalities among the uncountable sets but that is not
important for the material below.
Countable cardinality has some interesting properties. For example
Example 2.3.1
Example 2.3.2
Example 2.3.3
Example 2.3.2 can be used to show that Q is countable, after recognizing that
numbers in Q can be written as the ratio of integers.
In fact, it follows from the definition that all countable sets are equivalent and, in-
deed, can be represented in the same way.
i
2.3. Cardinality 15
0 1 3 6
(1,1) (1,2) (1,3) (1,4) ...
2 4 7 11
(2,1) (2,2) (2,3) (2,4) ...
5 8 12 17
(3,1) (3,2) (3,3) (3,4) ...
9 13 18 24 ...
(4,1) (4,2) (4,3) (4,4)
...
...
...
...
...
Figure 2.2. Showing that N ∼ Z2 .
Theorem 2.3.1
Theorem 2.3.2
Proof. Since we are dealing with a countable number of sets, we can write them as
{Ai }∞ {ai,0 , ai,1 , ai,2 , · · · }.
i=0 . Moreover, each Ai is countable, so we can write Ai = S
∞
Then we can use the numbering scheme show in Fig. 2.2 to label i=0 Ai = {ai,j , i ∈
N, j ∈ N}.
Theorem 2.3.3
16 Chapter 2. Preliminaries
Example 2.3.4
Example 2.3.5
C([a, b]) is uncountable since it contains the set of constant functions with values
in (0, 1].
Theorem 2.3.4
For product sets A × B, the cardinality of A × B is |A| |B| when A and B are sets
with finite cardinality. If A and B are countable sets, then A × B is countable. If
A and B are nonempty sets and either A or B or both are uncountable, then A × B
is uncountable.
Definition 2.4.1
Let {Ai }∞
i=1 be a collection of subsets of a set X.
∞
Ai = A, then we say that {Ai }∞
S
If A1 ⊂ A2 ⊂ A3 . . . and i=1 is a (mono-
i=1
tone) increasing sequence of sets and that Ai converges to A. We denote this
by Ai % A and Ai ↑ A.
∞
Ai = A, then we say that {Ai }∞
T
If A1 ⊃ A2 ⊃ A3 . . . and i=1 is a (mono-
i=1
i
Some books use “sequence” when referring to both a collection and a monotone se-
quence of sets.
Example 2.4.1
1
Let Ai = 0, 1 − i for i ≥ 2. We have
Example 2.4.2
1
Let Ai = 0, 1 + i for i ≥ 1. We have
Theorem 2.4.1
Let {Ai }∞
i=1 be a sequence of subsets of X.
i
S
1. If Ai % A then, Ai = Aj .
j=1
2. If Ai % A then Aci & A . c
Definition 2.4.2
Example 2.4.3
Some authors use a special notation to indicate the case of a disjoint union. In this
book, we always note that situation in words. When reading any book, it is important to
note how disjoint unions are indicated.
i
18 Chapter 2. Preliminaries
The next set of ideas is based on the observation that given two subsets A, B ⊂ X,
we can write the union as a disjoint union:
A ∪ B = (A) ∪ (B ∩ Ac ) .
Theorem 2.1.1 implies the following result, which turns out to be extremely useful in
measure theory proofs.
Theorem 2.4.2
Let {Ai }∞
i=1 be a collection of subsets of X. Then,
S∞
1. Set A = i=1 Ai . Define the sequence of sets {Bj }∞ j=1 via B1 = A1 and
Sj
Bj = i=1 Ai for j ≥ 2. Then Bj % A.
S
j−1
2. Define the collection of sets {Bj }∞
j=1 via B1 = A1 and Bj = Aj \ i=1 Ai
S∞
for j ≥ 2. Then {Bj }∞ j=1 is a disjoint collection of sets with i=1 Ai =
S∞
i=1 Bi .
Definition 2.5.1
−∞ ≤ x ≤ ∞, x ∈ R,
b −∞ < x < ∞, x ∈ R,
x ± ∞ = ±∞, x ∈ R, ∞ + ∞ = ∞, −∞ − ∞ = −∞,
x · (±∞) = ±∞, x > 0, x · (±∞) = ∓∞, x < 0, 0 · (±∞) = 0.
Before the reader starts wondering why Calculus makes a lot of fuss about ∞, note that
these definitions are consistent with how we deal with sequences that may have infinite
limits (and may have been said to diverge). For that reason, we do not assign ∞ − ∞ a
value. Also, the convention that 0 · (±∞) = 0 is only permissible because in measure
theory that is usually the correct value that would be assigned with a careful analysis.
With these conventions, other structures associated with real numbers are extended
in the obvious way. In particular,
Definition 2.5.2
b + = [0, ∞] = {x ∈ R
R b : 0 ≤ x ≤ ∞},
b n = {(x1 , · · · , xn ) : xi ∈ R,
R b 1 ≤ i ≤ n}.
i
Definition 2.5.3
Remark 2.5.1
It is important to keep in mind that the extended reals are not a field.
Definition 2.5.4
If A ⊂ R
b is nonempty and not bounded from below, we define inf A = −∞. If
A ⊂ R is nonempty and not bounded from above, we define sup A = ∞.
b
It follows,
Theorem 2.5.1
Definition 2.5.5
It follows,
Theorem 2.5.2
• If {ai }∞ ∞
i=1 is an increasing sequence in R, then {ai }i=1 converges to lim sup ai .
b
• If {ai }∞ ∞
i=1 is an decreasing sequence in R, then {ai }i=1 converges to lim inf ai .
b
• A sequence {ai }∞ i=1 in R converges in R if and only if lim sup ai = lim inf ai ,
b b
in which case lim ai = lim sup ai = lim inf ai .
Proof. The proof is a good exercise. It requires treating different cases depending on
whether or not ±∞ is involved.
We frequently deal with functions in relation to the extended numbers. Recall how
infinity is incorporated into limits of functions:
i
20 Chapter 2. Preliminaries
Definition 2.5.6
Using these definitions, we can extend the domain and/or range of some functions from
the real numbers to the extended real numbers.
Example 2.5.1
Definition 2.5.7
2.6 References
2.7 Worked Problems
i
Chapter 3
The conception of chance enters into the very first steps of scientific activity in virtue of
the fact that no observation is absolutely correct. I think chance is a more fundamental
conception than causality; for whether in a concrete case, a cause-effect relationship
holds or not can only be judged by applying the laws of chance to the observation.
M. Born
It is mainly the practice of gambling which sets certain stubborn minds against [the]
notion of the independence of successive events. They observe that heads and tails
appear about equally often in a long series of tosses, and they conclude that if there has
been a long run of tails, it should be followed by a head; they look at this as a debt owed
to them by the game. A little reflection will suffice, however, to convince anyone of the
childishness of this anthropomorphism
É. Borel
The results concerning fluctuations in coin tossing show the widely held beliefs about
the law of large numbers are fallacious. They were so amazing and so at variance
with common intuition that even sophisticated colleagues doubted that coins actually
misbehave as theory predicts.
W. Feller
Aristotle
Panorama
The main goal of this chapter is to begin exploring the deep connection between prob-
ability and analysis. That connection is fully described by measure theory. But it is not
at all obvious, which may be the reason measure theory was not applied to probability
21
i
Definition 3.1.1
Example 3.1.1
The experiment is to draw a card from a standard deck. We can classify the possi-
ble outcome in a number of ways, e.g.,
Sample space 1 A point in the space of 52 outcomes.
i
Note that sample spaces 2 and 3 are sets whose points are sets.
There is a special case of a sequence of trials that greatly simplifies analysis, but is
also still important.
Definition 3.1.2
Example 3.1.2
Example 3.1.3
Definition 3.1.3
Example 3.1.4
Definition 3.1.4
Probability is associated with “randomness” and “uncertainty” and the rules governing
probability reflect properties of the experiment. But, it is important to note that there
is nothing uncertain or random about the rules governing probability. The connection
to uncertainty or randomness comes through the interpretation of the probability values
placed on the outcomes and how those values are assigned.
Example 3.1.5
Consider the experiment of flipping a two side coin with a head side (H) and a
tail side (T ). The sample space is {H, T }. Given the complexity of modeling the
physics of the motion through the flip to the catch, we might assign probability
by assuming that each outcome is equally likely, i.e. P (H) = P (T ) = 1/2.
The randomness or uncertainty in the experiment is that, short of carrying out
a complex predictive physics simulation, we cannot predict which outcome will
occur before the toss is made.
Example 3.1.6
Example 3.1.7
b, c, d, e, f }. PX = ∅, {a},
We consider a sample space with six outcomes {a,
{b}, · · · , {f }, {a, b}, {a, c}, · · · , {a, b, c, d, e, f } . Setting P (a) = .1, P (b) = .2,
P (c) = .1, P (d) = .1, P (e) = .2, and P (f ) = .3, then if B = {d, e, f },
P (B) = .4.
Definition 3.1.5
In discrete probability, a sample space together with its power set and a set of
probabilities is called a probability space. If Ω is the sample space and P the
probability, then we write (Ω, PΩ , P ) to emphasize the three ingredients.
Definition 3.1.6
A sure event must occur in an experiment, so it contains the entire sample space.
An almost sure event is an event with probability one. An event with probability
zero happens almost never. An impossible event never occurs in an experiment,
so it is the event with no outcomes.
Definition 3.1.7
Probabilities must satisfy certain properties with respect to taking unions and inter-
sections of events. For example,
Theorem 3.1.1
i
Definition 3.1.8
Two events in a probability space are (mutually) exclusive if they have no out-
comes in common.
Theorem 3.1.2
The Law of Large Numbers is a probabilistic statement about the frequency of oc-
currence of an outcome in a sequence of a large number of repeated independent trials
of an experiment. It is a result that we discuss several more times throughout the book.
In this section, we give an elementary proof of a simple version that does not require
measure theory.
Suppose that we do not know x. How might we determine it? If we conduct a single
trial, O might result or it might not. In either case, it gives little information about x.
However, if we conduct a large number m 1 of trials, intuition suggests O should
occur “approximately” xm times, at least most of the time. Another way of stating this
intuition is,
number of times O occurs
≈ probability of O.
total number of trials
But, we have to be wary about intuitive feelings:
Example 3.2.1
If we conduct a set of trials in which we flip a fair coin many times (m), we
expect to see around 50% ( m 2 ) heads most of the time. However, it turns out the
probability of getting exactly m/2 heads in m flips is approximately,
1
√ ,
πm
i
It is certainly possible that we might have a run of either “good” or “bad” luck in the
sequence of experiments, which would undermine the intuition. So, we have to be
careful about how we phrase the intuition mathematically.
It is important to spend some time reading the conclusion of Theorem 3.2.1 and
understanding its meaning. The theorem does not say O will occur exactly xm times
in m trials nor that O must occur approximately xm times in m trials. The role of δ
is that it quantifies the way in which k/m approximates x, thus avoiding the issue in
Ex. 3.2.1. The role of is that it allows the possibility, however small in probability, that
the sequence of trials can produce a result that is not expected. For example, we would
have the (mis)fortune to obtain all heads in the sequence of trials. By making δ small,
we obtain a better approximation to x. By making small, we obtain the expected result
with higher probability. The cost in each case is that having to conduct a large number
m of trials.
As well as being interesting in its own right, the Law of Large Numbers (LLN) is
centrally important to various aspects of probability theory. For example, it is important
in consideration of the role of probability theory as a mathematical model of reality. An
important component of mathematical modeling is “validation”, which roughly is the
process of verifying the usefulness of the model in terms of describing the system being
modeled. One aspect of validation is quantifying the accuracy of predictions made by
the model.
For example, we could try to describe the result of a particular coin flip for a fair coin
deterministically by using physics involving the initial position on the thumb, equations
describing the effects of force and direction of the flip, the effect of airflow, and so on.
This is a complex and computationally expensive undertaking. In the absence of such
a detailed computation for a particular flip, it is reasonable to believe that the outcome
for a fair coin is equally likely to be head or tails. The LLN describes how we could
validate the assignment of a probability to O through repeated trials.
The LLN can be proved using an elementary argument based on the binomial expan-
sion. Of course, the binomial coefficients and binomial expansions are very important
in probability. We briefly review the basic ideas, see [Est02] for a more detailed presen-
tation. This theorem was first proved by Jakob Bernoulli.
Definition 3.2.1
i
Theorem 3.2.2
Definition 3.2.2
Theorem 3.2.3
Theorem 3.2.4
m
X
kpm,k (x) = mx, for all 0 ≤ x ≤ 1. (3.5c)
k=0
m
X
k 2 pm,k (x) = (m2 − m)x2 + mx, for all 0 ≤ x ≤ 1. (3.5d)
k=0
Proof. We prove Theorem 3.2.1 by proving that given > 0 and δ > 0
X
pm,k (x) > 1 − ,
0≤k≤m
|m
k
−x|<δ
which is estimated
2
X 1 X k 1
pm,k (x) ≤ −x pm,k (x) ≤ Tm (x),
δ2 m m2 δ 2
0≤k≤m 0≤k≤m
|m
k
−x|≥δ |m
k
−x|≥δ
where
m
X 2
Tm (x) = (k − mx) pm,k (x).
k=0
m
Using (3.5c) and (3.5d), we find Tm (x) = mx(1 − x), and so Tm (x) ≤ 4, for all
0 ≤ x ≤ 1. Therefore,
X 1 X 1
pm,k (x) ≤ and pm,k (x) ≥ 1 − , 0 ≤ x ≤ 1. (3.6)
4mδ 2 4mδ 2
0≤k≤m 0≤k≤m
|m
k
−x|≥δ |m
k
−x|<δ
Remark 3.2.1
It is interesting to consider how the final line implies the result. Given δ and , we
require
1
m≥ .
4δ 2
This can be achieved uniformly with respect to the value of x. However, increasing
the accuracy by decreasing δ requires a very substantial increase in the number of
trials m. This adverse scaling occurs again, unfortunately.
i
The function 1/(1 + x2 ) is a standard example used to show problems that can
arise for polynomial approximation. In Figure 3.2, we plot the Taylor polynomial
of degree 8 centered at x = 0 and an interpolating polynomial of degree 8. The
local nature of the approximation provided by a Taylor polynomial and the issue
of oscillation that can arise when interpolating at equally spaced points (known as
Runge’s phenomenon) are clearly visible.
6 1
Taylor polynomial of degree 8 at x=0
5 1/(1+x2)
0.5
4
0
3
-0.5
2
1 -1
Interpolating polynomial of degree 8
1/(1+x2)
0 -1.5
-1 -0.5 0 0.5 1 -5 0 5
Figure 3.2. Left: The Taylor polynomial of degree 8 for 1/(1 + x2 ) centered at x = 0.
Right: The polynomial of degree 8 that interpolates 1/(1 + x2 ) at equally spaced points.
Recall the metric space C([a, b]) of real-valued continuous functions on [a, b] with
i
the max/sup metric. We introduce a convenient way to quantify the continuous behavior
of a function. Recall that a function f that is continuous on [a, b] is actually uniformly
continuous. If f is uniformly continuous on [a, b], then for any δ > 0, the set of numbers
Definition 3.3.1
We observe that the set of polynomials with rational coefficients is dense in the
space of all polynomials on [a, b] with respect to the sup metric. The set of polynomials
with rational coefficients is a countable set. We conclude,
Theorem 3.3.2
Definition 3.3.2
1.0
0.5
f(x)
b3(f,x)
0.0
0.0 0.5 1.0
x ∈ [0, 1],
9
|f (x) − bm (f, x)| ≤ κ(f, m−1/2 ). (3.7)
4
Example 3.3.2
Example 3.3.3
− (32x(x − 1)7 )/13 + 14x2 (x − 1)6 − (224x3 (x − 1)5 )/5 + 70x4 (x − 1)4
− (224x5 (x − 1)3 )/5 + 14x6 (x − 1)2 + x8 /5.
Example 3.3.4
0.8
f(x)
0.6
b8(f,x)
0.4 b16(f,x)
b32(f,x)
0.2
-2 -1 0 1 2
gained by examining the error f (x) − bm (f, x). Using (3.5b), we write the error as
m
X m
X
f (x) − bm (f, x) = f (x)pm,k (x) − f (xk )pm,k (x).
k=0 k=0
The motivation is to try to get quantities of the form f (x) − f (xk ), which should be
small when x is close to xk . Collecting terms, we get
m
X
f (x) − bm (f, x) = (f (x) − f (xk ))pm,k (x). (3.8)
k=0
Now we have to consider two cases: I in which x is close to an xk and the continuity of
f is relevant and II in which x is not close to any xk . So, we split the sum on the right
of (3.8) into two parts. For δ > 0,
X X
f (x)−bm (f, x) = (f (x) − f (xk ))pm,k (x) + (f (x) − f (xk ))pm,k (x) .
0≤k≤m 0≤k≤m
|m
k
−x|<δ |m
k
−x|≥δ
| {z } | {z }
I II
What about II, where the x are not close to any xk ? As m increases, the xk “fill
in” [0, 1], so the set of x that are not close to any xk should shrink. We prove that II is
small using the same argument used to prove the Law of Large Numbers. We note that
there is a C such that |f (x)| ≤ C, for 0 ≤ x ≤ 1. Hence, (3.6) in the proof of the LLN
implies
X C
|II| ≤ 2C pm,k (x) ≤ .
2mδ 2
0≤k≤m
|mk
−x|≥δ
i
So, we can make II as small as desired by taking m large. It is a good exercise to show
that in fact,
1
|II| ≤ κ(f, δ) 1 + ,
4mδ 2
and so,
1
|f (x) − bm (f, x)| ≤ κ(f, δ) 2 + .
4mδ 2
Setting δ = m−1/2 proves the result.
3.4 References
3.5 Worked problems
i
Chapter 4
G. Galilei
The simplest laws of natural science are those that state the conditions under which
some event of interest to us will either certainly occur or certainly not occur ... An event
A, which under complex conditions S sometimes occurs and sometimes does not occur,
is called random with respect to the complex of conditions.
A. Kolmogorov
... the main question is that of constructing a mathematical model which presents a
sufficiently close agreement with reality. With this main question resolved (or at least
partially resolved, since there are often several solutions), the role of the mathemati-
cian is limited to the study of the properties of the model, and this is a problem of pure
mathematics. The comparison of the results thus obtained with experience, and the de-
velopment of theories which make such comparisons possible, lie outside of the domain
of mathematics. Thus, the role of mathematics is limited, though very important, and
in one’s theoretical research, one must never lose sight of reality and one must always
check new ideas against observations and experience.
É. Borel
F. Nightingale
The sciences do not try to explain, they hardly even try to interpret, they mainly make
models. By a model is meant a mathematical construct which, with the addition of
certain verbal interpretations, describes observed phenomena. The justification of such
a mathematical construct is solely and precisely that it is expected to work.
J. von Neumann
35
i
Panorama
In this chapter, we develop a probability model for the experiment of tossing a coin
randomly an infinite number of times. The discrete probability model does not apply
to this experiment, so we have to develop a new one. In this new model, it turns out,
computing the probability of events, e.g., the event in which the sequence heads-heads-
tails-heads occurs infinitely many times, requires computing the size, or measure, of
complicated sets of real numbers. This raises the need to develop a theory for measuring
the size of complicated sets and for determining what is needed in the theory so that it
is useful for probability.
Ways in which a model is judged include fidelity of description, usefulness for gain-
ing understanding, and computational accessibility. The first has to do with how well
the model describes what is being modeled. The second has to do with how the model
helps us understand the system being modeled. The third has to do with whether or not
the model is useful in practical terms. We partially address the first two issues for the
probability model for coin flips, by using the model to ask and answer some interesting
questions in probability. In particular, we state and prove two different versions of the
Law of Large Numbers. We also begin exploring the somewhat maddening role of sets
of “measure zero” in the theory. Maddening because some of the key elements of mea-
sure theory are necessary to deal with such sets, yet in many practical situations we can
ignore them completely.
The proofs we give in this chapter are rigorous, provided we develop a rigorous
theory for measure with the required properties. But, we do not do that until later
chapters. This may cause some uneasiness in a reader accustomed to the usual style of
polished mathematical presentation which usually requires rigor at every stage starting
with the foundation. However, this presentation is much closer to how new mathematics
is discovered, which often involves creating structure out of conjecture, experiment, and
creativity, then returning to fill in the mathematical details, which may in turn require
changing the original creation. After developing rigorous measure theory, we revisit the
material in this chapter to verify that everything discussed is indeed rigorously justified.
Definition 4.1.1
Suppose an experiment has two possible outcomes and the probabilities of these
outcomes are fixed. A finite number of independent trials of the experiment is
i
Example 4.1.1
Let the experiment be the toss of a two-sided coin, with a head denoted (H) and
tails denoted (T ). We display some leading terms of 10 sequences generated ran-
domly from a fair coin with equal probability of heads or tails. The frequency of
occurrence of “runs”, or subsequences of repeated heads or tails, may be surpris-
ing.
HHTHHTTHHHTHHTHTTHHHHTHHHHHTHTHTTTTHHTHTTTHHTTTHHHTHHTTTHTHTHTHHHHHTTTHTHTHTT· · ·
THHHHTHHTHTTHHHTHTTTTHTHTHTHHHTTTHTHHHTTTHTHHHTTTHTHTTTTHHHTHHTHTTTTTTTTHHTTT· · ·
HTTTTTHHHTTTHTTTHHHTHTHTHTTHHTHHTTTTHHHHHTHHTHHHHHTTTTHTTTTTTHTTHHTTTTHTHHTTT· · ·
TTHTHHTHTTHHHTTHHTTHTHHHTTTHTHTHTHHHHTHTTHHTHHHHHHTTHTTTHHHTTHTTHTHHHTHHHHHTT· · ·
TTHHTTHTHHTTTHTTTTTTHHTHTTHHTHTHHHHHTTHTTHHHTTTHTHHTTTTTHTTHTTHTHHHTHTHHHTHTT· · ·
TTHTHHHTTTHHTHHHHTTHHTTTTTTHTHHHHHTHTHHTTTHHTHHTHTHTTTTHTHTTTHTTTHTHTHTHTTTHH· · ·
HHHHTHTTHTTHHTHTHHTHHHHTTTHHTHHHHHHHTTTTTTTHHHHHTTHTHHHTTHHHTTHTTHTTHTHTHHTTT· · ·
HTHTHTHHTTHHTHTHHTHHHHHTTTHHHTHTTTHTTTTTHHHHHHHHTHHHHTHTTHHHHHTHTHHHHHHHHHTTH· · ·
HHTHTTTTTTTTTHTHHTHTTTHHTTHHTTHTTTTHHTHHHTHHHHHHTTTTHTTTTHTHHHTTTTHTHTHHHHTTT· · ·
TTTHTTTHHTHTHHHTHTTTHTTTTTHTTTTTHTTTTTHHHTTHHHTHTTHTTHHHHHHTHHTTHTHHTTHTHHHTT· · ·
For the sake of comparison, here are the results of an unfair coin with proba-
bility 3/4 of heads and 1/4 of tails,
HHHTHHHTTHTHHHHHTHHHTHTHTTHHHHHTHHHHHHHHHTHTHHTHTHHTHTHHHHHHTTTTTHHHHHHHHHHHT· · ·
THHHHHHHHHHTHHHHHHHHTTHHHHTHHHTHHHTHHHHTTHHHTTTHTHTTHHHTHTHHHTHTHHHHHHTHHHHHH· · ·
HHTHHHHHHHHHHHHHHHHHHHHHHTHHHTHHHHTHHTHHHHTHTHHHHHHHHHTHTHHHHHHTTHHTHTHHHTHTT· · ·
HTHHTHHHHHHTHTTHHHHTHHHHHHHHHHHHHHHTHHHHHTTHHHHHHHHHHHHHTHHHHHHHHHHHHTHTHHHHT· · ·
HHHHHHHHTHHHHHTHHTHHHHHHTHHHHHHHHHHHHHHHHHTHHHTHHTHHHHHHHHHHHHTTTTHHTHTTHHHHH· · ·
HHHHHHTHTHHHHHHHTHHHHHHHTHHHHHHHHHTHHHHHHHHHHHHHHHTHHHTTHTHTHHTHTHHTTTTHHHHHT· · ·
THTHTHTHHHHHHTHHHTHHTHHHHHHHHHHHTHHHTHHHHHHHTHTHHHHTHHTHTHTTTTTHTTHHHTHHHHTHH· · ·
TTHHHHTHHHHTHTHHHHHHHHHHHHHTHHTHTHHHTHHTTTHHHHTHTTHHHHHHHHHHHHTHHTTTHHHHTHHHH· · ·
HHTHHHTHHHHHHHHTTHHHHTTHHHHTHHTHTHTHHTHHTHHTHHHHHHTHHHTTHHHHHTHHHHTHHTHHHHHTT· · ·
HHHTTHHHHHHHHHHTHHHTHHHTHHHHTHTHHTTHHHTTHHHHHHHHHHHTHHHHHHTHTHHHHHHHTHHTHHTHH· · ·
We collect all of the sequences resulting from such an experiment to form a space
(master set).
Definition 4.1.2
For simplicity, we mostly treat the case where the outcomes have equal probability of
occurring, i.e. corresponding to a “fair coin”. The development can be extended to coins
where the two sides have different probabilities of occurring.
We show that B can almost be represented by the real numbers in Ω = (0, 1], which
implies that B is uncountable.
i
Theorem 4.1.1
If we delete a countable subset of B, we can index the remaining points using the
numbers in Ω.
Recall that by index, we mean there is a 1 − 1 and onto correspondence between the
two sets.
We given a name to the representation of the space of Bernoulli sequences:
Definition 4.1.3
The theorem implies that the model is not “perfect” in the sense that we cannot represent
all Bernoulli sequences in Ω. For consistency, we have to devise a method for assigning
probabilities that is not affected by the fact that some outcomes in the experimental
space B are not included.
Proof. We construct a map from Ω to B that fails to be onto by a countable subset. Any
point ω ∈ Ω can be written as an expansion in base 2, or fractional binary expansion,
∞
X ai
ω= , ai = 0 or 1.
i=1
2i
Each such expansion corresponds to a Bernoulli sequence. (We use ω for a point in
Ω instead of x to be consistent with common notation in probability.) To see this,
define the ith term of the Bernoulli sequence to be H when ai = 1 and T when ai =
0.
Example 4.1.2
0.10111001001 · · · → HT HHHT T HT T H · · · .
A problem with using real numbers as an index set is the fact that some numbers do
not have a unique binary expansion (recall the proof of Theorem 2.3.3), but we consider
two Bernoulli sequences with different members to be distinct.
Example 4.1.3
1
2 = 0.1000 · · · = 0.0111 . . . , but HT T T T · · · 6= T HHH · · · .
Thus, this method to generate a Bernoulli sequence from a fractional binary expan-
sion does not define a function from Ω into B. To avoid this trouble, we adopt the
convention that if the real number ω has terminating and non-terminating binary expan-
sions, we use the non-terminating expansion. This is the reason for using Ω instead of
[0, 1].
With this convention, the method above defines a 1 − 1 map from Ω into B that is
not onto because it does not produce Bernoulli sequences ending in all T ’s. We claim
that the set BT of such Bernoulli sequences is countable. Let BTk be the finite set of
Bernoulli sequences that have only T ’s after the k th term. We have,
∞
[
BT = BTk . (4.1)
k=1
i
Remark 4.1.1
Note that the largest number not in IAH is 0.100000 . . . while the largest number
in IAH is 0.11111 . . . . We do not include 1/2 because we use non-terminating
expansions. Likewise, if AT is the event where T occurs as the first outcome, then
IAT = (0, .5].
Since B is uncountable, discrete probability does not apply. However, the
tosses in a sequence are independent of each other, and assigning the probabilities
P (AH ) = .5 and P (AT ) = .5 seems reasonable. Interestingly, the lengths of
IAH and IAT are both .5. This suggests that the length or “measure” of IA might
correspond to the probability for A ⊂ B.
One issue with this approach is that we have to ignore the countable set BT in
order to use Ω as a model for B. To be consistent with assigning probability, BT
should be assigned probability 0. This means that the corresponding set IBT , which is
undoubtedly complicated, should be assigned a “length” or measure of 0. Likewise, it
turns out any finite or countable subset of Ω needs to be assigned a length or measure
of 0. This has a number of important ramifications and it motivates devising a way to
measure the size of sets of real numbers that applies to complex sets.
Lebesgue developed an approach to measure the sizes of complex sets of real num-
bers that is the basis for measure theory. Measure theory can be developed in a very
abstract way that applies to spaces of many different kinds of objects, though we focus
on spaces consisting of real numbers in this book. In that context, it is initially rea-
sonable to think of measure as a generalization of length in one dimension, and area
and volume in higher dimensions. But, we also caution that measures can have other
interpretations. For example, we use measure to quantify probability later on.
To fit common conceptions of measuring the sizes of sets, at a minimum, a measure
µ should satisfy some properties.
i
Thus, a measure is a non-negative finitely additive set function, just like a probability
function. There should be a connection here.
We pay particular attention to the case of real numbers:
If the space X is an interval of real numbers and the measurable sets include inter-
vals for which
Note that this implies that the measure of a set of a single point is zero, i.e.,
µL ({a}) = 0.
It also implies that the Lebesgue measure is determined by assigning values on intervals.
the events in which the first m outcomes are specified, we obtain 2m disjoint intervals
of equal length, and assign equal probability 2−m to each interval and thus each event.
In this way, we obtain a sequence of “binary” partitions Tm of Ω into 2m disjoint
S2m
subintervals Im,j of equal length such that Ω = j=1 Im,j , Im,j = ((j−1)2−m , j2−m ],
j = 1, · · · , 2m , see Figure 4.1. Note that we simplify notation by removing mention of
the event in B corresponding to each interval.
0 1
T4
a b
Example 4.1.7
m 1
P (A) = µL (IA ) = · m.
i 2
But, what about a general event, e.g., associated with an arbitrary interval (a, b] ⊂
Ω? Such intervals correspond to events in B that are difficult to describe. Figure 4.1
suggests an interesting conjecture. SIt appears that any interval (a, b] ⊂ Ω can be
“approximated” arbitrarily well by Im,j ⊂(a,b] Im,j in the sense that the intervals of
points not in the approximation (a, b] \ ∪Im,j ⊂(a,b] Im,j shrink
S in size as m increases,
see Fig. 4.1. Since the lengths of the approximations Im,j ⊂(a,b] Im,j converge to
µL ((a, b]), we assign probability µL ((a, b]) to the event in B corresponding to (a, b].
In view of the Wish List 4.1.4 and the fact that µL (I) = 1, we extend this as a
general modeling approach.
All of this discussion is terribly vague, since we have not defined µL , described the
collection of measurable sets, or quantified the sense of approximation of sets observed
above! But, we show that these ideas lead to some interesting theorems in the next
couple of sections.
it is often based on a finite set of observations. Above, we defined P (A) = µL (IA ) for
any event A based on what we assigned for some particularly simple events.
We conclude by noting that the model derived in this section can be applied to a
variety of situations.
Example 4.1.8
Ω can index the points in the space corresponding to the random throw of a dart
into a circular target of radius 1 and measuring the distance from the dart’s position
to the origin.
Definition 4.2.1
Definition 4.2.2
Sm gives the number of heads in the first m outcomes of the Bernoulli sequence
i
corresponding to ω.
Following this example, a random variable can also be viewed as a function on the
outcomes of an experiment.
Given δ > 0, we define
Sm (ω) 1
Iδ,m = ω ∈ Ω : − >δ .
(4.2)
m 2
Roughly speaking, this is the event consisting of outcomes for which there are not
approximately the same number of H and T after m trials, where δ quantifies the dis-
crepancy.
We prove
µL (Iδ,m ) < ,
for all sufficiently large m. Identifying µL with P , we see that (4.3) extends the earlier
Law of Large Numbers (3.4) to B.
Remark 4.2.1
The idea of measuring the size of the set where a function takes a specified range
of values is central to measure theory. However, such a set is not a finite collection
of disjoint intervals in general.
Example 4.2.1
We compute 5000 sequences of “numerical” coin flips for varying m, where each
sequence corresponds to a sample number 0 < ω̂i ≤ 1, 1 ≤ i ≤ 5000. We then
evaluate
Sm (ω̂i ) 1
D̂i =
− . (4.4)
m 2
Finally, we compute a normalized histogram of {D̂i }5000
i=1 . We show plots of the
histograms in Figures 4.2 and 4.3. The results support the conclusion of the Law
of Large Numbers as increasing m has the consequence that more of the sequences
have small values of D̂i .
.8 1
.3 m=100 m=1000 m=10000
.64 .8
.24
.18 .48 .6
.12 .32 .4
.06 .16 .2
0 0 0
.01 .09 .21 .01 .09 .21 .01 .09 .21
Figure 4.2. Normalized histograms of the results of 5000 different sequences of coin
tosses of lengths m = 100, m = 1000, m = 10000. Horizontal axis is D̂i (4.4). Vertical axis is
proportion of tosses in the indicated interval. As the number of tosses increases, the proportion
of tosses with values of D̂i near 0 increases.
Figure 4.3. Normalized histograms of the results of 5000 different sequences of coin
tosses of lengths m = 1000 and m = 10000. Horizontal axis is D̂i (4.4). Vertical axis is
proportion of tosses in the indicated interval. We change the horizontal scales so that the “shape”
of the distribution of values of D̂i for larger and larger number of tosses is evident. Comparing
the left-hand plot of Figure 4.2 and these two plots suggests that the distributions in all three
cases have a similar profile even as the horizontal scale changes.
Definition 4.2.3
Equivalently, (
1, ai = 1,
Ri (ω) =
−1, ai = 0.
We plot some of these functions in Fig. 4.4. Ri has a useful interpretation. Suppose we
bet on a sequence of coin tosses such that at each toss, we win $1 if it is heads and lose
$1 if it is tails. Then Ri (ω) is the amount won or lost at the ith toss in the sequence of
tosses represented by ω.
Following this interpretation, we define another random variable.
Definition 4.2.4
The total amount won or lost after the mth toss in the betting game is given by
m
P
Wm (ω) = Ri (ω).
i=1
i
R1 R2 R3
1 1 1
-1 -1 -1
By the definition of Ri ,
Now,
Sm (ω) 1
m − 2 > δ ⇔ 2Sm (ω) − m > 2mδ,
µL (Am ) → 0 as m → ∞. (4.6)
where the integral is the standard Riemann integral, which is well defined for
piecewise constant, nonnegative functions.
Proof. [Theorem 4.2.2] Since f is piecewise constant, there is a mesh 0 = ω1 < ω2 <
· · · < ωm = 1 such that f (ω) = ci for ωi < ω ≤ ωi+1 for 1 ≤ i ≤ m − 1. Since f is
nonnegative,
Z1 m
X m
X
f (ω) dω = ci (ωi+1 − ωi ) ≥ ci (ωi+1 − ωi )
0 i=1 i=1
ci >α
m
X
>α (ωi+1 − ωi ) = αµL {ω ∈ I : f (ω) > α} .
i=1
ci >α
Z1
1 2
µL (Am ) < 2 2 Wm (ω) dω.
m δ
0
We compute,
Z1 Z1 m
X
!2 m Z
X
1 m Z1
X
2
Wm (ω) dω = Ri (ω) dω = Ri2 (ω) dω + Ri (ω)Rj (ω) dω.
0 0 i=1 i=1 0 i,j=1 0
i6=j
The first integral on the right simplifies since Ri2 (ω) = 1 for all ω, so
m Z 1
X
Ri2 (ω) dω = m.
i=1 0
i
R1
We consider 0 Ri (ω)Rj (ω) dω when i 6= j. Without loss of generality, we assume
i < j. Set J to be the interval,
` `+1
J= , , 0 ≤ ` < 2i .
2i 2i
Therefore, Z 1
Ri (ω)Rj (ω) dω = 0, i 6= j.
0
R1 2
Thus, Wm (ω) dω = m, and
0
1 1
µL (Iδ,m ) ≤ m= ⇒ µL (Iδ,m ) → 0 as m → ∞.
m2 δ 2 mδ 2
The random variables introduced for this proof can be used to quantify other inter-
esting questions.
Example 4.2.2
The set IAm , determined by where a function has prescribed values, is generally
complicated. The event A of losing all the money, given by
∞
[
IA = IAm
m=1
is even more complicated. The probability of A is µL (IA ), once we figure out how
that is computed.
Definition 4.3.1
Another way to state the intuition behind the Law of Large Numbers is that obtaining a
sequence corresponding to a non-normal numbers should be highly improbable. In fact,
ideally, there should be 0 probability of obtaining such a sequence.
Definition 4.3.2
Note, however, that there are non-normal sequences, so having probability zero does
not mean it is impossible to obtain a non-normal sequence.
In this section, we characterize sets with Lebesgue measure zero. We noted above
that the Lebesgue measure of a single point is zero. It follows immediately that finite
collections of points also have Lebesgue measure zero. Infinite collections are appar-
ently more complicated. For example, I is the uncountable union of single points and
does not have Lebesgue measure zero. Working from the assumptions about measure
we have made so far, we develop a general method for characterizing sets with Lebesgue
measure zero. In doing so, we actually motivate several key aspects of measure theory.
The characterization is based on a fundamentally important concept for metric spaces.
Definition 4.3.3
Example 4.3.1
Definition 4.3.4
A set A ⊂ R has Lebesgue measure zero if for every > 0, there is a countable
∞
cover {Ai }i=1 of A, where each Ai consists of a finite union of open intervals,
such that
X∞
µL (Ai ) < .
i=1
Note that because each Ai in the countable cover consists of a finite union of open
intervals, their Lebesgue measure is computable. In this way, we sidestep the issue of
computing µL (A) directly.
Example 4.3.2
We show that N has Lebesgue measure 0. Given > 0, we have the open cover:
∞
[
N⊂ i− ,i + .
i=0
2i+2 2i+2
We compute
∞ ∞
X X
µL i − i+2
, i + i+2 = = .
i=0
2 2 i=0
2i+1
Definition 4.3.5
If (c, d) ⊆ (a, b), then µL ((c, d)) ≤ µL ((a, b)). We say that Lebesgue measure is
monotone.
We could use half open or closed intervals in the definition instead of open intervals,
but open intervals turn out to be convenient for “compactness” arguments.
Example 4.3.3
We show that a closed interval [a, b] with a 6= b cannot have measure zero. If [a, b]
is covered by countably many open intervals, we can extract a finite number that
cover [a, b] (a finite subcover) because it is compact. The sum of length of these
intervals must be at least b − a.
Theorem 4.3.1
measure zero.
3. Any finite or countable set of numbers has measure zero.
This states that a countable union of sets of measure zero is a set of measure zero. In
contrast, uncountable unions of sets of measure zero can have nonzero measure. The
assumption that the subset of the set of measure zero in 1. is measurable is an important
point that we address in later chapters.
Proof.
Result 1. This follows from the definition since any countable cover of the larger set
is also a cover of the smaller set.
Result 2. We choose > 0. Since Ai has measure zero, there is a countable collection
of open intervals {Bi,1 , Bi,2 , . . . ,} covering Ai with
∞
X
µL (Bi,j ) ≤ .
j=1
2i
∞
∞ S
By Theorem 2.3.2, the collection {Bi,j }i,j=1 is countable and covers Ai . Moreover,
i=1
∞ ∞ ∞ ∞
X X X X
µL (Bi,j ) = µL (Bi,j ) ≤ i
= .
j,i=1 i=1 j=1 i=1
2
Note that we use non-negativity to switch the order of summation in this argument.
Result 3. This follows from 2. and 3. and the observation that a point has measure
zero.
Remark 4.3.1
An interesting question is whether or not there are any interesting sets of measure
zero. We next show that there are uncountable sets of measure zero. In particular, we
describe the construction of a special example that is used frequently in measure theory.
The set is constructed by an iterative process.
Step 1 Beginning with the unit interval F0 = [0, 1], divide F0 into 3 equal parts
and remove the middle third open interval 31 , 23 to get
1 2
F1 = 0, ∪ ,1 .
3 3
Step 2 Working on F1 next, divide each of its two pieces into equal thirds and
remove the middle open intervals from the divisions to get F2 .
1 2 1 2 7 8
F2 = 0, ∪ , ∪ , ∪ ,1 .
9 9 3 3 9 9
o 1 o 1_ 2_ 1
3 3
F0 F1
Figure 4.6. The first step in the construction of the Cantor set.
0 1_ 2_ 1_ 2_ 7_ 8_ 1
9 9 3 3 9 9
F2
Figure 4.7. The second step in the construction of the Cantor set.
Theorem 4.3.2
Proof.
Result 1 Exercise.
Result 2 Exercise.
Result 3 C is contained in Fi for any i. Since Fi is a union of disjoint intervals whose
i i
lengths sum to (2/3) and, for any > 0, (2/3) < for all sufficiently large i, C has
measure zero.
i
This is a contradiction and so every number in C has a unique base 3 decimal expansion.
2i−1
Now let {Gi,j }j=1 be the open “middle third” intervals removed to obtain Fi . Then,
a number given by the base 3 decimal expansion 0.b1 b2 b3 . . ., bi = 0, 1, 2, is in Gi,j for
some j if and only if:
• bj = 0 or 2 for each j < i, because it is in Fi−1 ;
• bi = 1, because it is in one of the discarded open intervals at this stage;
• the bj ’s are not all 0 or 2 for j > i.
It is a good exercise to use a variation of the Cantor diagonal argument to show that C
is uncountable.
To give some idea of the importance of the concept of sets of measure zero, we quote
a beautiful result of Lebesgue that states “if and only if” conditions for a function to be
Riemann integrable. Recall that two aspects of Riemann integration provided significant
impetus to the development of measure theory. First, there was a long search minimal
equivalent conditions on a function that would guarantee the function is Riemann inte-
grable. Second, the Riemann integral has some annoying “flaws”. The resolution of the
search is describe in detail in Section 9.10. Here, we simply quote the main result.
To explain the idea, we begin with a canonic example. First,
Definition 4.3.7
A property of sets that holds except on a set of measure zero is said to hold almost
everywhere (a.e. ). We say that almost all points in a set have a property if all the
points except those in a set of measure zero have the property.
Definition 4.3.8
i
Theorem 4.3.3
We emphasize that a function may be equal to a continuous function a.e. but not be
continuous a.e. !
The next result is part of Theorem 9.10.1 proved later.
This theorem is a statement that is naturally expressed in terms of measure theory. This
version of the Law of Large Numbers is called strong because Theorem 4.4.1 implies
Theorem 4.2.1. This is a consequence of a general result on different kinds of conver-
gence that discuss in Chapter 12.
Proof. We first show that that Nc is uncountable and contains a “Cantor-like” set.
Consider the map f : Ω → Ω,
We cover the complicated set Nc using a countable cover of much simpler sets.
Recall the set Am = {ω ∈ Ω : |Wm (ω)| > δm} used in the proof of the Weak Law
of Large Numbers. We use an equivalent definition,
4
(ω) > δ 4 m4 .
Am = ω ∈ Ω : W m
By Theorem 4.2.2,
Z 1 Z1 m
!4
1 4 1 X
µL (Am ) ≤ 4 4 Wm dω ≤ 4 4 Ri dω.
δ m 0 δ m i=1
0
1. Ri4 for i = 1 · · · m.
2. Ri2 Rj2 for i 6= j.
3. Ri2 Rj Rk for i 6= j 6= k.
4. Ri3 Rj for i 6= j.
5. Ri Rj Rk Rl for i 6= j 6= k 6= l.
We show the other terms integrate to zero because of cancellation. Two follow from the
proof of the Weak Law of Large Numbers:
Z 1 Z 1
Ri2 Ri Rk dω = Rj Rk dω = 0, i 6= j 6= k,
0 0
Z 1 Z 1
Ri3 Rj dω = Ri Rj dω = 0, i 6= j.
0 0
Finally, assume i < j < k < l, and consider an interval of the form
r r+1
J= , .
2k 2k
There are m terms of the first kind of integrand and 3m(m − 1) terms involving the
second kind of integrand, so
Z 1
4
Wm (ω) dω = 3m2 − 2m ≤ 3m2 ,
0
and
3
µL (Am ) ≤ .
m2 δ 4
i
We cover Nc using a collection of sets of the form Am for increasing m and decreas-
ing δ chosen in such a way that the cover has arbitrarily small measure. For a constant
4
C, set δm = Cm−1/2 , so
∞ ∞
X 3 3 X 1
= ,
δ 4 m2
m=1 m
C m=1 m3/2
where the series on the right converges to a sum that can be made smaller than any > 0
by choosing sufficiently large C. Hence, given > 0, there is a sequence {δm } such
that
∞
X 3
≤ .
m=1 m
δ m2
4
3
µL (Ãm ) ≤ 4 m2
,
δm
and
∞
X
µL (Ãm ) ≤ .
m=1
∞ ∞
If we show that Nc ⊂ Ãcm . If
S T
Ãm , then we are done. This holds if N ⊃
m=1 m=1
∞
|Wm (ω)|
Ãcm , then for each m, |Wm (ω)| ≤ δm m, or
T
ω∈ m ≤ δm . Since δm → 0,
m=1
|Wm (ω)|
m → 0, or ω ∈ N.
The proof of Theorem 4.4.1 can be used to draw stronger conclusions. For exam-
ple, a normal number has the property that no finite sequence of digits occurs more
frequently than any other finite sequence of digits.
1. µL should be non-negative set function from sets in Rn into the extended reals R.
b
µL {x} = 0 for a single point. µL A = ∞ should be possible for unbounded sets.
2. In R, we should have µL ([a, b]) = b − a. In Rn , we should have
Q = {x ∈ Rn : ai ≤ xi ≤ bi , 1 ≤ i ≤ n} .
i
4.6. References 57
3. If {Ai }m
i=1 are disjoint sets, then
m
X
µL (A1 ∪ A2 ∪ · · · ∪ Am ) = µL (Ai ).
i=1
So, uncountable collections of sets are a problem and we avoid them. What
about countable collections? Countable disjoint collections of sets of measure
zero should have measure zero. Also,
1 1 1 1 1
(0, 1] = ,1 ∪ , ∪ , ∪ ...,
2 3 2 4 3
and,
1 1 1 1 1
1 = µL ((0, 1]) = 1 − + − + − + ...
2 2 3 3 4
1 1 1 1 1
= µL , 1 + µL , + µL , + ···
2 3 2 4 3
It turns out that we cannot construct a desirable measure that satisfies all of these
properties. We have to give up something, so we do not require that the measure be
defined on all subsets on Rn . We settle for a measure defined on a class of subsets. On
the other hand, this class of subset is extremely rich and it is quite difficult to construct
a set that is not in it.
4.6 References
4.7 Worked problems
i