Math For Data Science
Math For Data Science
vii
viii
descent are simplifications of the proofs in [36], and the connection between
properness and trainability seems to be new to the literature.
The ideas presented in the text are made concrete by interpreting them
in Python code. The standard Python data science packages are used, and a
Python index lists the functions used in the text.
Because Python is used to highlight concepts, the code examples are pur-
posely simple to follow. This should be helpful to the reader new to Python.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 Linear Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.1 Single-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.3 Multi-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
ix
x Contents
5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
5.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
6.4 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.1 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.2 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.3 The Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.4 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
A.6 Asymptotics and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
A.7 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
A.8 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
List of Figures
xi
xii List of Figures
1.1 Introduction
1
2 CHAPTER 1. DATASETS
iris = datasets.load_iris()
iris["feature_names"]
This returns
['sepal length','sepal width','petal length','petal width'].
To return the data and the classes, the code is
dataset = iris["data"]
labels = iris["target"]
dataset, labels
This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points
x1 , x2 , . . . , xN
If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus), and
Python (numpy, pandas, scipy, sympy, matplotlib). It may help to read the
4 CHAPTER 1. DATASETS
code examples , and the important math principles first, then dive
into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a jupyter notebook.
jupyter is an IDE, an integrated development environment. jupyter
supports many languages, including Python, Sage, Julia, and R. A useful
jupyter feature is the ability to measure the amount of execution time of a
jupyter cell by including at the start of the cell
%%time
It’s simplest to first install Python, then jupyter. If your laptop is not a
recent model, to minimize overhead, it’s best to install Python directly and
avoid extra packages or frameworks. If Python is installed from
https://www.python.org/downloads/,
then the Python package installer pip is also installed.
From within a shell, check the latest version of pip is installed using the
command
pip --version,
The versions of Python and pip used in this edition of the text are 3.12.*
and 24.*. The first step is to ensure updated versions of Python and pip are
on your laptop.
After this, from within a shell, use pip to install your first package:
pip install jupyter
After installing jupyter, all other packages are installed from within
jupyter. For example, for this text, from within a jupyter cell, we ran
Exercises
def uniq(a):
return [x for i, x in enumerate(a) if x not in a[:i] ]
The MNIST1 dataset consists of 60,000 training images. Since this dataset is
for demonstration purposes, these images are coarse.
Each image consists of 28 × 28 = 784 pixels, and each pixel shading is a
byte, an integer between 0 and 255 inclusive. Therefore each image is a point
x in Rd = R784 . Attached to each image is its label, a digit 0, 1, . . . , 9.
We assume the dataset has been downloaded to your laptop as a CSV file
mnist.csv. Then each row in the file consists of the pixels for a single image.
Since the image’s label is also included in the row, each row consists of 785
integers. There are many sources and formats online for this dataset.
The code
mnist = read_csv("mnist.csv").to_numpy()
mnist.shape,dataset.shape,labels.shape
returns
Fig. 1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
1 The National Institute of Standards and Technology (NIST) is a physical sciences labo-
ratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 7
Here is an exercise. The top left image in Figure 1.4 is given by a 784-
dimensional point which is imported as an array pixels.
pixels = dataset[1].reshape((28,28))
grid()
scatter(2,3,s = 50)
show()
2. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.
pixels = dataset[1]
grid()
for i in range(28):
for j in range(28):
scatter(i,j, s = pixels[i,j])
show()
imshow(pixels, cmap="gray_r")
np.float64(5.843333333333335)
5.843333333333335
set_printoptions(legacy="1.25")
We end the section by discussing the Python import command. The last
code snippet can be rewritten
plt.imshow(pixels, cmap="gray_r")
or as
imshow(pixels, cmap="gray_r")
In the third version, only the command imshow is imported. Which import
style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To help
with online searches, in the Python index, Python commands are listed under
their full package path.
Exercises
Exercise 1.2.1 Run the code in this section on your laptop (all code is run
within jupyter).
Exercise 1.2.2 The first image in the MNIST dataset is an image of the
digit 5. What is the 43,120th image?
Exercise 1.2.3 Figure 1.6 is not oriented the same way as the top-left image
in Figure 1.4. Modify the code returning Figure 1.6 to match the top-left
image in Figure 1.4.
L = [x_1,x_2,...,x_N].
The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population
Or, the sample space consists of all integers and we take N = 5 samples from
this population
Or, the sample space consists of all rational numbers and we take N = 5
samples from this population
Or, the sample space consists of all Python strings and we take N = 5 samples
from this population
L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']
Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population
def hexcolor():
chars = '0123456789abcdef'
return "#" + ''.join([choice(chars) for _ in range(6)])
v = x − µ = (c − a, d − b).
Then µ is the tail of v, and x is the head of v. For example, the vector joining
µ = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean µ of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − µ, k = 1, 2, . . . , N .
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
x5 x2
v5 v2
µ v4 v1
0
x4 x1
v3
x3
Let us go back to vector spaces. When we work with vector spaces, numbers
are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions of v.
When we multiply a vector v by a scalar r to get the scaled vector rv, we call
this vector scaling. This is to distinguish this multiplication from the inner
and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:
14 CHAPTER 1. DATASETS
returns False.
the average is
1.23 + 4.29 − 3.3 + 555
µ= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code
dataset = array([1.23,4.29,-3.3,555])
mu = mean(dataset)
mu
the average is
dataset = array([[1,3,-2,0],[2,4,11,66]])
Here the x-components of the four points are the first row, and the y-
components are the second row. With this, the code
1.3. AVERAGES AND VECTOR SPACES 15
mu = mean(dataset, axis=1)
mu
mean(dataset, axis=0)
N = 20
def row(N): return array([random() for _ in range(N) ])
16 CHAPTER 1. DATASETS
# 2xN array
dataset = array([ row(N), row(N) ])
mu = mean(dataset,axis=1)
grid()
scatter(*mu)
scatter(*dataset)
show()
H, H, T, T, T, H, T, . . .
If we add the vectorized samples f (x) using vector addition in the plane
(§1.4), the first component of the mean (1.3.2) is an average of ones and
1.3. AVERAGES AND VECTOR SPACES 17
according to which class x belongs to. Then the mean (1.3.2) is a triple
p̂ = (p̂1 , p̂2 , p̂3 ) of proportions of each class in the sampling. Of course, p̂1 +
p̂2 + p̂3 = 1, so p̂ is a probability vector (§5.6).
f
sample space
vector space
When there are only two possibilities, two classes, it’s simpler to encode
the classes as follows,
(
1, if x is heads,
f (x) =
0, if x is tails.
Even when the samples are already scalars or vectors, we may still want
to vectorize them. For example, suppose x1 , x2 , . . . , xN are the prices of a
sample of printers from across the country. Then the average price (1.3.1) is
well-defined. Nevertheless, we may set
18 CHAPTER 1. DATASETS
(
1, if x is greater than $100,
f (x) =
0, if x is ≤ $100.
Then the mean (1.3.2) is the sample proportion p̂ of printers that cost more
then $100.
In §6.4, we use vectorization to derive the chi-squared tests.
Exercises
Exercise 1.3.2 What is the average petal length in the Iris dataset?
Exercise 1.3.3 What is the average shading of the pixels in the first image
in the MNIST dataset?
x = arange(0,1,.2)
plot(x,f(x))
scatter(x,f(x))
We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.
In the cartesian plane, a vector is an arrow v joining the origin to a point
(Figure 1.12). In this way, points and vectors are almost interchangeable, as a
point x in Rd corresponds to the vector v starting at the origin 0 and ending
at x.
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13.
1.4. TWO DIMENSIONS 19
(0, 2) (3, 2)
v
(0, 1)
This cannot be done unless one first draws a horizontal line (the x-axis),
then a vertical line (the y-axis). In this manner, each vector v has cartesian
coordinates v = (x, y). In Figure 1.12, the coordinates of v are (3, 2). In
particular, the vector 0 = (0, 0), the zero vector, corresponds to the origin.
v1
v2
0 0
Addition of vectors
v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)
Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
This addition is the same as combining their shadows as in Figure 1.14.
In Python, lists and tuples do not add this way. Lists and tuples have to first
be converted into numpy arrays.
v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False
v1 = [1,2]
20 CHAPTER 1. DATASETS
v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False
v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True
Scaling of vectors
v = array([1,2])
3*v == array([3,6]) # returns True
tv
0 tv
Given a vector v, the scalings tv of v form a line passing through the origin
0 (Figure 1.17). This line is the span of v (more on this in §2.4). Scalings tv
of v are also called multiples of v.
If t and s are real numbers, it is easy to check
Thus scaling v by s, and then scaling the result by t, has the same effect as
scaling v by ts, in a single step. Because points and vectors are interchange-
able, the same formula tP is used for scaling points P by t.
We set −v = (−1)v, and define subtraction of vectors by
v1 − v2 = v1 + (−v2 ).
v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True
Subtraction of vectors
v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
22 CHAPTER 1. DATASETS
Distance Formula
v = array([1,2])
norm(v) == sqrt(5)# returns True
(x, y)
r y
θ
0 x
The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
1.4. TWO DIMENSIONS 23
unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),
−v
The unit circle intersects the horizontal axis at (1, 0), and (−1, 0), and
intersects the vertical axis at (0, 1), and (0, −1). These four points are equally
spaced on the unit circle (Figure 1.17).
By the distance formula, a vector v = (x, y) is a unit vector when
x2 + y 2 = 1.
More generally, any circle with center Q = (a, b) and radius r consists of
points (x, y) satisfying
(x − a)2 + (y − b)2 = r2 .
Let R be a point on the unit circle, and let t > 0. From this, we see the scaled
point tR is on the circle with center (0, 0) and radius t. Moreover, it follows
a point P is on the circle of center Q and radius r iff P = Q + rR for some
R on the unit circle.
Given this, it is easy to check
1 1 1
v = |v| = r = 1,
r r r
24 CHAPTER 1. DATASETS
Now we discuss the dot product in two dimensions. We have two vectors
v1 and v2 in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as
v1 · v2 = x1 x2 + y1 y2 ,
or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the
v2 − v1
v2
v1
v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True
As a consequence of the dot product identity, we have code for the angle
between two vectors (there is also a built-in numpy.angle).
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
To derive the dot product identity, we first derive Pythagoras’ theorem for
general triangles (Figure 1.19)
f
e
a
b
d
a2 = d2 + f 2 and c2 = e2 + f 2 .
Also b = e + d, so e = b − d, so
1.4. TWO DIMENSIONS 27
e2 = (b − d)2 = b2 − 2bd + d2 .
c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,
so we get (1.4.8).
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |. By (1.4.6),
thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.9)
Comparing the terms in (1.4.8) and (1.4.9), we arrive at (1.4.5). This com-
pletes the proof of the dot product identity (1.4.5).
P + P⊥
P⊥
v⊥ P⊥
v
0 P
−v ⊥ c
b
−P ⊥ O a
v · v ⊥ = (x, y) · (−y, x) = 0.
From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ = 0
iff P ′ = ±P ⊥ .
ax + by = 0, cx + dy = 0. (1.4.10)
Homogeneous System
ax + by = e, cx + dy = f, (1.4.13)
(x, y) = (e/a, 0), (x, y) = (0, e/b), (x, y) = (f /c, 0), (x, y) = (0, f /d)
de − bf af − ce
x= , y= . (1.4.15)
ad − bc ad − bc
Putting all this together, we conclude
Inhomogeneous System
In §2.9, we will understand the three cases in terms of the rank of A equal
to 2, 1, or 0.
In this case, we call u and v the rows of A. On the other hand, A may be
written as
ac
A= = uv , u = (a, b), v = (c, d).
bd
In this case, we call u and v the columns of A. This shows there are at least
three ways to think about a matrix: as rows, or as columns, or as a single
block.
The simplest operations on matrices are addition and scaling. Addition is
as follows,
′ ′
a + a′ b + b′
ab ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
cd c d c + c′ d + d′
u · u′ u · v ′
AA′ = .
u′ · v u′ · v ′
cos θ′ − sin θ′
cos θ − sin θ
U (θ)U (θ′ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )
= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )
−b
d
1 d −b 1 d −b − bc ad − bc
A−1 = = = ad−c a
det(A) −c a ad − bc −c a
ad − bc ad − bc
is the inverse of A. The inverse matrix satisfies
32 CHAPTER 1. DATASETS
AA−1 = A−1 A = I.
(AB)−1 = B −1 A−1 .
(AB)t = B t At .
Ax = b,
where
ab x e
A= , x= , b= .
cd y f
Then the solution (1.4.15) can be rewritten
x = A−1 b,
where A−1 is the inverse matrix. We study inverse matrices in depth in §2.3.
The matrix (1.4.11) is symmetric if b = c. A symmetric matrix looks like
ab
Q= .
bc
Qt = Q.
Orthogonal Matrices
Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get
ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.
There is no need to use this, since the numpy built-in outer does the same
job,
A = outer(u,v)
det(u ⊗ v) = 0.
This is true no matter what the vectors u and v are. Check this yourself.
By definition of u ⊗ v,
so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.
Quadratic Form
If
ab
Q= and v = (x, y),
bc
then
v · Qv = ax2 + 2bxy + cy 2 .
Q=I =⇒ v · Qv = x2 + y 2 .
1.4. TWO DIMENSIONS 35
When Q is diagonal,
a0
Q= =⇒ v · Qv = ax2 + cy 2 .
0c
If Q = u ⊗ u, then
Exercises
ax + by = c, −bx + ay = d.
Exercise 1.4.2 Let u = (1, a), v = (b, 2), and w = (3, 4). Solve
u + 2v + 3w = 0
for a and b.
Exercise 1.4.3 Let u = (1, 2), v = (3, 4), and w = (5, 6). Find a and b such
that
au + bv = w.
⊥
Exercise 1.4.4 Let P be a nonzero point in the plane. What is P ⊥ ?
8 −8 3 −2
Exercise 1.4.5 Let A = and B = . Compute AB and
−7 −3 2 −2
BA.
9 2
Exercise 1.4.6 Let A = . Find a nonzero 2×2 matrix B satisfying
−36 −8
AB = 0.
Exercise 1.4.7 Solve for X
−7 4 −9 5
− 4X = .
4 −3 6 −9
eq:tensorident
Exercise 1.4.8 If u = (a, b) and v = (c, d) and A = u ⊗ v, use (1.4.17) to
compute A2 .
36 CHAPTER 1. DATASETS
u ∧ v = u ⊗ v − v ⊗ u.
Exercise 1.4.13 Calculate the areas of the triangles and the squares in Fig-
ure 1.21. From that, deduce Pythagoras’s theorem c2 = a2 + b2 .
Above |x| stands for the length of the vector x, or the distance of the point
x to the origin. When d = 2 and we are in two dimensions, this was defined
in §1.4. For general d, this is defined in §2.1. In this section we continue to
focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
µ= xk = . (1.5.1)
N N
k=1
Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-
square distance to the dataset (Figure 1.22).
1.5. MEAN AND VARIANCE 37
Fig. 1.22 MSD for the mean (green) versus MSD for a random point (red).
Using (1.4.6),
|a + b|2 = |a|2 + 2a · b + |b|2
for vectors a and b, it is easy to derive the above result. Insert a = xk − µ
and b = µ − x to get
N
2 X
M SD(x) = M SD(µ) + (xk − µ) · (µ − x) + |µ − x|2 .
N
k=1
so we have
M SD(x) = M SD(µ) + |x − µ|2 ,
which is clearly ≥ M SD(µ), deriving the above result.
Here is the code for Figure 1.22.
N, d = 20, 2
# d x N array
38 CHAPTER 1. DATASETS
mu = mean(dataset,axis=1)
p = array([random(),random()])
for v in dataset.T:
plot([mu[0],v[0]],[mu[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')
scatter(*mu)
scatter(*dataset)
grid()
show()
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ.
Then the variance is the matrix (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.5.3)
N
1.5. MEAN AND VARIANCE 39
x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.5.4)
Since
16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16
44
(±2, ±2) ⊗ (±2, ±2) = ,
44
00
(0, 0) ⊗ (0, 0) = ,
00
Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1. Here is code from scratch for the
variance (matrix) of a dataset.
def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])
N, d = 20, 2
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
mu = mean(dataset,axis=0)
# center dataset
40 CHAPTER 1. DATASETS
vectors = dataset - mu
N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])
Q = cov(dataset,bias=True)
Q
This returns the same result as the previous code for Q. Notice here there is
no need to compute the mean, this is taken care of automatically. The option
bias=True indicates division by N , returning the biased variance. To return
the unbiased variance and divide by N −1, change the option to bias=False,
or remove it, since bias=False is the default.
From (1.4.16), if Q is the variance matrix (1.5.3),
N
1 X
trace(Q) = |xk − m|2 . (1.5.5)
N
k=1
# dataset is d x N array
Q = cov(dataset,bias=True)
Q.trace()
1.5. MEAN AND VARIANCE 41
P b = (b · u)u.
Pb
u
These vectors are all multiples of u, as they should be. The projected dataset
is two-dimensional.
Alternately, discarding u and retaining the scalar coefficients, we have the
one-dimensional dataset
v1 · u, v2 · u, . . . , vN · u.
Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)
# dataset is d x N array
Q = cov(dataset,bias=True)
This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).
(v − µ) · Q(v − µ) = k
(v − µ) · Q−1 (v − µ) = k
Fig. 1.24 Unit variance ellipses (blue) and unit inverse variance ellipses (red) with µ = 0.
If we write v = (x, y) for a vector in the plane, the variance ellipse equation
centered at µ = 0 is
44 CHAPTER 1. DATASETS
v · Qv = ax2 + 2bxy + cy 2 = k.
def ellipse(Q,mu,padding=.5,levels=[1],render="var"):
grid()
scatter(*mu,c="red",s=5)
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
delta = .01
x = arange(d-padding,d+padding,delta)
y = arange(e-padding,e+padding,delta)
x, y = meshgrid(x, y)
if render == "var" or render == "both":
# matrix_text(Q,mu,padding,'blue')
eq = a*(x-d)**2 + 2*b*(x-d)*(y-e) + c*(y-e)**2
contour(x,y,eq,levels=levels,colors="blue",linewidths=.5)
if render == "inv" or render == "both":
draw_major_minor_axes(Q,mu)
Q = inv(Q)
# matrix_text(Q,mu,padding,'red')
A, B, C = Q[0,0],Q[0,1],Q[1,1]
eq = A*(x-d)**2 + 2*B*(x-d)*(y-e) + C*(y-e)**2
contour(x,y,eq,levels=levels,colors="red",linewidths=.5)
With this code, ellipse(Q,mu) returns the unit variance ellipse in the unit
square centered at µ. The codes for the functions draw_major_minor_axes
and matrix_text are below.
The code for draw_major_minor_axes uses the formulas for the best-fit
and worst-fit vectors (1.5.7).
Depending on whether render is var, inv, or both, the code renders the
variance ellipse (blue), the inverse variance ellipse (red), or both. The code
renders several ellipses, one for each level in the list levels. The default is
levels = [1], so the unit ellipse is returned. Also padding can be adjusted
to enlarge the plot.
The code for Figure 1.24 is
mu = array([0,0])
Q = array([[9,0],[0,4]])
ellipse(Q,mu,padding=4,render="both")
show()
1.5. MEAN AND VARIANCE 45
Q = array([[9,2],[2,4]])
ellipse(Q,mu,padding=4,render="both")
show()
To use TEX to display the matrices in Figure 1.24, insert the function
rcParams['text.usetex'] = True
rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'
def matrix_text(Q,mu,padding,color):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
valign = e + 3*padding/4
if color == 'blue': halign = d - padding/2; tex = "$Q="
else: halign = d; tex = "$Q^{-1}="
# r"..." means raw string
tex += r"\begin{pmatrix}" + str(round(a,2)) + "&" + str(round(b,2))
tex += r"\\" + str(round(b,2)) + "&" + str(round(c,2))
tex += r"\end{pmatrix}$"
return text(halign,valign,tex,fontsize=15,color=color)
Fig. 1.25 Variance ellipses (blue) and inverse variance ellipses (red) for a dataset.
N = 50
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset.T,bias=True)
mu = mean(dataset,axis=0)
scatter(*dataset.T,s=5)
ellipse(Q,mu,render="var",padding=.5,levels=[.005,.01,.02])
show()
scatter(*dataset.T,s=5)
ellipse(Q,mu,render="inv",padding=.5,levels=[.5,1,2])
show()
x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .
Suppose the mean of this dataset is µ = (µx , µy ). Then, by the formula for
tensor product, the variance matrix is
ab
Q= ,
bc
where
N N N
1 X 1 X 1 X
a= (xk − µx )2 , b= (xk − µx )(yk − µy ), c= (yk − µy )2 .
N N N
k=1 k=1 k=1
From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x and
y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
to their mean µx , resulting in a small x variance a, while the y-features may
be spread far from their mean µy , resulting in a large y variance c.
1.5. MEAN AND VARIANCE 47
When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − µx ′ x2 − µx xN − µx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a
and
y1 − µy ′ y2 − µy yN − µy
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c
where
N
1 X ′ ′ b
ρ= xk yk = √
N ac
k=1
# dataset is d x N array
corrcoef(dataset)
Fig. 1.26 Unit variance ellipse and unit inverse variance ellipse with standardized Q.
u · Qu = max v · Qv.
|v|=1
Since the sine function varies between +1 and −1, we conclude the pro-
jected variance varies between
1 − ρ ≤ v · Qv ≤ 1 + ρ,
and
π 1 1
θ= , v+ = √ ,√ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2
3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.26).
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is
1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,
Fig. 1.27 Positively and negatively correlated datasets (unit inverse ellipses).
Here are two randomly generated datasets. The dataset on the left in
Figure 1.27 is positively correlated. Its mean and variance are
0.08016526 0.01359483
(0.53626891, 0.54147513) .
0.01359483 0.08589097
50 CHAPTER 1. DATASETS
The dataset on the right in Figure 1.27 is negatively correlated. Its the
mean and variance are
0.08684941 −0.00972569
(0.46979642, 0.48347168) .
−0.00972569 0.09409118
λ− ≤ v · Qv ≤ λ+ , |v| = 1.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v+ or w+ is nonzero. If v+ ̸= 0, v+ is the best-aligned
vector. If v+ = 0, w+ is the best-aligned vector.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v− or w− is nonzero. If v− ̸= 0, v− is the worst-aligned
vector. If v− = 0, w− is the worst-aligned vector.
If Q is a multiple of the identity, then any vector is best-aligned and worst-
aligned.
All this follows from solutions of homogeneous 2 × 2 systems (1.4.10). The
general d×d case is in §3.2. For the 2×2 case discussed here, see the exercises
at the end of §3.2.
The code for rendering the major and minor axes of the inverse variance
ellipse uses (1.5.6) and (1.5.7),
def draw_major_minor_axes(Q,mu):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d, e = mu
label = { 1:"major", -1:"minor" }
for pm in [1,-1]:
1.5. MEAN AND VARIANCE 51
Exercises
d = 10
# 100 x 2 array
dataset = array([ array([i+j,j]) for i in range(d) for j in range(d)
,→ ])
Compute the mean and variance, and plot the dataset and the mean.
52 CHAPTER 1. DATASETS
Exercise 1.5.2 Let the dataset be the petal lengths against the petal widths
in the Iris dataset. Compute the mean and variance, and plot the dataset and
the mean.
Exercise 1.5.3 Project the dataset in Exercise 1.5.1 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?
Exercise 1.5.4 Project the dataset in Exercise 1.5.2 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?
Exercise 1.5.5 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.1.
Exercise 1.5.6 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.2.
Exercise 1.5.7 Plot the dataset in Exercise 1.5.1 together with its mean
and the line through the vector of best fit.
Exercise 1.5.8 Plot the dataset in Exercise 1.5.2 together with its mean
and the line through the vector of best fit.
Exercise 1.5.9 Standardize the dataset in Exercise 1.5.1. Plot the stan-
dardized dataset. What is the correlation matrix?
Exercise 1.5.10 Standardize the dataset in Exercise 1.5.2. Plot the stan-
dardized dataset. What is the correlation matrix?
ab
Exercise 1.5.11 Let Q = . Show Q is nonnegative when a ≥ |b|.
ba
(Compute v · Qv with v = (cos θ, sin θ) as in the text.)
Although not used in later material, this section is here to boost intuition
about high dimensions. Draw four disks inside a square, and a fifth disk in
the center.
In Figure 1.29, the edge-length of the square is 4, and the radius of each
blue disk is 1. Draw the diagonal of the square. Then the diagonal passes
through two blue disks.
1.6. HIGH DIMENSIONS 53
√
Since the length of the diagonal of the square is 4 2, and the diameters
of the two blue disks
√ add up 4, the portions of the diagonal outside the blue
disks add up to 4 2 − 4. Hence the radius of the red disk is
1 √ √
(4 2 − 4) = 2 − 1.
4
In three dimensions, draw eight balls inside a cube, as in Figure 1.30, and
one ball in the center. Since the edge-length of the cube is 4, the radius
√ of
each blue ball is 1. Since the length of the diagonal of the cube is 4 3, the
radius of the red ball is
1 √ √
(4 3 − 4) = 3 − 1.
4
Now we repeat in d dimensions. Here the edge-length of the cube remains
4, the radius of each blue ball remains 1,√and there are 2d blue balls. Since
the length of the diagonal of the cube is√4 d, the same calculation results in
the radius of the red ball equal to r = d − 1.
# initialize figure
ax = axes()
# red disk
circle = Circle((2, 2), radius=sqrt(2)-1, color='red')
ax.add_patch(circle)
ax.set_axis_off()
ax.axis('equal')
show()
%matplotlib ipympl
from matplotlib.pyplot import *
from numpy import *
from itertools import product
# initialize figure
ax = axes(projection="3d")
# render ball
def ball(a,b,c,r,color):
return ax.plot_surface(a + r*x,b + r*y, c + r*z,color=color)
# blue balls
for center in product(xcent,ycent,zcent): ball(*center,1,"blue")
# red ball
ball(2,2,2,sqrt(3)-1,"red")
# cube grid
cube = ones((4,4,4),dtype=bool)
ax.voxels(cube, edgecolors='black',lw=.5,alpha=0)
ax.set_aspect("equal")
ax.set_axis_off()
show()
Ĝ = {(tx, 1 − t) : 0 ≤ t ≤ 1, x in G}.
Thus
Vol(G)
Vol(Ĝ) = .
d+1
58 CHAPTER 1. DATASETS
Exercises
√
Exercise 1.6.1 Why is the diagonal length of the square 4 2?
√
Exercise 1.6.2 Why is the diagonal length of the cube 4 3?
Exercise 1.6.3 Why does dividing by 4 yield the red disk radius and the red
ball radius?
Exercise 1.6.4 Suspend the unit circle G : x2 +y 2 = 1 from its center. What
is the suspension Ĝ? Conclude area(unit disk) = length(unit circle)/2.
v = (t1 , t2 , . . . , td ).
The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.
59
60 CHAPTER 2. LINEAR GEOMETRY
v = array([1,2,3])
v.shape
v = Matrix([1,2,3])
v.shape
The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added and scaled component by component: With
we have
together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .
# numpy vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
A = column_stack([u,v,w])
A.shape
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.shape
B = array([u,v,w])
The transpose interchanges rows and columns: the rows of At are the columns
of A. In both numpy or sympy, the transpose of A is A.T.
A vector v may be written as a 1 × N matrix
v = t1 t2 . . . tN .
td
vN
# 5x3 matrix
A = Matrix.hstack(u,v,w)
# column vector
b = Matrix([1,1,1,1,1])
# 5x4 matrix
M = Matrix.hstack(A,b)
2.1. VECTORS AND MATRICES 63
In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code
returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is column_stack and row_stack, so the code
both return True. Here col refers to rows of At , hence refers to the columns
of A.
The number of rows is len(A), and the number of columns is len(A.T).
To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number of
columns is A.cols, so
A.shape == (A.rows,A.cols)
A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F
returns
10 0 0
000 11 12 23 5 10 01 0 0
, , , , , .
000 11 34 45 15 20 0 0 1 0
00 0 1
A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B
returns
−1 000
1 0 0 0 0 1 1 0
0 2 0 0
, 0 1 1 0
.
0 0 3 0
0 0 0 5
0 0 0 4 0 0 0 7
0 005
It is straightforward to convert back and forth between numpy and sympy.
In the code
A = diag(1,2,3,4)
B = array(A)
C = Matrix(B)
Exercises
Exercise 2.1.1 A vector is one-hot encoded if all features are zero, except for
one feature which is one. For example, in R3 there are three one-hot encoded
vectors
(1, 0, 0), (0, 1, 0), (0, 0, 1).
A matrix is a permutation matrix if it is square and all rows and all columns
are one-hot encoded. How many 3 × 3 permutation matrices are there? What
about d × d?
2.2 Products
u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)
As in §1.4, we always have rows on the left, and columns on the right.
In Python,
u = array([1,2,3])
v = array([4, 5, 6])
u = Matrix([1,2,3])
v = Matrix([4, 5, 6])
sqrt(dot(v,v))
sqrt(v.T * v)
As in §1.4,
Dot Product
In two dimensions, this was equation (1.4.5) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension. More
precisely, (2.2.2) is taken as the definition of cos θ.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)
def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)
Cauchy-Schwarz Inequality
The dot product of two vectors is absolutely less or equal to the prod-
uct of their lengths,
|a + b| = (a + b) · v ≤ |a| + |b|.
Let A and B be two matrices. If the row dimension of A equals the column
dimension of B, the matrix-matrix product AB is defined. When this condition
holds, the entries in the matrix AB are the dot products of the rows of A with
the columns of B. In Python,
the code
A,B,dot(A,B)
A,B,A*B
returns
70 80 90
AB = .
158 184 210
Let A and B be matrices, and suppose the row dimension of A and the
column dimension of B both equal d. Then the matrix-matrix product AB
is defined. If A = (aij ) and B = (bij ), then we may we may write AB in
summation notation as
X d
(AB)ij = aik bkj . (2.2.5)
k=1
d
X d X
X d
trace(AB) = (AB)ii = aik bkj .
i=1 i=1 k=1
dot(A,B).T == dot(B.T,A.T)
In terms of row vectors and column vectors, this is automatic. For example,
In Python,
dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))
As a consequence,1
(u ⊗ v)ij = ui vj .
Then the identities (1.4.17) and (1.4.18) hold in general. Using the tensor
product, we have
Tensor Identity
AAt = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.9)
To derive this, let Q and Q′ be the symmetric matrices on the left and
right sides of (2.2.9). By Exercise 2.2.7, to establish (2.2.9), it is enough to
show x · Qx = x · Q′ x for every vector x. By (2.2.7),
At x = (v1 · x, v2 · x, . . . , vN · x).
Since |At x|2 is the sum of the squares of its components, this establishes
x · Qx = x · Q′ x, hence the result.
valid for any matrix A and vectors u, v with compatible shapes. The deriva-
tion of this identity is a simple calculation with components that we skip.
and
2
∥A∥ = trace(At A) = trace(AAt ). (2.2.13)
By replacing A by At , the same results hold for rows.
Q = dot(vectors,vectors.T)/N
Q = cov(dataset,bias=True)
After downloading the Iris dataset as in §2.1, the mean, variance, and total
variance are
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
1.27 −0.32 3.09 1.29 , 4.54.
µ = (5.84, 3.05, 3.76, 1.2), Q =
x1 · ej , x2 · ej , . . . , xN · ej ,
consisting of the j-th feature of the samples. If qjj is the variance of this
scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the variance
matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let µ = (µ1 , µ2 , . . . , µd ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is
t1 − µ1 t2 − µ2 td − µd
v= √ , √ , . . . , √ .
q11 q22 qdd
′ qij
qij =√ , i, j = 1, 2, . . . , d.
qii qjj
In Python,
N, d = 10, 2
# Nxd array
dataset = array([ [random() for _ in range(d)] for _ in range(N) ])
# standardize dataset
standardized = StandardScaler().fit_transform(dataset)
Qcorr = corrcoef(dataset.T)
Qcov = cov(standardized.T,bias=True)
allclose(Qcov,Qcorr)
returns True.
Exercises
v = (1, 2, 3, . . . , n).
√
Let |v| = v · v be the length
√ of v. Then, for example, when n = 1, |v| = 1
and, when n = 2, |v| = 5. There is one other n for which |v| is a whole
number. Use Python to find it.
vd
AB = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ud ⊗ vd .
Let A be any matrix and b a vector. The goal is to solve the linear system
Ax = b. (2.3.1)
In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
Of course, the system (2.3.1) doesn’t even make sense unless
In what follows, we assume this equality is true and dimensions are appro-
priately compatible.
Even then, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the zero
matrix and b any non-zero vector. Because of this, we must take some care
when solving (2.3.1).
AB = I = BA. (2.3.2)
we have
(AB)−1 = B −1 A−1 .
Ax = b =⇒ x = A−1 b. (2.3.3)
Ax = A(A−1 b) = (AA−1 )b = Ib = b.
# solving Ax=b
x = A.inv() * b
# solving Ax=b
x = dot(inv(A) , b)
is in §2.6. The upshot is: every (square or non-square) matrix A has a pseudo-
inverse A+ . Here is the general result.
x + = A+ b =⇒ Ax+ = b.
Let
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
# arrange as columns
A = column_stack([u,v,w])
pinv(A)
returns
−37 −20 −3 14 31
1
A+ = −10 −5 0 5 10 .
150
17 10 3 −4 −11
Alternatively, in sympy,
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
A.pinv()
For
b3 = (−9, −3, 3, 9, 10),
we have
1
x + = A+ b3 = (82, 25, −32).
15
However, for this x+ , we have
We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
2.3. MATRIX INVERSE 83
B + u, B + v, B + w,
Let
1 2 3 4 5
C = At = 6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). By Exercise 2.6.8, C + = (A+ )t , so
−37 −10 17
−20 −5 10
+ + t 1 −3 0
C = (A ) = 3
150
14 5 −4
31 10 −11
and
1
x+ = C + f =(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .
We solve
Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,
D+ a, D+ b, D+ c, D+ d, D+ e,
x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).
Exercises
Exercise 2.3.2 With R(d) as in Exercise 2.2.9, find the formula for the
inverse and pseudo-inverse of R(d), whichever exists. Here d = 1, 2, 3, . . . .
t 1 v1 + t 2 v2 + · · · + t d vd . (2.4.1)
2.4. SPAN AND LINEAR INDEPENDENCE 85
and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be the vector
(r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this calculation!)
the matrix-vector product Ax equals ru + sv + tw,
Ax = ru + sv + tw.
The code
returns
x = (t1 , t2 , . . . , td ).
Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,
t1 v1 + t2 v2 + · · · + td vd
of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
86 CHAPTER 2. LINEAR GEOMETRY
Span Definition I
S = span(v1 , v2 , . . . , vd ).
Span Definition II
span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).
Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then
since adding a third vector can only increase the linear combination possibil-
ities. On the other hand, since w = 2v − u, we also have
It follows that
span(u, v, w) = span(u, v).
Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
2.4. SPAN AND LINEAR INDEPENDENCE 87
# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
returns a minimal list of vectors spanning the column space of A. The column
rank of A is the length of the list, i.e. the number of vectors returned.
For example, for A as in (2.3.4), this code returns the list
1 6
2 7
3 , 8 .
[u, v] =
4 9
5 10
Ax = t1 v1 + t2 v2 + · · · + td vd .
By (2.4.3),
The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.
# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])
A = column_stack([u,v,w])
orth(A)
For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A, so b3 is not a linear combination of u, v,
w.
When (2.4.6) holds, b is a linear combination of the columns of A. However,
(2.4.6) does not tell us which linear combination. According to (2.4.3), finding
the specific linear combination is equivalent to solving Ax = b.
then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or
R3 = span(e1 , e2 , e3 ).
e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)
Then e1 , e2 , . . . , ed span Rd , so
Rd is a span.
span(a, b, c, d, e) = span(a, f ).
2.4. SPAN AND LINEAR INDEPENDENCE 91
For any matrix, the row rank equals the column rank.
Because of this, we refer to this common number as the rank of the matrix.
t1 v1 + t2 v2 + · · · + td vd = 0.
ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)
u = −(s/r)v − (t/r)w.
v = −(r/s)u − (t/s)w.
If t ̸= 0, then
w = −(r/t)u − (s/t)v.
Hence linear dependence of u, v, w means one of the three vectors is a multiple
of the other two vectors.
In general, a vanishing non-trivial linear combination of v1 , v2 , . . . , vd , or
linear dependence of v1 , v2 , . . . , vd , is the same as saying one of the vectors
is a linear combination of the remaining vectors.
In terms of matrices,
A.nullspace()
This says the null space of A consists of all multiples of (1, −2, 1). Since the
code
[r,s,t] = A.nullspace()[0]
null_space(A)
A Versus At A
Let A be any matrix. The null space of A equals the null space of
At A.
94 CHAPTER 2. LINEAR GEOMETRY
|Ax|2 = Ax · Ax = x · At Ax = 0,
t1 v1 + t2 v2 + · · · + td vd = 0.
Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain
t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.
u = Matrix([1,2,3,4,5])
2.4. SPAN AND LINEAR INDEPENDENCE 95
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])
A = Matrix.hstack(u,v,w)
C = A.T
C.nullspace()
u⊥ = {v : u · v = 0} . (2.4.9)
Ax = (v1 · x, v2 · x, . . . , vN · x).
v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,
Every vector in the row space is orthogonal to every vector in the null
space,
Actually, the above paragraph only established the first identity. For the
second identity, we need to use (2.7.9), as follows
⊥
rowspace = rowspace⊥ = nullspace⊥ .
Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude
A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.
Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 97
A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .
A(x1 − x2 ) = b − b = 0, (2.4.11)
From this point of view, the source space of A is Rd , and the target space of
A is RN .
Let A be any matrix. The null space of A and the row space of A are
in the source space of A, and the column space of A is in the target
space of A.
This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,
Let A be a d×d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.
Exercises
Exercise 2.4.1 For what condition on a, b, c do the vectors (1, a), (2, b),
(3, c) lie on a line?
Exercise 2.4.2 Let
16
1 2 3 4 5 17
C = 6 7 8 9 10 , 18 .
x=
11 12 13 14 15 19
20
Compute Cx in two ways, first by row times column, then as a linear combi-
nation of the columns of C.
2.4. SPAN AND LINEAR INDEPENDENCE 99
Exercise 2.4.3 Check that the array in Figure 2.1 matches with b1 , b2 as
explained in the text, and the vectors b1 and b2 are orthogonal.
Exercise 2.4.4 Let A = (u, v, w) be as in (2.3.4) and let b = (16, 17, 18, 19, 20).
Is b in the column space of A? If yes, solve b = ru + sv + tw.
What are A(5, 3) and A(3, 5)? What are the source and target spaces for
A(N, d)?
Exercise 2.4.8 Calculate the column rank of the matrix A(N, d) for all N ≥
2 and all d ≥ 2. (Column rank is the length of the list columnspace returns.)
Exercise 2.4.9 What is the nullity of the matrix A(N, d) for all N ≥ 2 and
all d ≥ 2?
v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,
v1 · u, v2 · u, . . . , vN · u.
v · Qv = 0.
v · x + b = 0.
v · µ + b = 0.
v · (x − µ) = 0.
v · (x − µ) = 0.
a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,
u · Qu = 0 and v · Qv = 0.
From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through the mean and will be parallel to
b. But we know how to find such a vector. Let A be the matrix with rows u, v.
Then b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).
Let Q be a variance matrix. Then the null space of Q equals the zero
variance directions of Q.
To see this, we use the quadratic equation from high school. If Q is sym-
metric, then u · Qv = v · Qu. For t scalar and u, v vectors, since Q ≥ 0, the
function
(v + tu) · Q(v + tu)
is nonnegative for all t scalar. Expanding this function into powers of t, we
see
t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields
Now we can derive the result. If v is a zero variance direction, then v ·Qv =
0. By (2.5.4), this implies u · Qv = 0 for all u, so Qv = 0, so v is in the null
space of Q. This derivation is valid for any nonnegative matrix Q, not just
variance matrices. Later (§3.2) we will see every nonnegative matrix is the
variance matrix of a dataset.
104 CHAPTER 2. LINEAR GEOMETRY
Based on the above result, here is code that returns zero variance direc-
tions.
N, d = 20, 2
# dxN array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])
def zero_variance(dataset):
Q = cov(dataset)
return null_space(Q)
zero_variance(dataset)
(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).
Thus this dataset is orthogonal to three directions, hence lies in the intersec-
tion of three hyperplanes. Each hyperplane is one condition, so each hyper-
plane cuts the dimension down by one, so the dimension of this dataset is
5 − 3 = 2. Dimension of a dataset is discussed further in §2.9.
2.6 Pseudo-Inverse
What is the pseudo-inverse? In §2.3, we used both the inverse and the pseudo-
inverse to solve Ax = b, but we didn’t explain the framework behind them.
It turns out the framework is best understood geometrically.
2.6. PSEUDO-INVERSE 105
Think of b and Ax as points, and measure the distance between them, and
think of x and the origin 0 as points, and measure the distance between them
(Figure 2.2).
x b
A
−
−−−−−−−
→
0 Ax
rowspace
column space
x∗
x∗ Ax
x+ A Ax∗
−
−−−−−−−
→
Ax
Ax
x x
v nullspace
x v
0 b
Fig. 2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ .
The results in this section are as follows. Let A be any matrix. There is a
unique matrix A+ — the pseudo-inverse of A — with the following properties.
• the linear system Ax = b is solvable, when b = AA+ b.
• x+ = A+ b is a solution of
1. the linear system Ax = b, if Ax = b is solvable.
2. the regression equation At Ax = At b, always.
• In either case,
106 CHAPTER 2. LINEAR GEOMETRY
At Ax = At b. (2.6.2)
Zero Residual
x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z = 3 (2.6.3)
4x + 9y + 14z = 9
5x + 10y + 15z = 10
Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
2.6. PSEUDO-INVERSE 107
Regression Equation
To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into the
residual and expand in powers of t to obtain
At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.
At (Ax∗ − b) = 0,
Multiple Solutions
Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write
x∗ · v ≥ 0.
Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
2.6. PSEUDO-INVERSE 109
Uniqueness
If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
+ +
is both in the row space and in the null space of A, so x1 − x2 = 0. Hence
x+ +
1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defining
+
A by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape of
A+ equals the shape of At .
Summarizing what we have so far,
We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.11), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
110 CHAPTER 2. LINEAR GEOMETRY
Ax+ = A(x + v) = Ax + Av = b + 0 = b.
This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,
Solvability of Ax = b
Properties of Pseudo-Inverse
A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric
u = A+ Au + v. (2.6.9)
Au = AA+ Au.
A+ w = A+ AA+ w + v
for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector b.
Hence At AA+ = At . With P = AA+ ,
(x − A+ Ax) · A+ Ay = 0.
x · P y = P x · P y = x · P tP y
Also we have
Exercises
Exercise 2.6.2 Let A(N, d) be as in Exercise 2.4.7, and let A = A(6, 4).
Let b = (1, 1, 1, 1, 1, 1). Write out Ax = b as a linear system. How many
equations, how many unknowns?
Exercise 2.6.4 Continuing with the same A and b, write out the correspond-
ing regression equation. How many equations, how many unknowns?
(At )+ = (A+ )t .
QQ+ = Q+ Q.
Exercise 2.6.11 Let A be any matrix. Then the null space of A equals the
null space of A+ A. Use (2.6.8).
Exercise 2.6.12 Let A be any matrix. Then the row space of A equals the
row space of A+ A.
Exercise 2.6.13 Let A be any matrix. Then the column space of A equals
the column space of AA+ .
2.7 Projections
Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.4). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the projec-
tion matrix. Since span(u) is a line, the projected vector P b is a multiple tu
of u.
From Figure 2.4, b − P b is orthogonal to u, so
0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.
is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. If U is the matrix with the single column u, then the
reduced vector is U t b and the projected vector is U U t b.
b − Pb
b
P b = tu
u
(b − P b) · u = 0 and (b − P b) · v = 0.
r = b · u, s = b · v.
b
b − Pb
u
Pb
Characterization of Projections
To prove this, suppose P is the projection matrix onto some span S. For
any v, by 1., P v is in S. By 2., P (P v) = P v. Hence P 2 = P . Also, for any u
and v, P v is in S, and u − P u is orthogonal to S. Hence
(u − P u) · P v = 0
116 CHAPTER 2. LINEAR GEOMETRY
which implies
u · P v = (P u) · (P v).
Switching u and v,
v · P u = (P v) · (P u),
Hence
u · (P v) = (P u) · v,
t
which implies P = P .
For the other direction, suppose P is a projection matrix, and let S be the
column space of P . Then a vector x is in S iff x is of the form x = P v. This
establishes 1. above. Since
P x = P (P v) + P 2 v = P v = x,
Let A be any matrix. Then the projection matrix onto the column
space of A is
P = AA+ . (2.7.2)
P b = t1 v1 + t2 v2 + · · · + td vd .
def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected
Let A be a matrix and b a vector, and project onto the column space
of A. Then the projected vector is P b = AA+ b and the reduced vector
is x = A+ b.
For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b = (82, 25, −32),
15
and the projected vector onto the column space of A is
P = A+ A. (2.7.3)
118 CHAPTER 2. LINEAR GEOMETRY
def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected
dataset vk in Rd , k = 1, 2, . . . , N
reduced U tv k in Rn , k = 1, 2, . . . , N
projected U U tv k in Rd , k = 1, 2, . . . , N
# projection of dataset
# onto column space of A
If S is a span in Rd , then
Rd = S ⊕ S ⊥ . (2.7.5)
v = P v + (v − P v),
An important example of (2.7.5) is the relation between the row space and
the null space of a matrix. In §2.4, we saw that, for any matrix A, the row
space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important
If A is an N × d matrix,
and the null space and row space are orthogonal to each other.
From this,
P = I − A+ A. (2.7.7)
For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.
2.7. PROJECTIONS 121
But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.
Exercises
Exercise 2.7.4 Let P be the projection matrix onto the column space of a
matrix A. Use Exercise 2.7.3 to show trace(P ) equals the rank of A.
Exercise 2.7.6 Let A be the dataset matrix of the centered MNIST dataset,
so the shape of A is 60000 × 784. Using Exercise 2.7.4, show the rank of A
is 712.
Exercise 2.7.9 Let S be a span, and let P be the projection matrix onto S.
Use P to show ⊥
S ⊥ = S. (2.7.9)
(S ⊂ (S ⊥ )⊥ is easy. For S ⊃ (S ⊥ )⊥ , show |v − P v|2 = 0 when v in (S ⊥ )⊥ .)
Exercise 2.7.10 Let S be a span and suppose P and Q are both projection
matrices onto S. Show
(P − Q)2 = 0.
Conclude P = Q. Use Exercise 2.2.4.
2.8 Basis
To clarify this definition, suppose someone asks “Who is the shortest per-
son in the room?” There may be several shortest people in the room, but, no
matter how many shortest people there are, there is only one shortest height.
In other words, a span may have several bases, but a span’s dimension is
uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.
Here are two immediate consequences of this terminology.
Span of N Vectors
spanning
orthogonal orthonormal
vectors basis
basis basis
linearly
orthogonal orthonormal
independent
The dimension of Rd is d.
mu = mean(dataset,axis=0)
vectors = dataset - mu
matrix_rank(vectors)
2.8. BASIS 125
In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 784 − 712 = 72 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we load MNIST as dataset, as in §1.2, and run the code below, we
obtain n = 560 (Figure 2.8). matrix_rank is discussed in §2.9.
def find_first_defect(dataset):
d = len(dataset[0])
previous = 0
for n in range(len(dataset)):
r = matrix_rank(datset[:n+1,:])
print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r
This we call the dimension staircase. For example, Figure 2.9 is the di-
mension staircase for
v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).
def dimension_staircase(dataset):
N = len(dataset)
rmax = matrix_rank(dataset)
2.8. BASIS 127
dimensions = [ ]
for n in range(N):
r = matrix_rank(dataset[:n+1,:])
if len(dimensions) and dimensions[-1] < r: print(r," ",end="")
dimensions.append(r)
if r == rmax: break
title("number of vectors = " + str(n+1) + ", rank = " + str(rmax))
stairs(dimensions, range(1,n+3),linewidth=2,color='red')
grid()
show()
span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).
span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),
v1 is a linear combination of b1 , b2 , . . . , bd ,
v1 = t1 b1 + t2 b2 + · · · + td bd .
span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).
v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).
This shows
span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).
span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).
2.9 Rank
R3 R5
x b
A
At b
Ax
At
source space target space
By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
130 CHAPTER 2. LINEAR GEOMETRY
For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .
Rank Theorem
Let A be any matrix. Then
A.rank()
matrix_rank(A)
returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so
For any N × d matrix, the rank is never greater than min(N, d).
C = CI = CAB = IB = B,
so B = C is the inverse of A.
The first two assertions are in §2.2. For the last assertion, assume U is a
square matrix. From §2.4, orthonormality of the rows implies linear indepen-
dence of the rows, so U is full-rank. If U also is a square matrix, then U is
invertible. Multiply by U −1 ,
U −1 = U −1 I = U −1 U U t = U t .
U U t = I = U tU (2.9.2)
is an orthogonal matrix.
Equivalently, we can say
Orthogonal Matrix
A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.
Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves an-
gles. Summarizing,
As a consequence,
I = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ud ⊗ ud . (2.9.3)
2.9. RANK 133
and
|u|2 = (u · u1 )2 + (u · u2 )2 + · · · + (u · ud )2 . (2.9.5)
Full-Rank Dataset
A dataset x1 , x2 , . . . , xN is full-rank iff x1 , x2 , . . . , xN spans the
feature space.
To derive the rank theorem, first we recall (2.7.6). Assume A has N rows
and d columns. By (2.7.6), every vector x in the source space Rd can be
written as a sum x = u + v with u in the null space, and v in the row space.
In other words, each vector x may be written as a sum x = u + v with Au = 0
and v in the row space.
From this, we have
Ax = A(u + v) = Au + Av = Av.
This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
134 CHAPTER 2. LINEAR GEOMETRY
If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space, we
must have v orthogonal to itself. Thus v = 0, or t1 v1 + t2 v2 + · · · + tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.
Exercises
A = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ur ⊗ vr
135
136 CHAPTER 3. PRINCIPAL COMPONENTS
How does this compare with the distance between Av1 and Av2 , or |Av1 −
Av2 |?
If we let
v1 − v2
u= ,
|v1 − v2 |
then u is a unit vector, |u| = 1, and by linearity
|Av1 − Av2 |
|Au| = .
|v1 − v2 |
Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.
To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,
Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and
This shows the image of the unit circle is the inverse variance ellipse (§1.5)
corresponding to the variance Q, with major axis length 2σ1 and minor axis
length 2σ2 .
These reflect vectors across the horizontal axis, and across the vertical axis.
138 CHAPTER 3. PRINCIPAL COMPONENTS
The SVD decomposition (§3.4) states that every matrix A can be written
as a product
ab
A= = U SV.
cd
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).
V S U
In §1.5 and §2.5, we saw every variance matrix is nonnegative. In this section,
we see that every nonnegative matrix Q is the variance matrix of a specific
dataset. This dataset is called the principal components of Q.
Let A be a matrix. An eigenvector for A is a nonzero vector v such that
Av is aligned with v. This means
Av = λv (3.2.1)
singular:
σ, u, v
row column
any
rank rank
matrix square
eigen:
invertible symmetric
λ, v
non-
variance negative λ≥0
λ ̸= 0 positive λ>0
A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda
v · Qv = v · λv = λv · v = λ.
µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.
This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:
QU = U E. (3.2.3)
allclose(dot(Q,v), lamda*v)
returns True.
λ1 ≥ λ2 ≥ · · · ≥ λd .
Diagonalization (EVD)
Q = U EV = U EU t . (3.2.4)
When this happens, the rows of V are the eigenvectors of Q, and the
diagonal entries of E are the eigenvalues of Q.
In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U
returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).
Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)
allclose(Q,dot(U,dot(E,V))
returns True.
init_printing()
3.2. EIGENVALUE DECOMPOSITION 145
# eigenvalues
Q.eigenvals()
# eigenvectors
Q.eigenvects()
U, E = Q.diagonalize()
rank(Q) = rank(E) = r.
# dataset is Nxd
N, d = dataset.shape
Q = dot(dataset.T,dataset)/N
lamda = eigh(Q)[0]
approx_rank = d - approx_nullity
approx_rank, approx_nullity
This code returns 712 for the MNIST dataset, agreeing with the code in
§2.8.
Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)
returns
1 1 10
U= , E= .
−1 1 03
Also,
Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)
returns √ √
ab 1 a−c− D a−c+ D
Q= , U=
bc 2b 2b 2b
and
√
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D
0 0 . . . 0 1/λd
Q = U EV =⇒ Q+ = U E + V. (3.2.5)
Qx = b
has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.6)
λ1 λ2 λd
trace(Q) = λ1 + λ2 + · · · + λd . (3.2.7)
Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have
Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.8)
√
√ − λ2 v2
− λ1 v1
p p p
± dλ1 v1 , ± dλ2 v2 , . . . , ± dλd vd
is centered and has variance Q.
where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
150 CHAPTER 3. PRINCIPAL COMPONENTS
When Q is a variance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other eigen-
value. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any other
eigenvalue. We establish the following results.
A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know
λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.
λ1 ≥ v · Qv = v · (λv) = λv · v = λ.
λ1 = v1 · Qv1 ≥ v · Qv (3.2.10)
for all unit vectors v. Let u be any vector. Then for any real t,
3.2. EIGENVALUE DECOMPOSITION 151
v1 + tu
v=
|v1 + tu|
u · Qv1 = λ1 u · v1
u · (Qv1 − λ1 v1 ) = 0
Just as the maximum variance (3.2.9) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.11)
|v|=1
v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues
λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .
v1 , v2 , v3 , . . . , vd .
T = S⊥
v1
v3
v2
Sλ = {v : Qv = λv}
the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .
Let (lamda,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.
lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
154 CHAPTER 3. PRINCIPAL COMPONENTS
The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is
an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition is a systematic procedure for
finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.
All this can be readily computed in Python. For the Iris dataset, we have
the variance matrix in (2.2.15). The eigenvalues are
4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .
For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,
The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector
The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector
The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.
def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
156 CHAPTER 3. PRINCIPAL COMPONENTS
return v
# using sympy
from sympy import Matrix
# using numpy
from numpy import *
m1 m2
x1 x2
To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx pro-
portional to the displacement x. Here k is the spring constant. For example,
look at the mass m1 . The spring to its left is extended by x1 , so exerts a force
of −kx1 . Here the minus indicates pulling to the left. On the other hand, the
spring to its right is extended by x2 − x1 , so it exerts a force +k(x2 − x1 ).
Here the plus indicates pulling to the right. Adding the forces from either
side, the total force on m1 is −k(2x1 − x2 ). For m2 , the spring to its left
exerts a force −k(x2 − x1 ), and the spring to its right exerts a force −kx2 ,
so the total force on m2 is −k(2x2 − x1 ). We obtain the force vector
3.2. EIGENVALUE DECOMPOSITION 157
2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2
However, as you can see, the matrix here is not exactly Q(2).
m1 m2 m3 m4 m5
x1 x2 x3 x4 x5
But, again, the matrix here is not Q(5). Notice, if we place one mass and two
springs in Figure 3.6, we obtain the 1 × 1 matrix 2.
To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
158 CHAPTER 3. PRINCIPAL COMPONENTS
m1 m2 m2
m1
m1 m1
m2
m2
m5 m5
m3 m4
m4 m3
p(t) = 2 − t − td−1 ,
and let
1
ω
ω2
v1 = .
ω3
..
.
d−1
ω
Then Qv1 is
1
2 − ω − ω d−1
ω
−1 + 2ω − ω 2
ω2
−ω + 2ω 2 − ω 3
Qv1 = = p(ω) = p(ω)v1 .
ω3
..
. ..
d−2 d−1
.
−ω + 2ω −1
ω d−1
Then
v0 = 1 = (1, 1, . . . , 1),
and, by the same calculation, we have
By (A.4.9),
Eigenvalues of Q(d)
Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
√ √ √ √
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .
The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above it in Q(d) by shifting the entries to the right. The trick of
3.2. EIGENVALUE DECOMPOSITION 161
using the roots of unity to compute the eigenvalues and eigenvectors works
for any circulant matrix.
Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.
d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")
k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)
scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")
grid()
legend()
show()
Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ≈ 4 and
the bottom λd = 0, they are sparser near the middle. Using the double-angle
162 CHAPTER 3. PRINCIPAL COMPONENTS
formula,
πk
λk = 4 sin2 , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the
double multiplicity, we obtain the proportion of eigenvalues below threshold
λ,
1√
#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.15)
d π 2
Here ≈ means asymptotic equality, see §A.6.
Equivalently, the derivative (4.1.23) of the arcsine law (3.2.15) exhibits the
eigenvalue clustering near the ends (Figure 3.11).
lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
tex = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,tex,usetex=True,fontsize="x-large")
grid()
show()
bottom is valid for a wide class of matrices, not just Q(d), as the matrix size
d grows without bound, d → ∞.
Exercises
λ2 − tλ + d = 0,
d 4 · trace(Q(d)+ )
4 4+1
16 (4+1)(16+1)
256 (4+1)(16+1)(256+1)
3.3 Graphs
in the previous sections. Since graph theory is the start of neural networks,
we study it here.
A graph consists of nodes and edges. For example, the graphs in Figure
3.13 each have four nodes and three edges. The left graph is directed, in that
a direction is specified for each edge. The graph on the right is undirected, no
direction is specified.
−3 7.4
2 0
Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.
In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,
3.3. GRAPHS 165
aij = aji .
Sometimes graphs may have multiple edges between nodes, or loops, which
are edges starting and ending at the same node. A graph is simple if it has
no loops and no multiple edges. In this section, we deal only with simple
undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies
n 1
0≤m≤ = n(n − 1).
2 2
How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
a maximum of n(n − 1)/2 edges. The number of subsets of a set with m
elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .
When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
3.16).
166 CHAPTER 3. PRINCIPAL COMPONENTS
The cycle graph Cn with n nodes is as in Figure 3.16. The graph Cn has
n edges. The cycle graph C3 is a triangle.
d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn
(d1 , d2 , d3 , . . . , dn )
Handshaking Lemma
In any graph, there are at least two nodes with the same degree.
To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is
n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.
n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.
A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m= kn.
2
168 CHAPTER 3. PRINCIPAL COMPONENTS
For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then
v1 → v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
example, Figure 3.16 may be viewed as two connected graphs K6 and C6 , or
a single disconnected graph K6 ∪ C6 .
For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the com-
plete graph Kn is the n × n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is
A=1⊗1−I
Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.
For any adjacency matrix A, the sum of each row is equal to the degree of
the node corresponding to that row. This is the same as saying
d1
d2
A1 = . . . .
dn
A1 = k1,
v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.
Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies
Top Eigenvalue
A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,
2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,
Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
3.3. GRAPHS 171
Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence
Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i, which
means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
Loops, Edges, Triangles
This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,
Connected Graph
1000 0100
Hence P is orthogonal,
P P t = I, P −1 = P t .
Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy
A′ = P AP −1 = P AP t
A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
174 CHAPTER 3. PRINCIPAL COMPONENTS
and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.
Recall we have
(a ⊗ b)v = (b · v)a.
3.3. GRAPHS 175
From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b is an
eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity n + m − 2.
Since trace(A) = 0, the sum of the eigenvalues is zero, and the remaining two
eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Because eigenvectors corresponding
to distinct eigenvalues of a symmetric matrix are orthogonal (see §3.2), v is
orthogonal to the nullspace of A, so v must be a linear combination of a and
b, v = ra + sb. Since a · b = 0,
Aa = nb, Ab = ma.
Hence
λv = Av = A(ra + sb) = rnb + sma.
Applying A again,
For
√ example, √
for the graph in Figure 3.19, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.
L = B t B.
Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
176 CHAPTER 3. PRINCIPAL COMPONENTS
Laplacian
For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
2 −1 0 0 0 −1
−1 2 −1 0 0 0
0 −1 2 −1 0 0
L = Q(6) = .
0 0 −1 2 −1 0
0 0 0 −1 2 −1
−1 0 0 0 −1 2
When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
When (3.4.1) holds, so does
The singular values of A and the singular values of At are the same.
Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1, and only one eigenvector. Set
t 11
Q=AA= .
12
0 = det(Q − λI) = λ2 − 3λ + 1.
Qv = At Av = At (σu) = σ 2 v. (3.4.2)
Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because they are orthonormal eigenvectors of
the symmetric matrix Q. Also
0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .
A Versus Q = At A
Since the rank equals the dimension of the row space, the first part follows
from §2.4. If Av = σu and At u = σv, then
Qv = At Av = At (σu) = σAt u = σ 2 v,
so λ = σ 2 is an eigenvalue of Q.
Conversely,
√ If Qv = λv, then λ ≥ 0, so there are two cases. If λ > 0, set
σ = λ and u = Av/σ. Then
Let A be any matrix, and let r be the rank of A. Then there are
r positive singular values σk , an orthonormal basis uk of the target
space, and an orthonormal basis vk of the source space, such that
Avk = σk uk , At uk = σk vk , k ≤ r, (3.4.3)
and
Avk = 0, At uk = 0 for k > r. (3.4.4)
Taken together, (3.4.3) and (3.4.4) say the number of positive singular
values is exactly r. Assume A is N × d, and let p = min(N, d) be the lesser
of N and d.
Since (3.4.4) holds as long as there are vectors uk and vk , there are p − r
zero singular values. Hence there are p = min(N, d) singular values altogether.
The proof of the result is very simple once we remember the rank of Q
equals the number of positive eigenvalues of Q. By the eigenvalue decom-
position, there is an orthonormal basis vk of the source space and positive
√ that Qvk = λk vk , k ≤ r, and Qvk = 0, k > r.
eigenvalues λk such
Setting σk = λk and uk = Avk /σk , k ≤ r, as in our first example, we
have (3.4.3), and, again as in our first example, uk , k ≤ r, are orthonormal.
By construction, vk , k > r, is an orthonormal basis for the null space of
A, and uk , k ≤ r, is an orthonormal basis for the column space of A.
Choose uk , k > r, any orthonormal basis for the nullspace of At . Since
the column space of A is the row space of At , the column space of A is the
orthogonal complement of the nullspace of At (2.7.6). Hence uk , k ≤ r, and
uk , k > r, are orthogonal. From this, uk , k ≤ r, together with uk , k > r,
form an orthonormal basis for the target space.
For our second example, let a and b be nonzero vectors, possibly of different
sizes, and let A be the matrix
A = a ⊗ b, At = b ⊗ a.
Thus there is only one positive singular value of A, equal to |a| |b|. All other
singular values are zero. This is not surprising since the rank of A is one.
Now think of the vector b as a single-row matrix B. Then, in a similar
manner, one sees the only positive singular value of B is σ = |b|.
Our third example is
0000
1 0 0 0
A= 0 1 0 0 .
(3.4.5)
0010
Then
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At = Q = At A =
,
0 0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
1 0 0 0
0 1 0 0
0 , v2 = 0 , v3 1 , v4 = 0 .
v1 =
0 0 0 1
AV t = U S.
Diagonalization (SVD)
U and V , satisfying
A = U SV.
The rows of V are an orthonormal basis of right singular vectors, and
the columns of U are an orthonormal basis of left singular vectors.
0 0 0 σ4 0 0
U, sigma, V = svd(A)
# sigma is a vector
S = zeros(A.shape)
S[:p,:p] = diag(sigma)
print(U.shape,S.shape,V.shape)
print(U,S,V)
Given the relation between the singular values of A and the eigenvalues of
Q = At A, we also can conclude
For example, if dataset is the Iris dataset (ignoring the labels), the code
# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]
# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]
# compare columns of U
# and rows of V
U, V
3.4. SINGULAR VALUE DECOMPOSITION 183
returns
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32 −0.66 −0.73 0.18 0.07
0.86 0.18 0.07 −0.48 , V = 0.58 −0.6 −0.07 −0.55
U =
This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .
Exercises
Exercise 3.4.1 Let b be a vector and let B be the matrix with the single
row b. Show σ = |b| is the only positive singular value.
Qvk = λk vk , k = 1, . . . , d.
λ1 ≥ λ2 ≥ · · · ≥ λd ,
3.5. PRINCIPAL COMPONENT ANALYSIS 185
in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top two
eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset onto
the plane span(v1 , v2 ). The projected dataset can then be visualized as points
in the plane. Similarly, one can take the top three eigenvalues λ1 ≥ λ2 ≥
λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto the space
span(v1 , v2 , v3 ). This can then be visualized as points in three dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784 di-
mensions. After we download the dataset,
mnist = read_csv("mnist.csv").to_numpy()
dataset = mnist[:,1:]
labels = mnist[:,0]
The left column in Figure 3.20 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of the
total variance. The right column lists the cumulative sums of the eigenvalues,
186 CHAPTER 3. PRINCIPAL COMPONENTS
so the third entry in the right column is the sum of the top three eigenvalues,
λ1 + λ2 + λ3 = 22.97%.
This results in Figures 3.20 and 3.21. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.20 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.
Q = cov(dataset.T)
totvar = Q.trace()
# cumulative sums
sums = cumsum(percent)
data = array([percent,sums])
print(data.T[:20].round(decimals=3))
3.5. PRINCIPAL COMPONENT ANALYSIS 187
d = len(lamda)
from matplotlib.pyplot import stairs
stairs(percent,range(d+1))
def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P
In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d×n matrix V . The code
then returns the projection matrix P = V V t (2.7.4).
Instead of working with the variance Q, as discussed at the start of the
section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.
def pca_with_svd(dataset,n):
# center dataset
m = mean(dataset,axis=0)
vectors = dataset - m
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P
Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the variance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.
We now show how to project a vector v in the dataset using sklearn. The
following code sets up the PCA engine using sklearn.
3.5. PRINCIPAL COMPONENT ANALYSIS 189
N = len(dataset)
n = 10
engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
reduced.shape
and returns (N, n) = (60000, 10). The following code computes the projected
dataset
projected = engine.inverse_transform(reduced)
projected.shape
figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4
Fig. 3.22 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.
Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To start,
we compute reduced as above with n = 3, the top three components.
In the two-dimensional plotting code below, reduced is an array of shape
(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.23. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.
grid()
legend(loc='upper right')
show()
3.5. PRINCIPAL COMPONENT ANALYSIS 191
grid()
legend(loc='upper right')
show()
%matplotlib ipympl
from matplotlib.pyplot import *
from mpl_toolkits import mplot3d
ax = axes(projection='3d')
ax.set_axis_off()
legend(loc='upper right')
show()
The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.
The sklearn package contains clustering routines, but here we write the
code from scratch to illustrate the ideas. Here is an animated gif illustrating
the convergence of the algorithm.
Assume the means are given as a list of length k,
such that
def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i
def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]
def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
This code returns the size the clusters after each iteration. Here is code
that plots a cluster.
def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')
d = 2
k,N = 7,100
def random_vector(d):
return array([ random() for _ in range(d) ])
close_enough = False
figure(figsize=(4,4))
grid()
3.6. CLUSTER ANALYSIS 195
The material in this chapter lays the groundwork for Chapter 7. It assumes
the reader has some prior exposure, and the first section quickly reviews
basic material essential for our purposes. Nevertheless, the overarching role
of convexity is emphasized repeatedly, both in the single-variable and multi-
variable case.
The chain rule is treated extensively, in both interpretations, combinato-
rial (back-propagation) and geometric (time-derivatives). Both are crucial for
neural network training in Chapter 7.
Because it is used infrequently in the text, integration is treated separately
in an appendix (§A.5).
Even though parts of §4.5 are heavy-going, the material is necessary for
Chapter 7. Nevertheless, for a first pass, the reader should feel free to skim
this material and come back to it after the need is made clear.
Definition of Derivative
The derivative of f (x) at the point a is the slope of the line tangent
to the graph of f (x) at a.
197
198 CHAPTER 4. CALCULUS
Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = mx+b is a line with slope m, its derivative
is m.
Since the tangent line at a passes through the point (a, f (a)), and its slope
is f ′ (a), the equation of the tangent line at a is
y = f (x)
x
a
Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to
f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to
f (b) − f (a)
≤ L.
b−a
4.1. SINGLE-VARIABLE CALCULUS 199
We have shown
f (b) − f (a)
m≤ ≤ L. (4.1.1)
b−a
Derivative Formula
f (x) − f (a)
f ′ (a) = lim . (4.1.3)
x→a x−a
dy dy du
= · .
dx du dx
To visualize the chain rule, suppose
u = f (x) = sin x,
y = g(u) = u2 .
f g
x u y
√
Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since
dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §4.4.
By the product rule,
Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
with f (x) = x2 and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .
√
Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (4.1.4) with n = 1/2. In this generality, the variable x is restricted
to positive values only.
4.1. SINGLE-VARIABLE CALCULUS 201
x, a = symbols('x, a')
f = x**a
returns
axa axa
, , axa−1 , axa−1 .
x x
The power rule can be combined with the chain rule. For example, if
un+1
u = 1 − p + cp, f (p) = un , g(u) = ,
(c − 1)(n + 1)
then
(1 − p + cp)n+1
F (p) = ,
(c − 1)(n + 1)
and
F ′ (p) = g ′ (u)u′ = un ,
hence
(1 − p + cp)n+1
F (p) = =⇒ F ′ (p) = f (p). (4.1.5)
(c − 1)(n + 1)
For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!
(k) n!
(xn ) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!
When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x). The code
x, n = symbols('x, n')
diff(x**n,x,3)
def sym_legendre(n):
# symbolic variable
x = symbols('x')
# symbolic function
p = (x**2 - 1)**n
nfact = factorial(n,exact=True)
# symbolic nth derivative
return p.diff(x,n)/(nfact * 2**n)
For example,
def num_legendre(n):
x = symbols('x')
f = sym_legendre(n)
return lambdify(x,f, 'numpy')
We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum
f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (4.1.6)
Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,
Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (4.1.6), we derived
204 CHAPTER 4. CALCULUS
Taylor Series
y = log x ⇐⇒ x = ey .
log(ey ) = y, elog x = x.
4.1. SINGLE-VARIABLE CALCULUS 205
From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 4.3).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure A.3),
log ∞ = ∞.
log 0 = −∞.
We also see log x is negative when 0 < x < 1, and positive when x > 1.
ab = eb log a .
Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
206 CHAPTER 4. CALCULUS
x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,
so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm
1
y = log x =⇒ y′ = . (4.1.9)
x
Since the derivative of log(1 + x) is 1/(1 + x), the chain rule implies
dn (n − 1)!
log(1 + x) = (−1)n−1 , n ≥ 1.
dxn (1 + x)n
x2 x3 x4
log(1 + x) = x − + − + .... (4.1.10)
2 3 4
0
x
For the parabola in Figure 4.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
4.1. SINGLE-VARIABLE CALCULUS 207
√
(c = 1/ 3)
−1 −c c 1
x
0
max f (x).
x∗ ,a,b
In other words, to find the maximum of f (x), find the critical points x∗ ,
plug them and the endpoints a, b into f (x), and select whichever yields the
maximum value.
For example, since (x2 )′′ = 2 > 0 and (ex )′′ = ex > 0, x2 and e√ x
are
strictly convex everywhere, and x4 − 2x2 is strictly convex for |x| > 1/ 3.
Convexity of ex was also derived in (A.3.14). Since
(ex )(n) = ex , n ≥ 0,
f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain
210 CHAPTER 4. CALCULUS
For example, the function in Figure 4.6 is convex near x = a, and the
graph lies above its tangent line at a.
L
pL (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.13)
2
Then p′′L (x) = L. Moreover the graph of pL (x) is tangent to the graph of f (x)
at x = a, in the sense f (a) = pL (a) and f ′ (a) = p′L (a). Because of this, we
call pL (x) the upper tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent line.
When m ≤ y ′′ ≤ L, we can specify the size of the difference between the
graph and the tangent line. In fact, the graph is constrained to lie above or
below the lower or upper tangent parabolas.
If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between the lower and upper
tangent parabolas pm (x) and pL (x),
m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (4.1.14)
2 2
a ≤ x ≤ b.
so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
4.1. SINGLE-VARIABLE CALCULUS 211
x
a
Fig. 4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.
f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies
t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0.
This yields
For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is
Below we see g(p) is also convex. This may not always exist, but we will work
with cases where no problems arise.
To evaluate g(p), following (4.1.11), we compute the maximizer x∗ by
setting the derivative of (px − f (x)) equal to zero and solving for x.
Let a > 0. The simplest example is f (x) = ax2 /2. In this case, the maxi-
mum of px − f (x) occurs where (px − f (x))′ = 0, which leads to
′
1
0= px − ax2 = p − ax,
2
Going back to (4.1.16), for each p, the point x where px − f (x) equals the
maximum g(p) — the maximizer — depends on p. If we denote the maximizer
by x = x(p), then
g(p) = px(p) − f (x(p)).
Since the maximum occurs when the derivative is zero, we have
Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,
Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the same
as f (x) = px − g(p), we have
If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).
4.1. SINGLE-VARIABLE CALCULUS 213
f ′ (g ′ (p)) = p.
f ′′ (g ′ (p))g ′′ (p) = 1.
We derived
Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (4.1.18)
f ′′ (x)
n
This makes sense because the binomial coefficient k is defined for any
real number n (A.2.12), (A.2.13).
In summation notation,
∞
X
n n n−k k
(a + x) = a x . (4.1.19)
k
k=0
The only difference between (A.2.7) and (4.1.19) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (A.2.10),
we have
n
= 0, for k > n,
k
214 CHAPTER 4. CALCULUS
so
f (k) (0)
n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞
X f (k) (0) X n n−k k
(a + x)n = = a x ,
k! k
k=0 k=0
a, b = 0, 3*pi
theta = arange(a,b,.01)
ax = axes()
ax.grid(True)
ax.axhline(0, color='black', lw=1)
plot(theta,sin(theta))
show()
It is often convenient to set the horizontal axis tick marks at the multiples
of π/2. For this, we use
def label(k):
if k == 0: return '$0$'
elif k == 1: return r'$\pi/2$'
elif k == -1: return r'$-\pi/2$'
elif k == 2: return r'$\pi$'
elif k == -2: return r'$-\pi$'
216 CHAPTER 4. CALCULUS
def set_pi_ticks(a,b):
base = pi/2
m = floor(b/base)
n = ceil(a/base)
k = arange(n,m+1,dtype=int)
# multiples of base
return xticks(k*base, map(label,k) )
We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 4.9. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ.
The key idea here is Archimedes’ axiom [13], which states:
Suppose two convex curves share common initial and terminal points. If one is inside
the other, then the inside curve is the shorter.
P 1−x
Q
1 y
θ
O x I
By the figure, there are three convex curves joining P and I: The line
segment P I, the red arc, and the polygonal curve P QI. Since the length of
the line segment is greater than y, Archimedes’ axiom implies
y < θ < 1 − x + y,
or
sin θ < θ < 1 − cos θ + sin θ.
4.1. SINGLE-VARIABLE CALCULUS 217
1 − cos θ sin θ
1− < < 1. (4.1.21)
θ θ
We use this to show (the definition of limit is in §A.6)
sin θ
lim = 1. (4.1.22)
θ→0 θ
Since sin θ is odd, it is enough to verify (4.1.22) for θ > 0.
To this end, since sin2 θ = 1 − cos2 θ, from (4.1.21),
which implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (4.1.21), we obtain (4.1.22) for θ > 0.
From (A.4.6),
sin(θ + t) = sin θ cos t + cos θ sin t,
so
sin(θ + t) − sin θ cos t − 1 sin t
lim = lim sin θ · + cos θ · = cos θ.
t→0 t t→0 t t
Thus the derivative of sine is cosine,
Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have p
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.15). With
x = λ/2, by the chain rule,
218 CHAPTER 4. CALCULUS
′
1√
2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(4.1.23)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)
This shows the derivative of the arcsine law is the density in Figure 3.11.
Exercises
Exercise 4.1.2 With exp x = ex , what are the first derivatives of exp(exp x)
and exp(exp(exp x))?
1 2
Exercise 4.1.3 With a > 0, let f (x) = 2 ax − ex . Where is f (x) convex,
and where is it concave?
Exercise 4.1.5 Compute the Taylor series for sin θ and cos θ.
Exercise 4.1.7 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x) + t?
Exercise 4.1.8 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x + t)?
This is also called absolute entropy to contrast with relative entropy which
we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,
′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p
Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximizer of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.
Taking the second derivative, by the chain rule and the quotient rule,
′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)
To explain the meaning of the entropy function H(p), suppose a coin has
heads-bias or heads-probability p. If p is near 1, then we have confidence the
outcome of tossing the coin is heads, and, if p is near 0, we have confidence the
outcome of tossing the coin is tails. If p = 1/2, then we have least information.
Thus we can view the entropy as measuring a lack of information.
To formalize this, we define the information or absolute information
Then we have
ex 1
p = σ(x) = = , −∞ < x < ∞. (4.2.3)
1+e x 1 + e−x
By the quotient and chain rules, its derivative is
−e−x
p′ = − = σ(x)(1 − σ(x)) = p(1 − p). (4.2.4)
(1 + e−x )2
The logistic function, also called the expit function and the sigmoid function,
is studied further in §5.1, where it used in coin-tossing and Bayes theorem.
The inverse of the logistic function is the logit function. The logit function
is found by solving p = σ(x) for x, obtaining
−1 p
x = σ (p) = log . (4.2.5)
1−p
The logit function is also called the log-odds function. Its derivative is
′
′ 1−p p 1−p 1 1
x = · = · 2
= .
p 1−p p (1 − p) p(1 − p)
Let
Z(x) = log (1 + ex ) . (4.2.6)
Then Z ′ (x) = σ(x) and Z ′′ (x) = σ ′ (x) = σ(1 − σ) > 0. This shows Z(x) is
strictly convex. We call Z(x) the cumulant-generating function, to be consis-
tent with random variable terminology (§5.3).
max(px − Z(x))
x
0 ≤ p ≤ 1, and 0 ≤ q ≤ 1.
Then
I(q, q) = 0,
4.2. ENTROPY AND INFORMATION 223
which agrees with our design goal that I(p, q) measures the divergence be-
tween the information in p and the information in q. Because I(p, q) is not
symmetric in p, q, we think of q as a base or reference probability, against
which we compare p.
d2 1
I(p, q) = I ′′ (p) = ,
dp2 p(1 − p)
d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2
Figure 4.13 clearly exhibits the trough p = q where I(p, q) = 0, and the
edges q = 0, 1 where I(p, q) = ∞. In scipy, I(p, q) is incorrectly called
entropy. For more on this terminology confusion, see the end of §5.6. The
code is as follows.
4.2. ENTROPY AND INFORMATION 225
%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import entropy
ax = axes(projection='3d')
ax.set_axis_off()
p = arange(0,1,.01)
q = arange(0,1,.01)
p,q = meshgrid(p,q)
# surface
ax.plot_surface(p,q,I(p,q), cmap='cool')
# square
ax.plot([0,1,1,0,0],[0,0,1,1,0],linewidth=.5,c="k")
show()
Exercises
Exercise 4.2.5 The relative information I(p, q) has minimum zero when p =
q. Use the lower tangent parabola (4.1.12) of I(x, q) at q and Exercise 4.2.2
to show
I(p, q) ≥ 2(p − q)2 .
For q = 0.7, plot both I(p, q) and 2(p − q)2 as functions of 0 < p < 1.
226 CHAPTER 4. CALCULUS
Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v is
d
Dv f (x) = f (x + tv). (4.3.1)
dt t=0
∂f d
(x) = f (x + tek ).
∂xk ds t=0
The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.
Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §7.3.
The second interpretation is combinatorial, and involves repeated compo-
sitions of functions. This interpretation is relevant to computing gradients in
networks, specifically backpropagation §4.4, §7.2.
These two interpretations work together when training neural networks,
§7.4.
4.3. MULTI-VARIABLE CALCULUS 227
For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have
The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector
d
f (x + tv) = ∇f (x) · v. (4.3.3)
dt t=0
228 CHAPTER 4. CALCULUS
d
f (W + sV ) = trace(V t G). for all V. (4.3.4)
ds s=0
g s u y
x + k
dy dy du
= = −0.90 ∗ 1 = −0.90,
dr du dr
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (4.2.4), s′ = s(1 − s) = 0.22, so
dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §4.4.
∇f (x∗ ) = 0.
if
d2
f (x + tv) (4.3.6)
dt2 t=0
is nonnegative for every point x and every direction v. For this, see also
(4.5.18).
For example, when f (x) is given by (4.3.5),
g(t) = f (x + tv)
1
= (x + tv) · Q(x + tv) − b · (x + tv)
2
1 1 (4.3.7)
= x · Qx − b · x + tv · (Qx − b) + t2 v · Qv
2 2
1 2
= f (x) + tv · (Qx − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx − b) + tv · Qv, g ′′ (t) = v · Qv.
2
This shows
Quadratic Convexity
By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude
Exercises
Exercise 4.3.1 Let I(p, q) be the relative information (4.2.9), and let Ipp ,
Ipq , Iqp , Iqq be the second partial derivatives. If Q is the second derivative
matrix
Ipp Ipq
Q= ,
Iqp Iqq
show
(p − q)2
det(Q) = .
p(1 − p)q 2 (1 − q)2
Exercise 4.3.2 Let I(p, q) be the relative information (4.2.9). With x =
(p, q) and v = (ap(1 − p), bq(1 − q)), show
d2
I(x + tv) = p(1 − p)(a − b)2 + b2 (p − q)2 .
dt2 t=0
Conclude that I(p, q) is a convex function of (p, q). Where is it not strictly
convex?
Exercise 4.3.3 Let J(x) = J(x1 , x2 , . . . , xd ) equal
1 1 1 1
J(x) = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 .
2 2 2 2
Compute Q = D2 J.
three versions of forward and back propagation. In all cases, back propagation
depends on the chain rule.
The chain rule (§4.1) states
dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose
r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .
r s y
x f g h
The chain in Figure 4.15 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,
Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
With the above values for x, r, s, we have
dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (4.2.4),
From this,
234 CHAPTER 4. CALCULUS
dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,
dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.
r = x2 ,
s = r 2 = x4 ,
y = s2 = x8 .
This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have
func_chain = [f,g,h]
der_chain = [df,dg,dh]
Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,
def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x
# dy/dy = 1
delta_out = 1
def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
236 CHAPTER 4. CALCULUS
delta = backward_prop(delta_out,x,der_chain)
d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1
x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)
x +
y J
∗
z max
Now we work with the network in Figure 4.16, using the multi-variable
chain rule (§4.3). The functions are
a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
4.4. BACK PROPAGATION 237
J = (x + y) max(y, z),
Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 4.18). This is forward propagation.
y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z
y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0
The outputs (blue) and the derivatives (red) are displayed in Figure 4.18.
Summarizing, by the chain rule,
• derivatives are computed backward,
• derivatives along successive edges are multiplied,
• derivatives along several outgoing edges are added.
1
x +
2∗1=2
2 3
2 2
y 6
2+3 ∗
1
2 2
3 3
0
z max
0
Suppose a directed graph has d nodes, and, for each node i, let xi be the
outgoing signal. Then x = (x1 , x2 , . . . , xd ) is the outgoing vector. In the case
of Figure 4.16, d = 6 and
d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1
More generally, in a weighed directed graph (§3.3), the weights wij are nu-
meric scalars.
Once we have the outgoing vector x, for each node j, let
x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (4.4.1)
Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj
−
xj = fj (x−
j ) = fj (w1j x1 , w2j x2 , . . . , wdj xd ). (4.4.2)
For example, if (1, 5), (7, 5), (2, 5) are the edges pointing to node 5 and we
ignore zeros in (4.4.1), then x− 5 = (w15 x1 , w75 x7 , w25 x2 ), so
x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).
x− = (x− − −
1 , x2 , . . . , xd ).
x− = (x− − − − − −
1 , x2 , x3 , x4 , x5 , x6 ) = ((), (), (), (x, y), (y, z), (a, b)),
and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point defin-
ing f1 , f2 , f3 .
activate = [None]*d
def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i in
,→ range(d) ]
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))
Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
For this code to work, we assume there are no cycles in the graph: All back-
ward paths end at inputs.
Let xout be the output nodes. For Figure 4.16, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 4.16, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi
so X
δi = δj · gij · wij .
i→j
The code is
def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None else 0
,→ for j in range(d) ])
def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
delta[d-m:] = delta_out
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta
x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), f (x) = f (x1 , x2 ) = + x22
4
are scalar functions of points in R2 . More generally, if Q is a d × d matrix,
f (x) = x · Qx is such a function. Here, to obtain x · Qx, we think of the point
x as a vector, then use row-times-column multiplication to obtain Qx, then
take the dot product with x. We begin with functions in general.
A level set of f (x) is the set
E: f (x) = 1.
Here we write the level set of level 1. One can have level sets corresponding
to any level ℓ, f (x) = ℓ. In two dimensions, level sets are also called contour
lines.
x0
x0 x0
x21
max(|x1 |, |x2 |) = 1, + x22 = 1.
4
The contour lines of
x21 x2
f (x) = f (x1 , x2 ) = + 2
16 4
are in Figure 4.20.
E: f (x) ≤ 1.
Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 4.19, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves, in
Figure 4.25 are sublevel sets. Note we always consider the level set to be part
of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 4.19 are boundaries of their respective
sublevel sets, and the variance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.
4.5. CONVEX FUNCTIONS 245
x1
(1 − t)x0 + tx1
x0
A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,
This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two di-
mensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 4.22.
If the inequality is strict for 0 < t < 1, then f (x) is strictly convex,
f ((1 − t)x0 + tx1 ) < (1 − t)f (x0 ) + tf (x1 ), for 0 < t < 1.
t1 x1 + t2 x2 + · · · + tN xN
t1 + t2 + · · · + tN = 1.
Fig. 4.22 Convex: The line segment lies above the graph.
Quadratic is Convex
This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (4.3.7),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(4.5.3)
Inserting t = 1 in (4.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (4.5.3),
4.5. CONVEX FUNCTIONS 247
Here are some basic properties and definitions of sets that will be used
in this section and in the exercises. Let a be a point in Rd and let r be a
positive scalar. A closed ball of radius r and center a is the set of points x
satisfying |x − a|2 ≤ r2 . An open ball of radius r and center a is the set of
points x satisfying |x − a|2 < r2 .
Let E be any set in Rd . The complement of E is the set E c of points that
are not in E. If E and F are sets, the intersection E ∩ F is the set of points
that lie in both sets.
A point a is in the interior of E if there is a ball B centered at a contained
in E; this is usually written B ⊂ E. Here the ball may be either open or
closed, the interior is the same.
A point a is in the boundary of E if every ball centered at a contains points
of E and points of E c . From the definitions, it is clear that there are no points
that lie in both the interior of E and the boundary of E.
Let E be a set. If E equals its interior, then E is an open set. If E contains
its boundary, then E is a closed set . When a set is closed, we have
x = t 1 x1 + t 2 x2 + · · · + t N xN
x3
x4
x2
x6 x7
x5
x1
rng = default_rng()
hull = ConvexHull(points)
facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()
If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 4.22). From convex functions, there are other ways to get convex
sets:
E: f (x) ≤ 1
is a convex set.
H: n · (x − x0 ) = 0. (4.5.4)
H: m · x + b = 0, (4.5.5)
with a nonzero vector m and scalar b. In this section, we use (4.5.4); in §7.6,
we use (4.5.5).
4.5. CONVEX FUNCTIONS 251
n
n
x0 x0
n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.
The vector n is the normal vector to the hyperplane. Note replacing n by any
nonzero multiple of n leaves the hyperplane unchanged.
Separating Hyperplane I
x∗
n
x′
x0 x0
x x0 + tv
Expanding, we have
0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.
Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain
x in E =⇒ (x − x0 ) · n ≤ 0. (4.5.7)
y ≥ 0, if p = 1,
for every sample x. (4.5.8)
y ≤ 0, if p = 0,
m · xk + b = 0, k = 1, 2, . . . , N. (4.5.9)
Separating Hyperplane II
To derive this result, from Exercise 4.5.7 both K0 and K1 have interiors.
Suppose there is a separating hyperplane m · x + b = 0. If x0 is in K0 ∩ K1 ,
then we have m · x0 + b ≤ 0 and m · x0 + b ≥ 0, so m · x0 + b = 0. This shows
the separating hyperplane passes through x0 . Since K0 lies on one side of the
hyperplane, x0 cannot be in the interior of K0 . Similarly for K1 . Hence x0
cannot be in the interior of K0 ∩ K1 . This implies K0 ∩ K1 has no interior.
Conversely, suppose K0 ∩ K1 has no interior. There are two cases, whether
K0 ∩ K1 is empty or not. If K0 ∩ K1 is empty, then the minimum of |x1 − x0 |2
over all x1 in K1 and x0 in K0 is positive. If we let
then x∗0 ̸= x∗1 , x∗1 is on the boundary of K1 , and x∗0 is on the boundary of K0 .
254 CHAPTER 4. CALCULUS
H0 H1 H
K0 K1 tK0 tK1
x∗0 x∗1
In the first case, since K0 and K1 don’t intersect, x∗1 is not in K0 , and x∗0
is not in K1 . Let m = x∗1 − x∗0 . By separating hyperplane I, the hyperplane
H0 : m · (x − x∗0 ) = 0 separates K0 from x∗1 . Similarly, the hyperplane H1 :
m · (x − x∗1 ) = 0 separates K1 from x∗0 . Thus (Figure 4.28) both hyperplanes
separate K0 from K1 .
In the second case, when K0 and K1 intersect, then x∗0 = x∗1 = x∗ . Let
0 < t < 1, and let tK0 be K0 scaled towards its mean. Similarly, let tK1
be K1 scaled towards its mean. By Exercise 4.5.8, both tK0 and tK1 lie in
the interiors of K0 and K1 respectively, so tK0 and tK1 do not intersect. By
applying the first case to tK0 and tK1 , and choosing t close to 1, t → 1, we
obtain a hyperplane H separating K0 and K1 . We skip the details.
In Figure 4.19, at the corner of the square, there are multiple supporting
hyperplanes. However, at every other point a on the boundary of the square,
there is a unique (up to scalar multiple) supporting hyperplane. For the ellipse
or ellipsoid, at every point of the boundary, there is a unique supporting
hyperplane.
Now we derive the analogous concepts for convex functions.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is
where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
4.4) and the relative information (Figure 4.12) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (4.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. Before we state this precisely, we contrast a level versus a
bound.
Let f (x) be a function. A level is a scalar c determining a sublevel set
f (x) ≤ c. A bound is a scalar C determining a bounded set |x| ≤ C.
We say f (x) is proper if for every level c, there is a bound C so that
To see this, pick any point a. Then, by properness, the sublevel set S given
by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer x∗
(see §A.7). Since for all x outside the sublevel set, we have f (x) > f (a), x∗
is a global minimizer.
When f (x) is also strictly convex, the minimizer is unique.
To see this, suppose f (x) is not proper. In this case, by (4.5.12), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.7, this implies x′n subconverges to some
′
1 1 1 √
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ ( c + |b|).
|xn | |xn | |xn |
Properness of Residual
is proper on Rd .
As a consequence,
∇f (x∗ ) = 0. (4.5.17)
Let a be any point, and v any direction, and let g(t) = f (a + tv). Then
g ′ (0) = ∇f (a) · v.
∂2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj
d
f (x + tv) = ∇f (x + tv) · v.
dt
Differentiating and using the chain rule again,
d2
f (x + tv) = v · Qv. (4.5.18)
dt2 t=0
4.5. CONVEX FUNCTIONS 259
This implies
d2
f (x + tv) = 0 only when v = 0. (4.5.19)
dt2 t=0
If m ≤ D2 f (x) ≤ L, then
m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (4.5.20)
2 2
m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (4.5.21)
2 2
Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (4.5.22).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (4.5.23)
2 2 2
To see this, since the left side of (4.5.23) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (4.5.23) equals zero iff p = Qx, we are led to (4.5.22).
Moreover, switching p · Q−1 p with x · Qx, we also have
Thus the convex dual of the convex dual of f (x) is f (x). In §5.6, we compute
the convex dual of the cumulant-generating function.
If x is a maximizer in (4.5.22), then the derivative is zero,
0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).
p = ∇x f (x) ⇐⇒ x = ∇p g(p).
This yields
Using this, and writing out (4.5.20) for g(p) instead of f (x) (we skip the
details) yields
mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (4.5.26)
m+L m+L
This is derived by using (4.5.25), the details are in [4]. This result is used
in gradient descent.
For the exercises below, we refer to the properties of sets defined earlier:
interior and boundary.
262 CHAPTER 4. CALCULUS
Exercises
Exercise 4.5.4 Let B be a ball in Rd (either open or closed). Then the span
of B is Rd .
Exercise 4.5.6 Let K be the convex hull of a dataset, and suppose the
dataset does not lie in a hyperplane. Then the mean of the dataset does
not lie in any supporting hyperplane of K.
Exercise 4.5.7 Let K be the convex hull of a dataset. Then the dataset does
not lie in a hyperplane iff K has interior. (Show the mean of the dataset is
in the interior of K: Argue by contradiction - assume the mean is on the
boundary of K.)
Exercise 4.5.8 Let K be a convex set, let x0 lie on the boundary of K, and
let m be in the interior of K. Then, apart from x0 , the line segment joining
m and x0 lies in the interior of K.
Exercise 4.5.9 If a two-class dataset does not lie in a hyperplane, then the
means of the two classes are distinct.
Chapter 5
Probability
Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about this
coin? Can we claim the coin is fair? Can we claim the probability of obtaining
heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads
[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].
On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains
[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].
263
264 CHAPTER 5. PROBABILITY
In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave, with
the goal of answering the question: Is a given coin fair?
A ∩ Ac = A and Ac
A ∪ Ac = A or Ac
P rob(A ∪ A) = P rob(A or Ac ) = 1.
P rob(Ac ) = 1 − P rob(A).
P rob(A) + P rob(Ac ) = 1.
More generally, let A and B be any two events. If A and B are mutually
exclusive, then no outcome satisfies A and B simultaneously. In this case, we
expect
are mutually exhaustive if they are mutually exclusive, and their union is
everything, meaning at least one of them must happen. In this case we must
have
P rob(A) + P rob(B) + P rob(C) + · · · = 1.
As we saw above, A and Ac are mutually exhaustive. Continuing along the
same lines, the general result for additivity is
Addition of Probabilities
If A1 , A2 , . . . , Ad are mutually exhaustive events, then
d
X
P rob(B) = P rob(B and Ai ). (5.1.2)
i=1
p + q = 1.
p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.
To see why these are the correct probabilities, we use the conditional
probability definition,
P rob(A and B)
P rob(A | B) = . (5.1.5)
P rob(B)
266 CHAPTER 5. PROBABILITY
P rob(X1 = 1 and X2 = 1) = p2 ,
P rob(X1 = 1 and X2 = 0) = pq,
P rob(X1 = 0 and X2 = 1) = qp,
P rob(X1 = 0 and X2 = 0) = q 2 .
P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude
Here a1 , a2 , . . . are 0 or 1.
P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.
as it should be.
Assume we know p = P rob(Xn = 1). Since the number
of ways of choosing
k heads from n tosses is the binomial coefficient nk (see §A.2), and the
268 CHAPTER 5. PROBABILITY
n, p, N = 5, .5, 10
k,n,p = 5, 10, .5
5.1. BINOMIAL PROBABILITY 269
B = binom(n,p)
# probability of k heads
B.pmf(k)
k,n,p = 5, 10, .5
allclose(pmf1,pmf2)
returns True.
Be careful to distinguish between
numpy.random.binomial and scipy.stats.binom.
The former returns samples from a binomial distribution, while the latter
returns a binomial random variable. Samples are just numbers; random vari-
ables have cdf’s, pmf’s or pdf’s, etc.
1 This result exhibits the entropy as the log of the number of combinations, or configura-
tions, or possibilities, which is the original definition of the physicist Boltzmann (1875).
270 CHAPTER 5. PROBABILITY
Toss a coin n times, and let #n (p) be the number of outcomes where
the heads-proportion is p. Then
In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (5.1.11)
2πn p(1 − p)
Figure 5.1 is returned by the code below, which compares both sides of
the asymptotic equality (5.1.11) for n = 10.
n = 10
p = arange(0,1,.01)
def approx(n,p):
return exp(n*H(p))/sqrt(2*n*pi*p*(1-p))
5.1. BINOMIAL PROBABILITY 271
grid()
plot(p, comb(n,n*p), label="binomial coefficient")
plot(p, approx(n,p), label="entropy approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()
Assume a coin’s bias is q. Toss the coin n times, and let Pn (p, q) be
the probability of obtaining tosses where the heads-proportion is p.
Then
In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (5.1.13)
2πn p(1 − p)
272 CHAPTER 5. PROBABILITY
The law of large numbers (§5.2)) states that the heads-proportion equals
approximately q for large n. Therefore, when p ̸= q, we expect the proba-
bilities that the heads-proportions equal p become successively smaller as n
get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(5.1.13) implies this is so. Thus (5.1.13) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.
If we set
5.1. BINOMIAL PROBABILITY 273
(1 − p + cp)n+1
f (p) = (1 − p + cp)n , F (p) = ,
(c − 1)(n + 1)
Notice the difference: In (5.1.10), we know the coin’s bias p, and obtain the
binomial distribution, while in (5.1.17), since we don’t know p, and there are
n + 1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution 1/(n + 1).
We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s bias p?
To this end, we introduce the fundamental
Bayes Theorem I
P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.18)
P rob(B)
P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
274 CHAPTER 5. PROBABILITY
P rob(p)
P rob(p | Sn = k) = P rob(Sn = k | p) · . (5.1.19)
P rob(Sn = k)
Because of the extra factor (n+1), this is not equal to (5.1.14). In (5.1.14),
p is fixed, and k is the variable. In (5.1.20), k is fixed, and p is the variable.
This a posteriori distribution for (n, k) = (10, 7) is plotted in Figure 5.2.
Notice this distribution is concentrated about k/n = 7/10 = .7.
The code generating Figure 5.2 is
5.1. BINOMIAL PROBABILITY 275
n = 10
k = 7
grid()
p = arange(0,1,.01)
plot(p,f(p),color="blue",linewidth=.5)
show()
Because Bayes Theorem is so useful, here are two alternate forms. Suppose
A1 , A2 , . . . , Ad are several mutually exhaustive events, so they are mutually
exclusive and
Then by the law of total probability (5.1.8) and the first version (5.1.18), we
have the second version
Bayes Theorem II
P rob(B | Ai ) P rob(Ai )
P rob(Ai | B) = Pd , i = 1, 2, . . . , d.
j=1 P rob(B | Aj ) P rob(Aj )
(5.1.21)
In particular,
P rob(B | A) P rob(A)
P rob(A | B) = .
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
As an example, suppose 20% of the population are smokers, and the preva-
lence of lung cancer among smokers is 90%. Suppose also 80% of non-smokers
are cancer-free. Then what is the probability that someone who has cancer
is actually a smoker?
To use the second version, set A = smoker and B = cancer. This means
A is the event that a randomly sampled person is a smoker, and B is the
event that a randomly sampled person has cancer. Then
and
P rob(B | A) P rob(A)
P rob(A | B) =
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
.9 × .2
= = .52941.
.9 × .2 + .2 × .8
Thus the probability that a person with lung cancer is indeed a smoker is
53%.
To describe the third version of Bayes theorem, bring in the logistic func-
tion. Let
1
p = σ(y) = . (5.1.22)
1 + e−y
This is the logistic function or sigmoid function (Figure 5.3). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.4).
5.1. BINOMIAL PROBABILITY 277
p = expit(y)
w0 1 −m2H + m2T mH + mT
x∗ = − =− = ,
w 2 mH − mT 2
which is the midpoint of the line segment joining mH and mT .
mH cut-off mT
More generally, if the points x are in Rd , then the same question may be
asked, using the normal distribution with variance I in Rd (§5.5). In this
case, w is a nonzero vector, and w0 is still a scalar,
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Then the cut-off or decision boundary between the two groups is the hyper-
plane
w · x + w0 = 0,
which is the hyperplane halfway between mH and mT , and orthogonal to the
vector joining mH and mT . Written this way, the probability
mT
cut-off
mH
Exercises
Exercise 5.1.2 A coin with bias p is tossed. What is the probability of ob-
taining 5 heads in 8 tosses?
Exercise 5.1.3 A coin with bias p is tossed 8 times and 5 heads are obtained.
What is the most likely value for p?
Exercise 5.1.4 A coin with unknown bias p is tossed 8 times and 5 heads
are obtained. Assuming a uniform prior for p, what is the probability that
p lies between 0.5 and 0.7? Use scipy.integrate.quad (§A.5) to integrate
(5.1.20) over 0.5 ≤ p ≤ 0.7.)
Exercise 5.1.5 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?
5.2 Probability
or
000, 001, 010, 011, 100, 101, 110, 111.
• The sample space is the set S of all possible outcomes. If #(S) is the num-
ber of outcomes in S, then for the four experiments above, we have #(S)
equals 2, 6, 36, and 8. The sample space S is also called the population.
• An event is a specific subset E of S. For example, when rolling two dice,
E can be the outcomes where the sum of the dice equals 7. In this case,
the outcomes in E are
(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
#(E)
P rob(E) = .
#(S)
5.2. PROBABILITY 281
For example,
1. A coin is fair if the outcomes are equally likely. For one toss of a fair
coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes
P rob(head) = p, P rob(tail) = 1 − p,
Roll two six-sided dice. Let A be the event that at least one dice is an even
number, and let B be the event that the sum is 6. Then
A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .
B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:
A or B = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6), (1, 5), (3, 3), (5, 1)} .
not A = {(1, 1), (1, 3), (1, 5), (3, 1), (3, 3), (3, 5), (5, 1), (5, 3), (5, 5)} .
Clearly #(B) = 5.
The difference of A minus B is the event of outcomes in A but not in B:
282 CHAPTER 5. PROBABILITY
A − B = A and not B
= {(2, ∗ except 4), (4, ∗ except 2), (6, ∗), (∗ except 4, 2), (∗ except 2, 4), (∗, 6)} .
Similarly,
B − A = {(1, 5), (3, 3), (5, 1)} .
Then A − B is part of A and B − A is part of B, A ∩ B is part of both, and
all are part of A ∪ B.
Hence
27 3 5
P rob(A) = = , P rob(B) = .
36 4 36
Events A and B are independent if
P rob(A and B)
P rob(A | B) = .
P rob(B)
P rob(A ∩ B) 2/36 2
P rob(A | B) = = =
P rob(B) 5/36 5
and
1
P rob(B = 0 and G = 1) = P rob(G = 1 | 1 child) P rob(1 child) = 0.20 = 0.1,
2
and
5.2. PROBABILITY 283
3
P rob(B = 1 and G = 2) = P rob(G = 2 | 3 children) P rob(3 children) = 0.30 = .1125.
8
Continuing in this manner, the complete table is
p = .5
n = 10
N = 20
v = binomial(n,p,N)
print(v)
returns
[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]
p = .5
for n in [5,50,500]: print(binomial(n,p,1))
This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,
3, 28, 266
The proportions are the count divided by the total number of tosses in the
experiment. For the above three experiments, the proportions after 5 tosses,
50 tosses, and 500 tosses, are
Fig. 5.8 100,000 sessions, with 5, 15, 50, and 500 tosses per session.
Now we repeat each experiment 100,000 times and we plot the results in
a histogram.
N = 100000
p = .5
5.2. PROBABILITY 285
for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
show()
The takeaway from these graphs are the two fundamental results of prob-
ability:
For large sample size, the shape of the graph of the proportions or
counts is approximately normal. The normal distribution is studied in
§5.4. Another way of saying this is: For large sample size, the shape
of the sample mean histogram is approximately normal.
The law of large numbers is qualitative and the central limit theorem is
quantitative. While the law of large numbers says one thing is close to another,
it does not say how close. The central limit theorem provides a numerical
measure of closeness, using the normal distribution.
One may think that the LLN and the CLT above depends on some aspect
of the binomial distribution. After all, the binomial is a specific formula and
something about this formula may lead to the LLN and the CLT. To show
that this is not at all the case, to show that the LLN and the CLT are
universal, we bring in the petal lengths of the Iris dataset. This time the
experiment is not something we invent, it is a result of something arising in
nature, Iris petal lengths.
286 CHAPTER 5. PROBABILITY
iris = datasets.load_iris()
dataset = iris["data"]
iris["feature_names"]
This code shows the petal lengths are the third feature in the dataset, and
we compute the mean of the petal lengths using
petal_lengths = dataset[:,2]
mean(petal_lengths)
This returns the petal length population mean µ = 3.758. If we plot the
petal lengths in a histogram with 50 bins using the code
grid()
hist(petal_lengths,bins=50)
show()
# n = batch_size
def random_batch_mean(n):
rng.shuffle(petal_lengths)
return mean(petal_lengths[:n])
random_batch_mean(5)
This code shuffles the dataset, then selects the first n petal lengths, then
returns their mean.
To sample a single petal length randomly 100,000 times, we run the code
N = 100000
n = 1
288 CHAPTER 5. PROBABILITY
Since we are sampling single petal lengths, here we take n = 1. This code
returns the histogram in Figure 5.10.
In Figure 5.9, the bin heights add up to 150. In Figure 5.10, the bin
heights add up to 100,000. Moreover, while the shapes of the histograms are
almost identical, a careful examination shows the histograms are not identical.
Nevertheless, there is no essential difference between the two figures.
Fig. 5.11 Iris petal lengths batch means sampled 100,000 times, batch sizes 3, 5, 20.
Now repeat the same experiment, but with batches of various sizes, and
plot the resulting histograms. If we do this with batches of size n = 3, n = 5,
n = 20 using
figure(figsize=(8,4))
# three subplots
rows, cols = 1, 3
N = 100000
show()
Exercises
Exercise 5.2.3 [30] Approximately 80,000 marriages took place in New York
last year. Assuming any day is equally likely, what is the probability that for
at least one of these couples, both partners were born on January 1? Both
partners celebrate their birthdays on the same day of the year?
def sums(dataset,k):
if k == 1: return dataset
else:
s = sums(dataset,k-1)
return array([ a+b for a in dataset for b in s ])
for k in range(5):
s = sums(dataset,k)
grid()
hist(s,bins=50,edgecolor="k")
show()
290 CHAPTER 5. PROBABILITY
for k = 1, 2, 3, 4, . . . . What does this code do? What does it return? What
pattern do you see? What if dataset were changed? What if the samples in
the dataset were vectors?
Exercise 5.2.5 Let A and B be any events, not necessarily exclusive. Let
B − A be the event of A occuring and B not occuring. Show
Exercise 5.2.6 Let A and B be any events, not necessarily exclusive. Extend
(5.1.1) to show
Exercise 5.2.7 [30] There is a 60% chance an event A will occur. If A does
not occur, there is a 10% chance B occurs. What is the chance A or B occurs?
(Start with two events, then go from two to three events.) With a = P rob(Ac ),
b = P rob(B c ), c = P rob(C c ), this exercise is the same as Exercise A.3.4.
X x
for this quantity, then we are asking to compute P rob(a < X < b). If we
don’t know anything about X, then we can’t figure out the probability, and
there is nothing we can say. Knowing something about X means knowing
the distribution of X: Where X is more likely to be and where X is less
likely to be. In effect, a random variable is a quantity X whose probabilities
P rob(a < X < b) can be computed.
292 CHAPTER 5. PROBABILITY
Then E(X) is the mean of the random variable X associated to the dataset.
Similarly,
N
1 X 2
E(X 2 ) = xk
N
k=1
P (X = a) = p, P (X = b) = q, P (X = c) = r.
E(X) = ap + bq + cr.
V ar(X) = E(X 2 ) − µ2 .
E(X) = x1 p1 + x2 p2 + . . . . (5.3.3)
E(1) = p1 + p2 + · · · = 1.
pj = P rob(X = xj )
= P rob(X = xj and Y = y1 ) + P rob(X = xj and Y = y2 ) + . . .
X
= rj1 + rj2 + · · · = rjk .
k
Similarly, X
qk = r1k + r2k + · · · = rjk .
j
We conclude
E(X + Y ) = E(X) + E(Y ).
Since we already know E(aX) = aE(X), this derives linearity.
The variance measures the spread of X about its mean. Since the mean of
aX is aµ, the variance of aX is the mean of (aX − aµ)2 = a2 (X − µ)2 . Thus
V ar(aX) = a2 V ar(X).
However, the variance of a sum X + Y is not simply the sum of the variances
of X and Y : This only happens if X and Y are independent, see (5.3.19).
Using (5.3.2), we can view a dataset as the samples of a random variable
X. In this case, the mean and variance of X are the same as the mean and
variance of the dataset, as defined by (1.5.1) and (1.5.2).
When X is a constant, then X = µ, so V ar(X) = 0. Conversely, if
V ar(X) = 0, then by definition
This displays the variance in terms of the first moment E(X) and the second
moment E(X 2 ). Equivalently,
P rob(X = 1) = p, P rob(X = 0) = 1 − p.
1−p
p
0 1
E X 2 = 12 · P rob(X = 1) + 02 · P rob(X = 0) = p.
5.3. RANDOM VARIABLES 297
From this,
1
p
1−p
0 1
M ′ (t) = E XetX .
When t = 0,
M ′ (0) = E(X) = µ.
Similarly, since the derivative of log x is 1/x, for the cumulant-generating
function,
M ′ (0)
Z ′ (0) = = E(X) = µ.
M (0)
The second derivative of M (t) is
M ′′ (t) = E X 2 etX ,
Definition of Uncorrelated
Random variables X and Y are uncorrelated if
We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and c > 0.
First, because the total probability equals 1,
a + 2b + c = 1. (5.3.13)
Also we have
and
E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if
a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)
Definition of Independence
for all positive powers n and m. When X and Y are discrete, this is
equivalent to the events X = x and Y = y being independent, for
every value x of X and every value y of Y .
a − b = (a − b)(a + b).
Thus U and Y are uncorrelated when (5.3.17) holds, for any choice of c.
However, since X 2 = 1 and Y 2 = Y , U 2 = Y , so U 2 and Y are always
correlated, unless Y is constant. Hence U and Y are never independent, unless
Y is constant. Note Y is a constant when a = 1 or c = 1.
Expanding the exponentials into their series, and using (5.3.16), one can show
1 e7t − et
MX (t) = .
6 et − 1
By Exercise 5.3.1 again,
1 e13t − et
MX+Y (t) = ,
12 et − 1
302 CHAPTER 5. PROBABILITY
It follows, by (5.3.18),
1 e13t − et 1 e7t − et
= · MY (t).
12 et − 1 6 et − 1
Factoring
we obtain
1 6t
MY (t) = (e + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.
Sn = X1 + X2 + · · · + Xn .
Then
The next simplest discrete random variable is the binomial random vari-
able Sn ,
Sn = X1 + X2 + · · · + Xn
obtained from n independent Bernoulli random variables.
Then Sn has values 0, 1, 2, . . . , n, and the probability mass function
n pk (1 − p)n−k , if x = 0, 1, 2, . . . , n,
p(x) = k
0, otherwise.
Since the cdf F (x) is the sum of the pmf p(k) for k ≤ x, the code
n, p = 8, .5
B = binom(n,p)
returns
0 0.003906250000000007 0.00390625
1 0.031249999999999983 0.03515625
2 0.10937500000000004 0.14453125
3 0.21874999999999992 0.36328125
4 0.27343749999999994 0.63671875
5 0.2187499999999999 0.85546875
6 0.10937500000000004 0.96484375
7 0.031249999999999983 0.99609375
8 0.00390625 1.0
Since
p(1 − p)
E(p̂n ) = p, V ar(p̂n ) = . (5.3.21)
n
By the binomial theorem, the moment-generating function is
304 CHAPTER 5. PROBABILITY
n
tSn
X
tk n k n
p (1 − p)n−k = pet + 1 − p .
E e = e
k
k=0
Here the integration is over the entire range of the random variable: If X
takes values in the interval [a, b], the integral is from a to b. For a normal
random variable, the range is (−∞, ∞). For a chi-squared random variable,
the range is (0, ∞). Below, when we do not specify the limits of integration,
the integral is taken over the whole range of X.
More generally, let f (x) be a function. The mean of f (X) or expectation
of f (X) is Z
E(f (X)) = f (x)p(x) dx. (5.3.23)
This only holds when the integral is over the complete range of X. When this
is not so,
Z b
P rob(a < X < b) = p(x) dx
a
is the green area in Figure 5.15. Thus
0 a b 0 a b
0 a µ b 1
Since
1 2
F (x) =
x =⇒ F ′ (x) = x,
2
by the fundamental theorem of calculus (A.5.2),
Z 1
1
E(X) = x dx = F (1) − F (0) = .
0 2
In particular, if [a, b] = [−1, 1], then the mean is zero, the variance is 1/3,
and
1 1
Z
E(f (X)) = f (x) dx.
2 −1
When X is discrete, X
F (x) = pk .
xk ≤x
When X is continuous, Z x
F (x) = p(z) dz.
−∞
Then each green area in Figure 5.15 is the difference between two areas,
F (b) − F (a).
discrete continuous
density pmf pdf
distribution cdf cdf
sum cdf(x) = sum([ pmf(k)for k in range(x+1)]) cdf(x) = integrate(pdf,x)
difference pmf(k) = cdf(k)-cdf(k-1) pdf(x) = derivative(cdf,x)
Table 5.17 summarizes the situation. For the distribution on the left in
Figure 5.15, the cumulative distribution function is in Figure 5.18.
Let X and Y be independent uniform random variables on [0, 1], and let
Z = max(X, Y ). We compute the pdf p(x), the cdf F (x), and the mean of
Z. By definition of max(X, Y ),
5.3. RANDOM VARIABLES 309
P rob(X ≤ x) P rob(Y ≤ x) = x2 .
Hence
0,
if x < 0,
2
F (x) = P rob(max(X, Y ) ≤ x) = x , if 0 ≤ x ≤ 1,
1, if x > 1.
From this,
0,
if x < 0,
′
p(x) = F (x) = 2x, if 0 ≤ x ≤ 1,
0, if x > 1.
λk
P rob(X = k) = e−λ · , k = 0, 1, 2, . . . . (5.3.29)
k!
Here λ > 0. From the exponential series (A.3.12),
∞ ∞
X X λk
P rob(X = k) = e−λ = 1,
k!
k=0 k=0
so the total probability is one. The Python code for a Poisson random variable
is
lamda = 1
P = poisson(lamda)
The mean and variance of a Poisson with parameter λ are both λ (Exer-
cise 5.3.13).
E(X n ) = E(Y n ), n ≥ 1.
for every interval [a, b], and equivalent to having the same moment-
generating functions,
MX (t) = MY (t)
for every t.
On the other hand, Let X be any random variable, and let Y = X. Then
X and Y are identically distributed, but are certainly correlated. So identical
distributions does not imply independence, nor vice-versa.
Let X be a random variable. A simple random sample of size n is a sequence
of random variables X1 , X2 , . . . , Xn that are independent and identically
distributed. We also say the sequence X1 , X2 , . . . , Xn is an i.i.d. sequence
(independent identically distributed).
For example, going back to the smartphone example, suppose we select n
students at random, where we are allowed to select the same student twice.
We obtain numbers x1 , x2 , . . . , xn . So the result of a single selection experi-
ment is a sequence of numbers x1 , x2 , . . . , xn . To make statistical statements
about the results, we repeat this experiment many times, and we obtain a
sequence of numbers x1 , x2 , . . . , xn each time.
This process can be thought of n machines producing x1 , x2 , . . . , xn each
time, or n random variables X1 , X2 , . . . , Xn (Figure 5.19). By making each
of the n selections independently, we end up with an i.i.d. sequence, or a
simple random sample.
X1 , X2 , . . . , Xn x1 , x2 , . . . , xn
Then
1 1
E(X̄n ) = (E(X1 ) + E(X2 ) + · · · + E(Xn )) = · nµ = µ.
n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . By (5.3.19), the
variance of Sn is nσ 2 , hence the variance of X̄n is σ 2 /n. Summarizing,
312 CHAPTER 5. PROBABILITY
σ2
E(X̄n ) = µ, V ar(X̄n ) = , (5.3.30)
n
and
√
X̄n − µ
n (5.3.31)
σ
is standard.
is standard.
Exercises
1 etb − eta
MX (t) = · t .
b−a e −1
Exercise 5.3.2 Let A and B be events and let X and Y be the Bernoulli
random variables corresponding to A and B (5.3.8). Show that A and B are
independent (5.2.1) if and only if X and Y are independent (5.3.16).
Exercise 5.3.3 [30] Let X be a binomial random variable with mean 7 and
variance 3.5. What are P rob(X = 4) and P rob(X > 14)?
Exercise 5.3.4 The proportion of adults who own a cell phone in a certain
Canadian city is believed to be 90%. Thirty adults are selected at random
from the city. Let X be the number of people in the sample who own a cell
phone. What is the distribution of the random variable X?
Exercise 5.3.5 If two random samples of sizes n1 and n2 are selected inde-
pendently from two populations with means µ1 and µ2 , show the mean of the
5.3. RANDOM VARIABLES 313
Exercise 5.3.7 [30] You arrive at the bus stop at 10:00am, knowing the bus
will arrive at some time uniformly distributed during the next 30 minutes.
What is the probability you have to wait longer than 10 minutes? Given that
the bus hasn’t arrived by 10:15am, what is the probability that you’ll have
to wait at least an additional 10 minutes?
Exercise 5.3.9 Let B and G be the number of boys and the number of girls
in a randomly selected family with probabilities as in Table 5.7. Are B and
G independent? Are they identically distributed?
Exercise 5.3.13 Let X be Poisson with parameter λ. Show both E(X) and
V ar(X) equal λ (Use (5.3.10).)
Sn = X1 + X2 + · · · + Xn
nn+1
E (relu(Sn − n)) = e−n · .
n!
(Use Exercise A.1.2.)
Exercise 5.3.17 Suppose X is a logistic random variable (5.3.28). Show the
probability density function of X is σ(x)(1 − σ(x)).
Exercise 5.3.18 Suppose X is a logistic random variable (5.3.28). Show the
mean of X is zero.
Exercise 5.3.19 Suppose X is a logistic random variable (5.3.28). Use
(A.3.16) with a = −e−x to show the variance of X is
∞
(−1)n−1
X 1 1 1
4 =4 1− + − + ... .
n=1
n2 4 9 16
Under this interpretation, this probability corresponds to the area under the
graph (Figure 5.20) between the vertical lines at a and at b, and the total
area under the graph corresponds to a = −∞ and b = ∞.
0 a b
grid()
316 CHAPTER 5. PROBABILITY
z = arange(mu-3*sdev,mu+3*sdev,.01)
p = Z.pdf(z)
plot(z,p)
show()
√
The curious constant 2π in (5.4.1) is inserted to make the total area
under the graph equal to one. That this is so arises from the
√ fact that 2π is
the circumference of the unit circle. Using Python, we see 2π is the correct
constant, since the code
allclose(I, sqrt(2*pi))
returns True.
The mean of Z is Z
E(Z) = zp(z) dz.
with the integral computed using the fundamental theorem of calculus (A.5.2)
or Python.
E(Z) = 0, V ar(Z) = 1
From this, the odd moments of Z are zero, and the even moments are
(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to
For example,
lim X̄n = µ.
n→∞
The CLT says for large sample size, the sample mean is approximately
normal with mean µ and variance σ 2 /n. More exactly,
Let
√
X̄n − µ
Z̄n = n
σ
be the standardized sample mean, and let Z be a standard normal
random variable. Then
5.4. NORMAL DISTRIBUTION 319
lim P rob a < Z̄n < b = P rob(a < Z < b)
n→∞
for every t.
Toss a coin n times, assume the coin’s bias is p, and let Sn be the number
of heads. Then,p by (5.3.20), Sn is binomial with mean µ = np and standard
deviation σ = np(1 − p). By the CLT, Sn is approximately normal with
the same mean and sdev, so the cumulative distribution function of Sn ap-
proximately equals the cumulative distribution function of a normal random
variable with the same mean and sdev.
Fig. 5.21 The binomial cdf and its CLT normal approximation.
The code
320 CHAPTER 5. PROBABILITY
n, p = 100, pi/4
mu = n*p
sigma = sqrt(n*p*(1-p))
B = binom(n,p)
Z = norm(mu,sigma)
grid()
legend()
show()
If the samples of the dataset are equally likely, then sampling the dataset
results in a random variable X, with expectations given by (5.3.2). It follows
that X is standard, and the moment-generating function of X is
N
1 X txk
E(etX ) = e .
N
k=1
Since the mean and variance of X are zero and 1, taking expectations of both
sides,
√ t2
E etX/ n = 1 + + ....
2n
From this, n
t2
Mn (t) = 1 + + ... .
2n
By the compound-interest formula (A.3.8) (the missing terms . . . don’t affect
the result)
2
lim Mn (t) = et /2 ,
n→∞
we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So
p = Z.cdf(z)
z = Z.ppf(p)
ppf is the percentile point function, and cdf is the cumulative distribution
function.
p
p
z z
In Figure 5.23, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
By symmetry of the graph, upper-tail and two-tail p-values can be com-
puted from lower tail p-values.
and
P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < −z),
and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies
In Python,
# p = P(|Z| < z)
z = Z.ppf((1+p)/2)
p = Z.cdf(z) - Z.cdf(-z)
5.4. NORMAL DISTRIBUTION 323
−z 0 −z 0 z
0 z
Now let’s zoom in closer to the graph and mark off z-scores 1, 2, 3 on the
horizontal axis to obtain specific colored areas as in Figure 5.24. These areas
are governed by the 68-95-99 rule (Table 5.25). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of the
blue plus green areas 0.955, and our confidence that |Z| < 3 equals the sum
of the blue plus green plus red areas 0.997. This is summarized in Table 5.25.
−3 −2 −1 0 1 2 3
Fig. 5.24 68%, 95%, 99% confidence cutoffs for standard normal.
The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event is
considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
two in a billion. You want a plane crash to be six-sigma.
These terms are defined for two-tail p-values. The same terms may be used
for upper-tail or lower tail p-values.
324 CHAPTER 5. PROBABILITY
Figure 5.24 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.25, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.
In general, the normal distribution is not centered at the origin, but else-
where. We say X is normal with mean µ and standard deviation σ if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the nor-
mal with mean µ and standard deviation σ. As its name suggests, it is easily
checked that such a random variable X has mean µ and standard deviation
σ. For the normal distribution with mean µ and standard deviation σ, the
cutoffs are as in Figure 5.27. In Python, norm(mu,sigma) returns the normal
with mean m and standard deviation s.
µ − 3σ µ−σ µ µ+σ µ + 3σ
P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.
a = Z.ppf(.15)
b = Z.ppf(.9)
19 − 7
σ= , µ = 7 − aσ.
b−a
√ X̄ − µ
Z= n· ,
σ
then compute standard normal probabilities.
Here are two examples. In the first example, suppose student grades are
normally distributed with mean µ = 80 and variance σ 2 = 16. This says the
average of all grades is 80, and the standard deviation is σ = 4. If a grade is
g, the standardized grade is
g−µ g − 80
z= = .
σ 4
A student is picked and their grade was g = 84. Is this significant? Is it highly
significant? In effect, we are asking, how unlikely is it to obtain such a grade?
Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 84 − 80
z= = = 1.
4 4
5.4. NORMAL DISTRIBUTION 327
or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
The same grade, g = 84, is not significant for a single student, but is
significant for nine students. This is a reflection of the law of large numbers,
which says the sample mean approaches the population mean as the sample
size grows.
Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below 70,
then 1 − p is the chance that the student has a grade above 70. If n is the
sample size, (1 − p)n is the chance that all sample students have grades above
70. Thus the requested chance is 1 − (1 − p)n . The following code shows the
answer is n = 112.
z = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(z)
for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)
Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.
########################
# P-values
########################
def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type (lower-tail, upper-tail,
,→ two-tail)?")
return
print("sample size: ",n)
print("mean,sdev,xbar: ",mean,sdev,xbar)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
5.4. NORMAL DISTRIBUTION 329
type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90
pvalue(mean,sdev,n,xbar,type)
Exercises
Exercise 5.4.1 Let X be a normal random variable and suppose P rob(X <
1) = 0.3, and P rob(X < 2) = 0.4 What are the mean and variance of X?
Exercise 5.4.2 [27] Consider a normal distribution curve where the middle
90% of the area under the curve lies above the interval (4, 18). Use this
information to find the mean and the standard deviation of the distribution.
Exercise 5.4.3 Let Z be a normal random variable with mean 30.4 and
standard deviation of 0.7. What is P rob(29 < Z < 31.1)?
Exercise 5.4.4 [27] Consider a normal distribution where the 70th percentile
is at 11 and the 25th percentile is at 2. Find the mean and the standard
deviation of the distribution.
Exercise 5.4.6 Suppose the scores of students are normally distributed with
a mean of 80 and a standard deviation of 4. A sample of size n is selected,
and the sample mean is 84. What is the least n for which this is significant?
What is the least n for which this is highly significant?
Exercise 5.4.7 [27] A manufacturer says their laser printers’ printing speeds
are normally distributed with mean 17.63 ppm and standard deviation 4.75
ppm. An i.i.d. sample of n = 11 printers is selected, with speeds X1 , X2 , . . . ,
Xn . What is the probability the sample mean speed X̄ is greater than 18.53
ppm?
Exercise 5.4.8 [27] Continuing Exercise 5.4.7, let Yk be the Bernoulli ran-
dom variable corresponding to the event Xk > 18 (5.3.8),
(
1, if Xk > 18,
Yk = .
0, otherwise.
330 CHAPTER 5. PROBABILITY
We count the proportion of printers in the sample having speeds greater than
18 by setting
Y1 + Y2 + · · · + Yn
p̂ = .
n
Compute E(p̂) and V ar(p̂). Use the CLT to compute the probability that
more than 50.9% of the printers have speeds greater than 18.
Exercise 5.4.9 [27] The level of nitrogen oxides in the exhaust of a particular
car model varies with mean 0.9 grams per mile and standard deviation 0.19
grams per mile . What sample size is needed so that the standard deviation
of the sampling distribution is 0.01 grams per mile?
Exercise 5.4.10 [27] The scores of students had a normal distribution with
mean µ = 559.7 and standard deviation σ = 28.2. What is the probability
that a single randomly chosen student scores 565 or higher? Now suppose
n = 30 students are sampled, assume i.i.d. What are the mean and standard
deviation of the sample mean score? What z-score corresponds to the mean
score of 565? What is the probability that the mean score is 565 or higher?
Exercise 5.4.11 Complete the square in the moment-generating function of
the standard normal pdf and use (5.4.3) to derive (5.4.4).
Exercise 5.4.12 Let Z be a standard normal random variable, and let
relu(x) be as in Exercise 5.3.16. Show
1
E(relu(Z)) = √ .
2π
(Use the fundamental theorem of calculus (A.5.2).)
Exercise 5.4.13 [7] Let X1 , X2 , . . . , Xn be i.i.d. Poisson random variables
(5.3.29) with parameter 1, let Sn = X1 + X2 + · · · + Xn , and let X̄n = Sn /n
be the sample mean. Then the mean of X̄n is 1, and the variance of X̄n is
1/n, so
√ Sn − n
Z̄n = n(X̄n − 1) = √
n
is standard (5.3.31). By the CLT, Z̄n is approximately standard normal for
large n. Use this to derive Stirling’s approximation (A.1.6). (Insert f (x) =
relu(x) in (5.4.7), then use Exercises 5.3.16 and 5.4.12.)
P rob(X 2 + Y 2 ≤ 1)?
Fig. 5.28 (X, Y ) inside the square and inside the disk.
Since
∞
1 X un
√ = E euU = E(U n ),
1 − 2u n=0
n!
But this equals the right side of (5.4.5). Thus the left sides of (5.4.5) and
(5.5.1) are equal. This shows
Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want
P rob(X 2 + Y 2 ≤ 1).
5.5. CHI-SQUARED DISTRIBUTION 333
d = 2
u = 1
U(d).cdf(u)
returns 0.39.
u = arange(0,15,.01)
for d in range(1,7):
p = U(d).pdf(u)
3 Geometrically, the p-value P rob(U > 1) is the probability that a normally distributed
point in d-dimensional space is outside the unit sphere.
334 CHAPTER 5. PROBABILITY
ylim(ymin=0,ymax=.6)
grid()
legend()
show()
and
d
X d
X
V ar(U ) = V ar(Zk2 ) = 2 = 2d.
k=1 k=1
We conclude
5.5. CHI-SQUARED DISTRIBUTION 335
Because
1 1 1
= ,
(1 − 2t)d/2 (1 − 2t)d′ /2 (1 − 2t)(d+d′ )/2
we obtain
X = (X1 , X2 , . . . , Xn )
in Rn .
Random vectors have means, variances, moment-generating functions,
and cumulant-generating functions, just like scalar-valued random variables.
Moreover we can have simple random samples of random vectors X1 , X2 ,
. . . , Xn .
If X is a random vector in Rd , its mean is the vector
Q = E((X − µ) ⊗ (X − µ)).
By (1.4.18),
hence
w · Qw = E ((X − µ) · w)2 .
(5.5.3)
Thus the variance of a random vector is a nonnegative matrix
A random vector is standard if µ = 0 and Q = I. If X is standard, then
In §2.2, we defined the mean and variance of a dataset (2.2.14). Then the
mean and variance there is the same as the mean and variance defined here,
that of a random variable.
To see this, we must build a random variable X corresponding to a dataset
x1 , x2 , . . . , xN . But this was done in (5.3.2). The moral is: every dataset may
be interpreted as a random variable.
Uncorrelated Chi-squared
|Z|2 (5.5.6)
Using this, we can plot the probability density function of a normal random
vector in R2 ,
%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import multivariate_normal as Z
# standard normal
mu = array([0,0])
Q = array([[1,0],[0,1]])
x = arange(-3,3,.01)
y = arange(-3,3,.01)
xy = cartesian_product(x,y)
# last axis of xy is fed into pdf
z = Z(mu,Q).pdf(xy)
ax = axes(projection='3d')
ax.set_axis_off()
x,y = meshgrid(x,y)
ax.plot_surface(x,y,z, cmap='cool')
show()
5.5. CHI-SQUARED DISTRIBUTION 339
Then
t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .
From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.
Correlated Chi-squared
E = U t QU, Q+ = U E + U t ,
and
λ1 0 0 ... 0 1/λ1 0 0 . . . 0
0 λ2 0 ... 0 0 1/λ2 0 . . . 0
E+ =
. . .
E= ... ... ... . . .
, ... ... ... ... . . .
.
0 ... 0 λr 0 0 . . . 0 1/λr 0
0 0 0 0 0 0 0 0 0 0
so X · µ = 0.
By Exercise 2.6.7, Q+ = Q. Since X · µ = 0,
X · Q+ X = X · QX = X · (X − (X · µ)µ) = |X|2 .
We conclude
5.5. CHI-SQUARED DISTRIBUTION 341
Singular Chi-squared
We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + Xn
X̄ = .
n
Let S 2 be the sample variance,
1 = (1, 1, . . . , 1)
√
be in Rn , and let µ = 1/ n. Then µ is a unit vector and
n
1 X √
Z ·µ= √ Zk = n Z̄.
n
k=1
√
Since Z1 , Z2 , . . . , Zn are i.i.d standard, Z · µ = nZ̄ is standard.
Now let U = I − µ ⊗ µ and
Then the mean of X is zero. Since Z has variance I, by Exercises 2.2.2 and
5.5.5,
V ar(X) = U t IU = U 2 = U = I − µ ⊗ µ.
By singular chi-squared above,
(n − 1)S 2 = |X|2
Exercises
Exercise 5.5.3 Continuing the previous problem with n = 20, use the CLT
to estimate the probability that fewer than 50% of the points lie in the unit
disk. Is this a 1-sigma event, a 2-sigma event, or a 3-sigma event?
Exercise 5.5.4 Let X be a random vector with mean zero and variance Q
Show v is a zero variance direction (§2.5) for Q iff X · v = 0.
Exercise 5.5.5 Let µ and Q be the mean and variance of a random d-vector
X, and let A be any N × d matrix. Then AX is a random vector with mean
Aµ and variance AQAt .
5.6. MULTINOMIAL PROBABILITY 343
Y12 Y2 Y2
+ 2 + ··· + r
λ1 λ2 λr
is chi-squared with degree r.
Exercise 5.5.7 If X is a random vector with mean zero and variance Q, then
(Insert w = u + v in (5.5.3).)
Exercise 5.5.8 Assume the classes of the Iris dataset are normally dis-
tributed with their means and variances (Exercise 2.2.8), and assume the
classes are equally likely. Using Bayes theorem (5.1.21), write a Python
function that returns the probabilities (p1 , p2 , p3 ) that a given iris x =
(t1 , t2 , t3 , t4 ) lies in each of the three classes. Feed your function the 150
samples of the Iris dataset. How many samples are correctly classified?
p1 + p2 + · · · + pd = 1.
This is called one-hot encoding since all slots in Y are zero except for one
“hot” slot.
For example, suppose X has three values 1, 2, 3, say X is the class of a
random sample from the Iris dataset. Then Y is R3 -valued, and we have
(1, 0, 0), if X = 1,
Y = (0, 1, 0), if X = 2,
(0, 0, 1), if X = 3.
More generally, let X have d values. Then with one-hot encoding, the
moment-generating function is
In particular, for a fair dice with d sides, the values are equally likely, so
the one-hot encoded cumulant-generating function is
Because
(y1 , y2 , . . . , yd ) σ (p1 , p2 , . . . , pd )
ey1 1
q1 = = = σ(y1 − y2 ),
ey1 + ey2 1 + e−(y1 −y2 )
y2
e 1
q2 = y1 = = σ(y2 − y1 ).
e +e y 2 1 + e 2 −y1 )
−(y
Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.
y = array([y1,y2,y3])
q = softmax(y)
or
σ(y) = σ(y + a1).
We say a vector y is centered if y is orthogonal to 1,
y · 1 = y1 + y2 + · · · + yd = 0.
This establishes
y = Z1 + log p. (5.6.5)
The function
d
X
I(p) = p · log p = pk log pk (5.6.6)
k=1
5.6. MULTINOMIAL PROBABILITY 347
This implies
d
X d
X
p·y = pk yk = pk log(eyk )
k=1 k=1
d
! d
!
X X
yk yk +log pk
≤ log pk e = log e = Z(y + log p).
k=1 k=1
For all y,
Z(y) = max (p · y − I(p)) .
p
Since
1 1 1
D2 I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is
p = array([p1,p2,p3])
entropy(p)
Roll a d-faced dice n times, and let #n (p) be the number of outcomes
where the face-proportions are p = (p1 , p2 , . . . , pd ). Then
Now (
∂2Z ∂σj σj − σj σk , if j = k,
= =
∂yj ∂yk ∂yk −σj σk , if j ̸= k.
5.6. MULTINOMIAL PROBABILITY 349
Hence we have
yj ≤ c, j = 1, 2, . . . , d.
which implies
d
X
|y|2 = yk2 ≤ d(d − 1)2 c2 .
k=1
√
Setting C = d(d − 1)c, we conclude
Let
log q = (log q1 , log q2 , . . . , log qd ).
Then
d
X
p · log q = pk log qk ,
k=1
and
I(p, q) = I(p) − p · log q. (5.6.13)
Similarly, the relative entropy is
p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)
returns the relative information, not the relative entropy. Always check your
Python code’s conventions and assumptions. See below for more on this ter-
minology confusion.
Assume a d-faced dice’s bias is q. Roll the dice n times, and let Pn (p, q)
be the probability of obtaining outcomes where the proportion of faces
is p. Then
= max
′
(p · (y ′ − log q) − Z(y ′ ))
y
= I(p) − p · log q
= I(p, q).
This identity is the direct analog of (4.5.23). The identity (4.5.23) is used
in linear regression. Similarly, (5.6.15) is used in logistic regression.
The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1
d
X
Hcross (p, q) = −Icross (p, q) = pk log qk .
k=1
Since I(p, σ(y)) and Icross (p, σ(y)) differ by the constant I(p), we also have
This is easily checked using the definitions of I(p, q) and σ(y, q).
354 CHAPTER 5. PROBABILITY
H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)
Table 5.33 The third row is the sum of the first and second rows, and the H column is
the negative of the I column.
Exercises
6.1 Estimation
357
358 CHAPTER 6. STATISTICS
do not
reject H
p>α
hypothesis
sample p-value
H
p<α
reject H
d = 784
for _ in range(20):
u = randn(d)
v = randn(d)
print(angle(u,v))
6.1. ESTIMATION 359
86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357
d = 784
for _ in range(20):
u = binomial(n,.5,d)
v = binomial(n,.5,d)
print(angle(u,v))
59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699
The difference between the two scenarios is the distribution. In the first
scenario, we have randn(d): the components are distributed according to
a standard normal. In the second scenario, we have binomial(1,.5,d) or
binomial(3,.5,d): the components are distributed according to one or three
fair coin tosses. To see how the distribution affects things, we bring in the
law of large numbers, which is discussed in §5.3.
Let X1 , X2 , . . . , Xd be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xd are
i.i.d. random variables, with µ = E(X). The sample mean is
X1 + X2 + · · · + Xd
X̄ = .
d
For large sample size d, the sample mean X̄ approximately equals the
population mean µ, X̄ ≈ µ.
We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xd ), and v = (y1 , y2 , . . . , yd ) where all components
are selected independently of each other, and each is selected according to
the same distribution.
6.1. ESTIMATION 361
X1 Y1 + X2 Y2 + · · · + Xd Yd
≈ E(X1 Y1 ),
d
so
U · V = X1 Y1 + X2 Y2 + · · · + Xd Yd ≈ d E(X1 Y1 ).
Similarly, U · U ≈ d E(X12 ) and V · V ≈ d E(Y12 ). Hence (check that the d’s
cancel)
U ·V E(X1 Y1 )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X12 )E(Y12 )
Since X1 and Y1 are independent with mean µ and variance σ 2 ,
U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2
µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)
µ2
cos(θ) is approximately .
µ2 + σ 2
1 ≈ means the ratio of the two sides approaches 1 for large n, see §A.6.
362 CHAPTER 6. STATISTICS
6.2 Z-test
p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n
hist(v,edgecolor ='Black')
show()
A confidence level of zero indicates that we have no faith at all that se-
lecting another sample will give similar results, while a confidence level of 1
indicates that we have no doubt at all that selecting another sample will give
similar results.
When we say p is within X̄ ± ϵ, or
|p − X̄| < ϵ,
(L, U ) = (X̄ − ϵ, X̄ + ϵ)
is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• sample proportion X̄,
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if
√ X̄ − p
Z= np
p(1 − p)
L, U = X̄ − ϵ, X̄ + ϵ.
P rob(|Z| > z ∗ ) = α.
√
Let σ/ n be the standard error. By the central limit theorem,
6.2. Z-TEST 365
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n
|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n
##########################
# Confidence Interval - Z
##########################
def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
elif type == "lower-tail":
L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U
alpha = .02
sdev = 228
n = 35
xbar = 95
L, U = confidence_interval(xbar,sdev,n,alpha,type)
Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval for
15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain n = 36.
3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval for
1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain n = 62.
4. When X̄ = .7, n = 20, and ϵ = .1, we have
√
∗ ϵ n
z = = .976.
σ
6.2. Z-TEST 367
• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value
Hypothesis Testing
µ < µ0 , µ > µ0 , µ ̸= µ0 .
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.
###################
# Hypothesis Z-test
###################
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01
There are two types of possible errors we can make. a Type I error is when
H0 is true, but we reject it, and a Type 2 error is when H0 is not true but
we fail to reject it.
H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β
z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n
This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code
############################
# Type1 and Type2 errors - Z
############################
def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ", alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)
mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"
type2_error(type,mu0,mu1,sdev,n,alpha)
A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then the
power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
code, the probability is
6.3 T -test
Here C is a constant to make the total area under the graph equal to one
(Figure 6.4).
Then the t-distribution is continuous and the probability that T lies in a
small interval [a, b] is
by integration (§A.5),
Z b
P rob(a < T < b) = p(t) dt. (6.3.2)
a
Under this interpretation, this probability corresponds to the area under the
graph between the vertical lines at a and at b, and the total area under the
graph corresponds to a = −∞ and b = ∞.
More generally, means of f (T ) are computed by integration,
Z ∞
E(f (T )) = f (t)p(t) dt,
−∞
with the integral computed via the fundamental theorem of calculus (A.5.2)
or Python.
for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))
plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()
√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1
k=1
Xk = µ + σZk ,
√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n−1
k=1
Using the last result with d = n − 1, we arrive at the main result in this
section.
##########################
# Confidence Interval - T
##########################
def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":
2 Geometrically, the p-value P rob(T > 1) is the probability that a normally distributed
point in (d + 1)-dimensional spacetime is inside the light cone.
376 CHAPTER 6. STATISTICS
tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U
n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)
L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)
Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we
see (L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.
###################
# Hypothesis T-test
###################
xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01
ttest(mu0, s, n, xbar,type)
########################
# Type1 and Type2 errors
########################
def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
378 CHAPTER 6. STATISTICS
type2_error(type,mu0,mu1,n,alpha)
Exercises
i = 1, 2, . . . , d.
is approximately standard normal for large enough sample size, and con-
sequently U = Z 2 is approximately chi-squared with degree one. The chi-
squared test generalizes this from d = 2 categories to d > 2 categories.
Given a category i, let #i denote the number of times Xk = i, 1 ≤ k ≤ n,
in a sample of size n. Then #i is the count that Xk = i, and p̂i = #i /n is the
observed frequency, in a sample of size n. Let pi be the expected frequency,
pi = P rob(Xk = i).
Goodness-Of-Fit Test
def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i in
,→ range(d) ])
pvalue = 1 - U(d-1).cdf(u)
380 CHAPTER 6. STATISTICS
return pvalue
Suppose a dice is rolled n = 120 times, and the observed counts are
Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
If the dice is fair, the expected counts are
u = 12.7.
The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is
d = 6
ustar = U(d-1).ppf(1-alpha)
Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.
√1
if X = i,
Vi = vectp (X)i = pi (6.4.3)
0 if X ̸= i.
Then
1 pi √
E(Vi ) = √ P rob(X = i) = √ = pi ,
pi pi
6.4. CHI-SQUARED TESTS 381
and (
1 if i = j,
E(Vi Vj ) =
0 ̸ j,
if i =
for i, j = 1, 2, . . . , d. If
√ √ √
µ = ( p1 , p2 , . . . , pd ) ,
we conclude
E(V ) = µ, E(V ⊗ V ) = I.
From this,
E(V ) = µ, V ar(V ) = I − µ ⊗ µ. (6.4.4)
Now define
Vk = vectp (Xk ) , k = 1, 2, . . . , n.
Since X1 , X2 , . . . , Xn are i.i.d, V1 , V2 , . . . , Vn are i.i.d. By (5.5.5), we conclude
the random vector !
n
√ 1X
Z= n Vk − µ
n
k=1
obtaining (6.4.2).
rij = pi qj ,
or r = p ⊗ q.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.5).
Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, let p̂ and q̂ be the observed frequencies
#{k : Xk = i} #{k : Yk = j}
p̂i = , q̂j = ,
n n
and let r̂ be the joint observed frequencies
#{k : Xk = i and Yk = j}
r̂ij = .
n
Then r̂ is also a d × N matrix.
When the effects are independent, r = p ⊗ q, so, by the law of large
numbers, we should have
r̂ ≈ p̂ ⊗ q̂
for large sample size. The chi-squared independence test quantifies the dif-
ference of the two matrices r and r̂.
d,N 2
X (r̂ij − p̂i q̂j )
n (6.4.6)
i,j=1
p̂i q̂j
d,N
X (observed)2
= −n + n .
i,j=1
expected
The code
def chi2_independence(table):
n = sum(table) # total sample size
d = len(table)
N = len(table.T)
rowsum = array([ sum(table[i,:]) for i in range(d) ])
colsum = array([ sum(table[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum) # tensor product
u = -n + n*sum([[ table[i,j]**2/expected[i,j] for j in range(N) ]
,→ for i in range(d) ])
deg = (d-1)*(N-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue
table = array([[68,56,32],[52,72,20]])
chi2_independence(table)
returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
384 CHAPTER 6. STATISTICS
equals (6.4.6).
Let u1 , u2 , . . . , ud and v1 , v2 , . . . , vN be orthonormal bases for Rd and
N
R respectively. By (2.9.8),
d,N
X
2
∥Z∥ = trace(Z t Z) = (ui · Zvj )2 . (6.4.8)
i,j=1
2
We will show ∥Z∥ is asymptotically chi-squared of degree (d − 1)(N − 1).
To achieve this, we show Z is asymptotically normal.
Let X and Y be discrete random variables with probability vectors
p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qN ), and assume X and Y are in-
dependent.
Let
√ √ √ √ √ √
µ = ( p1 , p2 , . . . , pd ) , ν = ( q1 , q2 , . . . , qN ) .
and
6.4. CHI-SQUARED TESTS 385
W ≈0
q̂j − qj
√ p ≈ 0.
p̂i q̂j
Z ≈ Z CLT .
We conclude
• the mean and variance of Z are asymptotically the same as those of M ,
• u · Zν ≈ 0, µ · Zv ≈ 0 for any u and v, and,
• Z ≈ normal.
In particular, since u·Zv and u′ ·Zv ′ are asymptotically uncorrelated when
u ⊥ u′ and v ⊥ v ′ , and Z is asymptotically normal, we conclude u · Zv and
u′ · Zv ′ are asymptotically independent when u ⊥ u′ and v ⊥ v ′ .
Now choose the orthonormal bases with u1 and v1 equal to µ and ν re-
spectively. Then
ui · Zvj , i = 1, 2, 3, . . . , d, j = 1, 2, 3, . . . , N
are independent normal random variables with mean zero, asymptotically for
large n, and variances according to the listing
Exercises
Exercise 6.4.4 Verify the goodness-of-fit test statistic (6.4.2) is the square
of (6.4.1) when d = 2.
Exercise 6.4.5 [30] Among 100 vacuum tubes tested, 41 had lifetimes of less
than 30 hours, 31 had lifetimes between 30 and 60 hours, 13 had lifetimes
between 60 and 90 hours, and 15 had lifetimes of greater than 90 hours.
Are these data consistent with the hypothesis that a vacuum tube’s lifetime
is exponentially distributed (Exercise 5.3.22) with a mean of 50 hours? At
what significance? Here p = (p1 , p2 , p3 , p4 ).
7.1 Overview
389
390 CHAPTER 7. MACHINE LEARNING
Sometimes J(W ) is normalized by dividing by N , but this does not change the
results. With the dataset given, the mean error is a function of the weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,
In §4.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §3.3 and §4.4.
A graph consists of nodes and edges. Nodes are also called vertices, and an
edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is not
the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output, it
is a hidden node.
We assume our graphs have no cycles: every path terminates at an output
node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j). If
(i, j) is not an edge, we set wij = 0. If a network has d nodes, the edges are
completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (4.4.2) is a neuron. A net-
work is a directed weighed graph where the nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.
Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 4.16 is not a neural network.
Let X
x−j = wij xi (7.2.1)
i→j
be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
X
xj = fj (x−
j ) = fj
wij xi . (7.2.2)
i→j
x = (x1 , x2 , . . . , xd ),
x− = (x− − −
1 , x2 , . . . , xd ).
In a network, in §4.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form
x = f (W t x).
However, this representation is more useful when the network has structure,
for example in a dense shallow layer (7.2.12).
y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)
Neural Network
Every neural network is a combination of perceptrons.
x1
w1
w2
x2 f y
w3
x3
y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).
The role of the bias is to shift the thresholds in the activation functions.
If x1 , x2 , . . . , xN is a dataset, then (x1 , 1), (x2 , 1), . . . , (xN , 1) is the aug-
mented dataset. If the original dataset is in Rd , then the augmented dataset
is in Rd+1 . In this regard, Exercise 7.2.1 is relevant.
By passing to the augmented dataset, a neural network with bias and d
input features can be thought of as a neural network without bias and d + 1
input features.
In §5.1, Bayes theorem is used to express a conditional probability in terms
of a perceptron,
P rob(H | x) = σ(w · x + w0 ).
This is a basic example of how a perceptron computes probabilities.
Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [22], from which Figure 7.2 is taken.
7.2. NEURAL NETWORKS 393
ez − e−z
tanh(z) =
ez + e−z
# activation functions
w13 w35
x1 f3 f5 x5
w14
w36
w45
w23
w24 w46
x2 f4 f6 x6
Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 7.3 has outgoing vectors
Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .
f5 x−
5 = w35 x3 + w45 x4 x5 = f5 (w35 x3 + w45 x4 )
f6 x−
6 = w36 x3 + w46 x4 x6 = f6 (w36 x3 + w46 x4 )
xi xj
fi fj
wij
def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None else 0 for
,→ i in range(d) ])
def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))
We assume the nodes are ordered so that the initial portion of x equals
xin ,
m = len(x_in)
x[:m] = x_in
def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x
activate = [None]*d
activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
7.2. NEURAL NETWORKS 397
x_in = [1.5,2.5]
x = forward_prop(x_in,w)
Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 7.3, xout = (x5 , x6 ) and we
may take J to be the mean square error function or mean square loss
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (7.2.6)
2 2
The code for this J is
def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])
y0 = [0.132,-0.954]
y = [0.427, -0.288]
J(x_out,y0), J(x_out,y)
These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)
fi′
∂J ∂J
∂x−
i ∂xi
fi
From (7.2.2),
∂xj
= fj′ (x−
j ). (7.2.8)
∂x−
j
By the chain rule and (7.2.8), the key relation between these derivatives is
∂J ∂J
= · fi′ (x−
i ), (7.2.9)
∂x−
i ∂xi
or
downstream = upstream × local.
def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))
Let
7.2. NEURAL NETWORKS 399
∂J
δi = , i = 1, 2, . . . , d.
∂x−
i
d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out
∂J
= (x5 − y1 ) = −0.294.
∂x5
σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.
Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
400 CHAPTER 7. MACHINE LEARNING
def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in range(m) ]
delta_out(x_out,y_star,w)
X ∂J ∂xj ∂xi −
∂J
− = · ·
∂xi i→j
∂x−
j ∂xi ∂x− i
X ∂J
· fi′ (x−
= − · wij i ).
i→j
∂xj
The code is
def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if w[i][j]
,→ != None else 0 for j in range(d) ])
return upstream * local(x,w,i)
def backward_prop(x,y,w):
d = len(w)
7.2. NEURAL NETWORKS 401
delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta
delta = backward_prop(x,y,w)
returns
∂x−
j
= xi ,
∂wij
We have shown
∂J
= xi · δ j . (7.2.11)
∂wij
A shallow network is dense if all input nodes point to all output nodes:
wij is defined for all i, j. A shallow network can always be assumed dense by
inserting zero weights at missing edges.
Neural networks can also be assembled in series, with each component a
layer (Figure 7.8). Usually each layer is a dense shallow network. For example,
Figure 7.3 consists of two dense shallow networks in layers. We say a network
is deep if there are multiple layers.
The weight matrix W (7.2.3) is 6 × 6, while the weight matrices W1 , W2
of each of the two dense shallow network layers in Figure 7.3 are 2 × 2.
In a single shallow layer with n input nodes and m output nodes (Figure
7.7), let x and z be the layer’s input node vector and output node vector.
Then x and z are n and m dimensional respectively, and W is m × n.
x1
z1
+
x2
z2
+
x3
z3
+
x4
f (z − ) = f (z1− , z2− , . . . , zm
−
) = (f (z1− ), f (z2− ), . . . , f (zm
−
)).
Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (7.2.1), (7.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (7.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for parallelized forward and back propagation.
7.3. GRADIENT DESCENT 403
Exercises
This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in mind
is f = J, where J is the mean error from §7.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of this,
f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
404 CHAPTER 7. MACHINE LEARNING
From §4.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.
g(b) − g(a)
≈ g ′ (a).
b−a
Inserting a = w and b = w+ ,
Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,
f ′ (w)
w+ ≈ w − .
f ′′ (w)
f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )
def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory
u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)
def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
title("num_iter= " + str(len(trajectory.T)))
406 CHAPTER 7. MACHINE LEARNING
grid()
show()
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
f (w) = f (w1 , w2 , . . . ).
In other words,
In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
408 CHAPTER 7. MACHINE LEARNING
Given an initial point w0 , the sublevel set at w0 (see §4.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
In Figure 7.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.
a b c w1
u0 w0
Fig. 7.10 Double well cost function and sublevel sets at w0 and at w1 .
To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
7.3. GRADIENT DESCENT 409
and a = w into the right half of (4.5.20) and simplify. This leads to
t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (7.3.3).
The curvature of the loss function and the learning rate are inversely pro-
portional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.
For example, let f (w) = w4 − 6w2 + 2w (Figures 7.9, 7.10, 7.11). Then
Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 7.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 7.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 7.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §7.9, modified
gradient descent will address some of these shortcomings.
def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
7.4. NETWORK TRAINING 411
u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)
ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)
xin → xout .
Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is
xin → y.
This is training.
Let (§7.2)
x− = (x− − −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )
δ = (δ1 , δ2 , . . . , δd )
Let wij be the weight along an edge (i, j), let xi be the outgoing
signal from the i-th node, and let δj be the downstream derivative
of the output J with respect to the j-th node. Then the derivative
∂J/∂wij equals xi δj . In this partial sense,
∇W J = x ⊗ δ. (7.4.2)
def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]
def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory
Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
7.4. NETWORK TRAINING 413
x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000
w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)
returns the cost trajectory, which can be plotted using the code
for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)
grid()
legend()
show()
Fig. 7.12 Cost trajectory and number of iterations as learning rate varies.
414 CHAPTER 7. MACHINE LEARNING
⋆ under construction ⋆
is imprecise and not clearly defined. In fact, for deep networks, it is not at
all clear how to turn this vague idea into an actionable definition.
In the case of a single-layer perceptron, the situation is straightforward
enough to be able to both make the question precise, and to provide action-
able criteria that guarantee trainability. This we do in the two cases
• linear regression, and
• logistic regression.
With any loss function J, the goal is to minimize J. With this in mind,
from §4.5, we recall
J(W ∗ ) ≤ J(W ),
For linear regression without bias, the loss function is (7.5.1) with
1
J(x, y, W ) = |y − z|2 , z = W t x. (7.5.3)
2
Then (7.5.1) is the mean square error or mean square loss, and the problem
of minimizing (7.5.1) is linear regression (Figure 7.13).
416 CHAPTER 7. MACHINE LEARNING
= trace V t (x ⊗ (z + sv − y)) .
d2
J(x, y, W + sV ) = |v|2 = |V t x|2 . (7.5.6)
ds2 s=0
x1
+ y
z1
x2
z2 J
+ (−)2
z3
x3
z = W tx
+
J = |z − y|2 /2
x4
Since J(W ) is the sum of J(x, y, W ) over all samples, J(W ) is convex. To
check strict convexity of J(W ), suppose
d2
J(W + sV ) = 0.
ds2 s=0
V t xk = 0, k = 1, 2, . . . , N. (7.5.7)
Recall the feature space is the vector space of all inputs x, and (§2.9) a
dataset is full-rank if the span of the dataset is the entire feature space. When
this happens, (7.5.7) implies V = 0. By (4.5.19), J(W ) is strictly convex.
To check properness of J(W ), by definition (4.5.12), we show there is a
bound C with √
J(W ) ≤ c =⇒ ∥W ∥ ≤ C d. (7.5.8)
Here ∥W ∥ is the norm of the matrix W (2.2.12). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (7.5.1), (7.5.3), and the triangle inequality,
√
|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.
|W t x| ≤ C(x). (7.5.9)
For linear regression with bias, the loss function is (7.5.2) with
1
J(x, y, W, b) = |y − z|2 , z = W t x + b. (7.5.10)
2
Here W is the weight matrix and b is a bias vector.
If we augment the dataset x1 , x2 , . . . , xN to (x1 , 1), (x2 , 1), . . . , (xN , 1),
then this corresponds to the augmented weight matrix
W
.
bt
Applying the last result to the augmented dataset and appealing to Exer-
cise 7.2.1, we obtain
These are simple, clear geometric criteria for convergence of gradient de-
scent to the global minimum of J, valid for linear regression with or without
bias inputs.
Exercises
max p = max(p1 , p2 , . . . , pd ).
We start with logistic regression without bias inputs. For logistic regres-
sion, the loss function is
N
X
J(W ) = J(xk , pk , W ), (7.6.1)
k=1
Here I(p, q) is the relative information, measuring the information error be-
tween the desired target p and the computed target q, and q = σ(y) is the
softmax function, squashing the network’s output y = W t x into the proba-
bility q.
When p is one-hot encoded, by (5.6.16),
Because of this, in the literature, in the one-hot encoded case, (7.6.1) is called
the cross-entropy loss.
420 CHAPTER 7. MACHINE LEARNING
J(W ) is logistic loss or logistic error, and the problem of minimizing (7.6.1)
is logistic regression (Figure 7.14).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 5.33 is a useful summary
of the various information and entropy concepts.
x1
y1
+ p
q1
x2
y2 q2 J
+ σ I
x3 q3
y3 y = W tx
+ q = σ(y)
J = I(p, q)
x4
W 1 = 0, (7.6.2)
or
d
X
wij = 0, i = 1, 2, . . . , d.
j=1
and, by (5.6.10),
d d
J(x, y, W + sV ) = I(p, σ(y + sv))
ds s=0 ds s=0
= v · (q − p) = (V t x) · (q − p)
= trace (V t x) ⊗ (q − p)
= trace V t (x ⊗ (q − p)) .
As before, this result is a special case of (7.4.2). Since q and p are probability
vectors, p · 1 = 1 = q · 1, hence the gradient G is centered.
Recall (§5.6) we have strict convexity of Z(y) along centered vectors y,
those vectors satisfying y · 1 = 0. Since y = W t x, y · 1 = x · W 1. Hence, to
force y · 1 = 0, it is natural to assume W is centered.
If we initiate gradient descent with a centered weight matrix W , since the
gradient G is also centered, all successive weight matrices will be centered.
Pd
To see this, given a vector v and probability vector q, set v̄ = j=1 vj qj .
Then 2
Xd Xd Xd
vj2 qj − vj qj = (vj − v̄)2 qj .
j=1 j=1 j=1
(W + sV )t x = y + sv.
If y = W t x, it follows the second derivative of J(x, p, W ) in the direction of
V is
d
d2 X
J(x, p, W + sV ) = (vj − v̄)2 qj , v = V t x. (7.6.6)
ds2 t=0 j=1
vanishes, then, since the summands are nonnegative, (7.6.6) vanishes, for
every sample x = xk , p = pk , hence
V t xk = 0, k = 1, 2, . . . , N.
The convex hull is discussed in §4.5, see Figures 4.23 and 4.24. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one Ki .
Thus taking the convex hull in the definition of Ki is crucial. This is clearly
seen in Figure 7.26: The samples never intersect, but the convex hulls may
do so.
To establish properness of J(W ), by definition (4.5.12), we show
7.6. LOGISTIC REGRESSION 423
for some C. The exact formula for the bound C, which is not important for
our purposes, depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0 and let q = σ(y). Then I(p, q) =
J(x, p, W ) ≤ c for every sample x and corresponding target p.
Let x be a sample, let y = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(yj − yi ) ≤ ϵ(Z(y) − yi ) ≤ pi (Z(y) − yi ) ≤ pk (Z(y) − yk ) = Z(y) − p · y.
k=1
By (5.6.15),
Z(y) − p · y = I(p, σ(y)) − I(p) ≤ c + log d.
Combining the last two inequalities,
ϵ(yj − yi ) ≤ c + log d.
Let x be any vector in feature space, and let y = W t x. Since span(Ki ∩Kj )
is full-rank, x is a linear combination of vectors in Ki ∩ Kj , for every i and j.
This implies, by (7.6.8), there is a bound C(x), depending on x but not on
W , such that
X
d|yi | = |(d − 1)yi + yi | = (yi − yj ) ≤ (d − 1)C(x).
j̸=i
|wji | = |ej · W ei | ≤ C, i, j = 1, 2, . . . , d.
By (2.2.12), X
2
∥W ∥ = |wij |2 ≤ d2 C 2 .
i,j
with
J(x, p, W, b) = I(p, q), q = σ(y), y = W t x + b.
Here W is the weight matrix and b is the bias vector. In keeping with our
prior convention, we call the weight (W, b) centered if W is centered and b is
centered. Then y is centered.
If the columns of W are (w1 , w2 , . . . , wd ), and b = (b1 , b2 , . . . , bd ), then
y = W t x + b is equivalent to levels corresponding to d hyperplanes
y1 = w1 · x + b1 ,
y2 = w2 · x + b2 ,
(7.6.11)
... = ...
yd = wd · x + bd .
yi ≥ 0, for x in class i,
for every i = 1, 2, . . . , d and every j ̸= i.
yi ≤ 0, for x in class j,
(7.6.12)
yi ≥ 0, for x in class i,
for some i = 1, 2, . . . , d and some j ̸= i.
yi ≤ 0, for x in class j,
(7.6.13)
As special cases, there are corresponding results for strict targets and one-
hot encoded targets.
To begin the proof, suppose (W, b) satisfies (7.6.12). Then (Exercise 7.6.4)
yi ≥ 0, for x in Ki ,
for every i = 1, 2, . . . , d,
yj ≤ 0, for x in Ki and every j ̸= i,
(7.6.14)
From this, one obtains I(p, σ(y)) ≤ log d for every sample x and q = σ(y)
(Exercise 7.6.5). Since this implies J(W, b) ≤ N log d, the loss function is not
proper, hence not trainable.
By Exercise 7.6.6, for trainability, it is enough to check properness. To
establish properness of the loss function, suppose none of the classes lie in
a hyperplane and the dataset is not weakly separable. Then Ki ∩ Kj has
interior for all i and all j ̸= i. Let x∗ij be the centers of balls in Ki ∩ Kj for
each i ̸= j. By making the balls small enough, we may assume the radii of
the balls equal the same r > 0.
Let ϵi > 0 be the minimum of pi over all probability vectors p correspond-
ing to samples x in class i. Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then ϵ is
positive.
Suppose J(W, b) ≤ c for some level c, with W = (w1 , w2 , . . . , wd ),
b = (b1 , b2 , . . . , bd ) centered. We establish properness of the loss function
by showing
c + log d 1 X
|wi | + |bi | ≤ 1+r+ |x∗ij | , i = 1, 2, . . . , d.
rϵ d−1
j̸=i
(7.6.15)
The exact form of the right side of (7.6.15) doesn’t matter. What matters is
the right side is a constant depending only on the dataset, the targets, the
number of categories d, and the level c.
If J(W, b) ≤ c, then I(p, q) ≤ c for each sample x. As before, this leads to
(7.6.8).
428 CHAPTER 7. MACHINE LEARNING
rϵ|wi − wj | ≤ c + log d.
Let
yi = wi · x∗ij + bi , yj = wj · x∗ij + bj .
Since x∗ij is in Ki ∩ Kj , by (7.6.8),
Hence
Since W is centered,
X X
dwi = (d − 1)wi + wi = (d − 1)wi − wj = (wi − wj ).
j̸=i j̸=i
Hence
1X
|wi | + |bi | ≤ |wi − wj | + |bi − bj |.
d
j̸=i
A very special case is a two-class dataset. In this case, the result is com-
pelling:
7.6. LOGISTIC REGRESSION 429
We end the section by comparing the three regressions: linear, strict logis-
tic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic regression that is
relevant. Because of this, in the literature, logistic regression often defaults
to the one-hot encoded case.
In linear regression, not only do J(W ) and J(W, b) have minima, but so
does J(z, y). Properness ultimately depends on properness of a quadratic |z|2 .
In strict logistic regression, by (7.6.3), the critical point equation
∇y J(y, p) = 0
can always be solved, so there is at least one minimum for each J(y, p). Here
properness ultimately depends on properness of Z(y).
In one-hot encoded regression, J(y, p) = I(p, σ(y)) and ∇y J(y, p) = 0 can
never be solved, because q = σ(y) is always strict and p is one-hot encoded,
see (7.6.5). Nevertheless, trainability of J(W ) and J(W, b) is achievable if
there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression.
In logistic regression, the minimizer cannot be found in closed form, so we
have no choice but to apply gradient descent, even for low dimensions.
Exercises
0 = w0 + w · x = w0 + w1 x1 + w2 x2 + · · · + wd xd .
Linear Regression
We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
x1 1 y1
x2 1 y2
X= . . . . . . ,
Y =
. . . .
xN 1 yN
Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.5,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N N
k=1
(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.
The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
7.7. REGRESSION EXAMPLES 433
This derives
The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)
df - read_csv("longley.csv")
X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()
X = X - mean(X)
Y = Y - mean(Y)
varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)
X = X/sqrt(varx)
Y = Y/sqrt(vary)
After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).
def poly(x,d):
A = column_stack([ X**i for i in range(d) ]) # Nxd
Aplus = pinv(A)
b = Y # Nx1
wstar = dot(Aplus,b)
return sum([ x**i*wstar[i] for i in range(d) ],axis=0)
figure(figsize=(12,12))
# six subplots
rows, cols = 3,2
# x interval
x = arange(xmin,xmax,.01)
for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()
show()
Running this code with degree 1 returns Figure 7.15. Taking too high a
power can lead to overfitting, for example for degree 12.
x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1
More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 7.18, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is one-
dimensional (Figure 7.19).
436 CHAPTER 7. MACHINE LEARNING
Plotting the dataset on the (x, p) plane, the goal is to fit a curve
p = σ(m∗ x + b∗ ) (7.7.4)
as in Figure 7.20.
Since this is logistic regression with bias, we can apply the two-class result
from the previous section: The dataset is one-dimensional, so a hyperplane is
just a point, a threshold. Neither class lies in a hyperplane, and the dataset is
not separable (Figure 7.19). Hence logistic regression with bias is trainable,
and gradient descent is guaranteed to converge to an optimal weight (m∗ , b∗ ).
(0, 1)
x
(0, 0)
X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5, 2.75,
,→ 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in zip(X,P)
,→ ],axis=0)
# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
7.7. REGRESSION EXAMPLES 437
m∗ = 1.49991537, b∗ = −4.06373862.
Even though we are done, we take the long way and apply logistic regres-
sion without bias by incorporating the bias, to better understand how things
work.
To this end, we incorporate the bias and write the augmented dataset
resulting in Figure 7.21. Since these vectors are not parallel, the dataset is
full-rank in R2 , hence J(m, b) is strictly convex. In Figure 7.21, the shaded
area is bounded by the vectors corresponding to the overlap between passing
and failing students’ hours.
x0
(0, 1)
x
(0, 0)
Let σ(z) be the sigmoid function (5.1.22). Then, as in the previous section,
the goal is to minimize the loss function
438 CHAPTER 7. MACHINE LEARNING
N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (7.7.5)
k=1
Once we have the minimizer (m∗ , b∗ ), we have the best-fit curve (7.7.4).
If the targets p are one-hot encoded, the dataset is as follows.
x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)
b y
1 +
q
−b
J
σ I
m
−y 1−q
x +
−m
Since here d = 2, the networks in Figures 7.23 and 7.24 are equivalent.
In Figure 7.23, σ is the softmax function, I is given by (5.6.6), and p, q are
probability vectors. In Figure 7.24, σ is the sigmoid function, I is given by
(4.2.2), and p, q are probability scalars.
7.7. REGRESSION EXAMPLES 439
1
b
y q J
+ σ I
m
x
Figure 7.20 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 7.25) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.
The horizontal plane in Figure 7.25, which is the plane in Figure 7.21, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 7.25,
K0 is the line segment joining the green points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.
0.5
0
0
2
1
4
0.5
0
The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
440 CHAPTER 7. MACHINE LEARNING
top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 7.26). It follows we have no guarantee the logistic
loss is proper.
On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 7.27).
7.8. STRICT CONVEXITY 441
This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is proper.
In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying
Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have
m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (7.8.3)
2 2
1 In the literature, the condition number is often defined as L/m.
442 CHAPTER 7. MACHINE LEARNING
How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (7.8.1), the es-
timate (7.8.3) shows these two error measures are equivalent. We use both
measures below.
Gradient Descent I
Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
n
E(wn ) ≤ (1 − r) E(w0 ), n = 1, 2, . . . . (7.8.5)
mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (7.3.1) and t = 2/(m + L),
This implies
Gradient Descent II
GD-II improves GD-I in two ways: Since m < L, the learning rate is larger,
2 1
> ,
m+L L
and the convergence rate is smaller,
2
1−r
< (1 − r),
1+r
Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
444 CHAPTER 7. MACHINE LEARNING
w◦ = w + s(w − w− ). (7.9.1)
Here s is the decay rate. The momentum term reflects the direction induced by
the previous step. Because this mimics the behavior of a ball rolling downhill,
gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by
Here we have two hyperparameters, the learning rate and the decay rate.
7.9. ACCELERATED GRADIENT DESCENT 445
wn = w∗ + ρn v, Qv = λv. (7.9.5)
Inserting this into (7.9.3) and using Qw∗ = b leads to the quadratic equation
ρ2 = (1 − tλ + s)ρ − s.
(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (7.9.8)
mL
When (7.9.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (7.9.9)
2
It follows the absolute value of ρ equals
446 CHAPTER 7. MACHINE LEARNING
p √
|ρ| = x2 + y 2 = s.
√
To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (7.9.7). This forces (7.9.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain
√
√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2
Let w̃n = wn −w∗ . Since Qwn −b = Qw̃n , (7.9.3) is a 2-step linear recursion
in the variables w̃n . Therefore the general solution depends on two constants
A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (7.9.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (7.9.9), then (7.9.5) is
a solution of (7.9.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
d
X
wn = w∗ + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (7.9.10)
k=1
Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,
Let
(L − m)(L − m)
C = max . (7.9.11)
λ (L − λ)(λ − m)
Using (7.9.8), one verifies the estimate
Suppose the loss function f (w) is quadratic (7.8.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (7.9.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (7.9.2) with learning rate and decay
rate √ 2
1 4 1− r
t= · √ 2, s= √ ,
L (1 + r) 1+ r
converges to w∗ at the rate
√ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (7.9.12)
1+ r
w◦ = w + s(w − w− ),
(7.9.13)
w+ = w◦ − t∇f (w◦ ).
we will show
V (w+ ) ≤ ρV (w). (7.9.15)
In fact, we see below (7.9.22), (7.9.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (7.3.3).
The derivation of (7.9.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (7.9.13), it is not clear what
−
w means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work
with the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,
then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (7.9.14) to simplify the calculations.
We first show how (7.9.15) implies the result. Using (w0 )− = w0 and
(7.8.3),
L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
7.9. ACCELERATED GRADIENT DESCENT 449
This derives
Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (7.9.13) with learning rate and decay rate
√
1 1− r
t= , s= √
L 1+ r
While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all convex
functions satisfying (7.8.1), and the fact, also due to Nesterov [23], that this
convergence rate is best-possible among all such functions.
Now we derive (7.9.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (7.3.1) with w◦ replacing w, (7.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (7.9.17)
2
Here we used t = 1/L.
By (4.5.20) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (7.9.18)
2
By (4.5.20) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (7.9.19)
2
Multiply (7.9.18) by ρ and (7.9.19) by 1 − ρ and add, then insert the sum
into (7.9.17). After some simplification, this yields
r t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(7.9.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
450 CHAPTER 7. MACHINE LEARNING
we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (7.9.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (7.9.21)
2t 2t
Let
R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .
which is positive.
Chapter A
Appendices
Some of the material here is first seen in high school. Because repeating the
exposure leads to a deeper understanding, we review it in a manner useful to
us here.
We start with basic counting, and show how the factorial function leads
directly to the exponential. Given its convexity and its importance for entropy
(§5.1), the exponential is treated carefully (§A.3).
The other use of counting is in graph theory (§3.3), which lays the ground-
work for neural networks (§7.2).
Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure A.1.
Why are there six possibilities? Because they are three ways of choosing
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is
6 = 3 × 2 × 1.
n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.
451
452 CHAPTER A. APPENDICES
Notice also
(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,
Permutations of n Objects
We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls. The code for n! is
factorial(n,exact=True)
More generally, we can consider the selection of k balls from a bag contain-
ing n distinct balls. There are two varieties of selections that can be made:
Ordered selections and unordered selections. An ordered selection is a permu-
tation. In particular, when k = n, an ordered selection of n objects from n
objects is n, which is the number of ways of permuting n objects.
A.1. PERMUTATIONS AND COMBINATIONS 453
def perm_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else:
list1 = [ (i,*p) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
list2 = [ (*p,i) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
return list1 + list2
perm_tuples(1,5,2)
[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5),(2, 1),(3, 1),(4, 1),(5, 1),(3, 2),(4, 2),(5, 2),(4, 3),(5,
,→ 3),(5, 4)]
n, k = 5, 2
perm(n, k)
perm(n,k,exact=True) == len(perm_tuples(1,n,k))
Notice P (x, k) is defined for any real number x by the same formula,
def comb_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else: return [ (i, *p) for i in range(a,b) for p in
,→ comb_tuples(i+1,b,k-1) ]
comb_tuples(1,5,2)
[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5)]
n, k = 5, 2
comb(n, k)
comb(n,k,exact=True) == len(comb_tuples(1,n,k))
P (n, k) n!
C(n, k) = = .
k! (n − k)!k!
1, 2, 3, . . . , n − 1, n,
n! < nn .
However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume
n n
n! is approximately equal to e for n large, (A.1.1)
e
456 CHAPTER A. APPENDICES
for some constant e. We seek the best constant e that fits here. In this ap-
proximation, we multiply by e so that (A.1.1) is an equality when n = 1.
Using the binomial theorem, in §A.3 we show
n n n n
3 ≤ n! ≤ 2 , n ≥ 1. (A.1.2)
3 2
Based on this, a constant e satisfying (A.1.1) must lie between 2 and 3,
2 ≤ e ≤ 3.
To figure out the best constant e to pick, we see how much both sides
of (A.1.1) increase when we replace n by n + 1. Write (A.1.1) with n + 1
replacing n, obtaining
n+1
n+1
(n + 1)! is approximately equal to e for n large.
e
(A.1.3)
Dividing the left sides of (A.1.1), (A.1.3) yields
(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1
1 1
= (n + 1) · · 1 + . (A.1.4)
e(n/e)n e n
Exercises
(First break the sum into two sums, then write out the first few terms of each
sum separately, and notice all terms but one cancel.)
Similarly,
Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(A.2.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .
and
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and
4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4
and
A.2. THE BINOMIAL THEOREM 459
5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5
is the coefficient of an−k xk when you multiply out (a + x)n . This is the bino-
mial coefficient. Here n is the degree of the binomial, and k, which specifies
the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term 1,
and term 2. Alternatively, one says these are the zeroth term, the first term,
and the second term. Thus the second term in the expansion of the binomial
(a+x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general, the binomial
(a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
multiply out (a + x)n , we have the binomial theorem.
Binomial Theorem
For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,
n = 0: 1
n = 1: 1 1
n = 2: 1 2 1
n = 3: 1 3 3 1
n = 4: 1 4 6 4 1
n = 5: 1 5 10 10 5 1
n = 6: ⋆ 6 15 20 15 6 ⋆
n = 7: 1 ⋆ 21 35 35 21 ⋆ 1
n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1
n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1
n = 10: 1 10 45 120 ⋆ 252 ⋆ 120 45 10 1
N = 10
Comb = zeros((N,N),dtype=int)
Comb[0,0] = 1
for n in range(1,N):
Comb[n,0] = Comb[n,n] = 1
for k in range(1,n): Comb[n,k] = Comb[n-1,k] + Comb[n-1,k-1]
Comb
In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
A.2. THE BINOMIAL THEOREM 461
We can learn a lot about the binomial coefficients from this triangle. First,
we have 1’s all along the left edge. Next, we have 1’s all along the right edge.
Similarly, one step in from the left or right edge, we have the row number.
Thus we have
n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1
Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written
n n
= , 0 ≤ k ≤ n;
k n−k
4 4 4 3 4 2 2 4 4 4
a + a x+ a x + ax3 + x
0 1 2 3 4
3 4 3 3 3 2 2 3
= a + a x+ a x + ax3
0 1 2 3
3 3 3 2 2 3 3 4
+ a x+ a x + ax3 + x .
0 1 2 3
We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get
n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n
Hence: the alternating1 sum of the binomial coefficients along the n-th row of
Pascal’s triangle is zero.
We now show
Binomial Coefficient
Let
n · (n − 1) · · · · · (n − k + 1) n!
C(n, k) = = .
1 · 2 · ··· · k k!(n − k)!
Then
n
= C(n, k), 0 ≤ k ≤ n. (A.2.10)
k
n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!
n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!
The formula (A.2.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code
comb(n,k)
comb(n,k,exact=True)
The binomial coefficient nk makes sense even for fractional n. This can
Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
n n
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (A.3.1)
n k! n n n
k=2
From (A.3.1), we can tell a lot. First, since all terms are positive, we see
n
1
1+ ≥ 2, n ≥ 1.
n
n
X 1 1 1 1 1
= + + + · · · + n−1 ≤ 1,
2k−1 2 4 8 2
k=2
as follows.
A geometric sum is a sum of the form
n−1
X
sn = 1 + a + a2 + · · · + an−1 = ak .
k=0
asn = a + a2 + a3 + · · · + an−1 + an = sn + an − 1,
yielding
(a − 1)sn = an − 1.
When a ̸= 1, we may divide by a − 1, obtaining
n−1
X an − 1
sn = ak = 1 + a + a2 + · · · + an−1 = . (A.3.4)
a−1
k=0
By (A.3.3), we arrive at
n
1
2≤ 1+ ≤ 3, n ≥ 1. (A.3.5)
n
Since a bounded increasing sequence has a limit (§A.7), this establishes the
following strengthening of (A.1.5).
Euler’s Constant
The limit n
1
e = lim 1+ (A.3.6)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.
are in §A.6, see Exercises A.6.9 and A.6.10. Nevertheless, the intuition is
clear: (A.3.6) is saying there is a specific positive number e with
n
1
1+ ≈e
n
for n large.
Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (A.1.2).
∞ X∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k! ∞ ∞ ∞ k!
k=2 k=0
To summarize,
Euler’s Constant
Euler’s constant satisfies
∞
X 1 1 1 1 1 1
e= =1+1+ + + + + + ... (A.3.7)
k! 2 6 24 120 720
k=0
Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
2
1
1+ = 2.25
2
Exponential Function
(1 − x) = 1 − x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...
grid()
plot(x,exp(x))
show()
X xk n
x n 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (A.3.11)
n k! n n n
k=2
Exponential Series
The exponential function is always positive and satisfies, for every real
number x,
∞
X xk x2 x3 x4 x5 x6
exp x = = 1+x+ + + + + + . . . (A.3.12)
k! 2 6 24 120 720
k=0
470 CHAPTER A. APPENDICES
Law of Exponents
(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )
Thus
∞ ∞ ∞
! ! n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0
Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k! (n − k)! n! k n!
k=0 k=0 k=0
Thus
∞ ∞ ∞
! !
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k! m=0
m! n=0
n!
k=0
Exponential Notation
Graphically, the convexity of the exponential functions is the fact that the
line segment joining two points on the graph lies above the graph (Figure
A.4).
Exercises
Exercise A.3.1 Assume a bank gives 50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.
Exercise A.3.2 Assume a bank gives -50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.
valid for a, b, c in the interval [0, 1]. This remains valid for any number of
factors.
Exercise A.3.5 Use the previous exercise, (A.3.1), (A.3.3), and the identity
k(k − 1)
1 + 2 + 3 + · · · + (k − 2) + (k − 1) =
2
A.4. COMPLEX NUMBERS 473
In §1.4, we studied points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we studied points in any number of dimensions,
and there we also added and subtracted points.
P
P′
1
1
O O
P ′′
Q Q
P ′′
O O
P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(A.4.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).
so (A.4.1) is equivalent to
P ′′ = x′ P ± y ′ P ⊥ . (A.4.2)
P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (A.4.3)
Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and, using
(A.4.1), one can check
476 CHAPTER A. APPENDICES
Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written
P = x + iy,
since
x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y).
This leads to Figure A.6. In this way, real numbers x are considered complex
numbers with zero imaginary part, x = x + 0i.
2i 3 + 2i
−1 0 1 2 3
Square Root of −1
and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator to the
numerator by the formula
1 1 x − iy z̄
= = 2 = 2.
z x + iy x + y2 |z|
A.4. COMPLEX NUMBERS 477
From this and (A.4.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(A.4.6)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .
√ r+x yi
z = x + yi =⇒ z=√ +√ . (A.4.7)
2r + 2x 2r + 2x
p
Here r = x2 + y 2 and this formula is valid as long as z is not a negative
√ zero.√When z is a negative number or zero, z = −x with x ≥ 0,
number or
we have z = i x. We conclude every complex number has square roots.
When z is on the unit circle, r = 1, so the formula reduces to
√ 1+x yi
z=√ +√ .
2 + 2x 2 + 2x
We will need the roots of unity in §3.2. This generalizes square roots, cube
roots, etc.
A complex number ω is a root of unity if ω d = 1 for some power d. If d is
the power, we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1 and ±i, since (±1)4 = 1 and (±i)4 = 1.
Here we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).
ω
ω
ω 1 1 ω2 1
ω2
ω3
ω2 = 1 ω3 = 1 ω4 = 1
If ω d = 1, then
d k
ωk = ωd = 1k = 1.
With ω given by (A.4.8), this implies
1, ω, ω 2 , . . . , ω d−1
13 = 1, ω 3 = 1, (ω 2 )3 = 1.
ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 12
ω 11 ω
ω5 = 1 ω6 = 1 ω 15 = 1
Summarizing,
Roots of Unity
ω = cos(2π/d) + i sin(2π/d),
1, ω, ω 2 , . . . , ω d−1 .
ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.
d−1
zd − 1 Y
= (z − ω k ). (A.4.10)
z−1
k=1
z = symbols('z')
d = 5
solve(z**d - 1)
roots([a,b,c])
Since the cube roots of unity are the roots of p(z) = z 3 − 1, the code
roots([1,0,0,-1])
Exercises
Exercise A.4.1 Let P = (1, 2) and Q = (3, 4) and R = (5, 6). Calculate P Q,
P/Q, P R, P/R, QR, Q/R.
Exercise A.4.2 Let a = 1 + 2i and b = 3 + 4i and c = 5 + 6i. Calculate ab,
a/b, ac, a/c, bc, b/c.
Exercise A.4.3 We say z ′ is the reciprocal of z if zz ′ = 1. Show the reciprocal
of z = x + yi is
482 CHAPTER A. APPENDICES
x − yi
z′ = .
x2 + y 2
√ √
Exercise A.4.4 Show z given by (A.4.7) satisfies ( z)2 = z.
Exercise A.4.7 Let 1, ω, . . . , ω d−1 be the d-th roots of unity. Using the
code below, compute the product
z = symbols('z')
roots = solve(z**d - 1)
A.5 Integration
f (x)
0 a x x + dx b
To derive this, let A(x) denote the area under the graph between the y-
axis and the vertical line at x. Then A(x) is the sum of the gray area and
the red area, A(a) is the gray area, and A(b) is the sum of four areas: gray,
red, green, and blue. It follows the integral (A.5.1) equals A(b) − A(a).
Since A(x + dx) is the sum of three areas, gray, red, green, it follows
A(x + dx) − A(x) is the green area. But the green area is approximately a
rectangle of width dx and height f (x). Hence the green area is approximately
f (x) × dx, or
A(x + dx) − A(x) ≈ f (x) dx.
As a consequence of this analysis,
Now let F (x) be any function satisfying F ′ (x) = f (x). Then A(x) and
F (x) have the same derivative, so A(x)−F (x) has derivative zero. By (4.1.2),
A(x) − F (x) is a constant C, or A(x) = F (x) + C. This implies
Z b
f (x) dx = A(b) − A(a) = (F (b) + C) − (F (a) + C) = F (b) − F (a).
a
When d = 2, a = −1, b = 1, this is 2/3, which is the area under the parabola
in Figure A.10.
When a = 0, b = 1, Z 1
1
td dt = . (A.5.3)
0 d + 1
A.5. INTEGRATION 485
When F (x) can’t be found, we can’t use the FTC. Instead we use Python
to evaluate the integral (A.5.1) as follows.
d = 2
a,b = -1, 1
This not only returns the computed integral I but also an estimate of the
error between the computed integral and the theoretical value,
(0.6666666666666666, 7.401486830834376e-15).
quad refers to quadrature, which is another term for integration.
Another example is the area under one hump of the sine curve in Figure
A.11, Z π
sin x dx = − cos π − (− cos 0) = −(−1) + 1 = 2.
0
Here f (x) = sin x, F (x) = − cos x, F ′ (x) = f (x). The Python code quad
returns (2.0, 2.220446049250313e-14).
It is important to realize the integral (A.5.1) is the signed area under the
graph: Portions of areas that are below the x-axis are counted negatively. For
example,
486 CHAPTER A. APPENDICES
Z 2π
sin x dx = − cos(2π) − (− cos 0) = −1 + 1 = 0.
0
Explicitly,
Z 2π Z π Z 2π
sin x dx = sin x dx + sin x dx = 2 − 2 = 0,
0 0 π
so the areas under the first two humps in Figure A.11 cancel.
def plot_and_integrate(f,a,b,pi_ticks=False):
# initialize figure
ax = axes()
ax.grid(True)
# draw x-axis and y-axis
ax.axhline(0, color='black', lw=1)
ax.axvline(0, color='black', lw=1)
# set x-axis ticks as multiples of pi/2
if pi_ticks: set_pi_ticks(a,b)
x = linspace(a,b,100)
plot(x,f(x))
positive = f(x)>=0
negative = f(x)<0
ax.fill_between(x,f(x), 0, color='green', where=positive, alpha=.5)
ax.fill_between(x,f(x), 0, color='red', where=negative, alpha=.5)
A.5. INTEGRATION 487
I = quad(f,a,b,limit=1000)[0]
title("integral equals " + str(I),fontsize = 10)
show()
plot_and_integrate(f,a,b,pi_ticks=True)
Above, the Python function set_pi_ticks(a,b) sets the x-axis tick mark
labels at the multiples of π/2 The code for set_pi_ticks is in §4.1.
The exercises are meant to be done using the code in this section. For the
infinite limits below, use numpy.inf.
Exercises
Exercise A.5.1 Plot and integrate f (x) = x2 + A sin(5x) over the interval
[−10, 10], for amplitudes A = 0, 1, 2, 4, 15. Note the integral doesn’t depend
on A. Why?
Exercise A.5.3 Plot and integrate f (x) = exp(−x) over [a, b] with a = 0,
b = 1, 10, 100, 1000, 10000.
√
Exercise A.5.4 Plot and integrate f (x) = 1 − x2 over [−1, 1].
√
Exercise A.5.5 Plot and integrate f (x) = 1/ 1 − x2 over [−1, 1].
Exercise A.5.6 Plot and integrate f (x) = (− log x)n over [0, 1] for n =
2, 3, 4. What is the answer for general n?
Exercise A.5.7 With k = 7, n = 10, plot and integrate using Python
Z 1
xk (1 − x)n−k dx.
0
2 ∞ sin x
Z
dx.
π 0 x
Exercise A.5.10 Use numpy.inf to plot the normal pdf and compute its
integral Z ∞
1 2
√ e−x /2 dx.
2π −∞
Exercise A.5.11 Let σ(x) = 1/(1+e−x ). Plot and integrate f (x) = σ(x)(1−
σ(x)) over [−10, 10]. What is the answer for (−∞, ∞)?
Exercise A.5.12 Let Pn (x) be the Legendre polynomial of degree n (§4.1).
Use num_legendre (§4.1) to compute the integral
Z 1
Pn (x)2 dx
−1
for n = 1, 2, 3, 4. What is the integral for general n? Hint – take the reciprocal
of the answer.
Asymptotic Vanishing
an ≈ 0 =⇒ can ≈ 0.
Convergence of Reciprocals
If an ≈ 1, then 1/an ≈ 1.
Asymptotic Equality
an an
a n ≈ bn ⇐⇒ ≈1 ⇐⇒ − 1 ≈ 0.
bn bn
def factorial(n):
if n == 1: return 1
else: return n * factorial(n-1)
a = factorial(100)
b = stirling(100)
a/b, a-b
returns
(1.000833677872004, 7.773919124995513 × 10154 ).
The first entry is close to one, but the second entry is far from zero.
If, however, bn ≈ b for some nonzero constant b, then (Exercise A.6.6)
ratios and differences are the same,
an ≈ bn ⇐⇒ an − bn ≈ 0. (A.6.2)
a = lim an . (A.6.3)
n→∞
As we saw above, limits and asymptotic equality are the same, as long as the
limit is not zero. When a is the limit of an , we also say an converges to a, or
an approaches a and we write an → a.
Limits can be taken for sequences of points in Rd as well. Let an be a
sequence of points in Rd . We say an converges to a if an · v converges to a · v
for every vector v. Here we also write an → a and we write (A.6.3).
Exercises
an + bn → a + b, an bn → ab.
Several times in the text, we deal with minimizing functions, most notably for
the pseudo-inverse of a matrix (§2.3), for proper continuous functions (§4.5),
and for gradient descent (§7.3).
Previously, the technical foundations underlying the existence of minimiz-
ers were ignored. In this section, we review the foundational material sup-
porting the existence of minimizers.
For example, since y = ex is an increasing function, the minimum
min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1
Completeness Property
lim xn .
n→∞
Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.
I0 ⊃ I1 ⊃ I2 ⊃ . . . ,
x∗ = lim x∗n
n→∞
As we saw above, a minimizer may or may not exist, and, when the minimizer
does exist, there may be several minimizers.
A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn
approaches x∗ ,
xn → x∗ =⇒ f (xn ) → f (x∗ ),
Existence of Minimizers
f (x1 ) + m1
c=
2
be the midpoint between m1 and f (x1 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m2 = c and x2 = x1 . In the second case, there is a point x2 in
S satisfying f (x2 ) < c, and we define m2 = m1 . As a consequence, in either
case, we have f (x2 ) ≥ m2 , m1 ≤ m2 , and
496 CHAPTER A. APPENDICES
1
f (x2 ) − m2 ≤ (f (x1 ) − m1 ).
2
Let
f (x2 ) + m2
c=
2
be the midpoint between m2 and f (x2 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m3 = c and x3 = x2 . In the second case, there is a point x3 in
S satisfying f (x3 ) < c, and we define m3 = m2 . As a consequence, in either
case, we have f (x3 ) ≥ m3 , m2 ≤ m3 , and
1
f (x3 ) − m3 ≤ (f (x1 ) − m1 ).
22
Continuing in this manner, we have a sequence x1 , x2 , . . . in S, and an
increasing sequence m1 ≤ m2 ≤ . . . of lower bounds, with
2
f (xn ) − mn ≤ (f (x1 ) − m1 ).
2n
Since S is bounded, xn subconverges to some x∗ . Since f (x) is continuous,
f (xn ) subconverges to f (x∗ ). Since f (xn ) ≈ mn and mn is a lower bound for
all n, f (x∗ ) is a lower bound, hence x∗ is a minimizer.
A.8 SQL
sense that proprietary extensions are avoided, and the software is compatible
with the widest range of commercial variations.
Because database tables can contain millions of records, it is best to ac-
cess a database server programmatically, using an application programming
interface, rather than a graphical user interface. The basic API for inter-
acting with database servers is SQL (structured query language). SQL is a
programming language for creating and modifying databases.
Any application on your laptop that is used to access a database is called an
SQL client. The database server being accessed may be local, running on the
same computer you are logged into, or remote, running on another computer
on the internet. In our examples, the code assumes a local or remote database
server is being accessed.
Because SQL commands are case-insensitive, by default we write them
in lowercase. Depending on the SQL client, commands may terminate with
semicolons or not. As mentioned above, data may be numbers or strings.
The basic SQL commands are
select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
498 CHAPTER A. APPENDICES
This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers or
any unmutable Python objects. Since a Python list is mutable, a key cannot
be a list. Values may be any Python objects, so a value may be a list. In
a dict, values are accessed through their keys. For example, item1["dish"]
returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts, for
example,
len(L), L[0]["dish"]
returns
(2,'Hummus')
returns True.
A list-of-dicts L can be converted into a string using the json module, as
follows:
s = dumps(L)
Now print L and print s. Even though L and s “look” the same, L is a list,
and s is a string. To emphasize this point, note
• len(L) == 2 and len(s) == 99,
• L[0:2] == L and s[0:2] == '[{'
• L[8] returns an error and s[8] == ':'
To convert back the other way, use
L1 = loads(s)
Then L == L1 returns True. Strings having this form are called JSON strings,
and are easy to store in a database as VARCHARs (see Figure A.16).
The basic object in the Python package pandas is the dataframe (Figures
A.13, A.14, A.16, A.17). pandas can convert a dataframe df to many, many
other formats
df = DataFrame(L)
df
L1 = df.to_dict('records')
L == L1
500 CHAPTER A. APPENDICES
returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code
menu_df = read_csv("menu.csv")
menu_df
To go the other way, to convert the dataframe df to the CSV file menu1.
,→ csv, use the code
df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)
To connect using sqlalchemy, we first collect the connection data into one
URI string,
protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port
This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”rawa”, the URI is
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
df.to_sql('Menu',engine,if_exists='replace')
df.to_sql('Menu',engine)
One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.8.2).
As an example how all this goes together, here is a task:
A.8. SQL 503
Given two CSV files menu.csv and orders.csv downloaded from a restaurant website
(Figure A.15), create three SQL tables Menu, OrdersIn, OrdersOut.
/* Menu */
dish varchar
price integer
/* ordersin */
orderid integer
created datetime
customerid integer
items json
/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer
To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
orders.csv are in cents so they are INTs.)
1. Read the CSV files into dataframes menu_df and orders_df.
2. Convert the dataframes into list-of-dicts menu and orders.
3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId
whose values are obtained from list-of-dicts orders.
4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values are
obtained from list-of-dicts orders (tips are in cents so they are INTs).
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents so
they are INTs). The JSON string is of a list-of-dicts in the form discussed
above L = [item1, item2] (see row 0 in Figure A.16).
Do this by looping over each order in the list-of-dicts orders, then loop-
ing over each item in the list-of-dicts menu, and extracting the quantity
ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are computed
from the above values.
504 CHAPTER A. APPENDICES
Add a key tax to OrdersOut whose values (in cents) are computed using
the Connecticut tax rate 7.35%. Tax is applied to the sum of subtotal
and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df
,→ , OrdersOut_df.
8. Upload menu_df, OrdersIn_df, OrdersOut_df to tables Menu, OrdersIn,
OrdersOut.
The resulting dataframes ordersin_df and ordersout_df, and SQL ta-
bles OrdersIn and OrdersOut, are in Figures A.16 and A.17.
# step 1
from pandas import *
protocol = "https://"
server = "omar-hijab.org"
path = "/teaching/csv_files/restaurant/"
url = protocol + server + path
A.8. SQL 505
# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')
# step 3
OrdersIn = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)
# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)
# step 5
from json import *
506 CHAPTER A. APPENDICES
# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item in items
,→ ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
total = subtotal + tip + tax
r["tax"] = tax
r["total"] = total
# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)
# step 8
from sqlalchemy import create_engine, text
engine = create_engine(uri)
dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
A.8. SQL 507
dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}
In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filtering).
• Use Python to do detailed computations on your laptop (analysis).
Now we consider the following simple problem. The total number of orders
in 3970. What is the total number of plates? To answer this, we loop through
all the orders, summing the number of plates in each order. The answer is
14,949 plates.
protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database
engine = sqlalchemy.create_engine(uri)
connection = engine.connect()
num = 0
print(num)
def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])
Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.
num = df["items"].map(num_plates).sum()
print(num)
Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
REFERENCES 509
References
[19] J. W. Longley. “An Appraisal of Least Squares Programs for the Elec-
tronic Computer from the Point of View of the User”. In: Journal of
the American Statistical Association 62.319 (1967), pp. 819–841.
[20] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming.
Springer, 2008.
[21] A. A. Faisal M. P. Deisenroth and C. S. Ong. Mathematics for Machine
Learning. Cambridge University Press, 2020.
[22] M. Minsky and S. Papert. Perceptrons, An Introduction to Computa-
tional Geometry. MIT Press, 1988.
[23] Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.
[24] K. Pearson. “On the criterion that a given system of deviations from
the probable in the case of a correlated system of variables is such that
it can be reasonably supposed to have arisen from random sampling”.
In: Philosophical Magazine Series 5 50:302 (1900), pp. 157–175.
[25] R. Penrose. “A generalized inverse for matrices”. In: Proceedings of the
Cambridge Philosophical Society 51 (1955), pp. 406–413.
[26] B. T. Polyak. “Some methods of speeding up the convergence of itera-
tion methods”. In: USSR Computational Mathematics and Mathemat-
ical Physics 4(5) (1964), pp. 1–17.
[27] The WeBWorK Project. url: https://openwebwork.org/.
[28] S. Raschka. PCA in three simple steps. 2015. url: https://sebastia
nraschka.com/Articles/2015_pca_in_3_steps.html.
[29] H. Robbins and S. Monro. “A Stochastic Approximation Method”. In:
The Annals of Mathematical Statistics 22.3 (1951), pp. 400–407.
[30] S. M. Ross. Probability and Statistics for Engineers and Scientists, Sixth
Edition. Academic Press, 2021.
[31] M. J. Schervish. Theory of Statistics. Springer, 1995.
[32] G. Strang. Linear Algebra and its Applications. Brooks/Cole, 1988.
[33] Stanford University. CS224N: Natural Language Processing with Deep
Learning. url: https://web.stanford.edu/class/cs224n.
[34] I. Waldspurger. Gradient Descent With Momentum. 2022. url: https
://www.ceremade.dauphine.fr/~waldspurger/tds/22_23_s1/adva
nced_gradient_descent.pdf.
[35] Wikipedia. Logistic Regression. url: https://en.wikipedia.org/wi
ki/Logistic_regression.
[36] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge
University Press, 2022.
Python Index
*, 9, 16 def.matrix_text, 45
def.nearest_index, 193
all, 193 def.newton, 405
append, 193 def.num_legendre, 203
def.num_plates, 508
def.angle, 25, 68 def.outgoing, 240, 395
def.assign_clusters, 193 def.pca, 187
def.backward_prop, 235, 242, def.pca_with_svd, 187
400 def.perm_tuples, 453
def.ball, 55 def.plot_and_integrate, 486
def.cartesian_product, 337 def.plot_cluster, 194
def.chi2_independence, 383 def.plot_descent, 405
def.comb_tuples, 454 def.poly, 433
def.confidence_interval, 365, def.project, 116
375 def.project_to_ortho, 118
def.delta_out, 400 def.pvalue, 328
def.derivative, 242 def.random_batch_mean, 287
def.dimension_staircase, 126 def.random_vector, 193
def.downstream, 400 def.set_pi_ticks, 215
def.draw_major_minor_axes, 50 def.sym_legendre, 202
def.ellipse, 44 def.tensor, 33
def.find_first_defect, 125 def.train_nn, 412
def.forward_prop, 235, 241, 396 def.ttest, 376
def.gd, 410 def.type2_error, 371, 377
def.goodness_of_fit, 379 def.uniq, 5
def.H, 270 def.update_means, 193
def.hexcolor, 11 def.update_weights, 412
def.incoming, 240, 395 def.zero_variance, 104
def.J, 397 def.ztest, 369
def.local, 398 diag, 181
511
512 PYTHON INDEX
numpy.sqrt, 25 sklearn.preprocessing
.StandardScaler, 76
pandas.DataFrame, 499 sort, 186
pandas.DataFrame.to_csv, 500 sqlalchemy.create_engine, 501
pandas.DataFrame.to_dict, 499 sqlalchemy.text, 501
pandas.DataFrame.to_sql, 501 sympy.*, 66
pandas.read_csv, 433, 500 sympy.diag, 65
pandas.read_sql, 502 sympy.diagonalize, 144
sympy.diff, 202
random.choice, 11
sympy.eigenvects, 144
random.random, 15
sympy.init_printing, 144
scipy.integrate.quad, 485 sympy.lambdify, 203
scipy.linalg.null_space, 93 sympy.Matrix, 59
scipy.linalg.orth, 87 sympy.Matrix.col, 63
scipy.linalg.pinv, 81 sympy.Matrix.cols, 63
scipy.optimize.newton, 225 sympy.Matrix.columnspace, 86
scipy.spatial.ConvexHull, 248 sympy.Matrix.eye, 64
simplices, 249 sympy.Matrix.hstack, 62, 81, 94
scipy.special.comb, 454 sympy.Matrix.inv, 79
scipy.special.expit, 277 sympy.Matrix.nullspace, 92
scipy.special.factorial, 452 sympy.Matrix.ones, 64
scipy.special.perm, 453 sympy.Matrix.rank, 130
scipy.special.softmax, 345 sympy.Matrix.row, 63
scipy.stats.binom, 268 sympy.Matrix.rows, 63
scipy.stats.chi2, 333 sympy.Matrix.rowspace, 90
scipy.stats.entropy, 224, 270 sympy.Matrix.zeros, 64
scipy.stats. sympy.prod, 482
,→ multivariate_normal, sympy.shape, 59
337 sympy.simplify, 202
scipy.stats.norm, 315 sympy.solve, 299, 481
scipy.stats.poisson, 310 sympy.symbols, 202
scipy.stats.t, 373, 375
sklearn.datasets.load_iris, 2 tuple, 19
sklearn.decomposition
.PCA, 188 zip, 191
514 PYTHON INDEX
Index
515
516 INDEX
perp, 95 span, 85
perpendicular, 25 standardized, 75
polar, 22 subtraction, 21
probability, 343, 379 unit, 23, 67
strict, 418 zero, 19, 60
projected, 114, 117, 118 vectorization, 16, 380
random, 335, 358 weight, 390
standard, 336 gradient, 401, 412
reduced, 114, 117, 118 hyperplane, 425
scaling, 20, 60 matrix, 164
shadow, 18 centered, 420
INDEX 523
Omar Hijab obtained his doctorate from the University of Cal-
ifornia at Berkeley, and is faculty at Temple University in
Philadelphia, Pennsylvania. Currently he is affiliated with the
University of New Haven in West Haven, Connecticut.