Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
40 views

Math For Data Science

Uploaded by

ananth.gouri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Math For Data Science

Uploaded by

ananth.gouri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 538

Omar Hijab*

Math for Data Science


Copyright ©2022 — 2024 Omar Hijab. All Rights Reserved.
Compiled 2024-10-24 10:51:19-04:00
Preface

This text is a presentation of the mathematics underlying Data Science, and


assumes the math background typical of an Electrical Engineering under-
graduate. In particular, Chapter 4, Calculus, assumes the reader has prior
calculus exposure.
By contrast, because we outsource computations to Python, and focus
on conceptual understanding, Chapter 2, Linear Geometry, is developed in
depth.
Depending on the emphasis and supplementary material, the text is ap-
propriate for a course in the following programs
• Applied Mathematics,
• Business Analytics,
• Computer Science,
• Data Science,
• Engineering.
The level and pace of the text varies from gentle, at the start, to advanced,
at the end. Depending on the depth of coverage, the text is appropriate for
a one-semester course, or a two-semester course.
Chapters 1-3, together with some of the appendices, form the basis for
a leisurely one-semester course, and Chapters 4-7 form the basis for an ad-
vanced one-semester course. The chapter ordering is chosen to allow for the
two semesters being taught simultaneously. The text was written while being
repeatedly taught as a two-semester course with the two semesters taught
simultaneously.
The culmination of the text is Chapter 7, Machine Learning. Much of
the mathematics developed in prior chapters is used here. While only an
introduction to the subject, the material in Chapter 7 is carefully and logically
built up from first principles.
As a consequence, the presentation and some results are new: The proofs
of heavy ball convergence and Nesterov convergence for accelerated gradient

vii
viii

descent are simplifications of the proofs in [36], and the connection between
properness and trainability seems to be new to the literature.

Important principles or results are displayed in these boxes.

The ideas presented in the text are made concrete by interpreting them
in Python code. The standard Python data science packages are used, and a
Python index lists the functions used in the text.
Because Python is used to highlight concepts, the code examples are pur-
posely simple to follow. This should be helpful to the reader new to Python.

Python code is displayed in these boxes.

Because SQL is usually part of a data scientist’s toolkit, an introduction


to using SQL, from within Python, is included in an appendix. Also, in case
the instructor wishes to de-emphasize it, integration is presented separately
in an appendix. Other appendices cover combinations and permutations, the
binomial theorem, the exponential function, complex numbers, asymptotics,
and minimizers, to be used according to the instructor’s emphasis and pref-
erences.
The bibiliography at the end is a listing of the references accessed while
writing the text. Throughout, we use iff to mean if and only if, and we use ≈
for asymptotic equality (§A.6). Apart from a few exceptions, all figures in the
text were created by the author using Python or tikZ. The text is typeset
using TEX. The boxes above are created using tcolorbox.
To help navigate the text, in each section, to indicate a break, a new idea,
or a change in direction, we use a ship’s wheel .
Sections and figures are numbered sequentially within each chapter, and
equations and exercises are numbered sequentially within each section, so §3.4
is the fourth section in the third chapter, Figure 4.14 is the twelfth figure in
the fourth chapter, (3.2.1) is the first equation in the second section of the
third chapter, and Exercise 1.2.3 is the third exercise in the second section
of the first chapter. Also, [1] cites the first entry in the references.
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Averages and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2 Linear Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Span and Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.5 Zero Variance Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.6 Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.7 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.8 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.9 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

3 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


3.1 Geometry of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
3.6 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.1 Single-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.2 Entropy and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.3 Multi-Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.4 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

ix
x Contents

4.5 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.1 Binomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5.5 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
5.6 Multinomial Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
6.2 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 T -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
6.4 Chi-Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389


7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
7.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
7.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
7.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
7.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
7.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.7 Regression Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
7.8 Strict Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.9 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.1 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A.2 The Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
A.3 The Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
A.4 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
A.6 Asymptotics and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
A.7 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
A.8 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509

Python Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
List of Figures

1.1 Iris dataset [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


1.2 Images in the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 A portion of the MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 6
1.5 The MNIST dataset (3d projection). . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 A crude copy of the image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 HTML colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 The vector v joining the points m and x. . . . . . . . . . . . . . . . . . . . 12
1.9 Datasets of points versus datasets of vectors. . . . . . . . . . . . . . . . . 13
1.10 A dataset with its mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.11 Vectorization of samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.12 A vector v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.13 Vectors v1 and v2 and their shadows in the plane. . . . . . . . . . . . 19
1.14 Adding v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15 Scaling with t = 2 and t = −2/3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.16 The polar representation of v = (x, y). . . . . . . . . . . . . . . . . . . . . . 22
1.17 v and its antipode −v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.18 Two vectors v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.19 Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.20 Proof of Pythagoras for general triangles. . . . . . . . . . . . . . . . . . . . 26
1.21 P and P ⊥ and v and v ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.22 MSD for the mean (green) versus MSD for a random point
(red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.23 Projecting a vector b onto the line through u. . . . . . . . . . . . . . . . 41
1.24 Unit variance ellipses (blue) and unit inverse variance ellipses
(red) with µ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.25 Variance ellipses (blue) and inverse variance ellipses (red) for
a dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.26 Unit variance ellipse and unit inverse variance ellipse with
standardized Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xi
xii List of Figures

1.27 Positively and negatively correlated datasets (unit inverse


ellipses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.28 Ellipsoid and axes in 3d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.29 Disks inside the square. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.30 Balls inside the cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.31 Suspensions of interval [a, b] and disk D. . . . . . . . . . . . . . . . . . . . 57

2.1 Numpy column space array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


2.2 The points 0, x, Ax, and b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ . . . . . . 105
2.4 Projecting onto a line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.5 Projecting onto a plane, P b = ru + sv. . . . . . . . . . . . . . . . . . . . . . 115
2.6 Dataset, reduced dataset, and projected dataset, n < d. . . . . . . 119
2.7 Relations between vector classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.8 First defect for MNIST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.9 The dimension staircase with defects. . . . . . . . . . . . . . . . . . . . . . . 126
2.10 The dimension staircase for the MNIST dataset. . . . . . . . . . . . . . 127
2.11 A 5 × 3 matrix A is a linear transformation from R3 to R5 . . . 129

3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75. . . . . . . . . . . . . . 137


3.2 SVD decomposition A = U SV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.3 Relations between matrix classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.4 Inverse variance ellipse and centered dataset. . . . . . . . . . . . . . . . . 149
3.5 S = span(v1 ) and T = S ⊥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.6 Three springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . 156
3.7 Six springs at rest and perturbed. . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.8 Two springs along a circle leading to Q(2). . . . . . . . . . . . . . . . . . 158
3.9 Five springs along a circle leading to Q(5). . . . . . . . . . . . . . . . . . 158
3.10 Plot of eigenvalues of Q(50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.11 Density of eigenvalues of Q(d) for d large. . . . . . . . . . . . . . . . . . . 162
3.12 Trace of pseudo-inverse (§2.3) of Q(d). . . . . . . . . . . . . . . . . . . . . . 163
3.13 Directed and undirected graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.14 A weighed directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.15 A double edge and a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.16 The complete graph K6 and the cycle graph C6 . . . . . . . . . . . . . . 166
3.17 The triangle K3 = C3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.18 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1). . . 174
3.19 Complete bipartite graph K53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
3.20 MNIST eigenvalues as a percentage of the total variance. . . . . . 185
3.21 MNIST eigenvalue percentage plot. . . . . . . . . . . . . . . . . . . . . . . . . 186
3.22 Original and projections: n = 784, 600, 350, 150, 50, 10, 1. . . . . 190
3.23 The full MNIST dataset (2d projection). . . . . . . . . . . . . . . . . . . . 191
3.24 The Iris dataset (2d projection). . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

4.1 f ′ (a) is the slope of the tangent line at a. . . . . . . . . . . . . . . . . . . . 198


List of Figures xiii

4.2 Composition of two functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200


4.3 The logarithm function log x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.4 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.5 Increasing or decreasing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0. . . . . 211
4.7 The sine function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.8 The sine function with π/2 tick marks. . . . . . . . . . . . . . . . . . . . . . 215
4.9 Angle θ in the plane, P = (x, y). . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.10 The absolute entropy function H(p). . . . . . . . . . . . . . . . . . . . . . . . 219
4.11 The absolute information I(p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.12 The relative information I(p, q) with q = .7. . . . . . . . . . . . . . . . . 223
4.13 Surface plot of I(p, q) over the square 0 ≤ p ≤ 1, 0 ≤ q ≤ 1. . . . 224
4.14 Composition of multiple functions. . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.15 Composition of three functions in a chain. . . . . . . . . . . . . . . . . . . 233
4.16 A network composition [33]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
4.17 The function g = max(y, z). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.18 Forward and backward propagation [33]. . . . . . . . . . . . . . . . . . . . 238
4.19 Level sets and sublevel sets in two dimensions. . . . . . . . . . . . . . . 243
4.20 Contour lines in two dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . 244
4.21 Line segment [x0 , x1 ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.22 Convex: The line segment lies above the graph. . . . . . . . . . . . . . 246
4.23 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 . . . . . . . . . . . . . . . . . . . . 248
4.24 A convex hull with one facet highlighted. . . . . . . . . . . . . . . . . . . . 248
4.25 Convex set in three dimensions with supporting hyperplane. . . 250
4.26 Hyperplanes in two and three dimensions. . . . . . . . . . . . . . . . . . . 251
4.27 Separating hyperplane I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.28 Separating hyperplane II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

5.1 Asymptotics of binomial coefficients. . . . . . . . . . . . . . . . . . . . . . . . 270


5.2 The distribution of p given 7 heads in 10 tosses. . . . . . . . . . . . . . 274
5.3 The logistic function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
5.4 The logistic function takes real numbers to probabilities. . . . . . 277
5.5 Decision boundary (1d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
5.6 Decision boundary (3d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
5.7 Joint distribution of boys and girls [30]. . . . . . . . . . . . . . . . . . . . . 283
5.8 100,000 sessions, with 5, 15, 50, and 500 tosses per session. . . . 284
5.9 The histogram of Iris petal lengths. . . . . . . . . . . . . . . . . . . . . . . . . 286
5.10 Iris petal lengths sampled 100,000 times. . . . . . . . . . . . . . . . . . . . 287
5.11 Iris petal lengths batch means sampled 100,000 times, batch
sizes 3, 5, 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.12 When we sample X, we get x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.13 Probability mass function p(x) of a Bernoulli random variable. 296
5.14 Cumulative distribution function F (x) of a Bernoulli random
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5.15 Confidence that X lies in interval [a, b]. . . . . . . . . . . . . . . . . . . . . 305
xiv List of Figures

5.16 Uniform probability density function (pdf). . . . . . . . . . . . . . . . . . 306


5.17 Densities versus distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
5.18 Continuous cumulative distribution function. . . . . . . . . . . . . . . . . 308
5.19 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn . . . . . 311
5.20 The pdf of the standard normal distribution. . . . . . . . . . . . . . . . 315
5.21 The binomial cdf and its CLT normal approximation. . . . . . . . 319
5.22 z = Z.ppf(p) and p = Z.cdf(z). . . . . . . . . . . . . . . . . . . . . . . . . . 322
5.23 Confidence (green) or significance (red) (lower-tail, two-tail,
upper-tail). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.24 68%, 95%, 99% confidence cutoffs for standard normal. . . . . . . . 323
5.25 Cutoffs, confidence levels, p-values. . . . . . . . . . . . . . . . . . . . . . . . . 324
5.26 p-values at 5% and at 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
5.27 68%, 95%, 99% cutoffs for non-standard normal. . . . . . . . . . . . . 325
5.28 (X, Y ) inside the square and inside the disk. . . . . . . . . . . . . . . . . 331
5.29 Chi-squared distribution with different degrees. . . . . . . . . . . . . . 333
5.30 With degree d ≥ 2, the chi-squared density peaks at d − 2. . . . . 334
5.31 Normal probability density on R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 338
5.32 The softmax function takes vectors to probability vectors. . . . 345
5.33 The third row is the sum of the first and second rows, and
the H column is the negative of the I column. . . . . . . . . . . . . . . 354

6.1 Statistics flowchart: p-value p and significance α. . . . . . . . . . . . . 358


6.2 Histogram of sampling n = 25 students, repeated N = 1000
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 The error matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
6.4 T -distribution, against normal (dashed). . . . . . . . . . . . . . . . . . . . . 373
6.5 2 × 3 = d × N contingency table [30]. . . . . . . . . . . . . . . . . . . . . . . 382
6.6 Earthquake counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

7.1 A perceptron with activation function f . . . . . . . . . . . . . . . . . . . . 392


7.2 Perceptrons in parallel (R in the figure is the retina) [22]. . . . . 393
7.3 Network of neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
7.4 Incoming and Outgoing signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
7.5 Forward and back propagation between two neurons. . . . . . . . . . 395
7.6 Downstream, local, and upstream derivatives at node i. . . . . . . 398
7.7 A shallow dense layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
7.8 Layered neural network [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
7.9 Double well newton descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
7.10 Double well cost function and sublevel sets at w0 and at w1 . . . 408
7.11 Double well gradient descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
7.12 Cost trajectory and number of iterations as learning rate varies.413
7.13 Linear regression neural network with no bias inputs. . . . . . . . . 416
7.14 Logistic regression neural network without bias inputs. . . . . . . . 420
7.15 Population versus employed: Linear Regression. . . . . . . . . . . . . . 430
7.16 Longley Economic Data [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
List of Figures xv

7.17 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12. . . . . . . . . . . . . . 434


7.18 Hours studied and outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
7.19 Exam dataset: x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
7.20 Exam dataset: (x, p) [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
7.21 Exam dataset: (x, x0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
7.22 Hours studied and one-hot encoded outcomes. . . . . . . . . . . . . . . . 438
7.23 Neural network for student exam outcomes. . . . . . . . . . . . . . . . . . 438
7.24 Equivalent neural network for student exam outcomes. . . . . . . . 439
7.25 Exam dataset: (x, x0 , p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
7.26 Convex hulls of Iris classes in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.27 Convex hulls of MNIST classes in R2 . . . . . . . . . . . . . . . . . . . . . . . 440

A.1 6 = 3! permutations of 3 balls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452


A.2 Pascal’s triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
A.3 The exponential function exp x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
A.4 Convexity of the exponential function. . . . . . . . . . . . . . . . . . . . . . 472
A.5 Multiplying and dividing points on the unit circle. . . . . . . . . . . . 473
A.6 Complex numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
A.7 The second, third, and fourth roots of unity . . . . . . . . . . . . . . . . 478
A.8 The fifth, sixth, and fifteenth roots of unity . . . . . . . . . . . . . . . . . 479
A.9 Areas under the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
A.10 Area under the parabola. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
A.11 The graph and area under sin x. . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
A.12 Integral of sin x/x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
A.13 Dataframe from list-of-dicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.14 Menu dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . . . 500
A.15 Rawa restaurant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
A.16 OrdersIn dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . . . 504
A.17 OrdersOut dataframe and SQL table. . . . . . . . . . . . . . . . . . . . . . . 505
Chapter 1
Datasets

In this chapter we explore examples of datasets and some simple Python


code. We also review the geometry of vectors in the plane and properties of
2 × 2 matrices, introduce the mean and variance of a dataset, then present a
first taste of what higher dimensions might look like.

1.1 Introduction

Geometrically, a dataset is a sample of N points x1 , x2 , . . . , xN in d-


dimensional space Rd . When manipulating datasets as vectors, they are usu-
ally arranged into d × N arrays. When displaying datasets, as in spreadsheets
or SQL tables, they are usually arranged into N × d arrays.
Practically speaking, as we shall see, the following are all representations
of datasets
matrix = CSV file = spreadsheet = SQL table = array = dataframe
Each point x = (t1 , t2 , . . . , td ) in the dataset is a sample or an example,
and the components t1 , t2 , . . . , td of a sample point x are its features or
attributes. As such, d-dimensional space Rd is feature space.
Sometimes one of the features is separated out as the label or target. In
this case, the dataset is a labeled dataset.

The Iris dataset contains N = 150 examples of d = 4 features of Iris


flowers, and there are three classes of Irises, Setosa, Versicolor, and Virginica,
with 50 samples from each class. For each example, the class is the label
corresponding to that example, so the Iris dataset is labeled.
The four features are sepal length and width, and petal length and width.
In Figure 1.1, the dataset is displayed as an N × d array.

1
2 CHAPTER 1. DATASETS

Fig. 1.1 Iris dataset [28].

The Iris dataset is downloaded using the code

from sklearn import datasets

iris = datasets.load_iris()
iris["feature_names"]

This returns
['sepal length','sepal width','petal length','petal width'].
To return the data and the classes, the code is

dataset = iris["data"]
labels = iris["target"]

dataset, labels

The above code returns dataset as an N × d array. To return a d × N


array, take the transpose dataset = iris["data"].T.

The MNIST dataset consists of images of hand-written digits (Figure 1.2).


There are 10 classes of images, corresponding to each digit 0, 1, . . . , 9. We
1.1. INTRODUCTION 3

seek to compress the images while preserving as much as possible of the


images’ characteristics.
Each image is a grayscale 28x28 pixel image. Since 282 = 784, each image
is a point in d = 784 dimensions. Here there are N = 60000 samples and
d = 784 features.

Fig. 1.2 Images in the MNIST dataset.

This subsection is included just to give a flavor. All unfamiliar words are
explained in detail in Chapter 2. If preferred, just skip to the next subsection.
Suppose we have a dataset of N points

x1 , x2 , . . . , xN

in d-dimensional feature space. We seek to find a lower-dimensional feature


space U ⊂ Rd so that the projections of these points onto U retain as much
information as possible about the data.
In other words, we are looking for an n-dimensional subspace U for some
n < d. Among all n-dimensional subspaces, which one should we pick? The
answer is to select U among all n-dimensional subspaces to maximize vari-
ability in the data.
Another issue is the choice of n, which is an integer satisfying 0 ≤ n ≤ d.
On the one hand, we want n to be as small as possible, to maximize data
compression. On the other hand, we want n to be big enough to capture most
of the features of the data. At one extreme, if we pick n = d, then we have
no compression and complete information. At the other extreme, if we pick
n = 0, then we have full compression and no information.
Projecting the data from Rd to a lower-dimensional space U is dimensional
reduction. The best alignment, the best-fit, or the best choice of U is principal
component analysis. These issues will be taken up in §3.5.

If this is your first exposure to data science, there will be a learning curve,
because here there are three kinds of thinking: Data science (datasets, PCA,
descent, networks), math (linear algebra, probability, statistics, calculus), and
Python (numpy, pandas, scipy, sympy, matplotlib). It may help to read the
4 CHAPTER 1. DATASETS

code examples , and the important math principles first, then dive
into details as needed.
To illustrate and make concrete concepts as they are introduced, we use
Python code throughout. We run Python code in a jupyter notebook.
jupyter is an IDE, an integrated development environment. jupyter
supports many languages, including Python, Sage, Julia, and R. A useful
jupyter feature is the ability to measure the amount of execution time of a
jupyter cell by including at the start of the cell

%%time

It’s simplest to first install Python, then jupyter. If your laptop is not a
recent model, to minimize overhead, it’s best to install Python directly and
avoid extra packages or frameworks. If Python is installed from
https://www.python.org/downloads/,
then the Python package installer pip is also installed.
From within a shell, check the latest version of pip is installed using the
command
pip --version,
The versions of Python and pip used in this edition of the text are 3.12.*
and 24.*. The first step is to ensure updated versions of Python and pip are
on your laptop.
After this, from within a shell, use pip to install your first package:
pip install jupyter
After installing jupyter, all other packages are installed from within
jupyter. For example, for this text, from within a jupyter cell, we ran

pip install numpy


pip install sympy
pip install scipy
pip install scikit-learn
pip install pandas
pip install matplotlib
pip install ipympl
pip install sqlalchemy
pip install pymysql

After installing these packages, restart jupyter to activate the packages.


The above is a complete listing of the packages used in this text.
Because one often has to repeatedly install different versions of Python,
it’s best to isolate your installations from whatever Python your laptop’s
OS uses. This is achieved by carrying out the above steps within a venv, a
1.2. THE MNIST DATASET 5

virtual environment. Then several venvs may be set up side-by-side, and, at


any time, any venv may be deleted without impacting any others, or the OS.

Exercises

Exercise 1.1.1 What is dataset.shape and labels.shape?

Exercise 1.1.2 What does sum(dataset[0]) return and why?

Exercise 1.1.3 What does sum(dataset) return and why?

Exercise 1.1.4 Let a be a list. What does list(enumerate(a)) return?


What does the code below return?

def uniq(a):
return [x for i, x in enumerate(a) if x not in a[:i] ]

1.2 The MNIST Dataset

Fig. 1.3 A portion of the MNIST dataset.


6 CHAPTER 1. DATASETS

The MNIST1 dataset consists of 60,000 training images. Since this dataset is
for demonstration purposes, these images are coarse.
Each image consists of 28 × 28 = 784 pixels, and each pixel shading is a
byte, an integer between 0 and 255 inclusive. Therefore each image is a point
x in Rd = R784 . Attached to each image is its label, a digit 0, 1, . . . , 9.
We assume the dataset has been downloaded to your laptop as a CSV file
mnist.csv. Then each row in the file consists of the pixels for a single image.
Since the image’s label is also included in the row, each row consists of 785
integers. There are many sources and formats online for this dataset.
The code

from pandas import *


from numpy import *

mnist = read_csv("mnist.csv").to_numpy()

# separate rows into data and labels


# first column is the labels
labels = mnist[:,0]
# all other columns are the pixels
dataset = mnist[:,1:]

mnist.shape,dataset.shape,labels.shape

returns

(60000, 785), (60000, 784), (60000,)

Here the dataset is arranged into an N × d array.

Fig. 1.4 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

1 The National Institute of Standards and Technology (NIST) is a physical sciences labo-
ratory and non-regulatory agency of the United States Department of Commerce.
1.2. THE MNIST DATASET 7

To compress the image means to reduce the number of dimensions in the


point x while keeping maximum information. We can think of a single image
as a dataset itself, and compress the image, or we can design a compression
algorithm based on a collection of images. It is then reasonable to expect that
the procedure applies well to any image that is similar to the images in the
collection.
For the second image in Figure 1.2, reducing dimension from d = 784 to
n equal 600, 350, 150, 50, 10, and 1, we have the images in Figure 1.4.
Compressing each image to a point in n = 3 dimensions and plotting all
N = 60000 points yields Figure 1.5. All this is discussed in §3.5.

Fig. 1.5 The MNIST dataset (3d projection).

Here is an exercise. The top left image in Figure 1.4 is given by a 784-
dimensional point which is imported as an array pixels.

pixels = dataset[1].reshape((28,28))

Then pixels is an array of shape (28,28).

1. In Jupyter, return a two-dimensional plot of the point (2, 3) at size 50


using the code
8 CHAPTER 1. DATASETS

from matplotlib.pyplot import *

grid()
scatter(2,3,s = 50)
show()

2. Do for loops over i and j in range(28) and use scatter to plot points
at location (i,j) with size given by pixels[i,j], then show.

Fig. 1.6 A crude copy of the image.

Here is one possible code, returning Figure 1.6.

from matplotlib.pyplot import *


from numpy import *

pixels = dataset[1]

grid()
for i in range(28):
for j in range(28):
scatter(i,j, s = pixels[i,j])

show()

The top left image in Figure 1.4 is returned by the code


1.2. THE MNIST DATASET 9

from matplotlib.pyplot import *

imshow(pixels, cmap="gray_r")

In recent versions of numpy, floats are displayed as follows

np.float64(5.843333333333335)

To display floats without their type, as follows,

5.843333333333335

insert this code

from numpy import *

set_printoptions(legacy="1.25")

at the top of your jupyter notebook or in your jupyter configuration file

We end the section by discussing the Python import command. The last
code snippet can be rewritten

import matplotlib.pyplot as plt

plt.imshow(pixels, cmap="gray_r")

or as

from matplotlib.pyplot import imshow

imshow(pixels, cmap="gray_r")

So we have three versions of this code snippet.


In the second version, it is explicit that imshow is imported from the mod-
ule pyplot of the package matplotlib. Moreoever, the module matplotlib
,→ .pyplot is referenced by a short nickname plt.
In the first version import from *, many commands, maybe not all, are
imported from the module matplotlib.pyplot.
10 CHAPTER 1. DATASETS

In the third version, only the command imshow is imported. Which import
style is used depends on the situation.
In this text, we usually use the first style, as it is visually lightest. To help
with online searches, in the Python index, Python commands are listed under
their full package path.

Exercises

Exercise 1.2.1 Run the code in this section on your laptop (all code is run
within jupyter).
Exercise 1.2.2 The first image in the MNIST dataset is an image of the
digit 5. What is the 43,120th image?
Exercise 1.2.3 Figure 1.6 is not oriented the same way as the top-left image
in Figure 1.4. Modify the code returning Figure 1.6 to match the top-left
image in Figure 1.4.

1.3 Averages and Vector Spaces

Suppose we have a population of things (people, tables, numbers, vectors,


images, etc.) and we have a sample of size N from this population:

L = [x_1,x_2,...,x_N].

The total population is the population or the sample space. For example, the
sample space consists of all real numbers and we take N = 5 samples from
this population

L_1 = [3.95, 3.20, 3.10, 5.55, 6.93].

Or, the sample space consists of all integers and we take N = 5 samples from
this population

L_2 = [35, -32, -8, 45, -8].

Or, the sample space consists of all rational numbers and we take N = 5
samples from this population

L_3 = [13/31, 8/9, 7/8, 41/22, 32/27].


1.3. AVERAGES AND VECTOR SPACES 11

Or, the sample space consists of all Python strings and we take N = 5 samples
from this population

L_4 = ['a2e?','#%T','7y5,','kkk>><</','[[)*+']

Or, the sample space consists of all HTML colors and we take N = 5 samples
from this population

Fig. 1.7 HTML colors.

Here’s the code generating the colors

# HTML color codes are #rrggbb (6 hexes)


from matplotlib.pyplot import *
from random import choice

def hexcolor():
chars = '0123456789abcdef'
return "#" + ''.join([choice(chars) for _ in range(6)])

for i in range(5): scatter(i,0, c=hexcolor())


show()

Let L be a list as above. The goal is to compute the sample average or


mean of the list, which is
x1 + x2 + · · · + xN
µ= . (1.3.1)
N
In the first example, for real numbers, the average is
3.95 + 3.20 + 3.10 + 5.55 + 6.93
= 4.546.
5
In the second case, for integers, the average is 32/5. In the third case, the
average is 385373/73656. In the fourth case, while we can add strings, we
can’t divide them by 5, so the average is undefined. Similarly for colors: the
average is undefined.
This leads to an important definition. A sample space or population V is
called a vector space if, roughly speaking, one can compute means or averages
12 CHAPTER 1. DATASETS

in V . In this case, we call the members of the population “vectors”, even


though the members may be anything, as long as they satisfy the basic rules
of a vector space.
In a vector space V , the rules are:
1. vectors can be added, with the sum v + w back in V ,
2. vector addition is commutative v + w = w + v,
3. vector addition is associative u + (v + w) = (u + v) + w,
4. there is a zero vector 0,
5. vectors v have negatives −v,
6. vectors v can be scaled to rv by real numbers r, with rv is back in V ,
7. scaling is distributive over addition (r + s)v = rv + sv and r(u + v) =
ru + rv
8. 1v = v and 0v = 0
9. r(sv) = (rs)v.
As mentioned before, real numbers are called scalars because they often
serve to scale vectors.

Let x1 , x2 , . . . , xN be a dataset. Is the dataset a collection of points, or


is the dataset a collection of vectors? In other words, what geometric picture
of datasets should we have in our heads? Here’s how it works.
A vector is an arrow joining two points (Figure 1.8). Given two points
µ = (a, b) and x = (c, d), the vector joining them is

v = x − µ = (c − a, d − b).

Then µ is the tail of v, and x is the head of v. For example, the vector joining
µ = (1, 2) to x = (3, 4) is v = (2, 2).
Given a point x, we would like to associate to it a vector v in a uniform
manner. However, this cannot be done without a second point, a reference
point. Given a dataset of points x1 , x2 , . . . , xN , the most convenient choice
for the reference point is the mean µ of the dataset. This results in a dataset
of vectors v1 , v2 , . . . , vN , where vk = xk − µ, k = 1, 2, . . . , N .

Fig. 1.8 The vector v joining the points m and x.


1.3. AVERAGES AND VECTOR SPACES 13

The dataset v1 , v2 , . . . , vN is centered, its mean is zero,


v1 + v2 + · · · + vN
= 0.
N
So datasets can be points x1 , x2 , . . . , xN with mean µ, or vectors v1 , v2 , . . . ,
vN with mean zero (Figure 1.9). This distinction makes a difference when
measuring the dimension of a dataset (§2.8).

Centered Versus Non-Centered


If x1 , x2 , . . . , xN is a dataset of points with mean µ and

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

then v1 , v2 , . . . , vN is a centered dataset of vectors.

x5 x2
v5 v2

µ v4 v1
0
x4 x1
v3

x3

Fig. 1.9 Datasets of points versus datasets of vectors.

Let us go back to vector spaces. When we work with vector spaces, numbers
are referred to as scalars, because 2v, 3v, −v, . . . are scaled versions of v.
When we multiply a vector v by a scalar r to get the scaled vector rv, we call
this vector scaling. This is to distinguish this multiplication from the inner
and outer products we see below.
For example, the samples in the list L1 form a vector space, the set of all
real numbers R. Even though one can add integers, the set Z of all integers
does not form a vector space because multiplying an integer by 1/2 does
not result in an integer. The set Q of all rational numbers (fractions) is a
vector space, so L3 is a sampling from a vector space. The set of strings is
not a vector space because even though one can add strings, addition is not
commutative:
14 CHAPTER 1. DATASETS

'alpha' + 'romeo' == 'romeo' + 'alpha'

returns False.

For the scalar dataset

x1 = 1.23, x2 = 4.29, x3 = −3.3, x4 = 555,

the average is
1.23 + 4.29 − 3.3 + 555
µ= = 139.305.
4
In Python, averages are computed using numpy.mean. For a scalar dataset,
the code

from numpy import *

dataset = array([1.23,4.29,-3.3,555])
mu = mean(dataset)
mu

returns the average.


For the two-dimensional dataset

x1 = (1, 2), x2 = (3, 4), x3 = (−2, 11), x4 = (0, 66),

the average is

(1, 2) + (3, 4) + (−2, 11) + (0, 66)


µ= = (0.5, 20.75).
4
Note the x-components are summed, and the y-components are summed,
leading to a two-dimensional mean. (This is vector addition, taken up in
§1.4.)
In Python, a dataset of four points in R2 is assembled as 2 × 4 array

from numpy import *

dataset = array([[1,3,-2,0],[2,4,11,66]])

Here the x-components of the four points are the first row, and the y-
components are the second row. With this, the code
1.3. AVERAGES AND VECTOR SPACES 15

mu = mean(dataset, axis=1)
mu

returns the mean (0.5, 20.75).


To explain what axis=1 does, we use matrix terminology. After arranging
dataset into an array of two rows and four columns, to compute the mean,
we sum over the column index.
This means summing the entries of the first row, then summing the entries
of the second row, resulting in a mean with two components.
In Python, the default is to consider the row index i as index zero, and to
consider the column index j as index one.
Summing over index=0 is equivalent to thinking of the dataset as two
points in R4 , so

mean(dataset, axis=0)

returns (1.5, 3.5, 4.5, 33).

Fig. 1.10 A dataset with its mean.

Here is a more involved example of a dataset of random points and their


mean:

from numpy import *


from numpy.random import random
from matplotlib.pyplot import scatter, grid, show

N = 20
def row(N): return array([random() for _ in range(N) ])
16 CHAPTER 1. DATASETS

# 2xN array
dataset = array([ row(N), row(N) ])
mu = mean(dataset,axis=1)

grid()
scatter(*mu)
scatter(*dataset)
show()

This returns Figure 1.10.


In this code, scatter expects two positional arguments, the x and the y
components, arranged as two scalars (for a single point), or two arrays of x
and y components separately (for several points). The unpacking operator *
unpacks mu from one pair into its separate x and y components *mu. So mu
is one Python object, and *mu is two Python objects. Similarly, plot(x,y)
expects the x and y components as two separate arrays.

Sometimes, a population is not a vector space, so we can’t take sample


means from it. Instead, we take the sample mean of a scalar or vector com-
puted from the samples. This computed quantity is a statistic associated to
the population.
A statistic is an assignment of a scalar or vector f (x) to each sample x
from the population, and the sample mean is then

f (x1 ) + f (x2 ) + · · · + f (xN )


. (1.3.2)
N
Since scalars and vectors do form vector spaces, this mean is well-defined. For
example, a population of cats is not a vector space (they can’t be added),
but their heights is a vector space (heights can be added). This process is
vectorization of the samples.
Vectorization is frequently used to count proportions: Samples are drawn
from finitely many categories, and we wish to count the proportion of samples
belonging to a particular category.
If we toss a coin N times, we obtain a list of heads and tails,

H, H, T, T, T, H, T, . . .

To count the proportion of heads, we define


(
(1, 0), if x is heads,
f (x) =
(0, 1), if x is tails.

If we add the vectorized samples f (x) using vector addition in the plane
(§1.4), the first component of the mean (1.3.2) is an average of ones and
1.3. AVERAGES AND VECTOR SPACES 17

zeroes, with ones matching heads, resulting in the proportion p̂ of heads.


Similarly, the second component is the proportion of tails. Hence (1.3.2) is
the pair (p̂, 1 − p̂), where p̂ is the proportion of heads in N tosses.
More generally, if the label of a sample falls into d categories, we may let
f (x) be a vector with d components consisting of zeros and ones, according
to the category of the sample. This is one-hot encoding (see §2.4 and §7.6).
For example, suppose we take a sampling of size N from the Iris dataset,
and we look at the classes of the resulting samples. Since there are three
classes, in this case, we can define f (x) to equal

(1, 0, 0), (0, 1, 0), (0, 0, 1),

according to which class x belongs to. Then the mean (1.3.2) is a triple
p̂ = (p̂1 , p̂2 , p̂3 ) of proportions of each class in the sampling. Of course, p̂1 +
p̂2 + p̂3 = 1, so p̂ is a probability vector (§5.6).

f
sample space

vector space

Fig. 1.11 Vectorization of samples.

When there are only two possibilities, two classes, it’s simpler to encode
the classes as follows,
(
1, if x is heads,
f (x) =
0, if x is tails.

Then the mean (1.3.2) is the proportion p̂ of heads.

Even when the samples are already scalars or vectors, we may still want
to vectorize them. For example, suppose x1 , x2 , . . . , xN are the prices of a
sample of printers from across the country. Then the average price (1.3.1) is
well-defined. Nevertheless, we may set
18 CHAPTER 1. DATASETS
(
1, if x is greater than $100,
f (x) =
0, if x is ≤ $100.

Then the mean (1.3.2) is the sample proportion p̂ of printers that cost more
then $100.
In §6.4, we use vectorization to derive the chi-squared tests.

Exercises

Exercise 1.3.1 For the dataset = array([[1,3,-2,0],[2,4,11,66]]), the


commands mean(dataset,axis=1) and mean(dataset,axis=0) return means
in R2 and in R4 . What does mean(dataset) return and why?

Exercise 1.3.2 What is the average petal length in the Iris dataset?

Exercise 1.3.3 What is the average shading of the pixels in the first image
in the MNIST dataset?

Exercise 1.3.4 What’s the difference between plot and scatter in

from numpy import *


from matplotlib.pyplot import scatter, plot

def f(x): return x**2

x = arange(0,1,.2)

plot(x,f(x))
scatter(x,f(x))

1.4 Two Dimensions

We start with the geometry of vectors in two dimensions. This is the cartesian
plane R2 , also called 2-dimensional real space. The plane R2 is a vector space,
in the sense described in the previous section.
In the cartesian plane, a vector is an arrow v joining the origin to a point
(Figure 1.12). In this way, points and vectors are almost interchangeable, as a
point x in Rd corresponds to the vector v starting at the origin 0 and ending
at x.
In the cartesian plane, each vector v has a shadow. This is the triangle
constructed by dropping the perpendicular from the tip of v to the x-axis, as
in Figure 1.13.
1.4. TWO DIMENSIONS 19

(0, 2) (3, 2)

v
(0, 1)

Fig. 1.12 A vector v.

This cannot be done unless one first draws a horizontal line (the x-axis),
then a vertical line (the y-axis). In this manner, each vector v has cartesian
coordinates v = (x, y). In Figure 1.12, the coordinates of v are (3, 2). In
particular, the vector 0 = (0, 0), the zero vector, corresponds to the origin.

v1

v2

0 0

Fig. 1.13 Vectors v1 and v2 and their shadows in the plane.

In the cartesian plane, vectors v1 = (x1 , y1 ) and v2 = (x2 , y2 ) are added


by adding their coordinates,

Addition of vectors

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 + v2 = (x1 + x2 , y1 + y2 ). (1.4.1)

Because points and vectors are interchangeable, the same formula is used
for addition P + P ′ of points P and P ′ .
This addition is the same as combining their shadows as in Figure 1.14.
In Python, lists and tuples do not add this way. Lists and tuples have to first
be converted into numpy arrays.

v1 = (1,2)
v2 = (3,4)
v1 + v2 == (1+3,2+4) # returns False

v1 = [1,2]
20 CHAPTER 1. DATASETS

v2 = [3,4]
v1 + v2 == [1+3,2+4] # returns False

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 + v2 == array([1+3,2+4]) # returns True

For example, v1 = (−3, 1) and v2 = (2, −2) returns

v1 + v2 = (−3, 1) + (2, −2) = (−3 + 2, 1 − 2) = (−1, −1).

Fig. 1.14 Adding v1 and v2

In Python, == is exact equality of values. When the entries are integers,


this is not a problem. However, when the entries a and b are floats, a == b
may return False even though the two floats agree to within the underlying
precision of the Python code.
To remedy this, it’s best to use isclose(a,b) or allclose(a,b). This
returns True when a and b agree to within the underlying precision. In this
chapter, we ignore this point, but we are more careful starting in Chapter 2.

Scaling of vectors

If v = (x, y), then


tv = (tx, ty).

A vector v = (x, y) in the plane may be scaled by scaling the shadow as in


Figure 1.15. This is vector scaling by t. Note when t is negative, the shadow
is also flipped. Because of this frequent use, numbers t are also called scalars.
In Python, we write
1.4. TWO DIMENSIONS 21

from numpy import *

v = array([1,2])
3*v == array([3,6]) # returns True

tv

0 tv

Fig. 1.15 Scaling with t = 2 and t = −2/3

Given a vector v, the scalings tv of v form a line passing through the origin
0 (Figure 1.17). This line is the span of v (more on this in §2.4). Scalings tv
of v are also called multiples of v.
If t and s are real numbers, it is easy to check

t(v1 + v2 ) = tv1 + tv2 and t(sv) = (ts)v.

Thus scaling v by s, and then scaling the result by t, has the same effect as
scaling v by ts, in a single step. Because points and vectors are interchange-
able, the same formula tP is used for scaling points P by t.
We set −v = (−1)v, and define subtraction of vectors by

v1 − v2 = v1 + (−v2 ).

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
v1 - v2 == array([1-3,2-4]) # returns True

Subtraction of vectors

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then

v1 − v2 = (x1 − x2 , y1 − y2 ) (1.4.2)
22 CHAPTER 1. DATASETS

Distance Formula

If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then the distance between v1 and


v2 is p
|v1 − v2 | = (x1 − x2 )2 + (y1 − y2 )2 .

The distance of v = (x, y) to the origin 0 = (0, 0) is its magnitude or norm


or length p
r = |v| = |v − 0| = x2 + y 2 .
In Python,

from numpy import *


from numpy.linalg import norm

v = array([1,2])
norm(v) == sqrt(5)# returns True

(x, y)

r y

θ
0 x

Fig. 1.16 The polar representation of v = (x, y).

In terms of r and θ (Figure 1.16), the polar representation of (x, y) is

x = r cos θ, y = r sin θ. (1.4.3)

The unit circle consists of the vectors which are distance 1 from the origin
0. When v is on the unit circle, the magnitude of v is 1, and we say v is a
1.4. TWO DIMENSIONS 23

unit vector. In this case, the line formed by the scalings of v intersects the
unit circle at ±v (Figure 1.17).
When v is a unit vector, r = 1, and (Figure 1.16),

v = (x, y) = (cos θ, sin θ). (1.4.4)

−v

Fig. 1.17 v and its antipode −v.

The unit circle intersects the horizontal axis at (1, 0), and (−1, 0), and
intersects the vertical axis at (0, 1), and (0, −1). These four points are equally
spaced on the unit circle (Figure 1.17).
By the distance formula, a vector v = (x, y) is a unit vector when

x2 + y 2 = 1.

More generally, any circle with center Q = (a, b) and radius r consists of
points (x, y) satisfying

(x − a)2 + (y − b)2 = r2 .

Let R be a point on the unit circle, and let t > 0. From this, we see the scaled
point tR is on the circle with center (0, 0) and radius t. Moreover, it follows
a point P is on the circle of center Q and radius r iff P = Q + rR for some
R on the unit circle.
Given this, it is easy to check

|tv| = |t| |v|

for any real number t and vector v.


From this, if a vector v is unit and r > 0, then rv has magnitude r. If v is
any vector not equal to the zero vector, then r = |v| is positive, and

1 1 1
v = |v| = r = 1,
r r r
24 CHAPTER 1. DATASETS

so v/r is a unit vector.

Now we discuss the dot product in two dimensions. We have two vectors
v1 and v2 in the plane R2 , with v1 = (x1 , y1 ) and v2 = (x2 , y2 ). The dot
product of v1 and v2 is given algebraically as

v1 · v2 = x1 x2 + y1 y2 ,

or geometrically as
v1 · v2 = |v1 | |v2 | cos θ,
where θ is the angle between v1 and v2 . To show that these are the same,
below we derive the

Dot Product Identity

x1 x2 + y1 y2 = v1 · v2 = |v1 | |v2 | cos θ. (1.4.5)

v2 − v1

v2

v1

Fig. 1.18 Two vectors v1 and v2 .

From the algebraic definition of dot product, we have v ·v = x2 +y 2 = |v|2 .


If v1 = (x1 , y1 ) and v2 = (x2 , y2 ), then v1 + v2 = (x1 + x2 , y1 + y2 ). From this
we have

|v1 + v2 |2 = (v1 + v2 ) · (v1 + v2 ) = (x1 + x2 )2 + (y1 + y2 )2 .

Expanding the squares, we obtain


1.4. TWO DIMENSIONS 25

|v1 + v2 |2 = |v1 |2 + 2v1 · v2 + |v2 |2 . (1.4.6)

In Python, the dot product is given by numpy.dot,

from numpy import *

v1 = array([1,2])
v2 = array([3,4])
dot(v1,v2) == 1*3 + 2*4 # returns True

As a consequence of the dot product identity, we have code for the angle
between two vectors (there is also a built-in numpy.angle).

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Recall that −1 ≤ cos θ ≤ 1. Using the dot product identity (1.4.5), we


obtain the important

Cauchy-Schwarz Inequality

If u and v are any two vectors, then

−|u| |v| ≤ u · v ≤ |u| |v|. (1.4.7)

Using the geometric definition of the dot product,


u·v
cos θ = .
|u| |v|

Vectors u and v are orthogonal or perpendicular if the angle between them is


a right angle (90 degrees or π/2 radians). From this formula, we see vectors
are orthogonal when their dot product is zero.
When u and v are orthogonal and also unit vectors, we say u and v are
orthonormal. Python code for the angle is as above.
26 CHAPTER 1. DATASETS

To derive the dot product identity, we first derive Pythagoras’ theorem for
general triangles (Figure 1.19)

c2 = a2 + b2 − 2ab cos θ. (1.4.8)

Fig. 1.19 Pythagoras for general triangles.

To derive (1.4.8), we drop a perpendicular to the base b, obtaining two


right triangles, as in Figure 1.20.

f
e
a

b
d

Fig. 1.20 Proof of Pythagoras for general triangles.

By Pythagoras applied to each triangle,

a2 = d2 + f 2 and c2 = e2 + f 2 .

Also b = e + d, so e = b − d, so
1.4. TWO DIMENSIONS 27

e2 = (b − d)2 = b2 − 2bd + d2 .

By the definition of cos θ, d = a cos θ. Putting this all together,

c2 = e2 + f 2 = (b − d)2 + f 2
= f 2 + d2 + b2 − 2db
= a2 + b2 − 2ab cos θ,

so we get (1.4.8).
Next, connect Figures 1.18 and 1.19 by noting a = |v2 | and b = |v1 | and
c = |v2 − v1 |. By (1.4.6),

c2 = |v2 − v1 |2 = |v2 |2 − 2v1 · v2 + |v1 |2 = a2 + b2 − 2(x1 x2 + y1 y2 ),

thus
c2 = a2 + b2 − 2(x1 x2 + y1 y2 ). (1.4.9)
Comparing the terms in (1.4.8) and (1.4.9), we arrive at (1.4.5). This com-
pletes the proof of the dot product identity (1.4.5).

P + P⊥

P⊥

v⊥ P⊥
v
0 P
−v ⊥ c
b

−P ⊥ O a

Fig. 1.21 P and P ⊥ and v and v ⊥ .

If P = (x, y), let P ⊥ = (−y, x), and let v = OP and v ⊥ = OP ′ be the


vectors emanating from the origin, and ending at P and P ⊥ . Then

v · v ⊥ = (x, y) · (−y, x) = 0.

This shows v and v ⊥ are perpendicular (Figure 1.21).


28 CHAPTER 1. DATASETS

From Figure 1.21, we see points P and P ′ on the unit circle satisfy P ·P ′ = 0
iff P ′ = ±P ⊥ .

We now solve two linear equations in two unknowns x, y. We start with


the homogeneous case

ax + by = 0, cx + dy = 0. (1.4.10)

Let A be the 2 × 2 matrix  


ab
A= (1.4.11)
cd
Assume (a, b) ̸= (0, 0). Then it is easy to exhibit a nonzero solution of
the first equation in (1.4.10): choose (x, y) = (−b, a). If we want this to be a
solution of the second equation as well, we must have cx + dy = ad − bc = 0.
On the other hand, if (c, d) ̸= (0, 0), (x, y) = (−d, c) is a nonzero solution
of the first equation in (1.4.10). If we want this to be a solution of the second
equation as well, we must have ax + by = bc − ad = 0.
Based on this, we make the following definition. The determinant of A is
 
ab
det(A) = det = ad − bc. (1.4.12)
cd

Above we found solutions of (1.4.10) when det(A) = 0. Now we show when


det(A) ̸= 0, the only solution is (x, y) = (0, 0).
Multiply the first equation in (1.4.10) by d and the second by b and sub-
tract, obtaining

(ad − bc)x = d(ax + by) − b(cx + dy) = 0.

Since ad − bc ̸= 0, this leads to x = 0. Similarly, in (1.4.10), multiply the first


equation by c and the second by a and subtract, obtaining

(bc − ad)y = c(ax + by) − a(cx + dy) = 0.

Since ad − bc ̸= 0, this leads to y = 0.


Summarizing, we conclude

Homogeneous System

Let A be a nonzero matrix. When det(A) ̸= 0, the only solution of


(1.4.10) is (x, y) = (0, 0). When det(A) = 0, every solution of (1.4.10)
is a scalar multiple of (x, y) = (−b, a), or of (x, y) = (−d, c), depending
on whether (a, b) ̸= (0, 0) or (c, d) ̸= (0, 0).
1.4. TWO DIMENSIONS 29

This covers the homogeneous case. For the inhomogeneous case

ax + by = e, cx + dy = f, (1.4.13)

there are three mutually exclusive possibilities: A = 0, A ̸= 0 and det(A) = 0,


and det(A) ̸= 0.
• When A = 0, there is nothing to say. The system (1.4.13) has a solution
only if (e, f ) = (0, 0), in which case, any (x, y) is a solution.
• When A ̸= 0 and det(A) = 0, multiplying and subtracting as above, we
obtain
(ad − bc)x = d(ax + by) − b(cx + dy) = de − bf,
(1.4.14)
(ad − bc)y = a(cx + dy) − c(ax + by) = af − ce.

This implies ce = af and de = bf . Conversely, when ce = af and de = bf ,

(x, y) = (e/a, 0), (x, y) = (0, e/b), (x, y) = (f /c, 0), (x, y) = (0, f /d)

are solutions, when a ̸= 0, b ̸= 0, c ̸= 0, or d ̸= 0 respectively.


• When det(A) ̸= 0, dividing (1.4.14) by ad − bc leads to

de − bf af − ce
x= , y= . (1.4.15)
ad − bc ad − bc
Putting all this together, we conclude

Inhomogeneous System

When det(A) ̸= 0, (1.4.13) has the unique solution (1.4.15). When


A ̸= 0 and det(A) = 0, (1.4.13) has a solution iff ce = af and de = bf .
In this case, there are four possible solutions, listed above, depending
on which of a, b, c, d is nonzero. All other solutions differ from these
solutions by a solution of (1.4.10).

In §2.9, we will understand the three cases in terms of the rank of A equal
to 2, 1, or 0.

We now go over the basic properties of 2 × 2 matrices. This we use in the


next section. A 2 × 2 matrix A is a block of four numbers as in (1.4.11).
The matrix (1.4.11) can be written in terms of the two vectors u = (a, b)
and v = (c, d), as follows
   
ab u
A= = , u = (a, b), v = (c, d).
cd v
30 CHAPTER 1. DATASETS

In this case, we call u and v the rows of A. On the other hand, A may be
written as
 
ac 
A= = uv , u = (a, b), v = (c, d).
bd

In this case, we call u and v the columns of A. This shows there are at least
three ways to think about a matrix: as rows, or as columns, or as a single
block.
The simplest operations on matrices are addition and scaling. Addition is
as follows,
 ′ ′
a + a′ b + b′
   
ab ′ a b ′
A= , A = ′ ′ =⇒ A+A = ,
cd c d c + c′ d + d′

and scaling is as follows,  


ta tb
tA = .
tc td
The transpose At of the matrix A is
   
ab ac
A= =⇒ At = .
cd bd

Then the rows of At are the columns of A.


Let w = (x, y) be a vector. We now explain how to multiply the matrix
A by the vector w. The result is then another vector Aw. This is called
matrix-vector multiplication.  
u
To do this, we write A as rows A = , then use the dot product,
v

Aw = (u · w, v · w) = (ax + by, cx + dy).

Notice Aw is a vector. When multiplying this way, one often writes


    
ab x ax + by
Aw = = ,
cd y cx + dy

and we call w and Aw column vectors.


This terminology is introduced to keep things consistent: It’s always row-
times-column with row on the left and column on the right. Nevertheless, a
vector, a row vector, and a column vector are all the same thing, just a vector.
Just like we can multiply matrices and vectors, we can also multiply two
matrices A and A′ and obtain a product AA′ . This is matrix-matrix  multi-
u
plication. Following the row-column rule above, we write A = as rows
v
′ ′ ′
and A = (u , v ) as columns to obtain
1.4. TWO DIMENSIONS 31

u · u′ u · v ′
 
AA′ = .
u′ · v u′ · v ′

If we do this the other way, we obtain


 ′
u · u u′ · v


AA= ,
u · v′ u · v
so
AA′ ̸= A′ A.

A rotation in the plane is the matrix


 
cos θ − sin θ
U = U (θ) = .
sin θ cos θ

Here θ is the angle of rotation. By the trigonometric addition formulas


(A.4.6),

cos θ′ − sin θ′
  
cos θ − sin θ
U (θ)U (θ′ ) =
sin θ cos θ sin θ′ cos θ′
cos(θ + θ′ ) − sin(θ + θ′ )
 
= = U (θ + θ′ ).
sin(θ + θ′ ) cos(θ + θ′ )

This says rotating by θ′ followed by rotating by θ is the same as rotating by


θ + θ′ .

There is a special matrix I, the identity matrix,


 
10
I= .
01

The matrix I satisfies


AI = IA = A
for any matrix A.
Also, for each matrix A with det(A) ̸= 0, the matrix

−b
 
    d
1 d −b 1 d −b  − bc ad − bc 
A−1 = = =  ad−c a 
det(A) −c a ad − bc −c a
ad − bc ad − bc
is the inverse of A. The inverse matrix satisfies
32 CHAPTER 1. DATASETS

AA−1 = A−1 A = I.

The inverse reverses the order of the product,

(AB)−1 = B −1 A−1 .

The transpose also reverses the order of a product,

(AB)t = B t At .

Using matrix-vector multiplication, we can rewrite (1.4.13) as

Ax = b,

where      
ab x e
A= , x= , b= .
cd y f
Then the solution (1.4.15) can be rewritten

x = A−1 b,

where A−1 is the inverse matrix. We study inverse matrices in depth in §2.3.
The matrix (1.4.11) is symmetric if b = c. A symmetric matrix looks like
 
ab
Q= .
bc

A general matrix A consists of four numbers a, b, c, d, and a symmetric


matrix Q consists of three numbers a, b, c. A matrix Q is symmetric when

Qt = Q.

Let A = (u,  be a 2 × 2 matrix with columns u, v. Then u, v are the


 v)
u
rows of At = . Since matrix multiplication is row × column,
v
   
u u·u u·v
At A =

uv = .
v v·u v·v

Now suppose At A = I. Then u · u = 1 = v · v and u · v = 0, so u and v are


orthogonal unit vectors. Such vectors are called orthonormal. We have shown
1.4. TWO DIMENSIONS 33

Orthogonal Matrices

Let A be a matrix. Then At A = I iff the columns of A are orthonor-


mal, and AAt = I iff the rows of A are orthonormal.

The second statement follows by applying the first to At instead of A. A


matrix U is orthogonal if
U tU = I = U U t.
Thus a matrix is orthogonal iff its rows are orthonormal, and its columns are
orthonormal.

Now we introduce the tensor product. If u = (a, b) and v = (c, d) are


vectors, their tensor product is the matrix
   
ac ad  av
u⊗v = = cu du = .
bc bd bv

Here we wrote u ⊗ v as a single block, and also in terms of rows and columns.
If we do this the other way, we get
 
ca cb
v⊗u= ,
da db
so
(u ⊗ v)t = v ⊗ u.
When u = v, u ⊗ v = v ⊗ v is a symmetric matrix.
Here is code for tensor.

from numpy import *

def tensor(u,v): return array([ [ a*b for b in v] for a in u ])

There is no need to use this, since the numpy built-in outer does the same
job,

from numpy import *

A = outer(u,v)

The trace of a matrix A is the sum of the diagonal entries,


 
ab
A= =⇒ trace(A) = a + d.
cd
34 CHAPTER 1. DATASETS

The determinant of u ⊗ v is zero,

det(u ⊗ v) = 0.

This is true no matter what the vectors u and v are. Check this yourself.
By definition of u ⊗ v,

trace(u ⊗ v) = u · v, and trace(v ⊗ v) = |v|2 . (1.4.16)

The basic property of tensor product is

Tensor Product Identities


If A is a matrix and B = u ⊗ v, then

Bw = (u ⊗ v)w = (v · w)u, AB = A(u ⊗ v) = (Au) ⊗ v. (1.4.17)

These can be checked by writing out both sides in detail.


Now let  
ab
Q=
bc
be a symmetric matrix and let v = (x, y). Then

Qv = (ax + by, bx + cy),

so
v · Qv = (x, y) · (ax + by, bx + cy) = ax2 + 2bxy + cy 2 .
This is the quadratic form associated to the matrix Q.

Quadratic Form

If  
ab
Q= and v = (x, y),
bc
then
v · Qv = ax2 + 2bxy + cy 2 .

When Q is the identity


 
10
Q=I= ,
01

then the quadratic function is x2 + y 2 :

Q=I =⇒ v · Qv = x2 + y 2 .
1.4. TWO DIMENSIONS 35

When Q is diagonal,
 
a0
Q= =⇒ v · Qv = ax2 + cy 2 .
0c

An important case is when Q = u ⊗ u. In this case, by (1.4.17),

Quadratic Forms of Tensors

If Q = u ⊗ u, then

v · Qv = v · (u ⊗ u)v = (u · v)2 . (1.4.18)

Exercises

Exercise 1.4.1 Solve the linear system

ax + by = c, −bx + ay = d.

Exercise 1.4.2 Let u = (1, a), v = (b, 2), and w = (3, 4). Solve

u + 2v + 3w = 0

for a and b.
Exercise 1.4.3 Let u = (1, 2), v = (3, 4), and w = (5, 6). Find a and b such
that
au + bv = w.
⊥
Exercise 1.4.4 Let P be a nonzero point in the plane. What is P ⊥ ?
   
8 −8 3 −2
Exercise 1.4.5 Let A = and B = . Compute AB and
−7 −3 2 −2
BA.
 
9 2
Exercise 1.4.6 Let A = . Find a nonzero 2×2 matrix B satisfying
−36 −8
AB = 0.
Exercise 1.4.7 Solve for X
   
−7 4 −9 5
− 4X = .
4 −3 6 −9

eq:tensorident
Exercise 1.4.8 If u = (a, b) and v = (c, d) and A = u ⊗ v, use (1.4.17) to
compute A2 .
36 CHAPTER 1. DATASETS

Exercise 1.4.9 Find a nonzero 2 × 2 matrix A satisfying A2 = 0.


 
9 2
Exercise 1.4.10 What is the trace of A = ?
−36 −8
Exercise 1.4.11 The wedge product of vectors u and v is the matrix

u ∧ v = u ⊗ v − v ⊗ u.

If u = (a, b) and v = (c, d), what is u ∧ v?

Exercise 1.4.12 Let u be a unit vector, and let A = u ⊗ u. Compute A100 .

Exercise 1.4.13 Calculate the areas of the triangles and the squares in Fig-
ure 1.21. From that, deduce Pythagoras’s theorem c2 = a2 + b2 .

Exercise 1.4.14 Let u and v be unit vectors, and let A = u ⊗ v. If A2 = 0,


what is the angle between u and v?
 
0 1
Exercise 1.4.15 Let W = . What is W 211 ?
−1 0

1.5 Mean and Variance

Let x1 , x2 , . . . , xN be a dataset in Rd , and let x be any point in Rd . The


mean-square distance of x to D is
N
1 X
M SD(x) = |xk − x|2 .
N
k=1

Above |x| stands for the length of the vector x, or the distance of the point
x to the origin. When d = 2 and we are in two dimensions, this was defined
in §1.4. For general d, this is defined in §2.1. In this section we continue to
focus on two dimensions d = 2.
The mean or sample mean is
N
1 X x1 + x2 + · · · + xN
µ= xk = . (1.5.1)
N N
k=1

The mean µ is a point in feature space. The first result is

Point of Best-fit
The mean is the point of best-fit: The mean minimizes the mean-
square distance to the dataset (Figure 1.22).
1.5. MEAN AND VARIANCE 37

Fig. 1.22 MSD for the mean (green) versus MSD for a random point (red).

Using (1.4.6),
|a + b|2 = |a|2 + 2a · b + |b|2
for vectors a and b, it is easy to derive the above result. Insert a = xk − µ
and b = µ − x to get
N
2 X
M SD(x) = M SD(µ) + (xk − µ) · (µ − x) + |µ − x|2 .
N
k=1

Now the middle term vanishes


N N
! !
2 X 2 X
(xk − µ) · (µ − x) = xk − Nµ · (µ − x)
N N
k=1 k=1
= 2(µ − µ) · (µ − x) = 0,

so we have
M SD(x) = M SD(µ) + |x − µ|2 ,
which is clearly ≥ M SD(µ), deriving the above result.
Here is the code for Figure 1.22.

from matplotlib.pyplot import *


from numpy import *
from numpy.random import random

N, d = 20, 2
# d x N array
38 CHAPTER 1. DATASETS

dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

mu = mean(dataset,axis=1)
p = array([random(),random()])

for v in dataset.T:
plot([mu[0],v[0]],[mu[1],v[1]],c='green')
plot([p[0],v[0]],[p[1],v[1]],c='red')

scatter(*mu)
scatter(*dataset)

grid()
show()

The variance of a dataset is defined in any dimension d. When d = 1, the


dataset consists of scalars x1 , x2 , . . . , xN , and the mean µ is a scalar. In this
case, the variance q is also a scalar,
N
1 X
q= (xk − µ)2 . (1.5.2)
N
k=1

The square root of the variance is the standard deviation σ = q.
If a scalar dataset has mean zero and variance one, it is standard. Every
dataset x1 , x2 , . . . , xN may be standardized by first centering the dataset

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

then dividing the dataset by its standard deviation,


x1 − µ ′ x2 − µ xN − µ
x′1 = , x2 = , . . . , x′N = .
σ σ σ
The resulting dataset x′1 , x′2 , . . . , x′N is then standardized.

In general, a dataset consists of points x1 , x2 , . . . , xN in some feature


space Rd . If the dataset has mean µ, we can center the dataset,

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ.

Then the variance is the matrix (see §1.4 for tensor product)
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= . (1.5.3)
N
1.5. MEAN AND VARIANCE 39

Since v ⊗ v is a symmetric matrix, the variance of a dataset is a symmetric


matrix. Below we see the variance is also nonnegative, in the sense v · Qv ≥ 0
for all vectors v. Later we see how to standardize vector datasets.
When i ̸= j, the entries Q = (qij ) of the variance matrix are called covari-
ances: qij is the covariance between the i-th feature and the j-th feature.
For example, suppose N = 5 and

x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), x5 = (9, 10). (1.5.4)

Then µ = (5, 6) and

v1 = x1 − m = (−4, −4), v2 = x2 − m = (−2, −2), v3 = x3 − m = (0, 0),


v4 = x4 − m = (2, 2), v5 = x5 − m = (4, 4).

Since
 
16 16
(±4, ±4) ⊗ (±4, ±4) = ,
16 16
 
44
(±2, ±2) ⊗ (±2, ±2) = ,
44
 
00
(0, 0) ⊗ (0, 0) = ,
00

summing and dividing by N leads to the variance


 
88
Q= .
88

Notice
Q = 8(1, 1) ⊗ (1, 1),
which, as we see below (§2.5), reflects the fact that the points of this dataset
lies on a line. Here the line is y = x + 1. Here is code from scratch for the
variance (matrix) of a dataset.

from numpy import *


from numpy.random import random

def tensor(u,v):
return array([ [ a*b for b in v] for a in u ])

N, d = 20, 2
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
mu = mean(dataset,axis=0)

# center dataset
40 CHAPTER 1. DATASETS

vectors = dataset - mu

Q = mean([ tensor(v,v) for v in vectors ],axis=0)

The variance matrix as written in (1.5.3) is the biased variance matrix. If


instead the denominator is N − 1, the matrix is the unbiased variance matrix.
For datasets with large N , it doesn’t matter, since N and N −1 are almost
equal. For simplicity, here we divide by N , and we only consider the biased
variance matrix.
In practice, datasets are standardized before computing their variance. The
variance of standardized datasets — the correlation matrix — is the same
whether one starts with bias or not (§2.2).
In numpy, the Python variance constructor is

from numpy import *


from numpy.random import random

N, d = 20, 2
# d x N array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

Q = cov(dataset,bias=True)
Q

This returns the same result as the previous code for Q. Notice here there is
no need to compute the mean, this is taken care of automatically. The option
bias=True indicates division by N , returning the biased variance. To return
the unbiased variance and divide by N −1, change the option to bias=False,
or remove it, since bias=False is the default.
From (1.4.16), if Q is the variance matrix (1.5.3),
N
1 X
trace(Q) = |xk − m|2 . (1.5.5)
N
k=1

We call trace(Q) the total variance or explained variance of the dataset.


Thus the total variance equals MSD(m).
In Python,

from numpy import *

# dataset is d x N array

Q = cov(dataset,bias=True)
Q.trace()
1.5. MEAN AND VARIANCE 41

We now project a 2d dataset onto a line. Let u be a unit vector (a vector


of length one, |u| = 1), and let v1 , v2 , . . . , vN be a 2d dataset, assumed
for simplicity to be centered. We wish to project this dataset onto the line
through u. This will result in a 1d dataset.
According to Figure 1.23, when a vector b is projected onto the line through
u, the length of the projected vector P b equals |b| cos θ, where θ is the angle
between the vectors b and u. Since |u| = 1, this length equals the dot product
b · u. Hence the projected vector is

P b = (b · u)u.

Pb
u

Fig. 1.23 Projecting a vector b onto the line through u.

Applying this logic to each vector v1 , v2 , . . . , vN , we conclude: the projected


dataset onto the line through u is the two-dimensional dataset

(v1 · u)u, (v2 · u)u, . . . , (vN · u)u.

These vectors are all multiples of u, as they should be. The projected dataset
is two-dimensional.
Alternately, discarding u and retaining the scalar coefficients, we have the
one-dimensional dataset

v1 · u, v2 · u, . . . , vN · u.

This is the reduced dataset. The reduced dataset is one-dimensional.


Since the vector u is fixed, the reduced dataset and the projected dataset
contain the same information. Warning: The formulas for the projected and
reduced datasets are correct only when u is a unit vector. If u is not unit,
remember to replace u by u/|u|.
The reduced dataset is centered, since
42 CHAPTER 1. DATASETS
 
v1 · u + v2 · u + · · · + v N · u v1 + v2 + · · · + vN
= · u = 0 · u = 0,
N N

and the mean of the projected dataset is also 0.


The variance of the reduced dataset
N
1 X
q= (vk · u)2
N
k=1

is a scalar that is positive or zero. According to (1.4.18), this equals


N
1 X
q= u · (vk ⊗ vk )u = u · Qu.
N
k=1

Because the reduced dataset and projected dataset are essentially the
same, we also refer to q as the variance of the projected dataset. Thus we
conclude (see §1.4 for v · Qv)

Variance of Reduced Dataset


Let Q be the variance matrix of a dataset and let u be a unit vector.
Then the variance of the dataset reduced onto the line through u
equals u · Qu.

A matrix Q is positive if it is symmetric and u · Qu > 0 for any nonzero


vector u. A matrix Q is nonnegative if it is symmetric and u · Qu ≥ 0 for any
vector u.
As a consequence, since q = u · Qu is the variance of the reduced dataset,
we have

Variance Matrix is Nonnegative

Every variance matrix is nonnegative.

Here is code for computing the variance of the projected dataset.

from numpy import *

# dataset is d x N array

Q = cov(dataset,bias=True)

# project along unit vector u


q = dot(u,dot(Q,u))
1.5. MEAN AND VARIANCE 43

Going back to the dataset (1.5.4), xk − m, k = 1, 2, 3, 4, 5, are all multiples


of (1, 1). If we select u = (1, −1), then (xk − m) · u = 0, so the variance Q
satisfies u · Qu = 0. This can also be seen by

Qu = 8((1, 1) ⊗ (1, 1))u = 8(1, 1) · u (1, 1) = 0.

This shows that the dataset lies on the line passing through m and perpen-
dicular to (1, −1).

We describe the variance ellipses associated to a given dataset. Let Q be


a variance matrix and µ a point in R2 . The contour of all points v satisfying

(v − µ) · Q(v − µ) = k

is the variance ellipse corresponding to level k. When k = 1, the ellipse is the


unit variance ellipse.
The contour of all points v satisfying

(v − µ) · Q−1 (v − µ) = k

is the inverse variance ellipse corresponding to level k. When k = 1, the


ellipse is the unit inverse variance ellipse.

Fig. 1.24 Unit variance ellipses (blue) and unit inverse variance ellipses (red) with µ = 0.

In two dimensions, a variance matrix has the form


 
ab
Q= .
bc

If we write v = (x, y) for a vector in the plane, the variance ellipse equation
centered at µ = 0 is
44 CHAPTER 1. DATASETS

v · Qv = ax2 + 2bxy + cy 2 = k.

The inverse variance ellipse centered at µ = 0 is given by the same equation


with Q replaced by Q−1 . The code for rendering ellipses is

from matplotlib.pyplot import *


from numpy import *
from numpy.linalg import inv

def ellipse(Q,mu,padding=.5,levels=[1],render="var"):
grid()
scatter(*mu,c="red",s=5)
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
delta = .01
x = arange(d-padding,d+padding,delta)
y = arange(e-padding,e+padding,delta)
x, y = meshgrid(x, y)
if render == "var" or render == "both":
# matrix_text(Q,mu,padding,'blue')
eq = a*(x-d)**2 + 2*b*(x-d)*(y-e) + c*(y-e)**2
contour(x,y,eq,levels=levels,colors="blue",linewidths=.5)
if render == "inv" or render == "both":
draw_major_minor_axes(Q,mu)
Q = inv(Q)
# matrix_text(Q,mu,padding,'red')
A, B, C = Q[0,0],Q[0,1],Q[1,1]
eq = A*(x-d)**2 + 2*B*(x-d)*(y-e) + C*(y-e)**2
contour(x,y,eq,levels=levels,colors="red",linewidths=.5)

With this code, ellipse(Q,mu) returns the unit variance ellipse in the unit
square centered at µ. The codes for the functions draw_major_minor_axes
and matrix_text are below.
The code for draw_major_minor_axes uses the formulas for the best-fit
and worst-fit vectors (1.5.7).
Depending on whether render is var, inv, or both, the code renders the
variance ellipse (blue), the inverse variance ellipse (red), or both. The code
renders several ellipses, one for each level in the list levels. The default is
levels = [1], so the unit ellipse is returned. Also padding can be adjusted
to enlarge the plot.
The code for Figure 1.24 is

mu = array([0,0])

Q = array([[9,0],[0,4]])
ellipse(Q,mu,padding=4,render="both")
show()
1.5. MEAN AND VARIANCE 45

Q = array([[9,2],[2,4]])
ellipse(Q,mu,padding=4,render="both")
show()

To use TEX to display the matrices in Figure 1.24, insert the function

rcParams['text.usetex'] = True
rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'

def matrix_text(Q,mu,padding,color):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d,e = mu
valign = e + 3*padding/4
if color == 'blue': halign = d - padding/2; tex = "$Q="
else: halign = d; tex = "$Q^{-1}="
# r"..." means raw string
tex += r"\begin{pmatrix}" + str(round(a,2)) + "&" + str(round(b,2))
tex += r"\\" + str(round(b,2)) + "&" + str(round(c,2))
tex += r"\end{pmatrix}$"
return text(halign,valign,tex,fontsize=15,color=color)

A minimal TEX installation is included in matplotlib.pyplot. To dis-


play matrices, the code needs to access your laptop’s TEX installation. The
rcParams lines enable this access. If TEX is installed on your laptop, uncom-
ment matrix_text in ellipse.

Fig. 1.25 Variance ellipses (blue) and inverse variance ellipses (red) for a dataset.

Figure 1.25 shows variance ellipses with levels = [.005,.01,.02], and


inverse variance ellipses with levels = [.5,1,2], corresponding to a ran-
dom dataset. The code for this is
46 CHAPTER 1. DATASETS

from numpy.random import random

N = 50
# N x d array
dataset = array([ [random(),random()] for _ in range(N) ])
Q = cov(dataset.T,bias=True)
mu = mean(dataset,axis=0)

scatter(*dataset.T,s=5)
ellipse(Q,mu,render="var",padding=.5,levels=[.005,.01,.02])
show()

scatter(*dataset.T,s=5)
ellipse(Q,mu,render="inv",padding=.5,levels=[.5,1,2])
show()

When Q is diagonal, the lengths


√ of the
√ major and minor axes of the unit
inverse variance ellipse equal 2 a and 2 c, and√the lengths√of the major and
minor axes of the unit variance ellipse equal 2/ a and 2/ c.

We describe how to standardize datasets in R2 . For datasets in Rd , this


is described in §2.2.
Remember, a dataset is a sequence of N points in a d-dimensional feature
space. Restricting to the case d = 2, a dataset is a sequence of x-coordinates
and y-coordinates

x1 , x2 , . . . , xN , and y1 , y2 , . . . , yN .

Suppose the mean of this dataset is µ = (µx , µy ). Then, by the formula for
tensor product, the variance matrix is
 
ab
Q= ,
bc

where
N N N
1 X 1 X 1 X
a= (xk − µx )2 , b= (xk − µx )(yk − µy ), c= (yk − µy )2 .
N N N
k=1 k=1 k=1

From this, we see a is the variance of the x-features, and c is the variance
of y-features. We also see b is a measure of the correlation between the x and
y features.
Standardizing the dataset means to center the dataset and to place the x
and y features on the same scale. For example, the x-features may be close
to their mean µx , resulting in a small x variance a, while the y-features may
be spread far from their mean µy , resulting in a large y variance c.
1.5. MEAN AND VARIANCE 47

When this happens, the different scales of x’s and y’s distorts the relation
between them, and b may not accurately reflect the correlation. To correct
for this, we center and re-scale
x1 − µx ′ x2 − µx xN − µx
x1 , x2 , . . . xN → x′1 = √ , x2 = √ , . . . , x′N = √ ,
a a a

and
y1 − µy ′ y2 − µy yN − µy
y1 , y2 , . . . yN → y1′ = √ , y2 = √ ′
, . . . , yN = √ .
c c c

This results in a new dataset v1 = (x′1 , y1′ ), v2 = (x′2 , y2′ ), . . . , vN = (x′N , yN



)
that is centered,
v1 + v2 + · · · + vN
= 0,
N
with each feature standardized to have unit variance,
N N
1 X ′2 1 X ′2
xk = 1, yk = 1.
N N
k=1 k=1

This is the standardized dataset.


Because of this, the variance matrix of the standardized dataset is
 

Q′ = ,
ρ1

where
N
1 X ′ ′ b
ρ= xk yk = √
N ac
k=1

is the correlation coefficient of the dataset. The matrix Q′ is the correlation


matrix, or the standardized variance matrix.
For example,
   
92 b 1 ′ 1 1/3
Q= =⇒ ρ= √ = =⇒ Q = .
24 ac 3 1/3 1

The correlation coefficient ρ (“rho”) is always between −1 and 1 (this


follows from the Cauchy-Schwarz inequality (1.4.7).
When ρ > 0, we say the x and y features are positively correlated. More
loosely, we say the dataset is positively correlated (this only for 2d). When
ρ < 0, we say the x and y features are negatively correlated. More loosely, we
say the dataset is negatively correlated (this only for 2d).
When ρ = ±1, the dataset samples are perfectly correlated and lie on
a line passing through the mean. When ρ = 1, the line has slope 1, and
when ρ = −1, the line has slope −1. When ρ = 0, the dataset samples are
48 CHAPTER 1. DATASETS

completely uncorrelated and are considered two independent one-dimensional


datasets.
In numpy, the correlation matrix Q′ is returned by

from numpy import *

# dataset is d x N array

corrcoef(dataset)

Fig. 1.26 Unit variance ellipse and unit inverse variance ellipse with standardized Q.

We say a unit vector u is best aligned or best-fit with the dataset if u


maximizes the variance v · Qv over all unit vectors v,

u · Qu = max v · Qv.
|v|=1

We calculate the best-aligned unit vector. When a dataset is standardized,


the variance of the dataset projected onto a vector v = (x, y) equals

v · Qv = ax2 + 2bxy + cy 2 = x2 + 2ρxy + y 2 .

Since v = (x, y) is a unit vector, we have x2 + y 2 = 1, so we can write


(x, y) = (cos θ, sin θ). Using the double-angle formula, we obtain
1.5. MEAN AND VARIANCE 49

v · Qv = x2 + 2ρxy + y 2 = 1 + 2ρ sin θ cos θ = 1 + ρ sin(2θ).

Since the sine function varies between +1 and −1, we conclude the pro-
jected variance varies between

1 − ρ ≤ v · Qv ≤ 1 + ρ,

and  
π 1 1
θ= , v+ = √ ,√ =⇒ v+ · Qv+ = 1 + ρ,
4 2 2
 
3π −1 1
θ= , v− = √ , √ =⇒ v− · Qv− = 1 − ρ.
4 2 2
Thus the best-aligned vector v+ is at 45◦ , and the worst-aligned vector is at
135◦ (Figure 1.26).
Actually, the above is correct only if ρ > 0. When ρ < 0, it’s the other
way. The correct answer is

1 − |ρ| ≤ v · Qv ≤ 1 + |ρ|,

and v± must be switched when ρ < 0. We study best-aligned vectors in Rd


in §3.2.

Fig. 1.27 Positively and negatively correlated datasets (unit inverse ellipses).

Here are two randomly generated datasets. The dataset on the left in
Figure 1.27 is positively correlated. Its mean and variance are
 
0.08016526 0.01359483
(0.53626891, 0.54147513) .
0.01359483 0.08589097
50 CHAPTER 1. DATASETS

The dataset on the right in Figure 1.27 is negatively correlated. Its the
mean and variance are
 
0.08684941 −0.00972569
(0.46979642, 0.48347168) .
−0.00972569 0.09409118

In general, for non-standardized datasets, the projected variance v · Qv


varies between two extremes λ± ,

λ− ≤ v · Qv ≤ λ+ , |v| = 1.

where λ± are given by


s 2
a+c a−c
λ± = ± + b2 . (1.5.6)
2 2

When the dataset is standardized, as we saw above, λ± = 1 ± |ρ|.


−1
p ellipse v · Q v = 1 has length
pThe major axis of the inverse variance
2 λ+ , and the minor axis has length 2 λ− . These are the principal axes of
the dataset.
When the dataset is not standardized, let

v± = (−b, a − λ± ), and w± = (λ± − c, b), (1.5.7)

If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v+ or w+ is nonzero. If v+ ̸= 0, v+ is the best-aligned
vector. If v+ = 0, w+ is the best-aligned vector.
If the inverse variance ellipse is not a circle, then Q is not a multiple of the
identity, and either v− or w− is nonzero. If v− ̸= 0, v− is the worst-aligned
vector. If v− = 0, w− is the worst-aligned vector.
If Q is a multiple of the identity, then any vector is best-aligned and worst-
aligned.
All this follows from solutions of homogeneous 2 × 2 systems (1.4.10). The
general d×d case is in §3.2. For the 2×2 case discussed here, see the exercises
at the end of §3.2.
The code for rendering the major and minor axes of the inverse variance
ellipse uses (1.5.6) and (1.5.7),

def draw_major_minor_axes(Q,mu):
a, b, c = Q[0,0],Q[0,1],Q[1,1]
d, e = mu
label = { 1:"major", -1:"minor" }
for pm in [1,-1]:
1.5. MEAN AND VARIANCE 51

lamda = (a+c)/2 + pm * sqrt(b**2 + (a-c)**2/4)


sigma = sqrt(lamda)
lenv = sqrt(b**2 +(a-lamda)**2)
lenw = sqrt(b**2 +(c-lamda)**2)
if lenv: deltaX, deltaY = b/lenv, (a-lamda)/lenv
elif lenw: deltaX, deltaY = (lamda-c)/lenw, b/lenw
elif pm == 1: deltaX, deltaY = 1, 0
else: deltaX, deltaY = 0, 1
axesX = [d+sigma*deltaX,d-sigma*deltaX]
axesY = [e-sigma*deltaY,e+sigma*deltaY]
plot(axesX,axesY,linewidth=.5,label=label[pm])
legend()

In three dimensions, when d = 3, the ellipses are replaced by ellipsoids


(Figure 1.28).

Fig. 1.28 Ellipsoid and axes in 3d.

Exercises

Exercise 1.5.1 The dataset is

from numpy import *

d = 10
# 100 x 2 array
dataset = array([ array([i+j,j]) for i in range(d) for j in range(d)
,→ ])

Compute the mean and variance, and plot the dataset and the mean.
52 CHAPTER 1. DATASETS

Exercise 1.5.2 Let the dataset be the petal lengths against the petal widths
in the Iris dataset. Compute the mean and variance, and plot the dataset and
the mean.

Exercise 1.5.3 Project the dataset in Exercise 1.5.1 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?

Exercise 1.5.4 Project the dataset in Exercise 1.5.2 onto the line through
the vector (1, 2). What is the projected dataset? What is the reduced dataset?

Exercise 1.5.5 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.1.

Exercise 1.5.6 Plot the variance ellipse and inverse variance ellipses of the
dataset in Exercise 1.5.2.

Exercise 1.5.7 Plot the dataset in Exercise 1.5.1 together with its mean
and the line through the vector of best fit.

Exercise 1.5.8 Plot the dataset in Exercise 1.5.2 together with its mean
and the line through the vector of best fit.

Exercise 1.5.9 Standardize the dataset in Exercise 1.5.1. Plot the stan-
dardized dataset. What is the correlation matrix?

Exercise 1.5.10 Standardize the dataset in Exercise 1.5.2. Plot the stan-
dardized dataset. What is the correlation matrix?
 
ab
Exercise 1.5.11 Let Q = . Show Q is nonnegative when a ≥ |b|.
ba
(Compute v · Qv with v = (cos θ, sin θ) as in the text.)

1.6 High Dimensions

Although not used in later material, this section is here to boost intuition
about high dimensions. Draw four disks inside a square, and a fifth disk in
the center.
In Figure 1.29, the edge-length of the square is 4, and the radius of each
blue disk is 1. Draw the diagonal of the square. Then the diagonal passes
through two blue disks.
1.6. HIGH DIMENSIONS 53

Fig. 1.29 Disks inside the square.


Since the length of the diagonal of the square is 4 2, and the diameters
of the two blue disks
√ add up 4, the portions of the diagonal outside the blue
disks add up to 4 2 − 4. Hence the radius of the red disk is
1 √ √
(4 2 − 4) = 2 − 1.
4

Fig. 1.30 Balls inside the cube.


54 CHAPTER 1. DATASETS

In three dimensions, draw eight balls inside a cube, as in Figure 1.30, and
one ball in the center. Since the edge-length of the cube is 4, the radius
√ of
each blue ball is 1. Since the length of the diagonal of the cube is 4 3, the
radius of the red ball is
1 √ √
(4 3 − 4) = 3 − 1.
4
Now we repeat in d dimensions. Here the edge-length of the cube remains
4, the radius of each blue ball remains 1,√and there are 2d blue balls. Since
the length of the diagonal of the cube is√4 d, the same calculation results in
the radius of the red ball equal to r = d − 1.

In two dimensions, when a region is scaled by a factor t, its area increases


by the factor t2 . In three dimensions, when a region is scaled by a factor t,
its volume increases by the factor t3 . The general result is

Scaling Principle: Dependence on Dimension

In d dimensions, when a region is scaled by a factor t, its volume scales


by the factor td .

The radius of the red ball is r = d − 1. By the scaling principle, in d
dimensions, the volume of the red ball equals rd times the volume of the blue
ball. We conclude the following:

• Since r = d − 1 = 1 exactly when d = 4, we have: In four dimensions,
the red ball and the blue balls are the same size.
• Since there are 2d blue balls, the ratio of the volume of the red ball over
the total volume of all the blue balls is rd /2d . √
• Since rd = 2d exactly when r = 2, and since r = d − 1 = 2 exactly when
d = 9, we have: In nine dimensions, the volume of the red ball equals the
sum total of√the volumes of all blue balls.
• Since r = d − 1 > 2 exactly when d > 9, we have: In ten or more
dimensions, the red ball sticks out of the cube.

• Since the length of the semi-diagonal
√ is 2 d, for any dimension d, the
radius of the red ball r = d − 1 is less than half the length of the semi-
diagonal. As the dimension grows without bound, the proportion of the
diagonal covered by the red ball converges to 1/2.

The following code returns Figure 1.29.


1.6. HIGH DIMENSIONS 55

from matplotlib.pyplot import *


from matplotlib.patches import Circle, Rectangle
from numpy import *
from itertools import product

# initialize figure
ax = axes()

square = Rectangle((0,0), 4, 4,color='lightblue')


ax.add_patch(square)

xcent = ycent = [1,3]


# blue disks
for center in product(xcent,ycent):
circle = Circle(center, radius=1, color='blue')
ax.add_patch(circle)

# red disk
circle = Circle((2, 2), radius=sqrt(2)-1, color='red')
ax.add_patch(circle)

ax.set_axis_off()
ax.axis('equal')
show()

The code for Figure 1.30 is as follows.

%matplotlib ipympl
from matplotlib.pyplot import *
from numpy import *
from itertools import product

# build sphere mesh


N = 40
theta = linspace(0,2*pi,N)
phi = linspace(0,pi,N)
theta,phi = meshgrid(theta,phi)

# spherical coordinates theta, phi


x = cos(theta)*sin(phi)
y = sin(theta)*sin(phi)
z = cos(phi)

# initialize figure
ax = axes(projection="3d")

# render ball
def ball(a,b,c,r,color):
return ax.plot_surface(a + r*x,b + r*y, c + r*z,color=color)

xcent = ycent = zcent = [1,3]


56 CHAPTER 1. DATASETS

# blue balls
for center in product(xcent,ycent,zcent): ball(*center,1,"blue")

# red ball
ball(2,2,2,sqrt(3)-1,"red")

# cube grid
cube = ones((4,4,4),dtype=bool)
ax.voxels(cube, edgecolors='black',lw=.5,alpha=0)

ax.set_aspect("equal")
ax.set_axis_off()
show()

If theta and phi have shapes (m,) and (n,) then


theta,phi = meshgrid(theta,phi)
returns arrays theta and phi having shapes (m,n), with
theta[i,j] = theta[i], phi[i,j] = phi[j].
Here this is used to build a 2d mesh of 3d points
(x[i,j],y[i,j],z[i,j])
lying on a sphere. The cube grid is rendered using a voxel grid. Voxels are
the 3d counterparts of 2d pixels.
In jupyter, a magic command starts with a %. A magic command is sent to
jupyter, not to Python. The magic command %matplotlib ipympl allows
for rotating the figure.

Another phenomenon that happens in high dimensions, discussed in §6.1,


is that the angle between two randomly chosen vectors in a high-dimensional
space is not arbitrary, it is pre-determined. This is a consequence of the law
of large numbers.

Scaling and dimensionality work together in suspensions. (Figure 1.31).


Let [a, b] be an interval and let V be a point not in the interval. To suspend
the interval from V , draw line segments between V and all points in the
interval. You end up with a triangle with vertex V . Therefore the suspension
of an interval is a triangle. Here the dimension of the interval is one, and the
dimension of the triangle is two.
Let D be a disk and let V be a point not in the disk. To suspend the disk
from V , draw line segments between V and all points in the disk. You end
1.6. HIGH DIMENSIONS 57

up with a cone with vertex V . Therefore the suspension of a disk is a cone.


Here the dimension of the disk is two, and the dimension of the cone is three.

Fig. 1.31 Suspensions of interval [a, b] and disk D.

In general, the suspension Ĝ of G is obtaining by drawing line segments


from a point V not in G to every point x in G,

Ĝ = {(tx, 1 − t) : 0 ≤ t ≤ 1, x in G}.

When t = 1, the suspension’s base is the original region G, and when


t = 0, we have the vertex at the top. For each 0 < t < 1, the cross-section at
level t of the suspension is tG, which is G scaled by the factor t.
We assume the point V is in a dimension orthogonal to the dimensions of
G. Then the dimension of Ĝ is one more than the dimension of G: If G is d-
dimensional, then Ĝ is (d + 1)-dimensional, and the cross-sections at distinct
levels do not intersect.
The (d + 1)-dimensional volume of Ĝ is obtained by integrating over cross-
sections, Z 1
Vol(Ĝ) = Vol(tG) dt.
0

By the scaling principle and (A.5.3),


1 t=1
td+1
Z
d Vol(G)
Vol(Ĝ) = t Vol(G) dt = Vol(G) = .
0 d+1 t=0 d+1

Thus
Vol(G)
Vol(Ĝ) = .
d+1
58 CHAPTER 1. DATASETS

Exercises

Exercise 1.6.1 Why is the diagonal length of the square 4 2?

Exercise 1.6.2 Why is the diagonal length of the cube 4 3?

Exercise 1.6.3 Why does dividing by 4 yield the red disk radius and the red
ball radius?

Exercise 1.6.4 Suspend the unit circle G : x2 +y 2 = 1 from its center. What
is the suspension Ĝ? Conclude area(unit disk) = length(unit circle)/2.

Exercise 1.6.5 Suspend the unit sphere G : x2 + y 2 + z 2 = 1 from its center.


What is the suspension Ĝ? Conclude volume(unit ball) = area(unit sphere)/3.
Chapter 2
Linear Geometry

In §1.4, we reviewed the geometry of vectors in the plane. Now we study


linear geometry in any dimension d.
This chapter is a systematic thorough treatment of vectors and matrices,
but covering only the parts relevant to data science. The material in this
chapter is usually referred to as Linear Algebra. We prefer the term Lin-
ear Geometry, to emphasize that the material is, like much of data science,
geometric.
Even though parts of this chapter are heavy-going, all included material
is necessary for later chapters. In particular, the derivations of chi-squared
distribution (§5.5) and the chi-squared tests (§6.4) are clarified by the appro-
priate use of vectors and matrices.

2.1 Vectors and Matrices

A vector is a list of scalars

v = (t1 , t2 , . . . , td ).

The scalars are the components or the features of v. If there are d features,
we say the dimension of v is d. We call v a d-dimensional vector.
A point x is also a list of scalars, x = (t1 , t2 , . . . , td ). The relation between
points x and vectors v is discussed in §1.3. The set of all d-dimensional vectors
or points is d-dimensional space Rd .
In Python, we use numpy or sympy for vectors and matrices. In Python,
if L is a list, then numpy.array(L) or sympy.Matrix(L) return a vector or
matrix.

59
60 CHAPTER 2. LINEAR GEOMETRY

from numpy import *

v = array([1,2,3])
v.shape

from sympy import *

v = Matrix([1,2,3])
v.shape

The first v.shape returns (3,), and the second v.shape returns (3,1). In
either case, v is a 3-dimensional vector.
Vectors are added and scaled component by component: With

v = (t1 , t2 , . . . ) and v = (t′1 , t′2 , . . . ),

we have

v + v ′ = (t1 + t′1 , t2 + t′2 , . . . ), and sv = (st1 , st2 , . . . ).

Addition v + v ′ only works when v and v ′ have the same shape.


The zero vector is the vector 0 = (0, 0, 0, . . . ). The zero vector is the only
vector satisfying 0 + v = v = v + 0 for every vector v. Even though the zero
scalar and the zero vector are distinct objects, we use 0 to denote both. A
vector v is nonzero if v is not the zero vector.
In R4 , the vectors

e1 = (1, 0, 0, 0), e2 = (0, 1, 0, 0), e3 = (0, 0, 1, 0), e4 = (0, 0, 0, 1)

together are the standard basis. Similarly, in Rd , we have the standard basis
e1 , e2 , . . . , ed .

A matrix is a listing arranged in a rectangle of rows and columns. Specifi-


cally, an d × N matrix A has d rows and N columns,
 
a11 a12 . . . a1N
a21 a22 . . . a2N 
A= . . . . . . . . .  .

ad1 ad2 . . . adN

In Python, if L is a list of lists, then both array(L) and Matrix(L) return


a matrix. The code
2.1. VECTORS AND MATRICES 61

from numpy import *

# numpy vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

A = column_stack([u,v,w])
A.shape

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)
A.shape

returns (5,3), so A is a 5 × 3 matrix,


 
1 6 11
2 7 12
 
A= 3 8 13 .

4 9 14
5 10 15

The transpose of a matrix A is the matrix B = At resulting from turning


A on its side, so  
1 2 3 4 5
B = At =  6 7 8 9 10 .
11 12 13 14 15
The default for numpy is to arrange vectors as rows, so the code is shorter.

B = array([u,v,w])

The transpose interchanges rows and columns: the rows of At are the columns
of A. In both numpy or sympy, the transpose of A is A.T.
A vector v may be written as a 1 × N matrix

v = t1 t2 . . . tN .

In this case, we call v a row vector.


A vector v may be written as a d × 1 matrix
62 CHAPTER 2. LINEAR GEOMETRY
 
t1
 t2 
. . . .
v= 

td

In this case, we call v a column vector.

We will be considering matrices with different properties, and we use the


following notation
• A, B: any matrix
• U , V : orthonormal rows or orthonormal columns
• Q: symmetric matrix
• P : projection or permutation matrix

Vectors v1 , v2 , . . . , vN with the same dimension may be horizontally


stacked as columns of a d × N matrix,

A = v1 v2 . . . v N .

Similarly, vectors v1 , v2 , . . . , vN with the same dimension may be vertically


stacked as rows of an N × d matrix,
 
v1
 v2 
A= . . . .

vN

By default, sympy creates column vectors. Because of this, it is easiest to


build matrices as columns,

from sympy import *

# 5x3 matrix
A = Matrix.hstack(u,v,w)

# column vector
b = Matrix([1,1,1,1,1])

# 5x4 matrix
M = Matrix.hstack(A,b)
2.1. VECTORS AND MATRICES 63

In general, for any sympy matrix A, column vectors can be hstacked and
row vectors can be vstacked. For any matrix A, the code

from sympy import *

A == Matrix.hstack(*[A.col(j) for j in range(A.cols)])

returns True. Note we use the unpacking operator * to unpack the list, before
applying hstack.
In numpy, there is column_stack and row_stack, so the code

from numpy import *

A == row_stack([ row for row in A ])

A == column_stack([ col for col in A.T ])

returns True. In numpy, the input is a list, there is no unpacking.

In numpy, a matrix A is a list of rows, so

A == array([ row for row in A ])


A.T == array([ col for col in A.T ])

both return True. Here col refers to rows of At , hence refers to the columns
of A.
The number of rows is len(A), and the number of columns is len(A.T).
To access row i, use A[i]. To access column j, access row j of the transpose,
A.T[j]. To access the j-th entry in row i, use A[i,j].
In sympy, the number of rows in a matrix A is A.rows, and the number of
columns is A.cols, so

A.shape == (A.rows,A.cols)

returns True. To access row i, use A.row(i). Similarly, to access column j,


use A.col(j). So,

A == Matrix([ A.row(i) for i in range(A.rows) ])


A.T == Matrix([ A.col(j) for j in range(A.cols) ])

both return True.


A matrix is square if the number of rows equals the number of columns,
N = d. A matrix is diagonal if it looks like one of these
64 CHAPTER 2. LINEAR GEOMETRY
   
a000   a00
 0 b 0 0 a000 0 b 0
 0 b 0 0 ,
 0 0 c 0 , or or 0 0 c  ,
   
00c0
000d 000

where some of the numbers on the diagonal a, b, c, d may be zero.

A dataset is a collection of points x1 , x2 , . . . , xN in Rd . After centering


the mean to the origin (§1.3), a dataset is a collection of vectors v1 , v2 , . . . ,
vN . Usually the vectors are presented as the columns of a d × N matrix A.
Corresponding to this, datasets are often provided as a CSV file.
The matrix A is the dataset matrix. In excel, this is called a spreadsheet. In
SQL, this is called a table. In numpy, it’s an array. In pandas, it’s a dataframe.
So, effectively,
matrix = dataset = CSV file = spreadsheet = table = array = dataframe

Matrices are added and scaled as follows. With


 ′
a′12 . . . a′1N
  
a11 a12 . . . a1N a11

a21 a22 . . . a2N 
 and A′ = a21
 a′22 . . . a′2N 
A= . . . . . . . . .  . . .
,
... ... 
ad1 ad2 . . . adN a′d1 a′d2 . . . a′dN

we have matrix addition


a11 + a′11 a12 + a′12 . . . a1n + a′1N
 
a21 + a′21 a22 + a′22 . . . a2n + a′ 
A + A′ = 
 ...
2N 
... ... 
ad1 + a′d1 ad2 + a′d2 . . . adN + a′dN

and matrix scaling  


ta11 ta12 . . . ta1N
ta21 ta22 . . . ta2N 
tA = 
 .
... ... ... 
tad1 tad2 . . . tadN
Matrices may be added only if they have the same shape.
In Python, matrix scaling and matrix addition are a*A and A + B. The
code
2.1. VECTORS AND MATRICES 65

from sympy import *

A = zeros(2,3)
B = ones(2,2)
C = Matrix([[1,2],[3,4]])
D = B + C
E = 5 * C
F = eye(4)
A, B, C, D, E, F

returns
 
          10 0 0
000 11 12 23 5 10  01 0 0
, , , , , .
000 11 34 45 15 20 0 0 1 0
00 0 1

Diagonal matrices are constructed using diag. The code

from sympy import *

A = diag(1,2,3,4)
B = diag(-1, ones(2, 2), Matrix([5, 7, 5]))
A, B

returns  
  −1 000
1 0 0 0 0 1 1 0

0 2 0 0
, 0 1 1 0

 .
0 0 3 0 
0 0 0 5

0 0 0 4 0 0 0 7
0 005
It is straightforward to convert back and forth between numpy and sympy.
In the code

from sympy import *

A = diag(1,2,3,4)

from numpy import *

B = array(A)

C = Matrix(B)

A and C are sympy.Matrix, and B is numpy.array. numpy is for numerical


computations, and sympy is for algebraic/symbolic computations.
66 CHAPTER 2. LINEAR GEOMETRY

Exercises

Exercise 2.1.1 A vector is one-hot encoded if all features are zero, except for
one feature which is one. For example, in R3 there are three one-hot encoded
vectors
(1, 0, 0), (0, 1, 0), (0, 0, 1).
A matrix is a permutation matrix if it is square and all rows and all columns
are one-hot encoded. How many 3 × 3 permutation matrices are there? What
about d × d?

2.2 Products

Let t be a scalar, u, v, w be vectors, and let A, B be matrices. We already


know how to compute tu, tv, and tA, tB. In this section, we compute the dot
product u · v, the matrix-vector product Av, and the matrix-matrix product
AB.
These products are not defined unless the dimensions “match”. In numpy,
these products are written dot; in sympy, these products are written *.
In §1.4, we defined the dot product in two dimensions. We now generalize
to any dimension d. Suppose u, v are vectors in Rd . Then their dot product
u · v is the scalar obtained by multiplying corresponding features and then
summing the products. This only works if the dimensions of u and v agree.
In other words, if u = (s1 , s2 , . . . , sd ) and v = (t1 , t2 , . . . , td ), then

u · v = s1 t1 + s2 t2 + · · · + sd td . (2.2.1)

It’s best to think of this as “row-times-column” multiplication,


 
 t1
u · v = s1 s2 s3 t2  = s1 t1 + s2 t2 + s3 t3 .
t3

As in §1.4, we always have rows on the left, and columns on the right.
In Python,

from numpy import *

u = array([1,2,3])
v = array([4, 5, 6])

dot(u,v) == 1*4 + 2*5 + 3*6

from sympy import *


2.2. PRODUCTS 67

u = Matrix([1,2,3])
v = Matrix([4, 5, 6])

u.T * v == 1*4 + 2*5 + 3*6

both return True.


For clarity, sometimes we write (u.T)*v; the parentheses don’t change
anything. Note in sympy, we take the transpose when multiplying, since vec-
tors are by default column vectors, and it’s always row × column.

As in two dimensions, the length or norm or magnitude of a vector v =


(t1 , t2 , . . . , td ) is the square root of the dot product v · v,
√ q
|v| = v · v = t21 + t22 + · · · + t2d .

In Python, the length of a vector v is

from numpy import *

sqrt(dot(v,v))

from sympy import *

sqrt(v.T * v)

In numpy, this returns a scalar; in sympy, a 1 × 1 matrix.


A vector is a unit vector if its length equals 1. When |v| = 0, all the features
of v equal zero. It follows the zero vector is the only vector with zero length.
All other vectors have positive length.
Let v be any nonzero vector. By dividing v by its length |v|, we obtain a
unit vector u = v/|v|.

As in §1.4,

Dot Product

The dot product u · v (2.2.1) satisfies

u · v = |u| |v| cos θ, (2.2.2)

where θ is the angle between u and v.


68 CHAPTER 2. LINEAR GEOMETRY

In two dimensions, this was equation (1.4.5) in §1.4. Since any two vectors
lie in a two-dimensional plane, this remains true in any dimension. More
precisely, (2.2.2) is taken as the definition of cos θ.
Based on this, we can compute the angle θ,
u·v u·v
cos θ = p =p .
|u| |v| (u · u)(v · v)

Here is code for the angle θ (there is also a built-in numpy.angle).

from numpy import *

def angle(u,v):
a = dot(u,v)
b = dot(u,u)
c = dot(v,v)
theta = arccos(a / sqrt(b*c))
return degrees(theta)

Since | cos θ| ≤ 1, we have the

Cauchy-Schwarz Inequality

The dot product of two vectors is absolutely less or equal to the prod-
uct of their lengths,

|u · v| ≤ |u| |v| or |u · v|2 ≤ (u · u)(v · v). (2.2.3)

Vectors u and v are said to be perpendicular or orthogonal if u · v = 0.


In this case we often write u ⊥ v. A collection of vectors is orthogonal if
any pair of vectors in the collection are orthogonal. With this understood,
the zero vector is orthogonal to every vector. The converse is true as well: If
u · v = 0 for every v, then in particular, u · u = 0, which implies u = 0.
Vectors v1 , . . . , vN are said to be orthonormal if they are both unit vectors
and orthogonal. Orthogonal nonzero vectors can be made orthonormal by
dividing each vector by its length.

An important application of the Cauchy-Schwarz inequality is the triangle


inequality
|a + b| ≤ |a| + |b|. (2.2.4)
To see this, let v be any unit vector. Then

(a + b) · v = a · v + b · v ≤ |a||v| + |b||v| = |a| + |b|.

From this, selecting v = (a + b)/|a + b|,


2.2. PRODUCTS 69

|a + b| = (a + b) · v ≤ |a| + |b|.

Suppose v is a vector and A is a matrix. If the rows of A have the same


dimension as that of v, we can take the dot product of each row of A with v,
obtaining the matrix-vector product Av: Av is the vector whose features are
the dot products of the rows of A with v.
In other words,

dot(A,v) == array([ dot(row,v) for row in A ])

A*v == Matrix([ A.row(i) * v for i in range(A.rows) ])

both return True.


If u and v are vectors, we can think of u as a row vector, or a matrix
consisting of a single row. With this interpretation, the matrix-vector product
uv equals the dot product u · v.
If u and v are vectors, we can think of u as a column vector, or a matrix
consisting of a single column. With this interpretation, ut is a single row, and
the matrix-vector product ut v equals the dot product u · v.

Let A and B be two matrices. If the row dimension of A equals the column
dimension of B, the matrix-matrix product AB is defined. When this condition
holds, the entries in the matrix AB are the dot products of the rows of A with
the columns of B. In Python,

from numpy import *

C = array([ [ dot(row,col) for col in B.T ] for row in A ])


dot(A,B) == C

from sympy import *

C = Matrix([[ A.row(i)*B.col(j) for j in range(B.cols)] for i in


,→ range(A.rows) ])
A*B == C

both return True, and, with


 
  1 2 3
1234 4 5 6
A= ,B =  ,
5678 7 8 9
10 11 12
70 CHAPTER 2. LINEAR GEOMETRY

the code

A,B,dot(A,B)

A,B,A*B

returns  
70 80 90
AB = .
158 184 210

Let A and B be matrices, and suppose the row dimension of A and the
column dimension of B both equal d. Then the matrix-matrix product AB
is defined. If A = (aij ) and B = (bij ), then we may we may write AB in
summation notation as
X d
(AB)ij = aik bkj . (2.2.5)
k=1

The trace of a square matrix


 
ab c
A = b d e 
cef

is the sum of its diagonal elements,


 
ab c
trace(A) = trace  b d e  = a + d + f.
cef

In general, the trace of a d × d matrix is


d
X
trace(A) = aii .
i=1

Even though in general AB ̸= BA, it is always true that

trace(AB) = trace(BA), (2.2.6)

This can be verified by switching the i and the k in the sums


2.2. PRODUCTS 71

d
X d X
X d
trace(AB) = (AB)ii = aik bkj .
i=1 i=1 k=1

A matrix Q is symmetric if Q = Qt . For any matrix A, Q = AAt and


Q = At A are symmetric.
A symmetric matrix Q satisfying v · Qv ≥ 0 for every vector v is non-
negative. When Q is nonnegative, we write Q ≥ 0. A symmetric matrix Q
satisfying v · Qv > 0 for every nonzero vector v is positive. When Q is pos-
itive, we write Q > 0. Since any vector may be rescaled into a unit vector,
the vectors v in these definitions may be assumed to be unit vectors.
For any d × N matrix A, At A is a symmetric N × N matrix, and AAt is
a symmetric d × d matrix.
As we saw in §1.5, the variance matrix of a dataset is nonnegative. In fact,
when a dataset in Rd fills up all d dimensions, the variance matrix is positive
(see §2.5).

Let A and B be matrices. Since transpose interchanges rows and columns,


we always have
(AB)t = B t At .
As a special case, if we think of v as a column vector, i.e. as a matrix with a
single column, then the matrix-vector product Av is the same as the matrix-
matrix product Av, so
(Av)t = v t At .
Here we are thinking of v as a matrix with one column, and v t as a matrix
with one row.
In Python,

dot(A,B).T == dot(B.T,A.T)

(A * B).T == B.T * A.T

both return True.


We also have

Dot Product Transpose Identity

For any vectors u, v, and matrices A, we have


72 CHAPTER 2. LINEAR GEOMETRY

(Au) · v = u · (At v) and (At u) · v = u · Av, (2.2.7)

whenever the shapes of u, v, A match.

In terms of row vectors and column vectors, this is automatic. For example,

(Au) · v = (Au)t v = (ut At )v = ut (At v) = u · (At v).

In Python,

dot(dot(A,u),v) == dot(u,dot(A.T,v))
dot(dot(A.T,u),v) == dot(u,dot(A,v))

(A*u).T * v == u.T * (A.T*v)


(A.T*u).T * v == u.T * (A*v)

all return True.

Let A be a matrix. We compute useful expressions for AAt and At A.


Assume the columns of A are v1 , v2 , . . . , vN , so A is d × N . Since the
transpose interchanges rows and columns, v1 , v2 , . . . , vN are the rows of At .
Since matrix-matrix multiplication is row × column, we have
 
v1 · v1 v1 · v2 . . . v 1 · vN
 v2 · v1 v2 · v2 . . . v 2 · vN 
At A = 
 ...
. (2.2.8)
... ... ... 
vN · v1 vN · v2 . . . v N · v N

As a consequence,1

Orthonormal Rows and Columns


Let U be a matrix.
• U has orthonormal columns iff U t U = I.
• U has orthonormal rows iff U U t = I.

The second statement follows from the first by substituting U t for U .

To compute AAt , we bring in the tensor product. If u and v are vectors,


the tensor product u ⊗ v is the matrix
1 Iff is short for if and only if.
2.2. PRODUCTS 73

(u ⊗ v)ij = ui vj .

If u is d-dimensional and v is N -dimensional, then u ⊗ v is a d × N matrix.


If we think of u and v as 1 × d and 1 × N matrices, this is the matrix-matrix
product ut v.
For example, if u = (a, b, c), v = (A, B), then
   
a  aA aB
u ⊗ v =  b  A B =  bA bB  .
c cA cB

Then the identities (1.4.17) and (1.4.18) hold in general. Using the tensor
product, we have

Tensor Identity

Let A be a matrix with columns v1 , v2 , . . . , vN . Then

AAt = v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN . (2.2.9)

To derive this, let Q and Q′ be the symmetric matrices on the left and
right sides of (2.2.9). By Exercise 2.2.7, to establish (2.2.9), it is enough to
show x · Qx = x · Q′ x for every vector x. By (2.2.7),

x · Qx = x · AAt x = (At x) · (At x) = |At x|2

On the other hand, multiplying the right side of (2.2.9) by x, we obtain

Q′ x = (v1 ⊗ v1 )x + (v2 ⊗ v2 )x + · · · + (vN ⊗ vN )x.

By (1.4.17), this implies

Q′ x = (v1 · x)v1 + (v2 · x)v2 + · · · + (vN · x)vN .

Taking the dot product of both sides with x,

x · Q′ x = (v1 · x)2 + (v2 · x)2 + · · · + (vN · x)2 . (2.2.10)

But by matrix-vector multiplication,

At x = (v1 · x, v2 · x, . . . , vN · x).

Since |At x|2 is the sum of the squares of its components, this establishes
x · Qx = x · Q′ x, hence the result.

Trace and tensor product combine in the identity


74 CHAPTER 2. LINEAR GEOMETRY

u · Av = trace((u ⊗ v)t A), (2.2.11)

valid for any matrix A and vectors u, v with compatible shapes. The deriva-
tion of this identity is a simple calculation with components that we skip.

If A = (aij ) is any matrix, then the norm squared of A is


2
X
∥A∥ = a2ij .
i,j

This equals trace(At A) which equals trace(AAt ). By taking the trace in


(2.2.9),

Norm Squared of Matrix

Let A be a matrix with columns v1 , v2 , . . . , vN . Then


2
∥A∥ = |v1 |2 + |v2 |2 + · · · + |vN |2 , (2.2.12)

and
2
∥A∥ = trace(At A) = trace(AAt ). (2.2.13)
By replacing A by At , the same results hold for rows.

If x1 , x2 , . . . , xN is a dataset of points, and v1 , v2 , . . . , vN is the cor-


responding centered dataset, then the variance matrix Q is the average of
tensor products (§1.5),
v1 ⊗ v1 + v2 ⊗ v2 + · · · + vN ⊗ vN
Q= .
N
Let A be the matrix with columns v1 , v2 , . . . , vN . By (2.2.9), the last equation
is the same as
1
Q = AAt . (2.2.14)
N
Let dataset be the Iris dataset, as a d × N array. If vectors is the cor-
responding centered dataset, as in §2.1, code for the variance is

from numpy import *

# vectors is dxN array


2.2. PRODUCTS 75

Q = dot(vectors,vectors.T)/N

Of course, it is simpler to avoid centering and just do directly

Q = cov(dataset,bias=True)

After downloading the Iris dataset as in §2.1, the mean, variance, and total
variance are
 
0.68 −0.04 1.27 0.51
−0.04 0.19 −0.32 −0.12
 1.27 −0.32 3.09 1.29  , 4.54.
µ = (5.84, 3.05, 3.76, 1.2), Q =  

0.51 −0.12 1.29 0.58


(2.2.15)

In §1.5, we discussed standardizing datasets in R2 . This can be done in


general.
Let x1 , x2 , . . . , xN be a dataset in Rd . Each sample point x has d features
(t1 , t2 , . . . , td ). We compute the variance of each feature separately.
Let e1 , e2 , . . . , ed be the standard basis in Rd , and, for each j = 1, 2 . . . , d,
project the dataset onto ej , obtaining the scalar dataset

x1 · ej , x2 · ej , . . . , xN · ej ,

consisting of the j-th feature of the samples. If qjj is the variance of this
scalar dataset, then q11 , q22 , . . . , qdd are the diagonal entries of the variance
matrix.
To standardize the dataset, we center it, and rescale the features to have
variance one, as follows. Let µ = (µ1 , µ2 , . . . , µd ) be the dataset mean. For
each sample point x = (t1 , t2 , . . . , td ), the standardized vector is
 
t1 − µ1 t2 − µ2 td − µd
v= √ , √ , . . . , √ .
q11 q22 qdd

Then the standardized dataset is v1 , v2 , . . . , vN .


If Q = (qij ) is the variance matrix, then the correlation matrix is the d × d
matrix Q′ = (qij′
) with entries

′ qij
qij =√ , i, j = 1, 2, . . . , d.
qii qjj

Then a straightforward calculation shows


76 CHAPTER 2. LINEAR GEOMETRY

Standardized Variance Equals Correlation

The variance matrix of the standardized dataset equals the correlation


matrix of the original dataset.

In Python,

from numpy import *


from sklearn.preprocessing import StandardScaler

N, d = 10, 2
# Nxd array
dataset = array([ [random() for _ in range(d)] for _ in range(N) ])

# standardize dataset
standardized = StandardScaler().fit_transform(dataset)

Qcorr = corrcoef(dataset.T)
Qcov = cov(standardized.T,bias=True)

allclose(Qcov,Qcorr)

returns True.

Exercises

Exercise 2.2.1 For n = 1, 2, 3, . . . , let v be the vector

v = (1, 2, 3, . . . , n).

Let |v| = v · v be the length
√ of v. Then, for example, when n = 1, |v| = 1
and, when n = 2, |v| = 5. There is one other n for which |v| is a whole
number. Use Python to find it.

Exercise 2.2.2 If µ is a unit vector and Q = I − µ ⊗ µ, then Q2 = Q.

Exercise 2.2.3 Give an example of a 3 × 3 matrix A satisfying A2 = 0 but


A ̸= 0.

Exercise 2.2.4 If Q2 = 0 and Qt = Q, then Q = 0.

Exercise 2.2.5 Matrices A and B commute if AB = BA. For what condition


on a and b do these matrices commute?
   
100 100
A = a 1 0 , B = 0 1 0 .
001 0b1
2.2. PRODUCTS 77

Exercise 2.2.6 Verify (2.2.11).


Exercise 2.2.7 Let Q and Q′ be symmetric d×d matrices. Show that Q = Q′
iff
x · Qx = x · Q′ x, for all x.
(Replace x by u + v and expand, then insert u and v standard basis vectors.)
Exercise 2.2.8 Compute the means and variances µ1 , µ2 , µ3 and Q1 , Q2 ,
Q3 of the classes of the Iris dataset.
Exercise 2.2.9 With

from sympy import *

def row(i,d): return [ (-1)**(i+j) for j in range(d) ]


def R(d): return Matrix([ row(i,d) for i in range(d) ])

print R(d) for d = 1, 2, 3, . . . .


Exercise 2.2.10 With R(d) as in Exercise 2.2.9,

R(d)3 = c(d) × R(d)

for some scalar c(d). Use Python to find c(d). Here d = 1, 2, 3, . . . .


Exercise 2.2.11 Suppose A and B are matrices with rows and columns
 
u1
 u2 
A= 
 and B = (v1 , v2 , . . . , vd ),
. . .
uN

all with the same dimension. Show that


 
u1 · v1 u1 · v2 . . . u 1 · vd
 u2 · v1 u2 · v2 . . . u 2 · vd 
AB =  ...
.
... ... ... 
uN · v1 uN · v2 . . . u N · vd

This generalizes (2.2.8).


Exercise 2.2.12 Suppose A and B are matrices with columns and rows
 
v1
 v2 
A = (u1 , u2 , . . . , ud ) and B=
. . . .

vd

Use (2.2.5) to show


78 CHAPTER 2. LINEAR GEOMETRY

AB = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ud ⊗ vd .

This generalizes (2.2.9).

Exercise 2.2.13 Let P and Q be d×d permutation matrices (Exercise 2.1.1).


Show that P Q is a permutation matrix.

Exercise 2.2.14 Let P be a 3×3 permutation matrix (Exercise 2.1.1). Show


that P 6 = I. Check this for every 3 × 3 permutation matrix. What about
d × d?

2.3 Matrix Inverse

Let A be any matrix and b a vector. The goal is to solve the linear system

Ax = b. (2.3.1)

In this section, we use the inverse A−1 and the pseudo-inverse A+ to solve
(2.3.1).
Of course, the system (2.3.1) doesn’t even make sense unless

A.shape == b.shape, x.shape

In what follows, we assume this equality is true and dimensions are appro-
priately compatible.
Even then, it’s very easy to construct matrices A and vectors b for which
the linear system (2.3.1) has no solutions at all! For example, take A the zero
matrix and b any non-zero vector. Because of this, we must take some care
when solving (2.3.1).

Given a square matrix A, the inverse matrix is the matrix B satisfying

AB = I = BA. (2.3.2)

Here I is the identity matrix. Since I is a square matrix, A must also be a


square matrix.
Only square matrices may have inverses. Moreover, not every square ma-
trix has an inverse. For example, the zero matrix does not have an inverse.
When A has an inverse, we say A is invertible.
If a matrix is d × d, then the inverse is also d × d. We write B = A−1 for
the inverse matrix of A. For example, it is easy to check
2.3. MATRIX INVERSE 79
   
ab 1 d −b
A= =⇒ A−1 = .
cd ad − bc −c a

Since we can’t divide by zero, a 2 × 2 matrix is invertible only if ad − bc ̸= 0.


Since

(AB)(B −1 A−1 ) = A(BB −1 )A−1 = AIA−1 = AA−1 = I,

we have
(AB)−1 = B −1 A−1 .

When A is invertible, the inverse A−1 provides a conceptual framework for


solving the linear system Ax = b. Of course, a framework is not the same as
a computational procedure. Many issues arise in the numerical construction
of the inverse. These we sweep under the rug and ignore by accessing the
inverse code inv in numpy and sympy.

Solution of Ax = b when A invertible


If A is invertible, then

Ax = b =⇒ x = A−1 b. (2.3.3)

This is easy to check, since

Ax = A(A−1 b) = (AA−1 )b = Ib = b.

from sympy import *

# solving Ax=b
x = A.inv() * b

from numpy import *


from numpy.linalg import inv

# solving Ax=b
x = dot(inv(A) , b)

In general, a matrix A is not invertible, and Ax = b is solved using the


pseudo-inverse x = A+ b. The definition and framework of the pseudo-inverse
80 CHAPTER 2. LINEAR GEOMETRY

is in §2.6. The upshot is: every (square or non-square) matrix A has a pseudo-
inverse A+ . Here is the general result.

Solution of Ax = b for General A


If Ax = b is solvable, then

x + = A+ b =⇒ Ax+ = b.

If Ax = b is not solvable, then x+ minimizes the residual |Ax − b|2 .

This says if Ax = b has some solution, then x+ = A+ b is also a solution.


On the other hand, Ax = b may have no solution, in which case the error
|Ax − b|2 is minimized. From this point of view, it’s best to think of x+ as a
candidate for a solution. It’s a solution only after confirming equality of Ax+
and b. All this is worked out in §2.6.
To put this in context, there are three possibilities for a linear system
(2.3.1). A linear system Ax = b can have
• no solutions, or
• exactly one solution, or
• infinitely many solutions.
As examples of these three possibilities, we have
• A = 0 and b ̸= 0,
• A is invertible,
• A = 0 and b = 0.
The pseudo-inverse provides a conceptual framework for deciding among
these three possibilities. Of course, a framework is not the same as a com-
putational procedure. Many issues arise in the numerical construction of the
pseudo-inverse. These we sweep under the rug and ignore by accessing the
pseudo-inverse code pinv in numpy and sympy.
In this section, we focus on using Python to solve Ax = b, and in §2.6, we
explore the pseudo-inverse framework.
How do we use the above result? Given A and b, using Python, we compute
x = A+ b. Then we check, by multiplying in Python, equality of Ax and b.
The rest of the section consists of examples of solving linear systems. The
reader is encouraged to work out the examples below in Python. However,
because some linear systems have more than one solution, and the implemen-
tations of Python on your laptop and on my laptop may differ, our solutions
may differ.
It can be shown that if the entries of A are integers, then the entries of A+
are fractions. This fact is reflected in sympy, but not in numpy, as the default
in numpy is to work with floats.
2.3. MATRIX INVERSE 81

Let

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),

and let A be the matrix with columns u, v, w, and rows a, b, c, d, e,


   
1 6 11 a
2 7 12  b 
    
A= uvw = 3 8 13 =  c  .
   (2.3.4)
4 9 14 d
5 10 15 e

from numpy import *

# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

# arrange as columns
A = column_stack([u,v,w])

For this A, the code

from scipy.linalg import pinv

pinv(A)

returns  
−37 −20 −3 14 31
1
A+ = −10 −5 0 5 10  .
150
17 10 3 −4 −11
Alternatively, in sympy,

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

A.pinv()

returns the same result.


82 CHAPTER 2. LINEAR GEOMETRY

Let A be as in (2.3.4) and let

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9).

We solve Ax = b1 and Ax = b2 by computing the candidates


1
x + = A+ b1 = (2, 5, 8),
15
and
1
x+ = A+ b2 = (−173, −50, 73).
30
Then we check that the candidates are actually solutions, which they are, by
comparing Ax+ and b1 , in the first case, and Ax+ and b2 , in the second case.

For
b3 = (−9, −3, 3, 9, 10),
we have
1
x + = A+ b3 = (82, 25, −32).
15
However, for this x+ , we have

Ax+ = (−8, −3, 2, 7, 12),

which is not equal to b3 . From this, not only do we conclude x+ is not a


solution of Ax = b3 , but also, by the general result above, the system Ax = b3
is not solvable at all.

Let B be the matrix with columns b1 and b2 ,


 
8 11
9 6 
 
10 1  .
B = (b1 , b2 ) =  
11 −4
12 −9

We solve
Bx = u, Bx = v, Bx = w
by constructing the candidates
2.3. MATRIX INVERSE 83

B + u, B + v, B + w,

obtaining the solutions


1 1 1
x+ = (16, −7), x+ = (41, −2), x+ = (66, 3).
51 51 51

Let  
1 2 3 4 5
C = At =  6 7 8 9 10
11 12 13 14 15
and let f = (0, −5, −10). By Exercise 2.6.8, C + = (A+ )t , so
 
−37 −10 17
−20 −5 10 
+ + t 1  −3 0

C = (A ) = 3 
150  
14 5 −4 
31 10 −11

and
1
x+ = C + f =(32, 35, 38, 41, 44).
50
Once we confirm equality of Cx+ and f , which is the case, we obtain a
solution x+ of Cx = f .

Let D be the matrix with columns a and f ,


 
1 0
D = (a, f ) =  6 −5  ,
11 −10

where a, b, c, d, e are the rows of A, or, equivalently, the columns of C. Then


 
+ 1 25 10 −5
D = .
30 28 10 −8

We solve

Dx = a, Dx = b, Dx = c, , Dx = d, Dx = e,

by constructing the candidates


84 CHAPTER 2. LINEAR GEOMETRY

D+ a, D+ b, D+ c, D+ d, D+ e,

obtaining the solutions

x+ = (1, 0), x+ = (2, 1), x+ = (3, 2), x+ = (4, 3), x+ = (5, 4).

Exercises

Exercise 2.3.1 Verify the computations in this section using Python.

Exercise 2.3.2 With R(d) as in Exercise 2.2.9, find the formula for the
inverse and pseudo-inverse of R(d), whichever exists. Here d = 1, 2, 3, . . . .

Exercise 2.3.3 The sum matrix and difference matrix are


   
11111 1 −1 0 0 0
0 1 1 1 1 0 1 −1 0 0 
   
S= 0 0 1 1 1 , D= 0 0 1 −1 0  .
  
0 0 0 1 1 0 0 0 1 −1
00001 0 0 0 0 1

Compute SD and DS. What do you conclude?

Exercise 2.3.4 Let D = D(d) be the d × d difference matrix as in Exer-


cise 2.3.3. Compute DDt and Dt D, and SS t and S t S.

Exercise 2.3.5 Let u and v be vectors in Rd and let A = I + u ⊗ v. Show


that
u⊗v
A−1 = I − .
1+u·v
Exercise 2.3.6 Let P be a d × d permutation matrix (Exercise 2.1.1). Show
that P t is the inverse of P .

2.4 Span and Linear Independence

Let u, v, w be three vectors. Then


1
3u − v + 9w, 5u + 0v − w, 0u + 0v + 0w
6
are linear combinations of u, v, w.
In general, a linear combination of vectors v1 , v2 , . . . , vd is

t 1 v1 + t 2 v2 + · · · + t d vd . (2.4.1)
2.4. SPAN AND LINEAR INDEPENDENCE 85

Here the coefficients t1 , t2 , . . . , td are scalars. In short, a linear combination


is a sum of scaled vectors.
In terms of matrices, let

u = (1, 2, 3, 4, 5), v = (6, 7, 8, 9, 10), w = (11, 12, 13, 14, 15),

and let A be the matrix with columns u, v, w, as in (2.3.4). Let x be the vector
(r, s, t) = (1, 2, 3). Then an explicit calculation shows (do this calculation!)
the matrix-vector product Ax equals ru + sv + tw,

Ax = ru + sv + tw.

The code

dot(A,x) == r*u + s*v + t*w

returns

array([ True, True, True, True, True])

To repeat, the linear combination ru + sv + tw is the same as the matrix-


vector product Ax. This is a general fact on which everything depends:

Column Linear Combination Equals Matrix-Vector Product

Let A be a matrix with columns v1 , v2 , . . . , vd , and let

x = (t1 , t2 , . . . , td ).

Then
Ax = t1 v1 + t2 v2 + · · · + td vd , (2.4.2)
In other words,

Ax = b is the same as b = t1 v1 + t2 v2 + · · · + td vd . (2.4.3)

The span of vectors v1 , v2 , . . . , vd consists of all linear combinations

t1 v1 + t2 v2 + · · · + td vd

of the vectors. For example, span(b) of a single vector b is the line through
b, and span(u, v, w) is the set of all linear combinations ru + sv + tw.
86 CHAPTER 2. LINEAR GEOMETRY

Span Definition I

The span of v1 , v2 , . . . , vd is the set S of all linear combinations of


v1 , v2 , . . . , vd , and we write

S = span(v1 , v2 , . . . , vd ).

When we don’t want to specify the vectors v1 , v2 , v3 , . . . , vd , we simply


say S is a span.
From (2.4.2), we have

Span Definition II

Let A be the matrix with columns v1 , v2 , v3 , . . . , vd . Then


span(v1 , v2 , . . . , vd ) is the set S of all vectors of the form Ax.

If each vector vk is a linear combination of vectors w1 , w2 , . . . , wN , then


every vector v in span(v1 , v2 , . . . , vd ) is a linear combination of w1 , w2 , . . . ,
wN , so span(v1 , v2 , . . . , vd ) is contained in span(w1 , w2 , . . . , wN ).
If also each vector wk is a linear combination of vectors v1 , v2 , . . . , vd ,
then every vector w in span(w1 , w2 , . . . , wN ) is a linear combination of v1 ,
v2 , . . . , vd , so span(w1 , w2 , . . . , wN ) is contained in span(v1 , v2 , . . . , vd ).
When both conditions hold, it follows

span(v1 , v2 , . . . , vd ) = span(w1 , w2 , . . . , wN ).

Thus there are many choices of spanning vectors for a given span.
For example, let u, v, w be the columns of A in (2.3.4). Let ⊂ mean “is
contained in”. Then

span(u, v) ⊂ span(u, v, w),

since adding a third vector can only increase the linear combination possibil-
ities. On the other hand, since w = 2v − u, we also have

span(u, v, w) ⊂ span(u, v).

It follows that
span(u, v, w) = span(u, v).

Let A be a matrix. The column space of A is the span of its columns. For
A as in (2.3.4), the column space of A is span(u, v, w). The code
2.4. SPAN AND LINEAR INDEPENDENCE 87

from sympy import *

# column vectors
u = Matrix([1,2,3,4,5])
v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)

# returns minimal spanning set for column space of A


A.columnspace()

returns a minimal list of vectors spanning the column space of A. The column
rank of A is the length of the list, i.e. the number of vectors returned.
For example, for A as in (2.3.4), this code returns the list
   
1 6
2  7 
   
3 ,  8  .
[u, v] =     
4  9 
5 10

Why is this? Because w = 2v − u, so

span(u, v, w) = span(u, v).

We conclude the column rank of A equals 2.

If the columns of A are v1 , v2 , . . . , vd , and x = (t1 , t2 , . . . , td ) is a vector,


then by definition of matrix-vector multiplication,

Ax = t1 v1 + t2 v2 + · · · + td vd .

By (2.4.3),

Column Space and Ax = b

The column space of a matrix A consists of all vectors of the form Ax.
A vector b is in the column space of A when Ax = b has a solution.

The corresponding code in numpy is


88 CHAPTER 2. LINEAR GEOMETRY

from numpy import *


from scipy.linalg import orth

# vectors
u = array([1,2,3,4,5])
v = array([6,7,8,9,10])
w = array([11,12,13,14,15])

A = column_stack([u,v,w])

# returns minimal orthonormal spanning set


# for column space of A

orth(A)

This code returns the array in Figure 2.1.

Fig. 2.1 Numpy column space array.

To explain this, let

b1 = (8, 9, 10, 11, 12), b2 = (11, 6, 1, −4, −9).


√ √
Then b1 · b2 = 0, |b1 | = 510, |b2 | = 255, and the columns of the array
in Figure 2.1 are the two orthonormal vectors −b1 /|b1 | and b2 /|b2 |. (Why
−b1 /|b1 | instead of b1 /|b1 |? Because numpy has to make an arbitrary choice
among the unit vectors ±b1 /|b1 |.)
We conclude the column space of A can be described in at least three ways,

span(b1 , b2 ) = span(u, v, w) = span(u, v).

Explicitly, b1 and b2 are linear combinations of u, v, w,

15b1 = 2u + 5v + 8w, 30b2 = −173u − 50v + 73w, (2.4.4)

and u, v, w are linear combinations of b1 and b2 ,

51u = 16b1 − 7b2 , 51v = 41b1 − 2b2 , w = 2v − u. (2.4.5)

By (2.4.3), to derive (2.4.4), we solve Ax = b1 and Ax = b2 for x. But this


was done in §2.3.
2.4. SPAN AND LINEAR INDEPENDENCE 89

Similarly, let B be the matrix with columns b1 and b2 , and solve Bx = u,


Bx = v, Bx = w, obtaining (2.4.5). This was also done in §2.3.
As a general rule, sympy.columnspace returns lists of spanning vectors,
and scipy.linalg.orth returns arrays of orthonormal spanning vectors.

Let A be a matrix, and let b be a vector. How can we tell if b is in the


column space of A? Given the above tools, here is an easy way to tell.
Write the augmented matrix Ā = (A, b); Ā obtained by adding b as an
extra column next to the columns of A. If A is d × N , then Ā is d × (N + 1).
Given A and Ā = (A, b), compute their column ranks. Let v1 , v2 , . . . , vN
be the columns of A. If these ranks are equal, then

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , . . . , vN , b),

so b is a linear combination of the columns, or b is in the column space of A.

Column Space of Augmented Matrix

Let Ā be the matrix A augmented by a vector b. Then Ax = b is


solvable iff b is in the column space of A iff

column rank(A) = column rank(Ā). (2.4.6)

For example, let b3 = (−9, −3, 3, 9, 10) and let Ā = (A, b3 ). Using Python,
check the column rank of Ā is 3. Since the column rank of A is 2, we conclude
b3 is not in the column space of A, so b3 is not a linear combination of u, v,
w.
When (2.4.6) holds, b is a linear combination of the columns of A. However,
(2.4.6) does not tell us which linear combination. According to (2.4.3), finding
the specific linear combination is equivalent to solving Ax = b.

R3 consists of all vectors (r, s, t) in three dimensions. If

e1 = (1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1),

then
(r, s, t) = re1 + se2 + te3 .
This shows the vectors e1 , e2 , e3 span R3 , or

R3 = span(e1 , e2 , e3 ).

As a consequence, R3 is a span. Similarly, in dimension d, we can write


90 CHAPTER 2. LINEAR GEOMETRY

e1 = (1, 0, 0, . . . , 0, 0)
e2 = (0, 1, 0, . . . , 0, 0)
e3 = (0, 0, 1, . . . , 0, 0) (2.4.7)
... = ...
ed = (0, 0, 0, . . . , 0, 1)

Then e1 , e2 , . . . , ed span Rd , so

Standard Basis Spans

Rd is a span.

Following machine-learning terminology, a vector v = (v1 , v2 , . . . , vd ) is


one-hot encoded at slot j if all components of v are zero except the j-th
component. For example, when d = 3, the vectors

(a, 0, 0), (0, a, 0), (0, 0, a)

are one-hot encoded.


Sometimes one-hot encoded also means the nonzero slot must be a one.
With this interpretation, when d = 3, the only one-hot encoded vectors

(1, 0, 0), (0, 1, 0), (0, 0, 1).

We use both interpretations.


The vectors e1 , e2 , . . . , ed are one-hot encoded. These vectors are the
standard basis for Rd , or the one-hot encoded basis for Rd .

The row space of a matrix is the span of its rows.

from sympy import *

# returns minimal spanning set for row space of A


A.rowspace()

The row rank of a matrix is the number of vectors returned by rowspace().


This is the minimal number of vectors spanning the row space of A.
For example, call the rows of A in (2.3.4) a, b, c, d, e. Let

f = (0, −5, −10).

Then sympy.rowspace returns the vectors a and f , so

span(a, b, c, d, e) = span(a, f ).
2.4. SPAN AND LINEAR INDEPENDENCE 91

Explicitly, the linear combination

50f = 32a + 35b + 38c + 41d + 44e

is derived using C = At and solving Cx = f . The linear combinations

a = a + 0f, b = 2a − 5f, c = 3a − 10f, d = 4a − 15f, e = 5a − 20f

are derived using D = (a, f ) and solving Dx = a, Dx = b, Dx = c, Dx = d,


Dx = e. Again, these linear systems were solved in §2.3.
Since the transpose interchanges rows and columns, the row space of A
equals the column space of At . Using this, we compute the row space in numpy
by

from numpy import *


from scipy.linalg import orth

# returns minimal spanning set for row space of A


orth(A.T)

Numpy returns orthonormal vectors.


Clearly, when Q is symmetric, the row space of Q equals the column space
of Q.
It turns out the column rank equals the row rank, for any matrix. Even
though we won’t establish this till (2.9.1), we state this result here, because
it helps ground the concepts.

Column Rank Equals Row Rank

For any matrix, the row rank equals the column rank.

Because of this, we refer to this common number as the rank of the matrix.

A linear combination t1 v1 + t2 v2 + · · · + td vd is trivial if all the coefficients


are zero, t1 = t2 = · · · = td = 0. Otherwise it is non-trivial, if at least one
coefficient is not zero. A linear combination t1 v1 + t2 v2 + · · · + td vd vanishes
if it equals the zero vector,

t1 v1 + t2 v2 + · · · + td vd = 0.

For example, with u, v, w as above, we have w = 2v − u, so

ru + sv + tw = 1u − 2v + 1w = 0 (2.4.8)

is a vanishing non-trivial linear combination of u, v, w.


92 CHAPTER 2. LINEAR GEOMETRY

We say v1 , v2 , . . . , vd are linearly dependent if there is a vanishing non-


trivial linear combination of v1 , v2 , . . . , vd . Otherwise, if there is no non-trivial
vanishing linear combination, we say v1 , v2 , . . . , vd are linearly independent.
For example, u, v, w above are linearly dependent.
Suppose u, v, w are any three vectors, and suppose u, v, w are linearly
dependent. Then we have ru + sv + tw = 0 for some scalars r, s, t, where at
least one is not zero. If r ̸= 0, then we may solve for u, obtaining

u = −(s/r)v − (t/r)w.

If s ̸= 0, then we may solve for v, obtaining

v = −(r/s)u − (t/s)w.

If t ̸= 0, then
w = −(r/t)u − (s/t)v.
Hence linear dependence of u, v, w means one of the three vectors is a multiple
of the other two vectors.
In general, a vanishing non-trivial linear combination of v1 , v2 , . . . , vd , or
linear dependence of v1 , v2 , . . . , vd , is the same as saying one of the vectors
is a linear combination of the remaining vectors.
In terms of matrices,

Homogeneous Linear Systems

Let A be the matrix with columns v1 , v2 , . . . , vd . Then


• v1 , v2 , . . . , vd are linearly dependent when Ax = 0 has a nonzero
solution x, and
• v1 , v2 , . . . , vd are linearly independent when Ax = 0 has only the
zero solution x = 0.

The set of vectors x satisfying Ax = 0, or the set of solutions x of Ax = 0,


is the null space of the matrix A.
With this terminology, v1 , v2 , . . . , vd are linearly dependent when there is
a nonzero null space for the matrix A.
For example, with A as in (2.3.4), the sympy code

from sympy import *

A.nullspace()

returns a list with a single vector,


2.4. SPAN AND LINEAR INDEPENDENCE 93
   
r 1
s = −2 .
t 1

This says the null space of A consists of all multiples of (1, −2, 1). Since the
code

[r,s,t] = A.nullspace()[0]

r*u + s*v + t*w

returns the column vector  


0
0
 
0 ,
 
0
0
we have Ax = 0, in agreement with (2.4.8).

The corresponding numpy code is

from scipy.linalg import null_space

null_space(A)

This code returns the unit vector


 
1
−1
√ −2 ,
6 1

which is a multiple of (1, −2, 1). scipy.linalg.null_space always returns


orthonormal vectors.

Here is a simple result that is used frequently.

A Versus At A
Let A be any matrix. The null space of A equals the null space of
At A.
94 CHAPTER 2. LINEAR GEOMETRY

If x is in the null space of A, then Ax = 0. Multiplying by At leads to


A Ax = 0, so x is in the null space of At A.
t

Conversely, if x is in the null space of At A, then At Ax = 0. By the dot-


product-transpose identity (2.2.7),

|Ax|2 = Ax · Ax = x · At Ax = 0,

so Ax = 0, which means x is in the null space of A.

An important example of linearly independent vectors are orthonormal


vectors.

Orthonormal Implies Linearly Independent

If v1 , v2 , . . . , vd are orthonormal, they are linearly independent.

To see this, suppose we have a vanishing linear combination

t1 v1 + t2 v2 + · · · + td vd = 0.

Take the dot product of both sides with v1 . Since the dot products of any
two vectors is zero, and each vector has length one, we obtain

t1 = t1 v1 · v1 = t1 v1 · v1 + t2 v2 · v1 + · · · + td vd · v1 = 0.

Similarly, all other coefficients tk are zero. This shows v1 , v2 , . . . , vd are


linearly independent.

In general, nullspace() returns a minimal set of vectors spanning the


null space of A. The nullity of A is the number of vectors returned by the
method nullspace().
For example, to compute the nullspace of the matrix
 
1 2 3 4 5
C = At =  6 7 8 9 10 ,
11 12 13 14 15

we solve Cx = 0. Since the code

from sympy import *

u = Matrix([1,2,3,4,5])
2.4. SPAN AND LINEAR INDEPENDENCE 95

v = Matrix([6,7,8,9,10])
w = Matrix([11,12,13,14,15])

A = Matrix.hstack(u,v,w)
C = A.T

C.nullspace()

returns the list of three vectors


     
1 2 3
−2 −3 −4
     
 1  ,  0  ,  0  ,
     
 0   1   0 
0 0 1

here we can make three conclusions: (1) the nullspace of C is spanned by


three vectors, (2) this is the least number of vectors that spans the nullspace
of C, and (3) the nullity of C is 3.

Let u be a nonzero vector, and let

u⊥ = {v : u · v = 0} . (2.4.9)

Then u⊥ (pronounced “u-perp”), the orthogonal complement of u, is a span


and consists of all vectors orthogonal to u.
More generally, suppose S is any collection of vectors, not necessarily a
span, and let
S ⊥ = {v : u · v = 0 for all u in S} .
Then S ⊥ (pronounced “S-perp”), the orthogonal complement of S, is a span
(even if S isn’t) and consists of all vectors orthogonal to all vectors in S.
Suppose S consists of five vectors a, b, c, d, e. How do we compute S ⊥ ?
The answer is by using nullspace: Let A be the matrix with rows a, b, c, d,
e. By matrix-vector multiplication,
   
a a·x
b b · x
   
0 = Ax =  c  x =  c · x .
  
d d · x
e e·x

This shows x is orthogonal to a, b, c, d, e exactly when x is in the null space


of A. Thus S ⊥ equals the null space of A.
96 CHAPTER 2. LINEAR GEOMETRY

In general, if S = span(v1 , v2 , . . . , vN ), let A be the matrix with rows v1 ,


v2 , . . . , vN . Then S ⊥ equals the null space of A.

An important example of orthogonality is the relation between row space


and the null space. Suppose A has rows v1 , v2 , . . . , vN , and x is a vector, all
of the same dimension. Then, by definition, the matrix-vector product is

Ax = (v1 · x, v2 · x, . . . , vN · x).

If x is in the null space, Ax = 0, then

v1 · x = 0, v2 · x = 0, . . . , vN · x = 0,

so x is orthogonal to the rows of A. Conversely, if x is orthogonal to the rows


of A, then Ax = 0.
This shows the null space of A and the row space of A are orthogonal
complements. Summarizing, we write

Row Space and Null Space are Orthogonal

Every vector in the row space is orthogonal to every vector in the null
space,

rowspace⊥ = nullspace and nullspace⊥ = rowspace. (2.4.10)

Actually, the above paragraph only established the first identity. For the
second identity, we need to use (2.7.9), as follows
⊥
rowspace = rowspace⊥ = nullspace⊥ .

Since the row space is the orthogonal complement of the null space, and
the null space of A equals the null space of At A, we conclude

A Versus At A
Let A be any matrix. Then the row space of A equals the row space
of At A.

Now replace A by At in this last result. Since the row space of At equals
the column space of A, and AAt is symmetric, we also have
2.4. SPAN AND LINEAR INDEPENDENCE 97

A Versus AAt
Let A be any matrix. Then the column space of A equals the column
space of AAt .

Let A be a matrix and b a vector. So far we’ve met four spaces,


• the null space: all x’s satisfying Ax = 0,
• the row space: the span of the rows of A,
• the column space: the span of the columns of A,
• the solution space: the solutions x of Ax = b.
A set S of vectors is a subspace if x1 + x2 is in S whenever x1 and x2 are in
S, and tx is in S whenever x is in S. When this happens, we say S is closed
under addition and scalar multiplication: A subspace is a set of vectors closed
under addition and scalar multiplication.
Since a linear combination of linear combinations is a linear combination,
every span is a subspace. In particular, Rd is a subspace.
It’s important to realize the first three are subspaces, but the fourth is
not.
• If x1 and x2 are in the null space, and r1 and r2 are scalars, then so is
r1 x1 + r2 x2 , because

A(r1 x1 + r2 x2 ) = r1 Ax1 + r2 Ax2 = r1 0 + r2 0 = 0.

This shows the null space is a subspace. In particular, S ⊥ is a subspace


for any S.
• The row space is a span, so is a subspace.
• The column space is a span, so is a subspace.
• The solution space S of Ax = b is not a subspace, nor a span: If x is in
S, then Ax = b, so A(5x) = 5Ax = 5b, so 5x is not in S.
If x1 and x2 are solutions of Ax = b, then A(x1 + x2 ) = 2b, so the solution
space is not a subspace. However

A(x1 − x2 ) = b − b = 0, (2.4.11)

so the difference x1 − x2 of any two solutions x1 and x2 is in the null space


of A, which is a span.

Let A be an N × d matrix. Then matrix multiplication by A transforms a


vector x to the vector b = Ax. Since A is N × d, x is in Rd , and Ax is in RN .
98 CHAPTER 2. LINEAR GEOMETRY

From this point of view, the source space of A is Rd , and the target space of
A is RN .

Locations of Column, Row, and Null Spaces

Let A be any matrix. The null space of A and the row space of A are
in the source space of A, and the column space of A is in the target
space of A.

Let A be a d × d invertible matrix. Then the source space is Rd and the


target space is Rd . If Ax = 0, then

x = (A−1 A)x = A−1 (Ax) = A−1 0 = 0.

This shows the null space of an invertible matrix is zero, hence the nullity is
zero.
Since the row space is the orthogonal complement of the null space, we
conclude the row space is all of Rd .
In §2.9, we see that the column rank and the row rank are equal. From
this, we see also the column space is all of Rd . In summary,

Null Space of Invertible Matrix

Let A be a d×d invertible matrix. Then the null space is zero, and the
row space and column space are both Rd . In particular, the nullity is
0, and the row rank rank and column rank are both d.

Exercises

Exercise 2.4.1 For what condition on a, b, c do the vectors (1, a), (2, b),
(3, c) lie on a line?
Exercise 2.4.2 Let
 
  16
1 2 3 4 5 17
 
C =  6 7 8 9 10 , 18 .
x= 
11 12 13 14 15 19
20

Compute Cx in two ways, first by row times column, then as a linear combi-
nation of the columns of C.
2.4. SPAN AND LINEAR INDEPENDENCE 99

Exercise 2.4.3 Check that the array in Figure 2.1 matches with b1 , b2 as
explained in the text, and the vectors b1 and b2 are orthogonal.

Exercise 2.4.4 Let A = (u, v, w) be as in (2.3.4) and let b = (16, 17, 18, 19, 20).
Is b in the column space of A? If yes, solve b = ru + sv + tw.

Exercise 2.4.5 Let A = (u, v, w) be as in (2.3.4) and let Q = At A. What


are the source and target spaces for A and Q? Calculate column spaces, row
spaces, and null spaces of A and Q. How are they related?

Exercise 2.4.6 Let A = (u, v, w) be as in (2.3.4) and let Q = AAt . What


are the source and target spaces for A and Q? Calculate column spaces, row
spaces, and null spaces of A and Q. How are they related?

Exercise 2.4.7 Let A(N, d) be the matrix returned by the code

from sympy import *

def col(N,j): return Matrix([ 1+i+j*N for i in range(N) ])


def A(N,d): return Matrix.hstack(*[ col(N,j) for j in range(d) ])

What are A(5, 3) and A(3, 5)? What are the source and target spaces for
A(N, d)?

Exercise 2.4.8 Calculate the column rank of the matrix A(N, d) for all N ≥
2 and all d ≥ 2. (Column rank is the length of the list columnspace returns.)

Exercise 2.4.9 What is the nullity of the matrix A(N, d) for all N ≥ 2 and
all d ≥ 2?

Exercise 2.4.10 Show directly from the definition the vectors

u = (2, ∗, ∗, ∗, ∗, ∗), v = (0, 7, ∗, ∗, ∗, ∗), w = (0, 0, 0, 1, ∗, ∗), x = (0, 0, 0, 0, 0, 3)

are linearly independent.

Exercise 2.4.11 Let a, b, c, d be the rows of the matrix


 
2 1 0 1 3 7
0 7 7 2 0 5
E= 0 0

0 1 3 1
0 0 0 0 0 3

Show directly from the definition a, b, c, d are linearly independent. A


matrix with this staircase pattern is in echelon form.

Exercise 2.4.12 Let E be the matrix in Exercise 2.4.10. Solve Ex = 0 to


obtain the nullspace of E and the nullity of E.
100 CHAPTER 2. LINEAR GEOMETRY

Exercise 2.4.13 [27] Let x, y, z be three nonzero vectors, and w = 2y −


2x + z. If z = x − y, find r and s with w = rx + sy. Which of the following
must be true?
1. span(x, y, z) = span(w, y, z),
2. span(w, z) = span(y, z),
3. span(x, z) = span(x, z, w),
4. span(x, z) = span(w, x),
5. span(w, x, y) = span(w, x, z).
Exercise 2.4.14 [27] Let a be a linear combination of x, y, z. Select the best
statement.
1. span(u, v, w) is contained or equal to span(u, v, w, a),
2. span(u, v, w) is equal to span(u, v, w, a),
3. There is no obvious relationship between span(u, v, w) and span(u, v, w, a),
4. span(u, v, w) is not equal to span(u, v, w, a).

2.5 Zero Variance Directions

Let x1 , x2 , . . . , xN be a dataset in Rd . Then x1 , x2 , . . . , xN are N points in


Rd , and each x has d features, x = (t1 , t2 , . . . , td ). From §1.5, the mean is
x1 + x2 + · · · + xN
µ= .
N
Center the dataset (see §1.3)

v1 = x1 − µ, v2 = x2 − µ, . . . , vN = xN − µ,

and let A be the dataset matrix with rows v1 , v2 , . . . , vN . By (2.2.9), the


variance is
v 1 ⊗ v 1 + v 2 ⊗ v 2 + · · · + vN ⊗ vN 1
Q= = At A. (2.5.1)
N N
If u is a unit vector, the projection of the centered dataset onto the line
through u results in the reduced dataset

v1 · u, v2 · u, . . . , vN · u.

This reduced dataset is centered, and, by (2.2.9), its variance is

(v1 · u)2 + (v2 · u)2 + · · · + (vN · u)2 1


q= = v t At Au = u · Qu. (2.5.2)
N N
We obtain this result, which was first stated in §1.5.
2.5. ZERO VARIANCE DIRECTIONS 101

Variance of Reduced Dataset


Let Q be the variance matrix of a dataset and let u be a unit vector.
Then the variance of the reduced dataset onto the line through the
vector u equals the quadratic function u · Qu.

A vector v is a zero variance direction if the reduced variance is zero,

v · Qv = 0.

We investigate zero variance directions, but first we need a definition.


Let b be a scalar and v a nonzero vector in Rd . A hyperplane orthogonal
to v is the set of points x satisfying the equation

v · x + b = 0.

In R3 , a hyperplane is a plane, in R2 , a hyperplane is a line, and in R, a


hyperplane is a point, a threshhold. In general, in Rd , a hyperplane is (d−1)-
dimensional, always one less than the ambient dimension. When b = 0, the
hyperplane orthogonal to v equals v ⊥ (2.4.9).
The hyperplane passes through a point µ if

v · µ + b = 0.

By subtracting the last two equations, the equation of a hyperplane orthog-


onal to v and passing through µ may be written

v · (x − µ) = 0.

Zero Variance Directions

Let µ and Q be the mean and variance of a dataset in Rd . Then


v · Qv = 0 is the same as saying every point in the dataset lies in the
hyperplane passing through µ and orthogonal to v,

v · (x − µ) = 0.

This is easy to see. Let the dataset be x1 , x2 , . . . , xN , and center it to v1 ,


v2 , . . . , vN . If v · Qv = 0, then, by (2.5.2), vk · v = 0 for k = 1, 2, . . . , N . This
shows v · (xk − µ) = 0, k = 1, 2, . . . , N , which means the points x1 , x2 , . . . ,
xN lie on the hyperplane v · (x − µ) = 0. Here are some examples.
In two dimensions R2 , a line is determined by a point on the line and a
vector orthogonal to the line. If v = (a, b) is the vector orthogonal to the line
and (x0 , y0 ), (x, y) are points on the line, then (x, y) − (x0 , y0 ) is orthogonal
to v, or
(a, b) · ((x, y) − (x0 , y0 )) = 0.
102 CHAPTER 2. LINEAR GEOMETRY

Writing this out, the equation of the line is

a(x − x0 ) + b(y − y0 ) = 0, or ax + by = c,

where c = ax0 + by0 .


If the mean and variance of a dataset are µ = (2, 3) and
 
1 −1
Q= ,
−1 1

and v = (1, 1), then Qv = 0, so v · Qv = 0. Since the line x + y = 5 passes


through the mean, the dataset lies on this line. We conclude this dataset is
one-dimensional.
If  
30
Q= ,
01
and v = (x, y), then
v · Qv = 3x2 + y 2 ,
so v · Qv is never zero unless v = 0. In this case, we conclude the dataset is
two-dimensional, because it does not lie on a line.
In three dimensions R3 , a plane is determined by a point (x0 , y0 , z0 ) and
a vector v = (a, b, c). The point is in the plane, and the vector is orthogonal
to the plane. If (x, y, z) is any point in the plane, then (x, y, z) − (x0 , y0 , z0 )
is orthogonal to v, so the equation of the plane is

(a, b, c) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or ax + by + cz = d,

where d = ax0 + by0 + cz0 .


Suppose we have a dataset in R3 with mean µ = (3, 2, 1), and variance
 
111
Q = 1 1 1 . (2.5.3)
111

Let v = (2, −1, −1). Then Qv = 0, so v · Qv = 0. We conclude the dataset


lies in the plane

(2, −1, −1) · ((x, y, z) − (x0 , y0 , z0 )) = 0, or 2x − y − z = 3.

In this case, the dataset is two-dimensional, as it lies in a plane.


If a dataset has variance the 3 × 3 identity matrix I, then v · Iv is never
zero unless v = 0. Such a dataset is three-dimensional, it does not lie in a
plane.
Sometimes there may be several zero variance directions. For example, for
the variance (2.5.3) and u = (2, −1, −1), v = (0, 1, −1), we have both
2.5. ZERO VARIANCE DIRECTIONS 103

u · Qu = 0 and v · Qv = 0.

From this we see the dataset corresponding to this Q lies in two planes: The
plane orthogonal to u, and the plane orthogonal v. But the intersection of
two planes is a line, so this dataset lies in a line, which means it is one-
dimensional.
Which line does this dataset lie in? Well, the line has to pass through the
mean, and is orthogonal to u and v. If we find a vector b satisfying b · u = 0
and b · v = 0, then the line will pass through the mean and will be parallel to
b. But we know how to find such a vector. Let A be the matrix with rows u, v.
Then b in the nullspace of A fullfills the requirements. We obtain b = (1, 1, 1).

Let v1 , v2 , . . . , vN be a centered dataset of vectors in Rd , and let Q be


the variance matrix of the dataset. If v is in the nullspace of Q, then Qv = 0,
so v · Qv = 0. This shows every vector in the nullspace is a zero variance
direction. What is less clear is that this works in the other direction.

Zero Variance Directions and Nullspace I

Let Q be a variance matrix. Then the null space of Q equals the zero
variance directions of Q.

To see this, we use the quadratic equation from high school. If Q is sym-
metric, then u · Qv = v · Qu. For t scalar and u, v vectors, since Q ≥ 0, the
function
(v + tu) · Q(v + tu)
is nonnegative for all t scalar. Expanding this function into powers of t, we
see
t2 u · Qu + 2tu · Qv + v · Qv = at2 + 2bt + c
is nonnegative for all t scalar. Thus the parabola at2 + 2bt + c intersects the
horizontal axis in at most one root. This implies the discriminant b2 − ac is
not positive, b2 − ac ≤ 0, which yields

(u · Qv)2 ≤ (u · Qu) (v · Qv). (2.5.4)

Now we can derive the result. If v is a zero variance direction, then v ·Qv =
0. By (2.5.4), this implies u · Qv = 0 for all u, so Qv = 0, so v is in the null
space of Q. This derivation is valid for any nonnegative matrix Q, not just
variance matrices. Later (§3.2) we will see every nonnegative matrix is the
variance matrix of a dataset.
104 CHAPTER 2. LINEAR GEOMETRY

Based on the above result, here is code that returns zero variance direc-
tions.

from numpy import *


from scipy.linalg import null_space
from numpy.random import random

N, d = 20, 2
# dxN array
dataset = array([ [random() for _ in range(N)] for _ in range(d) ])

def zero_variance(dataset):
Q = cov(dataset)
return null_space(Q)

zero_variance(dataset)

Let A be an N ×d dataset matrix, and let Q be the variance of the dataset.


By (2.2.14), Q = At A/N if the dataset is centered. Then the null space of Q
equals the null space of At A, which equals the null space of A. We conclude

Zero Variance Directions and Nullspace II

Let Q be a variance matrix of a centered dataset A. Then the null


space of A equals the zero variance directions of Q.

Suppose the dataset is

(1, 2, 3, 4, 5), (6, 7, 8, 9, 10), (11, 12, 13, 14, 15), (16, 17, 18, 19, 20).

This is four vectors in R5 . Since it is only four vectors, it is at most a four-


dimensional dataset. The code zero_variance returns three vectors

(1, −2, 1, 0, 0), (2, −3, 0, 1, 0), (3, −4, 0, 0, 1).

Thus this dataset is orthogonal to three directions, hence lies in the intersec-
tion of three hyperplanes. Each hyperplane is one condition, so each hyper-
plane cuts the dimension down by one, so the dimension of this dataset is
5 − 3 = 2. Dimension of a dataset is discussed further in §2.9.

2.6 Pseudo-Inverse

What is the pseudo-inverse? In §2.3, we used both the inverse and the pseudo-
inverse to solve Ax = b, but we didn’t explain the framework behind them.
It turns out the framework is best understood geometrically.
2.6. PSEUDO-INVERSE 105

Think of b and Ax as points, and measure the distance between them, and
think of x and the origin 0 as points, and measure the distance between them
(Figure 2.2).

x b
A

−−−−−−−

0 Ax

source space target space

Fig. 2.2 The points 0, x, Ax, and b.

If Ax = b is solvable, then, among all solutions x∗ , select the solution x+


closest to 0.
More generally, if Ax = b is not solvable, select the points x∗ so that Ax∗
is closest to b, then, among all such x∗ , select the point x+ closest to the
origin (this is “closest twice”).
Even though the point x+ may not solve Ax = b, this procedure results
in a uniquely determined x+ : While there may be several points x∗ , there is
only one x+ . Figure 2.3 summarizes the situation for a 2 × 2 matrix A with
rank(A) = 1.

rowspace
column space
x∗
x∗ Ax
x+ A Ax∗

−−−−−−−

Ax
Ax

x x
v nullspace
x v
0 b

Fig. 2.3 The points x, Ax, the points x∗ , Ax∗ , and the point x+ .

The results in this section are as follows. Let A be any matrix. There is a
unique matrix A+ — the pseudo-inverse of A — with the following properties.
• the linear system Ax = b is solvable, when b = AA+ b.
• x+ = A+ b is a solution of
1. the linear system Ax = b, if Ax = b is solvable.
2. the regression equation At Ax = At b, always.
• In either case,
106 CHAPTER 2. LINEAR GEOMETRY

1. there is exactly one solution x∗ with minimum norm.


2. Among all solutions, x+ has minimum norm.
3. Every other solution is x∗ = x+ + v for some v in the null space of A.

Key concepts in this section are the residual

|Ax − b|2 (2.6.1)

and the regression equation

At Ax = At b. (2.6.2)

The following is clear.

Zero Residual

x is a solution of (2.3.1) iff the residual is zero.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10), the linear system Ax = b is

x + 6y + 11z = −9
2x + 7y + 12z = −3
3x + 8y + 13z = 3 (2.6.3)
4x + 9y + 14z = 9
5x + 10y + 15z = 10

and the regression equation At Ax = At b is

11x + 26y + 41z = 16


13x + 33y + 53z = 13 (2.6.4)
41x + 106y + 171z = 36.

Let b be any vector, not necessarily in the column space of A. To see how
close we can get to solving (2.3.1), we minimize the residual (2.6.1). We say
x∗ is a residual minimizer if
2.6. PSEUDO-INVERSE 107

|Ax∗ − b|2 = min |Ax − b|2 . (2.6.5)


x

A residual minimizer always exists.

Existence of Residual Minimizer


There is a residual minimizer x∗ in the row space of A.

The derivation of this technical result is in §4.5, see (4.5.15), (4.5.16).

Regression Equation

x∗ is a residual minimizer iff x∗ solves the regression equation.

To see this, let v be any vector, and t a scalar. Insert x = x∗ + tv into the
residual and expand in powers of t to obtain

|Ax − b|2 = |Ax∗ − b|2 + 2t(Ax∗ − b) · Av + t2 |Av|2 = f (t).

If x∗ is a residual minimizer, then f (t) is minimized when t = 0. But a


parabola
f (t) = a + 2bt + ct2
is minimized at t = 0 only when b = 0. Thus the linear coefficient b vanishes,
(Ax∗ − b) · Av = 0. This implies

At (Ax∗ − b) · v = (Ax∗ − b) · Av = 0.

Since v is any vector, this implies

At (Ax∗ − b) = 0,

which is the regression equation. Conversely, if the regression equation holds,


then the linear coefficient in the parabola f (t) vanishes, so t = 0 is a mini-
mum, establishing that x∗ is a residual minimizer.

If x1 and x2 are solutions of the regression equation, then

At A(x1 − x2 ) = At Ax1 − At Ax2 = At b − At b = 0,

so x1 − x2 is in the null space of At A. But from §2.4, the nullspace of At A


equals the nullspace of A. We conclude x1 − x2 is in the null space of A. This
establishes
108 CHAPTER 2. LINEAR GEOMETRY

Multiple Solutions

Any two residual minimizers differ by a vector in the nullspace of A.

We say x+ is a minimum norm residual minimizer if x+ is a residual


minimizer and
|x+ |2 ≤ |x∗ |2
for any residual minimizer x∗ .
Since any two residual minimizers differ by a vector in the null space of A,
x+ is a minimum norm residual minimizer if x+ is a residual minimizer and

|x+ |2 ≤ |x+ + v|2

for any v in the null space of A.

Minimum Norm Residual Minimizer


Let x∗ be a residual minimizer. Then x∗ is a minimum norm residual
minimizer iff x∗ is in the row space of A.

Since we know from above there is a residual minimizer in the row space
of A, we always have a minimum norm residual minimizer.
Let v be in the null space of A, and write

|x∗ + v|2 = |x∗ |2 + 2x∗ · v + |v|2 .

This shows x∗ is a minimum norm solution of the regression equation iff

2x∗ · v + |v|2 ≥ 0. (2.6.6)

If x∗ is in the row space of A, then x∗ · v = 0, so (2.6.6) is valid.


Conversely, if (2.6.6) is valid for every v in the null space of A, replacing
v by tv yields
2tx∗ · v + t2 |v|2 ≥ 0.
Dividing by t and inserting t = 0 yields

x∗ · v ≥ 0.

Since both ±v are in the null space of A, this implies ±x∗ · v ≥ 0, hence
x∗ · v = 0. Since the row space is the orthogonal complement of the null
space, the result follows.
2.6. PSEUDO-INVERSE 109

Now we use this to show

Uniqueness

There is exactly one minimum norm residual minimizer x+ .

If x+ +
1 and x2 are minimum norm residual minimizers, then v = x1 − x2
+ +
+ +
is both in the row space and in the null space of A, so x1 − x2 = 0. Hence
x+ +
1 = x2 .
Putting the above all together, each vector b leads to a unique x+ . Defining
+
A by setting
x+ = A+ b,
we obtain A+ , the pseudo-inverse of A.
Notice if A is, for example, 5 × 4, then Ax = b implies x is a 4-vector and
b is a 5-vector. Then from x = A+ b, it follows A+ is 4 × 5. Thus the shape of
A+ equals the shape of At .
Summarizing what we have so far,

Regression Equation is Always Solvable

The regression equation (2.6.2) is always solvable. The solution of


minimum norm is x+ = A+ b. Any other solution differs by a vector
in the null space of A.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10),


 
82
1
x + = A+ b =  25 
15
−32

is the minimum norm solution of the regression equation (2.6.4).

Returning to the linear system (2.3.1), we show

Linear System Versus Regression Equation

If the linear system is solvable, then every solution of the regression


equation is a solution of the linear system, and vice-versa.

We know any two solutions of the linear system (2.3.1) differ by a vector in
the null space of A (2.4.11), and any two solutions of the regression equation
(2.6.2) differ by a vector in the null space of A (above).
110 CHAPTER 2. LINEAR GEOMETRY

If x is a solution of (2.3.1), then, by multiplying by At , x is a solution of


the regression equation (2.6.2). Since x+ = A+ b is a solution of the regression
equation, x+ = x + v for some v in the null space of A, so

Ax+ = A(x + v) = Ax + Av = b + 0 = b.

This shows x+ is a solution of the linear system. Since all other solutions
differ by a vector v in the null space of A, this establishes the result.
Now we can state when Ax = b is solvable,

Solvability of Ax = b

The linear system Ax = b is solvable iff b = AA+ b. When this happens,


x+ = A+ b is the solution of minimum norm.

If (2.3.1) is solvable, then from above, x+ is a solution, so

AA+ b = A(A+ b) = Ax+ = b.

Conversely, if AA+ b = b, then clearly x+ = A+ b is a solution of (2.3.1).


When (2.3.1) is solvable, (2.3.1) and (2.6.2) have the same solutions, so
x+ is the minimum norm solution of (2.3.1).
For example, let b = (−9, −3, 3, 9, 10), and let A be as in (2.3.4). Since
 
−8
−3
AA+ b = 
 
2 (2.6.7)

7
12

is not equal to b, the linear system (2.6.3) is not solvable.

Suppose A is invertible. Then (2.3.1) has only the solution x = A−1 b, so


−1
A b is the minimum norm residual minimizer. We conclude

Inverse Equals Pseudo-Inverse

If A is invertible, then A+ = A−1 .

The key properties [25] of A+ are


2.6. PSEUDO-INVERSE 111

Properties of Pseudo-Inverse

The pseudo-inverse of A is the unique matrix A+ satisfying

A. AA+ A = A
B. A+ AA+ = A+
(2.6.8)
C. AA+ is symmetric
D. A+ A is symmetric

The verification of these properties is very enlightening, so we do it care-


fully. Let u be a vector and set b = Au. Then the residual

|Ax − b|2 = |Ax − Au|2

is minimized at x = u. Since A+ b = A+ Au is the minimum norm residual


minimizer, u and A+ Au differ by a vector v in the null space of A,

u = A+ Au + v. (2.6.9)

Since Av = 0, multiplying by A leads to

Au = AA+ Au.

Since u was any vector, this yields A.


Now let w be a vector and set u = A+ w. Inserting into (2.6.9) yields

A+ w = A+ AA+ w + v

for some v in the null space of A. But both A+ w and A+ AA+ w are in the
row space of A, hence so is v. Since v is in both the null space and the row
space, v is orthogonal to itself, so v = 0. This implies A+ AA+ w = A+ w.
Since w was any vector, we obtain B.
Since A+ b solves the regression equation, At AA+ b = At b for any vector b.
Hence At AA+ = At . With P = AA+ ,

P t P = (AA+ )t (AA+ ) = (A+ )t At AA+ = (A+ )t At = P t .

Since the left side is symmetric, so is P t . Hence P is symmetric, obtaining


C.
For any vector x,

A(x − A+ Ax) = Ax − AA+ Ax = 0,

so x − A+ Ax is in the null space of A. For any y, A+ Ay is in the row space


of A. Since the row space and the null space are orthogonal,
112 CHAPTER 2. LINEAR GEOMETRY

(x − A+ Ax) · A+ Ay = 0.

Let P = A+ A. This implies

x · P y = P x · P y = x · P tP y

Since this is true for any vectors x and y, P = P t P . This shows P = A+ A is


symmetric, obtaining D.
Having arrived at A, B, C, D, the reasoning is reversible: It can be shown
any matrix A+ satisfying A, B, C, D must equal the pseudo-inverse.

Also we have

Pseudo-Inverse and Transpose

If U has orthonormal columns or orthonormal rows, then U + = U t .

From (2.2.8), such a matrix U satisfies U U t = I or U t U = I. In either


case, A, B, C, D are immediate consequences.

Exercises

Exercise 2.6.1 Let A be the 1 × 3 matrix (1, 2, 3). What is A+ ?

Exercise 2.6.2 Let A(N, d) be as in Exercise 2.4.7, and let A = A(6, 4).
Let b = (1, 1, 1, 1, 1, 1). Write out Ax = b as a linear system. How many
equations, how many unknowns?

Exercise 2.6.3 With A and b as in Exercise 2.6.2, is Ax = b solvable? If so,


provide a solution.

Exercise 2.6.4 Continuing with the same A and b, write out the correspond-
ing regression equation. How many equations, how many unknowns?

Exercise 2.6.5 With A and b as in Exercise 2.6.2, is the regression equation


solvable? If so, provide a solution.

Exercise 2.6.6 With A and b as in Exercise 2.6.2, what is the minimum


norm residual minimizer x+ ?

Exercise 2.6.7 Let µ be a unit vector, and let Q = I − µ ⊗ µ. Use (2.6.8)


and Exercise 2.2.2 to show Q+ = Q.
2.7. PROJECTIONS 113

Exercise 2.6.8 Use (2.6.8) to show the transpose of the pseudo-inverse is


the pseudo-inverse of the transpose,

(At )+ = (A+ )t .

Exercise 2.6.9 Let Q be symmetric, Qt = Q. Show Q+ is symmetric.


Exercise 2.6.10 Let Q be symmetric, Qt = Q. Show Q and Q+ commute,

QQ+ = Q+ Q.

Exercise 2.6.11 Let A be any matrix. Then the null space of A equals the
null space of A+ A. Use (2.6.8).
Exercise 2.6.12 Let A be any matrix. Then the row space of A equals the
row space of A+ A.
Exercise 2.6.13 Let A be any matrix. Then the column space of A equals
the column space of AA+ .

2.7 Projections

In this section, we study projection matrices P , and we show


• P = AA+ is the projection matrix onto the column space of A,
• P = A+ A is the projection matrix onto the row space of A,
• P = I − A+ A is the projection matrix onto the null space of A,

Let u be a unit vector, and let b be any vector. Let span(u) be the line
through u (Figure 2.4). The projection of b onto span(u) is the vector v in
span(u) that is closest to b.
It turns out this closest vector v equals P b for some matrix P , the projec-
tion matrix. Since span(u) is a line, the projected vector P b is a multiple tu
of u.
From Figure 2.4, b − P b is orthogonal to u, so

0 = (b − P b) · u = b · u − P b · u = b · u − t u · u = b · u − t.

Solving for t, this implies t = b · u. Thus

P b = (b · u)u = (u ⊗ u)b. (2.7.1)

Notice P b = b when b is already on the line through u. In other words,


the projection of a vector onto a line equals the vector itself when the vector
114 CHAPTER 2. LINEAR GEOMETRY

is already on the line. If U is the matrix with the single column u, we obtain
P = U U t.
To summarize, the projected vector is the vector (b · u)u, and the reduced
vector is the scalar b · u. If U is the matrix with the single column u, then the
reduced vector is U t b and the projected vector is U U t b.

b − Pb
b

P b = tu
u

Fig. 2.4 Projecting onto a line.

Now we project onto a plane. Let u, v be an orthonormal pair of vectors,


so u · v = 0, u · u = 1 = v · v. We project a vector b onto span(u, v). As before,
there is a matrix P , the projection matrix, such that the projection of b onto
the plane equals P b. Then b − P b is orthogonal to the plane (Figure 2.5),
which means b − P b satisfies

(b − P b) · u = 0 and (b − P b) · v = 0.

Since P b lies in the plane, P b = ru + sv is a linear combination of u and v.


Inserting P b = ru + sv, we obtain

r = b · u, s = b · v.

If U is the matrix with columns u, v, by (2.2.9), this yields,

P b = (b · u)u + (b · v)v = (u ⊗ u + v ⊗ v)b = U U t b,

As before, here also the projection matrix is P = U U t .


Notice P b = b when b is already in the plane. In other words, the projection
of a vector onto a plane equals the vector itself when the vector is already in
the plane.
To summarize, here the projected vector is the vector U U t b = (b · u)u +
(b · v)v, and the reduced vector is the vector U t b = (b · u, b · v). The projected
vector has the same dimension as the original vector, and the reduced vector
has only two components.
2.7. PROJECTIONS 115

b
b − Pb

u
Pb

Fig. 2.5 Projecting onto a plane, P b = ru + sv.

We define projection matrices in general. Let S be a span. A matrix P is


the projection matrix onto S if
1. P v is in S for any vector v,
2. P v = v if v is in S,
3. v − P v is orthogonal to S for any vector v.
We say the projection matrix onto S because there is only one such matrix
corresponding to a given S, see Exercise 2.7.10.
Here is a characterization without mentioning S. A matrix P is a projection
matrix if
1. P 2 = P ,
2. P t = P .
What is the relation between these two versions? We show they are the same.

Characterization of Projections

If P is the projection matrix onto a span S, then P is a projection


matrix. Conversely, if P is a projection matrix, then P is the projection
matrix onto the column space S of P .

To prove this, suppose P is the projection matrix onto some span S. For
any v, by 1., P v is in S. By 2., P (P v) = P v. Hence P 2 = P . Also, for any u
and v, P v is in S, and u − P u is orthogonal to S. Hence

(u − P u) · P v = 0
116 CHAPTER 2. LINEAR GEOMETRY

which implies
u · P v = (P u) · (P v).
Switching u and v,
v · P u = (P v) · (P u),
Hence
u · (P v) = (P u) · v,
t
which implies P = P .
For the other direction, suppose P is a projection matrix, and let S be the
column space of P . Then a vector x is in S iff x is of the form x = P v. This
establishes 1. above. Since

P x = P (P v) + P 2 v = P v = x,

this establishes 2. above. Similarly, P t = P implies 3. above.

Projection Onto Column Space

Let A be any matrix. Then the projection matrix onto the column
space of A is
P = AA+ . (2.7.2)

To see this, let P = AA+ . By (2.6.8),

P 2 = AA+ AA+ = (AA+ A)A+ = AA+ = P,

and P is symmetric. Hence P is a projection matrix. By the previous result,


P is the projection matrix onto the column space of P = AA+ . But by
Exercise 2.6.13, the column spaces of A and of P agree. Thus P is the
projection matrix onto the column space of A.

Now let x = A+ b. Then Ax = AA+ b = P b is the projection of b onto


the column space of A. If the columns of A are v1 , v2 , . . . , vd , and x =
(t1 , t2 , . . . , td ), then by matrix-vector multiplication,

P b = t1 v1 + t2 v2 + · · · + td vd .

Since the reduced vector x consists of the coefficients when writing P b as a


linear combination of the columns of A, this shows A+ b is the reduced vector.
2.7. PROJECTIONS 117

from numpy import *


from numpy.linalg import pinv

# projection of column vector b


# onto column space of A

# assume len(b) == len(A.T)

def project(A,b):
Aplus = pinv(A)
x = dot(Aplus,b) # reduced
return dot(A,x) # projected

Projected and Reduced Vectors

Let A be a matrix and b a vector, and project onto the column space
of A. Then the projected vector is P b = AA+ b and the reduced vector
is x = A+ b.

For A as in (2.3.4) and b = (−9, −3, 3, 9, 10) the reduced vector onto the
column space of A is
1
x = A+ b = (82, 25, −32),
15
and the projected vector onto the column space of A is

P b = Ax = AA+ b = (−8, −3, 2, 7, 12).

The projection matrix onto the column space of A is


 
6 4 2 0 −2
4 321 0
+ 1  
 2 2 2 2 2 .
P = AA =
10 
 
0 123 4
−2 0 2 4 6

In the same way, one can show

Projection Onto Row Space

The projection matrix onto the row space of A is

P = A+ A. (2.7.3)
118 CHAPTER 2. LINEAR GEOMETRY

For A as in (2.3.4), the projection matrix onto the row space is


 
5 2 −1
1
P = A+ A =  2 2 2 
6
−1 2 5

When the columns of a matrix U are orthonormal, in the previous section


we saw U + = U t , so we have

Projection onto Orthonormal Vectors

If the columns of U are orthonormal, the projection matrix onto the


column space of U is
P = UUt (2.7.4)

Here the projected vector is U U t b, and the reduced vector is U t b. The


code here is

from numpy import *

# projection of column vector b


# onto column space of U
# with orthonormal columns

# assume len(b) == len(U)

def project_to_ortho(U,b):
x = dot(U.T,b) # reduced
return dot(U,x) # projected

Let v1 , v2 , . . . , vN be a dataset in Rd , and let U be a d × n matrix with


orthonormal columns. Then the projection matrix onto the column space of
U is P = U U t , and P is the projection onto an orthonormal span.
In this case, the dataset U t v1 , U t v2 , . . . , U t vN is the reduced dataset, and
U U t v1 , U U t v2 , . . . , U U t vN is the projected dataset.
The projected dataset is in Rd , and the reduced dataset is in Rn . Table
2.6 summarizes the relationships.
2.7. PROJECTIONS 119

dataset vk in Rd , k = 1, 2, . . . , N
reduced U tv k in Rn , k = 1, 2, . . . , N
projected U U tv k in Rd , k = 1, 2, . . . , N

Table 2.6 Dataset, reduced dataset, and projected dataset, n < d.

from numpy import *


from numpy.linalg import pinv

# projection of dataset
# onto column space of A

# Aplus = A.T # orthonormal columns


Aplus = A.pinv() # any matrix

reduced = array([ dot(Aplus,v) for v in dataset ])


projected = array([ dot(A,x) for x in reduced ])

Let S and T be spans. Let S + T consist of all sums of vectors u + v with


u in S and v in T . Then a moment’s thought shows S + T is itself a span.
When the intersection of S and T is the zero vector, we write S ⊕ T , and we
say S ⊕ T is the direct sum of S and T .
Let S be a span and let S ⊥ consist of all vectors orthogonal to S. We call

S the orthogonal complement. This is pronounced “S-perp”. If v is in both
S and in S ⊥ , then v is orthogonal to itself, hence v = 0. From this, we see
S + S ⊥ is a direct sum S ⊕ S ⊥ .

Direct Sum and Orthogonal Complement I

If S is a span in Rd , then

Rd = S ⊕ S ⊥ . (2.7.5)

This is an immediate consequence of what we already know. Let P be the


projection matrix onto S. Since any vector v in Rd may be written

v = P v + (v − P v),

we see any vector is a sum of a vector in S and a vector in S ⊥ .


Let S be the span of a dataset x1 , x2 , . . . , xN . If S does not equal Rd ,
then there is a nonzero vector in S ⊥ . This shows
120 CHAPTER 2. LINEAR GEOMETRY

Direct Sum and Orthogonal Complement II

If a dataset spans the feature space and v is orthogonal to the dataset,


then v = 0. If v is not zero and is orthogonal to the dataset, then the
dataset does not span the feature space.

Another way of saying the same thing: A vector v is orthogonal to the


whole space iff v is zero.

An important example of (2.7.5) is the relation between the row space and
the null space of a matrix. In §2.4, we saw that, for any matrix A, the row
space and the null space are orthogonal complements.
Taking S = nullspace in (2.7.5), we have the important

Null space plus Row Space Equals Source Space

If A is an N × d matrix,

nullspace ⊕ rowspace = Rd , (2.7.6)

and the null space and row space are orthogonal to each other.

From this,

Projection Onto Null Space

The projection matrix onto the null space of A is

P = I − A+ A. (2.7.7)

For A as in (2.3.4), the projection matrix onto the null space is


 
1 −2 1
1
P = I − A+ A = −2 4 −2
6
1 −2 1

The result (2.7.6) can be written as

Row Rank plus Nullity equals Source Space Dimension

For any matrix, the row rank plus the nullity equals the dimension of
the source space. If the matrix is N × d, r is the rank, and n is the
nullity, then
r + n = d.
2.7. PROJECTIONS 121

Let S be the column space of a matrix A, and let P be the projection


matrix onto S. We end the section by establishing the claim made at the
start of the section, that P b is the point in S that is closest to b.
Since every point in S is of the form Ax, we need to check

|P b − b|2 = min |Ax − b|2 .


x

But this was already done in §2.3, since P b = AA+ b = Ax+ where x+ = A+ b
is a residual minimizer.

Projection is the Nearest Point in the Span

Let P b = AA+ b be the projection of b onto the column space of A,


and let x+ = A+ b be the reduced vector. Then

|Ax+ − b|2 = min |Ax − b|2 . (2.7.8)


x

Exercises

Exercise 2.7.1 Let A be a 7 × 12 matrix. What is the greatest the rank of


A can be? What is the least the rank of A can be? What if A is 12 × 7?

Exercise 2.7.2 Let A be a 7 × 12 matrix. What is the greatest the nullity


of A can be? What is the least the nullity of A can be? What if A is 12 × 7?

Exercise 2.7.3 Let A be a matrix and let u1 , u2 , . . . , ur be an orthonormal


basis for the column space of A. Show that the projection onto the column
space of A is
P = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ur ⊗ ur .

Exercise 2.7.4 Let P be the projection matrix onto the column space of a
matrix A. Use Exercise 2.7.3 to show trace(P ) equals the rank of A.

Exercise 2.7.5 Let A be a 10 × 7 matrix and let Q = At A. Then Q is 7 × 7.


If the row rank of A is 5, what is the row rank of Q?

Exercise 2.7.6 Let A be the dataset matrix of the centered MNIST dataset,
so the shape of A is 60000 × 784. Using Exercise 2.7.4, show the rank of A
is 712.

Exercise 2.7.7 If µ is a unit vector, then P = I − µ ⊗ µ is a projection.


122 CHAPTER 2. LINEAR GEOMETRY

Exercise 2.7.8 If µ and ν are orthogonal unit vectors, then P = I − µ ⊗ µ −


ν ⊗ ν is a projection.

Exercise 2.7.9 Let S be a span, and let P be the projection matrix onto S.
Use P to show ⊥
S ⊥ = S. (2.7.9)
(S ⊂ (S ⊥ )⊥ is easy. For S ⊃ (S ⊥ )⊥ , show |v − P v|2 = 0 when v in (S ⊥ )⊥ .)

Exercise 2.7.10 Let S be a span and suppose P and Q are both projection
matrices onto S. Show
(P − Q)2 = 0.
Conclude P = Q. Use Exercise 2.2.4.

2.8 Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then there are many other


choices of spanning vectors for S. For example, v1 + v2 , v2 , v3 , . . . , vN also
spans S.
If S cannot be spanned by fewer than N vectors, then we say v1 , v2 , . . . ,
vN . is a basis for S, and we call N is the dimension of S.
In other words, when N is the smallest number of spanning vectors, we
say N is the dimension dim S of S, and v1 , v2 , . . . , vN is a minimal spanning
set for S. This definition is important enough to repeat,

Basis and Dimension Definition


A basis for a span S is a minimal spanning set of vectors. The dimen-
sion of S is the number of vectors in any basis for S.

To clarify this definition, suppose someone asks “Who is the shortest per-
son in the room?” There may be several shortest people in the room, but, no
matter how many shortest people there are, there is only one shortest height.
In other words, a span may have several bases, but a span’s dimension is
uniquely determined.
When a basis v1 , v2 , . . . , vN consists of orthogonal vectors, we say v1 , v2 ,
. . . , vN is an orthogonal basis. When v1 , v2 , . . . , vN are also unit vectors, we
say v1 , v2 , . . . , vN is an orthonormal basis.
Here are two immediate consequences of this terminology.

Span of N Vectors

If S = span(v1 , v2 , . . . , vN ), then dim S ≤ N .


2.8. BASIS 123

Larger Span has Larger Dimension

If a span S1 is contained in a span S2 , then dim S1 ≤ dim S2 .

spanning

orthogonal orthonormal
vectors basis
basis basis

linearly
orthogonal orthonormal
independent

Fig. 2.7 Relations between vector classes.

With this terminology,


• rowspace() returns a basis of the row space,
• columnspace() returns a basis of the column space,
• nullspace() returns a basis for the null space,
• row rank equals the dimension of the row space,
• column rank equals the dimension of the column space,
• nullity equals the dimension of the null space.

Let S be the span of vectors v1 , v2 , . . . , vN . How can we check if these


vectors constitute a basis for S? The answer is the main result of the section.

Spanning Plus Linearly Independent Equals Basis

Let S be the span of vectors v1 , v2 , . . . , vN . Then the vectors are a


basis for S iff they are linearly independent.
124 CHAPTER 2. LINEAR GEOMETRY

Remember, to check for linear independence of given vectors, assemble the


vectors as columns of a matrix A, and check whether A.nullspace() equals
zero. If that is the case, the vectors are a basis for their span. If not, the
vectors are not a basis for their span. The proof of the main result is at the
end of the section.

Here is an example. The columns of the 3 × 3 identity matrix I are e1 =


(1, 0, 0), e2 = (0, 1, 0), e3 = (0, 0, 1). Since the nullspace of I is zero, e1 , e2 , e3
are linearly independent. Hence the standard basis e1 , e2 , e3 is indeed a basis
for R3 , i.e. a minimal spanning set of vectors for R3 . From this, we conclude
dim R3 = 3.
The statement dim R3 = 3 may at first seem trivial or obvious. But, if
we flesh this out following our terminology above, the statement is saying
that any minimal spanning set of vectors in R3 must have exactly 3 vectors.
Stated in this manner, the statement has content.
Since we can do the same calculation with the standard basis
e1 = (1, 0, . . . , 0),
e2 = (0, 1, 0, . . . , 0),
... = ...
ed = (0, 0, . . . , 0, 1),

in Rd , we conclude e1 , e2 , . . . , ed are linearly independent, so

Dimension of Euclidean Space

The dimension of Rd is d.

The MNIST dataset consists of vectors v1 , v2 , . . . , vN in Rd , where N =


60000 and d = 784. For the MNIST dataset, the dimension is 712, as returned
by the code

from numpy.linalg import matrix_rank

# dataset is Nxd array

mu = mean(dataset,axis=0)
vectors = dataset - mu

matrix_rank(vectors)
2.8. BASIS 125

In particular, since 712 < 784, approximately 10% of pixels are never
touched by any image. For example, a likely pixel to remain untouched is
at the top left corner (0, 0). For this dataset, there are 784 − 712 = 72 zero
variance directions.
We pose the following question: What is the least n for which the first n
images are linearly dependent? Since the dimension of the feature space is
784, we must have n ≤ 784. To answer the question, we compute the rank
of the first n vectors for n = 1, 2, 3, . . . , and continue until we have linear
dependence of v1 , v2 , . . . , vn .
If we load MNIST as dataset, as in §1.2, and run the code below, we
obtain n = 560 (Figure 2.8). matrix_rank is discussed in §2.9.

from numpy import *


from numpy.linalg import matrix_rank

# dataset is Nxd array

def find_first_defect(dataset):
d = len(dataset[0])
previous = 0
for n in range(len(dataset)):
r = matrix_rank(datset[:n+1,:])
print((r,n+1),end=",")
if r == previous: break
if r == d: break
previous = r

Fig. 2.8 First defect for MNIST.

Let v1 , v2 , . . . , vN be a dataset. We want to compute the dimensions of


the first n vectors, n = 1, 2, 3, . . . ,
126 CHAPTER 2. LINEAR GEOMETRY

d1 = dim(v1 ), d2 = dim(v1 , v2 ), d3 = dim(v1 , v2 , v3 ), and so on

This we call the dimension staircase. For example, Figure 2.9 is the di-
mension staircase for

v1 = (1, 0, 0), v2 = (0, 1, 0), v3 = (1, 1, 0), v4 = (3, 4, 0), v5 = (0, 0, 1).

In Figure 2.9, we call the points (3, 2) and (4, 2) defects.

Fig. 2.9 The dimension staircase with defects.

In the code, the staircase is drawn by stairs(X,Y), where the horizontal


points X and the vertical values Y satisfy len(X) == len(Y)+1. In Figure 2.9,
X = [1,2,3,4,5,6], and Y = [1,2,2,2,3].
With the MNIST dataset loaded as vectors, here is code returning Fig-
ures 2.9 and 2.10. This code is not efficient, but it works.
Ideally the code should be run in sympy using exact arithmetic. However,
this takes too long, so we use numpy.linalg.matrix_rank. Because datasets
consist of floats in numpy, the matrix_rank and dimensions are approximate
not exact. For more on this, see approximate rank in §3.2.

from numpy import *


from matplotlib.pyplot import *
from numpy.linalg import matrix_rank

# dataset is Nxd array

def dimension_staircase(dataset):
N = len(dataset)
rmax = matrix_rank(dataset)
2.8. BASIS 127

dimensions = [ ]
for n in range(N):
r = matrix_rank(dataset[:n+1,:])
if len(dimensions) and dimensions[-1] < r: print(r," ",end="")
dimensions.append(r)
if r == rmax: break
title("number of vectors = " + str(n+1) + ", rank = " + str(rmax))
stairs(dimensions, range(1,n+3),linewidth=2,color='red')
grid()
show()

Fig. 2.10 The dimension staircase for the MNIST dataset.

Proof of main result. Here we derive: Let S be the span of v1 , v2 , . . . ,


vN . Then v1 , v2 , . . . , vN is a basis for S iff v1 , v2 , . . . , vN are linearly
independent.
Suppose v1 , v2 , . . . , vN are not linearly independent. Then v1 , v2 , . . . , vN
are linearly dependent, which means one of the vectors, say v1 , is a linear
combination of the other vectors v2 , v3 , . . . , vN . Then any linear combination
of v1 , v2 , . . . , vN is necessarily a linear combination of v2 , v3 , . . . , vN , thus

span(v1 , v2 , . . . , vN ) = span(v2 , v3 , . . . , vN ).

This shows v1 , v2 , . . . , vN is not a minimal spanning set, and completes the


derivation in one direction.
128 CHAPTER 2. LINEAR GEOMETRY

In the other direction, suppose v1 , v2 , . . . , vN are linearly independent,


and suppose b1 , b2 , . . . , bd is a minimal spanning set. Since b1 , b2 , . . . , bd is
minimal, we must have d ≤ N . Once we establish d = N , it follows v1 , v2 ,
. . . , vN is minimal, and the proof will be complete.
Since by assumption,

span(v1 , v2 , . . . , vN ) = span(b1 , b2 , . . . , bd ),

v1 is a linear combination of b1 , b2 , . . . , bd ,

v1 = t1 b1 + t2 b2 + · · · + td bd .

Since v1 ̸= 0, at least one of the t coefficients is not zero. By rearranging the


vectors, assume t1 ̸= 0. Then we can solve for b1 ,
1
b1 = (v1 − t2 b2 − t3 b3 − · · · − td bd ).
t1
This shows

span(v1 , v2 , . . . , vN ) = span(v1 , b2 , b3 , . . . , bd ).

Repeating the same logic, v2 is a linear combination of v1 , b2 , b3 , . . . , bd ,

v2 = s1 v1 + t2 b2 + t3 b3 + · · · + td bd .

If all the coefficients of b2 , b3 , . . . , bd are zero, then v2 is a multiple of v1 ,


contradicting linear independence of v1 , v2 , . . . , vN . Thus at least one of the
t coefficients is not zero. By rearranging the vectors, assume t2 ̸= 0. Then we
can solve for b2 , obtaining
1
b2 = (v2 − s1 v1 − t3 b3 − · · · − td bd ).
t2
This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , b3 , b4 , . . . , bd ).

Repeating the same logic, v3 is a linear combination of v1 , v2 , b3 , b3 , . . . ,


bd ,
v3 = s1 v1 + s2 v2 + t3 b3 + t4 b4 + · · · + td bd .
If all the coefficients of b3 , b4 , . . . , bd are zero, then v3 is a linear combination
of v1 , v2 , contradicting linear independence of v1 , v2 , . . . , vN . Thus at least
one of the t coefficients is not zero. By rearranging the vectors, assume t3 ̸= 0.
Then we can solve for b3 , obtaining
1
b3 = (v3 − s1 v1 − s2 v2 − t4 b4 − · · · − td bd ).
t3
2.9. RANK 129

This shows

span(v1 , v2 , . . . , vN ) = span(v1 , v2 , v3 , b4 , b5 , . . . , bd ).

Continuing in this manner, we eventually arrive at

span(v1 , v2 , . . . , vN ) = · · · = span(v1 , v2 , . . . , vd ).

This shows vN is a linear combination of v1 , v2 , . . . , vd . This shows N =


d, because N > d contradicts linear independence. Since d is the minimal
spanning number, this shows v1 , v2 , . . . , vN is a minimal spanning set for S.

2.9 Rank

If A is an N × d matrix, then (Figure 2.11) x 7→ Ax is a linear transformation


that sends a vector x in Rd (the source space) to the vector Ax in RN (the
target space). The transpose At goes in the reverse direction: The linear
transformation b 7→ At b sends a vector b in RN (the target space) to the
vector At b in Rd (the source space).

R3 R5

x b
A

At b
Ax
At
source space target space

Fig. 2.11 A 5 × 3 matrix A is a linear transformation from R3 to R5 .

It follows that for an N × d matrix, the dimension of the source space is


d, and the dimension of the target space is N ,

dim(source space) = d, dim(target space) = N.

from sympy import *

d = A.cols # source space dimension


N = A.rows # target space dimension

By (2.4.2), the column space is in the target space, and the row space is
in the source space. Thus we always have
130 CHAPTER 2. LINEAR GEOMETRY

0 ≤ row rank ≤ d and 0 ≤ column rank ≤ N.

For A as in (2.3.4), the column rank is 2, the row rank is 2, and the nullity
is 1. Thus the column space is a 2-d plane in R5 , the row space is a 2-d plane
in R3 , and the null space is a 1-d line in R3 .

The main result in this section is

Rank Theorem
Let A be any matrix. Then

row rank(A) = column rank(A). (2.9.1)

This is established at the end of the section.


Because the row rank and the column rank are equal, below we just say
rank of a matrix, and we write rank(A). In Python,

from sympy import *

A.rank()

from numpy.linalg import matrix_rank

matrix_rank(A)

returns the rank of a matrix. The main result implies rank(A) = rank(At ),
so

Upper bound for Rank

For any N × d matrix, the rank is never greater than min(N, d).

An N ×d matrix A is full-rank if its rank is the highest it can be, rank(A) =


min(N, d). Here are some consequences of the main result.
• When N ≥ d, full-rank is the same as rank(A) = d, which is the same as
saying the columns are linearly independent and the rows span Rd .
• When N ≤ d, full-rank is the same as rank(A) = N , which is the same
as saying the rows are linearly independent and the columns span RN .
• When N = d, full-rank is the same as saying the rows are a basis of Rd ,
and the columns are a basis of RN .
2.9. RANK 131

When A is a square matrix, we can say more:

Full Rank Square Equals Invertible

Let A be a square matrix. Then A is full-rank iff A is invertible.

Suppose A is d×d. If A is invertible and B is its inverse, then AB = I. Since


ABx = A(Bx) = Ay with y = Bx, the column space of AB is contained in
the column space of A. Since the column space of AB = I is Rd , we conclude
the column space of A is Rd , thus rank(A) = d.
Conversely, suppose A is full-rank. This means the columns of A span Rd .
By (2.4.3), this implies
Ax = b
is solvable for any b. Let e1 , e2 , . . . , ed be the standard basis. If we set
successively b = e1 , b = e2 , . . . , b = ed , we then get solutions x1 , x2 , . . . , xd .
If B is the matrix with columns x1 , x2 , . . . , xd , then

AB = A(x1 , x2 , . . . , xd ) = (Ax1 , Ax2 , . . . , Axd ) = (e1 , e2 , . . . , ed ) = I.

Thus we found a matrix B satisfying AB = I.


Repeating the same argument with rows instead of columns, we find a
matrix C satisfying CA = I. Then

C = CI = CAB = IB = B,

so B = C is the inverse of A.

Orthonormal Rows and Columns


Let U be a matrix.
• U has orthonormal rows iff U U t = I.
• U has orthonormal columns iff U t U = I.
If U is square and either holds, then they both hold.

The first two assertions are in §2.2. For the last assertion, assume U is a
square matrix. From §2.4, orthonormality of the rows implies linear indepen-
dence of the rows, so U is full-rank. If U also is a square matrix, then U is
invertible. Multiply by U −1 ,

U −1 = U −1 I = U −1 U U t = U t .

Since we have U −1 U = I, we also have U t U = I.


132 CHAPTER 2. LINEAR GEOMETRY

A square matrix U satisfying

U U t = I = U tU (2.9.2)

is an orthogonal matrix.
Equivalently, we can say

Orthogonal Matrix

A matrix U is orthogonal iff its rows are an orthonormal basis iff its
columns are an orthonormal basis.

Since
U u · U v = u · U t U v = u · v,
U preserves dot products. Since lengths are dot products, U also preserves
lengths. Since angles are computed from dot products, U also preserves an-
gles. Summarizing,

Angles, Lengths, and Dot Products

Orthogonal Matrices Preserve Angles, Lengths, and Dot Products:

As a consequence,

Orthogonal Matrix sends ON Vectors to ON Vectors

Let U be an orthogonal matrix. If v1 , v2 , . . . , vd are orthonormal,


then U v1 , U v2 , . . . , U vd are orthonormal.

In two dimensions, d = 2, an orthogonal matrix must have two orthonor-


mal columns, so must be of the form
   
cos θ − sin θ cos θ sin θ
U= or U= .
sin θ cos θ sin θ − cos θ

In the first case, U is a rotation, while in the second, U is a rotation followed


by a reflection.

If u1 , u2 , . . . , ud is an orthonormal basis of Rd , and U has columns u1 ,


u2 , . . . , ud , then U is square and U U t = I = U t U . By (2.2.9), we have

I = u1 ⊗ u1 + u2 ⊗ u2 + · · · + ud ⊗ ud . (2.9.3)
2.9. RANK 133

Multiplying both sides by u, by (1.4.17), we obtain

Orthonormal Basis Expansion

If u1 , u2 , . . . , ud is an orthonormal basis, and u is any vector, then

u = (u · u1 )u1 + (u · u2 )u2 + · · · + (u · ud )ud (2.9.4)

and
|u|2 = (u · u1 )2 + (u · u2 )2 + · · · + (u · ud )2 . (2.9.5)

Let x1 , x2 , . . . , xN be a dataset, and let A be the dataset matrix with rows


x1 , x2 , . . . , xN . The dataset is full-rank if A is full-rank. Since A is full-rank
iff its rows span (we assume N >> d, which means there are more samples
than features) we have

Full-Rank Dataset
A dataset x1 , x2 , . . . , xN is full-rank iff x1 , x2 , . . . , xN spans the
feature space.

The dimension or rank of the dataset is the rank of its N × d dataset


matrix A. Hence the dimension of the dataset equals the rank of At A. Since
scaling a matrix has no effect on the rank, we conclude the dimension or rank
of a dataset equals the rank of its variance Q = At A/N , (see (2.5.1)).

To derive the rank theorem, first we recall (2.7.6). Assume A has N rows
and d columns. By (2.7.6), every vector x in the source space Rd can be
written as a sum x = u + v with u in the null space, and v in the row space.
In other words, each vector x may be written as a sum x = u + v with Au = 0
and v in the row space.
From this, we have

Ax = A(u + v) = Au + Av = Av.

This shows the column space consists of vectors of the form Av with v in the
row space.
Let v1 , v2 , . . . , vr be a basis for the row space. From the previous para-
graph, it follows Av1 , Av2 , . . . , Avr spans the column space of A. We claim
Av1 , Av2 , . . . , Avr are linearly independent. To check this, we write
134 CHAPTER 2. LINEAR GEOMETRY

0 = t1 Av1 + t2 Av2 + · · · + tr Avr = A(t1 v1 + t2 v2 + · · · + tr vr ).

If v is the vector t1 v1 +t2 v2 +· · ·+tr vr , this shows v is in the null space. But v
is a linear combination of basis vectors of the row space, so v is also in the row
space. Since the row space is the orthogonal complement of the null space, we
must have v orthogonal to itself. Thus v = 0, or t1 v1 + t2 v2 + · · · + tr vr = 0.
But v1 , v2 , . . . , vr is a basis. By linear independence of v1 , v2 , . . . , vr , we
conclude t1 = 0, . . . , tr = 0. This establishes the claim, hence Av1 , Av2 , . . . ,
Avr is a basis for the column space. This shows r is the dimension of the
column space, which is by definition the column rank. Since by construction,
r is also the row rank, this establishes the rank theorem.

Exercises

Exercise 2.9.1 Let u and v be nonzero vectors. Then the rank of A = u ⊗ v


is one.
Exercise 2.9.2 Let µ be a unit vector in Rd . Then the rank of I − µ ⊗ µ is
d − 1.
Exercise 2.9.3 Use (2.9.4) to derive (2.9.5).
Exercise 2.9.4 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , and let
Q be an N × N matrix. Use (2.9.3) and Exercise 2.2.6 to show

trace(Q) = v1 · Qv1 + v2 · Qv2 + · · · + vN · QvN . (2.9.6)

Exercise 2.9.5 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , and let


A be an d × N matrix. Use Exercise 2.9.4 and Q = At A and (2.2.13) to show
2
∥A∥ = |Av1 |2 + |Av2 |2 + · · · + |AvN |2 . (2.9.7)

Exercise 2.9.6 Let v1 , v2 , . . . , vN be an orthonormal basis in RN , let u1 ,


u2 , . . . , ud be an orthonormal basis in Rd , and let A be an d × N matrix.
Use Exercise 2.9.5 and (2.9.5) to show
d X
X N
2
∥A∥ = (uj · Avk )2 . (2.9.8)
j=1 k=1

Exercise 2.9.7 Let u1 , u2 , . . . , ur be linearly independent, and v1 , v2 , . . . ,


vr be linearly independent. Then the rank of

A = u1 ⊗ v1 + u2 ⊗ v2 + · · · + ur ⊗ vr

is r. (One way to do this is by writing out Ax = 0.)


Chapter 3
Principal Components

In this chapter, we look at the two fundamental methods of breaking or


decomposing a matrix into elementary components, the eigenvalue decompo-
sition and the singular value decomposition, then we apply this to principal
component analysis.
Principal component analysis rests an important phenomenon, that the
eigenvalues of a large matrix cluster near the top and bottom: For a wide
class of d × d variance matrices Q, when d is large, the eigenvalues of Q
cluster near the top eigenvalue, or near the bottom eigenvalue.
Because the bottom eigenvalue is usually zero, the eigenvalues near the
bottom don’t add up to anything substantial. On the other hand, because
of this clustering, the eigenvalues of Q near the top provide the largest con-
tribution to the explained variance trace(Q). We illustrate this for a specific
class of matrices arising from mass-spring systems (§3.2).
We begin by looking at the geometry of a matrix as a linear transformation.

3.1 Geometry of Matrices

Matrix multiplication by an N × d matrix A sends a point x in the source


space Rd to a point b = Ax in the target space RN (Figure 2.11).
Equivalently, since points in Rd are essentially the same as vectors in Rd
(see §1.3), an N × d matrix A sends a vector v in Rd to a vector Av in RN .
Looked at this way, a matrix A induces a linear transformation: Matrix
multiplication by A satisfies

A(v1 + v2 ) = Av1 + Av2 , A(tv) = tAv.

One way to understand what the transformation does is to see how it


distorts distances between vectors. If v1 and v2 are in Rd , then the distance
between them is d = |v1 − v2 | (recall |v| denotes the euclidean length of v).

135
136 CHAPTER 3. PRINCIPAL COMPONENTS

How does this compare with the distance between Av1 and Av2 , or |Av1 −
Av2 |?
If we let
v1 − v2
u= ,
|v1 − v2 |
then u is a unit vector, |u| = 1, and by linearity

|Av1 − Av2 |
|Au| = .
|v1 − v2 |

This ratio is a scaling factor of the linear transformation. Of course this


scaling factor depends on the given vectors v1 , v2 .
From this, to understand the scaling distortions, it is enough to understand
what multiplication by A does to unit vectors u.
The first step in understanding this is to compute

σ1 = max |Au| and σ2 = min |Au|.

Here the maximum and minimum are taken over all unit vectors u.
Then σ1 is the distance of the furthest image from the origin, and σ2 is
the distance of the nearest image to the origin. It turns out σ1 and σ2 are
the top and bottom singular values of A.

To keep things simple, assume both the source space and the target space
are R2 ; then A is 2 × 2.
The unit circle (in red in Figure 3.1) is the set of vectors u satisfying
|u| = 1. The image of the unit circle (also in red in Figure 3.1) is the set of
vectors of the form
{Au : |u| = 1}.
The annulus is the set (the region between the dashed circles in Figure 3.1)
of vectors b satisfying
{b : σ2 < |b| < σ1 }.
It turns out the image is an ellipse, and this ellipse lies in the annulus.
Thus the numbers σ1 and σ2 constrain how far the image of the unit circle
is from the origin, and how near the image is to the origin.
To relate σ1 and σ2 to what we’ve seen before, let Q = At A. Then,

σ12 = max |Au|2 = max(Au) · (Au) = max u · At Au = max u · Qu.

Thus σ12 is the maximum projected variance corresponding to the variance


Q. Similarly, σ22 is the minimum projected variance corresponding to the
variance Q.
3.1. GEOMETRY OF MATRICES 137

Now let Q = AAt , and let b be in the image. Then b = Au for some unit
vector u, and

b · Q−1 b = (Au) · Q−1 Au = u · At (AAt )−1 Au = u · Iu = |u|2 = 1.

This shows the image of the unit circle is the inverse variance ellipse (§1.5)
corresponding to the variance Q, with major axis length 2σ1 and minor axis
length 2σ2 .

Fig. 3.1 Image of unit circle with σ1 = 1.5 and σ2 = .75.

Let us look at some special cases.


The first example is  
cos θ − sin θ
V = . (3.1.1)
sin θ cos θ
If e1 = (1, 0), e2 = (0, 1) is the standard basis in R2 . then the columns of V
are
V e1 = (cos θ, sin θ), and V e2 = (− sin θ, cos θ).
Since V t V = I, the columns of V are orthonormal. Thus V transforms the
orthonormal basis e1 , e2 into the orthonormal basis V e1 , V e2 (see §2.9). By
(1.4.4), V is a rotation by the angle θ.
The second example is  
σ1 0
S= .
0 σ2
Then S scales the horizontal direction by the factor σ1 , and S scales the
vertical direction by σ2 .
The third example are the reflections
   
−1 0 1 0
R= , R= .
0 1 0 −1

These reflect vectors across the horizontal axis, and across the vertical axis.
138 CHAPTER 3. PRINCIPAL COMPONENTS

Recall an orthogonal matrix is a matrix U satisfying U t U = I = U U t


(2.9.2). Every orthogonal matrix U is a rotation V or a rotation times a
reflection V R.

The SVD decomposition (§3.4) states that every matrix A can be written
as a product  
ab
A= = U SV.
cd
Here S is a diagonal matrix as above, and U , V are orthogonal and rotation
matrices as above.
In more detail, apart from a possible reflection, there are scalings σ1 and
σ2 and angles α and β, so that A transforms vectors by first rotating by α,
then scaling by (σ1 , σ2 ), then by rotating by β (Figure 3.2).

V S U

Fig. 3.2 SVD decomposition A = U SV .

In other words, each 2 × 2 matrix A, consisting of four numbers a, b, c,


d, may be described by four other numbers. These other numbers present a
much clearer picture of the geometry of A: two angles α, β, and two scalings
σ1 , σ2 .
Everything in this section generalizes to any N × d matrix, as we see in
the coming sections.

3.2 Eigenvalue Decomposition

In §1.5 and §2.5, we saw every variance matrix is nonnegative. In this section,
we see that every nonnegative matrix Q is the variance matrix of a specific
dataset. This dataset is called the principal components of Q.
Let A be a matrix. An eigenvector for A is a nonzero vector v such that
Av is aligned with v. This means

Av = λv (3.2.1)

for some scalar λ, the corresponding eigenvalue.


3.2. EIGENVALUE DECOMPOSITION 139

Because the solution v = 0 of (3.2.1) is not useful, we insist eigenvectors be


nonzero. If v is an eigenvector, then the dimension of v equals the dimension
of Av, which can only happen when A is a square matrix.

singular:
σ, u, v

row column
any
rank rank

matrix square

eigen:
invertible symmetric
λ, v

non-
variance negative λ≥0

λ ̸= 0 positive λ>0

Fig. 3.3 Relations between matrix classes.


140 CHAPTER 3. PRINCIPAL COMPONENTS

If v is an eigenvector corresponding to eigenvalue λ, then any scalar mul-


tiple u = tv is also an eigenvector corresponding to eigenvalue λ, since

Av = λv =⇒ Au = A(tv) = t(Av) = t(λv) = λ(tv) = λu.

Because of this, we usually take eigenvectors to be unit vectors, by normal-


izing them.
Even then, this does not determine v uniquely, since both ±v are unit
eigenvectors. This ± ambiguity is real, because different software packages
make different sign choices. Because of this, when plotting or computing with
datasets, units assumptions must be checked carefully.
Let  
21
Q=
12
Then Q has eigenvalues 3 and 1, with corresponding eigenvectors (1, 1) and
(1, −1).√These√are not unit
√ vectors,
√ but the corresponding unit eigenvectors
are (1/ 2, 1/ 2) and (1/ 2, −1/ 2).
The code

from numpy import *


from numpy.linalg import eig

# lambda is a keyword in Python


# so we use lamda instead

A = array([[2,1],[1,2]])
lamda, U = eig(A)
lamda

returns the eigenvalues [3,1] as an array, and returns the eigenvectors v1 ,


v2 of Q, as the columns of the matrix U . The matrix U is discussed further
below.
Since lambda is a keyword in Python, we deliberately misspell it and write
lamda in the code. When pretty-printed, Python knows to display lamda as
λ.
The method eig(A) works on any square matrix A, but may return com-
plex eigenvalues. When eig(A) returns real eigenvalues, they are not neces-
sarily ordered in any predetermined fashion.
If the matrix Q is known to be symmetric, then the eigenvalues are guar-
anteed real. In this case, eigh(Q) returns the eigenvalues in increasing order.
If eigh is used on a non-symmetric matrix, it will return erroneous data.

from numpy import *


from numpy.linalg import eigh
3.2. EIGENVALUE DECOMPOSITION 141

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda

returns the array [1,3].

Let A be a square d × d matrix. The ideal situation is when there is a


basis v1 , v2 , . . . , vd in Rd of eigenvectors of A. However, this is not always
the case. For example, if  
11
A= (3.2.2)
01
and Av = λv, then v = (x, y) satisfies x + y = λx, y = λy. This system has
only the nonzero solution (x, y) = (1, 0) (or its multiples) and λ = 1. Thus
A has only one eigenvector e1 = (1, 0), and the corresponding eigenvalue is
λ = 1.
Let A be any square matrix.

Eigenvalues of A Versus Eigenvalues of A Transpose

The eigenvalues of A and the eigenvalues of At are the same.

This result is a consequence of the rank theorem in §2.9. To see why,


suppose λ is an eigenvalue of A with corresponding eigenvector v. Then Av =
λv, which implies
(A − λI)v = Av − λv = 0.
As a consequence, if we let B = A − λI, then v is an eigenvector of A
corresponding to λ iff1 v is in the nullspace of B. It follows λ is an eigenvalue
for A iff B has a nonzero nullspace. Now B t = At − λI. If we show B t has a
nonzero null space, by the same logic, we will conclude λ is an eigenvalue of
At . Now B has a nonzero null space iff B is not full-rank. Since B is square,
by the rank theorem, this happens iff B t is not full-rank, which happens iff
B t has a nonzero null space. Thus λ is an eigenvalue of A iff λ is an eigenvalue
of At .

Let v be a unit vector. From §2.5, when Q is the variance matrix of a


dataset, v · Qv is the variance of the dataset projected onto the line through
v. When v is an eigenvector, Qv = λv, the variance equals
1 Iff is short for if and only if.
142 CHAPTER 3. PRINCIPAL COMPONENTS

v · Qv = v · λv = λv · v = λ.

More generally, this holds for any symmetric matrix Q. We conclude

Projected Variance along Eigenvector Direction

If v is a unit eigenvector of a symmetric matrix Q, then v·Qv equals the


corresponding eigenvalue. In particular, the eigenvalues of a variance
matrix are nonnegative.

In general, when Q is symmetric but not a variance matrix, some eigen-


values of Q may be negative.

Suppose λ and µ are eigenvalues of a symmetric matrix Q with correspond-


ing eigenvectors u, v. Since Q is symmetric, u · Qv = v · Qu. Using Qu = λu,
Qv = µv, we compute u · Qv in two ways:

µu · v = u · (µv) = u · Qv = v · Qu = v · (λu) = λu · v.

This implies
(µ − λ)u · v = 0.
If λ ̸= µ, we must have u · v = 0. We conclude:

Distinct Eigenvalues Have Orthogonal Eigenvectors

For a symmetric matrix Q, eigenvectors corresponding to distinct


eigenvalues are orthogonal.

Suppose there is a basis v1 , v2 , . . . , vd of eigenvectors of Q, with corre-


sponding eigenvalues λ1 , λ2 , . . . , λd . Let E be the diagonal matrix with λ1 ,
λ2 , . . . , λd on the diagonal,
 
λ1 0 0 . . . 0
 0 λ2 0 . . . 0 
 
E= . . . . . . . . . . . . . . . .

 0 0 . . . λd−1 0 
0 0 . . . 0 λd

Let U be the matrix with columns v1 , v2 , . . . , vd . By matrix multiplication


and Qvj = λj vj , j = 1, 2, . . . , d, we obtain
3.2. EIGENVALUE DECOMPOSITION 143

QU = U E. (3.2.3)

When this happens, we say Q is diagonalizable. Thus A in (3.2.2) is not


diagonalizable. On the other hand, we will show every symmetric matrix
Q is diagonalizable. In fact, Q symmetric leads to an orthonormal basis of
eigenvectors.
In Python, given Q, we compute the third eigenvector v and third eigen-
value λ, and verify Qv = λv. The code

from numpy import *


from numpy.linalg import eigh

# Q is any symmetric matrix


lamda, U = eigh(Q)
lamda = lamda[2]
v = U[:,2]

allclose(dot(Q,v), lamda*v)

returns True.

The main result in this section is

Eigenvalue Decomposition (EVD)

Let Q be a symmetric d × d matrix. There is an orthonormal basis v1 ,


v2 , . . . , vd in Rd of eigenvectors of Q, with corresponding eigenvalues

λ1 ≥ λ2 ≥ · · · ≥ λd .

Here are some consequences of the eigenvalue decomposition.


If V is the matrix with rows v1 , v2 , . . . , vd then U = V t is the matrix with
columns v1 , v2 , . . . , vd . Since v1 , v2 , . . . , vd are orthonormal, U is orthogonal
(see (2.9.2)), so U t U = I = U U t . By (3.2.3), QU = U E. Multiplying on the
right by V = U t ,
Q = QU V = U EV = U EU t .
Thus the eigenvalue decomposition states

Diagonalization (EVD)

There is an orthogonal matrix U and a diagonal matrix E such that


with V = U t , we have
144 CHAPTER 3. PRINCIPAL COMPONENTS

Q = U EV = U EU t . (3.2.4)

When this happens, the rows of V are the eigenvectors of Q, and the
diagonal entries of E are the eigenvalues of Q.

In other words, with the correct choice of orthonormal basis, the matrix
Q becomes a diagonal matrix E.
The orthonormal basis eigenvectors v1 , v2 , . . . , vd are the principal compo-
nents of the matrix Q. The eigenvalues and eigenvectors of Q, taken together,
are the eigendata of Q. The code

from numpy import *


from numpy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
lamda, U

returns the eigenvectors [1, 3] and the matrix U = [u, v] with columns
√ √ √ √
u = (1/ 2, −1/ 2), v = (1/ 2, 1/ 2).

These columns are the orthonormal eigenvectors Qv = 3v, Qu = 1u. By


(3.2.3), QU = U E, where E is the diagonal matrix with the eigenvalues on
the diagonal,

from numpy import *


from numpy.linalg import eigh

Q = array([[2,1],[1,2]])
lamda, U = eigh(Q)
V = U.T
E = diag(lamda)

allclose(Q,dot(U,dot(E,V))

returns True.

In sympy, the corresponding commands are

from sympy import *


from sympy import init_printing

init_printing()
3.2. EIGENVALUE DECOMPOSITION 145

# eigenvalues
Q.eigenvals()

# eigenvectors
Q.eigenvects()

U, E = Q.diagonalize()

The command init_printing pretty-prints the output.

Let λ1 , λ2 , . . . , λr be the nonzero eigenvalues of Q. Then the diagonal


matrix E has r nonzero entries on the diagonal, so rank(E) = r. Since U
and V = U t are invertible, rank(E) = rank(U EV ). Since Q = U EV ,

rank(Q) = rank(E) = r.

Rank Equals Number of Nonzero Eigenvalues

The rank of a diagonal matrix equals the number of nonzero en-


tries. The rank of a square symmetric matrix Q equals the number of
nonzero eigenvalues of Q.

Because real-life datasets are composed of floats, a more useful measure of


the rank or dimension of a dataset matrix is the approximate dimension. The
approximate dimension or approximate rank of A is the number of eigenvalues
of the variance Q = At A/N (see (2.5.1)) that are not almost zero, measured
by numpy.

from numpy import *


from numpy.linalg import eigh

# dataset is Nxd
N, d = dataset.shape
Q = dot(dataset.T,dataset)/N

lamda = eigh(Q)[0]

for i,eval in enumerate(lamda):


if not allclose(eval,0):
approx_nullity = i
break
146 CHAPTER 3. PRINCIPAL COMPONENTS

approx_rank = d - approx_nullity

approx_rank, approx_nullity

This code returns 712 for the MNIST dataset, agreeing with the code in
§2.8.

Lets’ go back to diagonalization. Using sympy,

from sympy import *

Q = Matrix([[2,1],[1,2]])
U, E = Q.diagonalize()
display(U,E)

returns    
1 1 10
U= , E= .
−1 1 03
Also,

from sympy import *

a,b,c = symbols("a b c")

Q = Matrix([[a,b ],[b,c]])
U, E = Q.diagonalize()
display(Q,U,E)

returns    √ √ 
ab 1 a−c− D a−c+ D
Q= , U=
bc 2b 2b 2b
and
 √ 
1 a+c− D 0 √
E= , D = (a − c)2 + 4b2 .
2 0 a+c+ D

(display is used to pretty-print the output.)

When all the eigenvalues are nonzero, we can write


3.2. EIGENVALUE DECOMPOSITION 147
 
1/λ1 0 0 . . . 0
 0 1/λ2 0 . . . 0 
E −1  ... ... ... ... ... .
= 

0 0 . . . 0 1/λd

Then a straightforward calculation using (3.2.4) shows

Nonzero Eigenvalues Equals Invertible

Let Q = U EV be the EVD of a symmetric matrix Q. Then Q is


invertible iff all its eigenvalues are nonzero. When this happens, we
have
Q−1 = U E −1 V

More generally, using (2.6.8), one can check

Pseudo-Inverse and EVD

If λ1 ≥ λ2 ≥ · · · ≥ λr are the nonzero eigenvalues of Q, then 1/λ1 ≤


1/λ2 ≤ · · · ≤ 1/λr are the nonzero eigenvalues of Q+ . Moreover, if U
is an orthogonal matrix, and V = U t , then

Q = U EV =⇒ Q+ = U E + V. (3.2.5)

Similarly, eigendata may be used to solve linear systems.

Nonzero Eigenvalues Equals Solvable

Let v1 , v2 , . . . , vd be the orthonormal basis of eigenvectors of Q cor-


responding to eigenvalues λ1 , λ2 , . . . , λd . Then the linear system

Qx = b

has a solution x for every vector b iff all eigenvalues are nonzero, in
which case
1 1 1
x= (b · v1 )v1 + (b · v2 )v2 + · · · + (b · vd )vd . (3.2.6)
λ1 λ2 λd

The proof is straightforward using (2.9.4): multiply x by Q to verify,


148 CHAPTER 3. PRINCIPAL COMPONENTS
 
1 1
Qx = Q (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
1 1
= (b · v1 )Qv1 + (b · v2 )Qv2 + . . .
λ1 λ2
1 1
= (b · v1 )v1 + (b · v2 )v2 + . . .
λ1 λ2
= (b · v1 )v1 + (b · v2 )v2 + · · · = b.

Another consequence of the eigenvalue decomposition is

Trace is the Sum of Eigenvalues

Let Q be a symmetric matrix with eigenvalues λ1 , λ2 , . . . , λd . Then

trace(Q) = λ1 + λ2 + · · · + λd . (3.2.7)

To derive this, use (3.2.3): Since U is orthogonal, U V = U U t = I. By


(2.2.6), trace(AB) = trace(BA), so

trace(Q) = trace(QU V ) = trace(V QU ) = trace(V U EV U ) = trace(E).

Since E = diag(λ1 , λ2 , . . . , λd ), trace(E) = λ1 + λ2 + · · · + λd , and the result


follows.
Let Q be symmetric with eigenvalues λ1 , λ2 , . . . , λd . Since

Qv = λv =⇒ Q2 v = QQv = Q(λv) = λQv = λ2 v,

Q2 is symmetric with eigenvalues λ21 , λ22 , . . . , λ2d . Applying the last result to
Q2 , we have

trace(Q2 ) = trace(QQt ) = trace(Q2 ) = λ21 + λ22 + · · · + λ2d .

It turns out every nonnegative matrix Q is the variance of a simple dataset


(Figure 3.4).

Sum of Tensor Products


Let Q be a symmetric d × d matrix with eigenvalues λ1 , λ2 , . . . , λd
and orthonormal eigenvectors v1 , v2 , . . . , vd . Then

Q = λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd . (3.2.8)

In particular, when Q is nonnegative, the dataset consisting of the 2d


points
3.2. EIGENVALUE DECOMPOSITION 149

√ λ1 v 1
λ2 v2


√ − λ2 v2
− λ1 v1

Fig. 3.4 Inverse variance ellipse and centered dataset.

p p p
± dλ1 v1 , ± dλ2 v2 , . . . , ± dλd vd
is centered and has variance Q.

The vectors in this dataset are the principal components of Q.


Since v1 , v2 , . . . , vd is an orthonormal basis, by (2.9.4), every vector v can
be written
v = (v · v1 ) v1 + (v · v2 ) v2 + · · · + (v · vd ) vd .
Multiply by Q. Since Qvk = λk vk ,

Qv = (v · v1 ) Qv1 + (v · v2 ) Qv2 + · · · + (v · vd ) Qvd


= λ1 (v · v1 ) v1 + λ2 (v · v2 ) v2 + · · · + λd (v · vd ) vd
= (λ1 v1 ⊗ v1 + λ2 v2 ⊗ v2 + · · · + λd vd ⊗ vd ) v

This proves the first part. For the second part, let bk = λk vk . Then the
mean of the 2d vectors ±b1 , ±b2 , . . . , ±bd is clearly zero, and by (3.2.8), the
variance matrix
2
(b1 ⊗ b1 + b2 ⊗ b2 + · · · + bd ⊗ bd )
2d
equals Q/d.

Now we approach the eigenvalues of Q from a different angle. In §2.5, we


studied zero variance directions. Since the eigenvalues of a variance matrix
are nonnegative, for a variance matrix, they may also be called minimum
variance directions. Now we study maximum variance directions.
Let
λ1 = max v · Qv,
|v|=1

where the maximum is over all unit vectors v. We say a unit vector b is best-fit
for Q or best-aligned with Q if the maximum is achieved at v = b: λ1 = b · Qb.
150 CHAPTER 3. PRINCIPAL COMPONENTS

When Q is a variance matrix, this means the unit vector b is chosen so that
the variance b · Qb of the dataset projected onto b is maximized.
An eigenvalue λ1 of Q is the top eigenvalue if λ1 ≥ λ for any other eigen-
value. An eigenvalue λ1 of Q is the bottom eigenvalue if λ1 ≤ λ for any other
eigenvalue. We establish the following results.

Maximum Projected Variance is an Eigenvalue

Let Q be a symmetric matrix. Then

λ1 = max v · Qv, (3.2.9)


|v|=1

is the top eigenvalue of Q.

Best-aligned vector is an eigenvector

Let Q be a symmetric matrix. Then a best-aligned vector b is an


eigenvector of Q corresponding to the top eigenvalue λ1 .

To prove these results, we begin with a simple calculation, whose derivation


we skip.

A Calculation
Suppose λ, a, b, c, d are real numbers and suppose we know

λ + at + bt2
≤ λ, for all t real.
1 + ct + dt2
Then a = λc.

Let λ be any eigenvalue of Q, with eigenvector v: Qv = λv. Dividing v by


its length, we may assume |v| = 1. Then

λ1 ≥ v · Qv = v · (λv) = λv · v = λ.

This shows λ1 ≥ λ for any eigenvalue λ.


Now we show λ1 itself is an eigenvalue. Let v1 be a unit vector maximizing
v · Qv, so v1 is best-fit for Q. Then

λ1 = v1 · Qv1 ≥ v · Qv (3.2.10)

for all unit vectors v. Let u be any vector. Then for any real t,
3.2. EIGENVALUE DECOMPOSITION 151

v1 + tu
v=
|v1 + tu|

is a unit vector. Insert this v into (3.2.10) to obtain

(v1 + tu) · Q(v1 + tu)


λ1 ≥ .
|v1 + tu|2

Since Q is symmetric, u · Qv1 = v1 · Qu. Expanding with |v1 |2 = 1, we obtain

λ1 + 2tu · Qv1 + t2 u · Qu λ1 + at + bt2


λ1 ≥ = .
1 + 2tu · v1 + t2 |u|2 1 + ct + dt2

Applying the calculation with λ = λ1 , a = 2u · Qv1 , b = u · Qu, c = 2u · v1 ,


and d = |u|2 , we conclude

u · Qv1 = λ1 u · v1

for all vectors u. But this implies

u · (Qv1 − λ1 v1 ) = 0

for all u. Thus Qv1 − λ1 v1 is orthogonal to all vectors, hence orthogonal to


itself. Since this can only happen if Qv1 − λ1 v1 = 0, we conclude Qv1 = λ1 v1 .
Hence λ1 is itself an eigenvalue. This completes the proof of the two results.

Just as the maximum variance (3.2.9) is the top eigenvalue λ1 , the mini-
mum variance
λd = min v · Qv, (3.2.11)
|v|=1

is the bottom eigenvalue, and the corresponding eigenvector vd is the worst-


aligned vector.
By the eigenvalue decomposition, the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd of
a symmetric matrix Q may be arranged in decreasing order, and may be
positive, zero, or negative scalars. When Q is a variance, the eigenvalues are
nonnegative, and the bottom eigenvalue is at least zero. When the bottom
eigenvalue is zero, the corresponding eigenvectors are zero variance directions.

Now we can complete the proof the eigenvalue decomposition. Having


found the top eigenvalue λ1 with its corresponding unit eigenvector v1 , we
let S = span(v1 ) and T = S ⊥ be the orthogonal complement of v1 (Figure
3.5). Then dim(T ) = d − 1, and we can repeat the process and maximize
152 CHAPTER 3. PRINCIPAL COMPONENTS

v · Qv over all unit v in T , i.e. over all unit v orthogonal to v1 . This leads to
another eigenvalue λ2 with corresponding eigenvector v2 orthogonal to v1 .
Since λ1 is the maximum of v · Qv over all vectors in Rd , and λ2 is the
maximum of v · Qv over the restricted space T of vectors orthogonal to v1 ,
we must have λ1 ≥ λ2 .
Having found the top two eigenvalues λ1 ≥ λ2 and their orthonormal
eigenvectors v1 , v2 , we let S = span(v1 , v2 ) and T = S ⊥ be the orthogonal
complement of S. Then dim(T ) = d − 2, and we can repeat the process to
obtain λ3 and v3 in T . Continuing in this manner, we obtain eigenvalues

λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λd .

with corresponding orthonormal eigenvectors

v1 , v2 , v3 , . . . , vd .

This proves the eigenvalue decomposition.

T = S⊥

v1

v3

v2

Fig. 3.5 S = span(v1 ) and T = S ⊥ .

Let Q be a positive variance matrix and let b · Q−1 b = 1 be the inverse


variance ellipsoid. If v is a unit eigenvector
√ corresponding
√ to an eigenvalue λ,
then λ ≥ 0, and the vector b = λv has length λ. Moreover b satisfies
√ √
b · Q−1 b = ( λv) · Q−1 ( λv) = λv · Q−1 v = λv · (λ−1 v) = v · v = 1.
3.2. EIGENVALUE DECOMPOSITION 153

Hence the line segment joining √ the vectors ± λv is an axis of the inverse
variance ellipsoid, with length 2 λ (Figure 3.4).
When λ = λ1 is the top eigenvalue, the axis is the principal axis of the
inverse variance ellipsoid. When λ = λ2 is the next highest eigenvalue, the
axis is orthogonal to the principal axis, and is the second principal axis.
Continuing in this manner, we obtain all the principal axes of the inverse
variance ellipsoid.

Principal Axes of Inverse Variance Ellipsoid

Let v be a unit eigenvector of a variance


√ matrix
√ Q with eigenvalue λ.
Then the line segment joining − λv and +√ λv is a principal axis of
the inverse variance ellipsoid, with length 2 λ.

Together with Figure ??, this result provides a geometric interpretation of


eigenvalues: They control the variances of a dataset’s points, in the principal
directions.
Sometimes, several eigenvalues are equal, leading to several eigenvectors,
say m of them, corresponding to a given eigenvalue λ. In this case, we say
the eigenvalue λ has multiplicity m, and we call the span

Sλ = {v : Qv = λv}

the eigenspace corresponding to λ. For example, suppose the top three eigen-
values are equal: λ1 = λ2 = λ3 , with b1 , b2 , b3 the corresponding eigenvectors.
Calling this common value λ, the eigenspace is Sλ = span(b1 , b2 , b3 ). Since
b1 , b2 , b3 are orthonormal, dim(Vλ ) = 3. In Python, the eigenspaces Vλ are
obtained by the matrix U above: The columns of U are an orthonormal basis
for the entire space, so selecting the columns corresponding to a specific λ
yields an orthonormal basis for Sλ .

Let (lamda,U) be the list of eigenvalues and matrix U whose columns are
the eigenvectors. Then the eigenvectors are the rows of U t . Here is code for
selecting just the eigenvectors corresponding to eigenvalue s.

from numpy import *


from numpy.linalg import eigh

lamda, U = eigh(Q)
V = U.T
V[isclose(lamda,s)]
154 CHAPTER 3. PRINCIPAL COMPONENTS

The function isclose(a,b) returns True when a and b are numerically close.
Using this boolean, we extract only those rows of V whose corresponding
eigenvalue is close to s.
The subspace Sλ is defined for any λ. However, dim(Sλ ) = 0 unless λ is
an eigenvalue, in which case dim(Sλ ) = m, where m is the multiplicity of λ.
The proof of the eigenvalue decomposition is a systematic procedure for
finding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd . Now we show there are no other
eigenvalues.

The Eigenvalue Decomposition is Complete

If λ is an eigenvalue for Q, Qv = λv, then λ equals one of the eigen-


values in the eigenvalue decomposition of Q.

To see this, suppose Qv = λv with λ ̸= λj for j = 1, . . . , d. Since λ ̸= λj


for j = 1, . . . , d, the vector v must be orthogonal to every vj , j = 1, . . . , d.
Since span(v1 , . . . , vd ) = Rd , it follows v is orthogonal to every vector, hence
v is orthogonal to itself, hence v = 0. We conclude λ cannot be an eigenvalue.

All this can be readily computed in Python. For the Iris dataset, we have
the variance matrix in (2.2.15). The eigenvalues are

4.2 > 0.24 > 0.08 > 0.02,

and the orthonormal eigenvectors are the columns of the matrix


 
0.36 −0.66 −0.58 0.32
−0.08 −0.73 0.6 −0.32
U =  0.86 0.18 0.07 −0.48

0.36 0.07 0.55 0.75

Since the eigenvalues are distinct, the multiplicity of each eigenvalue is 1.


From (2.2.15), the total variance of the Iris dataset is

4.54 = trace(Q) = λ1 + λ2 + λ3 + λ4 .

For the Iris dataset, the top eigenvalue is λ1 = 4.2, it has multiplicity 1, and
its corresponding list of eigenvectors contains only one eigenvector,

v1 = (0.36, −0.08, 0.86, 0.36).

The top eigenvalue accounts for 92.5% of the total variance.


The second eigenvalue is λ2 = 0.24 with eigenvector

v2 = (−0.66, −0.73, 0.18, 0.07).


3.2. EIGENVALUE DECOMPOSITION 155

The top two eigenvalues account for 97.8% of the total variance.
The third eigenvalue is λ3 = 0.08 with eigenvector

v3 = (−0.58, 0.60, 0.07, 0.55).

The top three eigenvalues account for 99.5% of the total variance.
The fourth eigenvalue is λ4 = 0.02 with eigenvector

v4 = (0.32, −0.32, −0.48, 0.75).

The top four eigenvalues account for 100% of the total variance. Here each
eigenvalue has multiplicity 1, since there are four distinct eigenvalues.

An important class of symmetric matrices are of the form


 
  2 −1 0 −1
  2 −1 −1 
2 −2  −1 2 −1 0 
−1 2 −1   
−2 2  0 −1 2 −1 
−1 −1 2
−1 0 −1 2
 
 2 −1 0 0 0 −1

2 −1 0 0 −1 
 −1 2 −1 0 0   −1 2 −1 0 0 0 

   0 −1 2 −1 0 0 
 0 −1 2 −1 0   
 0 0 −1 2 −1   0 0 −1 2 −1 0 
  
 0 0 0 −1 2 −1 
−1 0 0 −1 2
−1 0 0 0 −1 2
 
2 −1 0 0 0 0 −1
 −1 2 −1 0 0 0 0 
 
 0 −1 2 −1 0 0 0 
 
 0 0 −1 2 −1 0 0  .
 
 0 0 0 −1 2 −1 0 
 
 0 0 0 0 −1 2 −1 
−1 0 0 0 0 −1 2
We denote these matrices Q(2), Q(3), Q(4), Q(5), Q(6), Q(7). The following
code generates these symmetric d × d matrices Q(d),

def row(i,d):
v = [0]*d
v[i] = 2
if i > 0: v[i-1] = -1
if i < d-1: v[i+1] = -1
if i == 0: v[d-1] += -1
if i == d-1: v[0] += -1
156 CHAPTER 3. PRINCIPAL COMPONENTS

return v

# using sympy
from sympy import Matrix

def Q(d): return Matrix([ row(i,d) for i in range(d) ])

# using numpy
from numpy import *

def Q(d): return array([ row(i,d) for i in range(d) ])

The eigenvalues of these symmetric matrices follow interesting patterns


that are best explored using Python.
Below we will see, the eigenvalues of Q(d) are between 4 and 0, and each
eigenvalue other than 4 and 0 has multiplicity 2.

m1 m2

x1 x2

Fig. 3.6 Three springs at rest and perturbed.

To explain where these matrices come from, look at the mass-spring sys-
tems in Figures 3.6 and 3.7. Here we have springs attached to masses and
walls on either side. At rest, the springs are the same length. When per-
turbed, some springs are compressed and some stretched. In Figure 3.6, let
x1 and x2 denote the displacement of each mass from its rest position.
When extended by x, each spring fights back by exerting a force kx pro-
portional to the displacement x. Here k is the spring constant. For example,
look at the mass m1 . The spring to its left is extended by x1 , so exerts a force
of −kx1 . Here the minus indicates pulling to the left. On the other hand, the
spring to its right is extended by x2 − x1 , so it exerts a force +k(x2 − x1 ).
Here the plus indicates pulling to the right. Adding the forces from either
side, the total force on m1 is −k(2x1 − x2 ). For m2 , the spring to its left
exerts a force −k(x2 − x1 ), and the spring to its right exerts a force −kx2 ,
so the total force on m2 is −k(2x2 − x1 ). We obtain the force vector
3.2. EIGENVALUE DECOMPOSITION 157
    
2x1 − x2 2 −1 x1
−k = −k .
−x1 + 2x2 −1 2 x2

However, as you can see, the matrix here is not exactly Q(2).

m1 m2 m3 m4 m5

x1 x2 x3 x4 x5

Fig. 3.7 Six springs at rest and perturbed.

For five masses, let x1 , x2 , x3 , x4 , x5 denote the displacement of each mass


from its rest position. In Figure 3.7, x1 , x2 , x5 are positive, and x3 , x4 are
negative.
As before, the total force on m1 is −k(2x1 − x2 ), and the total force on m5
is −k(2x5 − x4 ). For m2 , the spring to its left exerts a force −k(x2 − x1 ), and
the spring to its right exerts a force +k(x3 − x2 ). Hence, the total force on
m2 is −k(−x1 + 2x2 − x3 ). Similarly for m3 , m4 . We obtain the force vector
    
2x1 − x2 2 −1 0 0 0 x1
−x1 + 2x2 − x3   −1 2 −1 0 0  x2 
    
−k −x2 + 2x3 − x4  = −k  0 −1 2 −1 0  x3  .
   
−x3 + 2x4 − x5   0 0 −1 2 −1  x4 
−x4 + 2x5 0 0 0 −1 2 x5

But, again, the matrix here is not Q(5). Notice, if we place one mass and two
springs in Figure 3.6, we obtain the 1 × 1 matrix 2.

To obtain Q(2) and Q(5), we place the springs along a circle, as in Figures
3.8 and 3.9. Now we have as many springs as masses. Repeating the same
logic, this time we obtain Q(2) and Q(5). Notice if we place one mass and
one spring in Figure 3.8, d = 1, we obtain the 1 × 1 matrix Q(1) = 0: There
is no force if we move a single mass around the circle, because the spring is
not being stretched.
158 CHAPTER 3. PRINCIPAL COMPONENTS

m1 m2 m2

m1

Fig. 3.8 Two springs along a circle leading to Q(2).

Thus the matrices Q(d) arise from mass-spring systems arranged on a


circle. From Newton’s law (force equals mass p times acceleration), one shows
the frequencies of the vibrating springs equal λk/m, where k is the spring
constant, m is the mass of each of the masses, and λ is an eigenvalue of Q(d).
This is the physical meaning of the eigenvalues of Q(d).

m1 m1

m2
m2

m5 m5

m3 m4

m4 m3

Fig. 3.9 Five springs along a circle leading to Q(5).

Let v have features (x1 , x2 , . . . , xd ), and let Q = Q(d). By elementary


algebra, check that

v · Qv = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 . (3.2.12)

As a consequence of (3.2.12), show also the following.


3.2. EIGENVALUE DECOMPOSITION 159

• For any vector v, 0 ≤ v · Qv ≤ 4|v|2 . Conclude every eigenvalue λ satisfies


0 ≤ λ ≤ 4.
• λ = 0 is an eigenvalue, with multiplicity 1.
• When d is even, λ = 4 is an eigenvalue with multiplicity 1.
• When d is odd, λ = 4 is not an eigenvalue.

To compute the eigenvalues, we use complex numbers, specifically the d-th


root of unity ω (§A.4). Let

p(t) = 2 − t − td−1 ,

and let  
1
 ω

ω2
 
 
v1 =  .
ω3
 
 
 ..
  .
d−1
ω
Then Qv1 is
 
1
2 − ω − ω d−1
 
ω
−1 + 2ω − ω 2
 
ω2
   
−ω + 2ω 2 − ω 3
   
Qv1 =   = p(ω)   = p(ω)v1 .
ω3
   
 ..   
 .    ..
d−2 d−1
  .
−ω + 2ω −1
ω d−1

Thus v1 is an eigenvector corresponding to eigenvalue p(ω).


For each k = 0, 1, 2, . . . , d − 1, define
 
vk = 1, ω k , ω 2k , ω 3k , . . . , ω (d−1)k . (3.2.13)

Then
v0 = 1 = (1, 1, . . . , 1),
and, by the same calculation, we have

Qvk = p(ω k )vk , k = 0, 1, 2, . . . , d − 1.

By (A.4.9),

p(ω k ) = 2 − ω k − ω (d−1)k = 2 − ω k − ω −k = 2 − 2 cos(2πk/d).


160 CHAPTER 3. PRINCIPAL COMPONENTS

Eigenvalues of Q(d)

The (unsorted) eigenvalues of Q(d) are


 
2πk
λk = p(ω k ) = 2 − 2 cos , (3.2.14)
d

with corresponding eigenvectors vk given by (3.2.13), k = 0, 1, 2, . . . ,


d − 1.

Corresponding to each eigenvalue λk , there is the complex eigenvector vk .


Separating vk into its real and imaginary parts yields two real eigenvectors
        
2πk 4πk 6πk 2(d − 1)πk
ℜ(vk ) = 1, cos , cos , cos , . . . , cos ,
d d d d
        
2πk 4πk 6πk 2(d − 1)πk
ℑ(vk ) = 0, sin , sin , sin , . . . , sin .
d d d d

When k = 0 or when k = d/2, d even, we have ℑ(vk ) = 0. This explains the


double multiplicity in Figure 3.10, except when k = 0 or k = d/2, d even.

Applying this formula, we obtain eigenvalues

Q(2) = (4, 0)
Q(3) = (3, 3, 0)
Q(4) = (4, 2, 2, 0)
√ √ √ √ !
5 5 5 5 5 5 5 5
Q(5) = + , + , − , − ,0
2 2 2 2 2 2 2 2
Q(6) = (4, 3, 3, 1, 1, 0)
√ √ √ √
Q(8) = (4, 2 + 2, 2 + 2, 2, 2, 2 − 2, 2 − 2, 0)
√ √ √ √
5 5 5 5 3 5 3 5
Q(10) = 4, + , + , + , + ,
2 2 2 2 2 2 2 2
√ √ √ √ !
5 5 5 5 3 5 3 5
− , − , − , − ,0
2 2 2 2 2 2 2 2
 √ √ √ √ 
Q(12) = 4, 2 + 3, 2 + 3, 3, 3, 2, 2, 1, 1, 2 − 3, 2 − 3, 0 .

The matrices Q(d) are circulant matrices. Each row in Q(d) is obtained
from the row above it in Q(d) by shifting the entries to the right. The trick of
3.2. EIGENVALUE DECOMPOSITION 161

using the roots of unity to compute the eigenvalues and eigenvectors works
for any circulant matrix.

Fig. 3.10 Plot of eigenvalues of Q(50).

Our last topic is the distribution of the eigenvalues for large d. How are
the eigenvalues scattered? Figure 3.10 plots the eigenvalues for Q(50) using
the code below.

from numpy.linalg import eigh


from matplotlib.pyplot import stairs,show,scatter,legend

d = 50
lamda = eigh(Q(d))[0]
stairs(lamda,range(d+1),label="numpy")

k = arange(d)
lamda = 2 - 2*cos(2*pi*k/d)
sorted = sort(lamda)

scatter(k,lamda,s=5,label="unordered")
scatter(k,sorted,c="red",s=5,label="increasing order")

grid()
legend()
show()

Figure 3.10 shows the eigenvalues tend to cluster near the top λ1 ≈ 4 and
the bottom λd = 0, they are sparser near the middle. Using the double-angle
162 CHAPTER 3. PRINCIPAL COMPONENTS

formula,  
πk
λk = 4 sin2 , k = 0, 1, 2, . . . , d − 1.
d
Solving for k/d in terms of λ, and multiplying by two to account for the
double multiplicity, we obtain the proportion of eigenvalues below threshold
λ,
1√
 
#{k : λk ≤ λ} 2
≈ arcsin λ , 0 ≤ λ ≤ 4. (3.2.15)
d π 2
Here ≈ means asymptotic equality, see §A.6.
Equivalently, the derivative (4.1.23) of the arcsine law (3.2.15) exhibits the
eigenvalue clustering near the ends (Figure 3.11).

Fig. 3.11 Density of eigenvalues of Q(d) for d large.

from numpy import *


from matplotlib.pyplot import *

lamda = arange(0.1,3.9,.01)
density = 1/(pi*sqrt(lamda*(4-lamda)))
plot(lamda,density)
# r"..." means raw string
tex = r"$\displaystyle\frac1{\pi\sqrt{\lambda(4-\lambda)}}$"
text(.5,.45,tex,usetex=True,fontsize="x-large")

grid()
show()

The matrices Q(d) are prototypes of matrices that are fundamental in


many areas of physics and engineering, including time series analysis and
information theory, see [11]. This clustering of eigenvalues near the top and
3.3. GRAPHS 163

bottom is valid for a wide class of matrices, not just Q(d), as the matrix size
d grows without bound, d → ∞.

Exercises

Exercise 3.2.1 Let A be a 2 × 2 matrix. Show λ is an eigenvalue of A when


det(A − λI) = 0. (See homogeneous systems in §1.4.)
Exercise 3.2.2 Let A be a 2 × 2 matrix. Show λ is an eigenvalue of A when

λ2 − tλ + d = 0,

where t = trace(A) and d = det(A).


 
ab
Exercise 3.2.3 Let Q = be a 2 × 2 symmetric matrix. Show the
bc
eigenvalues λ± of Q are given by (1.5.6).
Exercise 3.2.4 Let Q be a 2 × 2 symmetric matrix. Show Q ≥ 0 when
det(Q) ≥ 0 and trace(Q) ≥ 0.
Exercise 3.2.5 With R(d) as in Exercise 2.2.9, find the eigenvalues and
eigenvectors of R(d).
Exercise 3.2.6 Use Python to verify the entries in Table 3.12.

d 4 · trace(Q(d)+ )
4 4+1
16 (4+1)(16+1)
256 (4+1)(16+1)(256+1)

Table 3.12 Trace of pseudo-inverse (§2.3) of Q(d).

Exercise 3.2.7 Verify (3.2.12). Conclude Q(d) is nonnegative, hence a vari-


ance matrix.
Exercise 3.2.8 Let P be a projection matrix (§2.7). Show the eigenvalues of
P are 0 and 1. Which vectors are eigenvectors for 1, and which for 0?

3.3 Graphs

Graph theory is a kind of linear geometry, and depends on the material


already covered. As such, the study of graphs is an application of the material
164 CHAPTER 3. PRINCIPAL COMPONENTS

in the previous sections. Since graph theory is the start of neural networks,
we study it here.
A graph consists of nodes and edges. For example, the graphs in Figure
3.13 each have four nodes and three edges. The left graph is directed, in that
a direction is specified for each edge. The graph on the right is undirected, no
direction is specified.

Fig. 3.13 Directed and undirected graphs.

In a directed graph, if there is an edge pointing from node i to node j, we


say (i, j) is an edge. For undirected graphs, we say i and j are adjacent.

−3 7.4

2 0

Fig. 3.14 A weighed directed graph.

An edge (i, j) is weighed if a scalar wij is attached to it. If every edge in a


graph is weighed, then the graph is a weighed graph. Any two nodes may be
considered adjacent by assigning the weight zero to the edge between them.
In §4.4, back propagation on weighed directed graphs is used to calculate
derivatives.

Let wij be the weight on the edge (i, j) in a weighed directed graph. The
weight matrix of a weighed directed graph is the matrix W = (wij ).
If the graph is unweighed, then we set A = (aij ), where
(
1, if i and j adjacent,
aij = .
0, if not.

In this case, A consists of ones and zeros, and is called the adjacency matrix.
If the graph is also undirected, then the adjacency matrix is symmetric,
3.3. GRAPHS 165

aij = aji .

Fig. 3.15 A double edge and a loop.

Sometimes graphs may have multiple edges between nodes, or loops, which
are edges starting and ending at the same node. A graph is simple if it has
no loops and no multiple edges. In this section, we deal only with simple
undirected unweighed graphs.
To summarize, a simple undirected graph G = (V, E) is a collection V
of nodes, and a collection of edges E, each edge corresponding to a pair of
nodes.
The number of nodes is the order n of the graph, and the number of edges
is the size m of the graph. In a (simple undirected) graph of order n, the
number of pairs of nodes is n-choose-2, so the number of edges satisfies
 
n 1
0≤m≤ = n(n − 1).
2 2

How many graphs of order n are there? Since graphs are built out of
edges, the answer depends on how many subsets of edges you can grab from
a maximum of n(n − 1)/2 edges. The number of subsets of a set with m
elements is 2m , so the number Gn of graphs with n nodes is
n
Gn = 2( 2 ) = 2n(n−1)/2 .

For example, the number of graphs with n = 5 is 25(5−1)/2 = 210 = 1, 024,


and the number of graphs with n = 10 is

n = 10 =⇒ Gn = 245 = 35, 184, 372, 088, 832.

When m = 0, there are no edges, and we say the graph is empty. When
m = n(n − 1)/2, there are the maximum number of edges, and we say the
graph is complete. The complete graph with n nodes is written Kn (Figure
3.16).
166 CHAPTER 3. PRINCIPAL COMPONENTS

Fig. 3.16 The complete graph K6 and the cycle graph C6 .

The cycle graph Cn with n nodes is as in Figure 3.16. The graph Cn has
n edges. The cycle graph C3 is a triangle.

A graph G′ is a subgraph of a graph G if every node of G′ is a node of G,


and every edge of G′ is an edge of G. For example, a triangle in G is a graph
triangle that is a subgraph of G. Below we see the graph K6 in Figure 3.16
contains twenty triangles.

Fig. 3.17 The triangle K3 = C3 .

Let v be a node in a (simple, undirected) graph G. The degree of v is the


number dv of edges containing v. If the nodes are labeled 1, 2, . . . , n, with
the degrees in decreasing order, then

d1 ≥ d2 ≥ d3 ≥ · · · ≥ dn

is the degree sequence of the graph. We write


3.3. GRAPHS 167

(d1 , d2 , d3 , . . . , dn )

for the degree sequence.


If we add the degrees over all nodes, we obtain the number of edges counted
twice, because each edge contains two nodes. Thus we have

Handshaking Lemma

If the order is n, the size is m, and the degrees are d1 , d2 , . . . , dn ,


then
n
X
d1 + d2 + · · · + dn = dk = 2m.
k=1

A node is isolated if its degree is zero. A node is dominating if it has the


highest degree. Notice the highest degree is ≤ n − 1, because there are no
loops. We show

Nodes with Equal Degree

In any graph, there are at least two nodes with the same degree.

To see this, we consider two cases. First case, assume there are no isolated
nodes. Then the degree sequence is

n − 1 ≥ d1 ≥ d2 ≥ · · · ≥ dn ≥ 1.

So we have n integers spread between 1 and n − 1. This can’t happen unless


at least two of these integers are equal. This completes the first case. In the
second case, we have at least one isolated node, so dn = 0. If dn−1 = 0
also, then we have found two nodes with the same degree. If not, then the
maximum degree is n − 2 (because node n is isolated), and

n − 2 ≥ d1 ≥ d2 ≥ . . . dn−1 ≥ 1.

So now we have n − 1 integers spread between 1 and n − 2. This can’t happen


unless at least two of these integers are equal. This completes the second
case.

A graph is regular if all the node degrees are equal. If the node degrees are
all equal to k, we say the graph is k-regular. From the handshaking lemma,
for a k-regular graph, we have kn = 2m, so
1
m= kn.
2
168 CHAPTER 3. PRINCIPAL COMPONENTS

For example, because 2m is even, there are no 3-regular graphs with 11 nodes.
Both Kn and Cn are regular, with Kn being (n − 1)-regular, and Cn being
2-regular.
A walk on a graph is a sequence of nodes v1 , v2 , v3 , . . . where each
consecutive pair vi , vi+1 of nodes are adjacent. For example, if v1 , v2 , v3 ,
v4 , v5 , v6 are the nodes (in any order) of the complete graph K6 , then
v1 → v2 → v3 → v4 → v2 is a walk. A path is a walk with no backtracking: A
path visits each node at most once. A closed walk is a walk that ends where
it starts. A cycle is a closed walk with no backtracking.
Two nodes a and b are connected if there is a walk starting at a and ending
at b. If a and b are connected, then there is a path starting at a and ending
at b, since we can cut out the cycles of the walk. A graph is connected if every
two nodes are connected. A graph is disconnected if it is not connected. For
example, Figure 3.16 may be viewed as two connected graphs K6 and C6 , or
a single disconnected graph K6 ∪ C6 .

Consider a graph with order n. The adjacency matrix is the n × n matrix


A = (aij ) given by
(
1, if i and j are adjacent,
aij =
0, if not.

For example, the empty graph has adjacency matrix given by the zero matrix.
Since our graphs are undirected, the adjacency matrix is symmetric.
Let 1 be the vector 1 = (1, 1, 1, . . . , 1). The adjacency matrix of the com-
plete graph Kn is the n × n matrix A with all ones except on the diagonal.
If I is the n × n identity matrix, then this adjacency matrix is

A=1⊗1−I

For example, for the triangle K3 ,


     
 1 100 011
A = 1 1 1 ⊗ 1 − 0 1 0 = 1 0 1 .
1 001 110

If we label the nodes of the cycle graph Cn consecutively, then node i is


shares an edge with i − 1 and i + 1, except when i = 1 and i = n. Node 1
shares an edge with 2 and n, and node n shares an edge with n − 1 and 1.
So for C6 the adjacency matrix is
3.3. GRAPHS 169
 
0 1 0 0 0 1
1 0 1 0 0 0
 
0 1 0 1 0 0
A=
0
.
 0 1 0 1 0
0 0 0 1 0 1
1 0 0 0 1 0

Notice there are ones on the sub-diagonal, and ones on the super-diagonal,
and ones in the upper-right and lower-left corners.

For any adjacency matrix A, the sum of each row is equal to the degree of
the node corresponding to that row. This is the same as saying
 
d1
 d2 
A1 = . . . .

dn

In particular, for a k-regular graph, we have

A1 = k1,

so for a k-regular graph, k is an eigenvalue of A.


What is the connection between degrees and eigenvalues in general? To
explain this, let λ be an eigenvalue of A with eigenvector v = (v1 , v2 , . . . , vn ),
so Av = λv. Since a multiple tv of v is also an eigenvector, we may assume
the biggest component of v equals 1. Suppose the nodes are labeled so that
v = (1, v2 , v3 , . . . , vn ), with

v1 = 1 ≥ |vj |, j = 2, 3, . . . , n.

Taking the first component of Av = λv, we have

(Av)1 = a11 v1 + a12 v2 + a13 v3 + · · · + a1n vn .

Since the sum a11 + a12 + · · · + a1n equals the degree d1 of node 1, this implies

d1 = a11 +a12 +· · ·+a1n ≥ a11 v1 +a12 v2 +a13 v3 +· · ·+a1n vn = (Av)1 = λv1 = λ.

Since d1 is one of the degrees, d1 is no greater than the maximum degree.


This explains
170 CHAPTER 3. PRINCIPAL COMPONENTS

Maximum Degree of Graph

If λ is any eigenvalue of the adjacency matrix A, then λ is less or equal


to the maximum degree.

In particular, for a k-regular graph, the maximum degree equals k, and we


already saw k is an eigenvalue, so

Top Eigenvalue

For a k-regular graph, k is the top eigenvalue of the adjacency matrix


A.

Let A = 1 ⊗ 1 − I be the adjacency matrix of complete graph Kn . Then


for any vector v orthogonal to 1,

Av = (1 ⊗ 1 − I)v = (1 · v)1 − v = 0 − v = −v,

so λ = −1 is an eigenvalue with multiplicity n − 1. Since

A1 = (1 · 1)1 − 1 = n1 − 1 = (n − 1)1,

n − 1is an eigenvalue. Hence the eigenvalues of A are n − 1 with multiplicity


1 and −1 with multiplicity n − 1.

Let A be the adjacency matrix of the cycle graph Cn . Since Cn is 2-regular,


the top eigenvalue of A is 2. Since A is a circulant matrix, the method used
to find the eigenvalues of Q(d) in §3.2 works here. However, it is immediate
that
A = 2I − Q(n).
From this and by (3.2.14), the eigenvalues of A are

2 cos(2πk/n), k = 0, 1, 2, . . . , n − 1,

and the eigenvectors of A are the eigenvectors of Q(n).

The complement of graph G is the graph Ḡ obtained by switching 1’s and


0’s, so the adjacency matrix Ā of Ḡ is

Ā = A(Ḡ) = 1 ⊗ 1 − I − A(G).
3.3. GRAPHS 171

Let G be a k-regular graph, and suppose k = λ1 ≥ λ2 ≥ · · · ≥ λn are the


eigenvalues of A = A(G). Since A is symmetric, we have an orthogonal basis
of eigenvectors v1 , v2 , . . . , vn , with v1 = 1. Then Ḡ is an (n − 1 − k)-regular
graph, so the top eigenvalue of Ā = A(Ḡ) is n−1−k, with eigenvector v1 = 1.
If vk is any eigenvector of A other than 1, then vk is orthogonal to 1, hence

Āv = (1 ⊗ 1 − I − A)vk = −v − λk vk = (−1 − λk )vk .

Hence the eigenvalues of Ā are n − 1 − k and −1 − λk , k = 2, . . . , n, with


the same eigenbasis.

Now we look at powers of the adjacency matrix A. By definition of matrix-


matrix multiplication,
n
X
(A2 )ij = i-th row × j-th column = aik akj .
k=1

Now aik akj is either 0 or 1, and equals 1 exactly if there is a 2-step path from
i to j. Hence

(A2 )ij = number of 2-step walks connecting i and j.

Notice a 2-step walk between i and j is the same as a 2-step path between i
and j.
When i = j, (A2 )ii is the number of 2-step paths connecting i and i, which
means number of edges. Since this counts edges twice, we have
1
trace(A2 ) = m = number of edges.
2
Similarly, (A3 )ij is the number of 3-step walks connecting i and j. Since
a 3-step walk from i to i is the same as a triangle, (A3 )ii is the number
of triangles in the graph passing through i. Since the trace is the sum of
the diagonal elements, trace(A3 ) counts the number of triangles. But this
overcounts by a factor of 3! = 6, since three labels may be rearranged in six
ways. Hence
1
trace(A3 ) = number of triangles.
6
Loops, Edges, Triangles

Let A be the adjacency matrix. Then


• trace(A) = number of loops = 0,
• trace(A2 ) = 2 × number of edges,
172 CHAPTER 3. PRINCIPAL COMPONENTS

• trace(A3 ) = 6 × number of triangles.

Let us compute these for the complete graph Kn . Since

(u ⊗ v)2 = (u ⊗ v)(u ⊗ v) = (u · v)(u ⊗ v),

and 1 · 1 = n, we have (1 ⊗ 1)2 = n1 ⊗ 1. So

A2 = (1 ⊗ 1 − I)2 = (1 ⊗ 1)2 − 21 ⊗ 1 + I = (n − 2)1 ⊗ 1 + I.

Since trace(u ⊗ v) = u · v, we have trace(1 ⊗ 1) = n. Hence

trace(A2 ) = trace((n − 2)1 ⊗ 1 + I) = n(n − 2) + n = n(n − 1).

This is correct because for a complete graph, n(n − 1)/2 is the number of
edges.
Continuing,

A3 = A2 A = ((n − 2)1 ⊗ 1 + I)(1 ⊗ 1 − I)


= n(n − 2)1 ⊗ 1 − (n − 2)1 ⊗ 1 + 1 ⊗ 1 − I
= (n2 − 3n + 3)1 ⊗ 1 − I.

From this, we get

trace(A3 ) = n(n2 − 3n + 3) − n = n(n2 − 3n + 2) = n(n − 1)(n − 2).

This is correct because for a complete graph, we have a triangle whenever


we have a triple of nodes, and there are n-choose-3 triples, which equals
n(n − 1)(n − 2)/6.
Remember, a graph is connected if there is a walk connecting any two
nodes. Since there is a 4-step walk between i and j exactly when there are r,
s, and t satisfying
air ars ast atj = 1,
we see there is a 4-step walk connecting i and j if (A4 )ij > 0. Hence

Connected Graph

Let A be the adjacency matrix. Then the graph is connected if for


every i ̸= j, there is a k with (Ak )ij > 0.

Two graphs are isomorphic if a re-labeling of the nodes in one makes it


identical to the other. To explain this, we need permutations.
3.3. GRAPHS 173

A permutation on n letters is a re-arrangement of 1, 2, 3,. . . , n. Here are


two permutations of (1, 2, 3, 4),
   
1234 1234
, .
4321 4312

There are n! permutations of (1, 2, . . . , n). If a permutation sends i to j, we


write i → j. Since a permutation is just a re-labeling, if i → k and j → k,
then we must have i = j.
Each permutation leads to a permutation matrix. A permutation matrix
is a matrix of zeros and ones, with only one 1 in any column or row. For
example, the above permutations correspond to the 4 × 4 matrices
   
0001 0001
0 0 1 0 0 0 1 0
P = 0 1 0 0
 P = 1 0 0 0 .

1000 0100

In general, the permutation matrix P has Pij = 1 if i → j, and Pij = 0


if not. If P is any permutation matrix, then Pik Pjk equals 1 if both i → k
and j → k. In other words, Pik Pjk = 1 if i = j and i → k, and Pik Pjk = 0
otherwise. Since i → k for exactly one k,
n n
(
t
X
t
X 1, i = j,
(P P )ij = Pik Pkj = Pik Pjk =
k=1 k=1
0, i ̸= j.

Hence P is orthogonal,

P P t = I, P −1 = P t .

Using permutation matrices, we can say two graphs are isomorphic if their
adjacency matrices A, A′ satisfy

A′ = P AP −1 = P AP t

for some permutation matrix P .


If two graphs are isomorphic, then it is easy to check their degree sequences
are equal. However, the converse is not true. Figure 3.18 displays two non-
isomorphic graphs with degree sequences (3, 2, 2, 1, 1, 1). These graphs are
non-isomorphic because in one graph, there are two degree-one nodes adjacent
to a degree-three node, while in the other graph, there is only one degree-one
node adjacent to a degree-three node.

A graph is bipartite if the nodes can be divided into two groups, with
adjacency only between nodes across groups. If we call the two groups even
174 CHAPTER 3. PRINCIPAL COMPONENTS

Fig. 3.18 Non-isomorphic graphs with degree sequence (3, 2, 2, 1, 1, 1).

and odd, then odd nodes are never adjacent to odd nodes, and even nodes
are never adjacent to even nodes.
The complete bipartite graph is the bipartite graph with maximum num-
ber of edges: Every odd node is adjacent to every even node. The complete
bipartite graph with n odd nodes with m even nodes is written Knm . Then
the order of Kmn is n + m.

Fig. 3.19 Complete bipartite graph K53 .

Let a = (1, 1, . . . , 1, 0, 0, . . . , 0) be the vector with n ones and m zeros, and


let b = 1 − a. Then b has n zeros and m ones, and the adjacency matrix of
Knm is
A = A(Knm ) = a ⊗ b + b ⊗ a.
For example, the adjacency matrix of K53 is A = A(Knm ) which equals
         
1 0 0 1 00011111
1 0 0 1 0 0 0 1 1 1 1 1
         
1 0 0 1 0 0 0 1 1 1 1 1
         
0 1 1 0 1 1 1 0 0 0 0 0
 ⊗ + ⊗ =
0 1 1 0 1 1 1 0 0 0 0 0 .

         
0 1 1 0 1 1 1 0 0 0 0 0
         
0 1 1 0 1 1 1 0 0 0 0 0
0 1 1 0 11100000

Recall we have
(a ⊗ b)v = (b · v)a.
3.3. GRAPHS 175

From this, we see the column space of A = a⊗b+b⊗a is span(a, b). Thus the
rank of A is 2, and the nullspace of A consists of the orthogonal complement
span(a, b)⊥ of span(a, b). Using this, we compute the eigenvalues of A.
Since the nullspace is span(a, b)⊥ , any vector orthogonal to a and to b is an
eigenvector for λ = 0. Hence the eigenvalue λ = 0 has multiplicity n + m − 2.
Since trace(A) = 0, the sum of the eigenvalues is zero, and the remaining two
eigenvalues are ±λ ̸= 0.
Let v be an eigenvector for λ ̸= 0. Because eigenvectors corresponding
to distinct eigenvalues of a symmetric matrix are orthogonal (see §3.2), v is
orthogonal to the nullspace of A, so v must be a linear combination of a and
b, v = ra + sb. Since a · b = 0,

Aa = nb, Ab = ma.

Hence
λv = Av = A(ra + sb) = rnb + sma.
Applying A again,

λ2 v = A2 v = A(rnb + sma) = rnma + smnb = nm(ra + sb) = nmv.



Hence λ = nm. We conclude the eigenvalues of Knm are
√ √
nm, 0, 0, . . . , 0, − nm, (with 0 repeated n + m − 2 times).

For
√ example, √
for the graph in Figure 3.19, the nonzero eigenvalues are λ =
± 3 × 5 = ± 15.

Let G be a graph with n nodes and m edges. The incidence matrix of G


is a matrix whose rows are indexed by the edges, and whose columns are
indexed by the nodes. Therefore, the incidence matrix has shape m × n.
By placing arrows along the edges, we can make G into a directed graph.
In a directed graph, each edge has a tail node and a head node. Then the
incidence matrix is given by

1,
 if node j is the head of edge i,
Bij = −1, if node j is the tail of edge i,

0, if node j is not on edge i.

The laplacian of a graph G is the symmetric n × n matrix

L = B t B.

Both the laplacian matrix and the adjacency matrix are n × n. What is the
connection between them?
176 CHAPTER 3. PRINCIPAL COMPONENTS

Laplacian

The laplacian satisfies


L = D − A,
where D = diag(d1 , d2 , . . . , dn ) is the diagonal degree matrix.

For example, for the cycle graph C6 , the degree matrix is 2I, and the
laplacian is the matrix we saw in §3.2,
 
2 −1 0 0 0 −1
−1 2 −1 0 0 0 
 
 0 −1 2 −1 0 0 
L = Q(6) =  .
 0 0 −1 2 −1 0 

 0 0 0 −1 2 −1
−1 0 0 0 −1 2

3.4 Singular Value Decomposition

In this section, we discuss the singular value decomposition (U, S, V ) of a


matrix A.
Let A be a matrix. We say a real number σ is a singular value of A if there
are nonzero vectors v and u satisfying

Av = σu and At u = σv. (3.4.1)

When this happens, v is a right singular vector and u is a left singular vector
associated to σ.
When (3.4.1) holds, so does

Av = (−σ)(−u), At (−u) = (−σ)v.

Because of this, to eliminate ambiguity, it is standard to assume σ ≥ 0;


henceforth we shall insist singular values are positive or zero.
Contrast singular values with eigenvalues: While eigenvalues may be pos-
itive, negative, or zero, singular values are positive or zero, never negative.
The definition immediately implies

Singular Values of A Versus A Transpose

The singular values of A and the singular values of At are the same.

Contrast this with the analogous result for eigenvalues in §3.2.


We work out our first example. Let
3.4. SINGULAR VALUE DECOMPOSITION 177
 
11
A= .
01

Then Av = λv implies λ = 1 and v = (1, 0). Thus A has only one eigenvalue
equal to 1, and only one eigenvector. Set
 
t 11
Q=AA= .
12

Since Q is symmetric, Q has two eigenvalues λ1 , λ2 and corresponding eigen-


vectors v1 , v2 . Moreover, as we saw in §3.2, v1 , v2 may be chosen orthonormal.
The eigenvalues of Q are given by

0 = det(Q − λI) = λ2 − 3λ + 1.

By the quadratic formula,


√ √
3 5 3 5
λ1 = + = 2.62, λ2 = − = 0.38.
2 2 2 2
Now we turn to singular values. If v and u and σ satisfy (3.4.1), then

Qv = At Av = At (σu) = σ 2 v. (3.4.2)

Hence σ 2 = λ, and we obtain


s √ s √
3 5 3 5
σ1 = + = 1.62, σ2 = − = 0.62.
2 2 2 2

To make (3.4.1) work, we set u1 = Av1 /σ1 . Then Av1 = σ1 u1 , and

At u1 = At Av1 /σ1 = Qv1 /σ1 = λ1 v1 /σ1 = σ1 v1 .

Thus v1 , u1 are right and left singular vectors corresponding to the singular
value σ1 of A. Similarly, if we set u2 = Av2 /σ2 , then v2 , u2 are right and left
singular vectors corresponding to the singular value σ2 of A.
We show v1 , v2 are orthonormal, and u1 , u2 are orthonormal. We already
know v1 , v2 are orthonormal, because they are orthonormal eigenvectors of
the symmetric matrix Q. Also

0 = λ1 v1 ·v2 = Qv1 ·v2 = (At Av1 )·v2 = (Av1 )·(Av2 ) = σ1 u1 ·σ2 u2 = σ1 σ2 u1 ·u2 .

Since σ1 ̸= 0, σ2 ̸= 0, it follows u1 , u2 are orthogonal. Also

λ1 = λ1 v1 · v1 = Qv1 · v1 = (At Av1 ) · v1 = (Av1 ) · (Av1 ) = σ12 u1 · u1 .

Since λ1 = σ12 , u1 · u1 = 1. Similarly, u2 · u2 = 1. This shows u1 , u2 are


orthonormal, and completes the first example.
178 CHAPTER 3. PRINCIPAL COMPONENTS

Let A be an N × d matrix, and suppose σ1 ≥ σ2 ≥ · · · ≥ σr are positive


singular values with corresponding left singular vectors u1 , u2 , . . . , ur and
right singular vectors v1 , v2 , . . . , vr . Then, since uk = Avk /σk , the vectors
u1 , u2 , . . . , ur are in the column space of A.
If u1 , u2 , . . . , ur are linearly independent, it follows r is no larger than
rank(A), hence r is no larger than min(N, d). We seek the largest value of r.
Below we show the largest r is min(N, d), the lesser of d and N .

The close connection between singular values σ of A and eigenvalues λ of


Q = At A carries over in the general case.

A Versus Q = At A

Let A be any matrix. Then


• the rank of A equals the rank of Q,
• σ is a singular value of A iff λ = σ 2 is an eigenvalue of Q.

Since the rank equals the dimension of the row space, the first part follows
from §2.4. If Av = σu and At u = σv, then

Qv = At Av = At (σu) = σAt u = σ 2 v,

so λ = σ 2 is an eigenvalue of Q.
Conversely,
√ If Qv = λv, then λ ≥ 0, so there are two cases. If λ > 0, set
σ = λ and u = Av/σ. Then

Av = σu, At u = At Av/σ = Qv/σ = λv/σ = σv

This shows σ is a singular value of A with singular vectors u and v.


If λ = 0, then we take σ = 0, and the correct interpretation of the second
part is the null space of Q equals the null space of A, which we already know.
From §3.2, the number of positive eigenvalues (possibly repeated) of Q
equals the rank of Q. By the above, we conclude the rank of A equals the
number of positive singular values (possibly repeated) of A.

More explicitly, the above result may be phrased as


3.4. SINGULAR VALUE DECOMPOSITION 179

Singular Value Decomposition (SVD)

Let A be any matrix, and let r be the rank of A. Then there are
r positive singular values σk , an orthonormal basis uk of the target
space, and an orthonormal basis vk of the source space, such that

Avk = σk uk , At uk = σk vk , k ≤ r, (3.4.3)

and
Avk = 0, At uk = 0 for k > r. (3.4.4)

Taken together, (3.4.3) and (3.4.4) say the number of positive singular
values is exactly r. Assume A is N × d, and let p = min(N, d) be the lesser
of N and d.
Since (3.4.4) holds as long as there are vectors uk and vk , there are p − r
zero singular values. Hence there are p = min(N, d) singular values altogether.
The proof of the result is very simple once we remember the rank of Q
equals the number of positive eigenvalues of Q. By the eigenvalue decom-
position, there is an orthonormal basis vk of the source space and positive
√ that Qvk = λk vk , k ≤ r, and Qvk = 0, k > r.
eigenvalues λk such
Setting σk = λk and uk = Avk /σk , k ≤ r, as in our first example, we
have (3.4.3), and, again as in our first example, uk , k ≤ r, are orthonormal.
By construction, vk , k > r, is an orthonormal basis for the null space of
A, and uk , k ≤ r, is an orthonormal basis for the column space of A.
Choose uk , k > r, any orthonormal basis for the nullspace of At . Since
the column space of A is the row space of At , the column space of A is the
orthogonal complement of the nullspace of At (2.7.6). Hence uk , k ≤ r, and
uk , k > r, are orthogonal. From this, uk , k ≤ r, together with uk , k > r,
form an orthonormal basis for the target space.

For our second example, let a and b be nonzero vectors, possibly of different
sizes, and let A be the matrix

A = a ⊗ b, At = b ⊗ a.

Let v and u be right and left singular vectors corresponding to a positive


singular value σ. Then, by (1.4.17),

Av = (v · b)a = σu and At u = (u · a)b = σv.

Since σ > 0, it follows v is a multiple of b and u is a multiple of a. If we


write v = tb and u = sa and plug in, we get

v = |a| b, u = |b| a, σ = |a| |b|.


180 CHAPTER 3. PRINCIPAL COMPONENTS

Thus there is only one positive singular value of A, equal to |a| |b|. All other
singular values are zero. This is not surprising since the rank of A is one.
Now think of the vector b as a single-row matrix B. Then, in a similar
manner, one sees the only positive singular value of B is σ = |b|.
Our third example is  
0000
1 0 0 0
A= 0 1 0 0 .
 (3.4.5)
0010
Then    
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
At =  Q = At A = 
 
, 
0 0 0 1 0 0 1 0
0 0 0 0 0 0 0 0
Since Q is diagonal symmetric, its rank is 3 and its eigenvalues are λ1 = 1,
λ2 = 1, λ3 = 1, λ4 = 0, and its eigenvectors are
       
1 0 0 0
0 1 0 0
0 , v2 = 0 , v3 1 , v4 = 0 .
v1 =        

0 0 0 1

Clearly v1 , v2 , v3 , v4 are orthonormal. By (3.4.2), σ1 = 1, σ2 = 1, σ3 = 1,


σ4 = 0.
Since we must have Av = σu, we can check that

u1 = Av1 = v2 , u2 = Av2 = v3 , u3 = Av3 = v4 , u4 = v1

satisfies (3.4.1). This completes our third example.

Let A be N × d, let U be the matrix with columns u1 , u2 , . . . , uN , and


let V be the matrix with rows v1 , v2 , . . . , vd . Then V t has columns v1 , v2 ,
. . . , vd .
Then U and V are orthogonal N × N and d × d matrices. By (3.4.3),

AV t = U S.

Right-multiplying by V and using V t V = I implies the following result.

Diagonalization (SVD)

If A is any matrix, there is a diagonal matrix S with nonnegative


diagonal entries, with the same shape as A, and orthogonal matrices
3.4. SINGULAR VALUE DECOMPOSITION 181

U and V , satisfying
A = U SV.
The rows of V are an orthonormal basis of right singular vectors, and
the columns of U are an orthonormal basis of left singular vectors.

In more detail, suppose A is 4 × 6. Then we have an orthonormal basis v1 ,


v2 , v3 , v4 , v5 , v6 of R6 , and an orthonormal basis u1 , u2 , u3 , u4 satisfying
(3.4.3) with r = 4. If we set
 
σ1 0 0 0 0 0
 0 σ2 0 0 0 0
S=  0 0 σ 3 0 0 0 ,

0 0 0 σ4 0 0

then U is 4 × 4 and V is 6 × 6, and we can verify directly that A = U SV .


If A is 6 × 4, and  
σ1 0 0 0
 0 σ2 0 0 
 
 0 0 σ3 0 
S=  0 0 0 σ4  ,

 
0 0 0 0
0 0 0 0
then U is 6 × 6 and V is 4 × 4, and we can verify directly that A = U SV . In
either case, S has the same shape as A.
When A = Q is a variance matrix, Q ≥ 0, then the eigenvalues are non-
negative, and, from (3.2.4), we have U EU t = Q. If we choose V = U t , we see
EVD is a special case of SVD.
In general, however, if Q has negative eigenvalues, V is not equal to U t ;
instead V is obtained from U t by re-sorting the rows of U .

In numpy, svd returns the orthogonal matrices U and V and a 1d array


sigma of singular values. The singular values are arranged in decreasing order.
To recover the diagonal matrix S, we use diag.

from numpy import *


from numpy.linalg import svd

U, sigma, V = svd(A)
# sigma is a vector

# build diag matrix S


p = min(A.shape)
182 CHAPTER 3. PRINCIPAL COMPONENTS

S = zeros(A.shape)
S[:p,:p] = diag(sigma)

print(U.shape,S.shape,V.shape)
print(U,S,V)

allclose(A, dot(U, dot(S, V)))

This code returns True.

Given the relation between the singular values of A and the eigenvalues of
Q = At A, we also can conclude

Right Singular Vectors Are the Same as Eigenvectors

Let A be any matrix and let Q = At A.

v is an eigenvector of Q ⇐⇒ v is a right singular vector of A.


(3.4.6)

For example, if dataset is the Iris dataset (ignoring the labels), the code

from numpy import *


from numpy.linalg import svd,eigh

# center dataset
m = mean(dataset,axis=0)
A = dataset - m
# rows of V are right
# singular vectors of A
V = svd(A)[2]

# any of these will work


# because the eigenvectors are the same
Q = dot(A.T,A)
Q = cov(dataset.T,bias=False)
Q = cov(dataset.T,bias=True)

# columns of U are
# eigenvectors of Q
U = eigh(Q)[1]

# compare columns of U
# and rows of V

U, V
3.4. SINGULAR VALUE DECOMPOSITION 183

returns
   
0.36 −0.66 −0.58 0.32 0.36 −0.08 0.86 0.36
−0.08 −0.73 0.6 −0.32 −0.66 −0.73 0.18 0.07 
 0.86 0.18 0.07 −0.48 , V =  0.58 −0.6 −0.07 −0.55
U =   

0.36 0.07 0.55 0.75 0.32 −0.32 −0.48 0.75

This shows the columns of U are identical to the rows of V , except for the
third column of U , which is the negative of the third row of V .

Now we turn to the pseudo-inverse.

To get the Pseudo-Inverse, Invert the Positive Singular Val-


ues

The pseudo-inverse A+ is obtained by replacing positive singular val-


ues of A by their reciprocals, and taking the transpose.

More explicitly, we can write

Inverse Singular Values, and Flipped Singular Vectors

Let A have rank r, and let σk , vk , uk be the singular data as above.


Then
1 1
A+ uk = vk , (A+ )t vk = uk , k = 1, 2, . . . , r,
σk σk
and
A+ uk = 0, (A+ )t vk = 0 for k > r.

We illustrate these results in the case of a diagonal matrix


   
a0000 00
0 b 0 0 0  Q 0 0
S= 0 0 c 0 0 = 
  .
0 0
00000 00000

Since S is 4 × 5 and SS + S = S, S + must be 5 × 4. Writing S + as blocks


and applying the four properties of the pseudo-inverse S + , leads to
184 CHAPTER 3. PRINCIPAL COMPONENTS
 
0  
Q −1 1/a 0 0 0
 0
  0 1/b 0 0
+
 0  
0 0 0 0 =  0 0 1/c 0 .
S =   
 0 0 0 0
0000
0 0 0 0

Exercises

Exercise 3.4.1 Let b be a vector and let B be the matrix with the single
row b. Show σ = |b| is the only positive singular value.

3.5 Principal Component Analysis

Let Q be the variance matrix of a dataset in Rd . Then Q is a d × d symmetric


matrix, and the eigenvalue decomposition guarantees an orthonormal basis
v1 , v2 , . . . , vd in Rd consisting of eigenvectors of Q,

Qvk = λk vk , k = 1, . . . , d.

These eigenvectors are the principal components of the dataset. Principal


Component Analysis (PCA) consists of projecting the dataset onto lower-
dimensional subspaces spanned by some of the eigenvectors.
Let Q be a symmetric matrix with eigenvalue λ and corresponding eigen-
vector v, Qv = λv. If t is a scalar, then the matrix tQ has eigenvalue tλ and
corresponding eigenvector v, since

(tQ)v = tQv = tλv = (tλ)v.

Hence multiplying Q by a scalar does not change the eigenvectors.


Let A be the dataset matrix of a given dataset with N samples, and d
features. If the samples are the rows of A, then A is N × d. If we assume the
dataset is centered, then, by (2.2.14), the variance is Q = At A/N . From the
previous paragraph, the eigenvectors of the variance Q equal the eigenvectors
of At A. From (3.4.6), these are the same as the right singular vectors of A.
Thus the principal components of a dataset are the right singular vectors
of the centered dataset matrix. This shows there are two approaches to the
principal components of a dataset: Either through EVD and eigenvectors
of the variance matrix, or through SVD and right singular vectors of the
centered dataset matrix. We shall do both.
Assuming the eigenvalues are ordered top to bottom,

λ1 ≥ λ2 ≥ · · · ≥ λd ,
3.5. PRINCIPAL COMPONENT ANALYSIS 185

in PCA one takes the most significant components, those components who
eigenvalues are near the top eigenvalue. For example, one can take the top two
eigenvalues λ1 ≥ λ2 and their eigenvectors v1 , v2 , and project the dataset onto
the plane span(v1 , v2 ). The projected dataset can then be visualized as points
in the plane. Similarly, one can take the top three eigenvalues λ1 ≥ λ2 ≥
λ3 and their eigenvectors v1 , v2 , v3 and project the dataset onto the space
span(v1 , v2 , v3 ). This can then be visualized as points in three dimensions.
Recall the MNIST dataset consists of N = 60000 points in d = 784 di-
mensions. After we download the dataset,

from pandas import *


from numpy import *

mnist = read_csv("mnist.csv").to_numpy()

dataset = mnist[:,1:]
labels = mnist[:,0]

we compute Q, the total variance, and the eigenvalues, as percentages of the


total variance. We also name the targets as labels for later use.

Fig. 3.20 MNIST eigenvalues as a percentage of the total variance.

The left column in Figure 3.20 lists the top twenty eigenvalues as a per-
centage of their sum. For example, the top eigenvalue λ1 is around 10% of the
total variance. The right column lists the cumulative sums of the eigenvalues,
186 CHAPTER 3. PRINCIPAL COMPONENTS

so the third entry in the right column is the sum of the top three eigenvalues,
λ1 + λ2 + λ3 = 22.97%.

Fig. 3.21 MNIST eigenvalue percentage plot.

This results in Figures 3.20 and 3.21. Here we sort the array eig in
decreasing order, then we cumsum the array to obtain the cumulative sums.
Because the rank of the MNIST dataset is 712 (§2.9), the bottom 72 =
784 − 712 eigenvalues are exactly zero. A full listing shows that many more
eigenvalues are near zero, and the second column in Figure 3.20 shows the
top ten eigenvalues alone sum to almost 50% of the total variance.

Q = cov(dataset.T)
totvar = Q.trace()

from numpy.linalg import eigh

# use eigh for symmetric matrices


lamda, U = eigh(Q)

# sort in ascending order then reverse


sorted = sort(lamda)[::-1]
percent = sorted*100/totvar

# cumulative sums
sums = cumsum(percent)

data = array([percent,sums])
print(data.T[:20].round(decimals=3))
3.5. PRINCIPAL COMPONENT ANALYSIS 187

d = len(lamda)
from matplotlib.pyplot import stairs

stairs(percent,range(d+1))

A MNIST image is a point in R784 . Now we turn to projecting the image


from 784 dimensions down to n dimensions, where n is 784, 600, 350, 150,
50, 10, 1. Let Q be any d × d variance matrix, and let v be in Rd . Let v1 ,
v2 , . . . , vd be the orthonormal basis of eigenvectors corresponding to the
eigenvalues of Q, arranged in decreasing order. Here is code that returns the
projection matrix P (§2.7) onto the span of the eigenvectors v1 , v2 , . . . , vn
corresponding to the top n eigenvalues of Q.

from numpy import *


from numpy.linalg import eigh

# projection matrix onto top n


# eigenvectors of variance
# of dataset

def pca(dataset,n):
Q = cov(dataset.T)
# columns of V are
# eigenvectors of Q
lamda, U = eigh(Q)
# decreasing eigenvalue sort
order = lamda.argsort()[::-1]
# sorted top n columns of U
# are cols of U
V = U[:,order[:n]]
P = dot(V,V.T)
return P

In the code, lamda is sorted in decreasing order, and the sorting order is
saved as order. To obtain the top n eigenvectors, we sort the first n columns
U[:,order[:n]] in the same order, resulting in the d×n matrix V . The code
then returns the projection matrix P = V V t (2.7.4).
Instead of working with the variance Q, as discussed at the start of the
section, we can work directly with the dataset, using svd, to obtain the
eigenvectors.

from numpy import *


from numpy.linalg import svd
188 CHAPTER 3. PRINCIPAL COMPONENTS

# projection matrix onto top n


# eigenvectors of variance
# of dataset

def pca_with_svd(dataset,n):
# center dataset
m = mean(dataset,axis=0)
vectors = dataset - m
# rows of V are
# right singular vectors
V = svd(vectors)[2]
# no need to sort, already decreasing order
U = V[:n].T # top n rows as columns
P = dot(U,U.T)
return P

Let v = dataset[1] be the second image in the MNIST dataset, and let
Q be the variance of the dataset. Then the code below returns the image
compressed down to n = 784, 600, 350, 150, 50, 10, 1 dimensions, returning
Figure 1.4.

from matplotlib.pyplot import *

figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4

v = dataset[1] # second image


display_image(v,row,col,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):


# either will work
P = pca_with_svd(dataset,n)
P = pca(dataset,n)
projv = dot(P,v)
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")

If you run out of memory trying this code, cut down the dataset from
60,000 points to 10,000 points or fewer. The code works with pca or with
pca_with_svd.

We now show how to project a vector v in the dataset using sklearn. The
following code sets up the PCA engine using sklearn.
3.5. PRINCIPAL COMPONENT ANALYSIS 189

from sklearn.decomposition import PCA

N = len(dataset)
n = 10
engine = PCA(n_components = n)

The following code computes the reduced dataset (§2.7)

reduced = engine.fit_transform(dataset)
reduced.shape

and returns (N, n) = (60000, 10). The following code computes the projected
dataset

projected = engine.inverse_transform(reduced)
projected.shape

and returns (N, d) = (60000, 784).


Let U be the d × n matrix with columns the top n eigenvectors. Then the
projection matrix onto the column space of U (project_to_ortho in §2.7)
is P = U U t . In the above code, reduced equals U t v for each image v, and
projected is U U t v for each image v.
Then the code

from matplotlib.pyplot import *

figure(figsize=(10,5))
# eight subplots
rows, cols = 2, 4

v = dataset[1] # second image


display_image(v,rows,cols,1)

for i,n in enumerate([784,600,350,150,50,10,1],start=2):


engine = PCA(n_components = n)
reduced = engine.fit_transform(dataset)
projected = engine.inverse_transform(reduced)
projv = projected[1] # second image
A = reshape(projv,(28,28))
subplot(rows, cols,i)
imshow(A,cmap="gray_r")

returns Figure 3.22.


190 CHAPTER 3. PRINCIPAL COMPONENTS

Fig. 3.22 Original and projections: n = 784, 600, 350, 150, 50, 10, 1.

Now we project all vectors of the MNIST dataset onto two and three
dimensions, those corresponding to the top two or three eigenvalues. To start,
we compute reduced as above with n = 3, the top three components.
In the two-dimensional plotting code below, reduced is an array of shape
(60000,3), but we use only the top two components 0 and 1. When the
rows are plotted as a scatterplot, we obtain Figure 3.23. Note the rows are
plotted grouped by color, to match the legend, and each plot point’s color is
determined by the value of its label.

from matplotlib.pyplot import *

Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan',


,→ 'turquoise', 'black', 'orchid', 'brown')
for i,color in enumerate(Colors):
scatter(reduced[labels==i,0], reduced[labels==i,1], label=i,
,→ c=color, edgecolor='black')

grid()
legend(loc='upper right')
show()
3.5. PRINCIPAL COMPONENT ANALYSIS 191

Fig. 3.23 The full MNIST dataset (2d projection).

Fig. 3.24 The Iris dataset (2d projection).

Code for the 2d plot (Figure 3.24) of the Iris dataset is

from matplotlib.pyplot import *

Colors = ['blue', 'red', 'green']


Classes = ["Iris-setosa", "Iris-virginica", "Iris-versicolor"]

for a,b in zip(Classes,Colors):


192 CHAPTER 3. PRINCIPAL COMPONENTS

scatter(reduced[labels==a,0], reduced[labels==a,1], label=a, c=b,


,→ edgecolor='black')

grid()
legend(loc='upper right')
show()

Now we turn to three dimensional plotting. Here is the code

%matplotlib ipympl
from matplotlib.pyplot import *
from mpl_toolkits import mplot3d

ax = axes(projection='3d')
ax.set_axis_off()

Colors = ('blue', 'red', 'green', 'orange', 'gray','cyan' ,


,→ 'turquoise', 'black', 'orchid', 'brown')

for i,color in enumerate(Colors):


ax.scatter(reduced[labels==i,0], reduced[labels==i,1],
,→ reduced[labels==i,2], label=i, c=color, edgecolor='black')

legend(loc='upper right')
show()

The three dimensional plot of the complete MNIST dataset is Figure 1.5
in §1.2. The command %matplotlib notebook allows the figure to rotated
and scaled.

3.6 Cluster Analysis

Cluster analysis seeks to partition a dataset into groups or clusters based on


selected criteria, such as proximity in distance.
Let x1 , x2 , . . . , xN be a dataset in Rd . The simplest algorithm is k-means
clustering. The algorithm is iterative: We start with k means m1 , m2 , . . . , mk ,
not necessarily part of the dataset, and we divide the dataset into k clusters,
where the i-th cluster consists of the points x in the dataset for which mean
mi is nearest to x.
The algorithm is in two parts, the assignment step and the update step.
Initially the means m1 , m2 , . . . , mk are chosen at random, or by an edu-
cated guess, then clusters C1 , C2 , . . . , Ck are assigned, then each mean is
recomputed as the mean of each cluster.
3.6. CLUSTER ANALYSIS 193

The sklearn package contains clustering routines, but here we write the
code from scratch to illustrate the ideas. Here is an animated gif illustrating
the convergence of the algorithm.
Assume the means are given as a list of length k,

means = [ means[0], means[1], ... ]

and each cluster is a list of points (so clusters is a list of lists)

clusters = [ clusters[0], clusters[1], ... ]

such that

N == sum([ len(cluster) for cluster in clusters] )

Given a point x, we first select the mean closest to x:

from numpy import *


from numpy.linalg import norm

def nearest_index(x,means):
i = 0
for j,m in enumerate(means):
n = means[i]
if norm(x - m) < norm(x - n): i = j
return i

Starting with empty clusters (k is the number of clusters), we iterate the


assign/update steps until the means no longer change. If any clusters remain
empty, we discard them. Here is the assignment step.

def assign_clusters(dataset,means):
clusters = [ [ ] for m in means ]
for x in dataset:
i = nearest_index(x,means)
clusters[i].append(x)
return [ c for c in clusters if len(c) > 0 ]

Here is the update step.

def update_means(clusters):
return [ mean(c,axis=0) for c in clusters ]

Here is the iteration.


194 CHAPTER 3. PRINCIPAL COMPONENTS

from numpy.random import random

d = 2
k,N = 7,100

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]


means = [ random_vector(d) for _ in range(k) ]

close_enough = False

while not close_enough:


clusters = assign_clusters(dataset,means)
print([len(c) for c in clusters])
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
means = newmeans

This code returns the size the clusters after each iteration. Here is code
that plots a cluster.

def plot_cluster(mean,cluster,color,marker):
for v in cluster:
scatter(v[0],v[1], s=50, c=color, marker=marker)
scatter(mean[0], mean[1], s=100, c=color, marker='*')

Here is code for the entire iteration. hexcolor is in §1.3.

from matplotlib.pyplot import *

d = 2
k,N = 7,100

def random_vector(d):
return array([ random() for _ in range(d) ])

dataset = [ random_vector(d) for _ in range(N) ]


means = [ random_vector(d) for _ in range(k) ]
colors = [ hexcolor() for _ in range(k) ]

close_enough = False

figure(figsize=(4,4))
grid()
3.6. CLUSTER ANALYSIS 195

for v in dataset: scatter(v[0],v[1],s=20,c='black')


show()

while not close_enough:


clusters = assign_clusters(dataset,means)
newmeans = update_means(clusters)
# only check closeness if number of means unchanged
if len(newmeans) == len(means):
close_enough = all([ allclose(m,n) for m,n in
,→ zip(means,newmeans) ])
figure(figsize=(4,4))
grid()
for i,c in enumerate(clusters):
plot_cluster(newmeans[i], c, colors[i],
'$' + str(i) + '$')
show()
means = newmeans
Chapter 4
Calculus

The material in this chapter lays the groundwork for Chapter 7. It assumes
the reader has some prior exposure, and the first section quickly reviews
basic material essential for our purposes. Nevertheless, the overarching role
of convexity is emphasized repeatedly, both in the single-variable and multi-
variable case.
The chain rule is treated extensively, in both interpretations, combinato-
rial (back-propagation) and geometric (time-derivatives). Both are crucial for
neural network training in Chapter 7.
Because it is used infrequently in the text, integration is treated separately
in an appendix (§A.5).
Even though parts of §4.5 are heavy-going, the material is necessary for
Chapter 7. Nevertheless, for a first pass, the reader should feel free to skim
this material and come back to it after the need is made clear.

4.1 Single-Variable Calculus

In this section, we focus on single-variable calculus, and in §4.3, we review


multi-variable calculus. Recall the slope of a line y = mx + b equals m.
Let y = f (x) be a function as in Figure 4.1, and let a be a fixed point. The
derivative of f (x) at the point a is the slope of the line tangent to the graph
of f (x) at a. Then the derivative at a point a is a number f ′ (a) possibly
depending on a.

Definition of Derivative

The derivative of f (x) at the point a is the slope of the line tangent
to the graph of f (x) at a.

197
198 CHAPTER 4. CALCULUS

Since a constant function f (x) = c is a line with slope zero, the derivative
of a constant is zero. Since f (x) = mx+b is a line with slope m, its derivative
is m.
Since the tangent line at a passes through the point (a, f (a)), and its slope
is f ′ (a), the equation of the tangent line at a is

y = f (a) + f ′ (a)(x − a).

Based on the definition, natural properties of the derivative are


A. The derivative of f (x) + g(x) is f ′ (x) + g ′ (x), and (−f (x))′ is −f ′ (x).
B. If f ′ (x) ≥ 0 on an interval [a, b], then f (b) ≥ f (a).
C. If f ′ (x) ≤ 0 on an interval [a, b], then f (b) ≤ f (a).

y = f (x)

x
a

Fig. 4.1 f ′ (a) is the slope of the tangent line at a.

Using these properties, we determine the formula for f ′ (a). Suppose the
derivative is bounded between two extremes m and L at every point x in an
interval [a, b], say
m ≤ f ′ (x) ≤ L, a ≤ x ≤ b.
Then by A, the derivative of h(x) = f (x)−mx at x equals h′ (x) = f ′ (x)−m.
By assumption, h′ (x) ≥ 0 on [a, b], so, by B, h(b) ≥ h(a). Since h(a) =
f (a) − ma and h(b) = f (b) − mb, this leads to

f (b) − f (a)
≥ m.
b−a
Repeating this same argument with f (x) − Lx, and using C, leads to

f (b) − f (a)
≤ L.
b−a
4.1. SINGLE-VARIABLE CALCULUS 199

We have shown

First Derivative Bounds

If m ≤ f ′ (x) ≤ L for a ≤ x ≤ b, then

f (b) − f (a)
m≤ ≤ L. (4.1.1)
b−a

As an immediate consequence, when m = L = 0, applying this to any


subinterval [a′ , b′ ] in [a, b],

Zero Derivative Implies Constant

For any f (x),

f ′ (x) = 0 =⇒ f (x) is constant. (4.1.2)

When b is close to a, we expect both extremes m and L to be close to


f ′ (a). From (4.1.1), we arrive at the formula for the derivative,

Derivative Formula

f (x) − f (a)
f ′ (a) = lim . (4.1.3)
x→a x−a

From (4.1.3), the derivative of a line f (x) = mx + b equals f ′ (a) = m,


agreeing with what we already know. We usually deal with limits as in (4.1.3)
in an intuitive manner. When needed, please refer to §A.6 for basic properties
of limits.
Below we also write
dy
y ′ = f ′ (x) =
dx
or
dy
f ′ (a) =
dx x=a
When the particular point a is understood from the context, we write y ′ .

From (4.1.3), the basic properties of the derivative are


• Sum rule. h = f + g implies h′ = f ′ + g ′ ,
• Product rule. h = f g implies h′ = f ′ g + f g ′ ,
• Quotient rule. h = f /g implies h′ = (f ′ g − f g ′ )/g 2 .
200 CHAPTER 4. CALCULUS

• Chain rule. u = f (x) and y = g(u) implies

dy dy du
= · .
dx du dx
To visualize the chain rule, suppose

u = f (x) = sin x,
y = g(u) = u2 .

These are two functions f , g in composition, as in Figure 4.2.

f g
x u y

Fig. 4.2 Composition of two functions.


Suppose x = π/4. Then u = sin(π/4) = 1/ 2, and y = u2 = 1/2. Since

dy 2 du 1
= 2u = √ , = cos x = √ ,
du 2 dx 2
by the chain rule,
dy dy du 2 1
= · = √ · √ = 1.
dx du dx 2 2
Since the chain rule is important for machine learning, it is discussed in detail
in §4.4.
By the product rule,

(x2 )′ = x′ x + xx′ = 1x + x1 = 2x.

Similarly one obtains the power rule

(xn )′ = nxn−1 . (4.1.4)

Using the chain rule, the power rule can be √derived for any rational number n,
2
positive or negative. For example,
√ since ( x) = x, we can write x = f (g(x))
with f (x) = x2 and g(x) = x. By the chain rule,
√ √
1 = (x)′ = f ′ (g(x))g ′ (x) = 2g(x)g ′ (x) = 2 x( x)′ .

Solving for ( x)′ yields
√ 1
( x)′ = √ ,
2 x
which is (4.1.4) with n = 1/2. In this generality, the variable x is restricted
to positive values only.
4.1. SINGLE-VARIABLE CALCULUS 201

For example, the code

from sympy import *

x, a = symbols('x, a')
f = x**a

f.diff(x), diff(f,x), f.diff(x).simplify(), simplify(diff(f,x))

returns
axa axa
, , axa−1 , axa−1 .
x x

The power rule can be combined with the chain rule. For example, if

un+1
u = 1 − p + cp, f (p) = un , g(u) = ,
(c − 1)(n + 1)

then
(1 − p + cp)n+1
F (p) = ,
(c − 1)(n + 1)
and
F ′ (p) = g ′ (u)u′ = un ,
hence
(1 − p + cp)n+1
F (p) = =⇒ F ′ (p) = f (p). (4.1.5)
(c − 1)(n + 1)

The second derivative f ′′ (x) of f (x) is the derivative of the derivative,



f ′′ (x) = (f ′ (x)) .

For example,
n!
(xn )′′ = (nxn−1 )′ = n(n − 1)xn−2 = xn−2 = P (n, 2)xn−2
(n − 2)!

(for n! and P (n, k) see §A.1).


More generally, the k-th derivative f (k) (x) is the derivatives taken k times,
so
202 CHAPTER 4. CALCULUS

(k) n!
(xn ) = n(n − 1)(n − 2) . . . (n − k + 1)xn−k = xn−k = P (n, k)xn−k .
(n − k)!

When k = 0, f (0) (x) = f (x), and, when k = 1, f (1) (x) = f ′ (x). The code

from sympy import *


init_printing()

x, n = symbols('x, n')

diff(x**n,x,3)

returns the third derivative n(n − 1)(n − 2)xn−3 .

Here is an example using derivatives from sympy. Given a power n, let


pn (x) = (x2 − 1)n . Then pn (x) is a polynomial of degree 2n.
The Legendre polynomial Pn (x) is the n-th derivative of pn (x) divided by
n!2n . Then Pn (x) is a polynomial of degree n.
For example, when n = 1, p1 (x) = x2 − 1, so
1 1
P1 (x) = (x2 − 1)′ = 2x = x.
1!21 2
When n = 2,
1 ′′ 1
P2 (x) =
2
(x2 − 1)2 = (3x2 − 1).
2!2 2
The Python code for Pn (x) uses symbolic functions and symbolic deriva-
tives.

from sympy import diff, symbols


from scipy.special import factorial

def sym_legendre(n):
# symbolic variable
x = symbols('x')
# symbolic function
p = (x**2 - 1)**n
nfact = factorial(n,exact=True)
# symbolic nth derivative
return p.diff(x,n)/(nfact * 2**n)

For example,

from sympy import init_printing, simplify


init_printing()
4.1. SINGLE-VARIABLE CALCULUS 203

[ simplify(sym_legendre(n)) for n in range(6) ]

returns the first six Legendre Polynomials, starting from n = 0:


"  #
3x2 1 x 5x2 − 3 35x4 15x2 3 x 63x4 − 70x2 + 15
1, x, − , , − + ,
2 2 2 8 4 8 8

To compute values such as P4 (5), we have modify sym_legendre to a


numpy function as follows,

from sympy import lambdify

def num_legendre(n):
x = symbols('x')
f = sym_legendre(n)
return lambdify(x,f, 'numpy')

The function num_legendre(n) can be evaluated, plotted, integrated, etc.

We use the above to derive the Taylor series. Suppose f (x) is given by a
finite or infinite sum

f (x) = c0 + c1 x + c2 x2 + c3 x3 + . . . (4.1.6)

Then f (0) = c0 . Taking derivatives, by the sum, product, and power rules,

f ′ (x) = c1 + 2c2 x + 3c3 x2 + 4c4 x3 + . . .


f ′′ (x) = 2c2 + 3 · 2c3 x + 4 · 3c4 x2 + . . .
(4.1.7)
f ′′′ (x) = 3 · 2c3 + 4 · 3 · 2c4 x + . . .
f (4) (x) = 4 · 3 · 2c4 + . . .

Inserting x = 0, we obtain f ′ (0) = c1 , f ′′ (0) = 2c2 , f ′′′ (0) = 3 · 2c3 , f (4) (0) =
4 · 3 · 2c4 . This can be encapsulated by f (n) (0) = n!cn , n = 0, 1, 2, 3, 4, . . . ,
which is best written
f (n) (0)
= cn , n ≥ 0.
n!
Going back to (4.1.6), we derived
204 CHAPTER 4. CALCULUS

Taylor Series

For almost every function f (x),



X f (n) (0) n
f (x) = x
n=0
n!
x2 x3 x4
= f (0) + f ′ (0)x + f ′′ (0) + f ′′′ (0) + f (4) (0) + . . .
2 6 24
(4.1.8)

We now compute the derivatives of the exponential function (§A.3). By


the compound-interest formula (A.3.8),
 x n
ex = lim 1 + .
n→∞ n
By the power rule and chain rule,
 x n ′  x n−1 1  x n−1
1+ =n 1+ · = 1+ .
n n n n
From this follows

 x n−1  x n 1
(ex ) = lim 1+ = lim 1 + · = ex · 1 = ex .
n→∞ n n→∞ n 1 + x/n

Since the first derivative is ex , so is the second derivative. This derives

Derivative of the Exponential Function

The exponential function satisfies

(ex )′ = ex , (ex )′′ = ex .

The logarithm function is the inverse of the exponential function,

y = log x ⇐⇒ x = ey .

This is the same as saying

log(ey ) = y, elog x = x.
4.1. SINGLE-VARIABLE CALCULUS 205

From here, we see the logarithm is defined only for x > 0 and is strictly
increasing (Figure 4.3).
Since e0 = 1,
log 1 = 0.
Since e∞ = ∞ (Figure A.3),

log ∞ = ∞.

Since e−∞ = 1/e∞ = 1/∞ = 0,

log 0 = −∞.

We also see log x is negative when 0 < x < 1, and positive when x > 1.

Fig. 4.3 The logarithm function log x.

Moreover, by the law of exponents,

log(ab) = log a + log b.

For a > 0 and b real, define

ab = eb log a .

Then, by definition,
log(ab ) = b log a,
and c c
ab = eb log a = ebc log a = abc .
206 CHAPTER 4. CALCULUS

By definition of the logarithm, y = log x is shorthand for x = ey . Use the


chain rule to find y ′ :

x = ey =⇒ 1 = x′ = (ey )′ = ey y ′ = xy ′ ,

so
1
y = log x =⇒ y′ = .
x
Derivative of the Logarithm

1
y = log x =⇒ y′ = . (4.1.9)
x

Since the derivative of log(1 + x) is 1/(1 + x), the chain rule implies

dn (n − 1)!
log(1 + x) = (−1)n−1 , n ≥ 1.
dxn (1 + x)n

From this, the Taylor series of log(1 + x) is

x2 x3 x4
log(1 + x) = x − + − + .... (4.1.10)
2 3 4

0
x

Fig. 4.4 Increasing or decreasing?

For the parabola in Figure 4.4, y = x2 so, by the power rule, y ′ = 2x.
Since y ′ > 0 when x > 0 and y ′ < 0 when x < 0, this agrees with the
4.1. SINGLE-VARIABLE CALCULUS 207

increase/decrease of the graph. In particular, the minimum of the parabola


occurs when y ′ = 0.
For the curve y = x4 − 2x2 in Figure 4.5,

y ′ = 4x3 − 4x = 4x(x2 − 1) = 4x(x − 1)(x + 1),

so y ′ is a product of the three factors 4x, x − 1, x + 1. Since the zeros of these


factors are 0, 1, and −1, and y ′ > 0 when all factors are positive, or two of
them are negative, this agrees with the increase/decrease in the figure.
Here y ′ = 0 occurs at the two minima x = ±1 and at the local maximum
0. Notice 0 is not a global maximum as there is no highest value for y.


(c = 1/ 3)

−1 −c c 1
x
0

Fig. 4.5 Increasing or decreasing?

Let y = f (x) be a function. A critical point is a point x∗ where the deriva-


tive equals zero, f ′ (x∗ ) = 0. Above we saw local or global maximizers or
minimizers are critical points. In general, however, this need not be so. A
critical point may be neither. For example, for f (x) = x3 , x∗ = 0 is a critical
point, but is neither a maximizer nor a minimizer. Here, for y = x3 , x∗ = 0
is a saddle point. Nevertheless, we can say

Searching for Maximizers

Let y = f (x) be defined on an interval [a, b]. If x∗ is a maximizer in


the interior (a, b), then x∗ is a critical point. Thus the maximum

max f (x) (4.1.11)


a≤x≤b
208 CHAPTER 4. CALCULUS

equals the maximum over critical points and endpoints,

max f (x).
x∗ ,a,b

The same result holds for minimizers.

In other words, to find the maximum of f (x), find the critical points x∗ ,
plug them and the endpoints a, b into f (x), and select whichever yields the
maximum value.

Now we look at the increase/decrease in y ′ , rather than in y. Applying the


above logic to y ′ instead to y, we see y ′ is increasing when y ′′ ≥ 0, and y ′ is
decreasing when y ′′ ≤ 0. In the first case, we say f (x) is convex, while in the
second case, we say f (x) is concave. Clearly a function y = f (x) is concave
if −f (x) is convex.

If we look at Figure 4.4, the slope at x equals y ′ = 2x. Thus as x increases,



y increases. Even though the parabola height y decreases when x < 0 and
increases when x > 0, its slope y ′ is always increasing: When x < 0, as x
increases, y ′ = 2x is less and less negative, while, when x > 0, as x increases,
y ′ is more and more positive.
Since y ′ increases when its derivative is positive, the parabola’s behavior
is encapsulated in
y ′′ = (y ′ )′ = (2x)′ = 2 > 0.
In general,

Second Derivative Test for Convexity

y = f (x) is convex iff y ′′ ≥ 0, and concave if y ′′ ≤ 0.

A point where y ′′ = 0 is an inflection point. For example, the parabola in


Figure 4.4 is convex everywhere. Analytically, for the parabola, y ′′ = 2 > 0
everywhere,
For the graph in Figure 4.5 it is clear the graph is convex away from 0,
and concave near 0. Analytically,

y ′′ = (x4 − 2x2 )′′ = (4x3 − 4x)′ = 12x2 − 4 = 4(3x2 − 1),



so the inflection
√ points are x = ±1/ 3. Hence the √ graph is convex
√ when
|x| > 1/ 3, and the graph is concave when |x| < 1/ 3. Since 1/ 3 < 1, the
graph is convex near x = ±1.
4.1. SINGLE-VARIABLE CALCULUS 209

A function f (x) is strictly convex if y ′′ > 0. Geometrically, f (x) is strictly


convex if each chord joining any two points on the graph lies strictly above
the graph. Similarly, one defines strictly concave to mean y ′′ < 0.

Second Derivative Test for Strict Convexity

Suppose y = f (x) has a second derivative y ′′ . Then y is strictly convex


if y ′′ > 0, and strictly concave if y ′′ < 0.

For example, since (x2 )′′ = 2 > 0 and (ex )′′ = ex > 0, x2 and e√ x
are
strictly convex everywhere, and x4 − 2x2 is strictly convex for |x| > 1/ 3.
Convexity of ex was also derived in (A.3.14). Since

(ex )(n) = ex , n ≥ 0,

writing the Taylor series of ex yields the exponential series (A.3.12).

Let y = − log x.By the power rule,


 ′
′′ ′′ 1 ′ 1
y = (− log x) = − = − x−1 = 2 .
x x

Since y ′′ > 0, which shows − log x is strictly convex. This shows

Concavity of the Logarithm Function

The logarithm function log x is strictly concave on x > 0.

Suppose y = f (x) is convex, so y ′ is increasing. Then a ≤ t ≤ x ≤ b implies


f ′ (a) ≤ f ′ (t) ≤ f ′ (x) ≤ f ′ (b). Taking m = f ′ (a) and L = f ′ (x) in (4.1.1),

f (x) − f (a)
f ′ (a) ≤ ≤ f ′ (x), a ≤ x ≤ b.
x−a
Since the tangent line at a is y = f ′ (a)(x − a) + f (a), rearranging this last
inequality, we obtain
210 CHAPTER 4. CALCULUS

Convex Function Graph Lies Above the Tangent Line

If f (x) is convex on [a.b], then

f (x) ≥ f (a) + f ′ (a)(x − a), a ≤ x ≤ b.

For example, the function in Figure 4.6 is convex near x = a, and the
graph lies above its tangent line at a.

Let pm (x) be the parabola


m
pm (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.12)
2
Then p′′m (x) = m. Moreover the graph of pm (x) is tangent to the graph of
f (x) at x = a, in the sense f (a) = pm (a) and f ′ (a) = p′m (a). Because of this,
we call pm (x) the lower tangent parabola.
Similarly, let pL (x) be the parabola

L
pL (x) = f (a) + f ′ (a)(x − a) + (x − a)2 . (4.1.13)
2
Then p′′L (x) = L. Moreover the graph of pL (x) is tangent to the graph of f (x)
at x = a, in the sense f (a) = pL (a) and f ′ (a) = p′L (a). Because of this, we
call pL (x) the upper tangent parabola.
When y is convex, we saw above the graph of y lies above its tangent line.
When m ≤ y ′′ ≤ L, we can specify the size of the difference between the
graph and the tangent line. In fact, the graph is constrained to lie above or
below the lower or upper tangent parabolas.

Second Derivative Bounds

If m ≤ f ′′ (x) ≤ L on [a, b], the graph lies between the lower and upper
tangent parabolas pm (x) and pL (x),

m L
(x − a)2 ≤ f (x) − f (a) − f ′ (a)(x − a) ≤ (x − a)2 . (4.1.14)
2 2
a ≤ x ≤ b.

To see this, suppose f ′′ (x) ≥ m. then g(x) = f (x) − pm (x) satisfies

g ′′ (x) = f ′′ (x) − p′′m (x) = f ′′ (x) − m ≥ 0,

so g(x) is convex, so g(x) lies above its tangent line at x = a. Since g(a) = 0
and g ′ (a) = 0, the tangent line is 0, and we conclude g(x) ≥ 0, which is the
4.1. SINGLE-VARIABLE CALCULUS 211

left half of (4.1.14). Similarly, if f ′′ (x) ≤ L, then pL (x) − f (x) is convex,


leading to the right half of (4.1.14).

x
a

Fig. 4.6 Tangent parabolas pm (x) (green), pL (x) (red), L > m > 0.

Now suppose f (x) is strongly convex in the sense L ≥ f ′′ (x) ≥ m on an


interval [a, b], for some positive constants m and L. By (4.1.1),

f ′ (b) − f ′ (a)
t= =⇒ L ≥ t ≥ m,
b−a
which implies

t2 − (m + L)t + mL = (t − m)(t − L) ≤ 0.

This yields

Coercivity for Strongly Convex Functions

If m ≤ f ′′ (x) ≤ L for a ≤ x ≤ b, then


2
f ′ (b) − f ′ (a) mL 1 f ′ (b) − f ′ (a)
≥ + . (4.1.15)
b−a m+L m+L b−a
212 CHAPTER 4. CALCULUS

For gradient descent, we need the relation between a convex function and
its dual. If f (x) is convex, its convex dual is

g(p) = max(px − f (x)). (4.1.16)


x

Below we see g(p) is also convex. This may not always exist, but we will work
with cases where no problems arise.
To evaluate g(p), following (4.1.11), we compute the maximizer x∗ by
setting the derivative of (px − f (x)) equal to zero and solving for x.
Let a > 0. The simplest example is f (x) = ax2 /2. In this case, the maxi-
mum of px − f (x) occurs where (px − f (x))′ = 0, which leads to
 ′
1
0= px − ax2 = p − ax,
2

or x∗ = p/a. Plugging this maximizer x∗ back into (4.1.16) yields g(p) =


p2 /2a. In the exercises and in §4.2, we see other examples.

Going back to (4.1.16), for each p, the point x where px − f (x) equals the
maximum g(p) — the maximizer — depends on p. If we denote the maximizer
by x = x(p), then
g(p) = px(p) − f (x(p)).
Since the maximum occurs when the derivative is zero, we have

0 = (px − f (x))′ = p − f ′ (x) ⇐⇒ x = x(p).

Hence
g(p) = px − f (x) ⇐⇒ p = f ′ (x).
Also, by the chain rule, differentiating with respect to p,

g ′ (p) = (px − f (x))′ = x + px′ − f ′ (x)x′ = x.

From this, we conclude

p = f ′ (x) ⇐⇒ x = g ′ (p). (4.1.17)

Thus f ′ (x) is the inverse function of g ′ (p). Since g(p) = px − f (x) is the same
as f (x) = px − g(p), we have

Dual of the Dual

If g(p) is the convex dual of a convex f (x), then f (x) is the convex
dual of g(p).
4.1. SINGLE-VARIABLE CALCULUS 213

Since f ′ (x) is the inverse function of g ′ (p), we have

f ′ (g ′ (p)) = p.

Differentiating with respect to p again yields

f ′′ (g ′ (p))g ′′ (p) = 1.

We derived

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function, and let g(p) be the convex dual
of f (x). Then g(p) is strictly convex and
1
g ′′ (p) = , (4.1.18)
f ′′ (x)

where x = g ′ (p), p = f ′ (x).

Since f ′′ (x) > 0, also g ′′ (p) > 0, so g(p) is strictly convex.

For the chi-squared distribution (§5.5), we used Newton’s generalization


of the binomial theorem to general exponents. This we now derive.

Newton’s Binomial Theorem


Let n be any real number. For a > 0 and −a < x < a,
   
n n n−1 n n−2 2 n n−3 3
(a + x) = a + na x+ a x + a x + ....
2 3

n

This makes sense because the binomial coefficient k is defined for any
real number n (A.2.12), (A.2.13).
In summation notation,
∞  
X
n n n−k k
(a + x) = a x . (4.1.19)
k
k=0

The only difference between (A.2.7) and (4.1.19) is the upper limit of the
summation, which is set to infinity. When n is a whole number, by (A.2.10),
we have  
n
= 0, for k > n,
k
214 CHAPTER 4. CALCULUS

so (4.1.19) is a sum of n + 1 terms, and equals (A.2.7) exactly. When n is not


a whole number, the sum (4.1.19) is an infinite sum.
Actually, in §5.5, we need the special case a = 1, which we write in slightly
different notation,
∞  
p
X p n
(1 + x) = x . (4.1.20)
n=0
n

Newton’s binomial theorem (4.1.19) is a special case of the Taylor series


(4.1.8). To see this, set
f (x) = (a + x)n .
Then, by the power rule,

f (k) (x) = n(n − 1)(n − 2) . . . (n − k + 1)(a + x)n−k ,

so
f (k) (0)
 
n(n − 1)(n − 2) . . . (n − k + 1) n−k n n−k
= a = a .
k! k! k
Writing out the Taylor series,
∞ ∞  
X f (k) (0) X n n−k k
(a + x)n = = a x ,
k! k
k=0 k=0

which is Newton’s binomial theorem.

Fig. 4.7 The sine function.


4.1. SINGLE-VARIABLE CALCULUS 215

The trigonometric functions sine and cosine were defined in (1.4.3). To


plot them, use

from matplotlib.pyplot import *


from numpy import *

a, b = 0, 3*pi
theta = arange(a,b,.01)

ax = axes()
ax.grid(True)
ax.axhline(0, color='black', lw=1)

plot(theta,sin(theta))
show()

This returns Figure 4.7.

Fig. 4.8 The sine function with π/2 tick marks.

It is often convenient to set the horizontal axis tick marks at the multiples
of π/2. For this, we use

from numpy import *


from matplotlib.pyplot import *

def label(k):
if k == 0: return '$0$'
elif k == 1: return r'$\pi/2$'
elif k == -1: return r'$-\pi/2$'
elif k == 2: return r'$\pi$'
elif k == -2: return r'$-\pi$'
216 CHAPTER 4. CALCULUS

elif k%2 == 0: return '$' + str(k//2) + r'\pi$'


else: return '$' + str(k) + r'\pi/2$'

def set_pi_ticks(a,b):
base = pi/2
m = floor(b/base)
n = ceil(a/base)
k = arange(n,m+1,dtype=int)
# multiples of base
return xticks(k*base, map(label,k) )

Then inserting set_pi_ticks(a,b) in the plot code returns Figure 4.8.

We review the derivative of sine and cosine. Recall the angle θ in radians
is the length of the subtended arc (in red) in Figure 4.9. Following the figure,
with P = (x, y), we have x = cos θ, y = sin θ.
The key idea here is Archimedes’ axiom [13], which states:
Suppose two convex curves share common initial and terminal points. If one is inside
the other, then the inside curve is the shorter.

P 1−x
Q

1 y

θ
O x I

Fig. 4.9 Angle θ in the plane, P = (x, y).

By the figure, there are three convex curves joining P and I: The line
segment P I, the red arc, and the polygonal curve P QI. Since the length of
the line segment is greater than y, Archimedes’ axiom implies

y < θ < 1 − x + y,

or
sin θ < θ < 1 − cos θ + sin θ.
4.1. SINGLE-VARIABLE CALCULUS 217

Dividing by θ (here we assume 0 < θ < π/2),

1 − cos θ sin θ
1− < < 1. (4.1.21)
θ θ
We use this to show (the definition of limit is in §A.6)

sin θ
lim = 1. (4.1.22)
θ→0 θ
Since sin θ is odd, it is enough to verify (4.1.22) for θ > 0.
To this end, since sin2 θ = 1 − cos2 θ, from (4.1.21),

1 − cos θ 1 − cos2 θ sin θ sin θ


0≤ = = · ≤ sin θ ≤ θ,
θ θ(1 + cos θ) θ 1 + cos θ

which implies
1 − cos θ
lim = 0.
θ→0 θ
Taking the limit θ → 0 in (4.1.21), we obtain (4.1.22) for θ > 0.
From (A.4.6),
sin(θ + t) = sin θ cos t + cos θ sin t,
so
sin(θ + t) − sin θ cos t − 1 sin t
lim = lim sin θ · + cos θ · = cos θ.
t→0 t t→0 t t
Thus the derivative of sine is cosine,

(sin θ)′ = cos θ.

Similarly,
(cos θ)′ = − sin θ.
Using the chain rule, we compute the derivative of the inverse arcsin x of
sin θ. Since
θ = arcsin x ⇐⇒ x = sin θ,
we have p
1 = x′ = (sin θ)′ = θ′ · cos θ = θ′ · 1 − x2 ,
or
1
(arcsin x)′ = θ′ = √ .
1 − x2
We
√ use this to compute the derivative of the arcsine law (3.2.15). With
x = λ/2, by the chain rule,
218 CHAPTER 4. CALCULUS
′
1√
 
2 2 1
arcsin λ = √ · x′
π 2 π 1 − x2
(4.1.23)
2 1 1 1
= p · √ = p .
π 1 − λ/4 4 λ π λ(4 − λ)

This shows the derivative of the arcsine law is the density in Figure 3.11.

Exercises

Exercise 4.1.1 What is the y-intercept of the line tangent to f (x) = x2 at


the point (1, 1)?

Exercise 4.1.2 With exp x = ex , what are the first derivatives of exp(exp x)
and exp(exp(exp x))?
1 2
Exercise 4.1.3 With a > 0, let f (x) = 2 ax − ex . Where is f (x) convex,
and where is it concave?

Exercise 4.1.4 With Pn (x) the Legendre polynomial, use num_legendre to


find the general formula for Pn (0), Pn (1), Pn (−1), for n = 1, 2, 3, . . . .

Exercise 4.1.5 Compute the Taylor series for sin θ and cos θ.

Exercise 4.1.6 Using Newton’s binomial theorem, show


1 1·3 2 1·3·5 3 1·3·5·7 4
√ =1+u+ u + u + u + ...
1 − 2u 2! 3! 4!

Exercise 4.1.7 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x) + t?

Exercise 4.1.8 If the convex dual of f (x) is g(p), and t is a constant, what
is the convex dual of f (x + t)?

Exercise 4.1.9 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,


what is the convex dual of f (tx)?

Exercise 4.1.10 If the convex dual of f (x) is g(p), and t ̸= 0 is a constant,


what is the convex dual of tf (x)?

Exercise 4.1.11 If a > 0 and


1 2
f (x) = ax + bx + c,
2
what is the convex dual?

Exercise 4.1.12 Show f (x) convex implies ef (x) convex.


4.2. ENTROPY AND INFORMATION 219

4.2 Entropy and Information

Let p be a probability, i.e. a number between 0 and 1. The entropy of p is

H(p) = −p log p − (1 − p) log(1 − p), 0 ≤ p ≤ 1. (4.2.1)

This is also called absolute entropy to contrast with relative entropy which
we see below.
To graph H(p), we compute its first and second derivatives. Here the
independent variable is p. By the product rule,
 
′ ′ 1−p
H (p) = (−p log p − (1 − p) log(1 − p)) = − log p + log(1 − p) = log .
p

Thus H ′ (p) = 0 when p = 1/2, H ′ (p) > 0 on p < 1/2, and H ′ (p) < 0 on
p > 1/2. Since this implies H(p) is increasing on p < 1/2, and decreasing on
p > 1/2, p = 1/2 is a global maximizer of the graph.
Notice as p increases, 1 − p decreases, so (1 − p)/p decreases. Since log is
increasing, as p increases, H ′ (p) decreases. Thus H(p) is concave.

Fig. 4.10 The absolute entropy function H(p).

Taking the second derivative, by the chain rule and the quotient rule,
  ′
′′ 1−p 1
H (p) = log =− ,
p p(1 − p)

which is negative, leading to the strict concavity of H(p).


A crucial aspect of Figure 4.10 is its limiting values at the edges p = 0 and
p = 1,
220 CHAPTER 4. CALCULUS

H(0) = lim H(p) and H(1) = lim H(p).


p→0 p→1

Inserting p = 0 into p log p yields 0 × (−∞), so it is not at all clear what


H(0) should be. On the other hand, Figure 4.10 suggests H(0) = 0.
For the first limit, since H(p) is increasing near p = 0, it is clear there
is a definite value H(0). The entropy is the sum of two terms, −p log p, and
−(1 − p) log(1 − p). When p → 0, the second term approaches − log 1 = 0, so
H(0) is the limit of the first term,

H(0) = − lim p log p.


p→0

When p → 0, also 2p → 0. Replacing p by 2p,

H(0) = − lim p log p = − lim 2p log(2p)


p→0 p→0

= lim −2p log 2 + 2H(0) = 2H(0).


p→0

Thus H(0) = 0. Since H(p) is symmetric, H(1 − p) = H(p), we also have


H(1) = 0.

To explain the meaning of the entropy function H(p), suppose a coin has
heads-bias or heads-probability p. If p is near 1, then we have confidence the
outcome of tossing the coin is heads, and, if p is near 0, we have confidence the
outcome of tossing the coin is tails. If p = 1/2, then we have least information.
Thus we can view the entropy as measuring a lack of information.
To formalize this, we define the information or absolute information

I(p) = p log p + (1 − p) log(1 − p), 0 ≤ p ≤ 1. (4.2.2)

Then we have

Entropy and Information

Entropy equals negative information.

The clearest explanation of H(p) is in terms of coin-tossing, where it is


shown H(p) is the log of the number of outcomes with heads-proportion p.
This is explained in §5.1.

The logistic function is


4.2. ENTROPY AND INFORMATION 221

ex 1
p = σ(x) = = , −∞ < x < ∞. (4.2.3)
1+e x 1 + e−x
By the quotient and chain rules, its derivative is

−e−x
p′ = − = σ(x)(1 − σ(x)) = p(1 − p). (4.2.4)
(1 + e−x )2

The logistic function, also called the expit function and the sigmoid function,
is studied further in §5.1, where it used in coin-tossing and Bayes theorem.
The inverse of the logistic function is the logit function. The logit function
is found by solving p = σ(x) for x, obtaining
 
−1 p
x = σ (p) = log . (4.2.5)
1−p

The logit function is also called the log-odds function. Its derivative is
 ′
′ 1−p p 1−p 1 1
x = · = · 2
= .
p 1−p p (1 − p) p(1 − p)

Notice the derivative p′ of σ and the derivative x′ of its inverse σ −1 are


reciprocals. This result, the inverse function theorem, holds in general.

Let
Z(x) = log (1 + ex ) . (4.2.6)
Then Z ′ (x) = σ(x) and Z ′′ (x) = σ ′ (x) = σ(1 − σ) > 0. This shows Z(x) is
strictly convex. We call Z(x) the cumulant-generating function, to be consis-
tent with random variable terminology (§5.3).

Fig. 4.11 The absolute information I(p).


222 CHAPTER 4. CALCULUS

We compute the convex dual (§4.1) of Z(x). By (4.1.11), the maximum

max(px − Z(x))
x

is attained when (px − Z(x))′ = 0, which happens when p = Z ′ (x) = σ(x).


Inserting the log-odds function x = σ −1 (p), we obtain
    
p p
max(px − Z(x)) = p log − Z log , (4.2.7)
x 1−p 1−p

which simplifies to I(p) (4.2.2).

Dual of Cumulant-Generating Function is Information

The convex dual of the cumulant-generating function is the informa-


tion.

The derivative of I(p) is


 
p
I ′ (p) = log . (4.2.8)
1−p

Then I ′ (p) is the inverse of Z ′ (x) = σ(x), as it should be (4.1.17).


From (4.2.8),
1
I ′′ (p) = .
p(1 − p)
The multinomial extension of I(p) is in §5.6.

Let p and q be two probabilities,

0 ≤ p ≤ 1, and 0 ≤ q ≤ 1.

When do we consider p and q close to each other? If p and q were just


numbers, p and q are considered close if the distance |p − q| is small or the
distance squared |p − q|2 is small. But here p and q are probabilities, so it
makes sense to consider them close if their information content is close.
To this end, we define the relative information I(p, q) by
   
p 1−p
I(p, q) = p log + (1 − p) log . (4.2.9)
q 1−q

Then
I(q, q) = 0,
4.2. ENTROPY AND INFORMATION 223

which agrees with our design goal that I(p, q) measures the divergence be-
tween the information in p and the information in q. Because I(p, q) is not
symmetric in p, q, we think of q as a base or reference probability, against
which we compare p.

Fig. 4.12 The relative information I(p, q) with q = .7.

Equivalently, instead of measuring relative information, we can measure


the relative entropy,
H(p, q) = −I(p, q).
Since − log(x) is strictly convex,
   
q 1−q
I(p, q) = −p log − (1 − p) log
p 1−p
 
q 1−q
> − log p · + (1 − p) ·
p 1−p
= − log 1 = 0.

This shows I(p, q) is positive and H(p, q) is negative, when p ̸= q.


Since
I(p, q) = I(p) − p log(q) − (1 − p) log(1 − q),
the xecond derivatives of I(p) and I(p, q) agree, and I(0) = 0 = I(1), I(p, q)
is well-defined for p = 0, and p = 1,

I(1, q) = − log q, I(0, q) = − log(1 − q).

Taking derivatives (with independent variable p),


224 CHAPTER 4. CALCULUS

d2 1
I(p, q) = I ′′ (p) = ,
dp2 p(1 − p)

hence I is strictly convex in p. Thus q is a global minimizer of the graph of


I(p, q) (Figure 4.12). Also

d2 p 1−p
2
I(p, q) = 2 + ,
dq q (1 − q)2

so I(p, q) is strictly convex in q as well. In Exercise 4.3.2, it is shown I(p, q)


is convex in all directions in the (p, q)-plane (Figure 4.13).
The clearest explanation of H(p, q) is in terms of coin-tossing, where it is
shown H(p, q) is the log of the probability of a coin with heads-bias q having
outcomes with heads-proportion p. This also is explained in §5.1.

Fig. 4.13 Surface plot of I(p, q) over the square 0 ≤ p ≤ 1, 0 ≤ q ≤ 1.

Figure 4.13 clearly exhibits the trough p = q where I(p, q) = 0, and the
edges q = 0, 1 where I(p, q) = ∞. In scipy, I(p, q) is incorrectly called
entropy. For more on this terminology confusion, see the end of §5.6. The
code is as follows.
4.2. ENTROPY AND INFORMATION 225

%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import entropy

def I(p,q): return entropy([p,1-p],[q,1-q])

ax = axes(projection='3d')
ax.set_axis_off()

p = arange(0,1,.01)
q = arange(0,1,.01)
p,q = meshgrid(p,q)

# surface
ax.plot_surface(p,q,I(p,q), cmap='cool')

# square
ax.plot([0,1,1,0,0],[0,0,1,1,0],linewidth=.5,c="k")

show()

Exercises

Exercise 4.2.1 Check (4.2.7) simplifies to the information (4.2.2).

Exercise 4.2.2 Compute


min I ′′ (p).
0≤p≤1

Exercise 4.2.3 Let 0 ≤ q ≤ 1 be a constant. What is the convex dual of

Z(x, q) = log (qex + 1 − q)?


p
Exercise 4.2.4 Use Python to plot the entropy H(p) and p(1 − p). Use
scipy.optimize.newton to find where they are equal.

Exercise 4.2.5 The relative information I(p, q) has minimum zero when p =
q. Use the lower tangent parabola (4.1.12) of I(x, q) at q and Exercise 4.2.2
to show
I(p, q) ≥ 2(p − q)2 .
For q = 0.7, plot both I(p, q) and 2(p − q)2 as functions of 0 < p < 1.
226 CHAPTER 4. CALCULUS

4.3 Multi-Variable Calculus

Let
f (x) = f (x1 , x2 , . . . , xd )
be a scalar function of a point x = (x1 , x2 , . . . , xd ) in Rd , and suppose v is
a unit vector in Rd . Then, along the line x(t) = x + tv, g(t) = f (x + tv)
is a function of the single variable t. Hence its derivative g ′ (0) at t = 0 is
well-defined. Since g ′ (0) depends on the point x and on the direction v, this
rate of change is the directional derivative of f (x) at x in the direction v.
More explicitly, the directional derivative of f (x) at x in the direction v is

d
Dv f (x) = f (x + tv). (4.3.1)
dt t=0

In multiple dimensions, there are many directions v emanating from a


point x, we may ask: How does the direction v affect the rate of change of
temperature f ? More specifically, in which direction v does the temperature
f increase? In which direction v does the temperature decrease? In which
direction does the temperature have the greatest increase? In which direction
does the temperature have the greatest decrease? In one dimension, there are
only two directions, so the directional derivative is either f ′ (x) or −f ′ (x).

When we select specific directions, the directional derivatives have specific


names. Let e1 , e2 , . . . , ed be the standard basis in Rd . The partial derivative
in the k-th direction, k = 1, . . . , d, is

∂f d
(x) = f (x + tek ).
∂xk ds t=0

The partial derivative in the k-th direction is just the one-dimensional deriva-
tive considering xk as the independent variable, with all other xj ’s constants.

Below we exhibit the multi-variable chain rule in two ways. The first in-
terpretation is geometric, and involves motion in time and directional deriva-
tives. This interpretation is relevant to gradient descent, §7.3.
The second interpretation is combinatorial, and involves repeated compo-
sitions of functions. This interpretation is relevant to computing gradients in
networks, specifically backpropagation §4.4, §7.2.
These two interpretations work together when training neural networks,
§7.4.
4.3. MULTI-VARIABLE CALCULUS 227

For the first interpretation of the chain rule, suppose the components x1 ,
x2 , . . . , xd are functions of a single variable t (usually time), so we have

x1 = x1 (t), x2 = x2 (t), ..., xd = xd (t).

Inserting these into f (x1 , x2 , . . . , xd ), we obtain a function

f (t) = f (x1 (t), x2 (t), . . . , xd (t))

of a single variable t. Then we have

Multi-Variable Chain Rule

With f (t) = f (x1 (t), x2 (t), . . . , xd (t)),

df ∂f dx1 ∂f dx2 ∂f dxd


= · + · + ··· + · .
dt ∂x1 dt ∂x2 dt ∂xd dt

The gradient of f (x) is the vector


 
∂f ∂f ∂f
∇f = , ,..., . (4.3.2)
∂x1 ∂x2 ∂xd

The Rd -valued function x(t) = (x1 (t), x2 (t), . . . , xd (t)) represents a curve
or path in Rd , and the vector

x′ (t) = (x′1 (t), x′2 (t), . . . , x′d (t))

represents its velocity at time t.


With this notation, the chain rule may be written
df
= ∇f (x(t)) · x′ (t).
dt
Let v = (v1 , v2 , . . . , vd ). The simplest application of the multi-variable
chain rule is to select x(t) = x + tv. Then the chain rule becomes

Directional Derivative Formula

The directional derivative of f (x) in the direction v is the dot product


of the gradient ∇f (x) and v,

d
f (x + tv) = ∇f (x) · v. (4.3.3)
dt t=0
228 CHAPTER 4. CALCULUS

In §7.7, we will need to compute the gradient of a function f (W ) of ma-


trices W . Towards this, recall the collection of matrices with a fixed shape
may be added and scaled. It follows if W and V are matrices with the same
shape, then W + sV also has the same shape, for any scalar s.
If G and V are two matrices with the same shape, we think of trace(V t G)
as a dot product between G and V . This is consistent with the definition of
norm squared (2.2.13). By analogy with (4.3.3), we say

Directional Derivative Matrix Formula

A matrix G is the gradient of f (W ) at W if

d
f (W + sV ) = trace(V t G). for all V. (4.3.4)
ds s=0

Then the gradient G has the same shape as W .

We use this result in Chapter 7.

g s u y
x + k

Fig. 4.14 Composition of multiple functions.

Here is an example of the second interpretation of the chain rule. Suppose


4.3. MULTI-VARIABLE CALCULUS 229
1
r = f (x) = sin x, s = g(x) = ,
1 + e−x
t = h(x) = x2 , u = r + s + t, y = k(u) = cos u.

These are multiple functions in composition, as in Figure 4.14.


The input variable is x and the output variable is y. The intermediate
variables are r, s, t, u. Suppose x = π/4. Then

x, r, s, t, u, y = 0.79, 0.71, 0.69, 0.62, 2.01, −0.43.

To compute derivatives, start with


dy
= k ′ (u) = − sin u = −0.90.
du
Next, to compute dy/dr, the chain rule says

dy dy du
= = −0.90 ∗ 1 = −0.90,
dr du dr
and similarly,
dy dy
= = −0.90.
ds dt
By the chain rule,
dy dy dr dy ds dy dt
= · + · + · .
dx dr dx ds dx dt dx
By (4.2.4), s′ = s(1 − s) = 0.22, so

dr ds dt
= cos x = 0.71, = s(1 − s) = 0.22, = 2x = 1.57.
dx dx dx
We obtain
dy
= −0.90 ∗ 0.71 − 0.90 ∗ 0.22 − 0.90 ∗ 1.57 = −2.25.
dx
The chain rule is discussed in further detail in §4.4.

Let y = f (x) be a function. A critical point is a point x∗ satisfying

∇f (x∗ ) = 0.

Let x∗ be a local or global minimizer of y = f (x). Then for any vector v


and scalar t near zero, f (x∗ ) ≤ f (x∗ + tv). Hence
230 CHAPTER 4. CALCULUS

d f (x∗ + tv) − f (x∗ )


∇f (x) · v = f (x∗ + tv) = lim ≥ 0.
dt t=0
t→0 t

This is so for any direction v. Replacing v by −v, we conclude ∇f (x∗ ) · v = 0.


Since v is any direction, ∇f (x∗ ) = 0, x∗ is a critical point. Thus a minimizer
is a critical point. Similarly, a maximizer is a critical point.
As in the single-variable case, a critical point may be neither a minimizer
nor a maximizer, for example x∗ = (0, 0) and y = x21 − x22 . Such a point is a
saddle point.
If x∗ is a critical point and D2 f (x∗ ) > 0, then x∗ is a local or global
minimizer. This is the same as saying all eigenvalues of the symmetric matrix
D2 f (x∗ ) are positive. When D2 f (x∗ ) < 0, x∗ is a local or global maximum.
If D2 f (x∗ ) has both positive and negative eigenvalues, x∗ is a saddle point.

Let Q be a d × d symmetric matrix, let b be a vector, and let


d d
1 1 X X
f (x) = x · Qx − b · x = qij xi xj − bj x j . (4.3.5)
2 2 i,j=1 j=1

When Q is a variance matrix and b = 0, f (x) is the projected variance onto


direction x.
In this case,
d d
∂f 1X 1X
= qij xj + qji xj − bi = (Qx − b)i .
∂xi 2 j=1 2 j=1

Here we used Q = Qt . Thus ∇f (x) = Qx − b, and

Dv f (x) = v · (Qx − b).

Let x be a point and v a direction, both in Rd . Then x + tv is the equation


of the line passing through x and parallel to v. A multi-variable function f (x)
is convex if its restriction to any line x + tv is convex. Explicitly,

Definition of Multi-Variable Convex Function

Let f (x) be a function of x in Rd . Then f (x) is convex for every point x


fixed and direction v fixed, the single-variable function g(t) = f (x+tv)
is a convex function of t.

This means f (x), x = (x1 , x2 , . . . , xd ), is convex not just in each of the


directions e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), . . . , ed = (0, 0, . . . , 1), but is
also convex in any direction v. In terms of second derivatives, f (x) is convex
4.3. MULTI-VARIABLE CALCULUS 231

if
d2
f (x + tv) (4.3.6)
dt2 t=0
is nonnegative for every point x and every direction v. For this, see also
(4.5.18).
For example, when f (x) is given by (4.3.5),

g(t) = f (x + tv)
1
= (x + tv) · Q(x + tv) − b · (x + tv)
2
1 1 (4.3.7)
= x · Qx − b · x + tv · (Qx − b) + t2 v · Qv
2 2
1 2
= f (x) + tv · (Qx − b) + t v · Qv.
2
From this follows
1
g ′ (t) = v · (Qx − b) + tv · Qv, g ′′ (t) = v · Qv.
2
This shows

Quadratic Convexity

Let Q be a symmetric matrix and b a vector. The quadratic function


1
f (x) = x · Qx − b · x
2
has gradient
∇f (x) = Qx − b. (4.3.8)
Moreover f (x) is convex everywhere when Q is a variance matrix.

By (2.2.2),
Dv f (x) = ∇f (x) · v = |∇f (x)| |v| cos θ,
where θ is the angle between the vector v and the gradient vector ∇f (x).
Since −1 ≤ cos θ ≤ 1, we conclude

Gradient is Direction of Greatest Increase

Let v be a unit vector and let x0 be a point in Rd . As the direction v


varies, the directional derivative varies between the two extremes
232 CHAPTER 4. CALCULUS

−|∇f (x0 )| ≤ Dv f (x0 ) ≤ |∇f (x0 )|.

The directional derivative achieves its greatest value when v points in


the direction of ∇f (x0 ), and achieves its least value in the opposite
direction, when v points in the direction of −∇f (x0 ).

Exercises

Exercise 4.3.1 Let I(p, q) be the relative information (4.2.9), and let Ipp ,
Ipq , Iqp , Iqq be the second partial derivatives. If Q is the second derivative
matrix  
Ipp Ipq
Q= ,
Iqp Iqq
show
(p − q)2
det(Q) = .
p(1 − p)q 2 (1 − q)2
Exercise 4.3.2 Let I(p, q) be the relative information (4.2.9). With x =
(p, q) and v = (ap(1 − p), bq(1 − q)), show

d2
I(x + tv) = p(1 − p)(a − b)2 + b2 (p − q)2 .
dt2 t=0

Conclude that I(p, q) is a convex function of (p, q). Where is it not strictly
convex?
Exercise 4.3.3 Let J(x) = J(x1 , x2 , . . . , xd ) equal
1 1 1 1
J(x) = (x1 − x2 )2 + (x2 − x3 )2 + · · · + (xd−1 − xd )2 + (xd − x1 )2 .
2 2 2 2
Compute Q = D2 J.

4.4 Back Propagation

In this section, we compute outputs and derivatives on a graph. We consider


two cases, when the graph is a chain, or the graph is a network of neurons. The
derivatives are taken with respect to the outputs at each node of the graph.
In §7.2, we consider a third case, and compute outputs and derivatives on a
neural network.
To compute node outputs, we do forward propagation. To compute deriva-
tives, we do back propagation. Corresponding to the three cases, we will code
4.4. BACK PROPAGATION 233

three versions of forward and back propagation. In all cases, back propagation
depends on the chain rule.
The chain rule (§4.1) states

dy dy dr
r = f (x), y = g(r) =⇒ = · .
dx dr dx
In this section, we work out the implications of the chain rule on repeated
compositions of functions.
Suppose

r = f (x) = sin x,
1
s = g(r) = ,
1 + e−r
y = h(s) = s2 .

These are three functions f , g, h composed in a chain (Figure 4.15).

r s y
x f g h

Fig. 4.15 Composition of three functions in a chain.

The chain in Figure 4.15 has four nodes and four edges. The outputs at
the nodes are x, r, s, y. Start with output x = π/4. Evaluating the functions
in order,

x = 0.785, r = 0.707, s = 0.670, y = 0.448.

Notice these values are evaluated in the forward direction: x then r then s
then y. This is forward propagation.
Now we evaluate the derivatives of the output y with respect to x, r, s,
dy dy dy
, , .
dx dr ds
With the above values for x, r, s, we have
dy
= 2s = 2 ∗ 0.670 = 1.340.
ds
Since g is the logistic function, by (4.2.4),

g ′ (r) = g(r)(1 − g(r)) = s(1 − s) = 0.670 ∗ (1 − 0.670) = 0.221.

From this,
234 CHAPTER 4. CALCULUS

dy dy ds
= · = 1.340 ∗ g ′ (r) = 1.340 ∗ 0.221 = 0.296.
dr ds dr
Repeating one more time,
dy dy dr
= · = 0.296 ∗ cos x = 0.296 ∗ 0.707 = 0.209.
dx dr dx
Thus the derivatives are
dy dy dy
= 0.209, = 0.296, = 1.340.
dx dr ds
Notice the derivatives are evaluated in the backward direction: First dy/dy =
1, then dy/ds, then dy/dr, then dy/dx. This is back propagation.

Here is another example. Let

r = x2 ,
s = r 2 = x4 ,
y = s2 = x8 .

This is the same function h(x) = x2 composed with itself three times. With
x = 5, we have

x = 5, r = 25, s = 625, y = 390625.

Applying the chain rule as above, check that


dy dy dy
= 625000, = 62500, = 1250.
dx dr ds

To evaluate x, r, s, y in Figure 4.15, first we built the list of functions and


the list of derivatives

from numpy import *

def f(x): return sin(x)


def g(r): return 1/(1+ exp(-r))
def h(s): return s**2
# this for next example
def k(t): return cos(t)
4.4. BACK PROPAGATION 235

func_chain = [f,g,h]

def df(x): return cos(x)


def dg(r): return g(r)*(1-g(r))
def dh(s): return 2*s
# this for next example
def dk(t): return -sin(t)

der_chain = [df,dg,dh]

Then we evaluate the output vector x = (x, r, s, y), leading to the first
version of forward propagation,

# first version: chains

def forward_prop(x_in,func_chain):
x = [x_in]
while func_chain:
f = func_chain.pop(0) # first func
x_out = f(x_in)
x.append(x_out) # insert at end
x_in = x_out
return x

from numpy import *


x_in = pi/4
x = forward_prop(x_in,func_chain)

Now we evaluate the gradient vector δ = (dy/dx, dy/dr, dy/ds, dy/dy).


Since dy/dy = 1, we set

# dy/dy = 1
delta_out = 1

The code for the first version of back propagation is

# first version: chains

def backward_prop(delta_out,x,der_chain):
delta = [delta_out]
while der_chain:
# discard last output
x.pop(-1)
df = der_chain.pop(-1) # last der
der = df(x[-1])
# chain rule -- multiply by previous der
236 CHAPTER 4. CALCULUS

der = der * delta[0]


delta.insert(0,der) # insert at start
return delta

delta = backward_prop(delta_out,x,der_chain)

Note forward propagation must be run prior to back propagation.

To apply this code to the second example, use

d = 3
func_chain, der_chain = [h]*d, [dh]*d
x_in, delta_out = 5, 1

x = forward_prop(x_in,func_chain)
delta = backward_prop(delta_out,x,der_chain)

x +

y J

z max

Fig. 4.16 A network composition [33].

Now we work with the network in Figure 4.16, using the multi-variable
chain rule (§4.3). The functions are

a = f (x, y) = x + y,
b = g(y, z) = max(y, z),
J = h(a, b) = ab.
4.4. BACK PROPAGATION 237

The composite function is

J = (x + y) max(y, z),

Here there are three input nodes x, y, z, and three hidden nodes +, max,
∗. Starting with inputs (x, y, z) = (1, 2, 0), and plugging in, we obtain node
outputs
(x, y, z, a, b, J) = (1, 2, 0, 3, 2, 6)
(Figure 4.18). This is forward propagation.

Now we compute the derivatives


∂J ∂J ∂J ∂J ∂J
, , , , .
∂x ∂y ∂z ∂a ∂b
This we do in reverse order. First we compute
∂J ∂J
= b = 2, = a = 3.
∂a ∂b
Then
∂a ∂a
= 1, = 1.
∂x ∂y
Let (
1, y > z,
1(y > z) =
0, y < z.

y<z
max(y, z) = z
∂g/∂y = 0, ∂g/∂z = 1
y=z

y>z
max(y, z) = y
∂g/∂y = 1, ∂g/∂z = 0

Fig. 4.17 The function g = max(y, z).


238 CHAPTER 4. CALCULUS

By Figure 4.17, since y = 2 and z = 0,


∂b ∂b
= 1(y > z) = 1, = 1(z > y) = 0.
∂y ∂z
By the chain rule,
∂J ∂J ∂a
= = 2 ∗ 1 = 2,
∂x ∂a ∂x
∂J ∂J ∂a ∂J ∂b
= + = 2 ∗ 1 + 3 ∗ 1 = 5,
∂y ∂a ∂y ∂b ∂y
∂J ∂J ∂b
= = 3 ∗ 0 = 0.
∂z ∂b ∂z
Hence we have
 
∂J ∂J ∂J ∂J ∂J ∂J
, , , , , = (2, 5, 0, 2, 3, 1).
∂x ∂y ∂z ∂a ∂b ∂J

The outputs (blue) and the derivatives (red) are displayed in Figure 4.18.
Summarizing, by the chain rule,
• derivatives are computed backward,
• derivatives along successive edges are multiplied,
• derivatives along several outgoing edges are added.

1
x +
2∗1=2

2 3
2 2

y 6
2+3 ∗
1
2 2
3 3
0
z max
0

Fig. 4.18 Forward and backward propagation [33].

To do this in general, recall a directed graph (§3.3) as in Figure 4.16 has


an adjacency matrix W = (wij ) with wij equal to one or zero depending on
whether (i, j) is an edge or not.
4.4. BACK PROPAGATION 239

Suppose a directed graph has d nodes, and, for each node i, let xi be the
outgoing signal. Then x = (x1 , x2 , . . . , xd ) is the outgoing vector. In the case
of Figure 4.16, d = 6 and

x = (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J).

With this order, the adjacency matrix is


 
00 0 1 0 0
0 0 0 1 1 0
 
0 0 0 0 1 0
W = 0 0
.
 0 0 0 1
0 0 0 0 0 1
00 0 0 0 0

This we code as a list of lists,

d = 6
w = [ [None]*d for _ in range(d) ]
w[0][3] = w[1][3] = w[1][4] = w[2][4] = w[3][5] = w[4][5] = 1

More generally, in a weighed directed graph (§3.3), the weights wij are nu-
meric scalars.
Once we have the outgoing vector x, for each node j, let

x−
j = (w1j x1 , w2j x2 , . . . , wdj xd ). (4.4.1)

Then x−j is the list of node signals, each weighed accordingly. If (i, j) is
not an edge, then wij = 0, so xi does not appear in x− j : In other words, xj

is the weighed list of incoming signals at node j.


An activation function at node j is a function fj of the incoming signals
x−j . Then the outgoing signal at node j is

xj = fj (x−
j ) = fj (w1j x1 , w2j x2 , . . . , wdj xd ). (4.4.2)

By the chain rule,



∂xj  ∂fj · w , if (i, j) is an edge,
ij
= ∂xi (4.4.3)
∂xi 0, if (i, j) is not an edge.

For example, if (1, 5), (7, 5), (2, 5) are the edges pointing to node 5 and we
ignore zeros in (4.4.1), then x− 5 = (w15 x1 , w75 x7 , w25 x2 ), so

x5 = f5 (x−
5 ) = f5 (w15 x1 , w75 x7 , w25 x2 ).

The incoming vector is


240 CHAPTER 4. CALCULUS

x− = (x− − −
1 , x2 , . . . , xd ).

Then x− is a list of lists. In the case of Figure 4.16, if we ignore zeros,

x− = (x− − − − − −
1 , x2 , x3 , x4 , x5 , x6 ) = ((), (), (), (x, y), (y, z), (a, b)),

and
f4 (x, y) = x + y, f5 (y, z) = max(y, z), J(a, b) = ab.
Note there is nothing incoming at the input nodes, so there is no point defin-
ing f1 , f2 , f3 .

activate = [None]*d

activate[3] = lambda x,y: x+y


activate[4] = lambda y,z: max(y,z)
activate[5] = lambda a,b: a*b

Assume activate[j] is the function at node j. To compute the outgoing


signal xj at node j, we collect the incoming signals x−
j following (4.4.1)

def incoming(x,w,j):
return [ outgoing(x,w,i) * w[i][j] if w[i][j] else 0 for i in
,→ range(d) ]

then plug them into the activation function,

def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](*incoming(x,w,j))

Here * is the unpacking operator.


Summarizing, at each node j, we have the outgoing signal xj , and a list
x−
j of incoming signals.

A node with an attached activation function is a neuron. A network is


a directed weighed graph where the nodes are neurons. The code in this
section works for any network without cycles. In §7.2, we specialize to neural
networks. Neural networks are networks with a restricted class of activation
functions.
4.4. BACK PROPAGATION 241

Let xin be the outgoing vector over the input nodes. If there are m input
nodes, and d nodes in total, then the length of xin is m, and the length of x
is d. In the example above, xin = (x, y, z).
We assume the nodes are ordered so that the initial portion of x equals
xin ,

m = len(x_in)
x[:m] = x_in

Here is the second version of forward propagation.

# second version: networks

def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x

For this code to work, we assume there are no cycles in the graph: All back-
ward paths end at inputs.

Let xout be the output nodes. For Figure 4.16, this means xout = (J).
Then by forward propagation, J is also a function of all node outputs. For
Figure 4.16, this means J is a function of x, y, z, a, b.
Therefore, at each node i, we have the derivatives
∂J
δi = (xi ), i = 1, 2, . . . , d.
∂xi

Then δ = (δ1 , δ2 , . . . , δd ) is the gradient vector. We first compute the deriva-


tives of J with respect to the output nodes xout , and we assume these deriva-
tives are assembled into a vector δout .
In Figure 4.16, there is one output node J, and
∂J
δJ = = 1.
∂J
Hence δout = (1).
We assume the nodes are ordered so that the terminal portion of x equals
xout and the terminal portion of δ equals δout ,
242 CHAPTER 4. CALCULUS

d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out

For each i, j, let


∂fj
gij = .
∂xi
Then we have a d × d gradient matrix g = (gij ). When (i, j) is not an edge,
gij = 0.
These are the local derivatives, not the derivatives obtained by the chain
rule. For example, even though we saw above ∂J/∂y = 1, here the local
derivative is zero, since J does not depend directly on y.
For the example above, with (x1 , x2 , x3 , x4 , x5 , x6 ) = (x, y, z, a, b, J),

g = [ [None]*d for _ in range(d) ]

# note g[i][i] remains undefined

g[0][3] = lambda x,y: 1


g[1][3] = lambda x,y: 1
g[1][4] = lambda y,z: 1 if y>z else 0
g[2][4] = lambda y,z: 1 if z>y else 0
g[3][5] = lambda a,b: b
g[4][5] = lambda a,b: a

By the chain rule and (4.4.3),

∂J X ∂J ∂xj X ∂J ∂fj
= · = · · wij ,
∂xi i→j
∂xj ∂xi i→j
∂xj ∂xi

so X
δi = δj · gij · wij .
i→j

The code is

def derivative(x,delta,g,i):
if delta[i] != None: return delta[i]
else:
return sum([ derivative(x,delta,g,j) *
,→ g[i][j](*incoming(x,g,j)) * w[i][j] if g[i][j] != None else 0
,→ for j in range(d) ])

This leads to our second version of back propagation,


4.5. CONVEX FUNCTIONS 243

# second version: networks

def backward_prop(x,delta_out,g):
d = len(g)
delta = [None]*d
m = len(delta_out)
delta[d-m:] = delta_out
for i in range(d-m): delta[i] = derivative(x,delta,g,i)
return delta

4.5 Convex Functions

Let f (x) be a scalar function of points x = (x1 , . . . , xd ) in Rd . For example,


in two dimensions,

x21
f (x) = f (x1 , x2 ) = max(|x1 |, |x2 |), f (x) = f (x1 , x2 ) = + x22
4
are scalar functions of points in R2 . More generally, if Q is a d × d matrix,
f (x) = x · Qx is such a function. Here, to obtain x · Qx, we think of the point
x as a vector, then use row-times-column multiplication to obtain Qx, then
take the dot product with x. We begin with functions in general.
A level set of f (x) is the set

E: f (x) = 1.

Here we write the level set of level 1. One can have level sets corresponding
to any level ℓ, f (x) = ℓ. In two dimensions, level sets are also called contour
lines.

x0

x0 x0

Fig. 4.19 Level sets and sublevel sets in two dimensions.


244 CHAPTER 4. CALCULUS

For example, the variance ellipsoid x · Qx = 1 is a level set. In two dimen-


sions, the square and ellipse in Figure 4.19 are level sets

x21
max(|x1 |, |x2 |) = 1, + x22 = 1.
4
The contour lines of
x21 x2
f (x) = f (x1 , x2 ) = + 2
16 4
are in Figure 4.20.

Fig. 4.20 Contour lines in two dimensions.

A sublevel set of f (x) is the set

E: f (x) ≤ 1.

Here we write the sublevel set of level 1. One can have sublevel sets corre-
sponding to any level c, f (x) ≤ c. For example, in Figure 4.19, the (blue)
interior of the square, together with the square itself, is a sublevel set. Sim-
ilarly, the interior of the ellipse, together with the ellipse itself, is a sublevel
set. The interiors of the ellipsoids, together with the ellipsoids themselves, in
Figure 4.25 are sublevel sets. Note we always consider the level set to be part
of the sublevel set.
The level set f (x) = 1 is the boundary of the sublevel set f (x) ≤ 1. Thus
the square and the ellipse in Figure 4.19 are boundaries of their respective
sublevel sets, and the variance ellipsoid x · Qx = 1 is the boundary of the
sublevel set x · Qx ≤ 1.
4.5. CONVEX FUNCTIONS 245

Given points x0 and x1 in Rd , the line segment joining them is

[x0 , x1 ] = {(1 − t)x0 + tx1 : 0 ≤ t ≤ 1}.

x1

(1 − t)x0 + tx1
x0

Fig. 4.21 Line segment [x0 , x1 ].

A scalar function f (x) is convex if1 for any two points x0 and x1 in Rd ,

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ), for 0 ≤ t ≤ 1. (4.5.1)

This says the line segment joining any two points (x0 , f (x0 )) and (x1 , f (x1 ))
on the graph of f (x) lies above the graph of f (x). For example, in two di-
mensions, the function f (x) = f (x1 , x2 ) = x21 + x22 /4 is convex because its
graph is the paraboloid in Figure 4.22.
If the inequality is strict for 0 < t < 1, then f (x) is strictly convex,

f ((1 − t)x0 + tx1 ) < (1 − t)f (x0 ) + tf (x1 ), for 0 < t < 1.

More generally, given points x1 , x2 , . . . , xN , a linear combination

t1 x1 + t2 x2 + · · · + tN xN

is a convex combination if t1 , t2 , . . . , tN are nonnegative, and

t1 + t2 + · · · + tN = 1.

For example, if 0 ≤ t ≤ 1, (1 − t)x0 + tx1 is a convex combination of x0 and


x1 . Then a convex function also satisfies

f (t1 x1 + · · · + tN xN ) ≤ t1 f (x1 ) + · · · + tN f (xN ), (4.5.2)

for any convex combination.

1 We only consider convex functions that are continuous.


246 CHAPTER 4. CALCULUS

Fig. 4.22 Convex: The line segment lies above the graph.

Recall (§2.2) a nonnegative matrix is a symmetric matrix Q satisfying


x · Qx ≥ 0 for all x, and every such matrix is the variance matrix of some
dataset. This is equivalent to the nonnegativity of the eigenvalues of Q. When
the eigenvalues of Q are positive, Q is invertible.

Quadratic is Convex

If Q is a nonnegative matrix and b is a vector, then


1
f (x) = x · Qx − b · x
2
is a convex function. When Q is invertible, f (x) is strictly convex.

This was derived in the previous section, but here we present a more
geometric proof.
To derive this result, let x0 and x1 be any points, and let v = x1 − x0 .
Then x0 + tv = (1 − t)x0 + tx1 and x1 = x0 + v. Let g0 = Qx0 − b. By (4.3.7),
1 1
f (x0 + tv) = f (x0 ) + tv · (Qx0 − b) + t2 v · Qv = f (x0 ) + tv · g0 + + t2 v · Qv.
2 2
(4.5.3)
Inserting t = 1 in (4.5.3), we have f (x1 ) = f (x0 ) + v · g0 + v · Qv/2. Since
t2 ≤ t for 0 ≤ t ≤ 1 and v · Qv ≥ 0, by (4.5.3),
4.5. CONVEX FUNCTIONS 247

f ((1 − t)x0 + tx1 ) = f (x0 + tv)


1
≤ f (x0 ) + tv · g0 + tv · Qv
2
1
= (1 − t)f (x0 ) + tf (x0 ) + tv · g0 + tv · Qv
2
= (1 − t)f (x0 ) + tf (x1 ).

When Q is is invertible, then v · Qv > 0, and we have strict convexity.

Here are some basic properties and definitions of sets that will be used
in this section and in the exercises. Let a be a point in Rd and let r be a
positive scalar. A closed ball of radius r and center a is the set of points x
satisfying |x − a|2 ≤ r2 . An open ball of radius r and center a is the set of
points x satisfying |x − a|2 < r2 .
Let E be any set in Rd . The complement of E is the set E c of points that
are not in E. If E and F are sets, the intersection E ∩ F is the set of points
that lie in both sets.
A point a is in the interior of E if there is a ball B centered at a contained
in E; this is usually written B ⊂ E. Here the ball may be either open or
closed, the interior is the same.
A point a is in the boundary of E if every ball centered at a contains points
of E and points of E c . From the definitions, it is clear that there are no points
that lie in both the interior of E and the boundary of E.
Let E be a set. If E equals its interior, then E is an open set. If E contains
its boundary, then E is a closed set . When a set is closed, we have

set = interior + boundary.

Most sets are neither open nor closed.

A convex set is a closed subset E in Rd that contains the line segment


joining any two points in it: If x0 and x1 are in E, then the line segment
[x0 , x1 ] is in E. To be consistent with sublevel sets, we only consider convex
sets that contain their boundaries. In other words, we only consider convex
sets that are closed.
More generally, given points x1 , x2 , . . . , xN in E, the convex combination

x = t 1 x1 + t 2 x2 + · · · + t N xN

is also in E. The set of all convex combinations of x1 , x2 , . . . , xN is the


convex hull of x1 , x2 , . . . , xN (Figure 4.23).
248 CHAPTER 4. CALCULUS

x3

x4

x2
x6 x7

x5

x1

Fig. 4.23 Convex hull of x1 , x2 , x3 , x4 , x5 , x6 , x7 .

The convex hull of a dataset is a (closed) convex set. Conversely, if E is


convex and contains a dataset x1 , x2 , . . . , xN , then E contains the convex
hull of the dataset.
The interiors of the square and the ellipse in Figure 4.19, together with
their boundaries, are convex sets. The interior of the ellipsoid in Figure 4.25,
together with the ellipsoid, is a convex set.

Fig. 4.24 A convex hull with one facet highlighted.

The following code generates convex hulls,

from scipy.spatial import ConvexHull


from numpy import *
from numpy.random import *
4.5. CONVEX FUNCTIONS 249

rng = default_rng()

# 30 random points in 2-D


points = rng.random((30, 2))

hull = ConvexHull(points)

and this plots the facets of the convex hull

from matplotlib.pyplot import *

plot(points[:,0], points[:,1], 'o')


for facet in hull.simplices:
plot(points[facet,0], points[facet,1], 'k-')

facet = hull.simplices[0]
plot(points[facet, 0], points[facet, 1], 'r--')
grid()
show()

resulting in Figure 4.24.

If f (x) is a function, its graph is the set of points (x, y) in Rd+1 satisfying
y = f (x), and its epigraph is the set of points (x, y) satisfying y ≥ f (x).
If f (x) is defined on Rd , its sublevel sets are in Rd , and its epigraph is in
Rd+1 . Then f (x) is a convex function exactly when its epigraph is a convex
set (Figure 4.22). From convex functions, there are other ways to get convex
sets:

Sublevel of Convex is Convex

If f (x) is a convex function, then the sublevel set

E: f (x) ≤ 1

is a convex set.

This is an immediate consequence of the definition: f (x0 ) ≤ 1 and f (x1 ) ≤


1 implies

f ((1 − t)x0 + tx1 ) ≤ (1 − t)f (x0 ) + tf (x1 ) ≤ (1 − t) + t = 1.

From these results, we have


250 CHAPTER 4. CALCULUS

Ellipsoids are Boundaries of Convex Sets

If Q is a variance matrix, then x · Qx ≤ 1 is a convex set.

Fig. 4.25 Convex set in three dimensions with supporting hyperplane.

Let n be a nonzero vector in Rd . In two dimensions, the vectors orthogonal


to n form a line (Figure 4.26). In three dimensions, the vectors orthogonal
to n form a plane (Figure 4.26). In d dimensions, these vectors form the
orthogonal complement n⊥ (2.7.5), which is a (d − 1)-dimensional subspace.
This subspace is a hyperplane passing through the origin.
In general, given a point x0 and a nonzero vector n, the hyperplane through
x0 with normal n consists of all solutions x of

H: n · (x − x0 ) = 0. (4.5.4)

The hyperplane equation may be written

H: m · x + b = 0, (4.5.5)

with a nonzero vector m and scalar b. In this section, we use (4.5.4); in §7.6,
we use (4.5.5).
4.5. CONVEX FUNCTIONS 251

n
n
x0 x0

Fig. 4.26 Hyperplanes in two and three dimensions.

A hyperplane separates the whole space Rd into two half-spaces,

n · (x − x0 ) < 0 n · (x − x0 ) = 0 n · (x − x0 ) > 0.

The vector n is the normal vector to the hyperplane. Note replacing n by any
nonzero multiple of n leaves the hyperplane unchanged.

Separating Hyperplane I

Let E be a convex set and let x∗ be a point not in E. Then there is


a hyperplane separating x∗ and E: For some x0 in E and nonzero n,

n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0. (4.5.6)

x∗

n
x′
x0 x0
x x0 + tv

Fig. 4.27 Separating hyperplane I.

A diagram of the proof is Figure 4.27. Let x0 be the point in E closest


to x∗ . This means x0 minimizes |x − x∗ |2 over x in E. If x is in E, then by
convexity, the line segment [x0 , x] is in E, hence x0 + tv, v = x − x0 , is in E
for 0 ≤ t ≤ 1. Since x0 is the point of E closest to x∗ ,
252 CHAPTER 4. CALCULUS

|x0 − x∗ |2 ≤ |x0 + tv − x∗ |2 for 0 ≤ t ≤ 1.

Expanding, we have

|x0 − x∗ |2 ≤ |x0 − x∗ |2 + 2t(x0 − x∗ ) · v + t2 |v|2 , 0 ≤ t ≤ 1.

Canceling |x0 − x∗ |2 then, for t > 0, canceling t, we obtain

0 ≤ 2(x0 − x∗ ) · v + t|v|2 , 0 ≤ t ≤ 1.

Since this is true for small positive t, sending t → 0, results in v·(x0 −x∗ ) ≥ 0.
Setting n = x∗ − x0 , we obtain

n · (x − x0 ) ≤ 0 and n · (x∗ − x0 ) > 0.

Now suppose x0 is a point in the boundary of a convex set E. Since x0 is


in E, we cannot find a separating hyperplane for x∗ = x0 . In this case, the
best we can hope for is a hyperplane passing through x0 , with E to one side
of the hyperplane:

x in E =⇒ (x − x0 ) · n ≤ 0. (4.5.7)

Such a hyperplane is a supporting hyperplane for E at x0 . Figures 4.19 and


4.25 display examples of supporting hyperplanes. Here is the basic result
relating convex sets and supporting hyperplanes.

Supporting Hyperplane for Convex Set

Let E be a convex set and let x0 be a point on the boundary of E.


Then there is a supporting hyperplane at x0 .

If x0 is in the boundary of E, there are points x′ not in E approximating


x0 (Figure 4.27). Applying the separating hyperplane theorem to x′ , and
taking the limit x′ → x0 , results in a supporting hyperplane at x0 . We skip
the details here.
Supporting hyperplanes characterize convex sets in the following sense: If
through every point x0 in the boundary of E, there is a supporting hyper-
plane, then E is convex.

Recall a bit is either zero or one. A dataset x1 , x2 , . . . , xN is a two-class


dataset if there are bits p1 , p2 , . . . , pN . Then the two classes correspond to
p = 1 and p = 0 respectively.
4.5. CONVEX FUNCTIONS 253

Let m · x + b = 0 be a hyperplane. The level of a sample x relative to the


hyperplane is y = m · x + b. A hyperplane is separating if

y ≥ 0, if p = 1,
for every sample x. (4.5.8)
y ≤ 0, if p = 0,

When there is a separating hyperplane, we say the dataset is separable. Often


(see §7.6), hyperplanes are decision boundaries.
The dataset x1 , x2 , . . . , xN lies in the hyperplane m · x + b = 0 if

m · xk + b = 0, k = 1, 2, . . . , N. (4.5.9)

When a two-class dataset lies in a hyperplane, the hyperplane is separating,


so the question of separability is only interesting when the dataset does not
lie in a hyperplane.
If a two-class dataset does not lie in a hyperplane, then the means of the
two classes are distinct (Exercise 4.5.9).

Separating Hyperplane II

Let x1 , x2 , . . . , xN be a two-class dataset and assume neither class


lies in a hyperplane. Let K0 and K1 be the convex hulls of the two
classes. Then the dataset is separable iff the intersection K0 ∩ K1 has
no interior.

To derive this result, from Exercise 4.5.7 both K0 and K1 have interiors.
Suppose there is a separating hyperplane m · x + b = 0. If x0 is in K0 ∩ K1 ,
then we have m · x0 + b ≤ 0 and m · x0 + b ≥ 0, so m · x0 + b = 0. This shows
the separating hyperplane passes through x0 . Since K0 lies on one side of the
hyperplane, x0 cannot be in the interior of K0 . Similarly for K1 . Hence x0
cannot be in the interior of K0 ∩ K1 . This implies K0 ∩ K1 has no interior.
Conversely, suppose K0 ∩ K1 has no interior. There are two cases, whether
K0 ∩ K1 is empty or not. If K0 ∩ K1 is empty, then the minimum of |x1 − x0 |2
over all x1 in K1 and x0 in K0 is positive. If we let

|x∗1 − x∗0 |2 = min |x1 − x0 |2 ,


x0 in K0
x1 in K1

then x∗0 ̸= x∗1 , x∗1 is on the boundary of K1 , and x∗0 is on the boundary of K0 .
254 CHAPTER 4. CALCULUS

H0 H1 H

K0 K1 tK0 tK1
x∗0 x∗1

Fig. 4.28 Separating hyperplane II.

In the first case, since K0 and K1 don’t intersect, x∗1 is not in K0 , and x∗0
is not in K1 . Let m = x∗1 − x∗0 . By separating hyperplane I, the hyperplane
H0 : m · (x − x∗0 ) = 0 separates K0 from x∗1 . Similarly, the hyperplane H1 :
m · (x − x∗1 ) = 0 separates K1 from x∗0 . Thus (Figure 4.28) both hyperplanes
separate K0 from K1 .
In the second case, when K0 and K1 intersect, then x∗0 = x∗1 = x∗ . Let
0 < t < 1, and let tK0 be K0 scaled towards its mean. Similarly, let tK1
be K1 scaled towards its mean. By Exercise 4.5.8, both tK0 and tK1 lie in
the interiors of K0 and K1 respectively, so tK0 and tK1 do not intersect. By
applying the first case to tK0 and tK1 , and choosing t close to 1, t → 1, we
obtain a hyperplane H separating K0 and K1 . We skip the details.

In Figure 4.19, at the corner of the square, there are multiple supporting
hyperplanes. However, at every other point a on the boundary of the square,
there is a unique (up to scalar multiple) supporting hyperplane. For the ellipse
or ellipsoid, at every point of the boundary, there is a unique supporting
hyperplane.
Now we derive the analogous concepts for convex functions.
Let f (x) be a function and let a be a point at which there is a gradient
∇f (a). The tangent hyperplane for f (x) at a is

y = f (a) + ∇f (a) · (x − a). (4.5.10)

Convex Function Graph Lies Above the Tangent Hyperplane

If f (x) is convex and has a gradient ∇f (a), then

f (x) ≥ f (a) + ∇f (a) · (x − a). (4.5.11)


4.5. CONVEX FUNCTIONS 255

This vector result is obtained by applying the corresponding scalar result


in §4.1 to the function f (a + tv), where v = x − a. As in the scalar case, there
is a similar result (4.5.20) with tangent paraboloids.

We now address the existence of a global minimizer of a convex function.


A (global) minimizer for f (x) is a vector x∗ satisfying

f (x∗ ) = min f (x),


x

where the minimum is taken over all vectors x. A minimizer is the location of
the bottom of the graph of the function. For example, the parabola (Figure
4.4) and the relative information (Figure 4.12) both have global minimizers.
We say a function f (x) is strictly convex if g(t) = f (a + tv) is strictly
convex for every point a and direction v. This is the same as saying the
inequality (4.5.1) is strict for 0 < t < 1.
We say a function f (x) is proper if the sublevel set f (x) ≤ c is bounded
for every level c. Before we state this precisely, we contrast a level versus a
bound.
Let f (x) be a function. A level is a scalar c determining a sublevel set
f (x) ≤ c. A bound is a scalar C determining a bounded set |x| ≤ C.
We say f (x) is proper if for every level c, there is a bound C so that

f (x) ≤ c =⇒ |x| ≤ C. (4.5.12)

This is same as saying f (x) rises to +∞ as |x| → ∞. The exact formula


for the bound C, which depends on the level c and the function f (x), is not
important for our purposes. What matters is the existence of some bound C
for each level c.
More vividly, suppose x is scalar, and think of the graph of y = f (x) as the
cross-section of a river. Then f (x) is proper if the river never floods its banks,
no matter how much it rains. So y = sin x is not proper, but y = x2 + sin x
is proper.
What does it mean for f (x) to not be proper? Unpacking the definition,
f (x) is not proper if there is some level c with no corresponding bound C.
This means there is some level c and a sequence x1 , x2 , . . . with f (xn ) ≤ c
and |xn | → ∞.
For example, the functions in Figure 4.4 are proper and strictly convex,
while the function in Figure 4.5 is proper but neither convex nor strictly
convex.
Intuitively, if f (x) goes up to +∞ when x is far away, then its graph must
have a minimizer at some point x∗ . Continuous functions are defined in §A.7.
256 CHAPTER 4. CALCULUS

Existence of Global Minimizer

Suppose f (x) is a continuous proper function. Then f (x) has a global


minimizer x∗ ,
f (x∗ ) ≤ f (x). (4.5.13)

To see this, pick any point a. Then, by properness, the sublevel set S given
by f (x) ≤ f (a) is bounded. By continuity of f (x), there is a minimizer x∗
(see §A.7). Since for all x outside the sublevel set, we have f (x) > f (a), x∗
is a global minimizer.
When f (x) is also strictly convex, the minimizer is unique.

Existence and Uniqueness of Global Minimizer

Suppose f (x) is a continuous strictly convex proper function. Then


f (x) has a unique global minimizer x∗ .

Let x1 be another global minimizer. Then f (x1 ) = f (x∗ ). Let x2 = (x∗ +


x1 )/2 be their midpoint. By strict convexity,
1
f (x2 ) < (f (x∗ ) + f (x1 )) = f (x∗ ),
2
contradicting the fact that x∗ is a global minimizer. Thus there cannot be
another global minimizer.

The following result describes when the residual (2.6.1) is proper.

Properness of Residual on Row Space

Let A be a matrix, and b a vector with dimensions so that the residual

f (x) = |Ax − b|2

is defined. Then f (x) is proper on the row space of A.

To see this, suppose f (x) is not proper. In this case, by (4.5.12), there
would be a level c and a sequence x1 , x2 , . . . in the row space of A satisfying
|xn | → ∞ and f (xn ) ≤ c for n ≥ 1.
Let x′n = xn /|xn |. Then x′n are unit vectors in the row space of A, hence
xn is a bounded sequence. From §A.7, this implies x′n subconverges to some

x∗ , necessarily a unit vector in the row space of A.


By the triangle inequality (2.2.4),
4.5. CONVEX FUNCTIONS 257

1 1 1 √
|Ax′n | = |Axn | ≤ (|Axn − b| + |b|) ≤ ( c + |b|).
|xn | |xn | |xn |

Moreover Ax′n subconverges to Ax∗ . Since |xn | → ∞, taking the limit n → ∞,


1 √
|Ax∗ | = lim |Ax′n | ≤ ( c + |b|) = 0.
n→∞ ∞
Thus x∗ is both in the row space of A and in the null space of A. Since the
row space and the null space are orthogonal, this implies x∗ = 0. But we can’t
have 1 = |x∗ | = |0| = 0. This contradiction shows there is no such sequence
xn , and we conclude f (x) is proper.
When the row space is the source space,

Properness of Residual

When the N × d matrix A has rank d,

f (x) = |Ax − b|2 (4.5.14)

is proper on Rd .

As a consequence,

Existence of Residual Minimizer


Let A be a matrix and b a vector so that the residual

f (x) = |Ax − b|2 (4.5.15)

is well-defined. Then there is a residual minimizer x∗ in the row space


of A,
f (x∗ ) ≤ f (x). (4.5.16)

The global minimizer x∗ is located by the first derivative test.

First Derivative Test for Global Minimizer

Let f (x) be a strictly convex proper function having a gradient ∇f (x)


at every point. Then the global minimizer x∗ is the unique point sat-
isfying
258 CHAPTER 4. CALCULUS

∇f (x∗ ) = 0. (4.5.17)

Let a be any point, and v any direction, and let g(t) = f (a + tv). Then

g ′ (0) = ∇f (a) · v.

If a is a minimizer, then t = 0 is a minimum of g(t), so g ′ (0) = 0. Since v is


any direction, this shows ∇f (a) = 0.
If there were another point b satisfying ∇f (b) = 0, let v = b − a. Then
b = a + v and g(t) is strictly convex in t, and also g ′ (1) = ∇f (b) · v = 0. By
convexity, g ′ (t) is increasing in t. If g ′ (0) = 0 and g ′ (1) = 0, then g ′ (t) = 0
for 0 < t < 1. This implies g(t) is a linear on 0 < t < 1, contradicting strict
convexity. This establishes the first derivative test.

Suppose the second partials

∂2f
, 1 ≤ i, j ≤ d,
∂xi ∂xj

exist. Then the second derivative of f (x) is the symmetric matrix


 
∂2f ∂2f
 ∂x ∂x ∂x ∂x . . .
 1 1 1 2 
 ∂2f ∂2f
 

 . . .
D2 f (x) =  ∂x2 ∂x1 ∂x2 ∂x2
 

 
 ... ... . . .
 
 
 ∂2f ∂2f 
...
∂xd ∂x1 ∂xd ∂x2
Replacing x by x + tv in (4.3.3), we have

d
f (x + tv) = ∇f (x + tv) · v.
dt
Differentiating and using the chain rule again,

Second Directional Derivative and Convexity

The second derivative Q = D2 f (x) satisfies

d2
f (x + tv) = v · Qv. (4.5.18)
dt2 t=0
4.5. CONVEX FUNCTIONS 259

Then f (x) is convex if the second directional derivative is nonnegative


for all x and v.

This implies

Second Directional Derivative and Strict Convexity

f (x) is strictly convex if f (x) is convex and

d2
f (x + tv) = 0 only when v = 0. (4.5.19)
dt2 t=0

An important example of a strictly convex proper function is f (x) = x ·


Qx/2 − b · x when Q > 0 (§4.3). Also (§7.5, §7.6) loss functions in linear
regression and logistic regression are strictly convex and proper under the
right assumptions.

Recall m ≤ Q ≤ L means the eigenvalues of the symmetric matrix Q are


between L and m. The following is the multi-variable version of (4.1.14). The
proof is the same as in the scalar case.

Second Derivative Bounds

If m ≤ D2 f (x) ≤ L, then

m L
|x − a|2 ≤ f (x) − f (a) − ∇f (a) · (x − a) ≤ |x − a|2 . (4.5.20)
2 2

If we choose a = x∗ , where x∗ is the global minimizer, then by (4.5.17),


we see the graph of f (x) lies between two quadratics globally.

Upper and Lower Paraboloids

If m ≤ D2 f (x) ≤ L and x∗ is the global minimum, then

m L
|x − x∗ |2 ≤ f (x) − f (x∗ ) ≤ |x − x∗ |2 . (4.5.21)
2 2

We describe the convex dual in the multi-variable setting (the single-


variable case was done in (4.1.16)). If f (x) is a scalar convex function of
x, and x = (x1 , x2 , . . . , xd ) has d features, the convex dual is
260 CHAPTER 4. CALCULUS

g(p) = max (p · x − f (x)) . (4.5.22)


x

Here the maximum is over all vectors x, and p = (p1 , p2 , . . . , pd ), the dual
variable, also has d features. We will work in situations where a maximizer
exists in (4.5.22).
Let Q > 0 be a positive matrix. The simplest example is
1 1
f (x) = x · Qx =⇒ g(p) = p · Q−1 p.
2 2
This is established by the identity
1 1 1
(p − Qx) · Q−1 (p − Qx) = p · Q−1 p − p · x + x · Qx. (4.5.23)
2 2 2
To see this, since the left side of (4.5.23) is greater or equal to zero, we have
1 1
p · Q−1 p − p · x + x · Qx ≥ 0.
2 2
Since (4.5.23) equals zero iff p = Qx, we are led to (4.5.22).
Moreover, switching p · Q−1 p with x · Qx, we also have

f (x) = max (p · x − g(p)) . (4.5.24)


p

Thus the convex dual of the convex dual of f (x) is f (x). In §5.6, we compute
the convex dual of the cumulant-generating function.
If x is a maximizer in (4.5.22), then the derivative is zero,

0 = ∇x (p · x − f (x)) =⇒ p = ∇x f (x).

Here ∇ is with respect to x. The maximizer x = x(p) depends on p, so by


the chain rule
∇p g(p) = ∇(p · x(p) − f (x(p)))
= x + p∇x(p) − ∇f (x)∇x(p) = x + (p − ∇f (x))∇x(p) = x.

Here ∇ is with respect to p and, since x = x(p) is vector-valued, ∇x(p) is a


d × d matrix. We conclude

p = ∇x f (x) ⇐⇒ x = ∇p g(p).

Thus the vector-valued function ∇f (x) is the inverse of the vector-valued


function ∇g(p), or
∇g(∇f (x)) = x.
Differentiating, we obtain

D2 g(∇f (x))D2 f (x) = I.


4.5. CONVEX FUNCTIONS 261

This yields

Second Derivatives of Dual Functions

Let f (x) be a strictly convex function with second derivative D2 f (x),


and let g(p) be its convex dual. Then

D2 g(p) = (D2 f (x))−1 , p = ∇f (x).

Moreover, if m ≤ D2 f (x) ≤ L, then


1 1
≤ D2 g(p) ≤ .
L m

Using this, and writing out (4.5.20) for g(p) instead of f (x) (we skip the
details) yields

Dual Second Derivative Bounds

Let p = ∇f (x) and q = ∇f (a). If f (x) is convex and m ≤ D2 f (x) ≤ L,


then
1 1
|p − q|2 ≥ f (x) − f (a) − q · (x − a) ≥ |p − q|2 . (4.5.25)
2m 2L

This is used in gradient descent.

Now let f (x) be strongly convex in the sense m ≤ D2 f (x) ≤ L. Then we


have the vector version of (4.1.15).

Coercivity of the Gradient

Let p = ∇f (x) and q = ∇f (a). If m ≤ D2 f (x) ≤ L, then

mL 1
(p − q) · (x − a) ≥ |x − a|2 + |p − q|2 . (4.5.26)
m+L m+L

This is derived by using (4.5.25), the details are in [4]. This result is used
in gradient descent.
For the exercises below, we refer to the properties of sets defined earlier:
interior and boundary.
262 CHAPTER 4. CALCULUS

Exercises

Exercise 4.5.1 Let e0 = 0 and let e1 , e2 , . . . , ed be the one-hot encoded


basis in Rd . The d-simplex Σd is the convex hull of e0 , e1 , e2 , . . . , ed . Draw
pictures of Σ1 , Σ2 , and Σ3 . Show Σd is the suspension (§1.6) of Σd−1 from
ed . Conclude
1
Vol(Σd ) = , d = 0, 1, 2, 3, . . .
d!
(Since Σ0 is one point, we start with Vol(Σ0 ) = 1.)

Exercise 4.5.2 Let a be a point in Rd and r a positive scalar. Then the


open ball {x : |x − a| < r} is an open set.

Exercise 4.5.3 A hyperplane in Rd is a closed set.

Exercise 4.5.4 Let B be a ball in Rd (either open or closed). Then the span
of B is Rd .

Exercise 4.5.5 A hyperplane in Rd has no interior.

Exercise 4.5.6 Let K be the convex hull of a dataset, and suppose the
dataset does not lie in a hyperplane. Then the mean of the dataset does
not lie in any supporting hyperplane of K.

Exercise 4.5.7 Let K be the convex hull of a dataset. Then the dataset does
not lie in a hyperplane iff K has interior. (Show the mean of the dataset is
in the interior of K: Argue by contradiction - assume the mean is on the
boundary of K.)

Exercise 4.5.8 Let K be a convex set, let x0 lie on the boundary of K, and
let m be in the interior of K. Then, apart from x0 , the line segment joining
m and x0 lies in the interior of K.

Exercise 4.5.9 If a two-class dataset does not lie in a hyperplane, then the
means of the two classes are distinct.
Chapter 5
Probability

Many basic concepts of probability are already present in a coin-tossing con-


text. Because of this, we start with a section on binomial probability. Here
we show how, even for coin-tossing, entropy is an inescapable feature, a basic
measure of randomness.
We also show how Bayes theorem allows us to flip things and gain inference.
For this, we need the fundamental theorem of calculus, which is included in
§A.5. Bayesian techniques are further explored in the exercises.
After this, random variables and the normal and chi-squared distributions
are covered. The presentation is layered so that a reader with only minimal
prior exposure will come away with an appreciation of the basic nature of
probabilistic reasoning.

5.1 Binomial Probability

Suppose a coin is tossed repeatedly, landing heads or tails each time. After
tossing the coin 100 times, we obtain 53 heads. What can we say about this
coin? Can we claim the coin is fair? Can we claim the probability of obtaining
heads is .53?
Whatever claims we make about the coin, they should be reliable, in that
they should more or less hold up to repeated verification.
To obtain reliable claims, we therefore repeat the above experiment 20
times, obtaining for example the following count of heads

[57, 49, 55, 44, 55, 50, 49, 50, 53, 49, 53, 50, 51, 53, 53, 54, 48, 51, 50, 53].

On the other hand, suppose someone else repeats the same experiment 20
times with a different coin, and obtains

[69, 70, 79, 74, 63, 70, 68, 71, 71, 73, 65, 63, 68, 71, 71, 64, 73, 70, 78, 67].

263
264 CHAPTER 5. PROBABILITY

In this case, one suspects the two coins are statistically distinct, and have
different probabilities of obtaining heads.
In this section, we study how the probabilities of coin-tossing behave, with
the goal of answering the question: Is a given coin fair?

As we see below, coin-tossing outcomes rely heavily on how probabilities


multiply. Before we discuss this, we address the simpler issue of how proba-
bilities add.
Let A be an event, say obtaining a specific pattern of heads and tails.
Then the complement Ac of A is the event that we do not obtain the specific
pattern. Then A and Ac are mutually exclusive, in the sense there is no
outcome common to A and Ac . In other words, the intersection

A ∩ Ac = A and Ac

is no outcome, which implies

P rob(A ∩ A) = P rob(A and Ac ) = 0.

Moreover, A and Ac are mutually exhaustive, in the sense obtaining one


or the other is a certainty. In other words, the union

A ∪ Ac = A or Ac

is all outcomes, which implies

P rob(A ∪ A) = P rob(A or Ac ) = 1.

Since Ac is the event complementary to A, we expect

P rob(Ac ) = 1 − P rob(A).

This may be rewritten as

P rob(A) + P rob(Ac ) = 1.

More generally, let A and B be any two events. If A and B are mutually
exclusive, then no outcome satisfies A and B simultaneously. In this case, we
expect

P rob(A ∪ B) = P rob(A or B) = P rob(A) + P rob(B). (5.1.1)

To repeat, (5.1.1) is only correct when A and B are mutually exclusive.


In general, several events A, B, C, . . . are mutually exclusive if there
is no overlap between them, no two of them happen simultaneously. They
5.1. BINOMIAL PROBABILITY 265

are mutually exhaustive if they are mutually exclusive, and their union is
everything, meaning at least one of them must happen. In this case we must
have
P rob(A) + P rob(B) + P rob(C) + · · · = 1.
As we saw above, A and Ac are mutually exhaustive. Continuing along the
same lines, the general result for additivity is

Addition of Probabilities
If A1 , A2 , . . . , Ad are mutually exhaustive events, then
d
X
P rob(B) = P rob(B and Ai ). (5.1.2)
i=1

In particular, if A and B are events,

P rob(B) = P rob(B and A) + P rob(B and Ac ). (5.1.3)

Assume we are tossing a coin. If we let p = P rob(H) and q = P rob(T ) be


the probabilities of obtaining heads and tails after a single toss, then

p + q = 1.

The proportion p is the coin’s bias towards heads. In particular, we see q =


1 − p, and the bias p may be any real number between 0 and 1, depending on
the particular coin being tossed. When p = 1/2, P rob(H) = P rob(T ), and
we say the coin is fair.
If we toss the coin twice, we obtain one of four possibilities, HH, HT ,
T H, or T T . If we make the natural assumption that the coin has no memory,
that the result of the first toss has no bearing on the result of the second
toss, then the probabilities are

P rob(HH) = p2 , P rob(HT ) = pq, P rob(T H) = qp, P rob(T T ) = q 2 . (5.1.4)

These are valid probabilities since their sum equals 1,

p2 + pq + qp + q 2 = (p + q)2 = 12 = 1.

To see why these are the correct probabilities, we use the conditional
probability definition,

P rob(A and B)
P rob(A | B) = . (5.1.5)
P rob(B)
266 CHAPTER 5. PROBABILITY

We use this formula to compute the probability that we obtain heads on


the second toss given that we obtain tails on the first toss. The conditional
probability definition (5.1.5) is equivalent to the chain rule

P rob(A and B) = P rob(A | B) P rob(B). (5.1.6)

To compute this, we introduce the convenient notation


(
1, if the n-th toss is heads,
Xn =
0, if the n-th toss is tails.

Then Xn is a random variable (§5.3) and represents a numerical reward


function of the outcome (heads or tails) at the n-th toss.
With this notation, (5.1.4) may be rewritten

P rob(X1 = 1 and X2 = 1) = p2 ,
P rob(X1 = 1 and X2 = 0) = pq,
P rob(X1 = 0 and X2 = 1) = qp,
P rob(X1 = 0 and X2 = 0) = q 2 .

In particular, by (5.1.3), this implies (remember q = 1 − p)

P rob(X1 = 1) = P rob(X1 = 1 and X2 = 0) + P rob(X1 = 1 and X2 = 1)


= pq + p2 = p(p + q) = p.

Similarly, P rob(X2 = 1) = p. Computing,

P rob(X1 = 0 and X2 = 1) qp
P rob(X2 = 1 | X1 = 0) = = = p = P rob(X2 = 1),
P rob(X1 = 0) q
so
P rob(X2 = 1 | X1 = 0) = P rob(X2 = 1).
Thus X1 = 0 has no effect on the probability that X2 = 1, and similarly for
the other possibilities. This is often referred to as the independence of the
coin tosses. We conclude

Multiplication of Probabilities: Independent Coin-Tossing

With the conditional probability definition (5.1.5), a coin has no mem-


ory between successive tosses iff the probabilities at distinct tosses
multiply,

P rob(X1 = a1 , X2 = a2 , . . . ) = P rob(X1 = a1 ) P rob(X2 = a2 ) . . .


(5.1.7)
5.1. BINOMIAL PROBABILITY 267

Here a1 , a2 , . . . are 0 or 1.

Since we are tossing the same coin, we can set

P rob(Xn = 1) = p, P rob(Xn = 0) = q = 1 − p, n ≥ 1.

Thus all probabilities in (5.1.7) are determined by the parameter p, which


may be any number between 0 and 1.

Addition of probabilities and the chain rule can be combined into

Law of Total Probability

If A1 , A2 , . . . , Ad are mutually exhaustive events, then


d
X
P rob(B) = P rob(B | Ai ) P rob(Ai ). (5.1.8)
i=1

In particular, if A and B are events,

P rob(B) = P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac ). (5.1.9)

It is natural to ask for the probability of obtaining k heads in n tosses,


P rob(Sn = k). Here k varies between 0 and n, corresponding to all tails or
all heads respectively.
There are n + 1 possibilities Sn = 0, Sn = 1, Sn = 2, . . . , Sn = n for the
number of heads in n tosses. If we have no data to think otherwise, then all
possibilities are equally likely, so one expects
1
P rob(Sn = k) = , 0 ≤ k ≤ n.
n+1
Notice the total probability is 1,
n n
X X 1
P rob(Sn = k) = = 1,
n+1
k=0 k=0

as it should be.
Assume we know p = P rob(Xn = 1). Since the number
 of ways of choosing
k heads from n tosses is the binomial coefficient nk (see §A.2), and the
268 CHAPTER 5. PROBABILITY

probabilities of distinct tosses multiply, the probability of k heads in n tosses


is as follows.

Coin-Tossing With Known Bias

If a coin has bias p, the probability of obtaining k heads in n tosses


is the binomial distribution
 
n k
P rob(Sn = k) = p (1 − p)n−k . (5.1.10)
k

Moreover the mean and variance of the binomial distribution is np


and np(1 − p).

Why is this? Because the probabilities multiply, so the probability of a


specific pattern of k heads in n tosses is pk(1−p)n−k . By (5.1.2), probabilities
of exclusive events add, and there are nk exclusive events here, because nk
is the number of ways of choosing k heads from n tosses.
By the binomial theorem,
n n  
X X n
P rob(Sn = k) = pk (1 − p)n−k = (p + 1 − p)n = 1,
k
k=0 k=0

again as it should be.


The binomial distribution with n = 1 corresponds to a single coin toss,
and is called the Bernoulli distribution. The corresponding random variable
X,
P rob(X = 1) = p, P rob(X = 0) = 1 − p,
is a Bernoulli random variable.
The code for counting heads from n tosses repeated N times is

from numpy.random import binomial

n, p, N = 5, .5, 10

# counting heads from n tosses sampled N times


binomial(n,p,N)

This returns array([2, 2, 2, 0, 4, 3, 4, 2, 4, 2]).


The code for the probability of k heads in n tosses of a coin with bias p is

from scipy.stats import binom

k,n,p = 5, 10, .5
5.1. BINOMIAL PROBABILITY 269

B = binom(n,p)

# probability of k heads
B.pmf(k)

This returns 0.24609375000000003.


More generally,

from scipy.stats import binom


from scipy.special import comb

# code to verify binomial pmf

def f(n,k,p): return binom(n,p).pmf(k)


def g(n,k,p): return comb(n,k,exact=True) * p**k * (1-p)**(n-k)

k,n,p = 5, 10, .5

pmf1 = array([ f(n,k,p) for k in range(n+1) ])


pmf2 = array([ g(n,k,p) for k in range(n+1) ])

allclose(pmf1,pmf2)

returns True.
Be careful to distinguish between
numpy.random.binomial and scipy.stats.binom.
The former returns samples from a binomial distribution, while the latter
returns a binomial random variable. Samples are just numbers; random vari-
ables have cdf’s, pmf’s or pdf’s, etc.

We explain the connection


 between entropy (§4.2) and coin-tossing. Recall
the binomial coefficient nk is the number of ways of selecting k objects from
n objects (A.2.10).
Toss the coin n times, and let #n = #n (p) be the number of outcomes
where the proportion k/n of heads is p. Then the number of heads is k = np,
so,  
n
#n (p) = .
np
When p is an irrational, np is replaced by the floor ⌊np⌋, but we ignore
this point. Using (A.1.1), a straightforward calculation yields the following
result.1

1 This result exhibits the entropy as the log of the number of combinations, or configura-
tions, or possibilities, which is the original definition of the physicist Boltzmann (1875).
270 CHAPTER 5. PROBABILITY

Entropy and Coin-Tossing

Toss a coin n times, and let #n (p) be the number of outcomes where
the heads-proportion is p. Then

#n (p) is approximately equal to enH(p) for n large.

In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
#n (p) ≈ √ ·p · enH(p) , for n large. (5.1.11)
2πn p(1 − p)

Asymptotic equality means the ratio of the two sides approaches 1 as n → ∞


(see §A.6).

Fig. 5.1 Asymptotics of binomial coefficients.

Figure 5.1 is returned by the code below, which compares both sides of
the asymptotic equality (5.1.11) for n = 10.

from numpy import *


from scipy.special import comb
from scipy.stats import entropy as H
from matplotlib.pyplot import *

n = 10
p = arange(0,1,.01)

def approx(n,p):
return exp(n*H(p))/sqrt(2*n*pi*p*(1-p))
5.1. BINOMIAL PROBABILITY 271

grid()
plot(p, comb(n,n*p), label="binomial coefficient")
plot(p, approx(n,p), label="entropy approximation")
title("number of tosses " + "$n=" + str(n) +"$", usetex=True)
legend()
show()

Assume the probability of heads in a single toss of a coin is q. We call q


the coin’s bias. Then we expect the long-term proportion of heads in n tosses
to equal roughly q. Now let p be another probability, 0 ≤ p ≤ 1.
Toss a coin n times, and let Pn (p, q) be the probability of obtaining out-
comes with heads-proportion p, given that the coin’s bias is q.
If p = q, one’s first guess is Pn (p, p) ≈ 1 for n large. However, this is
not correct, because Pn (p, p) is specifying a specific proportion p, predicting
specific behavior from the coin tosses. Because this is too specific, it turns
out Pn (p, p) ≈ 0, see Exercise 5.1.9.
On the other hand, if p ̸= q, we definitely expect the proportion of heads
to not equal p. In other words, we expect Pn (p, q) to be small for large n. In
fact, when p ̸= q, it turns out Pn (p, q) → 0 exponentially, as n → ∞.
We derive a formula for the speed of this decay. With k = np in the
binomial distribution (5.1.10),
 
n np
Pn (p, q) = q (1 − q)n−np .
np

Let H(p, q) be the relative entropy (§4.2). Using (A.1.1), a straightforward


calculation results in

Relative Entropy and Coin-Tossing

Assume a coin’s bias is q. Toss the coin n times, and let Pn (p, q) be
the probability of obtaining tosses where the heads-proportion is p.
Then

Pn (p, q) is approximately equal to enH(p,q) for n large.


(5.1.12)

In more detail, using Stirling’s approximation (A.1.6), one can derive the
asymptotic equality
1 1
Pn (p, q) ≈ √ ·p · enH(p,q) , for n large. (5.1.13)
2πn p(1 − p)
272 CHAPTER 5. PROBABILITY

The law of large numbers (§5.2)) states that the heads-proportion equals
approximately q for large n. Therefore, when p ̸= q, we expect the proba-
bilities that the heads-proportions equal p become successively smaller as n
get larger, and in fact vanish when n = ∞. Since H(p, q) < 0 when p ̸= q,
(5.1.13) implies this is so. Thus (5.1.13) may be viewed as a quantitative
strengthening of the law of large numbers, in the setting of coin-tossing.

Now we assume the coin parameter p is unknown, and we interpret (5.1.10)


as the conditional probability that Sn = k given knowledge of p, which we
rewrite as
 
n k
P rob(Sn = k | p) = p (1 − p)n−k , 0 ≤ k ≤ n. (5.1.14)
k

By addition of probabilities, P rob(Sn = k) is the sum of the probabilities


P rob(Sn = k and p) over 0 ≤ p ≤ 1.
By the conditional probability chain rule (5.1.6),

P rob(Sn = k and p) = P rob(Sn = k | p) P rob(p).

Thus P rob(Sn = k) is the sum of P rob(Sn = k | p) P rob(p) over 0 ≤ p ≤ 1.


Since p varies continuously over 0 ≤ p ≤ 1, the sum is replaced by the integral,
and Z 1
P rob(Sn = k) = P rob(Sn = k | p) P rob(p) dp.
0
Integrals are reviewed in §A.5.
Since we don’t know anything about p, it’s simplest to assume a uniform
a priori probability P rob(p) = 1. Based on this, we obtain
Z 1  
n k
P rob(Sn = k) = p (1 − p)n−k dp. (5.1.15)
0 k

Usually, this integral is evaluated using integration by parts. However, it


is easier to evaluate this for all 0 ≤ k ≤ n at once, by writing
n
X
I(c) = ck P rob(Sn = k).
k=0

Using (5.1.15) and the binomial theorem (A.2.7), I(c) equals


Z 1 X n  
! Z 1
n k k n−k
c p (1 − p) dp = (1 − p + cp)n dp.
0 k 0
k=0

If we set
5.1. BINOMIAL PROBABILITY 273

(1 − p + cp)n+1
f (p) = (1 − p + cp)n , F (p) = ,
(c − 1)(n + 1)

then F ′ (p) = f (p) (see (4.1.5)). By the fundamental theorem of calculus


(A.5.2),
1 cn+1 − 1
I(c) = F (1) − F (0) = · . (5.1.16)
n+1 c−1
But by (A.3.4) with n replaced by n + 1,
n
1 cn+1 − 1 X k 1
· = c · .
n+1 c−1 n+1
k=0

Matching coefficients of powers of c here and in I(c), we conclude

Coin-Tossing With Unknown Bias

If a coin has unknown bias p, distributed uniformly on 0 ≤ p ≤ 1,


then the probability of obtaining k heads in n tosses is
1
P rob(Sn = k) = , k = 0, 1, 2, . . . , n. (5.1.17)
n+1

Notice the difference: In (5.1.10), we know the coin’s bias p, and obtain the
binomial distribution, while in (5.1.17), since we don’t know p, and there are
n + 1 possibilities 0 ≤ k ≤ n, we obtain the uniform distribution 1/(n + 1).

We now turn things around: Suppose we toss the coin n times, and obtain
k heads. How can we use this data to estimate the coin’s bias p?
To this end, we introduce the fundamental

Bayes Theorem I

P rob(B | A) · P rob(A)
P rob(A | B) = . (5.1.18)
P rob(B)

The proof of Bayes Theorem is straightforward:

P rob(A and B)
P rob(A | B) =
P rob(B)
P rob(A and B) P rob(A)
= ·
P rob(A) P rob(B)
P rob(A)
= P rob(B | A) · .
P rob(B)
274 CHAPTER 5. PROBABILITY

The depth of the result lies in its widespread usefulness.


We now write Bayes Theorem to compute

P rob(p)
P rob(p | Sn = k) = P rob(Sn = k | p) · . (5.1.19)
P rob(Sn = k)

But P rob(Sn = k | p) is as in (5.1.14), P rob(Sn = k) is as in (5.1.17).


Since p is uniformly distributed, P rob(p) = 1. Inserting these quantities into
(5.1.19) leads to

A Posteriori Probability Given k Heads in n Tosses

Assume the unknown bias p of a coin is uniformly distributed on


0 ≤ p ≤ 1. Then the probability that the bias is p given k heads in n
tosses equals
 
n k
P rob(p | Sn = k) = (n + 1) · p (1 − p)n−k . (5.1.20)
k

Fig. 5.2 The distribution of p given 7 heads in 10 tosses.

Because of the extra factor (n+1), this is not equal to (5.1.14). In (5.1.14),
p is fixed, and k is the variable. In (5.1.20), k is fixed, and p is the variable.
This a posteriori distribution for (n, k) = (10, 7) is plotted in Figure 5.2.
Notice this distribution is concentrated about k/n = 7/10 = .7.
The code generating Figure 5.2 is
5.1. BINOMIAL PROBABILITY 275

from matplotlib.pyplot import *


from numpy import arange
from scipy.stats import binom

n = 10
k = 7

def f(p): return (n+1) * binom(n,p).pmf(k)

grid()
p = arange(0,1,.01)
plot(p,f(p),color="blue",linewidth=.5)
show()

Because Bayes Theorem is so useful, here are two alternate forms. Suppose
A1 , A2 , . . . , Ad are several mutually exhaustive events, so they are mutually
exclusive and

P rob(A1 ) + P rob(A2 ) + · · · + P rob(Ad ) = 1.

Then by the law of total probability (5.1.8) and the first version (5.1.18), we
have the second version

Bayes Theorem II

If A1 , A2 , . . . , Ad are several mutually exhaustive events,

P rob(B | Ai ) P rob(Ai )
P rob(Ai | B) = Pd , i = 1, 2, . . . , d.
j=1 P rob(B | Aj ) P rob(Aj )
(5.1.21)
In particular,

P rob(B | A) P rob(A)
P rob(A | B) = .
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )

As an example, suppose 20% of the population are smokers, and the preva-
lence of lung cancer among smokers is 90%. Suppose also 80% of non-smokers
are cancer-free. Then what is the probability that someone who has cancer
is actually a smoker?
To use the second version, set A = smoker and B = cancer. This means
A is the event that a randomly sampled person is a smoker, and B is the
event that a randomly sampled person has cancer. Then

P rob(A) = .2, P rob(B | A) = .9, P rob(B c | Ac ) = .8.


276 CHAPTER 5. PROBABILITY

From this, we have

P rob(B | Ac ) = 1 − P rob(B c | Ac ) = 1 − .8 = .2,

and
P rob(B | A) P rob(A)
P rob(A | B) =
P rob(B | A) P rob(A) + P rob(B | Ac ) P rob(Ac )
.9 × .2
= = .52941.
.9 × .2 + .2 × .8
Thus the probability that a person with lung cancer is indeed a smoker is
53%.

Fig. 5.3 The logistic function.

To describe the third version of Bayes theorem, bring in the logistic func-
tion. Let
1
p = σ(y) = . (5.1.22)
1 + e−y
This is the logistic function or sigmoid function (Figure 5.3). The logistic
function takes as inputs real numbers y, and returns as outputs probabilities
p (Figure 5.4).
5.1. BINOMIAL PROBABILITY 277

−∞ < y < ∞ σ 0<p<1

Fig. 5.4 The logistic function takes real numbers to probabilities.

We think of the input y as an activation energy, the output p as the


probability of activation, and y = 0 as the activation threshold.
In Python, σ is the expit function.

from scipy.special import expit

p = expit(y)

The multinomial or vector-valued version of σ(y) is the softmax function


§5.6.
Dividing the numerator and denominator of (5.1.21) by the numerator, we
obtain Bayes Theorem in terms of log-probability,

Bayes Theorem III


  
P rob(B | A) P rob(A)
P rob(A | B) = σ log . (5.1.23)
P rob(B | Ac ) P rob(Ac )

When there are several mutually exclusive events A1 , A2 , . . . , Ad , the


same result holds with σ the softmax function (§5.6).

Here is an application of the third version. Suppose we have two groups


of scalars, selected as follows. A fair coin is tossed. Depending on the result,
select a scalar x at random with normal probability (§5.4) with probability
density
1 2
P rob(x | H) = √ · e−(x−mH ) /2 ,

(5.1.24)
1 2
P rob(x | T ) = √ · e−(x−mT ) /2 .

This says the the two groups of scalars are centered around the means mH
and mT respectively, according to whether the coin toss results in heads or
tails.
278 CHAPTER 5. PROBABILITY

Given a scalar x, what is the probability x is in the heads group? In other


words, what is
P rob(H | x)?
This question is begging for Bayes theorem.
Assume the two groups are distinct, by assuming mH ̸= mT , and let
1 1
w = mH − mT , w0 = − m2H + m2T .
2 2
Then w ̸= 0. Since P rob(H) = P rob(T ), here we have P rob(A) = P rob(Ac ).
Inserting the formulas for P rob(x | H) and P rob(x | T ) leads to the log-
probability  
P rob(x | H) P rob(H)
log = wx + w0 . (5.1.25)
P rob(x | T ) P rob(T )
By (5.1.21),
P rob(H | x) = σ(wx + w0 ).
This shows the group membership of x is determined by the activation thresh-
old wx + w0 = 0, or by the cut-off x∗ = −w0 /w. Simplifying, the cut-off is

w0 1 −m2H + m2T mH + mT
x∗ = − =− = ,
w 2 mH − mT 2
which is the midpoint of the line segment joining mH and mT .

mH cut-off mT

Fig. 5.5 Decision boundary (1d).

More generally, if the points x are in Rd , then the same question may be
asked, using the normal distribution with variance I in Rd (§5.5). In this
case, w is a nonzero vector, and w0 is still a scalar,
1 1
w = mH − mT , w0 = − |mH |2 + |mT |2 .
2 2
Then the cut-off or decision boundary between the two groups is the hyper-
plane
w · x + w0 = 0,
which is the hyperplane halfway between mH and mT , and orthogonal to the
vector joining mH and mT . Written this way, the probability

P rob(H | x) = σ(w · x + w0 ) (5.1.26)

is a single-layer perceptron (§7.2). We study hyperplanes in §4.5.


5.1. BINOMIAL PROBABILITY 279

mT

cut-off
mH

Fig. 5.6 Decision boundary (3d).

Exercises

Exercise 5.1.1 A fair coin is tossed. What is the probability of obtaining 5


heads in 8 tosses?

Exercise 5.1.2 A coin with bias p is tossed. What is the probability of ob-
taining 5 heads in 8 tosses?

Exercise 5.1.3 A coin with bias p is tossed 8 times and 5 heads are obtained.
What is the most likely value for p?

Exercise 5.1.4 A coin with unknown bias p is tossed 8 times and 5 heads
are obtained. Assuming a uniform prior for p, what is the probability that
p lies between 0.5 and 0.7? Use scipy.integrate.quad (§A.5) to integrate
(5.1.20) over 0.5 ≤ p ≤ 0.7.)

Exercise 5.1.5 A fair coin is tossed n times. Sometimes you get more heads
than tails, sometimes the reverse. If you’re really lucky, the number of heads
may equal exactly the number of tails. What is the least n for which the
probability of this happening is less than 10%?

Exercise 5.1.6 A fair coin is tossed 10 times. What is the probability of


obtaining 7 heads given that we obtain 4 heads in the first 5 tosses?

Exercise 5.1.7 A coin is tossed. Depending on the result, select a scalar x


at random with normal probability densities as in (5.1.24). If the coin bias is
p, compute the decision boundary.

Exercise 5.1.8 [30] At least one-half of an airplane’s engines are required


to function in order for it to operate. If each engine functions independently
with probability p, for what value of 0 < p < 1 is a 4-engine plane as likely
to operate as a 2-engine plane? (Write the binomial probability as a function
of p and use numpy.roots.)

Exercise 5.1.9 If a fair coin is tossed 2n times, show


√ the probability of
obtaining n heads and n tails is approximately 1/ πn for n large. (Use
(5.1.13).)
280 CHAPTER 5. PROBABILITY

5.2 Probability

A probability is often described as


the extent to which an event is likely to occur, measured by the ratio of the favorable
outcomes to the whole number of outcomes possible.

We explain what this means by describing the basic terminology:


• An experiment is a procedure that yields an outcome, out of a set of
possible outcomes. For example, tossing a coin is an experiment that
yields one of two outcomes, heads or tails, which we also write as 1 or
0. Rolling a six-sided die yields outcomes 1, 2, 3, 4, 5, 6. Rolling two
six-sided dice yields 36 outcomes (1, 1), (1, 2),. . . . Flipping a coin three
times yields 23 = 8 outcomes

T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH,

or
000, 001, 010, 011, 100, 101, 110, 111.
• The sample space is the set S of all possible outcomes. If #(S) is the num-
ber of outcomes in S, then for the four experiments above, we have #(S)
equals 2, 6, 36, and 8. The sample space S is also called the population.
• An event is a specific subset E of S. For example, when rolling two dice,
E can be the outcomes where the sum of the dice equals 7. In this case,
the outcomes in E are

(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),

so here #(S) = 36 and #(E) = 6. Another example is obtaining three


heads when tossing a coin seven times. Here #(S) = 27 = 128 and
#(E) = 35, which is the number of ways you can choose three things out
of seven things:
 
7 7·6·5
#(E) = 7-choose-3 = = = 35.
3 1·2·3

• The probability of an outcome s is a number P rob(s) with the properties


1. 0 ≤ P rob(s) ≤ 1,
2. The sum of the probabilities of all outcomes equals one.
• The probability P rob(E) of an event E is the sum of the probabilities of
the outcomes in E.
• Outcomes are equally likely when they have the same probability. When
this is so, we must have

#(E)
P rob(E) = .
#(S)
5.2. PROBABILITY 281

For example,
1. A coin is fair if the outcomes are equally likely. For one toss of a fair
coin, P rob(heads) = 1/2.
2. More generally, tossing a coin results in outcomes

P rob(head) = p, P rob(tail) = 1 − p,

with 0 < p < 1.


3. A die is fair if the outcomes are equally likely. Roll a fair die and
let E be the event that the outcome is less than 3. Then P rob(E) =
2/6 = 1/3.
4. The probability of obtaining a sum of 7 when rolling two fair dice is
P rob(E) = 6/36 = 1/6.
5. The probability of obtaining three heads when tossing a fair coin
seven times is P rob(E) = 35/128.
6. The probability of selecting an iris with petal length between 1 and
3 from the Iris dataset.

Roll two six-sided dice. Let A be the event that at least one dice is an even
number, and let B be the event that the sum is 6. Then

A = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6)} .

B = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} .
The intersection of A and B is the event of outcomes in both events:

A and B = {(2, 4), (4, 2)} .

The union of A and B is the event of outcomes in either event:

A or B = {(2, ∗), (4, ∗), (6, ∗), (∗, 2), (∗, 4), (∗, 6), (1, 5), (3, 3), (5, 1)} .

The complement of A is the events of outcomes not in A. So

not A = {(1, 1), (1, 3), (1, 5), (3, 1), (3, 3), (3, 5), (5, 1), (5, 3), (5, 5)} .

Since #(not A) = 9, and #(S) = 36,

#(A) = #(S) − #(not A) = 36 − 9 = 27.

Clearly #(B) = 5.
The difference of A minus B is the event of outcomes in A but not in B:
282 CHAPTER 5. PROBABILITY

A − B = A and not B
= {(2, ∗ except 4), (4, ∗ except 2), (6, ∗), (∗ except 4, 2), (∗ except 2, 4), (∗, 6)} .

Similarly,
B − A = {(1, 5), (3, 3), (5, 1)} .
Then A − B is part of A and B − A is part of B, A ∩ B is part of both, and
all are part of A ∪ B.
Hence
27 3 5
P rob(A) = = , P rob(B) = .
36 4 36
Events A and B are independent if

P rob(A and B) = P rob(A) × P rob(B). (5.2.1)

The conditional probability of A given B is

P rob(A and B)
P rob(A | B) = .
P rob(B)

When A and B are independent,

P rob(A and B) P rob(A) × P rob(B)


P rob(A | B) = = = P rob(A),
P rob(B) P rob(B)

so the conditional probability equals the unconditional probability.


Are A and B above independent? Since

P rob(A ∩ B) 2/36 2
P rob(A | B) = = =
P rob(B) 5/36 5

which is not equal to P rob(A), they are not independent.

Suppose in a certain community 15% of families have no children, 20%


have one child, 35% have two children, and 30% have three children. Suppose
also each child is equally likely to be a boy or a girl. Let B and G be the
number of boys and girls in a randomly selected family. Then

P rob(B = 0 and G = 0) = P rob(no children) = 0.15,

and
1
P rob(B = 0 and G = 1) = P rob(G = 1 | 1 child) P rob(1 child) = 0.20 = 0.1,
2
and
5.2. PROBABILITY 283

3
P rob(B = 1 and G = 2) = P rob(G = 2 | 3 children) P rob(3 children) = 0.30 = .1125.
8
Continuing in this manner, the complete table is

P rob((B, G) = (i, j)) G = 0 G = 1 G = 2 G = 3


B=0 0.15 0.10 .0875 .0375
B=1 0.10 .175 .1125 0
B=2 .0875 .1125 0 0
B=3 .0375 0 0 0

Table 5.7 Joint distribution of boys and girls [30].

Now suppose we conduct an experiment by tossing a coin (always assumed


fair unless otherwise mentioned) 10 times. Because the coin is fair, we expect
to obtain heads around 5 times. Will we obtain heads exactly 5 times? Let’s
run the experiment with Python. In fact, we will run the experiment 20 times.
If we count the number of heads after each run of the experiment, we obtain
a digit between 0 and 10 inclusive.
To simulate this, we use binomial(n,p,N). When N = 1, this returns the
number of heads obtained after a single experiment, consisting of tossing a
coin n times, where the probability of obtaining heads in each toss is p.
More generally, binomial(n,p,N) runs this experiment N times, returning
a vector v with N components. For example, the code

from numpy.random import *

p = .5
n = 10
N = 20

v = binomial(n,p,N)
print(v)

returns

[9 6 7 4 4 4 3 3 7 5 6 4 6 9 4 5 4 7 6 7]

The sample space S corresponding to (p, n, N ) consists of all vectors v =


(v1 , v2 , . . . , vN ) with N components, with each component equal to to 0, 1,
. . . , n. So here #(S) = (n + 1)N .
284 CHAPTER 5. PROBABILITY

Now we conduct three experiments: tossing a coin 5 times, then 50 times,


then 500 times. The code

p = .5
for n in [5,50,500]: print(binomial(n,p,1))

This returns the count of heads after 5 tosses, 50 tosses, and 500 tosses,

3, 28, 266

The proportions are the count divided by the total number of tosses in the
experiment. For the above three experiments, the proportions after 5 tosses,
50 tosses, and 500 tosses, are

3/5=.600, 28/50=.560, 266/500=.532

Fig. 5.8 100,000 sessions, with 5, 15, 50, and 500 tosses per session.

Now we repeat each experiment 100,000 times and we plot the results in
a histogram.

from matplotlib.pyplot import *


from numpy.random import *

N = 100000
p = .5
5.2. PROBABILITY 285

for n in [5,50,500]:
data = binomial(n,p,N)
hist(data,bins=n,edgecolor ='Black')
grid()
show()

This results in Figure 5.8.

The takeaway from these graphs are the two fundamental results of prob-
ability:

Law of Large Numbers (LLN)

The proportion in a repeated experiment is the sample proportion.


The sample proportion tends to be near the underlying probability
p. The underlying probability is the population proportion. The larger
the sample size in the experiment, the closer the proportion is to p.
Another way of saying this is: For large sample size, the sample mean
is approximately equal to the population mean.

Central Limit Theorem (CLT)

For large sample size, the shape of the graph of the proportions or
counts is approximately normal. The normal distribution is studied in
§5.4. Another way of saying this is: For large sample size, the shape
of the sample mean histogram is approximately normal.

The law of large numbers is qualitative and the central limit theorem is
quantitative. While the law of large numbers says one thing is close to another,
it does not say how close. The central limit theorem provides a numerical
measure of closeness, using the normal distribution.

One may think that the LLN and the CLT above depends on some aspect
of the binomial distribution. After all, the binomial is a specific formula and
something about this formula may lead to the LLN and the CLT. To show
that this is not at all the case, to show that the LLN and the CLT are
universal, we bring in the petal lengths of the Iris dataset. This time the
experiment is not something we invent, it is a result of something arising in
nature, Iris petal lengths.
286 CHAPTER 5. PROBABILITY

from sklearn import datasets

iris = datasets.load_iris()

dataset = iris["data"]
iris["feature_names"]

This code shows the petal lengths are the third feature in the dataset, and
we compute the mean of the petal lengths using

petal_lengths = dataset[:,2]
mean(petal_lengths)

Fig. 5.9 The histogram of Iris petal lengths.

This returns the petal length population mean µ = 3.758. If we plot the
petal lengths in a histogram with 50 bins using the code

from matplotlib.pyplot import *

grid()
hist(petal_lengths,bins=50)
show()

we obtain Figure 5.9.


Now we sample the Iris dataset randomly. More generally, we take a ran-
dom batch of samples of size n and take the mean of the samples in the batch.
For example, the following code grabs a batch of n = 5 petals lengths X1 ,
5.2. PROBABILITY 287

X2 , X3 , X4 , X5 at random and takes their mean,


X1 + X2 + X3 + X4 + X5
.
5

from numpy import *


from numpy.random import *

# rng = random number generator


rng = default_rng()

# n = batch_size

def random_batch_mean(n):
rng.shuffle(petal_lengths)
return mean(petal_lengths[:n])

random_batch_mean(5)

This code shuffles the dataset, then selects the first n petal lengths, then
returns their mean.

Fig. 5.10 Iris petal lengths sampled 100,000 times.

To sample a single petal length randomly 100,000 times, we run the code

N = 100000
n = 1
288 CHAPTER 5. PROBABILITY

Xbar = [ random_batch_mean(n) for _ in range(N)]


hist(Xbar,bins=50)
grid()
show()

Since we are sampling single petal lengths, here we take n = 1. This code
returns the histogram in Figure 5.10.
In Figure 5.9, the bin heights add up to 150. In Figure 5.10, the bin
heights add up to 100,000. Moreover, while the shapes of the histograms are
almost identical, a careful examination shows the histograms are not identical.
Nevertheless, there is no essential difference between the two figures.

Fig. 5.11 Iris petal lengths batch means sampled 100,000 times, batch sizes 3, 5, 20.

Now repeat the same experiment, but with batches of various sizes, and
plot the resulting histograms. If we do this with batches of size n = 3, n = 5,
n = 20 using

from matplotlib.pyplot import *

figure(figsize=(8,4))
# three subplots
rows, cols = 1, 3

N = 100000

for i,n in enumerate([3,5,20],start=1):


Xbar = [ random_batch_mean(n) for _ in range(N)]
subplot(rows,cols,i)
grid()
hist(Xbar,bins=50)
5.2. PROBABILITY 289

show()

we obtain Figure 5.11.


This shows the CLT is universal, since here it arises from sampling the
petal lengths of Irises, whose dataset has the histogram in Figure 5.9. Of
course, we also have the LLN, which says the peak of each of the bell-shaped
curves is near µ = 3.758.

Exercises

Exercise 5.2.1 [30] A communications channel transmits bits 0 and 1. Be-


cause of noise, the probability of transmitting a bit incorrectly is 0.2. To
reduce error probabilities, each bit is repeated five times: 1 is sent as 11111
and 0 is sent as 00000. If the recipient uses majority decoding, what is the
probability of mis-reading a message consisting of one bit? Majority decoding
means five consecutive bits will be read as 1 if at least three of the bits are
1, and similarly for 0.

Exercise 5.2.2 Check the values in Table 5.7.

Exercise 5.2.3 [30] Approximately 80,000 marriages took place in New York
last year. Assuming any day is equally likely, what is the probability that for
at least one of these couples, both partners were born on January 1? Both
partners celebrate their birthdays on the same day of the year?

Exercise 5.2.4 This problem has nothing to do with calculus or probability


or data science, and just uses addition of numbers, so can be presented to
grade school students. Let dataset be any five numbers, for example [-11.2,
,→ sqrt(2),1.4,11, 23.4], and run the code

from matplotlib.pyplot import *


from numpy import *

def sums(dataset,k):
if k == 1: return dataset
else:
s = sums(dataset,k-1)
return array([ a+b for a in dataset for b in s ])

for k in range(5):
s = sums(dataset,k)
grid()
hist(s,bins=50,edgecolor="k")
show()
290 CHAPTER 5. PROBABILITY

for k = 1, 2, 3, 4, . . . . What does this code do? What does it return? What
pattern do you see? What if dataset were changed? What if the samples in
the dataset were vectors?

Exercise 5.2.5 Let A and B be any events, not necessarily exclusive. Let
B − A be the event of A occuring and B not occuring. Show

P rob(A or B) = P rob(A) + P rob(B − A).

Exercise 5.2.6 Let A and B be any events, not necessarily exclusive. Extend
(5.1.1) to show

P rob(A or B) = P rob(A) + P rob(B) − P rob(A and B). (5.2.2)

Break A ∪ B into three exclusive events — draw a Venn diagram.

Exercise 5.2.7 [30] There is a 60% chance an event A will occur. If A does
not occur, there is a 10% chance B occurs. What is the chance A or B occurs?

Exercise 5.2.8 Let A, B, C be any events, not necessarily exclusive. Use


the previous exercise to show

P rob(A or B or C) ≤ P rob(A) + P rob(B) + P rob(C).

(Start with two events, then go from two to three events.) With a = P rob(Ac ),
b = P rob(B c ), c = P rob(C c ), this exercise is the same as Exercise A.3.4.

5.3 Random Variables

Suppose a real number x is selected at random. Even if we don’t know any-


thing about x, we know x is a number, so our confidence that −∞ < x < ∞
equals 100%, the chance that x satisfies −∞ < x < ∞ equals 1, and the
probability that x satisfies −∞ < x < ∞ equals 1.
When we say x is “selected at random”, we think of a machine X that
is the source of the numbers x (Figure 5.12). Such a source of numbers is
best called a random number, short for random number generator, just like
a source for apples should be called a random apple, short for random apple
generator.
Unfortunately,2 the standard 100-year-old terminology for such an X is
random variable, and this is what we’ll use.
2 The standard terminology is misleading, because the variability of the samples x is
implied by the term random; the term variable is superfluous and may suggest something
like double-randomness (whatever that is), which is not the case.
5.3. RANDOM VARIABLES 291

X x

Fig. 5.12 When we sample X, we get x.

For example, suppose we want to estimate the proportion of American


college students who have a smart phone. Instead of asking every student,
we take a sample and make an estimate based on the sample.
Let p be the actual proportion of students that in fact have a smartphone.
If there are N students in total, and m of them have a smartphone, then
p = m/N . For each student, let
(
1, if the student has a smartphone,
X=
0, if not.

Then X is a random variable: X is a machine that returns 0 or 1 depending


on the chosen student.
A random variable taking on only two values is a Bernoulli random vari-
able. Since X takes on the two values 0 and 1, X is a Bernoulli random
variable.
Throughout we adopt the convention that random variables are written in
uppercase, X, while the numbers resulting when sampled are written lower-
case, x. In other words, when we sample X, we obtain x.
We will have occasion to meet many different random variables X, Y , Z,
. . . . The letter Z is reserved for a standard random variable, one having mean
zero and variance one. Samples from Z are written as z.

Let X be a random variable and let x be a sample of X. What is the


chance, what is our confidence, what is the probability, of selecting x from
an interval [a, b]? If we write

P rob(a < X < b)

for this quantity, then we are asking to compute P rob(a < X < b). If we
don’t know anything about X, then we can’t figure out the probability, and
there is nothing we can say. Knowing something about X means knowing
the distribution of X: Where X is more likely to be and where X is less
likely to be. In effect, a random variable is a quantity X whose probabilities
P rob(a < X < b) can be computed.
292 CHAPTER 5. PROBABILITY

From this point of view, every dataset x1 , x2 , . . . , xN may be viewed as


the samples of a random variable X, as follows.
Define
N
1 X
E(X) = xk . (5.3.1)
N
k=1

Then E(X) is the mean of the random variable X associated to the dataset.
Similarly,
N
1 X 2
E(X 2 ) = xk
N
k=1

is the second moment of the random variable X associated to the dataset.


More generally, given any function f (x), we have the mean of f (x1 ), f (x2 ),
. . . , f (xN ),
N
1 X
E(f (X)) = f (xk ). (5.3.2)
N
k=1

Given any interval (a, b), we may set


(
1, a < x < b,
f (x) =
0, otherwise,

Then f (xk ) is only counted when a < xk < b, so


N
1 X #{xk : a < xk < b}
E(f (X)) = f (xk ) = = P rob(a < X < b)
N N
k=1

is the probability that a randomly selected sample lies in (a, b).


This shows probabilities are special cases of means. Since we can compute
means by (5.3.2), we can compute probabilities for X. This is what is meant
by “selecting a random sample from the dataset”.

Suppose X is a random variable taking on three values a, b, c with prob-


abilities p, q, r,

P (X = a) = p, P (X = b) = q, P (X = c) = r.

Then the mean or average or expected value of X is

E(X) = ap + bq + cr.

Since p + q + r = 1, the expected value of X lies between the greatest of a,


b, c, and the least,
5.3. RANDOM VARIABLES 293

min(a, b, c) ≤ E(X) ≤ max(a, b, c).

Let µ = E(X) be the mean. The variance of X is a measure of how far X


deviates from its mean,

V ar(X) = E((X − µ)2 ).

For the random variable X above,

V ar(X) = (a − µ)2 · p + (b − µ)2 · q + (c − µ)2 · r.

By expanding the squares, one has the identity

V ar(X) = E(X 2 ) − µ2 .

This is valid for any random variable X.

Now we repeat this in general. Random variables are either discrete or


continuous. Even though the proofs and identities below are carried out in the
context of discrete random variables, the results remain valid in the context
of continuous random variables.
A random variable X is discrete if X takes on discrete values x1 , x2 , . . . ,
with probabilities p1 , p2 , . . . . Here the values may be scalars or vectors, and
there may be finitely many or infinitely many values. If all the values are
equal to a scalar µ, then we say X is a constant.
For a discrete random variable, the probability mass function (pmf in
Python) is
p(x) = P rob(X = x),
and the cumulative distribution function (cdf in Python) is

F (x) = P rob(X ≤ x).

Then pk = p(xk ). By addition of probabilities (5.1.2), F (x) is the sum of


p(xk ) over all xk ≤ x.

Definition of Expectation: Discrete Case

Let X take on values x1 , x2 , . . . , with probabilities p1 , p2 , . . . . The


expectation of X is

E(X) = x1 p1 + x2 p2 + . . . . (5.3.3)

E(X) is also called the mean or average or first moment of X, and is


usually denoted µ.
294 CHAPTER 5. PROBABILITY

When there are N values, and we take p1 = p2 = · · · = 1/N , we say the


values are equally likely or X is uniform. In this case, the mean reduces to
(1.3.1).
More generally, let f (x) be a function. The mean or expectation of f (X)
is
E(f (X)) = f (x1 )p1 + f (x2 )p2 + . . . (5.3.4)
Since the total probability is one, when f (x) = 1,

E(1) = p1 + p2 + · · · = 1.

If a is a constant then the values of aX are ax1 , ax2 , . . . , with probabilities


p1 , p2 ,. . . , so

E(aX) = ax1 p1 + ax2 p2 + · · · = a(x1 p1 + x2 p2 + . . . ) = aE(X).

When f (x) = x2 , the mean of f (X) is the second moment

E(X 2 ) = x21 p1 + x22 p2 + . . .

When f (x) = etx , the mean of f (X) is the moment-generating function

M (t) = E etX = etx1 p1 + etx2 p2 + . . .




The log of the moment-generating function is the cumulant-generating


function
Z(t) = log M (t) = log E etX .


A basic property of E is the

Linearity of the Expectation

For any random variables X and Y and constants a and b,

E(aX + bY ) = aE(X) + bE(Y ).

Linearity is used routinely whenever we compute expectations, and is de-


ceptively simple to state. Because the derivation of linearity uses addition of
probabilities (5.1.2), it is instructive to go over this carefully.
Let X have values x1 , x2 , . . . , and probabilities p1 , p2 , . . . , and let Y have
values y1 , y2 , . . . , and probabilities q1 , q2 , . . . .
If
rjk = P rob(X = xj and Y = yk ), j, k = 1, 2, . . . ,
then, by addition of probabilities (5.1.2),
5.3. RANDOM VARIABLES 295

pj = P rob(X = xj )
= P rob(X = xj and Y = y1 ) + P rob(X = xj and Y = y2 ) + . . .
X
= rj1 + rj2 + · · · = rjk .
k

Similarly, X
qk = r1k + r2k + · · · = rjk .
j

Since the values of X + Y are xj + yk , with probabilities rjk , j, k = 1, 2, . . . ,


by definition of expectation,
XX
E(X + Y ) = (xj + yk )rjk .
j k

But this double sum may be written in two parts as


XX XX X X
xj rjk + yk rjk = xj pj + yk qk = E(X) + E(Y ).
j k j k j k

We conclude
E(X + Y ) = E(X) + E(Y ).
Since we already know E(aX) = aE(X), this derives linearity.

Let µ be the mean of a random variable X. The variance of X is

V ar(X) = E((X − µ)2 ). (5.3.5)

The variance measures the spread of X about its mean. Since the mean of
aX is aµ, the variance of aX is the mean of (aX − aµ)2 = a2 (X − µ)2 . Thus

V ar(aX) = a2 V ar(X).

However, the variance of a sum X + Y is not simply the sum of the variances
of X and Y : This only happens if X and Y are independent, see (5.3.19).
Using (5.3.2), we can view a dataset as the samples of a random variable
X. In this case, the mean and variance of X are the same as the mean and
variance of the dataset, as defined by (1.5.1) and (1.5.2).
When X is a constant, then X = µ, so V ar(X) = 0. Conversely, if
V ar(X) = 0, then by definition

0 = (x1 − µ)2 p1 + (x2 − µ)2 p2 + . . . ,

so all values are equal to µ, hence X is a constant.


296 CHAPTER 5. PROBABILITY

The square root of the variance is the standard deviation. If we write


V ar(X) = σ 2 , then the standard deviation is σ.
Expanding the square in (5.3.5),

V ar(X) = E(X 2 ) − 2µE(X) + µ2 .

Since µ = E(X), we obtain the alternate formula for the variance

V ar(X) = E(X 2 ) − E(X)2 . (5.3.6)

This displays the variance in terms of the first moment E(X) and the second
moment E(X 2 ). Equivalently,

E(X 2 ) = µ2 + σ 2 = (E(X))2 + V ar(X). (5.3.7)

The simplest discrete random variable is the Bernoulli random variable X


resulting from a coin toss, with X = 1 corresponding to heads, and X = 0
corresponding to tails,

P rob(X = 1) = p, P rob(X = 0) = 1 − p.

We say X is Bernoulli with bias p. The probability mass function is



1 − p,
 if x = 0,
p(x) = p, if x = 1,

0, otherwise.

This is presented graphically in Figure 5.13.

1−p
p

0 1

Fig. 5.13 Probability mass function p(x) of a Bernoulli random variable.

The mean of the Bernoulli random variable is

E(X) = 1 · P rob(X = 1) + 0 · P rob(X = 0) = 1 · p + 0 · (1 − p) = p.

The second moment is

E X 2 = 12 · P rob(X = 1) + 02 · P rob(X = 0) = p.

5.3. RANDOM VARIABLES 297

From this,

V ar(X) = E(X 2 ) − E(X)2 = p − p2 = p(1 − p).

When p = 0 or p = 1, the variance is zero, there is no randomness. When


p = 1/2, the randomness is maximized and the maximum variance equals
1/4.

1
p

1−p

0 1

Fig. 5.14 Cumulative distribution function F (x) of a Bernoulli random variable.

The moment-generating function is

M (t) = E etX = et1 p + et0 (1 − p) = pet + (1 − p).




The cumulant-generating function is

Z(t) = log M (t) = log(pet + 1 − p).

The cumulative distribution function F (x) is in Figure 5.14. Because the


Bernoulli random variable takes on only the values x = 0, 1, these are the
values where F (x) jumps.

More generally, let A be any event, and define


(
1, if the outcome is in A,
B= (5.3.8)
0, if the outcome is in Ac .

Then B has values 1 and 0 with probabilities

p = P rob(B = 1) = P rob(A), 1 − p = P rob(B = 0) = P rob(Ac ),

hence B is Bernoulli with bias p. We say B is the Bernoulli random variable


corresponding to event A.
By definition of B,

E(B) = p, V ar(B) = p(1 − p).

The relation between A and B is discussed further in Exercise 5.3.2.


298 CHAPTER 5. PROBABILITY

Bernoulli random variables are used to count sample proportions. Let X


be a random variable, and fix a threshold a for X. Let X1 , X2 , . . . , Xn be
a repeated sampling of X, and let B1 , B2 , . . . , Bn be the Bernoulli random
variables corresponding to the events X1 > a, X2 > a, . . . , Xn > a. Then
B1 + B2 + · · · + Bn
p̂ = (5.3.9)
n
is the proportion of samples greater than threshold a. This is a special case
of vectorization (§1.3).

Let X be any random variable. Since the total probability is one,

M (0) = E(e0X ) = E(1) = 1.

The derivative of the moment-generating function is

M ′ (t) = E XetX .


When t = 0,
M ′ (0) = E(X) = µ.
Similarly, since the derivative of log x is 1/x, for the cumulant-generating
function,
M ′ (0)
Z ′ (0) = = E(X) = µ.
M (0)
The second derivative of M (t) is

M ′′ (t) = E X 2 etX ,


so M ′′ (0) is the second moment E(X 2 ).


By the quotient rule, the second derivative of Z(t) is
′
M ′ (t) M ′′ (t)M (t) − M ′ (t)2

′′
Z (t) = = .
M (t) M (t)2

Inserting t = 0, and recalling (5.3.6), we have

Cumulant-Generating Function and Variance

Let Z(t) be the cumulant-generating function of a random variable


X. Then

Z ′ (0) = E(X) and Z ′′ (0) = V ar(X). (5.3.10)


5.3. RANDOM VARIABLES 299

In §5.2, we discussed independence of events. Now we do the same for


random variables.

Definition of Uncorrelated
Random variables X and Y are uncorrelated if

E(XY ) = E(X) E(Y ). (5.3.11)

Otherwise, we say X and Y are correlated.

By (5.3.6), a random variable X is always correlated to itself, unless it is


a constant.

Suppose X and Y take on the values X = ±1 and Y = 0, 1 with the


probabilities


 (1, 1) with probability a,

(1, 0) with probability b,
(X, Y ) = (5.3.12)


 (−1, 1) with probability b,
(−1, 0) with probability c.

We investigate when X and Y are uncorrelated. Here a > 0, b > 0, and c > 0.
First, because the total probability equals 1,

a + 2b + c = 1. (5.3.13)

Also we have

P rob(X = 1) = a+b = P rob(Y = 1), P rob(X = −1) = b+c = P rob(Y = 0),

and
E(X) = a − c, E(Y ) = a + b.
Now X and Y are uncorrelated if

a − b = E(XY ) = E(X)E(Y ) = (a − c)(a + b). (5.3.14)

Solving (5.3.13), (5.3.14) using Python,

from sympy import *


300 CHAPTER 5. PROBABILITY

a,b,c = symbols('a,b,c')
eq1 = a + 2*b + c - 1
eq2 = a - b - (a-c)*(a+b)
solutions = solve([eq1,eq2],a,b)
print(solutions)

we see X and Y are uncorrelated if


√ √
b = c − c, a = c − 2 c + 1. (5.3.15)

For example, X and Y are uncorrelated when c = 1/4, which leads to a =


b = 1/4. Also, X and Y are uncorrelated if c = .01, which leads to a = .81
and b = .09.
Let X and Y be random variables. We say X and Y are independent if all
powers of X are uncorrelated with all powers of Y .

Definition of Independence

Random variables X and Y are independent if

E(X n Y m ) = E(X n ) E(Y m ) (5.3.16)

for all positive powers n and m. When X and Y are discrete, this is
equivalent to the events X = x and Y = y being independent, for
every value x of X and every value y of Y .

Clearly, if X and Y are independent, then, by taking n = 1 and m = 1, X


and Y are uncorrelated.
Suppose X and Y satisfy (5.3.12) and (5.3.15). Since X = ±1, X n = 1 for
n even and X n = X for X odd. Since Y = 0, 1, Y n = Y for all n. This is
enough to show that, in this case, X and Y uncorrelated is equivalent to X
and Y independent. However, this is certainly not true in general.
Here is an example of uncorrelated random variables that are not inde-
pendent. Let X, Y be as above and set U = XY . We check when U and
Y are uncorrelated versus when they are independent. As before, check that
E(U Y ) = E(U )E(Y ) is equivalent to

a − b = (a − b)(a + b).

This happens in one of two cases. Either a − b ̸= 0, or a − b = 0. If a − b ̸= 0,


then canceling a − b leads to a + b = 1. By (5.3.13), this leads to b + c = 0,
which can’t happen, since both b and c are positive or zero. Hence we must
have the other case, a − b = 0. By (5.3.13), this leads to
1 c 1 c
a= − , b= − . (5.3.17)
3 3 3 3
5.3. RANDOM VARIABLES 301

Thus U and Y are uncorrelated when (5.3.17) holds, for any choice of c.
However, since X 2 = 1 and Y 2 = Y , U 2 = Y , so U 2 and Y are always
correlated, unless Y is constant. Hence U and Y are never independent, unless
Y is constant. Note Y is a constant when a = 1 or c = 1.

Let X and Y be random variables. The joint moment-generating function


of the pair (X, Y ) is
MX,Y (s, t) = E esX+tY .


Expanding the exponentials into their series, and using (5.3.16), one can show

Independence and Moment-Generating Functions

Let X and Y be random variables. Then X and Y are independent if


their moment-generating functions multiply,

MX,Y (s, t) = MX (s) MY (t).

As a special case, choosing s = t, we see

Independent Sums and Moment-Generating Functions

Let X and Y be independent random variables. Then the moment-


generating function of X + Y is

MX+Y (t) = MX (t)MY (t). (5.3.18)

As an illustration, consider an ordinary dice with X = 1, X = 2, . . . ,


X = 6 equally probable. Then P rob(X = k) = 1/6, k = 1, 2, . . . , 6. Now
suppose we have a random variable Y with values Y = 0, Y = 1, . . . ,Y = 6,
and assume X and Y are independent.
If we are told the sum X + Y is uniform over 1 ≤ X + Y ≤ 12, how should
we choose the probabilities for Y = 0, Y = 1, . . . ,Y = 6?
To answer this, we use (5.3.18). By Exercise 5.3.1,

1 e7t − et
MX (t) = .
6 et − 1
By Exercise 5.3.1 again,

1 e13t − et
MX+Y (t) = ,
12 et − 1
302 CHAPTER 5. PROBABILITY

It follows, by (5.3.18),

1 e13t − et 1 e7t − et
= · MY (t).
12 et − 1 6 et − 1
Factoring

e13t − et = et (e6t − 1)(e6t + 1), e7t − et = et (e6t − 1),

we obtain
1 6t
MY (t) = (e + 1).
2
This says
1 1
P rob(Y = 0) = , P rob(Y = 6) = ,
2 2
and all other probabilities are zero.

Taking the log in (5.3.18), independence is related to cumulant-generating


functions as follows.

Independent Sums and Cumulant-Generating Functions

Let X and Y be independent random variables. Then the cumulant-


generating function of X + Y is

ZX+Y (t) = ZX (t) + ZY (t).

Taking the second derivative, plugging in t = 0, and using (5.3.10), we


obtain
V ar(X + Y ) = V ar(X) + V ar(Y ).
This holds when X and Y are independent. In general, the result is

Independent Sums and Variances

Let X1 , X2 , . . . , Xn be independent random variables, and let

Sn = X1 + X2 + · · · + Xn .

Then

V ar(Sn ) = V ar(X1 ) + V ar(X2 ) + · · · + V ar(Xn ). (5.3.19)


5.3. RANDOM VARIABLES 303

The next simplest discrete random variable is the binomial random vari-
able Sn ,
Sn = X1 + X2 + · · · + Xn
obtained from n independent Bernoulli random variables.
Then Sn has values 0, 1, 2, . . . , n, and the probability mass function
 
 n pk (1 − p)n−k , if x = 0, 1, 2, . . . , n,
p(x) = k
0, otherwise.

Since the cdf F (x) is the sum of the pmf p(k) for k ≤ x, the code

from scipy.stats import binom

n, p = 8, .5
B = binom(n,p)

for k in range(n+1): print(k, B.pmf(k), B.cdf(k))

returns

0 0.003906250000000007 0.00390625
1 0.031249999999999983 0.03515625
2 0.10937500000000004 0.14453125
3 0.21874999999999992 0.36328125
4 0.27343749999999994 0.63671875
5 0.2187499999999999 0.85546875
6 0.10937500000000004 0.96484375
7 0.031249999999999983 0.99609375
8 0.00390625 1.0

Since

E(Sn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np,

the mean of Sn is np.


Since X1 , X2 . . . , Xn are independent, by (5.3.19), V ar(Sn ) = np(1 − p).
Summarizing,
E(Sn ) = np, V ar(Sn ) = np(1 − p). (5.3.20)
If p̂n is the proportion of heads, then p̂n = Sn /n, so

p(1 − p)
E(p̂n ) = p, V ar(p̂n ) = . (5.3.21)
n
By the binomial theorem, the moment-generating function is
304 CHAPTER 5. PROBABILITY

n  
tSn
X
tk n k n
p (1 − p)n−k = pet + 1 − p .

E e = e
k
k=0

Then the cumulant-generating function is

Z(t) = n log pet + 1 − p .




A continuous random variable X takes on continuous values x with prob-


ability density function p(x) (pdf in Python). Here means are computed by
integrals using the fundamental theorem of calculus (A.5.2).

Definition of Expectation: Continuous Case

Let X have probability density function p(x). The expectation of X is


Z
E(X) = x p(x) dx. (5.3.22)

E(X) is also called the mean or average or first moment of X, and is


usually denoted µ.

Here the integration is over the entire range of the random variable: If X
takes values in the interval [a, b], the integral is from a to b. For a normal
random variable, the range is (−∞, ∞). For a chi-squared random variable,
the range is (0, ∞). Below, when we do not specify the limits of integration,
the integral is taken over the whole range of X.
More generally, let f (x) be a function. The mean of f (X) or expectation
of f (X) is Z
E(f (X)) = f (x)p(x) dx. (5.3.23)

Since the total probability is one,


Z
E(1) = p(x) dx = 1.

This only holds when the integral is over the complete range of X. When this
is not so,
Z b
P rob(a < X < b) = p(x) dx
a
is the green area in Figure 5.15. Thus

chance = confidence = probability = area.


5.3. RANDOM VARIABLES 305

When f (x) = x2 , the mean of f (X) is the second moment


Z
E(X ) = x2 p(x) dx.
2

When f (x) = etx , the mean of f (X) is the moment-generating function


Z
M (t) = E etX = etx p(x) dx.


As before, the log of the moment-generating function is the cumulant-


generating function

Z(t) = log M (t) = log E etX .




0 a b 0 a b

Fig. 5.15 Confidence that X lies in interval [a, b].

The simplest continuous distribution is the uniform distribution. A random


variable X is distributed uniformly over the interval [0, 1] if
Z b
P rob(a < X < b) = b − a = 1 dx, 0 < a < b < 1.
a

Here the probability density function is


(
1, a < x < b,
p(x) =
0, otherwise,

for any interval (a, b) inside (0, 1).


306 CHAPTER 5. PROBABILITY

0 a µ b 1

Fig. 5.16 Uniform probability density function (pdf).

The mean of a uniform random variable is


Z 1
E(X) = x dx.
0

Since
1 2
F (x) =
x =⇒ F ′ (x) = x,
2
by the fundamental theorem of calculus (A.5.2),
Z 1
1
E(X) = x dx = F (1) − F (0) = .
0 2

Since F (x) = x3 /3 implies F ′ (x) = x2 , the second moment is


Z 1
1
E(X 2 ) = x2 dx = F (1) − F (0) = .
0 3

Hence the variance is


1 1 1
V ar(X) = E(X 2 ) − E(X)2 = − = .
3 4 12
The moment-generating function is
1
et − 1
Z
M (t) = E(etX ) = etx dx = .
0 t

The cumulative distribution function is



0, if x < 0,

F (x) = x, if 0 ≤ x ≤ 1, (5.3.24)

1, if x > 1.

More generally, fix an interval [a, b]. A random variable X is uniform on


[a, b] if the probability density function of X is
5.3. RANDOM VARIABLES 307

 1 , a < x < b,
p(x) = b − a .
0, otherwise,

For such an X, the mean is


b
b2 − a2
Z
1 1 1
µ = E(X) = x dx = · = (a + b), (5.3.25)
b−a a b−a 2 2

and the variance is


Z b
1 1
V ar(X) = (x − µ)2 dx = (b − a)2 . (5.3.26)
b−a a 12

In particular, if [a, b] = [−1, 1], then the mean is zero, the variance is 1/3,
and
1 1
Z
E(f (X)) = f (x) dx.
2 −1

We summarize the differences between discrete and continuous random


variables. In both cases, the cumulative distribution function is

F (x) = P rob(X ≤ x).

When X is discrete, X
F (x) = pk .
xk ≤x

When X is continuous, Z x
F (x) = p(z) dz.
−∞

Then each green area in Figure 5.15 is the difference between two areas,

F (b) − F (a).

When X is discrete, the probability mass function is

p(x) = P rob(X = x).

When X is continuous, the probability density function p(x) satisfies


Z b
P rob(a < X < b) = p(x) dx.
a
308 CHAPTER 5. PROBABILITY

For a continuous random variable the probability density function is the


derivative of the cumulative distribution function,

p(x) = F ′ (x). (5.3.27)

discrete continuous
density pmf pdf
distribution cdf cdf
sum cdf(x) = sum([ pmf(k)for k in range(x+1)]) cdf(x) = integrate(pdf,x)
difference pmf(k) = cdf(k)-cdf(k-1) pdf(x) = derivative(cdf,x)

Table 5.17 Densities versus distributions.

Table 5.17 summarizes the situation. For the distribution on the left in
Figure 5.15, the cumulative distribution function is in Figure 5.18.

Fig. 5.18 Continuous cumulative distribution function.

A logistic random variable is a random variable X with cumulative distri-


bution function σ(x) (5.1.22). For a logistic random variable, the probability
density function is

p(x) = σ ′ (x) = σ(x)(1 − σ(x)), (5.3.28)

the mean is zero, and the variance is π 2 /3 (see the exercises).

Let X and Y be independent uniform random variables on [0, 1], and let
Z = max(X, Y ). We compute the pdf p(x), the cdf F (x), and the mean of
Z. By definition of max(X, Y ),
5.3. RANDOM VARIABLES 309

F (x) = P rob(Z ≤ x) = P rob(max(X, Y ) ≤ x)) = P rob(X ≤ x and Y ≤ x).

By independence, for 0 ≤ x ≤ 1, this equals

P rob(X ≤ x) P rob(Y ≤ x) = x2 .

Hence 
0,
 if x < 0,
2
F (x) = P rob(max(X, Y ) ≤ x) = x , if 0 ≤ x ≤ 1,

1, if x > 1.

From this, 
0,
 if x < 0,

p(x) = F (x) = 2x, if 0 ≤ x ≤ 1,

0, if x > 1.

From this, by the FTC (§A.5),


Z Z 1 x=1
2 3 2
E(max(X, Y )) = xp(x) dx = x(2x) dx = x = .
0 3 x=0 3

Let X have mean µ and variance σ 2 , and write


X −µ
Z= .
σ
Then
1 E(X) − µ µ−µ
E(Z) = E(X − µ) = = = 0,
σ σ σ
and
1 σ2
E(Z 2 ) = E((X − µ)2
) = = 1.
σ2 σ2
We conclude Z has mean zero and variance one.
A random variable is standard if its mean is zero and its variance is one.
The variable Z is the standardization of X. For example, the standardization
of a Bernoulli random variable is
X −p
Z=p ,
p(1 − p)

and the standardization of a uniform random variable on [0, 1] is



Z = 12(X − 1/2).
310 CHAPTER 5. PROBABILITY

A random variable X is Poisson with parameter λ if X is discrete and


takes on the nonnegative integer values k = 0, 1, 2, . . . with probabilities

λk
P rob(X = k) = e−λ · , k = 0, 1, 2, . . . . (5.3.29)
k!
Here λ > 0. From the exponential series (A.3.12),
∞ ∞
X X λk
P rob(X = k) = e−λ = 1,
k!
k=0 k=0

so the total probability is one. The Python code for a Poisson random variable
is

from scipy.stats import poisson

lamda = 1
P = poisson(lamda)

for k in range(10): print(k, P.pmf(k), P.cdf(k))

The mean and variance of a Poisson with parameter λ are both λ (Exer-
cise 5.3.13).

Definition of Identically Distributed

Random variables X and Y are identically distributed if

E(X n ) = E(Y n ), n ≥ 1.

This is equivalent to X and Y having equal probabilities,

P rob(a < X < b) = P rob(a < Y < b),

for every interval [a, b], and equivalent to having the same moment-
generating functions,
MX (t) = MY (t)
for every t.

For example, if X and Y satisfy (5.3.12), then X and 2Y − 1 are identi-


cally distributed. However, X and 2Y − 1 are independent iff X and Y are
independent, which, as we saw above, happens only when (5.3.15) holds.
5.3. RANDOM VARIABLES 311

On the other hand, Let X be any random variable, and let Y = X. Then
X and Y are identically distributed, but are certainly correlated. So identical
distributions does not imply independence, nor vice-versa.
Let X be a random variable. A simple random sample of size n is a sequence
of random variables X1 , X2 , . . . , Xn that are independent and identically
distributed. We also say the sequence X1 , X2 , . . . , Xn is an i.i.d. sequence
(independent identically distributed).
For example, going back to the smartphone example, suppose we select n
students at random, where we are allowed to select the same student twice.
We obtain numbers x1 , x2 , . . . , xn . So the result of a single selection experi-
ment is a sequence of numbers x1 , x2 , . . . , xn . To make statistical statements
about the results, we repeat this experiment many times, and we obtain a
sequence of numbers x1 , x2 , . . . , xn each time.
This process can be thought of n machines producing x1 , x2 , . . . , xn each
time, or n random variables X1 , X2 , . . . , Xn (Figure 5.19). By making each
of the n selections independently, we end up with an i.i.d. sequence, or a
simple random sample.

X1 , X2 , . . . , Xn x1 , x2 , . . . , xn

Fig. 5.19 When we sample X1 , X2 , . . . , Xn , we get x1 , x2 , . . . , xn .

Let X1 , X2 , . . . , Xn be independent and identically distributed, and let µ


be their common mean E(X). The sample mean is
n
Sn X1 + X2 + · · · + Xn 1X
X̄n = = = Xk .
n n n
k=1

Then
1 1
E(X̄n ) = (E(X1 ) + E(X2 ) + · · · + E(Xn )) = · nµ = µ.
n n
We conclude the mean of the sample mean equals the population mean.
Now let σ 2 be the common variance of X1 , X2 , . . . , Xn . By (5.3.19), the
variance of Sn is nσ 2 , hence the variance of X̄n is σ 2 /n. Summarizing,
312 CHAPTER 5. PROBABILITY

Mean and Variance of Sample Mean

If X1 , X2 , . . . , Xn are independent and identically distributed, each


with mean µ and variance σ 2 , then

σ2
E(X̄n ) = µ, V ar(X̄n ) = , (5.3.30)
n
and

 
X̄n − µ
n (5.3.31)
σ
is standard.

For example, when X1 , X2 , . . . , Xn are independent and identically dis-


tributed according to a random variable X, the proportion p̂ of samples
(5.3.9) greater than a threshold a has mean p = P rob(X > a), and variance
p(1 − p)/n. It follows that
√ p̂ − p
Z= n· p
p(1 − p)

is standard.

Exercises

Exercise 5.3.1 Let a and b be integers and let X have values a, a + 1, a + 2,


. . . , b − 1. Assume the values are equally likely. Use (A.3.4) to show

1 etb − eta
MX (t) = · t .
b−a e −1

Exercise 5.3.2 Let A and B be events and let X and Y be the Bernoulli
random variables corresponding to A and B (5.3.8). Show that A and B are
independent (5.2.1) if and only if X and Y are independent (5.3.16).
Exercise 5.3.3 [30] Let X be a binomial random variable with mean 7 and
variance 3.5. What are P rob(X = 4) and P rob(X > 14)?
Exercise 5.3.4 The proportion of adults who own a cell phone in a certain
Canadian city is believed to be 90%. Thirty adults are selected at random
from the city. Let X be the number of people in the sample who own a cell
phone. What is the distribution of the random variable X?
Exercise 5.3.5 If two random samples of sizes n1 and n2 are selected inde-
pendently from two populations with means µ1 and µ2 , show the mean of the
5.3. RANDOM VARIABLES 313

sample mean difference X̄1 − X̄2 equals µ1 − µ2 . If σ1 and σ2 are standard


deviations of the two populations, then the standard deviation of X̄1 − X̄2
equals s
σ12 σ2
+ 2.
n1 n2

Exercise 5.3.6 Check (5.3.25) and (5.3.26).

Exercise 5.3.7 [30] You arrive at the bus stop at 10:00am, knowing the bus
will arrive at some time uniformly distributed during the next 30 minutes.
What is the probability you have to wait longer than 10 minutes? Given that
the bus hasn’t arrived by 10:15am, what is the probability that you’ll have
to wait at least an additional 10 minutes?

Exercise 5.3.8 If X and Y satisfy (5.3.12), show X and 2Y −1 are identically


distributed for any a, b, c.

Exercise 5.3.9 Let B and G be the number of boys and the number of girls
in a randomly selected family with probabilities as in Table 5.7. Are B and
G independent? Are they identically distributed?

Exercise 5.3.10 If X and Y satisfy (5.3.12), use Python to verify (5.3.15)


and (5.3.17).

Exercise 5.3.11 If X and Y satisfy (5.3.12), compute V ar(X) and V ar(Y )


in terms of a, b, c. What condition on a, b, c maximizes V ar(X)? What
condition on a, b, c maximizes V ar(Y )?

Exercise 5.3.12 Let X be Poisson with parameter λ. Show the cumulant-


generating function is
Z(t) = λ(et − 1).
(Use the exponential series (A.3.12).)

Exercise 5.3.13 Let X be Poisson with parameter λ. Show both E(X) and
V ar(X) equal λ (Use (5.3.10).)

Exercise 5.3.14 Let X and Y be independent Poisson with parameter λ


and µ respectively. Show X + Y is Poisson with parameter λ + µ.

Exercise 5.3.15 If X1 , X2 , . . . , Xn are i.i.d. Poisson with parameter λ, show

Sn = X1 + X2 + · · · + Xn

is Poisson with parameter nλ.


314 CHAPTER 5. PROBABILITY

Exercise 5.3.16 The relu(x) function is a common activation function in


neural networks (§7.2),
(
x if x ≥ 0,
relu(x) =
0 if x < 0.

If Sn is Poisson with parameter n, then

nn+1
E (relu(Sn − n)) = e−n · .
n!
(Use Exercise A.1.2.)
Exercise 5.3.17 Suppose X is a logistic random variable (5.3.28). Show the
probability density function of X is σ(x)(1 − σ(x)).
Exercise 5.3.18 Suppose X is a logistic random variable (5.3.28). Show the
mean of X is zero.
Exercise 5.3.19 Suppose X is a logistic random variable (5.3.28). Use
(A.3.16) with a = −e−x to show the variance of X is

(−1)n−1
 
X 1 1 1
4 =4 1− + − + ... .
n=1
n2 4 9 16

(This requires knowledge of integration substitution.) Using other tools, it


can be shown separately this sum equals π 2 /3 [14].
Exercise 5.3.20 Let X1 , X2 , . . . , Xn be i.i.d. each uniformly distributed on
[0, 1]. Let
Xmax = max(X1 , X2 , . . . , Xn ).
Compute F (x) = P rob(Xmax ≤ x). From that, compute the pdf p(x) of
Xmax , then the mean E(Xmax ). (To evaluate the integral in E(Xmax ), use
the FTC.)
Exercise 5.3.21 Let X1 , X2 , . . . , Xn be i.i.d. each uniformly distributed on
[0, 1]. Let
Xmin = min(X1 , X2 , . . . , Xn ).
Compute 1 − F (x) = P rob(Xmin > x). From that, compute the pdf p(x)
of Xmin , then the mean E(Xmin ). (To evaluate the integral in E(Xmin ), use
(5.1.15) and (5.1.17).)
Exercise 5.3.22 A random variable X is exponential with parameter a > 0
if P rob(X > x) = e−x/a . Then P rob(X ≤ 0) = 0, so the values of X are
positive. Show that the mean and standard deviation of X are both a.
Exercise 5.3.23 A random variable is arcsine if its pdf is given by Fig-
ure 3.11. Compute the mean√and variance of an arcsine random variable.
(Substitute x = (2/π) arcsin( λ/2) in the integrals.)
5.4. NORMAL DISTRIBUTION 315

5.4 Normal Distribution

A random variable Z has a standard normal distribution or Z distribution or


gaussian distribution if its probability density function is given by the famous
formula
1 2
p(z) = √ · e−z /2 . (5.4.1)

This means the normal distribution is continuous and the probability that
Z lies in a small interval [a, b] is

P rob(a < Z < b)


≈ p(z), a < z < b,
b−a
When the interval [a, b] is not small, this is not correct. The exact formula
for P rob(a < Z < b) is the area under the graph (Figure 5.20). This is
obtained by integration (§A.5),
Z b
P rob(a < Z < b) = p(x) dx. (5.4.2)
a

Under this interpretation, this probability corresponds to the area under the
graph (Figure 5.20) between the vertical lines at a and at b, and the total
area under the graph corresponds to a = −∞ and b = ∞.

0 a b

Fig. 5.20 The pdf of the standard normal distribution.

The normal probability density function is plotted by

from scipy.stats import norm as Z


from numpy import *
from matplotlib.pyplot import *

# Z defaults to standard normal


# for non-standard, use Z(mu,sdev)

grid()
316 CHAPTER 5. PROBABILITY

z = arange(mu-3*sdev,mu+3*sdev,.01)
p = Z.pdf(z)
plot(z,p)

show()


The curious constant 2π in (5.4.1) is inserted to make the total area
under the graph equal to one. That this is so arises from the
√ fact that 2π is
the circumference of the unit circle. Using Python, we see 2π is the correct
constant, since the code

from numpy import *


from scipy.integrate import quad

def p(z): return exp(-z**2/2)

a,b = -inf, inf


I = quad(p,a,b)[0] # integral from a to b

allclose(I, sqrt(2*pi))

returns True.

The mean of Z is Z
E(Z) = zp(z) dz.

More generally, means of f (Z) are


Z
E(f (Z)) = f (z)p(z) dz,

with the integral computed using the fundamental theorem of calculus (A.5.2)
or Python.

Let p(z) be the probability density function of Z. If we shift the graph of


p(z) by horizontally by t, we obtain p(z − t). Since shifting a graph doesn’t
change the total area under the graph,
5.4. NORMAL DISTRIBUTION 317
Z
p(z − t) dz = 1. (5.4.3)

By definition, the moment-generating function of Z is


Z
M (t) = E etZ = etz p(z) dz.


Using (5.4.3), one can show (Exercise 5.4.11)


2
M (t) = et /2
= exp(t2 /2). (5.4.4)

From this, the cumulant-generating function is t2 /2. Using (5.3.10), it follows


Z is indeed a standard random variable,

E(Z) = 0, V ar(Z) = 1

Expand both sides of the definition of MZ (t) in exponential series. This


results in
t2 t3 t4
1 + tE(Z) + E(Z 2 ) + E(Z 3 ) + E(Z 4 ) + . . .
2! 3! 4!
 2  3
t2 1 t2 1 t2
=1+ + + + ....
2 2! 2 3! 2

From this, the odd moments of Z are zero, and the even moments are

(2n)!
E(Z 2n ) = , n = 0, 1, 2, . . .
2n n!
By separating the even and the odd factors, this simplifies to

(1 · 3 · 5 · · · · · (2n − 1))(2 · 4 · · · · · 2n)


E(Z 2n ) =
2n n!
(1 · 3 · 5 · · · · · (2n − 1))2n n! (5.4.5)
=
2n n!
= 1 · 3 · 5 · · · · · (2n − 1), n ≥ 1.

For example,

E(Z) = 0, E(Z 2 ) = 1, E(Z 3 ) = 0, E(Z 4 ) = 3, E(Z 5 ) = 0, E(Z 6 ) = 15.

More generally, we say X has a normal distribution with parameters µ and


σ 2 , if its moment-generating function is
318 CHAPTER 5. PROBABILITY

MX (t) = E etX = exp(µt + σ 2 t2 /2).



(5.4.6)

Then its cumulant-generating function is


1
ZX (t) = µt + σ 2 t2 ,
2
hence its mean and variance are
′ ′′
ZX (0) = µ, ZX (0) = σ 2 .

From this, if X is normal with parameters µ and σ 2 , then its standardization


Z = (X − µ)/σ is standard normal.

We restate the two fundamental results of probability in the language of


this section and in terms of limits. We usually deal with limits in an intuitive
manner. For additional information on limits, see §A.6.
Given a sample from a random variable X, the population mean is µ =
E(X), and the population variance is σ 2 = V ar(X). The LLN says for large
sample size, the sample mean X̄ approximately equals µ. More exactly,

Law of Large Numbers (LLN)

Let X1 , X2 , . . . , Xn be independent identically distributed random


variables, each with mean µ and variance σ 2 , and let
X1 + X2 + · · · + Xn
X̄n =
n
be the sample mean. Then

lim X̄n = µ.
n→∞

The CLT says for large sample size, the sample mean is approximately
normal with mean µ and variance σ 2 /n. More exactly,

Central Limit Theorem (CLT)

Let

 
X̄n − µ
Z̄n = n
σ
be the standardized sample mean, and let Z be a standard normal
random variable. Then
5.4. NORMAL DISTRIBUTION 319


lim P rob a < Z̄n < b = P rob(a < Z < b)
n→∞

for every interval [a, b].

An equivalent form of the CLT is



lim E f Z̄n = E(f (Z)) (5.4.7)
n→∞

for every function f (x).


Let Mn (t) be the moment-generating function of Z̄n . Another equivalent
form of the CLT is convergence of the moment-generating functions,
2
lim Mn (t) = et /2
, (5.4.8)
n→∞

for every t.

Toss a coin n times, assume the coin’s bias is p, and let Sn be the number
of heads. Then,p by (5.3.20), Sn is binomial with mean µ = np and standard
deviation σ = np(1 − p). By the CLT, Sn is approximately normal with
the same mean and sdev, so the cumulative distribution function of Sn ap-
proximately equals the cumulative distribution function of a normal random
variable with the same mean and sdev.

Fig. 5.21 The binomial cdf and its CLT normal approximation.

The code
320 CHAPTER 5. PROBABILITY

from numpy import *


from scipy.stats import binom, norm
from matplotlib.pyplot import *

n, p = 100, pi/4
mu = n*p
sigma = sqrt(n*p*(1-p))

B = binom(n,p)
Z = norm(mu,sigma)

x = arange(mu - 2*sigma, mu + 2*sigma, .01)


plot(x, Z.cdf(x), label="normal approx")
plot(x, B.cdf(x), label="binomial")

grid()
legend()
show()

returns Figure 5.21.

Using the compound-interest formula (A.3.8), it is simple to derive the


CLT. We derive the third version (5.4.8) of the CLT. Let x1 , x2 , . . . , xN be
a scalar dataset, and assume the dataset is standardized. Then its mean and
variance are zero and one,
N N
X 1 X 2
xk = 0, xk = 1.
N
k=1 k=1

If the samples of the dataset are equally likely, then sampling the dataset
results in a random variable X, with expectations given by (5.3.2). It follows
that X is standard, and the moment-generating function of X is
N
1 X txk
E(etX ) = e .
N
k=1

If X1 , X2 , . . . , Xn are obtained by repeated sampling of the dataset, then


they are i.i.d. following X.
If X̄n is the sample mean
X1 + X2 + · · · + Xn
X̄n = ,
n

then, by (5.3.31), Z̄n = nX̄n is standard.
By independence, the moment-generating function Mn (t) of Z̄n is the
product
5.4. NORMAL DISTRIBUTION 321
 √   √    √ n
Mn (t) = E et nX̄n = E et(X1 +X2 +···+Xn )/ n = E etX/ n .

By the exponential series,


√ t t2 2
etX/ n
=1+ √ X + X + ...
n 2n

Since the mean and variance of X are zero and 1, taking expectations of both
sides,
 √  t2
E etX/ n = 1 + + ....
2n
From this, n
t2

Mn (t) = 1 + + ... .
2n
By the compound-interest formula (A.3.8) (the missing terms . . . don’t affect
the result)
2
lim Mn (t) = et /2 ,
n→∞

which is the moment-generating function of the standard normal distribution.


Even though we couched this derivation in terms of a standardized dataset,
it is valid in general. This completes the derivation of the CLT.

The standard normal distribution is symmetric about zero, and has a


specific width. Because of the symmetry, a random number Z following this
distribution is equally likely to satisfy Z < 0 and Z > 0, so P rob(Z < 0) =
P rob(Z > 0). Since the total area equals 1,

P rob(Z < 0) + P rob(Z > 0) = 1,

we expect the chance that Z < 0 should equal 1/2. In other words, because
of the symmetry of the curve, we expect to be 50% confident that Z < 0, or
0 is at the 50-th percentile level. So

chance = confidence = percentile = area

To summarize, we expect P rob(Z < 0) = 1/2.


When
P rob(Z < z) = p,
we say z is the z-score z corresponding to the p-value p. Equivalently, we say
our confidence that Z < z is p, or the percentile of z equals 100p. In Python,
the relation between z and p (Figure 5.22) is specified by
322 CHAPTER 5. PROBABILITY

from scipy.stats import norm as Z

p = Z.cdf(z)
z = Z.ppf(p)

ppf is the percentile point function, and cdf is the cumulative distribution
function.

p
p

z z

Fig. 5.22 z = Z.ppf(p) and p = Z.cdf(z).

In Figure 5.23, the red areas are the lower tail p-value P rob(Z < z), the
two-tail p-value P rob(|Z| > z), and the upper tail p-value P rob(Z > z).
By symmetry of the graph, upper-tail and two-tail p-values can be com-
puted from lower tail p-values.

P rob(a < Z < b) = P rob(Z < b) − P rob(Z < a),

and

P rob(|Z| < z) = P rob(−z < Z < z) = P rob(Z < z) − P rob(Z < −z),

and
P rob(Z > z) = 1 − P rob(Z < z).
To go backward, suppose we are given P rob(|Z| < z) = p and we want
to compute the cutoff z. Then P rob(|Z| > z) = 1 − p, so P rob(Z > z) =
(1 − p)/2. This implies

P rob(Z < z) = 1 − (1 − p)/2 = (1 + p)/2.

In Python,

from scipy.stats import norm as Z

# p = P(|Z| < z)

z = Z.ppf((1+p)/2)
p = Z.cdf(z) - Z.cdf(-z)
5.4. NORMAL DISTRIBUTION 323

−z 0 −z 0 z

0 z

Fig. 5.23 Confidence (green) or significance (red) (lower-tail, two-tail, upper-tail).

Now let’s zoom in closer to the graph and mark off z-scores 1, 2, 3 on the
horizontal axis to obtain specific colored areas as in Figure 5.24. These areas
are governed by the 68-95-99 rule (Table 5.25). Our confidence that |Z| < 1
equals the blue area 0.685, our confidence that |Z| < 2 equals the sum of the
blue plus green areas 0.955, and our confidence that |Z| < 3 equals the sum
of the blue plus green plus red areas 0.997. This is summarized in Table 5.25.

−3 −2 −1 0 1 2 3

Fig. 5.24 68%, 95%, 99% confidence cutoffs for standard normal.

The possibility |Z| > 1 is called a 1-sigma event, |Z| > 2 a 2-sigma event,
and so on. So a 2-sigma event is 95.5% unlikely, or 4.5% likely. An event is
considered statistically significant if it’s a 2-sigma event or more. In other
words, something is significant if it’s unlikely. A six-sigma event |Z| > 6 is
two in a billion. You want a plane crash to be six-sigma.
These terms are defined for two-tail p-values. The same terms may be used
for upper-tail or lower tail p-values.
324 CHAPTER 5. PROBABILITY

cutoff abs confidence two-tail p-value


z 1−p p
1 .685 .315
2 .955 .045
3 .997 .003

Table 5.25 Cutoffs, confidence levels, p-values.

Figure 5.24 is not to scale, because a 1-sigma event should be where the
curve inflects from convex to concave (in the figure this happens closer to
2.7). Moreover, according to Table 5.25, the left-over white area should be
.03% (3 parts in 10,000), which is not what the figure suggests.

Significant and Highly Significant

An event A is significant if P rob(A) < 0.05. An event A is highly


significant if P rob(A) < 0.01.

An event is statistically significant if its p-value is 5% or less (Table 5.26).


For example, Z > z is statistically significant if P rob(Z > z) is .05 or
less, which means z is greater than 1.64, Z < z is statistically significant
if P rob(Z < z) is .05 or less, which means z is less than −1.64, and |Z| > z
is statistically significant if P rob(|Z| > z) is .05 or less, which means |z| is
greater than 1.96.

event type p-value z-score


Z > z upper tail .05 1.64
Z < z lower tail .05 -1.64
|Z| > z two-tail .05 1.96
Z > z upper tail .01 2.33
Z < z lower tail .01 -2.33
|Z| > z two-tail .01 2.56

Table 5.26 p-values at 5% and at 1%.

An event is highly significant if its p-value is 1% or less (Table 5.26). For


example, Z > z is highly significant if P rob(Z > z) is .01 or less, which
means z is greater than 2.33, Z < z is highly significant if P rob(Z < z) is .01
or less, which means z is less than −2.33, and |Z| > z is highly significant if
P rob(|Z| > z) is .01 or less, which means |z| is greater than 2.56.
5.4. NORMAL DISTRIBUTION 325

In general, the normal distribution is not centered at the origin, but else-
where. We say X is normal with mean µ and standard deviation σ if
X −µ
Z=
σ
is distributed according to a standard normal. We write N (µ, σ) for the nor-
mal with mean µ and standard deviation σ. As its name suggests, it is easily
checked that such a random variable X has mean µ and standard deviation
σ. For the normal distribution with mean µ and standard deviation σ, the
cutoffs are as in Figure 5.27. In Python, norm(mu,sigma) returns the normal
with mean m and standard deviation s.

µ − 3σ µ−σ µ µ+σ µ + 3σ

Fig. 5.27 68%, 95%, 99% cutoffs for non-standard normal.

Here is a sample computation. Let X be a normal random variable with


mean µ and standard deviation σ, and suppose P rob(X < 7) = .15, and
P rob(X < 19) = .9. Given this data, we find µ and σ as follows.
With Z as above, we have

P rob(Z < (7 − µ)/σ) = .15, and P rob(Z < (19 − µ)/σ) = .9.

Also, since Z is standard, we compute

a = Z.ppf(.15)
b = Z.ppf(.9)

By definition of ppf (see above), we then have


7−µ 19 − µ
a= , b= .
σ σ
These are two equations in two unknowns. Multiplying both equations by σ
then subtracting, we obtain µ and σ,
326 CHAPTER 5. PROBABILITY

19 − 7
σ= , µ = 7 − aσ.
b−a

Let X̄ be the sample mean


X1 + X2 + · · · + Xn
X̄ = ,
n
drawn from a normally distributed population with meanõ and standard
deviation σ. By (5.3.30), the standard deviation of X̄ is σ/ n.

Standard Deviation of Sample Mean is Standard Error

The standard deviation of the sample mean is called the standard


error.
√ If the samples have standard deviation σ, the standard error is
σ/ n.

To compute probabilities for X̄ when X has mean µ and standard deviation


σ, standardize X̄ by writing

√ X̄ − µ
Z= n· ,
σ
then compute standard normal probabilities.

Here are two examples. In the first example, suppose student grades are
normally distributed with mean µ = 80 and variance σ 2 = 16. This says the
average of all grades is 80, and the standard deviation is σ = 4. If a grade is
g, the standardized grade is
g−µ g − 80
z= = .
σ 4
A student is picked and their grade was g = 84. Is this significant? Is it highly
significant? In effect, we are asking, how unlikely is it to obtain such a grade?
Remember,
significant = unlikely
Since the standard deviation is 4, the student’s z-score is
g − 80 84 − 80
z= = = 1.
4 4
5.4. NORMAL DISTRIBUTION 327

What’s the upper-tail p-value corresponding to this z? It’s


1
P rob(Z > z) = P rob(Z > 1) = P rob(|Z| > 1) = .16,
2
or 16%. Since the upper-tail p-value is more than 5%, this student’s grade is
not significant.
For the second example, suppose a sample of n = 9 students are selected
and their sample average grade is ḡ = 84. Is this significant? Is it highly
significant? This time we take
√ ḡ − 80 84 − 80
z= n· =3 = 3.
4 4
What’s the upper-tail p-value corresponding to this z? It’s

P rob(Z > z) = P rob(Z > 2.5) = 0.0013,

or .13%. Since the upper-tail p-value is less than 1%, yes, this sample average
grade is both significant and highly significant.
The same grade, g = 84, is not significant for a single student, but is
significant for nine students. This is a reflection of the law of large numbers,
which says the sample mean approaches the population mean as the sample
size grows.

To extract samples from a normal distribution, use numpy.random.normal.


For example.

from numpy.random import normal

mean, sdev, n = 80, 4, 20


normal(mean,sdev,n)

returns 20 normally distributed numbers, with specified mean and standard


deviation.
Be careful to distinguish between
numpy.random.normal and scipy.stats.norm.
The former returns samples from a normal distribution, while the latter re-
turns a normal random variable. Samples are just numbers; random variables
have cdf’s, pmf’s or pdf’s, etc.
328 CHAPTER 5. PROBABILITY

Suppose student grades are normally distributed with mean 80 and vari-
ance 16. How many students should be sampled so that the chance that at
least one student’s grade lies below 70 is at least 50%?
To solve this, if p is the chance that a single student has a grade below 70,
then 1 − p is the chance that the student has a grade above 70. If n is the
sample size, (1 − p)n is the chance that all sample students have grades above
70. Thus the requested chance is 1 − (1 − p)n . The following code shows the
answer is n = 112.

from scipy.stats import norm as Z

z = 70
mean, sdev = 80, 4
p = Z(mean,sdev).cdf(z)

for n in range(2,200):
q = 1 - (1-p)**n
print(n, q)

Here is the code for computing tail probabilities for the sample mean X̄
drawn from a normally distributed population with mean µ and standard
deviation σ. When n = 1, this applies to a single normal random variable.

########################
# P-values
########################

from numpy import *


from scipy.stats import norm as Z

def pvalue(mean,sdev,n,xbar,type):
Xbar = Z(mean,sdev/sqrt(n))
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 *(1 - Xbar.cdf(abs(xbar)))
else:
print("What's the tail type (lower-tail, upper-tail,
,→ two-tail)?")
return
print("sample size: ",n)
print("mean,sdev,xbar: ",mean,sdev,xbar)
print("mean,sdev,n,xbar: ",mean,sdev,n,xbar)
print("p-value: ",p)
z = sqrt(n) * (xbar - mean) / sdev
print("z-score: ",z)
5.4. NORMAL DISTRIBUTION 329

type = "upper-tail"
mean = 80
sdev = 4
n = 1
xbar = 90

pvalue(mean,sdev,n,xbar,type)

Exercises

Exercise 5.4.1 Let X be a normal random variable and suppose P rob(X <
1) = 0.3, and P rob(X < 2) = 0.4 What are the mean and variance of X?

Exercise 5.4.2 [27] Consider a normal distribution curve where the middle
90% of the area under the curve lies above the interval (4, 18). Use this
information to find the mean and the standard deviation of the distribution.

Exercise 5.4.3 Let Z be a normal random variable with mean 30.4 and
standard deviation of 0.7. What is P rob(29 < Z < 31.1)?

Exercise 5.4.4 [27] Consider a normal distribution where the 70th percentile
is at 11 and the 25th percentile is at 2. Find the mean and the standard
deviation of the distribution.

Exercise 5.4.5 [27] Let X1 , X2 , . . . , Xn be an i.i.d. sample each with mean


300 and standard deviation of 21. What is the mean and standard deviation
of the sample mean X̄?

Exercise 5.4.6 Suppose the scores of students are normally distributed with
a mean of 80 and a standard deviation of 4. A sample of size n is selected,
and the sample mean is 84. What is the least n for which this is significant?
What is the least n for which this is highly significant?

Exercise 5.4.7 [27] A manufacturer says their laser printers’ printing speeds
are normally distributed with mean 17.63 ppm and standard deviation 4.75
ppm. An i.i.d. sample of n = 11 printers is selected, with speeds X1 , X2 , . . . ,
Xn . What is the probability the sample mean speed X̄ is greater than 18.53
ppm?

Exercise 5.4.8 [27] Continuing Exercise 5.4.7, let Yk be the Bernoulli ran-
dom variable corresponding to the event Xk > 18 (5.3.8),
(
1, if Xk > 18,
Yk = .
0, otherwise.
330 CHAPTER 5. PROBABILITY

We count the proportion of printers in the sample having speeds greater than
18 by setting
Y1 + Y2 + · · · + Yn
p̂ = .
n
Compute E(p̂) and V ar(p̂). Use the CLT to compute the probability that
more than 50.9% of the printers have speeds greater than 18.
Exercise 5.4.9 [27] The level of nitrogen oxides in the exhaust of a particular
car model varies with mean 0.9 grams per mile and standard deviation 0.19
grams per mile . What sample size is needed so that the standard deviation
of the sampling distribution is 0.01 grams per mile?
Exercise 5.4.10 [27] The scores of students had a normal distribution with
mean µ = 559.7 and standard deviation σ = 28.2. What is the probability
that a single randomly chosen student scores 565 or higher? Now suppose
n = 30 students are sampled, assume i.i.d. What are the mean and standard
deviation of the sample mean score? What z-score corresponds to the mean
score of 565? What is the probability that the mean score is 565 or higher?
Exercise 5.4.11 Complete the square in the moment-generating function of
the standard normal pdf and use (5.4.3) to derive (5.4.4).
Exercise 5.4.12 Let Z be a standard normal random variable, and let
relu(x) be as in Exercise 5.3.16. Show
1
E(relu(Z)) = √ .

(Use the fundamental theorem of calculus (A.5.2).)
Exercise 5.4.13 [7] Let X1 , X2 , . . . , Xn be i.i.d. Poisson random variables
(5.3.29) with parameter 1, let Sn = X1 + X2 + · · · + Xn , and let X̄n = Sn /n
be the sample mean. Then the mean of X̄n is 1, and the variance of X̄n is
1/n, so
√ Sn − n
Z̄n = n(X̄n − 1) = √
n
is standard (5.3.31). By the CLT, Z̄n is approximately standard normal for
large n. Use this to derive Stirling’s approximation (A.1.6). (Insert f (x) =
relu(x) in (5.4.7), then use Exercises 5.3.16 and 5.4.12.)

5.5 Chi-squared Distribution

Let X and Y be independent standard normal random variables. Then (X, Y )


is a random point in the plane. What is the probability that the point (X, Y )
lies inside a square (Figure 5.28)? Specifically, assume the square is |X| ≤ 1
5.5. CHI-SQUARED DISTRIBUTION 331

and |Y | ≤ 1. Since X and Y independent, the probability (X, Y ) lies in the


square is

P rob(|X| ≤ 1 and |Y | ≤ 1) = P rob(|X| ≤ 1) P rob(|Y | ≤ 1)


= P rob(|X| ≤ 1)2 = .6852 = .469.

What is the probability (X, Y ) lies inside the unit disk,

P rob(X 2 + Y 2 ≤ 1)?

Here the answer is not as straightforward, and leads us to introduce the


chi-squared distribution.

Fig. 5.28 (X, Y ) inside the square and inside the disk.

A random variable U has a chi-squared distribution with degree 1 if


1
MU (u) = E(euU ) = √ .
1 − 2u

To compute the moments of U , we use the binomial theorem (4.1.20)


∞      
p
X p n p 2 p 3
(1 + x) = x = 1 + px + x + x + ...
n=0
n 2 3

to write out MU (u). Taking p = −1/2 and x = −2u,


∞  
1 −1/2
X −1/2
√ = (1 − 2u) = (−2u)n .
1 − 2u n=0
n
332 CHAPTER 5. PROBABILITY

Since

1  X un
√ = E euU = E(U n ),
1 − 2u n=0
n!

comparing coefficients of un /n! shows


 
−1/2
E(U n ) = (−2)n n! , n = 0, 1, 2, . . . (5.5.1)
n

Using the definition


 
p p · (p − 1) · · · · · (p − n + 1)
= ,
n n!
p

n makes sense for fractional p (see (A.2.12)). With this, we have

(−1/2) · (−1/2 − 1) · · · · · (−1/2 − n + 1)


E(U n ) = (−2)n n!
n!
= 1 · 3 · 5 · 7 · · · · · (2n − 1).

But this equals the right side of (5.4.5). Thus the left sides of (5.4.5) and
(5.5.1) are equal. This shows

Chi-squared is the Square of Normal

If Z is standard normal, then U = Z 2 is chi-squared with degree 1,


and E(U ) = 1, V ar(U ) = 2.

More generally, we say U is chi-squared with degree d if

U = U1 + U2 + · · · + Ud = Z12 + Z22 + · · · + Zd2 , (5.5.2)

with independent standard normal Z1 , Z2 , . . . , Zd .


By independence, the moment-generating functions multiply (§5.3), so the
moment-generating function for chi-squared with degree d is
1
MU (t) = E(etU ) = .
(1 − 2t)d/2

Going back to the question posed at the beginning of the section, we have
X and Y independent standard normal and we want

P rob(X 2 + Y 2 ≤ 1).
5.5. CHI-SQUARED DISTRIBUTION 333

If we set U = X 2 + Y 2 , we want3 P rob(U ≤ 1). Since U is chi-squared with


degree d = 2, we use chi2.cdf(u,d). Then the code

from scipy.stats import chi2 as U

d = 2
u = 1

U(d).cdf(u)

returns 0.39.

Fig. 5.29 Chi-squared distribution with different degrees.

Figure 5.29 is returned by the code

from scipy.stats import chi2 as U


from matplotlib.pyplot import *
from numpy import *

u = arange(0,15,.01)

for d in range(1,7):
p = U(d).pdf(u)

3 Geometrically, the p-value P rob(U > 1) is the probability that a normally distributed
point in d-dimensional space is outside the unit sphere.
334 CHAPTER 5. PROBABILITY

plot(u,p,label="d: " + str(d))

ylim(ymin=0,ymax=.6)
grid()
legend()
show()

Fig. 5.30 With degree d ≥ 2, the chi-squared density peaks at d − 2.

Let us compute the mean and variance of a chi-squared U with degree d.


When d = 1, we already know E(U ) = 1 and V ar(U ) = 2. In general, by
(5.5.2) and (5.3.19),
d
X d
X
E(U ) = E(Zk2 ) = 1 = d,
k=1 k=1

and
d
X d
X
V ar(U ) = V ar(Zk2 ) = 2 = 2d.
k=1 k=1

We conclude
5.5. CHI-SQUARED DISTRIBUTION 335

Mean and Variance of Chi-squared

If U is chi-squared with degree d, the mean and variance of U are

E(U ) = d, and V ar(U ) = 2d.

The peak (maximum likelihood point) in the chi-squared density of degree


d is not at the mean d. Using polar coordinates, one can show the peak is at
d − 2 (Figure 5.30).

Because
1 1 1
= ,
(1 − 2t)d/2 (1 − 2t)d′ /2 (1 − 2t)(d+d′ )/2
we obtain

Independent Chi-squared Variables

If U and U ′ are independent chi-squared with degrees d and d′ , then


U + U ′ is chi-squared with degree d + d′ .

To compute distributions for sample variances (below) and chi-squared


tests (§6.4), we need to derive chi-squared for correlated normal samples.
This is best approached using vector-valued random variables.
A vector-valued random variable is a vector X = (X1 , X2 , . . . , Xd ) in Rd
whose components are random variables. A vector-valued random variable is
also called a random vector. For example, a simple random sample X1 , X2 ,
. . . , Xn may be collected into a single random vector

X = (X1 , X2 , . . . , Xn )

in Rn .
Random vectors have means, variances, moment-generating functions,
and cumulant-generating functions, just like scalar-valued random variables.
Moreover we can have simple random samples of random vectors X1 , X2 ,
. . . , Xn .
If X is a random vector in Rd , its mean is the vector

µ = E(X) = (E(X1 ), E(X2 ), . . . , E(Xd )) = (µ1 , µ2 , . . . , µd ).

The variance of X is the d × d matrix Q whose (i, j)-th entry is


336 CHAPTER 5. PROBABILITY

Qij = E((Xi − µi )(Xj − µj )), 1 ≤ i, j ≤ d.

In the notation of §2.2,

Q = E((X − µ) ⊗ (X − µ)).

By (1.4.18),

w · ((X − µ) ⊗ (X − µ))w = ((X − µ) · w)2 ,

hence
w · Qw = E ((X − µ) · w)2 .

(5.5.3)
Thus the variance of a random vector is a nonnegative matrix
A random vector is standard if µ = 0 and Q = I. If X is standard, then

E(X · w) = 0, V ar(X · w) = |w|2 . (5.5.4)

In §2.2, we defined the mean and variance of a dataset (2.2.14). Then the
mean and variance there is the same as the mean and variance defined here,
that of a random variable.
To see this, we must build a random variable X corresponding to a dataset
x1 , x2 , . . . , xN . But this was done in (5.3.2). The moral is: every dataset may
be interpreted as a random variable.

In §5.3, we considered i.i.d. sequences of scalar random variables. We can


also do the same with random vectors. If X1 , X2 , . . . , Xn is an i.i.d. se-
quence of random vectors, each with mean µ and variance Q, then the same
calculation as in §5.3 shows
n
!
√ 1X
n Xk − µ (5.5.5)
n
k=1

has mean zero and variance Q.

A random vector X is normal with mean µ and variance Q if for every


vector w, the scalar random variable X · w is normal with mean µ · w and
variance w · Qw. When µ = 0 and Q = I, then X is standard normal.
From §5.3, we see
5.5. CHI-SQUARED DISTRIBUTION 337

Standard Normal Random Vectors


Z1 , Z2 , . . . , Zd is a simple random sample of standard normal random
variables iff
Z = (Z1 , Z2 , . . . , Zd )
is a standard normal random vector in Rd .

The central limit theorem remains valid for random vectors: If X1 , X2 ,


. . . , Xn is an i.i.d. sequence of random vectors with mean µ and variance
Q, then (5.5.5) is approximately normal, with mean zero and variance Q, for
large n.
From (5.5.2),

Uncorrelated Chi-squared

If Z = is a standard normal random vector in Rd , then

|Z|2 (5.5.6)

is chi-squared with degree d.

If X is a normal random vector with mean zero and variance Q, then, by


definition, X · w is normal with mean zero and V ar(X · w) = w · Qw. Using
(5.4.6) with t = 1, µ = 0, and σ 2 = w · Qw, the moment-generating function
of the random vector X is

MX (w) = E ew·X = ew·Qw/2 .



(5.5.7)

In Python, the probability density function of a normal random variable


with mean µ and variance Q is

from numpy import *


from scipy.stats import multivariate_normal as Z

# mu is mean vector array


# Q is variance matrix array

# here x.shape == mu.shape


Z.pdf(x, mean=mu, cov=Q)

If x and y are arrays, then cartesian_product(x,y) is defined by

from numpy import *

def cartesian_product(x,y): return dstack(meshgrid(x,y))


338 CHAPTER 5. PROBABILITY

If x and y have shapes (m,) and (n,) then xy = cartesian_product(x,y)


has shape (m,n,2), with xy[i,j,:] = array([x[i],y[j]]).

Fig. 5.31 Normal probability density on R2 .

Using this, we can plot the probability density function of a normal random
vector in R2 ,

%matplotlib ipympl
from numpy import *
from matplotlib.pyplot import *
from scipy.stats import multivariate_normal as Z

# standard normal
mu = array([0,0])
Q = array([[1,0],[0,1]])

x = arange(-3,3,.01)
y = arange(-3,3,.01)

xy = cartesian_product(x,y)
# last axis of xy is fed into pdf
z = Z(mu,Q).pdf(xy)

ax = axes(projection='3d')
ax.set_axis_off()
x,y = meshgrid(x,y)
ax.plot_surface(x,y,z, cmap='cool')
show()
5.5. CHI-SQUARED DISTRIBUTION 339

resulting in Figure 5.31.

In §5.3 we studied correlation and independence. We saw how indepen-


dence implies uncorrelatedness, but not conversely. Now we show that, for
normal random vectors, they are in fact the same.

Independence and Correlation

If (X, Y ) is a normal random vector, then X and Y are uncorrelated


iff X and Y are independent.

Saying (X, Y ) is normal is more than just saying X is normal and Y is


normal, This is joint normality of X and Y . By subtracting their means, we
may assume the means of X and Y are zero.
To derive the result, we write down

E(X ⊗ X) = A, E(X ⊗ Y ) = B, E(Y ⊗ Y ) = C.

Then the variance of (X, Y ) is


   
E(X ⊗ X) E(X ⊗ Y ) A B
Q= =
E(Y ⊗ X) E(Y ⊗ Y ) Bt C

. From this, we see X and Y are uncorrelated when B = 0.


With w = (u, v), we write
    
u A B u
w · Qw = · = u · Au + u · Bv + v · B t u + v · Cv.
v Bt C v

Then
  t
MX,Y (w) = E ew·(X,Y ) = ew·Qw/2 = MX (u) MY (v) e(u·Bv+v·B u)/2 .

From this, X and Y are independent when B = 0. Thus, for normal random
vectors, independence and uncorrelatedness are the same.

If Z is a standard normal random vector in Rd , then (5.5.6) we saw |Z|2 is


chi-squared with degree d. Now we generalize this result to correlated normal
random vectors.
340 CHAPTER 5. PROBABILITY

Correlated Chi-squared

Let X be a normal random vector with mean zero and variance Q.


Let r be the rank of Q, and let Q+ be the pseudo-inverse (§2.3) of Q.
Then
X · Q+ X (5.5.8)
is chi-squared with degree r.

To derive this, we use the eigenvalue decomposition (3.2.5) of Q: There is


a square diagonal matrix E and a matrix U satisfying

E = U t QU, Q+ = U E + U t ,

and
   
λ1 0 0 ... 0 1/λ1 0 0 . . . 0
0 λ2 0 ... 0  0 1/λ2 0 . . . 0
E+ = 
   
. . .
E= ... ... ... . . .
,  ... ... ... ... . . .
.
0 ... 0 λr 0  0 . . . 0 1/λr 0
0 0 0 0 0 0 0 0 0 0

Here r, the number of nonzero eigenvalues of Q, is the rank of Q.


Then, with Y = U t X,
r
X Y2 i
X · Q+ X = X · (U E + U t )X = (U t X) · E + (U t X) = Y · E + Y = .
i=1
λi

Since Y has variance U t QU (Exercise 5.5.5), and U t QU = E, X · Q+ X is


chi-squared with degree r (Exercise 5.5.6).

Let µ be a unit vector in Rd , and let Q = I − µ ⊗ µ. Then Q has rank d − 1


(Exercise 2.9.2). Suppose X is a normal random vector with mean zero and
variance Q. Then

E((X · µ)2 ) = µ · Qµ = µ · (I − µ ⊗ µ)µ = µ − (µ · µ)µ = 0,

so X · µ = 0.
By Exercise 2.6.7, Q+ = Q. Since X · µ = 0,

X · Q+ X = X · QX = X · (X − (X · µ)µ) = |X|2 .

We conclude
5.5. CHI-SQUARED DISTRIBUTION 341

Singular Chi-squared

Let µ be a unit vector, and let X be a normal random vector with


mean zero and variance I −µ⊗µ. Then |X|2 is chi-squared with degree
d − 1.

We use the above to derive the distribution of the sample variance. Let
X1 , X2 , . . . , Xn be a random sample, and let X̄ be the sample mean,
X1 + X2 + · · · + Xn
X̄ = .
n
Let S 2 be the sample variance,

(X1 − X̄)2 + (X2 − X̄)2 + · · · + (Xn − X̄)2


S2 = . (5.5.9)
n−1
Since (n − 1)S 2 is a sum-of-squares similar to (5.5.2), we expect (n − 1)S 2
to be chi-squared. In fact this is so, but the degree is n − 1, not n. We will
show

Independence of Sample Mean and Sample Variance

Let Z1 , Z2 , . . . , Zn be independent standard normal random vari-


ables, let Z̄ be the sample mean, and let S 2 be the sample variance.
Then (n − 1)S 2 is chi-squared with degree n − 1, and Z̄ and S 2 are
independent.

To see this, we work with the random vector Z = (Z1 , Z2 , . . . , Zn ) with


mean zero and variance I. Let u and v be vectors in Rn , let

1 = (1, 1, . . . , 1)

be in Rn , and let µ = 1/ n. Then µ is a unit vector and
n
1 X √
Z ·µ= √ Zk = n Z̄.
n
k=1

Since Z1 , Z2 , . . . , Zn are i.i.d standard, Z · µ = nZ̄ is standard.
Now let U = I − µ ⊗ µ and

X = U Z = Z − (Z · µ)µ = (Z1 − Z̄, Z2 − Z̄, . . . .Zn − Z̄).


342 CHAPTER 5. PROBABILITY

Then the mean of X is zero. Since Z has variance I, by Exercises 2.2.2 and
5.5.5,
V ar(X) = U t IU = U 2 = U = I − µ ⊗ µ.
By singular chi-squared above,

(n − 1)S 2 = |X|2

is chi-squared with degree n − 1. Since Z · µ is standard,

E(X(Z · µ)) = E(Z(Z · µ)) − E((Z · µ)2 )µ = µ − µ = 0,

so X and Z · µ are uncorrelated. Since X and Z√· µ are normal, X and Z · µ


are independent. Since (n − 1)S 2 = |X|2 and nZ̄ = Z · µ, S 2 and Z̄ are
independent.

Exercises

Exercise 5.5.1 Let X and Y be independent uniform random variables with


values in the interval [−1, 1]. Then (X, Y ) is a point in the square {|x| ≤
1, |y| ≤ 1}. Let (
1 if X 2 + Y 2 ≤ 1,
B=
0 otherwise
be the Bernoulli variable corresponding to (X, Y ) beingin the unit disk. Then
(5.3.8) the mean p = E(B) equals P rob X 2 + Y 2 ≤ 1 . Show that p = π/4.
(This uses polar coordinates rdrdθ replacing dxdy.)

Exercise 5.5.2 Let X1 , X2 , . . . , Xn , and Y1 , Y2 , . . . , Yn be independent i.i.d.


samples of uniform random variables with values in the interval [−1, 1]. Then
(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) are points in the square {|x| ≤ 1, |y| ≤ 1}
(Figure 5.28). Let p̂n be the proportion of points (5.3.9) lying in the unit disk
{x2 + y 2 ≤ 1}. Use the LLN to estimate p̂n for large n.

Exercise 5.5.3 Continuing the previous problem with n = 20, use the CLT
to estimate the probability that fewer than 50% of the points lie in the unit
disk. Is this a 1-sigma event, a 2-sigma event, or a 3-sigma event?

Exercise 5.5.4 Let X be a random vector with mean zero and variance Q
Show v is a zero variance direction (§2.5) for Q iff X · v = 0.

Exercise 5.5.5 Let µ and Q be the mean and variance of a random d-vector
X, and let A be any N × d matrix. Then AX is a random vector with mean
Aµ and variance AQAt .
5.6. MULTINOMIAL PROBABILITY 343

Exercise 5.5.6 Let Y1 , Y2 , . . . , Yr be independent normal random variables


with mean zero and variances λ1 , λ2 , . . . , λr . Then

Y12 Y2 Y2
+ 2 + ··· + r
λ1 λ2 λr
is chi-squared with degree r.

Exercise 5.5.7 If X is a random vector with mean zero and variance Q, then

E((X · u)(X · v)) = u · Qv.

(Insert w = u + v in (5.5.3).)

Exercise 5.5.8 Assume the classes of the Iris dataset are normally dis-
tributed with their means and variances (Exercise 2.2.8), and assume the
classes are equally likely. Using Bayes theorem (5.1.21), write a Python
function that returns the probabilities (p1 , p2 , p3 ) that a given iris x =
(t1 , t2 , t3 , t4 ) lies in each of the three classes. Feed your function the 150
samples of the Iris dataset. How many samples are correctly classified?

5.6 Multinomial Probability

Let X be a discrete random variable, with values i = 1, 2, . . . , d, and proba-


bilities p = (p1 , p2 , . . . , pd ). Then each pi ≥ 0 and

p1 + p2 + · · · + pd = 1.

Such a vector p is a probability vector..


Since the values of X are 1, 2, . . . , d, the moment-generating function of
X is
M (t) = E(etX ) = et p1 + e2t p2 + · · · + edt pd .
A vector v = (v1 , v2 , . . . , vd ) is one-hot encoded at slot j if all components
of v are zero except the j-th component. For example, when d = 3, the vectors

(a, 0, 0), (0, a, 0), (0, 0, a)

are one-hot encoded.


A useful alternative to M (t) above is to use one-hot encoding and to define
a vector-valued random variable Y = (Y1 , Y2 , . . . , Yd ) by
(
1, if X = i,
Yi = i = 1, 2, . . . , d.
0, otherwise
344 CHAPTER 5. PROBABILITY

This is called one-hot encoding since all slots in Y are zero except for one
“hot” slot.
For example, suppose X has three values 1, 2, 3, say X is the class of a
random sample from the Iris dataset. Then Y is R3 -valued, and we have

(1, 0, 0), if X = 1,

Y = (0, 1, 0), if X = 2,

(0, 0, 1), if X = 3.

With this understood, set t = (t1 , t2 , t3 ). Then the moment-generating func-


tion of Y is
M (t) = E et·Y = et1 p1 + et2 p2 + et3 p3 .


More generally, let X have d values. Then with one-hot encoding, the
moment-generating function is

M (t) = et1 p1 + et2 p2 + · · · + etd pd ,

and the cumulant-generating function is

Z(t) = log et1 p1 + et2 p2 + · · · + etd pd .




In particular, for a fair dice with d sides, the values are equally likely, so
the one-hot encoded cumulant-generating function is

Z(t) = log et1 + et2 + · · · + etd − log d.



(5.6.1)

In this section, we define

Z(y) = log (ey1 + ey2 + · · · + eyd ) , (5.6.2)

so we ignore the constant log d in (5.6.1). Then Z is a function of d variables


y = (y1 , y2 , . . . , yd ). If we insert y = 0, we obtain Z(0) = log d.
Let
1 = (1, 1, . . . , 1).
Then
d
X
p·1= pk = 1.
k=1

Because

Z(y + a1) = Z(y1 + a, y2 + a, . . . , yd + a) = Z(y) + a,

Z is not bounded below and does not have a minimum.


5.6. MULTINOMIAL PROBABILITY 345

The softmax function is the vector-valued function q = σ(y) with compo-


nents
eyk eyk
qk = σk (y) = = Z(y) , k = 1, 2, . . . , d.
ey1 + ey2 y
+ ··· + e d e
Thus
q = σ(y) = e−Z(y) (ey1 , ey2 , . . . , eyd ) .

(y1 , y2 , . . . , yd ) σ (p1 , p2 , . . . , pd )

Fig. 5.32 The softmax function takes vectors to probability vectors.

By the chain rule, the gradient of the cumulant-generating function is the


softmax function,
∇Z(y) = σ(y). (5.6.3)
When d = 2, the vector softmax function reduces to the scalar logistic
function (5.1.22), since

ey1 1
q1 = = = σ(y1 − y2 ),
ey1 + ey2 1 + e−(y1 −y2 )
y2
e 1
q2 = y1 = = σ(y2 − y1 ).
e +e y 2 1 + e 2 −y1 )
−(y

Because of this, the softmax function is the multinomial analog of the logistic
function, and we use the same symbol σ to denote both functions.

from scipy.special import softmax

y = array([y1,y2,y3])
q = softmax(y)

In §4.5, we studied convex functions and the existence and uniqueness of


the global minimum. As we saw above, Z does not have a global minimum
over unrestricted z.
346 CHAPTER 5. PROBABILITY

Since σ(y) = ∇Z(y), a critical point y ∗ of Z must satisfy σ(y ∗ ) = 0. For


Z, a critical point cannot be unique, because

σ(y1 , y2 , . . . , yd ) = σ(y1 + a, y2 + a, . . . , yd + a),

or
σ(y) = σ(y + a1).
We say a vector y is centered if y is orthogonal to 1,

y · 1 = y1 + y2 + · · · + yd = 0.

To guarantee uniqueness of a global minimum of Z, we have to restrict at-


tention to centered vectors y.
Suppose y is centered. Since the exponential function is convex,
d d
!
eZ 1 X yk 1X
= e ≥ exp yk = e0 = 1.
d d d
k=1 k=1

This establishes

Restricted Global Minimum of the Cumulant-generating


Function

If y is centered, then Z(y) ≥ Z(0) = log d.

The inverse of the softmax function is obtained by solving p = σ(y) for y,


obtaining
yk = Z + log pk , k = 1, 2, . . . , d. (5.6.4)
Define
log p = (log p1 , log p2 , . . . , log pd ).
Then the inverse of p = σ(z) is

y = Z1 + log p. (5.6.5)

The function
d
X
I(p) = p · log p = pk log pk (5.6.6)
k=1
5.6. MULTINOMIAL PROBABILITY 347

is the absolute information. Since 0 ≤ p ≤ 1, log p ≤ 0, hence I(p) ≤ 0.


Since log is concave,
d d
!
X X
pk log(eyk ) ≤ log pk eyk .
k=1 k=1

This implies
d
X d
X
p·y = pk yk = pk log(eyk )
k=1 k=1
d
! d
!
X X
yk yk +log pk
≤ log pk e = log e = Z(y + log p).
k=1 k=1

Replacing y by y − log p, this establishes

I(p) ≥ p · y − Z(y). (5.6.7)

By (5.6.5), (5.6.7) is an equality when p = σ(y). We conclude

Information and Cumulant-Generating Function are Convex


Duals
For all p,
I(p) = max (p · y − Z(y)) .
y

For all y,
Z(y) = max (p · y − I(p)) .
p

The second equality follows by switching Z and I in (5.6.7), and repeating


the same logic used to derive the first equality.

Inserting y = 0 in (5.6.7), we have

Absolute Information is Bounded

For all p = (p1 , p2 , . . . , pd ),

0 ≥ I(p) ≥ − log(d). (5.6.8)


348 CHAPTER 5. PROBABILITY

The absolute entropy, the analog of (4.2.1), is then


d
X
H(p) = −I(p) = − pk log(pk ). (5.6.9)
k=1

Since  
1 1 1
D2 I(p) = diag , ,..., ,
p1 p2 pd
we see I(p) is strictly convex, and H(p) is strictly concave.
In Python, the entropy is

from scipy.stats import entropy

p = array([p1,p2,p3])
entropy(p)

Here is the multinomial analog of the relation between entropy and


coin-tossing (5.1.11). Suppose a dice has d faces, and suppose the prob-
ability of rolling the i-th face in a single roll is pi , i = 1, 2, . . . , d. Then
p = (p1 , p2 , . . . , pd ) is a probability vector. We call p the dice’s bias.

Entropy and Dice-Rolling

Roll a d-faced dice n times, and let #n (p) be the number of outcomes
where the face-proportions are p = (p1 , p2 , . . . , pd ). Then

#n (p) is approximately equal to enH(p) for n large.

In more detail, using (A.1.6), here is the asymptotic equality,


1 1
#n (p) ≈ ·√ · enH(p) , for n large.
(2πn)(d−1)/2 p1 p2 . . . pd

Asymptotic equality means the ratio of the two sides approaches 1 as n → ∞


(A.6).

Now (
∂2Z ∂σj σj − σj σk , if j = k,
= =
∂yj ∂yk ∂yk −σj σk , if j ̸= k.
5.6. MULTINOMIAL PROBABILITY 349

Hence we have

D2 Z(y) = ∇σ(y) = diag(q) − q ⊗ q, q = σ(z). (5.6.10)

qk vk . Since Q = D2 Z(z) satisfies


P
Let v̄ = v · q =
d
X d
X
v · Qv = qk vk2 − (v · q)2 = qk (vk − v̄)2 ,
k=1 k=1

which is nonnegative, Q is a variance matrix, and Z is convex.


In fact Z is strictly convex along centered directions v, the directions
satisfying v · 1 = 0. If v · Qv = 0, then, since qk > 0 for all k, v = v̄1. By
Exercise 5.6.1, if v is centered, this forces v = 0. This shows Z is strictly
convex along centered directions.
Moreover, Z is proper (4.5.12) on centered vectors. To see this, suppose
y · 1 = 0 and Z(y) ≤ c. Since yj ≤ Z(y), this implies

yj ≤ c, j = 1, 2, . . . , d.

Given 1 ≤ j ≤ d,Padd the inequalities yk ≤ c over all indices k ̸= j. Since


y · 1 = 0, −yj = k̸=j yk . Hence
X
−yj = yk ≤ (d − 1)c, j = 1, 2, . . . , d.
k̸=j

Combining the last two inequalities,

|yj | = max(yj , −yj ) ≤ (d − 1)c, j = 1, 2, . . . , d,

which implies
d
X
|y|2 = yk2 ≤ d(d − 1)2 c2 .
k=1

Setting C = d(d − 1)c, we conclude

Z(y) ≤ c and y·1=0 =⇒ |y| ≤ C. (5.6.11)

By (4.5.12), we have shown

The Cumulant-generating Function is Proper and Strictly


Convex

On centered vectors, Z(y) is proper and strictly convex.


350 CHAPTER 5. PROBABILITY

Let p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qd ) be probability vectors. The


relative information is
d
X
I(p, q) = pk log(pk /qk ). (5.6.12)
k=1

Let
log q = (log q1 , log q2 , . . . , log qd ).
Then
d
X
p · log q = pk log qk ,
k=1

and
I(p, q) = I(p) − p · log q. (5.6.13)
Similarly, the relative entropy is

H(p, q) = −I(p, q). (5.6.14)

In Python, as of this writing, the code

from scipy.stats import entropy

p = array([p1,p2,p3])
q = array([q1,q2,q3])
entropy(p,q)

returns the relative information, not the relative entropy. Always check your
Python code’s conventions and assumptions. See below for more on this ter-
minology confusion.

Here is the multinomial analog of the relation between relative entropy


and coin-tossing (5.1.12). Suppose a dice has d faces, and suppose the prob-
ability of rolling the i-th face in a single roll is qi , i = 1, 2, . . . , d. Then
q = (q1 , q2 , . . . , qd ) is a probability vector, and we expect the long-term pro-
portion of faces in n rolls to equal roughly q.
Let p = (p1 , p2 , . . . , pd ) be another probability vector. Roll a d-faced dice
n times, and let Pn (p, q) be the probability that the face-proportions are
p = (p1 , p2 , . . . , pd ), given that the dice’s bias is q.
If p = q, one’s first guess is Pn (p, p) ≈ 1 for n large. However, this is
not correct, because Pn (p, p) is specifying a specific proportion p, predicting
specific behavior from the coin tosses. Because this is too specific, it turns
out Pn (p, p) ≈ 0, see Exercise 5.1.9.
5.6. MULTINOMIAL PROBABILITY 351

On the other hand, if p ̸= q, we definitely expect the proportion of faces


to not equal p. In other words, we expect Pn (p, q) to be small for large n. In
fact, when p ̸= q, it turns out Pn (p, q) → 0 exponentially, as n → ∞. Using
(A.1.1), a straightforward calculation results in

Relative Entropy and Dice-Rolling

Assume a d-faced dice’s bias is q. Roll the dice n times, and let Pn (p, q)
be the probability of obtaining outcomes where the proportion of faces
is p. Then

Pn (p, q) is approximately equal to enH(p,q) for n large.

More exactly, using (A.1.6), here is the asymptotic equality,


1 1
Pn (p, q) ≈ ·√ · enH(p,q) , for n large.
(2πn)(d−1)/2 p1 p2 . . . pd

The relative cumulant-generating function is


d
!
X
yk
Z(y, q) = log e qk ,
k=1

As we saw above, this is the one-hot encoded cumulant-generating function


of a d-sided dice with side-probabilities q = (q1 , q2 , . . . , qd ).
If we insert qk = exp(log(qk )) in the definition of Z(y, q), one obtains

Z(y, q) = Z(y + log q).

From this, using the change of variable y ′ = y + log q,

max (p · y − Z(y, q)) = max (p · y − Z(y + log q))


y y

= max

(p · (y ′ − log q) − Z(y ′ ))
y

= max (p · y − Z(y)) − p · log q


y

= I(p) − p · log q
= I(p, q).

As before, this shows


352 CHAPTER 5. PROBABILITY

Relative Information and Relative Cumulant-generating


Function are Convex Duals
For all p and q,

I(p, q) = max (p · y − Z(y, q)) .


y

For all y and q,

Z(y, q) = max (p · y − I(p, q)) .


p

In logistic regression (§7.6), the output is y, the computed target is q =


σ(y), the desired target is p, and the information error function is I(p, q). To
compute the information error, by (5.6.5),

q = σ(y) =⇒ log q = y − Z(y)1.

By (5.6.13), this yields

Information Error Identity

For all p and all y, if q = σ(y), then

I(p, q) = I(p) − p · y + Z(y). (5.6.15)

This identity is the direct analog of (4.5.23). The identity (4.5.23) is used
in linear regression. Similarly, (5.6.15) is used in logistic regression.

Let max y = maxj yj . Then, by definition of Z(y),

yj ≤ Z(y) ≤ max y + log d, j = 1, 2, . . . , d.

The cross-information is
d
X
Icross (p, q) = − pk log qk ,
k=1

and the cross-entropy is


5.6. MULTINOMIAL PROBABILITY 353

d
X
Hcross (p, q) = −Icross (p, q) = pk log qk .
k=1

In the literature, the terminology is backward: the cross-information is usually


erroneously called “cross-entropy,” see the discussion at the end of the section.
Cross-information and relative information are related by

I(p, q) = I(p) + Icross (p, q).

A probability vector p = (p1 , p2 , . . . , pd ) is one-hot encoded at slot j if


pj = 1. When p is one-hot encoded at slot j, then pk = 0 for k ̸= j.
When p is one-hot encoded, then I(p) = 0, so

I(p, q) = Icross (p, q), (5.6.16)

and, from (5.6.15),

Icross (p, σ(y)) = −p · y + Z(y).

From (5.6.3) and (5.6.15),

∇y I(p, σ(y)) = q − p, q = σ(y). (5.6.17)

Since I(p, σ(y)) and Icross (p, σ(y)) differ by the constant I(p), we also have

∇y Icross (p, σ(y)) = q − p, q = σ(y),

so it doesn’t matter whether I(p, q) or Icross (p, q) is used in gradient descent


(§7.3). Nevertheless, we stick with I(p, q), because I(p, q) arises naturally as
the convex dual of Z(y, q).

Let q = (q1 , q2 , . . . , qd ) be a probability vector. The relative softmax func-


tion is
σ(y, q) = e−Z(y,q) (ey1 q1 , ey2 q2 , . . . , eyd qd ) .
Then the relative version of (5.6.15) is

I(p, σ(y, q)) = I(p, q) − p · y + Z(y, q).

This is easily checked using the definitions of I(p, q) and σ(y, q).
354 CHAPTER 5. PROBABILITY

In the literature, in the industry, in Wikipedia, and in Python, the termi-


nology4 is confused: The relative information I(p, q) is almost always called
“relative entropy.”
Since the entropy H is the negative of the information I, this is looking at
things upside-down. In other settings, I(p, q) is called the “Kullback–Leibler
divergence,” which is not exactly intuitive terminology.
Also, in machine learning, Icross (p, q) is called the “cross-entropy,” not
cross-information, continuing the confusion.
Rubbing salt into the wound, in Python, entropy(p) is H(p), which is
correct, but entropy(p,q) is I(p, q), which is incorrect, or at the very least,
inconsistent, even within Python.
How does one keep things straight? By remembering that it’s convex func-
tions that we like to minimize, not concave functions. In more vivid terms,
would you rather ski down a convex slope, or a concave slope?
In machine learning, loss functions are built to be minimized, and infor-
mation, in any form, is convex, while entropy, in any form, is concave. Table
5.33 summarizes the situation.

H = −I Information Entropy
Absolute I(p) H(p)
Cross Icross (p, q) Hcross (p, q)
Relative I(p, q) H(p, q)
Curvature Convex Concave
Error I(p, q) with q = σ(z)

Table 5.33 The third row is the sum of the first and second rows, and the H column is
the negative of the I column.

Exercises

Exercise 5.6.1 Let v be a centered vector, and suppose v is a multiple of 1.


Show v = 0.

Exercise 5.6.2 Let p be a probability vector, and v be a vector. Then p + tv


is a probability vector for all scalar t iff v is centered.

Exercise 5.6.3 Let p = (p1 , p2 , . . . , pd ) be a probability vector, and let a =


(a1 , a2 , . . . , ad ) be a vector satisfying

ea1 p1 + ea2 p2 + · · · + ead pd = 1.


4 The quantities used here are identical to those in the literature, it’s only the naming that
is confused.
5.6. MULTINOMIAL PROBABILITY 355

Show a · p ≤ 0. (Use convexity of ex .)

Exercise 5.6.4 Continuing Exercise 5.6.3, assume furthermore pi > 0, i =


1, 2, . . . , d. Then a · p = 0 implies a = 0.
Chapter 6
Statistics

6.1 Estimation

In statistics, like any science, we start with a guess or an assumption or hy-


pothesis, then we take a measurement, then we accept or modify our guess/as-
sumption based on the result of the measurement. This is common sense, and
applies to everything in life, not just statistics.
For example, suppose you see a sign on campus saying
There is a lecture in room B120.
How can you tell if this is true/correct or not? One approach is to go to room
B120 and look. Either there is a lecture or there isn’t. Problem solved.
But then someone might object, saying, wait, what if there is a lecture
in room B120 tomorrow? To address this, you go every day to room B120
and check, for 100 days. You find out that in 85 of the 100 days, there is a
lecture, and in 15 days, there is none. Based on this, you can say you are
85% confident there is a lecture there. Of course, you can never be sure, it
depends on which day you checked, you can only provide a confidence level.
Nevertheless, this kind of thinking allows us to quantify the probability that
our hypothesis is correct.
In general, the measurement is significant if it is unlikely. When we obtain
a significant measurement, then we are likely to reject our guess/assumption.
So
significance = 1 − confidence.
In practice, our guess/assumption allows us to calculate a p-value, which is
the probability that the measurement is not consistent with our assumption.
In the above scenario, the p-value is .15, determined by repeatedly sampling
the room.
This is what statistics is about, summarized in Figure 6.1. The details may
be more or less complicated depending on the problem situation or setup, but
this is the central idea.

357
358 CHAPTER 6. STATISTICS

do not
reject H

p>α

hypothesis
sample p-value
H

p<α

reject H

Fig. 6.1 Statistics flowchart: p-value p and significance α.

Here is a geometric example. Grab two vectors at random in three di-


mensions and measure the angle between them. Is there any pattern to the
answer? Doing so twenty times, we see the answer is no, the resulting angle
can be any angle. Now grab two vectors at random from 784 dimensions.
Then, as we shall see, there is a pattern.
The null hypothesis and the alternate hypothesis are
• H0 : The angle between two randomly selected vectors in 784 dimensions
is approximately 90◦
• Ha : The angle between two randomly selected vectors in 784 dimensions
is approximately 60◦ .
In §2.2, there is code (2.2) returning the angle Angle(u,v) between two
vectors. To test these hypotheses, we run the code

from numpy import *


from numpy.random import randn

# randn(d) is standard normal sample of size d

d = 784

for _ in range(20):
u = randn(d)
v = randn(d)
print(angle(u,v))
6.1. ESTIMATION 359

to randomly select u, v twenty times. Here randn(n) returns a vector in Rd


whose components are selected independently and randomly according to a
standard normal distribution.
This code returns (since the selection is random, your numbers will differ)

86.27806537791886
87.91436653824776
93.00098725550777
92.73766421951748
90.005139015804
87.99643434444482
89.77813370637857
96.09801014394806
90.07032573539982
89.37679070400239
91.3405728939376
86.49851399221568
87.12755619082597
88.87980905998855
89.80377324818076
91.3006921339982
91.43977096117017
88.52516224405458
86.89606919838387
90.49100744167357

and we see strong evidence supporting H0 .


On the other hand, run the code

from numpy import *


from numpy.random import binomial

d = 784

n = 1 # one coin toss


#n = 3 # three coin tosses

# binomial(n,p,d) is n coin-tosses with bias p and sample of size d

for _ in range(20):
u = binomial(n,.5,d)
v = binomial(n,.5,d)
print(angle(u,v))

to randomly select u, v twenty times. Here binomial(n,.5,d) returns a


vector in Rd whose components are selected independently and randomly
according to the number of heads in n tosses of a fair coin. This code returns
360 CHAPTER 6. STATISTICS

59.43464627897324
59.14345748418916
60.31453922165891
60.38024365702492
59.24709660805488
59.27165957992343
61.21424657806321
60.55756381536082
61.59468919876665
61.33296028237481
60.03925473033243
60.25732069941224
61.77018692842784
60.672901794058326
59.628519516164666
59.41272458020638
58.43172340007064
59.863796136907744
59.45156367988921
59.95835532791699

and we see strong evidence supporting H1 .

The difference between the two scenarios is the distribution. In the first
scenario, we have randn(d): the components are distributed according to
a standard normal. In the second scenario, we have binomial(1,.5,d) or
binomial(3,.5,d): the components are distributed according to one or three
fair coin tosses. To see how the distribution affects things, we bring in the
law of large numbers, which is discussed in §5.3.
Let X1 , X2 , . . . , Xd be a simple random sample from some population,
and let µ be the population mean. Recall this means X1 , X2 , . . . , Xd are
i.i.d. random variables, with µ = E(X). The sample mean is

X1 + X2 + · · · + Xd
X̄ = .
d

Law of Large Numbers

For large sample size d, the sample mean X̄ approximately equals the
population mean µ, X̄ ≈ µ.

We use the law of large numbers to explain the closeness of the vector
angles to specific values.
Assume u = (x1 , x2 , . . . , xd ), and v = (y1 , y2 , . . . , yd ) where all components
are selected independently of each other, and each is selected according to
the same distribution.
6.1. ESTIMATION 361

Let U = (X1 , X2 , . . . , Xd ), V = (Y1 , Y2 , . . . , Yd ), be the corresponding


random variables. Then X1 , X2 , . . . , Xd and Y1 , Y2 , . . . , Yd are independent
and identically distributed (i.i.d.), with population mean E(X1 ) = E(Y1 ).
From this, X1 Y1 , X2 Y2 , . . . , Xd Yd are i.i.d. random variables with popu-
lation mean E(X1 Y1 ). By the law of large numbers,1

X1 Y1 + X2 Y2 + · · · + Xd Yd
≈ E(X1 Y1 ),
d
so
U · V = X1 Y1 + X2 Y2 + · · · + Xd Yd ≈ d E(X1 Y1 ).
Similarly, U · U ≈ d E(X12 ) and V · V ≈ d E(Y12 ). Hence (check that the d’s
cancel)
U ·V E(X1 Y1 )
cos(U, V ) = p ≈p .
(U · U )(V · V ) E(X12 )E(Y12 )
Since X1 and Y1 are independent with mean µ and variance σ 2 ,

E(X1 Y1 ) = E(X1 )E(Y1 ) = µ2 , E(X12 ) = µ2 + σ 2 , E(Y12 ) = µ2 + σ 2 .

If θ is the angle between U and V , we conclude

U ·V µ2
cos(θ) = p ≈ .
(U · U )(V · V ) µ2 + σ 2

When the distribution is standard normal, µ = 0, so the angle is approxi-


mately 90◦ . When the distribution is Bernoulli with parameter p,

µ2 p2
= = p.
µ2 + σ 2 p2 + p(1 − p)

For p = .5, this results in an angle of 60◦ .


The general result is

Random Vectors in High Dimensions

Let U and V be two vectors selected randomly. Assume the compo-


nents of U and V are independent and identically distributed with
mean µ and variance σ 2 . Let θ be the angle between them. When the
vector dimension is high,

µ2
cos(θ) is approximately .
µ2 + σ 2

1 ≈ means the ratio of the two sides approaches 1 for large n, see §A.6.
362 CHAPTER 6. STATISTICS

6.2 Z-test

Suppose we want to estimate the proportion of American college students


who have a smart phone. Instead of asking every student, we take a sample
and make an estimate based on the sample.
The population proportion p is the actual proportion of students that in
fact have a smart phone. Then 0 < p < 1. Pick a student, and let
(
1, if the student has a smartphone,
X=
0, if not.

Then X is a Bernoulli random variable with mean p.


For example, suppose the population proportion of students that have a
smartphone is p = .7, and we sample n = 25 students, obtaining a sample
proportion X̄. If we repeat the sampling N = 1000 times, we will obtain 1000
values for X̄. Figure 6.2 displays the resulting histogram of X̄ values. Here
is the code

from numpy import *


from matplotlib.pyplot import *
from numpy.random import binomial

p = .7
n = 25
N = 1000
v = binomial(n,p,N)/n

hist(v,edgecolor ='Black')
show()

Fig. 6.2 Histogram of sampling n = 25 students, repeated N = 1000 times.


6.2. Z-TEST 363

Let X1 , X2 , . . . , Xn be a simple random sample of size n. This means


n students were selected randomly and independently and whether or not
they had smartphones was recorded in the variables X1 , X2 , . . . , Xn . Each
of these variables equals one or zero with probability p or 1 − p.
The sample mean (§5.3) is
n
X1 + X2 + · · · + Xn 1X
X̄ = = Xk .
n n
k=1

Because each Xk is 0 or 1, this is the sample proportion of the students in


the sample that have smartphones. Like p, X̄ is also between zero and one.
Because the samples vary, it is impossible to make absolute statements
about the population. Instead, as we see below, the best we can do is make
statements that come with a confidence level. Confidence levels are expressed
as percentages, such as a 95% confidence level, or as a proportion, such as a
.95 confidence level.
Often levels are expressed as significance levels. The significance level is
the corresponding tail probability, so

significance level = 1 − confidence level.

A confidence level of zero indicates that we have no faith at all that se-
lecting another sample will give similar results, while a confidence level of 1
indicates that we have no doubt at all that selecting another sample will give
similar results.
When we say p is within X̄ ± ϵ, or

|p − X̄| < ϵ,

we call ϵ the margin of error.. The interval

(L, U ) = (X̄ − ϵ, X̄ + ϵ)

is a confidence interval.
With the above setup, we have the population proportion p, and the four
sample characteristics
• sample size n
• sample proportion X̄,
• margin of error ϵ,
• confidence level α.
Suppose we do not know p, but we know n and X̄. We say the margin of
error is ϵ, at confidence level α, if

P rob(|p − X̄| < ϵ) = α.

Here are some natural questions:


364 CHAPTER 6. STATISTICS

1. Given a sample of size n = 20 and sample proportion X̄ = .7, what can


we say about the margin of error ϵ with confidence α = .95?
2. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .95?
3. Given a sample proportion X̄ = .7, what sample size n should we take
to obtain a margin of error ϵ = .15 with confidence α = .99?
4. Given a sample of size n = 20 and sample proportion X̄ = .7, with what
confidence level α is the margin of error ϵ = .1?
The answers are at the end of the section.

Suppose each Xk in the sample X1 , X2 , . . . , Xn has mean µ and standard


deviation√σ. From §5.3, we know the mean and standard deviation of X̄ are
µ and σ/ n. In particular, when X1 , X2 , . . . , Xn is a Bernoulli sample, the
mean and variance of the sample proportion X̄ are p and p(1 − p)/n.
Therefore, the mean and variance of the standardized random variable

√ X̄ − p
Z= np
p(1 − p)

are zero and one.


Returning to our smartphone question, how close is the sample mean X̄
to the population mean E(X) = p? Remember, both X̄ and p are between 0
and 1. More specifically, given a margin of error ϵ, we want to compute the
confidence level 
P rob X̄ − p < ϵ .
This corresponds to the confidence interval

L, U = X̄ − ϵ, X̄ + ϵ.

The key result is the central limit theorem (§5.3): Z is approximately


normal. How large should the sample size n be in order to apply the central
limit theorem? When we have success-failure condition

np ≥ 10, n(1 − p) ≥ 10.

For example, p = .7 and n = 50 satisfies the success-failure condition.


Let α be the two-tail significance level, say α = .05. Assuming Z is exactly
normal, let z ∗ be the z-score corresponding to significance α,

P rob(|Z| > z ∗ ) = α.

Let σ/ n be the standard error. By the central limit theorem,
6.2. Z-TEST 365
!
|X̄ − p| z∗
α ≈ P rob p >√ .
p(1 − p) n

To compute the confidence interval (L, U ), we solve

|X̄ − p| z∗
p =√ (6.2.1)
p(1 − p) n

for p. But (6.2.1) may be rewritten as a quadratic equation in p, leading to


the approximate solution
z∗
q
L, U = X̄ ± ϵ = X̄ ± √ · X̄(1 − X̄).
n

From here we obtain the margin of error


z∗
q
ϵ = √ · X̄(1 − X̄).
n

More generally, let z ∗ be the z-score corresponding to significance level α,


so

zstar = Z.ppf(alpha) # lower-tail, zstar < 0


zstar = Z.ppf(1-alpha) # upper-tail, zstar > 0
zstar = Z.ppf(1-alpha/2) # two-tail, zstar > 0

Given a population with known standard deviation σ, sample size n, and


sample mean X̄, the margin of error is
σ
ϵ = z∗ · √ ,
n

and the intervals



(X̄ − ϵ, X̄),
 lower-tail,
(L, U ) = (X̄, X̄ + ϵ), upper-tail,

(X̄ − ϵ, X̄ + ϵ), two-tail,

are the confidence intervals at significance level α. When not specified, a


confidence interval is usually taken to be two-tail.
In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with the X̄. When σ is not known, we have to replace
the normal distribution by the t distribution (§6.3).
366 CHAPTER 6. STATISTICS

##########################
# Confidence Interval - Z
##########################

from numpy import *


from scipy.stats import norm as Z

# significance level alpha

def confidence_interval(xbar,sdev,n,alpha,type):
Xbar = Z(xbar,sdev/sqrt(n))
if type == "two-tail":
U = Xbar.ppf(1-alpha/2)
L = Xbar.ppf(alpha/2)
elif type == "upper-tail":
U = Xbar.ppf(1-alpha)
L = xbar
elif type == "lower-tail":
L = Xbar.ppf(alpha)
U = xbar
else: print("what's the test type?"); return
return L, U

# when X is not Bernoulli 0,1,


# Z-test assumes sdev is known!!!
# when X is Bernoulli, sdev = sqrt(xbar*(1-xbar))

alpha = .02
sdev = 228
n = 35
xbar = 95

L, U = confidence_interval(xbar,sdev,n,alpha,type)

print("type: ", type)


print("significance, sdev, n, xbar: ", alpha,sdev,n,xbar)
print("lower, upper: ",L, U)

Now we can answer the questions posed at the start of the section. Here
are the answers.
1. When n = 20, α = .95, and X̄ = .7, we have [L, U ] = [.5, .9], so ϵ = .2.
2. When X̄ = .7, α = .95, and ϵ = .15, we run confidence_interval for
15 ≤ n ≤ 40, and select the least n for which ϵ < .15. We obtain n = 36.
3. When X̄ = .7, α = .99, and ϵ = .15, we run confidence_interval for
1 ≤ n ≤ 100, and select the least n for which ϵ < .15. We obtain n = 62.
4. When X̄ = .7, n = 20, and ϵ = .1, we have

∗ ϵ n
z = = .976.
σ
6.2. Z-TEST 367

Since P rob(Z > z ∗ ) = .165, the confidence level is 1 − 2 ∗ .165 = .68 or


68%.

The speed limit on a highway is µ0 = 120. Ten automatic speed cam-


eras are installed along a stretch of the highway to measure passing vehicles
speeds. Because the cameras aren’t perfect, the average speed X̄ measured
by the cameras may not equal a vehicle’s true speed µ. As a consequence,
some drivers who were driving at the speed limit may be fined. These drivers
are false positives.
Suppose the distribution of a vehicle’s measured speed is normal with
standard deviation 2. What measured speed cutoff µ∗ should the authorities
use to keep false positives below 1%? Here we are asked for the upper-tail
confidence interval (L, U ) = (µ0 , µ∗ ) at significance level .01. A driver will be
fined if their average measured speed X̄ is higher than µ∗ .
Using the above code, the cutoff µ∗ equals 121.47.

One use of confidence intervals is hypothesis testing. Here we have two


hypotheses, a null hypothesis and an alternate hypothesis. In the above set-
ting where we are estimating a population parameter µ, the null hypothesis
is that µ equals a certain value µ0 , and the alternate hypothesis is that µ
is not equal to µ0 . Hypothesis testing is of three types, depending on the
alternate hypothesis: µ ̸= µ0 , µ > µ) , µ < µ0 . These are two-tail, lower-tail,
and upper-tail hypotheses.
• H0 : µ = µ0
• Ha : µ ̸= µ0 or µ < µ0 or µ > µ0 .
For example, going back to our smartphone p setup, if we sample n = 20
students, obtaining a mean x̄ = .7, then σ = x̄(1 − x̄) = .46, and the two-
tail 5% confidence interval is then [.5, .9]. If µ0 lies outside the confidence
interval, we reject H0 and accept Ha , at the 5% level. Otherwise, if µ0 lies
within the interval, we do not reject H0 .
Suppose 35 people are randomly selected and the accuracy of their wrist-
watches is checked, with positive errors representing watches that are ahead
of the correct time and negative errors representing watches that are behind
the correct time. The sample has a mean of 95 seconds and a population
standard deviation of 228 seconds. At the 2% significance, can we claim the
population mean is µ0 = 0?
Here
• H0 : µ = 0
368 CHAPTER 6. STATISTICS

• Ha : µ ̸= 0.
Here the significance level is α = .02 and µ0 = 0. To decide whether to
reject H0 or not, compute the standardized test statistic
√ x̄ − µ0
z= n· = 2.465.
σ
Since z is a sample from an approximately normal distribution Z, the p-value

p = P rob(|Z| > z) = .0137.

On the other hand, the z-score corresponding to the requested significance


level is z ∗ = 2.326, since

P rob(|Z| > 2.326) = .02.

Since p is less than α, or equivalently, since |z| > z ∗ , we reject H0 . In other


words, when the p-value is smaller than the significance level, it is more
significant, and we reject H0 .
Equivalently, the 98% confidence interval is

(x̄ − ϵ, x̄ + ϵ) = (5.3, 184.6) .

Since µ0 = 0 is outside this interval, we reject H0 .

Hypothesis Testing

There are three types of alternative hypotheses Ha :

µ < µ0 , µ > µ0 , µ ̸= µ0 .

These are lower-tail, upper-tail, and two-tail tests. In every case, we


have a sample of size n, a statistic x̄, a standard deviation σ, a stan-
dardized statistic
√ x̄ − µ0
z = n· ,
σ
a significance level α, the p-value

p = P rob(Z < z), p = P rob(Z > z), p = P rob(|Z| > z),

and the critical cutoff z ∗ ,

P rob(Z < z ∗ ) = α, P rob(Z > z ∗ ) = α, P rob(|Z| > z ∗ ) = α.


6.2. Z-TEST 369

Then we reject H0 whenever z is more significant than z ∗ , which is


the same as saying whenever the p-value p is less than the significance
level α.

In the Python code below, instead of working with the standardized statis-
tic Z, we work directly with
√ X̄, which is normally distributed with mean µ0
and standard deviation σ/ n.

###################
# Hypothesis Z-test
###################

from numpy import *


from scipy.stats import norm as Z

# significance level alpha

def ztest(mu0, sdev, n, xbar,type):


Xbar = Z(mu0,sdev/sqrt(n))
print("mu0, sdev, n, xbar: ", mu0,sdev,n,xbar)
if type == "lower-tail": p = Xbar.cdf(xbar)
elif type == "upper-tail": p = 1 - Xbar.cdf(xbar)
elif type == "two-tail": p = 2 * (1 - Xbar.cdf(abs(xbar)))
print("type: ",type)
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
sdev = 2
alpha = .01

ztest(mu0, sdev, n, xbar,type)

Going back to the driving speed example, the hypothesis test is


• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects H0 .
This is consistent with the confidence interval cutoff we found above.
370 CHAPTER 6. STATISTICS

There are two types of possible errors we can make. a Type I error is when
H0 is true, but we reject it, and a Type 2 error is when H0 is not true but
we fail to reject it.

H0 is true H0 is false
do not reject H0 1−α Type II error: β
reject H0 Type I error: α Power: 1 − β

Table 6.3 The error matrix.

We reject H0 when the p-value of Z is less than the significance level α,


which happens when z < z ∗ or z > z ∗ or |z| > z ∗ . In all cases, the chance of
this happening is by definition α. In other words,

P rob(Type I error) = P rob(p-value < α | H0 ) = α.

Thus the probability of a type I error is the significance level α.


We make a Type II error when we do not reject H0 , but H0 is false. To
compute the probability of a Type II error, suppose the true value of µ is µ1 .

√ H0 if |z| < |z |, which is when µ0 lies in the confidence
Then we do not reject

interval x̄ ± z σ/ n, or when x̄ lies in the interval

z∗σ z∗σ
µ0 − √ < x̄ < µ0 + √ .
n n

But when µ = µ1 , X̄ is N (µ1 , σ), so the probability of this event can be


computed.
Standardize X̄ by subtracting µ1 and dividing by the standard error. Then
we have a Type II error when
√ (µ0 − µ1 ) √ (µ0 − µ1 )
n − z∗ < z < n + z∗.
σ σ
If we set δ to equal the standardized difference in the means,
√ (µ0 − µ1 )
δ= n ,
σ
then we have a Type II error when

δ − z∗ < Z < δ + z∗,

or when |Z − δ| < z ∗ . Hence

P rob(Type II error) = P rob (|Z − δ| < z ∗ ) .


6.2. Z-TEST 371

This calculation was for a two-tail test. When the test is upper-tail or
lower-tail, a similar calculation leads to the code

############################
# Type1 and Type2 errors - Z
############################

from numpy import *


from scipy.stats import norm as Z

def type2_error(type,mu0,mu1,sdev,n,alpha):
print("significance,mu0,mu1, sdev, n: ", alpha,mu0,mu1,sdev,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
zstar = Z.ppf(alpha)
type2 = 1 - Z.cdf(delta + zstar)
elif type == "upper-tail":
zstar = Z.ppf(1-alpha)
type2 = Z.cdf(delta + zstar)
elif type == "two-tail":
zstar = Z.ppf(1 - alpha/2)
type2 = Z.cdf(delta + zstar) - Z.cdf(delta - zstar)
else: print("what's the test type?"); return
print("test type: ",type)
print("zstar: ", zstar)
print("delta: ", delta)
print("prob of type2 error: ", type2)
print("power: ", 1 - type2)

mu0 = 120
mu1 = 122
sdev = 2
n = 10
alpha = .01
type = "upper-tail"

type2_error(type,mu0,mu1,sdev,n,alpha)

A type II error is when we do not reject the null hypothesis and yet it’s
false. The power of a test is the probability of rejecting the null hypothesis
when it’s false (Figure 6.3). If the probability of a type II error is β, then the
power is 1 − β.
Going back to the driving speed example, what is the chance that someone
driving at µ1 = 122 is not caught? This is a type II error; using the above
code, the probability is

β = P rob(X̄ = 120 | µ = 122) = 20%.


372 CHAPTER 6. STATISTICS

Therefore this test has power 80% to detect such a driver.

6.3 T -test

Let X1 , X2 , . . . , Xn be a simple random sample from a population. We repeat


the previous section when we know neither the population mean µ, nor the
population variance σ 2 . We only know the sample mean
X1 + X2 + · · · + Xn
X̄ =
n
and the sample variance
n
1 X
S2 = (Xk − X̄)2 .
n−1
k=1

For example, assume X1 , X2 , . . . , Xn are Bernoulli random variables with


values 0, 1. Then as we’ve seen before,
n
X
(n − 1)S 2 = (Xk − X̄)2 = nX̄(1 − X̄).
k=1

From §5.5, when the population is normal,


• (n − 1)S 2 is chi-squared of degree n − 1, and
• X̄ and S 2 are independent.

A random variable T has a t-distribution with degree d if its probability


density function is
−(d+1)/2
t2

p(t) = C · 1 + , a < t < b. (6.3.1)
d

Here C is a constant to make the total area under the graph equal to one
(Figure 6.4).
Then the t-distribution is continuous and the probability that T lies in a
small interval [a, b] is

P rob(a < T < b)


≈ p(t), a < t < b,
b−a
When the interval [a, b] is not small, this is not correct. The exact formula
for P rob(a < T < b) is the area under the graph (Figure 6.4). This is obtained
6.3. T -TEST 373

by integration (§A.5),
Z b
P rob(a < T < b) = p(t) dt. (6.3.2)
a

Under this interpretation, this probability corresponds to the area under the
graph between the vertical lines at a and at b, and the total area under the
graph corresponds to a = −∞ and b = ∞.
More generally, means of f (T ) are computed by integration,
Z ∞
E(f (T )) = f (t)p(t) dt,
−∞

with the integral computed via the fundamental theorem of calculus (A.5.2)
or Python.

Fig. 6.4 T -distribution, against normal (dashed).

The t-distribution (6.3.1) approaches the standard normal distribution


(5.4.1) as d → ∞ (Exercise 6.3.1).

from numpy import *


from scipy.stats import t as T, norm as Z
from matplotlib.pyplot import *
374 CHAPTER 6. STATISTICS

for d in [3,4,7]:
t = arange(-3,3,.01)
plot(t,T(d).pdf(t),label="d = "+str(d))

plot(t,Z.pdf(t),"--",label=r"d = $\infty$")
grid()
legend()
show()

Using calculus, one can derive

Relation Between Z, U , and T

Suppose Z and U are independent, where Z is standard normal, and


U is chi-squared with d degrees of freedom. Then
Z
T =p
U/d

is a t-distribution with degree d.

In the previous section, we normalized a sample


√ mean by subtracting the
mean µ and dividing by the standard error σ/ n. Since now we don’t know
σ, it is reasonable to divide by the sample standard error, obtaining

√ X̄ − µ √ X̄ − µ
n· = n· v .
S u n
1 X
(Xk − X̄)2
u
t
n−1
k=1

If we standardize each variable by

Xk = µ + σZk ,

then we can verify


Z1 + Z2 + · · · + Zn
X̄ = µ + σ Z̄, Z̄ = ,
n
and
n
1 X
S 2 = σ2 (Zk − Z̄)2 .
n−1
k=1

From this, we have


6.3. T -TEST 375

√ X̄ − µ √ Z̄ √ Z̄
n· = n· v = n· p .
S u n U/(n − 1)
1 X
(Zk − Z̄)2
u
t
n−1
k=1

Using the last result with d = n − 1, we arrive at the main result in this
section.

Samples and T Distributions

Let X1 , X2 , . . . , Xn be independent normal random variables with


mean µ. Let X̄ be the sample mean, let S 2 be the sample variance,
and let
√ X̄ − µ
T = n· .
S
Then T is distributed according to a t-distribution with degree (n−1).

The takeaway here is we do not need to know the standard deviations σ


of X1 , X2 , . . . , Xn to compute T .

The t-score t∗ corresponding2 to significance α is

tstar = T(d).ppf(alpha) # lower-tail, tstar < 0


tstar = T(d).ppf(1-alpha) # upper-tail, tstar > 0
tstar = T(d).ppf(1-alpha/2) # two-tail, tstar > 0

Here d is the degree of T . Then we have

##########################
# Confidence Interval - T
##########################

from numpy import *


from scipy.stats import t as T

def confidence_interval(xbar,s,n,alpha,type):
d = n-1
if type == "two-tail":
tstar = T(d).ppf(1-alpha/2)
L = xbar - tstar * s / sqrt(n)
U = xbar + tstar * s / sqrt(n)
elif type == "upper-tail":

2 Geometrically, the p-value P rob(T > 1) is the probability that a normally distributed
point in (d + 1)-dimensional spacetime is inside the light cone.
376 CHAPTER 6. STATISTICS

tstar = T(d).ppf(1-alpha)
L = xbar
U = xbar + tstar* s / sqrt(n)
elif type == "lower-tail":
tstar = T(d).ppf(alpha)
L = xbar + tstar* s / sqrt(n)
U = xbar
else: print("what's the test type?"); return
print("type: ",type)
return L, U

n = 10
xbar = 120
s = 2
alpha = .01
type = "upper-tail"
print("significance, s, n, xbar: ", alpha,s,n,xbar)

L,U = confidence_interval(xbar,s,n,alpha,type)
print("lower, upper: ", L,U)

Going back to the driving speed example from §6.2, instead of assuming
the population standard deviation is σ = 2, we compute the sample standard
deviation and find it’s S = 2. Recomputing with T (9), instead of Z, we
see (L, U ) = (120, 121.78), so the cutoff now is µ∗ = 121.78, as opposed to
µ∗ = 121.47 there.

We turn now to hypothesis testing. As before, we have two hypotheses, a


null hypothesis and an alternate hypothesis. In the above setting where we
are estimating a population parameter, the null hypothesis is that µ equals a
certain value µ0 , and the alternate hypothesis is that µ is not equal to µ0 .
• H0 : µ = µ0
• Ha : µ ̸= µ0 .
Here is the code:

###################
# Hypothesis T-test
###################

from numpy import *


from scipy.stats import t as T

def ttest(mu0, s, n, xbar,type):


d = n-1
print("mu0, s, n, xbar: ", mu0,s,n,xbar)
6.3. T -TEST 377

t = sqrt(n) * (xbar - mu0) / s


print("t: ",t)
if type == "lower-tail": p = T(d).cdf(t)
elif type == "upper-tail": p = 1 - T(d).cdf(t)
elif type == "two-tail": p = 2 * (1 - T(d).cdf(abs(t)))
print("pvalue: ",p)
if p < alpha: print("reject H0")
else: print("do not reject H0")

xbar = 122
n = 10
type = "upper-tail"
mu0 = 120
s = 2
alpha = .01

ttest(mu0, s, n, xbar,type)

Going back to the driving speed example, the hypothesis test is


• H0 : µ = µ0
• Ha : µ > µ0
If a driver’s measured average speed is X̄ = 122, the above code rejects
H0 . This is consistent with the confidence interval cutoff we found above.
However, the p-value obtained here is greater than the corresponding p-value
in §6.2.

For Type I and Type II errors, the code is

########################
# Type1 and Type2 errors
########################

from numpy import *


from scipy.stats import t as T

def type2_error(type,mu0,mu1,n,alpha):
d = n-1
print("significance,mu0,mu1,n: ", alpha,mu0,mu1,n)
print("prob of type1 error: ", alpha)
delta = sqrt(n) * (mu0 - mu1) / sdev
if type == "lower-tail":
tstar = T(d).ppf(alpha)
type2 = 1 - T(d).cdf(delta + tstar)
elif type == "upper-tail":
tstar = T(d).ppf(1-alpha)
type2 = T(d).cdf(delta + tstar)
elif type == "two-tail":
378 CHAPTER 6. STATISTICS

tstar = T(d).ppf(1 - alpha/2)


type2 = T(d).cdf(delta + tstar) - T(d).cdf(delta - tstar)
else: print("what's the test type?"); return

print("test type: ",type)


print("tstar: ", tstar)
print("delta: ", delta)

print("prob of type2 error: ", type2)


print("power: ", 1 - type2)

type2_error(type,mu0,mu1,n,alpha)

Going back to the driving speed example, if a driver’s measured average


speed is X̄ = 122, what is the chance they will not be fined? From the code,
the probability of this Type II error is 37%, and the power to detect such a
driver is 63%.

Exercises

Exercise 6.3.1 Use the compound-interest formula (A.3.8) to show the T -


distribution pdf equals the standard normal pdf when d = ∞. Since the
formula for the normalizing constant C is not given, ignore C in your calcu-
lation.

6.4 Chi-Squared Tests

Let X1 , X2 , . . . , Xn be i.i.d. random variables, where each Xk is categorical.


This means each Xk is a discrete random variable (§5.3), taking values in one
of d categories. For simplicity, assume the categories are

i = 1, 2, . . . , d.

When d = 2, this reduces to the Bernoulli case.


When d = 2 and Xk = 0, 1, the sample mean X̄ is a proportion p̂, the
population
p mean is p = P rob(Xk = 1), the population
p standard deviation
p is
p(1 − p), and the sample standard deviation is X̄(1 − X̄) = p̂(1 − p̂).
By the central limit theorem, the test statistic
√ p̂ − p
Z= n· p (6.4.1)
p(1 − p)
6.4. CHI-SQUARED TESTS 379

is approximately standard normal for large enough sample size, and con-
sequently U = Z 2 is approximately chi-squared with degree one. The chi-
squared test generalizes this from d = 2 categories to d > 2 categories.
Given a category i, let #i denote the number of times Xk = i, 1 ≤ k ≤ n,
in a sample of size n. Then #i is the count that Xk = i, and p̂i = #i /n is the
observed frequency, in a sample of size n. Let pi be the expected frequency,

pi = P rob(Xk = i).

Then p = (p1 , p2 , . . . , pd ) is the probability vector associated to X. Since Xk


are identically distributed, this does not depend on k.
By the central limit theorem,
√ √
 
#i
n(p̂i − pi ) = n − pi ,
n

are approximately normal for large n. Based on this, we have the

Goodness-Of-Fit Test

Let p̂ = (p̂1 , p̂2 , . . . , p̂d ) be the observed frequencies corresponding to


samples X1 , X2 , . . . , Xn , and let p = (p1 , p2 , . . . , pd ) be the expected
frequencies. Then, for large sample size n, the statistic
d
X (p̂i − pi )2
n (6.4.2)
i=1
pi

is approximately chi-squared with degree d − 1.

By clearing denominators, (6.4.2) may be rewritten in terms of counts as


follows,
d d
X (#i − npi )2 X (observedi − expectedi )2
= .
i=1
npi i=1
expectedi

When d = 2, this statistic reduces to Z 2 , where Z is given by (6.4.1). Here


is the code.

from numpy import *


from scipy.stats import chi2 as U

def goodness_of_fit(observed,expected):
# assume len(observed) == len(expected)
d = len(observed)
u = sum([ (observed[i] - expected[i])**2/expected[i] for i in
,→ range(d) ])
pvalue = 1 - U(d-1).cdf(u)
380 CHAPTER 6. STATISTICS

return pvalue

Suppose a dice is rolled n = 120 times, and the observed counts are

O1 = 17, O2 = 12, O3 = 14, O4 = 20, O5 = 29, O6 = 28.

Notice
O1 + O2 + O3 + O4 + O5 + O6 = 120.
If the dice is fair, the expected counts are

E1 = 20, E2 = 20, E3 = 20, E4 = 20, E5 = 20, E6 = 20.

Based on the observed counts, at 5% significance, what can we conclude about


the dice?
Here there are d = 6 categories, α = .05, and the statistic (6.4.2) equals

u = 12.7.

The dice is fair if u is not large and the dice is unfair if u is large. At
significance level α, the large/not-large cutoff u∗ is

from scipy.stats import chi2 as U

d = 6
ustar = U(d-1).ppf(1-alpha)

Since this returns u∗ = 11.07 and u > u∗ , we can conclude the dice is not
fair.

To derive the goodness-of-fit test, let X be a discrete random variable,


taking values in 1, 2, . . . , d, with distribution p = (p1 , p2 , . . . , pd ). We vec-
torize (§1.3) X by defining the one-hot encoded (§2.4) vector-valued random
variable V = vectp (X) = (V1 , V2 , . . . , Vd ) as follows,

 √1

if X = i,
Vi = vectp (X)i = pi (6.4.3)
0 if X ̸= i.

Then
1 pi √
E(Vi ) = √ P rob(X = i) = √ = pi ,
pi pi
6.4. CHI-SQUARED TESTS 381

and (
1 if i = j,
E(Vi Vj ) =
0 ̸ j,
if i =
for i, j = 1, 2, . . . , d. If
√ √ √
µ = ( p1 , p2 , . . . , pd ) ,

we conclude
E(V ) = µ, E(V ⊗ V ) = I.
From this,
E(V ) = µ, V ar(V ) = I − µ ⊗ µ. (6.4.4)
Now define
Vk = vectp (Xk ) , k = 1, 2, . . . , n.
Since X1 , X2 , . . . , Xn are i.i.d, V1 , V2 , . . . , Vn are i.i.d. By (5.5.5), we conclude
the random vector !
n
√ 1X
Z= n Vk − µ
n
k=1

has mean zero and variance I − µ ⊗ µ.


Since V1 , V2 , . . . , Vn are i.i.d, by the central limit theorem, we also conclude
Z is approximately normal for large n.
Since
√ √ √
|µ|2 = ( p1 )2 + ( p2 )2 + · · · + ( pd )2 = p1 + p2 + · · · + pd = 1,

µ is a unit vector. By the singular chi-squared result in §5.5, |Z|2 is approxi-


mately chi-squared with degree d − 1. Since

 
p̂i √
Zi = n √ − p i , (6.4.5)
pi

we write |Z|2 out,


d d  2 d
2
X X p̂i √ X (p̂i − pi )2
|Z| = Zi2 =n √ − pi =n ,
i=1 i=1
pi i=1
pi

obtaining (6.4.2).

Suppose X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are samples measuring two


possibly related effects. Suppose the X variables take on d categories, i =
1, 2, . . . , d, and the Y variables take on N categories, j = 1, 2, . . . , N . Let
382 CHAPTER 6. STATISTICS

pi = P rob(Xk = i), qj = P rob(Yk = j),

and set p = (p1 , p2 , . . . , pd ), q = (q1 , q2 , . . . , qN ). The goal is test whether the


two effects are independent or not.
Let
rij = P rob(Xk = i and Yk = j).
Then r is a d × N matrix. The effects are independent when

rij = pi qj ,

or r = p ⊗ q.
For example, suppose 300 people are polled and the results are collected
in a contingency table (Figure 6.5).

Democrat Republican Independent Total


Women 68 56 32 156
Men 52 72 20 144
Total 120 128 52 300

Table 6.5 2 × 3 = d × N contingency table [30].

Is a person’s gender correlated with their party affiliation, or are the two
variables independent? To answer this, let p̂ and q̂ be the observed frequencies

#{k : Xk = i} #{k : Yk = j}
p̂i = , q̂j = ,
n n
and let r̂ be the joint observed frequencies

#{k : Xk = i and Yk = j}
r̂ij = .
n
Then r̂ is also a d × N matrix.
When the effects are independent, r = p ⊗ q, so, by the law of large
numbers, we should have
r̂ ≈ p̂ ⊗ q̂
for large sample size. The chi-squared independence test quantifies the dif-
ference of the two matrices r and r̂.

Chi-squared Independence Test

If X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn are independent, then, for large


sample size n, the statistic
6.4. CHI-SQUARED TESTS 383

d,N 2
X (r̂ij − p̂i q̂j )
n (6.4.6)
i,j=1
p̂i q̂j

is approximately chi-squared with degree (d − 1)(N − 1).

Only sample data is used to compute the statistic (6.4.6), knowledge of p


and q is not needed. Conversely, the test says nothing about p and q, and
only queries independence.
By clearing denominators, (6.4.6) may be rewritten in terms of counts as
follows,
d,N 2 d,N 2
X n#XY X Y
ij − #i #j
X #XY
ij
= −n + n
i,j=1
n#X Y
i #j i,j=1
#X
i #j
Y

d,N
X (observed)2
= −n + n .
i,j=1
expected

The code

from numpy import *


from scipy.stats import chi2 as U

# table is dxN numpy array

def chi2_independence(table):
n = sum(table) # total sample size
d = len(table)
N = len(table.T)
rowsum = array([ sum(table[i,:]) for i in range(d) ])
colsum = array([ sum(table[:,j]) for j in range(e) ])
expected = outer(rowsum,colsum) # tensor product
u = -n + n*sum([[ table[i,j]**2/expected[i,j] for j in range(N) ]
,→ for i in range(d) ])
deg = (d-1)*(N-1)
pvalue = 1 - U(deg).cdf(u)
return pvalue

table = array([[68,56,32],[52,72,20]])
chi2_independence(table)

returns a p-value of 0.0401, so, at the 5% significance level, the effects are
not independent.
384 CHAPTER 6. STATISTICS

The derivation of the independence test is similar to the goodness-of-fit


test. There are two differences. First, because there are two indices Xk = i,
Yk = j, we work with matrices, not vectors. Second, we appeal to the law of
large numbers to replace pi by p̂i and qj by q̂j for large n.
Let Z be the d × N matrix
!
√ r̂ij − p̂i q̂j
Zij = n √ p . (6.4.7)
p̂i q̂j

Then (see (2.2.12))


d,N
X
2
∥Z∥ = trace(Z t Z) = 2
Zij
i,j=1

equals (6.4.6).
Let u1 , u2 , . . . , ud and v1 , v2 , . . . , vN be orthonormal bases for Rd and
N
R respectively. By (2.9.8),
d,N
X
2
∥Z∥ = trace(Z t Z) = (ui · Zvj )2 . (6.4.8)
i,j=1

2
We will show ∥Z∥ is asymptotically chi-squared of degree (d − 1)(N − 1).
To achieve this, we show Z is asymptotically normal.
Let X and Y be discrete random variables with probability vectors
p = (p1 , p2 , . . . , pd ) and q = (q1 , q2 , . . . , qN ), and assume X and Y are in-
dependent.
Let
√ √ √ √ √ √
µ = ( p1 , p2 , . . . , pd ) , ν = ( q1 , q2 , . . . , qN ) .

Then µ and ν are unit vectors. Following (6.4.3), define

M = (vectp (X) − µ) ⊗ (vectq (Y ) − ν). (6.4.9)

Then M is a d × N matrix-valued random variable, and

u · M v = (vectp (X) · u − µ · u)(vectq (Y ) · v − ν · v).

If u and v are unit vectors in Rd and RN respectively, by (6.4.4),

E(vectp (X) · u) = µ · u, V ar(vectp (X) · u) = 1 − (µ · u)2 ,

and
6.4. CHI-SQUARED TESTS 385

E(vectq (Y ) · v) = ν · v, V ar(vectq (Y ) · v) = 1 − (ν · v)2 .

By independence of X and Y , the mean of u · M v is zero, and

V ar(u · M v) = 1 − (µ · u)2 1 − (ν · v)2 .


 

In particular, when the unit vectors u or v are orthogonal to µ and ν respec-


tively, u · M v is a standard random variable, i.e. has mean zero and variance
one. This also shows u · M ν = 0, µ · M v = 0 for any u and v.
More generally (Exercise 6.4.3) u · M v and u′ · M v ′ are uncorrelated when
u ⊥ u′ and v ⊥ v ′ .
Our goal is to show for large n, Z has the same mean and variance as that
of M . If we also show u · Zv and u′ · Zv ′ are independent for large n when
u ⊥ u′ and v ⊥ v ′ , then (6.4.8) leads to the result. Now to the details.
Let W = Wn be a random variable that depends on n. We write

W ≈0

if all probabilities of W converge to zero for n large. In this case, we say W


is asymptotically zero (see §A.6 for more information).
Let W ′ be another random variable depending on n. We write W ≈ W ′ ,
and we say W is asymptotically equal to W ′ , if all probabilities of W and W ′
agree asymptotically for n large. In particular, when W ≈ W ′ ,

E(W ) ≈ E(W ′ ), V ar(W ) ≈ V ar(W ′ ).

If W ≈ W ′ and W ′ is a normal random variable not depending on n, we


write W ≈ normal, and we say W is asymptotically normal. If W ≈ W ′ and
W ′ ≈ normal, then W ≈ normal.
Let Mk correspond to Xk and Yk , k = 1, 2, . . . , n, and let
n
! n
√ 1 X 1 X
Z CLT = n Mk − E(Mk ) = √ Mk .
n n
k=1 k=1

Then, by independence, and the central limit theorem,


• the mean and variance of Z CLT are the same as those of M ,
• u · Z CLT ν = 0, µ · Z CLT v = 0 for any u and v, and,
• Z CLT ≈ normal.
Although Z and Z CLT are not equal, we will show Z ≈ Z CLT . To this
end, multiplying out the expression (6.4.9) for each M = Mk , and summing
over k = 1, 2, . . . , n, we see
√ √
√ p̂i qj
 
CLT r̂ij q̂j pi √
Zij = n √ √ − √ − √ + pi qj . (6.4.10)
pi qj pi qj

By the law of large numbers, p̂i ≈ pi and q̂j ≈ qj so


386 CHAPTER 6. STATISTICS

q̂j − qj
√ p ≈ 0.
p̂i q̂j

As we saw before (6.4.5), by the central limit theorem,



n (p̂i − pi ) ≈ normal.

Hence3 the product


√ (pi − p̂i )(q̂j − qj )
n· √ p ≈ 0. (6.4.11)
p̂i q̂j
Similarly, p̂i ≈ pi and q̂j ≈ qj , so
√ √ !
CLT
pi qj
Zij √ p − 1 ≈ 0. (6.4.12)
p̂i q̂j

Adding (6.4.10), (6.4.11), and (6.4.12), we obtain (6.4.7), hence

Z ≈ Z CLT .

We conclude
• the mean and variance of Z are asymptotically the same as those of M ,
• u · Zν ≈ 0, µ · Zv ≈ 0 for any u and v, and,
• Z ≈ normal.
In particular, since u·Zv and u′ ·Zv ′ are asymptotically uncorrelated when
u ⊥ u′ and v ⊥ v ′ , and Z is asymptotically normal, we conclude u · Zv and
u′ · Zv ′ are asymptotically independent when u ⊥ u′ and v ⊥ v ′ .
Now choose the orthonormal bases with u1 and v1 equal to µ and ν re-
spectively. Then

ui · Zvj , i = 1, 2, 3, . . . , d, j = 1, 2, 3, . . . , N

are independent normal random variables with mean zero, asymptotically for
large n, and variances according to the listing

µ · Zν µ · Zv2 . . . . . . µ · ZvN −1 µ · ZvN 0 0 ... ... 0 0


u2 · Zν u2 · Zv2 . . . . . . u2 · ZvN −1 u2 · ZvN 0 1 ... ... 1 1
... ... ... ... ... ... ≈ ... ... ... ... ... ....
ud−1 · Zν ud−1 · Zv2 . . . . . . ud−1 · ZvN −1 ud−1 · ZvN 0 1 ... ... 1 1
ud · Zν ud · Zv2 . . . . . . ud · ZvN −1 ud · ZvN 0 1 ... ... 1 1
2
From this, only (d − 1)(N − 1) terms are nonzero in (6.4.8), hence ∥Z∥ is
chi-squared with degree (d − 1)(N − 1), completing the proof.
3 The theoretical basis for this intuitively obvious result is Slutsky’s theorem [8].
6.4. CHI-SQUARED TESTS 387

Exercises

Exercise 6.4.1 Let V be the vectorization (6.4.3) of the discrete random


variable X, and let µ be the mean of V . Then V · µ = 1.

Exercise 6.4.2 Verify (6.4.10).

Exercise 6.4.3 Let M be as in (6.4.9). Then u · M v and u′ · M v ′ are uncor-


related when u ⊥ u′ and v ⊥ v ′ .

Exercise 6.4.4 Verify the goodness-of-fit test statistic (6.4.2) is the square
of (6.4.1) when d = 2.

Exercise 6.4.5 [30] Among 100 vacuum tubes tested, 41 had lifetimes of less
than 30 hours, 31 had lifetimes between 30 and 60 hours, 13 had lifetimes
between 60 and 90 hours, and 15 had lifetimes of greater than 90 hours.
Are these data consistent with the hypothesis that a vacuum tube’s lifetime
is exponentially distributed (Exercise 5.3.22) with a mean of 50 hours? At
what significance? Here p = (p1 , p2 , p3 , p4 ).

Exercise 6.4.6 [30] A study was instigated to see if southern California


earthquakes of at least moderate size are more likely to occur on certain
days of the week than on others. The catalogs yielded the data in Figure 6.6.
Test, at the 5 percent level, the hypothesis that an earthquake is equally
likely to occur on any of the 7 days of the week.

Day Sun Mon Tues Wed Thurs Fri Sat Total


Number of Earthquakes 156 144 170 158 172 148 152 1100

Table 6.6 Earthquake counts.


Chapter 7
Machine Learning

7.1 Overview

This first section is an overview of the chapter. Here is a summary of the


structure of neural networks.
• A graph consists of nodes and edges (§3.3).
• If each edge has a direction, the graph is directed.
• If each edge has a weight, the graph is weighed.
• In a directed graph, there are input nodes, output nodes, and hidden
nodes.
• A node with an activation function is a neuron (§4.4).
• Each neuron has incoming signals and an outgoing signal.
• The outgoing signal is the activation function applied to the incoming
signals.
• A network is a weighed directed graph (§3.3) where the nodes are neurons
(§4.4).
• A neural network is a network where each activation function is a function
of the sum of the incoming signals (§7.2).
The goal is to train a neural network. To train a neural network means to
find weights W so that the input-output behavior of the network is as close
as possible to a given dataset of sample pairs (xk , yk ), k = 1, 2. . . . , N . Here
is a summary of how neural networks are trained (§7.4).
1. Start with a sample pair (xk , yk ) and a weight matrix W .
2. Using xk as incoming signals at the input nodes, compute the network’s
outgoing signals at all nodes (forward propagation).
3. Compute the error J = J(xk , yk , W ) between the outgoing signals at the
output nodes and yk .
4. Compute the derivatives δout of J at the output nodes.
5. Compute the derivatives δ of J at all nodes (back propagation).
6. Then the weight gradient is given by ∇W J = x ⊗ δ.

389
390 CHAPTER 7. MACHINE LEARNING

7. Update W using gradient descent (§7.3), W + = W − t∇W J (§7.4).


8. Repeat steps 1-7 over all sample pairs (xk , yk ), k = 1, 2, . . . , N (§7.4).
9. Repeat step 8 until convergence.
Steps 1-7 is an iteration, and step 8 is an epoch. An iteration uses a single
sample (more generally a batch of samples), and an epoch uses the entire
dataset. The mean error function over the dataset is
N
X
J(W ) = J(xk , yk , W ).
k=1

Sometimes J(W ) is normalized by dividing by N , but this does not change the
results. With the dataset given, the mean error is a function of the weights.
A weight matrix W ∗ is optimal if it is a minimizer of the mean error,

J(W ∗ ) ≤ J(W ), for all W.

Convergence means W is close to W ∗ . Now we turn to the details.

7.2 Neural Networks

In §4.4, we saw two versions of forward and back propagation. In this section
we see a third version. We begin by reviewing the definition of graph and
network as given in §3.3 and §4.4.
A graph consists of nodes and edges. Nodes are also called vertices, and an
edge is an ordered pair (i, j) of nodes. Because the ordered pair (i, j) is not
the same as the ordered pair (j, i), our graphs are directed.
The edge (i, j) is incoming at node j and outgoing at node i. If a node j
has no outgoing edges, then j is an output node. If a node i has no incoming
edges, then i is an input node. If a node is neither an input nor an output, it
is a hidden node.
We assume our graphs have no cycles: every path terminates at an output
node in a finite number of steps.
A graph is weighed if a scalar weight wij is attached to each edge (i, j). If
(i, j) is not an edge, we set wij = 0. If a network has d nodes, the edges are
completely specified by the d × d weight matrix W = (wij ).
A node with an attached activation function (4.4.2) is a neuron. A net-
work is a directed weighed graph where the nodes are neurons. In the next
paragraph, we define a special kind of network, a neural network.

In a network, in §4.4, the activation function fj at node j was allowed to


be any function of the incoming list (4.4.1) at node j
7.2. NEURAL NETWORKS 391

(w1j x1 , w2j x2 , . . . , wdj xd ).

Because wij = 0 if (i, j) is not an edge, the nonzero entries in the incoming
list at node j correspond to the edges incoming at node j.
A neural network is a network where every activation function is restricted
to be a function of the sum of the entries of the incoming list.
For example, all the networks in this section are neural networks, but the
network in Figure 4.16 is not a neural network.
Let X
x−j = wij xi (7.2.1)
i→j

be the sum of the incoming list at node j. Then, in a neural network, the
outgoing signal at node j is
 
X
xj = fj (x−
j ) = fj
 wij xi  . (7.2.2)
i→j

If the network has d nodes, the outgoing vector is

x = (x1 , x2 , . . . , xd ),

and the incoming vector is

x− = (x− − −
1 , x2 , . . . , xd ).

In a network, in §4.4, x− −
j was a list or vector; in a neural network, xj is a
scalar.
Let W be the weight matrix. If the network has d nodes, the activation
vector is
f = (f1 , f2 , . . . , fd ).
Then a neural network may be written in vector-matrix form

x = f (W t x).

However, this representation is more useful when the network has structure,
for example in a dense shallow layer (7.2.12).

A perceptron is a network of the form

y = f (w1 x1 + w2 x2 + · · · + wd xd ) = f (w · x)

(Figure 7.1). This is the simplest neural network.


392 CHAPTER 7. MACHINE LEARNING

Thus a perceptron is a linear function followed by an activation function.


By our definition of neural network,

Neural Network
Every neural network is a combination of perceptrons.

x1

w1

w2
x2 f y

w3

x3

Fig. 7.1 A perceptron with activation function f .

When an input x0 is fixed to equal 1, x0 = 1, the corresponding weight


w0 is called a bias,

y = f (w1 x1 + w2 x2 + · · · + wd xd + w0 ) = f (w · x + w0 ).

The role of the bias is to shift the thresholds in the activation functions.
If x1 , x2 , . . . , xN is a dataset, then (x1 , 1), (x2 , 1), . . . , (xN , 1) is the aug-
mented dataset. If the original dataset is in Rd , then the augmented dataset
is in Rd+1 . In this regard, Exercise 7.2.1 is relevant.
By passing to the augmented dataset, a neural network with bias and d
input features can be thought of as a neural network without bias and d + 1
input features.
In §5.1, Bayes theorem is used to express a conditional probability in terms
of a perceptron,
P rob(H | x) = σ(w · x + w0 ).
This is a basic example of how a perceptron computes probabilities.

Perceptrons gained wide exposure after Minsky and Papert’s famous 1969
book [22], from which Figure 7.2 is taken.
7.2. NEURAL NETWORKS 393

Fig. 7.2 Perceptrons in parallel (R in the figure is the retina) [22].

Here is a listing of common activation functions.


• The identity function,
id(z) = z

and its derivative id = 1.
• The binary output, (
1 if z > 0,
bin(z) =
0 if z < 0,

and its derivative bin′ = 0, z ̸= 0, and bin′ (0) undefined.


• The logistic or sigmoid function (Figure 5.3)
1
σ(z) =
1 + e−z
and its derivative σ ′ = σ(1 − σ).
• The hyperbolic tangent function

ez − e−z
tanh(z) =
ez + e−z

and its derivative tanh′ = 1 − tanh2 .


• The rectified linear unit relu,
(
z if z ≥ 0,
relu(z) =
0 if z < 0,
394 CHAPTER 7. MACHINE LEARNING

and its derivative relu′ = bin.


Here is the code

# activation functions

def relu(z): return 0 if z < 0 else z


def bin(z): return 0 if z < 0 else 1
def sigmoid(z): return 1/(1+exp(-z))
def id(z): return z
# tanh already part of numpy
def one(z): return 1
def zero(z): return 0

# derivative of relu is bin


# derivative of bin is zero
# derivative of s=sigmoid is s*(1-s)
# derivative of id is one
# derivative of tanh is 1-tanh**2

def D_relu(z): return bin(z)


def D_bin(z): return 0
def D_sigmoid(z): return sigmoid(z)*(1-sigmoid(z))
def D_id(z): return 1
def D_relu(z): return bin(z)
def D_tanh(z): return 1 - tanh(z)**2

der_dict = { relu:D_relu, id:D_id, bin:D_bin, sigmoid:D_sigmoid,


,→ tanh: D_tanh}

The neural network in Figure 7.3 has weight matrix


 
0 0 w13 w14 0 0
0 0 w23 w24 0 0 
 
0 0 0 0 w35 w36 
W = 0 0 0 0 w45 w46 
 (7.2.3)
 
0 0 0 0 0 0 
00 0 0 0 0

and activation functions f3 , f4 , f5 , f6 . Here 1 and 2 are plain nodes, and 3,


4, 5, 6 are neurons.
7.2. NEURAL NETWORKS 395

w13 w35
x1 f3 f5 x5

w14
w36

w45
w23

w24 w46
x2 f4 f6 x6

Fig. 7.3 Network of neurons.

Let xin and xout be the outgoing vectors corresponding to the input and
output nodes. Then the network in Figure 7.3 has outgoing vectors

x = (x1 , x2 , x3 , x4 , x5 , x6 ), xin = (x1 , x2 ), xout = (x5 , x6 ).

Here are the incoming and outgoing signals at each of the four neurons f3 ,
f4 , f5 , f6 .

Neuron Incoming Outgoing


f3 x−
3 = w13 x1 + w23 x2 x3 = f3 (w13 x1 + w23 x2 )
f4 x−
4 = w14 x1 + w24 x2 x4 = f4 (w14 x1 + w24 x2 )

f5 x−
5 = w35 x3 + w45 x4 x5 = f5 (w35 x3 + w45 x4 )

f6 x−
6 = w36 x3 + w46 x4 x6 = f6 (w36 x3 + w46 x4 )

Table 7.4 Incoming and Outgoing signals.

Now we specialize the forward propagation code in §4.4 to neural networks.


The key diagram is Figure 7.5.

xi xj
fi fj
wij

Fig. 7.5 Forward and back propagation between two neurons.

Assume the activation function at node j is activate[j]. By (7.2.1) and


(7.2.2), the code is
396 CHAPTER 7. MACHINE LEARNING

def incoming(x,w,j):
return sum([ outgoing(x,w,i)*w[i][j] if w[i][j] != None else 0 for
,→ i in range(d) ])

def outgoing(x,w,j):
if x[j] != None: return x[j]
else: return activate[j](incoming(x,w,j))

We assume the nodes are ordered so that the initial portion of x equals
xin ,

m = len(x_in)
x[:m] = x_in

Here is the third version of forward propagation.

# third version: neural networks

def forward_prop(x_in,w):
d = len(w)
x = [None]*d
m = len(x_in)
x[:m] = x_in
for j in range(m,d): x[j] = outgoing(x,w,j)
return x

For Figure 7.3, we define a weight matrix as follows,


 
0 0 0.1 −2.0 0 0
0 0 0.1 −2.0 0 0 
 
0 0 0 0 −0.3 −0.3
W =   (7.2.4)
0 0 0 0 0.22 0.22 

0 0 0 0 0 0 
00 0 0 0 0

and activation functions

activate = [None]*d

activate[2] = relu
activate[3] = id
activate[4] = sigmoid
activate[5] = tanh
7.2. NEURAL NETWORKS 397

The code for W is

w = [ [None]*d for _ in range(d) ]

# remember in Python, index starts from 0

w[0][2] = w[1][2] = 0.1


w[0][3] = w[1][3] = -2.0
w[2][4] = w[2][5] = -0.3
w[3][4] = w[3][5] = 0.22

Then the code

x_in = [1.5,2.5]
x = forward_prop(x_in,w)

returns the outgoing vector

x = (1.5, 2.5, 0.4, −8.0, 0.132, −0.954). (7.2.5)

From this, the incoming vector is

x− = (0, 0, 0.4, −8.0, −1.88, −1.88).

and the outputs are


xout = (0.132, −0.954).

Let
y1 = 0.427, y2 = −0.288, y = (y1 , y2 )
be targets, and let J(xout , y) be a function of the outputs xout of the output
nodes, and the targets y. For example, for Figure 7.3, xout = (x5 , x6 ) and we
may take J to be the mean square error function or mean square loss
1 1
J(xout , y) = (x5 − y1 )2 + (x6 − y2 )2 , (7.2.6)
2 2
The code for this J is

def J(x_out,y):
m = len(y)
return sum([ (x_out[i] - y[i])**2/2 for i in range(m) ])

and the code


398 CHAPTER 7. MACHINE LEARNING

y0 = [0.132,-0.954]
y = [0.427, -0.288]

J(x_out,y0), J(x_out,y)

returns 0 and 0.266.

By forward propagation, J is also a function of all nodes. Then, at each


node j, we have the derivatives
∂J ∂J
, fj′ (x−
j ), . (7.2.7)
∂x−
j ∂xj

These are the downstream derivative, local derivative, and upstream derivative
at node j. (The terminology reflects the fact that derivatives are computed
backward.)

fi′
∂J ∂J
∂x−
i ∂xi
fi

Fig. 7.6 Downstream, local, and upstream derivatives at node i.

From (7.2.2),
∂xj
= fj′ (x−
j ). (7.2.8)
∂x−
j

By the chain rule and (7.2.8), the key relation between these derivatives is

∂J ∂J
= · fi′ (x−
i ), (7.2.9)
∂x−
i ∂xi

or
downstream = upstream × local.

def local(x,w,i):
return der_dict[activate[i]](incoming(x,w,i))

Let
7.2. NEURAL NETWORKS 399

∂J
δi = , i = 1, 2, . . . , d.
∂x−
i

Then we have the outgoing vector x = (x1 , x2 , . . . , xd ) and the downstream


gradient vector δ = (δ1 , δ2 , . . . , δd ). Strictly speaking, we should write δi−
for the downstream derivatives. However, in §7.4, we don’t need upstream
derivatives. Because of this, we will write δi .
Let xout be the output nodes, and let δout be the downstream derivatives
of J corresponding to xout . Then δout is a function of xout , y, w. We assume
the nodes are ordered so that the terminal portions of x and δ equal xout and
δout respectively,

d = len(x)
m = len(x_out)
x[d-m:] = x_out
delta[d-m:] = delta_out

Once we have the incoming vector x− and outgoing vector x, we can


differentiate J and compute the downstream derivatives δout with respect to
each node in xout . For example, in Figure 7.3, there are two output nodes x5 ,
x6 , and we compute
∂J ∂J
δ5 = , δ6 = , δout = (δ5 , δ6 )
∂x−
5 ∂x−
6

as follows. Using (7.2.5) and (7.2.6), the upstream derivative is

∂J
= (x5 − y1 ) = −0.294.
∂x5

At node 5, the activation function is f5 = σ. Since σ ′ = σ(1 − σ), the local


derivative at node 5 is

σ ′ (x− − −
5 ) = σ(x5 )(1 − σ(x5 )) = x5 (1 − x5 ) = 0.114.

Hence the downstream derivative at node 5 is

δ5 = upstream × local = −0.294 ∗ 0.114 = −0.0337.

Similarly,
δ6 = −0.059.
We conclude
δout = (−0.0337, −0.059).
400 CHAPTER 7. MACHINE LEARNING

The code for this is

# delta_out for mean square error

def delta_out(x_out,y,w):
d =len(w)
m = len(y)
return [ (x_out[i] - y[i]) * local(x,w,d-m+i) for i in range(m) ]

delta_out(x_out,y_star,w)

We compute δ recursively via back propagation as in §4.4. From Figure


7.5 and (7.2.1) and (7.2.8),

X ∂J ∂xj ∂xi −
∂J
− = · ·
∂xi i→j
∂x−
j ∂xi ∂x− i
 
X ∂J
 · fi′ (x−
= − · wij i ).
i→j
∂xj

This yields the downstream derivative at node i,


 
X
δi =  δj · wij  · fi′ (x−
i ). (7.2.10)
i→j

The code is

def downstream(x,delta,w,i):
if delta[i] != None: return delta[i]
else:
upstream = sum([ downstream(x,delta,w,j) * w[i][j] if w[i][j]
,→ != None else 0 for j in range(d) ])
return upstream * local(x,w,i)

Using this, we have the third version of back propagation,

# third version: neural networks

def backward_prop(x,y,w):
d = len(w)
7.2. NEURAL NETWORKS 401

delta = [None]*d
m = len(y)
x_out = x[d-m:]
delta[d-m:] = delta_out(x_out,y_star,w)
for i in range(d-m): delta[i] = downstream(x,delta,w,i)
return delta

With W , x, and targets y as above, the code

delta = backward_prop(x,y,w)

returns

δ = (0.0437, 0.0437, 0.0279, −0.0204, −0.0337, −0.059).

Above we computed the upstream, downstream, and local derivatives of


J at a given node (7.2.7). Since the incoming signals x−
j depend also on the
weights wij , J also depends on wij . By (7.2.1),

∂x−
j
= xi ,
∂wij

see also Table 7.4. From this,



∂J ∂J ∂xj
= · = δj · xi .
∂wij ∂x−
j ∂wij

We have shown

Weight Gradient of Output

If (i, j) is an edge, then

∂J
= xi · δ j . (7.2.11)
∂wij

This result is key for neural network training (§7.4).

Perceptrons can be assembled in parallel (Figure 7.2). If a network has


no hidden nodes, the network is shallow. In a shallow network, all nodes are
either input nodes or output nodes (Figure 7.7).
402 CHAPTER 7. MACHINE LEARNING

A shallow network is dense if all input nodes point to all output nodes:
wij is defined for all i, j. A shallow network can always be assumed dense by
inserting zero weights at missing edges.
Neural networks can also be assembled in series, with each component a
layer (Figure 7.8). Usually each layer is a dense shallow network. For example,
Figure 7.3 consists of two dense shallow networks in layers. We say a network
is deep if there are multiple layers.
The weight matrix W (7.2.3) is 6 × 6, while the weight matrices W1 , W2
of each of the two dense shallow network layers in Figure 7.3 are 2 × 2.
In a single shallow layer with n input nodes and m output nodes (Figure
7.7), let x and z be the layer’s input node vector and output node vector.
Then x and z are n and m dimensional respectively, and W is m × n.

x1

z1
+

x2

z2
+

x3

z3
+

x4

Fig. 7.7 A shallow dense layer.

If we have the same activation function f at every output node, then we


may apply it componentwise,

f (z − ) = f (z1− , z2− , . . . , zm

) = (f (z1− ), f (z2− ), . . . , f (zm

)).

Our convention is to let wij denote the weight on the edge (i, j). With this
convention, the formulas (7.2.1), (7.2.2) reduce to the matrix multiplication
formulas
z − = W t x, z = f (W t x). (7.2.12)
Thus a dense shallow network can be thought of as a vector-valued percep-
tron. This allows for parallelized forward and back propagation.
7.3. GRADIENT DESCENT 403

Fig. 7.8 Layered neural network [10].

Exercises

Exercise 7.2.1 Show that a dataset x1 , x2 , . . . , xN lies in a hyperplane


(4.5.9) in Rd iff the augmented dataset (x1 , 1), (x2 , 1), . . . , (xN , 1) does not
span Rd+1 (see §2.7).

7.3 Gradient Descent

Let f (w) be a scalar function of a vector w = (w1 , w2 , . . . , wd ) in Rd . A basic


problem is to minimize f (w), that is, to find or compute a minimizer w∗ ,

f (w) ≥ f (w∗ ), for every w.

This goal is so general, that anything concrete one insight one provides to-
wards this goal is widely useful in many settings. The setting we have in mind
is f = J, where J is the mean error from §7.1.
Usually f (w) is a measure of cost or lack of compatibility. Because of this,
f (w) is called the loss function or cost function.
A neural network is a black box with inputs x and outputs y, depending on
unknown weights w. To train the network is to select weights w in response
to training data (x, y). The optimal weights w∗ are selected as minimizers
of a loss function f (w) measuring the error between predicted outputs and
actual outputs, corresponding to given training inputs.
404 CHAPTER 7. MACHINE LEARNING

From §4.5, if the loss function f (w) is continuous and proper, there is
a global minimizer w∗ . If f (w) is in addition strictly convex, w∗ is unique.
Moreover, if the gradient of the loss function is g = ∇f (w), then w∗ is a
critical point, g ∗ = ∇f (w∗ ) = 0.

Let g(w) be any function of a scalar variable w. From the definition of


derivative (4.1.3), if b is close to a, we have the approximation

g(b) − g(a)
≈ g ′ (a).
b−a
Inserting a = w and b = w+ ,

g(w+ ) ≈ g(w) + g ′ (w)(w+ − w).

Assume w∗ is a root of g(w) = 0, so g(w∗ ) = 0. If w+ is close to w∗ , then


g(w+ ) is close to zero, so

0 ≈ g(w) + g ′ (w)(w+ − w).

Solving for w+ ,
g(w)
w+ ≈ w − .
g ′ (w)
Since the global minimizer w∗ satisfies f ′ (w∗ ) = 0, we insert g(w) = f ′ (w)
in the above approximation,

f ′ (w)
w+ ≈ w − .
f ′′ (w)

This leads to Newton’s method of computing approximations w0 , w1 , w2 , . . .


of w∗ using the recursion

f ′ (wn )
wn+1 = wn − , n = 1, 2, . . .
f ′′ (wn )

Because calculating f ′′ (w) is computationally expensive, first-order de-


scent methods replace the second derivative terms f ′′ (wn ) by constants,
known as learning rates.
In the multi-variable case, Newton’s method becomes

wn+1 = wn − D2 f (wn )−1 ∇f (wn ), n = 1, 2, . . . ,

and the second-derivative term is even more expensive to compute.


7.3. GRADIENT DESCENT 405

These first-order methods, collectively known as gradient descent, are the


subject of this chapter. In presenting §7.3 and §7.9, we follow [4], [23], [34],
[36].

Here is code for Newton’s method.

from numpy import *

def newton(loss,grad,curv,w,num_iter):
g = grad(w)
c = curv(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= g/c
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
c = curv(w)
if allclose(g,0): break
return trajectory

When applied to the function

f (w) = w4 − 6w2 + 2w,

the code returns trajectory

def loss(w): return w**4 - 6*w**2 + 2*w # f(w)


def grad(w): return 4*w**3 - 12*w + 2 # f'(w)
def curv(w): return 12*w**2 - 12 # f''(w)

u0 = -2.72204813
w0 = 2.45269774
num_iter = 20
trajectory = newton(loss,grad,curv,w0,num_iter)

which can be plotted

from matplotlib.pyplot import *

def plot_descent(a,b,loss,curv,delta,trajectory):
w = arange(a,b,delta)
plot(w,loss(w),color='red',linewidth=1)
plot(w,curv(w),"--",color='blue',linewidth=1)
plot(*trajectory,color='green',linewidth=1)
scatter(*trajectory,s=10)
title("num_iter= " + str(len(trajectory.T)))
406 CHAPTER 7. MACHINE LEARNING

grid()
show()

with the code

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

returning Figure 7.9.

Fig. 7.9 Double well newton descent.

A descent sequence is a sequence w0 , w1 , w2 , . . . where the loss function


decreases
f (w0 ) ≥ f (w1 ) ≥ f (w2 ) ≥ . . . .
In a descent sequence, the point after the current point w = wn is the succes-
sive point w+ = wn+1 , and the point before the current point is the previous
point w− = wn−1 . Then (w− )+ = w = (w+ )− .
Recall (§4.3) the gradient ∇f (w) at a given point w is the direction of
greatest increase of the function, starting from w. Because of this, it is natural
to construct a descent sequence by moving, at any given w, in the direction
−∇f (w) opposite to the gradient.
7.3. GRADIENT DESCENT 407

A gradient descent is a descent sequence w0 , w1 , w2 , . . . where each suc-


cessive point w+ is obtained from the previous point w by moving in the
direction opposite to the gradient g = ∇f (w) at w,

Basic Gradient Descent Step

w+ = w − t∇f (w). (7.3.1)

The step-size t, which determines how far to go in the direction opposite


to g, is the learning rate.

Let us unpack (7.3.1), so we understand how it applies to weights in net-


works (§4.4). In a neural network, weights w1 , w2 , . . . are attached to edges,
and the final outputs are combined into a loss function. As a result, the loss
function is a function of the weights,

f (w) = f (w1 , w2 , . . . ).

In (7.3.1), w = (w1 , w2 , . . . ) is the weight vector, consisting all of weights


combined into a single vector. By the gradient formula (4.3.2), (7.3.1) is
equivalent to
∂f
w1+ = w1 − t ,
∂w1
∂f
w2+ = w2 − t ,
∂w2
... = ....

In other words,

Each Weight is Computed Separately

To update a weight in a specific edge using gradient descent, one needs


only the derivative of the loss function relative to that specific weight.

Of course, the derivative relative to a specific weight may depend on other


derivatives and other weights, when one applies backpropagation (§4.4). This
principle also holds for modified gradient descent (§7.9).

In practice, the learning rate is selected by trial and error. Which learning
rate does the theory recommend?
408 CHAPTER 7. MACHINE LEARNING

Given an initial point w0 , the sublevel set at w0 (see §4.5) consists of all
points w where f (w) ≤ f (w0 ). Only the part of the sublevel set that is
connected to w0 counts.
In Figure 7.10, the sublevel set at w0 is the interval [u0 , w0 ]. The sublevel
set at w1 is the interval [b, w1 ]. Notice we do not include any points to the
left of b in the sublevel set at w1 , because points to the left of b are separated
from w1 by the gap at the point b.

a b c w1
u0 w0

Fig. 7.10 Double well cost function and sublevel sets at w0 and at w1 .

Suppose the second derivative D2 f (w) is never greater than a constant L


on the sublevel set. This means

D2 f (w) ≤ L, on f (w) ≤ f (w0 ), (7.3.2)

in the sense the eigenvalues of D2 f (w) are never greater than L.


Because the second derivative is the derivative of the first derivative,
D2 f (w) measures how fast the gradient ∇f (w) changes from point to point.
From this point of view, D2 f (w) is a measure of the curvature of the function
f (w), and (7.3.2) says the rate of change of the gradient is never greater than
L.
Given such a bound L on the curvature, If the learning rate t is no larger
than 1/L, we say we are doing short step gradient descent. Then we have

Short Step Gradient Descent

Let L be as above and w+ as in (7.3.1). If t ≤ 1/L, then


t
f (w+ ) ≤ f (w) − |∇f (w)|2 . (7.3.3)
2

To see this, fix w and let S be the sublevel set {w′ : f (w′ ) ≤ f (w)}. Since
the gradient pushes f down, for t > 0 small, w+ stays in S. Insert x = w+
7.3. GRADIENT DESCENT 409

and a = w into the right half of (4.5.20) and simplify. This leads to

t2 L
f (w+ ) ≤ f (w) − t|∇f (w)|2 + |∇f (w)|2 .
2
Since tL ≤ 1 when 0 ≤ t ≤ 1/L,we have t2 L ≤ t. This derives (7.3.3).
The curvature of the loss function and the learning rate are inversely pro-
portional. Where the curvature of the graph of f (w) is large, the learning
rate 1/L is small, and gradient descent proceeds in small time steps.

When the sublevel set is bounded, there is a bound L satisfying (7.3.2).


From §4.5, the sublevel set is bounded when f (w) is proper: Large |w| implies
high cost f (w). The graphs in Figures 4.4, 4.5, 7.10, are proper.
In practice, when the loss function is not proper, it is modified by an extra
term that forces properness. This is called regularization. If the extra term
is proportional to |w|2 , it is ridge regularization, and if the extra term is
proportional to |w|, it is LASSO regularization.

Now let w0 w1 , w2 . . . be a short-step gradient descent sequence, t ≤ 1/L.


By (7.3.3), wn remains in the sublevel set f (w) ≤ f (w0 ). If this sublevel set is
bounded, wn subconverges to a limit w∗ (Appendix A.7). Inserting w = wn ,
w+ = wn+1 in (7.3.3),
1
f (wn+1 ) ≤ f (wn ) − |∇f (wn )|2 .
2L
Since f (wn ) and f (wn+1 ) both converge to f (w∗ ), and ∇f (wn ) converges to
∇f (w∗ ), we conclude
1
f (w∗ ) ≤ f (w∗ ) − |∇f (w∗ )|2 .
2L
Since this implies ∇f (w∗ ) = 0, we have derived the following.

Gradient Descent Converges to a Critical Point

Fix an initial weight w0 and let L be as above. If the short-step gra-


dient descent sequence starting from w0 converges to some point w∗ ,
then w∗ is a critical point.

For example, let f (w) = w4 − 6w2 + 2w (Figures 7.9, 7.10, 7.11). Then

f ′ (w) = 4w3 − 12w + 2, f ′′ (w) = 12w2 − 12.


410 CHAPTER 7. MACHINE LEARNING

Thus the inflection points (where f ′′ (w) = 0) are ±1 and, in Figure 7.10, the
critical points are a, b, c.
Let u0 and w0 be the points satisfying f (w) = 5 as in Figure 7.11.
Then u0 = −2.72204813 and w0 = 2.45269774, so f ′′ (u0 ) = 76.914552 and
f ′′ (w0 ) = 60.188. Thus we may choose L = 76.914552. With this L, the
short-step gradient descent starting at w0 is guaranteed to converge to one
of the three critical points. In fact, the sequence converges to the right-most
critical point c (Figure 7.10).
This exposes a flaw in basic gradient descent. Gradient descent may con-
verge to a local minimizer, and miss the global minimizer. In §7.9, modified
gradient descent will address some of these shortcomings.

Fig. 7.11 Double well gradient descent.

The code for gradient descent is

from numpy import *


from matplotlib.pyplot import *

def gd(loss,grad,w,learning_rate,num_iter):
g = grad(w)
trajectory = array([[w],[loss(w)]])
for _ in range(num_iter):
w -= learning_rate * g
trajectory = column_stack([trajectory,[w,loss(w)]])
g = grad(w)
if allclose(g,0): break
return trajectory
7.4. NETWORK TRAINING 411

When applied to the double well function f (w),

u0 = -2.72204813
w0 = 2.45269774
L = 76.914552
learning_rate = 1/L
num_iter = 100
trajectory = gd(loss,grad,w0,learning_rate,num_iter)

ylim(-15,10)
delta = .01
plot_descent(u0,w0,loss,curv,delta,trajectory)

the code returns Figure 7.11.

7.4 Network Training

A neural network with weight matrix W defines an input-output map

xin → xout .

Given inputs xin and target outputs y, we seek to modify the weight matrix
W so that the input-output map is

xin → y.

This is training.
Let (§7.2)

x− = (x− − −
1 , x2 , . . . , xd ), x = (x1 , x2 , . . . , xd )

be the network’s incoming vector and outgoing vector, and let

δ = (δ1 , δ2 , . . . , δd )

be the downstream gradient vector, relative to some mean error function J.


From (7.2.1),

∂J ∂J ∂xj ∂J
= · = − · xi = xi δ j . (7.4.1)
∂wij ∂x−
j ∂wij ∂xj

This we derived as (7.2.11), but here it is again:


412 CHAPTER 7. MACHINE LEARNING

The Weight Gradient of J is a Tensor Product

Let wij be the weight along an edge (i, j), let xi be the outgoing
signal from the i-th node, and let δj be the downstream derivative
of the output J with respect to the j-th node. Then the derivative
∂J/∂wij equals xi δj . In this partial sense,

∇W J = x ⊗ δ. (7.4.2)

When W is the weight matrix between successive layers in a layered neural


network (Figure 7.8), (7.4.2) is not partial, it is exactly correct.
Using (7.4.1), we update the weight wij using gradient descent

def update_weights(x,delta,w,learning_rate):
d = len(w)
for i in range(d):
for j in range(d):
if w[i][j]:
w[i][j] = w[i][j] - learning_rate*x[i]*delta[j]

The learning rate is discussed in §7.3. The triple

forward propagation → backward propagation → update weights

is an iteration. Starting with a given W0 , we repeat this iteration until we


obtain the target outputs y. Here is the code.

def train_nn(x_in,y,w0,learning_rate,n_iter):
trajectory = []
cost = 1
# build a local copy
w = [ row[:] for row in w0 ]
d = len(w0)
for _ in range(n_iter):
x = forward_prop(x_in,w)
delta = backward_prop(x,y,w)
update_weights(x,delta,w,learning_rate)
m = len(y)
x_out = x[d-m:]
cost = J(x_out,y)
trajectory.append(cost)
if allclose(0,cost): break
return w, trajectory

Here n_iter is the maximum number of iterations allowed, and the iterations
stop if the cost J is close to zero.
The cost or error function J enters the code only through the function
delta_out, which is part of the function backward_prop.
7.4. NETWORK TRAINING 413

Let W0 be the weight matrix (7.2.4). Then

x_in = [1.5,2.5]
learning_rate = .01
y0 = 0.4265356063
y1 = -0.2876478137
y = [y0,y1]
n_iter = 10000

w, trajectory = train_nn(x_in,y,w0,learning_rate,n_iter)

returns the cost trajectory, which can be plotted using the code

from matplotlib.pyplot import *

for lr in [.01,.02,.03,.035]:
w, trajectory = train_nn(x_in,y,w0,lr,n_iter)
n = len(trajectory)
label = str(n) + ", " + str(lr)
plot(range(n),trajectory,label=label)

grid()
legend()
show()

resulting in Figure 7.12.

Fig. 7.12 Cost trajectory and number of iterations as learning rate varies.
414 CHAPTER 7. MACHINE LEARNING

The convergence here is surprisingly easy to attain. However, the conver-


gence here is a mirage. It is a reflection of overfitting, in the sense that we
trained the weights to obtain the input-output map corresponding to a single
sample: There is no reason the trained weights reproduce the input-output
map for other samples.
Only after we train the weights repeatedly against all samples in a training
dataset, can we hope to achieve training with some predictive power.

Stochastic Gradient Descent

⋆ under construction ⋆

7.5 Linear Regression

Let x1 , x2 , . . . , xN be a dataset, with corresponding labels or targets y1 , y2 ,


. . . , yN . As in §7.1, the loss function is
N
X
J(W ) = J(xk , yk , W ). (7.5.1)
k=1

In this section, we focus on a single-layer perceptron (Figure 7.13),

J(x, y, W ) = J(z, y), z = W t x.

Here x is the input, W is the weight matrix, z is the network computed


output, and y is the desired output or target.
The loss function (7.5.1) has no bias inputs. When there are bias inputs
b, the loss function is
N
X
J(W, b) = J(xk , yk , W, b), (7.5.2)
k=1

and we focus on a single-layer perceptron

J(x, y, W, b) = J(z, y), z = W t x + b.

A basic attribute of a neural network is its trainability. Can a given network


be trained to achieve desired input-output behavior? As stated, this question
7.5. LINEAR REGRESSION 415

is imprecise and not clearly defined. In fact, for deep networks, it is not at
all clear how to turn this vague idea into an actionable definition.
In the case of a single-layer perceptron, the situation is straightforward
enough to be able to both make the question precise, and to provide action-
able criteria that guarantee trainability. This we do in the two cases
• linear regression, and
• logistic regression.
With any loss function J, the goal is to minimize J. With this in mind,
from §4.5, we recall

Ideal Loss Function

If a loss function J(W ) is strictly convex and proper, then J has a


unique optimal weight W ∗ ,

J(W ∗ ) ≤ J(W ),

characterized as the unique weight W ∗ satisfying ∇W J(W ∗ ) = 0.

Often, in machine learning, J is neither convex nor proper. Nevertheless,


this result is an important benchmark to start with. Lack of properness is
often addressed by regularization, which is the modification of J by a proper
forcing term. Lack of convexity is addressed by using some type of accelerated
gradient descent.
It is natural to say a loss function (7.5.1) is trainable if it is proper (§4.5),
because this guarantees the existence of optimal weights. In the case of a
single-layer perceptron, strict convexity is easy to pin down, leading to the
uniqueness of optimal weights.
Because of this, for a single-layer perceptron, we say the regression is
trainable if the loss function (7.5.1) (without bias) or the loss function (7.5.2)
(with bias) is proper and strictly convex. In this and the next section, we
determine conditions on the dataset that guarantee trainability in the above
two cases. We do this when there are no bias inputs, and when there are bias
inputs, so there are four cases in all.

For linear regression without bias, the loss function is (7.5.1) with
1
J(x, y, W ) = |y − z|2 , z = W t x. (7.5.3)
2
Then (7.5.1) is the mean square error or mean square loss, and the problem
of minimizing (7.5.1) is linear regression (Figure 7.13).
416 CHAPTER 7. MACHINE LEARNING

We use the identities (1.4.16) and (1.4.17) to compute the gradient of


J(x, y, W ). As mentioned in §2.2, these remain valid for any matrices and
vectors with matching shapes.
Let V be a weight matrix, and let v = V t x, z = W t x. Then (W + sV )t x =
z + sv, and the directional derivative is
d d 1
J(x, y, W + sV ) = |z + sv − y|2
ds ds 2
= v · (z + sv − y) = (V t x) · (z + sv − y) (7.5.4)
= trace (V t x) ⊗ (z + sv − y)


= trace V t (x ⊗ (z + sv − y)) .


By (4.3.4), inserting s = 0, (7.5.4) implies the weight gradient for mean


square loss is

G = ∇W J(x, y, W ) = x ⊗ (z − y), z = W t x. (7.5.5)

Note this result is a special case of (7.4.2).


Differentiating (7.5.4) with respect to s and inserting s = 0,

d2
J(x, y, W + sV ) = |v|2 = |V t x|2 . (7.5.6)
ds2 s=0

Since this is nonnegative, by (4.5.18), J(x, y, W ) is a convex function of W .

x1

+ y

z1
x2

z2 J
+ (−)2

z3
x3

z = W tx
+
J = |z − y|2 /2

x4

Fig. 7.13 Linear regression neural network with no bias inputs.


7.5. LINEAR REGRESSION 417

Since J(W ) is the sum of J(x, y, W ) over all samples, J(W ) is convex. To
check strict convexity of J(W ), suppose

d2
J(W + sV ) = 0.
ds2 s=0

Then (7.5.6) vanishes for all samples x = xk , y = yk , which implies

V t xk = 0, k = 1, 2, . . . , N. (7.5.7)

Recall the feature space is the vector space of all inputs x, and (§2.9) a
dataset is full-rank if the span of the dataset is the entire feature space. When
this happens, (7.5.7) implies V = 0. By (4.5.19), J(W ) is strictly convex.
To check properness of J(W ), by definition (4.5.12), we show there is a
bound C with √
J(W ) ≤ c =⇒ ∥W ∥ ≤ C d. (7.5.8)
Here ∥W ∥ is the norm of the matrix W (2.2.12). The exact formula for the
bound C, which is not important for our purposes, depends on the level c
and the dataset.
If J(W ) ≤ c, by (7.5.1), (7.5.3), and the triangle inequality,

|W t xk | ≤ 2c + |yk |, k = 1, 2, . . . , N.

If x is in the span of the dataset, then x is a linear combination of samples


xk . Hence there is a bound C(x), depending on x but not on W , such that

|W t x| ≤ C(x). (7.5.9)

Let e1 , e2 , . . . be the standard basis in feature space, and assume the


dataset is full-rank. Let C be the largest of C(e1 ), C(e2 ), . . . . Then e1 , e2 ,
. . . are in the span of the dataset. By (2.2.12) and (7.5.9), inserting x = ei ,
2
X
∥W ∥ = |W t ej |2 ≤ dC 2 .
j

Since this establishes (7.5.8), we have shown

Trainability: Linear Regression Without Bias

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then linear regression


without bias is trainable on weights W .

This is a simple, clear geometric criterion for convergence of gradient de-


scent to the global minimum of J, valid for linear regression with no bias
inputs.
418 CHAPTER 7. MACHINE LEARNING

For linear regression with bias, the loss function is (7.5.2) with
1
J(x, y, W, b) = |y − z|2 , z = W t x + b. (7.5.10)
2
Here W is the weight matrix and b is a bias vector.
If we augment the dataset x1 , x2 , . . . , xN to (x1 , 1), (x2 , 1), . . . , (xN , 1),
then this corresponds to the augmented weight matrix
 
W
.
bt

Applying the last result to the augmented dataset and appealing to Exer-
cise 7.2.1, we obtain

Trainability: Linear Regression With Bias

Suppose the dataset x1 , x2 , . . . , xN does not lie in a hyperplane. Then


linear regression with bias is trainable on weights (W, b).

These are simple, clear geometric criteria for convergence of gradient de-
scent to the global minimum of J, valid for linear regression with or without
bias inputs.

Exercises

7.6 Logistic Regression

Recall (§5.3) a vector p = (p1 , p2 , . . . , pd ) is a probability vector if each com-


ponent p1 , p2 , . . . , pd is nonnegative (positive or zero), and the components
sum to one, p1 + p2 + · · · + pd = 1.
A probability vector p is strict if the components are all positive (none are
zero). A probability vector p is one-hot encoded if one of the components is
one. When this is the i-th component, we say p is one-hot encoded at slot i.
When this happens, all other components are zero.
Let x1 , x2 , . . . , xN be a dataset. In logistic regression, a dataset is com-
posed of classes and the targets are probability vectors. Here we can assign
targets to classes, or we can assign classes to targets.
A dataset is a two-class dataset if it composed of two disjoint classes. More
generally, a dataset is a multi-class dataset if it is composed of d ≥ 2 disjoint
classes.
7.6. LOGISTIC REGRESSION 419

In a multi-class dataset, if a sample xk lies in class i, the target pk assigned


to xk is the probability vector that is one-hot encoded (§2.4) at slot i. Here
we start with a multi-class dataset, and we assign targets to classes.
For example, if there are three classes, as in the Iris dataset, the probability
vector pk is one of

(1, 0, 0), (0, 1, 0), (0, 0, 1),

according to the class of the sample xk .


On the other hand, given a probability vector p = (p1 , p2 , . . . , pd ), let

max p = max(p1 , p2 , . . . , pd ).

Let x1 , x2 , . . . , xN be a dataset with associated target probability vectors p1 ,


p2 , . . . , pN . Then the i-th class may be defined as the samples with targets
satisfying pi = max p. Alternatively, the i-th class may be defined as the
samples with targets satisfying pi > 0. Here we start with target probability
vectors, and we assign classes reflecting the targets.
When classes are assigned to targets, they need not be disjoint. Because of
this, they are called soft classes. Summarizing, a soft-class dataset is a dataset
x1 , x2 , . . . , xN with targets p1 , p2 , . . . , pN consisting of probability vectors.
Below we show how logistic regression works for all soft-class datasets.

We start with logistic regression without bias inputs. For logistic regres-
sion, the loss function is
N
X
J(W ) = J(xk , pk , W ), (7.6.1)
k=1

with (see §5.6)

J(x, p, W ) = I(p, q), q = σ(y), y = W t x.

Here I(p, q) is the relative information, measuring the information error be-
tween the desired target p and the computed target q, and q = σ(y) is the
softmax function, squashing the network’s output y = W t x into the proba-
bility q.
When p is one-hot encoded, by (5.6.16),

J(x, p, W ) = Icross (p, σ(W t x)).

Because of this, in the literature, in the one-hot encoded case, (7.6.1) is called
the cross-entropy loss.
420 CHAPTER 7. MACHINE LEARNING

J(W ) is logistic loss or logistic error, and the problem of minimizing (7.6.1)
is logistic regression (Figure 7.14).
Since we will be considering both strict and one-hot encoded probabilities,
we work with I(p, q) rather than Icross (p, q). Table 5.33 is a useful summary
of the various information and entropy concepts.

x1

y1
+ p
q1
x2

y2 q2 J
+ σ I

x3 q3

y3 y = W tx
+ q = σ(y)
J = I(p, q)

x4

Fig. 7.14 Logistic regression neural network without bias inputs.

In §5.6, we defined 1 = (1, 1, . . . , 1), and a vector v was centered if v ·1 = 0.


Here we define a matrix W as centered if

W 1 = 0, (7.6.2)

or
d
X
wij = 0, i = 1, 2, . . . , d.
j=1

With this understood, if W is centered, and y = W t x, then y is centered.


We compute the gradient ∇W J(x, p, W ). By (5.6.3) and (5.6.15),

∇y I(p, σ(y)) = ∇y Z(y) − p = q − p, q = σ(y), (7.6.3)

and, by (5.6.10),

Dy2 I(p, σ(y)) = D2 Z(y) = diag(q) − q ⊗ q, q = σ(y). (7.6.4)


7.6. LOGISTIC REGRESSION 421

Let V be a centered weight matrix, and let v = V t x, y = W t x. Then


(W + sV )t x = y + sv, and, by (7.6.3), the directional derivative is

d d
J(x, y, W + sV ) = I(p, σ(y + sv))
ds s=0 ds s=0
= v · (q − p) = (V t x) · (q − p)
= trace (V t x) ⊗ (q − p)


= trace V t (x ⊗ (q − p)) .


By (4.3.4), this shows the gradient for log loss is

G = ∇W J(x, p, W ) = x ⊗ (q − p), q = σ(W t x). (7.6.5)

As before, this result is a special case of (7.4.2). Since q and p are probability
vectors, p · 1 = 1 = q · 1, hence the gradient G is centered.
Recall (§5.6) we have strict convexity of Z(y) along centered vectors y,
those vectors satisfying y · 1 = 0. Since y = W t x, y · 1 = x · W 1. Hence, to
force y · 1 = 0, it is natural to assume W is centered.
If we initiate gradient descent with a centered weight matrix W , since the
gradient G is also centered, all successive weight matrices will be centered.

Turning to convexity, we establish

Strict Convexity: Logistic Regression Without Bias

Suppose the dataset x1 , x2 , . . . , xN is full-rank. Then the logistic loss


J(W ) without bias is strictly convex on centered weights W .

Pd
To see this, given a vector v and probability vector q, set v̄ = j=1 vj qj .
Then  2
Xd Xd Xd
vj2 qj −  vj qj  = (vj − v̄)2 qj .
j=1 j=1 j=1

If either side is zero, and q is strict, then v = v̄1, so v is a multiple of 1.


From this identity, and by (4.5.18) and (7.6.4), the second derivative of
I(p, σ(y)) in the direction of a vector v is
d
d2 X
I(p, σ(y + sv)) = (vj − v̄)2 qj , q = σ(y).
ds2 s=0 j=1

Let V be a centered weight matrix and let v = V t x. Then v·1 = x·V 1 = 0,


so v is centered, and
422 CHAPTER 7. MACHINE LEARNING

(W + sV )t x = y + sv.
If y = W t x, it follows the second derivative of J(x, p, W ) in the direction of
V is
d
d2 X
J(x, p, W + sV ) = (vj − v̄)2 qj , v = V t x. (7.6.6)
ds2 t=0 j=1

This shows the second derivative of J(x, p, W ) is nonnegative, establishing


the convexity of J(x, p, W ). Since J(W ) is the sum of J(x, p, W ) over all
samples, we conclude J(W ) is convex.
Moreover, if (7.6.6) vanishes, then, by the previous paragraph, since q =
σ(y) is strict, v is a multiple of 1. Since v is centered, it follows v = 0
(Exercise 5.6.1). Since v = V t x, the vanishing of (7.6.6) implies V t x = 0.
If
N
d2 X d2
J(W + sV ) = J(xk , pk , W + sV )
ds2 s=0 ds2 s=0
k=1

vanishes, then, since the summands are nonnegative, (7.6.6) vanishes, for
every sample x = xk , p = pk , hence

V t xk = 0, k = 1, 2, . . . , N.

When the dataset is full-rank, this implies V = 0. This establishes strict


convexity of J(W ) on centered weights.

Now we turn to properness of J(W ).

Properness: Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,


pN . For each class i, let Ki be the convex hull of the samples x whose
corresponding targets p = (p1 , p2 , . . . , pd ) satisfy pi > 0. If the span of
the intersection Ki ∩ Kj is full-rank for every class i and class j, then
the logistic loss J(W ) without bias is proper on centered weights W .

The convex hull is discussed in §4.5, see Figures 4.23 and 4.24. If Ki were
just the samples x whose corresponding targets p satisfy pi > 0 (with no
convex hull), then the intersection Ki ∩ Kj may be empty.
For example, if p were one-hot encoded, then x belongs to at most one Ki .
Thus taking the convex hull in the definition of Ki is crucial. This is clearly
seen in Figure 7.26: The samples never intersect, but the convex hulls may
do so.
To establish properness of J(W ), by definition (4.5.12), we show
7.6. LOGISTIC REGRESSION 423

W1 = 0 and J(W ) ≤ c =⇒ ∥W ∥ ≤ dC (7.6.7)

for some C. The exact formula for the bound C, which is not important for
our purposes, depends on the level c and the dataset.
Suppose J(W ) ≤ c, with W 1 = 0 and let q = σ(y). Then I(p, q) =
J(x, p, W ) ≤ c for every sample x and corresponding target p.
Let x be a sample, let y = W t x, and suppose the corresponding target p
satisfies pi ≥ ϵ, for some class i, and some ϵ > 0. If j ̸= i, then
d
X
ϵ(yj − yi ) ≤ ϵ(Z(y) − yi ) ≤ pi (Z(y) − yi ) ≤ pk (Z(y) − yk ) = Z(y) − p · y.
k=1

By (5.6.15),
Z(y) − p · y = I(p, σ(y)) − I(p) ≤ c + log d.
Combining the last two inequalities,

ϵ(yj − yi ) ≤ c + log d.

By definition of Ki , pi > 0 for all targets p corresponding to samples x


in Ki . Therefore there is a positive ϵi such that pi ≥ ϵi for all targets p
corresponding to samples x in Ki . Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then

ϵ(yj − yi ) ≤ c + log d, j ̸= i, for samples x in Ki .

By taking convex combinations of samples x in Ki , the last inequality


remains valid for all x in Ki , so

ϵ(yj − yi ) ≤ c + log d, j ̸= i, for all x in Ki .

Repeating the same argument for x in Kj ,

ϵ(yi − yj ) ≤ c + log d, j ̸= i, for all x in Kj .

Combining the last two inequalities,

ϵ|yi − yj | ≤ c + log d, j ̸= i, for all x in Ki ∩ Kj . (7.6.8)

Let x be any vector in feature space, and let y = W t x. Since span(Ki ∩Kj )
is full-rank, x is a linear combination of vectors in Ki ∩ Kj , for every i and j.
This implies, by (7.6.8), there is a bound C(x), depending on x but not on
W , such that

|yi − yj | ≤ C(x), for every vector x and i and j. (7.6.9)


P
Since y · 1 = 0, yi = − j̸=i yj . Summing (7.6.9) over j ̸= i,
424 CHAPTER 7. MACHINE LEARNING

X
d|yi | = |(d − 1)yi + yi | = (yi − yj ) ≤ (d − 1)C(x).
j̸=i

Let e1 , e2 , . . . be the standard basis in feature space, and let C be the


largest of C(e1 ), C(e2 ), . . . . Since y = W t x, yi = (W t x) · ei = x · (W ei ).
Inserting x = ej ,

|wji | = |ej · W ei | ≤ C, i, j = 1, 2, . . . , d.

By (2.2.12), X
2
∥W ∥ = |wij |2 ≤ d2 C 2 .
i,j

Thus dC is a bound, depending only on level c and the dataset, satisfying


(7.6.7).

If the span of Ki ∩ Kj is full-rank, then the span of the dataset itself is


full-rank. Putting the last two results together, we conclude

Trainability: Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 , . . . ,


pN . For each class i, let Ki be the convex hull of the samples x whose
corresponding targets p = (p1 , p2 , . . . , pd ) satisfy pi > 0. If the span of
the intersection Ki ∩ Kj is full-rank for every class i and class j, then
logistic regression without bias is trainable on centered weights W .

By the definition of Ki here, the union of Ki over classes i = 1, 2, . . . , d


contains the whole dataset. This is not necessarily the case in the results
below.
As a special case, let K be the samples whose corresponding targets are
strict. Then K ⊂ Ki for all classes i. If the span of K is full-rank, then
span(Ki ∩ Kj ) is full-rank. This derives the first consequence,

Trainability: Strict Logistic Regression Without Bias

Let x1 , x2 , . . . , xN be a dataset, with corresponding targets p1 , p2 ,


. . . , pN . Let K be the convex hull of the samples whose corresponding
targets are strict. If the span of K is full-rank, then logistic regression
without bias is trainable on centered weights W .

If a target p is one-hot encoded at slot i, then pi = 1 > 0. This derives the


second consequence,
7.6. LOGISTIC REGRESSION 425

Trainability: One-hot Encoded Logistic Regression Without


Bias
Let x1 , x2 , . . . , xN be a dataset with corresponding targets p1 , p2 ,
. . . , pN . For each class i, let Ki be the convex hull of the samples
whose corresponding targets are one-hot encoded at slot i. If the span
of the intersection Ki ∩ Kj is full-rank for every i and j, then logistic
regression without bias is trainable on centered weights W .

In this case, each sample x belongs in at most one Ki , so taking convex


hulls is crucial, see the examples in the next section. Here not all samples
need be one-hot encoded: The requirement is that there is sufficient overlap
between the targets that are one-hot encoded.

For logistic regression with bias, the loss function is


N
X
J(W, b) = J(xk , pk , W, b), (7.6.10)
k=1

with
J(x, p, W, b) = I(p, q), q = σ(y), y = W t x + b.
Here W is the weight matrix and b is the bias vector. In keeping with our
prior convention, we call the weight (W, b) centered if W is centered and b is
centered. Then y is centered.
If the columns of W are (w1 , w2 , . . . , wd ), and b = (b1 , b2 , . . . , bd ), then
y = W t x + b is equivalent to levels corresponding to d hyperplanes

y1 = w1 · x + b1 ,
y2 = w2 · x + b2 ,
(7.6.11)
... = ...
yd = wd · x + bd .

The scalars y1 , y2 , . . . , yd are the outputs corresponding to the sample x and


weight (W, b).
Let x1 , x2 , . . . , xN be a dataset, and suppose (W, b) is a weight with
vanishing outputs yk = 0, k = 1, 2, . . . , N . If W ̸= 0, then at least one of the
columns wj is nonzero, hence the dataset lies in the hyperplane wj ·x+bj = 0.
On the other hand, if the dataset lies in a hyperplane, then there is a weight
(W, b) with W ̸= 0 such that the outputs vanish (Exercise 7.6.1). Because of
this, we call a weight (W, b) satisfying W ̸= 0 a hyperplane weight.
Let x1 , x2 , . . . , xN be a soft-class dataset with associated target probability
vectors p1 , p2 , . . . , pN . Suppose there are d possibly overlapping classes, and
426 CHAPTER 7. MACHINE LEARNING

suppose for each sample x in class i, the corresponding target p satisfies


pi > 0. This assumption covers the two cases, strict and one-hot encoded,
discussed above.
We leave strict convexity of the loss function to Exercise 7.6.6, and we
focus on properness of the loss function.
In §4.5, we defined separating hyperplanes and separable two-class datasets.
There are at least two generalizations of separability to soft-class datasets.
They are strong separability (“all-against-all”), and weak separability (“some-
against-some”). Let y1 , y2 , . . . , yd be the outputs (7.6.11).
A dataset is strongly separable if there is a hyperplane separating class i
from the rest of the dataset, for every i = 1, 2, . . . , d. By Exercise 7.6.3, this
is the same as saying there is a weight (W, b) such that

yi ≥ 0, for x in class i,
for every i = 1, 2, . . . , d and every j ̸= i.
yi ≤ 0, for x in class j,
(7.6.12)

Here again the hyperplanes are decision boundaries.


On the other hand, a dataset is weakly separable if there is a hyperplane
separating some class i and some class j ̸= i. By Exercise 7.6.2, this is the
same as saying there is a weight (W, b) such that

yi ≥ 0, for x in class i,
for some i = 1, 2, . . . , d and some j ̸= i.
yi ≤ 0, for x in class j,
(7.6.13)

Clearly strong separability implies weak separability. In a two-class dataset,


strong separability equals weak separability and both equal separability as
defined in (4.5.8).
If a dataset lies in a hyperplane (4.5.9), the dataset is separable, in both
strong and weak senses. Thus the question of separability is only interesting
when the dataset does not lie in a hyperplane.

Recall (§4.5) a set K has interior if there is a ball B in K. For each


i = 1, 2, . . . , d, let Ki be the convex hull of the samples in class i. Then Ki
has interior iff class i does not lie in a hyperplane (Exercise 4.5.7).
By hyperplane separation II (§4.5), we have

Weak Separability and Interiors

Assume none of the classes lie in a hyperplane. Then the soft-class


dataset is weakly separable iff Ki ∩ Kj has no interior for some i and
7.6. LOGISTIC REGRESSION 427

some j ̸= i. Equivalently, the soft-class dataset is not weakly separable


iff Ki ∩ Kj has interior for every i and every j ̸= i.

We use this to derive the main result

Trainability: Logistic Regression With Bias

If the soft-class dataset is strongly separable, logistic regression with


bias is not trainable. If none of the classes lie in a hyperplane and the
soft-class dataset is not weakly separable, logistic regression with bias
is trainable on centered weights (W, b).

As special cases, there are corresponding results for strict targets and one-
hot encoded targets.
To begin the proof, suppose (W, b) satisfies (7.6.12). Then (Exercise 7.6.4)

yi ≥ 0, for x in Ki ,
for every i = 1, 2, . . . , d,
yj ≤ 0, for x in Ki and every j ̸= i,
(7.6.14)

From this, one obtains I(p, σ(y)) ≤ log d for every sample x and q = σ(y)
(Exercise 7.6.5). Since this implies J(W, b) ≤ N log d, the loss function is not
proper, hence not trainable.
By Exercise 7.6.6, for trainability, it is enough to check properness. To
establish properness of the loss function, suppose none of the classes lie in
a hyperplane and the dataset is not weakly separable. Then Ki ∩ Kj has
interior for all i and all j ̸= i. Let x∗ij be the centers of balls in Ki ∩ Kj for
each i ̸= j. By making the balls small enough, we may assume the radii of
the balls equal the same r > 0.
Let ϵi > 0 be the minimum of pi over all probability vectors p correspond-
ing to samples x in class i. Let ϵ be the least of ϵ1 , ϵ2 , . . . , ϵd . Then ϵ is
positive.
Suppose J(W, b) ≤ c for some level c, with W = (w1 , w2 , . . . , wd ),
b = (b1 , b2 , . . . , bd ) centered. We establish properness of the loss function
by showing
 
c + log d  1 X
|wi | + |bi | ≤ 1+r+ |x∗ij | , i = 1, 2, . . . , d.
rϵ d−1
j̸=i
(7.6.15)
The exact form of the right side of (7.6.15) doesn’t matter. What matters is
the right side is a constant depending only on the dataset, the targets, the
number of categories d, and the level c.
If J(W, b) ≤ c, then I(p, q) ≤ c for each sample x. As before, this leads to
(7.6.8).
428 CHAPTER 7. MACHINE LEARNING

Let v be a unit vector, and let

x± = x∗ij ± rv, yi± = wi · x± + bi , yj± = wj · x± + bj .

Since x± are in Ki ∩ Kj , by (7.6.8),

2rϵ|(wi − wj ) · v| = ϵ|(yi+ − yj+ ) − (yi− − yj− )| ≤ 2(c + log d).

Optimizing over all v, or choosing v = (wi − wj )/|wi − wj |, we obtain

rϵ|wi − wj | ≤ c + log d.

Let
yi = wi · x∗ij + bi , yj = wj · x∗ij + bj .
Since x∗ij is in Ki ∩ Kj , by (7.6.8),

rϵ|bi − bj | ≤ rϵ|yi − yj | + rϵ|(wi − wj ) · x∗ij |


≤ r(c + log d) + rϵ|wi − wj | |x∗ij | ≤ (c + log d) r + |x∗ij | .


Hence

rϵ|wi − wj | + rϵ|bi − bj | ≤ (c + log d) · (1 + r + |x∗ij |). (7.6.16)

Since W is centered,
X X
dwi = (d − 1)wi + wi = (d − 1)wi − wj = (wi − wj ).
j̸=i j̸=i

Similarly, since b is centered,


X
dbi = (bi − bj ).
j̸=i

Hence
1X
|wi | + |bi | ≤ |wi − wj | + |bi − bj |.
d
j̸=i

Combining this with (7.6.16) results in (7.6.15), and establishes properness


of the loss function. This completes the proof of the main result.

A very special case is a two-class dataset. In this case, the result is com-
pelling:
7.6. LOGISTIC REGRESSION 429

Trainability: Two-Class Logistic Regression With Bias

Assume neither class lies in a hyperplane. Then logistic regression


with bias is trainable iff the two-class dataset is not separable.

To highlight this result, a two-class dataset is either separable or it is not.


If it is separable, then a support vector machines [16] computes an optimal
decision boundary. If it is not separable, then (assuming neither class lies
in a hyperplane) logistic regression with bias computes an optimal decision
boundary.

We end the section by comparing the three regressions: linear, strict logis-
tic, and one-hot encoded logistic.
In classification problems, it is one-hot encoded logistic regression that is
relevant. Because of this, in the literature, logistic regression often defaults
to the one-hot encoded case.
In linear regression, not only do J(W ) and J(W, b) have minima, but so
does J(z, y). Properness ultimately depends on properness of a quadratic |z|2 .
In strict logistic regression, by (7.6.3), the critical point equation

∇y J(y, p) = 0

can always be solved, so there is at least one minimum for each J(y, p). Here
properness ultimately depends on properness of Z(y).
In one-hot encoded regression, J(y, p) = I(p, σ(y)) and ∇y J(y, p) = 0 can
never be solved, because q = σ(y) is always strict and p is one-hot encoded,
see (7.6.5). Nevertheless, trainability of J(W ) and J(W, b) is achievable if
there is sufficient overlap between the sample categories.
In linear regression, the minimizer is expressible in terms of the regression
equation, and thus can be solved in principle using the pseudo-inverse. In
practice, when the dimensions are high, gradient descent may be the only
option for linear regression.
In logistic regression, the minimizer cannot be found in closed form, so we
have no choice but to apply gradient descent, even for low dimensions.

Exercises

Exercise 7.6.1 Show a dataset x1 , x2 , . . . , xN lies in a hyperplane iff there


is a weight (W, b) with W ̸= 0 such that the outputs y1 , y2 , . . . , yN are all
zero.
430 CHAPTER 7. MACHINE LEARNING

Exercise 7.6.2 Show a dataset x1 , x2 , . . . , xN is weakly separable iff (7.6.13)


holds.
Exercise 7.6.3 Show a dataset x1 , x2 , . . . , xN is strongly separable iff
(7.6.12) holds.
Exercise 7.6.4 Show a dataset x1 , x2 , . . . , xN is strongly separable iff
(7.6.14) holds.
Exercise 7.6.5 Let (W, b) be strongly separating, and let y = W t x+b. Using
(5.6.15) and (7.6.14), show I(p, σ(y)) ≤ log d for every sample x and q = σ(y).
Exercise 7.6.6 Let J(W, b) be the logistic loss function with bias inputs.
Then J(W, b) is convex. If the dataset does not lie in a hyperplane, then
J(W, b) is strictly convex.
Exercise 7.6.7 Suppose the multi-class dataset does not lie in a hyperplane.
Then the means of the classes agree iff there is an optimal weight (W, b) with
W = 0. (Do two-class first.)

7.7 Regression Examples

Let (xk , yk ), k = 1, 2, . . . , N , be a dataset in the plane. The simplest regres-


sion problem is to determine the line y = mx + b minimizing the residual
N
X
J(m, b) = (yk − mxk − b)2 . (7.7.1)
k=1

Then the line is the regression line.

Fig. 7.15 Population versus employed: Linear Regression.


7.7. REGRESSION EXAMPLES 431

More generally, given a dataset x1 , x2 , . . . , xN in Rd , and scalar targets


y1 , y2 , . . . , yN , we want to minimize
N
X
J(w, w0 ) = (yk − w · xk − w0 )2
k=1

over all weight vectors w in Rd and scalars w0 .

GNP.deflator GNP Unemployed Armed Forces Population Year Employed


83 234.289 235.6 159 107.608 1947 60.323
88.5 259.426 232.5 145.6 108.632 1948 61.122
88.2 258.054 368.2 161.6 109.773 1949 60.171
89.5 284.599 335.1 165 110.929 1950 61.187
96.2 328.975 209.9 309.9 112.075 1951 63.221
98.1 346.999 193.2 359.4 113.27 1952 63.639
99 365.385 187 354.7 115.094 1953 64.989
100 363.112 357.8 335 116.219 1954 63.761
101.2 397.469 290.4 304.8 117.388 1955 66.019
104.6 419.18 282.2 285.7 118.734 1956 67.857
108.4 442.769 293.6 279.8 120.445 1957 68.169
110.8 444.546 468.1 263.7 121.95 1958 66.513
112.6 482.704 381.3 255.2 123.366 1959 68.655
114.2 502.601 393.1 251.4 125.368 1960 69.564
115.7 518.173 480.6 257.2 127.852 1961 69.331
116.9 554.894 400.7 282.7 130.081 1962 70.551

Table 7.16 Longley Economic Data [19].

Here we are fitting a regression hyperplane

0 = w0 + w · x = w0 + w1 x1 + w2 x2 + · · · + wd xd .

This corresponds to (7.5.10), where W is the d × 1 matrix W = w, and b is


the scalar w0 .
For example, Figure 7.16 is a dataset and Figure 7.15 is a plot of population
versus employed, with the mean and the regression line shown.

Let X be the N × d matrix with rows x1 , x2 , . . . , xN , and let Y be the


vector (y1 , y2 , . . . , yN ). Then we can rewrite the residual as

J(w) = |Xw − Y |2 . (7.7.2)


432 CHAPTER 7. MACHINE LEARNING

From §2.3, any weight w∗ minimizing (7.7.2) is a solution the regression


equation
X t Xw∗ = X t Y. (7.7.3)
Since the pseudo-inverse provides a solution of the regression equation, we
have

Linear Regression

The weight w∗ = X + Y minimizes the residual (7.7.2) and solves the


regression equation (7.7.3).

We work out the regression equation in the plane, when both features x
and y are scalar. In this case, w = (m, b) and
   
x1 1 y1
 x2 1   y2 
X=  . . . . . . ,
 Y =
. . . .

xN 1 yN

In the scalar case, the regression equation (7.7.3) is 2 × 2. To simplify the


computation of X t X, let
N N
1 X 1 X
x̄ = xk , ȳ = yk .
N N
k=1 k=1

Then (x̄, ȳ) is the mean of the dataset. Also, let x and y denote the vectors
(x1 , x2 , . . . , xN ) and (y1 , y1 , . . . , yN ), and let, as in §1.5,
N
1 X 1
cov(x, y) = (xk − x̄)(yk − ȳ) = x · y − x̄ȳ.
N N
k=1

Then cov(x, y) is the covariance between x and y,


   
x · x x̄ x·y
X tX = N , X tY = N .
x̄ 1 ȳ

With w = (m, b), the regression equation reduces to

(x · x)m + x̄b = x · y,
mx̄ + b = ȳ.

The second equation says the regression line passes through the mean (x̄, ȳ).
Multiplying the second equation by x̄ and subtracting the result from the
first equation cancels the b and leads to
7.7. REGRESSION EXAMPLES 433

cov(x, x)m = (x · x − x̄2 )m = (x · y − x̄ȳ) = cov(x, y).

This derives

Linear Regression in the Plane

The regression line in two dimensions passes through the mean (x̄, ȳ)
and has slope
cov(x, y)
m= .
cov(x, x)

Now we use linear regression to do polynomial regression. Return to the


dataset (xk , yk ) in R2 (Figure 7.15). We can expand or “lift” the dataset
from R2 to R6 by working with the vectors (1, xk , x2k , x3k , x4k , yk ) instead of
(xk , yk ).
Assuming the data is given by Figure 7.16, we build the code for Figures
7.15 and 7.17. We begin by assuming the data is given as arrays,

from numpy import *


from pandas import read_csv

df - read_csv("longley.csv")

X = df["Population"].to_numpy()
Y = df["Employed"].to_numpy()

Then we standardize the data

X = X - mean(X)
Y = Y - mean(Y)

varx = sum(X**2)/len(X)
vary = sum(Y**2)/len(Y)

X = X/sqrt(varx)
Y = Y/sqrt(vary)

After this, we compute the optimal weight w∗ and construct the polyno-
mial. The regression equation is solved using the pseudo-inverse (§2.3).

from numpy.linalg import pinv

# polynomial function - degree d-1


434 CHAPTER 7. MACHINE LEARNING

def poly(x,d):
A = column_stack([ X**i for i in range(d) ]) # Nxd
Aplus = pinv(A)
b = Y # Nx1
wstar = dot(Aplus,b)
return sum([ x**i*wstar[i] for i in range(d) ],axis=0)

Fig. 7.17 Polynomial regression: Degrees 2, 4, 6, 8, 10, 12.

Then we plot the data and the polynomial in six subplots.


7.7. REGRESSION EXAMPLES 435

from matplotlib.pyplot import *

xmin,ymin = amin(X), amin(Y)


xmax, ymax = amax(X), amax(Y)

figure(figsize=(12,12))
# six subplots
rows, cols = 3,2

# x interval
x = arange(xmin,xmax,.01)

for i in range(6):
d = 3 + 2*i # degree = d-1
subplot(rows, cols,i+1)
plot(X,Y,"o",markersize=2)
plot([0],[0],marker="o",color="red",markersize=4)
plot(x,poly(x,d),color="blue",linewidth=.5)
xlabel("degree = %s" % str(d-1))
grid()

show()

Running this code with degree 1 returns Figure 7.15. Taking too high a
power can lead to overfitting, for example for degree 12.

Here is an example of a simple logistic regression problem. A group of


students takes an exam. For each student, we know the amount of time x
they studied, and the outcome p, whether or not they passed the exam.

x p x p x p x p x p
0.5 0 .75 0 1.0 0 1.25 0 1.5 0
1.75 0 1.75 1 2.0 0 2.25 1 2.5 0
2.75 1 3.0 0 3.25 1 3.5 0 4.0 1
4.25 1 4.5 1 4.75 1 5.0 1 5.5 1

Table 7.18 Hours studied and outcomes.

More generally, we may only know the amount of study time x, and the
probability p that the student passed, where now 0 ≤ p ≤ 1.
For example, the data may be as in Figure 7.18, where pk equals 1 or 0
according to whether they passed or not.
As stated, the samples of this dataset are scalars, and the dataset is one-
dimensional (Figure 7.19).
436 CHAPTER 7. MACHINE LEARNING

Fig. 7.19 Exam dataset: x.

Plotting the dataset on the (x, p) plane, the goal is to fit a curve

p = σ(m∗ x + b∗ ) (7.7.4)

as in Figure 7.20.
Since this is logistic regression with bias, we can apply the two-class result
from the previous section: The dataset is one-dimensional, so a hyperplane is
just a point, a threshold. Neither class lies in a hyperplane, and the dataset is
not separable (Figure 7.19). Hence logistic regression with bias is trainable,
and gradient descent is guaranteed to converge to an optimal weight (m∗ , b∗ ).

(0, 1)

x
(0, 0)

Fig. 7.20 Exam dataset: (x, p) [35].

Here is the descent code.

from numpy import *


from scipy.special import expit

X = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25, 2.5, 2.75,
,→ 3.0, 3.25, 3.5, 4.0, 4.25, 4.5, 4.75, 5.0, 5.5]
P = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

def gradient(m,b):
return sum([ (expit(m*x+b) - p) * array([x,1]) for x,p in zip(X,P)
,→ ],axis=0)

# gradient descent
w = array([0,0]) # starting m,b
g = gradient(*w)
7.7. REGRESSION EXAMPLES 437

t = .01 # learning rate

while not allclose(g,0):


wplus = w - t * g
if allclose(w,wplus): break
else: w = wplus
g = gradient(*w)

print("descent result: ",w)


print("gradient: ",gradient(*w))

This code returns

m∗ = 1.49991537, b∗ = −4.06373862.

These values are used to graph the sigmoid in Figure 7.20.

Even though we are done, we take the long way and apply logistic regres-
sion without bias by incorporating the bias, to better understand how things
work.
To this end, we incorporate the bias and write the augmented dataset

(x1 , 1), (x2 , 1), . . . , (xN , 1), N = 20,

resulting in Figure 7.21. Since these vectors are not parallel, the dataset is
full-rank in R2 , hence J(m, b) is strictly convex. In Figure 7.21, the shaded
area is bounded by the vectors corresponding to the overlap between passing
and failing students’ hours.

x0

(0, 1)

x
(0, 0)

Fig. 7.21 Exam dataset: (x, x0 ).

Let σ(z) be the sigmoid function (5.1.22). Then, as in the previous section,
the goal is to minimize the loss function
438 CHAPTER 7. MACHINE LEARNING

N
X
J(m, b) = I(pk , qk ), qk = σ(mxk + b), (7.7.5)
k=1

Once we have the minimizer (m∗ , b∗ ), we have the best-fit curve (7.7.4).
If the targets p are one-hot encoded, the dataset is as follows.

x p x p x p x p x p
0.5 (1,0) .75 (1,0) 1.0 (1,0) 1.25 (1,0) 1.5 (1,0)
1.75 (1,0) 1.75 (0,1) 2.0 (1,0) 2.25 (0,1) 2.5 (1,0)
2.75 (0,1) 3.0 (1,0) 3.25 (0,1) 3.5 (1,0) 4.0 (0,1)
4.25 (0,1) 4.5 (0,1) 4.75 (0,1) 5.0 (0,1) 5.5 (0,1)

Table 7.22 Hours studied and one-hot encoded outcomes.

Each sample (x, 1) in the dataset is in R2 , and each target is one-hot


encoded as (p, 1 − p). Since the weight matrix must satisfy (7.6.2) W 1 = 0.
we have  
b −b
W = .
m −m
Since z = W t x, the outputs must satisfy z1 = z and z2 = −z. This leads to
a neural network with two inputs and two outputs (Figure 7.23).

b y
1 +
q
−b
J
σ I
m
−y 1−q
x +
−m

Fig. 7.23 Neural network for student exam outcomes.

Since here d = 2, the networks in Figures 7.23 and 7.24 are equivalent.
In Figure 7.23, σ is the softmax function, I is given by (5.6.6), and p, q are
probability vectors. In Figure 7.24, σ is the sigmoid function, I is given by
(4.2.2), and p, q are probability scalars.
7.7. REGRESSION EXAMPLES 439

1
b
y q J
+ σ I

m
x

Fig. 7.24 Equivalent neural network for student exam outcomes.

Figure 7.20 is a plot of x against p. However, the dataset, with the bias
input included, has two inputs x, 1 and one output p, and should be plotted
in three dimensions (x, 1, p). Then (Figure 7.25) samples lie on the line (x, 1)
in the horizontal plane, and p is on the vertical axis.
The horizontal plane in Figure 7.25, which is the plane in Figure 7.21, is
feature space. The convex hulls K0 and K1 are in feature space, so the convex
hull K0 of the samples corresponding to p = 0 is the line segment joining
(.5, 1, 0) and (3.5, 1, 0), and the convex hull K1 of the samples corresponding
to p = 1 is the line segment joining (1.75, 1, 0) and (5.5, 1, 0). In Figure 7.25,
K0 is the line segment joining the green points, and K1 is the projection onto
feature space of the line segment joining the red points. Since K0 ∩ K1 is the
line segment joining (1.75, 1, 0) and (3.5, 1, 0), the span of K0 ∩ K1 is all of
feature space. By the results of the previous section, J(w) is proper.

0.5

0
0

2
1
4
0.5
0

Fig. 7.25 Exam dataset: (x, x0 , p).

The Iris dataset consists of 150 samples divided into three groups. leading
to three convex hulls K0 , K1 , K2 in R4 . If the dataset is projected onto the
440 CHAPTER 7. MACHINE LEARNING

top two principal components, then the projections of these three hulls do
not pair-intersect (Figure 7.26). It follows we have no guarantee the logistic
loss is proper.

Fig. 7.26 Convex hulls of Iris classes in R2 .

Fig. 7.27 Convex hulls of MNIST classes in R2 .

On the other hand, the MNIST dataset consists of 60,000 samples divided
into ten groups. If the MNIST dataset is projected onto the top two principal
components, the projections of the ten convex hulls K0 , K1 , . . . , K9 onto R2 ,
do intersect (Figure 7.27).
7.8. STRICT CONVEXITY 441

This does not guarantee that the ten convex hulls K0 , K1 , . . . , K9 in R784
intersect, but at least this is so for the 2d projection of the MNIST dataset.
Therefore the logistic loss of the 2d projection of the MNIST dataset is proper.

7.8 Strict Convexity

In this section, we work with loss functions that are smooth and strictly
convex. While this is not always the case, this assumption is a base case
against which we can test different optimization or training models.
By smooth and strictly convex, we mean there are positive constants m
and L satisfying

m ≤ D2 f (w) ≤ L, for every w. (7.8.1)

Recall this means the eigenvalues of the symmetric matrix D2 f (w) are be-
tween L and m. In this situation, the condition number1 r = m/L is between
zero and one: 0 < r ≤ 1.
In the previous section, we saw that basic gradient descent converged to
a critical point. If f (x) is strictly convex, there is exactly one critical point,
the global minimum. From this we have

Gradient Descent on a Strictly Convex Function

If the short-step gradient descent sequence starting from w0 converges


to w∗ , then w∗ is the global minimum.

The simplest example of a convex loss function is the quadratic case


1
f (w) = w · Qw − b · w, (7.8.2)
2
where Q is a variance matrix. Then D2 f (w) = Q. If the eigenvalues of Q
are between positive constants m and L, then f (w) is smooth and strictly
convex.
By (4.3.8), the gradient for this example is g = Qw−b. Hence the minimizer
is the unique solution w∗ = Q−1 b of the linear system Qw = b. Thus gradient
descent is a natural tool for solving linear systems and computing inverses,
at least for variance matrices Q.
By (4.5.21), f (w) lies between two quadratics,

m L
|w − w∗ |2 ≤ f (w) − f (w∗ ) ≤ |w − w∗ |2 . (7.8.3)
2 2
1 In the literature, the condition number is often defined as L/m.
442 CHAPTER 7. MACHINE LEARNING

How far we are from our goal w∗ can be measured by the error E(w) =
|w − w∗ |2 . Another measure of error is E(w) = f (w) − f (w∗ ). The goal is to
drive the error between w and w∗ to zero.
When f (w) is smooth and strictly convex in the sense of (7.8.1), the es-
timate (7.8.3) shows these two error measures are equivalent. We use both
measures below.

Let t = 1/L. Inserting x = w and a = w∗ in the left half of (4.5.25) and


using ∇f (w∗ ) = 0 implies
1
f (w) ≤ f (w∗ ) + |∇f (w)|2 .
2m
Let E(w) = f (w) − f (w∗ ). Combining this inequality with (7.3.3), and re-
calling r = m/L = mt, we arrive at

E(w+ ) ≤ (1 − r)E(w). (7.8.4)

Iterating this implies

E(w2 ) ≤ (1 − r)E(w1 ) ≤ (1 − r)(1 − r)E(w0 ) = (1 − r)2 E(w0 ).

In general, this leads to

Gradient Descent I

Let r = m/L and set E(w) = f (w)−f (w∗ ). Then the descent sequence
w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
1
t=
L
converges to w∗ at the rate
n
E(wn ) ≤ (1 − r) E(w0 ), n = 1, 2, . . . . (7.8.5)

This is the basic gradient descent result GD-I.

Using coercivity of the gradient (4.5.26), we can obtain an improved result


GD-II.
Let E(w) = |w − w∗ |2 and set the learning rate at t = 2/(m + L). Inserting
x = w and a = w∗ in (4.5.26) and using ∇f (w∗ ) = 0 implies
7.8. STRICT CONVEXITY 443

mL 1
g · (w − w∗ ) ≥ |w − w∗ |2 + |g|2 .
m+L m+L
Using this and (7.3.1) and t = 2/(m + L),

E(w+ ) = E(w) − 2tg · (w − w∗ ) + t2 |g|2


   
mL 2t
≤ 1 − 2t E(w) + t2 − |g|2
m+L m+L
 2
L−m
= E(w).
L+m

This implies

Gradient Descent II

Let r = m/L and set E(w) = |w − w∗ |2 . Then the descent sequence


w0 , w1 , w2 , . . . given by (7.3.1) with learning rate
2
t=
m+L
converges to w∗ at the rate
 2n
1−r
E(wn ) ≤ E(w0 ), n = 1, 2, . . . . (7.8.6)
1+r

GD-II improves GD-I in two ways: Since m < L, the learning rate is larger,
2 1
> ,
m+L L
and the convergence rate is smaller,
 2
1−r
< (1 − r),
1+r

implying faster convergence.


For example, if L = 6 and m = 2, then r = 1/3, the learning rates are 1/6
versus 1/4, and the convergence rates are 2/3 versus 1/4. Even though GD-II
improves GD-I, the improvement is not substantial. In the next section, we
use momentum to derive better convergence rates.

Let g be the gradient of the loss function at a point w. Then the line
passing through w in the direction of g is w − tg. When the loss function is
444 CHAPTER 7. MACHINE LEARNING

quadratic (4.3.8), f (w − tg) is a quadratic function of the scalar variable t.


In this case, the minimizer t along the line w − tg is explicitly computable as
g·g
t= .
g · Qg
This leads to gradient descent with varying time steps t0 , t1 , t2 , . . . . As a
consequence, one can show the error is lowered as follows,
 
1 g
E(w+ ) = 1 − E(w), u= .
(u · Qu)(u · Q−1 u) |g|

Using a well-known inequality, Kantorovich’s inequality [20], one can show


that here the convergence rate is also (7.8.6). Thus, after all this work, there
is no advantage here, it simpler to stick with GD-II!
Nevertheless, the idea here, the line-search for a minimizer, is a sound
one, and is useful in some situations.

7.9 Accelerated Gradient Descent

In this section, we modify the gradient descent method by adding a term


incorporating previous gradients, leading to gradient descent with momentum.
After this, we consider other variations, leading to the most frequently used
descent methods.
Recall in a descent sequence, the current point is w, the next point is w+ ,
and the previous point is w− .
In gradient descent with momentum, we add a momentum term to the
current point w, obtaining the lookahead point

w◦ = w + s(w − w− ). (7.9.1)

Here s is the decay rate. The momentum term reflects the direction induced by
the previous step. Because this mimics the behavior of a ball rolling downhill,
gradient descent with momentum is also called heavy ball descent.
Then the descent sequence w0 , w1 , w2 , . . . is generated by

Momentum Gradient Descent Step

w+ = w − t∇f (w) + s(w − w− ). (7.9.2)

Here we have two hyperparameters, the learning rate and the decay rate.
7.9. ACCELERATED GRADIENT DESCENT 445

We study convergence for the simplest case of a quadratic (7.8.2). In this


case, ∇f (w) = Qw − b, and the sequence satisfies the recursion

wn+1 = wn − t(Qwn − b) + s(wn − wn−1 ), n = 0, 1, 2, . . . . (7.9.3)

To initialize the recursion, we set w−1 = w0− = w0 . This implies w1 =


w0 − t(Qw0 − b).
We measure the convergence using the error E(w) = |w − w∗ |2 , and we
assume m < Q < L strictly, in the sense every eigenvalue λ of Q satisfies

m < λ < L. (7.9.4)

As before, we set r = m/L.


Let v be an eigenvector of Q with eigenvalue λ. To solve (7.9.3), we assume
a solution of the form

wn = w∗ + ρn v, Qv = λv. (7.9.5)

Inserting this into (7.9.3) and using Qw∗ = b leads to the quadratic equation

ρ2 = (1 − tλ + s)ρ − s.

By the quadratic formula,


p
(1 − λt + s) ± (1 − λt + s)2 − 4s
ρ = ρ± = .
2
Assume the discriminant (1 − λt + s)2 − 4s is negative. This happens exactly
when √ √
(1 − s)2 (1 + s)2
<t< . (7.9.6)
λ λ
If we assume √ √
(1 − s)2 (1 + s)2
≤t≤ , (7.9.7)
m L
then (7.9.6) holds for every eigenvalue λ of Q.
Multiplying (7.9.7) by λ and factoring the discriminant as a difference of
two squares leads to

(L − λ)(λ − m)
4s − (1 − λt + s)2 ≥ (1 − s)2 . (7.9.8)
mL
When (7.9.6) holds, the roots are conjugate complex numbers ρ, ρ̄, where
p
(1 − λt + s) + i −(1 − λt + s)2 + 4s
ρ = x + iy = . (7.9.9)
2
It follows the absolute value of ρ equals
446 CHAPTER 7. MACHINE LEARNING
p √
|ρ| = x2 + y 2 = s.

To obtain the fastest convergence, we choose s and t to minimize |ρ| = s,
while still satisfying (7.9.7). This forces (7.9.7) to be an equality,
√ √
(1 − s)2 (1 + s)2
=t= .
m L
These are two equations in two unknowns s, t. Solving, we obtain

√ 1− r 1 4
s= √ , t= · √ .
1+ r L (1 + r)2

Let w̃n = wn −w∗ . Since Qwn −b = Qw̃n , (7.9.3) is a 2-step linear recursion
in the variables w̃n . Therefore the general solution depends on two constants
A, B.
Let λ1 , λ2 , . . . , λd be the eigenvalues of Q and let v1 , v2 , . . . , vd be the
corresponding orthonormal basis of eigenvectors.
Since (7.9.3) is a 2-step vector linear recursion, A and B are vectors, and
the general solution depends on 2d constants Ak , Bk , k = 1, 2, . . . , d.
If ρk , k = 1, 2, . . . , d, are the corresponding roots (7.9.9), then (7.9.5) is
a solution of (7.9.3) for each of 2d roots ρ = ρk , ρ = ρ̄k , k = 1, 2, . . . , d.
Therefore the linear combination
d
X
wn = w∗ + (Ak ρnk + Bk ρ̄nk ) vk , n = 0, 1, 2, . . . (7.9.10)
k=1

is the general solution of (7.9.3). Inserting n = 0 and n = 1 into (7.9.10), then


taking the dot product of the result with vk , we obtain two linear equations
for two unknowns Ak , Bk ,

Ak + Bk = (w0 − w∗ ) · vk ,
Ak ρk + Bk ρ̄k = (w1 − w∗ ) · vk = (1 − tλk )(w0 − w∗ ) · vk ,

for each k = 1, 2, . . . , d. Solving for Ak , Bk yields


 
1 − tλk − ρ̄k
Ak = (w0 − w∗ ) · vk , Bk = Āk .
ρk − ρ̄k

Let
(L − m)(L − m)
C = max . (7.9.11)
λ (L − λ)(λ − m)
Using (7.9.8), one verifies the estimate

|Ak |2 = |Bk |2 ≤ C |(w0 − w∗ ) · vk |2 .


7.9. ACCELERATED GRADIENT DESCENT 447

Now use (2.9.5) twice, first with v = wn − w∗ , then with v = w0 − w∗ . By


(7.9.10) and the triangle inequality,
d
X
|wn − w∗ |2 = |(wn − w∗ ) · vk |2
k=1
d
X
= |Ak ρnk + Bk ρ̄nk |2
k=1
d
X
≤ (|Ak | + |Bk |)2 |ρk |2n
k=1
d
X
≤ 4Csn |(w0 − w∗ ) · vk |2
k=1
= 4Cs |w0 − w∗ |2 .
n

This derives the following result.

Momentum Gradient Descent - Heavy Ball

Suppose the loss function f (w) is quadratic (7.8.2), let r = m/L, and
set E(w) = |w − w∗ |2 . Let C be given by (7.9.11). Then the descent
sequence w0 , w1 , w2 , . . . given by (7.9.2) with learning rate and decay
rate  √ 2
1 4 1− r
t= · √ 2, s= √ ,
L (1 + r) 1+ r
converges to w∗ at the rate
 √ 2n
1− r
E(wn ) ≤ 4C √ E(w0 ), n = 1, 2, . . . (7.9.12)
1+ r

This heavy ball


√ descent, due to Polyak [26], is an improvement over GD-II
(7.8.6), because r is substantially larger than r when r is small. The down-
side of this momentum method is that the convergence (7.9.12) is only guar-
anteed for f (w) quadratic (7.8.2). In fact, there are examples of non-quadratic
f (w) where heavy ball descent does not converge to w∗ . Nevertheless, this
method is widely used.

The momentum method can be modified by evaluating the gradient at the


lookahead point w◦ (7.9.1),
448 CHAPTER 7. MACHINE LEARNING

Momentum Descent Step With Lookahead Gradient

w◦ = w + s(w − w− ),
(7.9.13)
w+ = w◦ − t∇f (w◦ ).

This leads to accelerated gradient descent, or momentum descent with


lookahead gradient. This result, due to Nesterov [23], is valid for any con-
vex function satisfying (7.8.1), not just quadratics.
The iteration (7.9.13) is in two steps, a momentum step followed by a basic
gradient descent step. The momentum step takes us from the current point
w to the lookahead point w◦ , and the gradient descent step takes us from w◦
to the successive point w+ .
Starting from w0 , and setting w−1 = w0 , here it turns out the loss se-
quence f (w0 ), f (w1 ), f (w2 ), . . . is not always decreasing. Because of this, we
seek another function V (w) where the corresponding sequence V (w0 ), V (w1 ),
V (w2 ), . . . is decreasing.
To explain this, it’s best to assume w∗ = 0 and f (w∗ ) = 0. This can always
be arranged by translating the coordinate system. Then it turns out
L
V (w) = f (w) + |w − ρw− |2 , (7.9.14)
2
with a suitable choice of ρ, does the job. With the choices

1 1− r √
t= , s= √ , ρ = 1 − r,
L 1+ r

we will show
V (w+ ) ≤ ρV (w). (7.9.15)
In fact, we see below (7.9.22), (7.9.23) that V is reduced by an additional
quantity proportional to the momentum term.
The choice t = 1/L is a natural choice from basic gradient descent (7.3.3).
The derivation of (7.9.15) below forces the choices for s and ρ.
Given a point w, while w+ is well-defined by (7.9.13), it is not clear what

w means. There are two ways to insert meaning here. Either evaluate V (w)
along a sequence w0 , w1 , w2 , . . . and set, as before, wn− = wn−1 , or work
with the function W (w) = V (w+ ) instead of V (w). If we assume (w+ )− = w,
then W (w) is well-defined. With this understood, we nevertheless stick with
V (w) as in (7.9.14) to simplify the calculations.
We first show how (7.9.15) implies the result. Using (w0 )− = w0 and
(7.8.3),

L m
V (w0 ) = f (w0 ) + |w0 − ρw0 |2 = f (w0 ) + |w0 |2 ≤ 2f (w0 ).
2 2
7.9. ACCELERATED GRADIENT DESCENT 449

Moreover f (w) ≤ V (w). Iterating (7.9.15), we obtain

f (wn ) ≤ V (wn ) ≤ ρn V (w0 ) ≤ 2ρn f (w0 ).

This derives

Momentum Descent - Lookahead Gradient

Let r = m/L and set E(w) = f (w) − f (w∗ ). Then the sequence w0 ,
w1 , w2 , . . . given by (7.9.13) with learning rate and decay rate

1 1− r
t= , s= √
L 1+ r

converges to w∗ at the rate


√ n
E(wn ) ≤ 2 1 − r E(w0 ), n = 1, 2, . . . . (7.9.16)

While the convergence rate for accelerated descent is slightly worse than
heavy ball descent, the value of accelerated descent is its validity for all convex
functions satisfying (7.8.1), and the fact, also due to Nesterov [23], that this
convergence rate is best-possible among all such functions.
Now we derive (7.9.15). Assume (w+ )− = w and w∗ = 0, f (w∗ ) = 0. We
know w◦ = (1 + s)w − sw− and w+ = w◦ − tg ◦ , where g ◦ = ∇f (w◦ ).
By the basic descent step (7.3.1) with w◦ replacing w, (7.3.3) implies
t
f (w+ ) ≤ f (w◦ ) − |g ◦ |2 . (7.9.17)
2
Here we used t = 1/L.
By (4.5.20) with x = w and a = w◦ ,
m
f (w◦ ) ≤ f (w) − g ◦ · (w − w◦ ) − |w − w◦ |2 . (7.9.18)
2
By (4.5.20) with x = w∗ = 0 and a = w◦ ,
m ◦2
f (w◦ ) ≤ g ◦ · w◦ − |w | . (7.9.19)
2
Multiply (7.9.18) by ρ and (7.9.19) by 1 − ρ and add, then insert the sum
into (7.9.17). After some simplification, this yields
r  t
f (w+ ) ≤ ρf (w) + g ◦ · (w◦ − ρw) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 − |g ◦ |2 .
2t 2
(7.9.20)
Since
(w◦ − ρw) − tg ◦ = w+ − ρw,
450 CHAPTER 7. MACHINE LEARNING

we have
1 + 1 t
|w − ρw|2 = |w◦ − ρw|2 − g ◦ · (w◦ − ρw) + |g ◦ |2 .
2t 2t 2
Adding this to (7.9.20) leads to
r 1
V (w+ ) ≤ ρf (w) − ρ|w − w◦ |2 + (1 − ρ)|w◦ |2 + |w◦ − ρw|2 . (7.9.21)

2t 2t
Let

R(a, b) = r ρs2 |b|2 + (1 − ρ)|a + sb|2 − |(1 − ρ)a + sb|2 + ρ|(1 − ρ)a + ρb|2 .


Solving for f (w) in (7.9.14) and inserting into (7.9.21) leads to


1
V (w+ ) ≤ ρV (w) − R(w, w − w− ). (7.9.22)
2t
If we can choose s and ρ so that R(a, b) is a positive scalar multiple of |b|2 ,
then, by (7.9.22), (7.9.15) follows, completing the proof.
Based on this, we choose s, ρ to make R(a, b) independent of a, which is
equivalent to ∇a R = 0. But
 
2 2
 
∇a R = 2(1 − ρ) r − (1 − ρ) a + ρ − s(1 − r) b ,

so ∇a R = 0 is two equations in two unknowns s, ρ. This leads to the choices


for s and ρ made above. Once these choices are made, s(1 − r) = ρ2 and
ρ > s. From this,

R(a, b) = R(0, b) = (rs2 − s2 + ρ3 )|b|2 = ρ2 (ρ − s)|b|2 , (7.9.23)

which is positive.
Chapter A
Appendices

Some of the material here is first seen in high school. Because repeating the
exposure leads to a deeper understanding, we review it in a manner useful to
us here.
We start with basic counting, and show how the factorial function leads
directly to the exponential. Given its convexity and its importance for entropy
(§5.1), the exponential is treated carefully (§A.3).
The other use of counting is in graph theory (§3.3), which lays the ground-
work for neural networks (§7.2).

A.1 Permutations and Combinations

Suppose we have three balls in a bag, colored red, green, and blue. Suppose
they are pulled out of the bag and arranged in a line. We then obtain six
possibilities, listed in Figure A.1.
Why are there six possibilities? Because they are three ways of choosing
the first ball, then two ways of choosing the second ball, then one way of
choosing the third ball, so the total number of ways is

6 = 3 × 2 × 1.

In particular, we see that the number of ways multiply, 6 = 3 × 2 × 1.


Similarly, there are 5 × 4 × 3 × 2 × 1 = 120 ways of selecting five distinct
balls. Since this pattern appears frequently, it has a name.
If n is a positive integer, then n-factorial is

n! = n × (n − 1) × (n − 2) × · · · × 2 × 1.

The factorial function grows large rapidly, for example,

10! = 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 3, 628, 800.

451
452 CHAPTER A. APPENDICES

Notice also

(n + 1)! = (n + 1) × n × (n − 1) × · · · × 2 × 1 = (n + 1) × n!,

and (n + 2)! = (n + 2) × (n + 1)!, and so on.

Fig. A.1 6 = 3! permutations of 3 balls.

Permutations of n Objects

The number of ways of selecting n objects from a collection of n


distinct objects is n!.

We also have
1! = 1, 0! = 1.
It’s clear that 1! = 1. It’s less clear that 0! = 1, but it’s reasonable if you
think about it: The number of ways of selecting from zero balls results in
only one possibility — no balls. The code for n! is

from scipy.special import factorial

factorial(n,exact=True)

More generally, we can consider the selection of k balls from a bag contain-
ing n distinct balls. There are two varieties of selections that can be made:
Ordered selections and unordered selections. An ordered selection is a permu-
tation. In particular, when k = n, an ordered selection of n objects from n
objects is n, which is the number of ways of permuting n objects.
A.1. PERMUTATIONS AND COMBINATIONS 453

The function perm_tuples(a,b,k) returns all permutations of k integers


between the integers a and b inclusive. Thus perm_tuples(a,b,2) returns
all ordered pairs of integers between a and b inclusive, and perm_tuples(a,
,→ b,3) returns all ordered triples of integers between a and b inclusive, The
code

def perm_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else:
list1 = [ (i,*p) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
list2 = [ (*p,i) for i in range(a,b) for p in
,→ perm_tuples(i+1,b,k-1) ]
return list1 + list2

perm_tuples(1,5,2)

returns the list

[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5),(2, 1),(3, 1),(4, 1),(5, 1),(3, 2),(4, 2),(5, 2),(4, 3),(5,
,→ 3),(5, 4)]

The number of permutations of k objects from n objects is written as P (n, k).


In Python, P (n, k) is

from scipy.special import perm

n, k = 5, 2

perm(n, k)

It follows the code

perm(n,k,exact=True) == len(perm_tuples(1,n,k))

returns True. For example, perm(5,2) equals 20.


For ordered selections, there are n choices for the first ball, n − 1 choices
for the second ball, and so on, until we have n − k + 1 choices for the k-th
ball. Thus
P (n, k) = n × (n − 1) × · · · × (n − k + 1).
For example, there are 5 × 4 = 20 ordered selections of two balls from five
distinct balls. Because ordering is taken into account, selecting ball #2 then
ball #3 is considered distinct from selecting ball #3 then ball #2.
454 CHAPTER A. APPENDICES

Permutation of k Objects from n Objects

The number of permutations of k objects from n objects is


n!
P (n, k) = n(n − 1)(n − 2) . . . (n − k + 1) = .
(n − k)!

The last formula follows by canceling,

n! n(n − 1) . . . (n − k + 1)(n − k)!


= = n(n − 1) . . . (n − k + 1).
(n − k)! (n − k)!

Notice P (x, k) is defined for any real number x by the same formula,

P (x, k) = x(x − 1)(x − 2) . . . (x − k + 1).

An unordered selection is a combination. When a selection of k objects


is made, and the k objects are permuted, we obtain the same unordered
selection, but a different ordered selection. Since the number of permutations
of k objects is k!, the number of permutations of k objects from n objects is
k! times the number of combinations of k objects from n objects.
The function comb_tuples(a,b,k) returns all combinations of k integers
between the integers a and b inclusive. Thus comb_tuples(a,b,2) returns
all unordered pairs of integers between a and b inclusive, and comb_tuples(a
,→ ,b,3) returns all unordered triples of integers between a and b inclusive.
The code

def comb_tuples(a,b,k):
if k==1: return [ (i,) for i in range(a,b+1) ]
else: return [ (i, *p) for i in range(a,b) for p in
,→ comb_tuples(i+1,b,k-1) ]

comb_tuples(1,5,2)

returns the list

[(1, 2),(1, 3),(1, 4),(1, 5),(2, 3),(2, 4),(2, 5),(3, 4),(3, 5),(4,
,→ 5)]

The number of combinations of k objects from n objects is written as C(n, k).


In Python, C(n, k) is
A.1. PERMUTATIONS AND COMBINATIONS 455

from scipy.special import comb

n, k = 5, 2

comb(n, k)

It follows the code

comb(n,k,exact=True) == len(comb_tuples(1,n,k))

returns True. For example, comb(5,2) equals 10.


The number C(n, k) is also called n-choose-k. Because it appears in the
binomial theorem, C(n, k) is also called the binomial coefficient (§A.2).

Combinations of k Objects from n Objects

The number of combinations of k objects from n objects is

P (n, k) n!
C(n, k) = = .
k! (n − k)!k!

Since P (x, k) is defined for any real number x, so is C(n, k):

P (x, k) x(x − 1)(x − 2) . . . (x − k + 1)


C(x, k) = = .
k! 1 · 2 · 3 · ··· · k

An important question is the rate of growth of the factorial function n!.


Attempting to answer this question leads to the exponential (§A.3) and to
the entropy (§4.2). Here is how this happens.
Since n! is a product of the n factors

1, 2, 3, . . . , n − 1, n,

each no larger than n, it is clear that

n! < nn .

However, because half of the factors are less then n/2, we expect an approx-
imation smaller than nn , maybe something like (n/2)n or (n/3)n .
To be systematic about it, assume
 n n
n! is approximately equal to e for n large, (A.1.1)
e
456 CHAPTER A. APPENDICES

for some constant e. We seek the best constant e that fits here. In this ap-
proximation, we multiply by e so that (A.1.1) is an equality when n = 1.
Using the binomial theorem, in §A.3 we show
 n n  n n
3 ≤ n! ≤ 2 , n ≥ 1. (A.1.2)
3 2
Based on this, a constant e satisfying (A.1.1) must lie between 2 and 3,

2 ≤ e ≤ 3.

To figure out the best constant e to pick, we see how much both sides
of (A.1.1) increase when we replace n by n + 1. Write (A.1.1) with n + 1
replacing n, obtaining
 n+1
n+1
(n + 1)! is approximately equal to e for n large.
e
(A.1.3)
Dividing the left sides of (A.1.1), (A.1.3) yields

(n + 1)!
= (n + 1).
n!
Dividing the right sides yields
n
e((n + 1)/e)n+1

1 1
= (n + 1) · · 1 + . (A.1.4)
e(n/e)n e n

To make these quotients match as closely as possible, we should choose


 n
1
e≈ 1+ , for n large. (A.1.5)
n

Choosing n = 1, 2, 3, . . . , 100, . . . results in

e ≈ 2, 2.25, 2.37, . . . , 2.705, . . . .

As n → ∞, we obtain Euler’s constant e (§A.3).


Equation (A.1.1) can be improved to Stirling’s approximation
√  n n
n! ≈ 2πn , for n large. (A.1.6)
e
This is an asymptotic equality. This means the ratio of the two sides ap-
proaches one for large n (see §A.6). Stirling’s approximation is a consequence
of the central limit theorem (Exercise 5.4.13).
The central limit theorem guarantees the accuracy of Stirling’s approxi-
mation for n large. In fact, as soon as n = 1, Stirling’s approximation is 90%
accurate, and, as soon as n = 9, Stirling’s approximation is 99% accurate.
A.2. THE BINOMIAL THEOREM 457

Exercises

Exercise A.1.1 The n-th Hermite number is


(2n)!
Hn = , n = 0, 1, 2, 3, . . .
2n n!
Use scipy.special.factorial to find the least n for which Hn is greater
than a billion.

Exercise A.1.2 (Summation notation exercise) Let n = 1, 2, . . . . Show



X nk nn+1
(k − n) · = . (A.1.7)
k! n!
k=n

(First break the sum into two sums, then write out the first few terms of each
sum separately, and notice all terms but one cancel.)

A.2 The Binomial Theorem

Let x and a be two variables. A binomial is an expression of the form

(a + x)2 , (a + x)3 , (a + x)4 , ...

The degree of each of these binomials is 2, 3, and 4.


When binomials are expanded by multiplying out, one obtains a sum of
terms. The binomial theorem specifies the exact pattern or form of the re-
sulting sum.
Recall that

(a + b)(c + d) = a(c + d) + b(c + d) = ac + ad + bc + bd.

Similarly,

(a + b)(c + d + e) = a(c + d + e) + b(c + d + e) = ac + ad + ae + bc + bd + be.

Using this algebra, we can expand each binomial.


Expanding (a + x)2 yields

(a + x)2 = (a + x)(a + x) = a2 + xa + ax + x2 = a2 + 2ax + x2 . (A.2.1)

Similarly, for (a + x)3 , we have


458 CHAPTER A. APPENDICES

(a + x)3 = (a + x)(a + x)2 = (a + x)(a2 + 2ax + x2 )


= a3 + 2a2 x + ax2 + xa2 + 2xax + x3 (A.2.2)
3 2 2 3
= a + 3a x + 3ax + x .

For (a + x)4 , we have

(a + x)4 = (a + x)(a + x)3 = (a + x)(a3 + 3a2 x + 3ax2 + x3 )


= a4 + 3a3 x + 3a2 x2 + ax3 + a3 x + 3a2 x2 + 3ax3 + x4 (A.2.3)
4 3 2 2 3 4
= a + 4a x + 6a x + 4ax + x .

Thus
(a + x)2 = a2 + 2ax + x2
(a + x)3 = a3 + 3a2 x + 3ax2 + x3
(A.2.4)
(a + x)4 = a4 + 4a3 x + 6a2 x2 + 4ax3 + x4
(a + x)5 = ⋆a5 + ⋆a4 x + ⋆a3 x2 + ⋆a2 x3 + ⋆ax4 + ⋆x5 .

Here ⋆ means we haven’t found the coefficient yet.

There is a pattern in (A.2.4). In the first line, the powers of a are in


decreasing order, 2, 1, 0, while the powers of x are in increasing order, 0, 1,
2. In the second line, the powers of a decrease from 3 to 0, while the powers
of x increase from 0 to 3. In the third line, the powers of a decrease from 4
to 0, while the powers of x increase from 0 to 4.
This pattern of powers is simple and clear. Now we want to find the pattern
for the coefficients in front of each term. In (A.2.4), these coefficients are
(1, 2, 1), (1, 3, 3, 1), (1, 4, 6, 4, 1), and (⋆, ⋆, ⋆, ⋆, ⋆, ⋆). These coefficients are the
binomial coefficients.
Before we determine the pattern, we introduce a useful notation for these
coefficients by writing
     
2 2 2
= 1, = 2, =1
0 1 2

and        
3 3 3 3
= 1, = 3, = 3, =1
0 1 2 3
and
         
4 4 4 4 4
= 1, = 4, = 6, = 4, =1
0 1 2 3 4

and
A.2. THE BINOMIAL THEOREM 459
           
5 5 5 5 5 5
= ⋆, = ⋆, = ⋆, = ⋆, = ⋆, = ⋆.
0 1 2 3 4 5

With this notation, the number


 
n
(A.2.5)
k

is the coefficient of an−k xk when you multiply out (a + x)n . This is the bino-
mial coefficient. Here n is the degree of the binomial, and k, which specifies
the term in the resulting sum, varies from 0 to n (not 1 to n).
It is important to remember that, in this notation, the binomial (a + x)2
expands into the sum of three terms a2 , 2ax, x2 . These are term 0, term 1,
and term 2. Alternatively, one says these are the zeroth term, the first term,
and the second term. Thus the second term in the expansion of the binomial
(a+x)4 is 6a2 x2 , and the binomial coefficient 42 = 6. In general, the binomial
(a + x)n of degree n expands into a sum of n + 1 terms.
Since the binomial coefficient nk is the coefficient of an−k xk when you
multiply out (a + x)n , we have the binomial theorem.

Binomial Theorem

The binomial (a + x)n equals


         
n n n n−1 n n−2 2 n n n
a + a x+ a x +···+ axn−1 + x .
0 1 2 n−1 n
(A.2.6)

Using summation notation, the binomial theorem states


n  
n
X n n−k k
(a + x) = a x . (A.2.7)
k
k=0

The binomial coefficient nk is called “n-choose-k”, because it is the coef-




ficient of the term corresponding to choosing k x’s when multiplying the n


factors in the product

(a + x)n = (a + x)(a + x)(a + x) . . . (a + x).

For example, the term 42 a2 x2 corresponds to choosing two a’s, and two x’s,


when multiplying the four factors in the product

(a + x)4 = (a + x)(a + x)(a + x)(a + x).


460 CHAPTER A. APPENDICES

n = 0: 1
n = 1: 1 1

n = 2: 1 2 1
n = 3: 1 3 3 1
n = 4: 1 4 6 4 1
n = 5: 1 5 10 10 5 1
n = 6: ⋆ 6 15 20 15 6 ⋆

n = 7: 1 ⋆ 21 35 35 21 ⋆ 1
n = 8: 1 8 ⋆ 56 70 56 ⋆ 8 1
n = 9: 1 9 36 ⋆ 126 126 ⋆ 36 9 1
n = 10: 1 10 45 120 ⋆ 252 ⋆ 120 45 10 1

Fig. A.2 Pascal’s triangle.

The binomial coefficients may be arranged in a triangle, Pascal’s triangle


(Figure A.2). Can you figure out the numbers ⋆ in this triangle before peeking
ahead? Based on (A.2.11), here is code generating Pascal’s triangle.

from numpy import *

N = 10
Comb = zeros((N,N),dtype=int)
Comb[0,0] = 1

for n in range(1,N):
Comb[n,0] = Comb[n,n] = 1
for k in range(1,n): Comb[n,k] = Comb[n-1,k] + Comb[n-1,k-1]

Comb

In Pascal’s triangle, the very top row has one number in it: This is the
zeroth row corresponding to n = 0 and the binomial expansion of (a+x)0 = 1.
The first row corresponds to n = 1; it contains the numbers (1, 1), which
correspond to the binomial expansion of (a + x)1 = 1a + 1x. We say the
zeroth entry (k = 0) in the first row (n = 1) is 1 and the first entry (k = 1)
in the first row is 1. Similarly, the zeroth entry (k = 0) in the second row
(n = 2) is 1, and the second entry (k = 2) in the second row (n = 2) is 1.
The second entry (k = 2) in the fourth row (n = 4) is 6. For every row, the
entries are counted starting from k = 0, and end with k = n, so there are
n + 1 entries in row n. With this understood, the k-th entry in the n-th row
A.2. THE BINOMIAL THEOREM 461

is the binomial coefficient n-choose-k. So 10-choose-2 is


 
10
= 45.
2

We can learn a lot about the binomial coefficients from this triangle. First,
we have 1’s all along the left edge. Next, we have 1’s all along the right edge.
Similarly, one step in from the left or right edge, we have the row number.
Thus we have
       
n n n n
=1= , =n= , n ≥ 1.
0 n 1 n−1

Note also Pascal’s triangle has a left-to-right symmetry: If you read off
the coefficients in a particular row, you can’t tell if you’re reading them from
left to right, or from right to left. It’s the same either way: The fifth row is
(1, 5, 10, 10, 5, 1). In terms of our notation, this is written
   
n n
= , 0 ≤ k ≤ n;
k n−k

the binomial coefficients remain unchanged when k is replaced by n − k.


The key step in finding a formula for n-choose-k is to notice

(a + x)n+1 = (a + x)(a + x)n .

Let’s multiply this out when n = 3. From (A.2.4), we get

         
4 4 4 3 4 2 2 4 4 4
a + a x+ a x + ax3 + x
0 1 2 3 4
       
3 4 3 3 3 2 2 3
= a + a x+ a x + ax3
0 1 2 3
       
3 3 3 2 2 3 3 4
+ a x+ a x + ax3 + x .
0 1 2 3

Combining terms, this equals


               
3 4 3 3 3 3 3 3 3 4
a + + a3 x+ + a2 x2 + + ax3 + x .
0 1 0 2 1 3 2 3

Equating corresponding coefficients of x, we get,


                 
4 3 3 4 3 3 4 3 3
= + , = + , = + .
1 1 0 2 2 1 3 3 2
462 CHAPTER A. APPENDICES

In general, a similar calculation establishes


     
n+1 n n
= + , 1 ≤ k ≤ n. (A.2.8)
k k k−1
This allows us to build Pascal’s triangle (Figure A.2), where, apart from
the ones on either end, each term (“the child”) in a given row is the sum of
the two terms (“the parents”) located directly above in the previous row.

Insert x = 1 and a = 1 in the binomial theorem to get


         
n n n n n n
2 = + + + ··· + + . (A.2.9)
0 1 2 n−1 n

We conclude the sum of the binomial coefficients along the n-th row of Pas-
cal’s triangle is 2n (remember n starts from 0).
Now insert x = 1 and a = −1. You get
         
n n n n n
0= − + − ··· ± ± .
0 1 2 n−1 n

Hence: the alternating1 sum of the binomial coefficients along the n-th row of
Pascal’s triangle is zero.

We now show

Binomial Coefficient
Let
n · (n − 1) · · · · · (n − k + 1) n!
C(n, k) = = .
1 · 2 · ··· · k k!(n − k)!
Then  
n
= C(n, k), 0 ≤ k ≤ n. (A.2.10)
k

To establish (A.2.10), because


 
0
C(0, 0) = 1 = ,
0

it is enough to show C(n, k) also satisfies (A.2.8),


1 Alternating means the plus-minus pattern + − + − + − . . . .
A.2. THE BINOMIAL THEOREM 463

C(n + 1, k) = C(n, k) + C(n, k − 1), 1 ≤ k ≤ n. (A.2.11)

To establish (A.2.11), we simplify

n! n!
C(n, k) + C(n, k − 1) = +
k!(n − k)! (k − 1)!(n − k + 1)!
 
n! 1 1
= +
(k − 1)!(n − k)! k n − k + 1
n!(n + 1)
=
(k − 1)!(n − k)!k(n − k + 1)
(n + 1)!
= = C(n + 1, k).
k!(n + 1 − k)!

This establishes (A.2.11), and, consequently, (A.2.10).


For example,
       
7 7·6·5 7 10 10 · 9 10
= = 35 = and = = 45 = .
3 1·2·3 4 2 1·2 8

The formula (A.2.10) is easy to remember: There are k terms in the numerator
as well as the denominator, the factors in the denominator increase starting
from 1, and the factors in the numerator decrease starting from n.
In Python, the code

from scipy.special import comb

comb(n,k)
comb(n,k,exact=True)

returns the binomial coefficient.

The binomial coefficient nk makes sense even for fractional n. This can


be seen from (A.2.10). For example, for n = 1/2 and k = 3,


  
1 1 1
  −1 −2
1/2 2 2 2 (1/2)(−1/2)(−3/2) 3
= = = . (A.2.12)
3 1·2·3 6 48

This works also for n negative,


   
1 1 1
  − − −1 − −2
−1/2 2 2 2 (−1/2)(−3/2)(−5/2) 15
= = = .
3 1·2·3 6 48
(A.2.13)
464 CHAPTER A. APPENDICES

In fact, in (A.2.10), n may be any real number, for example n = 2.

A.3 The Exponential Function

In this section, our first goal is to derive (A.1.2), as promised in §A.1.


To begin, use the binomial theorem (A.2.7) with a = 1 and x = 1/n,
obtaining
 n Xn    k Xn
1 n n−k 1 1 n(n − 1)(n − 2) . . . (n − k + 1)
1+ = 1 = .
n k n k! n · n · n · ··· · n
k=0 k=0

Rewriting this by pulling out the first two terms k = 0 and k = 1 leads to
 n n     
1 X 1 1 2 k−1
1+ =1+1+ 1− 1− ... 1 − . (A.3.1)
n k! n n n
k=2

From (A.3.1), we can tell a lot. First, since all terms are positive, we see
 n
1
1+ ≥ 2, n ≥ 1.
n

Second, each factor in (A.3.1) is of the form


 
j
1− , 1 ≤ j ≤ k − 1. (A.3.2)
n

Since n is in the denominator, each such factor increases with n. Moreover,


as n increases, the number of terms in (A.3.1) increases, hence so does the
sum. We conclude
 n
1
1+ increases as n increases.
n

Third, when k ≥ 2, we know

k! = k(k − 1)(k − 2) . . . 3 · 2 ≥ 2k−1 .

Since each factor in (A.3.2) is no greater than 1, by (A.3.1),


 n n n
1 X 1 X 1
1+ ≤1+1+ ≤2+ k−1
. (A.3.3)
n k! 2
k=2 k=2

But we can show


A.3. THE EXPONENTIAL FUNCTION 465

n
X 1 1 1 1 1
= + + + · · · + n−1 ≤ 1,
2k−1 2 4 8 2
k=2

as follows.
A geometric sum is a sum of the form
n−1
X
sn = 1 + a + a2 + · · · + an−1 = ak .
k=0

Multiplying sn by a results in almost the same sum,

asn = a + a2 + a3 + · · · + an−1 + an = sn + an − 1,

yielding
(a − 1)sn = an − 1.
When a ̸= 1, we may divide by a − 1, obtaining
n−1
X an − 1
sn = ak = 1 + a + a2 + · · · + an−1 = . (A.3.4)
a−1
k=0

Inserting a = 1/2, we conclude


n n−1
X 1 X 1
= sn − 1 = 2 1 − 2−n − 1 ≤ 1,

= n ≥ 2.
2k−1 2k
k=2 k=1

By (A.3.3), we arrive at
 n
1
2≤ 1+ ≤ 3, n ≥ 1. (A.3.5)
n

Since a bounded increasing sequence has a limit (§A.7), this establishes the
following strengthening of (A.1.5).

Euler’s Constant
The limit  n
1
e = lim 1+ (A.3.6)
n→∞ n
exists and satisfies 2 ≤ e ≤ 3.

Standard properties of limits, such as

lim (an + bn ) = lim an + lim bn , lim an bn = lim an lim bn ,


n→∞ n→∞ n→∞ n→∞ n→∞ n→∞
466 CHAPTER A. APPENDICES

are in §A.6, see Exercises A.6.9 and A.6.10. Nevertheless, the intuition is
clear: (A.3.6) is saying there is a specific positive number e with
 n
1
1+ ≈e
n

for n large.

We use (A.3.5) to establish (A.1.2). Write (A.1.2) as an ≤ bn ≤ cn . When


n = 1,
a1 = b1 = c1 .
Moreover, as n increases, an , bn , cn all increase. Therefore, to establish
(A.1.2), it is enough to show bn increases faster than an , and cn increases
faster than bn , both as n increases.
To measure how an , bn , cn increase with n, divide the (n + 1)-st term by
the n-th term: It is enough to show
an+1 bn+1 cn+1
≤ ≤ .
an bn cn
But we already know
bn+1
= n + 1,
bn
and, from (A.3.5),
n
3((n + 1)/3)n+1

an+1 1 1 bn+1
= = (n + 1) · · 1 + ≤n+1= ,
an 3(n/3)n 3 n bn

and, from (A.3.5) again,


n
2((n + 1)/2)n+1

bn+1 1 1 cn+1
= n + 1 ≤ (n + 1) · · 1 + = = .
bn 2 n 2(n/2)n cn

Since we’ve shown bn increases faster than an , and cn increases faster than
bn , we have derived (A.1.2).

By definition, Euler’s constant e satisfies (A.3.6). To obtain a second for-


mula for e, insert n = ∞ in (A.3.1), which means let n grow to infinity
without bound in (A.3.1). Using 1/∞ = 0, since the k-th term approaches
1/k!, and since the number of terms increases with n, we obtain the second
formula
A.3. THE EXPONENTIAL FUNCTION 467

∞      X∞
X 1 1 2 k−1 1
e=1+1+ 1− 1− ... 1 − = .
k! ∞ ∞ ∞ k!
k=2 k=0

To summarize,

Euler’s Constant
Euler’s constant satisfies

X 1 1 1 1 1 1
e= =1+1+ + + + + + ... (A.3.7)
k! 2 6 24 120 720
k=0

Depositing one dollar in a bank offering 100% interest returns two dollars
after one year. Depositing one dollar in a bank offering the same annual
interest compounded at mid-year returns
 2
1
1+ = 2.25
2

dollars after one year.


Depositing one dollar in a bank offering the same annual interest com-
pounded at n intermediate time points returns (1 + 1/n)n dollars after one
year.
Passing to the limit, depositing one dollar in a bank and continuously
compounding at an annual interest rate of 100% returns e dollars after one
year. Because of this, (A.3.6) is often called the compound-interest formula.

Now we derive the result of continuously compounding at any specified


annual interest rate x. Note here x is a proportion, not a percent. An interest
rate of 30% corresponds to x = .3 in the exponential function.

Exponential Function

For any real number x, the limit


 x n
exp x = lim 1+ (A.3.8)
n→∞ n
exists. In particular, exp 0 = 1 and exp 1 = e.
468 CHAPTER A. APPENDICES

Fig. A.3 The exponential function exp x.

Note, in the compound-interest interpretation, when x > 0, the bank is


giving you interest, while, if x < 0, the bank is taking away interest, leading
to a continual loss.
To derive this, assume first x > 0 is a positive real number. Then, exactly
as before, using the binomial theorem,
 x n
1+ , n ≥ 1,
n
is increasing with n, so the limit in (A.3.8) is well-defined.
To establish the existence of the limit when x < 0, we first show

(1 − x)n ≥ 1 − nx, 0 < x < 1, n ≥ 1. (A.3.9)

This follows inductively: Each of the following inequalities is implied by the


preceding one,

(1 − x) = 1 − x
(1 − x)2 = 1 − 2x + x2 ≥ 1 − 2x
(1 − x)3 = (1 − x)(1 − x)2 ≥ (1 − x)(1 − 2x) = 1 − 3x + 2x2 ≥ 1 − 3x
(1 − x)4 = (1 − x)(1 − x)3 ≥ (1 − x)(1 − 3x) = 1 − 4x + 3x3 ≥ 1 − 4x
... ...

This establishes (A.3.9) for all n ≥ 1.


Now let x be any real number. Then, for n large enough, x2 /n2 lies between
0 and 1. Replacing x by x2 /n2 in (A.3.9), we obtain
n
x2 x2

1≥ 1− 2 ≥1− .
n n
A.3. THE EXPONENTIAL FUNCTION 469

As n → ∞, both sides of this last equation approach 1, so


n
x2

lim 1 − 2 = 1. (A.3.10)
n→∞ n

Now let n grow without bound in


n
x2

 x n  x n
1+ 1− = 1− 2 .
n n n

Since the limit exp x is well-defined when x > 0, by (A.3.10), we obtain


 x n
exp x · lim 1 − = 1, x > 0.
n→∞ n
This shows the limit exp x in (A.3.8) is well-defined when x < 0, and
1
exp(−x) = , for all x.
exp x
The code

from numpy import *


from matplotlib.pyplot import *

grid()
plot(x,exp(x))
show()

returns Figure A.3.

Repeating the logic yielding (A.3.1), we have

X xk n     
 x n 1 2 k−1
1+ =1+x+ 1− 1− ... 1 − . (A.3.11)
n k! n n n
k=2

Letting n → ∞ in (A.3.11) as before, results in the following.

Exponential Series

The exponential function is always positive and satisfies, for every real
number x,

X xk x2 x3 x4 x5 x6
exp x = = 1+x+ + + + + + . . . (A.3.12)
k! 2 6 24 120 720
k=0
470 CHAPTER A. APPENDICES

The graph of exp x is in Figure A.3.

We use the binomial theorem one more time to show

Law of Exponents

For real numbers x and y,

exp(x + y) = exp x · exp y.

To see this, multiply out the sums

(a0 + a1 + a2 + a3 + . . . )(b0 + b1 + b2 + b3 + . . . )

in a “symmetric” manner, obtaining

a0 b0 + (a0 b1 + a1 b0 ) + (a0 b2 + a1 b1 + a2 b0 ) + (a0 b3 + a1 b2 + a2 b1 + a3 b0 ) + . . .

Using summation notation, the n-th term in this last sum is


n
X
ak bn−k = a0 bn + a1 bn−1 + · · · + an−1 b1 + an b0 .
k=0

Thus
∞ ∞ ∞
! ! n
!
X X X X
ak bm = ak bn−k .
k=0 m=0 n=0 k=0

Now insert
xk y n−k
ak = , bn−k = .
k! (n − k)!
Then the n-th term in the resulting sum equals, by the binomial theorem,
n n n  
X X xk y n−k 1 X n k n−k 1
ak bn−k = = x y = (x + y)n .
k! (n − k)! n! k n!
k=0 k=0 k=0

Thus
∞ ∞ ∞
! !
X xk X ym X (x + y)n
exp x · exp y = = = exp(x + y).
k! m=0
m! n=0
n!
k=0

This derives the law of exponents.


Repeating the law of exponents n times implies

exp(nx) = exp(x + x + · · · + x) = exp x · exp x · · · · · exp x = (exp x)n .


A.3. THE EXPONENTIAL FUNCTION 471

If we write n
x = x1/n , replacing x by x/n yields

exp(x/n) = (exp x)1/n .

Combining the last two equations yields


1/m
exp(nx/m) = ((exp x)n ) = (exp x)n/m .

Inserting x = 1 in this last equation, it follows, for any rational number


x = n/m,
exp x = exp(1 · x) = (exp 1)x = ex .
Because of this, as a matter of convenience, we write the exponential function
either way, exp x or ex , even when x is not rational.

Exponential Notation

For any real number x,


ex = exp x.

Suppose 0 < r < 1. Then r2 < r, r3 < r, and so on. Replacing x by rx in


the exponential series (A.3.12),
1 2 2 1
erx = 1 + rx + r x + r 3 x3 + . . .
2! 3!
1 2 1 3 (A.3.13)
< 1 + rx + rx + rx + . . .
2! 3!
= 1 − r + rex .

From this we can show

Convexity of the Exponential Function

For 0 < r < 1,

exp((1 − r)x + ry) < (1 − r) exp x + r exp y. (A.3.14)

To derive (A.3.14), replace x by y − x in (A.3.13), obtaining

er(y−x) < 1 − r + rey−x .

Now multiply both sides by ex , obtaining (A.3.14).


472 CHAPTER A. APPENDICES

Graphically, the convexity of the exponential functions is the fact that the
line segment joining two points on the graph lies above the graph (Figure
A.4).

Fig. A.4 Convexity of the exponential function.

Convexity is discussed further in §4.5.

Exercises

Exercise A.3.1 Assume a bank gives 50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.

Exercise A.3.2 Assume a bank gives -50% annual interest on deposits. After
one year, what does $1 become? Do this when the money is compounded once,
twice, and at every instant during the year.

Exercise A.3.3 Which of the following is correct? For n large,


n+1 n
n + 1 ≈ n, en+1 ≈ en , ee ≈ ee .

≈ is asymptotic equality (see §A.6).

Exercise A.3.4 Extend (A.3.9) by showing

(1 − a)(1 − b)(1 − c) ≥ 1 − (a + b + c),

valid for a, b, c in the interval [0, 1]. This remains valid for any number of
factors.

Exercise A.3.5 Use the previous exercise, (A.3.1), (A.3.3), and the identity

k(k − 1)
1 + 2 + 3 + · · · + (k − 2) + (k − 1) =
2
A.4. COMPLEX NUMBERS 473

to derive the error estimate


n  n
X 1 1 3
0≤ − 1+ ≤ , n ≥ 2.
k! n 2n
k=0

This is a complete derivation of (A.3.7).

Exercise A.3.6 Use (A.3.4) to derive the geometric series



1 X
= an = 1 + a + a2 + a3 + . . . , −1 < a < 1. (A.3.15)
1 − a n=0

Exercise A.3.7 Take the derivative of (A.3.15) to obtain



1 X
= nan−1 = 1 + 2a + 3a2 + 4a3 + . . . , −1 < a < 1. (A.3.16)
(1 − a)2 n=1

A.4 Complex Numbers

In §1.4, we studied points in two dimensions, and we saw how points can be
added and subtracted. In §2.1, we studied points in any number of dimensions,
and there we also added and subtracted points.

P
P′
1
1

O O
P ′′

Q Q

P ′′

O O

Fig. A.5 Multiplying and dividing points on the unit circle.


474 CHAPTER A. APPENDICES

In two dimensions, each point has a shadow (Figure 1.13). By stacking


shadows, points in the plane can be multiplied and divided (Figure A.5). In
this sense, points in the plane behave like numbers, because they follow the
usual rules of arithmetic.
This ability of points in the plane to follow the usual rules of arithmetic
is unique to two dimensions (considering one dimension as part of two di-
mensions), and not present in any other dimension. When thought of in this
manner, points in the plane are called complex numbers, and the plane is the
complex plane.

To define multiplication of points, let P = (x, y) and P ′ = (x′ , y ′ ) be


points on the unit circle. Stack the shadow of P ′ on top of the shadow of P ,
as in Figure A.5. Because angle stacking is at the basis of angle measurement
[12], we we must do this without knowledge of angle measure.
Here is how one does this without any angle measurement: Mark Q = x′ P
at distance x′ along the vector OP joining O and P , and draw the circle
with radius y ′ and center Q. Then this circle intersects the unit circle at two
points, both called P ′′ .
We think of the first point P ′′ as the result of multiplying P and P ′ , and
we write P ′′ = P P ′ , and we think of the second point P ′′ as the result of
dividing P by P ′ , and we write P ′′ = P/P ′ . Then we have

Multiplication and Division of Points

For P = (x, y) and P ′ = (x′ , y ′ ) on the unit circle, when x′ y ′ ̸= 0,

P ′′ = P P ′ = (xx′ − yy ′ , x′ y + xy ′ ),
(A.4.1)
P ′′ = P/P ′ = (xx′ + yy ′ , x′ y − xy ′ ).

To derive (A.4.1), let P ⊥ = (−y, x) (“P -perp”). Then

x′ P + y ′ P ⊥ = (x′ x, x′ y) + (−y ′ y, y ′ x) = (xx′ − yy ′ , x′ y + xy ′ ),


x′ P − y ′ P ⊥ = (x′ x, x′ y) − (−y ′ y, y ′ x) = (xx′ + yy ′ , x′ y − xy ′ ),

so (A.4.1) is equivalent to

P ′′ = x′ P ± y ′ P ⊥ . (A.4.2)

To establish (A.4.2), since P ′′ is on the circle of center Q and radius y ′ ,


we may write P ′′ = Q + y ′ R, for some point R on the unit circle (see §1.4).
Interpreting points as vectors, and using (1.4.6), P ′′ = x′ P + y ′ R is on the
unit circle iff
A.4. COMPLEX NUMBERS 475

1 = |x′ P + y ′ R|2 = |x′ P |2 + 2x′ P · y ′ R + |y ′ R|2


2 2
= x′ |P |2 + 2x′ y ′ P · R + y ′ |R|2
2 2
= x′ + y ′ + 2x′ y ′ P · R
= 1 + 2x′ y ′ P · R.

But this happens iff P · R = 0, which happens iff R = ±P ⊥ (Figure 1.21).


This establishes (A.4.2).

More generally, if r = |P | and r′ = |P ′ |, let R be any point satisfying


|R| = r. Then
P ′′ = Q + y ′ R = x′ P + y ′ R
satisfies |P ′′ | = rr′ exactly when R = ±P ⊥ , leading to the two points in
(A.4.1).
Let P̄ be the conjugate (x, −y) of P = (x, y). The first P ′′ is the product

P P ′ = (xx′ − yy ′ , x′ y + xy ′ ), (A.4.3)

but the second P ′′ is not division, it is the hermitian product P P̄ ′ of P and


P̄ ′ .
The correct formula for division is given by
1 1
P/P ′ = P P̄ ′ = ′ 2 (xx′ + yy ′ , x′ y − xy ′ ). (A.4.4)
r′ 2 x + y′ 2

When r′ = 1, (A.4.4) reduces to the formula in (A.4.1).


With this understood, it is easily checked that division undoes multiplica-
tion,
(P/P ′ )P ′ = P.
In fact, one can check that multiplication and division as defined by (A.4.3)
and (A.4.4) follow the usual rules of arithmetic.

It is natural to identify points on the horizontal axis with real numbers,


because, using (A.4.1), z = (x, 0) and z ′ = (x′ , 0) implies

z + z ′ = (x, 0) + (x′ , 0) = (x + x′ , 0), zz ′ = (xx′ − 00, x0 + x′ 0) = (xx′ , 0).

Because of this, we can write z = x instead of z = (x, 0), this only for points
in the plane, and we call the horizontal axis the real axis.
Similarly, let i = (0, 1). Then the point i is on the vertical axis, and, using
(A.4.1), one can check
476 CHAPTER A. APPENDICES

ix = (0, 1)(x, 0) = (−0, x) = x⊥ .

Thus the vertical axis consists of all points of the form ix. These are called
imaginary numbers, and the vertical axis is the imaginary axis.
Using i, any point P = (x, y) may be written

P = x + iy,

since
x + iy = (x, 0) + (y, 0)(0, 1) = (x, 0) + (0, y) = (x, y).
This leads to Figure A.6. In this way, real numbers x are considered complex
numbers with zero imaginary part, x = x + 0i.

2i 3 + 2i

−1 0 1 2 3

Fig. A.6 Complex numbers

Since by (A.4.1), i2 = (0, 1)2 = (−1, 0) = −1, we have

Square Root of −1

The complex number i satisfies i2 = −1.

When thinking of points in the plane as complex numbers, it is traditional


to denote them by z instead of P . By (A.4.1), we have

z = x + iy, z ′ = x′ + iy ′ =⇒ zz ′ = (xx′ − yy ′ ) + i(x′ y + xy ′ ),

and
z x + iy (xx′ + yy ′ ) + i(x′ y − xy ′ )
= = .
z′ x′ + iy ′ x′ 2 + y ′ 2
In particular, one can always “move” the i from the denominator to the
numerator by the formula
1 1 x − iy z̄
= = 2 = 2.
z x + iy x + y2 |z|
A.4. COMPLEX NUMBERS 477

Here x2 + y 2 = r2 = |z|2 is the absolute value squared of z, and z̄ is the


conjugate of z.

Let r, r′ , r′′ and θ, θ′ , θ′′ be the polar coordinates (Figure 1.16) of z, z ′ ,


z = zz ′ . Then Figure A.5 says θ′′ = θ+θ′ . Using angle stacking together with
′′

his bisection method, Archimedes [13] defined angle measure θ numerically


and derived θ′′ = θ + θ′ .
By elementary algebra,
2 2
(x2 + y 2 )(x′ + y ′ ) = (xx′ − yy ′ )2 + (x′ y + xy ′ )2 . (A.4.5)
2 2
Since this says r2 r′ = r′′ , we conclude

Polar Coordinates of Complex Numbers

If (r, θ) and (r′ , θ′ ) are the polar coordinates of complex numbers z


and z ′ , and (r′′ , θ′′ ) are the polar coordinates of the product z ′′ = zz ′ ,
then
r′′ = rr′ and θ′′ = θ + θ′ .

From this and (A.4.1), using (x, y) = (cos θ, sin θ), (x′ , y ′ ) = (cos θ′ , sin θ′ ),
we have the addition formulas
sin(θ + θ′ ) = sin θ cos θ′ + cos θ sin θ′ ,
(A.4.6)
cos(θ + θ′ ) = cos θ cos θ′ − sin θ sin θ′ .

For example, if ω = cos θ + i sin θ, then the polar coordinates of ω are


r = 1 and θ. It follows the polar coordinates of ω 2 are r = 1 and 2θ, so
ω 2 = cos(2θ) + i sin(2θ).
By the same logic, for any power k, the polar coordinates of ω k are r = 1
and kθ, so ω k = cos(kθ) + i sin(kθ).
When P = (x, y) is thought of as a complex number zp= x + iy, r is called
the absolute value, and w write r = |z|. Then r = |z| = x2 + y 2 and

z = x + iy = r cos θ + ir sin θ = r(cos θ + i sin θ).

We can reverse the logic in the previous paragraph to compute square


roots. We define the square root of a complex number √z to be a complex
number w satisfying w2 = z. In this case, we write w √
= z. If w is a square
root, so is −w, so there √
are two square roots ±w = ± z.
The formula for w = z is
478 CHAPTER A. APPENDICES

√ r+x yi
z = x + yi =⇒ z=√ +√ . (A.4.7)
2r + 2x 2r + 2x
p
Here r = x2 + y 2 and this formula is valid as long as z is not a negative
√ zero.√When z is a negative number or zero, z = −x with x ≥ 0,
number or
we have z = i x. We conclude every complex number has square roots.
When z is on the unit circle, r = 1, so the formula reduces to
√ 1+x yi
z=√ +√ .
2 + 2x 2 + 2x

We will need the roots of unity in §3.2. This generalizes square roots, cube
roots, etc.
A complex number ω is a root of unity if ω d = 1 for some power d. If d is
the power, we say ω is a d-th root of unity.
For example, the square roots of unity are ±1, since (±1)2 = 1. Here we
have
1 = cos 0 + i sin 0, −1 = cos π + i sin π.
The fourth roots of unity are ±1 and ±i, since (±1)4 = 1 and (±i)4 = 1.
Here we have
1 = cos 0 + i sin 0,
i = cos(π/2) + i sin(π/2),
−1 = cos π + i sin π,
−i = cos(3π/2) + i sin(3π/2).

ω
ω

ω 1 1 ω2 1

ω2
ω3

ω2 = 1 ω3 = 1 ω4 = 1

Fig. A.7 The second, third, and fourth roots of unity


A.4. COMPLEX NUMBERS 479

In general, the roots of unity are denoted by powers of ω, so the square


roots of unity are 1 and ω = −1, and the fourth roots of unity are 1, ω = i,
ω 2 = −1, ω 3 = −i.
Let ω = cos θ + i sin θ. Since 1 = cos(2π) + i sin(2π) and ω k = cos(kθ) +
i sin(kθ), a d-th root of unity ω satisfies

ω = cos(2π/d) + i sin(2π/d). (A.4.8)

If ω d = 1, then
d k
ωk = ωd = 1k = 1.
With ω given by (A.4.8), this implies

1, ω, ω 2 , . . . , ω d−1

are the d-th roots of unity.


If we set √
1 3
ω =− +i = cos(2π/3) + i sin(2π/3),
2 2
then a calculation shows

2 1 3
1, ω, ω =− −i
2 2
are the cube roots of unity,

13 = 1, ω 3 = 1, (ω 2 )3 = 1.

Similarly, the fifth roots of unity are 1, ω, ω 2 , ω 3 , ω 4 , where


√ s√
1 5 5 5
ω=− + +i + = cos(2π/5) + i sin(2π/5).
4 4 8 8

ω ω ω4 ω3
ω2 ω5
ω2
ω2 ω6
ω
ω7
1 ω3 1 1
ω8
ω 14
ω3 ω9
ω 13
ω4 ω4 ω5 ω 10 12
ω 11 ω

ω5 = 1 ω6 = 1 ω 15 = 1

Fig. A.8 The fifth, sixth, and fifteenth roots of unity


480 CHAPTER A. APPENDICES

Summarizing,

Roots of Unity

Let d ≥ 1 and let

ω = cos(2π/d) + i sin(2π/d),

Then the d-th roots of unity are

1, ω, ω 2 , . . . , ω d−1 .

The roots satisfy

ω k = cos(2πk/d) + i sin(2πk/d), k = 0, 1, 2, . . . , d − 1.

Here even though ω depends on the degree d, in the notation, we do not


indicate the dependence of ω on d.
Since ω d = 1, one has, from Figures A.7 and A.8,

ω k + ω −k = ω k + ω d−k = 2 cos(2πk/d), k = 0, 1, 2, . . . , d − 1. (A.4.9)

This we need in §3.2.

A polynomial is an expression of the form

p(z) = z d + c1 z d−1 + c2 z d−2 + · · · + cd .

For example, p(z) = z 3 − 5z + 2 or p(z) = z 2 − 2z + 2. Here z is the variable,


and the constants c1 , c2 , . . . , cd are the coefficients.
A root of a polynomial p(z) is a complex number a satisfying p(a) = 0.
For example, the roots of z 2 − 2z + 2 are 1 ± i, and the roots of z 5 − 1 are
the fifth roots of unity 1, ω, ω 2 , ω 3 , ω 4 . In general, the roots of z d − 1 are
the d-th roots of unity 1, ω, ω 2 , . . . , ω d−1 .
The fundamental theorem of algebra states that every polynomial has as
many roots as its degree: If the degree of p(z) is d, there are d (not necessarily
distinct) roots a1 , a2 , . . . , ad of p(z), and p(z) may be factored into a product
d
Y
p(z) = (z − ak ) = (z − a1 )(z − a2 ) . . . (z − ad ).
k=1

In particular, when p(z) = z d − 1, we have


A.4. COMPLEX NUMBERS 481

d−1
zd − 1 Y
= (z − ω k ). (A.4.10)
z−1
k=1

Here is sympy code for the roots of unity.

from sympy import solve, symbols, init_printing


init_printing()

z = symbols('z')

d = 5
solve(z**d - 1)

In numpy, the roots of p(z) = az 2 + bz + c are returned by

from numpy import roots

roots([a,b,c])

Since the cube roots of unity are the roots of p(z) = z 3 − 1, the code

from numpy import roots

roots([1,0,0,-1])

returns the cube roots

array([-0.5+0.8660254j, -0.5-0.8660254j, 1. +0.j ])

Exercises

Exercise A.4.1 Let P = (1, 2) and Q = (3, 4) and R = (5, 6). Calculate P Q,
P/Q, P R, P/R, QR, Q/R.
Exercise A.4.2 Let a = 1 + 2i and b = 3 + 4i and c = 5 + 6i. Calculate ab,
a/b, ac, a/c, bc, b/c.
Exercise A.4.3 We say z ′ is the reciprocal of z if zz ′ = 1. Show the reciprocal
of z = x + yi is
482 CHAPTER A. APPENDICES

x − yi
z′ = .
x2 + y 2

√ √
Exercise A.4.4 Show z given by (A.4.7) satisfies ( z)2 = z.

Exercise A.4.5 Check (A.4.5) is correct.

Exercise A.4.6 Let a, b, c be complex numbers, with a ̸= 0. Show the roots


of p(z) = az 2 + bz + c are given by the Babylonian quadratic formula

−b ± b2 − 4ac
z= .
2a

Exercise A.4.7 Let 1, ω, . . . , ω d−1 be the d-th roots of unity. Using the
code below, compute the product

(1 − ω)(1 − ω 2 )(1 − ω 3 ) . . . (1 − ω d−1 ).

What is the answer? Try different degrees d.

from sympy import prod, solve, symbols, simplify

z = symbols('z')
roots = solve(z**d - 1)

prod([ 1-a if a != 1 else 1 for a in roots ]).simplify()

The answer can be derived algebraically by using (A.4.10).

A.5 Integration

This section is a review of integration, using the fundamental theorem of


calculus and Python.
Let y = f (x) be a function, and let its graph be as in Figure A.9. The
integral
Z b
I= f (x) dx (A.5.1)
a
is the area under the graph between the vertical lines at a and b.
To repeat, the integral is a number, the area of a specific region under the
graph y = f (x). In Figure A.9, the integral (A.5.1) is the sum of three areas:
red, green, blue.
A.5. INTEGRATION 483

f (x)

0 a x x + dx b

Fig. A.9 Areas under the graph.

We use Figure A.9 to derive the

Fundamental Theorem of Calculus (FTC)

If F ′ (x) = f (x), then


Z b
f (x) dx = F (b) − F (a). (A.5.2)
a

To derive this, let A(x) denote the area under the graph between the y-
axis and the vertical line at x. Then A(x) is the sum of the gray area and
the red area, A(a) is the gray area, and A(b) is the sum of four areas: gray,
red, green, and blue. It follows the integral (A.5.1) equals A(b) − A(a).
Since A(x + dx) is the sum of three areas, gray, red, green, it follows
A(x + dx) − A(x) is the green area. But the green area is approximately a
rectangle of width dx and height f (x). Hence the green area is approximately
f (x) × dx, or
A(x + dx) − A(x) ≈ f (x) dx.
As a consequence of this analysis,

A(x + dx) − A(x)


≈ f (x).
dx
The smaller dx is, the closer the green area is to a rectangle. Taking the limit
dx → 0, the green rectangle becomes infinitely thin, and we obtain

A(x + dx) − A(x)


A′ (x) = lim = f (x).
dx→0 dx
484 CHAPTER A. APPENDICES

Now let F (x) be any function satisfying F ′ (x) = f (x). Then A(x) and
F (x) have the same derivative, so A(x)−F (x) has derivative zero. By (4.1.2),
A(x) − F (x) is a constant C, or A(x) = F (x) + C. This implies
Z b
f (x) dx = A(b) − A(a) = (F (b) + C) − (F (a) + C) = F (b) − F (a).
a

This completes the derivation of the fundamental theorem of calculus.


Often one writes F (x)|ba for F (b) − F (a). The FTC then reads
Z b x=b
f (x) dx = F (x) .
a x=a

When F ′ (x) = f (x), F (x) is called an anti-derivative or indefinite integral


of f (x). This should not be confused with the integral (A.5.1), which is a
number, an area.
Since the total area between a and R b may be sliced into many thin green
rectangles, interpreting the symbol as “sum” explains the notation (A.5.1).

Important consequences of the FTC are integral additivity,


Z b Z b Z b
(f (x) + g(x)) dx = f (x) dx + g(x) dx,
a a a

and integral scaling,


Z b Z b
cf (x) dx = c f (x) dx.
a a

For example, if f (x) = xd , then, by (4.1.4), F (x) = xd+1 /(d + 1) satisfies



F (x) = f (x), so, by the FTC,
b
bd+1 ad+1
Z
xn dx = F (b) − F (a) = − .
a d+1 d+1

When d = 2, a = −1, b = 1, this is 2/3, which is the area under the parabola
in Figure A.10.
When a = 0, b = 1, Z 1
1
td dt = . (A.5.3)
0 d + 1
A.5. INTEGRATION 485

When F (x) can’t be found, we can’t use the FTC. Instead we use Python
to evaluate the integral (A.5.1) as follows.

from scipy.integrate import quad

d = 2

def f(x): return x**d

a,b = -1, 1

# integral of f(x) over the interval [a,b]


I = quad(f,a,b)

This not only returns the computed integral I but also an estimate of the
error between the computed integral and the theoretical value,
(0.6666666666666666, 7.401486830834376e-15).
quad refers to quadrature, which is another term for integration.

Fig. A.10 Area under the parabola.

Another example is the area under one hump of the sine curve in Figure
A.11, Z π
sin x dx = − cos π − (− cos 0) = −(−1) + 1 = 2.
0

Here f (x) = sin x, F (x) = − cos x, F ′ (x) = f (x). The Python code quad
returns (2.0, 2.220446049250313e-14).
It is important to realize the integral (A.5.1) is the signed area under the
graph: Portions of areas that are below the x-axis are counted negatively. For
example,
486 CHAPTER A. APPENDICES
Z 2π
sin x dx = − cos(2π) − (− cos 0) = −1 + 1 = 0.
0

Explicitly,
Z 2π Z π Z 2π
sin x dx = sin x dx + sin x dx = 2 − 2 = 0,
0 0 π

so the areas under the first two humps in Figure A.11 cancel.

Fig. A.11 The graph and area under sin x.

Here is code for Figures A.10, A.11, A.12.

from numpy import *


from matplotlib.pyplot import *
from scipy.integrate import quad

def plot_and_integrate(f,a,b,pi_ticks=False):
# initialize figure
ax = axes()
ax.grid(True)
# draw x-axis and y-axis
ax.axhline(0, color='black', lw=1)
ax.axvline(0, color='black', lw=1)
# set x-axis ticks as multiples of pi/2
if pi_ticks: set_pi_ticks(a,b)
x = linspace(a,b,100)
plot(x,f(x))
positive = f(x)>=0
negative = f(x)<0
ax.fill_between(x,f(x), 0, color='green', where=positive, alpha=.5)
ax.fill_between(x,f(x), 0, color='red', where=negative, alpha=.5)
A.5. INTEGRATION 487

I = quad(f,a,b,limit=1000)[0]
title("integral equals " + str(I),fontsize = 10)
show()

def f(x): return sin(x)/x


a, b = 0, 3*pi

plot_and_integrate(f,a,b,pi_ticks=True)

Above, the Python function set_pi_ticks(a,b) sets the x-axis tick mark
labels at the multiples of π/2 The code for set_pi_ticks is in §4.1.

Fig. A.12 Integral of sin x/x.

The exercises are meant to be done using the code in this section. For the
infinite limits below, use numpy.inf.

Exercises

Exercise A.5.1 Plot and integrate f (x) = x2 + A sin(5x) over the interval
[−10, 10], for amplitudes A = 0, 1, 2, 4, 15. Note the integral doesn’t depend
on A. Why?

Exercise A.5.2 Plot and integrate (Figure A.12)


Z 3π
sin x
dx.
0 x
488 CHAPTER A. APPENDICES

Exercise A.5.3 Plot and integrate f (x) = exp(−x) over [a, b] with a = 0,
b = 1, 10, 100, 1000, 10000.

Exercise A.5.4 Plot and integrate f (x) = 1 − x2 over [−1, 1].

Exercise A.5.5 Plot and integrate f (x) = 1/ 1 − x2 over [−1, 1].
Exercise A.5.6 Plot and integrate f (x) = (− log x)n over [0, 1] for n =
2, 3, 4. What is the answer for general n?
Exercise A.5.7 With k = 7, n = 10, plot and integrate using Python
Z 1
xk (1 − x)n−k dx.
0

From (5.1.17), what is the exact integral?


Exercise A.5.8 Plot and integrate f (x) = sin(nx)/x over [0, π] with n =
1, 2, 3, 4, . . . . What’s the limit of the integral as n → ∞?
Exercise A.5.9 Use numpy.inf to compute

2 ∞ sin x
Z
dx.
π 0 x

Exercise A.5.10 Use numpy.inf to plot the normal pdf and compute its
integral Z ∞
1 2
√ e−x /2 dx.
2π −∞

Exercise A.5.11 Let σ(x) = 1/(1+e−x ). Plot and integrate f (x) = σ(x)(1−
σ(x)) over [−10, 10]. What is the answer for (−∞, ∞)?
Exercise A.5.12 Let Pn (x) be the Legendre polynomial of degree n (§4.1).
Use num_legendre (§4.1) to compute the integral
Z 1
Pn (x)2 dx
−1

for n = 1, 2, 3, 4. What is the integral for general n? Hint – take the reciprocal
of the answer.

A.6 Asymptotics and Convergence

Let a1 , a2 , . . . be a sequence of scalars. What does it mean to say an is


asymptotically zero? To make this notion precise, we introduce some termi-
nology.
A.6. ASYMPTOTICS AND CONVERGENCE 489

We say an is bounded if all terms lie in some bounded interval. If b > 0,


we say an is bounded by b if |an | ≤ b. For example an = sin(n) is bounded by
b = 1. The constant b is a bound.
We say an is eventually bounded by a positive constant b, if, after ignoring
finitely many terms, the remaining terms are bounded by b. For example,
an = 1/n is eventually bounded by b = .01, since, after ignoring the first
ninety-nine terms, the sequence is bounded by b. The sequence b1 = 1, b2 = 1,
b3 = 1, . . . , is eventually bounded by 1, but not eventually bounded by 0.5.
Please note: If an is bounded by 5, then an is eventually bounded by 5.
On the other hand, if an is eventually bounded by 5, then an is bounded, but
not necessarily by 5, since “eventually” means we are ignoring finitely many
terms.
Typically we use the greek letter epsilon ϵ to denote small positive num-
bers.

Asymptotic Vanishing

If for any positive constant ϵ, no matter how small, an is eventually


bounded by ϵ, we say an is asymptotically zero or asymptotically van-
ishing, and we write an ≈ 0.

For example, bn = 1, 1, 1, . . . is not asymptotically zero, since bn is not


eventually bounded by ϵ = 0.5.
On the other hand, an = 1/n is asymptotically zero: To bound the se-
quence by ϵ = 1/10, we ignore the first nine terms. To bound the sequence by
ϵ = 1/100, we ignore the first ninety-nine terms. To bound the sequence by
ϵ = 1/10000, we ignore the first 9999 terms. Notice the smaller the desired
bound, the more terms need to be ignored.
Immediate consequences of the asymptotic vanishing definition are the
following properties (we skip the proofs).
1. |an | ≤ en and en ≈ 0 imply an ≈ 0.
2. an ≈ 0 and bn ≈ 0 implies an + bn ≈ 0,
3. an ≈ 0 and bn eventually bounded implies an bn ≈ 0.
These are intuitively clear. As a special case, for any constant c,

an ≈ 0 =⇒ can ≈ 0.

A sequence an is asymptotically positive if, apart from finitely many terms,


an is positive. More generally, an is asymptotically nonzero if, apart from
finitely many terms, an is not zero.
We say an is asymptotically one, if the difference an − 1 is asymptotically
zero. In this case, we write
an ≈ 1.
As a consequence of the above properties, we show
490 CHAPTER A. APPENDICES

Convergence of Reciprocals

If an ≈ 1, then 1/an ≈ 1.

To derive this, assume an ≈ 1. Then an − 1 ≈ 0, so an − 1 is eventu-


ally bounded by any positive constant. In particular, an − 1 is eventually
bounded by ϵ = 1/2, which means an is eventually in the interval (1/2, 3/2),
so eventually an ≥ 1/2, or 1/an is eventually bounded. By property 3,
1 1
−1= (1 − an ) ≈ 0,
an an
yielding the result.
The exercises exhibit many other consequences of the above properties.
The point of the exercises is that they depend only on these properties, or
their consequences; do not use any other properties you may have learned
elsewhere.
Let b1 , b2 , . . . be another sequence. We say an is asymptotically equal to
bn , and we write
an ≈ bn , (A.6.1)
if bn is asymptotically nonzero, and the ratio an /bn is asymptotically one, or
an /bn ≈ 1. As part of this, we are assuming bn is asymptotically nonzero, to
ensure we aren’t dividing by zero.
From the definition, it is not clear that an ≈ bn is equivalent to bn ≈ an .
Nevertheless, this is correct (Exercise A.6.1). Summarizing,

Asymptotic Equality
an an
a n ≈ bn ⇐⇒ ≈1 ⇐⇒ − 1 ≈ 0.
bn bn

For example, let an = n, bn = n + 1, cn = n2 . When n = 1000,


an
an = 1000, bn = 1001, = .9990000,
bn
so here we do have an ≈ bn for n large. On the hand, here cn is a million,
and an is a thousand, so we don’t have an ≈ cn .
This is exactly what is meant in (A.1.6). While both sides in (A.1.6) in-
crease without bound, their ratio is close to one, for large n.
In general, an ≈ bn is not the same as an − bn ≈ 0: ratios and differences
behave differently. For example, based on (A.1.6), the following code

from numpy import *


A.6. ASYMPTOTICS AND CONVERGENCE 491

def factorial(n):
if n == 1: return 1
else: return n * factorial(n-1)

def stirling(n): return sqrt(2*pi*n) * (n/e)**n

a = factorial(100)
b = stirling(100)

a/b, a-b

returns
(1.000833677872004, 7.773919124995513 × 10154 ).
The first entry is close to one, but the second entry is far from zero.
If, however, bn ≈ b for some nonzero constant b, then (Exercise A.6.6)
ratios and differences are the same,

an ≈ bn ⇐⇒ an − bn ≈ 0. (A.6.2)

In particular, for a ̸= 0, an ≈ a is the same as an − a ≈ 0.

When we have an − a ≈ 0, we say a is the limit of an , and we write

a = lim an . (A.6.3)
n→∞

As we saw above, limits and asymptotic equality are the same, as long as the
limit is not zero. When a is the limit of an , we also say an converges to a, or
an approaches a and we write an → a.
Limits can be taken for sequences of points in Rd as well. Let an be a
sequence of points in Rd . We say an converges to a if an · v converges to a · v
for every vector v. Here we also write an → a and we write (A.6.3).

In Chapter 6, ≈ is used for random variables. We say random variables


Xn are asymptotically equal to random variables Yn , and we write Xn ≈ Yn ,
if their corresponding probabilities are asymptotically equal,

P rob(a < Xn < b) ≈ P rob(a < Yn < b),

for any interval (a, b).


In particular, when Y does not depend on n, the asymptotic equality
Xn ≈ Y is short for
492 CHAPTER A. APPENDICES

P rob(a < Xn < b) ≈ P rob(a < Y < b). (A.6.4)

When Y is normal or standard normal or chi-squared, we also write Xn ≈


normal or Xn ≈ N (0, 1) or Xn ≈ χ2d .
Since probabilities are positive, here both interpretations in (A.6.2) hold,
hence we also have

lim P rob(a < Xn < b) = P rob(a < Y < b).


n→∞

Also, Xn ≈ Y in the sense of (A.6.4) is equivalent to approximations of


the means
E(f (Xn )) ≈ E(f (Y )),
and equivalent to approximations of the moment-generating functions

MXn (t) ≈ MY (t).

Exercises

Exercise A.6.1 If an ≈ bn , then bn ≈ an .


Exercise A.6.2 If an ≈ 1 and bn ≈ 1, then an bn ≈ 1.
Exercise A.6.3 If an ≈ bn and bn ≈ cn , then an ≈ cn .
Exercise A.6.4 If an ≈ a′n and bn ≈ b′n , then an bn ≈ a′n b′n .
Exercise A.6.5 Let a ̸= 0. If an ≈ a, then an − a ≈ 0, and conversely.
Exercise A.6.6 If bn ≈ b and b ̸= 0, then (A.6.2) holds.
Exercise A.6.7 If an − bn ≈ 0 and bn → b, then an → b.
Exercise A.6.8 Let µ be a constant and let X̄1 , X̄2 , . . . be a sequence of
random variables. For example, in the LLN, X̄n is the sample mean. Show
X̄n ≈ µ is equivalent to

P rob(a < X̄n < b) ≈ 0,

for any interval (a, b) not containing µ.


Exercise A.6.9 If an → a and bn → b, then

an + bn → a + b, an bn → ab.

Exercise A.6.10 If an ≤ bn ≤ cn and an → L and cn → L, then bn → L


(sandwich lemma).
A.7. EXISTENCE OF MINIMIZERS 493

A.7 Existence of Minimizers

Several times in the text, we deal with minimizing functions, most notably for
the pseudo-inverse of a matrix (§2.3), for proper continuous functions (§4.5),
and for gradient descent (§7.3).
Previously, the technical foundations underlying the existence of minimiz-
ers were ignored. In this section, we review the foundational material sup-
porting the existence of minimizers.
For example, since y = ex is an increasing function, the minimum

min ex = min{ex | 0 ≤ x ≤ 1}
0≤x≤1

is y ∗ = e0 = 1, and the minimizer, the location at which the minimum occurs,


is x∗ = 0. Here we have one minimizer.
For the function y = x4 −2x2 in Figure 4.5, the minimum over −2 ≤ x ≤ 2
is y ∗ = −1, which occurs at the minimizers x∗ = ±1. Here we have two
minimizers.
On the other hand, there is no minimizer for y = ex on the entire real line
−∞ < x < ∞, because as x approaches −∞, ex approaches zero, but never
reaches it. Our goal in this section is to establish conditions which guarantee
the existence of minimizers.

A sequence xn is increasing if x1 ≤ x2 ≤ x3 ≤ . . . . A sequence xn is


decreasing if x1 ≥ x2 ≥ x3 ≥ . . . . If xn is increasing, then −xn is decreasing,
and vice-versa.
In §A.6, we had bounded sequences and limits. A foundational axiom for
real numbers, the completeness property, is the following.

Completeness Property

Let xn be a bounded increasing sequence. Then xn has a limit

lim xn .
n→∞

By multiplying a sequence by a minus, we also see every bounded decreas-


ing sequence has a limit. In general, a bounded sequence need not converge.
However below we see it subconverges.

Let x1 , x2 , . . . be a sequence. A subsequence is a selection of terms

xn1 , xn2 , xn3 , . . . , n1 < n2 < n3 < . . . .


494 CHAPTER A. APPENDICES

Here it is important that the indices n1 < n2 < n3 < . . . be strictly increas-
ing.
If a sequence x1 , x2 , . . . has a subsequence x′1 , x′2 , . . . converging to x∗ ,
then we say the sequence x1 , x2 , . . . subconverges to x∗ . For example, the
sequence 1, −1, 1, −1, 1, −1, . . . subconverges to 1 and also subconverges
to −1, as can be seen by considering the odd-indexed terms and the even-
indexed terms separately.

Bounded Sequences Must Subconverge

Let x1 , x2 , . . . be a bounded sequence of vectors. Then there is a


subsequence x′1 , x′2 , . . . converging to some x∗ .

To see this, assume first x1 , x2 , . . . are scalars, and let x1 , x2 , . . . be a


bounded sequence of numbers, say a ≤ xn ≤ b for n ≥ 1. Bisect the interval
I0 = [a, b] into two equal subintervals. Then at least one of the subintervals,
call it I1 , has infinitely many terms of the sequence. Select x′1 in I1 and let
x∗1 be the left endpoint of I1 .
Now bisect I1 into two equal subintervals. Then at least one of the subin-
tervals, call it I2 , has infinitely many terms of the sequence. Select x′2 in I2
and let x∗2 be the left endpoint of I2 . Continuing in this manner, we obtain a
subsubsequence x′1 , x′2 , . . . with x′n in In , and a sequence x∗1 , x∗2 , . . . .
Since the intervals are nested

I0 ⊃ I1 ⊃ I2 ⊃ . . . ,

the sequence x∗1 , x∗2 , . . . is increasing. By the completeness property,

x∗ = lim x∗n
n→∞

exists. By definition of limit, this says en = x∗n − x∗ ≈ 0.


Since the length of In equals (b − a)/2n , and 2−n → 0,

0 ≤ x′n − x∗n ≤ (b − a)2−n ,

hence x′n − x∗n ≈ 0. By Exercise A.6.7, we conclude x′n → x∗ .


Now let x1 , x2 , . . . be a sequence of vectors in Rd , and let v be a vector;
then x1 · v, x2 · v, . . . are scalars, so, from the previous paragraph, there is a
subsequence x′n · v (depending on v) converging to some x∗v .
Let e1 , e2 , . . . , ed be the standard basis in Rd . By choosing v = e1 , there
is a subsequence x′1 , x′2 , . . . such that the first features of x′n converge. By
choosing v = e2 , and focusing on the subsequence x′1 , x′2 , . . . , there is a
sub-subsequence x′′1 , x′′2 , . . . such that the first and second features of x′′n
A.7. EXISTENCE OF MINIMIZERS 495

converge. Continuing in this manner, we obtain a subsequence x∗1 , x∗2 , . . .


such that the k-th feature of the subsequence converges to the k-th feature
of a single x∗ , for every 1 ≤ k ≤ d. From this, it follows that x∗n converges to
x∗ .

Let S be a set of vectors and let y = f (x) be a scalar-valued function


bounded below on S, f (x) ≥ b for some number b, for all x in S. Then b is a
lower bound for f (x) over S.
A minimizer is a vector x∗ satisfying

f (x∗ ) ≤ f (x), for every x in S.

As we saw above, a minimizer may or may not exist, and, when the minimizer
does exist, there may be several minimizers.
A function y = f (x) is continuous if f (xn ) approaches f (x∗ ) whenever xn
approaches x∗ ,

xn → x∗ =⇒ f (xn ) → f (x∗ ),

for every x∗ and every xn → x∗ . Here an → a means an − a ≈ 0, see §A.6.


Now we can establish

Existence of Minimizers

If f (x) is continuous on Rd and S is a bounded set in Rd , then there


is a minimizer x∗ ,
f (x∗ ) = min f (x). (A.7.1)
x in S

In general, the minimizer x∗ may lie outside the set S. To guarantee x∗


belongs to S, typically one assumes an additional requirement, the closedness
of S. In our applications of this result, we seek a minimizer somewhere in Rd ,
so this point is of no concern.
To establish the result, let m1 be a lower bound for f (x) over S, and let
x1 be any point in S. Then f (x1 ) ≥ m1 . Let

f (x1 ) + m1
c=
2
be the midpoint between m1 and f (x1 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m2 = c and x2 = x1 . In the second case, there is a point x2 in
S satisfying f (x2 ) < c, and we define m2 = m1 . As a consequence, in either
case, we have f (x2 ) ≥ m2 , m1 ≤ m2 , and
496 CHAPTER A. APPENDICES

1
f (x2 ) − m2 ≤ (f (x1 ) − m1 ).
2
Let
f (x2 ) + m2
c=
2
be the midpoint between m2 and f (x2 ).
There are two possibilities. Either c is a lower bound or not. In the first
case, define m3 = c and x3 = x2 . In the second case, there is a point x3 in
S satisfying f (x3 ) < c, and we define m3 = m2 . As a consequence, in either
case, we have f (x3 ) ≥ m3 , m2 ≤ m3 , and
1
f (x3 ) − m3 ≤ (f (x1 ) − m1 ).
22
Continuing in this manner, we have a sequence x1 , x2 , . . . in S, and an
increasing sequence m1 ≤ m2 ≤ . . . of lower bounds, with
2
f (xn ) − mn ≤ (f (x1 ) − m1 ).
2n
Since S is bounded, xn subconverges to some x∗ . Since f (x) is continuous,
f (xn ) subconverges to f (x∗ ). Since f (xn ) ≈ mn and mn is a lower bound for
all n, f (x∗ ) is a lower bound, hence x∗ is a minimizer.

A.8 SQL

Recall matrices (§2.1), datasets, CSV files, spreadsheets, arrays, dataframes


are basically the same objects.
Databases are collections of tables, where a table is another object similar
to the above. Hence

matrix = dataset = CSV f ile = spreadsheet = table = array = dataf rame


(A.8.1)
One difference is that each entry in a table may be a string, or code, or an
image, not just a number. Nevertheless, every table has rows and columns;
rows are usually called records, and columns are columns.
A database is a collection of several tables that may or may not be linked
by columns with common data. Software that serves databases is a database
server. Often the computer running this software is also called a database
server, or a server for short. Databases created by a database server (software)
are stored as files on the database server.
There are many varieties of database server software. Here we use Mari-
aDB, a widely-used open-source database server. By using open-source soft-
ware, one is assured to be using the “purest” form of the software, in the
A.8. SQL 497

sense that proprietary extensions are avoided, and the software is compatible
with the widest range of commercial variations.
Because database tables can contain millions of records, it is best to ac-
cess a database server programmatically, using an application programming
interface, rather than a graphical user interface. The basic API for inter-
acting with database servers is SQL (structured query language). SQL is a
programming language for creating and modifying databases.
Any application on your laptop that is used to access a database is called an
SQL client. The database server being accessed may be local, running on the
same computer you are logged into, or remote, running on another computer
on the internet. In our examples, the code assumes a local or remote database
server is being accessed.
Because SQL commands are case-insensitive, by default we write them
in lowercase. Depending on the SQL client, commands may terminate with
semicolons or not. As mentioned above, data may be numbers or strings.
The basic SQL commands are

select from
limit
select distinct
where/where not <column>
where <column> = <data> and/or <column> = <data>
order by <column1>,<column2>
insert into table (<column1>,<column2>,...) \
values (<data1>, <data2>, ...)
is null
update <table> set <column> = <data> where ...
like <regex> (%, _, [abc], [a-f], [!abc])
delete from <table> where ...
select min(<column>) from <table> (also max, count, avg)
where <column> in/not in (<data array>)
between/not between <data1> and <data2>
as
join (left, right, inner, full)
create database <database>
drop database <database>
create table <table>
truncate <table>
alter table <table> add <column> <datatype>
alter table <table> drop column <column>
insert into <table> select
498 CHAPTER A. APPENDICES

All the objects in (A.8.1) are also equivalent to a Python list-of-dicts. In


this section we explain how to convert between the objects

list-of-dicts ⇐⇒ JSON string ⇐⇒ dataframe ⇐⇒ CSV file ⇐⇒ SQL table


(A.8.2)
For all conversions, we use pandas. We begin describing a Python list-of-dicts,
because this does not require any additional Python packages.
A Python dictionary or dict is a Python object of the form (prices are in
cents)

item1 = {"dish": "Hummus", "price": 800, "quantity": 5}

This is an unordered listing of key-value pairs. Here the keys are the strings
dish, price, and quantity. Keys need not be strings; they may be integers or
any unmutable Python objects. Since a Python list is mutable, a key cannot
be a list. Values may be any Python objects, so a value may be a list. In
a dict, values are accessed through their keys. For example, item1["dish"]
returns 'Hummus'.
A list-of-dicts is simply a Python list whose elements are Python dicts, for
example,

item2 = {"dish": "Avocado", "price": 900, "quantity": 2}


L = [item1,item2]

Here L is a list and

len(L), L[0]["dish"]

returns

(2,'Hummus')

In other words, L is a list-of-dicts,

L == [{"dish": "Hummus", "price": 800, "quantity": 5}, {"dish": ...


,→ }]

returns True.
A list-of-dicts L can be converted into a string using the json module, as
follows:

frpm json import *


A.8. SQL 499

s = dumps(L)

Now print L and print s. Even though L and s “look” the same, L is a list,
and s is a string. To emphasize this point, note
• len(L) == 2 and len(s) == 99,
• L[0:2] == L and s[0:2] == '[{'
• L[8] returns an error and s[8] == ':'
To convert back the other way, use

from json import *

L1 = loads(s)

Then L == L1 returns True. Strings having this form are called JSON strings,
and are easy to store in a database as VARCHARs (see Figure A.16).
The basic object in the Python package pandas is the dataframe (Figures
A.13, A.14, A.16, A.17). pandas can convert a dataframe df to many, many
other formats

df.to_dict(), df.to_csv(), df.to_excel(), df.to_sql(), df.to_json(),


,→ ...

To convert a list-of-dicts to a dataframe is easy. The code

from pandas import *

df = DataFrame(L)
df

returns the dataframe in Figure A.13 (prices are in cents).

Fig. A.13 Dataframe from list-of-dicts.

To go the other way is equally easy. The code

L1 = df.to_dict('records')
L == L1
500 CHAPTER A. APPENDICES

returns True. Here the option 'records' returns a list-of-dicts; other options
returns a dict-of-dicts or other combinations.
To convert a CSV file into a dataframe, use the code

menu_df = read_csv("menu.csv")
menu_df

This returns Figure A.14 (prices are in cents).

Fig. A.14 Menu dataframe and SQL table.

To go the other way, to convert the dataframe df to the CSV file menu1.
,→ csv, use the code

df.to_csv("menu1.csv")
df.to_csv("menu2.csv",index=False)

The option index=False suppresses the index column, so menu2.csv has


two columns, while menu1.csv has three columns. Also useful is the method
.to_excel, which returns an excel file.
Now we explain how to convert between a dataframe and an SQL table.
What we have seen so far uses only pandas. To convert to SQL, we need two
more packages, sqlalchemy and pymysql.
The package sqlalchemy allows us to connect to a database server from
within Python, and the package pymysql is the code necessary to complete
A.8. SQL 501

the connection to our version of database server. For example, if we are


connecting to an Oracle database server, we would use the package cx-Oracle
instead of pymysql.
In Python, the standard package installation method is to use pip. To
install sqlalchemy and pymysql, type within jupyter:

pip install sqlalchemy


pip install pymysql

To connect using sqlalchemy, we first collect the connection data into one
URI string,

protocol = "mysql+pymysql://"
credentials = "username:password"
server = "@servername"
port = ":3306"
uri = protocol + credentials + server + port

This string contains your database username, your database password, the
database server name, the server port, and the protocol. If the database is
”rawa”, the URI is

database = "/rawa"
uri = protocol + credentials + server + port + database

Using this uri, the connection is made as follows

from sqlalchemy import create_engine

engine = sqlalchemy.create_engine(uri)

(In sqlalchemy, a connection is called an “engine”.) After this, to store the


dataframe df into a table Menu, use the code

df.to_sql('Menu',engine,if_exists='replace')

The if_exists = 'replace' option replaces the table Menu if it existed


prior to this command. Other options are if_exists='fail' and if_exists
,→ ='append'. The default is if_exists='fail', so

df.to_sql('Menu',engine)

returns an error if Menu exists.


502 CHAPTER A. APPENDICES

Fig. A.15 Rawa restaurant.

To read a table into a dataframe, use for example the code

from sqlalchemy import text

query1 = text("select * from rawa.OrdersIn")


query2 = text("select * from rawa.OrdersIn where items
,→ like '%Hummus%';")
connection = engine.connect()
df1 = read_sql(query1, connection)
df2 = read_sql(query2, connection)

Better Python coding technique is to place read_sql and to_sql in a


with block, as follows

with engine.connect() as connection:


df = pd.read_sql(query, connection)
df.to_sql('Menu',engine)

One benefit of this syntax is the automatic closure of the connection upon
completion. This completes the discussion of how to convert between dataframes
and SQL tables, and completes the discussion of conversions between any of
the objects in (A.8.2).
As an example how all this goes together, here is a task:
A.8. SQL 503

Given two CSV files menu.csv and orders.csv downloaded from a restaurant website
(Figure A.15), create three SQL tables Menu, OrdersIn, OrdersOut.

The two CSV files are (click)


orders.csv and menu.csv.
The three SQL table columns are as follows (price, tip, tax, subtotal, total
are in cents)

/* Menu */
dish varchar
price integer

/* ordersin */
orderid integer
created datetime
customerid integer
items json

/* ordersout */
orderid integer
subtotal integer
tip integer
tax integer
total integer

To achieve this task, we download the CSV files menu.csv and orders
,→ .csv, then we carry out these steps. (price and tip in menu.csv and
orders.csv are in cents so they are INTs.)
1. Read the CSV files into dataframes menu_df and orders_df.
2. Convert the dataframes into list-of-dicts menu and orders.
3. Create a list-of-dicts OrdersIn with keys orderId, created, customerId
whose values are obtained from list-of-dicts orders.
4. Create a list-of-dicts OrdersOut with keys orderId, tip whose values are
obtained from list-of-dicts orders (tips are in cents so they are INTs).
5. Add a key items to OrdersIn whose values are JSON strings specifying
the items ordered in orders, using the prices in menu (these are in cents so
they are INTs). The JSON string is of a list-of-dicts in the form discussed
above L = [item1, item2] (see row 0 in Figure A.16).
Do this by looping over each order in the list-of-dicts orders, then loop-
ing over each item in the list-of-dicts menu, and extracting the quantity
ordered of the item item in the order order.
6. Add a key subtotal to OrdersOut whose values (in cents) are computed
from the above values.
504 CHAPTER A. APPENDICES

Fig. A.16 OrdersIn dataframe and SQL table.

Add a key tax to OrdersOut whose values (in cents) are computed using
the Connecticut tax rate 7.35%. Tax is applied to the sum of subtotal
and tip.
Add a key total to OrdersOut whose values (in cents) are computed
from the above values (subtotal, tax, tip).
7. Convert the list-of-dicts OrdersIn, OrdersOut to dataframes OrdersIn_df
,→ , OrdersOut_df.
8. Upload menu_df, OrdersIn_df, OrdersOut_df to tables Menu, OrdersIn,
OrdersOut.
The resulting dataframes ordersin_df and ordersout_df, and SQL ta-
bles OrdersIn and OrdersOut, are in Figures A.16 and A.17.

Complete Code for the Task

# step 1
from pandas import *

protocol = "https://"
server = "omar-hijab.org"
path = "/teaching/csv_files/restaurant/"
url = protocol + server + path
A.8. SQL 505

Fig. A.17 OrdersOut dataframe and SQL table.

menu_df = read_csv(url + "menu.csv")


orders_df = read_csv(url + "orders.csv")

# step 2
menu = menu_df.to_dict('records')
orders = orders_df.to_dict('records')

# step 3
OrdersIn = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["created"] = r["created"]
d["customerId"] = r["customerId"]
OrdersIn.append(d)

# step 4
OrdersOut = h
for r in orders:
d = {}
d["orderId"] = r["orderId"]
d["tip"] = r["tip"]
OrdersOut.append(d)

# step 5
from json import *
506 CHAPTER A. APPENDICES

for i,r in enumerate(OrdersIn):


itemsOrdered = h
for item in menu:
dish = item["dish"]
price = item["price"]
if dish in orders[i]:
quantity = orders[i][dish
if quantity > 0:
d = {"dish": dish, "price": price, "quantity":
,→ quantity}
itemsOrdered.append(d)
r["items"] = dumps(itemsOrdered)

# steps 6
for i,r in enumerate(OrdersOut):
items = loads(OrdersIn[i]["items"])
subtotal = sum([ item["price"]*item["quantity"] for item in items
,→ ])
r["subtotal"] = subtotal
tip = OrdersOut[i]["tip"]
tax = int(.0735*(tip + subtotal))
total = subtotal + tip + tax
r["tax"] = tax
r["total"] = total

# step 7
ordersin_df = DataFrame(OrdersIn)
ordersout_df = DataFrame(OrdersOut)

# step 8
from sqlalchemy import create_engine, text

# connect to the database


protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = create_engine(uri)

dtype1 = { "dish":sqlalchemy.String(60), "price":sqlalchemy.Integer }

dtype2 = {
"orderId":sqlalchemy.Integer,
"created":sqlalchemy.String(30),
"customerId":sqlalchemy.Integer,
"items":sqlalchemy.String(1000)
}
A.8. SQL 507

dtype3 = {
"orderId":sqlalchemy.Integer,
"tip":sqlalchemy.Integer,
"subtotal":sqlalchemy.Integer,
"tax":sqlalchemy.Integer,
"total":sqlalchemy.Integer
}

with engine.connect() as connection:


menu_df.to_sql('Menu', engine,
if_exists = 'replace', index = False, dtype = dtype1)
ordersin_df.to_sql("OrdersIn", engine,
index = False, if_exists = 'replace', dtype = dtype2)
ordersout_df.to_sql("OrdersOut", engine,
index = False, if_exists = 'replace', dtype = dtype3)

Moral of this section

In this section, all work was done in Python on a laptop, no SQL was used on
the database, other than creating a table or downloading a table. Generally,
this is an effective workflow:
• Use SQL to do big manipulations on the database (joining and filtering).
• Use Python to do detailed computations on your laptop (analysis).
Now we consider the following simple problem. The total number of orders
in 3970. What is the total number of plates? To answer this, we loop through
all the orders, summing the number of plates in each order. The answer is
14,949 plates.

from json import *


from pandas import *
from sqlalchemy import create_engine, text

protocol = "mysql+pymysql://"
credentials = "username:password@"
server = "servername"
port = ":3306"
database = "/rawa"
uri = protocol + credentials + server + port + database

engine = sqlalchemy.create_engine(uri)

connection = engine.connect()

query = text("select * from OrdersIn")


df = read_sql(query, connection)
508 CHAPTER A. APPENDICES

num = 0

for item in df["items"]:


plates = loads(item)
num += sum( [ plate["quantity"] for plate in plates ])

print(num)

A more streamlined approach is to use map. First we define a function


whose input is a JSON string in the format of df["items"], and whose
output is the number of plates.

from json import *

def num_plates(item):
dishes = loads(item)
return sum( [ dish["quantity"] for dish in dishes ])

Then we use map to apply to this function to every element in the series
df["items"], resulting in another series. Then we sum the resulting series.

num = df["items"].map(num_plates).sum()
print(num)

Since the total number of plates is 14,949, and the total number of orders
is 4970, the average number of plates per order is 3.76.
REFERENCES 509

References

[1] J. Akey. Genome 560: Introduction to Statistical Genomics. 2008. url:


https://www.gs.washington.edu/academics/courses/akey/56008
/lecture/lecture1.pdf.
[2] V. Behzadan and O. Hijab. “Binary Classifiers and Logistic Regres-
sion”. Preprint.
[3] C. M. Bishop. Pattern Recognition and Machine Learning. Information
Science and Statistics. Springer, 2006.
[4] S. Bubeck. Convex Optimization: Algorithms and Complexity. Vol. 8.
Foundations and Trends in Machine Learning. Now Publishers, 2015.
[5] H. Cramér. Mathematical Methods of Statistics. Princeton University
Press, 1946.
[6] J. L. Doob. “Probability and Statistics”. In: Transactions of the Amer-
ican Mathematical Society 36 (1934), pp. 759–775.
[7] Math Stack Exchange. url: https://math.stackexchange.com/que
stions/4195547/derivation-of-stirling-approximation-from-c
lt.
[8] T. S. Ferguson. A Course in Large Sample Theory. Springer, 1996.
[9] R. A. Fisher. “The conditions under which χ2 measures the discrep-
ancy between observation and hypothesis”. In: Journal of the Royal
Statistical Society 87 (1924), pp. 442–450.
[10] Google. Machine Learning. url: https://developers.google.com/m
achine-learning.
[11] R. M. Gray. “Toeplitz and Circulant Matrices: A Review”. In: Foun-
dations and Trends in Communications and Information Theory 2.3
(2006), pp. 155–239. issn: 1567-2190. url: http://dx.doi.org/10.1
561/0100000006.
[12] E. L. Grinberg and O. Hijab. “The fundamental theorem of trigonom-
etry”. Preprint.
[13] T. L. Heath. The Works of Archimedes. Cambridge University Press,
1897.
[14] O. Hijab. Introduction to Calculus and Classical Analysis, Fourth Edi-
tion. Springer, 2016.
[15] Y. Bengio I. Goodfellow and A. Courville. Deep Learning. MIT Press,
2016. url: http://www.deeplearningbook.org.
[16] I. Steinwart and A. Christmann. Support Vector Machines. Springer,
2008.
[17] N. Janakiev. Classifying the Iris Data Set with Keras. 2018. url: htt
ps://janakiev.com/blog/keras-iris.
[18] L. Jiang. A Visual Explanation of Gradient Descent Methods. 2020.
url: https://towardsdatascience.com/a-visual-explanation-o
f-gradient-descent-methods-momentum-adagrad-rmsprop-adam-
f898b102325c.
510 REFERENCES

[19] J. W. Longley. “An Appraisal of Least Squares Programs for the Elec-
tronic Computer from the Point of View of the User”. In: Journal of
the American Statistical Association 62.319 (1967), pp. 819–841.
[20] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming.
Springer, 2008.
[21] A. A. Faisal M. P. Deisenroth and C. S. Ong. Mathematics for Machine
Learning. Cambridge University Press, 2020.
[22] M. Minsky and S. Papert. Perceptrons, An Introduction to Computa-
tional Geometry. MIT Press, 1988.
[23] Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.
[24] K. Pearson. “On the criterion that a given system of deviations from
the probable in the case of a correlated system of variables is such that
it can be reasonably supposed to have arisen from random sampling”.
In: Philosophical Magazine Series 5 50:302 (1900), pp. 157–175.
[25] R. Penrose. “A generalized inverse for matrices”. In: Proceedings of the
Cambridge Philosophical Society 51 (1955), pp. 406–413.
[26] B. T. Polyak. “Some methods of speeding up the convergence of itera-
tion methods”. In: USSR Computational Mathematics and Mathemat-
ical Physics 4(5) (1964), pp. 1–17.
[27] The WeBWorK Project. url: https://openwebwork.org/.
[28] S. Raschka. PCA in three simple steps. 2015. url: https://sebastia
nraschka.com/Articles/2015_pca_in_3_steps.html.
[29] H. Robbins and S. Monro. “A Stochastic Approximation Method”. In:
The Annals of Mathematical Statistics 22.3 (1951), pp. 400–407.
[30] S. M. Ross. Probability and Statistics for Engineers and Scientists, Sixth
Edition. Academic Press, 2021.
[31] M. J. Schervish. Theory of Statistics. Springer, 1995.
[32] G. Strang. Linear Algebra and its Applications. Brooks/Cole, 1988.
[33] Stanford University. CS224N: Natural Language Processing with Deep
Learning. url: https://web.stanford.edu/class/cs224n.
[34] I. Waldspurger. Gradient Descent With Momentum. 2022. url: https
://www.ceremade.dauphine.fr/~waldspurger/tds/22_23_s1/adva
nced_gradient_descent.pdf.
[35] Wikipedia. Logistic Regression. url: https://en.wikipedia.org/wi
ki/Logistic_regression.
[36] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge
University Press, 2022.
Python Index

*, 9, 16 def.matrix_text, 45
def.nearest_index, 193
all, 193 def.newton, 405
append, 193 def.num_legendre, 203
def.num_plates, 508
def.angle, 25, 68 def.outgoing, 240, 395
def.assign_clusters, 193 def.pca, 187
def.backward_prop, 235, 242, def.pca_with_svd, 187
400 def.perm_tuples, 453
def.ball, 55 def.plot_and_integrate, 486
def.cartesian_product, 337 def.plot_cluster, 194
def.chi2_independence, 383 def.plot_descent, 405
def.comb_tuples, 454 def.poly, 433
def.confidence_interval, 365, def.project, 116
375 def.project_to_ortho, 118
def.delta_out, 400 def.pvalue, 328
def.derivative, 242 def.random_batch_mean, 287
def.dimension_staircase, 126 def.random_vector, 193
def.downstream, 400 def.set_pi_ticks, 215
def.draw_major_minor_axes, 50 def.sym_legendre, 202
def.ellipse, 44 def.tensor, 33
def.find_first_defect, 125 def.train_nn, 412
def.forward_prop, 235, 241, 396 def.ttest, 376
def.gd, 410 def.type2_error, 371, 377
def.goodness_of_fit, 379 def.uniq, 5
def.H, 270 def.update_means, 193
def.hexcolor, 11 def.update_weights, 412
def.incoming, 240, 395 def.zero_variance, 104
def.J, 397 def.ztest, 369
def.local, 398 diag, 181

511
512 PYTHON INDEX

dict, 498 numpy.amax, 434


display, 146 numpy.amin, 434
numpy.arange, 18, 44
enumerate, 188 numpy.arccos, 25, 68
numpy.argsort, 187
import, 9 numpy.array, 8, 59
itertools.product, 55 numpy.ceil, 215
numpy.column_stack, 81
join, 11 numpy.corrcoef, 48
json.dumps, 498 numpy.cov, 40
json.loads, 499
numpy.cumsum, 186
numpy.degrees, 25
lambda, 240
numpy.dot, 66
lamda, 140
numpy.dstack, 337
list, 7
numpy.exp, 270
map, 215 numpy.floor, 215
matplotlib.patches numpy.inf, 488
Circle, 54 numpy.isclose, 20, 153
Rectangle, 54 numpy.linalg.eig, 140
matplotlib.pyplot.axes, 54 numpy.linalg.eigh, 140, 186
add_patch, 54 numpy.linalg.inv, 79
axis, 54 numpy.linalg.matrix_rank,
set_axis_off, 54 124, 125
matplotlib.pyplot.contour, 44 numpy.linalg.norm, 22, 193
matplotlib.pyplot.figure, 188 numpy.linalg.pinv, 116
matplotlib.pyplot.grid, 7 numpy.linalg.svd, 181
matplotlib.pyplot.hist, 284 numpy.linspace, 55
matplotlib.pyplot.imshow, 8, 9 numpy.log, 270
matplotlib.pyplot.legend, 44 numpy.mean, 14
matplotlib.pyplot.meshgrid, numpy.meshgrid, 55, 337
44 numpy.outer, 33, 383
matplotlib.pyplot.plot, 18, 37 numpy.pi, 270
matplotlib.pyplot.scatter, 7, numpy.random.binomial, 268,
18 283
matplotlib.pyplot.show, 7 numpy.random.default_rng, 287
matplotlib.pyplot.stairs, 126 numpy.random.default_rng.
matplotlib.pyplot.subplot, ,→ shuffle, 287
188 numpy.random.normal, 327
matplotlib.pyplot.text, 45 numpy.random.randn, 358
matplotlib.pyplot.title, 270 numpy.random.random, 37
matplotlib.pyplot.xlabel, 434 numpy.reshape, 185
matplotlib.pyplot.xticks, 215 numpy.roots, 481
matplotlib.pyplot.ylim, 333 numpy.row_stack, 63
numpy.set_printoptions, 9
numpy.allclose, 20, 76 numpy.shape, 59
PYTHON INDEX 513

numpy.sqrt, 25 sklearn.preprocessing
.StandardScaler, 76
pandas.DataFrame, 499 sort, 186
pandas.DataFrame.to_csv, 500 sqlalchemy.create_engine, 501
pandas.DataFrame.to_dict, 499 sqlalchemy.text, 501
pandas.DataFrame.to_sql, 501 sympy.*, 66
pandas.read_csv, 433, 500 sympy.diag, 65
pandas.read_sql, 502 sympy.diagonalize, 144
sympy.diff, 202
random.choice, 11
sympy.eigenvects, 144
random.random, 15
sympy.init_printing, 144
scipy.integrate.quad, 485 sympy.lambdify, 203
scipy.linalg.null_space, 93 sympy.Matrix, 59
scipy.linalg.orth, 87 sympy.Matrix.col, 63
scipy.linalg.pinv, 81 sympy.Matrix.cols, 63
scipy.optimize.newton, 225 sympy.Matrix.columnspace, 86
scipy.spatial.ConvexHull, 248 sympy.Matrix.eye, 64
simplices, 249 sympy.Matrix.hstack, 62, 81, 94
scipy.special.comb, 454 sympy.Matrix.inv, 79
scipy.special.expit, 277 sympy.Matrix.nullspace, 92
scipy.special.factorial, 452 sympy.Matrix.ones, 64
scipy.special.perm, 453 sympy.Matrix.rank, 130
scipy.special.softmax, 345 sympy.Matrix.row, 63
scipy.stats.binom, 268 sympy.Matrix.rows, 63
scipy.stats.chi2, 333 sympy.Matrix.rowspace, 90
scipy.stats.entropy, 224, 270 sympy.Matrix.zeros, 64
scipy.stats. sympy.prod, 482
,→ multivariate_normal, sympy.shape, 59
337 sympy.simplify, 202
scipy.stats.norm, 315 sympy.solve, 299, 481
scipy.stats.poisson, 310 sympy.symbols, 202
scipy.stats.t, 373, 375
sklearn.datasets.load_iris, 2 tuple, 19
sklearn.decomposition
.PCA, 188 zip, 191
514 PYTHON INDEX
Index

≈, 488 cartesian plane, 18


1, 159, 168, 341, 344, 420 Cauchy-Schwarz inequality, 25, 68
central limit theorem, 285, 318
angle, 68, 132 and Stirling’s approximation,
Archimedes 456
angle measure, 477 chi-squared
axiom, 216 correlated, 337, 340
arcsine law, 217 circle, 23
asymptotically unit, 22
equal, 385, 490 coin-tossing, 264
nonzero, 489 bias, 271
normal, 385 entropy, 270
one, 489 relative, 271
positive, 489 column space, 86
zero, 385, 489 columns, 62
average, 11 orthonormal, 72
combination, 454
basis, 122 convex, 245
of eigenvectors, 143 linear, 84
of singular vectors, 179 complex
one-hot encoded, 90 conjugate, 475
orthonormal, 122, 133, 143 division, 474, 475
standard, 60, 90 hermitian product, 475
Bayes theorem, 273, 275, 277 multiplication, 474, 475
perceptron, 278 numbers, 474
binomial, 457 plane, 474
coefficient, 455, 458, 459 polar representation, 477
theorem, 457, 459 roots of unity, 478
Newton’s, 213 concave function, 208
bound, 255 condition number, 441

515
516 INDEX

confidence, 323 vectors or points, 13


interval, 364 decision boundary, 253, 278, 426
level, 363 degree
contingency table, 382 binomial, 457
converges, 491 chi-squared, 331
convex, 258 graph node, 166
combination, 245 derivative, 197
dual, 212, 259, 347, 352 directional, 226
function, 208, 245 formula, 199
hull, 247, 424 logarithm, 206
set, 247 maximizers, 207
strictly, 209, 245, 259 partial, 226
correlation second, 201
coefficient, 47 convexity, 208
matrix, 47, 75 descent
CSV file, 64 gradient, 405, 442, 443
cumulant-generating function, heavy ball, 447
221, 294, 344 sequence, 406
and variance, 298 with lookahead gradient, 449
with momentum, 447
dataset, 1 diagonalizable, 143
attributes, 1 diagonalization
augmented, 392 eigen, 143
centered, 13 singular, 180
dimension, 133 dice-rolling
example, 1 bias, 348
features, 1 entropy, 348
full-rank, 133 relative, 351
Iris, 1 dimension, 122
label, 1 staircase, 126
mean, 36 direct sum, 119
MNIST, 6 distance formula, 22
multi-class, 418 distribution
projected, 41, 118, 189 Bernoulli, 268
reduced, 41, 100, 118, 189 binomial, 268
sample, 1 chi-squared, 331, 332
separable, 253 normal, 315, 317
strongly, 426 Poisson, 310
weakly, 426 T -, 372
soft-class, 419 uniform, 305
standard, 38 Z-, 315, 317
standardized, 47, 75 dot product, 24, 66
target, 1
two-class, 252, 418 eigenspace, 153
variance, 38 eigenvalue, 138
INDEX 517

bottom, 151 information, 352


clustering, 161 mean, 390
decomposition, 143 mean square, 397
minimum variance, 151 level, 255
projected variance, 150 logistic, 220, 276
top, 150 logit, 221
transpose, 141 loss, 403, 415
eigenvectors, 138 moment-generating
best-aligned vector, 150 chi-squared, 332
is right singular vector, 182 independence, 301
orthogonal, 142 normal, 317
entropy, 219, 348 standard normal, 317
absolute, 219, 348 probability density, 304, 315
cross-, 352 probability mass, 293
relative, 223, 350 proper, 255
epigraph, 249 and trainability, 417
epoch, 390 relu, 314, 394
error central limit theorem, 330
logistic, 420 Stirling’s approximation, 330
mean square, 415 sigmoid, 221, 276
Euler’s constant, 465 softmax, 345
events, 280 relative, 353
highly significant, 324 fundamental theorem
independent, 282 of algebra, 480
significant, 324 of calculus, 483
experiment, 280
exponential geometric
function, 467 series, 473
series, 469 sum, 465
gradient, 227
factorial, 451 weight, 401, 412
full-rank graph, 164
dataset, 133 bipartite, 173
matrix, 130 complement, 170
function complete, 165
concave, 208 connected, 168
convex, 208 cycle, 166, 168
strictly, 209 directed, 164
cumulant-generating, 221, 294, edge, 164, 390
344 incoming, 390
independence, 302 outgoing, 390
relative, 351 isomorphism, 172
cumulative distribution, 293, laplacian, 175
307 nodes, 164, 390
error adjacent, 164
518 INDEX

connected, 168 Legendre polynomial, 202


degree, 166 level, 253
dominating, 167 limit, 491
hidden, 390 line-search, 444
input, 390 linear
isolated, 167 combination, 84
output, 390 dependence, 92
order, 165 independence, 92
path, 168 system, 79, 147
regular, 167 homogeneous, 28, 92
simple, 165 inhomogeneous, 29
size, 165 transformation, 129, 135
sub-, 166 log-odds, 221
undirected, 164 logistic function, 220, 276
walk, 168 logit function, 221
weighed, 164 loss, 403, 415
weight cross-entropy, 419
matrix, 390 logistic, 420
mean, 390
hyperplane, 101, 250 mean square, 397, 415
separating, 251, 253
suporting, 252 machine learning, 389
tangent, 254 margin of error, 363
hypothesis mass-spring system, 156
alternate, 367 matrices
null, 367 projection, 115
testing, 367 matrix, 60
2 × 2, 29
iff, 72, 141
addition, 64
incoming edge, 390
adjacency, 164, 168
information, 220, 347
augmented, 89
absolute, 220, 347
centered, 420
cross-, 352
relative, 222, 350 circulant, 160, 170
integral, 482 eigenvalues, 161
additivity, 484 columns, 30
scaling, 484 dataset, 64
inverse, 78 diagonal, 63
pseudo-, 80, 105, 109 identity, 78
Iris dataset, 1 incidence, 175
iteration, 390 inverse, 31, 78
nonnegative, 42, 71
Jupyter, 4 orthogonal, 132
permutation, 66
law of large numbers, 272, 285, positive, 42, 71
318, 360 projection, 113, 115
INDEX 519

rank perceptron, 278, 391


approximate, 145 Bayes theorem, 392
rows, 30 parallel, 401
scaling, 64 permutation, 452
square, 63 perp, 95
symmetric, 32, 71 point, 59
trace, 33 critical, 207, 229, 409
transpose, 30, 61 inflection, 208, 410
variance, 38 saddle, 207, 230
weight, 164, 390 point of best-fit, 36
maximizer, 212 population, 10
mean, 11, 36, 293, 304, 335 power of a test, 371
sample, 311 principal axes, 50
minimizer, 493, 495 principal components, 144, 149,
existence, 256 184
global, 255 probability, 280
properness, 256 addition of, 265
residual, 257 binomial, 263
uniqueness, 256 coin-tossing, 266
conditional, 265, 282
network, 240, 390 multiplication of, 266
deep, 402 one-hot encoded, 418
iteration, 412 strict, 418
neural, 391 product
layered, 402
dot, 24, 66, 132
training, 411
matrix-matrix, 69
neuron, 240, 390
matrix-vector, 69
perceptron, 391
tensor, 33, 72
shallow, 401
projection, 113
dense, 402
matrix, 115
trainability, 415
onto column space, 116
Newton’s method, 404
onto null space, 120
norm, 22, 74
onto row space, 117
null space, 92
propagation
1, 344 back, 234, 235
one-hot encoding, 90, 343, 353 chain, 235
orthogonal, 68 network, 242
complement, 95, 119 neural network, 400
orthonormal, 68 forward, 233, 235
outgoing edge, 390 chain, 235
network, 241
parabola neural network, 396
lower tangent, 210 proper function, 255
upper tangent, 210 minimizer, 256
Pascal’s triangle, 460 pseudo-inverse, 109
520 INDEX

Pythagoras theorem, 26 central limit theorem, 330


Python, 4 Stirling’s approximation, 330
residual, 106
quadratic form, 34 vanishing, 106
residual minimizer, 106
random variables, 290, 291 and properness, 256
Bernoulli, 268, 291 minimum norm, 108
chi-squared, 331 pseudo-inverse, 109
continuous, 293 regression equation, 107
correlation, 299 row space, 90
discrete, 293 rows, 62
expectation, 293, 304 orthonormal, 72
gaussian, 315
identically distributed, 310 scalars, 12, 20
independence, 300 scaling
logistic, 308 factor, 136
normal, 315 integral, 484
Poisson, 310 matrix, 64
standard, 309 principle, 54
vector-valued, 335 vector, 20, 60
rank, 130 sequence, 488
and eigenvalues, 145 convergent, 491
and singular values, 178 sub-, 493
approximate, 145 subconvergent, 494
column, 87 series
full-, 130 alternating, 462
nonzero eigenvalues, 145 exponential, 469
row, 90 Taylor, 204
regression set
linear, 415, 430, 432 ball, 247
convexity, 416 boundary, 244, 247
neural network, 415 closed, 247
properness, 417 complement, 247
trainability, 417, 418 convex, 247
with bias, 418 interior, 247
without bias, 417 level, 244
logistic, 420 open, 247
convexity, 421 sublevel, 244
neural network, 420 sigmoid function, 221, 276
one-hot encoded, 425 singular
properness, 422 value, 176
strict, 424 decomposition, 179, 180
trainability, 424, 425 of pseudo-inverse, 183
regularization, 409 versus eigenvalue, 178
relu function, 314, 394 vectors, 176
INDEX 521

left, 176 Z, 369


right, 176 trainability, 415
versus eigenvectors, 182 and properness, 417
singular values linear regression, 417, 418
transpose, 176 logistic regression, 424
slope, 197 one-hot encoded, 425
softmax function, 345 strict, 424
space transpose, 61
column, 86 triangle inequality, 68
eigen-, 153
feature, 1, 59, 90 unit circle, 22
null, 92
row, 90 variance, 38, 74, 100
sample, 10, 280 and correlation, 76
source, 129 biased, 40
sub-, 97 ellipse, 43
target, 129 explained, 40
vector, 12 inverse ellipse, 43
span, 86 inverse ellipsoid, 153
spherical coordinates, 55 matrix, 38
standard projected, 42, 101, 136, 142
deviation, 296 reduced, 42, 101
error, 326 sample, 341
statistic, 16 total, 40
Stirling’s approximation, 456 unbiased, 40
central limit theorem, 330 zero direction, 101
relu function, 330 vector, 12, 18, 59
sum addition, 19, 60
direct, 119 best aligned, 48
geometric, 465 bias, 418
of spans, 119 cartesian, 19
of vectors, 60 centered, 346
suspensions, 56 dimension, 59
system dot product, 24
linear gradient, 227
homogeneous, 28 downstream, 399
inhomogeneous, 29 incoming, 391
length, 22, 67
tangent magnitude, 67
line, 198 norm, 22, 67
test one-hot encoded, 90, 343, 353,
chi-squared, 379, 382 418
goodness of fit, 379 orthogonal, 25, 68
independence, 382 orthonormal, 25, 68, 94
T , 376 outgoing, 239, 391
522 INDEX

perp, 95 span, 85
perpendicular, 25 standardized, 75
polar, 22 subtraction, 21
probability, 343, 379 unit, 23, 67
strict, 418 zero, 19, 60
projected, 114, 117, 118 vectorization, 16, 380
random, 335, 358 weight, 390
standard, 336 gradient, 401, 412
reduced, 114, 117, 118 hyperplane, 425
scaling, 20, 60 matrix, 164
shadow, 18 centered, 420
INDEX 523
Omar Hijab obtained his doctorate from the University of Cal-
ifornia at Berkeley, and is faculty at Temple University in
Philadelphia, Pennsylvania. Currently he is affiliated with the
University of New Haven in West Haven, Connecticut.

You might also like