Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Lecture Notes on High Dimensional Linear Regression

Uploaded by

R S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture Notes on High Dimensional Linear Regression

Uploaded by

R S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

arXiv:2412.15633v1 [stat.

ME] 20 Dec 2024

Lecture Notes on High Dimensional Linear


Regression

Alberto Quaini1,2

December 23, 2024 (first version: October 17, 2024)

1 Department of Econometrics, Erasmus University of Rotterdam, P.O. Box


1738, 3000 DR Rotterdam, The Netherlands.
2 Please, let me know if you find typos or mistakes at quaini@ese.eur.nl.
Introduction
These lecture notes were developed for a Master’s course in advanced ma-
chine learning at Erasmus University of Rotterdam. The course is designed
for graduate students in mathematics, statistics and econometrics. The con-
tent follows a proposition-proof structure, making it suitable for students
seeking a formal and rigorous understanding of the statistical theory under-
lying machine learning methods.
At present, the notes focus on linear regression, with an in-depth ex-
ploration of the existence, uniqueness, relations, computation, and non-
asymptotic properties of the most prominent estimators in this setting: least
squares, ridgeless, ridge, and lasso.

Background
It is assumed that readers have a solid background in calculus, linear alge-
bra, convex analysis, and probability theory. Some definitions and results
from these fields, relevant for the course, are provided in the Appendix for
reference.

Book-length references
The content of these lecture notes is inspired by a wide range of existing liter-
ature, but the presentation of topics follows my own interpretation and logical
structure. Although most of the content can be traced back to established
sources, certain sections reflect my perspective, and some material is origi-
nal to this course. For those interested in more comprehensive, book-length
discussions of related topics, the following key references are recommended:
Hastie et al. [2009], Bühlmann and Van De Geer [2011], Hastie et al. [2015],
and Wainwright [2019].

Disclaimer
Please note that despite my efforts, these lecture notes may contain errors. I
welcome any feedback, corrections, or suggestions you may have. If you spot
any mistakes or have ideas for improvement, feel free to contact me via email
at quaini@ese.eur.nl.
Contents

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 Linear Regression 5
1.0.1 Estimators and their properties . . . . . . . . . . . . . 7
1.1 Least Squares and Penalized Least Squares . . . . . . . . . . . 11
1.1.1 Existence and uniqueness . . . . . . . . . . . . . . . . 13
1.1.2 Equivalent expressions and relations . . . . . . . . . . 19
1.1.3 Geometric interpretation . . . . . . . . . . . . . . . . . 23
1.1.4 Computation of lasso . . . . . . . . . . . . . . . . . . . 26
1.1.5 Finite-sample properties of ridgeless and ridge . . . . . 31
1.1.6 Finite sample properties of lasso . . . . . . . . . . . . . 39

2 Appendix 49
2.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.1 Moore-Penrose inverse . . . . . . . . . . . . . . . . . . 56
2.1.2 Eigenvalue and Singular value decomposition . . . . . . 58
2.2 Convex analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . 63

Alphabetical Index 71

2
Notation
• All random variables are defined on complete probability space (Ω, F, P)
and take values in a real Euclidean space.

• For a random variable (vector) [matrix] x (x) [X], the notation x ∈ R


(x ∈ Rn ) [X ∈ Rn×p ] means that x (x) [X] takes values in R (Rn )
[Rn×p ].
d
• The symbols → and → denote convergence in probability and in dis-
P

tribution, respectively.

• Given a random variable x, its expectation is denoted E[x] and its


variance Var[x].

• For a vector x ∈ Rn , the i−th element is denoted xi for i = 1, . . . , n.

• For a matrix A ∈ Rn×p , the i, j−th element is denoted Ai,j , the j−th
column is denoted Aj and the i−th row is denoted A(i) , for i = 1, . . . , n
and j = 1, . . . , p.

• The transpose of a matrix A ∈ Rn×p is denoted A′ .

• The Moore-Penrose inverse of a matrix A ∈ Rn×p is denoted A+ .


1/p
• The lp −norm ∥·∥p on Rn is defined for all v ∈ Rn as ∥v∥p := ( ni=1 |vi |p )
P
when p ∈ [1, +∞), and ∥v∥p := maxni=1 |vi | when p = +∞.

• Given vector θ ∈ Rp , the l0 −norm (which is not a norm!) ∥x∥0 counts


the number of nonzero elements of θ.

• argminx∈X f (x) denotes the set of minimizers of f over set X.

• diag(A) denotes the diagonal elements of a square matrix A ∈ Rn×n

• diag(a1 , . . . , an ) denotes a square matrix in Rn×n that has diagonal


elements given by a1 , . . . , an ∈ R and that has zero elsewhere.

• Given a matrix A ∈ Rn×p , its rank is Rank(A), its range is Range(A),


its kernel is Ker(A), its trace is denoted Trace(A).

• We denote PS the orthogonal projection onto set S ⊂ Rn×p .

• Given a vector space V , the sum of two subsets A, B ⊂ V is defined


as A + B := {a + b : a ∈ A, b ∈ B}. The sum of a set A ∈ V and a
vector b ∈ V is defined as A + b := {a + b : a ∈ A}.

3
• The symbol ∂ indicates the subdifferential.

4
Chapter 1

Linear Regression

Linear regression is a supervised learning technique aimed at predicting a


target random variable y using a linear combination

x′ θ = x1 θ1 + . . . + xp θp

of explanatory variables x = [x1 , . . . , xp ]′ , where θ ∈ Rp and p ∈ N.1 The


target variable y is also referred to as dependent or output variable, while the
explanatory variables x are also known as independent variables, predictors
or input variables.
In typical applications, we observe only a sample of size n ∈ N of these
random variables, represented by the pairs (xi , yi )ni=1 , where xi ∈ Rp and
yi ∈ R for each i. Given a regression coefficient θ0 ∈ Rp , a statistical linear
model, or simply linear model, is expressed as

yi = x′i θ0 + ε0i , i = 1, . . . , n, (1.1)

where ε0i are real-valued residual random variables. Figure 1.1 depicts a
linear model for i = 1, . . . , n observations yi , with two predictors x̃i = [1, xi ]′
consisting of a unit constant and a variable xi , a coefficient θ0 ∈ R2 and the
associated error terms ε0i . The Data Generating Process (DGP), i.e., the
joint distribution of the predictors x and the real-valued residual random
variables ε0 = [ε0i ]ni=1 , is subject to certain restrictions. Depending on the
type of restrictions imposed on the DGP, different types of linear models
are obtained. The two general forms of linear models are fixed and random
design models, which are defined as follows.
Definition 1 (Fixed design model). In a fixed design model, the sequence
(xi )ni=1 is fixed. The residuals ε0i are independent and identically distributed.
1
An intercept in f (x; θ) can be introduced by adding a constant term to the predictors.

5
Definition 2 (Random design model). In a random design model, the pair
(xi , yi )ni=1 is a sequence of independent and identically distributed random
variables.
The fixed design model is particularly suitable when the predictors are
controlled by the analyst, such as the dose of medication administered to
patients in the treatment group in a clinical trial. Conversely, the random
design model is appropriate when the explanatory variables are stochastic,
such as the wind speed observed at a specific time and location.

Linear Model
yi
Lm(xi, θ0) = 0.5 + 0.8xi
ε0i
8
6
y
4
2

3 4 5 6 7 8
x

Figure 1.1: Statistical linear model yi = x̃′i θ0 + ε0i where x̃i = [1, xi ]′ and
θ0 = [0.5, 0.8]′ .

We organize the observed values of the target variable in the vector y =


[y1 , . . . , yn ]′ ∈ Rn , and the observations on the predictors in the design matrix
 
x . . . x1p
 11
 .. . . ..  n×p

X :=  . . . R .
 
xn1 . . . xnp
With this notation, the linear model (1.1) can be expressed as:
y = Xθ0 + ε0 ,
where ε0 = [ε01 , . . . , ε0n ]′ ∈ Rn .
Example 1. A classic example of linear regression is found in the work of
Nerlove et al. [1961], which examines returns to scale in the U.S. electricity
power supply industry. In this study, the total cost yi for firm i is predicted
using a linear model based on the firm’s output production xi1 , the wage rate
xi2 , the price of fuel xi3 , and the rental price of capital xi4 , with data from
n = 145 electric utility companies.

6
The rest of the chapter is organized as follows. First, we study the most
basic linear regression approach, the method of least squares projection, and
some of its recent machine learning extensions. Our study focuses on their
existence, uniqueness, connections, geometric interpretation, and computa-
tion. Then, we study we cover both their finite- or small-sample properties,
that are valid for any given sample size, and their asymptotic properties,
that are useful approximations when the sample size is large enough.

1.0.1 Estimators and their properties


Definition 3 (Estimand). An estimand is a feature, or parameter, of interest
of the population.
Definition 4 (estimator and estimate). An estimator is a function taking as
input the data, and possibly other auxiliary variables, and outputting an
estimate, which is a specific value assigned to the estimand.
For instance, in the context of the linear model (1.1), the coefficient θ0 ∈
R represents the estimand. An estimator is a function θ̂n : Rn × Rn×p → Rp
p

that takes as inputs the data (y, X) ∈ Rn × Rn×p and produces an estimate
θ̂n (y, X) ∈ Rp . For simplicity, we use the same notation, θ̂n , to refer to both
the estimator and the resulting estimate, although formally the estimate
should be written as θ̂n (y, X). We may also write θ̂n ∈ Rp to indicate that
an estimator outputs values in Rp .
Definition 5 (Linear prediction). The quantity lm(X, θ) := Xθ denotes
the linear prediction associated to coefficient vector θ ∈ Rp and predictors
X ∈ Rn×p .
Let M + denote the Moore-Penrose inverse of a generic real-valued matrix
M . We make extensive use of the following useful projections.2
Definition 6 (Useful projections). Given a fixed matrix X ∈ Rn×p :

• PRange(X ′ ) := X + X is the orthogonal projector onto Range(X ′ );

• PKer(X) := I − X + X is the orthogonal projector onto Ker(X);

• PRange(X) := XX + is the orthogonal projector onto Range(X);

• PKer(X ′ ) := I − XX + is the orthogonal projector onto Ker(X ′ ).


2
We use the notation Range (Ker) for the range (kernel) of a matrix. Details on these
sets, the Moore-Penrose inverse and orthogonal projections are given in Appendix Section
2.1.

7
The next proposition demonstrates that, if we fix the design matrix X,
we can focus on regression coefficients in Range(X ′ ). Indeed, coefficients in
this set span all possible linear predictions that can be achieved through X.

Proposition 1.0.1. Given a matrix X ∈ Rn×p , for any θ ∈ Rp :

lm(X, θ) = lm(X, PRange(X ′ ) θ).

Proof. Using the identity I = PRange(X ′ ) + PKer(X) , we have for any θ ∈ Rp :

Xθ = X(PRange(X ′ ) + PKer(X) )θ = X PRange(X ′ ) θ.

Finite-sample properties
Since an estimator is derived from data, it is a random variable. Intuitively,
when comparing two estimators of the same estimand, we prefer the one
whose probability distribution is ”more concentrated” around the true value
of the estimand. Formally, estimators are compared using several key prop-
erties.3
Definition 7 (Bias). The bias of an estimator θ̂n of θ0 is the difference between
the expected value of the estimator and the estimand:

Bias(θ̂n , θ0 ) := E[θ̂n ] − θ0 .

Definition 8 (Estimation risk). The estimation risk of an estimator θ̂n of θ0


measures the difference between the estimate and the estimand as:

ER(θ̂n , θ0 ) := ∥θ̂n − θ0 ∥22 .

Definition 9 (MSE). The Mean Squared Error (MSE) of an estimator θ̂n of


θ0 is the expected estimation risk of the estimator:

MSE(θ̂n , θ0 ) := E[∥θ̂n − θ0 ∥22 ].

Definition 10 (Predictive risk). The predictive risk of an estimator θ̂n of θ0


measures the difference between the linear predictions of θ̂n and those of θ0 :

PR(θ̂n , θ0 ) := ∥lm(X, θ̂n ) − lm(X, θ0 )∥22 /n.


3
Some of these properties are defined by means of the l2 −norm. Note that this choice
is typical, but arbitrary.

8
Definition 11 (Mean predictive risk). The Mean Predictive Risk (MPR) of
an estimator θ̂n of θ0 is the expected predictive risk of the estimator:

MPR(θ̂n , θ0 ) := E[∥lm(X, θ̂n ) − lm(X, θ0 )∥22 /n].

As a corollary to Proposition 1.0.1, the predictive risk of an estimator


is unchanged if both the estimator and the estimand are projected onto
Range(X ′ ).
Corollary 1.0.1. Given a matrix X ∈ Rn×p , for any estimator θ̂n of θ0 ∈
Rp :
PR(θ̂n , θ0 ) = PR(PRange(X ′ ) θ̂n , PRange(X ′ ) θ0 ).
Proof. The result follows from Proposition 1.0.1.
The next proposition justifies the definition of mean predictive risk given
in Definition 11.
Proposition 1.0.2. Assume that the linear model (1.1) holds with E[xε0 ] =
0. Then, for any θ ∈ Rp :

E[∥y − Xθ∥22 /n] = MPR(θ, θ0 ) + E[∥ε0 ∥22 /n].

Proof. Since y = Xθ0 + ε0 , we have

E[∥y − Xθ∥22 /n] = E[∥X(θ0 − θ)∥22 /n] + E[∥ε0 ∥22 /n] + 2(θ0 − θ)E[X ′ ε0 ].

Then, the result follows since E[X ′ ε0 ] = ni=1 E[Xi ε0i ] = 0, where Xi de-
P
notes the i−th row of X.
If our primary goal is to accurately predict the target variable, we seek
a estimators θ̂ with a low mean prediction risk E[∥y − X θ̂∥22 /n]. Since we
cannot control the error term ε0 , Proposition 1.0.2 suggests that we should
focus on estimators with a low mean predictive risk.
On the other hand, if our interest lies in understanding which predictors
influence the target variable and how they do so, the true coefficient θ0
becomes our focus. In this case, we might prefer unbiased estimators – those
with zero bias – over biased ones. However, estimators with lower mean
squared error (MSE) are generally favored, even if they feature some bias.
The following proposition demonstrates that the MSE can be decomposed
into a bias and a variance term.
Proposition 1.0.3 (Bias-variance decomposition of MSE). Given an esti-
mator θ̂n ∈ Rp for θ0 ∈ Rp , the MSE can be decomposed as follows:

MSE(θ̂n , θ0 ) = ∥Bias(θ̂n , θ0 )∥22 + Trace(Var[θ̂n ]).

9
Proof. The result follows from

MSE(θ̂n , θ0 ) = E[(θ̂n − θ0 )′ (θ̂n − θ0 )]


= E[Trace{(θ̂n − θ0 )(θ̂n − θ0 )′ }]
= Trace(E[(θ̂n − θ0 )(θ̂n − θ0 )′ ])
= Trace(E[(θ̂n − E[θ̂n ] + Bias(θ̂n , θ0 ))(θ̂n − E[θ̂n ] + Bias(θ̂n , θ0 ))′ ])
= Trace(Var[θ̂n ] + Bias(θ̂n , θ0 ) Bias(θ̂n , θ0 )′ +
E[θ̂n − E[θ̂n ]] Bias(θ̂n , θ0 )′ + Bias(θ̂n , θ0 )E[θ̂n − E[θ̂n ]]′ )
= Trace(Var[θ̂n ] + Bias(θ̂n , θ0 ) Bias(θ̂n , θ0 )′ )
= ∥Bias(θ̂n , θ0 )∥22 + Trace(Var[θ̂n ]).

Loosely speaking, the bias and the variance of an estimator are linked
to the estimator’s ”complexity”. Estimators with higher complexity often
fit the data better, resulting in lower bias, but they are more sensitive to
data variations, leading to higher variance. Conversely, estimators with lower
complexity tend to have lower variance but higher bias, a phenomenon known
as the bias-variance tradeoff.
Apart from simple cases, computing the finite-sample properties of esti-
mators, such as their MSE or predictive risk, is infeasible or overly compli-
cated. This is because they require computations under the DGP of complex
transformations of the data. When direct computation is not possible, we
can rely on concentration inequalities or asymptotic approximations.
Concentration inequalities are inequalities that bound the probability
that a random variable deviates from a particular value, typically its ex-
pectation. In this chapter, we focus on inequalities that control the MSE or
predictive risk of an estimator, such as:

P[d(θ̂n , θ0 ) ≤ h(y, X, n, p)] ≥ 1 − δ,

or
P[d(lm(X, θ̂n ), lm(X, θ0 )) ≤ h(y, X, n, p)] ≥ 1 − δ,
where δ ∈ (0, 1) is the level of confidence, d : Rp ×Rp → [0, +∞) is a distance,
and h is a real-valued function of the data, the sample size, and the number
of predictors.

10
Large-sample properties
Large-sample or asymptotic theory provides an alternative approach to study
and analyse estimators. Classically, this framework develops approximations
of the finite-sample properties of estimators, such as their distribution, MSE
or predictive risk, by letting the sample size n → ∞. Consequently, these
approximations work well when the sample size n is much larger than the
number of predictors p. More recently, asymptotic approximations are also
developed by letting p → ∞, or having both n, p → ∞ at some rate. Note
that, given a sample of size n and number of variables p, there is no general
indication on how to choose the appropriate asymptotic regime for n and p,
as the goodness of fit of the corresponding asymptotic approximations should
be assessed on a case by case basis. In this chapter, we work with two notions
from large-sample theory: consistency and asymptotic distribution.
P
Definition 12 (Consistency). Estimator θ̂n of θ0 is consistent, written θ̂n →
θ0 as n → ∞, if for all ε > 0,

lim P[|θ̂n − θ0 | > ε] = 0.


n→∞

Definition 13 (Asymptotic distribution). Given a deterministic real-valued


sequence rn,p → ∞, let Fn,p be the probability distribution of rn,p (θ̂n − θ0 )
and F a non-degenerate probability distribution. Estimator θ̂n of θ0 has
asymptotic distribution F with rate of convergence rn,p if Fn,p (z) → F (z)
as rn,p → ∞ for all z at which F (z) is continuous. Equivalent short-hand
d d
notations are rn,p (θ̂n − θ0 ) → F and rn,p (θ̂n − θ0 ) → η ∼ F , as rn,p → ∞.

1.1 Least Squares and Penalized Least Squares


In this chapter, we study the most widely used methods in linear regression
analysis: the method of least squares and some of its penalized variants. The
method of least squares was first introduced by Legendre [1805] and Gauss
[1809], and it consists in minimizing the squared l2 −distance between the
target values y and the linear prediction lm(X, θ) = Xθ in the coefficient
vector θ ∈ Rp .

11
Figure 1.2: Adrien-Marie Legendre (1752–1833) and Johann Carl Friedrich
Gauss (1977–1855).

Definition 14 (Least squares estimator). The Least Squares Estimator (LSE)


is defined as:
n
1X 1
θ̂nls ∈ argmin (yi − θ ′ xi )2 = argmin ∥y − Xθ∥22 . (1.2)
θ∈Rp 2 i=1 θ∈Rp 2

In addition to the LSE, we consider the following variants.


Definition 15 (Ridgeless estimator). The ridgeless estimator is defined as:
 
rl 2 1 2
θ̂n = argmin ∥θ̂∥2 : θ̂ ∈ argmin ∥y − Xθ∥2 . (1.3)
θ̂∈Rp θ∈Rp 2

Definition 16 (Ridge estimator). The ridge estimator is defined for λ > 0 as:
1 λ
θ̂nr (λ) = argmin ∥y − Xθ∥22 + ∥θ∥22 . (1.4)
θ∈Rp 2 2

Definition 17 (Lasso estimator). The lasso estimator is defined for λ > 0 as:
1
θ̂nl (λ) ∈ argmin ∥y − Xθ∥22 + λ ∥θ∥1 . (1.5)
θ∈Rp 2

Here is a brief overview of the results that are discussed in detail in the
rest of this chapter. A solution to the least squares problem (1.2) always
exists. However, when the predictors (i.e., the columns of X) are linearly
dependent, there are infinitely many solutions.4 In such cases, the LSE typ-
ically considered is the ridgeless estimator, which is always unique.
The ridge and lasso estimators are penalized or regularized versions of
the LSE, with penalty term λ ∥θ∥22 and λ ∥θ∥1 , respectively. The penalty
parameter λ > 0 controls the strength of the penalty. The ridge estimator,
4
This situation always arises when p > n, and it may arise even when n ≤ p.

12
introduced by Hoerl and Kennard [1970], was developed to address certain
shortcomings of the LSE, particularly in scenarios involving collinear or mul-
ticollinear designs – where the predictors in X are linearly dependent or
nearly-linearly dependent. The ridge estimator is uniquely defined and often
exhibits better statistical properties compared to those of the LSE in set-
tings with multicollinear or many predictors. On the other hand, the lasso
estimator, popularized by Tibshirani [1996], offers an approximation of the
l0 estimator, which is defined for some R > 0:
 
l0 1 2
θ̂n (λ) ∈ argmin ∥y − Xθ∥2 : ∥θ∥0 ≤ R , (1.6)
θ∈Rp 2
where ∥θ∥0 is the number of nonzero elements in θ. A key feature of this es-
timator is its ability to produce sparse solutions, i.e., to set some coefficients
exactly to zero. Consequently, the l0 estimator can be used to perform pa-
rameter estimation and variable selection simultaneously. However, it is the
solution of a non-convex problem, and, in general, computing it can be an
”NP-hard” problem. The lasso instead shares the ability to produce sparse
solutions and it can be easily computed even for large datasets.
Remark 1 (Data standardization). For computational stability, it is recom-
mended to compute linear regression estimators with a least squares loss
after having standardized the predictors X so that x̄ := X ′ 1/n = 0 and
Xj′ Xj = 1 for each j = 1 . . . , p. Without standardization, the solutions
would depend on the units used to measure the predictors. Moreover, we
may also center the target variable y, meaning ȳ := y ′ 1/n = 0. These
centering conditions are convenient, since they mean that we can omit the
intercept term. Given an optimal solution θ̂ on the centered data, we can
recover the optimal solutions for the uncentered data: θ̂ is the same and the
intercept is given by ȳ − x̄′ θ̂.

1.1.1 Existence and uniqueness


From here on, we make extensive use of the spectral decomposition of X.
Definition 18 (Spectral decomposition of X). The spectral decomposition of
X is
X = U SV ′ ,
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices and, for r :=
Rank(X) ≤ min{n, p},
 
diag(s1 , . . . , sr ) 0
S=  ∈ Rn×p .
0 0

13
We establish the following key results: the existence of the LSE, the
ridgeless, the ridge and the lasso estimators; the closed-form expression of
the LSE, ridgeless, and ridge; the uniqueness of the ridgeless and ridge; the
uniqueness of the LSE when Rank(X) = p, i.e., when the predictors in X are
linearly independent. Notice that this rank condition cannot hold if n < p.

Theorem 1.1.1 (Existence and uniqueness of LSE, ridgeless, ridge and


lasso). The following statements hold:
(i) The set of solutions to the least squares problem (1.2) is non-empty and
given by
argmin ∥y − Xθ∥22 /2 = X + y + Ker(X).
θ∈Rp

(ii) The ridgeless estimator exists, is an element of Range(X ′ ), and is


uniquely given in closed form by

θ̂nrl = X + y. (1.7)

(iii) If Rank(X) = p, then the LSE and the ridgeless estimator are uniquely
given in closed form by:

θ̂nls = θ̂nrl = (X ′ X)−1 X ′ y. (1.8)

(iv) The ridge estimator with penalty parameter λ > 0 exists, is an element
of Range(X ′ ), and is uniquely given in closed form by:

θ̂nr (λ) = (X ′ X + λI)−1 X ′ y. (1.9)

(v) The lasso estimator exists and, in general, it is not unique.


Proof. (i) The least squares problem (1.2) is an unconstrained optimiza-
tion problem with the objective function

f : Rp → [0, +∞); θ 7→ ∥y − Xθ∥22 /2.

By Theorem 2.2.1, the set of least squares minimizers is given by

S := {θ̂ ∈ Rp : X ′ X θ̂ = X ′ y}.

Since X ′ y ∈ Range(X ′ ) and Range(X ′ ) = Range(X ′ X), set S is


not empty. Consider a vector θ̂ ∈ X + y + Ker(X). Using that X =
PRange(X) X which implies X ′ = X ′ PRange(X) , we obtain

X ′ X θ̂ = X ′ XX + y = X ′ PRange(X) y = X ′ y.

14
Therefore, X + y + Ker(X) ⊂ S. Now consider a vector v ∈ Rp not in
set X + y + Ker(X). That is, v = θ̂ + u with θ̂ ∈ X + y + Ker(X) and
u ∈ Range(X ′ ). Since Xu ̸= 0,

X ′ Xv = Xy + Xu ̸= Xy.

We conclude that X + y + Ker(X) = S.

(ii) The minimum norm least squares problem in (1.3) has a strictly convex
and coercive objective function

f : Rp → R; θ 7→ ∥θ∥22 ,

and a closed convex feasible set X + y + Ker(X) ⊂ Rp . It follows that a


solution exists and it is unique; see Propositions 2.2.3 and 2.2.4. Since,
for any v ∈ Ker(X),

X +y 2
≤ X +y 2
+ ∥v∥2 = X + y + v 2
,

the ridgeless estimator can be expressed in closed form as θ̂nrl = X + y,


which is an element of Range(X ′ ) since X + = PRange(X ′ ) X + .

(iii) If Rank(X) = p, then Ker(X) = {0}. Moreover, X ′ X is invertible


and we can use the identity X + = (X ′ X)−1 X ′ to conclude that the
LSE and the ridgeless estimator are uniquely given by (1.8).

(iv) The ridge problem in (1.4) is an unconstrained optimization problem


with the strictly convex, coercive and continuously differentiable objec-
tive function
2
f : Rp → [0, +∞); θ 7→ ∥y − X ′ θ∥2 /2 + λ/2 ∥θ∥22 .

It follows that a solution θ̂ r ∈ Rn exists and it is unique; see, again,


Propositions 2.2.3 and 2.2.4. Theorem 2.2.1 implies

(X ′ X + λI)θ̂ r (λ) = X ′ y.

Consider the spectral decomposition X = U SV ′ in Definition 18.


Then,
 
2 2
diag(s1 , . . . , sr ) + λ 0
X ′ X + λI = V S ′ SV ′ + λV V ′ = V   V ′,
0 λI

15
which is positive definite, and thus θ̂nr (λ) = (X ′ X + λI)−1 X ′ y is the
solution to the FOCs. Finally, to prove that θ̂nr (λ) ∈ Range(X ′ ), notice
that PRange(X ′ ) = V S + SV ′ , where
 
I 0
S+S =  .
0 0

Thus,

PRange(X ′ ) θ̂nr (λ) =V S + SV ′ V (S ′ S + λI)−1 V V ′ S ′ U ′ y


=V (S ′ S + λI)−1 S ′ U ′ y = θ̂nr (λ).

We conclude that PRange(X ′ ) θ̂nr (λ) = V (S ′ S+λI)−1 S ′ U ′ y, i.e., θ̂nr (λ) ∈


Range(X ′ ).

(v) The lasso problem in (1.5) is an unconstrained optimization problem


with the convex and coercive objective function
2
f : Rp → [0, +∞); θ 7→ ∥y − X ′ θ∥2 /2 + λ ∥θ∥1 .

It follows that a solution θ̂ l (λ) ∈ Rp exists; see Proposition 2.2.3. How-


ever, we demonstrate by counterexample that this solution is, in gen-
eral, not unique. Consider a sample (y, X) ∈ Rn × Rn×2 where the
two predictors are identical, i.e., x1 = x2 ∈ Rn , and assume that there
exists a corresponding lasso solution θ̂nl (λ) = [θ̂n1l l
(λ), θ̂n2 (λ)]′ ∈ R2 that
is non-zero. Then,
l l
θ̂1 = [θ̂n1 (λ) + θ̂n2 (λ), 0]
and
l l
θ̂2 = [0, θ̂n1 (λ) + θ̂n2 (λ)]
are two distinct coefficient vectors that produce the same fit, X θ̂1 =
X θ̂2 , and have identical l1 −norms, ∥θ̂1 ∥1 = ∥θ̂2 ∥1 . Consequently, in
this example there exist at least two distinct lasso solutions.

Remark 2 (Computation ridgeless and ridge). The closed form expressions of


the LSE, ridgeless and ridge estimators are useful analytical result. However,
for numerical stability, it is recommended to compute these estimators by
solving their corresponding normal equations, which are X ′ X θ̂n = X ′ y for
the LSE or ridgless, and (X ′ X + λI)θ̂nr = X ′ y for the ridge.

16
Remark 3 (Collinearity). Using the notation in Definition 18, the minimum
nonzero eigenvalue of X ′ X is s2r . If r < p, then X ′ X has p − r zero eigen-
values and the predictors are said to be collinear, that is, they are linearly
dependent. In this case Ker(X) is not trivial (it contains nonzero elements),
hence the LSE is not unique. Moreover, if sr ≈ 0, then the computation of
 
diag(1/s1 , . . . , 1/sr ) 0
X+ = V   U ′,
0 0

and hence of the ridgeless estimator, is unstable. The ridge estimator in-
stead may not display these computational hurdles, provided that the penalty
parameter λ is large enough. That is because the minimum eigenvalue of
(X ′ X + λI) is s2r + λ. In Section 1.1.5 we show that, if sr ≈ 0, the ridge-
less (ridge) estimator’s MSE and MPR satisfy loose (sharp) concentration
inequalities.
Remark 4 (Uniqueness of the lasso solution). Tibshirani [2013] shows that,
under some conditions, the lasso estimator is unique. For instance, if the
predictors in X are in general position, then the lasso solution is unique.
Specifically, a set (aj )pj=1 where aj ∈ Rn for all j is in general position if any
affine subspace of Rn of dimension k < n contains at most k + 1 elements
of the set {±x1 , ±x2 , ... ± xp }, excluding antipodal pairs of points (that is,
points differing only by a sign flip). If the predictors are (non redundant)
continuous random variables, they are almost surely in general position, and
hence the lasso solution is unique. As a result, non-uniqueness of the lasso
solution typically occurs with discrete-valued data, such as those comprising
dummy or categorical variables.
Since the LSE, ridgeless, ridge and lasso estimators exist, their linear
predictions exist too. Moreover, the linear predictions of the uniquely defined
estimators, like ridgeless and ridge, are trivially unique. Remarkably, some
estimators that may not be unique entail unique linear predictions. The next
lemma implies that the LSE and lasso are among these estimators.

Lemma 1.1.2. Let h : Rp → (−∞, +∞] be a proper convex function. Then


Xθ1 = Xθ2 and h(θ1 ) = h(θ2 ) for every minimizers θ1 , θ2 ∈ Rp of
1
f : Rp → (−∞, +∞]; θ 7→ ∥y − Xθ∥22 + h(θ).
2
Proof. Assume that Xθ1 ̸= Xθ2 , and let δ := inf θ∈Rp f (θ). By Proposition

17
2.2.2, the set of minimizers of f is convex. Thus, for any α ∈ (0, 1):

δ =f (αθ1 + (1 − α)θ2 )
1
= ∥y − X[αθ1 + (1 − α)θ2 ]∥22 + h(αθ1 + (1 − α)θ2 )
2
1 1
<α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + h(αθ1 + (1 − α)θ2 )
2 2
1 1
≤α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + αh(θ1 ) + (1 − α)h(θ2 )
2 2
=αf (θ1 ) + (1 − α)f (θ2 ) = δ,

where the strict inequality follows from the strict convexity of g : v 7→


∥y − v∥22 . Since the conclusion δ < δ is absurd, we must have Xθ1 = Xθ2 ,
and since f (θ1 ) = f (θ2 ), it follows that h(θ1 ) = h(θ2 ) as well.
While we make use of this lemma for proving uniqueness of the predic-
tions of lasso, we can use a more direct approach for the other estimators’
predictions, which directly provides their closed form expressions. We further
show that the LSE’s prediction has the geometric interpretation of being the
unique vector in the range of X that is closest to y in l2 distance, and that
the residual vector is orthogonal to the range of X.

Theorem 1.1.3 (Uniqueness of linear predictions). The following statements


hold:

(i) The linear predictions of the LSE and the ridgeless estimator are uniquely
given by:
lm(X, θ̂nls ) = lm(X, θ̂nrl ) = PRange(X) y, (1.10)
which is the unique vector v ∈ Range(X) such that

∥y − v∥2 = inf{∥y − z∥2 : z ∈ Range(X)}.

Moreover, the residual vector y − lm(X, θ̂nls ) = y − lm(X, θ̂nrl ) is or-


thogonal to Range(X).

(ii) The linear prediction of the ridge estimator is uniquely given, for λ > 0,
by
lm(X, θ̂nr (λ)) = X(X ′ X + λI)−1 X ′ y.

(iii) The linear prediction of the lasso estimator is unique.

18
Proof. (i) The linear predictions lm(X, θ̂nls ) and lm(X, θ̂nrl ) are uniquely
given by (1.10) because all solutions to the least squares problem θ̂ ∈
X + y + Ker(X) yield the same prediction

lm(X, θ̂) = XX + y = PRange(X) y.

By the definition of θ̂nls and the fact that Range(X) is a closed vector
subspace of Rn , the remaining claims follow as a direct application of
the Hilbert projection theorem (Theorem 2.2.2).
(ii) This result follows directly from the closed form expression (1.9) of the
ridge estimator.
(iii) Since the l1 −norm is convex, the result follows by Lemma 1.1.2.

1.1.2 Equivalent expressions and relations


The ridgeless and the ridge, together with their corresponding linear predic-
tions, admit the following simple expressions.
Proposition 1.1.1 (Spectral expression of ridgeless and ridge). Given the
spectral decomposition X = U SV ′ in Definition 18:
(i) The ridgeless estimator is given by
r
!
X 1
θ̂nrl = vj u′j y.
s
j=1 j

The corresponding linear prediction is


r
!
X
lm(X, θ̂nrl ) = uj u′j y. (1.11)
j=1

(ii) The ridge estimator with λ > 0 is given by


r
!
X sj
θ̂nr (λ) = 2
vj u′j y.
j=1
sj + λ

The corresponding linear prediction is


r
!
X s2j
lm(X, θ̂nr ) = 2
uj u′j y. (1.12)
j=1
sj + λ

19
Proof. (i) From the closed-form expression of the ridgeless estimator,
r
!
X 1
θ̂nrl = X + y = V S + U ′ y = vj u′j y.
j=1
s j

Therefore,
r
!
X
X θ̂nrl = U S ′ S + U ′ y = uj u′j y.
j=1

(ii) From the closed-form expression of the ridge estimator,

θ̂nr = (X ′ X + λI)−1 X ′ y =V (S ′ S + λI)−1 S ′ U ′ y


r
!
X sj
= 2
vj u′j y.
j=1 j
s + λ

Therefore,
r
!
′ −1 ′ ′
X s2j
X θ̂nr = U S(S S + λI) S U y = 2
uj u′j y.
j=1
sj + λ

Pr ′
Using Definition 18, matrix PRange(X) = XX + = j=1 uj uj , where
{u1 , . . . , ur } is an orthogonal basis of Range(X). From expression (1.11),
it follows that the prediction of the ridgeless estimator is the orthogonal
projection of y onto the range of X. Expression (1.12) instead shows that
the ridge estimator shrinks this projection, shrinking less the directions uj
associated to high variance (high sj ), and more the directions uj associated
to low variance (low sj ); see Figure 1.3. Indeed, for fixed λ > 0, the weight
s2j /(s2j + λ) → 0 as sj → 0, and s2j /(s2j + λ) → 1 as sj → ∞.

20
f(s) = s2 (s2 + λ)
1

0.5

1 2 3 4 5 6 7 8 9 s

Figure 1.3: Shrinkage of principal components in the linear prediction of


ridge when λ = 1/2.

The ridgeless estimator can also be expressed as a penalized LSE.


Proposition 1.1.2 (Penalized expression of ridgeless). The ridgeless es-
timator is the only solution to the least squares problem (1.2) that is in
Range(X ′ ), and it can be expressed as:
1 1
θ̂nrl = argmin ∥y − Xθ∥22 + θ ′ PKer(X) θ. (1.13)
θ∈Rp 2 2
Proof. From Theorem 1.1.1, the solution set of the least squares problem
(1.2) is θ̂nrl + Ker(X), where θ̂nrl = X + y is in Range(X ′ ). Since Ker(X) ⊥
Range(X ′ ), θ̂nrl is the only solution in Range(X ′ ). Moreover, penalty h :
θ 7→ θ ′ PKer(X) θ is zero in θ̂nrl , and strictly positive at any other least squares
solution. We conclude that θ̂nrl minimizes (1.13).
Following linear transformations relate the ridgeless and the ridge esti-
mators.
Proposition 1.1.3 (Links between ridgeless and ridge). Following relations
between the ridge and the minimum norm least squares estimators hold:

θ̂nr (λ) = (X ′ X + λI)−1 X ′ X θ̂nrl , (1.14)


θ̂nrl = (X ′ X)+ (X ′ X + λI)θ̂nr (λ), (1.15)
lim θ̂nr (λ) = θ̂nrl . (1.16)
λ→0

Proof. Using X = PRange(X) X which implies X ′ = X ′ PRange(X) , we have

(X ′ X + λI)−1 X ′ y = (X ′ X + λI)−1 X ′ XX + y,

21
and thus
θ̂nr (λ) = (X ′ X + λI)−1 X ′ X θ̂nrl .
Using X + = X + (X + )′ X ′ , we have
X + y = X + (X + )′ (X ′ X + λI)(X ′ X + λI)−1 X ′ y.
Moreover, X + (X + )′ = (X ′ X)+ implies
θ̂nrl = (X ′ X)+ (X ′ X + λI)θ̂nr (λ).
Finally, since X + = limλ→0 (X ′ X + λI)−1 X ′ , we have limλ→0 θ̂nr (λ) = θ̂nrl .

Expression (1.16) explains why estimator (1.3) is called the ridgeless esti-
mator. The ridge and lasso estimators can be expressed as constrained least
squares problems.
Proposition 1.1.4 (Equivalence between penalized and constrained least
squares). For c ≥ 0, λ ≥ 0, and some norm ∥·∥ : Rp → R, define:
C(c) := argmin ∥y − Xθ∥22 /2 : ∥θ∥ ≤ c ;

θ∈Rp

P(λ) := argmin ∥y − Xθ∥22 /2 + λ ∥θ∥ .



θ∈Rp

Then, for a given c > 0, there exists λ0 ≥ 0 such that C(c) ⊂ P(λ0 ). Con-
versely, for a given λ > 0, there exists c0 ≥ 0 such that P(λ) ⊂ C(c0 ).
Proof. The objective function h : θ 7→ ∥y − Xθ∥22 /2 is convex and continu-
ous, and the constraint set {θ ∈ Rp : ∥θ∥ ≤ c} is not empty, closed, bounded
and convex. By the KKT theorem for convex problems, θ̂ ∈ C(c) for any
c > 0 if and only if θ̂ satisfies the KKT conditions, for some corresponding
λ0 ≥ 0:
0 ∈ λ0 ∂∥θ̂∥ + X ′ X θ̂ − X ′ y,
∥θ̂∥ ≤ c,
λ0 (∥θ̂∥ − c) = 0.

By Theorem 2.2.1, the first of these conditions implies that θ̂ ∈ P(λ0 ). Now
fix a λ > 0 and notice that P(λ) is not empty, given that its objective function
is convex, continuous and coercive; see Proposition 2.2.3. We can thus take
some θ̂ ∈ P(λ). Then, θ̂ satisfies the KKT conditions for c0 = ∥θ̂∥, which
implies θ̂ ∈ C(c0 ).
Note that the link between the penalty parameter λ and the constraint
parameter c is not explicit.

22
1.1.3 Geometric interpretation
We illustrate the geometry of the least squares, ridge, and lasso solutions
through a simple example. Consider the linear model (1.1), with p = 2,
ε0i ∼ iiN (0, 1), θ0 = [1.5, 0.5]′ , E[xi ε0 ] = 0, and
   
0 2 0
xi ∼ iiN   ,   .
0 0 1

Figure 1.4 shows the level curves of the least squares loss function f (θ) :=
∥y − Xθ∥22 /2, corresponding to values f1 < f2 < f3 < f4 . Its minimizer, or
least squares solution θ̂nls , which coincides with the ridgeless solution θ̂nrl , is
highlighted in the figure.

θ2
rl
θ^n

θ1

Figure 1.4: Geometry of the least squares solution.

To illustrate the geometry of the ridge solution, we consider the con-


strained formulation of the ridge problem; see Proposition 1.1.4. Figure 1.5
demonstrates the impact of imposing the ridge constraint, represented by the
sphere {θ ∈ R2 : ∥θ∥2 ≤ c} with c = 0.5, on the least squares problem. The
ridge solution θ̂nr is located at the intersection between the ridge constraint
and the lower level set of the least squares loss at the lowest height (see
Appendix 2.2 Definition 71) for which the intersection is non-empty. If the
ridgeless solution θ̂nrl lies within the constraint boundary, then θ̂nr coincides
with θ̂nrl . Otherwise, the ridge solution θ̂nr , by construction, is closer to the
origin than θ̂nrl , demonstrating the shrinkage effect of the ridge penalty. In
general, θ̂nr is dense (i.e., contains no zero elements) with probability one.

23
θ2
rl
θ^n

r
θ^n
{θ: ||θ||2 <= c}
θ1

Figure 1.5: Geometry of the ridge solution.

Figure 1.6 illustrates the effect of the lasso constraint, represented by the
rotated square {θ ∈ R2 : ∥θ∥1 ≤ c} with c = 0.5, on the least squares
solution. Like the ridge solution, the lasso solution θ̂nl is located at the
intersection of the lasso constraint and the lower level set of the least squares
loss at the lowest height for which the intersection is non-empty. is located
at the intersection between the lasso constraint and the lower level set of the
least squares loss at the lowest height for which the intersection is not empty.
For small values of c, this intersection is more likely to occur along one of the
coordinate axes. As a result, the lasso solution tends to be sparse, meaning
that some components of θ̂nl are exactly zero.

θ2
rl
θ^n

l
{θ: ||θ||1<= c} θ^n
θ1

Figure 1.6: Geometry of the lasso solution.

24
As discussed in Section 1.1, the lasso estimator serves as an approxima-
tion to the l0 −estimator (1.6). This relationship becomes evident through
visual comparison of Figure 1.6 and Figure 1.7. The lasso constraint set
{θ : ∥θ∥1 ≤ c} is the convex hull (i.e., the smallest convex superset) of
the constraint set underlying the l0 −estimator, which is given by: {θ :
∥θ∥0 ≤ c, ∥θ∥∞ ≤ 1}. Further details on this approximation can be found
in Argyriou et al. [2012].

θ2
rl
θ^n

l0
θ^n
θ1
{θ: ||θ||0 <= 1, ||θ||...<= c}

Figure 1.7: Geometry of the l0 solution.

To illustrate the geometry of the ridgeless solution, consider the linear


model (1.1) with p = 2, ε0i ∼ iidN (0, 1), xi1 ∼ iidN (0, 1), E[xi1 ε0i ] = 0, and
xi2 = 2xi1 . In this case, the predictors are linearly dependent. As a result,
the second-moment matrix of the predictors is reduced-rank:
 
1 2
E[xx′ ] =  ,
2 4

with Rank(E[xx′ ]) = 1. Let E[xi1 yi ] = 1. The identifying condition E[xi ε0i ] =


0 holds if and only if the population coefficient θ0 satisfies
   
1 2 1
E[xx′ ]θ0 = E[xi yi ] ⇐⇒   θ0 =   .
2 4 2

Thus, any coefficient in the set θ0rl + Kernel(E[xx′ ]) satisfies this condition,
where
θ0rl := E[xx′ ]+ E[xi yi ] = [0.2, 0.4]′ .

25
If the sample size n > Rank(E[xx′ ]), then Ker(X) ⊃ Ker(E[xx′ ]), and
the same issue arises in the finite-sample least squares problem, where the
objective function f (θ) is minimized at any point on the affine set

θ̂nrl + Ker(X).

Figure 1.8 depicts the level curves of f (θ) at f (θ̂nrl ) = f1 < f2 < f3 .
These curves are parallel lines, unlike the typical ellipses seen in full-rank
cases. The ridgeless estimator is the minimum l2 -norm solution to the least
squares problem, as expected from its construction.

f1 f2 f3 θ2
f2
f3

rl
θ^n

rl θ1
{θ: ||θ||2 <= ||θ^n||2}

Figure 1.8: Geometry of the ridgeless solution.

1.1.4 Computation of lasso


By Theorem 2.2.1, a function f : Rp → R is minimized at θ ∗ ∈ Rp if and
only if 0 ∈ ∂f (θ ∗ ). In the lasso problem, the objective function
1
f l : Rp → R; θ 7→ ∥y − Xθ∥22 + λ ∥θ∥1 (1.17)
2n
contains the l1 −norm, which makes f l non-smooth.5 As a result, the subdif-
ferential of f l at a minimizer θ̂nl (λ) is not a singleton, implying that the lasso
estimator may not be unique. Moreover, due to the complexity of ∂f l (θ̂nl (λ)),
no closed-form solution exists in general for the lasso estimator.
Fortunately, Proposition 1.1.4 implies that the lasso problem is a quadratic
program with a convex constraint, which allows for the computation of the
5
Notice that f l is the objective function in (1.5), multiplied by 1/n. This term does
not show up in the penalization term as it is absorbed by λ.

26
lasso estimator using various quadratic programming algorithms. One par-
ticularly simple and effective method is the cyclical coordinate descent algo-
rithm, which minimizes the convex objective function by iterating through
each coordinate independently. This approach provides insight into how the
lasso solution is obtained.
Consider the soft-thresholding operator for a given λ > 0, which is defined
as the function

η − λ, η > λ

Sλ : R → R; η 7→ 0, η ∈ [−λ, λ] .

η + λ, η < −λ

This operator is illustrated in Figure 1.9.

Sλ(η)

−λ λ
η

Figure 1.9: Soft-thresholding operator.

The soft-thresholding operator provides a direct way to compute the lasso


estimator in a univariate regression model, i.e., when there is only one pre-
dictor.
Proposition 1.1.5 (Lasso solution for univariate regression). Given λ > 0
and X1 ∈ Rn such that X1′ X1 > 0, we have
Sλ (X1′ y/n)
 
l 1 2
θ̂n (λ) := argmin ∥y − X1 θ∥2 + λ|θ| = .
θ∈R 2n X1′ X1 /n

Proof. The subdifferential of f : θ 7→ 1


2n
∥y − X1 θ∥22 + λ|θ| at θ̂ ∈ R reads

∂f (θ̂) = bθ̂ − a + λ∂|θ̂|,

27
where a := X1′ y/n and b := X1′ X1 /n. From Theorem 2.2.1, and the subd-
ifferential of the absolute value function (Appendix 2.2, Example 7), θ̂ is a
minimizer of f if and only if

{λ},
 θ̂ > 0
0 ∈ ∂f (θ̂) ⇐⇒ a ∈ bθ̂ + [−λ, λ], θ̂ = 0 .

{−λ}, θ̂ < 0

This condition reads: (i) if θ̂ > 0, then θ̂ = (a − λ)/b, implying a > λ; (ii)
if θ̂ = 0, then −λ ≤ a ≤ λ; and (iii) if θ̂ < 0, then θ̂ = (a + λ)/b, implying
a < −λ. These cases are summarized by θ̂ = Sλ (a)/b.
Proposition 1.1.5 can be used to show that the j−th coordinate of the
lasso solution in a multivariate regression model, i.e., when there is more
than just one predictor, satisfies an expression based on the soft-thresholding
operator applied to the residual of a lasso regression of y onto the predictors
Xk at position k ̸= j.

Theorem 1.1.4 (Lasso solution). Let Xj denote the j−th column of X and
X(−j) denote X without the j−th column. Assume that Xj′ Xj > 0 for all
j = 1, . . . , p. Then, given λ > 0, any lasso solution θ̂nl (λ) is such that for all
j = 1, . . . , p:

Sλ (Xj′ ej /n)
 
l 1 2
θ̂n,j (λ) = argmin ∥ej − Xj θ∥2 + λ|θ| = , (1.18)
θ∈R 2n Xj′ Xj /n

where θ̂n,j (λ) is the j−th element of θ̂nl (λ), θ̂n,(−j)


l
(λ) is θ̂nl (λ) without the
l
j−th element, and ej := y − X(−j) θ̂n,(−j) (λ).

Proof. The subdifferential of the lasso objective function f l defined in (1.17)


at θ̂ ∈ Rp is
∂f l (θ̂) = (X ′ X/n)θ̂ − X ′ y/n + λ∂∥θ̂∥1 ,
where
∂∥θ̂∥1 = {v ∈ Rp : vj ∈ ∂|θ̂j | for all j = 1, . . . , p}.
and the subdifferential of the absolute value function |·| is given in Appendix
2.2, Example 7. From Theorem 2.2.1, a minimizer θ̂nl (λ) of f satisfies

0 ∈ ∂f (θ̂nl (λ)) ⇐⇒ X ′ y/n ∈ X ′ X/nθ̂nl (λ) + λ∂∥θ̂nl (λ)∥1 .

28
This condition holds if and only if for all j = 1, . . . , p:

Xj′ y/n ∈ Xj′ X/nθ̂nl (λ) + λ∂|θ̂n,j (λ)| ⇐⇒


Xj′ ej /n ∈ Xj′ Xj /nθ̂n,j (λ) + λ∂|θ̂n,j (λ)| ⇐⇒
Sλ (Xj′ ej /n)
θ̂n,j (λ) = ,
Xj′ Xj /n

where the first double implication follows from


p
X
Xj′ X θ̂nl (λ) = Xj′ Xk θ̂n,k
l
(λ) = Xj′ X(−j) θ̂n,(−j)
l
(λ) + Xj′ Xj θ̂n,j (λ),
k=1

and the last double implication follows from Proposition 1.1.5 since, by The-
orem 2.2.1,

Xj′ ej /n ∈ Xj′ Xj /nθ̂n,j (λ) + λ∂|θ̂n,j (λ)| ⇐⇒


 
1 2
θ̂n,j (λ) = argmin ∥ej − Xj θ∥2 + λ|θ| .
θ∈R 2n

Theorem 1.1.4 suggests that the lasso solution can be computed by a


cyclical coordinate minimization algorithm. This method is an iterative al-
gorithm that, given a candidate solution θ̂ (k) at iteration t + 1, it chooses to
update a coordinate j as
(t+1) (t) (t) (t)
θ̂j = argmin f (θ̂1 , . . . , θ̂j−1 , θ, θ̂j+1 , . . . , θ̂p(t) ),
θ∈R

(t+1) (t)
and sets θ̂k = θ̂k for k ̸= j. A typical choice for the lasso solution would
be to cycle through the coordinates in their natural order: from 1 to p. The
coordinate descent algorithm is guaranteed to converge to a global minimizer
of any convex cost function f : Rp → R satisfying the additive decomposition:
p
X
f : θ 7→ g(θ) + hj (θj ),
j=1

where g : Rp → R is differentiable and convex, and the univariate func-


tions hj : Rp → R are convex (but not necessarily differentiable); see Tseng
[2001]. What makes this algorithm work for the lasso problem is the fact
that objective function (1.17) satisfies this separable structure.

29
Remark 5. If the predictors are measured in different units, it is recommended
to standardize them so that Xj′ Xj = 1 for all j. In this case, the lasso update
(1.18) has the simpler form:
l
θ̂n,j (λ) = Sλ (Xj′ ej /n).

Algorithm 1 summarizes the pseudo-code of the cyclical coordinate de-


scent algorithm for computing the lasso estimator. This algorithm proceeds
by cyclically applying the soft-thresholding update in (1.18) for each coordi-
l
nate, simultaneously updating the residuals ej := y − X(−j) θ̂n,(−j) (λ). The
ridgeless or the ridge estimators can be used to initialize the procedure.

Algorithm 1 Cyclical coordinate descent method for the lasso estimator.


Require: y ∈ Rn and X ∈ Rn×p such that Xj′ Xj > 0 for all j = 1, . . . , p
Require: Penalty parameter λ > 0
Require: Initial estimator θ̂ni (e.g., ridgeless or ridge)
Require: Maximum number of iterations T
Standardize y and X so that y ′ 1 = 0, X ′ 1 = 0 and diag(X ′ X/n) = I
θ̂ (1) ← θ̂ni
for t = 2, . . . , T do
for j = 1, . . . , p do
(t−1)
ej ← y − X(−j) θ̂(−j) (λ)
(t−1)
← Sλ Xj′ ej /n

θ̂j
end for
θ̂ ← θ̂ (t−1)
if a suitable stopping rule is satisfied then
Stop and output θ̂ (t)
end if
θ̂ (t)
end for
Output θ̂ (T )

In practice, it is often desirable to compute the lasso solution not for a


single fixed value of λ, for the entire solution path over a range of λ values.
A common approach begins by selecting a value of λ just large enough that
the only optimal solution is the zero vector. This value is denoted as λmax =
maxj {Xj′ y/n}. From there, we gradually decrease λ by a small amount
and run coordinate descent until convergence using the previous solution as
a ”warm start”. By further decreasing the previous solution as a “warm
start,” we then run coordinate descent until convergence. In this way we can

30
efficiently compute the solutions over a grid of λ values. This approach is
known as pathwise coordinate descent.
Coordinate descent is particularly efficient for the lasso because the up-
date rule (1.18) is available in closed form, eliminating the need for iterative
searches along each coordinate. Additionally, the algorithm exploits the in-
herent sparsity of the problem: for sufficiently large values of λ, most coeffi-
cients will be zero and will remain unchanged. There are also computational
strategies that can predict the active set of variables, significantly speeding
up the algorithm. More details on the pathwise coordinate descent algorithm
for lasso can be found in Friedman et al. [2007].
Homotopy methods are another class of techniques for solving the lasso
estimator. They produce the entire path of solutions in a sequential fashion,
starting at zero. An homotopy method that is particularly efficient at com-
puting the entire lasso path is the least angle regression (LARS) algorithm;
see Efron et al. [2004].

1.1.5 Finite-sample properties of ridgeless and ridge


This section presents finite-sample expressions and bounds for the bias, vari-
ance, MSE and MPR of the LSE, ridgeless and ridge estimators. The main
underlying assumption is that linear model (1.1) satisfies the typical regres-
sion condition E[xε0 ] = 0. Furthermore, we work with a fixed design matrix
X (or equivalently we work conditionally on X).
The next proposition derives the bias, MSE and MPR of the LSE when
it is well-defined, that is, when Rank(X) = p, which implies p ≤ n.
Proposition 1.1.6 (Finite-sample properties of LSE (fixed design)). Assume
that the linear model (1.1) holds with E[xε0 ] = 0. Then, for a fixed design
matrix such that Rank(X) = p:
(i) The LSE is unbiased: E[θ̂nls ] = θ0 .

(ii) The variance of the LSE is given by

Var[θ̂nls ] = (X ′ X)−1 X ′ Var[ε0 ]X(X ′ X)−1 .

Further let Var[ε0 ] = σ 2 I with σ > 0. Then:


(iii) Var[θ̂nls ] = σ 2 (X ′ X)−1 .

(iv) The LSE is the best linear unbiased estimator, in the sense that Var[θ̃n ]−
Var[θ̂nls ] is positive semi-definite for any other unbiased linear estimator
θ̃n , i.e., θ̃n = Ay for some A ∈ Rp×n .

31
2 P
(v) The MSE of the LSE is given by: MSE(θ̂nls , θ0 ) = σn pj=1 λ1j , where
λ1 ≥ . . . ≥ λp > 0 are the eigenvalues of X ′ X/n. Therefore,
σ2p
MSE(θ̂nls , θ0 ) ≤ .
λp n

(vi) The mean predictive risk of the LSE is given by:


MPR(θ̂nls , θ0 ) = pσ 2 /n.

Proof. (i) From the closed form expression (1.8):


θ̂nls = (X ′ X)−1 X ′ ε0 + θ0 .
Thus, unbiasedness follows directly since:
E[θ̂nls ] − θ0 = (X ′ X)−1 E[X ′ ε0 ] = 0.

(ii) The closed form expression of θˆnls immediately implies the expression
Var[θ̂nls ] = Var[(X ′ X)−1 X ′ ε0 ]
=(X ′ X)−1 X ′ Var[ε0 ]X(X ′ X)−1 .

(iii) Das ist trivial.


(iv) A linear estimator θ̃n = Ay is unbiased if and only if AX = I. Let
M := X(X ′ X)−1 X ′ , and notice that (I − M )(I − M ) = (I − M ),
i.e., (I − M ) is idempotent. It follows that:
Var[θ̃n ] − Var[θ̂nls ] =σ 2 (AA′ − (X ′ X)−1 )
=σ 2 (AA′ − AX(X ′ X)−1 X ′ A′ )
=σ 2 A(I − M )A′
=σ 2 [A(I − M )][A(I − M )]′ ,
which is positive semi definite.
(v) Using the linearity of the Trace operator and the SVD decomposition
of X in Definition 18:
E[∥θ̂nls − θ0 ∥22 ] =E[Trace((θ̂nls − θ0 )(θ̂nls − θ0 )′ )]
= Trace(Var[θ̂nls ])
σ2
= Trace((X ′ X/n)−1 )
n
p
σ2 σ2 X 1
= Trace(V ′ (S ′ S/n)−1 V ) = .
n n j=1 λj

32
(vi) Simple computations give

E[∥lm(X, θ̂nls ) − lm(X, θ0 )∥22 /n]


=E[∥X(θ̂nls − θ0 )∥22 /n]
=E[Trace((θ̂nls − θ0 )′ X ′ X/n(θ̂nls − θ0 ))]
=E[Trace((θ̂nls − θ0 )(θ̂nls − θ0 )′ X ′ X/n)]
= Trace(Var[θ̂nls ]X ′ X/n) = σ 2 p/n.

This proposition shows that the LSE’s accuracy decreases:


• as the variance σ 2 of the error term increases;

• as the number of predictors per observation p/n increases;

• as the ”degree of singularity” of the design matrix 1/λp increases.


We now aim to drop the requirement that Rank(X) = p, to allow for high-
dimensional settings where:
• n < p, or even

• Rank(E[xx′ ]) = r0 , not necessarily equals to p.


This last case We next show that, when Rank(E[xx′ ]) < p, the typical linear
regression condition E[xε0 ] = 0 no longer identifies a unique estimand θ0 .
Proposition 1.1.7. Given random variables y ∈ R and x ∈ Rp , let ε(θ) :=
y − x′ θ for every θ ∈ Rp . The set

S := {θ0 ∈ Rp : y = x′ θ0 + ε(θ0 ), E[xε(θ0 )] = 0}

is either empty or S = E[xx′ ]+ E[xy] + Ker(E[xx′ ]).


Proof. The fact that, in some cases, set S can be empty is obvious. Moreover,
since
S = {θ0 ∈ Rp : E[xx′ ]θ0 = E[xy]},
then E[xy] ∈ Range(E[xx′ ]) when S is not empty. In this case, it follows
from Theorem 2.1.5 that S = E[xx′ ]+ E[xy] + Ker(E[xx′ ]).
However, when S is not-empty, there is an element of S ∩ Range(E[xx′ ])
that is well-defined when Rank(E[xx′ ]) < p, and it is equal to θ0 by con-
struction when Rank(E[xx′ ]) = p. We define it as follows.

33
Definition 19 (Ridgeless estimand). The ridgeless estimand is defined as the
vector θ0rl ∈ Range(E[xx′ ]) given by θ0rl := E[xx′ ]+ E[xy].6
We can now extend Proposition 1.1.6 to the fixed design setting where
Rank(X) ≤ p.
Proposition 1.1.8 (Finite-sample properties of ridgeless (fixed design)).
Assume that the linear model (1.1) holds with E[xε0 ] = 0 and denote r0 :=
Rank(E[xx′ ]). Then, for a fixed design matrix:
(i) E[θ̂nrl ] = PRange(X ′ ) θ0rl . If Range(X ′ ) = Range(E[xx′ ]), which implies
n ≥ r0 , then the ridgeless estimator is unbiased:

E[θ̂nrl ] = θ0rl .

(ii) The variance of the ridgeless estimator is given by

Var[θ̂nrl ] = X + Var[ε(θ0rl )](X + )′ ,

where ε(θ0rl ) := y − Xθ0rl .


Further let Var[ε(θ0rl )] = σ 2 I with σ > 0, and define r := Rank(X) ≤
min{n, p}. Then:
(iii) Var[θ̂nrl ] = σ 2 (X ′ X)+ .
(iv) The MSE of the ridgeless estimator is given by:
r
σ2 X 1 2
MSE(θ̂nrl , θ0rl ) = + PKer(X) θ0rl 2
,
n j=1 λj

where λ1 ≥ . . . ≥ λr > 0 are the positive eigenvalues of X ′ X/n.


(v) The mean predictive risk of the ridgeless estimator is given by:

MPR(θ̂nrl , θ0rl ) = rσ 2 /n.

(vi) If Range(X ′ ) = Range(E[xx′ ]), we have


r0
σ2 X 1 σ 2 r0
MSE(θ̂nrl , θ0rl ) = ≤ .
n j=1 λj λr0 n

and
MPR(θ̂nrl , θ0rl ) = r0 σ 2 /n.
6
The result that θ0rl ∈ Range(E[xx′ ]) follows from the identity E[xx′ ]+ =
PRange(E[xx′ ]) E[xx′ ]+ . Notice that we used the ridgeless estimand in Section 1.1.3.

34
Proof. (i) Using Proposition 1.1.7, we have E[Xε(θ0rl )] = 0, which implies
E[ε(θ0rl )] = 0 under a (non-trivial) fixed design. Simple computations
then give
E[θ̂nrl ] = X + E[y] = X + Xθ0rl + X + E[ε(θ0rl )] = PRange(X ′ ) θ0rl .
If Range(X ′ ) = Range(E[xx′ ]), we conclude that E[θ̂nrl ] = θ0rl since
θ0rl ∈ Range(E[xx′ ]).
(ii) The closed-form expression of θ̂ rl immediately implies
Var[θ̂nrl ] = X + Var[ε(θ0rl )](X + )′ .

(iii) It follows since X + (X + )′ = (X ′ X)+ .


(iv) Using the fact that Rank(X) = r:
r
σ2 σ2 X 1
Trace(Var[θ̂nrl ]) = Trace((X ′ X/n)+ ) = .
n n j=1 λj

Moreover,
Bias(θ̂nrl , θ0rl ) = (PRange(X ′ ) −I)θ0rl = − PKer(X) θ0rl .
The result then follows using Proposition 1.0.3.
(v) Proposition 1.0.1 and E[θ̂nrl ] = PRange(X ′ ) θ0rl imply lm(X, θ0rl ) = XE[θ̂nrl ].
Therefore:
E[∥lm(X, θ̂nrl ) − lm(X, θ0rl )∥22 /n]
=E[∥X(θ̂nrl − E[θ̂nrl ])∥22 /n]
=E[Trace{(θ̂nrl − E[θ̂nrl ])′ X ′ X/n(θ̂nrl − E[θ̂nrl ])}]
=E[Trace{(θ̂nrl − E[θ̂nrl ])(θ̂nrl − E[θ̂nrl ])′ X ′ X/n}]
= Trace(Var[θ̂nrl ]X ′ X/n)
=σ 2 /n Trace[(X ′ X)+ X ′ X] = σ 2 /n Trace(X + X),
where the last equality follows from the identity (X ′ X)+ X ′ = X + . Fi-
nally, considering the spectral decomposition X = U SV ′ in Definition
18, we obtain:
σ 2 /n Trace(X + X) =σ 2 /n Trace(V (S ′ )+ SV ′ )
 
Ir 0r×(p−r)
=σ 2 /n Trace  
0(p−r)×r 0(p−r)×(p−r)
=σ 2 r/n.

35
(vi) If Range(X ′ ) = Range(E[xx′ ]), then Rank(X) = r0 and Ker(X) =
Ker(E[xx′ ]). Therefore Bias(θ̂nrl , θ0rl ) = 0 as θ0rl ∈ Range(E[xx′ ]), and
we have
r0
σ2 X 1
MSE(θnrl , θ0rl ) = .
n j=1 λj

Proposition 1.1.9 (Finite-sample properties of ridge (fixed design)). As-


sume that the linear model (1.1) holds with E[xε0 ] = 0. Denote r0 :=
Rank(E[xx′ ]), λ > 0 and Q(λ) := (X ′ X + λI)−1 X ′ X. Then, for a fixed
design matrix:
(i) The ridge estimator is biased: E[θ̂nr (λ)] = Q(λ)θ0rl .
(ii) The variance of the ridge estimator is given by

Var[θ̂nr (λ)] = (X ′ X + λI)−1 X ′ Var[ε0 ]X(X ′ X + λI)−1 .

Further let Var[ε0 ] = σ 2 I with σ > 0. Then:


(iii) Var[θ̂nr (λ)] = σ 2 (X ′ X+λI)−1 X ′ X(X ′ X+λI)−1 . Moreover, Var[θ̂nrl ]−
Var[θ̂nr (λ)] is positive definite.
(iv) The MSE of the ridge estimator is given by:
r
σ2 X λj 2
MSE(θ̂nr (λ), θ0rl ) = 2
+ [I − Q(λ)]θ0rl 2
,
n j=1 (λj + λ/n)

where λ1 ≥ . . . ≥ λr > 0 are the positive eigenvalues of X ′ X/n.


(v) The mean predictive risk of the ridge estimator is given by:
r
σ2 X λ2j
MPR(θ̂nr (λ), θ0rl ) = .
n j=1 (λj + λ/n)2

(vi) If Range(X ′ ) = Range(E[xx′ ]), we have


r0
σ2 X 1 σ 2 r0
lim MSE(θ̂nr (λ), θ0rl ) = ≤ ,
λ→0 n j=1 λj λr0 n

and
lim MPR(θ̂nr (λ), θ0rl ) = r0 σ 2 /n.
λ→0

36
Proof. (i) Using the link between ridge and ridgeless estimators in identity
(1.14) we have, with θ0rl as defined in (19):

E[θ̂nr (λ)] =Q(λ)E[θ̂nrl ] = Q(λ) PRange(X ′ ) θ0rl .

The result then follows from Q(λ) PRange(X ′ ) = Q(λ).

(ii) The closed-form expression of θ̂ r immediately implies

Var[θ̂nr (λ)] = (X ′ X + λI)−1 X ′ Var[ε0 ]X(X ′ X + λI)−1 .

(iii) The expression follows trivially from the previous item. To show that
Var[θ̂nrl ] − Var[θ̂nr (λ)] is positive definite, consider the spectral decom-
position X P = U SV ′ in Definition 18. Since Rank(X) = r, we have
X X/n = rj=1 λj vj vj′ , where λj = s2j /n for j = 1, . . . , r. It follows

2 P
Var[θ̂nrl ] = σn rj=1 λ1j vj vj′ . Instead,

Var[θ̂nr (λ)] =σ 2 V (S ′ S + λI)−1 S ′ S(S ′ S + λI)−1 V ′


r
σ2 X λj
= vj vj′ .
n j=1 (λj + λ/n)2

The result then follows using that, for j = 1, . . . , r,


λj
1/λj > .
(λj + λ/n)2

(iv) Using the linearity of the Trace and the fact that V is orthogonal:
r
!
2 X
σ λ j
Trace(Var[θ̂nr (λ)]) = Trace 2
vj vj′
n j=1 (λj + λ/n)
r
σ2 X λj
= .
n j=1 (λj + λ/n)2

Moreover, Bias(θ̂nr (λ), θ0rl ) = [Q(λ) − I]θ0rl . The result then follows
using Proposition 1.0.3.

(v) In Proposition 1.1.8 we obtained E[θ̂nrl ] = PRange(X ′ ) θ0rl . Proposition


1.0.1 thus implies lm(X, θ0rl ) = XE[θ̂nrl ], and so we can write

lm(X, θ̂nr (λ)) − lm(X, θ0rl ) = X(Q(λ)θ̂nrl − E[θ̂nrl ]).

37
Therefore:

MPR(θ̂nr (λ), θ0rl ) = E[∥lm(X, θ̂nr (λ)) − lm(X, θ0rl )∥22 /n]
= E[Trace{(Q(λ)θ̂nrl − E[θ̂nrl ])′ X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])}]
= E[Trace{X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ }]
n h io
= Trace X ′ X/nE (Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ .

Let Q := Q(λ), E rl := E[θ̂nrl ] and V rl := Var[θ̂nrl ]. Then, the expected


value inside the Trace reads:
h ih i′ 
rl rl rl rl rl rl
E θ̂n − E − (I − Q)θ̂n θ̂n − E − (I − Q)θ̂n

= V rl + (I − Q)V rl (I − Q)′ − (I − Q)V rl − (I − Q)(I − Q)′


= QV rl Q′ .

Using Var[θ̂nrl ] = σ 2 (X ′ X)+ and the SVD decomposition of X in Def-


inition 18, we conclude:

MPR(θ̂nr (λ), θ0rl )


= σ 2 Trace X ′ X/nQ(λ)(X ′ X)+ Q(λ)′
 

= σ 2 Trace X ′ X/n(X ′ X + λI)−1 X ′ X(X ′ X)+ X ′ X(X ′ X + λI)−1


 

= σ 2 /n Trace S ′ S/n(S ′ S/n + λ/nI)−1 S ′ S/n(S ′ S)+ S ′ S(S ′ S/n + λ/nI)−1


 
r
σ2 X λ2j
= .
n j=1 (λj + λ/n)2

(vi) If Range(X ′ ) = Range(E[xx′ ]), then Rank(X) = r0 and Ker(X) =


Ker(E[xx′ ]). Therefore, since

lim Q(λ) = lim [(X ′ X + λI)−1 X ′ ]X = PRange(E[xx′ ]) ,


λ→0 λ→0

and θ0rl ∈ Range(E[xx′ ]), we obtain


r0  r0
σ2 X σ2 X

λj 2 1
lim 2
+ [I − Q(λ)]θ0rl 2
= .
λ→∞ n (λj + λ/n) n j=1 λj
j=1

The next proposition shows that there are penalty parameter values for
which the MSE of ridge is lower than the MSE of ridgeless.

38
Proposition 1.1.10. Assume that the linear model (1.1) holds with E[xε0 ] =
0 and Var[ε0 ] = σ 2 I for σ > 0. Then, for a fixed design matrix X, there
exists λ∗ > 0 such that

MSE(θ̂nr (λ∗ ), θ0rl ) < MSE(θ̂nrl , θ0rl ).

Proof. See Farebrother [1976].

1.1.6 Finite sample properties of lasso


In this section, we study the finite sample properties of the lasso estimator
under a fixed design matrix X. Given the lack of a closed form expression for
the lasso estimator, we do not have access to closed form expressions for its
bias and variance. Therefore, instead of deriving its MSE and MPR, we find
concentration inequalities for its estimation risk ∥θ̂nl (λ) − θ0 ∥22 and predictive
risk ∥X(θ̂nl (λ) − θ0 )∥22 /n.
Before doing that, we first show some auxiliary properties satisfied by any
lasso solution. Given a set C, let |C| denote its cardinality, i.e., the number of
elements in C, and consider the index set S ⊂ {1, . . . , p} with complementary
index set S c = {1, . . . , p} \ S. We use the notation vS = [vi ]i∈S ∈ R|S| for
the subvector of v ∈ Rp with entries indexed by S. Further define, for some
α ≥ 1, the set

Cα (S) := {v ∈ Rp : ∥vS c ∥1 ≤ α ∥vS ∥1 }.

In words, Cα (S) is the set of vectors in Rp whose subvector in S c has size


smaller or equal to α times the size of the subvector in S, where the size
of vectors is measured using the l1 −norm. Finally, consider the following
definition.
Definition 20 (Support of a vector). The support of vector θ ∈ Rp is defined
as
Supp(θ) := {j ∈ {1, . . . , p} : θj ̸= 0}.
The next lemma shows that, for an appropriate choice of the penalty
parameter, the lasso estimator satisfies some basic inequalities and has an
estimation error contained in Cα (S) for some α ≥ 1 and some index set S.

Lemma 1.1.5 (Auxiliary properties of lasso). Suppose that the linear model
(1.1) holds. If λ ≥ 2 ∥X ′ ε0 /n∥∞ > 0, then any lasso solution θ̂nl (λ) satisfies:

(i) The predictive risk bound

PR(θ̂nl (λ), θ0 ) ≤ 12λ ∥θ0 ∥1 . (1.19)

39
(ii) An estimation error η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )) such that

∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 . (1.20)

Proof. Under the linear model (1.1), we have for any θ ∈ Rp :


∥y − Xθ∥22 = y ′ y + θ ′ XXθ − 2θ0′ X ′ Xθ − 2ε′0 Xθ.

Since θ̂nl (λ) is a lasso solution,


1 1
0≤ ∥y − X θ̂nl (λ)∥22 + λ∥θ̂nl (λ)∥1 ≤ ∥y − Xθ0 ∥22 + λ ∥θ0 ∥1 ,
2n 2n
which holds if and only if
1
∥X η̂∥22 ≤ ε′0 X η̂/n + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ). (1.21)
2n
(i) By Hölder inequality,
ε′0 X η̂/n ≤ |ε′0 X η̂/n| ≤ ∥X ′ ε0 /n∥∞ ∥η̂∥1 .
Thus, using the choice λ ≥ 2 ∥X ′ ε0 /n∥∞ in (1.21) yields
1
0≤ ∥X η̂∥22 ≤ λ/2∥η̂∥1 + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ). (1.22)
2n
Using the triangle inequality
∥η̂∥1 ≤ ∥θ̂nl (λ)∥1 + ∥θ0 ∥1 , (1.23)
we further obtain
0 ≤ λ/2(∥θ̂nl (λ)∥1 + ∥θ0 ∥1 ) + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ),
which, using that λ > 0, implies ∥θ̂nl (λ)∥1 ≤ 3∥θ0 ∥1 . Substituting this
result into the triangle inequality (1.23) yields
∥η̂∥1 ≤ 4∥θ0 ∥1 . (1.24)
Moreover, again by the triangle inequality,
∥θ0 ∥1 = ∥(θ0 + η̂) − η̂∥1 ≤ ∥θ0 + η̂∥1 + ∥η̂∥1 ,
which implies:
∥θ0 + η̂∥1 ≥ ∥θ0 ∥1 − ∥η̂∥1 . (1.25)
Using (1.24) and (1.25) in the basic inequality (1.22), we obtain
∥X η̂∥22 /n ≤λ∥η̂∥1 + 2λ(∥θ0 ∥1 − ∥θ0 + η̂∥1 )
≤3λ∥η̂∥1 ≤ 12λ∥θ0 ∥1 .

40
(ii) Let S0 := Supp(θ0 ). Using that θ0S0c = 0, we have

∥θ0 ∥1 − ∥θ̂nl (λ)∥1 = ∥θ0S0 ∥1 − ∥θ0S0 + η̂S0 ∥1 − ∥η̂S0c ∥1 . (1.26)

Substituting (1.26) into the basic inequality (1.22) yields:

0 ≤ ∥X η̂∥22 /n ≤ λ∥η̂∥1 + 2λ(∥θ0S0 ∥1 − ∥θ0S0 + η̂S0 ∥1 − ∥η̂S0c ∥1 ). (1.27)

By the triangle inequality,

∥θ0S0 ∥1 = ∥θ0S0 + η̂S0 − η̂S0 ∥1 ≤ ∥θ0S0 + η̂S0 ∥1 + ∥η̂S0 ∥1 .

Therefore, using the decomposition ∥η̂∥1 = ∥η̂S0 ∥1 + ∥η̂S0c ∥1 , (1.27)


reads

0 ≤ ∥X η̂∥22 /n ≤λ∥η̂∥1 + 2λ(∥η̂S0 ∥1 − ∥η̂S0c ∥1 )


=λ(3∥η̂S0 ∥1 − ∥η̂S0c ∥1 ),

which implies that η̂ ∈ C3 (Supp(θ0 )). Finally,


√ using the relation be-
tween the l1 − and the l2 −norm (∥v∥1 ≤ s ∥v∥2 for every v ∈ Rs ), we
conclude that

∥X η̂∥22 /n ≤λ(3∥η̂S0 ∥1 − ∥η̂S0c ∥1 ) ≤ 3λ∥η̂S0 ∥1 ≤ 3 s0 λ∥η̂S0 ∥2

≤3 s0 λ∥η̂∥2 .

We derive the main properties of lasso under the following restricted eigen-
value condition on the design matrix, which leverages the result that for
λ ≥ 2 ∥X ′ ε0 /n∥∞ , the estimation error of lasso θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
Assumption 1 (Restricted eigenvalue condition). The design matrix X ∈
Rn×p is such that for all η ∈ C3 (Supp(θ0 )) there exists κ > 0 for which:

∥Xη∥22 /n ≥ κ ∥η∥22 ,

where θ0 ∈ Rp is the coefficient of linear model (1.1).


In the next proposition we derive bounds on the squared l2 estimation
risk and predictive risk. We then provide intuition on why the restricted
eigenvalue condition is required.
Theorem 1.1.6. Suppose that the linear model (1.1) holds and that Assump-
tion 1 holds. Let s0 := | Supp(θ0 )| ≤ p. Then, any lasso solution θ̂nl (λ) with
λ ≥ 2 ∥X ′ ε0 /n∥∞ > 0 satisfies:

41
(i) The estimation risk bound
9
∥θ̂nl (λ) − θ0 ∥22 ≤ 2
s 0 λ2 . (1.28)
κ
(ii) The predictive risk bound
9
PR(θ̂nl (λ), θ0 ) ≤ s 0 λ2 . (1.29)
κ
Proof. In Lemma (1.1.5), we obtained Inequality (1.20), which reads

∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 ,
where η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
(i) Using Assumption 1 on the left hand side of Inequality (1.20) yields

κ∥η̂∥22 ≤ 3 s0 λ∥η̂∥2 .
If ∥η̂∥2 > 0, the result follows by dividing both sides of the inequality
by ∥η̂∥2 . If instead ∥η̂∥2 = 0, the result is trivially obtained.
(ii) Using Assumption 1 on the right hand side of Inequality (1.20) yields,
√ √
∥X η̂∥22 /n ≤ 3 s0 λ∥X η̂∥2 / nκ.
If ∥X η̂∥2 >√0, the result follows by dividing both sides of the inequality
by ∥X η̂∥2 / n. If instead ∥X η̂∥2 = 0, the result is trivially obtained.

The sparsity parameter s0 = | Supp(θ0 )| plays a major role in the bounds


of Theorem 1.1.6. We say that θ0 is hard sparse if it has some zero entries.
More formally:
Definition 21 (Hard sparsity). Coefficient θ0 ∈ Rp is hard sparse if s0 :=
| Supp(θ0 )| < p.
In high-dimensional regimes, hard sparsity is typically imposed as an
identifying condition for θ0 . Consider an asymptotic regime where
lim p/n = K > 0, and lim s0 /p = s∞ ∈ (0, 1].
n→∞ p→∞

Then, s0 → ∞ as n → ∞. In this setting, the lasso converges to θ0 only


if κ and λ compensate for the divergence of s0 , i.e., limn→∞ s0 λ/κ2 = 0.
Notice however that 2 ∥X ′ ε0 /n∥∞ , the lower bound for λ in Theorem 1.1.6,
is monotonically non decreasing as we add columns to X. Moreover, intuition
suggests that Assumption 1 with a large κ is an increasingly more restrictive
assumption as p → ∞.

42
Remark 6. It is possible to extend the results in Lemma 1.1.5 and Theorem
1.1.6 using a milder restriction than hard sparsity, called weak sparsity. This
restriction formalizes the notion that θ0 can be well approximated by means
of a hard sparse vector.
Definition 22 (Weak sparsity). Coefficient θ0 ∈ Rp is weak sparse if θ0 ∈
Bq (r) where, for q ∈ [0, 1] and radius r > 0,

Bq (r) := {θ ∈ Rp : ∥θ∥qq ≤ r}.

Setting q = 0 in Definition 22 recovers Definition 21 with s = r. For q ∈ (0, 1],


we restrict the way the ordered coefficients

max |θ0j | = θ0(1) ≥ θ0(2) ≥ . . . , ≥ θ0(p−1) ≥ θ0(p) = min |θ0j |


j=1,...,p j=1,...,p

decay. More precisely, if the ordered coefficients satisfy the bound |θ0j | ≤
Cj −α for some suitable C ≥ 0 and α > 0, then θ0 ∈ Bq (r) for a radius r that
depends on C and α.

Restricted eigenvalue condition


Inequality (1.20) of Lemma (1.1.5) establishes an upper bound for the pre-
diction risk of the lasso solution in terms of its estimation risk. Conversely,
for an appropriate choice of the penalty parameter, the restricted eigenvalue
condition in Assumption 1 provides an upper bound for the estimation risk
of the lasso solution based on its prediction risk. These two bounds are
combined to obtain the estimation and predictive risk bounds in Theorem
1.1.6.
To provide more intuition on why the restricted eigenvalue condition is
needed, consider the constrained version of the lasso estimator:
 
lc 1 2
θ̂n (R) := argmin ∥y − Xθ∥2 : ∥θ∥1 ≤ R ,
θ∈Rp 2n

where the radius R := ∥θ0 ∥1 . With this choice, the true parameter θ0 is
feasible for the problem. Additionally, we have Ln (θ̂nlc (R)) ≤ Ln (θ0 ) where

Ln : Rp → R; θ 7→ ∥y − Xθ∥22 /(2n)

is the least squares loss function. Under mild regularity conditions, it can be
shown that the loss difference Ln (θ0 )−Ln (θ̂nlc (R)) decreases as the sample size
n increases. Under what conditions does this imply that the estimation risk,
∥η̂∥22 with η̂ := θ̂nlc (R) − θ0 , also decreases? Since Ln is a quadratic function,

43
the estimation risk will decrease if the function has positive curvature in every
direction (i.e., if there are no flat regions). This occurs when the Hessian,
∇2 Ln (θ̂nlc (R)) = X ′ X/n, has eigenvalues that are uniformly lower-bounded
by a positive constant κ. This condition is equivalently expressed as
∥Xη∥22 /n ≥ κ∥η∥22 > 0
for all nonzero η ∈ Rp .
In the high-dimensional setting, where p > n, the Hessian has rank at
most n, meaning that the least squares loss is flat in at least p − n directions.
As a result, the uniform curvature condition must be relaxed. By Lemma
1.1.5, the estimation error of lasso lies in the subset C3 (Supp(θ0 )) ⊂ Rp for an
appropriate choice of the penalty parameter (equivalently, of the constrained
radius R). For this reason, we require the condition to hold only in the
directions η that lie in C3 (Supp(θ0 )), hoping that | Supp(θ0 )| ≤ Rank(X).
With this adjustment, even in high-dimensional settings, a small difference
in the loss function still leads to an upper bound on the difference between
the lasso estimate and the true parameter.
Verifying that a given design matrix X satisfies the restricted eigenvalue
condition is challenging. Developing methods to discover random design
matrices that satisfy this condition with high probability remains an active
area of research.

Slow rates and fast rates


Consider assuming that the error term in linear model (1.1) is sub-Gaussian
with mean zero and variance proxy σ 2 . It is then possible to find a choice of
λ that only depends on the unknown σ and that ensures that the estimation
and prediction risks are upper-bounded with high probability.
Theorem 1.1.7. Suppose that the linear model (1.1) holds and that ε0 is a
vector if independent random variables with ε0i ∼ sub-G(σ 2 ) where variance
proxy σ > 0. Further √ suppose that the columns of X are standardized so that
maxj=1,...,p ∥Xj ∥2 / n ≤ C for some constant C > 0. Then, for all δ > 0,
any lasso solution θ̂nl (λ) with regularization parameter
p 
λ = 2Cσ 2 ln(p)/n + δ (1.30)
2 /2
satisfies with probability 1 − 2e−nδ :
p
PR(θ̂nl (λ), θ0 ) ≤ 24C ∥θ0 ∥1 σ( 2 ln(p)/n + δ). (1.31)
Further suppose that Assumption 1 holds and let s0 := | Supp(θ0 )| ≤ p. Then,
2
with probability 1 − 2e−nδ /2 :

44
(i) The estimation risk bound

72C 2 σ 2 s0
∥θ̂nl (λ) − θ0 ∥22 ≤ (2 ln(p)/n + δ 2 ). (1.32)
κ2

(ii) The predictive risk bound

72C 2 σ 2 s0
PR(θ̂nl (λ), θ0 ) ≤ (2 ln(p)/n + δ 2 ). (1.33)
κ

Proof. From the union bound:


 

P [∥X ε0 /n∥∞ ≥ t] = P max |Xj′ ε0 /n| ≥t
j=1,...,p

= P ∪j=1,...,p {|Xj′ ε0 /n| ≥ t ]



p
X
P |Xj′ ε0 /n| ≥ t .
 

j=1

Since ε01 , . . . , εon are independent random variables with sub-G(σ 2 ) distribu-
tion, from Proposition 2.3.8 we have that for any t ∈ R:
p p
!
X  ′  X t2
P |Xj ε0 /n| ≥ t ≤ 2 exp −
j=1 j=1
2σ 2 ∥Xj /n∥22
t2 n
 
≤ 2p exp − 2 2 .
2σ C
p 
Substituting t = Cσ 2 ln(p)/n + δ we get

t2 n
   
p2
2p exp − 2 2 = 2 exp(ln(p)) exp −nδ /2 − ln(p) − δ 2n ln(p)
2σ C
 p 
= 2 exp −nδ 2 /2 exp −δ 2n ln(p)


≤ 2 exp −nδ 2 /2 ,


p
since −δ 2n ln(p) < 0. We conclude that, for all δ > 0:
2
p
P[2 ∥X ′ ε0 /n∥∞ ≤ 2Cσ( 2 ln(p)/n + δ)] ≥ 1 − 2e−nδ /2 .
p
Consequently, if we set λ = 2Cσ( 2 ln(p)/n + δ), we obtain from (1.19) of
2
Lemma 1.1.5 that (1.31) holds with probability at least 1 − 2e−nδ /2 . More-
over, under Assumption 1, we obtain from (1.28) and (1.29) of Theorem 1.1.6

45
2 /2
that (1.32) and (1.33) hold with probability at least 1 − 2e−nδ , by using
the inequality:7
p
2 ln(p)/n + δ 2 + 2 2 ln(p)/nδ ≤ 2(2 ln(p)/n + δ 2 ).

Asplong as n ≥ 2 ln(p), the ratio 2 ln(p)/n can be significantly smaller


than 2 ln(p)/n. For this reason, the bounds (1.31) and (1.33) are often
referred to as the slow rates and fast rates for the prediction risk of lasso,
respectively.

7
This inequality follows from 2ab ≤ a2 + b2 for any two real numbers a and b.

46
Bibliography

Arthur Albert. Regression and the moore-penrose pseudoinverse. 1972.

Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with
the k-support norm. Advances in Neural Information Processing Systems,
25, 2012.

Sheldon Axler. Linear algebra done right. Springer Nature, 2024.

Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L


Combettes. Correction to: convex analysis and monotone operator theory
in Hilbert spaces. Springer, 2017.

Patrick Billingsley. Probability and measure. John Wiley & Sons, 2017.

Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data:
methods, theory and applications. Springer Science & Business Media,
2011.

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least
angle regression. The Annals of Statistics, 32(2):407–499, 2004.

Richard William Farebrother. Further results on the mean square error of


ridge regression. Journal of the Royal Statistical Society. Series B (Method-
ological), pages 248–250, 1976.

Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani.


Pathwise coordinate optimization. The annals of applied statistics, 1(2):
302–332, 2007.

Carl F Gauss. Theoria motus corporum coelestium in sectionibus conicis


solem ambientium. sumtibus Frid. Perthes et I. H. Besser, 1809.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Fried-


man. The elements of statistical learning: data mining, inference, and
prediction, volume 2. Springer, 2009.

47
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learn-
ing with sparsity. Monographs on statistics and applied probability, 143
(143):8, 2015.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation


for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

Adrien-Marie Legendre. Nouvelles méthodes pour la détermination des or-


bites des comètes. F. Didot, 1805.

Eliakim H Moore. On the reciprocal of the general algebraic matrix. Bulletin


of the american mathematical society, 26:294–295, 1920.

Marc Nerlove et al. Returns to scale in electricity supply. Institute for math-
ematical studies in the social sciences, 1961.

Roger Penrose. A generalized inverse for matrices. In Mathematical pro-


ceedings of the Cambridge philosophical society, volume 51, pages 406–413.
Cambridge University Press, 1955.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society Series B: Statistical Methodology, 58(1):
267–288, 1996.

Ryan J Tibshirani. The lasso problem and uniqueness. The Electronic Jour-
nal of Statistics, 7:1456–1490, 2013.

Paul Tseng. Convergence of a block coordinate descent method for nondif-


ferentiable minimization. Journal of optimization theory and applications,
109:475–494, 2001.

Roman Vershynin. High-dimensional probability: An introduction with ap-


plications in data science, volume 47. Cambridge university press, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic view-


point, volume 48. Cambridge university press, 2019.

48
Chapter 2

Appendix

2.1 Linear algebra


This section introduces a selection of definitions and results from linear alge-
bra that are used in these lecture notes. A book-length exposition of linear
algebra can be found in Axler [2024], among others.

Vector space
We introduce useful definitions and results for real vector spaces.
Definition 23 (Vector space). A (real) vector space is a set V along with an
addition on V and a scalar multiplication on V with the following properties:
1. (commutativity) u + v = v + u for all u, v ∈ V ;
2. (associativity) (u + v) + w = u + (v + w) and (ab)v = a(bv) for all
u, v, w ∈ V and all a, b ∈ R;
3. (additive identity) there exists an element 0 ∈ V such that v + 0 = v
for all v ∈ V ;
4. (additive inverse) for every v ∈ V , there exists w ∈ V such that v +w =
0;
5. (multiplicative identity) 1v = v for all v ∈ V ;
6. (distributive properties) a(u + v) = au + av and (a + b)v = av + bv for
all a, b ∈ R and all u, v ∈ V .
Definition 24 (Subspace). A subset U of a vector space V is a subspace of V
if U is a vector space, (using the same addition and scalar multiplication as
on V ).

49
Proposition 2.1.1. A subset U of a vector space V is a subspace of V if
and only if it satisfies these conditions:
(i) (additive identity) 0 ∈ U ;

(ii) (closed under addition) u, v ∈ U implies u + v ∈ U ;

(iii) (closed under scalar multiplication) a ∈ R and u ∈ U implies au ∈ U .


Definition 25 (Linear combination). A linear combination of vectors v1 , . . . , vn
in vector space V with coefficients a1 , . . . , an ∈ R is:

a1 v1 + . . . , +an vn .

Definition 26 (Span). The span of vectors v1 , . . . , vn in vector space V is


defined as

span(v1 , . . . , vn ) := {a1 v1 + . . . + an vn : a1 , . . . , an ∈ R}.

Definition 27 (Linear independence). The vectors v1 , . . . , vn in vector space


V are linearly independent if

{a1 , . . . , an ∈ R : a1 v1 + . . . + an vn = 0} = {a1 = . . . = an = 0}.

Definition 28 (Linear dependence). The vectors v1 , . . . , vn in vector space V


are linearly dependent if they are not linearly independent.
Definition 29 (Basis). A basis of a vector space V is a set of vectors in V
that are linearly independent and span V .

Inner products and norms


Definition 30 (Inner product). An inner product on a vector space V is a
function that takes each ordered pair (v, u) of elements of V to a number
⟨v, u⟩ ∈ R and satisfies:
1. (positivity) ⟨v, v⟩ ≥ 0 for all v ∈ V ;

2. (definiteness) ⟨v, v⟩ = 0 if and only if v = 0;

3. (additivity in first slot) ⟨v + u, w⟩ = ⟨v, w⟩ + ⟨u, w⟩ for all v, u, w ∈ V ;

4. (homogeneity in first slot) ⟨av, u⟩ = a⟨v, u⟩ for all a ∈ R and all v, u ∈


V;

5. (conjugate symmetry) ⟨v, u⟩ = ⟨u, v⟩ for all v, u ∈ V .

50
Proposition 2.1.2 (Basic properties of an inner product). An inner product
⟨·, ·⟩ on vector space V satisfies:
1. ⟨0, v⟩ = ⟨v, 0⟩ for every v ∈ V .
2. ⟨v, u + w⟩ = ⟨v, u⟩ + ⟨v, w⟩ for every v, u, w ∈ V .
3. ⟨v, au⟩ = a⟨v, u⟩ for all a ∈ R and all v, u ∈ V .
Definition 31 (Orthogonal vectors). Two vectors v and u in vector space V
are orthogonal if ⟨v, u⟩ = 0.
Definition 32 (Orthogonal subspace). U and W are orthogonal subspaces of
vector space V if ⟨u, w⟩ = 0 for all u ∈ U and all w ∈ W .
Definition 33 (Orthonormal basis). The set of vectors {v1 , . . . , vn } in vector
space V is an orthonormal basis of V if it is a basis of V such that ⟨vi , vj ⟩ = 0
and ∥vi ∥ = 1 for all i, j = 1, . . . , n with i ̸= j.
Definition 34 (Norms). Given innerpproduct ⟨·, ·⟩ on vector space V , the
norm of v ∈ V is defined by ∥v∥ := ⟨v, v⟩.
Proposition 2.1.3 (Properties of norms). For v in vector space V :
1. ∥v∥ = 0 if and only if v = 0.
2. ∥av∥ = a ∥v∥ for all a ∈ R.
Definition 35 (Linear function). L : V → W from a vector space V to another
vector space W is a linear function if:
(i) T (v + u) = T (v) + T (u) for all v, u ∈ V ;
(ii) T (av) = aT (v) for all a ∈ R and v ∈ V .
Theorem 2.1.1 (Cauchy–Schwarz inequality). Suppose v and u are two
vectors in vector space V . Then,
|⟨v, u⟩| ≤ ∥v∥ ∥u∥ .
This inequality is an equality if and only if there is a ∈ R such that v = au.
Theorem 2.1.2 (Triangle inequality). Suppose v and u are two vectors in
vector space V . Then,
∥v + u∥ ≤ ∥v∥ + ∥u∥ .
This inequality is an equality if and only if there is a ≥ 0 such that v = au.
Theorem 2.1.3 (Parallelogram equality). Suppose v and u are two vectors
in vector space V . Then,
∥v + u∥2 + ∥v − u∥2 = 2(∥v∥2 + ∥u∥2 ).

51
The Euclidean space
Definition 36 ((Real) n−tuple). A (real) n−tuple is a ordered list of n real
numbers.
With a slight abuse of terminology, we sometimes we use the term vector
to mean a (real) n−tuple.
Definition 37 (Real Euclidean space). The real Euclidean space of dimension
n, denoted Rn , is the set of all n−tuples.
Elements of a real Euclidean space are written in bold. For example,
a ∈ Rn , which means a = (a1 , . . . , an ) with a1 , . . . , an ∈ R.
Definition 38 (Euclidean inner product).
Pn The Euclidean inner product of
n
v, u ∈ R is defined as ⟨v, u⟩e := i=1 vi ui .
Definition 39 (lp −norm). The lp −norm ∥·∥p on Rn is defined for all v ∈ Rn
1/p
as ∥v∥p := ( ni=1 |vi |p ) when p ∈ [1, +∞), and ∥v∥p := maxni=1 |vi | when
P
p = +∞.

Theorem 2.1.4 (Hölder inequality). Suppose v, u ∈ Rp . Then,

|⟨v, u⟩| ≤ ∥v∥p ∥u∥q .

Matrices
Definition 40 (Matrix). An n × p matrix is a collection of p n−tuples.
The collection of all n × p matrices is denoted Rn×p . For a matrix A ∈
Rn×p , we write A = [A1 , . . . , Ap ] where A1 , . . . , Ap ∈ Rn are p n−tuples.
Written more explicitly,
 
A1,1 . . . A1,p
 . .. 
 
A =  .. ...
. ,
 
An,1 . . . An,p

that is, the elements of A, the n−tuples, are organized in columns. We


denote:

• the i, j−th element of A by Ai,j ;

• the j−th column A by Aj ;

• the i−th row A by A(i) .

52
Notice that a matrix in Rn×p can be equivalently seen as a collection of n
p−tuples, where the p−tuples represent the rows of the matrix.
Definition 41 (Column and row vector). A n−column vector is a n−tuple
seen as a matrix in Rn×1 . A n−row vector is a n−tuple seen as a matrix in
R1×n .
Throughout these lecture notes, we denote n−tuples as column vectors,
and use the simple notation v ∈ Rn instead of v ∈ Rn×1 .
Definition 42 (Matrix addition). The sum of two matrices of the same size is
the matrix obtained by adding corresponding entries in the matrices. That is,
for A, B ∈ Rn×p , we define A+B = C where C ∈ Rn×p and Ci,j = Ai,j +Bi,j
for i = 1, . . . , n and j = 1, . . . , p.
Definition 43 (Matrix-scalar multiplication). The product of a scalar and a
matrix is the matrix obtained by multiplying each entry in the matrix by the
scalar That is, for A ∈ Rn×p and a ∈ R, we define aA = B where B ∈ Rn×p
and Bi,j = aAi,j for i = 1, . . . , n and j = 1, . . . , p.
Definition 44 (Matrix multiplication). Given AP∈ Rn×p and B ∈ Rp×m , the
product AB = C where C ∈ Rn×m and Ci,j = pr=1 Ai,r Br,j for i = 1, . . . , n
and j = 1, . . . , m.
Note that we define the product of two matrices only when the number of
columns of the first matrix equals the number of rows of the second matrix.
Definition 45 (Transpose of a matrix). The transpose of a matrix A ∈ Rn×p
is the matrix B ∈ Rp×n with j, i−entry given by Bj,i = Ai,j for i = 1, . . . , n
and j = 1, . . . , p. We denote it by A′ .
It follows that the Euclidean inner product between v, u ∈ Rn is

⟨v, u⟩e = v ′ u.

Definition 46 (Range of a matrix). The range of a matrix A ∈ Rn×p is defined


as
Range(A) := {u ∈ Rn : u = Av for some v ∈ Rp }.
The range of a matrix is also called the column space, i.e., the space
spanned by the matrix’s columns, since:
Proposition 2.1.4. Let A = [A1 , . . . , Ap ] ∈ Rn×p . Then, Range(A) =
span(A1 , . . . , Ap ).
Definition 47 (Kernel of a matrix). The kernel, or null space, of a matrix
A ∈ Rn×p is defined as

Ker(A) := {v ∈ Rp : Av = 0}.

53
Proposition 2.1.5. Let A ∈ Rn×p . Then, Range(A) and Ker(A′ ) are or-
thogonal subspaces of Rn such that Rn = Range(A) + Ker(A′ ).
Definition 48 (Rank of a matrix). The rank of a matrix A ∈ Rn×p , denoted
Rank(A), is the maximum number of linearly independent columns of A.
Proposition 2.1.6. Let A ∈ Rn×p . Then, Rank(A) ≤ min{n, p}.
Definition 49 (Eigenvalue). λ ∈ R is an eigenvalue of A ∈ Rn×p if there
exists v ∈ Rp such that v ̸= 0 and
Av = λv.
Definition 50 (Eigenvector). Given matrix A ∈ Rn×p with eigenvalue λ ∈ R,
v ∈ Rp is an eigenvector of A ∈ Rn×p corresponding to λ if v ̸= 0 and
Av = λv.
Proposition 2.1.7. Every matrix A ∈ Rn×p has an eigenvalue.
Proposition 2.1.8. Let A ∈ Rn×p . Then, A has at most Rank(A) distinct
eigenvalues.
Proposition 2.1.9. Suppose λ1 , . . . , λr ∈ R are distinct eigenvalues of A ∈
Rn×p and v1 , . . . , vr ∈ Rp are corresponding eigenvectors. Then, v1 , . . . , vr
are linearly independent.
Definition 51 (Singular values). The singular values of A ∈ Rn×p are the
nonnegative square roots of the eigenvalues of A′ A.
Definition 52 (Symmetric matrix). A square matrix A ∈ Rn×n is symmetric
if A′ = A.
Definition 53 (Positive definite matrix). A square symmetric matrix A ∈
Rn×n is positive definite if v ′ Av > 0 for all v ∈ Rn such that v ̸= 0.
Definition 54 (Positive semi-definite matrix). A square symmetric matrix
A ∈ Rn×n is positive semi-definite if v ′ Av ≥ 0 for all v ∈ Rn .
Proposition 2.1.10. A square symmetric matrix A ∈ Rn×n is positive def-
inite (positive semi-definite) if and only if all of its eigenvalues are positive
(nonnegative).
Definition 55 (Identity matrix). The identity matrix on Rn is defined as
 
1 0
 
I :=  . . .  ∈ Rn×n .
 
 
0 1

54
Definition 56 (Diagonal of a matrix). The diagonal of a square matrix A ∈
Rn×n indicates the elements ”on the diagonal”: A1,1 , . . . , An,n .
Definition 57 (Diagonal matrix). A square matrix A ∈ Rn×n is a diagonal
matrix if all its elements outside of the diagonal are zero. We can write
A = diag(A1,1 , . . . , An,n ).
Definition 58 (Invertible matrix, matrix inverse). A square matrix A ∈ Rn×n
is invertible if there is a matrix B ∈ Rn×n such that AB = BA = I. We
call B the inverse of A and denote it by A−1 .
Proposition 2.1.11. A square matrix A ∈ Rn×n is invertible if and only if
Rank(A) = n, or equivalently, if and only if Ker(A) = ∅.
Proposition 2.1.12. A square matrix A ∈ Rn×n is invertible if and only if
it is positive definite.
Definition 59 (Orthogonal matrix). A square matrix P ∈ Rp×p is orthogonal,
or orthonormal, if P ′ P = P P ′ = I.
Definition 60 (Projection matrix). A square matrix P ∈ Rp×p is a projection
matrix if P = P 2 .
Definition 61 (Orthogonal projection matrix). A square matrix P ∈ Rp×p is
an orthogonal projection matrix if it is a projection matrix and P = P ′ .
Projections and orthogonal projections have the following properties.
Proposition 2.1.13. For any projection matrix P ∈ Rp×p and vector b ∈
Rp , we have
b = P b + (I − P )b.
If P is an orthogonal projection matrix, then
(P b)′ (I − P )b = 0.
Definition 62 (Trace). The trace of a square matrix A ∈ Rn×n , denoted
Trace(A), is the sum of its diagonal elements:
Trace(A) = A11 + . . . , An,n .
Proposition 2.1.14. The Trace is a linear function.
Proposition 2.1.15 (Properties of the trace). 1. Trace(A) = λ1 + . . . +
n×n
λn for all A ∈ R with λ1 , . . . , λn denoting the (not necessarily dis-
tinct) eigenvalues of A.
2. Trace(A) = Trace(A′ ) for all A ∈ Rn×n .
3. Trace(AB) = Trace(BA) for all for all A, B ∈ Rn×n .
4. Trace(A′ B) = Trace(AB ′ ) = Trace(B ′ A) = Trace(BA′ ) for all A, B ∈
Rn×p .

55
2.1.1 Moore-Penrose inverse
The Moore-Penrose inverse, or matrix pseudoinverse, is a generalization of
the inverse of a matrix that was independently introduced by Moore [1920]
and Penrose [1955].
Definition 63 (Moore-Penrose inverse). The matrix A+ ∈ Rp×n is a Moore-
Penrose inverse of A ∈ Rn×p if

(i) AA+ A = A;

(ii) A+ AA+ = A+ ;

(iii) (AA+ )′ = AA+ ;

(iv) (A+ A)′ = A+ A.

Properties and examples of the Moore-Penrose inverse


We now list the main properties of the Moore-Penrose inverse.

Proposition 2.1.16. For any matrix A ∈ Rn×p , the Moore-Penrose inverse


A+ exists and is unique.

Proposition 2.1.17. Let A ∈ Rn×p have Rank(A) = p. Then, A+ =


(A′ A)−1 A′ .

Proposition 2.1.18. Let the square matrix A ∈ Rp×p have Rank(A) = p.


Then, A+ = A−1 .

Proposition 2.1.19. Let A ∈ Rn×p . Then,

A+ = lim (A′ A + λI)−1 A′ = lim A′ (AA′ + λI)−1 .


λ→0 λ→0

Proof. See Albert [1972].

Proposition 2.1.20. Let A ∈ Rn×p . Then:

1. A = (A+ )+ .

2. A+ = (A′ A)+ A′ = A′ (AA′ )+ .

3. (A′ )+ = (A+ )′ .

4. (A′ A)+ = A+ (A′ )+ .

5. (AA′ )+ = (A′ )+ A+ .

56
6. Range(A+ ) = Range(A′ ) = Range(A+ A) = Range(A′ A).

7. Ker(A+ ) = Ker(AA+ ) = Ker((AA′ )+ ) = Ker(AA′ ) = Ker(A′ ).

Proposition 2.1.21. For any matrix A ∈ Rn×p :

1. AA+ is an orthogonal projection onto Range(A).

2. I − AA+ is an orthogonal projection onto Ker(A′ ).

3. A+ A is an orthogonal projection onto Range(A′ ).

4. I − A+ A is an orthogonal projection onto Ker(A).

We also collect some examples of the Moore-Penrose inverse.


(
a−1 a ̸= 0
Example 2. If a ∈ R, then a+ = .
0 a=0
Example 3. If A = diag(A1 , . . . , Ap−k , 0, . . . , 0) ∈ Rp×p , then

A+ = diag(1/A1 , . . . , 1/Ap−k , 0, . . . , 0).


 
1
Example 4. If A =  , then A+ = [1/5.2/5].
2
   
1 0 1 0
Example 5. If A =  , then A+ =  .
0 0 0 0
   
1 1 1/4 1/4
Example 6. If A =  , then A+ =  .
1 1 1/4 1/4

Systems of linear equations and least squares


The Moore-Penrose inverse plays a central role in the study of solutions to
systems of linear equations and linear least squares problems.

Theorem 2.1.5 (Solutions of systems of linear equations). For A ∈ Rn×p


and b ∈ Rn , let L := {θ ∈ Rp : Aθ = b}. The following statements hold:

(i) If b ∈
/ Range(A), then L is empty.

(ii) If b ∈ Range(A), then L = A+ b + Ker(A).

57
Corollary 2.1.6. Given a square matrix A ∈ Rp×p and b ∈ Rp , let L :=
{θ ∈ Rp : Aθ = b}. Then, A+ b is the unique element of L if and only if
Rank(A) = p. In this case, A+ = A−1 .
Corollary 2.1.7. For X ∈ Rn×p and y ∈ Rn :
argmin ∥y − Xθ∥22 = X + y + Ker(X).
θ∈Rp

2.1.2 Eigenvalue and Singular value decomposition


This section introduces the eigenvalue and the singular value decomposi-
tions, which are matrix factorizations with many applications to statistics
and machine learning.
Definition 64 (Singular value decomposition). The Singular Value Decom-
position (SVD) of a matrix A ∈ Rn×p with rank r := Rank(A) is given
by
Xr
A = U SV ′ = si ui vi′ ,
i=1
n×n p×p
where U ∈ R and V ∈ R are orthogonal matrix, and
 
diag(s1 , . . . , sr ) 0
S=  ∈ Rn×p ,
0 0

where s1 , . . . , sr are the positive singular values of A.


Proposition 2.1.22 (Existence). Any matrix A ∈ Rn×p admits a singular
value decomposition.
The next proposition demonstrates the relation of the SVD to the four
fundamental subspaces of a matrix.
Proposition 2.1.23. Consider Definition 64. Then,
(i) {u1 , . . . , ur } is an orthonormal basis of Range(A).
(ii) {ur+1 , . . . , un } is an orthonormal basis of Ker(A′ ).
(iii) {v1 , . . . , vr } is an orthonormal basis of Range(A′ ).
(iv) {vr+1 , . . . , vp } is an orthonormal basis of Ker(A).
Pr ′ ′
Pr ′
We thus have Range(A) = j=1 uj uP
j and Range(A ) = j=1 vj vj .
′ p ′
Moreover, if r < P p, we have Ker(A ) = j=r+1 uj uj , and if r < n, we
have Ker(A) = nj=r+1 vj vj′ .

58
Proposition 2.1.24. The Moore-Penrose inverse of a matrix A ∈ Rn×p
admitting SVD decomposition A = U SV ′ is given by A+ = V S + U ′ .

Definition 65 (Eigenvalue decomposition). The eigenvalue decomposition of a


square matrix A ∈ Rn×n with n linearly independent eigenvectors Q1 , . . . , Qn
corresponding to eigenvalues λ1 , . . . , λn is given by

A = QΛQ−1 ,

where Q = [Q1 , . . . , Qn ] and Λ = diag(λ1 , . . . , λn ).

Proposition 2.1.25 (Relation between the singular value and the eigenvalue
decompositions). Given a matrix A ∈ Rn×p with SVD A = U SV ′ :

1. A′ A = V S ′ SV ′ ;

2. AA′ = U SS ′ U ′ .

2.2 Convex analysis


This section introduces a selection of definitions and results from convex
analysis that are used in these lecture notes. A book-length exposition of
convex analysis can be found in Bauschke et al. [2017], among others.

Basic definitions
Definition 66 (Closed set). A set C ⊂ Rp is closed if it contains all of its
limit points.
Definition 67 (Bounded set). A set C ⊂ Rp is bounded if there exists r > 0
such that for all θ, β ∈ Rp , we have ∥θ − β∥ < r.
Definition 68 (Convex set). A set C ⊂ Rp is convex if for all 0 < α < 1 and
all θ, β ∈ C:
αθ + (1 − α)β ∈ C.
In particular, Rp and ∅ are convex.
Definition 69 (Epigraph of a function). The epigraph of function f : Rp →
(−∞, +∞] is
epi(f ) := {(θ, ξ) ∈ Rp × R : f (θ) ≤ ξ}.
Definition 70 (Domain of a function). The domain of function f : Rp →
(−∞, +∞] is
dom(f ) := {θ ∈ Rp : f (θ) < +∞}.

59
Definition 71 (Lower level set of a function). The lower level set of function
f : Rp → (−∞, +∞] at height ξ ∈ R is

lev≤ξ (f ) := {θ ∈ Rp : f (θ) ≤ ξ}.

Definition 72 (Proper function). function f : Rp → (−∞, +∞] is proper if


dom(f ) ̸= ∅.
Definition 73 (Convex function). Let f : Rp → (−∞, +∞] be a proper
function. Then f is convex if its epigraph epi(f ) is convex. Equivalently, f
is convex if for all 0 < α < 1 and all θ, β ∈ Rp such that θ ̸= β:

f (αθ + (1 − α)β) ≤ αf (θ) + (1 − α)f (β).

Definition 74 (Strictly convex function). Let f : Rp → (−∞, +∞] be a


proper function. Then f is strictly convex if for all 0 < α < 1 and all
θ, β ∈ Rp such that θ ̸= β:

f (αθ + (1 − α)β) < αf (θ) + (1 − α)f (β).

Definition 75 (Limit inferior). The limit inferior of f : Rp → (−∞, +∞] at


a point θ ∗ ∈ Rp is

lim inf

f (θ) = lim (inf{f (θ) : θ ̸= θ ∗ , ∥θ − θ ∗ ∥ ≤ ε}) .
θ→θ ε→0

Definition 76 (Lower semicontinuous function). Function f : Rp → (−∞, +∞]


is lower semicontinuous at θ ∗ ∈ Rp if

lim inf

f (θ) ≥ f (θ ∗ ).
θ→θ

Definition 77 (Coercive function). Function f : Rp → (−∞, +∞] is coercive


if
lim f (θ) = +∞.
∥θ∥→+∞

Definition 78 (Subdifferential). Let f : Rp → (−∞, +∞] be a proper func-


tion. The subdifferential of f is the set-valued operator:1
p
∂f : Rp → 2R ; θ 7→ {β ∈ Rp : ⟨v − θ, β⟩ + f (θ) ≤ f (v) ∀v ∈ Rp } .

Let θ ∈ Rp . Then f is subdifferentiable at θ if ∂f (θ) ̸= ∅; the elements of


∂f (θ) are the subgradients of f at θ.
1
Given a set C, the set of all subsets of C, including the empty set and C itself, is
denoted 2C . This set is called the power set of C.

60
Graphically, a vector β ∈ Rp is a subgradient of a proper function f :
p
R → (−∞, +∞] at θ ∈ dom(f ) if

fβ,θ : v 7→ ⟨v − θ, β⟩ + f (θ),

which coincides with f at θ, lies below f .


Example 7. The subdifferential of the absolute value function | · | at θ ∈ R is
given by 
{1},
 θ>0
∂|θ| = [−1, 1], θ = 0 .

{−1}, θ < 0

See Bauschke et al. [2017, Example 16.15].

Minimizers of convex optimization problems


Definition 79 (Global minimizer). θ ∗ is a (global) minimizer of a proper
function f : Rp → (−∞, +∞] over C ⊂ Rp if f (θ ∗ ) = inf θ∈C f (θ). The set
of minimizers of f over C is denoted by

argmin f (θ) = argmin{f (θ) : θ ∈ C}.


θ∈C θ∈Rp

Definition 80 (Local minimizer). θ ∗ is a local minimizer of a proper function


f : Rp → (−∞, +∞] if there exists c > 0 such that

θ ∗ ∈ argmin f (θ) : ∥θ∥ ≤ c}.


θ∈Rp

Proposition 2.2.1 (Convex problems: local minimizers are global mini-


mizers). Let f : Rp → (−∞, +∞] be proper and convex. Then every local
minimizer of f is a minimizer.

Proposition 2.2.2 (Convex problems: argmin is convex). Let f : Rp →


(−∞, +∞] be proper and convex and C ⊂ Rp . Then argminθ∈C f (θ) is
convex.

Proposition 2.2.3 (Existence of minimizers). Let f : Rp → (−∞, +∞] be


proper, convex and lower semicontinuous and C be a closed convex subset of
Rp such that C ∩ dom(f ) ̸= ∅. Suppose that one of the following holds:

(i) f is coercive.

(ii) C is bounded.

61
Then f has a minimizer over C.

Proof. Since C ∩ dom(f ) ̸= ∅, there exists θ ∈ dom(f ) such that D =


C ∩ lev≤f (θ) (f ) is not empty, closed and convex. Moreover, D is bounded
since C or lev≤f (θ) (f ) is. The result therefore follows from Bauschke et al.
[2017, Thm. 11.10].

Proposition 2.2.4 (Uniqueness of minimizers). Let f : Rp → (−∞, +∞] be


proper and strictly convex. Then f has at most one minimizer.

Proof. Set µ := inf θ∈Rp f (θ) and suppose that there exist two distinct points
θ1 , θ2 ∈ dom(f ) such that f (θ1 ) = f (θ2 ) = µ. Since θ1 , θ2 ∈ lev≤µ (f ), which
is convex, so does β = (θ1 + θ2 )/2. Therefore f (β) = µ. It follows from the
strict convexity of f that

µ = f (β) < max{f (θ1 ), f (θ2 )} = µ,

which is impossible.
Global minimizers of proper functions can be characterized by a simple
rule which extends a seventeenth century result due to Pierre Fermat.

Theorem 2.2.1 (Fermat’s rule). Let f : Rp → (−∞, +∞] be proper. Then

argmin f (θ) = {θ ∗ ∈ Rp : 0 ∈ ∂f (θ ∗ )}.


θ∈Rp

Proof. Let θ ∗ ∈ Rp . Then θ ∗ ∈ argminθ∈Rp f (θ) if and only if, for every
β ∈ Rp ,
⟨β − θ ∗ , 0⟩ + f (θ ∗ ) ≤ f (β).
By definition of subgradient, this last requirement reads 0 ∈ ∂f (θ ∗ ).

Theorem 2.2.2 (Hilbert projection theorem). For every vector θ ∈ Rp and


every nonempty closed convex C ⊂ Rp , there exists a unique vector β ∈ Rp
for which
∥θ − β∥22 = inf ∥θ − η∥22 .
η∈C

If C is a vector subspace of Rp , then the minimizer β is the unique element


in C such that θ − β is orthogonal to C.

62
2.3 Probability theory
This section introduces a selection of definitions and results from probability
theory that are used in these lecture notes. A book-length exposition of
probability theory can be found in Billingsley [2017] and Vershynin [2018],
among others.
All random variables are (real valued and) defined on the complete prob-
ability space (Ω, F, P).
Definition 81 (Cumulative Distribution Function). The Cumulative Distri-
bution Function (CDF) of random variable X is the function

FX : R → [0, 1]; x 7→ P[X ≤ x].

Definition 82 (Expected value). The expected value of a random variable X


is Z
E[X] := XdP.

Definition 83 (Moment generating function). the Moment Generating Func-


tion (MGF) of a random variable X is the function

MX : R → [0, +∞]; t 7→ E[exp(tX)].

Definition 84 (Moment of order p). For p ∈ R, the moment of order p of a


random variable X is E[|X|p ].
Definition 85 (Lp −norm). The Lp −norm of a random variable X is, for
p > 0:
∥X∥Lp := E[|X|p ]1/p ,
and for p = ∞:

∥X∥Lp := ess sup |X| := sup{b ∈ R : P({ω : X(ω) < b}) = 0}.

Definition 86 (Lp −space). The space Lp = Lp (Ω, F, P) consists of all random


variables X with finite Lp norm:

Lp := {X : ∥X∥Lp < +∞}.

Definition 87 (Conjugate exponents). p, q ∈ [1, ∞] are conjugate exponents


if 1/p + 1/q = 1.
The inner product between Lp and Lq where p, q ∈ [1, ∞] are conjugate
exponents is for all X ∈ Lp and Y ∈ Lq :

⟨X, Y ⟩Lp −Lq := E[XY ].

63
The inner product in L2 is for all X, Y ′ inL2 :

⟨X, Y ⟩L2 := E[XY ].

The Variance of X ∈ L2 is

Var[X] := E[(X − E[X])2 ] = ∥X − E[X]∥2L2 ,

and the standard deviation is


p
σ(X) := Var[X] = ∥X − E[X]∥Lp .

The covariance between X, Y ∈ L2 is

Cov[X, Y ] := ⟨X − E[X], Y − E[Y ]⟩L2 = E[(X − E[X])(Y − E[Y ])].

Classical inequalities
Theorem 2.3.1 (Jensen’s inequality). For any random variable X and a
convex function f : R → R, we have

f (E[X]) ≤ E[f (x)].

The following proposition is a consequence of Jensen’s inequality.


Proposition 2.3.1. For any random variable X and any p, q ∈ [0, ∞] with
p ≤ q:
∥X∥Lp ≤ ∥Y ∥Lq .
Therefore, Lp ⊂ Lq for any p, q ∈ [0, ∞] with p ≤ q.
Theorem 2.3.2 (Minkowski’s inequality). For any p ∈ [1, ∞] and any ran-
dom variables X, Y ∈ Lp :

∥X + Y ∥Lp ≤ ∥X∥Lp + ∥Y ∥Lp .

Theorem 2.3.3 (Cauchy-Schwarz inequality). For any random variables


X, Y ∈ L2 :
|⟨X, Y ⟩| = |E[XY ]| ≤ ∥X∥L2 ∥Y ∥L2 .
Theorem 2.3.4 (Hölder’s inequality). For any random variables X ∈ Lp
and Y ∈ Lq with conjugate exponents p, q ∈ (1, ∞):

|E[XY ]| ≤ ∥X∥Lp ∥Y ∥Lq .

This inequality also holds for p = 1 and q = ∞.

64
The tails and the moments of a random variable are connected.
Proposition 2.3.2 (Integral identity). For any nonnegative random variable
X: Z ∞
E[X] = P[X > x]dx.
0
The two sides of this identity are either both finite or both infinite.
Theorem 2.3.5 (Markov’s inequality). For any nonnegative random variable
X and x > 0:
P[X ≥ x] ≤ E[X]/x.
A consequence of Markov’s inequality is Chebyshev’s inequality, which
bounds the concentration of a random variable about its mean.
Theorem 2.3.6 (Chebyshev’s inequality). Let X be a random variable with
finite mean µ and finite variance σ 2 . Then, for any x > 0:
P[|X − µ| ≥ x] ≤ σ 2 /x2 .
Proposition 2.3.3 (Generalization of Markov’s inequality). For any random
variable X with mean µ ∈ R and finite moment of order p ≥ 1, and for any
x > 0:
P[|X − µ| ≥ x] ≤ E[|X − µ|p ]/xp .

Concentration of sums of independent random variables


Concentration inequalities quantify how a random variable deviates around
its mean.
Definition 88 (Symmetric Bernoulli distribution). A random variable X has
a symmetric Bernoulli distribution if
P[X = −1] = P[X = +1] = 1/2.
Theorem 2.3.7 (Hoeffding’s inequality). Let X1 , . . . , Xn be an independent
symmetric Bernoulli random variables, and a ∈ Rn . Then, for any x ≥ 0:
" n # !
X x2
P ai Xi ≥ x ≤ exp − 2 .
i=1
2 ∥a∥ 2

Theorem 2.3.8 (Two-sided Hoeffding’s inequality). Let X1 , . . . , Xn be an


independent symmetric Bernoulli random variables, and a ∈ Rn . Then, for
any x > 0: " n # !
X x2
P ai Xi ≥ x ≤ 2 exp − .
i=1
2 ∥a∥22

65
Theorem 2.3.9 (Hoeffding’s inequality for bounded random variables). Let
X1 , . . . , Xn be an independent random variables. Assume that Xi ∈ [li , ui ]
with li , ui ∈ R and li ≤ ui . Then, for any x > 0:
" n #
2x2
X  
P (Xi − E[Xi ]) ≥ x ≤ exp − Pn 2
.
i=1 i=1 (ui − li )

Theorem 2.3.10 (Chernoff’s inequality). Chernoff ’s inequality Let Xi be


independent Bernoulli random variables with parameter pi ∈ [0, 1]. Let Sn :=
P n
i=1 Xi and its mean µ := E[Sn ]. Then, for any x > 0:
 eµ x
P[Sn ≥ x] ≤ exp(−µ) .
x
Proposition 2.3.4 (Tails of the standard normal distribution). Let Z ∼
N (0, 1). Then, for all z > 0:
 
1 1 1 2 1 1 2
− 3 √ e−z /2 ≤ P[Z ≥ z] ≤ √ e−z /2 .
z z 2π z 2π
In particular, for z ≥ 1:
1 2
P[Z ≥ z] ≤ √ e−z /2 .

Proposition 2.3.5 (Tails of the normal distribution). Let X ∼ N (µ, σ 2 )
with µ ∈ R and σ > 0. Then, for all x ≥ 0:
 2
−x
P[X − µ ≥ x] ≤ exp .
2σ 2
Proposition 2.3.6. Let Z ∼ N (0, 1). Then, for all z ≥ 0:

P[|Z| ≥ z] ≤ 2 exp(−z 2 /2).

Proposition 2.3.7 (Sub-Gaussian properties). Let X be a random variable.


Then, there are constants C1 , . . . , C5 > 0 for which the following properties
are equivalent.
(i) The tails of X satisfy for all x ≥ 0:

P[|X| ≥ x] ≤ 2 exp(−x2 /C12 ).

(ii) The moments of X satisfy for all p ≥ 1:



∥X∥Lp = E[|X|p ]1/p ≤ C2 p.

66
(iii) The MGF of X 2 satisfies for all t ∈ R such that |t| ≤ 1/C3 :

E[exp(t2 X 2 )] ≤ exp(C32 t2 ).

(iv) The MGF of X 2 is bounded at some point, namely

E[exp(X 2 /C42 )] ≤ 2.

If further E[X] = 0, these properties are equivalent to:


(v) The MGF of X satisfies for all t ∈ R:

E[exp(tX)] ≤ exp(C52 t2 ).

Definition 89 (Sub-Gaussian random variables). A random variable X that


satisfies the equivalent conditions of Proposition 2.3.7 is a sub-Gaussian ran-
dom variable, denoted X ∼ sub-G.
Gaussian, symmetric Bernoulli, uniform, and bounded random variables
are examples of sub-Gaussian random variables. The tails of the distribu-
tion of a sub-Gaussian random variable decay at least as fast as the tails
of a Gaussian distribution. The Poisson, exponential, Pareto, and Cauchy
distribution are examples of distributions that are not sub-Gaussian.
Definition 90 (Variance proxy). For a random variable X ∼ sub-G, if there
is some s > 0 such that for all t ∈ R:

E[e(X−E[X])t ] ≤ exp(s2 t2 /2),

then s2 is called variance proxy.


Proposition 2.3.8 (Weighted sum of independent sub-Gaussian random
variables). Let X1 , . . . , Xn be independent sub-Gaussian random variables,
all with variance proxy σ 2 where σ > 0. Then, for any a ∈ Rn :
" n # !
X t2
P ai Xi ≥ t ≤ exp − ,
2σ 2 ∥a∥2
i=1 2

and " n # !
X t2
P ai Xi ≤ −t ≤ exp − .
i=1
2σ 2 ∥a∥22
Definition 91 (Sub-Gaussian norm). The sub-Gaussian norm ∥X∥ψ2 of ran-
dom variable X is defined as

∥X∥ψ2 := inf{t > 0 : E[exp(X 2 /t2 )] ≤ 2}.

67
Proposition 2.3.9. If X is a sub-Gaussian random variable, then X − E[X]
is sub-Gaussian and for a constant C > 0:

∥X − E[X]∥ψ2 ≤ C ∥X∥ψ2 .

Proposition 2.3.10 (Sums of independent sub-Gaussian). Let X1 , . .P


. , Xn be
independent sub-Gaussian random variables with mean zero. Then ni=1 Xi
is also a sub-Gaussian random variable, and, for a constant C > 0:
n 2 n
X X
Xi ≤C ∥Xi ∥2ψ2 .
i=1 ψ2 i=1

We can now extend the Hoeffding’s inequality to sub-Gaussian distribu-


tions.
Proposition 2.3.11 (General Hoeffding’s inequality). Let X1 , . . . , Xn be in-
dependent sub-Gaussian random variables with mean zero and C > 0 a con-
stant. Then, for every t ≥ 0:
" n # !
X Ct2
P Xi ≥ t ≤ 2 exp − Pn 2 .
i=1 i=1 ∥Xi ∥ψ2

Proposition 2.3.12. Let X1 , . . . , Xn be independent sub-Gaussian random


variables with mean zero, a ∈ Rn , K = maxni=1 ∥Xi ∥ψ2 and C > 0 a constant.
Then, for every t ≥ 0:
" n # !
X Ct2
P ai Xi ≥ t ≤ 2 exp − .
K 2 ∥a∥2
i=1 2

Proposition 2.3.13 (Khintchine’s inequality). Let X1 , . . . , Xn be indepen-


dent sub-Gaussian random variables, all with mean zero and unit variance
proxy, a ∈ Rn , K = maxni=1 ∥Xi ∥ψ2 and C > 0 a constant. Then, for every
p ∈ [2, ∞):
n
!1/2 n n
!1/2
X X √ X
a2i ≤ ai X i ≤ CK p a2i .
i=1 i=1 Lp i=1

The sub-Gaussian distribution does not embed distributions whose tails


are heavier than Gaussian.
Proposition 2.3.14 (Sub-exponential properties). Let X be a random vari-
able. Then, there are constants K1 , . . . , K5 > 0 for which the following prop-
erties are equivalent.

68
(i) The tails of X satisfy for all x ≥ 0:

P[|X| ≥ x] ≤ 2 exp(−x/K1 ).

(ii) The moments of X satisfy for all p ≥ 1:

∥X∥Lp = E[|X|p ]1/p ≤ K2 p.

(iii) The MGF of |X| satisfies for all t ∈ R such that 0 ≤ t ≤ 1/K3 :

E[exp(t|X|)] ≤ exp(L3 t).

(iv) The MGF of |X| is bounded at some point, namely

E[exp(|X|/K4 )] ≤ 2.

If further E[X] = 0, these properties are equivalent to:

(v) The MGF of X satisfies for all t ∈ R such that |t| ≤ 1/K5 :

E[exp(tX)] ≤ exp(K52 t2 ).

Definition 92 (Sub-exponential random variables). A random variable X that


satisfies the equivalent conditions of Proposition 2.3.14 is a sub-exponential
random variable.
Sub-Gaussian, Poisson, exponential, Pareto, Levy, Weibull, log-normal,
Cauchy, t-distributed random variables are examples of sub-exponential ran-
dom variables.
Definition 93 (Sub-exponential norm). The sub-exponential norm ∥X∥ψ1 of
random variable X is defined as

∥X∥ψ1 := inf{t > 0 : E[exp(|X|/t)] ≤ 2}.

Proposition 2.3.15. If X is a sub-exponential random variable, then X −


E[X] is sub-exponential and for a constant C > 0:

∥X − E[X]∥ψ1 ≤ C ∥X∥ψ1 .

Proposition 2.3.16 (Sub-exponential is sub-Gaussian squared). A random


variable X is sub-exponential if and only if X 2 is sub-Gaussian. Moreover,

X2 ψ1
:= ∥X∥2ψ2 .

69
Proposition 2.3.17 (Product of sub-Gaussians is sub-exponential). Let X
and Y be sub-Gaussian random variables. Then XY is sub-exponential.
Moreover,
∥XY ∥ψ1 = ∥X∥ψ2 ∥Y ∥ψ2 .

Theorem 2.3.11 (Bernstein’s inequality). Let X1 , . . . , Xn be independent


sub-exponential random variables with mean zero and let C > 0 be a constant.
Then, for every t ≥ 0:
" n # ( )!
X t2 t
P Xi ≥ t ≤ 2 exp −C min Pn 2 , .
i=1 i=1 ∥Xi ∥ψ
maxi ∥Xi ∥ψ1
1

Theorem 2.3.12 (Bernstein’s inequality for weighted sums). Let X1 , . . . , Xn


be independent sub-exponential random variables with mean zero, C > 0 be
a constant, K := maxni=1 ∥Xi ∥ψ1 and a ∈ Rn . Then:
" n
# ( )!
X t2 t
P ai Xi ≥ t ≤ 2 exp −C min 2 , .
i=1
K 2 ∥a∥2 K ∥a∥∞

Corollary 2.3.13 (Bernstein’s inequality for averages). Let X1 , . . . , Xn be


independent sub-exponential random variables with mean zero, C > 0 be a
constant and K := maxni=1 ∥Xi ∥ψ1 . Then:
" n
#   2 
X t t
P Xi /n ≥ t ≤ 2 exp −Cn min , .
i=1
K2 K

70
Alphabetical Index

Asymptotic distribution, 11 Estimator, 7


Euclidean inner product, 52
Basis, 50 Euclidean space, 52
Bernstein’s inequality, 70 Expected value, 63
Bias, 8
Bias-variance tradeoff, 10 Fermat’s rule, 62
Bounded set, 59 Fixed design, 5

Cauchy-Schwartz inequality, 64 General Hoeffding’s inequality, 68


Cauchy–Schwarz inequality, 51 Global minimizer, 61
Chebyshev’s inequality, 65
Closed set, 59 Hard sparsity, 42
Coercive function, 60 Hilbert projection theorem, 62
Column vector, 53 Hoeffding’s inequality, 65
Conjugate exponents, 63 Hölder inequality, 52
Consistency, 11 Hölder’s inequality, 64
Convex function, 60 Identity matrix, 54
Convex set, 59 Inner product, 50
Coordinate descent algorithm, 29 Integral identity, 65
Cumulative distribution function, Invertible matrix, 55
63
Jensen’s inequality, 64
Design matrix, 6
Diagonal matrix, 55 Khintchine’s inequality, 68
Domain, 59
Lp space, 63
Eigenvalue, 54 lp −norm, 3, 52
Eigenvalue decomposition, 59 Lasso, 12
Eigenvector, 54 Least squares, 12
Epigraph, 59 Limit inferior, 60
Estimand, 7 Linear combination, 50
Estimate, 7 Linear dependence, 50
Estimation risk, 8 Linear function, 51

71
Linear independence, 50 Projection matrix, 55
Linear model, 5 Proper function, 60
Linear prediction, 7
Local minimizer, 61 Random design, 6
Lower level set, 60 Restricted eigenvalue condition,
Lower semicontinuous function, 60 41
Ridge, 12
Markov’s inequality, 65 Ridgeless, 12
Matrix, 52 Ridgeless estimand, 34
Matrix addition, 53 Row vector, 53
Matrix diagonal, 55
Singular value decomposition, 58
Matrix kernel, 53
Singular values, 54
Matrix multiplication, 53
Soft-thresholding operator, 27
Matrix Range, 53
Span, 50
Matrix rank, 54
Strictly convex function, 60
Matrix transpose, 53
Sub-exponential norm, 69
Matrix-scalar multiplication, 53
Sub-exponential properties, 68
Mean predictive risk, 9
Sub-exponential random variables,
Mean squared error, 8
69
Minkowski’s inequality, 64
Sub-Gaussian norm, 67
Moment generating function, 63
Sub-Gaussian properties, 66
Moment of order p, 63
Sub-Gaussian random variable, 67
Moore-Penrose inverse, 56
Subdifferential, 60
n-tuple, 52 Subgradient, 60
Norm, 51 Subspace, 49
Symmetric Bernoulli distribution,
Orthogonal matrix, 55 65
Orthogonal projection, 55 Symmetric matrix, 54
Orthogonal subspaces, 51 System of linear equations, 57
Orthogonal vectors, 51
Trace, 55
Orthonormal basis, 51
Triangular inequality, 51
Pathwise coordinate descent, 31 Variance proxy, 67
Positive definite matrix, 54 Vector space, 49
Positive semi-definite matrix, 54
Predictive risk, 8 Weak sparsity, 43

72

You might also like