0% found this document useful (0 votes)

13 views

Lecture Notes on High Dimensional Linear Regression

Uploaded by

R S

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture Notes on High Dimensional Linear Regression

Uploaded by

R S

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

arXiv:2412.15633v1 [stat.

ME] 20 Dec 2024

Lecture Notes on High Dimensional Linear

Regression

Alberto Quaini1,2

December 23, 2024 (first version: October 17, 2024)

1 Department of Econometrics, Erasmus University of Rotterdam, P.O. Box

1738, 3000 DR Rotterdam, The Netherlands.
2 Please, let me know if you find typos or mistakes at quaini@ese.eur.nl.
Introduction
These lecture notes were developed for a Master’s course in advanced ma-
chine learning at Erasmus University of Rotterdam. The course is designed
for graduate students in mathematics, statistics and econometrics. The con-
tent follows a proposition-proof structure, making it suitable for students
seeking a formal and rigorous understanding of the statistical theory under-
lying machine learning methods.
At present, the notes focus on linear regression, with an in-depth ex-
ploration of the existence, uniqueness, relations, computation, and non-
asymptotic properties of the most prominent estimators in this setting: least
squares, ridgeless, ridge, and lasso.

Background
It is assumed that readers have a solid background in calculus, linear alge-
bra, convex analysis, and probability theory. Some definitions and results
from these fields, relevant for the course, are provided in the Appendix for
reference.

Book-length references
The content of these lecture notes is inspired by a wide range of existing liter-
ature, but the presentation of topics follows my own interpretation and logical
structure. Although most of the content can be traced back to established
sources, certain sections reflect my perspective, and some material is origi-
nal to this course. For those interested in more comprehensive, book-length
discussions of related topics, the following key references are recommended:
Hastie et al. [2009], Bühlmann and Van De Geer [2011], Hastie et al. [2015],
and Wainwright [2019].

Disclaimer
Please note that despite my efforts, these lecture notes may contain errors. I
welcome any feedback, corrections, or suggestions you may have. If you spot
any mistakes or have ideas for improvement, feel free to contact me via email
at quaini@ese.eur.nl.
Contents

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 Linear Regression 5
1.0.1 Estimators and their properties . . . . . . . . . . . . . 7
1.1 Least Squares and Penalized Least Squares . . . . . . . . . . . 11
1.1.1 Existence and uniqueness . . . . . . . . . . . . . . . . 13
1.1.2 Equivalent expressions and relations . . . . . . . . . . 19
1.1.3 Geometric interpretation . . . . . . . . . . . . . . . . . 23
1.1.4 Computation of lasso . . . . . . . . . . . . . . . . . . . 26
1.1.5 Finite-sample properties of ridgeless and ridge . . . . . 31
1.1.6 Finite sample properties of lasso . . . . . . . . . . . . . 39

2 Appendix 49
2.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.1 Moore-Penrose inverse . . . . . . . . . . . . . . . . . . 56
2.1.2 Eigenvalue and Singular value decomposition . . . . . . 58
2.2 Convex analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . 63

Alphabetical Index 71

2
Notation
• All random variables are defined on complete probability space (Ω, F, P)
and take values in a real Euclidean space.

• For a random variable (vector) [matrix] x (x) [X], the notation x ∈ R

(x ∈ Rn ) [X ∈ Rn×p ] means that x (x) [X] takes values in R (Rn )
[Rn×p ].
d
• The symbols → and → denote convergence in probability and in dis-
P

tribution, respectively.

• Given a random variable x, its expectation is denoted E[x] and its

variance Var[x].

• For a vector x ∈ Rn , the i−th element is denoted xi for i = 1, . . . , n.

• For a matrix A ∈ Rn×p , the i, j−th element is denoted Ai,j , the j−th
column is denoted Aj and the i−th row is denoted A(i) , for i = 1, . . . , n
and j = 1, . . . , p.

• The transpose of a matrix A ∈ Rn×p is denoted A′ .

• The Moore-Penrose inverse of a matrix A ∈ Rn×p is denoted A+ .

1/p
• The lp −norm ∥·∥p on Rn is defined for all v ∈ Rn as ∥v∥p := ( ni=1 |vi |p )
P
when p ∈ [1, +∞), and ∥v∥p := maxni=1 |vi | when p = +∞.

• Given vector θ ∈ Rp , the l0 −norm (which is not a norm!) ∥x∥0 counts

the number of nonzero elements of θ.

• argminx∈X f (x) denotes the set of minimizers of f over set X.

• diag(A) denotes the diagonal elements of a square matrix A ∈ Rn×n

• diag(a1 , . . . , an ) denotes a square matrix in Rn×n that has diagonal

elements given by a1 , . . . , an ∈ R and that has zero elsewhere.

• Given a matrix A ∈ Rn×p , its rank is Rank(A), its range is Range(A),

its kernel is Ker(A), its trace is denoted Trace(A).

• We denote PS the orthogonal projection onto set S ⊂ Rn×p .

• Given a vector space V , the sum of two subsets A, B ⊂ V is defined

as A + B := {a + b : a ∈ A, b ∈ B}. The sum of a set A ∈ V and a
vector b ∈ V is defined as A + b := {a + b : a ∈ A}.

3
• The symbol ∂ indicates the subdifferential.

4
Chapter 1

Linear Regression

Linear regression is a supervised learning technique aimed at predicting a

target random variable y using a linear combination

x′ θ = x1 θ1 + . . . + xp θp

of explanatory variables x = [x1 , . . . , xp ]′ , where θ ∈ Rp and p ∈ N.1 The

target variable y is also referred to as dependent or output variable, while the
explanatory variables x are also known as independent variables, predictors
or input variables.
In typical applications, we observe only a sample of size n ∈ N of these
random variables, represented by the pairs (xi , yi )ni=1 , where xi ∈ Rp and
yi ∈ R for each i. Given a regression coefficient θ0 ∈ Rp , a statistical linear
model, or simply linear model, is expressed as

yi = x′i θ0 + ε0i , i = 1, . . . , n, (1.1)

where ε0i are real-valued residual random variables. Figure 1.1 depicts a
linear model for i = 1, . . . , n observations yi , with two predictors x̃i = [1, xi ]′
consisting of a unit constant and a variable xi , a coefficient θ0 ∈ R2 and the
associated error terms ε0i . The Data Generating Process (DGP), i.e., the
joint distribution of the predictors x and the real-valued residual random
variables ε0 = [ε0i ]ni=1 , is subject to certain restrictions. Depending on the
type of restrictions imposed on the DGP, different types of linear models
are obtained. The two general forms of linear models are fixed and random
design models, which are defined as follows.
Definition 1 (Fixed design model). In a fixed design model, the sequence
(xi )ni=1 is fixed. The residuals ε0i are independent and identically distributed.
1
An intercept in f (x; θ) can be introduced by adding a constant term to the predictors.

5
Definition 2 (Random design model). In a random design model, the pair
(xi , yi )ni=1 is a sequence of independent and identically distributed random
variables.
The fixed design model is particularly suitable when the predictors are
controlled by the analyst, such as the dose of medication administered to
patients in the treatment group in a clinical trial. Conversely, the random
design model is appropriate when the explanatory variables are stochastic,
such as the wind speed observed at a specific time and location.

Linear Model
yi
Lm(xi, θ0) = 0.5 + 0.8xi
ε0i
8
6
y
4
2

3 4 5 6 7 8
x

Figure 1.1: Statistical linear model yi = x̃′i θ0 + ε0i where x̃i = [1, xi ]′ and
θ0 = [0.5, 0.8]′ .

We organize the observed values of the target variable in the vector y =

[y1 , . . . , yn ]′ ∈ Rn , and the observations on the predictors in the design matrix
 
x . . . x1p
 11
 .. . . ..  n×p

X :=  . . . R .
 
xn1 . . . xnp
With this notation, the linear model (1.1) can be expressed as:
y = Xθ0 + ε0 ,
where ε0 = [ε01 , . . . , ε0n ]′ ∈ Rn .
Example 1. A classic example of linear regression is found in the work of
Nerlove et al. [1961], which examines returns to scale in the U.S. electricity
power supply industry. In this study, the total cost yi for firm i is predicted
using a linear model based on the firm’s output production xi1 , the wage rate
xi2 , the price of fuel xi3 , and the rental price of capital xi4 , with data from
n = 145 electric utility companies.

6
The rest of the chapter is organized as follows. First, we study the most
basic linear regression approach, the method of least squares projection, and
some of its recent machine learning extensions. Our study focuses on their
existence, uniqueness, connections, geometric interpretation, and computa-
tion. Then, we study we cover both their finite- or small-sample properties,
that are valid for any given sample size, and their asymptotic properties,
that are useful approximations when the sample size is large enough.

1.0.1 Estimators and their properties

Definition 3 (Estimand). An estimand is a feature, or parameter, of interest
of the population.
Definition 4 (estimator and estimate). An estimator is a function taking as
input the data, and possibly other auxiliary variables, and outputting an
estimate, which is a specific value assigned to the estimand.
For instance, in the context of the linear model (1.1), the coefficient θ0 ∈
R represents the estimand. An estimator is a function θ̂n : Rn × Rn×p → Rp
p

that takes as inputs the data (y, X) ∈ Rn × Rn×p and produces an estimate
θ̂n (y, X) ∈ Rp . For simplicity, we use the same notation, θ̂n , to refer to both
the estimator and the resulting estimate, although formally the estimate
should be written as θ̂n (y, X). We may also write θ̂n ∈ Rp to indicate that
an estimator outputs values in Rp .
Definition 5 (Linear prediction). The quantity lm(X, θ) := Xθ denotes
the linear prediction associated to coefficient vector θ ∈ Rp and predictors
X ∈ Rn×p .
Let M + denote the Moore-Penrose inverse of a generic real-valued matrix
M . We make extensive use of the following useful projections.2
Definition 6 (Useful projections). Given a fixed matrix X ∈ Rn×p :

• PRange(X ′ ) := X + X is the orthogonal projector onto Range(X ′ );

• PKer(X) := I − X + X is the orthogonal projector onto Ker(X);

• PRange(X) := XX + is the orthogonal projector onto Range(X);

• PKer(X ′ ) := I − XX + is the orthogonal projector onto Ker(X ′ ).

2
We use the notation Range (Ker) for the range (kernel) of a matrix. Details on these
sets, the Moore-Penrose inverse and orthogonal projections are given in Appendix Section
2.1.

7
The next proposition demonstrates that, if we fix the design matrix X,
we can focus on regression coefficients in Range(X ′ ). Indeed, coefficients in
this set span all possible linear predictions that can be achieved through X.

Proposition 1.0.1. Given a matrix X ∈ Rn×p , for any θ ∈ Rp :

lm(X, θ) = lm(X, PRange(X ′ ) θ).

Proof. Using the identity I = PRange(X ′ ) + PKer(X) , we have for any θ ∈ Rp :

Xθ = X(PRange(X ′ ) + PKer(X) )θ = X PRange(X ′ ) θ.

Finite-sample properties
Since an estimator is derived from data, it is a random variable. Intuitively,
when comparing two estimators of the same estimand, we prefer the one
whose probability distribution is ”more concentrated” around the true value
of the estimand. Formally, estimators are compared using several key prop-
erties.3
Definition 7 (Bias). The bias of an estimator θ̂n of θ0 is the difference between
the expected value of the estimator and the estimand:

Bias(θ̂n , θ0 ) := E[θ̂n ] − θ0 .

Definition 8 (Estimation risk). The estimation risk of an estimator θ̂n of θ0

measures the difference between the estimate and the estimand as:

ER(θ̂n , θ0 ) := ∥θ̂n − θ0 ∥22 .

Definition 9 (MSE). The Mean Squared Error (MSE) of an estimator θ̂n of

θ0 is the expected estimation risk of the estimator:

MSE(θ̂n , θ0 ) := E[∥θ̂n − θ0 ∥22 ].

Definition 10 (Predictive risk). The predictive risk of an estimator θ̂n of θ0

measures the difference between the linear predictions of θ̂n and those of θ0 :

PR(θ̂n , θ0 ) := ∥lm(X, θ̂n ) − lm(X, θ0 )∥22 /n.

3
Some of these properties are defined by means of the l2 −norm. Note that this choice
is typical, but arbitrary.

8
Definition 11 (Mean predictive risk). The Mean Predictive Risk (MPR) of
an estimator θ̂n of θ0 is the expected predictive risk of the estimator:

MPR(θ̂n , θ0 ) := E[∥lm(X, θ̂n ) − lm(X, θ0 )∥22 /n].

As a corollary to Proposition 1.0.1, the predictive risk of an estimator

is unchanged if both the estimator and the estimand are projected onto
Range(X ′ ).
Corollary 1.0.1. Given a matrix X ∈ Rn×p , for any estimator θ̂n of θ0 ∈
Rp :
PR(θ̂n , θ0 ) = PR(PRange(X ′ ) θ̂n , PRange(X ′ ) θ0 ).
Proof. The result follows from Proposition 1.0.1.
The next proposition justifies the definition of mean predictive risk given
in Definition 11.
Proposition 1.0.2. Assume that the linear model (1.1) holds with E[xε0 ] =
0. Then, for any θ ∈ Rp :

E[∥y − Xθ∥22 /n] = MPR(θ, θ0 ) + E[∥ε0 ∥22 /n].

Proof. Since y = Xθ0 + ε0 , we have

E[∥y − Xθ∥22 /n] = E[∥X(θ0 − θ)∥22 /n] + E[∥ε0 ∥22 /n] + 2(θ0 − θ)E[X ′ ε0 ].

Then, the result follows since E[X ′ ε0 ] = ni=1 E[Xi ε0i ] = 0, where Xi de-
P
notes the i−th row of X.
If our primary goal is to accurately predict the target variable, we seek
a estimators θ̂ with a low mean prediction risk E[∥y − X θ̂∥22 /n]. Since we
cannot control the error term ε0 , Proposition 1.0.2 suggests that we should
focus on estimators with a low mean predictive risk.
On the other hand, if our interest lies in understanding which predictors
influence the target variable and how they do so, the true coefficient θ0
becomes our focus. In this case, we might prefer unbiased estimators – those
with zero bias – over biased ones. However, estimators with lower mean
squared error (MSE) are generally favored, even if they feature some bias.
The following proposition demonstrates that the MSE can be decomposed
into a bias and a variance term.
Proposition 1.0.3 (Bias-variance decomposition of MSE). Given an esti-
mator θ̂n ∈ Rp for θ0 ∈ Rp , the MSE can be decomposed as follows:

MSE(θ̂n , θ0 ) = ∥Bias(θ̂n , θ0 )∥22 + Trace(Var[θ̂n ]).

9
Proof. The result follows from

MSE(θ̂n , θ0 ) = E[(θ̂n − θ0 )′ (θ̂n − θ0 )]

= E[Trace{(θ̂n − θ0 )(θ̂n − θ0 )′ }]
= Trace(E[(θ̂n − θ0 )(θ̂n − θ0 )′ ])
= Trace(E[(θ̂n − E[θ̂n ] + Bias(θ̂n , θ0 ))(θ̂n − E[θ̂n ] + Bias(θ̂n , θ0 ))′ ])
= Trace(Var[θ̂n ] + Bias(θ̂n , θ0 ) Bias(θ̂n , θ0 )′ +
E[θ̂n − E[θ̂n ]] Bias(θ̂n , θ0 )′ + Bias(θ̂n , θ0 )E[θ̂n − E[θ̂n ]]′ )
= Trace(Var[θ̂n ] + Bias(θ̂n , θ0 ) Bias(θ̂n , θ0 )′ )
= ∥Bias(θ̂n , θ0 )∥22 + Trace(Var[θ̂n ]).

Loosely speaking, the bias and the variance of an estimator are linked
to the estimator’s ”complexity”. Estimators with higher complexity often
fit the data better, resulting in lower bias, but they are more sensitive to
data variations, leading to higher variance. Conversely, estimators with lower
complexity tend to have lower variance but higher bias, a phenomenon known
as the bias-variance tradeoff.
Apart from simple cases, computing the finite-sample properties of esti-
mators, such as their MSE or predictive risk, is infeasible or overly compli-
cated. This is because they require computations under the DGP of complex
transformations of the data. When direct computation is not possible, we
can rely on concentration inequalities or asymptotic approximations.
Concentration inequalities are inequalities that bound the probability
that a random variable deviates from a particular value, typically its ex-
pectation. In this chapter, we focus on inequalities that control the MSE or
predictive risk of an estimator, such as:

P[d(θ̂n , θ0 ) ≤ h(y, X, n, p)] ≥ 1 − δ,

or
P[d(lm(X, θ̂n ), lm(X, θ0 )) ≤ h(y, X, n, p)] ≥ 1 − δ,
where δ ∈ (0, 1) is the level of confidence, d : Rp ×Rp → [0, +∞) is a distance,
and h is a real-valued function of the data, the sample size, and the number
of predictors.

10
Large-sample properties
Large-sample or asymptotic theory provides an alternative approach to study
and analyse estimators. Classically, this framework develops approximations
of the finite-sample properties of estimators, such as their distribution, MSE
or predictive risk, by letting the sample size n → ∞. Consequently, these
approximations work well when the sample size n is much larger than the
number of predictors p. More recently, asymptotic approximations are also
developed by letting p → ∞, or having both n, p → ∞ at some rate. Note
that, given a sample of size n and number of variables p, there is no general
indication on how to choose the appropriate asymptotic regime for n and p,
as the goodness of fit of the corresponding asymptotic approximations should
be assessed on a case by case basis. In this chapter, we work with two notions
from large-sample theory: consistency and asymptotic distribution.
P
Definition 12 (Consistency). Estimator θ̂n of θ0 is consistent, written θ̂n →
θ0 as n → ∞, if for all ε > 0,

lim P[|θ̂n − θ0 | > ε] = 0.

n→∞

Definition 13 (Asymptotic distribution). Given a deterministic real-valued

sequence rn,p → ∞, let Fn,p be the probability distribution of rn,p (θ̂n − θ0 )
and F a non-degenerate probability distribution. Estimator θ̂n of θ0 has
asymptotic distribution F with rate of convergence rn,p if Fn,p (z) → F (z)
as rn,p → ∞ for all z at which F (z) is continuous. Equivalent short-hand
d d
notations are rn,p (θ̂n − θ0 ) → F and rn,p (θ̂n − θ0 ) → η ∼ F , as rn,p → ∞.

1.1 Least Squares and Penalized Least Squares

In this chapter, we study the most widely used methods in linear regression
analysis: the method of least squares and some of its penalized variants. The
method of least squares was first introduced by Legendre [1805] and Gauss
[1809], and it consists in minimizing the squared l2 −distance between the
target values y and the linear prediction lm(X, θ) = Xθ in the coefficient
vector θ ∈ Rp .

11
Figure 1.2: Adrien-Marie Legendre (1752–1833) and Johann Carl Friedrich
Gauss (1977–1855).

Definition 14 (Least squares estimator). The Least Squares Estimator (LSE)

is defined as:
n
1X 1
θ̂nls ∈ argmin (yi − θ ′ xi )2 = argmin ∥y − Xθ∥22 . (1.2)
θ∈Rp 2 i=1 θ∈Rp 2

In addition to the LSE, we consider the following variants.

Definition 15 (Ridgeless estimator). The ridgeless estimator is defined as:

rl 2 1 2
θ̂n = argmin ∥θ̂∥2 : θ̂ ∈ argmin ∥y − Xθ∥2 . (1.3)
θ̂∈Rp θ∈Rp 2

Definition 16 (Ridge estimator). The ridge estimator is defined for λ > 0 as:
1 λ
θ̂nr (λ) = argmin ∥y − Xθ∥22 + ∥θ∥22 . (1.4)
θ∈Rp 2 2

Definition 17 (Lasso estimator). The lasso estimator is defined for λ > 0 as:
1
θ̂nl (λ) ∈ argmin ∥y − Xθ∥22 + λ ∥θ∥1 . (1.5)
θ∈Rp 2

Here is a brief overview of the results that are discussed in detail in the
rest of this chapter. A solution to the least squares problem (1.2) always
exists. However, when the predictors (i.e., the columns of X) are linearly
dependent, there are infinitely many solutions.4 In such cases, the LSE typ-
ically considered is the ridgeless estimator, which is always unique.
The ridge and lasso estimators are penalized or regularized versions of
the LSE, with penalty term λ ∥θ∥22 and λ ∥θ∥1 , respectively. The penalty
parameter λ > 0 controls the strength of the penalty. The ridge estimator,
4
This situation always arises when p > n, and it may arise even when n ≤ p.

12
introduced by Hoerl and Kennard [1970], was developed to address certain
shortcomings of the LSE, particularly in scenarios involving collinear or mul-
ticollinear designs – where the predictors in X are linearly dependent or
nearly-linearly dependent. The ridge estimator is uniquely defined and often
exhibits better statistical properties compared to those of the LSE in set-
tings with multicollinear or many predictors. On the other hand, the lasso
estimator, popularized by Tibshirani [1996], offers an approximation of the
l0 estimator, which is defined for some R > 0:

l0 1 2
θ̂n (λ) ∈ argmin ∥y − Xθ∥2 : ∥θ∥0 ≤ R , (1.6)
θ∈Rp 2
where ∥θ∥0 is the number of nonzero elements in θ. A key feature of this es-
timator is its ability to produce sparse solutions, i.e., to set some coefficients
exactly to zero. Consequently, the l0 estimator can be used to perform pa-
rameter estimation and variable selection simultaneously. However, it is the
solution of a non-convex problem, and, in general, computing it can be an
”NP-hard” problem. The lasso instead shares the ability to produce sparse
solutions and it can be easily computed even for large datasets.
Remark 1 (Data standardization). For computational stability, it is recom-
mended to compute linear regression estimators with a least squares loss
after having standardized the predictors X so that x̄ := X ′ 1/n = 0 and
Xj′ Xj = 1 for each j = 1 . . . , p. Without standardization, the solutions
would depend on the units used to measure the predictors. Moreover, we
may also center the target variable y, meaning ȳ := y ′ 1/n = 0. These
centering conditions are convenient, since they mean that we can omit the
intercept term. Given an optimal solution θ̂ on the centered data, we can
recover the optimal solutions for the uncentered data: θ̂ is the same and the
intercept is given by ȳ − x̄′ θ̂.

1.1.1 Existence and uniqueness

From here on, we make extensive use of the spectral decomposition of X.
Definition 18 (Spectral decomposition of X). The spectral decomposition of
X is
X = U SV ′ ,
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices and, for r :=
Rank(X) ≤ min{n, p},
 
diag(s1 , . . . , sr ) 0
S=  ∈ Rn×p .
0 0

13
We establish the following key results: the existence of the LSE, the
ridgeless, the ridge and the lasso estimators; the closed-form expression of
the LSE, ridgeless, and ridge; the uniqueness of the ridgeless and ridge; the
uniqueness of the LSE when Rank(X) = p, i.e., when the predictors in X are
linearly independent. Notice that this rank condition cannot hold if n < p.

Theorem 1.1.1 (Existence and uniqueness of LSE, ridgeless, ridge and

lasso). The following statements hold:
(i) The set of solutions to the least squares problem (1.2) is non-empty and
given by
argmin ∥y − Xθ∥22 /2 = X + y + Ker(X).
θ∈Rp

(ii) The ridgeless estimator exists, is an element of Range(X ′ ), and is

uniquely given in closed form by

θ̂nrl = X + y. (1.7)

(iii) If Rank(X) = p, then the LSE and the ridgeless estimator are uniquely
given in closed form by:

θ̂nls = θ̂nrl = (X ′ X)−1 X ′ y. (1.8)

(iv) The ridge estimator with penalty parameter λ > 0 exists, is an element
of Range(X ′ ), and is uniquely given in closed form by:

θ̂nr (λ) = (X ′ X + λI)−1 X ′ y. (1.9)

(v) The lasso estimator exists and, in general, it is not unique.

Proof. (i) The least squares problem (1.2) is an unconstrained optimiza-
tion problem with the objective function

f : Rp → [0, +∞); θ 7→ ∥y − Xθ∥22 /2.

By Theorem 2.2.1, the set of least squares minimizers is given by

S := {θ̂ ∈ Rp : X ′ X θ̂ = X ′ y}.

Since X ′ y ∈ Range(X ′ ) and Range(X ′ ) = Range(X ′ X), set S is

not empty. Consider a vector θ̂ ∈ X + y + Ker(X). Using that X =
PRange(X) X which implies X ′ = X ′ PRange(X) , we obtain

X ′ X θ̂ = X ′ XX + y = X ′ PRange(X) y = X ′ y.

14
Therefore, X + y + Ker(X) ⊂ S. Now consider a vector v ∈ Rp not in
set X + y + Ker(X). That is, v = θ̂ + u with θ̂ ∈ X + y + Ker(X) and
u ∈ Range(X ′ ). Since Xu ̸= 0,

X ′ Xv = Xy + Xu ̸= Xy.

We conclude that X + y + Ker(X) = S.

(ii) The minimum norm least squares problem in (1.3) has a strictly convex
and coercive objective function

f : Rp → R; θ 7→ ∥θ∥22 ,

and a closed convex feasible set X + y + Ker(X) ⊂ Rp . It follows that a

solution exists and it is unique; see Propositions 2.2.3 and 2.2.4. Since,
for any v ∈ Ker(X),

X +y 2
≤ X +y 2
+ ∥v∥2 = X + y + v 2
,

the ridgeless estimator can be expressed in closed form as θ̂nrl = X + y,

which is an element of Range(X ′ ) since X + = PRange(X ′ ) X + .

(iii) If Rank(X) = p, then Ker(X) = {0}. Moreover, X ′ X is invertible

and we can use the identity X + = (X ′ X)−1 X ′ to conclude that the
LSE and the ridgeless estimator are uniquely given by (1.8).

(iv) The ridge problem in (1.4) is an unconstrained optimization problem

with the strictly convex, coercive and continuously differentiable objec-
tive function
2
f : Rp → [0, +∞); θ 7→ ∥y − X ′ θ∥2 /2 + λ/2 ∥θ∥22 .

It follows that a solution θ̂ r ∈ Rn exists and it is unique; see, again,

Propositions 2.2.3 and 2.2.4. Theorem 2.2.1 implies

(X ′ X + λI)θ̂ r (λ) = X ′ y.

Consider the spectral decomposition X = U SV ′ in Definition 18.

Then,
 
2 2
diag(s1 , . . . , sr ) + λ 0
X ′ X + λI = V S ′ SV ′ + λV V ′ = V   V ′,
0 λI

15
which is positive definite, and thus θ̂nr (λ) = (X ′ X + λI)−1 X ′ y is the
solution to the FOCs. Finally, to prove that θ̂nr (λ) ∈ Range(X ′ ), notice
that PRange(X ′ ) = V S + SV ′ , where
 
I 0
S+S =  .
0 0

Thus,

PRange(X ′ ) θ̂nr (λ) =V S + SV ′ V (S ′ S + λI)−1 V V ′ S ′ U ′ y

=V (S ′ S + λI)−1 S ′ U ′ y = θ̂nr (λ).

We conclude that PRange(X ′ ) θ̂nr (λ) = V (S ′ S+λI)−1 S ′ U ′ y, i.e., θ̂nr (λ) ∈

Range(X ′ ).

(v) The lasso problem in (1.5) is an unconstrained optimization problem

with the convex and coercive objective function
2
f : Rp → [0, +∞); θ 7→ ∥y − X ′ θ∥2 /2 + λ ∥θ∥1 .

It follows that a solution θ̂ l (λ) ∈ Rp exists; see Proposition 2.2.3. How-

ever, we demonstrate by counterexample that this solution is, in gen-
eral, not unique. Consider a sample (y, X) ∈ Rn × Rn×2 where the
two predictors are identical, i.e., x1 = x2 ∈ Rn , and assume that there
exists a corresponding lasso solution θ̂nl (λ) = [θ̂n1l l
(λ), θ̂n2 (λ)]′ ∈ R2 that
is non-zero. Then,
l l
θ̂1 = [θ̂n1 (λ) + θ̂n2 (λ), 0]
and
l l
θ̂2 = [0, θ̂n1 (λ) + θ̂n2 (λ)]
are two distinct coefficient vectors that produce the same fit, X θ̂1 =
X θ̂2 , and have identical l1 −norms, ∥θ̂1 ∥1 = ∥θ̂2 ∥1 . Consequently, in
this example there exist at least two distinct lasso solutions.

Remark 2 (Computation ridgeless and ridge). The closed form expressions of

the LSE, ridgeless and ridge estimators are useful analytical result. However,
for numerical stability, it is recommended to compute these estimators by
solving their corresponding normal equations, which are X ′ X θ̂n = X ′ y for
the LSE or ridgless, and (X ′ X + λI)θ̂nr = X ′ y for the ridge.

16
Remark 3 (Collinearity). Using the notation in Definition 18, the minimum
nonzero eigenvalue of X ′ X is s2r . If r < p, then X ′ X has p − r zero eigen-
values and the predictors are said to be collinear, that is, they are linearly
dependent. In this case Ker(X) is not trivial (it contains nonzero elements),
hence the LSE is not unique. Moreover, if sr ≈ 0, then the computation of
 
diag(1/s1 , . . . , 1/sr ) 0
X+ = V   U ′,
0 0

and hence of the ridgeless estimator, is unstable. The ridge estimator in-
stead may not display these computational hurdles, provided that the penalty
parameter λ is large enough. That is because the minimum eigenvalue of
(X ′ X + λI) is s2r + λ. In Section 1.1.5 we show that, if sr ≈ 0, the ridge-
less (ridge) estimator’s MSE and MPR satisfy loose (sharp) concentration
inequalities.
Remark 4 (Uniqueness of the lasso solution). Tibshirani [2013] shows that,
under some conditions, the lasso estimator is unique. For instance, if the
predictors in X are in general position, then the lasso solution is unique.
Specifically, a set (aj )pj=1 where aj ∈ Rn for all j is in general position if any
affine subspace of Rn of dimension k < n contains at most k + 1 elements
of the set {±x1 , ±x2 , ... ± xp }, excluding antipodal pairs of points (that is,
points differing only by a sign flip). If the predictors are (non redundant)
continuous random variables, they are almost surely in general position, and
hence the lasso solution is unique. As a result, non-uniqueness of the lasso
solution typically occurs with discrete-valued data, such as those comprising
dummy or categorical variables.
Since the LSE, ridgeless, ridge and lasso estimators exist, their linear
predictions exist too. Moreover, the linear predictions of the uniquely defined
estimators, like ridgeless and ridge, are trivially unique. Remarkably, some
estimators that may not be unique entail unique linear predictions. The next
lemma implies that the LSE and lasso are among these estimators.

Lemma 1.1.2. Let h : Rp → (−∞, +∞] be a proper convex function. Then

Xθ1 = Xθ2 and h(θ1 ) = h(θ2 ) for every minimizers θ1 , θ2 ∈ Rp of
1
f : Rp → (−∞, +∞]; θ 7→ ∥y − Xθ∥22 + h(θ).
2
Proof. Assume that Xθ1 ̸= Xθ2 , and let δ := inf θ∈Rp f (θ). By Proposition

17
2.2.2, the set of minimizers of f is convex. Thus, for any α ∈ (0, 1):

δ =f (αθ1 + (1 − α)θ2 )
1
= ∥y − X[αθ1 + (1 − α)θ2 ]∥22 + h(αθ1 + (1 − α)θ2 )
2
1 1
<α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + h(αθ1 + (1 − α)θ2 )
2 2
1 1
≤α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + αh(θ1 ) + (1 − α)h(θ2 )
2 2
=αf (θ1 ) + (1 − α)f (θ2 ) = δ,

where the strict inequality follows from the strict convexity of g : v 7→

∥y − v∥22 . Since the conclusion δ < δ is absurd, we must have Xθ1 = Xθ2 ,
and since f (θ1 ) = f (θ2 ), it follows that h(θ1 ) = h(θ2 ) as well.
While we make use of this lemma for proving uniqueness of the predic-
tions of lasso, we can use a more direct approach for the other estimators’
predictions, which directly provides their closed form expressions. We further
show that the LSE’s prediction has the geometric interpretation of being the
unique vector in the range of X that is closest to y in l2 distance, and that
the residual vector is orthogonal to the range of X.

Theorem 1.1.3 (Uniqueness of linear predictions). The following statements

hold:

(i) The linear predictions of the LSE and the ridgeless estimator are uniquely
given by:
lm(X, θ̂nls ) = lm(X, θ̂nrl ) = PRange(X) y, (1.10)
which is the unique vector v ∈ Range(X) such that

∥y − v∥2 = inf{∥y − z∥2 : z ∈ Range(X)}.

Moreover, the residual vector y − lm(X, θ̂nls ) = y − lm(X, θ̂nrl ) is or-

thogonal to Range(X).

(ii) The linear prediction of the ridge estimator is uniquely given, for λ > 0,
by
lm(X, θ̂nr (λ)) = X(X ′ X + λI)−1 X ′ y.

(iii) The linear prediction of the lasso estimator is unique.

18
Proof. (i) The linear predictions lm(X, θ̂nls ) and lm(X, θ̂nrl ) are uniquely
given by (1.10) because all solutions to the least squares problem θ̂ ∈
X + y + Ker(X) yield the same prediction

lm(X, θ̂) = XX + y = PRange(X) y.

By the definition of θ̂nls and the fact that Range(X) is a closed vector
subspace of Rn , the remaining claims follow as a direct application of
the Hilbert projection theorem (Theorem 2.2.2).
(ii) This result follows directly from the closed form expression (1.9) of the
ridge estimator.
(iii) Since the l1 −norm is convex, the result follows by Lemma 1.1.2.

1.1.2 Equivalent expressions and relations

The ridgeless and the ridge, together with their corresponding linear predic-
tions, admit the following simple expressions.
Proposition 1.1.1 (Spectral expression of ridgeless and ridge). Given the
spectral decomposition X = U SV ′ in Definition 18:
(i) The ridgeless estimator is given by
r
!
X 1
θ̂nrl = vj u′j y.
s
j=1 j

The corresponding linear prediction is

r
!
X
lm(X, θ̂nrl ) = uj u′j y. (1.11)
j=1

(ii) The ridge estimator with λ > 0 is given by

r
!
X sj
θ̂nr (λ) = 2
vj u′j y.
j=1
sj + λ

The corresponding linear prediction is

r
!
X s2j
lm(X, θ̂nr ) = 2
uj u′j y. (1.12)
j=1
sj + λ

19
Proof. (i) From the closed-form expression of the ridgeless estimator,
r
!
X 1
θ̂nrl = X + y = V S + U ′ y = vj u′j y.
j=1
s j

Therefore,
r
!
X
X θ̂nrl = U S ′ S + U ′ y = uj u′j y.
j=1

(ii) From the closed-form expression of the ridge estimator,

θ̂nr = (X ′ X + λI)−1 X ′ y =V (S ′ S + λI)−1 S ′ U ′ y

r
!
X sj
= 2
vj u′j y.
j=1 j
s + λ

Therefore,
r
!
′ −1 ′ ′
X s2j
X θ̂nr = U S(S S + λI) S U y = 2
uj u′j y.
j=1
sj + λ

Pr ′
Using Definition 18, matrix PRange(X) = XX + = j=1 uj uj , where
{u1 , . . . , ur } is an orthogonal basis of Range(X). From expression (1.11),
it follows that the prediction of the ridgeless estimator is the orthogonal
projection of y onto the range of X. Expression (1.12) instead shows that
the ridge estimator shrinks this projection, shrinking less the directions uj
associated to high variance (high sj ), and more the directions uj associated
to low variance (low sj ); see Figure 1.3. Indeed, for fixed λ > 0, the weight
s2j /(s2j + λ) → 0 as sj → 0, and s2j /(s2j + λ) → 1 as sj → ∞.

20
f(s) = s2 (s2 + λ)
1

0.5

1 2 3 4 5 6 7 8 9 s

Figure 1.3: Shrinkage of principal components in the linear prediction of

ridge when λ = 1/2.

The ridgeless estimator can also be expressed as a penalized LSE.

Proposition 1.1.2 (Penalized expression of ridgeless). The ridgeless es-
timator is the only solution to the least squares problem (1.2) that is in
Range(X ′ ), and it can be expressed as:
1 1
θ̂nrl = argmin ∥y − Xθ∥22 + θ ′ PKer(X) θ. (1.13)
θ∈Rp 2 2
Proof. From Theorem 1.1.1, the solution set of the least squares problem
(1.2) is θ̂nrl + Ker(X), where θ̂nrl = X + y is in Range(X ′ ). Since Ker(X) ⊥
Range(X ′ ), θ̂nrl is the only solution in Range(X ′ ). Moreover, penalty h :
θ 7→ θ ′ PKer(X) θ is zero in θ̂nrl , and strictly positive at any other least squares
solution. We conclude that θ̂nrl minimizes (1.13).
Following linear transformations relate the ridgeless and the ridge esti-
mators.
Proposition 1.1.3 (Links between ridgeless and ridge). Following relations
between the ridge and the minimum norm least squares estimators hold:

θ̂nr (λ) = (X ′ X + λI)−1 X ′ X θ̂nrl , (1.14)

θ̂nrl = (X ′ X)+ (X ′ X + λI)θ̂nr (λ), (1.15)
lim θ̂nr (λ) = θ̂nrl . (1.16)
λ→0

Proof. Using X = PRange(X) X which implies X ′ = X ′ PRange(X) , we have

(X ′ X + λI)−1 X ′ y = (X ′ X + λI)−1 X ′ XX + y,

21
and thus
θ̂nr (λ) = (X ′ X + λI)−1 X ′ X θ̂nrl .
Using X + = X + (X + )′ X ′ , we have
X + y = X + (X + )′ (X ′ X + λI)(X ′ X + λI)−1 X ′ y.
Moreover, X + (X + )′ = (X ′ X)+ implies
θ̂nrl = (X ′ X)+ (X ′ X + λI)θ̂nr (λ).
Finally, since X + = limλ→0 (X ′ X + λI)−1 X ′ , we have limλ→0 θ̂nr (λ) = θ̂nrl .

Expression (1.16) explains why estimator (1.3) is called the ridgeless esti-
mator. The ridge and lasso estimators can be expressed as constrained least
squares problems.
Proposition 1.1.4 (Equivalence between penalized and constrained least
squares). For c ≥ 0, λ ≥ 0, and some norm ∥·∥ : Rp → R, define:
C(c) := argmin ∥y − Xθ∥22 /2 : ∥θ∥ ≤ c ;

θ∈Rp

P(λ) := argmin ∥y − Xθ∥22 /2 + λ ∥θ∥ .

θ∈Rp

Then, for a given c > 0, there exists λ0 ≥ 0 such that C(c) ⊂ P(λ0 ). Con-
versely, for a given λ > 0, there exists c0 ≥ 0 such that P(λ) ⊂ C(c0 ).
Proof. The objective function h : θ 7→ ∥y − Xθ∥22 /2 is convex and continu-
ous, and the constraint set {θ ∈ Rp : ∥θ∥ ≤ c} is not empty, closed, bounded
and convex. By the KKT theorem for convex problems, θ̂ ∈ C(c) for any
c > 0 if and only if θ̂ satisfies the KKT conditions, for some corresponding
λ0 ≥ 0:
0 ∈ λ0 ∂∥θ̂∥ + X ′ X θ̂ − X ′ y,
∥θ̂∥ ≤ c,
λ0 (∥θ̂∥ − c) = 0.

By Theorem 2.2.1, the first of these conditions implies that θ̂ ∈ P(λ0 ). Now
fix a λ > 0 and notice that P(λ) is not empty, given that its objective function
is convex, continuous and coercive; see Proposition 2.2.3. We can thus take
some θ̂ ∈ P(λ). Then, θ̂ satisfies the KKT conditions for c0 = ∥θ̂∥, which
implies θ̂ ∈ C(c0 ).
Note that the link between the penalty parameter λ and the constraint
parameter c is not explicit.

22
1.1.3 Geometric interpretation
We illustrate the geometry of the least squares, ridge, and lasso solutions
through a simple example. Consider the linear model (1.1), with p = 2,
ε0i ∼ iiN (0, 1), θ0 = [1.5, 0.5]′ , E[xi ε0 ] = 0, and
   
0 2 0
xi ∼ iiN   ,   .
0 0 1

Figure 1.4 shows the level curves of the least squares loss function f (θ) :=
∥y − Xθ∥22 /2, corresponding to values f1 < f2 < f3 < f4 . Its minimizer, or
least squares solution θ̂nls , which coincides with the ridgeless solution θ̂nrl , is
highlighted in the figure.

θ2
rl
θ^n

θ1

Figure 1.4: Geometry of the least squares solution.

To illustrate the geometry of the ridge solution, we consider the con-

strained formulation of the ridge problem; see Proposition 1.1.4. Figure 1.5
demonstrates the impact of imposing the ridge constraint, represented by the
sphere {θ ∈ R2 : ∥θ∥2 ≤ c} with c = 0.5, on the least squares problem. The
ridge solution θ̂nr is located at the intersection between the ridge constraint
and the lower level set of the least squares loss at the lowest height (see
Appendix 2.2 Definition 71) for which the intersection is non-empty. If the
ridgeless solution θ̂nrl lies within the constraint boundary, then θ̂nr coincides
with θ̂nrl . Otherwise, the ridge solution θ̂nr , by construction, is closer to the
origin than θ̂nrl , demonstrating the shrinkage effect of the ridge penalty. In
general, θ̂nr is dense (i.e., contains no zero elements) with probability one.

23
θ2
rl
θ^n

r
θ^n
{θ: ||θ||2 <= c}
θ1

Figure 1.5: Geometry of the ridge solution.

Figure 1.6 illustrates the effect of the lasso constraint, represented by the
rotated square {θ ∈ R2 : ∥θ∥1 ≤ c} with c = 0.5, on the least squares
solution. Like the ridge solution, the lasso solution θ̂nl is located at the
intersection of the lasso constraint and the lower level set of the least squares
loss at the lowest height for which the intersection is non-empty. is located
at the intersection between the lasso constraint and the lower level set of the
least squares loss at the lowest height for which the intersection is not empty.
For small values of c, this intersection is more likely to occur along one of the
coordinate axes. As a result, the lasso solution tends to be sparse, meaning
that some components of θ̂nl are exactly zero.

θ2
rl
θ^n

l
{θ: ||θ||1<= c} θ^n
θ1

Figure 1.6: Geometry of the lasso solution.

24
As discussed in Section 1.1, the lasso estimator serves as an approxima-
tion to the l0 −estimator (1.6). This relationship becomes evident through
visual comparison of Figure 1.6 and Figure 1.7. The lasso constraint set
{θ : ∥θ∥1 ≤ c} is the convex hull (i.e., the smallest convex superset) of
the constraint set underlying the l0 −estimator, which is given by: {θ :
∥θ∥0 ≤ c, ∥θ∥∞ ≤ 1}. Further details on this approximation can be found
in Argyriou et al. [2012].

θ2
rl
θ^n

l0
θ^n
θ1
{θ: ||θ||0 <= 1, ||θ||...<= c}

Figure 1.7: Geometry of the l0 solution.

To illustrate the geometry of the ridgeless solution, consider the linear

model (1.1) with p = 2, ε0i ∼ iidN (0, 1), xi1 ∼ iidN (0, 1), E[xi1 ε0i ] = 0, and
xi2 = 2xi1 . In this case, the predictors are linearly dependent. As a result,
the second-moment matrix of the predictors is reduced-rank:
 
1 2
E[xx′ ] =  ,
2 4

with Rank(E[xx′ ]) = 1. Let E[xi1 yi ] = 1. The identifying condition E[xi ε0i ] =

0 holds if and only if the population coefficient θ0 satisfies
   
1 2 1
E[xx′ ]θ0 = E[xi yi ] ⇐⇒   θ0 =   .
2 4 2

Thus, any coefficient in the set θ0rl + Kernel(E[xx′ ]) satisfies this condition,
where
θ0rl := E[xx′ ]+ E[xi yi ] = [0.2, 0.4]′ .

25
If the sample size n > Rank(E[xx′ ]), then Ker(X) ⊃ Ker(E[xx′ ]), and
the same issue arises in the finite-sample least squares problem, where the
objective function f (θ) is minimized at any point on the affine set

θ̂nrl + Ker(X).

Figure 1.8 depicts the level curves of f (θ) at f (θ̂nrl ) = f1 < f2 < f3 .
These curves are parallel lines, unlike the typical ellipses seen in full-rank
cases. The ridgeless estimator is the minimum l2 -norm solution to the least
squares problem, as expected from its construction.

f1 f2 f3 θ2
f2
f3

rl
θ^n

rl θ1
{θ: ||θ||2 <= ||θ^n||2}

Figure 1.8: Geometry of the ridgeless solution.

1.1.4 Computation of lasso

By Theorem 2.2.1, a function f : Rp → R is minimized at θ ∗ ∈ Rp if and
only if 0 ∈ ∂f (θ ∗ ). In the lasso problem, the objective function
1
f l : Rp → R; θ 7→ ∥y − Xθ∥22 + λ ∥θ∥1 (1.17)
2n
contains the l1 −norm, which makes f l non-smooth.5 As a result, the subdif-
ferential of f l at a minimizer θ̂nl (λ) is not a singleton, implying that the lasso
estimator may not be unique. Moreover, due to the complexity of ∂f l (θ̂nl (λ)),
no closed-form solution exists in general for the lasso estimator.
Fortunately, Proposition 1.1.4 implies that the lasso problem is a quadratic
program with a convex constraint, which allows for the computation of the
5
Notice that f l is the objective function in (1.5), multiplied by 1/n. This term does
not show up in the penalization term as it is absorbed by λ.

26
lasso estimator using various quadratic programming algorithms. One par-
ticularly simple and effective method is the cyclical coordinate descent algo-
rithm, which minimizes the convex objective function by iterating through
each coordinate independently. This approach provides insight into how the
lasso solution is obtained.
Consider the soft-thresholding operator for a given λ > 0, which is defined
as the function

η − λ, η > λ

Sλ : R → R; η 7→ 0, η ∈ [−λ, λ] .

η + λ, η < −λ


This operator is illustrated in Figure 1.9.

Sλ(η)

−λ λ
η

Figure 1.9: Soft-thresholding operator.

The soft-thresholding operator provides a direct way to compute the lasso

estimator in a univariate regression model, i.e., when there is only one pre-
dictor.
Proposition 1.1.5 (Lasso solution for univariate regression). Given λ > 0
and X1 ∈ Rn such that X1′ X1 > 0, we have
Sλ (X1′ y/n)

l 1 2
θ̂n (λ) := argmin ∥y − X1 θ∥2 + λ|θ| = .
θ∈R 2n X1′ X1 /n

Proof. The subdifferential of f : θ 7→ 1

2n
∥y − X1 θ∥22 + λ|θ| at θ̂ ∈ R reads

∂f (θ̂) = bθ̂ − a + λ∂|θ̂|,

27
where a := X1′ y/n and b := X1′ X1 /n. From Theorem 2.2.1, and the subd-
ifferential of the absolute value function (Appendix 2.2, Example 7), θ̂ is a
minimizer of f if and only if

{λ},
 θ̂ > 0
0 ∈ ∂f (θ̂) ⇐⇒ a ∈ bθ̂ + [−λ, λ], θ̂ = 0 .

{−λ}, θ̂ < 0


This condition reads: (i) if θ̂ > 0, then θ̂ = (a − λ)/b, implying a > λ; (ii)
if θ̂ = 0, then −λ ≤ a ≤ λ; and (iii) if θ̂ < 0, then θ̂ = (a + λ)/b, implying
a < −λ. These cases are summarized by θ̂ = Sλ (a)/b.
Proposition 1.1.5 can be used to show that the j−th coordinate of the
lasso solution in a multivariate regression model, i.e., when there is more
than just one predictor, satisfies an expression based on the soft-thresholding
operator applied to the residual of a lasso regression of y onto the predictors
Xk at position k ̸= j.

Theorem 1.1.4 (Lasso solution). Let Xj denote the j−th column of X and
X(−j) denote X without the j−th column. Assume that Xj′ Xj > 0 for all
j = 1, . . . , p. Then, given λ > 0, any lasso solution θ̂nl (λ) is such that for all
j = 1, . . . , p:

Sλ (Xj′ ej /n)

l 1 2
θ̂n,j (λ) = argmin ∥ej − Xj θ∥2 + λ|θ| = , (1.18)
θ∈R 2n Xj′ Xj /n

where θ̂n,j (λ) is the j−th element of θ̂nl (λ), θ̂n,(−j)

l
(λ) is θ̂nl (λ) without the
l
j−th element, and ej := y − X(−j) θ̂n,(−j) (λ).

Proof. The subdifferential of the lasso objective function f l defined in (1.17)

at θ̂ ∈ Rp is
∂f l (θ̂) = (X ′ X/n)θ̂ − X ′ y/n + λ∂∥θ̂∥1 ,
where
∂∥θ̂∥1 = {v ∈ Rp : vj ∈ ∂|θ̂j | for all j = 1, . . . , p}.
and the subdifferential of the absolute value function |·| is given in Appendix
2.2, Example 7. From Theorem 2.2.1, a minimizer θ̂nl (λ) of f satisfies

0 ∈ ∂f (θ̂nl (λ)) ⇐⇒ X ′ y/n ∈ X ′ X/nθ̂nl (λ) + λ∂∥θ̂nl (λ)∥1 .

28
This condition holds if and only if for all j = 1, . . . , p:

Xj′ y/n ∈ Xj′ X/nθ̂nl (λ) + λ∂|θ̂n,j (λ)| ⇐⇒

Xj′ ej /n ∈ Xj′ Xj /nθ̂n,j (λ) + λ∂|θ̂n,j (λ)| ⇐⇒
Sλ (Xj′ ej /n)
θ̂n,j (λ) = ,
Xj′ Xj /n

where the first double implication follows from

p
X
Xj′ X θ̂nl (λ) = Xj′ Xk θ̂n,k
l
(λ) = Xj′ X(−j) θ̂n,(−j)
l
(λ) + Xj′ Xj θ̂n,j (λ),
k=1

and the last double implication follows from Proposition 1.1.5 since, by The-
orem 2.2.1,

Xj′ ej /n ∈ Xj′ Xj /nθ̂n,j (λ) + λ∂|θ̂n,j (λ)| ⇐⇒

1 2
θ̂n,j (λ) = argmin ∥ej − Xj θ∥2 + λ|θ| .
θ∈R 2n

Theorem 1.1.4 suggests that the lasso solution can be computed by a

cyclical coordinate minimization algorithm. This method is an iterative al-
gorithm that, given a candidate solution θ̂ (k) at iteration t + 1, it chooses to
update a coordinate j as
(t+1) (t) (t) (t)
θ̂j = argmin f (θ̂1 , . . . , θ̂j−1 , θ, θ̂j+1 , . . . , θ̂p(t) ),
θ∈R

(t+1) (t)
and sets θ̂k = θ̂k for k ̸= j. A typical choice for the lasso solution would
be to cycle through the coordinates in their natural order: from 1 to p. The
coordinate descent algorithm is guaranteed to converge to a global minimizer
of any convex cost function f : Rp → R satisfying the additive decomposition:
p
X
f : θ 7→ g(θ) + hj (θj ),
j=1

where g : Rp → R is differentiable and convex, and the univariate func-

tions hj : Rp → R are convex (but not necessarily differentiable); see Tseng
[2001]. What makes this algorithm work for the lasso problem is the fact
that objective function (1.17) satisfies this separable structure.

29
Remark 5. If the predictors are measured in different units, it is recommended
to standardize them so that Xj′ Xj = 1 for all j. In this case, the lasso update
(1.18) has the simpler form:
l
θ̂n,j (λ) = Sλ (Xj′ ej /n).

Algorithm 1 summarizes the pseudo-code of the cyclical coordinate de-

scent algorithm for computing the lasso estimator. This algorithm proceeds
by cyclically applying the soft-thresholding update in (1.18) for each coordi-
l
nate, simultaneously updating the residuals ej := y − X(−j) θ̂n,(−j) (λ). The
ridgeless or the ridge estimators can be used to initialize the procedure.

Algorithm 1 Cyclical coordinate descent method for the lasso estimator.

Require: y ∈ Rn and X ∈ Rn×p such that Xj′ Xj > 0 for all j = 1, . . . , p
Require: Penalty parameter λ > 0
Require: Initial estimator θ̂ni (e.g., ridgeless or ridge)
Require: Maximum number of iterations T
Standardize y and X so that y ′ 1 = 0, X ′ 1 = 0 and diag(X ′ X/n) = I
θ̂ (1) ← θ̂ni
for t = 2, . . . , T do
for j = 1, . . . , p do
(t−1)
ej ← y − X(−j) θ̂(−j) (λ)
(t−1)
← Sλ Xj′ ej /n

θ̂j
end for
θ̂ ← θ̂ (t−1)
if a suitable stopping rule is satisfied then
Stop and output θ̂ (t)
end if
θ̂ (t)
end for
Output θ̂ (T )

In practice, it is often desirable to compute the lasso solution not for a

single fixed value of λ, for the entire solution path over a range of λ values.
A common approach begins by selecting a value of λ just large enough that
the only optimal solution is the zero vector. This value is denoted as λmax =
maxj {Xj′ y/n}. From there, we gradually decrease λ by a small amount
and run coordinate descent until convergence using the previous solution as
a ”warm start”. By further decreasing the previous solution as a “warm
start,” we then run coordinate descent until convergence. In this way we can

30
efficiently compute the solutions over a grid of λ values. This approach is
known as pathwise coordinate descent.
Coordinate descent is particularly efficient for the lasso because the up-
date rule (1.18) is available in closed form, eliminating the need for iterative
searches along each coordinate. Additionally, the algorithm exploits the in-
herent sparsity of the problem: for sufficiently large values of λ, most coeffi-
cients will be zero and will remain unchanged. There are also computational
strategies that can predict the active set of variables, significantly speeding
up the algorithm. More details on the pathwise coordinate descent algorithm
for lasso can be found in Friedman et al. [2007].
Homotopy methods are another class of techniques for solving the lasso
estimator. They produce the entire path of solutions in a sequential fashion,
starting at zero. An homotopy method that is particularly efficient at com-
puting the entire lasso path is the least angle regression (LARS) algorithm;
see Efron et al. [2004].

1.1.5 Finite-sample properties of ridgeless and ridge

This section presents finite-sample expressions and bounds for the bias, vari-
ance, MSE and MPR of the LSE, ridgeless and ridge estimators. The main
underlying assumption is that linear model (1.1) satisfies the typical regres-
sion condition E[xε0 ] = 0. Furthermore, we work with a fixed design matrix
X (or equivalently we work conditionally on X).
The next proposition derives the bias, MSE and MPR of the LSE when
it is well-defined, that is, when Rank(X) = p, which implies p ≤ n.
Proposition 1.1.6 (Finite-sample properties of LSE (fixed design)). Assume
that the linear model (1.1) holds with E[xε0 ] = 0. Then, for a fixed design
matrix such that Rank(X) = p:
(i) The LSE is unbiased: E[θ̂nls ] = θ0 .

(ii) The variance of the LSE is given by

Var[θ̂nls ] = (X ′ X)−1 X ′ Var[ε0 ]X(X ′ X)−1 .

Further let Var[ε0 ] = σ 2 I with σ > 0. Then:

(iii) Var[θ̂nls ] = σ 2 (X ′ X)−1 .

(iv) The LSE is the best linear unbiased estimator, in the sense that Var[θ̃n ]−
Var[θ̂nls ] is positive semi-definite for any other unbiased linear estimator
θ̃n , i.e., θ̃n = Ay for some A ∈ Rp×n .

31
2 P
(v) The MSE of the LSE is given by: MSE(θ̂nls , θ0 ) = σn pj=1 λ1j , where
λ1 ≥ . . . ≥ λp > 0 are the eigenvalues of X ′ X/n. Therefore,
σ2p
MSE(θ̂nls , θ0 ) ≤ .
λp n

(vi) The mean predictive risk of the LSE is given by:

MPR(θ̂nls , θ0 ) = pσ 2 /n.

Proof. (i) From the closed form expression (1.8):

θ̂nls = (X ′ X)−1 X ′ ε0 + θ0 .
Thus, unbiasedness follows directly since:
E[θ̂nls ] − θ0 = (X ′ X)−1 E[X ′ ε0 ] = 0.

(ii) The closed form expression of θˆnls immediately implies the expression
Var[θ̂nls ] = Var[(X ′ X)−1 X ′ ε0 ]
=(X ′ X)−1 X ′ Var[ε0 ]X(X ′ X)−1 .

(iii) Das ist trivial.

(iv) A linear estimator θ̃n = Ay is unbiased if and only if AX = I. Let
M := X(X ′ X)−1 X ′ , and notice that (I − M )(I − M ) = (I − M ),
i.e., (I − M ) is idempotent. It follows that:
Var[θ̃n ] − Var[θ̂nls ] =σ 2 (AA′ − (X ′ X)−1 )
=σ 2 (AA′ − AX(X ′ X)−1 X ′ A′ )
=σ 2 A(I − M )A′
=σ 2 [A(I − M )][A(I − M )]′ ,
which is positive semi definite.
(v) Using the linearity of the Trace operator and the SVD decomposition
of X in Definition 18:
E[∥θ̂nls − θ0 ∥22 ] =E[Trace((θ̂nls − θ0 )(θ̂nls − θ0 )′ )]
= Trace(Var[θ̂nls ])
σ2
= Trace((X ′ X/n)−1 )
n
p
σ2 σ2 X 1
= Trace(V ′ (S ′ S/n)−1 V ) = .
n n j=1 λj

32
(vi) Simple computations give

E[∥lm(X, θ̂nls ) − lm(X, θ0 )∥22 /n]

=E[∥X(θ̂nls − θ0 )∥22 /n]
=E[Trace((θ̂nls − θ0 )′ X ′ X/n(θ̂nls − θ0 ))]
=E[Trace((θ̂nls − θ0 )(θ̂nls − θ0 )′ X ′ X/n)]
= Trace(Var[θ̂nls ]X ′ X/n) = σ 2 p/n.

This proposition shows that the LSE’s accuracy decreases:

• as the variance σ 2 of the error term increases;

• as the number of predictors per observation p/n increases;

• as the ”degree of singularity” of the design matrix 1/λp increases.

We now aim to drop the requirement that Rank(X) = p, to allow for high-
dimensional settings where:
• n < p, or even

• Rank(E[xx′ ]) = r0 , not necessarily equals to p.

This last case We next show that, when Rank(E[xx′ ]) < p, the typical linear
regression condition E[xε0 ] = 0 no longer identifies a unique estimand θ0 .
Proposition 1.1.7. Given random variables y ∈ R and x ∈ Rp , let ε(θ) :=
y − x′ θ for every θ ∈ Rp . The set

S := {θ0 ∈ Rp : y = x′ θ0 + ε(θ0 ), E[xε(θ0 )] = 0}

is either empty or S = E[xx′ ]+ E[xy] + Ker(E[xx′ ]).

Proof. The fact that, in some cases, set S can be empty is obvious. Moreover,
since
S = {θ0 ∈ Rp : E[xx′ ]θ0 = E[xy]},
then E[xy] ∈ Range(E[xx′ ]) when S is not empty. In this case, it follows
from Theorem 2.1.5 that S = E[xx′ ]+ E[xy] + Ker(E[xx′ ]).
However, when S is not-empty, there is an element of S ∩ Range(E[xx′ ])
that is well-defined when Rank(E[xx′ ]) < p, and it is equal to θ0 by con-
struction when Rank(E[xx′ ]) = p. We define it as follows.

33
Definition 19 (Ridgeless estimand). The ridgeless estimand is defined as the
vector θ0rl ∈ Range(E[xx′ ]) given by θ0rl := E[xx′ ]+ E[xy].6
We can now extend Proposition 1.1.6 to the fixed design setting where
Rank(X) ≤ p.
Proposition 1.1.8 (Finite-sample properties of ridgeless (fixed design)).
Assume that the linear model (1.1) holds with E[xε0 ] = 0 and denote r0 :=
Rank(E[xx′ ]). Then, for a fixed design matrix:
(i) E[θ̂nrl ] = PRange(X ′ ) θ0rl . If Range(X ′ ) = Range(E[xx′ ]), which implies
n ≥ r0 , then the ridgeless estimator is unbiased:

E[θ̂nrl ] = θ0rl .

(ii) The variance of the ridgeless estimator is given by

Var[θ̂nrl ] = X + Var[ε(θ0rl )](X + )′ ,

where ε(θ0rl ) := y − Xθ0rl .

Further let Var[ε(θ0rl )] = σ 2 I with σ > 0, and define r := Rank(X) ≤
min{n, p}. Then:
(iii) Var[θ̂nrl ] = σ 2 (X ′ X)+ .
(iv) The MSE of the ridgeless estimator is given by:
r
σ2 X 1 2
MSE(θ̂nrl , θ0rl ) = + PKer(X) θ0rl 2
,
n j=1 λj

where λ1 ≥ . . . ≥ λr > 0 are the positive eigenvalues of X ′ X/n.

(v) The mean predictive risk of the ridgeless estimator is given by:

MPR(θ̂nrl , θ0rl ) = rσ 2 /n.

(vi) If Range(X ′ ) = Range(E[xx′ ]), we have

r0
σ2 X 1 σ 2 r0
MSE(θ̂nrl , θ0rl ) = ≤ .
n j=1 λj λr0 n

and
MPR(θ̂nrl , θ0rl ) = r0 σ 2 /n.
6
The result that θ0rl ∈ Range(E[xx′ ]) follows from the identity E[xx′ ]+ =
PRange(E[xx′ ]) E[xx′ ]+ . Notice that we used the ridgeless estimand in Section 1.1.3.

34
Proof. (i) Using Proposition 1.1.7, we have E[Xε(θ0rl )] = 0, which implies
E[ε(θ0rl )] = 0 under a (non-trivial) fixed design. Simple computations
then give
E[θ̂nrl ] = X + E[y] = X + Xθ0rl + X + E[ε(θ0rl )] = PRange(X ′ ) θ0rl .
If Range(X ′ ) = Range(E[xx′ ]), we conclude that E[θ̂nrl ] = θ0rl since
θ0rl ∈ Range(E[xx′ ]).
(ii) The closed-form expression of θ̂ rl immediately implies
Var[θ̂nrl ] = X + Var[ε(θ0rl )](X + )′ .

(iii) It follows since X + (X + )′ = (X ′ X)+ .

(iv) Using the fact that Rank(X) = r:
r
σ2 σ2 X 1
Trace(Var[θ̂nrl ]) = Trace((X ′ X/n)+ ) = .
n n j=1 λj

Moreover,
Bias(θ̂nrl , θ0rl ) = (PRange(X ′ ) −I)θ0rl = − PKer(X) θ0rl .
The result then follows using Proposition 1.0.3.
(v) Proposition 1.0.1 and E[θ̂nrl ] = PRange(X ′ ) θ0rl imply lm(X, θ0rl ) = XE[θ̂nrl ].
Therefore:
E[∥lm(X, θ̂nrl ) − lm(X, θ0rl )∥22 /n]
=E[∥X(θ̂nrl − E[θ̂nrl ])∥22 /n]
=E[Trace{(θ̂nrl − E[θ̂nrl ])′ X ′ X/n(θ̂nrl − E[θ̂nrl ])}]
=E[Trace{(θ̂nrl − E[θ̂nrl ])(θ̂nrl − E[θ̂nrl ])′ X ′ X/n}]
= Trace(Var[θ̂nrl ]X ′ X/n)
=σ 2 /n Trace[(X ′ X)+ X ′ X] = σ 2 /n Trace(X + X),
where the last equality follows from the identity (X ′ X)+ X ′ = X + . Fi-
nally, considering the spectral decomposition X = U SV ′ in Definition
18, we obtain:
σ 2 /n Trace(X + X) =σ 2 /n Trace(V (S ′ )+ SV ′ )
 
Ir 0r×(p−r)
=σ 2 /n Trace  
0(p−r)×r 0(p−r)×(p−r)
=σ 2 r/n.

35
(vi) If Range(X ′ ) = Range(E[xx′ ]), then Rank(X) = r0 and Ker(X) =
Ker(E[xx′ ]). Therefore Bias(θ̂nrl , θ0rl ) = 0 as θ0rl ∈ Range(E[xx′ ]), and
we have
r0
σ2 X 1
MSE(θnrl , θ0rl ) = .
n j=1 λj

Proposition 1.1.9 (Finite-sample properties of ridge (fixed design)). As-

sume that the linear model (1.1) holds with E[xε0 ] = 0. Denote r0 :=
Rank(E[xx′ ]), λ > 0 and Q(λ) := (X ′ X + λI)−1 X ′ X. Then, for a fixed
design matrix:
(i) The ridge estimator is biased: E[θ̂nr (λ)] = Q(λ)θ0rl .
(ii) The variance of the ridge estimator is given by

Var[θ̂nr (λ)] = (X ′ X + λI)−1 X ′ Var[ε0 ]X(X ′ X + λI)−1 .

Further let Var[ε0 ] = σ 2 I with σ > 0. Then:

(iii) Var[θ̂nr (λ)] = σ 2 (X ′ X+λI)−1 X ′ X(X ′ X+λI)−1 . Moreover, Var[θ̂nrl ]−
Var[θ̂nr (λ)] is positive definite.
(iv) The MSE of the ridge estimator is given by:
r
σ2 X λj 2
MSE(θ̂nr (λ), θ0rl ) = 2
+ [I − Q(λ)]θ0rl 2
,
n j=1 (λj + λ/n)

where λ1 ≥ . . . ≥ λr > 0 are the positive eigenvalues of X ′ X/n.

(v) The mean predictive risk of the ridge estimator is given by:
r
σ2 X λ2j
MPR(θ̂nr (λ), θ0rl ) = .
n j=1 (λj + λ/n)2

(vi) If Range(X ′ ) = Range(E[xx′ ]), we have

r0
σ2 X 1 σ 2 r0
lim MSE(θ̂nr (λ), θ0rl ) = ≤ ,
λ→0 n j=1 λj λr0 n

and
lim MPR(θ̂nr (λ), θ0rl ) = r0 σ 2 /n.
λ→0

36
Proof. (i) Using the link between ridge and ridgeless estimators in identity
(1.14) we have, with θ0rl as defined in (19):

E[θ̂nr (λ)] =Q(λ)E[θ̂nrl ] = Q(λ) PRange(X ′ ) θ0rl .

The result then follows from Q(λ) PRange(X ′ ) = Q(λ).

(ii) The closed-form expression of θ̂ r immediately implies

Var[θ̂nr (λ)] = (X ′ X + λI)−1 X ′ Var[ε0 ]X(X ′ X + λI)−1 .

(iii) The expression follows trivially from the previous item. To show that
Var[θ̂nrl ] − Var[θ̂nr (λ)] is positive definite, consider the spectral decom-
position X P = U SV ′ in Definition 18. Since Rank(X) = r, we have
X X/n = rj=1 λj vj vj′ , where λj = s2j /n for j = 1, . . . , r. It follows
′
2 P
Var[θ̂nrl ] = σn rj=1 λ1j vj vj′ . Instead,

Var[θ̂nr (λ)] =σ 2 V (S ′ S + λI)−1 S ′ S(S ′ S + λI)−1 V ′

r
σ2 X λj
= vj vj′ .
n j=1 (λj + λ/n)2

The result then follows using that, for j = 1, . . . , r,

λj
1/λj > .
(λj + λ/n)2

(iv) Using the linearity of the Trace and the fact that V is orthogonal:
r
!
2 X
σ λ j
Trace(Var[θ̂nr (λ)]) = Trace 2
vj vj′
n j=1 (λj + λ/n)
r
σ2 X λj
= .
n j=1 (λj + λ/n)2

Moreover, Bias(θ̂nr (λ), θ0rl ) = [Q(λ) − I]θ0rl . The result then follows
using Proposition 1.0.3.

(v) In Proposition 1.1.8 we obtained E[θ̂nrl ] = PRange(X ′ ) θ0rl . Proposition

1.0.1 thus implies lm(X, θ0rl ) = XE[θ̂nrl ], and so we can write

lm(X, θ̂nr (λ)) − lm(X, θ0rl ) = X(Q(λ)θ̂nrl − E[θ̂nrl ]).

37
Therefore:

MPR(θ̂nr (λ), θ0rl ) = E[∥lm(X, θ̂nr (λ)) − lm(X, θ0rl )∥22 /n]
= E[Trace{(Q(λ)θ̂nrl − E[θ̂nrl ])′ X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])}]
= E[Trace{X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ }]
n h io
= Trace X ′ X/nE (Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ .

Let Q := Q(λ), E rl := E[θ̂nrl ] and V rl := Var[θ̂nrl ]. Then, the expected

value inside the Trace reads:
h ih i′
rl rl rl rl rl rl
E θ̂n − E − (I − Q)θ̂n θ̂n − E − (I − Q)θ̂n

= V rl + (I − Q)V rl (I − Q)′ − (I − Q)V rl − (I − Q)(I − Q)′

= QV rl Q′ .

Using Var[θ̂nrl ] = σ 2 (X ′ X)+ and the SVD decomposition of X in Def-

inition 18, we conclude:

MPR(θ̂nr (λ), θ0rl )

= σ 2 Trace X ′ X/nQ(λ)(X ′ X)+ Q(λ)′

= σ 2 Trace X ′ X/n(X ′ X + λI)−1 X ′ X(X ′ X)+ X ′ X(X ′ X + λI)−1

= σ 2 /n Trace S ′ S/n(S ′ S/n + λ/nI)−1 S ′ S/n(S ′ S)+ S ′ S(S ′ S/n + λ/nI)−1

r
σ2 X λ2j
= .
n j=1 (λj + λ/n)2

(vi) If Range(X ′ ) = Range(E[xx′ ]), then Rank(X) = r0 and Ker(X) =

Ker(E[xx′ ]). Therefore, since

lim Q(λ) = lim [(X ′ X + λI)−1 X ′ ]X = PRange(E[xx′ ]) ,

λ→0 λ→0

and θ0rl ∈ Range(E[xx′ ]), we obtain

r0 r0
σ2 X σ2 X

λj 2 1
lim 2
+ [I − Q(λ)]θ0rl 2
= .
λ→∞ n (λj + λ/n) n j=1 λj
j=1

The next proposition shows that there are penalty parameter values for
which the MSE of ridge is lower than the MSE of ridgeless.

38
Proposition 1.1.10. Assume that the linear model (1.1) holds with E[xε0 ] =
0 and Var[ε0 ] = σ 2 I for σ > 0. Then, for a fixed design matrix X, there
exists λ∗ > 0 such that

MSE(θ̂nr (λ∗ ), θ0rl ) < MSE(θ̂nrl , θ0rl ).

Proof. See Farebrother [1976].

1.1.6 Finite sample properties of lasso

In this section, we study the finite sample properties of the lasso estimator
under a fixed design matrix X. Given the lack of a closed form expression for
the lasso estimator, we do not have access to closed form expressions for its
bias and variance. Therefore, instead of deriving its MSE and MPR, we find
concentration inequalities for its estimation risk ∥θ̂nl (λ) − θ0 ∥22 and predictive
risk ∥X(θ̂nl (λ) − θ0 )∥22 /n.
Before doing that, we first show some auxiliary properties satisfied by any
lasso solution. Given a set C, let |C| denote its cardinality, i.e., the number of
elements in C, and consider the index set S ⊂ {1, . . . , p} with complementary
index set S c = {1, . . . , p} \ S. We use the notation vS = [vi ]i∈S ∈ R|S| for
the subvector of v ∈ Rp with entries indexed by S. Further define, for some
α ≥ 1, the set

Cα (S) := {v ∈ Rp : ∥vS c ∥1 ≤ α ∥vS ∥1 }.

In words, Cα (S) is the set of vectors in Rp whose subvector in S c has size

smaller or equal to α times the size of the subvector in S, where the size
of vectors is measured using the l1 −norm. Finally, consider the following
definition.
Definition 20 (Support of a vector). The support of vector θ ∈ Rp is defined
as
Supp(θ) := {j ∈ {1, . . . , p} : θj ̸= 0}.
The next lemma shows that, for an appropriate choice of the penalty
parameter, the lasso estimator satisfies some basic inequalities and has an
estimation error contained in Cα (S) for some α ≥ 1 and some index set S.

Lemma 1.1.5 (Auxiliary properties of lasso). Suppose that the linear model
(1.1) holds. If λ ≥ 2 ∥X ′ ε0 /n∥∞ > 0, then any lasso solution θ̂nl (λ) satisfies:

(i) The predictive risk bound

PR(θ̂nl (λ), θ0 ) ≤ 12λ ∥θ0 ∥1 . (1.19)

39
(ii) An estimation error η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )) such that
√
∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 . (1.20)

Proof. Under the linear model (1.1), we have for any θ ∈ Rp :

∥y − Xθ∥22 = y ′ y + θ ′ XXθ − 2θ0′ X ′ Xθ − 2ε′0 Xθ.

Since θ̂nl (λ) is a lasso solution,

1 1
0≤ ∥y − X θ̂nl (λ)∥22 + λ∥θ̂nl (λ)∥1 ≤ ∥y − Xθ0 ∥22 + λ ∥θ0 ∥1 ,
2n 2n
which holds if and only if
1
∥X η̂∥22 ≤ ε′0 X η̂/n + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ). (1.21)
2n
(i) By Hölder inequality,
ε′0 X η̂/n ≤ |ε′0 X η̂/n| ≤ ∥X ′ ε0 /n∥∞ ∥η̂∥1 .
Thus, using the choice λ ≥ 2 ∥X ′ ε0 /n∥∞ in (1.21) yields
1
0≤ ∥X η̂∥22 ≤ λ/2∥η̂∥1 + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ). (1.22)
2n
Using the triangle inequality
∥η̂∥1 ≤ ∥θ̂nl (λ)∥1 + ∥θ0 ∥1 , (1.23)
we further obtain
0 ≤ λ/2(∥θ̂nl (λ)∥1 + ∥θ0 ∥1 ) + λ(∥θ0 ∥1 − ∥θ̂nl (λ)∥1 ),
which, using that λ > 0, implies ∥θ̂nl (λ)∥1 ≤ 3∥θ0 ∥1 . Substituting this
result into the triangle inequality (1.23) yields
∥η̂∥1 ≤ 4∥θ0 ∥1 . (1.24)
Moreover, again by the triangle inequality,
∥θ0 ∥1 = ∥(θ0 + η̂) − η̂∥1 ≤ ∥θ0 + η̂∥1 + ∥η̂∥1 ,
which implies:
∥θ0 + η̂∥1 ≥ ∥θ0 ∥1 − ∥η̂∥1 . (1.25)
Using (1.24) and (1.25) in the basic inequality (1.22), we obtain
∥X η̂∥22 /n ≤λ∥η̂∥1 + 2λ(∥θ0 ∥1 − ∥θ0 + η̂∥1 )
≤3λ∥η̂∥1 ≤ 12λ∥θ0 ∥1 .

40
(ii) Let S0 := Supp(θ0 ). Using that θ0S0c = 0, we have

∥θ0 ∥1 − ∥θ̂nl (λ)∥1 = ∥θ0S0 ∥1 − ∥θ0S0 + η̂S0 ∥1 − ∥η̂S0c ∥1 . (1.26)

Substituting (1.26) into the basic inequality (1.22) yields:

0 ≤ ∥X η̂∥22 /n ≤ λ∥η̂∥1 + 2λ(∥θ0S0 ∥1 − ∥θ0S0 + η̂S0 ∥1 − ∥η̂S0c ∥1 ). (1.27)

By the triangle inequality,

∥θ0S0 ∥1 = ∥θ0S0 + η̂S0 − η̂S0 ∥1 ≤ ∥θ0S0 + η̂S0 ∥1 + ∥η̂S0 ∥1 .

Therefore, using the decomposition ∥η̂∥1 = ∥η̂S0 ∥1 + ∥η̂S0c ∥1 , (1.27)

reads

0 ≤ ∥X η̂∥22 /n ≤λ∥η̂∥1 + 2λ(∥η̂S0 ∥1 − ∥η̂S0c ∥1 )

=λ(3∥η̂S0 ∥1 − ∥η̂S0c ∥1 ),

which implies that η̂ ∈ C3 (Supp(θ0 )). Finally,

√ using the relation be-
tween the l1 − and the l2 −norm (∥v∥1 ≤ s ∥v∥2 for every v ∈ Rs ), we
conclude that
√
∥X η̂∥22 /n ≤λ(3∥η̂S0 ∥1 − ∥η̂S0c ∥1 ) ≤ 3λ∥η̂S0 ∥1 ≤ 3 s0 λ∥η̂S0 ∥2
√
≤3 s0 λ∥η̂∥2 .

We derive the main properties of lasso under the following restricted eigen-
value condition on the design matrix, which leverages the result that for
λ ≥ 2 ∥X ′ ε0 /n∥∞ , the estimation error of lasso θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
Assumption 1 (Restricted eigenvalue condition). The design matrix X ∈
Rn×p is such that for all η ∈ C3 (Supp(θ0 )) there exists κ > 0 for which:

∥Xη∥22 /n ≥ κ ∥η∥22 ,

where θ0 ∈ Rp is the coefficient of linear model (1.1).

In the next proposition we derive bounds on the squared l2 estimation
risk and predictive risk. We then provide intuition on why the restricted
eigenvalue condition is required.
Theorem 1.1.6. Suppose that the linear model (1.1) holds and that Assump-
tion 1 holds. Let s0 := | Supp(θ0 )| ≤ p. Then, any lasso solution θ̂nl (λ) with
λ ≥ 2 ∥X ′ ε0 /n∥∞ > 0 satisfies:

41
(i) The estimation risk bound
9
∥θ̂nl (λ) − θ0 ∥22 ≤ 2
s 0 λ2 . (1.28)
κ
(ii) The predictive risk bound
9
PR(θ̂nl (λ), θ0 ) ≤ s 0 λ2 . (1.29)
κ
Proof. In Lemma (1.1.5), we obtained Inequality (1.20), which reads
√
∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 ,
where η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
(i) Using Assumption 1 on the left hand side of Inequality (1.20) yields
√
κ∥η̂∥22 ≤ 3 s0 λ∥η̂∥2 .
If ∥η̂∥2 > 0, the result follows by dividing both sides of the inequality
by ∥η̂∥2 . If instead ∥η̂∥2 = 0, the result is trivially obtained.
(ii) Using Assumption 1 on the right hand side of Inequality (1.20) yields,
√ √
∥X η̂∥22 /n ≤ 3 s0 λ∥X η̂∥2 / nκ.
If ∥X η̂∥2 >√0, the result follows by dividing both sides of the inequality
by ∥X η̂∥2 / n. If instead ∥X η̂∥2 = 0, the result is trivially obtained.

The sparsity parameter s0 = | Supp(θ0 )| plays a major role in the bounds

of Theorem 1.1.6. We say that θ0 is hard sparse if it has some zero entries.
More formally:
Definition 21 (Hard sparsity). Coefficient θ0 ∈ Rp is hard sparse if s0 :=
| Supp(θ0 )| < p.
In high-dimensional regimes, hard sparsity is typically imposed as an
identifying condition for θ0 . Consider an asymptotic regime where
lim p/n = K > 0, and lim s0 /p = s∞ ∈ (0, 1].
n→∞ p→∞

Then, s0 → ∞ as n → ∞. In this setting, the lasso converges to θ0 only

if κ and λ compensate for the divergence of s0 , i.e., limn→∞ s0 λ/κ2 = 0.
Notice however that 2 ∥X ′ ε0 /n∥∞ , the lower bound for λ in Theorem 1.1.6,
is monotonically non decreasing as we add columns to X. Moreover, intuition
suggests that Assumption 1 with a large κ is an increasingly more restrictive
assumption as p → ∞.

42
Remark 6. It is possible to extend the results in Lemma 1.1.5 and Theorem
1.1.6 using a milder restriction than hard sparsity, called weak sparsity. This
restriction formalizes the notion that θ0 can be well approximated by means
of a hard sparse vector.
Definition 22 (Weak sparsity). Coefficient θ0 ∈ Rp is weak sparse if θ0 ∈
Bq (r) where, for q ∈ [0, 1] and radius r > 0,

Bq (r) := {θ ∈ Rp : ∥θ∥qq ≤ r}.

Setting q = 0 in Definition 22 recovers Definition 21 with s = r. For q ∈ (0, 1],

we restrict the way the ordered coefficients

max |θ0j | = θ0(1) ≥ θ0(2) ≥ . . . , ≥ θ0(p−1) ≥ θ0(p) = min |θ0j |

j=1,...,p j=1,...,p

decay. More precisely, if the ordered coefficients satisfy the bound |θ0j | ≤
Cj −α for some suitable C ≥ 0 and α > 0, then θ0 ∈ Bq (r) for a radius r that
depends on C and α.

Restricted eigenvalue condition

Inequality (1.20) of Lemma (1.1.5) establishes an upper bound for the pre-
diction risk of the lasso solution in terms of its estimation risk. Conversely,
for an appropriate choice of the penalty parameter, the restricted eigenvalue
condition in Assumption 1 provides an upper bound for the estimation risk
of the lasso solution based on its prediction risk. These two bounds are
combined to obtain the estimation and predictive risk bounds in Theorem
1.1.6.
To provide more intuition on why the restricted eigenvalue condition is
needed, consider the constrained version of the lasso estimator:

lc 1 2
θ̂n (R) := argmin ∥y − Xθ∥2 : ∥θ∥1 ≤ R ,
θ∈Rp 2n

where the radius R := ∥θ0 ∥1 . With this choice, the true parameter θ0 is
feasible for the problem. Additionally, we have Ln (θ̂nlc (R)) ≤ Ln (θ0 ) where

Ln : Rp → R; θ 7→ ∥y − Xθ∥22 /(2n)

is the least squares loss function. Under mild regularity conditions, it can be
shown that the loss difference Ln (θ0 )−Ln (θ̂nlc (R)) decreases as the sample size
n increases. Under what conditions does this imply that the estimation risk,
∥η̂∥22 with η̂ := θ̂nlc (R) − θ0 , also decreases? Since Ln is a quadratic function,

43
the estimation risk will decrease if the function has positive curvature in every
direction (i.e., if there are no flat regions). This occurs when the Hessian,
∇2 Ln (θ̂nlc (R)) = X ′ X/n, has eigenvalues that are uniformly lower-bounded
by a positive constant κ. This condition is equivalently expressed as
∥Xη∥22 /n ≥ κ∥η∥22 > 0
for all nonzero η ∈ Rp .
In the high-dimensional setting, where p > n, the Hessian has rank at
most n, meaning that the least squares loss is flat in at least p − n directions.
As a result, the uniform curvature condition must be relaxed. By Lemma
1.1.5, the estimation error of lasso lies in the subset C3 (Supp(θ0 )) ⊂ Rp for an
appropriate choice of the penalty parameter (equivalently, of the constrained
radius R). For this reason, we require the condition to hold only in the
directions η that lie in C3 (Supp(θ0 )), hoping that | Supp(θ0 )| ≤ Rank(X).
With this adjustment, even in high-dimensional settings, a small difference
in the loss function still leads to an upper bound on the difference between
the lasso estimate and the true parameter.
Verifying that a given design matrix X satisfies the restricted eigenvalue
condition is challenging. Developing methods to discover random design
matrices that satisfy this condition with high probability remains an active
area of research.

Slow rates and fast rates

Consider assuming that the error term in linear model (1.1) is sub-Gaussian
with mean zero and variance proxy σ 2 . It is then possible to find a choice of
λ that only depends on the unknown σ and that ensures that the estimation
and prediction risks are upper-bounded with high probability.
Theorem 1.1.7. Suppose that the linear model (1.1) holds and that ε0 is a
vector if independent random variables with ε0i ∼ sub-G(σ 2 ) where variance
proxy σ > 0. Further √ suppose that the columns of X are standardized so that
maxj=1,...,p ∥Xj ∥2 / n ≤ C for some constant C > 0. Then, for all δ > 0,
any lasso solution θ̂nl (λ) with regularization parameter
p
λ = 2Cσ 2 ln(p)/n + δ (1.30)
2 /2
satisfies with probability 1 − 2e−nδ :
p
PR(θ̂nl (λ), θ0 ) ≤ 24C ∥θ0 ∥1 σ( 2 ln(p)/n + δ). (1.31)
Further suppose that Assumption 1 holds and let s0 := | Supp(θ0 )| ≤ p. Then,
2
with probability 1 − 2e−nδ /2 :

44
(i) The estimation risk bound

72C 2 σ 2 s0
∥θ̂nl (λ) − θ0 ∥22 ≤ (2 ln(p)/n + δ 2 ). (1.32)
κ2

(ii) The predictive risk bound

72C 2 σ 2 s0
PR(θ̂nl (λ), θ0 ) ≤ (2 ln(p)/n + δ 2 ). (1.33)
κ

Proof. From the union bound:

′
P [∥X ε0 /n∥∞ ≥ t] = P max |Xj′ ε0 /n| ≥t
j=1,...,p

= P ∪j=1,...,p {|Xj′ ε0 /n| ≥ t ]

p
X
P |Xj′ ε0 /n| ≥ t .

≤
j=1

Since ε01 , . . . , εon are independent random variables with sub-G(σ 2 ) distribu-
tion, from Proposition 2.3.8 we have that for any t ∈ R:
p p
!
X ′ X t2
P |Xj ε0 /n| ≥ t ≤ 2 exp −
j=1 j=1
2σ 2 ∥Xj /n∥22
t2 n

≤ 2p exp − 2 2 .
2σ C
p
Substituting t = Cσ 2 ln(p)/n + δ we get

t2 n

p2
2p exp − 2 2 = 2 exp(ln(p)) exp −nδ /2 − ln(p) − δ 2n ln(p)
2σ C
p
= 2 exp −nδ 2 /2 exp −δ 2n ln(p)

≤ 2 exp −nδ 2 /2 ,

p
since −δ 2n ln(p) < 0. We conclude that, for all δ > 0:
2
p
P[2 ∥X ′ ε0 /n∥∞ ≤ 2Cσ( 2 ln(p)/n + δ)] ≥ 1 − 2e−nδ /2 .
p
Consequently, if we set λ = 2Cσ( 2 ln(p)/n + δ), we obtain from (1.19) of
2
Lemma 1.1.5 that (1.31) holds with probability at least 1 − 2e−nδ /2 . More-
over, under Assumption 1, we obtain from (1.28) and (1.29) of Theorem 1.1.6

45
2 /2
that (1.32) and (1.33) hold with probability at least 1 − 2e−nδ , by using
the inequality:7
p
2 ln(p)/n + δ 2 + 2 2 ln(p)/nδ ≤ 2(2 ln(p)/n + δ 2 ).

Asplong as n ≥ 2 ln(p), the ratio 2 ln(p)/n can be significantly smaller

than 2 ln(p)/n. For this reason, the bounds (1.31) and (1.33) are often
referred to as the slow rates and fast rates for the prediction risk of lasso,
respectively.

7
This inequality follows from 2ab ≤ a2 + b2 for any two real numbers a and b.

46
Bibliography

Arthur Albert. Regression and the moore-penrose pseudoinverse. 1972.

Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with
the k-support norm. Advances in Neural Information Processing Systems,
25, 2012.

Sheldon Axler. Linear algebra done right. Springer Nature, 2024.

Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L

Combettes. Correction to: convex analysis and monotone operator theory
in Hilbert spaces. Springer, 2017.

Patrick Billingsley. Probability and measure. John Wiley & Sons, 2017.

Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data:
methods, theory and applications. Springer Science & Business Media,
2011.

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least
angle regression. The Annals of Statistics, 32(2):407–499, 2004.

Richard William Farebrother. Further results on the mean square error of

ridge regression. Journal of the Royal Statistical Society. Series B (Method-
ological), pages 248–250, 1976.

Jerome Friedman, Trevor Hastie, Holger Höfling, and Robert Tibshirani.

Pathwise coordinate optimization. The annals of applied statistics, 1(2):
302–332, 2007.

Carl F Gauss. Theoria motus corporum coelestium in sectionibus conicis

solem ambientium. sumtibus Frid. Perthes et I. H. Besser, 1809.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Fried-

man. The elements of statistical learning: data mining, inference, and
prediction, volume 2. Springer, 2009.

47
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learn-
ing with sparsity. Monographs on statistics and applied probability, 143
(143):8, 2015.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation

for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

Adrien-Marie Legendre. Nouvelles méthodes pour la détermination des or-

bites des comètes. F. Didot, 1805.

Eliakim H Moore. On the reciprocal of the general algebraic matrix. Bulletin

of the american mathematical society, 26:294–295, 1920.

Marc Nerlove et al. Returns to scale in electricity supply. Institute for math-
ematical studies in the social sciences, 1961.

Roger Penrose. A generalized inverse for matrices. In Mathematical pro-

ceedings of the Cambridge philosophical society, volume 51, pages 406–413.
Cambridge University Press, 1955.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society Series B: Statistical Methodology, 58(1):
267–288, 1996.

Ryan J Tibshirani. The lasso problem and uniqueness. The Electronic Jour-
nal of Statistics, 7:1456–1490, 2013.

Paul Tseng. Convergence of a block coordinate descent method for nondif-

ferentiable minimization. Journal of optimization theory and applications,
109:475–494, 2001.

Roman Vershynin. High-dimensional probability: An introduction with ap-

plications in data science, volume 47. Cambridge university press, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic view-

point, volume 48. Cambridge university press, 2019.

48
Chapter 2

Appendix

2.1 Linear algebra

This section introduces a selection of definitions and results from linear alge-
bra that are used in these lecture notes. A book-length exposition of linear
algebra can be found in Axler [2024], among others.

Vector space
We introduce useful definitions and results for real vector spaces.
Definition 23 (Vector space). A (real) vector space is a set V along with an
addition on V and a scalar multiplication on V with the following properties:
1. (commutativity) u + v = v + u for all u, v ∈ V ;
2. (associativity) (u + v) + w = u + (v + w) and (ab)v = a(bv) for all
u, v, w ∈ V and all a, b ∈ R;
3. (additive identity) there exists an element 0 ∈ V such that v + 0 = v
for all v ∈ V ;
4. (additive inverse) for every v ∈ V , there exists w ∈ V such that v +w =
0;
5. (multiplicative identity) 1v = v for all v ∈ V ;
6. (distributive properties) a(u + v) = au + av and (a + b)v = av + bv for
all a, b ∈ R and all u, v ∈ V .
Definition 24 (Subspace). A subset U of a vector space V is a subspace of V
if U is a vector space, (using the same addition and scalar multiplication as
on V ).

49
Proposition 2.1.1. A subset U of a vector space V is a subspace of V if
and only if it satisfies these conditions:
(i) (additive identity) 0 ∈ U ;

(ii) (closed under addition) u, v ∈ U implies u + v ∈ U ;

(iii) (closed under scalar multiplication) a ∈ R and u ∈ U implies au ∈ U .

Definition 25 (Linear combination). A linear combination of vectors v1 , . . . , vn
in vector space V with coefficients a1 , . . . , an ∈ R is:

a1 v1 + . . . , +an vn .

Definition 26 (Span). The span of vectors v1 , . . . , vn in vector space V is

defined as

span(v1 , . . . , vn ) := {a1 v1 + . . . + an vn : a1 , . . . , an ∈ R}.

Definition 27 (Linear independence). The vectors v1 , . . . , vn in vector space

V are linearly independent if

{a1 , . . . , an ∈ R : a1 v1 + . . . + an vn = 0} = {a1 = . . . = an = 0}.

Definition 28 (Linear dependence). The vectors v1 , . . . , vn in vector space V

are linearly dependent if they are not linearly independent.
Definition 29 (Basis). A basis of a vector space V is a set of vectors in V
that are linearly independent and span V .

Inner products and norms

Definition 30 (Inner product). An inner product on a vector space V is a
function that takes each ordered pair (v, u) of elements of V to a number
⟨v, u⟩ ∈ R and satisfies:
1. (positivity) ⟨v, v⟩ ≥ 0 for all v ∈ V ;

2. (definiteness) ⟨v, v⟩ = 0 if and only if v = 0;

3. (additivity in first slot) ⟨v + u, w⟩ = ⟨v, w⟩ + ⟨u, w⟩ for all v, u, w ∈ V ;

4. (homogeneity in first slot) ⟨av, u⟩ = a⟨v, u⟩ for all a ∈ R and all v, u ∈

5. (conjugate symmetry) ⟨v, u⟩ = ⟨u, v⟩ for all v, u ∈ V .

50
Proposition 2.1.2 (Basic properties of an inner product). An inner product
⟨·, ·⟩ on vector space V satisfies:
1. ⟨0, v⟩ = ⟨v, 0⟩ for every v ∈ V .
2. ⟨v, u + w⟩ = ⟨v, u⟩ + ⟨v, w⟩ for every v, u, w ∈ V .
3. ⟨v, au⟩ = a⟨v, u⟩ for all a ∈ R and all v, u ∈ V .
Definition 31 (Orthogonal vectors). Two vectors v and u in vector space V
are orthogonal if ⟨v, u⟩ = 0.
Definition 32 (Orthogonal subspace). U and W are orthogonal subspaces of
vector space V if ⟨u, w⟩ = 0 for all u ∈ U and all w ∈ W .
Definition 33 (Orthonormal basis). The set of vectors {v1 , . . . , vn } in vector
space V is an orthonormal basis of V if it is a basis of V such that ⟨vi , vj ⟩ = 0
and ∥vi ∥ = 1 for all i, j = 1, . . . , n with i ̸= j.
Definition 34 (Norms). Given innerpproduct ⟨·, ·⟩ on vector space V , the
norm of v ∈ V is defined by ∥v∥ := ⟨v, v⟩.
Proposition 2.1.3 (Properties of norms). For v in vector space V :
1. ∥v∥ = 0 if and only if v = 0.
2. ∥av∥ = a ∥v∥ for all a ∈ R.
Definition 35 (Linear function). L : V → W from a vector space V to another
vector space W is a linear function if:
(i) T (v + u) = T (v) + T (u) for all v, u ∈ V ;
(ii) T (av) = aT (v) for all a ∈ R and v ∈ V .
Theorem 2.1.1 (Cauchy–Schwarz inequality). Suppose v and u are two
vectors in vector space V . Then,
|⟨v, u⟩| ≤ ∥v∥ ∥u∥ .
This inequality is an equality if and only if there is a ∈ R such that v = au.
Theorem 2.1.2 (Triangle inequality). Suppose v and u are two vectors in
vector space V . Then,
∥v + u∥ ≤ ∥v∥ + ∥u∥ .
This inequality is an equality if and only if there is a ≥ 0 such that v = au.
Theorem 2.1.3 (Parallelogram equality). Suppose v and u are two vectors
in vector space V . Then,
∥v + u∥2 + ∥v − u∥2 = 2(∥v∥2 + ∥u∥2 ).

51
The Euclidean space
Definition 36 ((Real) n−tuple). A (real) n−tuple is a ordered list of n real
numbers.
With a slight abuse of terminology, we sometimes we use the term vector
to mean a (real) n−tuple.
Definition 37 (Real Euclidean space). The real Euclidean space of dimension
n, denoted Rn , is the set of all n−tuples.
Elements of a real Euclidean space are written in bold. For example,
a ∈ Rn , which means a = (a1 , . . . , an ) with a1 , . . . , an ∈ R.
Definition 38 (Euclidean inner product).
Pn The Euclidean inner product of
n
v, u ∈ R is defined as ⟨v, u⟩e := i=1 vi ui .
Definition 39 (lp −norm). The lp −norm ∥·∥p on Rn is defined for all v ∈ Rn
1/p
as ∥v∥p := ( ni=1 |vi |p ) when p ∈ [1, +∞), and ∥v∥p := maxni=1 |vi | when
P
p = +∞.

Theorem 2.1.4 (Hölder inequality). Suppose v, u ∈ Rp . Then,

|⟨v, u⟩| ≤ ∥v∥p ∥u∥q .

Matrices
Definition 40 (Matrix). An n × p matrix is a collection of p n−tuples.
The collection of all n × p matrices is denoted Rn×p . For a matrix A ∈
Rn×p , we write A = [A1 , . . . , Ap ] where A1 , . . . , Ap ∈ Rn are p n−tuples.
Written more explicitly,
 
A1,1 . . . A1,p
 . .. 
 
A =  .. ...
. ,
 
An,1 . . . An,p

that is, the elements of A, the n−tuples, are organized in columns. We

denote:

• the i, j−th element of A by Ai,j ;

• the j−th column A by Aj ;

• the i−th row A by A(i) .

52
Notice that a matrix in Rn×p can be equivalently seen as a collection of n
p−tuples, where the p−tuples represent the rows of the matrix.
Definition 41 (Column and row vector). A n−column vector is a n−tuple
seen as a matrix in Rn×1 . A n−row vector is a n−tuple seen as a matrix in
R1×n .
Throughout these lecture notes, we denote n−tuples as column vectors,
and use the simple notation v ∈ Rn instead of v ∈ Rn×1 .
Definition 42 (Matrix addition). The sum of two matrices of the same size is
the matrix obtained by adding corresponding entries in the matrices. That is,
for A, B ∈ Rn×p , we define A+B = C where C ∈ Rn×p and Ci,j = Ai,j +Bi,j
for i = 1, . . . , n and j = 1, . . . , p.
Definition 43 (Matrix-scalar multiplication). The product of a scalar and a
matrix is the matrix obtained by multiplying each entry in the matrix by the
scalar That is, for A ∈ Rn×p and a ∈ R, we define aA = B where B ∈ Rn×p
and Bi,j = aAi,j for i = 1, . . . , n and j = 1, . . . , p.
Definition 44 (Matrix multiplication). Given AP∈ Rn×p and B ∈ Rp×m , the
product AB = C where C ∈ Rn×m and Ci,j = pr=1 Ai,r Br,j for i = 1, . . . , n
and j = 1, . . . , m.
Note that we define the product of two matrices only when the number of
columns of the first matrix equals the number of rows of the second matrix.
Definition 45 (Transpose of a matrix). The transpose of a matrix A ∈ Rn×p
is the matrix B ∈ Rp×n with j, i−entry given by Bj,i = Ai,j for i = 1, . . . , n
and j = 1, . . . , p. We denote it by A′ .
It follows that the Euclidean inner product between v, u ∈ Rn is

⟨v, u⟩e = v ′ u.

Definition 46 (Range of a matrix). The range of a matrix A ∈ Rn×p is defined

as
Range(A) := {u ∈ Rn : u = Av for some v ∈ Rp }.
The range of a matrix is also called the column space, i.e., the space
spanned by the matrix’s columns, since:
Proposition 2.1.4. Let A = [A1 , . . . , Ap ] ∈ Rn×p . Then, Range(A) =
span(A1 , . . . , Ap ).
Definition 47 (Kernel of a matrix). The kernel, or null space, of a matrix
A ∈ Rn×p is defined as

Ker(A) := {v ∈ Rp : Av = 0}.

53
Proposition 2.1.5. Let A ∈ Rn×p . Then, Range(A) and Ker(A′ ) are or-
thogonal subspaces of Rn such that Rn = Range(A) + Ker(A′ ).
Definition 48 (Rank of a matrix). The rank of a matrix A ∈ Rn×p , denoted
Rank(A), is the maximum number of linearly independent columns of A.
Proposition 2.1.6. Let A ∈ Rn×p . Then, Rank(A) ≤ min{n, p}.
Definition 49 (Eigenvalue). λ ∈ R is an eigenvalue of A ∈ Rn×p if there
exists v ∈ Rp such that v ̸= 0 and
Av = λv.
Definition 50 (Eigenvector). Given matrix A ∈ Rn×p with eigenvalue λ ∈ R,
v ∈ Rp is an eigenvector of A ∈ Rn×p corresponding to λ if v ̸= 0 and
Av = λv.
Proposition 2.1.7. Every matrix A ∈ Rn×p has an eigenvalue.
Proposition 2.1.8. Let A ∈ Rn×p . Then, A has at most Rank(A) distinct
eigenvalues.
Proposition 2.1.9. Suppose λ1 , . . . , λr ∈ R are distinct eigenvalues of A ∈
Rn×p and v1 , . . . , vr ∈ Rp are corresponding eigenvectors. Then, v1 , . . . , vr
are linearly independent.
Definition 51 (Singular values). The singular values of A ∈ Rn×p are the
nonnegative square roots of the eigenvalues of A′ A.
Definition 52 (Symmetric matrix). A square matrix A ∈ Rn×n is symmetric
if A′ = A.
Definition 53 (Positive definite matrix). A square symmetric matrix A ∈
Rn×n is positive definite if v ′ Av > 0 for all v ∈ Rn such that v ̸= 0.
Definition 54 (Positive semi-definite matrix). A square symmetric matrix
A ∈ Rn×n is positive semi-definite if v ′ Av ≥ 0 for all v ∈ Rn .
Proposition 2.1.10. A square symmetric matrix A ∈ Rn×n is positive def-
inite (positive semi-definite) if and only if all of its eigenvalues are positive
(nonnegative).
Definition 55 (Identity matrix). The identity matrix on Rn is defined as
 
1 0
 
I :=  . . .  ∈ Rn×n .
 
 
0 1

54
Definition 56 (Diagonal of a matrix). The diagonal of a square matrix A ∈
Rn×n indicates the elements ”on the diagonal”: A1,1 , . . . , An,n .
Definition 57 (Diagonal matrix). A square matrix A ∈ Rn×n is a diagonal
matrix if all its elements outside of the diagonal are zero. We can write
A = diag(A1,1 , . . . , An,n ).
Definition 58 (Invertible matrix, matrix inverse). A square matrix A ∈ Rn×n
is invertible if there is a matrix B ∈ Rn×n such that AB = BA = I. We
call B the inverse of A and denote it by A−1 .
Proposition 2.1.11. A square matrix A ∈ Rn×n is invertible if and only if
Rank(A) = n, or equivalently, if and only if Ker(A) = ∅.
Proposition 2.1.12. A square matrix A ∈ Rn×n is invertible if and only if
it is positive definite.
Definition 59 (Orthogonal matrix). A square matrix P ∈ Rp×p is orthogonal,
or orthonormal, if P ′ P = P P ′ = I.
Definition 60 (Projection matrix). A square matrix P ∈ Rp×p is a projection
matrix if P = P 2 .
Definition 61 (Orthogonal projection matrix). A square matrix P ∈ Rp×p is
an orthogonal projection matrix if it is a projection matrix and P = P ′ .
Projections and orthogonal projections have the following properties.
Proposition 2.1.13. For any projection matrix P ∈ Rp×p and vector b ∈
Rp , we have
b = P b + (I − P )b.
If P is an orthogonal projection matrix, then
(P b)′ (I − P )b = 0.
Definition 62 (Trace). The trace of a square matrix A ∈ Rn×n , denoted
Trace(A), is the sum of its diagonal elements:
Trace(A) = A11 + . . . , An,n .
Proposition 2.1.14. The Trace is a linear function.
Proposition 2.1.15 (Properties of the trace). 1. Trace(A) = λ1 + . . . +
n×n
λn for all A ∈ R with λ1 , . . . , λn denoting the (not necessarily dis-
tinct) eigenvalues of A.
2. Trace(A) = Trace(A′ ) for all A ∈ Rn×n .
3. Trace(AB) = Trace(BA) for all for all A, B ∈ Rn×n .
4. Trace(A′ B) = Trace(AB ′ ) = Trace(B ′ A) = Trace(BA′ ) for all A, B ∈
Rn×p .

55
2.1.1 Moore-Penrose inverse
The Moore-Penrose inverse, or matrix pseudoinverse, is a generalization of
the inverse of a matrix that was independently introduced by Moore [1920]
and Penrose [1955].
Definition 63 (Moore-Penrose inverse). The matrix A+ ∈ Rp×n is a Moore-
Penrose inverse of A ∈ Rn×p if

(i) AA+ A = A;

(ii) A+ AA+ = A+ ;

(iii) (AA+ )′ = AA+ ;

(iv) (A+ A)′ = A+ A.

Properties and examples of the Moore-Penrose inverse

We now list the main properties of the Moore-Penrose inverse.

Proposition 2.1.16. For any matrix A ∈ Rn×p , the Moore-Penrose inverse

A+ exists and is unique.

Proposition 2.1.17. Let A ∈ Rn×p have Rank(A) = p. Then, A+ =

(A′ A)−1 A′ .

Proposition 2.1.18. Let the square matrix A ∈ Rp×p have Rank(A) = p.

Then, A+ = A−1 .

Proposition 2.1.19. Let A ∈ Rn×p . Then,

A+ = lim (A′ A + λI)−1 A′ = lim A′ (AA′ + λI)−1 .

λ→0 λ→0

Proof. See Albert [1972].

Proposition 2.1.20. Let A ∈ Rn×p . Then:

1. A = (A+ )+ .

2. A+ = (A′ A)+ A′ = A′ (AA′ )+ .

3. (A′ )+ = (A+ )′ .

4. (A′ A)+ = A+ (A′ )+ .

5. (AA′ )+ = (A′ )+ A+ .

56
6. Range(A+ ) = Range(A′ ) = Range(A+ A) = Range(A′ A).

7. Ker(A+ ) = Ker(AA+ ) = Ker((AA′ )+ ) = Ker(AA′ ) = Ker(A′ ).

Proposition 2.1.21. For any matrix A ∈ Rn×p :

1. AA+ is an orthogonal projection onto Range(A).

2. I − AA+ is an orthogonal projection onto Ker(A′ ).

3. A+ A is an orthogonal projection onto Range(A′ ).

4. I − A+ A is an orthogonal projection onto Ker(A).

We also collect some examples of the Moore-Penrose inverse.

(
a−1 a ̸= 0
Example 2. If a ∈ R, then a+ = .
0 a=0
Example 3. If A = diag(A1 , . . . , Ap−k , 0, . . . , 0) ∈ Rp×p , then

A+ = diag(1/A1 , . . . , 1/Ap−k , 0, . . . , 0).

 
1
Example 4. If A =  , then A+ = [1/5.2/5].
2
   
1 0 1 0
Example 5. If A =  , then A+ =  .
0 0 0 0
   
1 1 1/4 1/4
Example 6. If A =  , then A+ =  .
1 1 1/4 1/4

Systems of linear equations and least squares

The Moore-Penrose inverse plays a central role in the study of solutions to
systems of linear equations and linear least squares problems.

Theorem 2.1.5 (Solutions of systems of linear equations). For A ∈ Rn×p

and b ∈ Rn , let L := {θ ∈ Rp : Aθ = b}. The following statements hold:

(i) If b ∈
/ Range(A), then L is empty.

(ii) If b ∈ Range(A), then L = A+ b + Ker(A).

57
Corollary 2.1.6. Given a square matrix A ∈ Rp×p and b ∈ Rp , let L :=
{θ ∈ Rp : Aθ = b}. Then, A+ b is the unique element of L if and only if
Rank(A) = p. In this case, A+ = A−1 .
Corollary 2.1.7. For X ∈ Rn×p and y ∈ Rn :
argmin ∥y − Xθ∥22 = X + y + Ker(X).
θ∈Rp

2.1.2 Eigenvalue and Singular value decomposition

This section introduces the eigenvalue and the singular value decomposi-
tions, which are matrix factorizations with many applications to statistics
and machine learning.
Definition 64 (Singular value decomposition). The Singular Value Decom-
position (SVD) of a matrix A ∈ Rn×p with rank r := Rank(A) is given
by
Xr
A = U SV ′ = si ui vi′ ,
i=1
n×n p×p
where U ∈ R and V ∈ R are orthogonal matrix, and
 
diag(s1 , . . . , sr ) 0
S=  ∈ Rn×p ,
0 0

where s1 , . . . , sr are the positive singular values of A.

Proposition 2.1.22 (Existence). Any matrix A ∈ Rn×p admits a singular
value decomposition.
The next proposition demonstrates the relation of the SVD to the four
fundamental subspaces of a matrix.
Proposition 2.1.23. Consider Definition 64. Then,
(i) {u1 , . . . , ur } is an orthonormal basis of Range(A).
(ii) {ur+1 , . . . , un } is an orthonormal basis of Ker(A′ ).
(iii) {v1 , . . . , vr } is an orthonormal basis of Range(A′ ).
(iv) {vr+1 , . . . , vp } is an orthonormal basis of Ker(A).
Pr ′ ′
Pr ′
We thus have Range(A) = j=1 uj uP
j and Range(A ) = j=1 vj vj .
′ p ′
Moreover, if r < P p, we have Ker(A ) = j=r+1 uj uj , and if r < n, we
have Ker(A) = nj=r+1 vj vj′ .

58
Proposition 2.1.24. The Moore-Penrose inverse of a matrix A ∈ Rn×p
admitting SVD decomposition A = U SV ′ is given by A+ = V S + U ′ .

Definition 65 (Eigenvalue decomposition). The eigenvalue decomposition of a

square matrix A ∈ Rn×n with n linearly independent eigenvectors Q1 , . . . , Qn
corresponding to eigenvalues λ1 , . . . , λn is given by

A = QΛQ−1 ,

where Q = [Q1 , . . . , Qn ] and Λ = diag(λ1 , . . . , λn ).

Proposition 2.1.25 (Relation between the singular value and the eigenvalue
decompositions). Given a matrix A ∈ Rn×p with SVD A = U SV ′ :

1. A′ A = V S ′ SV ′ ;

2. AA′ = U SS ′ U ′ .

2.2 Convex analysis

This section introduces a selection of definitions and results from convex
analysis that are used in these lecture notes. A book-length exposition of
convex analysis can be found in Bauschke et al. [2017], among others.

Basic definitions
Definition 66 (Closed set). A set C ⊂ Rp is closed if it contains all of its
limit points.
Definition 67 (Bounded set). A set C ⊂ Rp is bounded if there exists r > 0
such that for all θ, β ∈ Rp , we have ∥θ − β∥ < r.
Definition 68 (Convex set). A set C ⊂ Rp is convex if for all 0 < α < 1 and
all θ, β ∈ C:
αθ + (1 − α)β ∈ C.
In particular, Rp and ∅ are convex.
Definition 69 (Epigraph of a function). The epigraph of function f : Rp →
(−∞, +∞] is
epi(f ) := {(θ, ξ) ∈ Rp × R : f (θ) ≤ ξ}.
Definition 70 (Domain of a function). The domain of function f : Rp →
(−∞, +∞] is
dom(f ) := {θ ∈ Rp : f (θ) < +∞}.

59
Definition 71 (Lower level set of a function). The lower level set of function
f : Rp → (−∞, +∞] at height ξ ∈ R is

lev≤ξ (f ) := {θ ∈ Rp : f (θ) ≤ ξ}.

Definition 72 (Proper function). function f : Rp → (−∞, +∞] is proper if

dom(f ) ̸= ∅.
Definition 73 (Convex function). Let f : Rp → (−∞, +∞] be a proper
function. Then f is convex if its epigraph epi(f ) is convex. Equivalently, f
is convex if for all 0 < α < 1 and all θ, β ∈ Rp such that θ ̸= β:

f (αθ + (1 − α)β) ≤ αf (θ) + (1 − α)f (β).

Definition 74 (Strictly convex function). Let f : Rp → (−∞, +∞] be a

proper function. Then f is strictly convex if for all 0 < α < 1 and all
θ, β ∈ Rp such that θ ̸= β:

f (αθ + (1 − α)β) < αf (θ) + (1 − α)f (β).

Definition 75 (Limit inferior). The limit inferior of f : Rp → (−∞, +∞] at

a point θ ∗ ∈ Rp is

lim inf
∗
f (θ) = lim (inf{f (θ) : θ ̸= θ ∗ , ∥θ − θ ∗ ∥ ≤ ε}) .
θ→θ ε→0

Definition 76 (Lower semicontinuous function). Function f : Rp → (−∞, +∞]

is lower semicontinuous at θ ∗ ∈ Rp if

lim inf
∗
f (θ) ≥ f (θ ∗ ).
θ→θ

Definition 77 (Coercive function). Function f : Rp → (−∞, +∞] is coercive

if
lim f (θ) = +∞.
∥θ∥→+∞

Definition 78 (Subdifferential). Let f : Rp → (−∞, +∞] be a proper func-

tion. The subdifferential of f is the set-valued operator:1
p
∂f : Rp → 2R ; θ 7→ {β ∈ Rp : ⟨v − θ, β⟩ + f (θ) ≤ f (v) ∀v ∈ Rp } .

Let θ ∈ Rp . Then f is subdifferentiable at θ if ∂f (θ) ̸= ∅; the elements of

∂f (θ) are the subgradients of f at θ.
1
Given a set C, the set of all subsets of C, including the empty set and C itself, is
denoted 2C . This set is called the power set of C.

60
Graphically, a vector β ∈ Rp is a subgradient of a proper function f :
p
R → (−∞, +∞] at θ ∈ dom(f ) if

fβ,θ : v 7→ ⟨v − θ, β⟩ + f (θ),

which coincides with f at θ, lies below f .

Example 7. The subdifferential of the absolute value function | · | at θ ∈ R is
given by 
{1},
 θ>0
∂|θ| = [−1, 1], θ = 0 .

{−1}, θ < 0


See Bauschke et al. [2017, Example 16.15].

Minimizers of convex optimization problems

Definition 79 (Global minimizer). θ ∗ is a (global) minimizer of a proper
function f : Rp → (−∞, +∞] over C ⊂ Rp if f (θ ∗ ) = inf θ∈C f (θ). The set
of minimizers of f over C is denoted by

argmin f (θ) = argmin{f (θ) : θ ∈ C}.

θ∈C θ∈Rp

Definition 80 (Local minimizer). θ ∗ is a local minimizer of a proper function

f : Rp → (−∞, +∞] if there exists c > 0 such that

θ ∗ ∈ argmin f (θ) : ∥θ∥ ≤ c}.

θ∈Rp

Proposition 2.2.1 (Convex problems: local minimizers are global mini-

mizers). Let f : Rp → (−∞, +∞] be proper and convex. Then every local
minimizer of f is a minimizer.

Proposition 2.2.2 (Convex problems: argmin is convex). Let f : Rp →

(−∞, +∞] be proper and convex and C ⊂ Rp . Then argminθ∈C f (θ) is
convex.

Proposition 2.2.3 (Existence of minimizers). Let f : Rp → (−∞, +∞] be

proper, convex and lower semicontinuous and C be a closed convex subset of
Rp such that C ∩ dom(f ) ̸= ∅. Suppose that one of the following holds:

(i) f is coercive.

(ii) C is bounded.

61
Then f has a minimizer over C.

Proof. Since C ∩ dom(f ) ̸= ∅, there exists θ ∈ dom(f ) such that D =

C ∩ lev≤f (θ) (f ) is not empty, closed and convex. Moreover, D is bounded
since C or lev≤f (θ) (f ) is. The result therefore follows from Bauschke et al.
[2017, Thm. 11.10].

Proposition 2.2.4 (Uniqueness of minimizers). Let f : Rp → (−∞, +∞] be

proper and strictly convex. Then f has at most one minimizer.

Proof. Set µ := inf θ∈Rp f (θ) and suppose that there exist two distinct points
θ1 , θ2 ∈ dom(f ) such that f (θ1 ) = f (θ2 ) = µ. Since θ1 , θ2 ∈ lev≤µ (f ), which
is convex, so does β = (θ1 + θ2 )/2. Therefore f (β) = µ. It follows from the
strict convexity of f that

µ = f (β) < max{f (θ1 ), f (θ2 )} = µ,

which is impossible.
Global minimizers of proper functions can be characterized by a simple
rule which extends a seventeenth century result due to Pierre Fermat.

Theorem 2.2.1 (Fermat’s rule). Let f : Rp → (−∞, +∞] be proper. Then

argmin f (θ) = {θ ∗ ∈ Rp : 0 ∈ ∂f (θ ∗ )}.

θ∈Rp

Proof. Let θ ∗ ∈ Rp . Then θ ∗ ∈ argminθ∈Rp f (θ) if and only if, for every
β ∈ Rp ,
⟨β − θ ∗ , 0⟩ + f (θ ∗ ) ≤ f (β).
By definition of subgradient, this last requirement reads 0 ∈ ∂f (θ ∗ ).

Theorem 2.2.2 (Hilbert projection theorem). For every vector θ ∈ Rp and

every nonempty closed convex C ⊂ Rp , there exists a unique vector β ∈ Rp
for which
∥θ − β∥22 = inf ∥θ − η∥22 .
η∈C

If C is a vector subspace of Rp , then the minimizer β is the unique element

in C such that θ − β is orthogonal to C.

62
2.3 Probability theory
This section introduces a selection of definitions and results from probability
theory that are used in these lecture notes. A book-length exposition of
probability theory can be found in Billingsley [2017] and Vershynin [2018],
among others.
All random variables are (real valued and) defined on the complete prob-
ability space (Ω, F, P).
Definition 81 (Cumulative Distribution Function). The Cumulative Distri-
bution Function (CDF) of random variable X is the function

FX : R → [0, 1]; x 7→ P[X ≤ x].

Definition 82 (Expected value). The expected value of a random variable X

is Z
E[X] := XdP.
Ω

Definition 83 (Moment generating function). the Moment Generating Func-

tion (MGF) of a random variable X is the function

MX : R → [0, +∞]; t 7→ E[exp(tX)].

Definition 84 (Moment of order p). For p ∈ R, the moment of order p of a

random variable X is E[|X|p ].
Definition 85 (Lp −norm). The Lp −norm of a random variable X is, for
p > 0:
∥X∥Lp := E[|X|p ]1/p ,
and for p = ∞:

∥X∥Lp := ess sup |X| := sup{b ∈ R : P({ω : X(ω) < b}) = 0}.

Definition 86 (Lp −space). The space Lp = Lp (Ω, F, P) consists of all random

variables X with finite Lp norm:

Lp := {X : ∥X∥Lp < +∞}.

Definition 87 (Conjugate exponents). p, q ∈ [1, ∞] are conjugate exponents

if 1/p + 1/q = 1.
The inner product between Lp and Lq where p, q ∈ [1, ∞] are conjugate
exponents is for all X ∈ Lp and Y ∈ Lq :

⟨X, Y ⟩Lp −Lq := E[XY ].

63
The inner product in L2 is for all X, Y ′ inL2 :

⟨X, Y ⟩L2 := E[XY ].

The Variance of X ∈ L2 is

Var[X] := E[(X − E[X])2 ] = ∥X − E[X]∥2L2 ,

and the standard deviation is

p
σ(X) := Var[X] = ∥X − E[X]∥Lp .

The covariance between X, Y ∈ L2 is

Cov[X, Y ] := ⟨X − E[X], Y − E[Y ]⟩L2 = E[(X − E[X])(Y − E[Y ])].

Classical inequalities
Theorem 2.3.1 (Jensen’s inequality). For any random variable X and a
convex function f : R → R, we have

f (E[X]) ≤ E[f (x)].

The following proposition is a consequence of Jensen’s inequality.

Proposition 2.3.1. For any random variable X and any p, q ∈ [0, ∞] with
p ≤ q:
∥X∥Lp ≤ ∥Y ∥Lq .
Therefore, Lp ⊂ Lq for any p, q ∈ [0, ∞] with p ≤ q.
Theorem 2.3.2 (Minkowski’s inequality). For any p ∈ [1, ∞] and any ran-
dom variables X, Y ∈ Lp :

∥X + Y ∥Lp ≤ ∥X∥Lp + ∥Y ∥Lp .

Theorem 2.3.3 (Cauchy-Schwarz inequality). For any random variables

X, Y ∈ L2 :
|⟨X, Y ⟩| = |E[XY ]| ≤ ∥X∥L2 ∥Y ∥L2 .
Theorem 2.3.4 (Hölder’s inequality). For any random variables X ∈ Lp
and Y ∈ Lq with conjugate exponents p, q ∈ (1, ∞):

|E[XY ]| ≤ ∥X∥Lp ∥Y ∥Lq .

This inequality also holds for p = 1 and q = ∞.

64
The tails and the moments of a random variable are connected.
Proposition 2.3.2 (Integral identity). For any nonnegative random variable
X: Z ∞
E[X] = P[X > x]dx.
0
The two sides of this identity are either both finite or both infinite.
Theorem 2.3.5 (Markov’s inequality). For any nonnegative random variable
X and x > 0:
P[X ≥ x] ≤ E[X]/x.
A consequence of Markov’s inequality is Chebyshev’s inequality, which
bounds the concentration of a random variable about its mean.
Theorem 2.3.6 (Chebyshev’s inequality). Let X be a random variable with
finite mean µ and finite variance σ 2 . Then, for any x > 0:
P[|X − µ| ≥ x] ≤ σ 2 /x2 .
Proposition 2.3.3 (Generalization of Markov’s inequality). For any random
variable X with mean µ ∈ R and finite moment of order p ≥ 1, and for any
x > 0:
P[|X − µ| ≥ x] ≤ E[|X − µ|p ]/xp .

Concentration of sums of independent random variables

Concentration inequalities quantify how a random variable deviates around
its mean.
Definition 88 (Symmetric Bernoulli distribution). A random variable X has
a symmetric Bernoulli distribution if
P[X = −1] = P[X = +1] = 1/2.
Theorem 2.3.7 (Hoeffding’s inequality). Let X1 , . . . , Xn be an independent
symmetric Bernoulli random variables, and a ∈ Rn . Then, for any x ≥ 0:
" n # !
X x2
P ai Xi ≥ x ≤ exp − 2 .
i=1
2 ∥a∥ 2

Theorem 2.3.8 (Two-sided Hoeffding’s inequality). Let X1 , . . . , Xn be an

independent symmetric Bernoulli random variables, and a ∈ Rn . Then, for
any x > 0: " n # !
X x2
P ai Xi ≥ x ≤ 2 exp − .
i=1
2 ∥a∥22

65
Theorem 2.3.9 (Hoeffding’s inequality for bounded random variables). Let
X1 , . . . , Xn be an independent random variables. Assume that Xi ∈ [li , ui ]
with li , ui ∈ R and li ≤ ui . Then, for any x > 0:
" n #
2x2
X
P (Xi − E[Xi ]) ≥ x ≤ exp − Pn 2
.
i=1 i=1 (ui − li )

Theorem 2.3.10 (Chernoff’s inequality). Chernoff ’s inequality Let Xi be

independent Bernoulli random variables with parameter pi ∈ [0, 1]. Let Sn :=
P n
i=1 Xi and its mean µ := E[Sn ]. Then, for any x > 0:
eµ x
P[Sn ≥ x] ≤ exp(−µ) .
x
Proposition 2.3.4 (Tails of the standard normal distribution). Let Z ∼
N (0, 1). Then, for all z > 0:

1 1 1 2 1 1 2
− 3 √ e−z /2 ≤ P[Z ≥ z] ≤ √ e−z /2 .
z z 2π z 2π
In particular, for z ≥ 1:
1 2
P[Z ≥ z] ≤ √ e−z /2 .
2π
Proposition 2.3.5 (Tails of the normal distribution). Let X ∼ N (µ, σ 2 )
with µ ∈ R and σ > 0. Then, for all x ≥ 0:
2
−x
P[X − µ ≥ x] ≤ exp .
2σ 2
Proposition 2.3.6. Let Z ∼ N (0, 1). Then, for all z ≥ 0:

P[|Z| ≥ z] ≤ 2 exp(−z 2 /2).

Proposition 2.3.7 (Sub-Gaussian properties). Let X be a random variable.

Then, there are constants C1 , . . . , C5 > 0 for which the following properties
are equivalent.
(i) The tails of X satisfy for all x ≥ 0:

P[|X| ≥ x] ≤ 2 exp(−x2 /C12 ).

(ii) The moments of X satisfy for all p ≥ 1:

√
∥X∥Lp = E[|X|p ]1/p ≤ C2 p.

66
(iii) The MGF of X 2 satisfies for all t ∈ R such that |t| ≤ 1/C3 :

E[exp(t2 X 2 )] ≤ exp(C32 t2 ).

(iv) The MGF of X 2 is bounded at some point, namely

E[exp(X 2 /C42 )] ≤ 2.

If further E[X] = 0, these properties are equivalent to:

(v) The MGF of X satisfies for all t ∈ R:

E[exp(tX)] ≤ exp(C52 t2 ).

Definition 89 (Sub-Gaussian random variables). A random variable X that

satisfies the equivalent conditions of Proposition 2.3.7 is a sub-Gaussian ran-
dom variable, denoted X ∼ sub-G.
Gaussian, symmetric Bernoulli, uniform, and bounded random variables
are examples of sub-Gaussian random variables. The tails of the distribu-
tion of a sub-Gaussian random variable decay at least as fast as the tails
of a Gaussian distribution. The Poisson, exponential, Pareto, and Cauchy
distribution are examples of distributions that are not sub-Gaussian.
Definition 90 (Variance proxy). For a random variable X ∼ sub-G, if there
is some s > 0 such that for all t ∈ R:

E[e(X−E[X])t ] ≤ exp(s2 t2 /2),

then s2 is called variance proxy.

Proposition 2.3.8 (Weighted sum of independent sub-Gaussian random
variables). Let X1 , . . . , Xn be independent sub-Gaussian random variables,
all with variance proxy σ 2 where σ > 0. Then, for any a ∈ Rn :
" n # !
X t2
P ai Xi ≥ t ≤ exp − ,
2σ 2 ∥a∥2
i=1 2

and " n # !
X t2
P ai Xi ≤ −t ≤ exp − .
i=1
2σ 2 ∥a∥22
Definition 91 (Sub-Gaussian norm). The sub-Gaussian norm ∥X∥ψ2 of ran-
dom variable X is defined as

∥X∥ψ2 := inf{t > 0 : E[exp(X 2 /t2 )] ≤ 2}.

67
Proposition 2.3.9. If X is a sub-Gaussian random variable, then X − E[X]
is sub-Gaussian and for a constant C > 0:

∥X − E[X]∥ψ2 ≤ C ∥X∥ψ2 .

Proposition 2.3.10 (Sums of independent sub-Gaussian). Let X1 , . .P

. , Xn be
independent sub-Gaussian random variables with mean zero. Then ni=1 Xi
is also a sub-Gaussian random variable, and, for a constant C > 0:
n 2 n
X X
Xi ≤C ∥Xi ∥2ψ2 .
i=1 ψ2 i=1

We can now extend the Hoeffding’s inequality to sub-Gaussian distribu-

tions.
Proposition 2.3.11 (General Hoeffding’s inequality). Let X1 , . . . , Xn be in-
dependent sub-Gaussian random variables with mean zero and C > 0 a con-
stant. Then, for every t ≥ 0:
" n # !
X Ct2
P Xi ≥ t ≤ 2 exp − Pn 2 .
i=1 i=1 ∥Xi ∥ψ2

Proposition 2.3.12. Let X1 , . . . , Xn be independent sub-Gaussian random

variables with mean zero, a ∈ Rn , K = maxni=1 ∥Xi ∥ψ2 and C > 0 a constant.
Then, for every t ≥ 0:
" n # !
X Ct2
P ai Xi ≥ t ≤ 2 exp − .
K 2 ∥a∥2
i=1 2

Proposition 2.3.13 (Khintchine’s inequality). Let X1 , . . . , Xn be indepen-

dent sub-Gaussian random variables, all with mean zero and unit variance
proxy, a ∈ Rn , K = maxni=1 ∥Xi ∥ψ2 and C > 0 a constant. Then, for every
p ∈ [2, ∞):
n
!1/2 n n
!1/2
X X √ X
a2i ≤ ai X i ≤ CK p a2i .
i=1 i=1 Lp i=1

The sub-Gaussian distribution does not embed distributions whose tails

are heavier than Gaussian.
Proposition 2.3.14 (Sub-exponential properties). Let X be a random vari-
able. Then, there are constants K1 , . . . , K5 > 0 for which the following prop-
erties are equivalent.

68
(i) The tails of X satisfy for all x ≥ 0:

P[|X| ≥ x] ≤ 2 exp(−x/K1 ).

(ii) The moments of X satisfy for all p ≥ 1:

∥X∥Lp = E[|X|p ]1/p ≤ K2 p.

(iii) The MGF of |X| satisfies for all t ∈ R such that 0 ≤ t ≤ 1/K3 :

E[exp(t|X|)] ≤ exp(L3 t).

(iv) The MGF of |X| is bounded at some point, namely

E[exp(|X|/K4 )] ≤ 2.

If further E[X] = 0, these properties are equivalent to:

(v) The MGF of X satisfies for all t ∈ R such that |t| ≤ 1/K5 :

E[exp(tX)] ≤ exp(K52 t2 ).

Definition 92 (Sub-exponential random variables). A random variable X that

satisfies the equivalent conditions of Proposition 2.3.14 is a sub-exponential
random variable.
Sub-Gaussian, Poisson, exponential, Pareto, Levy, Weibull, log-normal,
Cauchy, t-distributed random variables are examples of sub-exponential ran-
dom variables.
Definition 93 (Sub-exponential norm). The sub-exponential norm ∥X∥ψ1 of
random variable X is defined as

∥X∥ψ1 := inf{t > 0 : E[exp(|X|/t)] ≤ 2}.

Proposition 2.3.15. If X is a sub-exponential random variable, then X −

E[X] is sub-exponential and for a constant C > 0:

∥X − E[X]∥ψ1 ≤ C ∥X∥ψ1 .

Proposition 2.3.16 (Sub-exponential is sub-Gaussian squared). A random

variable X is sub-exponential if and only if X 2 is sub-Gaussian. Moreover,

X2 ψ1
:= ∥X∥2ψ2 .

69
Proposition 2.3.17 (Product of sub-Gaussians is sub-exponential). Let X
and Y be sub-Gaussian random variables. Then XY is sub-exponential.
Moreover,
∥XY ∥ψ1 = ∥X∥ψ2 ∥Y ∥ψ2 .

Theorem 2.3.11 (Bernstein’s inequality). Let X1 , . . . , Xn be independent

sub-exponential random variables with mean zero and let C > 0 be a constant.
Then, for every t ≥ 0:
" n # ( )!
X t2 t
P Xi ≥ t ≤ 2 exp −C min Pn 2 , .
i=1 i=1 ∥Xi ∥ψ
maxi ∥Xi ∥ψ1
1

Theorem 2.3.12 (Bernstein’s inequality for weighted sums). Let X1 , . . . , Xn

be independent sub-exponential random variables with mean zero, C > 0 be
a constant, K := maxni=1 ∥Xi ∥ψ1 and a ∈ Rn . Then:
" n
# ( )!
X t2 t
P ai Xi ≥ t ≤ 2 exp −C min 2 , .
i=1
K 2 ∥a∥2 K ∥a∥∞

Corollary 2.3.13 (Bernstein’s inequality for averages). Let X1 , . . . , Xn be

independent sub-exponential random variables with mean zero, C > 0 be a
constant and K := maxni=1 ∥Xi ∥ψ1 . Then:
" n
# 2
X t t
P Xi /n ≥ t ≤ 2 exp −Cn min , .
i=1
K2 K

70
Alphabetical Index

Asymptotic distribution, 11 Estimator, 7

Euclidean inner product, 52
Basis, 50 Euclidean space, 52
Bernstein’s inequality, 70 Expected value, 63
Bias, 8
Bias-variance tradeoff, 10 Fermat’s rule, 62
Bounded set, 59 Fixed design, 5

Cauchy-Schwartz inequality, 64 General Hoeffding’s inequality, 68

Cauchy–Schwarz inequality, 51 Global minimizer, 61
Chebyshev’s inequality, 65
Closed set, 59 Hard sparsity, 42
Coercive function, 60 Hilbert projection theorem, 62
Column vector, 53 Hoeffding’s inequality, 65
Conjugate exponents, 63 Hölder inequality, 52
Consistency, 11 Hölder’s inequality, 64
Convex function, 60 Identity matrix, 54
Convex set, 59 Inner product, 50
Coordinate descent algorithm, 29 Integral identity, 65
Cumulative distribution function, Invertible matrix, 55
63
Jensen’s inequality, 64
Design matrix, 6
Diagonal matrix, 55 Khintchine’s inequality, 68
Domain, 59
Lp space, 63
Eigenvalue, 54 lp −norm, 3, 52
Eigenvalue decomposition, 59 Lasso, 12
Eigenvector, 54 Least squares, 12
Epigraph, 59 Limit inferior, 60
Estimand, 7 Linear combination, 50
Estimate, 7 Linear dependence, 50
Estimation risk, 8 Linear function, 51

71
Linear independence, 50 Projection matrix, 55
Linear model, 5 Proper function, 60
Linear prediction, 7
Local minimizer, 61 Random design, 6
Lower level set, 60 Restricted eigenvalue condition,
Lower semicontinuous function, 60 41
Ridge, 12
Markov’s inequality, 65 Ridgeless, 12
Matrix, 52 Ridgeless estimand, 34
Matrix addition, 53 Row vector, 53
Matrix diagonal, 55
Singular value decomposition, 58
Matrix kernel, 53
Singular values, 54
Matrix multiplication, 53
Soft-thresholding operator, 27
Matrix Range, 53
Span, 50
Matrix rank, 54
Strictly convex function, 60
Matrix transpose, 53
Sub-exponential norm, 69
Matrix-scalar multiplication, 53
Sub-exponential properties, 68
Mean predictive risk, 9
Sub-exponential random variables,
Mean squared error, 8
69
Minkowski’s inequality, 64
Sub-Gaussian norm, 67
Moment generating function, 63
Sub-Gaussian properties, 66
Moment of order p, 63
Sub-Gaussian random variable, 67
Moore-Penrose inverse, 56
Subdifferential, 60
n-tuple, 52 Subgradient, 60
Norm, 51 Subspace, 49
Symmetric Bernoulli distribution,
Orthogonal matrix, 55 65
Orthogonal projection, 55 Symmetric matrix, 54
Orthogonal subspaces, 51 System of linear equations, 57
Orthogonal vectors, 51
Trace, 55
Orthonormal basis, 51
Triangular inequality, 51
Pathwise coordinate descent, 31 Variance proxy, 67
Positive definite matrix, 54 Vector space, 49
Positive semi-definite matrix, 54
Predictive risk, 8 Weak sparsity, 43

CH 7 Linear Functions
100% (1)
CH 7 Linear Functions
34 pages
Algorithms For Multidimensional Spectral Factorization and Sum of Squares
No ratings yet
Algorithms For Multidimensional Spectral Factorization and Sum of Squares
21 pages
Copy of deep-learning
No ratings yet
Copy of deep-learning
28 pages
Chapter 15: Regression Analysis With Linear Algebra Primer
No ratings yet
Chapter 15: Regression Analysis With Linear Algebra Primer
26 pages
1468564504EText (CH 3, M 1
No ratings yet
1468564504EText (CH 3, M 1
16 pages
A Problem in Enumerating Extreme Points
No ratings yet
A Problem in Enumerating Extreme Points
9 pages
linear algebra
No ratings yet
linear algebra
11 pages
NLAFull Notes 22
No ratings yet
NLAFull Notes 22
59 pages
1 n 1 n n α α 1 α n 1 1 n n d n α α α α d d α n d
No ratings yet
1 n 1 n n α α 1 α n 1 1 n n d n α α α α d d α n d
21 pages
A Mathematical Programming Approach For Improving The Robustness of LAD Regression
No ratings yet
A Mathematical Programming Approach For Improving The Robustness of LAD Regression
22 pages
Econometric S If All 2020
No ratings yet
Econometric S If All 2020
119 pages
09 SS049
No ratings yet
09 SS049
14 pages
Korovina
No ratings yet
Korovina
15 pages
Notes Mainimp
No ratings yet
Notes Mainimp
164 pages
Jaroslav Lukeš, Jan Malý - Measure and Integral
100% (1)
Jaroslav Lukeš, Jan Malý - Measure and Integral
232 pages
Formulas
No ratings yet
Formulas
21 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
Correlation in Random Variables
No ratings yet
Correlation in Random Variables
6 pages
Methodos Numericos
No ratings yet
Methodos Numericos
34 pages
bookwithindex
No ratings yet
bookwithindex
96 pages
Đồ_án_CSXS (1)
No ratings yet
Đồ_án_CSXS (1)
28 pages
Variables Aleatorias 2
No ratings yet
Variables Aleatorias 2
34 pages
Math Data
No ratings yet
Math Data
117 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Internship_report_chandan
No ratings yet
Internship_report_chandan
18 pages
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
No ratings yet
On The Eigenspectrum of The Gram Matrix and Its Relationship To The Operator Eigenspectrum
18 pages
Week 4: CHAPTER 2. Differential Calculus in R 2.1 The Normed Space of Linear Mappings
No ratings yet
Week 4: CHAPTER 2. Differential Calculus in R 2.1 The Normed Space of Linear Mappings
3 pages
Isi Mtech Qror 07
No ratings yet
Isi Mtech Qror 07
34 pages
Dr.R.Venkatesan Matrices
No ratings yet
Dr.R.Venkatesan Matrices
63 pages
Lin Alg ML Mimuw
No ratings yet
Lin Alg ML Mimuw
55 pages
Covariance Matrix (W Krzanowski)
No ratings yet
Covariance Matrix (W Krzanowski)
5 pages
Calcul Diff
No ratings yet
Calcul Diff
38 pages
Lab PDF
No ratings yet
Lab PDF
11 pages
Isi Mtech Qror 08
No ratings yet
Isi Mtech Qror 08
36 pages
Fuzzy Diff
No ratings yet
Fuzzy Diff
12 pages
Interkoneksi Sistem
No ratings yet
Interkoneksi Sistem
43 pages
mophong05_identifydistribution_09
No ratings yet
mophong05_identifydistribution_09
36 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
S1 Edexcel Revision Pack
No ratings yet
S1 Edexcel Revision Pack
9 pages
Lecture1
No ratings yet
Lecture1
8 pages
Tensor Analysis
No ratings yet
Tensor Analysis
23 pages
TE Mechanical-Numerical Methods & Optimization - Unit 05
No ratings yet
TE Mechanical-Numerical Methods & Optimization - Unit 05
73 pages
Breiman-JASA-EstimatingOptimalTransformations-1985
No ratings yet
Breiman-JASA-EstimatingOptimalTransformations-1985
20 pages
Econometric Toolkit For Studying Dynamic Models in Economics and Finance
No ratings yet
Econometric Toolkit For Studying Dynamic Models in Economics and Finance
39 pages
Stability of Non-Linear Dynamical System
No ratings yet
Stability of Non-Linear Dynamical System
9 pages
Lagrange Intepolation
No ratings yet
Lagrange Intepolation
10 pages
Point Estimation: Institute of Technology of Cambodia
No ratings yet
Point Estimation: Institute of Technology of Cambodia
22 pages
斯坦福大学机器学习数学基础 9-16
No ratings yet
斯坦福大学机器学习数学基础 9-16
8 pages
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
No ratings yet
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
34 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
L05 Final
No ratings yet
L05 Final
19 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Notes
No ratings yet
Notes
32 pages
A New Operator Theory of Linear Partial Differential Equations
No ratings yet
A New Operator Theory of Linear Partial Differential Equations
28 pages
MA5158 Unit I Section 5
No ratings yet
MA5158 Unit I Section 5
29 pages
Opt2017 Part1
No ratings yet
Opt2017 Part1
48 pages
Calculus II: For Biology and Medicine
No ratings yet
Calculus II: For Biology and Medicine
80 pages
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
No ratings yet
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
22 pages
The Algebra of Logic
No ratings yet
The Algebra of Logic
93 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Algorithms Lecture Notes Cambridge
No ratings yet
Algorithms Lecture Notes Cambridge
133 pages
The mathematics of causality
No ratings yet
The mathematics of causality
10 pages
A brief introduction to causal inference in machine learning
No ratings yet
A brief introduction to causal inference in machine learning
88 pages
Notes on Randomized Algorithms
No ratings yet
Notes on Randomized Algorithms
539 pages
The Mathematics of Kolmogorov-Arnold-Networks
No ratings yet
The Mathematics of Kolmogorov-Arnold-Networks
26 pages
Lecture Notes Introduction to Topological Data Analysis
No ratings yet
Lecture Notes Introduction to Topological Data Analysis
80 pages
SET B Activity Lesson 1 WITH ANSWER
No ratings yet
SET B Activity Lesson 1 WITH ANSWER
2 pages
Advanced Calculus Test Questions
No ratings yet
Advanced Calculus Test Questions
13 pages
A Semi Verbal
No ratings yet
A Semi Verbal
2 pages
State-Space Formulation For Structural Dynamics by Jose Luis Mendoza Zabala
No ratings yet
State-Space Formulation For Structural Dynamics by Jose Luis Mendoza Zabala
0 pages
Sudoku, Gerechte Designs, Resolutions, Affine Space, Spreads, Reguli, and Hamming Codes
No ratings yet
Sudoku, Gerechte Designs, Resolutions, Affine Space, Spreads, Reguli, and Hamming Codes
33 pages
Lesson 2.1-Fundamental Identities PDF
No ratings yet
Lesson 2.1-Fundamental Identities PDF
23 pages
Advanced Functions 12
No ratings yet
Advanced Functions 12
97 pages
Algebra 1 Summer Packet
No ratings yet
Algebra 1 Summer Packet
9 pages
Further Mathematics Note
No ratings yet
Further Mathematics Note
24 pages
11th Maths Important Sums Study Material English Medium PDF
No ratings yet
11th Maths Important Sums Study Material English Medium PDF
2 pages
MATH 415 Review by Nitesh Nath: Chapter 1!
No ratings yet
MATH 415 Review by Nitesh Nath: Chapter 1!
6 pages
06 Finite Elements Basics
No ratings yet
06 Finite Elements Basics
28 pages
MCA Data Structures With Algorithms 02
No ratings yet
MCA Data Structures With Algorithms 02
20 pages
10th maths Revision 3 (set 1)24-25
No ratings yet
10th maths Revision 3 (set 1)24-25
23 pages
Fraction Review Exercises
No ratings yet
Fraction Review Exercises
10 pages
Homeworkproblems PDF
No ratings yet
Homeworkproblems PDF
144 pages
Gr8_Checkpoint_Rev_Wrsht_2
No ratings yet
Gr8_Checkpoint_Rev_Wrsht_2
9 pages
Maths For Social Sciences
100% (1)
Maths For Social Sciences
5 pages
Ee127-Fa2018-Mt1-El Ghaoui-Soln
No ratings yet
Ee127-Fa2018-Mt1-El Ghaoui-Soln
15 pages
Algebra Quiz
No ratings yet
Algebra Quiz
4 pages
Chiang Wainwright Fundamental Methods CH 2 3 Solutions
60% (5)
Chiang Wainwright Fundamental Methods CH 2 3 Solutions
7 pages
Functional Analysis.
No ratings yet
Functional Analysis.
165 pages
Math 313-11 (Linear Algebra) Practice Exam 1
No ratings yet
Math 313-11 (Linear Algebra) Practice Exam 1
8 pages
Syllabus - Vector Calculus
No ratings yet
Syllabus - Vector Calculus
2 pages
Mathematics Inputs by Raghavendra
No ratings yet
Mathematics Inputs by Raghavendra
4 pages
Sandwich Laminate
No ratings yet
Sandwich Laminate
43 pages
Factoring The Difference of Two Squares
100% (1)
Factoring The Difference of Two Squares
19 pages
Mathematica For The Beginner
No ratings yet
Mathematica For The Beginner
10 pages
Generalized Rodrigues Formula Solutions
No ratings yet
Generalized Rodrigues Formula Solutions
12 pages