Lecture Notes on High Dimensional Linear Regression
Lecture Notes on High Dimensional Linear Regression
Alberto Quaini1,2
Background
It is assumed that readers have a solid background in calculus, linear alge-
bra, convex analysis, and probability theory. Some definitions and results
from these fields, relevant for the course, are provided in the Appendix for
reference.
Book-length references
The content of these lecture notes is inspired by a wide range of existing liter-
ature, but the presentation of topics follows my own interpretation and logical
structure. Although most of the content can be traced back to established
sources, certain sections reflect my perspective, and some material is origi-
nal to this course. For those interested in more comprehensive, book-length
discussions of related topics, the following key references are recommended:
Hastie et al. [2009], Bühlmann and Van De Geer [2011], Hastie et al. [2015],
and Wainwright [2019].
Disclaimer
Please note that despite my efforts, these lecture notes may contain errors. I
welcome any feedback, corrections, or suggestions you may have. If you spot
any mistakes or have ideas for improvement, feel free to contact me via email
at quaini@ese.eur.nl.
Contents
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1 Linear Regression 5
1.0.1 Estimators and their properties . . . . . . . . . . . . . 7
1.1 Least Squares and Penalized Least Squares . . . . . . . . . . . 11
1.1.1 Existence and uniqueness . . . . . . . . . . . . . . . . 13
1.1.2 Equivalent expressions and relations . . . . . . . . . . 19
1.1.3 Geometric interpretation . . . . . . . . . . . . . . . . . 23
1.1.4 Computation of lasso . . . . . . . . . . . . . . . . . . . 26
1.1.5 Finite-sample properties of ridgeless and ridge . . . . . 31
1.1.6 Finite sample properties of lasso . . . . . . . . . . . . . 39
2 Appendix 49
2.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.1 Moore-Penrose inverse . . . . . . . . . . . . . . . . . . 56
2.1.2 Eigenvalue and Singular value decomposition . . . . . . 58
2.2 Convex analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . 63
Alphabetical Index 71
2
Notation
• All random variables are defined on complete probability space (Ω, F, P)
and take values in a real Euclidean space.
tribution, respectively.
• For a matrix A ∈ Rn×p , the i, j−th element is denoted Ai,j , the j−th
column is denoted Aj and the i−th row is denoted A(i) , for i = 1, . . . , n
and j = 1, . . . , p.
3
• The symbol ∂ indicates the subdifferential.
4
Chapter 1
Linear Regression
x′ θ = x1 θ1 + . . . + xp θp
where ε0i are real-valued residual random variables. Figure 1.1 depicts a
linear model for i = 1, . . . , n observations yi , with two predictors x̃i = [1, xi ]′
consisting of a unit constant and a variable xi , a coefficient θ0 ∈ R2 and the
associated error terms ε0i . The Data Generating Process (DGP), i.e., the
joint distribution of the predictors x and the real-valued residual random
variables ε0 = [ε0i ]ni=1 , is subject to certain restrictions. Depending on the
type of restrictions imposed on the DGP, different types of linear models
are obtained. The two general forms of linear models are fixed and random
design models, which are defined as follows.
Definition 1 (Fixed design model). In a fixed design model, the sequence
(xi )ni=1 is fixed. The residuals ε0i are independent and identically distributed.
1
An intercept in f (x; θ) can be introduced by adding a constant term to the predictors.
5
Definition 2 (Random design model). In a random design model, the pair
(xi , yi )ni=1 is a sequence of independent and identically distributed random
variables.
The fixed design model is particularly suitable when the predictors are
controlled by the analyst, such as the dose of medication administered to
patients in the treatment group in a clinical trial. Conversely, the random
design model is appropriate when the explanatory variables are stochastic,
such as the wind speed observed at a specific time and location.
Linear Model
yi
Lm(xi, θ0) = 0.5 + 0.8xi
ε0i
8
6
y
4
2
3 4 5 6 7 8
x
Figure 1.1: Statistical linear model yi = x̃′i θ0 + ε0i where x̃i = [1, xi ]′ and
θ0 = [0.5, 0.8]′ .
6
The rest of the chapter is organized as follows. First, we study the most
basic linear regression approach, the method of least squares projection, and
some of its recent machine learning extensions. Our study focuses on their
existence, uniqueness, connections, geometric interpretation, and computa-
tion. Then, we study we cover both their finite- or small-sample properties,
that are valid for any given sample size, and their asymptotic properties,
that are useful approximations when the sample size is large enough.
that takes as inputs the data (y, X) ∈ Rn × Rn×p and produces an estimate
θ̂n (y, X) ∈ Rp . For simplicity, we use the same notation, θ̂n , to refer to both
the estimator and the resulting estimate, although formally the estimate
should be written as θ̂n (y, X). We may also write θ̂n ∈ Rp to indicate that
an estimator outputs values in Rp .
Definition 5 (Linear prediction). The quantity lm(X, θ) := Xθ denotes
the linear prediction associated to coefficient vector θ ∈ Rp and predictors
X ∈ Rn×p .
Let M + denote the Moore-Penrose inverse of a generic real-valued matrix
M . We make extensive use of the following useful projections.2
Definition 6 (Useful projections). Given a fixed matrix X ∈ Rn×p :
7
The next proposition demonstrates that, if we fix the design matrix X,
we can focus on regression coefficients in Range(X ′ ). Indeed, coefficients in
this set span all possible linear predictions that can be achieved through X.
Finite-sample properties
Since an estimator is derived from data, it is a random variable. Intuitively,
when comparing two estimators of the same estimand, we prefer the one
whose probability distribution is ”more concentrated” around the true value
of the estimand. Formally, estimators are compared using several key prop-
erties.3
Definition 7 (Bias). The bias of an estimator θ̂n of θ0 is the difference between
the expected value of the estimator and the estimand:
Bias(θ̂n , θ0 ) := E[θ̂n ] − θ0 .
8
Definition 11 (Mean predictive risk). The Mean Predictive Risk (MPR) of
an estimator θ̂n of θ0 is the expected predictive risk of the estimator:
E[∥y − Xθ∥22 /n] = E[∥X(θ0 − θ)∥22 /n] + E[∥ε0 ∥22 /n] + 2(θ0 − θ)E[X ′ ε0 ].
Then, the result follows since E[X ′ ε0 ] = ni=1 E[Xi ε0i ] = 0, where Xi de-
P
notes the i−th row of X.
If our primary goal is to accurately predict the target variable, we seek
a estimators θ̂ with a low mean prediction risk E[∥y − X θ̂∥22 /n]. Since we
cannot control the error term ε0 , Proposition 1.0.2 suggests that we should
focus on estimators with a low mean predictive risk.
On the other hand, if our interest lies in understanding which predictors
influence the target variable and how they do so, the true coefficient θ0
becomes our focus. In this case, we might prefer unbiased estimators – those
with zero bias – over biased ones. However, estimators with lower mean
squared error (MSE) are generally favored, even if they feature some bias.
The following proposition demonstrates that the MSE can be decomposed
into a bias and a variance term.
Proposition 1.0.3 (Bias-variance decomposition of MSE). Given an esti-
mator θ̂n ∈ Rp for θ0 ∈ Rp , the MSE can be decomposed as follows:
9
Proof. The result follows from
Loosely speaking, the bias and the variance of an estimator are linked
to the estimator’s ”complexity”. Estimators with higher complexity often
fit the data better, resulting in lower bias, but they are more sensitive to
data variations, leading to higher variance. Conversely, estimators with lower
complexity tend to have lower variance but higher bias, a phenomenon known
as the bias-variance tradeoff.
Apart from simple cases, computing the finite-sample properties of esti-
mators, such as their MSE or predictive risk, is infeasible or overly compli-
cated. This is because they require computations under the DGP of complex
transformations of the data. When direct computation is not possible, we
can rely on concentration inequalities or asymptotic approximations.
Concentration inequalities are inequalities that bound the probability
that a random variable deviates from a particular value, typically its ex-
pectation. In this chapter, we focus on inequalities that control the MSE or
predictive risk of an estimator, such as:
or
P[d(lm(X, θ̂n ), lm(X, θ0 )) ≤ h(y, X, n, p)] ≥ 1 − δ,
where δ ∈ (0, 1) is the level of confidence, d : Rp ×Rp → [0, +∞) is a distance,
and h is a real-valued function of the data, the sample size, and the number
of predictors.
10
Large-sample properties
Large-sample or asymptotic theory provides an alternative approach to study
and analyse estimators. Classically, this framework develops approximations
of the finite-sample properties of estimators, such as their distribution, MSE
or predictive risk, by letting the sample size n → ∞. Consequently, these
approximations work well when the sample size n is much larger than the
number of predictors p. More recently, asymptotic approximations are also
developed by letting p → ∞, or having both n, p → ∞ at some rate. Note
that, given a sample of size n and number of variables p, there is no general
indication on how to choose the appropriate asymptotic regime for n and p,
as the goodness of fit of the corresponding asymptotic approximations should
be assessed on a case by case basis. In this chapter, we work with two notions
from large-sample theory: consistency and asymptotic distribution.
P
Definition 12 (Consistency). Estimator θ̂n of θ0 is consistent, written θ̂n →
θ0 as n → ∞, if for all ε > 0,
11
Figure 1.2: Adrien-Marie Legendre (1752–1833) and Johann Carl Friedrich
Gauss (1977–1855).
Definition 16 (Ridge estimator). The ridge estimator is defined for λ > 0 as:
1 λ
θ̂nr (λ) = argmin ∥y − Xθ∥22 + ∥θ∥22 . (1.4)
θ∈Rp 2 2
Definition 17 (Lasso estimator). The lasso estimator is defined for λ > 0 as:
1
θ̂nl (λ) ∈ argmin ∥y − Xθ∥22 + λ ∥θ∥1 . (1.5)
θ∈Rp 2
Here is a brief overview of the results that are discussed in detail in the
rest of this chapter. A solution to the least squares problem (1.2) always
exists. However, when the predictors (i.e., the columns of X) are linearly
dependent, there are infinitely many solutions.4 In such cases, the LSE typ-
ically considered is the ridgeless estimator, which is always unique.
The ridge and lasso estimators are penalized or regularized versions of
the LSE, with penalty term λ ∥θ∥22 and λ ∥θ∥1 , respectively. The penalty
parameter λ > 0 controls the strength of the penalty. The ridge estimator,
4
This situation always arises when p > n, and it may arise even when n ≤ p.
12
introduced by Hoerl and Kennard [1970], was developed to address certain
shortcomings of the LSE, particularly in scenarios involving collinear or mul-
ticollinear designs – where the predictors in X are linearly dependent or
nearly-linearly dependent. The ridge estimator is uniquely defined and often
exhibits better statistical properties compared to those of the LSE in set-
tings with multicollinear or many predictors. On the other hand, the lasso
estimator, popularized by Tibshirani [1996], offers an approximation of the
l0 estimator, which is defined for some R > 0:
l0 1 2
θ̂n (λ) ∈ argmin ∥y − Xθ∥2 : ∥θ∥0 ≤ R , (1.6)
θ∈Rp 2
where ∥θ∥0 is the number of nonzero elements in θ. A key feature of this es-
timator is its ability to produce sparse solutions, i.e., to set some coefficients
exactly to zero. Consequently, the l0 estimator can be used to perform pa-
rameter estimation and variable selection simultaneously. However, it is the
solution of a non-convex problem, and, in general, computing it can be an
”NP-hard” problem. The lasso instead shares the ability to produce sparse
solutions and it can be easily computed even for large datasets.
Remark 1 (Data standardization). For computational stability, it is recom-
mended to compute linear regression estimators with a least squares loss
after having standardized the predictors X so that x̄ := X ′ 1/n = 0 and
Xj′ Xj = 1 for each j = 1 . . . , p. Without standardization, the solutions
would depend on the units used to measure the predictors. Moreover, we
may also center the target variable y, meaning ȳ := y ′ 1/n = 0. These
centering conditions are convenient, since they mean that we can omit the
intercept term. Given an optimal solution θ̂ on the centered data, we can
recover the optimal solutions for the uncentered data: θ̂ is the same and the
intercept is given by ȳ − x̄′ θ̂.
13
We establish the following key results: the existence of the LSE, the
ridgeless, the ridge and the lasso estimators; the closed-form expression of
the LSE, ridgeless, and ridge; the uniqueness of the ridgeless and ridge; the
uniqueness of the LSE when Rank(X) = p, i.e., when the predictors in X are
linearly independent. Notice that this rank condition cannot hold if n < p.
θ̂nrl = X + y. (1.7)
(iii) If Rank(X) = p, then the LSE and the ridgeless estimator are uniquely
given in closed form by:
(iv) The ridge estimator with penalty parameter λ > 0 exists, is an element
of Range(X ′ ), and is uniquely given in closed form by:
S := {θ̂ ∈ Rp : X ′ X θ̂ = X ′ y}.
X ′ X θ̂ = X ′ XX + y = X ′ PRange(X) y = X ′ y.
14
Therefore, X + y + Ker(X) ⊂ S. Now consider a vector v ∈ Rp not in
set X + y + Ker(X). That is, v = θ̂ + u with θ̂ ∈ X + y + Ker(X) and
u ∈ Range(X ′ ). Since Xu ̸= 0,
X ′ Xv = Xy + Xu ̸= Xy.
(ii) The minimum norm least squares problem in (1.3) has a strictly convex
and coercive objective function
f : Rp → R; θ 7→ ∥θ∥22 ,
X +y 2
≤ X +y 2
+ ∥v∥2 = X + y + v 2
,
(X ′ X + λI)θ̂ r (λ) = X ′ y.
15
which is positive definite, and thus θ̂nr (λ) = (X ′ X + λI)−1 X ′ y is the
solution to the FOCs. Finally, to prove that θ̂nr (λ) ∈ Range(X ′ ), notice
that PRange(X ′ ) = V S + SV ′ , where
I 0
S+S = .
0 0
Thus,
16
Remark 3 (Collinearity). Using the notation in Definition 18, the minimum
nonzero eigenvalue of X ′ X is s2r . If r < p, then X ′ X has p − r zero eigen-
values and the predictors are said to be collinear, that is, they are linearly
dependent. In this case Ker(X) is not trivial (it contains nonzero elements),
hence the LSE is not unique. Moreover, if sr ≈ 0, then the computation of
diag(1/s1 , . . . , 1/sr ) 0
X+ = V U ′,
0 0
and hence of the ridgeless estimator, is unstable. The ridge estimator in-
stead may not display these computational hurdles, provided that the penalty
parameter λ is large enough. That is because the minimum eigenvalue of
(X ′ X + λI) is s2r + λ. In Section 1.1.5 we show that, if sr ≈ 0, the ridge-
less (ridge) estimator’s MSE and MPR satisfy loose (sharp) concentration
inequalities.
Remark 4 (Uniqueness of the lasso solution). Tibshirani [2013] shows that,
under some conditions, the lasso estimator is unique. For instance, if the
predictors in X are in general position, then the lasso solution is unique.
Specifically, a set (aj )pj=1 where aj ∈ Rn for all j is in general position if any
affine subspace of Rn of dimension k < n contains at most k + 1 elements
of the set {±x1 , ±x2 , ... ± xp }, excluding antipodal pairs of points (that is,
points differing only by a sign flip). If the predictors are (non redundant)
continuous random variables, they are almost surely in general position, and
hence the lasso solution is unique. As a result, non-uniqueness of the lasso
solution typically occurs with discrete-valued data, such as those comprising
dummy or categorical variables.
Since the LSE, ridgeless, ridge and lasso estimators exist, their linear
predictions exist too. Moreover, the linear predictions of the uniquely defined
estimators, like ridgeless and ridge, are trivially unique. Remarkably, some
estimators that may not be unique entail unique linear predictions. The next
lemma implies that the LSE and lasso are among these estimators.
17
2.2.2, the set of minimizers of f is convex. Thus, for any α ∈ (0, 1):
δ =f (αθ1 + (1 − α)θ2 )
1
= ∥y − X[αθ1 + (1 − α)θ2 ]∥22 + h(αθ1 + (1 − α)θ2 )
2
1 1
<α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + h(αθ1 + (1 − α)θ2 )
2 2
1 1
≤α ∥y − Xθ1 ∥22 + (1 − α) ∥y − Xθ2 ∥22 + αh(θ1 ) + (1 − α)h(θ2 )
2 2
=αf (θ1 ) + (1 − α)f (θ2 ) = δ,
(i) The linear predictions of the LSE and the ridgeless estimator are uniquely
given by:
lm(X, θ̂nls ) = lm(X, θ̂nrl ) = PRange(X) y, (1.10)
which is the unique vector v ∈ Range(X) such that
(ii) The linear prediction of the ridge estimator is uniquely given, for λ > 0,
by
lm(X, θ̂nr (λ)) = X(X ′ X + λI)−1 X ′ y.
18
Proof. (i) The linear predictions lm(X, θ̂nls ) and lm(X, θ̂nrl ) are uniquely
given by (1.10) because all solutions to the least squares problem θ̂ ∈
X + y + Ker(X) yield the same prediction
By the definition of θ̂nls and the fact that Range(X) is a closed vector
subspace of Rn , the remaining claims follow as a direct application of
the Hilbert projection theorem (Theorem 2.2.2).
(ii) This result follows directly from the closed form expression (1.9) of the
ridge estimator.
(iii) Since the l1 −norm is convex, the result follows by Lemma 1.1.2.
19
Proof. (i) From the closed-form expression of the ridgeless estimator,
r
!
X 1
θ̂nrl = X + y = V S + U ′ y = vj u′j y.
j=1
s j
Therefore,
r
!
X
X θ̂nrl = U S ′ S + U ′ y = uj u′j y.
j=1
Therefore,
r
!
′ −1 ′ ′
X s2j
X θ̂nr = U S(S S + λI) S U y = 2
uj u′j y.
j=1
sj + λ
Pr ′
Using Definition 18, matrix PRange(X) = XX + = j=1 uj uj , where
{u1 , . . . , ur } is an orthogonal basis of Range(X). From expression (1.11),
it follows that the prediction of the ridgeless estimator is the orthogonal
projection of y onto the range of X. Expression (1.12) instead shows that
the ridge estimator shrinks this projection, shrinking less the directions uj
associated to high variance (high sj ), and more the directions uj associated
to low variance (low sj ); see Figure 1.3. Indeed, for fixed λ > 0, the weight
s2j /(s2j + λ) → 0 as sj → 0, and s2j /(s2j + λ) → 1 as sj → ∞.
20
f(s) = s2 (s2 + λ)
1
0.5
1 2 3 4 5 6 7 8 9 s
(X ′ X + λI)−1 X ′ y = (X ′ X + λI)−1 X ′ XX + y,
21
and thus
θ̂nr (λ) = (X ′ X + λI)−1 X ′ X θ̂nrl .
Using X + = X + (X + )′ X ′ , we have
X + y = X + (X + )′ (X ′ X + λI)(X ′ X + λI)−1 X ′ y.
Moreover, X + (X + )′ = (X ′ X)+ implies
θ̂nrl = (X ′ X)+ (X ′ X + λI)θ̂nr (λ).
Finally, since X + = limλ→0 (X ′ X + λI)−1 X ′ , we have limλ→0 θ̂nr (λ) = θ̂nrl .
Expression (1.16) explains why estimator (1.3) is called the ridgeless esti-
mator. The ridge and lasso estimators can be expressed as constrained least
squares problems.
Proposition 1.1.4 (Equivalence between penalized and constrained least
squares). For c ≥ 0, λ ≥ 0, and some norm ∥·∥ : Rp → R, define:
C(c) := argmin ∥y − Xθ∥22 /2 : ∥θ∥ ≤ c ;
θ∈Rp
Then, for a given c > 0, there exists λ0 ≥ 0 such that C(c) ⊂ P(λ0 ). Con-
versely, for a given λ > 0, there exists c0 ≥ 0 such that P(λ) ⊂ C(c0 ).
Proof. The objective function h : θ 7→ ∥y − Xθ∥22 /2 is convex and continu-
ous, and the constraint set {θ ∈ Rp : ∥θ∥ ≤ c} is not empty, closed, bounded
and convex. By the KKT theorem for convex problems, θ̂ ∈ C(c) for any
c > 0 if and only if θ̂ satisfies the KKT conditions, for some corresponding
λ0 ≥ 0:
0 ∈ λ0 ∂∥θ̂∥ + X ′ X θ̂ − X ′ y,
∥θ̂∥ ≤ c,
λ0 (∥θ̂∥ − c) = 0.
By Theorem 2.2.1, the first of these conditions implies that θ̂ ∈ P(λ0 ). Now
fix a λ > 0 and notice that P(λ) is not empty, given that its objective function
is convex, continuous and coercive; see Proposition 2.2.3. We can thus take
some θ̂ ∈ P(λ). Then, θ̂ satisfies the KKT conditions for c0 = ∥θ̂∥, which
implies θ̂ ∈ C(c0 ).
Note that the link between the penalty parameter λ and the constraint
parameter c is not explicit.
22
1.1.3 Geometric interpretation
We illustrate the geometry of the least squares, ridge, and lasso solutions
through a simple example. Consider the linear model (1.1), with p = 2,
ε0i ∼ iiN (0, 1), θ0 = [1.5, 0.5]′ , E[xi ε0 ] = 0, and
0 2 0
xi ∼ iiN , .
0 0 1
Figure 1.4 shows the level curves of the least squares loss function f (θ) :=
∥y − Xθ∥22 /2, corresponding to values f1 < f2 < f3 < f4 . Its minimizer, or
least squares solution θ̂nls , which coincides with the ridgeless solution θ̂nrl , is
highlighted in the figure.
θ2
rl
θ^n
θ1
23
θ2
rl
θ^n
r
θ^n
{θ: ||θ||2 <= c}
θ1
Figure 1.6 illustrates the effect of the lasso constraint, represented by the
rotated square {θ ∈ R2 : ∥θ∥1 ≤ c} with c = 0.5, on the least squares
solution. Like the ridge solution, the lasso solution θ̂nl is located at the
intersection of the lasso constraint and the lower level set of the least squares
loss at the lowest height for which the intersection is non-empty. is located
at the intersection between the lasso constraint and the lower level set of the
least squares loss at the lowest height for which the intersection is not empty.
For small values of c, this intersection is more likely to occur along one of the
coordinate axes. As a result, the lasso solution tends to be sparse, meaning
that some components of θ̂nl are exactly zero.
θ2
rl
θ^n
l
{θ: ||θ||1<= c} θ^n
θ1
24
As discussed in Section 1.1, the lasso estimator serves as an approxima-
tion to the l0 −estimator (1.6). This relationship becomes evident through
visual comparison of Figure 1.6 and Figure 1.7. The lasso constraint set
{θ : ∥θ∥1 ≤ c} is the convex hull (i.e., the smallest convex superset) of
the constraint set underlying the l0 −estimator, which is given by: {θ :
∥θ∥0 ≤ c, ∥θ∥∞ ≤ 1}. Further details on this approximation can be found
in Argyriou et al. [2012].
θ2
rl
θ^n
l0
θ^n
θ1
{θ: ||θ||0 <= 1, ||θ||...<= c}
Thus, any coefficient in the set θ0rl + Kernel(E[xx′ ]) satisfies this condition,
where
θ0rl := E[xx′ ]+ E[xi yi ] = [0.2, 0.4]′ .
25
If the sample size n > Rank(E[xx′ ]), then Ker(X) ⊃ Ker(E[xx′ ]), and
the same issue arises in the finite-sample least squares problem, where the
objective function f (θ) is minimized at any point on the affine set
θ̂nrl + Ker(X).
Figure 1.8 depicts the level curves of f (θ) at f (θ̂nrl ) = f1 < f2 < f3 .
These curves are parallel lines, unlike the typical ellipses seen in full-rank
cases. The ridgeless estimator is the minimum l2 -norm solution to the least
squares problem, as expected from its construction.
f1 f2 f3 θ2
f2
f3
rl
θ^n
rl θ1
{θ: ||θ||2 <= ||θ^n||2}
26
lasso estimator using various quadratic programming algorithms. One par-
ticularly simple and effective method is the cyclical coordinate descent algo-
rithm, which minimizes the convex objective function by iterating through
each coordinate independently. This approach provides insight into how the
lasso solution is obtained.
Consider the soft-thresholding operator for a given λ > 0, which is defined
as the function
η − λ, η > λ
Sλ : R → R; η 7→ 0, η ∈ [−λ, λ] .
η + λ, η < −λ
Sλ(η)
−λ λ
η
27
where a := X1′ y/n and b := X1′ X1 /n. From Theorem 2.2.1, and the subd-
ifferential of the absolute value function (Appendix 2.2, Example 7), θ̂ is a
minimizer of f if and only if
{λ},
θ̂ > 0
0 ∈ ∂f (θ̂) ⇐⇒ a ∈ bθ̂ + [−λ, λ], θ̂ = 0 .
{−λ}, θ̂ < 0
This condition reads: (i) if θ̂ > 0, then θ̂ = (a − λ)/b, implying a > λ; (ii)
if θ̂ = 0, then −λ ≤ a ≤ λ; and (iii) if θ̂ < 0, then θ̂ = (a + λ)/b, implying
a < −λ. These cases are summarized by θ̂ = Sλ (a)/b.
Proposition 1.1.5 can be used to show that the j−th coordinate of the
lasso solution in a multivariate regression model, i.e., when there is more
than just one predictor, satisfies an expression based on the soft-thresholding
operator applied to the residual of a lasso regression of y onto the predictors
Xk at position k ̸= j.
Theorem 1.1.4 (Lasso solution). Let Xj denote the j−th column of X and
X(−j) denote X without the j−th column. Assume that Xj′ Xj > 0 for all
j = 1, . . . , p. Then, given λ > 0, any lasso solution θ̂nl (λ) is such that for all
j = 1, . . . , p:
Sλ (Xj′ ej /n)
l 1 2
θ̂n,j (λ) = argmin ∥ej − Xj θ∥2 + λ|θ| = , (1.18)
θ∈R 2n Xj′ Xj /n
28
This condition holds if and only if for all j = 1, . . . , p:
and the last double implication follows from Proposition 1.1.5 since, by The-
orem 2.2.1,
(t+1) (t)
and sets θ̂k = θ̂k for k ̸= j. A typical choice for the lasso solution would
be to cycle through the coordinates in their natural order: from 1 to p. The
coordinate descent algorithm is guaranteed to converge to a global minimizer
of any convex cost function f : Rp → R satisfying the additive decomposition:
p
X
f : θ 7→ g(θ) + hj (θj ),
j=1
29
Remark 5. If the predictors are measured in different units, it is recommended
to standardize them so that Xj′ Xj = 1 for all j. In this case, the lasso update
(1.18) has the simpler form:
l
θ̂n,j (λ) = Sλ (Xj′ ej /n).
30
efficiently compute the solutions over a grid of λ values. This approach is
known as pathwise coordinate descent.
Coordinate descent is particularly efficient for the lasso because the up-
date rule (1.18) is available in closed form, eliminating the need for iterative
searches along each coordinate. Additionally, the algorithm exploits the in-
herent sparsity of the problem: for sufficiently large values of λ, most coeffi-
cients will be zero and will remain unchanged. There are also computational
strategies that can predict the active set of variables, significantly speeding
up the algorithm. More details on the pathwise coordinate descent algorithm
for lasso can be found in Friedman et al. [2007].
Homotopy methods are another class of techniques for solving the lasso
estimator. They produce the entire path of solutions in a sequential fashion,
starting at zero. An homotopy method that is particularly efficient at com-
puting the entire lasso path is the least angle regression (LARS) algorithm;
see Efron et al. [2004].
(iv) The LSE is the best linear unbiased estimator, in the sense that Var[θ̃n ]−
Var[θ̂nls ] is positive semi-definite for any other unbiased linear estimator
θ̃n , i.e., θ̃n = Ay for some A ∈ Rp×n .
31
2 P
(v) The MSE of the LSE is given by: MSE(θ̂nls , θ0 ) = σn pj=1 λ1j , where
λ1 ≥ . . . ≥ λp > 0 are the eigenvalues of X ′ X/n. Therefore,
σ2p
MSE(θ̂nls , θ0 ) ≤ .
λp n
(ii) The closed form expression of θˆnls immediately implies the expression
Var[θ̂nls ] = Var[(X ′ X)−1 X ′ ε0 ]
=(X ′ X)−1 X ′ Var[ε0 ]X(X ′ X)−1 .
32
(vi) Simple computations give
33
Definition 19 (Ridgeless estimand). The ridgeless estimand is defined as the
vector θ0rl ∈ Range(E[xx′ ]) given by θ0rl := E[xx′ ]+ E[xy].6
We can now extend Proposition 1.1.6 to the fixed design setting where
Rank(X) ≤ p.
Proposition 1.1.8 (Finite-sample properties of ridgeless (fixed design)).
Assume that the linear model (1.1) holds with E[xε0 ] = 0 and denote r0 :=
Rank(E[xx′ ]). Then, for a fixed design matrix:
(i) E[θ̂nrl ] = PRange(X ′ ) θ0rl . If Range(X ′ ) = Range(E[xx′ ]), which implies
n ≥ r0 , then the ridgeless estimator is unbiased:
E[θ̂nrl ] = θ0rl .
and
MPR(θ̂nrl , θ0rl ) = r0 σ 2 /n.
6
The result that θ0rl ∈ Range(E[xx′ ]) follows from the identity E[xx′ ]+ =
PRange(E[xx′ ]) E[xx′ ]+ . Notice that we used the ridgeless estimand in Section 1.1.3.
34
Proof. (i) Using Proposition 1.1.7, we have E[Xε(θ0rl )] = 0, which implies
E[ε(θ0rl )] = 0 under a (non-trivial) fixed design. Simple computations
then give
E[θ̂nrl ] = X + E[y] = X + Xθ0rl + X + E[ε(θ0rl )] = PRange(X ′ ) θ0rl .
If Range(X ′ ) = Range(E[xx′ ]), we conclude that E[θ̂nrl ] = θ0rl since
θ0rl ∈ Range(E[xx′ ]).
(ii) The closed-form expression of θ̂ rl immediately implies
Var[θ̂nrl ] = X + Var[ε(θ0rl )](X + )′ .
Moreover,
Bias(θ̂nrl , θ0rl ) = (PRange(X ′ ) −I)θ0rl = − PKer(X) θ0rl .
The result then follows using Proposition 1.0.3.
(v) Proposition 1.0.1 and E[θ̂nrl ] = PRange(X ′ ) θ0rl imply lm(X, θ0rl ) = XE[θ̂nrl ].
Therefore:
E[∥lm(X, θ̂nrl ) − lm(X, θ0rl )∥22 /n]
=E[∥X(θ̂nrl − E[θ̂nrl ])∥22 /n]
=E[Trace{(θ̂nrl − E[θ̂nrl ])′ X ′ X/n(θ̂nrl − E[θ̂nrl ])}]
=E[Trace{(θ̂nrl − E[θ̂nrl ])(θ̂nrl − E[θ̂nrl ])′ X ′ X/n}]
= Trace(Var[θ̂nrl ]X ′ X/n)
=σ 2 /n Trace[(X ′ X)+ X ′ X] = σ 2 /n Trace(X + X),
where the last equality follows from the identity (X ′ X)+ X ′ = X + . Fi-
nally, considering the spectral decomposition X = U SV ′ in Definition
18, we obtain:
σ 2 /n Trace(X + X) =σ 2 /n Trace(V (S ′ )+ SV ′ )
Ir 0r×(p−r)
=σ 2 /n Trace
0(p−r)×r 0(p−r)×(p−r)
=σ 2 r/n.
35
(vi) If Range(X ′ ) = Range(E[xx′ ]), then Rank(X) = r0 and Ker(X) =
Ker(E[xx′ ]). Therefore Bias(θ̂nrl , θ0rl ) = 0 as θ0rl ∈ Range(E[xx′ ]), and
we have
r0
σ2 X 1
MSE(θnrl , θ0rl ) = .
n j=1 λj
and
lim MPR(θ̂nr (λ), θ0rl ) = r0 σ 2 /n.
λ→0
36
Proof. (i) Using the link between ridge and ridgeless estimators in identity
(1.14) we have, with θ0rl as defined in (19):
(iii) The expression follows trivially from the previous item. To show that
Var[θ̂nrl ] − Var[θ̂nr (λ)] is positive definite, consider the spectral decom-
position X P = U SV ′ in Definition 18. Since Rank(X) = r, we have
X X/n = rj=1 λj vj vj′ , where λj = s2j /n for j = 1, . . . , r. It follows
′
2 P
Var[θ̂nrl ] = σn rj=1 λ1j vj vj′ . Instead,
(iv) Using the linearity of the Trace and the fact that V is orthogonal:
r
!
2 X
σ λ j
Trace(Var[θ̂nr (λ)]) = Trace 2
vj vj′
n j=1 (λj + λ/n)
r
σ2 X λj
= .
n j=1 (λj + λ/n)2
Moreover, Bias(θ̂nr (λ), θ0rl ) = [Q(λ) − I]θ0rl . The result then follows
using Proposition 1.0.3.
37
Therefore:
MPR(θ̂nr (λ), θ0rl ) = E[∥lm(X, θ̂nr (λ)) − lm(X, θ0rl )∥22 /n]
= E[Trace{(Q(λ)θ̂nrl − E[θ̂nrl ])′ X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])}]
= E[Trace{X ′ X/n(Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ }]
n h io
= Trace X ′ X/nE (Q(λ)θ̂nrl − E[θ̂nrl ])(Q(λ)θ̂nrl − E[θ̂nrl ])′ .
The next proposition shows that there are penalty parameter values for
which the MSE of ridge is lower than the MSE of ridgeless.
38
Proposition 1.1.10. Assume that the linear model (1.1) holds with E[xε0 ] =
0 and Var[ε0 ] = σ 2 I for σ > 0. Then, for a fixed design matrix X, there
exists λ∗ > 0 such that
Lemma 1.1.5 (Auxiliary properties of lasso). Suppose that the linear model
(1.1) holds. If λ ≥ 2 ∥X ′ ε0 /n∥∞ > 0, then any lasso solution θ̂nl (λ) satisfies:
39
(ii) An estimation error η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )) such that
√
∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 . (1.20)
40
(ii) Let S0 := Supp(θ0 ). Using that θ0S0c = 0, we have
We derive the main properties of lasso under the following restricted eigen-
value condition on the design matrix, which leverages the result that for
λ ≥ 2 ∥X ′ ε0 /n∥∞ , the estimation error of lasso θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
Assumption 1 (Restricted eigenvalue condition). The design matrix X ∈
Rn×p is such that for all η ∈ C3 (Supp(θ0 )) there exists κ > 0 for which:
∥Xη∥22 /n ≥ κ ∥η∥22 ,
41
(i) The estimation risk bound
9
∥θ̂nl (λ) − θ0 ∥22 ≤ 2
s 0 λ2 . (1.28)
κ
(ii) The predictive risk bound
9
PR(θ̂nl (λ), θ0 ) ≤ s 0 λ2 . (1.29)
κ
Proof. In Lemma (1.1.5), we obtained Inequality (1.20), which reads
√
∥X η̂∥22 /n ≤ 3 s0 λ∥η̂∥2 ,
where η̂ := θ̂nl (λ) − θ0 ∈ C3 (Supp(θ0 )).
(i) Using Assumption 1 on the left hand side of Inequality (1.20) yields
√
κ∥η̂∥22 ≤ 3 s0 λ∥η̂∥2 .
If ∥η̂∥2 > 0, the result follows by dividing both sides of the inequality
by ∥η̂∥2 . If instead ∥η̂∥2 = 0, the result is trivially obtained.
(ii) Using Assumption 1 on the right hand side of Inequality (1.20) yields,
√ √
∥X η̂∥22 /n ≤ 3 s0 λ∥X η̂∥2 / nκ.
If ∥X η̂∥2 >√0, the result follows by dividing both sides of the inequality
by ∥X η̂∥2 / n. If instead ∥X η̂∥2 = 0, the result is trivially obtained.
42
Remark 6. It is possible to extend the results in Lemma 1.1.5 and Theorem
1.1.6 using a milder restriction than hard sparsity, called weak sparsity. This
restriction formalizes the notion that θ0 can be well approximated by means
of a hard sparse vector.
Definition 22 (Weak sparsity). Coefficient θ0 ∈ Rp is weak sparse if θ0 ∈
Bq (r) where, for q ∈ [0, 1] and radius r > 0,
decay. More precisely, if the ordered coefficients satisfy the bound |θ0j | ≤
Cj −α for some suitable C ≥ 0 and α > 0, then θ0 ∈ Bq (r) for a radius r that
depends on C and α.
where the radius R := ∥θ0 ∥1 . With this choice, the true parameter θ0 is
feasible for the problem. Additionally, we have Ln (θ̂nlc (R)) ≤ Ln (θ0 ) where
Ln : Rp → R; θ 7→ ∥y − Xθ∥22 /(2n)
is the least squares loss function. Under mild regularity conditions, it can be
shown that the loss difference Ln (θ0 )−Ln (θ̂nlc (R)) decreases as the sample size
n increases. Under what conditions does this imply that the estimation risk,
∥η̂∥22 with η̂ := θ̂nlc (R) − θ0 , also decreases? Since Ln is a quadratic function,
43
the estimation risk will decrease if the function has positive curvature in every
direction (i.e., if there are no flat regions). This occurs when the Hessian,
∇2 Ln (θ̂nlc (R)) = X ′ X/n, has eigenvalues that are uniformly lower-bounded
by a positive constant κ. This condition is equivalently expressed as
∥Xη∥22 /n ≥ κ∥η∥22 > 0
for all nonzero η ∈ Rp .
In the high-dimensional setting, where p > n, the Hessian has rank at
most n, meaning that the least squares loss is flat in at least p − n directions.
As a result, the uniform curvature condition must be relaxed. By Lemma
1.1.5, the estimation error of lasso lies in the subset C3 (Supp(θ0 )) ⊂ Rp for an
appropriate choice of the penalty parameter (equivalently, of the constrained
radius R). For this reason, we require the condition to hold only in the
directions η that lie in C3 (Supp(θ0 )), hoping that | Supp(θ0 )| ≤ Rank(X).
With this adjustment, even in high-dimensional settings, a small difference
in the loss function still leads to an upper bound on the difference between
the lasso estimate and the true parameter.
Verifying that a given design matrix X satisfies the restricted eigenvalue
condition is challenging. Developing methods to discover random design
matrices that satisfy this condition with high probability remains an active
area of research.
44
(i) The estimation risk bound
72C 2 σ 2 s0
∥θ̂nl (λ) − θ0 ∥22 ≤ (2 ln(p)/n + δ 2 ). (1.32)
κ2
72C 2 σ 2 s0
PR(θ̂nl (λ), θ0 ) ≤ (2 ln(p)/n + δ 2 ). (1.33)
κ
Since ε01 , . . . , εon are independent random variables with sub-G(σ 2 ) distribu-
tion, from Proposition 2.3.8 we have that for any t ∈ R:
p p
!
X ′ X t2
P |Xj ε0 /n| ≥ t ≤ 2 exp −
j=1 j=1
2σ 2 ∥Xj /n∥22
t2 n
≤ 2p exp − 2 2 .
2σ C
p
Substituting t = Cσ 2 ln(p)/n + δ we get
t2 n
p2
2p exp − 2 2 = 2 exp(ln(p)) exp −nδ /2 − ln(p) − δ 2n ln(p)
2σ C
p
= 2 exp −nδ 2 /2 exp −δ 2n ln(p)
≤ 2 exp −nδ 2 /2 ,
p
since −δ 2n ln(p) < 0. We conclude that, for all δ > 0:
2
p
P[2 ∥X ′ ε0 /n∥∞ ≤ 2Cσ( 2 ln(p)/n + δ)] ≥ 1 − 2e−nδ /2 .
p
Consequently, if we set λ = 2Cσ( 2 ln(p)/n + δ), we obtain from (1.19) of
2
Lemma 1.1.5 that (1.31) holds with probability at least 1 − 2e−nδ /2 . More-
over, under Assumption 1, we obtain from (1.28) and (1.29) of Theorem 1.1.6
45
2 /2
that (1.32) and (1.33) hold with probability at least 1 − 2e−nδ , by using
the inequality:7
p
2 ln(p)/n + δ 2 + 2 2 ln(p)/nδ ≤ 2(2 ln(p)/n + δ 2 ).
7
This inequality follows from 2ab ≤ a2 + b2 for any two real numbers a and b.
46
Bibliography
Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with
the k-support norm. Advances in Neural Information Processing Systems,
25, 2012.
Patrick Billingsley. Probability and measure. John Wiley & Sons, 2017.
Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data:
methods, theory and applications. Springer Science & Business Media,
2011.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least
angle regression. The Annals of Statistics, 32(2):407–499, 2004.
47
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learn-
ing with sparsity. Monographs on statistics and applied probability, 143
(143):8, 2015.
Marc Nerlove et al. Returns to scale in electricity supply. Institute for math-
ematical studies in the social sciences, 1961.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society Series B: Statistical Methodology, 58(1):
267–288, 1996.
Ryan J Tibshirani. The lasso problem and uniqueness. The Electronic Jour-
nal of Statistics, 7:1456–1490, 2013.
48
Chapter 2
Appendix
Vector space
We introduce useful definitions and results for real vector spaces.
Definition 23 (Vector space). A (real) vector space is a set V along with an
addition on V and a scalar multiplication on V with the following properties:
1. (commutativity) u + v = v + u for all u, v ∈ V ;
2. (associativity) (u + v) + w = u + (v + w) and (ab)v = a(bv) for all
u, v, w ∈ V and all a, b ∈ R;
3. (additive identity) there exists an element 0 ∈ V such that v + 0 = v
for all v ∈ V ;
4. (additive inverse) for every v ∈ V , there exists w ∈ V such that v +w =
0;
5. (multiplicative identity) 1v = v for all v ∈ V ;
6. (distributive properties) a(u + v) = au + av and (a + b)v = av + bv for
all a, b ∈ R and all u, v ∈ V .
Definition 24 (Subspace). A subset U of a vector space V is a subspace of V
if U is a vector space, (using the same addition and scalar multiplication as
on V ).
49
Proposition 2.1.1. A subset U of a vector space V is a subspace of V if
and only if it satisfies these conditions:
(i) (additive identity) 0 ∈ U ;
a1 v1 + . . . , +an vn .
50
Proposition 2.1.2 (Basic properties of an inner product). An inner product
⟨·, ·⟩ on vector space V satisfies:
1. ⟨0, v⟩ = ⟨v, 0⟩ for every v ∈ V .
2. ⟨v, u + w⟩ = ⟨v, u⟩ + ⟨v, w⟩ for every v, u, w ∈ V .
3. ⟨v, au⟩ = a⟨v, u⟩ for all a ∈ R and all v, u ∈ V .
Definition 31 (Orthogonal vectors). Two vectors v and u in vector space V
are orthogonal if ⟨v, u⟩ = 0.
Definition 32 (Orthogonal subspace). U and W are orthogonal subspaces of
vector space V if ⟨u, w⟩ = 0 for all u ∈ U and all w ∈ W .
Definition 33 (Orthonormal basis). The set of vectors {v1 , . . . , vn } in vector
space V is an orthonormal basis of V if it is a basis of V such that ⟨vi , vj ⟩ = 0
and ∥vi ∥ = 1 for all i, j = 1, . . . , n with i ̸= j.
Definition 34 (Norms). Given innerpproduct ⟨·, ·⟩ on vector space V , the
norm of v ∈ V is defined by ∥v∥ := ⟨v, v⟩.
Proposition 2.1.3 (Properties of norms). For v in vector space V :
1. ∥v∥ = 0 if and only if v = 0.
2. ∥av∥ = a ∥v∥ for all a ∈ R.
Definition 35 (Linear function). L : V → W from a vector space V to another
vector space W is a linear function if:
(i) T (v + u) = T (v) + T (u) for all v, u ∈ V ;
(ii) T (av) = aT (v) for all a ∈ R and v ∈ V .
Theorem 2.1.1 (Cauchy–Schwarz inequality). Suppose v and u are two
vectors in vector space V . Then,
|⟨v, u⟩| ≤ ∥v∥ ∥u∥ .
This inequality is an equality if and only if there is a ∈ R such that v = au.
Theorem 2.1.2 (Triangle inequality). Suppose v and u are two vectors in
vector space V . Then,
∥v + u∥ ≤ ∥v∥ + ∥u∥ .
This inequality is an equality if and only if there is a ≥ 0 such that v = au.
Theorem 2.1.3 (Parallelogram equality). Suppose v and u are two vectors
in vector space V . Then,
∥v + u∥2 + ∥v − u∥2 = 2(∥v∥2 + ∥u∥2 ).
51
The Euclidean space
Definition 36 ((Real) n−tuple). A (real) n−tuple is a ordered list of n real
numbers.
With a slight abuse of terminology, we sometimes we use the term vector
to mean a (real) n−tuple.
Definition 37 (Real Euclidean space). The real Euclidean space of dimension
n, denoted Rn , is the set of all n−tuples.
Elements of a real Euclidean space are written in bold. For example,
a ∈ Rn , which means a = (a1 , . . . , an ) with a1 , . . . , an ∈ R.
Definition 38 (Euclidean inner product).
Pn The Euclidean inner product of
n
v, u ∈ R is defined as ⟨v, u⟩e := i=1 vi ui .
Definition 39 (lp −norm). The lp −norm ∥·∥p on Rn is defined for all v ∈ Rn
1/p
as ∥v∥p := ( ni=1 |vi |p ) when p ∈ [1, +∞), and ∥v∥p := maxni=1 |vi | when
P
p = +∞.
Matrices
Definition 40 (Matrix). An n × p matrix is a collection of p n−tuples.
The collection of all n × p matrices is denoted Rn×p . For a matrix A ∈
Rn×p , we write A = [A1 , . . . , Ap ] where A1 , . . . , Ap ∈ Rn are p n−tuples.
Written more explicitly,
A1,1 . . . A1,p
. ..
A = .. ...
. ,
An,1 . . . An,p
52
Notice that a matrix in Rn×p can be equivalently seen as a collection of n
p−tuples, where the p−tuples represent the rows of the matrix.
Definition 41 (Column and row vector). A n−column vector is a n−tuple
seen as a matrix in Rn×1 . A n−row vector is a n−tuple seen as a matrix in
R1×n .
Throughout these lecture notes, we denote n−tuples as column vectors,
and use the simple notation v ∈ Rn instead of v ∈ Rn×1 .
Definition 42 (Matrix addition). The sum of two matrices of the same size is
the matrix obtained by adding corresponding entries in the matrices. That is,
for A, B ∈ Rn×p , we define A+B = C where C ∈ Rn×p and Ci,j = Ai,j +Bi,j
for i = 1, . . . , n and j = 1, . . . , p.
Definition 43 (Matrix-scalar multiplication). The product of a scalar and a
matrix is the matrix obtained by multiplying each entry in the matrix by the
scalar That is, for A ∈ Rn×p and a ∈ R, we define aA = B where B ∈ Rn×p
and Bi,j = aAi,j for i = 1, . . . , n and j = 1, . . . , p.
Definition 44 (Matrix multiplication). Given AP∈ Rn×p and B ∈ Rp×m , the
product AB = C where C ∈ Rn×m and Ci,j = pr=1 Ai,r Br,j for i = 1, . . . , n
and j = 1, . . . , m.
Note that we define the product of two matrices only when the number of
columns of the first matrix equals the number of rows of the second matrix.
Definition 45 (Transpose of a matrix). The transpose of a matrix A ∈ Rn×p
is the matrix B ∈ Rp×n with j, i−entry given by Bj,i = Ai,j for i = 1, . . . , n
and j = 1, . . . , p. We denote it by A′ .
It follows that the Euclidean inner product between v, u ∈ Rn is
⟨v, u⟩e = v ′ u.
Ker(A) := {v ∈ Rp : Av = 0}.
53
Proposition 2.1.5. Let A ∈ Rn×p . Then, Range(A) and Ker(A′ ) are or-
thogonal subspaces of Rn such that Rn = Range(A) + Ker(A′ ).
Definition 48 (Rank of a matrix). The rank of a matrix A ∈ Rn×p , denoted
Rank(A), is the maximum number of linearly independent columns of A.
Proposition 2.1.6. Let A ∈ Rn×p . Then, Rank(A) ≤ min{n, p}.
Definition 49 (Eigenvalue). λ ∈ R is an eigenvalue of A ∈ Rn×p if there
exists v ∈ Rp such that v ̸= 0 and
Av = λv.
Definition 50 (Eigenvector). Given matrix A ∈ Rn×p with eigenvalue λ ∈ R,
v ∈ Rp is an eigenvector of A ∈ Rn×p corresponding to λ if v ̸= 0 and
Av = λv.
Proposition 2.1.7. Every matrix A ∈ Rn×p has an eigenvalue.
Proposition 2.1.8. Let A ∈ Rn×p . Then, A has at most Rank(A) distinct
eigenvalues.
Proposition 2.1.9. Suppose λ1 , . . . , λr ∈ R are distinct eigenvalues of A ∈
Rn×p and v1 , . . . , vr ∈ Rp are corresponding eigenvectors. Then, v1 , . . . , vr
are linearly independent.
Definition 51 (Singular values). The singular values of A ∈ Rn×p are the
nonnegative square roots of the eigenvalues of A′ A.
Definition 52 (Symmetric matrix). A square matrix A ∈ Rn×n is symmetric
if A′ = A.
Definition 53 (Positive definite matrix). A square symmetric matrix A ∈
Rn×n is positive definite if v ′ Av > 0 for all v ∈ Rn such that v ̸= 0.
Definition 54 (Positive semi-definite matrix). A square symmetric matrix
A ∈ Rn×n is positive semi-definite if v ′ Av ≥ 0 for all v ∈ Rn .
Proposition 2.1.10. A square symmetric matrix A ∈ Rn×n is positive def-
inite (positive semi-definite) if and only if all of its eigenvalues are positive
(nonnegative).
Definition 55 (Identity matrix). The identity matrix on Rn is defined as
1 0
I := . . . ∈ Rn×n .
0 1
54
Definition 56 (Diagonal of a matrix). The diagonal of a square matrix A ∈
Rn×n indicates the elements ”on the diagonal”: A1,1 , . . . , An,n .
Definition 57 (Diagonal matrix). A square matrix A ∈ Rn×n is a diagonal
matrix if all its elements outside of the diagonal are zero. We can write
A = diag(A1,1 , . . . , An,n ).
Definition 58 (Invertible matrix, matrix inverse). A square matrix A ∈ Rn×n
is invertible if there is a matrix B ∈ Rn×n such that AB = BA = I. We
call B the inverse of A and denote it by A−1 .
Proposition 2.1.11. A square matrix A ∈ Rn×n is invertible if and only if
Rank(A) = n, or equivalently, if and only if Ker(A) = ∅.
Proposition 2.1.12. A square matrix A ∈ Rn×n is invertible if and only if
it is positive definite.
Definition 59 (Orthogonal matrix). A square matrix P ∈ Rp×p is orthogonal,
or orthonormal, if P ′ P = P P ′ = I.
Definition 60 (Projection matrix). A square matrix P ∈ Rp×p is a projection
matrix if P = P 2 .
Definition 61 (Orthogonal projection matrix). A square matrix P ∈ Rp×p is
an orthogonal projection matrix if it is a projection matrix and P = P ′ .
Projections and orthogonal projections have the following properties.
Proposition 2.1.13. For any projection matrix P ∈ Rp×p and vector b ∈
Rp , we have
b = P b + (I − P )b.
If P is an orthogonal projection matrix, then
(P b)′ (I − P )b = 0.
Definition 62 (Trace). The trace of a square matrix A ∈ Rn×n , denoted
Trace(A), is the sum of its diagonal elements:
Trace(A) = A11 + . . . , An,n .
Proposition 2.1.14. The Trace is a linear function.
Proposition 2.1.15 (Properties of the trace). 1. Trace(A) = λ1 + . . . +
n×n
λn for all A ∈ R with λ1 , . . . , λn denoting the (not necessarily dis-
tinct) eigenvalues of A.
2. Trace(A) = Trace(A′ ) for all A ∈ Rn×n .
3. Trace(AB) = Trace(BA) for all for all A, B ∈ Rn×n .
4. Trace(A′ B) = Trace(AB ′ ) = Trace(B ′ A) = Trace(BA′ ) for all A, B ∈
Rn×p .
55
2.1.1 Moore-Penrose inverse
The Moore-Penrose inverse, or matrix pseudoinverse, is a generalization of
the inverse of a matrix that was independently introduced by Moore [1920]
and Penrose [1955].
Definition 63 (Moore-Penrose inverse). The matrix A+ ∈ Rp×n is a Moore-
Penrose inverse of A ∈ Rn×p if
(i) AA+ A = A;
(ii) A+ AA+ = A+ ;
1. A = (A+ )+ .
3. (A′ )+ = (A+ )′ .
5. (AA′ )+ = (A′ )+ A+ .
56
6. Range(A+ ) = Range(A′ ) = Range(A+ A) = Range(A′ A).
(i) If b ∈
/ Range(A), then L is empty.
57
Corollary 2.1.6. Given a square matrix A ∈ Rp×p and b ∈ Rp , let L :=
{θ ∈ Rp : Aθ = b}. Then, A+ b is the unique element of L if and only if
Rank(A) = p. In this case, A+ = A−1 .
Corollary 2.1.7. For X ∈ Rn×p and y ∈ Rn :
argmin ∥y − Xθ∥22 = X + y + Ker(X).
θ∈Rp
58
Proposition 2.1.24. The Moore-Penrose inverse of a matrix A ∈ Rn×p
admitting SVD decomposition A = U SV ′ is given by A+ = V S + U ′ .
A = QΛQ−1 ,
Proposition 2.1.25 (Relation between the singular value and the eigenvalue
decompositions). Given a matrix A ∈ Rn×p with SVD A = U SV ′ :
1. A′ A = V S ′ SV ′ ;
2. AA′ = U SS ′ U ′ .
Basic definitions
Definition 66 (Closed set). A set C ⊂ Rp is closed if it contains all of its
limit points.
Definition 67 (Bounded set). A set C ⊂ Rp is bounded if there exists r > 0
such that for all θ, β ∈ Rp , we have ∥θ − β∥ < r.
Definition 68 (Convex set). A set C ⊂ Rp is convex if for all 0 < α < 1 and
all θ, β ∈ C:
αθ + (1 − α)β ∈ C.
In particular, Rp and ∅ are convex.
Definition 69 (Epigraph of a function). The epigraph of function f : Rp →
(−∞, +∞] is
epi(f ) := {(θ, ξ) ∈ Rp × R : f (θ) ≤ ξ}.
Definition 70 (Domain of a function). The domain of function f : Rp →
(−∞, +∞] is
dom(f ) := {θ ∈ Rp : f (θ) < +∞}.
59
Definition 71 (Lower level set of a function). The lower level set of function
f : Rp → (−∞, +∞] at height ξ ∈ R is
lim inf
∗
f (θ) = lim (inf{f (θ) : θ ̸= θ ∗ , ∥θ − θ ∗ ∥ ≤ ε}) .
θ→θ ε→0
lim inf
∗
f (θ) ≥ f (θ ∗ ).
θ→θ
60
Graphically, a vector β ∈ Rp is a subgradient of a proper function f :
p
R → (−∞, +∞] at θ ∈ dom(f ) if
fβ,θ : v 7→ ⟨v − θ, β⟩ + f (θ),
(i) f is coercive.
(ii) C is bounded.
61
Then f has a minimizer over C.
Proof. Set µ := inf θ∈Rp f (θ) and suppose that there exist two distinct points
θ1 , θ2 ∈ dom(f ) such that f (θ1 ) = f (θ2 ) = µ. Since θ1 , θ2 ∈ lev≤µ (f ), which
is convex, so does β = (θ1 + θ2 )/2. Therefore f (β) = µ. It follows from the
strict convexity of f that
which is impossible.
Global minimizers of proper functions can be characterized by a simple
rule which extends a seventeenth century result due to Pierre Fermat.
Proof. Let θ ∗ ∈ Rp . Then θ ∗ ∈ argminθ∈Rp f (θ) if and only if, for every
β ∈ Rp ,
⟨β − θ ∗ , 0⟩ + f (θ ∗ ) ≤ f (β).
By definition of subgradient, this last requirement reads 0 ∈ ∂f (θ ∗ ).
62
2.3 Probability theory
This section introduces a selection of definitions and results from probability
theory that are used in these lecture notes. A book-length exposition of
probability theory can be found in Billingsley [2017] and Vershynin [2018],
among others.
All random variables are (real valued and) defined on the complete prob-
ability space (Ω, F, P).
Definition 81 (Cumulative Distribution Function). The Cumulative Distri-
bution Function (CDF) of random variable X is the function
∥X∥Lp := ess sup |X| := sup{b ∈ R : P({ω : X(ω) < b}) = 0}.
63
The inner product in L2 is for all X, Y ′ inL2 :
The Variance of X ∈ L2 is
Classical inequalities
Theorem 2.3.1 (Jensen’s inequality). For any random variable X and a
convex function f : R → R, we have
64
The tails and the moments of a random variable are connected.
Proposition 2.3.2 (Integral identity). For any nonnegative random variable
X: Z ∞
E[X] = P[X > x]dx.
0
The two sides of this identity are either both finite or both infinite.
Theorem 2.3.5 (Markov’s inequality). For any nonnegative random variable
X and x > 0:
P[X ≥ x] ≤ E[X]/x.
A consequence of Markov’s inequality is Chebyshev’s inequality, which
bounds the concentration of a random variable about its mean.
Theorem 2.3.6 (Chebyshev’s inequality). Let X be a random variable with
finite mean µ and finite variance σ 2 . Then, for any x > 0:
P[|X − µ| ≥ x] ≤ σ 2 /x2 .
Proposition 2.3.3 (Generalization of Markov’s inequality). For any random
variable X with mean µ ∈ R and finite moment of order p ≥ 1, and for any
x > 0:
P[|X − µ| ≥ x] ≤ E[|X − µ|p ]/xp .
65
Theorem 2.3.9 (Hoeffding’s inequality for bounded random variables). Let
X1 , . . . , Xn be an independent random variables. Assume that Xi ∈ [li , ui ]
with li , ui ∈ R and li ≤ ui . Then, for any x > 0:
" n #
2x2
X
P (Xi − E[Xi ]) ≥ x ≤ exp − Pn 2
.
i=1 i=1 (ui − li )
66
(iii) The MGF of X 2 satisfies for all t ∈ R such that |t| ≤ 1/C3 :
E[exp(t2 X 2 )] ≤ exp(C32 t2 ).
E[exp(X 2 /C42 )] ≤ 2.
E[exp(tX)] ≤ exp(C52 t2 ).
and " n # !
X t2
P ai Xi ≤ −t ≤ exp − .
i=1
2σ 2 ∥a∥22
Definition 91 (Sub-Gaussian norm). The sub-Gaussian norm ∥X∥ψ2 of ran-
dom variable X is defined as
67
Proposition 2.3.9. If X is a sub-Gaussian random variable, then X − E[X]
is sub-Gaussian and for a constant C > 0:
∥X − E[X]∥ψ2 ≤ C ∥X∥ψ2 .
68
(i) The tails of X satisfy for all x ≥ 0:
P[|X| ≥ x] ≤ 2 exp(−x/K1 ).
(iii) The MGF of |X| satisfies for all t ∈ R such that 0 ≤ t ≤ 1/K3 :
E[exp(|X|/K4 )] ≤ 2.
(v) The MGF of X satisfies for all t ∈ R such that |t| ≤ 1/K5 :
E[exp(tX)] ≤ exp(K52 t2 ).
∥X − E[X]∥ψ1 ≤ C ∥X∥ψ1 .
X2 ψ1
:= ∥X∥2ψ2 .
69
Proposition 2.3.17 (Product of sub-Gaussians is sub-exponential). Let X
and Y be sub-Gaussian random variables. Then XY is sub-exponential.
Moreover,
∥XY ∥ψ1 = ∥X∥ψ2 ∥Y ∥ψ2 .
70
Alphabetical Index
71
Linear independence, 50 Projection matrix, 55
Linear model, 5 Proper function, 60
Linear prediction, 7
Local minimizer, 61 Random design, 6
Lower level set, 60 Restricted eigenvalue condition,
Lower semicontinuous function, 60 41
Ridge, 12
Markov’s inequality, 65 Ridgeless, 12
Matrix, 52 Ridgeless estimand, 34
Matrix addition, 53 Row vector, 53
Matrix diagonal, 55
Singular value decomposition, 58
Matrix kernel, 53
Singular values, 54
Matrix multiplication, 53
Soft-thresholding operator, 27
Matrix Range, 53
Span, 50
Matrix rank, 54
Strictly convex function, 60
Matrix transpose, 53
Sub-exponential norm, 69
Matrix-scalar multiplication, 53
Sub-exponential properties, 68
Mean predictive risk, 9
Sub-exponential random variables,
Mean squared error, 8
69
Minkowski’s inequality, 64
Sub-Gaussian norm, 67
Moment generating function, 63
Sub-Gaussian properties, 66
Moment of order p, 63
Sub-Gaussian random variable, 67
Moore-Penrose inverse, 56
Subdifferential, 60
n-tuple, 52 Subgradient, 60
Norm, 51 Subspace, 49
Symmetric Bernoulli distribution,
Orthogonal matrix, 55 65
Orthogonal projection, 55 Symmetric matrix, 54
Orthogonal subspaces, 51 System of linear equations, 57
Orthogonal vectors, 51
Trace, 55
Orthonormal basis, 51
Triangular inequality, 51
Pathwise coordinate descent, 31 Variance proxy, 67
Positive definite matrix, 54 Vector space, 49
Positive semi-definite matrix, 54
Predictive risk, 8 Weak sparsity, 43
72