Functionspaces PDF
Functionspaces PDF
A function space is a set of functions F that has some structure. Often a nonparametric
regression function or classifier is chosen to lie in some function space, where the assumed
structure is exploited by algorithms and theoretical analysis. Here we review some basic
facts about function spaces.
1 Hilbert Spaces
1. kx + yk ≤ kxk + kyk.
pP
An example of a norm on V = Rk is the Euclidean norm kxk = 2
i xi . A sequence
x1 , x2 , . . . in a normed space is a Cauchy sequence if kxm − xn k → 0 as m, n → ∞. The
space is complete if every Cauchy sequence converges to a limit. A complete, normed space
is called a Banach space.
4. hx, yi = hy, xi
orthogonal if hx, yi = 0. An inner product defines a norm kvk = hv, vi. We then have the
Cauchy-Schwartz inequality
|hx, yi| ≤ kxk kyk. (1)
1
A Hilbert space is a complete, inner product space. Every Hilbert space is a Banach space
but the reverse is not true in general. In a Hilbert space, we write fn → f to mean that
||fn − f || → 0 as n → ∞. Note that ||fn − f || → 0 does NOT imply that fn (x) → f (x). For
this to be true, we need the space to be a reproducing kernel Hilbert space which we discuss
later.
If V is a Hilbert space and L is a closed subspace then for any v ∈ V there is a unique
y ∈ L, called the projection of v onto L, which minimizes kv − zk over z ∈ L. The set of
elements orthogonal to every z ∈ L is denoted by L⊥ . Every v ∈ V can be written uniquely
as v = w + z where z is the projection of v onto L and w ∈ L⊥ . In general, if L and M are
subspaces such that every ` ∈ L is orthogonal to every m ∈ M then we define the orthogonal
sum (or direct sum) as
L ⊕ M = {` + m : ` ∈ L, m ∈ M }. (2)
A set of vectors {et , t ∈ T } is orthonormal if hes , et i = 0 when s 6= t and ket k = 1 for all t ∈ T .
If {et , t ∈ T } are orthonormal, and the only vector orthogonal to each et is the zero vector,
then {et , t ∈ T } is called an orthonormal basis. Every Hilbert space has an orthonormal
basis. A Hilbert space is separable if there exists a countable orthonormal basis.
Theorem 1 Let V be a separable Hilbert P∞ space with countable orthonormal basis {e1 , e2 , . . .}.
2
Then, for any x ∈ V , we have x = j=1 θj ej where θj = hx, ej i. Furthermore, kxk =
P∞ 2
j=1 θj , which is known as Parseval’s identity.
The set Rd with inner product hv, wi = j vj wj is a Hilbert space. Another example of a
P
Rb
Hilbert space
R is the set of functions f : [a, b] → R such that a f 2 (x)dx < ∞ with inner
product f (x)g(x)dx. This space is denoted by L2 (a, b).
2 Lp Spaces
2
Sometimes we write kf k2 simply as kf k. The space Lp (a, b) is defined as follows:
( )
Lp (a, b) = f : [a, b] → R : kf kp < ∞ . (5)
where Z b
θj = f (x) φj (x) dx (7)
a
are the coefficients. Also, recall Parseval’s identity
Z b ∞
X
2
f (x)dx = θj2 . (8)
a j=1
3
The Fourier basis on [0, 1] is defined by setting φ1 (x) = 1 and
1 1
φ2j (x) = √ cos(2jπx), φ2j+1 (x) = √ sin(2jπx), j = 1, 2, . . . (10)
2 2
The cosine basis on [0, 1] is defined by
√
φ0 (x) = 1, φj (x) = 2 cos(2πjx), j = 1, 2, . . . . (11)
where
1 if 0 ≤ x < 1
φ(x) = (16)
0 otherwise,
ψjk (x) = 2j/2 ψ(2j x − k) and
−1 if 0 ≤ x ≤ 12
ψ(x) = (17)
1 if 12 < x ≤ 1.
This is a doubly indexed set of functions so when f is expanded in this basis we write
−1
∞ 2X j
X
f (x) = αφ(x) + βjk ψjk (x) (18)
j=1 k=1
R1 R1
where α = 0 f (x) φ(x) dx and βjk = 0 f (x) ψjk (x) dx. The Haar basis is an example of a
wavelet basis.
4
Let [a, b]d = [a, b] × · · · × [a, b] be the d-dimensional cube and define
( Z )
L2 [a, b]d = f : [a, b]d → R : f 2 (x1 , . . . , xd ) dx1 . . . dxd < ∞ .
(19)
[a,b]d
Suppose that B = {φ1 , φ2 , . . .} is an orthonormal basis for L2 ([a, b]). Then the set of functions
( )
Bd = B ⊗ · · · ⊗ B = φi1 (x1 ) φi2 (x2 ) · · · φid (xd ) : i1 , i2 , . . . , id ∈ {1, 2, . . . , } , (20)
is called the tensor product of B, and forms an orthonormal basis for L2 ([a, b]d ).
3 Hölder Spaces
Let β be a positive integer.1 Let T ⊂ R. The Holder space H(β, L) is the set of functions
g : T → R such that
|g (β−1) (y) − g (β−1) (x)| ≤ L|x − y|, for all x, y ∈ T. (21)
The special case β = 1 is sometimes called the Lipschitz space. If β = 2 then we have
|g 0 (x) − g 0 (y)| ≤ L |x − y|, for all x, y.
Roughly speaking, this means that the functions have bounded second derivatives.
1
It is possible to define Holder spaces for non-integers but we will not need this generalization.
5
In the case of β = 2, this means that
We will see that in function estimation, the optimal rate of convergence over H(β, L) under
L2 loss is O(n−2β/(2β+d) ).
4 Sobolev Spaces
For the rest of this section we take p = 2 and write Wm instead of Wm,2
Theorem 2 The Sobolev space Wm is a Hilbert space under the inner product
m−1
X Z 1
(k) (k)
hf, gi = f (0)g (0) + f (k) (x)g (k) (x) dx. (27)
k=0 0
Define
m−1 x∧y
(x − u)m−1 (y − u)m−1
Z
X 1 k k
K(x, y) = x y + du. (28)
k=1
k! 0 (m − 1)!2
Then, for each f ∈ Wm we have
f (y) = hf, K(·, y)i (29)
and
K(x, y) = hK(·, x), K(·, y)i. (30)
We say that K is a kernel for the space and that Wm is a reproducing kernel Hilbert space
or RKHS. See Section 7 for more on reproducing kernel Hilbert spaces.
6
It follows from Mercer’s theorem (Theorem 4) that there is an orthonormal basis {e1 , e2 , . . . , }
for L2 (a, b) and real numbers λ1 , λ2 , . . . such that
∞
X
K(x, y) = λj ej (x) ej (y). (31)
j=1
The functions ej are eigenfunctions of K and the λj ’s are the corresponding eigenvalues,
Z
K(x, y) ej (y) dy = λj ej (x). (32)
Next we discuss how the functions in a Sobolev space can be parameterized by using another
convenient basis. An ellipsoid is a set of the form
( ∞
)
X
Θ= θ: a2j θj2 ≤ c2 (34)
j=1
where aj = (πj)m for j even and aj = (π(j − 1))m for j odd. Thus, a Sobolev space
corresponds to a Sobolev ellipsoid with aj ∼ (πj)2m .
Note that (36) allows us to define the Sobolev space Wm for fractional values of m as well
as integer values. A multivariate version of Sobolev spaces can be defined as follows. Let
α = (α1 , . . . , αd ) be non-negative integers and define |α| = α1 + · · · + αd . Given x =
(x1 , . . . , xd ) ∈ Rd write xα = xα1 1 · · · xαd d and
∂ |α|
Dα = . (37)
∂xα1 1 · · · ∂xαd d
7
Then the Sobolev space is defined by
( )
[a, b]d : Dα f ∈ Lp ([a, b]d ) for all |α| ≤ m .
Wm,p = f ∈ Lp (38)
We will see that in function estimation, the optimal rate of convergence over Wβ,2 under L2
loss is O(n−2β/(2β+d) ).
5 Besov Spaces*
Functions in Sobolev spaces are homogeneous, meaning that their smoothness does not vary
substantially across the domain of the function. Besov spaces are richer classes of functions
that include inhomogeneous functions.
Let r
k r
(r)
X
∆h f (x) = (−1) f (x + kh). (39)
k=0
k
(0)
Thus, ∆h f (x) = f (x) and
(r) (r−1) (r−1)
∆h f (x) = ∆h f (x + h) − ∆h f (x). (40)
Next define
(r)
wr,p (f ; t) = sup k∆h f kp (41)
|h|≤t
R 1/p
where kgkp = |g(x)|p dx . Given (p, q, ς), let r be such that r − 1 ≤ ς ≤ r. The Besov
seminorm is defined by
Z ∞ 1/q
ς −ς q dh
kf kp,q = (h wr,p (f ; h)) . (42)
0 h
For q = ∞ we define
wr,p (f ; h)
kf kςp,∞ = sup . (43)
0<h<1 hς
ς
The Besov space Bp,q (c) is defined to be the set of functions f mapping [0, 1] into R such
that |f | < ∞ and kf kςp,q ≤ c.
R p
Besov spaces include a wide range of familiar function spaces. The Sobolev space Wm,2
m
corresponds to the Besov ball B2,2 . The generalized Sobolev space Wm,p which uses an Lp
th m m
norm on the m derivative is almost a Besov space in the sense that Bp,1 ⊂ Wp (m) ⊂ Bp,∞ .
k+β
The Hölder space Hα with α = k + β is equivalent to B∞,∞ , and the set T consisting of
1 1
functions of bounded variation satisfies B1,1 ⊂ T ⊂ B1,∞ .
8
6 Entropy and Dimension
Space H()
Sobolev Wm,p −d/m
Besov Bpqς
−d/ς
Hölder Hα −d/α
Intuitively, a reproducing kernel Hilbert space (RKHS) is a class of smooth functions defined
by an object called a Mercer kernel. Here are the details.
Mercer Kernels. A Mercer kernel is a continuous function K : [a, b] × [a, b] → R such that
K(x, y) = K(y, x), and such that K is positive semidefinite, meaning that
n X
X n
K(xi , xj )ci cj ≥ 0 (45)
i=1 j=1
for all finite sets of points x1 , . . . , xn ∈ [a, b] and all real numbers c1 , . . . , cn . The function
m−1 Z x∧y
X 1
k k (x − u)m−1 (y − u)m−1
K(x, y) = x y + 2
du (46)
k=1
k! 0 (m − 1)!
introduced in the Section 4 on Sobolev spaces is an example of a Mercer kernel. The most
commonly used kernel is the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .
9
Theorem 4 (Mercer’s theorem) Suppose that K : X ×X → R is symmetric and satisfies
supx,y K(x, y) < ∞, and define
Z
TK f (x) = K(x, y) f (y) dy (47)
X
2 2
suppose that Tk : L (X ) → L (X) is positive semidefinite; thus,
Z Z
K(x, y) f (x) f (y) dx dy ≥ 0 (48)
X X
The positive semidefinite requirement for Mercer kernels is generally difficult to verify. But
the following basic results show how one can build up kernels in pieces.
10
RKHS. Given a kernel K, let Kx (·) be the function obtained by fixing the first coordinate.
That is, Kx (y) = K(x, y). For the Gaussian kernel, Kx is a Normal, centered at x. We can
create functions by taking liner combinations of the kernel:
k
X
f (x) = αj Kxj (x).
j=1
Given two such functions f (x) = kj=1 αj Kxj (x) and g(x) = m
P P
j=1 βj Kyj (x) we define an
inner product XX
hf, gi = hf, giK = αi βj K(xi , yj ).
i j
In general, f (and g) might be representable in more than one way. You can check that
hf, giK is independent of how f (or g) is represented. The inner product defines a norm:
p sX X √
||f ||K = hf, f, i = αj αk K(xj , xk ) = αT Kα
j k
This follows from the definition of hf, gi where we take g = Kx . This implies that
This is called the reproducing property. It also implies that Kx is the representer of the
evaluation functional.
To verify that this is a well-defined Hilbert space, you should check that the following
properties hold:
hf, gi = hg, f i
hcf + dg, hi = chf, hi + chg, hi
hf, f i = 0 iff f = 0.
11
The last one is not obvious so let us verify it here. It is easy to see that f = 0 implies that
hf, f i = 0. Now we must show that hf, f i = 0 implies that f (x) = 0. So suppose that
hf, f i = 0. Pick any x. Then
But in an RKHS, the evaluation functional is continuous. Intuitively, this means that the
functions in the space are well-behaved. To see this, suppose that fn → f . Then
δx fn = hfn Kx i → hf Kx i = f (x) = δx f
Example 5 Let H be all functions f on R such that the support of the Fourier transform
of f is contained in [−a, a]. Then
sin(a(y − x))
K(x, y) =
a(y − x)
and Z
hf, gi = f g.
12
Then
K(x, y) = (xy)−1 e−x sinh(y)I(0 < x ≤ y) + e−y sinh(x)I(0 < y ≤ x)
and Z 1
2
||f || = (f 2 (x) + (f 0 (x))2 )x2 dx.
0
Example
R 7 The Sobolev space of order m is (roughly speaking) the set of functions f such
that (f (m) )2 < ∞. For m = 1 and X = [0, 1] the kernel is
( 2 3
1 + xy + xy2 − y6 0 ≤ y ≤ x ≤ 1
K(x, y) = 2 3
1 + xy + yx2 − x6 0 ≤ x ≤ y ≤ 1
and Z 1
0
||f ||2K 2
= f (0) + f (0) + 2
(f 00 (x))2 dx.
0
Spectral Representation. Suppose that supx,y K(x, y) < ∞. Define eigenvalues λj and
orthonormal eigenfunctions ψj by
Z
K(x, y)ψj (y)dy = λj ψj (x).
P
Then j λj < ∞ and supx |ψj (x)| < ∞. Also,
∞
X
K(x, y) = λj ψj (x)ψj (y).
j=1
13
Representer Theorem. Let ` be a loss function depending on (X1 , Y1 ), . . . , (Xn , Yn ) and
on f (X1 ), . . . , f (Xn ). Let fb minimize
` + g(||f ||2K )
for some α1 , . . . , αn .
b = (K + λI)−1 Y
α
P
and m(x)
b = j α
bj K(Xi , x). The fitted values are
Support Vector Machines. Suppose Yi ∈ {−1, +1}. Recall the the linear SVM minimizes
the penalized hinge loss:
X λ
J= [1 − Yi (β0 + β T Xi )]+ + ||β||22 .
i
2
subject to 0 ≤ αi ≤ C.
14
The dual is the same except that hXi , Xj i is replaced with K(Xi , Xj ). This is called the
kernel trick.
The Kernel Trick. This is a fairly general trick. In many algorithms you can replace hxi , xj i
with K(xi , xj ) and get a nonlinear version of the algorithm. This is equivalent to replacing
x with Φ(x) and replacing hxi , xj i with hΦ(xi ), Φ(xj )i. However, K(xi , xj ) = hΦ(xi ), Φ(xj )i
and K(xi , xj ) is much easier to compute.
In summary, by replacing hxi , xj i with K(xi , xj ) we turn a linear procedure into a nonlinear
procedure without adding much computation.
Hidden Tuning Parameters. There are hidden tuning parameters in the RKHS. Consider
the Gaussian kernel
||x−y||2
K(x, y) = e− σ2 .
For nonparametric regression we minimize i (Yi − m(Xi ))2 subject to ||m||K ≤ L. We
P
control the bias variance tradeoff by doing cross-validation over L. But what about σ?
This parameter seems to get mostly ignored. Suppose we have a uniform distribution on a
circle. The eigenfunctions of K(x, y) are the sines and cosines. The eigenvalues λk die off
like (1/σ)2k . So σ affects the bias-variance tradeoff since it weights things towards lower
order Fourier functions. In principle we can compensate for this by varying L. But clearly
there is some interaction between L and σ. The practical effect is not well understood.
Now consider the polynomial kernel K(x, y) = (1 + hx, yi)d . This kernel has the same
eigenfunctions but the eigenvalues decay at a polynomial rate depending on d. So there is
an interaction between L, d and, the choice of kernel itself.
15