Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

KernelMethods

The document discusses kernel methods in machine learning, particularly focusing on the limitations of half-spaces for separating certain datasets and the need for feature mapping to achieve better classification. It introduces polynomial feature mapping and the kernel trick, which allows for efficient computation of linear separators in high-dimensional spaces without directly accessing the feature space. The document also covers specific kernel functions, such as polynomial and Gaussian kernels, and their properties in relation to inner products in Hilbert spaces.

Uploaded by

shahpanav8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

KernelMethods

The document discusses kernel methods in machine learning, particularly focusing on the limitations of half-spaces for separating certain datasets and the need for feature mapping to achieve better classification. It introduces polynomial feature mapping and the kernel trick, which allows for efficient computation of linear separators in high-dimensional spaces without directly accessing the feature space. The document also covers specific kernel functions, such as polynomial and Gaussian kernels, and their properties in relation to inner products in Hilbert spaces.

Uploaded by

shahpanav8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DS303: Introduction to Machine Learning

Kernel Methods

Manjesh K. Hanawal
Limitation of Half Spaces
Consider the domain points {−10, −9, . . . , −1, 0, 1, . . . , 9, 10},
where the labels are:
▶ +1 for all x such that |x| > 2.
▶ −1 otherwise.
This data cannot be seperated by usual halfspaces
Also consider the following 2D dataset
x2

1
x1
−2 −1 1 2
−1

−2

This figure too cannot be seperated using halfspaces


DS303 Manjesh K. Hanawal 2
Feature Mapping

Coming back to the first problem discussed,we need to find some


transformation.
First define a mapping ψ : R → R2 as follows:

ψ(x) = (x, x 2 ). (1)

We use the term feature space to denote the range of ψ. After


applying ψ the data can be easily explained using the half-space:

h(x) = sign(⟨w, ψ(x)⟩ − b), (2)

where w = (0, 1) and b = 5.

DS303 Manjesh K. Hanawal 3


The basic paradigm is as follows:
1. Given some domain set X and a learning task, choose a
mapping ψ : X → F, for some feature space F, that will
usually be Rn for some n.
2. Given a sequence of labeled examples,
S = (x1 , y1 ), . . . , (xm , ym ), create the image sequence:

Ŝ = (ψ(x1 ), y1 ), . . . , (ψ(xm ), ym ).

3. Train a linear predictor h over Ŝ.


4. Predict the label of a test point, x, to be h(ψ(x)).
The success of this paradigm depends on how good our function ψ
is

DS303 Manjesh K. Hanawal 4


Polynomial Feature Mapping and Half-Spaces

The prediction of a standard half-space classifier is based on a


linear mapping x 7→ ⟨w, x⟩. However, certain datasets require
nonlinear decision boundaries, which can be achieved using
polynomial feature mappings.
Example: Consider a polynomial mapping of degree k:
k
X
p(x) = wj x j , (3)
j=0

which can be rewritten as p(x) = ⟨w, ψ(x)⟩, where

ψ : R → Rk+1 , x 7→ (1, x, x 2 , . . . , x k ). (4)

This transformation allows a linear classifier to separate data that


was not linearly separable in the original space.

DS303 Manjesh K. Hanawal 5


Polynomial Feature Mapping

▶ It follows that learning a k degree polynomial over R can be


done by learning a linear mapping in the (k+ 1) dimensional
feature space.
▶ Polynomial-based classifiers yield much richer hypothesis
classes than halfspaces.
▶ While the classifier is always linear in the feature space, it can
have highly nonlinear behavior on the original space from
which instances were sampled.
▶ If the range of ψ is very large, we need many more samples in
order to learn a halfspace in the range of ψ
▶ Performing calculations in the high dimensional space might
be too costly

DS303 Manjesh K. Hanawal 6


Kernel Trick

However, computing linear separators in very high-dimensional


data may be computationally expensive.
The common solution is kernel-based learning. The term
”kernels” describes inner products in the feature space. Given an
embedding ψ of some domain space X into a Hilbert space, we
define the kernel function:

K (x, x′ ) = ⟨ψ(x), ψ(x′ )⟩. (5)


One can think of K as specifying similarity between instances and
of the embedding ψ as mapping the domain into a more expressive
space.

DS303 Manjesh K. Hanawal 7


Kernel Trick

The SVM optimization problem can be generalized as:

min (f (⟨w, ψ(x1 )⟩, . . . , ⟨w, ψ(xm )⟩) + R(∥w∥)) (6)


w

where f : Rm → R is an arbitrary function and R : R+ → R is a


monotonically non-decreasing function.
Special Cases:
▶ Soft-SVM (for homogeneous halfspaces): R(a) = λa2 , and
f (a1 , . . . , am ) = m1 i max{0, 1 − yi ai }.
P

▶ Hard-SVM (for nonhomogeneous halfspaces): R(a) = a2


with constraints yi (ai + b) ≥ 1 for all i.
Key Insight: There exists an optimal solution in the span of
{ψ(x1 ), . . . , ψ(xm )}.

DS303 Manjesh K. Hanawal 8


Representer Theorem

Theorem (Representer Theorem)


Assume that ψ is a mapping from X to a Hilbert space. Then,
there exists a vector α ∈ Rm such that
m
X
w= αi ψ(xi )
i=1

is an optimal solution of Equation 6

DS303 Manjesh K. Hanawal 9


SVM optimization problem
Note: All versions of the SVM optimization problem we have
derived so far are instances of the following general problem:

min (f (⟨w , ψ(x1 )⟩, . . . , ⟨w , ψ(xm )⟩) + R(∥w ∥)) (7)


w

where f : Rm → R is an arbitrary function and R : R+ → R is a


monotonically nondecreasing function.
Instead of solving Equation 7, we can solve the equivalent problem:

 
m
X m
X sX
min αj K (xj , x1 ), . . . , αj K (xj , xm )+R  αi αj K (xj , xi )
α∈Rm
j=1 j=1 i,j
(8)
To solve the optimization problem given in Equation 8, we do not
need any direct access to elements in the feature space. The only
thing we should know is how to calculate inner products in the
feature space, or equivalently, to calculate the kernel function.
DS303 Manjesh K. Hanawal 10
SVM optimization problem

In fact, to solve Equation 8 in previous slide we solely need to


know the value of the m × x matrix G such that Gi,j = K (xi , xj ),
which is often called the Gram matrix. We can write the problem
in the equivalent soft-margin problem as:
m
!
T 1 X
min λα G α + max (0, 1 − yi (G α)i ) (9)
α∈Rm m
i=1

where (G α)i is the ith element of the vector obtained by


multiplying the Gram matrix G by the vector α.

DS303 Manjesh K. Hanawal 11


Polynomial Kernel

The degree k polynomial kernel is defined as

K (x, x ′ ) = (1 + ⟨x, x ′ ⟩)k .

Now, we will show that this is indeed a kernel function. That is,
we will show that there exists a mapping ψ from the original space
to some higher-dimensional space for which

K (x, x ′ ) = ⟨ψ(x), ψ(x ′ )⟩.

For simplicity, denote x0 = x0′ = 1. Then, we have:

DS303 Manjesh K. Hanawal 12


Polynomial Kernel

K (x, x ′ ) = (1 + ⟨x, x ′ ⟩)k = (1 + ⟨x, x ′ ⟩) · · · (1 + ⟨x, x ′ ⟩)


   
Xn Xn
= xj xj′  · · ·  xj xj′ 
j=0 j=0

X k
Y
= xJi xJ′ i
J∈{0,1,...,n}k i=1

X k
Y k
Y
= xJi xJ′ i .
J∈{0,1,...,n}k i=1 i=1
k
Now, if we define ψ : Rn → R(n+1) such that for
J ∈ {0, 1, . . . , n}k there is an element of ψ(x) that equals
Q k
i=1 xJi , we obtain that
DS303 Manjesh K. Hanawal 13
Polynomial Kernel

K (x, x ′ ) = ⟨ψ(x), ψ(x ′ )⟩.


▶ Since ψ contains all the monomials up to degree k, a
halfspace over the range of ψ corresponds to a polynomial
predictor of degree k over the original space. Hence, learning
a halfspace with a degree k polynomial kernel enables us to
learn polynomial predictors of degree k over the original space.
▶ Note that here the complexity of implementing K is O(n),
while the dimension of the feature space is on the order of nk .

DS303 Manjesh K. Hanawal 14


Gaussian Kernel
Let the original instance space be R and consider the mapping ψ
where, for each nonnegative integer n ≥ 0, there exists an element
x2
q
1 −2 n
ψ(x)n that equals n! e x . Then,
∞ r ! r !

X 1 − x2 n 1 − (x ′ )2 ′ n
⟨ψ(x), ψ(x )⟩ = e 2x e 2 (x )
n! n!
n=0

(x 2 +(x ′ )2 ) X (xx ′ )n
= e− 2
n!
n=0
∥x−x ′ ∥2
= e− 2

Here the feature space is of infinite dimension while evaluating the


kernel is very simple. More generally, given a scalar σ > 0, the
Gaussian kernel is defined to be
∥x−x′ ∥2
K (x, x′ ) = e − 2σ .
DS303 Manjesh K. Hanawal 15
Gaussian Kernel

▶ Intuitively, the Gaussian kernel sets the inner product in the


feature space between x, x′ to be close to zero if the instances
are far away from each other (in the original domain) and
close to 1 if they are close. σ is a parameter that controls the
scale determining what we mean by close.
▶ It is easy to verify that K implements an inner product in a
space in which for any n and any monomial of order k there
∥x∥2
q
1 − 2 Qn
exists an element of ψ(x) that equals n! e i=1 xJi .
Hence, we can learn any polynomial predictor over the original
space by using a Gaussian kernel.

DS303 Manjesh K. Hanawal 16


The RBF Kernel

The Gaussian kernel is also called the RBF kernel, for “Radial
Basis Functions.”

DS303 Manjesh K. Hanawal 17


Lemma

A symmetric function K : X × X → R implements an inner


product in some Hilbert space if and only if it is positive
semidefinite; namely, for all x1 , . . . , xm , the Gram matrix
Gi,j = K (xi , xj ) is a positive semidefinite matrix.
Proof:
It is trivial to see that if K implements an inner product in some
Hilbert space then the Gram matrix is positive semidefinite. For
the other direction, define the space of functions over X as
RX = {f : X → R}. For each x ∈ X , let ψ(x) be the function
x 7→ K (·, x). Define a vector space by taking all linear
combinations of elements of the form K (·, x). Define an inner
product on this vector space to be
X X X
⟨ αi K (·, xi ), βj K (·, xj′ )⟩ = αi βj K (xi , xj′ ).
i j i,j
DS303 Manjesh K. Hanawal 18
Lemma

This is a valid inner product since it is symmetric (because K is


symmetric), it is linear (immediate), and it is positive definite (it is
easy to see that K (x, x) ≥ 0 with equality only for ψ(x) being the
zero function). Clearly,

⟨ψ(x), ψ(x ′ )⟩ = ⟨K (·, x), K (·, x ′ )⟩ = K (x, x ′ ),


which concludes our proof.

DS303 Manjesh K. Hanawal 19

You might also like