01 Machine Learning Basics
01 Machine Learning Basics
Xiaogang Wang
xgwang@ee.cuhk.edu.hk
January 5, 2015
cuhk
cuhk
cuhk
f : RD → RM
cuhk
5w MSEtrain = 0
cuhk
M
1 X (test) (test)
Performancetest = Error(f (xi ), yi )
M
i=1
cuhk
y = w2 x 2 + w1 x + b
The learner cannot find a solution that fits training examples well
For example, use linear regression to fit training examples
{x(train)
i , yi(train) } where yi(train) is an quadratic function of x(train)
i
The learner fits the training data well, but loses the ability to
generalize well, i.e. it has small training error but larger
generalization error
A learner with large capacity tends to overfit
The family of functions is too large (compared with the size of the
training data) and it contains many functions which all fit the
training data well.
Without sufficient data, the learner cannot distinguish which one is
most appropriate and would make an arbitrary choice among
these apparently good solutions
A separate validation set helps to choose a more appropriate one
In most cases, data is contaminated by noise. The learner with
large capacity tends to describe random errors or noise instead of
the underlying models of data (classes)
cuhk
cuhk
(Duda et al. Pattern Classification 2000)
cuhk
cuhk
Typical relationship between capacity and both training and generalization (or test)
error. As capacity increases, training error can be reduced, but the optimism
(difference between training and generalization error) increases. At some point, the
increase in optimism is larger than the decrease in training error (typically when the
training error is low and cannot go much lower), and we enter the overfitting regime,
where capacity is too large, above the optimal capacity. Before reaching optimal
capacity, we are in the underfitting regime.
cuhk
As the number of training examples increases, optimal capacity (bold black) increases (we can afford a bigger and
more flexible model), and the associated generalization error (green bold) would decrease, eventually reaching the
(non-parametric) asymptotic error (green dashed line). If capacity was fixed (parametric setting), increasing the
number of training examples would also decrease generalization error (top red curve), but not as fast, and training
error would slowly increase (bottom red curve), so that both would meet at an asymptotic value (dashed red line)
corresponding to the best achievable solution in some class of learned functions. cuhk
In the figure above, the training data (10 black dots) were selected from a
quadratic function plus Gaussian noise, i.e., f (x) = w2 x 2 + w1 x 2 + b + where
p() = N(0, σ 2 ). The degree-10 polynomial fits the data perfectly. Which learner
should be chosen in order to better predict new examples? The second-order
function or the 10th degree function?
If the ten training examples were generated from a 10th degree polynomial plus
Gaussian noise, which learned should be chosen?
If the one million training examples were generated from a quadratic function cuhk
plus Gaussian noise, which learned should be chosen?
cuhk
cuhk
The more training samples in each cell, the more robust the
classifier
The number of cells grows exponentially with the dimensionality
of the feature space. If each dimension is divided into three
intervals, the number of cells is N = 3D
Some cells are empty when the number of cells is very large!
cuhk
(Duda et al. Pattern Classification 2000)
Examples
The objective function for linear regression becomes
1 X t (train)
MSEtrain + regularization = (w xi − yi(train) )2 + λ||w||22
N
i
cuhk
Multi-task learning, transfer learning, dropout, sparsity, pre-training
bias(θ̂) = E(θ̂) − θ
where expectation is over all the train sets of size n sampled from the
underlying distribution
An estimator is called unbiased if E(θ̂) = θ
Example: Gaussian distribution. p(xi ; θ) = N (θ, Σ) and the
Pn
estimator is θ̂ = n1 i=1 x(train)
i
" n
# n n
1 X (train) 1 X h (train) i 1 X
E(θ̂) = E xi = E xi = θ=θ
n n n
i=1 i=1 i=1
cuhk
cuhk
cuhk
cuhk
cuhk
n
X
θ = arg max log P(yn(train) |x(train)
n , θ) + log p(θ)
θ
i=1
cuhk
cuhk
n
kx̃k − x̄k2
X
arg max
{ei }
k =1
or
n
kxk − x̃k k2
X
arg min
{ei }
k =1
cuhk
J1 (a11 , . . . , an1 , e1 )
Xn n
X
2
= k(x̄ + ak 1 e1 ) − xk k = kak 1 e1 − (xk − x̄)k2
k =1 k =1
X n n
X n
X
= ak21 ke1 k2 − 2 ak 1 et (xk − x̄) + kxk − x̄k2
k =1 k =1 k =1
∂J1
Since e1 is a unit vector, ke1 k = 1. To minimize J1 , set ∂ak 1 =0
and we have
ak 1 = et1 (xk − x̄)
We obtain a least-squares solution by projecting the vector
xk onto the line in the direction of e1 passing through the
mean.
cuhk
n
X n
X n
X
J1 (e1 ) = ak21 −2 ak21 + kxk − x̄k2
k =1 k =1 k =1
n n
X 2 X
et1 (xk − x̄) kxk − x̄k2
= − +
k =1 k =1
X n n
X
= − et1 (xk − x̄)(xk − x̄)t e1 + kxk − x̄k2
k =1 k =1
n
X
= −et1 Se1 + kxk − x̄k2
k =1
cuhk
Pn t
S= k =1 (xk − x̄)(xk − x̄) is the scatter matrix
et1 Se1 = nk =1 ak21 ( nk =1 ak 1 = 0) is the variance
P P
of the
projected data
The vector e1 that minimizes J1 also maximizes et1 Se1 ,
subject to the constraint that ke1 k = 1
cuhk
Se1 = λe1
Pd 0
d 0 −dimensional representation: x̃k = x̄ + i=1 aki ei
Mean-squared criterion function:
n
d0
!
2
X X
Jd 0 =
x̄ + aki ei − xk
k =1 i=1
n
X d
X
2
Jd = 0 ⇒ kxk − x̄k = λi
k =1 i=1
d
X
Jd 0 = λi
i=d 0 +1
cuhk
cuhk
cuhk
cuhk
n
X
f (x) = b + αi K (x, xi )
i=1
K is a kernel function, e.g., the Gaussian kernel
K (u, v) = N(u − v; 0, σ 2 I)
cuhk
cuhk
cuhk
cuhk