ML Recap

ML Algorithms
Draft V1.0
Do you really understand...
Outline
Supervised Learning:
Linear regression,Logistic regression
Linear Discriminant Analysis
Princeple Component Analysis
Neural network
Support vector machines
K-nearest neighbor
Gradient Boosting Decision Tree
Decision trees(C4.5,ID3,CART), Random Forests
Kernels, Kernel-based PCA
Optimization Methods
Linear Regression
using the dataset below to prediction the price of house
Target is to find a line
which can fit this
points well.
And the euquation of

line is shown below.
There are infinate number of lines so how can we evaulate the best one?
We introduced the concept of cost function
This is the most common used cost funcion called square error cost function.
When it is more close to zero means that the line fits the dataset better.
Tuning of Parameters of Linear
Regression
It can be found that the line is decided by theta0 and theta1, and the cost fuction is the
function of (theta0, theta1).
The target is to find the theta0 and theta1 so that the cost function can get the grobal
minimal or local minimal.
In order to find that ,we must introduce a concept gradient descent.

Gradient Descent: optimization in
Linear Regression
Started from a simple case, when theta0 = 0,
cost function J will be looked like this and we use
the equation below to find the optimal theta1.
Sigmoid Funtions: Activation function
“S” shaped functions
erf: Error Function

Logistic Funtion
Logistic Function
Helpful in:
Logistic Regression
Neural Networks
Logistic Funtion
Features:
Smooth,monotomic, Sigmoid...
Logistic Regression : Motivation
Targrt Function: f(x) = P(+1|x) is with in [0,1]
Logistic Regression: binary
classification
Risk Score
Convert a score into an

estimated probability Logistic Funtion
Logistic Regression :likelihood
estimation
Logistic hypothesis:
Error Function: Likelihood Estimation
h(x) = \theta(x)
1-h(x) = h(-x)
Logistic Regression:likelihood
estimation
y: -1 or +1
Logistic Regression: optimization
Details go here.
Logistic Regression: comparision
Some methods
erf: Error Function

Screenshots from here.
Logistic regression & SVM: in
practice
1. pca analysis
components range < 25

Logistic regression
2. logistic regression on
{sklearn.linear_model.LogisticRegression} and {sklearn.linear_model.SGDClassifier} with LogisticRegression
３．conculsion (discuss 1)
(1) logistic regression does not works on this dataset, because of its numerous dimention, even regulaize it. I
assume SVM might better than it.
(2) logistic regression using same likehood function with svm, sklearn.svm.base._fit_liblinear
Random Forest
Bagging(bootstrap of aggregation)
function bag(D,A), For t=1,2,3…...T
1. Request size-N’ dataset D’ by bootstraping with D
2. obtain base gt by A(D’)
Return G = Uniform({gt})
Random Forest
Decision Tree
funcion Tree(D)
if ternimation return base gt else
1. learn b(x) then split D to Dc by b(x)
2. build Gc <- Tree(Dc)
3. return G(x)=∑[bx=c]Gc(x)
Bagging: reduce variance by voting/averaging
Decision Tree: large variance especially in fully-grown tree

Random Forest
Putting them together?1-(（t/n)^2+(x/n)^2)...
Random Forest(RF)= bagging + fully-grown C&RT Decision Tree
function bag(D,A), For t=1,2,3…...T
1. Request size-N’ dataset D’ by

bootstraping with D
2. obtain tree gt by1. Dtree(D’)

Highly efficient/parallel to learn
2. Inherit pros of C&RT tree
3. eliminate cons of fully-grown C&RT tree

Random Forest
Diversifying by Feature Projection
Recall: data randomness for diversity in bagging

randomly sample N’ of examples from
D
Another possibility for diversity:
randomly sample d’ features from X
Namely, new dataset d’ is a random subspace of d in features
often d’<<d, efficient when d is large
Re-sample new subspace for each b(x) in C&RT
RF=bagging + random subspace C&RT

Random Forest
projection(combination) with random p so that x=px
Often consider low dimenstional projection:
only d’’ non-zero components in p
includes random subspace as special case:
d’’=1 and p is natural basis
RF = bagging + (random+combination )C&RT

Random Forest
Decision Tree
•The major Decision Tree implementations are:
•ID3, or Iternative Dichotomizer, was the first of three Decision Tree implementations developed by Ross
Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)
•CART, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree,
though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5;
the one notable difference is that CART constructs the tree based on a numerical splitting criterion
recursively applied to the data, whereas C4.5 includes the intermediate step of constructing *rule set*s.
•C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete
features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up
technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise
the training data. Of these, the first three are very important--and i would suggest that any DT
implementation you choose have all three. The fourth (differential weighting) is much less important
Decision Tree
•ID3 and C.4.5 use Shannon Entropy to pick features with the greatest information gain as
nodes. As an example, let's say we would like to classify animals. You would probably ask more
general questions (like "Is it a mammal") first and once confirmed continue with more specific
questions (like "is it a monkey"). In terms of information gain the general questions of our toy
example gives you more information in addition to what you already know (that it is an
animal).
•CART uses Gini Impurity instead. Gini Impurity is a measure of the homogeneity (or "purity")
of the nodes. If all datapoints at one node belong to the same class then this node is
considered "pure". So by minimising the Gini Impurity the decision tree finds the features the
separate the data best.
Decision Tree
•Ensemble
•Build many “base” decision trees, using different subsets of the data.
•Trees can vote on the class of a new input example.
•Accuracy of the ensemble should be better than that of the individual trees.
•Bagging
•Randomly draw a “bootstrap” sample from training data with replacement.
•Apply a classifier to each sample independently.
•Combine the outputs of the classifiers (e.g. majority voting).
•Random Forests
•Ensemble built from multiple tree models, generated using both bagging and subspace
sampling strategies.
Random Forests
•Forest-RI
•Forest-RC
Decision Tree: in practice sklearn.ensemble.Ran
domForestClassifier
•Main prameters：
•max_features: The number of features to consider when looking for the best split:
If int, then consider max_features features at each split
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
if “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
•max_depth: (default=None) The maximum depth of the tree. If None, then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.
•n_estimators=10：The number of trees in the forest.

Principle Component Analysis
Suppose the traning samples are
x-dimisional.
PCA is trying to find a set of y vectors (y<x)
that contain the maxium amount of
variance in the original traning data.

How to define “Variance” ?
If we can define V(v), it is
relatively easy to find the
maximum “Variance”.
Using cosine rule we can deduce that:
The length of the projection = XtV
where X is the original data, V is the

target vector(direction).
C=Covariance Matrix
Find ν that maximize
the value of σ.
f(v) : CV
g(v) :
F(v,λ) = f(v) - λ(g(v)-1)

Get top n max lambda
until the total variance
(sum of all) meet the
requirement.
The corresponding sets
of ν and the projections
on it will be the result.
From the graph we can tell that:

square variance = eigenvalue
KPCA(Kernel Based PCA)
Calculate the eigenvalue and
eigenvector of covariance
matrix on the virtual space?
φ is implicit, can not diretly calcu-

-late the result.
k(xi,xj) = coefficient of xi and xj in

higher dimension which can be
calculated.
Where u and lambda is
the eigenvector and
eigenvalue of K.
We have to make sure

Note that len(u) = 1
But we still don’t know the
value of
Nevermind we only have to
get the projection of the data
in the virtual space, which is

In this case, KPCA is
better than PCA when
the data is not linear.

Kernel: Kernel Motivation
Non-linear problem
->
Linear problem
Kernel Trick: VIPs in Kernel
1) Feature mapping:
2) Feature/Original space: X/H
3) Kernel Function: ?
H: New space
Kernel Trick: Kernel Function
(v1,v2,v3)
x (x1,x2)
y (y1,y2) (z1,z2,z3)
Dot Product in H = (Dot Product in X)^2

→ Kernel Function!
Kernel Trick: Distance of 2 points in
H
= k(x,x)-2k(x,x’) + k(x’,x’)
Kernel Trick: Angle of 2 points in H
Kernel Trick: A deep understanding
1) Mapping function is not necessary:
2) Only Kernel function is cool: k(..,...)
3) k? Finitely possitive semi-definite function.
Kernel Matrix/ Gram Matrix
is a finitely possitive semi-definite matrix
Kernel Trick: FPSD Matrix
M is a FPSD Matrix, if only all non-zero vectors x meets:
If k(x,y) = <x,y> is a FPSD function?

Kernel Trick: More..
If k(x,y) is a FPSD function, then there is at least a feature mapping
function \phi. Vice versa.
H: New space
Kernel Trick: Some Kernel Functions
Common kernel functions for vector based data.
Linear kernel: K(x, y) = x · y
Polynominal kernel: K(x, y) = (x · y + 1)d
Radial Basis Function:
(The bandwidth sigma can be estimated---Kernel Smoothing)

SVM
In general concept, SVM is a linear classifier to use maximum margin in feature space
doing binary classify.
First part of SVM (find maximum margin)
1. simple example on linear classify
2. margin (functional & geometrical)
3. maximum margin classifier

SVM : a toy example
Introduce a classify function (its will be
proved in Ⅲ or Ⅳ)
Geometrical)
Geometrical)
functional margin
numeric distance:
where x0 = 0
dicide margin by positive or negative:
weakness: the distance is scalar distance, if hyper panel changes, it will not
change, bacause of direction

Geometrical)
geometrical margin
in vector space:
because
means:
Then the geometric distance:
# similar to functional margin, we get geometrical margin as follows:

SVM : Maximum margin Classifier
Due to the definition of SVM is to find the maxmum margin, that is:
Owing to:
Then let
because is a constant in calculating process, for any point achieve:
will not be the point on margin.

SVM : Maximum margin Classifier
http://www.36dsj.com/archives/24596
KNN: Introduction
Instance-based
Distance-based
Supervised learning
Lasy learning: keep all!
Classification
KNN: Main Idea
. Set a distance threshold,
calculate the distances of
the given data point to
others.
. Get the nearest k
neighbours (odd number).
. Majority voting to
determine the class label.
KNN: Distance Metrics
Manhattan distance
Euclidean Distance /Taxicab Geometry
(Minkowski)
Cosine Similarity
Others: Pearson Correlation (Karl

Pearson),
Kullback–Leibler (KL)
divergence,etc.
KNN: Disadvantages
- Sensitive to the threshold.
- Majority Voting: more is good?

---> Weight = 1/distance
- Works with proper k .

KNN: Advantages
- Easy & lazy.
- Generalized performance: better than Naïve Bayesian Classifier.
Nearest neighbours: Z
Given point: x
Prob of KNN making errors
1+p <= 2
NBC: best results

KNN: Iris dataset in practice
KNN: Code with sklearn
Load directly….
KNN: Practice
Which one is a possible result of KNN?
GBDT
What’s GBDT?
Gradient Boosting Decision Tree
or GBRT
Gradient Boosting Regression Tree
A regression tree instead of a classification tree

Difference between Classification Tree and
Regression Tree
Main difference:
Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. Usually the
response variable has two classes: Yes or No (1 or 0). If the target variable has more than 2 categories, then a variant of the algorithm,
called C4.5, is used. For binary splits however, the standard CART procedure is used. Thus classification trees are used when the
response or target variable is categorical in nature.
Regression trees are needed when the response variable is numeric or continuous. For example, the predicted price of a consumer
good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
Boosting
Boosting (not Adaboosting):
图自 Machine Learning A Probabilistic Perspective

GB-Gradient Boosting
Boosting，迭代，即通过迭代多棵树来共同决策。这怎么实现呢？难道是每棵树独立训练一遍，
比如A这个人，第一棵树认为是10岁，第二棵树认为是0岁，第三棵树认为是20岁，我们就取
平均值10岁做最终结论？--当然不是！且不说这是投票方法并不是GBDT，只要训练集不变，
独立训练三次的三棵树必定完全相同，这样做完全没有意义。之前说过，GBDT是把所有树的
结论累加起来做最终结论的，所以可以想到每棵树的结论并不是年龄本身，而是年龄的一个累
加量。GBDT的核心就在于，每一棵树学的是之前所有树结论和的残差，这个残差就是一个加
预测值后能得真实值的累加量。比如A的真实年龄是18岁，但第一棵树的预测年龄是12岁，差
了6岁，即残差为6岁。那么在第二棵树里我们把A的年龄设为6岁去学习，如果第二棵树真的
能把A分到6岁的叶子节点，那累加两棵树的结论就是A的真实年龄；如果第二棵树的结论是5
岁，则A仍然存在1岁的残差，第三棵树里A的年龄就变成1岁，继续学。这就是Gradient
Boosting在GBDT中的意义，简单吧。
GDBT: Algorithm
GBDT工作过程实例
实例：http://blog.csdn.net/w28971023/article/details/8240756
关于随机森林和梯度提升树，请选择正确的选项。
1.在随机森林中，中间树互相不独立，而在梯度回归树中，中间树相互独
立。
2.他们都使用随机特征子集来构建中间树。
3.在梯度提升树的情况下我们可以生成并行树，因为树互相独立。
4.梯度提升树在任何数据集上都比随机森林要好。
Blending in Practice
- Blending: make advantage of the whole datasets
OldLee Sharing(2): A simple model of
NN
- 192 Features (Input)
- 99 Classes (Output)
- 1 hidden layer
Neural Network (forward & backward)
demonstrated by this link
https://sixunouyang.wordpress.com/2016/11/09/backpropagation-neural-networkformula/
Searching for a linear combination of variables (predictors) that
best separates two classes.
Works well in multiple classes tasks(more than 2) when compare

to SVM.
Experiment result:
v
Classify equation:
For multi-class classification task:
https://users.cs.fiu.edu/~taoli/pub/Li-discrimant.pdf
Optimization: Motivation
What is optimization? Recall the gradient Descent
Optimization: How?
No constraints: Gradient Descent, Newton Method,
Quasi Newton Method (an optimized Newton Method)
Constraints: KKT Conditions.
A Generalized description about optimization...

Optimization: Unconstraint
Optimization
Description:
x* is the optimum.
“Newton Method”: is to find the

roots of f(x)=0.
In optimization: derivative f ′ of a twice-differentiable
function f to find the roots of the derivative (solutions to f
′(x)=0), also known as the stationary points of f.
Optimization: Unconstraint
Optimization
Description:
x* is the optimum.
“Newton Method”
“Gradient Descend”
Both need iterations.

Optimization: Equality Optimization
min f(x,y)
s.t. g(x,y)=c
Brings in Lagrange Multiplier lambda: (1 constraint,

one multiplier)
Optimization: Why?
Optimization: Equality Optimization
E.P.
Optimization: Generalized Optimization
Prob
f(x): objective function, loss function,
or cost function
h(x): equality constraint

g(x): inequality constraint
Generalized Lagrange function

Prob
alpha, beta: Lagrange mulipliers. alpha >=0
L is about alpha and beta, x is constant:
If x not following the constraints:
If x is good:
Prob
With out constraints!

Optimization: Dual Problem
L is about x, alpha and beta are constants.

Optimization: How to Solve?
Optimization: Inequality Optimization
Karush-Kuhn-Tucker (KKT) Conditions:
Nonlinear Programming
Optimization: Inequality Optimization
Karush-Kuhn-Tucker (KKT) Conditions
KKT Conditions:
1. L(a, b, x) partial
derivative with x, 0；
2. h(x) =0;
3. a*g(x) = 0;
s.t. a>=0
Optimization: KKT Conditions
Karush-Kuhn-Tucker (KKT) Conditions
Thank you all!

ML Recap

Uploaded by

Copyright:

Available Formats

ML Recap

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Recap

Uploaded by

Copyright:

Available Formats

ML Algorithms

And the euquation of

In order to find that ,we must introduce a concept gradient descent.

“S” shaped functions

erf: Error Function

Convert a score into an

Error Function: Likelihood Estimation

erf: Error Function

components range < 25

function bag(D,A), For t=1,2,3…...T

1. Request size-N’ dataset D’ by bootstraping with D

2. obtain base gt by A(D’)

if ternimation return base gt else

1. learn b(x) then split D to Dc by b(x)

2. build Gc <- Tree(Dc)

Bagging: reduce variance by voting/averaging

Decision Tree: large variance especially in fully-grown tree

Random Forest(RF)= bagging + fully-grown C&RT Decision Tree

function bag(D,A), For t=1,2,3…...T

1. Request size-N’ dataset D’ by

2. obtain tree gt by1. Dtree(D’)

2. Inherit pros of C&RT tree

3. eliminate cons of fully-grown C&RT tree

Recall: data randomness for diversity in bagging

Re-sample new subspace for each b(x) in C&RT

RF=bagging + random subspace C&RT

Often consider low dimenstional projection:

only d’’ non-zero components in p

includes random subspace as special case:

d’’=1 and p is natural basis

RF = bagging + (random+combination )C&RT

if “auto”, then max_features=sqrt(n_features).

•n_estimators=10：The number of trees in the forest.

PCA is trying to find a set of y vectors (y<x)

that contain the maxium amount of

variance in the original traning data.

If we can define V(v), it is

relatively easy to find the

The length of the projection = XtV

where X is the original data, V is the

F(v,λ) = f(v) - λ(g(v)-1)

From the graph we can tell that:

φ is implicit, can not diretly calcu-

k(xi,xj) = coefficient of xi and xj in

the eigenvector and

We have to make sure

But we still don’t know the

get the projection of the data

in the virtual space, which is

better than PCA when

the data is not linear.

Dot Product in H = (Dot Product in X)^2

If k(x,y) = <x,y> is a FPSD function?

Linear kernel: K(x, y) = x · y

Polynominal kernel: K(x, y) = (x · y + 1)d

Radial Basis Function:

(The bandwidth sigma can be estimated---Kernel Smoothing)

First part of SVM (find maximum margin)

1. simple example on linear classify