ML Recap
ML Recap
ML Recap
Draft V1.0
Do you really understand...
Outline
Supervised Learning:
Linear regression,Logistic regression
Linear Discriminant Analysis
Princeple Component Analysis
Neural network
Support vector machines
K-nearest neighbor
Gradient Boosting Decision Tree
Decision trees(C4.5,ID3,CART), Random Forests
Kernels, Kernel-based PCA
Optimization Methods
Linear Regression
using the dataset below to prediction the price of house
Target is to find a line
which can fit this
points well.
This is the most common used cost funcion called square error cost function.
When it is more close to zero means that the line fits the dataset better.
Tuning of Parameters of Linear
Regression
It can be found that the line is decided by theta0 and theta1, and the cost fuction is the
function of (theta0, theta1).
The target is to find the theta0 and theta1 so that the cost function can get the grobal
minimal or local minimal.
Helpful in:
Logistic Regression
Neural Networks
Logistic Funtion
Features:
Smooth,monotomic, Sigmoid...
Logistic Regression : Motivation
Targrt Function: f(x) = P(+1|x) is with in [0,1]
Logistic Regression: binary
classification
Risk Score
h(x) = \theta(x)
1-h(x) = h(-x)
Logistic Regression:likelihood
estimation
y: -1 or +1
Logistic Regression: optimization
Details go here.
Logistic Regression: comparision
Some methods
1. pca analysis
3.conculsion (discuss 1)
(1) logistic regression does not works on this dataset, because of its numerous dimention, even regulaize it. I
assume SVM might better than it.
(2) logistic regression using same likehood function with svm, sklearn.svm.base._fit_liblinear
Random Forest
Bagging(bootstrap of aggregation)
Return G = Uniform({gt})
Random Forest
Decision Tree
funcion Tree(D)
3. return G(x)=∑[bx=c]Gc(x)
x-dimisional.
maximum “Variance”.
Princeple Component Analysis
Using cosine rule we can deduce that:
the value of σ.
f(v) : CV
g(v) :
eigenvalue of K.
value of
KPCA(Kernel Based PCA)
Nevermind we only have to
y (y1,y2) (z1,z2,z3)
= k(x,x)-2k(x,x’) + k(x’,x’)
Kernel Trick: Angle of 2 points in H
Kernel Trick: A deep understanding
1) Mapping function is not necessary:
2) Only Kernel function is cool: k(..,...)
3) k? Finitely possitive semi-definite function.
Kernel Matrix/ Gram Matrix
is a finitely possitive semi-definite matrix
Kernel Trick: FPSD Matrix
M is a FPSD Matrix, if only all non-zero vectors x meets:
numeric distance:
where x0 = 0
weakness: the distance is scalar distance, if hyper panel changes, it will not
in vector space:
because
means:
Owing to:
Then let
http://www.36dsj.com/archives/24596
KNN: Introduction
Instance-based
Distance-based
Supervised learning
Lasy learning: keep all!
Classification
KNN: Main Idea
. Set a distance threshold,
calculate the distances of
the given data point to
others.
. Get the nearest k
neighbours (odd number).
. Majority voting to
determine the class label.
KNN: Distance Metrics
Manhattan distance
Euclidean Distance /Taxicab Geometry
(Minkowski)
Cosine Similarity
Nearest neighbours: Z
Given point: x
Prob of KNN making errors
1+p <= 2
Load directly….
KNN: Practice
Which one is a possible result of KNN?
GBDT
What’s GBDT?
or GBRT
Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. Usually the
response variable has two classes: Yes or No (1 or 0). If the target variable has more than 2 categories, then a variant of the algorithm,
called C4.5, is used. For binary splits however, the standard CART procedure is used. Thus classification trees are used when the
response or target variable is categorical in nature.
Regression trees are needed when the response variable is numeric or continuous. For example, the predicted price of a consumer
good. Thus regression trees are applicable for prediction type of problems as opposed to classification.
Boosting
Boosting (not Adaboosting):
- 99 Classes (Output)
- 1 hidden layer
Neural Network (forward & backward)
https://sixunouyang.wordpress.com/2016/11/09/backpropagation-neural-networkformula/
Linear Discriminant Analysis
Searching for a linear combination of variables (predictors) that
best separates two classes.
Experiment result:
Linear Discriminant Analysis
v
Linear Discriminant Analysis
Linear Discriminant Analysis
Classify equation:
Linear Discriminant Analysis
For multi-class classification task:
https://users.cs.fiu.edu/~taoli/pub/Li-discrimant.pdf
Optimization: Motivation
What is optimization? Recall the gradient Descent
Optimization: How?
No constraints: Gradient Descent, Newton Method,
Quasi Newton Method (an optimized Newton Method)
Constraints: KKT Conditions.
x* is the optimum.
x* is the optimum.
“Newton Method”
“Gradient Descend”
If x is good:
Optimization: Generalized Optimization
Prob
KKT Conditions:
1. L(a, b, x) partial
derivative with x, 0;
2. h(x) =0;
3. a*g(x) = 0;
s.t. a>=0
Optimization: KKT Conditions
Karush-Kuhn-Tucker (KKT) Conditions
Thank you all!