Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

KNN - Algorithm - SVM - Algorithm

Uploaded by

MohitKhemka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

KNN - Algorithm - SVM - Algorithm

Uploaded by

MohitKhemka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Mithilesh Singh

KNN regression/Classifier
-non-parametric method
- approximates the association between
independent variables and the continuous
outcome by averaging the observations in
the same neighborhood.
-The size of the neighborhood K value
needs to be set by the analyst or can be
chosen using cross-validation to select the
size that minimizes the mean-squared
error.
Parametric V/s Non-parametric method
Parametric Method-assume the data is of
sufficient “Quality”, exp :- Linear Regression,
Logistic Regression.
- The result can be misleading if
assumptions are wrong.
- Quality is defined in terms of certain
properties of the data, like Normally
distributed ,Symmetrical linear
distribution, homogeneity of variance etc.

Non Parametric tests can be used when the data is not


of sufficient quality to satisfied the assumptions of parametric
test. Non parameter tests STILL have assumptions but are less
stringent.
Non parameter tests can be applied to Normal distributed data
but parametric tests have greater power IF assumptions met.
- Parametric tests are preferred when the assumptions are
met because they are more sensitive.
KNN model classifies the points based on
proximity or distance.
Important Distance Metrics in Machine Learning
 Euclidean Distance
 Manhattan Distance
 Minkowski distance

1. Euclidean Distance

Euclidean Distance represents the shortest distance between


two points.

Pythagorean theorem
The 2 dimensions’ formula for Euclidean Distance:

For n-dimensional space as:

Where,
 n = number of dimensions
 pi, qi = data points
2. Manhattan Distance
Manhattan Distance is the sum of absolute differences
between points across all the dimensions. Also called L1
Norm ,Taxi Cab Norm.

We can represent Manhattan Distance as:

Manhattan Distance, sum of absolute distances x


and y directions.
In a 2-dimensional space is given as:

And the generalized formula for an n-dimensional


space is given as:

Where,
 n = number of dimensions
 pi, qi = data points
(3)Minkowski distance:
This distance measure is the generalized form
of Euclidean and Manhattan distance metrics.
Euclidean distance is represented by this
formula when p is equal to 2, and Manhattan
distance is denoted with p equal to 1.
Minkowski distance=

 KNN model used for classification (and regression).


  KNN uses distance metrics in order to find
similarities or dissimilarities.

 working off the assumption that similar points can be


found near one another.
 The distinction between these terminologies is that
“majority voting”
Example of KNN- Euclidean distance in real life

Liability a b c d Status
Person (loan approved)
1 160 10 20 2000 No
2 180 20 30 5000 Yes
3 200 30 35 8000 Yes
4 150 20 25 4000 No
5 350 60 100 6500 Yes
---
--
100 200 50 80 7500 Yes
A 175 35 30 4500 ?

Euclidean distance: -

A-1square-root(15 sq +25 sq+10 sq +2500 sq)

A-2 square-root (5 sq+15 sq+ 0 sq+500 sq)

A-3 square-root (25 sq+5 sq+5 sq+3500 sq)

A-4 square-root (25 sq+15 sq +5 sq+ 500 sq)

A-5 square-root (175 sq+25 sq+70 sq+2000 sq)

-----

A-100……

Nearest neighbor lowest distance…


KNN method working concept:-

The K-NN working can be explained on the basis


of the below algorithm:
o Step-1: Select the number K of the
neighbors
o Step-2: Calculate the Euclidean distance of K
number of neighbors
o Step-3: Take the K nearest neighbors as per
the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the
number of the data points in each category.
o Step-5: Assign the new data points to that
category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we


need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

• KNN can be used in regression, the target


continuous value is computed as the mean of
the target value of k nearest neighbors.
For classification Problems:-

from sklearn.neighbors import KNeighborsClassifier


FOR Regression Problems: -
from sklearn.neighbors import KNeighborsRegressor
How to select the optimal value of k?
 Prefer odd k values, as there are chances of
tie in even values.
 K should not be too small.
 Thumb rule: k should be generally sqrt(n),
where n denotes the total number of data
points
 Further, one can try different values of k,
and then observe the evaluation metrics to
decide upon the best value of k
Python code:-

from sklearn.neighbors import KNeighborsClassifier


model_name = ‘K-Nearest Neighbor Classifier’
knnClassifier = KNeighborsClassifier(n_neighbors = 5, metric =
‘minkowski’, p=2)
knn_model = Pipeline(steps=[(‘preprocessor’,
preprocessorForFeatures), (‘classifier’ , knnClassifier)])
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)
# find the more effective value of n_neighbors parameter k value:

accuracy_K = []
for k in range(1, 50):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train1, Y_train)
Y_pred = knn.predict(X_test)
accuracy =accuracy_score(Y_test, Y_pred)
accuracy_K.append(accuracy)
plt.figure(figsize=(12,8))
plt.xlabel("k values")
plt.ylabel("ACCURACY")
plt.plot(range(1,50),accuracy_K, marker='o', markersize=9)
KNN is a Lazy Learner model
 Almost all the models get trained on the training
dataset, but KNN does not get
trained on the training dataset.
 When we use knn.fit (X train, Y train), this model
‘memorizes’ the dataset. It does not understand it or
tries to learn the underlying trend.
 Now when we ask the model to predict some value,
then it takes a lot of time because now it actually will
have to recall all the points and work around them so
that it can help predict the correct value.
 Hence where most of the models, take time during
training, this model does not take any time during
training.
 Most of the models take no time in prediction, but the
KNN model takes a lot of time during the prediction
stage.
Important points
 Since it is a distance-based model, feature scaling is a must
for it.
 Besides logistic regression, the rest all the classification
models can work on multi class classification.
Advantages of KNN Algorithm:

o It is simple to implement.
o When the label data is too expensive or
impossible to obtain.
o It is robust to the noisy training data
o It can be more effective if the training data is
large.
Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which


may be complex some time.
o Entire dataset is processed for every prediction,
not good for large dataset.

The computation cost is high because of calculating


the distance between the data points for all the
training samples. computational Time complexity for
each prediction is equal to MNlog(k)
M is the dimension of data, N is the size of instance
of training data and K is nearest point selected.
Mithilesh singh
Support vector machines (SVM)
1990AI called winter year. USA had invested huge
amount in military but not successful due to failure
of AI.
SVM During that period SVM Invented popular
& successful More than 200 paper published.

SVM—called the maximal margin classifier


commonly used and originally intended for a binary classification.

It is often considered one of the best “out of the


box” classifiers.

 Used in Regression & Classification


 Preferred for Medium and small sized
dataset.

 It separates data in two components using hyper plane by


maximizing the margin (Also called large marginal classifier). the
maximal margin classifier tries to find the optimal separating
hyperplane.

Hyper-plane
It is plane that linearly divide the n-dimensional
data points in two components. In case of 2D,
hyperplane is line,
in case of 3D it is plane.
It is also called as n-dimensional line.

Hyperplane: A-line classify with highest margin


(maximum margin hyperplane) .

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM

SVM can be of two types based on hyper_planes:

o Linear SVM: Linear SVM is used for linearly separable data,


which means if a dataset can be classified into two classes by
using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be classified
by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM
classifier.

Support Vectors:

The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

How does SVM works?

Kernel means approach ---- Hyperplane


There are different kernel functions available:
 linear
 Gaussian (RBF kernel- Radial Basis Function)
 Polynomial
 Sigmoid
Linear Plane and Non Linear Plane
(Scenario-1)
Identify the right hyper-plane: Here, we have
three hyper-planes (A, B, and C). Now, identify the right
hyper-plane to classify stars and circles.

In this scenario, hyper-plane“B” has excellently


performed this job.
(Scenario-2)
 Identify the right hyper-plane: Here, we have three
hyper-planes (A, B, and C) and all are segregating
the classes well. Now, How can we identify the right
hyper-plane?

Here, maximizing the Margin. the right


hyper-plane as C. If we select a hyper-
plane having low margin then there is high
chance of miss-classification.
(Scenario-3)
 Identify the right hyper-plane: Hint: Use the rules
as discussed in previous section to identify the right
hyper-plane

Note:- hyper-plane B as it has higher margin


compared to A.
But, here is the catch, SVM selects the hyper-
plane which classifies the classes
accurately prior to maximizing margin. Here,
hyper-plane B has a classification error and A
has classified all correctly. Therefore, the right
hyper-plane is A.
(Scenario-4)

The SVM algorithm has a feature to ignore outliers and find the hyper-plane
that has the maximum margin. Hence, SVM classification is robust to
outliers. Similarly, decision trees (scikit-learn's classifier) are also robust to
outlier.
Non-Linear SVM:

If data is linearly arranged, then we can separate it


by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the
below image:
(Scenario-5)
Find the hyper-plane to segregate to
classes: In the scenario below, we can’t have
linear hyper-plane between the two classes.

If data is linearly arranged, then we can separate it by


using a straight line, but for non-linear data, we cannot
draw a single straight line..

Do SVM has the Solution?


SVM will add one extra dimension to the
data points to make it separable.
Kernel trick CONVERTING TO HIGH
DIMENSIONALITY
To separate these data points, SVM add one
more dimension. For linear data, SVM use two
dimensions x and y, for non-linear data, SVM
will add a third dimension z.
It can be calculated as:-
z=x2 +y2

SVM keep on increasing the dimensions unless


the classes are separable.
By adding the third dimension, the sample space will
become as below image:
Python code:-

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier
to the training dataset (x_train, y_train)

The model performance can be altered by changing the value of


kernel, gamma and C value
Kernel functions. - transform non-linear spaces into linear
spaces. It transforms data into higher dimension so that the data
can be classified.
kernel: various options available - “linear”, “rbf”,”poly” and
sigmoid (default is “rbf”).
Here “rbf” and “poly” are useful for non-linear hyper-plane.
gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
Higher the value of gamma, will try to exact fit the as per
training data set i.e. generalization error and cause over-
fitting problem.

gamma different gamma values like 0.001,0.01,1, 10


or 100.
C: Penalty parameter C of the error term. It also controls
the trade-off between smooth decision boundaries and
classifying the training points correctly.

Pro & Cons associated with SVM: -


Advantage: -
 Works really well with margin of separation.
 Effective in high dimensional space
 Accurate results.
 Useful for both linearly separable and non-linearly separable data.

Disadvantages: -
 It doesn’t perform well when we have large dataset because the
required training time is very high.
Applications of SVM
 Sentiment analysis
 Spam Detection
 Handwritten digit recognition
 Image recognition

from sklearn.svm import SVC #"Support vector classifier"


from sklearn.svm import SVR #"Support vector Regressor"

# Building a Support Vector Machine on train data


svc_model = SVC(C=1, kernel='linear', gamma= 100)
svc_model.fit(X_train, Y_train)
prediction = svc_model.predict(X_test)

You might also like