KNN - Algorithm - SVM - Algorithm
KNN - Algorithm - SVM - Algorithm
KNN regression/Classifier
-non-parametric method
- approximates the association between
independent variables and the continuous
outcome by averaging the observations in
the same neighborhood.
-The size of the neighborhood K value
needs to be set by the analyst or can be
chosen using cross-validation to select the
size that minimizes the mean-squared
error.
Parametric V/s Non-parametric method
Parametric Method-assume the data is of
sufficient “Quality”, exp :- Linear Regression,
Logistic Regression.
- The result can be misleading if
assumptions are wrong.
- Quality is defined in terms of certain
properties of the data, like Normally
distributed ,Symmetrical linear
distribution, homogeneity of variance etc.
1. Euclidean Distance
Pythagorean theorem
The 2 dimensions’ formula for Euclidean Distance:
Where,
n = number of dimensions
pi, qi = data points
2. Manhattan Distance
Manhattan Distance is the sum of absolute differences
between points across all the dimensions. Also called L1
Norm ,Taxi Cab Norm.
Where,
n = number of dimensions
pi, qi = data points
(3)Minkowski distance:
This distance measure is the generalized form
of Euclidean and Manhattan distance metrics.
Euclidean distance is represented by this
formula when p is equal to 2, and Manhattan
distance is denoted with p equal to 1.
Minkowski distance=
Liability a b c d Status
Person (loan approved)
1 160 10 20 2000 No
2 180 20 30 5000 Yes
3 200 30 35 8000 Yes
4 150 20 25 4000 No
5 350 60 100 6500 Yes
---
--
100 200 50 80 7500 Yes
A 175 35 30 4500 ?
Euclidean distance: -
-----
A-100……
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
accuracy_K = []
for k in range(1, 50):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train1, Y_train)
Y_pred = knn.predict(X_test)
accuracy =accuracy_score(Y_test, Y_pred)
accuracy_K.append(accuracy)
plt.figure(figsize=(12,8))
plt.xlabel("k values")
plt.ylabel("ACCURACY")
plt.plot(range(1,50),accuracy_K, marker='o', markersize=9)
KNN is a Lazy Learner model
Almost all the models get trained on the training
dataset, but KNN does not get
trained on the training dataset.
When we use knn.fit (X train, Y train), this model
‘memorizes’ the dataset. It does not understand it or
tries to learn the underlying trend.
Now when we ask the model to predict some value,
then it takes a lot of time because now it actually will
have to recall all the points and work around them so
that it can help predict the correct value.
Hence where most of the models, take time during
training, this model does not take any time during
training.
Most of the models take no time in prediction, but the
KNN model takes a lot of time during the prediction
stage.
Important points
Since it is a distance-based model, feature scaling is a must
for it.
Besides logistic regression, the rest all the classification
models can work on multi class classification.
Advantages of KNN Algorithm:
o It is simple to implement.
o When the label data is too expensive or
impossible to obtain.
o It is robust to the noisy training data
o It can be more effective if the training data is
large.
Disadvantages of KNN Algorithm:
Hyper-plane
It is plane that linearly divide the n-dimensional
data points in two components. In case of 2D,
hyperplane is line,
in case of 3D it is plane.
It is also called as n-dimensional line.
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
Support Vectors:
The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
The SVM algorithm has a feature to ignore outliers and find the hyper-plane
that has the maximum margin. Hence, SVM classification is robust to
outliers. Similarly, decision trees (scikit-learn's classifier) are also robust to
outlier.
Non-Linear SVM:
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the classifier
to the training dataset (x_train, y_train)
Disadvantages: -
It doesn’t perform well when we have large dataset because the
required training time is very high.
Applications of SVM
Sentiment analysis
Spam Detection
Handwritten digit recognition
Image recognition