A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
A Study On Support Vector Machine Based Linear and Non-Linear Pattern Classification
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 29,2020 at 14:42:14 UTC from IEEE Xplore. Restrictions apply.
International Conference on Intelligent Sustainable Systems (ICISS 2019)
IEEE Xplore Part Number: CFP19M19-ART; ISBN: 978-1-5386-7799-5
constructs hyper-planes in a multidimensional space that SVMs can be called as probabilistic approaches but do
separates different class boundaries and the number of not consider dependencies among the attributes. SVM works
dimensions is called the feature vector of the dataset. SVM on empirical risk minimization which leads us to an
has the capability to handle multiple continuous and optimization function as in (3),
categorical variables as shown in Fig.2 [5]. There are two min 𝑤 ∑ 𝑙(𝑥𝑖 , 𝑦𝑖 , 𝑤) + 𝜆 𝑟(𝑤) − (3)
kinds of circles, one filled and one outlined. The goal of the
SVM is to separate the two types into classes based on the Where l is the loss function (Hinge loss in SVMs) and r is
features. The model consists of three lines. One is w.x-b=0, the regularization function. SVM is a squared l2-regularised
that is the marginal line or margin. The lines w.x-b =1 and linear model i.e. r(w)=∥w2∥ 2. This bars us against large
w.x-b =-1 represents the position of the closest data points of coefficients or coefficient magnitudes that are themselves
both the classes. The circles lying on the hyper-plane are penalized in the optimization. Regularization (r(w)) always
called the support vectors. The filled circle in the other class gives us a unique solution in m>n cases where m is the
is called an outlier. It is ignored to avoid over-fitting and number of dimensions or features and n is the quantity of
hence obtaining a nearly perfect classification. The objective training data sets. Thus SVMs can be effective in a much
of the SVM is to maximize the perpendicular distance higher dimensional space where the number of features in
between the two edges of the hyper-plane to minimize the the feature vector are greater than number of training sample
occurrence of generalization error. Since the hyper-plane but may tend to get slow while learning as shown in Fig. 3
depends on the number of support vectors, the generalization (a).
capacity increases with decreasing support vectors.
(a) (b)
Fig. 1. Non-linear classification.
Fig. 3. (a) Over-fitting (b) Almost Perfect fit.
B. Over fitting
In the case of m>>n i.e. number of features is much
greater than the number of training data samples, the
regularization function introduces a bias so large that the
training data model heavily underperforms causing over
fitting as shown in Fig.3(b) [6]. Over fitting is one of the
major curses of dimensionality. This phenomenon can be
overcome by tuning the regularization parameter carefully in
cases of linear regression and selecting the proper kernel and
tuning them carefully.
C. Algorithm of SVM
There are two cases, separable case and non separable
Fig. 2. SVM classifier. case, of which, Separable Case is where infinite boundaries
are available to separate data into two classes. The
A. Hyper-plane boundaries giving the largest distance to the nearest
SVM makes parallel partitions by generating two parallel observation is called the optimal hyper-plane. The optimal
lines to create a separate space in a high dimensional space hyper-plane is derived using (4),
using most of its attributes. This plane is called the hyper-
𝑤𝑥 + 𝑏 = 0 − (4)
plane. It creates hyper-planes that have the largest margin in
the high dimensional space, hence separating the given data This equation must satisfy the two conditions , one being
into classes and creating margins. The margin represents the it should separate the two classes A and B very well i.e. f (x)
maximum distance between the closest data points of the two = ω. x + b is positive iff x ∈ A. And the other, it should exist
classes. The larger the margin, the lower is the generalization further away from all the possible observations adding to the
error of the classifier. SVM provides the maximum robustness of the model. Given that the distance from the
flexibility of all the classifiers. hyper-plane to the observation x is |ω.x+b|/||a||. The
maximum margin should be 2/||a||. Whereas in non separable
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 29,2020 at 14:42:14 UTC from IEEE Xplore. Restrictions apply.
International Conference on Intelligent Sustainable Systems (ICISS 2019)
IEEE Xplore Part Number: CFP19M19-ART; ISBN: 978-1-5386-7799-5
case the two classes cannot be separated properly, they kernel function, sigmoid kernel function and the RBF kernel
overlap. A term measuring the error must add and the function and a few more.
margins are normalized to 1/||a|| giving a term i called the
slack variable. Error of the model is the observation where ξ i. Polynomial kernel function
> 1 and sum of all the ξi gives the set of classification error. It works on SVM that represents the similarity of the
Thus, two constraints are formed to construct the hyper- training samples in a feature space over polynomials of the
plane, firstly for every i, yi.(ω.x + b) >= 1 − ξi and secondly original variables. Polynomial kernel looks both at the given
1/2 ||a||2 + δΣiξi is minimal. Quantity δ is parameter that features of the input samples and the combination of them
penalizes errors. Increment of this increases the sensitivity of which are called as interaction features. In case of the
error and the adaptation of the model to the errors also boolean input features, the features are the logical
increases. conjunction of the given input. For degree d polynomials, the
kernel works with the function in (7),
III. TYPES OF SVM CLASSIFIER
K(x,y)- (xy + C) - (7)
There are two kinds of classifier used for SVM such as
linear and nonlinear which is discussed below. Where x and y are vectors in the input space and non
negative C is a parameter which is meant to reduce the gap
A. Linear SVM between the higher orders and the lower orders of the
Here n training dataset(x1, y1 to xn, yn) are supplied. A polynomial.ie, when C = 0, the polynomial is homogeneous.
large margin classifier between the two classes of data will K is the inner product of the feature space based on a
be obtained. Any hyper-plane can be mentioned as the set of mapping φ i.e. K(x,y) = < (φ (x) , φ (y)) >. This kernel
points 𝑥⃗ satisfying (5), function finds its use in the natural language processing. In
𝑤
⃗⃗⃗. 𝑥⃗ − 𝑏 = 0 − (5) the Fig.4 shows how the kernel responds to the different
degrees.
Where 𝑤⃗⃗⃗ serves as the normal vector to the hyper-plane.
There can be of 2 types of margin, hard margin and soft d=0 d=2 d=5
margin. If the training data can be separable linearly and with
completely without errors (outliers and noise), hard margin is
used. In case of errors, either margin is smaller or hard
margin fails. The hard margins are constructed in the
following steps:
-It has to be enforced that all the points are out of margin
i.e. (wT. xj+b) yj ≥a
-The margin should be maximised i.e. Max γ = a/||ω|| where
a is the margin after points are projected onto ω. Fig. 4. Response of the kernel functions on different
degrees.
- Finally, setting a to 1, we get min || ω|| (ωt. xj + b ) >=1.
ii. Sigmoid kernel function
Dual form is when ω is a linear combination of training
observations i.e. ω = Σαl yl xl where α will be 0 except for The sigmoid kernel has its origin from the neural
support vectors. Soft margin can be called as extension of the networks. It is generally not as effective as the RBF but it
hard margin. It is used in case of nonseparable classes i.e. can be tuned to work approximately at par with the Gaussian
overlapping classes as explained earlier. It introduces a slack RBF kernels. Problems where the number of feature vectors
variable ξi and ξj to calculate the net error in the are high or non linear decision boundary in 2 dimensions,
classification i.e. min || ω|| + C Σ ξj (wT. xj+b) yj>=1- ξ j. In Sigmoid Kernels may be more or less as good as Gaussian
linear SVM two parallel hyper-planes are selected that RBF Kernels. The performance of the kernels in such
distinguishes between the two classes of data, so that the situation depends upon the level of cross validation one
distance between them can be maximized. Here by rescaling needs to do for each. It finds it application as an activation
the datasets can be represented by the two equations for function of artificial neurons as in equation (8),
labeling each of the classes separated by the boundary using K(x, y) = tanh(∝xT y + c) -(8)
(6) and (7),
Fig.5 shows how the surface plot of the C-SVM and the
𝑤
⃗⃗⃗. 𝑥⃗ − 𝑏 = 1 − (6) sigmoid function looks like.
𝑤
⃗⃗⃗. 𝑥⃗ − 𝑏 = −1 − (7)
The classifiers are designed in such a way that anything
above the boundary in (6) is of one class while anything
below the mentioned constraint in (7) will be considered as
other class.
B. Non-Linear SVM
This is the part where SVM plays its major role. Initially
SVM was designed to serve for linear classifications. But
later on in the late 20th century, it was designed to be used
for non linear classification as well by the help of the
kernels. The most common types of kernel are, polynomial
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 29,2020 at 14:42:14 UTC from IEEE Xplore. Restrictions apply.
International Conference on Intelligent Sustainable Systems (ICISS 2019)
IEEE Xplore Part Number: CFP19M19-ART; ISBN: 978-1-5386-7799-5
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 29,2020 at 14:42:14 UTC from IEEE Xplore. Restrictions apply.
International Conference on Intelligent Sustainable Systems (ICISS 2019)
IEEE Xplore Part Number: CFP19M19-ART; ISBN: 978-1-5386-7799-5
with other algorithms. A classifier is judged on the basis of [21] D.K. Agarwal and R. Kumar, “Spam Filtering using SVM with
training time, testing time, classification accuracy. It will be different Kernel Functions”, International Journal of Computer
Applications, Vol. 136, No. 5, February 2016.
very challenging to determine a relationship between the
[22] Irene Kotsia, Nikolaos Nikolaidis, and Ioannis Pitas, “Facial
kernels and its distribution data which will choose a proper expression recognition in videos using a novel multi-class support
kernel function for a given dataset in order to maximize class vector machines variant”, Aristotle University of Thessaloniki,
separability between data points. It can also been seen that Department of Informatics, Thessaloniki, Greece.
nowadays mobile applications collect huge amount of user [23] S. Alsaleem, “Automated Arabic Text Categorization Using SVM and
data for better in-app experience of the users. This is can be NB”, International Arab Journal of e-Technology, Vol. 2, No. 2,
done using properly scalable kernels, using which large 2011.
volumes of data collection won't be necessary. [24] P. Fabian and K. Stąpor, "Developing a new SVM classifier for the
extended ES protein structure prediction," Federated Conference on
REFERENCES Computer Science and Information Systems, Prague, pp. 169-172,
2017.
[1] V. Vapnik, “Statistical Learning Theory”, Wiley-Interscience, [25] W. Astuti, R. Akmeliawati, W. Sediono, M.J.E. Salami, “Hybrid
Publication, New York, 1998. Technique Using Singular Value Decomposition (SVD) and Support
[2] V.N. Vapnik, "Principles of Risk Minimization for Learning Theory". Vector Machine (SVM) Approach for Earthquake Prediction”, IEEE
In Proceedings of Advances in Neural Information Processing Journal of Selected Topics in Applied Earth Observations and Remote
Systems, 1992. Sensing, Vol. 7, Issue. 5, pp. 1719 – 1728, 2014.
[3] C. Junli, and J. Licheng, "Classification mechanism of support vector
machines," 5th International Conference on Signal Processing
Proceedings. 16th World Computer Congress, Beijing, 2000, vol.3,
pp. 1556-1559, 2000.
[4] L. Rosasco, E.D. De Vito, A. Caponnetto, M. Piana, A. Verri, "Are
Loss Functions All the Same?" Neural Computation, vol. 16, 2004.
[5] C.J.C Burges, “A Tutorial On Support Vector Machines for Pattern
Recognition”, Data Mining and Knowledge Discovery , vol. 2, no. 2,
pp. 121-167, 1998.
[6] H. Han, J. Xiaoqian, “Overcome Support Vector Machine Diagnosis
Overfitting.” Cancer Informatics, vol. 13, suppl 1, pp. 145–158, 2014.
[7] N. Gruzling, "Linear separability of the vertices of an n-dimensional
hypercube”, M.Sc Thesis, University of Northern British Columbia,
2006.
[8] D. Chaudhuri, and B.B. Chaudhuri, "A novel multiseed
nonhierarchical data clustering technique," IEEE Transactions on
Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 27, no. 5,
pp. 871-876, Sep 1997.
[9] S.P. Boyd, L. Vandenberghe, “Convex Optimization” Cambridge
University Press. ISBN 978-0-521-83378-3, 2004.
[10] M.A. Aizerman, E.M. Braverman, L.I. Rozonoer, "Theoretical
foundations of the potential function method in pattern recognition
learning", Automation and Remote Control, vol. 25, pp. 821–837,
1964.
[11] B.E. Boser, I.M. Guyon, V.N Vapnik, "A training algorithm for
optimal margin classifiers". Proceedings of the fifth annual workshop
on Computational learning theory, pp. 144, 1992.
[12] Y. Goldberg and M. Elhadad, “SVM: Fast, Space-Efficient, non-
Heuristic, Polynomial Kernel Computation for NLP Applications”,
Proc. ACL-08: HLT, 2008.
[13] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing
multiple parameters for support vector machines”, Machine Learning,
vol. 46, pp. 131–159, 2002.
[14] A. Shashua, "Introduction to Machine Learning: Class Notes 67577".
arXiv:0904.3664v1 Freely accessible [cs.LG], 2009.
[15] C. Cortes, V. Vapnik, “Support-vector networks”, Machine learning,
vol. 20, no. 3, pp. 273-297, 1995.
[16] Y.W. Chang, C.J. Hsieh, K.W. Chang, M. Ringgaard, C.J. Lin,
"Training and testing low-degree polynomial data mappings via linear
SVM", Journal of Machine Learning Research, Vol. 11, pp. 1471–
1490, 2010.
[17] A. Ben-Hur, C.S. Ong, S. Sonnenburg, B. Schölkopf, G. Rätsch,
“Support Vector Machines and Kernels for Computational Biology”,
PLoS computational biology, vol. 4. 2008.
[18] D. Michie, D.J. Spiegelhalter, and C.C. Taylor, “Machine Learning,
Neural and Statistical Classification”, Englewood Cliffs, N.J.:
Prentice Hall, 1994.
[19] J. Vert, K. Tsuda, and B. Schölkopf, "A primer on kernel methods",
2004.
[20] D.S. Broomhead, D. Lowe, “Multivariable functional interpolation
and adaptive networks”, Complex Systems, vol. 2, pp. 21-355, 1988.
Authorized licensed use limited to: Auckland University of Technology. Downloaded on May 29,2020 at 14:42:14 UTC from IEEE Xplore. Restrictions apply.