SVM & Image Classification.
SVM & Image Classification.
Beno t Patra
February 2014
Contents
Contents 1 Support Vector Machines 1.1 Introduction . . . . . . . . . . . . . . . . . 1.2 Support Vector Machines . . . . . . . . . 1.2.1 Linearly separable set . . . . . . . 1.2.2 Nearly linearly separable set . . . . 1.2.3 Linearly inseparable set . . . . . . 1.2.3.1 The kernel trick . . . . . 1.2.3.2 Classication : projection 1.2.3.3 Mapping conveniently . . 1.2.3.4 Usual kernel functions . . i 1 1 1 2 4 4 5 5 6 6 7 7 7 8 8 8 8 9 10 11 11 13 14 14 15 16
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
2 Computation under C++ 2.1 Librairies & datasets employed . . . . . . . . . . . . 2.2 Project format . . . . . . . . . . . . . . . . . . . . . 2.3 Two-class SVM implementation . . . . . . . . . . . . 2.3.1 First results . . . . . . . . . . . . . . . . . . . 2.3.2 Parameter selection . . . . . . . . . . . . . . 2.3.2.1 Optimal training on parameter grid 2.3.2.2 Iterating and sharpening results . . 2.4 A good insight : testing on a small zone . . . . . . . 2.5 Central results : testing on a larger zone . . . . . . . 2.5.1 Results . . . . . . . . . . . . . . . . . . . . . 2.5.2 Case of an unreached minimum . . . . . . . . 2.6 Going further : enriching our model . . . . . . . . . 2.6.1 Case (A) : limited dataset . . . . . . . . . . . 2.6.2 Case (B) : richer dataset . . . . . . . . . . . . 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
A Unbalanced data set 17 A.1 Dierent costs for misclassication . . . . . . . . . . . . . . . . . . . . . . 17 B Multi-class SVM 19 B.1 One-versus-all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.2 One-versus-one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Bibliography i 20
Chapter 1
Support vector learning is based on simple ideas, which originated from statistical learning theory [1]. Support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns. Basic SVM takes a set of input data and predicts, for each given input, which of the two possible classes forms the output, making it a nonprobabilistic binary linear classier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
1.2
A data set containing points which belong to two dierent classes can be represented by the following set : D = {(xi , yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (1.1)
where yi represents the belonging to one of the two classes, xi the training points, q the dimension of the data set. One of the most important things we have to focus on is the shape of the data set. Our goal is to nd the best way to distinguish between the two classes. Ideally, we would like to have a linearly separable data set - in which our two set of points can be fully separated by a line for a two-dimensional space, or a hyperplane for a n-dimensional space. However, this is not the case in general. We will look in the following subsections at three possible congurations for our dataset. 1
1.2.1
In the following example (Fig. 1.1), it is easy to see that the data points can be easily linearly separated. Most of the time, with a big data set, its impossible to say just by visualizing the data whether it can be linearly separated or not - even the data cannot be visualized.
Figure 1.1: A simple linearly separable dataset. Blue points are labelled 1 ; red are labelled -1.
To solve the problem analytically, we have to dene several new objects. Definition. A linear separator is a function f that depends on two parameters w and b, given by the following formula : fw,b (x) = w, x + b, b R, w Rq . (1.2)
This separator can take more values than 1 and 1. When fw,b (x) 0, x will belong to the class of vectors such that yi = 1 ; in the opposite case, to the other class (i.e. such that yi = 1). The line of separation is the contour line dened by the equation fw,b (x) = 0.
f Definition. The margin of an element (xi , yi ), relatively to a separator f , noted ( xi ,yi ) , is the real given by : f ( xi ,yi ) = f (xi ) yi
0.
(1.3)
Definition. The margin of a set of points D, relatively to a separator f , is the minimum of the margins for all the elements in D :
f f D = min ( xi ,yi ) | (xi , yi ) D .
(1.4)
1, i.e. yi ( w, xi + b)
1.
(1.5)
The goal of the SVM is to maximize the margin of the data set.
S u p p o r t v e c t o r s .
Mi n i ma l ma r g i n .
Figure 1.2: Support vectors and minimal margin. The orange line represents the separation, while the pink and blue ones represents respectively the hyperplans associated to the equations fw,b (x) = 1 and fw,b (x) = 1.
Lemma. The width of the band constituted by the hyperplans fw,b (x) = 1 and fw,b (x) = 2 1 equals w . Proof. Let u be a point of the contour line dened by fw,b (x) = 1.
Let u be his orthogonal projection on the contour line fw,b (x) = 1. Hence we have : fw,b (u) fw,b (u ) = 2 i.e. u u , w = 2. Yet we have : i.e. u u , w = u u w . Indeed, they are colinear, and have the same orientation. Besides, u u is equal to the width constituted by the two contour lines.
In order to nd the best separator - i.e. the one providing the maximum margin - we f have to seek within the class of separators such that ( xi ,yi ) > 1, (xi , yi ) D and retain the one for which w is minimal. This leads us to solve the following constrained optimization problem : min
w,b
w 2
(1.6)
for calculus purposes : derivations become easier ; besides, it is NB. We minimize w 2 better to work with the square norm. By introducing Lagrange multipliers i , the previous constrained problem can be expressed as : m w 2 argmin max i [yi ( w, xi + b) 1] (1.7) i 0 2 w,b
i=1
that is we look for a saddle point. In doing so all the points which can be separated as yi ( w, xi + b) 1 > 0 do not matter since we must set the corresponding i to zero.
Chapter 1. Introduction to Support Vector Machines This problem can now be solved by standard quadratic programming techniques.
1.2.2
In this subsection, we will discuss the case of a nearly separable set - i.e. a dataset for which using a linear separator would be ecient enough. If there exists no hyperplane that can split entirely the dataset, the following method - called soft margin method - will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. Let us modify the maximum margin idea to allow mislabeled examples to be treated the same way, by allowing points to have a margin which can be smaller than 1, even negative. The previous constraint in (1.6) now becomes : (xi , yi ) D, yi ( w, xi + b) > 1 i . (1.8)
where i 0 are called the slack variables, and measure the degree of misclassication of the data xi . The objective function we minimize has also to be changed : we increase it by a function which penalizes non-zero i , and the optimization becomes a trade-o between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes : min
w,b,
w 2
+C
i=1
(1.9) 0.
This constraint minimization problem above can be solved using Lagrange multipliers as done previously. We now solve the following problem : argmin max
w,b,
i ,i 0
w 2
m i
+C
i=1
i=1
i [yi ( w, xi + b) 1 + i ]
i=1
(1.10)
with i , i
0.
1.2.3
We saw in the previous subsection that linear classication can read to misclassications - this is especially true if the dataset D is not separable at all. Let us consider the following example (Fig. 1.3). For this set of data points, any linear classication would introduce too much misclassication to be considered as accurate enough.
Figure 1.3: Linearly inseparable set. Blue points are labelled 1 ; red are labelled -1.
1.2.3.1
To solve our classication problem, let us introduce the kernel trick. For machine learning algorithms, the kernel trick is a way of mapping observations from a general data set S into an inner product space V , without having to compute the mapping explicitly, such that the observations will have a meaningful linear structure in V. Hence linear classications in V are equivalent to generic classications in S. The trick or method used to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V , and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a certain kernel function - a function K : S 2 V that can be expressed as an inner product.
1.2.3.2
To understand the usefulness of the trick, lets go back to our classication problem. Let us consider a simple projection of vectors in D, our dataset, into a much richer, higher-dimension feature space. We project each point of D in this bigger space and make a linear separation there. Lets name p this projection : p1 (xi ) (xi , yi ) D, p(xi ) = ... pn (xi ) as we express the projected vector p(xi ) in a base of the n-dimensional new space. This point of view can lead to problems, because n can grow without any limit, and nothing assures us that the pi are linear in the vectors. Following the same method than above would imply to work on a new set D : D = p(D) = {(p(xi ), yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (1.11)
Chapter 1. Introduction to Support Vector Machines Because it implies to calculate p for each vector of D, this method will be never used.
1.2.3.3
Mapping conveniently
Lets rst notice that its not necessary to calculate p, as the optimization problem only involves inner products between the dierent vectors. We can now consider the kernel trick approach. We construct : K : D2 V such as K (x, z ) = p(x), p(z ) , (x, yx ), (z, yz ) D (1.12)
making sure that it corresponds to a projection in the unknown space V . We then avoid the calculus of p, and the description of the space in which we are projecting. The optimization problem remains the same, through replacing ., . by k (., .): m k (w, w) min +C (1.13) i 2 w,b,
i=1
0.
1.2.3.4
Polynomial :
where c 0 is a constant trading o the inuence of higher-order versus lower-order terms in the polynomial. Polynomials such that c = 0 are called homogeneous. Gaussian radial basis (RBF) : K (x, z ) = exp( x z 2 ), > 0. Sometimes parametrized using = Hyperbolic tangent : K (x, z ) = tanh(xT z + c), for > 0, c < 0 well chosen.
1 . 2 2
Chapter 2
We used for this project the computer vision and machine learning library OpenCV. All its SVM features are based on the specic library LibSVM, by Chih-Chung Chang and Chih-Jen Lin. We trained our models on the Image Classication Dataset from Andrea Vedaldi and Andrew Zissermans Oxford assignment. It includes ve dierent image classes - aeroplanes, motorbikes, people, horses and cars - of various sizes, and pre-computed feature vectors, in form of a sequence of consecutive 6-digit values. Pictures used are all color images in .jpg format, of various dimensions. The dataset can be downloaded at : http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htm.
2.2
Project format
The C++ project itself possesses 4 branches, for opening, saving, training & testing phases. In its original form, it allows opening two training les and a testing one, on a user-friendly, console-input base. User enters les directories, format used and labels for the dierent training classes. For the testing phase, a label is asked, so results obtained via the SVM classication can be compared with the prior label given by user ; the latter can directly see the misclassication results - rate, number of misclassied les - in the console output. The user can either choose its own kernel type, parameter values, or let the computer run the optimal one ; classes have been created in consequence. Following results have been obtained using this program and additional versions (especially when including multiple training les) that derive directly from it ; these will not be presented here. The project can be found on GitHub at : https://github.com/Parveez/CPP_Project_ENSAE_2013.
2.3
2.3.1
We rst trained our SVM with training sets aeroplane train.txt and horse train .txt ; the data tested was contained in aeroplane val.txt and horse val.txt. As the images included in the two training classes may vary in size, we all resized them to a unique testing zone ; same goes for the testing set. All images are stored in two matrices - one for the training phase, one for the testing phase : each matrix row is a point (here, an image), and all its coecients are features (here, pixels). For example, for 251 training images, all of size 50x50 pixels, the training matrix will be of dimensions 251x2500. For a 50x50 pixel zone, with respectively 112 and 139 elements in each class, learning time amounts to 0.458 seconds ; testing time, for 274 elements, amounts to 11.147 seconds. But a classier of any type produces bad results for randomly-assigned parameter values : for example, with the default value assigned to C and , a gaussian classier misclassies 126 elements of the aeroplane val.txt le. The following section discuss the optimal selection of the statmodel parameters.
2.3.2
2.3.2.1
Parameter selection
Optimal training on parameter grid
The eectiveness of SVM depends on the selection of kernel, the kernels parameters, and soft margin parameter C . The best combination is here selected by a grid search with multiplicative growing sequences of the parameter, given a certain step. Input parameters for the parameter selection are : min val,max val the extremal values tested step the step parameter. Parameter values are tested through the following iteration sequence : (min val, min val step, ..., min val stepn ) with n such that min val stepn < max val. Parameters are considered optimal when having the best cross-validation accuracy. Using an initial grid gives us a rst approximation of the best parameter possible, and produces better results than default training and testing. It is important to mention here that, without specifying any kernel type to our program, RBF kernel was always chosen as the best t for our data. All the results presented thereafter will be presented for the RBF kernel, with optimization of parameters C and ; following methods are applicable to other classiers as well, even though they remain less ecient.
Even if results are improved by the use of a parameter grid, renements can be added. Indeed, we sharpen our estimation by computing iterative parameter selection - each time on smaller grids : Data: Default inital grid Result: Optimal parameter for SVM training while iterations under threshold do train SVM on grid through cross-validation ; return best parameter; set parameter = best parameter; re-center grid; diminish grid size; end Algorithm 1: Basic iterative parameter testing. One can initially think of : (j ) (j ) max val = max val min val(j ) = min val(j ) + (j 1) step(j ) = step2
to implement the grid resizing at the step j , with param(j ) best parameter value obtained after training the SVM model. Yet such recursion is not properly ecient : as j grows, the calculation time grows very fastly. Indeed, as step gets smaller, the number of iterations to reach max value increases very easily. As we usually initialize C and grid extremal values at dierent powers of ten, with step(0) = 10, a convient way to resize the grid at step j is the following : 1 1 (j ) = param(j ) 10 2j 1 +10 2j max val 2
(j ) (j ) 2j min val = param 10 1 step(j ) = step(j 1) = ... = 10 2j
1
as we can express min val, max val using powers of ten after replacing step(j ) . It only takes a couple of iterations to go through the grid, and produces equivalent or better results. Besides, the more precise the estimation of the parameters, the faster the iteration.
10
2.4
We rst sought results for a small zone of 50x50 pixels, to get a primary overview of how our algorithms works. For such zone, and the following initial grid and characteristics1 : Grid min val max val 107 3 10 + 1010 C 103 7 10 + 103 Number of class 1 les Number of class -1 les Files tested 112 139 247
we obtained the following results : No iterations nor grid usage (latest calculation time2 : 0.599 seconds): default value C default value Files misclassied Misclassication rate 1 1 126 0.459
After 1 iteration (latest calculation time : 11.691 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 68 0.248
After 5 iterations (latest calculation time : 4.138 seconds): nal value C nal value Files misclassied Misclassication rate 9.085 108 90.851 68 0.248
After 20 iterations (latest calculation time : 3.974 seconds): nal value C nal value Files misclassied Misclassication rate
1
Again, we point our that RBF kernel type was not specied initially by the user, but chosen by the program during paramater optimization. 2 Here represents the total calculation time - i.e. including training and testing time - for the last iteration mentionned.
11
What can we surmise from those results ? Firstly, the number of misclassied images is improved by automatically training our model on a grid. Secondly, it is also improved by iterating the parameter selection process. Although decay is slow, each iteration help our SVM classifying better the testing data. Lastly, calculation time seem to be globally lower iteration after iteration, in acceptable proportions considering the small size of our zone.
2.5
2.5.1
Let us now run training and testing on a larger zone of 300x300 zone, to gain better comprehension of our models behaviour. Parameter grids are initialized to the same values as in the previous subsection ; here again, RBF kernel is the optimal kernel type for the data. No iterations nor grid usage (latest calculation time : 20.857 seconds): default value C default value Files misclassied Misclassication rate 1 1 126 0.459
Chapter 2. Computation under C++ After 1 iteration (latest calculation time : 420.265 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430
12
After 5 iterations (latest calculation time : 161.741 seconds): nal value C nal value Files misclassied Misclassication rate 9.085 109 133.352 60 0.218
After 15 iterations (latest calculation time : 143.982 seconds): nal value C nal value Files misclassied Misclassication rate 3.048 109 38.983 68 0.248
Figure 2.4: Values of C per iterations. Blue background, left : Normal scale. Red background, right : logarithmic scale.
13
Figure 2.5: Values of 1010 per iterations. Blue background, left : Normal scale. Red background, right : logarithmic scale.
2.5.2
Here the most intriguing fact may be probably be that after 5 iterations, the number of misclassied les drops to 60 les out 274 tested, and raises to 62 the next step. This can be explained by the following fact : the point ( (5) , C (5) ) is near the minimum value - i.e. the one providing the minimal misclassication rate - we are seeking, which exact value cannot be reached through the grid at fth step ; and as reposition (, C ) and resize the grid at ( (5) , C (5) ), we might actually re-center the problem on a new area that does not include the minimum at all.
U n r e a c h e dmi n i mu m. ( G a mma , C ) a t s t e p5 .
( G a mma , C ) g r i da t s t e p5 ( G a mma , C ) g r i da t s t e p6
Figure 2.6: Problem of the unreached minimum. Here the minimum is included in the upper-middle case of the grid at step 5. (Gamma, C) is the best approximation available over the grid, but shrinking the grid at this exact point leaves the minimum o the new grid.
A solution to address this problem may be to have a smoother re-sizing algorithm, like the rst one we presented. But this may actually have a negative impact on calculation time at each step. For example, let us compare our results with those obtained with the initial, less-ecient re-sizing algorithm ; for the latter, with the same 300x300 pixel zone, the rst three steps of iteraion on parameter selection produced the following results :
Chapter 2. Computation under C++ After 1 iteration (latest calculation time : 432.228 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430
14
After 2 iterations (latest calculation time : 644.136 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430
After 3 iterations (latest calculation time : 1590.78 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430
At rst step, the misclassication rate is the same as with the second re-sizing method ; the decay is indeed much slower (the resizing is so smooth that second and third steps still give a rate of 0.430), but the calculation time are very poor. The third iteration takes 1590.78 seconds to compute, compared to 161.975 with the convenient method. The conclusion of this section is that there might be an actual trade-o between computing performances and avoiding the unreached-minimum problem in many cases.
2.6
In the rst two sections, we trained our model on two dierent subsets : aeroplane train .txt and horse train.txt, trying to make predictions for both aeroplanes and horses. Here, we will include more objects - horses, background, motorbikes, and cars - in the class -1, and leave aeroplanes in the class 1 ; we will only try to classify les from testing test aeroplane val.txt. Our goal is here to show how using a larger training set can improve our predictions. Let us compare the results between a class -1 testing test containing only horses - case (A) -, and the testing set described above - case (B). RBF kernel is the optimal kernel type in both cases. Zone used is of size 300x300.
2.6.1
Chapter 2. Computation under C++ After 5 iterations (latest calculation time : 145.650 seconds): nal value C nal value Files misclassied Misclassication rate 3.83 108 177.8 41 0.325
15
After 10 iterations (latest calculation time : 137.342 seconds): nal value C nal value Files misclassied Misclassication rate 3.28 108 56.51 41 0.325
After 20 iterations (latest calculation time : 135.250 seconds): nal value C nal value Files misclassied Misclassication rate 2.27 108 26.25 36 0.285
After 40 iterations (latest calculation time : 141.561 seconds): nal value C nal value Files misclassied Misclassication rate 1.59 108 13.98 34 0.269
2.6.2
After 1 iteration (latest calculation time : 681.084 seconds): nal value C nal value Files misclassied Misclassication rate 106 1000 12 0.095
We directly see here, after only 1 iteration, that the classication accuracy is much better ; the larger the initial training set, the better. Note that calculation time can reach quite high rates for very large datasets.
16
2.7
Conclusions
From all the experiments we conducted in this section, we can draw the following conclusions : The number of misclassied images is improved by automatically training our model on a parameter grid. It can also be improved by selecting the best parameter iteratively, and shriking our grid after each step. Choosing the right shrinking algorithm is very important, and can be very tricky. Indeed, for a very sharp resizing, calculation time can be acceptable but we might leave the point of minimal misclassication out of the grid. Using a large training set is always a good thing, as it improves drastically classication accuracy.
Appendix A
A.1
Let us consider an unbalanced data set of the following form : D = {(xi , yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (A.1)
+C
i=1
(A.2) 0.
m i=1 i
by a new (A.3)
C+
0, C 17
0,
18
One condition has to be satised, in order to give equal overall weight to each class : the total penalty term has to be the same for each class. A hypothesis commonly made is to suppose that the number of misclassied vectors in each class is proportional to the number of vector in each class, leading us to the following condition : C Card(J ) = C+ Card(J+ ) (A.4)
If, for instance, Card(J ) Card(J+ ), then C C +. A larger importance will be given to misclassied vectors xi such that yi = 1.
Appendix B
Multi-class SVM
Several methods have been suggested to extend the previous SVM scheme to solve multiple-class problems. [2] All the following schemes are applicable to any binary classier, and are not exclusively related to SVM. The most famous methods are the one-versus-all and one-versus-one methods.
B.1
One-versus-all
In this and the following subsections, the training and testing sets can be classied in M classes C1 , C2 , ...CM . The one-versus-all method is based on the construction of M binary classiers, each labelling 1 a specied class, -1 the others. During the testing phase, the classier providing the highest margin wins the majority vote.
B.2
One-versus-one
1) The one-versus-all method is based on the construction of M (M binary classiers by 2 confronting each of the M classes. During the testing phase, every point is analysed by each classier, and a majority vote is conducted to determine its class. If we denote xt the point to classify and hij the SVM classier separating classes Ci , Cj , then the label awarded to xt can be formally written :
(B.1)
This represents the class awarded to xt most of the time, after being analysed by all the classiers hij . Some ambiguity may exist in the counting of votes, if there is no majority election. Both methods presents downsides. For the one-versus-all version, nothing indicates that the classication results between the M classiers are comparable. Besides, the problem isnt well-balanced anymore : for example, with M = 10, we use only 10% of positives examples, against 90% negative ones. 19
Bibliography
[1] Vladimir N. Vapnik. The Nature of Statistical Learning Theory, 1995. [2] Christopher M. Bishop. Pattern Recognition And Machine Learning, 2006.
20