Bayesian Classifier Implementation Using MATLAB
Bayesian Classifier Implementation Using MATLAB
Submitted by
Shabeeb Ali O.
M.Tech. Signal Processing
Semester -II
Roll No.15
March 6, 2018
Contents
0.1 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2.1 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2.2 Discriminant Functions, and Decision Surfaces . . . . . . . . . . . . . . . . . . . . . . 2
0.2.3 Normal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2.4 Discriminant Functions For The Normal Density . . . . . . . . . . . . . . . . . . . . . 4
0.2.5 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.3.1 linearly separable data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.3.2 non-linearly separable data set, real-world data set . . . . . . . . . . . . . . . . . . . . 7
0.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4.1 Linearly Separable Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.4.2 Non Linearly Separable Data Set - Overlapping Data . . . . . . . . . . . . . . . . . . . 11
0.4.3 Non Linearly Separable - Spiral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
0.4.4 Non Linearly Separable - Helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.4.5 Real World Data - Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.4.6 Real World Data - Jaffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
0.5 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.1 Question
The dataset given to you contain 3 folders
1. linearly separable dataset
2. Nonlinearly separable dataset
3. Real world dataset
Implement classifiers using Bayes decision rules for cases I,II and III respectively for all datasets.
0.2 Theory
0.2.1 Bayesian Decision Theory
Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. It
is considered the ideal case in which the probability structure underlying the categories is known perfectly.
While this sort of situation rarely occurs in practice.
Bayes formula
likelihood × prior
posterior =
evidence
Mathematically:
p(x|ωj )P (ωj )
P (ωj |x) =
p(x)
where
c
X
p(x) = p(x|ωj )P (ωj )
j=1
c : no.of classes
Posterior
Bayes formula shows that by observing the value of x we can convert the prior probability P(ωj ) to a
posteriori probability (or posterior) P(ωj |x) - the probability of the state of nature being ωj given that
feature value x has been measured.
likelihood
We call p(x|ωj ) the likelihood of ωj with respect to x, a term chosen to indicate that, other things being
equal, the category ωj for which p(x|ωj ) is large is more ”likely” to be the true category.
Evidence
The evidence factor p(x), can be viewed as merely a scale factor that guarantees that the posterior proba-
bilities sum to one, as all good probabilities must.
1
Bayes Decision Rule
For a 2 class case the bayes decision rule can be shown as:
Decide ω1 if P (ω1 |x) > P (ω2 |x); otherwise decide ω2
Feature space
Allowing use of more than one feature merely requires replacing that scalar x by the f eature vector x, when
x is in a d-dimensional Euclidean space Rd , called f eature space
Suppose that we observe a particular x and that we contemplate taking action αi . If the true state of
nature is ωj by definition, we will incur the loss λ(αi |ωj ). Because P (ωj |x) is the probability that the true
state of nature is ωj , the expected loss associated with taking action αi is
c
X
R(αi |x) = λ(αi |ωj )P (ωj |x)
j=1
An expected loss is called a risk, and R(αi |x) is called the conditional risk. Whenever we encounter
a particular observation x, we can minimize our expected loss by selecting the action that minimizes the
conditional risk.
p(x|ωi )P (ωi )
gi (x) = P (ωj |x) = Pc
j=1 p(x|ωi )P (ωi )
.
Decision Region
Even though the discriminant functions can be written in a variety of forms, the decision rules are equivalent.
The effect of any decision rule is to divide the feature space into c decision boundaries, R1 ,, Rc . If gi (x) >
gj (x) for all i 6= j, then x is in Ri , and the decision rule calls for us to assign x to ωi . The regions are
separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant
functions.
2
0.2.3 Normal density
Expectation
The definition of the expected value of a scalar function f (x) defined for some density p(x) is given by
Z ∞
E[f (x)] ≡ f (x)p(x)dx
−∞
Univariate density
The continuous univariate normal density is given by
1 h 1 x − µ 2 i
p(x) = √ exp −
2πσ 2 σ
Z ∞
µ ≡ E[x] = xp(x)dx
−∞
Z ∞
σ 2 ≡ E[(x − µ)2 ] = (x − µ)2 p(x)dx
−∞
Multivariate density
The multivariate normal density in d dimensions is written as
1 h 1
t −1
i
p(x) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2
where x is a d-component column vector, µ is the d-component mean vector, Σ is d × d covariance matrix,
|Σ| is its determinant and Σ−1 is its inverse respectively. The covariance matrix S is defined as the square
matrix whose ij th element σij is the covariance of xi and xj : The covariance of two features measures their
tendency to vary together, i.e., to co-vary.
We can use vector product (x − µ)(x − µ)T to write the covariance matrix as Σ = E[(x − µ)(x − µ)T ]
Thus Σ is symmetric, and its diagonal elements are the variances of the respective individual elements
xi of x (e.g.σi2 ,), which can never be negative; the off-diagonal elements are the covariances of xi and xj ,
which can be positive or negative. If the variables xi and xj are statistically independent, the covariances
σij are zero, and the covariance matrix is diagonal. If all the off-diagonal elements are zero, p(x) reduces to
the product of the univariate normal densities for the components of x. The analog to the Cauchy-Schwarz
inequality comes from recognizing that if w is any d-dimensional vector, then the variance of wT x can never
be negative. This leads to the requirement that the quadratic form wT Σx never be negative. Matrices for
which this is true are said to be positive semidefinite; thus, the covariance matrix is positive semidefinite.
3
0.2.4 Discriminant Functions For The Normal Density
This equation can be easily evaluated if the densities are multivariate normal. i.e.,
1 d 1
gi (x) = − (x − µi )t Σ−1
i (x − µi ) − ln2π − ln|Σi | + lnP (ωi ) (1)
2 2 2
Case I : Σi = σ 2 I
The simplest case occurs when the features that are measured is independent of each other, that is, sta-
tistically independent, and when each feature has the same variance, σ 2 . For example, if we were trying
to recognize an apple from an orange, and we measured the colour and the weight as our feature vector,
then chances are that there is no relationship between these two properties. The non-diagonal elements of
the covariance matrix are the covariances of the two features x1 =colour and x2 =weight. But because these
features are independent, their covariances would be 0. Therefore, the covariance matrix for both classes
would be diagonal, being merely σ 2 times the identity matrix I.
As a second simplification, assume that the variance of colours is the same is the variance of weights.
This means that there is the same degree of spreading out from the mean of colours as there is from the
mean of weights. If this is true for some class i then the covariance matrix for that class will have identical
diagonal elements. Finally, suppose that the variance for the colour and weight features is the same in both
classes. This means that the degree of spreading for these two features is independent of the class from
which you draw your samples. If this is true, then the covariance matrices will be identical. When normal
distributions are plotted that have a diagonal covariance matrix that is just a constant multplied by the
identity matrix, their cluster points about the mean are shperical in shape.
Geometrically, this corresponds to the situation in which the samples fall in equal-size hyperspherical
clusters, the cluster for the ith class being centered about the mean vector µi (see Figure). The computation
of the determinant and the inverse of Σi is particularly easy:
1
|Σi | = σ 2d and Σ−1 = I
σ2
Because both |Σi | and the (d/2) ln 2π term in equation are independent of i, they are unimportant additive
constants that can be ignored. Thus, we obtain the simple discriminant functions
||x − µi ||2
gi (x) = − + lnP (ωi )
2σ 2
where || · || denotes Euclidean norm, that is
||x − µi ||2 = (x − µi )t (x − µi )
If the prior probabilities are not equal, then above equation shows that the squared distance ||x − µi ||2 must
be normalized by the variance σ 2 and offset by adding ln P (ωi ); thus, if x is equally near two different mean
vectors, the optimal decision will favor the a priori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actually necessary to compute
distances. Expansion of the quadratic form (x − µi )t (x − µi ) yields
1
gi (x) = − [xt x − 2µti x + µti µi ] + lnP (ωi )
2σ 2
4
which appears to be a quadratic function of x. However, the quadratic term xT x is the same for all i, making
it an ignorable additive constant. Thus, we obtain the equivalent linear discriminant functions
gi (x) = wti x + wi0
where
1
wi = µi
σ2
and
−1 t
wi0 = µ µi + lnP (ωi )
2σ 2 i
We call wi0 the threshold or bias for the ith category.
Case II : Σi = Σ
Another simple case arises when the covariance matrices for all of the classes are identical but otherwise
arbitrary. Since it is quite likely that we may not be able to measure features that are independent, this
section allows for any arbitrary covariance matrix for the density of each class. In order to keep things
simple, assume also that this arbitrary covariance matrix is the same for each class wi. This means that we
allow for the situation where the color of fruit may covary with the weight, but the way in which it does
is exactly the same for apples as it is for oranges. Instead of having shperically shaped clusters about our
means, the shapes may be any type of hyperellipsoid, depending on how the features we measure relate to
each other. However, the clusters of each class are of equal size and shape and are still centered about the
mean for that class.
Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal clusters of
equal size and shape, the cluster for the ith class being centered about the mean vector µi . Because both
Si and the (d/2) ln 2π terms in eq(1) are independent of i, they can be ignored as superfluous additive
constants. Using the general discriminant function for the normal density, the constant terms are removed.
This simplification leaves the discriminant functions of the form:
5
1
gi (x) = − (x − µi )t Σ−1 (x − µi ) + lnP (ωi )
2
Note that, the covariance matrix no longer has a subscript i, since it is the same matrix for all classes.
If the prior probabilities P (ωi ) are the same for all c classes, then the ln P (ωi ) term can be ignored. In
this case, the optimal decision rule can once again be stated very simply: To classify a feature vector x,
measure the squared Mahalanobis distance (x − µi )T Σ−1 (x − µi ) from x to each of the c mean vectors, and
assign x to the category of the nearest mean. As before, unequal prior probabilities bias the decision in favor
of the a priori more likely category.
Expansion of the quadratic form (x−µi )T Σ−1 (x−µi ) results in a sum involving a quadratic term xT Σ−1 x
which here is independent of i. After this term is dropped from above equation, the resulting discriminant
functions are again linear.
and
1 1
wi0 = − µti Σ−1
i µi − ln|Σi | + lnP (ωi )
2 2
6
Example:
If a classification system has been trained to distinguish between class 1, class 2 and class 3, a confusion
matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample of 27
test cases 8 cases belongs to class 1, 6 cases belongs to class 2 , and 13 cases belongs to class 3, the resulting
confusion matrix could look like the table below:
Class 1 Class 2 Class 3
Class 1 5 2 0
Class 2 3 3 2
Class 3 0 1 11
In this confusion matrix, of the 8 actual cases that belongs to class 1 , the system predicted that three were
belongs to class 2, and of the six actual cases that belongs to class 2, it predicted that three were belongs to
class 1, three were belongs to class 2 and two were belongs to class 3. We can see from the matrix that the
system in question has trouble distinguishing between class 1 and class 2. All correct predictions are located
in the diagonal of the table, so it is easy to visually inspect the table for prediction errors, as they will be
represented by values outside the diagonal.
Classification Accuracy:
sum of all diagonal elements of confusion matrix
classification accuracy (%) = sum of all elements of confusion matrix × 100
0.3 Procedure
0.3.1 linearly separable data set
1. Load train data set and test data set separately on train and test array respectively.
2. Compute covariance matrix Σ for train data set. using cov() function of MATLAB.
3. Compute mean for each train data set and store them to µi where i, is the ith class.
4. Determine discriminant function for each cases (case I, case II & case III).
5. For each case and for each test data compute discriminant function against each train data set, and
select the highest discriminant function value among values against each train class
6. The class against which the highest value of discriminant function is obtained is the class for that test
data.
7
0.4 Results
0.4.1 Linearly Separable Data Set
Scatter Plot for Case I
Confusion Matrix
Class 1 Class 2 Class 3
Class 1 199 0 1
Class 2 0 200 0
Class 3 0 0 200
8
Scatter Plot for Case II
Confusion Matrix
Class 1 Class 2 Class 3
Class 1 199 0 1
Class 2 0 200 0
Class 3 0 0 200
9
Scatter Plot for Case III
Confusion Matrix
Class 1 Class 2 Class 3
Class 1 200 0 0
Class 2 0 200 0
Class 3 0 0 200
10
0.4.2 Non Linearly Separable Data Set - Overlapping Data
Scatter Plot for Case I
Confusion Matrix
Class 1 Class 2 Class 3 Class4
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 11 3 0 77
11
Scatter Plot for Case II
Confusion Matrix
Class 1 Class 2 Class 3 Class4
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 10 3 1 77
12
Scatter Plot for Case III
Confusion Matrix
Class 1 Class 2 Class 3 Class4
Class 1 82 2 2 5
Class 2 3 72 8 8
Class 3 11 0 80 0
Class 4 10 3 1 77
13
0.4.3 Non Linearly Separable Data Set - Spiral Data
Scatter Plot for Case I
Confusion Matrix
Class 1 Class 2
Class 1 151 120
Class 2 120 151
14
Scatter Plot for Case II
Confusion Matrix
Class 1 Class 2
Class 1 157 114
Class 2 114 157
15
Scatter Plot for Case III
Confusion Matrix
Class 1 Class 2
Class 1 157 114
Class 2 114 157
16
0.4.4 Non Linearly Separable DataSet - Helix Data
Case I
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
Case II
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
Case III
Confusion Matrix:
Class 1 Class 2
Class 1 0 151
Class 2 0 151
17
0.4.5 Real World Data - Glass
Case I
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1 4 11 7 0 0 0
Class 2 13 1 5 3 1 0
Class 3 0 1 5 0 0 0
Class 4 0 0 0 4 0 0
Class 5 0 0 0 1 1 1
Class 6 0 0 0 0 0 9
Classification Accuracy (Case I) : 35.8209%
Case II
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1 17 3 2 0 0 0
Class 2 11 7 2 1 2 0
Class 3 1 1 4 0 0 0
Class 4 0 2 1 1 0 0
Class 5 0 0 0 0 2 1
Class 6 0 0 0 0 1 8
Classification Accuracy (Case II) : 52.2090%
Case III
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
Class 1 17 3 2 0 0 0
Class 2 11 7 2 1 2 0
Class 3 1 1 4 0 0 0
Class 4 0 2 1 1 0 0
Class 5 0 0 0 0 2 1
Class 6 0 0 0 0 1 8
Classification Accuracy (Case III) : 52.2090%
18
Classification Accuracy (Case I) : 44.6518%
Case II
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
Class 1 802 0 3 0 541 535 96
Class 2 1338 25 0 0 321 0 0
Class 3 48 0 78 0 1406 86 286
Class 4 140 0 21 143 1084 882 0
Class 5 816 0 4 0 483 578 96
Class 6 805 0 3 0 465 608 96
Class 7 821 0 0 0 1062 0 314
Case III
Confusion Matrix:
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
Class 1 802 0 3 0 541 535 96
Class 2 1338 25 0 0 321 0 0
Class 3 48 0 78 0 1406 86 286
Class 4 140 0 21 143 1084 882 0
Class 5 816 0 4 0 483 578 96
Class 6 805 0 3 0 465 608 96
Class 7 821 0 0 0 1062 0 314
0.5 Observation
For complex data like non-linearly separable dataset, bayesian classifier outputs poor classification accuracy.
By comparing Linearly separable data are efficiently handled by the classifier in all cases. Among the cases
case I (Σi = σ 2 I) exhibits much good classification compared to other cases.
19