Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module_5 (1)

The document discusses the necessity of nonlinear classifiers for problems like the XOR problem, which cannot be solved by linear classifiers. It introduces the concept of multilayer perceptrons, including two-layer and three-layer architectures, to handle nonlinearly separable classes and describes the backpropagation algorithm for training these networks. Additionally, it touches on clustering as an unsupervised learning technique, outlining its basic concepts, steps, and applications.

Uploaded by

pics4noww
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module_5 (1)

The document discusses the necessity of nonlinear classifiers for problems like the XOR problem, which cannot be solved by linear classifiers. It introduces the concept of multilayer perceptrons, including two-layer and three-layer architectures, to handle nonlinearly separable classes and describes the backpropagation algorithm for training these networks. Additionally, it touches on clustering as an unsupervised learning technique, outlining its basic concepts, steps, and applications.

Uploaded by

pics4noww
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Module 5

Non Linear Classifiers


The design of linear classifiers is described by linear discriminant functions (hyperplanes) g(x). In
the simple two-class case, the perceptron algorithm computes the weights of the linear function
g(x),provided that the classes are linearly separable. For nonlinearly separable classes,linear
classifiers were optimally designed, for example, by minimizing the squared error. here we will
deal with problems that are not linearly separable and for which the design of a linear classifier,
even in an optimal way, does not lead to satisfactory performance. The design of nonlinear
classifiers emerges now as an inescapable necessity.

THE XOR PROBLEM


The well-known Exclusive OR (XOR) Boolean function is a typical example of such a problem.
Boolean functions can be interpreted as classification tasks.Indeed, depending on the values of the
input binary data x _=[x1, x2, . . . , xl ]T , the output is either 0 or 1, and x is classified into one of
the two classes A(1) or B(0).

Figure 1 and Table 1 shows the position of the classes in space. It is apparent from this figure that
no single straight line exists that separates the two classes.

Fig 1: Classes A and B for the XOR problem.

In contrast, the other two Boolean functions, AND and OR, are linearly separable. The
corresponding truth tables for the AND and OR operations are given in Table 4.2 and the respective
class positions in the two-dimensional space are shown in Figure 4.2a and 4.2b. Figure 4.3 shows
a perceptron, introduced in the previous chapter, with synaptic weights computed so as to realize
an OR gate (verify).
Our major concern now is first to tackle the XOR problem and then to extend the procedure to
more general cases of nonlinearly separable classes.
THE TWO-LAYER PERCEPTRON
To separate the two classes A and B in Figure 4.1, a first thought that comes to mind is to draw
two, instead of one, straight lines. Figure 4.4 shows two such possible lines, g1(x) = g2(x) =0, as
well as the regions in space for which g1(x) ≷ 0, g2(x) ≷ 0. The classes can now be separated.
Class A is to the right (+) of g1(x) and to the left (-) of g2(x). The region corresponding to class B
lies either to the left or to the right of both lines. What we have really done is to attack the problem
in two successive phases.
During the first phase,we calculate the position of a feature vector x with respect to each of the
two decision lines. In the second phase,we combine the results of the previous phase and we find
the position of x with respect to both lines, that is, outside or inside the shaded area. We will now
view this from a slightly different perspective,which will subsequently lead us easily to
generalizations.

Realization of the two decision lines (hyperplanes), g1(·) and g2(·), during the first phase of
computations is achieved with the adoption of two perceptron with inputs x1, x2 and appropriate
synaptic weights. The corresponding outputs are yi = f ( gi(x)), i = 1, 2,where the activation
function f (·) is the step function with levels 0 and 1. Table 4.3 summarizes the yi values for all
possible combinations of the inputs. These are nothing else than the relative positions of the input
vector x with respect to each of the two lines. From another point of view, the computations during
the first phase perform a mapping of the input vector x to a new one y =[ y1, y2]T . The decision during the
second phase is now based on the transformed data; that is,our goal is now to separate [ y1, y2]=[0, 0] and
[ y1, y2]=[1, 1],which correspond to class B vectors, from the [ y1, y2]=[1, 0], which corresponds
to class A vectors.
As is apparent from Figure 4.5, this is easily achieved by drawing a third line g( y),which can be
realized via a third neuron. In other words, the mapping of the first phase transforms the
nonlinearly separable problem to a linearly separable one. We will return to this important issue
later on. Figure 4.6 gives a possible realization of these steps. Each of the three lines is realized
via a neuron with appropriate synaptic weights.

The resulting multilayer architecture can be considered as a generalization of the perceptron, and
it is known as a two-layer perceptron.or a two-layer feedforward1 neural network. The two
neurons (nodes) of the first layer perform computations of the first phase and they constitute the
so-called hidden layer. The single neuron of the second layer performs the computations of the
final phase and constitutes the output layer.
In Figure 4.6 the input layer corresponds to the (nonprocessing) nodes where input data are
applied. Thus, the number of input layer nodes equals the dimension of the input space. Note that
at the input layer nodes no processing takes place. The lines that are realized by the two-layer
perceptron of the figure are

The multilayer perceptron architecture of Figure 4.6 can be generalized to l-dimensional input
vectors and to more than two (one) neurons in the hidden (output) layer. We will now turn our
attention to the investigation of the class discriminatory capabilities of such networks for more
complicated nonlinear classification tasks.

THREE-LAYER PERCEPTRONS
The inability of the two-layer perceptrons to separate classes resulting from any union of
polyhedral regions springs from the fact that the output neuron can realize

only a single hyperplane. This is the same situation confronting the basic perceptron when
dealing with the XOR problem. The difficulty was overcome by constructing two lines instead of
one. A similar escape path will be adopted here.

Figure 4.10 shows a three-layer perceptron architecture with two layers of hidden neurons and one
output layer. such an architecture can separate classes resulting from any union of polyhedral
regions. Indeed, let us assume that all regions of interest are formed by intersections of pl-
dimensional half-spaces defined by the p hyperplanes. These are realized by the p neurons of the
first hidden layer, which also perform the mapping of the input space onto the vertices of the Hp
hypercube of unit side length.
In the sequel let us assume that class A consists of the union of K of the resulting polyhedra and
class B of the rest. We then use K neurons in the second hidden layer. Each of these neurons
realizes a hyperplane in the p-dimensional space. The synaptic weights for each of the second-
layer neurons are chosen so that the realized hyperplane leaves only one of the Hp vertices on one
side and all the rest on the other. For each neuron a different vertex is isolated, that is, one of the
K A class vertices. In other words, each time an input vector from class A enters the network, one
of the K neurons of the second layer results in a 1 and the remaining K =1 give 0. In contrast, for
class B vectors all neurons in the second layer output a 0. Classification is now a straightforward
task. Choose the output layer neuron to realize an OR gate. Its output will be 1 for class A and 0
for class B vectors. The proof is now complete.
The number of neurons in the second hidden layer can be reduced by exploiting the geometry that
results from each specific problem—for example,whenever two of the K vertices are located in a
way that makes them separable from the rest,using a single hyperplane. Finally,the multilayer
structure can be generalized to more than two classes. To this end,the output layer neurons are
increased in number, realizing one OR gate for each class. Thus, one of them results in 1 every
time a vector from the respective class enters the network, and all the others give 0.
In summary,we can say that the neurons of the first layer form the hyperplanes,those of the second
layer form the regions, and finally the neurons of the output layer form the classes.

Backpropagation Algorithm

Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make the model reliable by
increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient of
a loss function with respect to all the weights in the network.
The multilayer perceptron architectures we have considered so far have been developed around
the McCulloch–Pitts neuron,employing as the activation function the step function

A popular family of continuous differentiable functions,which approximate the step function, is


the family of sigmoid functions. A typical representative is the logistic function

where a is a slope parameter

How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct
computation. It computes the gradient, but it does not define how the gradient is used. It generalizes
the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
How Backpropagation Algorithm Works
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output


5. Travel back from the output layer to the hidden layer to adjust the weights such that the error is
decreased
Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:


● Backpropagation is fast, simple and easy to program
● It has no parameters to tune apart from the numbers of input
● It is a flexible method as it does not require prior knowledge about the network
● It is a standard method that generally works well
● It does not need any special mention of the features of the function to be learned.

What is a Feed Forward Network?


A feedforward neural network is an artificial neural network where the nodes never form a cycle.
This kind of neural network has an input layer, hidden layers, and an output layer. It is the first
and simplest type of artificial neural network.
Types of Backpropagation Networks

Two Types of Backpropagation Networks are:


● Static Back-propagation
● Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.

Basic Concepts of Clustering, Introduction to Clustering , Proximity Measures.

we turn to the unsupervised case, where class labelling of the training patterns is not available.
Thus, our major concern now is to “reveal” the organization of patterns into “sensible” clusters
(groups), which will allow us to discover similarities and differences among patterns and to derive
useful conclusions about them.

Clustering may be found under different names in different contexts, such as unsupervised learning
and learning without a teacher (in pattern recognition), numerical taxonomy (in biology,ecology),
typology (in social sciences),and partition (in graph theory).

Consider the following animals: sheep, dog, cat (mammals), sparrow, seagull (birds),viper,lizard
(reptiles),goldfish,redmullet,blue shark (fish),and frog (amphibians).In order to organize these
animals into clusters,we need to define a clustering criterion.

Thus, if we employ the way these animals bear their progeny as a clustering criterion, the sheep,
the dog, the cat, and the blue shark will be assigned to the same cluster, while all the rest will form
a second cluster (Figure 11.1a). If the clustering criterion is the existence of lungs, the goldfish,
the red mullet, and the blue shark are assigned to the same cluster, while all the other animals are
assigned to a second cluster (Figure 11.1b). On the other hand, if the clustering criterion is the
environment where the animals live, the sheep, the dog, the cat, the sparrow,the seagull, the viper,
and the lizard will form one cluster (animals living outside water); the goldfish, the red mullet, and
the blue shark will form a second cluster (animals living only in water); and the frog will form a
third cluster by itself, since it may live in the water or out of it (Figure 11.1c).
As was the case with supervised learning, we will assume that all patterns are represented in terms
of features, which form l-dimensional feature vectors.

The basic steps that an expert must follow in order to develop a clustering task are the
following:
■ Feature selection. Features must be properly selected so as to encode as much information as
possible concerning the task of interest. Once more, parsimony and, thus, minimum information
redundancy among the features is a major goal. As in supervised classification, preprocessing of
features may be necessary prior to their utilization in subsequent stages. The techniques
discussed there are applicable here.
■ Proximity measure. This measure quantifies how“similar”or“dissimilar”two feature vectors are.
It is natural to ensure that all selected features contribute equally to the computation of the
proximity measure and there are no features that dominate others. This must be taken care of during
preprocessing.
■ Clustering criterion. This criterion depends on the interpretation the expert gives to the term
sensible, based on the type of clusters that are expected to underlie the data set. For example, a
compact cluster of feature vectors in the l-dimensional space,may be sensible according to one
criterion, whereas an elongated cluster may be sensible according to another. The clustering
criterion may be expressed via a cost function or some other types of rules.
■ Clustering algorithms. Having adopted a proximity measure and a clustering criterion, this step
refers to the choice of a specific algorithmic scheme that unravels the clustering structure of the
data set.
■ Validation of the results. Once the results of the clustering algorithm have been obtained,we
have to verify their correctness. This is usually carried out using appropriate tests.
■ Interpretation of the results. In many cases, the expert in the application field must integrate
the results of clustering with other experimental evidence and analysis in order to draw the right
conclusions.
In a number of cases, a step known as clustering tendency should be involved. This includes
various tests that indicate whether or not the available data possess a clustering structure. For
example, the data set may be of a completely random nature, thus trying to unravel clusters would
be meaningless
let us consider the following example. Consider Figure 11.2. How many “sensible” ways of
clustering can we obtain for these points? The most “logical” answer seems to be two. The first
clustering contains four clusters (surrounded by solid circles). The second clustering contains two
clusters (surrounded by dashed lines). Which clustering is “correct”? It seems that there is no
definite answer. Both clusterings are valid. The best thing to do is give the results to an expert and
let the expert decide about the most sensible one. Thus, the final answer to these questions will be
influenced by the expert’s knowledge.

11.1.1 Applications of Cluster Analysis


Clustering is a major tool used in a number of applications. To enrich the list of examples already
presented in the introductory chapter of the book, we summarize here four basic directions in which
clustering is of use.
■ Data reduction. In several cases, the amount of the available data,N, is often very large and as
a consequence, its processing becomes very demanding.Cluster analysis can be used in order to
group the data into a number of “sensible” clusters, m =N, and to process each cluster as a single
entity.
For example, in data transmission, a representative for each cluster is defined.Then, instead of
transmitting the data samples, we transmit a code number corresponding to the representative of
the cluster in which each specific sample lies. Thus, data compression is achieved.
■ Hypothesis generation. In this case we apply cluster analysis to a data set in order to infer some
hypotheses concerning the nature of the data. Thus, clustering is used here as a vehicle to suggest
hypotheses. These hypotheses must then be verified using other data sets.
■ Hypothesis testing. In this context,cluster analysis is used for the verification of the validity of
a specific hypothesis. Consider, for example, the following hypothesis: “Big companies invest
abroad.” One way to verify whether this
is true is to apply cluster analysis to a large and representative set of companies. Suppose that each
company is represented by its size, its activities abroad, and its ability to complete successfully
projects on applied research.
Prediction based on groups. In this case, we apply cluster analysis to the available data set, and
the resulting clusters are characterized based on the characteristics of the patterns by which they
are formed. In the sequel, if we are given an unknown pattern, we can determine the cluster to
which it is more likely to belong, and we characterize it based on the characterization of the
respective cluster. Suppose, for example, that cluster analysis is applied to a data set concerning
patients infected by the same disease. This results in a number of clusters of patients, according to
their reaction to specific drugs. Then for a new patient,we identify the most appropriate cluster for
the patient and, based on it,we decide on his or her medication.

11.1.2 Types of Features

A feature may take values from a continuous range (subset of R) or from a finite discrete set. If the
finite discrete set has only two elements,then the feature is called binary or dichotomous.
A different categorization of the features is based on the relative significance of the values they
take We have four categories of features: nominal, ordinal, interval-scaled, and ratio-scaled.

The first category, nominal, includes features whose possible values code states. Consider for
example a feature that corresponds to the sex of an individual. Its possible values may be 1 for a
male and 0 for a female. Clearly,any quantitative comparison between these values is meaningless.
The next category, ordinal, includes features whose values can be meaningfully ordered.
Consider, for example, a feature that characterizes the performance of a student in the pattern
recognition course. Suppose that its possible values are 4, 3, 2, 1 and that these correspond to
the ratings “excellent,”“very good,”“good,”“not good.” Obviously, these values are arranged in a
meaningful order. However, the difference between two successive values is of no meaningful
quantitative importance.
If, for a specific feature, the difference between two values is meaningful while their ratio is
meaningless, then it is an interval-scaled feature. A typical example is the measure of
temperature in degrees Celsius. If the temperatures in London and Paris are 5 and 10 degrees
Celsius, respectively, then it is meaningful to say that the temperature in Paris is 5 degrees higher
than that in London. However, it is meaningless to say that Paris is twice as hot as London.
Finally, if the ratio between two values of a specific feature is meaningful, then this is a ratio-
scaled feature, the fourth category. An example of such a feature is weight, since it is meaningful
to say that a person who weighs 100 kg is twice as fat as a person whose weight is 50 kg.

11.1.3 Definitions of Clustering

The definition of clustering leads directly to the definition of a single“cluster.” Many definitions
have been proposed over the years. However, most of these definitions are based on loosely
defined terms, such as similar, and alike, etc., or they are oriented to a specific kind of cluster. As
pointed out in most of these definitions are of vague and of circular nature. This fact reveals the
difficulty of having a universally acceptable definition for the term cluster. In the vectors are
viewed as points in the l-dimensional space,and the clusters are described as “continuous regions
of this space containing a relatively high density of points, separated from other high density
regions by regions of relatively low density of points.” Clusters described in this way are
sometimes referred to as natural clusters. This definition is closer to our visual perception of
clusters in the two- and three-dimensional spaces.
Let us now try to give some definitions for “clustering,” which, although they may not be universal,
give us an idea of what clustering is. Let X be our data set, that is,
X _ {x1, x2, . . . , xN }.
In addition, the vectors contained in a cluster Ci are “more similar” to each other and “less similar”
to the feature vectors of the other clusters. Quantifying the terms similar and dissimilar depends
very much on the types of clusters. involved. For example,other measures (measuring similarity)
are required for compact clusters (e.g., Figure 11.3a), others for elongated clusters (e.g., Figure
11.3b), and different ones for shell-shaped clusters (e.g., Figure 11.3c).

11.2 PROXIMITY MEASURES


Proximity measures
• Whenever classification is carried out, it is done according to some similarity of the test
pattern to the training patterns.
• For clustering also, patterns which are similar to each other are to be put into the same
cluster while patterns which are dissimilar are to be put into different clusters.
• To determine this similarity/dissimilarity, proximity measures are used.
• Some of the proximity measures are metrics while some are not.
• The distance between two patterns is used as a proximity measure. If this distance is
smaller, then the two patterns are more similar.
• For the distance measure to be metric, the following properties should hold :
The preceding arguments show that the Euclidean distance is a dissimilarity measure. In
addition, the Euclidean distance between two vectors takes its minimum value d0 =0, when the
vectors coincide.

Distance measure
• These measures find the distance between points in a d-dimensional space, where each pattern
is represented as a point in the d-space.
• The distance is inversely proportional to the similarity. If d(X,Y) gives the distance between X
and Y , and s(X,Y) gives the similarity between X and Y , then d(X, Y ) ∝ 1 / s(X,Y ).

The Euclidean distance is the most popular distance measure. If we have two patterns X and Y ,
then the euclidean distance will be
Pearson’s Correlation Coefficient

Correlation means to find out the association between the two variables and Correlation
coefficients are used to find out how strong the is relationship between the two variables. The most
popular correlation coefficient is Pearson’s Correlation Coefficient. It is very commonly used in
linear regression.
Consider the example of car price detection where we have to detect the price considering all the
variables that affect the price of the car such as carlength, curbweight, carheight, carwidth,
fueltype, carbody, horsepower, etc.

earson’s Correlation coefficient is represented as ‘r’, it measures how strong is the linear
association between two continuous variables using the formula:
Values of Pearson’s Correlation are:
Value of ‘r’ ranges from ‘-1’ to ‘+1’. Value ‘0’ specifies that there is no relation between the two
variables. A value greater than ‘0’ indicates a positive relationship between two variables where
an increase in the value of one variable increases the value of another variable. Value less than ‘0’
indicates a negative relationship between two variables where an increase in the value of one
decreases the value of another variable.

Example :

Consider the given data and compute correlation between age and glucose level.

Age Glucose Level

43 99

21 65

25 79

42 75

57 87

59 81

Solution :

SNo Age X Glucose Level Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225


3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Total 247 486 20485 11409 40022

From table :

∑x=247, ∑y=486, ∑xy=20,485,


∑x2=11,409, ∑y2=40,022, ′n′=6
r=n(∑xy)−(∑x)(∑y)√ [n ∑ x2−(∑x)2][n ∑ y2−(∑y)2]
r=6(20,485)−(247×486)/[6(11,409)−(2472)]×[6(40,022)−4862]
r=2868/5413.27=0.529809
r = 0.5298
The range of the correlation coefficient is from -1 to 1.
Our result is 0.5298, which means the variables have a moderate positive correlation.

Example -2

Consider 4 data points A,B,C, D as below. Draw clusters of points. Use Euclidean Distance as a
similarity measure.

2. Choose two centroids AB and CD, calculated as

AB = Average of A, B
CD = Average of C,D
3. Calculate squared euclidean distance between all data points to the
centroids AB, CD. For example distance between A(2,3) and AB (4,2)
can be given by s = (2–4)² + (3–2)².

A is very near to CD than AB

4. If we observe in the fig, the highlighted distance between (A, CD) is


4 and is less compared to (AB, A) which is 5. Since point A is close to
the CD we can move A to CD cluster.
5. There are two clusters formed so far, let recompute the centroids i.e,
B, ACD similar to step 2.

ACD = Average of A, C, D
B=B

New centroids B, ACD

6. As we know K-Means is iterative procedure now we have to calculate


the distance of all points (A, B, C, D) to new centroids (B, ACD ) similar
to step 3.

Clusters B, ACD
7. In the above picture, we can see respective cluster values are
minimum that A is too far from cluster B and near to cluster ACD. All
data points are assigned to clusters (B, ACD ) based on their minimum
distance. The iterative procedure ends here.

8. To conclude, we have started with two centroids and end up with


two clusters, K=2.

Cosine similarity

Cosine similarity is a metric, helpful in determining, how similar the data objects are

irrespective of their size. We can measure the similarity between two sentences in Python using

Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. The

formula to find the cosine similarity between two vectors is –

Cos(x, y) = x . y / ||x|| * ||y||


where,
• x . y = product (dot) of the vectors ‘x’ and ‘y’.
• ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
• ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.
Example :
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’,
using Cosine Similarity.
The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
The formula for calculating the cosine similarity is :
Cos(x, y) = x . y / ||x|| * ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1


∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49
The dissimilarity between the two vectors ‘x’ and ‘y’ is given by –
∴ Dis(x, y) = 1 - Cos(x, y) = 1 - 0.49 = 0.51
• The cosine similarity between two vectors is measured in ‘θ’.
• If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
• If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

You might also like