Module_5 (1)
Module_5 (1)
Figure 1 and Table 1 shows the position of the classes in space. It is apparent from this figure that
no single straight line exists that separates the two classes.
In contrast, the other two Boolean functions, AND and OR, are linearly separable. The
corresponding truth tables for the AND and OR operations are given in Table 4.2 and the respective
class positions in the two-dimensional space are shown in Figure 4.2a and 4.2b. Figure 4.3 shows
a perceptron, introduced in the previous chapter, with synaptic weights computed so as to realize
an OR gate (verify).
Our major concern now is first to tackle the XOR problem and then to extend the procedure to
more general cases of nonlinearly separable classes.
THE TWO-LAYER PERCEPTRON
To separate the two classes A and B in Figure 4.1, a first thought that comes to mind is to draw
two, instead of one, straight lines. Figure 4.4 shows two such possible lines, g1(x) = g2(x) =0, as
well as the regions in space for which g1(x) ≷ 0, g2(x) ≷ 0. The classes can now be separated.
Class A is to the right (+) of g1(x) and to the left (-) of g2(x). The region corresponding to class B
lies either to the left or to the right of both lines. What we have really done is to attack the problem
in two successive phases.
During the first phase,we calculate the position of a feature vector x with respect to each of the
two decision lines. In the second phase,we combine the results of the previous phase and we find
the position of x with respect to both lines, that is, outside or inside the shaded area. We will now
view this from a slightly different perspective,which will subsequently lead us easily to
generalizations.
Realization of the two decision lines (hyperplanes), g1(·) and g2(·), during the first phase of
computations is achieved with the adoption of two perceptron with inputs x1, x2 and appropriate
synaptic weights. The corresponding outputs are yi = f ( gi(x)), i = 1, 2,where the activation
function f (·) is the step function with levels 0 and 1. Table 4.3 summarizes the yi values for all
possible combinations of the inputs. These are nothing else than the relative positions of the input
vector x with respect to each of the two lines. From another point of view, the computations during
the first phase perform a mapping of the input vector x to a new one y =[ y1, y2]T . The decision during the
second phase is now based on the transformed data; that is,our goal is now to separate [ y1, y2]=[0, 0] and
[ y1, y2]=[1, 1],which correspond to class B vectors, from the [ y1, y2]=[1, 0], which corresponds
to class A vectors.
As is apparent from Figure 4.5, this is easily achieved by drawing a third line g( y),which can be
realized via a third neuron. In other words, the mapping of the first phase transforms the
nonlinearly separable problem to a linearly separable one. We will return to this important issue
later on. Figure 4.6 gives a possible realization of these steps. Each of the three lines is realized
via a neuron with appropriate synaptic weights.
The resulting multilayer architecture can be considered as a generalization of the perceptron, and
it is known as a two-layer perceptron.or a two-layer feedforward1 neural network. The two
neurons (nodes) of the first layer perform computations of the first phase and they constitute the
so-called hidden layer. The single neuron of the second layer performs the computations of the
final phase and constitutes the output layer.
In Figure 4.6 the input layer corresponds to the (nonprocessing) nodes where input data are
applied. Thus, the number of input layer nodes equals the dimension of the input space. Note that
at the input layer nodes no processing takes place. The lines that are realized by the two-layer
perceptron of the figure are
The multilayer perceptron architecture of Figure 4.6 can be generalized to l-dimensional input
vectors and to more than two (one) neurons in the hidden (output) layer. We will now turn our
attention to the investigation of the class discriminatory capabilities of such networks for more
complicated nonlinear classification tasks.
THREE-LAYER PERCEPTRONS
The inability of the two-layer perceptrons to separate classes resulting from any union of
polyhedral regions springs from the fact that the output neuron can realize
only a single hyperplane. This is the same situation confronting the basic perceptron when
dealing with the XOR problem. The difficulty was overcome by constructing two lines instead of
one. A similar escape path will be adopted here.
Figure 4.10 shows a three-layer perceptron architecture with two layers of hidden neurons and one
output layer. such an architecture can separate classes resulting from any union of polyhedral
regions. Indeed, let us assume that all regions of interest are formed by intersections of pl-
dimensional half-spaces defined by the p hyperplanes. These are realized by the p neurons of the
first hidden layer, which also perform the mapping of the input space onto the vertices of the Hp
hypercube of unit side length.
In the sequel let us assume that class A consists of the union of K of the resulting polyhedra and
class B of the rest. We then use K neurons in the second hidden layer. Each of these neurons
realizes a hyperplane in the p-dimensional space. The synaptic weights for each of the second-
layer neurons are chosen so that the realized hyperplane leaves only one of the Hp vertices on one
side and all the rest on the other. For each neuron a different vertex is isolated, that is, one of the
K A class vertices. In other words, each time an input vector from class A enters the network, one
of the K neurons of the second layer results in a 1 and the remaining K =1 give 0. In contrast, for
class B vectors all neurons in the second layer output a 0. Classification is now a straightforward
task. Choose the output layer neuron to realize an OR gate. Its output will be 1 for class A and 0
for class B vectors. The proof is now complete.
The number of neurons in the second hidden layer can be reduced by exploiting the geometry that
results from each specific problem—for example,whenever two of the K vertices are located in a
way that makes them separable from the rest,using a single hyperplane. Finally,the multilayer
structure can be generalized to more than two classes. To this end,the output layer neurons are
increased in number, realizing one OR gate for each class. Thus, one of them results in 1 every
time a vector from the respective class enters the network, and all the others give 0.
In summary,we can say that the neurons of the first layer form the hyperplanes,those of the second
layer form the regions, and finally the neurons of the output layer form the classes.
Backpropagation Algorithm
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration).
Proper tuning of the weights allows you to reduce error rates and make the model reliable by
increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient of
a loss function with respect to all the weights in the network.
The multilayer perceptron architectures we have considered so far have been developed around
the McCulloch–Pitts neuron,employing as the activation function the step function
The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native direct
computation. It computes the gradient, but it does not define how the gradient is used. It generalizes
the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
How Backpropagation Algorithm Works
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
4. Calculate the error in the outputs
we turn to the unsupervised case, where class labelling of the training patterns is not available.
Thus, our major concern now is to “reveal” the organization of patterns into “sensible” clusters
(groups), which will allow us to discover similarities and differences among patterns and to derive
useful conclusions about them.
Clustering may be found under different names in different contexts, such as unsupervised learning
and learning without a teacher (in pattern recognition), numerical taxonomy (in biology,ecology),
typology (in social sciences),and partition (in graph theory).
Consider the following animals: sheep, dog, cat (mammals), sparrow, seagull (birds),viper,lizard
(reptiles),goldfish,redmullet,blue shark (fish),and frog (amphibians).In order to organize these
animals into clusters,we need to define a clustering criterion.
Thus, if we employ the way these animals bear their progeny as a clustering criterion, the sheep,
the dog, the cat, and the blue shark will be assigned to the same cluster, while all the rest will form
a second cluster (Figure 11.1a). If the clustering criterion is the existence of lungs, the goldfish,
the red mullet, and the blue shark are assigned to the same cluster, while all the other animals are
assigned to a second cluster (Figure 11.1b). On the other hand, if the clustering criterion is the
environment where the animals live, the sheep, the dog, the cat, the sparrow,the seagull, the viper,
and the lizard will form one cluster (animals living outside water); the goldfish, the red mullet, and
the blue shark will form a second cluster (animals living only in water); and the frog will form a
third cluster by itself, since it may live in the water or out of it (Figure 11.1c).
As was the case with supervised learning, we will assume that all patterns are represented in terms
of features, which form l-dimensional feature vectors.
The basic steps that an expert must follow in order to develop a clustering task are the
following:
■ Feature selection. Features must be properly selected so as to encode as much information as
possible concerning the task of interest. Once more, parsimony and, thus, minimum information
redundancy among the features is a major goal. As in supervised classification, preprocessing of
features may be necessary prior to their utilization in subsequent stages. The techniques
discussed there are applicable here.
■ Proximity measure. This measure quantifies how“similar”or“dissimilar”two feature vectors are.
It is natural to ensure that all selected features contribute equally to the computation of the
proximity measure and there are no features that dominate others. This must be taken care of during
preprocessing.
■ Clustering criterion. This criterion depends on the interpretation the expert gives to the term
sensible, based on the type of clusters that are expected to underlie the data set. For example, a
compact cluster of feature vectors in the l-dimensional space,may be sensible according to one
criterion, whereas an elongated cluster may be sensible according to another. The clustering
criterion may be expressed via a cost function or some other types of rules.
■ Clustering algorithms. Having adopted a proximity measure and a clustering criterion, this step
refers to the choice of a specific algorithmic scheme that unravels the clustering structure of the
data set.
■ Validation of the results. Once the results of the clustering algorithm have been obtained,we
have to verify their correctness. This is usually carried out using appropriate tests.
■ Interpretation of the results. In many cases, the expert in the application field must integrate
the results of clustering with other experimental evidence and analysis in order to draw the right
conclusions.
In a number of cases, a step known as clustering tendency should be involved. This includes
various tests that indicate whether or not the available data possess a clustering structure. For
example, the data set may be of a completely random nature, thus trying to unravel clusters would
be meaningless
let us consider the following example. Consider Figure 11.2. How many “sensible” ways of
clustering can we obtain for these points? The most “logical” answer seems to be two. The first
clustering contains four clusters (surrounded by solid circles). The second clustering contains two
clusters (surrounded by dashed lines). Which clustering is “correct”? It seems that there is no
definite answer. Both clusterings are valid. The best thing to do is give the results to an expert and
let the expert decide about the most sensible one. Thus, the final answer to these questions will be
influenced by the expert’s knowledge.
A feature may take values from a continuous range (subset of R) or from a finite discrete set. If the
finite discrete set has only two elements,then the feature is called binary or dichotomous.
A different categorization of the features is based on the relative significance of the values they
take We have four categories of features: nominal, ordinal, interval-scaled, and ratio-scaled.
The first category, nominal, includes features whose possible values code states. Consider for
example a feature that corresponds to the sex of an individual. Its possible values may be 1 for a
male and 0 for a female. Clearly,any quantitative comparison between these values is meaningless.
The next category, ordinal, includes features whose values can be meaningfully ordered.
Consider, for example, a feature that characterizes the performance of a student in the pattern
recognition course. Suppose that its possible values are 4, 3, 2, 1 and that these correspond to
the ratings “excellent,”“very good,”“good,”“not good.” Obviously, these values are arranged in a
meaningful order. However, the difference between two successive values is of no meaningful
quantitative importance.
If, for a specific feature, the difference between two values is meaningful while their ratio is
meaningless, then it is an interval-scaled feature. A typical example is the measure of
temperature in degrees Celsius. If the temperatures in London and Paris are 5 and 10 degrees
Celsius, respectively, then it is meaningful to say that the temperature in Paris is 5 degrees higher
than that in London. However, it is meaningless to say that Paris is twice as hot as London.
Finally, if the ratio between two values of a specific feature is meaningful, then this is a ratio-
scaled feature, the fourth category. An example of such a feature is weight, since it is meaningful
to say that a person who weighs 100 kg is twice as fat as a person whose weight is 50 kg.
The definition of clustering leads directly to the definition of a single“cluster.” Many definitions
have been proposed over the years. However, most of these definitions are based on loosely
defined terms, such as similar, and alike, etc., or they are oriented to a specific kind of cluster. As
pointed out in most of these definitions are of vague and of circular nature. This fact reveals the
difficulty of having a universally acceptable definition for the term cluster. In the vectors are
viewed as points in the l-dimensional space,and the clusters are described as “continuous regions
of this space containing a relatively high density of points, separated from other high density
regions by regions of relatively low density of points.” Clusters described in this way are
sometimes referred to as natural clusters. This definition is closer to our visual perception of
clusters in the two- and three-dimensional spaces.
Let us now try to give some definitions for “clustering,” which, although they may not be universal,
give us an idea of what clustering is. Let X be our data set, that is,
X _ {x1, x2, . . . , xN }.
In addition, the vectors contained in a cluster Ci are “more similar” to each other and “less similar”
to the feature vectors of the other clusters. Quantifying the terms similar and dissimilar depends
very much on the types of clusters. involved. For example,other measures (measuring similarity)
are required for compact clusters (e.g., Figure 11.3a), others for elongated clusters (e.g., Figure
11.3b), and different ones for shell-shaped clusters (e.g., Figure 11.3c).
Distance measure
• These measures find the distance between points in a d-dimensional space, where each pattern
is represented as a point in the d-space.
• The distance is inversely proportional to the similarity. If d(X,Y) gives the distance between X
and Y , and s(X,Y) gives the similarity between X and Y , then d(X, Y ) ∝ 1 / s(X,Y ).
The Euclidean distance is the most popular distance measure. If we have two patterns X and Y ,
then the euclidean distance will be
Pearson’s Correlation Coefficient
Correlation means to find out the association between the two variables and Correlation
coefficients are used to find out how strong the is relationship between the two variables. The most
popular correlation coefficient is Pearson’s Correlation Coefficient. It is very commonly used in
linear regression.
Consider the example of car price detection where we have to detect the price considering all the
variables that affect the price of the car such as carlength, curbweight, carheight, carwidth,
fueltype, carbody, horsepower, etc.
earson’s Correlation coefficient is represented as ‘r’, it measures how strong is the linear
association between two continuous variables using the formula:
Values of Pearson’s Correlation are:
Value of ‘r’ ranges from ‘-1’ to ‘+1’. Value ‘0’ specifies that there is no relation between the two
variables. A value greater than ‘0’ indicates a positive relationship between two variables where
an increase in the value of one variable increases the value of another variable. Value less than ‘0’
indicates a negative relationship between two variables where an increase in the value of one
decreases the value of another variable.
Example :
Consider the given data and compute correlation between age and glucose level.
43 99
21 65
25 79
42 75
57 87
59 81
Solution :
From table :
Example -2
Consider 4 data points A,B,C, D as below. Draw clusters of points. Use Euclidean Distance as a
similarity measure.
AB = Average of A, B
CD = Average of C,D
3. Calculate squared euclidean distance between all data points to the
centroids AB, CD. For example distance between A(2,3) and AB (4,2)
can be given by s = (2–4)² + (3–2)².
ACD = Average of A, C, D
B=B
Clusters B, ACD
7. In the above picture, we can see respective cluster values are
minimum that A is too far from cluster B and near to cluster ACD. All
data points are assigned to clusters (B, ACD ) based on their minimum
distance. The iterative procedure ends here.
Cosine similarity
Cosine similarity is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two sentences in Python using
Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. The