AIML 4th and 5th Module Notes

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
MODULE 4
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction
• Why called as decision tree ?

• As starts from root node and finds number of solutions .
• The benefits of having a decision tree are as follows :
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
• Example : Toll free number
6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.
Applies to classification and regression model.
1
The decision tree consists of 2 major procedures:
1) Building a tree and
2) Knowledge inference or classification.
Building the Tree
Knowledge Inference or Classification
Advantages of Decision Trees
2
Disadvantages of Decision Trees
6.1.2 Fundamentals of Entropy
• How to draw a decision tree ?

Entropy
Information gain
3
Algorithm 6.1: General Algorithm for Decision Trees
6.2 DECISION TREE INDUCTION ALGORITHMS
6.2.1 ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )

A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks.
It builds a flowchart-like tree structure where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal
4
node) holds a class label. It is constructed by recursively splitting the training data
into subsets based on the values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of samples required to split
a node.
5
6.2.2 C4.5 Construction

C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must not include
missing data, and finally the algorithm tend to fall into overfitting.
To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5. Now, the
algorithm can create a more generalized models including continuous data and could
handle missing data. And also works with discrete data, supports post-prunning.
6
Dealing with Continuous Attributes in C4.5
7
6.2.3 Classification and Regression Trees Construction

Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and regression
tasks. CART is similar to C4.5 but has some differences in its construction and splitting
criteria.
The classification method CART is required to construct a decision tree based on Gini's
impurity index. It serves as an example of how the values of other variables can be used
to predict the values of a target variable. It functions as a fundamental machine-learning
method and provides a wide range of use cases
8
6.2.4 Regression Trees
9
6.3 VALIDATING AND PRUNING OF DECISION TREES
Validating and pruning decision trees is a crucial part of building accurate and robust
machine learning models. Decision trees are prone to overfitting, which means they can
learn to capture noise and details in the training data that do not generalize well to new,
unseen data.
Validation and pruning are techniques used to mitigate this issue and improve the
performance of decision tree models.
The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to

the training pipeline. It involves the heuristic known as ‘early stopping’ which stops the
growth of the decision tree - preventing it from reaching its full depth. It stops the tree-
building process to avoid producing leaves with small samples. During each stage of
the splitting of the tree, the cross-validation error will be monitored. If the value of the
error does not decrease anymore - then we stop the growth of the decision tree.
The hyperparameters that can be tuned for early stopping and preventing overfitting
are: max_depth, min_samples_leaf, and min_samples_split
These same parameters can also be used to tune to get a robust model
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to
grow to its full depth. Once the model grows to its full depth, tree branches are removed
10
to prevent the model from overfitting. The algorithm will continue to partition data into
smaller subsets until the final subsets produced are similar in terms of the outcome
variable. The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is introduced
that differs from the learned data - it may not get predicted well.
The hyperparameter that can be tuned for post-pruning and preventing overfitting
is: ccp_alpha
ccp stands for Cost Complexity Pruning and can be used as another option to control
the size of a tree. A higher value of ccp_alpha will lead to an increase in the number of
nodes pruned.
11
Chapter 10 Artificial
Neural Networks
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modelled after the brain.
An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are
known as nodes.
The biological neuron consists of main four parts:

• dendrites: nerve fibres carrying electrical signals to the cell .
• cell body: computes a non-linear function of its inputs
• axon: single long fiber that carries the electrical signal from the cell body to other neurons
• synapse: the point of contact between the axon of one cell and the dendrite of another,
regulating a chemical connection whose strength affects the input to the cell.
•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.
Difference between biological and Artificial Neuron

ARTIFICIAL NEURONS:
Artificial neurons are like biological neurons that are linked to each other in various layers of the
networks. These neurons are known as nodes.
A node or a neuron can receive one or more input information and process it. artificial neurons are
connected by connection links to another neuron. Each connection link is associated with a synaptic
weight. The structure of a single neuron is shown below:
Fig: McCulloch-Pitts Neuron Mathematical model.
Simple Model of an ANN

The first mathematical model of a biological neuron was designed by McCulloch-Pitts in 1943.
It includes 2 steps:
1. It receives weighted inputs from other neurons.
2. It operates with a threshold function or activation function.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets fired.The neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi
Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.
• Several activation functions are there.
1. Identity function or Linear Function: It is a linear function which is defined as 𝑓(𝑥) =

𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥
The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as
where, λ represents steepness parameter. The range of sigmoid function is 0

to 1
b) Bipolar sigmoid function: This function is defined as
Where λ represents steepness parameter and the sigmoid range is between -1

and +1.
5. Ramp function: The ramp function is defined as:
It is a linear function whose upper and lower limits are fixed.

6. Tanh-Hyperbolic tangent function : Tanh function is very similar to the sigmoid/logistic
activation function, and even has the same S-shape with the difference in output range of -1 to
1. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to -1.0.
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
probabilities. The output of a Softmax is a vector (say v) with probabilities of each
possible outcome. The probabilities in vector v sums to one for all possible outcomes or
classes.
Artificial Neural Network Structure

• Artificial Neural Networks Computational models inspired by the human brain: – Massively
parallel, distributed system, made up of simple processing units (neurons) – Synaptic
connection strengths among neurons are used to store the acquired knowledge.
• Knowledge is acquired by the network from its environment through a learning process.
• The Neural Network is constructed from 3 type of layers:

• Input layer — initial data for the neural network.
• Hidden layers — intermediate layer between input and output layer and place where all the
computation is done.
• Output layer — produce the result for given inputs.
PERCEPTRON AND LEARNING THEORY

• The perceptron is also a simplified model of a biological neuron.
• The perceptron is an algorithm for supervised learning of binary classifiers. It is a type of
linear classifier, i.e. a classification algorithm that makes all of its predictions based on a
linear predictor function combining a set of weights with the feature vector.
• One type of ANN system is based on a unit called a perceptron.
OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.
Major components of a perceptron

• Input
• Weight
• Bias
• Weighted summation
• Step/activation function
• output
WORKING:
• Feed the features of the model that is required to be trained as input in the first layer. All
weights and inputs will be multiplied – the multiplied result of each weight and input will be
added up.The Bias value will be added to shift the output function .This value will be
presented to the activation function (the type of activation function will depend on the need)
The value received after the last step is the output value.
The activation function is a binary step function which outputs a value 1, if f(x) is above the
threshold value Θ and a 0 if f(x) is below the threshold value Θ. Then the output of a neuron
is:
PROBLEM:
Design a 2 layer network of perceptron to implement NAND gate. Assume your own weights and
biases in the range of [-0.5 0.5]. Use learning rate as 0.4.
Solution:
X0
𝜃3 𝜃4
X1 𝑤13
X3 X4
𝑤34
AND NOT
𝑤23
X2
Figure 1 Two Layer Network for NAND gate
Table 1: Weights and Biases

𝑿𝟏 𝑿𝟐 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒘𝟏𝟑 𝒘𝟐𝟑 𝒘𝟑𝟒 𝜃𝟑 𝜃𝟒 𝑿𝟎
0 1 1 0.1 -0.4 0.3 0.2 -0.3 1

Table 2: Truth Table of NAND Gate
𝑿𝟏 𝑿𝟐 𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐 𝑵𝑨𝑵𝑫 = 𝑵𝑶𝑻(𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐)
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋
𝑿𝟏 0 0
𝑿𝟐 1 1
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋
𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝜃3 1

𝑶𝟑 =
1 + 𝑒−𝐼3
= 0(0.1) + 1(−0.4) + 1(0.2)
1
= −0.2 =
1 + 𝑒−(−0.2)
= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝜃4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + 𝑒−(−0.165)
= 0.458
3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542
Step 2: BACKWARD PROPAGATION

1. For each 𝒖𝒏𝒊𝒕𝒌 in the output layer
𝑬𝒓𝒓𝒐𝒓𝒌 = 𝑶𝒌 ∗ (𝟏 − 𝑶𝒌) ∗ (𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒌)
For each 𝒖𝒏𝒊𝒕𝒋 in the hidden layer
𝑬𝒓𝒓𝒐𝒓𝒋 = 𝑶𝒋 ∗ (𝟏 − 𝑶𝒋) ∗ (∑ 𝑬𝒓𝒓𝒐𝒓 * 𝑾𝒋𝒌)

𝒌
Table 5: Error Calculation

For each output 𝑬𝒓𝒓𝒐𝒓𝒌
layer 𝒖𝒏𝒊𝒕𝒌
𝑋4 𝐸𝑟𝑟𝑜𝑟𝑘 = 𝑂𝑘 ∗ (1 − 𝑂𝑘) ∗ (𝑂𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑂𝑘)
= 0.458(1 − 0.458)(1 − 0.458)
= 0.134
For each hidden layer 𝑬𝒓𝒓𝒐𝒓𝒋

𝒖𝒏𝒊𝒕𝒋
𝑋3 𝐸𝑟𝑟𝑜𝑟𝑗 = 𝑂𝑗 ∗ (1 − 𝑂𝑗) ∗ (∑ 𝐸𝑟𝑟𝑜𝑟 ∗ 𝑊𝑗𝑘)

𝑘
= 0.450 ∗ (1 − 0.450) ∗ 0.134 ∗ 0.3

= 0.0099
2. Update Weights and biases

Table 6: Weight and Bias Calculation
𝒘𝒊𝒋 𝒘𝒊𝒋 = 𝒘𝒊𝒋 + (𝑎 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋 ∗ 𝑶𝒊) Net Weight
𝑤13 𝑤13 = 𝑤13 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂1) 0.1

= 0.1 ∗ (0.4 ∗ 0.0099 ∗ 0)
𝑤23 𝑤23 = 𝑤23 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂2) -0.396
= −0.4 ∗ (0.4 ∗ 0.0099 ∗ 1)
𝑤24 𝑤24 = 𝑤24 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4 ∗ 𝑂2) 0.324
= 0.3 ∗ (0.4 ∗ 0.134 ∗ 0.450)
𝜃𝒋 𝜃𝒋 = 𝜃𝒋 + (𝑎 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋) Net Bias
𝜃3 𝜃3 = 𝜃3 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3) 0.203

= 0.2 + (0.4 ∗ 0.0099)
𝜃4 𝜃4 = 𝜃4 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4) -0.246
= −0.3 + (0.4 ∗ 0.134
ITERATION 2:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in hidden and output layer

Table 7: Inputs and Outputs in Hidden and Output layer
𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋
𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝜃3 1

𝑶𝟑 =
1 + 𝑒−𝐼3
= 0(0.1) + 1(−0.396) + 1(0.203)
1
= −0.193 =
1 + 𝑒−(−0.193)
= 0.451
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝜃4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + 𝑒−(−0.099)
= 0.475
2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
0.1
X1 -0.3
-0.2
0.4
0.4
0.2
X3 0.2
X2 X5
-0.3
-0.3
X4
Figure 2: Multi Layer Perceptron for XOR
Learning rate: =0.8

Table 8: Weights and Biases
X1 X2 W13 W14 W23 W24 W35 W45 𝜃3 𝜃4 𝜃5
1 0 -0.2 0.4 0.2 -0.3 0.2 -0.3 0.4 0.1 -0.3
Step 1: Forward Propagation

1. Calculate Input and Output in the Input Layer shown in Table 9.
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 10.
Table 10: Unit j at Hidden Layer and Output Layer – Net Input and Output Calculation
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = = 0.549
1+𝑒−𝐼3 1+𝑒−0.2
I3 = 1*-0.2 + 0*0.2+ 1*0.4 = 0.2
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = = 0.622
1+𝑒−𝐼4 1+𝑒−0.5
I4 = 1*0.4 + 0*-0.3+ 1*0.1 = 0.5
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.407
1+𝑒−𝐼5 1+𝑒0.376
I5 = 0.549 * 0.2 + 0.622 * -0.3 + 1*-0.3 = -0.376
3. Calculate Error = Odesired – OEstimated
So error for this network is,
Error = Odesired – O7 = 1 – 0.407 = 0.593
Step 2: Backward Propagation

1. Calculate Error at each node as shown in Table 11.
For each unit k in the output layer, calculate
Error k = Ok (1-Ok) (YN – Ok)
For each unit j in the hidden layer, calculate
Error j = Oj (1-Oj) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
2. Update weight using the below formula,

Learning rate α = 0.8
∆Wij = 𝖺∗ Error j* Oi
Wij = Wij+ ∆Wij
The updated weight and bias is shown in Table 12 and Table 13.
Table 12: Weight Updation
Wij Wij = Wij+ 𝖺∗ Error j* Oi New Weight
W13 W13 = W13 + 0.8 * Error 3* O1 -0.194
= -0.2 + 0.8 * 0.007 * 1
W14 W14 = W14 + 0.8 * Error 4* O1 0.392
= 0.4+ 0.8 * -0.01 *1
W23 W23 = W23 + 0.8 * Error 3* O2 0.2
= 0.2 + 0.8 * 0.007 *0
W24 W24 = W24+ 0.8 * Error 4 * O2 -0.3
= -0.3+ 0.8 * -0.001 *0
W35 W35 = W35 + 0.8 * Error 5* O3 0.154
= 0.2 + 0.8 *0.143* 0.4
W45 W45 = W45 + 0.8 * Error 5* O4 -0.288
= 0.3 + 0.8 * 0.143* 0.1
Update bias using the below formula,

∆θj = = 𝖺∗ Error j
θj = θj + ∆θj
Table 13: Bias Updation
θj θj = θj + 𝖺∗ Error j New Bias
𝜃3 Θ3 = θ3 + 𝖺∗ Error 3 0.405
= 0.4 + 0.8 * 0.007
𝜃4 θ 4 = θ4 + 𝖺∗ Error 4 0.092
= 0.1 + 0.8 *- 0.01
𝜃5 θ 5 = θ5 + 𝖺∗ Error 5 -0.185
= -0.3 + 0.8 * 0.143
Iteration 2
Now with the updated weights and biases,
1. Calculate Input and Output in the Input Layer shown in Table 14.
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+𝑒−𝐼3 1+𝑒−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211 0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+𝑒−𝐼4 1+𝑒−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484 0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+𝑒−𝐼5 1+𝑒0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
The output we receive in the network at node 5 is 0.407.

Error = 1 - 0.429= 0.571
Now when we compare the error, we get in the previous iteration and in the current iteration, the
network has learnt which reduces the error by 0.022.
Error is reduced by 0.055: 0.593 – 0.571.
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]: [ ]
Unit 2 0.3 0.5 0.4 0.6
Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 1 weights.
d2 = (0.2 -1)2 + (0.8 – 1)2 + (0.5 -1)2 + (0.1 – 0)2

= 0.94
d2 = (0.3 -1)2 + (0.5 – 1)2 + (0.4 -1)2 + (0.6– 0)2

= 1.46
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.2 0.8 0.5 0.2] + 0.6 ([1 1 1 0] - [0.2 0.8 0.5 0.2])
= [0.2 0.8 0.5 0.2] + 0.6 [0.8 0.2 0.5 -0.2]
= [0.2 0.8 0.5 0.2] + [0.48 0.12 0.30 -0.12]
= [0.68 0.92 0.80 0.08]
[Unit 1 ]:[ 0.68 0.92 0.80 0.08]

Unit 2 0.3 0.5 0.4 0.6
Iteration 2:
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
d2 = (0.68 -0)2 + (0.92 – 0)2 + (0.80 -1)2 + (0.08 – 1)2

= 2.1952
d2 = (0.3 -0)2 + (0.5 – 0)2 + (0.4 -1)2 + (0.6– 1)2

= 0.86
Unit 2 wins
New Unit 2 weights = [0.3 0.5 0.4 0.6] + 0.6 ([0 0 1 1] - [0.3 0.5 0.4 0.6])
= [0.3 0.5 0.4 0.6] + 0.6 [-0.3 -0.5 0.6 0.4]
= [0.3 0.5 0.4 0.6] + [-0.18 -0.30 0.36 0.24]
= [0.12 0.2 0.76 0.84]
Unit 1 0.68 0.92 0.80 0.08
[ ]:[ ]
Unit 2 0.12 0.2 0.76 0.84
Iteration 3:
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.12 0.2 0.76 0.84
d2 = (0.68 -1)2 + (0.92 – 0)2 + (0.80 -0)2 + (0.08 – 1)2

= 2.44
d2 = (0.12 -1)2 + (0.2 – 0)2 + (0.76 -0)2 + (0.84– 1)2

= 1.42
Unit 2 wins
New Unit 2 weights = [0.12 0.2 0.76 0.84] + 0.6 ([1 0 0 1] - [0.12 0.2 0.76 0.84])
= [0.12 0.2 0.76 0.84] + 0.6 [0.88 -0.2 -0.76 0.16]
= [0.12 0.2 0.76 0.84] + [0.53 -0.12 -0.46 0.096]
= [0.65 0.08 0.3 0.94]
[Unit 1 ]:[0.68 0.92 0.80 0.08]

Unit 2 0.65 0.08 0.3 0.94
Iteration 4:
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94
d2 = (0.68 -0)2 + (0.92 –0)2 + (0.80 -1)2 + (0.08 – 0)2

= 1.36
d2 = (0.65- 0)2 + (0.08 – 0)2 + (0.3 -1)2 + (0.94– 0)2

= 1.8025
Unit 1 wins
New Unit 1 weights = [0.68 0.92 0.80 0.08] + 0.6 ([0 0 1 0] - [0.68 0.92 0.80 0.08])
= [0.68 0.92 0.80 0.08] + 0.6 [-0.68 -0.92 0.2 -0.08]
= [0.68 0.92 0.80 0.08] + [-0.408 -0.552 0.12 -0.258]
= [0.27 0.37 0.92 -0.178]
0.27 0.37 0.92 − 0.178
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94
Best mapping unit for each of the sample taken are,

X1: (1, 1, 1, 0) → Unit 1
X2: (0, 0, 1, 1) → Unit 2
X3: (1, 0, 0, 1) → Unit 2
X4: (0, 0, 1, 0) → Unit 1
This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.
Delta Learning Rule and Gradient Descent

🞂 Developed by Widrow and Hoff, the delta rule, is one of the most common learning rules.
🞂 It is supervised learning.
🞂 Delta rule is derived from gradient descent method(Back-propogation).
🞂 It is Non-linearly separable. Also called as continuous perceptron Learning rule.
🞂 It updates the connection weights with the difference between the target and the output
value. It is the least mean square learning algorithm.
🞂 The Delta difference is measured as an error function or also called as cost function.
TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.
Fully connected Neural Network:
• A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
• The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
Feedback Neural Network:

Feedback networks also known as recurrent neural network or interactive neural network are
the deep learning models in which information flows in backward direction.
It allows feedback loops in the network. Feedback networks are dynamic in nature, powerful and
can get much complicated at some stage of execution
Neuronal connections can be made in any way.
RNNs may process input sequences of different lengths by using their internal state, which can
represent a form of memory.
They can therefore be used for applications like speech recognition or handwriting recognition.
LEARNING OF MULTI LAYER PERCEPTRON
WHY MULTI LAYER PERCEPTRON?

Imagine a group of 7-year-old students who are working on a math problem, imagine that each of
them can only do arithmetic with two numbers. But you are giving them an equation like this 5 x 3
+ 2 x 4 + 8 x 2, how can they solve it?
To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student can
solve the second part of the equation "2 x 4 = 8". The third student can solve the third part "8 x 2 =
16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve "15
+ 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer
So here we are breaking down the large math problem into different sections and giving them to
each of the students who are just doing really simple calculations, but as a result of the teamwork,
they can solve the problem efficiently.
This is exactly the idea of how a multi-layer perceptron (MLP) works. Each neuron in the MLP is
like a student in the group, and each neuron is only able to perform simple arithmetic operations.
However, when these neurons are connected and work together, they can solve complex problems.
The principle weakness of the perceptron was that it could only solve problems that were linearly
separable.
A multilayer perceptron (MLP) is a fully connected feed-forward artificial neural network with at
least three layers input, output, and at least one hidden layer.
The mapping between inputs and output is non-linear. (Ex: XOR gate)
In Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
MLP networks are uses back propagation for supervised learning network.
The activation functions used in the layers can be linear or Non-linear depending on the type of a
problem.
NOTE : In each iteration, after the weighted sums are forwarded through all layers, the gradient of
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it back,
the weights of the first hidden layer are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the neural network.
This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold, compared to
the previous iteration.
Works in 2 stages.
1. Forward phase
2. Backward phase
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function
by repeatedly updating these weights. After computing the loss, a backward pass propagates it
from the output layer to the previous layers, providing each weight parameter with an update
value meant to decrease the loss.
ALGORITHM
Radial Basis Function Neural Network
This networks have a fundamentally different architecture than most neural network architectures.
Most neural network architecture consists of many layers and introduces nonlinearity by repetitively
applying nonlinear activation functions.
RBF network on the other hand only consists of an input layer, a single hidden layer, and an
output layer.
The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from.
The output layer performs the prediction task such as classification or regression.
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.
It is useful for interpolation, function approximation ,time series prediction and classification.
RBFNN Architecture :
Self-organizing Feature Map
SOM is trained using unsupervised learning.
SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use competitive
learning to adjust weights in neurons. Artificial neural networks often utilize competitive
learning models to classify input without the use of labeled data.
Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.
Self organizing maps have two layers, the first one is the input layer and the second one is the
output layer or the feature map.
SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.
Network Architecture and operations
It consists of 2 layers:
1. Input layer
2. Output layer
No Hidden layer.
The initialization of the weight to vectors initiates the mapping processes of the Self-Organizing
Maps.
The mapped vectors are then examined to determine which weight most accurately represents the
chosen sample using a sample random vector. Neighboring weights that are near each weighted
vector are present. The chosen weight is allowed to turn into a vector for a random sample. This
encourages the map to develop and take on new forms. In a 2D feature space, they typically form
hexagonal or square shapes. More than 1,000 times are spent repeatedly performing this entire
process.
To put it simply, learning takes place in the following ways:
• To determine whether appropriate weights are similar to the input vector, each node is analyzed.
The best matching unit is the term used to describe the appropriate node.
• The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors tend
to decline in number.
The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more as
it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its neighbor.
For N iterations, repeat step two.
Advantages and Disadvantages of ANN
Limitations of ANN
Challenges of Artificial Neural Networks
Chapter 13
CLUSTERING ALGORITHMS
• Clustering: the process of grouping a set of objects into classes of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar
• Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters.
• Unsupervised learning: no predefined classes.
• Example: Below fig: shows the data points with two features shown in different shaded
samples.
If few similarities then manually we can do , but when examples have more
features then cannot be done manually, so automatic clustering is required.
Clusters are represented by centroids.

Example:
(3,3),(2,6) and(7,9).
Centroid : (3+2+7,3+6+9)=(4,6). The clusters should not overlap and every
cluster should represent only one class.
Difference between Clustering and Classification
Applications of Clustering
Advantages and Disadvantages
Challenges of Clustering Algorithms

1. Collection of data with higher dimensions.
2. Designing a proximity measure is another challenge.
3. The curse of dimensionality
PROXIMITY MEASURES
Clustering algorithms need a measure to find the similarity or dissimilarity among the
objects to group them. Similarity and Dissimilarity are collectively known as proximity
measures. This is used by a number of data mining techniques, such as clustering,
nearest neighbour classification, and anomaly detection.
Distance measures are known as dissimilarity measures, as these indicate how one
object is different from another.
Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as more
distance indicates more similarity and vice-versa.
If all the conditions are satisfied, then the distance measure is called metric.
Some of proximity measures:
1. Quantitative variables
a) Euclidean distance: It is one of the most important and common
distance measure. It is also called L2 norm.
Advantage: The distance does not change with the addition of new object.
Disadvantage: i) If the unit changes, the resulting Euclidean or squared
Euclidean Changes drastically.
ii) Computational complexity is high, because it involves square root and
square.
b) City Block Distance: Known as Manhattan Distance or L1 norm.
c) Chebyshev Distance: Also known as maximum value distance. This is

the absolute magnitude of the differences between the coordinates of a
pair of objects.This distance is called supremum distance or Lmax or
L∞ norm.
d) Minkowski Distance: In general, all the above distances measures

can be generalized as:
Binary Attributes: Binary Attributes have only two values. Distance
measures have discussed above cannot be applied to find the distance
between objects that have binary attributes. For finding the distance
among objects with binary objects, the contingency table is used.
Hamming Distance: Hamming distance is a metric for comparing two binary data
strings. While comparing two binary strings of equal length, Hamming distance is the
number of bit positions in which the two bits are different. It is used for error detection
or error correction when data is transmitted over computer networks.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
2. Categorical variables
Ordinal Variables
Cosine Similarity
🞂 Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
🞂 It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
🞂 The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
🞂 The smaller the angle, higher the cosine similarity.
🞂 Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.
1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.
a. (2 3 4) and (1 5 6)
Solution
Euclidean distance = (2 −1)2 + (3 − 5)2 + (4 − 6)2 = 9 = 3
2 −1 + 3 − 5) + 4 − 6 = 1+ 2 + 2 = 5
Manhattan distance =
Chebyshev Distance = max 2 −1 , 3 − 5) , 4 − 6  = max{1, 2, 2} = 2
b. (2 2 9) and (7 8 9)
25 + 36 + 09 61
Euclidean Distance = (2 − 7) + (2 − 8) + (9 − 9) =
2 2 2 = = 7.81
Manhattan Distance = 2 − 7 + 2 − 8) + 9 − 9 = 5 + 6 + 0 = 11
Chebyshev Distance = max{ 2 − 7 + 2 − 8) + 9 − 9 } = {5, 6, 0} = 6
2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:
a. (1 0 1 1) and (1 1 0 0)
Solution
10 11
110 0
C = 2, b = 1, d = 1,
a+d 1
SMC = = = 0.25
a+b+c+ d 4
d 1
= = 0.25
Jaccard Coefficient =
b+c+d 4
Cosine Similarity = 3 12 0 +1 0)

(11+ 01+
= 31 2
b. (1 0 0 0 1) and (1 0 0 0 0 1)
Solution
No match
(1 0 0 0 1) and (1 1 0 0 0)
10001
11000
A=2, b= 1, c = 1, d= 1
a+d 2
SMC = = = 0.5
a+b+c+d 5
d 1
Jaccard Coefficient = = = 0.33
b+c+d 3
(11+ 01+ 0 0 + 0 0 +1 0) 1 1
Cosine Similarity = = = = 0.5
2 2 2 2 2
3. Find Hamming distance for the following binary data:
a. (1 1 1) and (1 0 0)
Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4
4. Find the distance between:

a. Employee ID: 1000 and 1001
Solution
They are not equal. Therefore, distance is 0
b. Employee name – John & John and John & Joan

Solution
The distance between John and John is 1
The distance between John and Joan is 0
5. Find the distance between:

a. (Yellow, red, green) and (red, green, yellow)
Solution
Yellow = 1, red = 2, Green = 3
= −1 = 1 = 0.5
1− 2
Therefore, the distance between (yellow, red) =
2 2 2
2 − 3 −1 1
Distance between (red, green) = = = = 0.5
2
2 2
3 −1 2
Distance between (green, yellow) =
= =1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)
Solution
Bread =1, Butter =2, Milk = 3, Sandwich = 4, Tea = 5
−2 1
=
The distance between (bread, milk) = 1− 3 =
5 −1 4 2
−2 1
The distance between (butter, sandwich) = 2 − 4 = =
5 −1 4 2
1
= =
The distance between (Milk, Tea) = 3 − 5 −2
5 −1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
1 1 1
 , , 
2 2 2
🞂 Hierarchical Clustering Algorithms
Hierarchical clustering involves creating clusters that have a predetermined ordering
from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy.
Hierarchical relationship is shown in the form of dendogram.
There are two types of hierarchical clustering.
◦ Divisive and Agglomerative.
🞂 Divisive method : In divisive or top-down clustering method we assign all of the

observations to a single cluster and then partition the cluster to two least similar clusters.
Finally, we proceed recursively on each cluster until there is one cluster for each
observation. There is evidence that divisive algorithms produce more accurate
hierarchies than agglomerative algorithms in some circumstances but is conceptually
more complex.
🞂 Agglomerative method: In agglomerative or bottom-up clustering method we assign
each observation to its own cluster. Then, compute the similarity (e.g., distance)
between each of the clusters and join the two most similar clusters. Finally, repeat steps
2 and 3 until there is only a single cluster left. The related algorithm is shown below.
🞂 The following three methods differ in how the distance between each cluster is
measured.
1. Single Linkage
2. Average Linkage
3. Complete Linkage
Single Linkage or MIN algorithm
In single linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two closest points.
🞂 Complete Linkage : In complete linkage hierarchical clustering, the distance between
two clusters is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two furthest points.
OR
🞂 Average Linkage : In average linkage hierarchical clustering, the distance between two
clusters is defined as the average distance between each point in one cluster to every
point in the other cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the average length each arrow between connecting the points of one
cluster to the other.
Mean-Shift Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.
SNo. X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8
Table Sample Data
Solution
The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.
Table 134.57: Proximity Matrix
Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -
The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Table After Iteration 1
Clusters {1,4} 2 3 5
{1,4} - 5 5.66 4.12

2 - 5.83 13
3 - 8.94
5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12
The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.
Table After Iteration 2
Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83
5 -
Thus, the distance between {1,4,5} and {2} is:

Minimum {(1,2),(4,2},(5,2)}= {5,9.49,13} = 5
Thus, the distance between {1,4,5} and {3} is:

Minimum { {1,3}, {4,3},{5,3)} = Minimum {9,5.66,8.94} = 5.66
The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.
therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as
Thus, the distance between {1,4} and {2} is:

Max{ {1,4}, {2} = Max {(1,2),(4,2)= 9.49
Max { {1,3}, {4,3} } = Max {9,5.66}=9
Max{ {1,5}, {2,5} } = Max {17.26,4.12} = 17.26
This results in a Table
Clusters {1,4} 2 3 5
{1,4} - 9.49 9 17.26

2 - 5.83 13
3 - 8.94
5 -
So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.
Clusters {1,4} {3,5} 2

{1,4} - 17.26 9.49
{3,5} - 13
2 -
The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}
and {2}, and {3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs of
points across the clusters is used to form clusters.
Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
Solution
Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.
Table Initial Cluster Table
Cluster 1 Cluster 2
(3,5) (16,9)
Centroid (3,5) Centroid (16,9)
Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.
Take the sample object 2 and compare it with the two centroids as follows:
Dist(2,centroid 1) = = = =5
(7 − 3)2 + (8 − 5)2 16 + 9 25
Dist(2,centroid 2) = (7 −16)2 + (8 − 9)2 = 81+1 = 82 = 9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,
Dist(3,centroid 1) = = =9
(12 − 3)2 + (5 − 5)2 81
Dist(3,centroid 2) = (12 −16)2 + (5 − 9)2 = 16 +16 = 32 = 5.66
Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1
This is shown in the following Table.

Table Cluster Table After Iteration 1
Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)
Centroid (10/2,13/2)=(5,6.5) Centroid (28/2,14/2)=(14,7)
The second iteration is started again. Compute again,
Dist(1,centroid 1) = = 6.25
(7 − 5)2 + (8 − 6.5)2
Dist(1,centroid 2) = (12 −14)2 + (8 − 7)2 = 49 +1 = 50 = 7.07
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again
Dist(3,centroid 1) = (12 − 5)2 + (5 − 6.5)2 = 51.25 = 7.16
Dist(3,centroid 2) = (16 −14)2 + (9 − 7)2 = 4 + 4 = 8 = 2.82
Object 3 is closer to centroid of cluster 2 and remains in the same cluster.

Therefore, the resultant clusters are
{(3,5), (7,80} and {(12,5),(16,9)}.
PARTITIONAL CLUSTERING ALGORITHMS
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.
K means can be viewed as greedy algorithm as it involves partitioning ‘n’ samples to k

clusterd to minimize sum of squared Error. SSE is a metric that is a measure of error that gives
the sum of the squared Euclidean distances of each data to its closet centroid.
𝑘
SSE= ∑ 𝑓(𝑥) = ∑ (𝐝𝐢𝐬𝐭(𝐜𝐢 , x)2)
𝑖=1
Here ci = centroid of ith cluster

x=sample data
PROBLEM
Density-Based Clustering
A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm

for density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it. Connectivity, on the other hand, involves a transitivity based
chaining-approach to determine whether points are located in a particular cluster. For example,
p and q points could be connected if p->r->s->t->q, where a->b means b is in the neighborhood
of a.
There are three types of points after the DBSCAN clustering is complete:
• Core — This is a point that has at least m points within distance n from itself.
• Border — This is a point that has at least one Core point at a distance n.
• Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.
Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.
It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.
MONOTONICITY Property
CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.
Algorithm works in 2 stages:

PROBABILITY MODEL BASED METHODS
Probability model-based methods in clustering are a class of techniques that use statistical models to
represent the underlying probability distributions of data points in a dataset.
These methods are used to group similar data points together into clusters based on their likelihood of
belonging to a particular cluster according to the assumed probability distribution.
Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:
1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.
Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.
Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.
The function of Gaussian distribution is given by:

The EM algorithm iteratively optimizes a likelihood function in two steps: the E-step
(Expectation) and the M-step (Maximization).
Here's a high-level overview of how the EM algorithm works:
1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
• In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
• This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
• Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
• In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
• This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
• The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
• Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
• Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
• The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.
The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.
One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.
CLUSTER EVALUATION METHODS

Evaluation of clustering algorithm is a difficult task, as domain knowledge is absent most of the times.
SO, clustering algorithms validation is difficult as compared to the validation of classification
algorithms.
Evaluation of Clustering
1. Internal
2. External
3. Relative
Cohesion and separation
Here, N – No. of cluster,

C – set of centroids
Xi – centroid
Mj – samples.
Here, x – centroid of the entire dataset

Xi – centroid of the cluster
Ci – size of the cluster
DUNN Index
This metric measures the ratio between the distance between the clusters and the distance within
the clusters. A high Dunn index indicates that the clusters are well-separated and distinct.
DUNN index is calculated as:
Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.
Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.
--

AIML 4th and 5th Module Notes

Uploaded by

Copyright:

Available Formats

AIML 4th and 5th Module Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIML 4th and 5th Module Notes

Uploaded by

Copyright:

Available Formats

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)

• Why called as decision tree ?

Applies to classification and regression model.

The decision tree consists of 2 major procedures:

1) Building a tree and

2) Knowledge inference or classification.

Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees

Disadvantages of Decision Trees

6.1.2 Fundamentals of Entropy

• How to draw a decision tree ?

Algorithm 6.1: General Algorithm for Decision Trees

6.2 DECISION TREE INDUCTION ALGORITHMS

6.2.1 ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )

6.2.2 C4.5 Construction

Dealing with Continuous Attributes in C4.5

6.2.3 Classification and Regression Trees Construction

6.2.4 Regression Trees

6.3 VALIDATING AND PRUNING OF DECISION TREES

The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to

are: max_depth, min_samples_leaf, and min_samples_split

The biological neuron consists of main four parts:

Difference between biological and Artificial Neuron

Simple Model of an ANN

• Several activation functions are there.

1. Identity function or Linear Function: It is a linear function which is defined as 𝑓(𝑥) =

where, λ represents steepness parameter. The range of sigmoid function is 0

Where λ represents steepness parameter and the sigmoid range is between -1

It is a linear function whose upper and lower limits are fixed.

Artificial Neural Network Structure

• The Neural Network is constructed from 3 type of layers:

• Output layer — produce the result for given inputs.

PERCEPTRON AND LEARNING THEORY

Major components of a perceptron

Figure 1 Two Layer Network for NAND gate

Table 1: Weights and Biases

0 1 1 0.1 -0.4 0.3 0.2 -0.3 1

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝜃3 1

Step 2: BACKWARD PROPAGATION

For each 𝒖𝒏𝒊𝒕𝒋 in the hidden layer

𝑬𝒓𝒓𝒐𝒓𝒋 = 𝑶𝒋 ∗ (𝟏 − 𝑶𝒋) ∗ (∑ 𝑬𝒓𝒓𝒐𝒓 * 𝑾𝒋𝒌)

Table 5: Error Calculation

For each hidden layer 𝑬𝒓𝒓𝒐𝒓𝒋

𝑋3 𝐸𝑟𝑟𝑜𝑟𝑗 = 𝑂𝑗 ∗ (1 − 𝑂𝑗) ∗ (∑ 𝐸𝑟𝑟𝑜𝑟 ∗ 𝑊𝑗𝑘)

= 0.450 ∗ (1 − 0.450) ∗ 0.134 ∗ 0.3

2. Update Weights and biases

𝒘𝒊𝒋 𝒘𝒊𝒋 = 𝒘𝒊𝒋 + (𝑎 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋 ∗ 𝑶𝒊) Net Weight

𝑤13 𝑤13 = 𝑤13 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂1) 0.1

𝜃3 𝜃3 = 𝜃3 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3) 0.203

1. Calculate net inputs and outputs in hidden and output layer

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝜃3 1

Figure 2: Multi Layer Perceptron for XOR

Learning rate: =0.8

Step 1: Forward Propagation

Step 2: Backward Propagation

2. Update weight using the below formula,