AIML 4th and 5th Module Notes
AIML 4th and 5th Module Notes
AIML 4th and 5th Module Notes
MODULE 4
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction
6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.
1
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
2
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
3
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
node) holds a class label. It is constructed by recursively splitting the training data
into subsets based on the values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of samples required to split
a node.
5
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
6
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
7
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
8
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
9
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
Validating and pruning decision trees is a crucial part of building accurate and robust
machine learning models. Decision trees are prone to overfitting, which means they can
learn to capture noise and details in the training data that do not generalize well to new,
unseen data.
Validation and pruning are techniques used to mitigate this issue and improve the
performance of decision tree models.
The hyperparameters that can be tuned for early stopping and preventing overfitting
These same parameters can also be used to tune to get a robust model
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to
grow to its full depth. Once the model grows to its full depth, tree branches are removed
10
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING(21CS54)
to prevent the model from overfitting. The algorithm will continue to partition data into
smaller subsets until the final subsets produced are similar in terms of the outcome
variable. The final subset of the tree will consist of only a few data points allowing the
tree to have learned the data to the T. However, when a new data point is introduced
that differs from the learned data - it may not get predicted well.
The hyperparameter that can be tuned for post-pruning and preventing overfitting
is: ccp_alpha
ccp stands for Cost Complexity Pruning and can be used as another option to control
the size of a tree. A higher value of ccp_alpha will lead to an increase in the number of
nodes pruned.
11
Chapter 10 Artificial
Neural Networks
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modelled after the brain.
An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are
known as nodes.
•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets fired.The neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi
Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.
The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
0 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
probabilities. The output of a Softmax is a vector (say v) with probabilities of each
possible outcome. The probabilities in vector v sums to one for all possible outcomes or
classes.
• Knowledge is acquired by the network from its environment through a learning process.
OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.
Solution:
X0
𝜃3 𝜃4
X1 𝑤13
X3 X4
𝑤34
AND NOT
𝑤23
X2
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋
𝑿𝟏 0 0
𝑿𝟐 1 1
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝜃4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + 𝑒−(−0.165)
= 0.458
3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542
ITERATION 2:
Step 1: FORWARD PROPAGATION
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝜃4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + 𝑒−(−0.099)
= 0.475
2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
0.1
X1 -0.3
-0.2
0.4
0.4
0.2
X3 0.2
X2 X5
-0.3
-0.3
X4
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+𝑒−𝐼3 1+𝑒−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211 0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+𝑒−𝐼4 1+𝑒−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484 0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+𝑒−𝐼5 1+𝑒0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1 ]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
Unit 1 0.2 0.8 0.5 0.1
[ ]: [ ]
Unit 2 0.3 0.5 0.4 0.6
Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.12 0.2 0.76 0.84
Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1 ]:[ ]
Unit 2 0.65 0.08 0.3 0.94
This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.
TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.
• A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
• The major advantage of fully connected networks is that they are “structure agnostic” i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student can
solve the second part of the equation "2 x 4 = 8". The third student can solve the third part "8 x 2 =
16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve "15
+ 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer
So here we are breaking down the large math problem into different sections and giving them to
each of the students who are just doing really simple calculations, but as a result of the teamwork,
they can solve the problem efficiently.
This is exactly the idea of how a multi-layer perceptron (MLP) works. Each neuron in the MLP is
like a student in the group, and each neuron is only able to perform simple arithmetic operations.
However, when these neurons are connected and work together, they can solve complex problems.
The principle weakness of the perceptron was that it could only solve problems that were linearly
separable.
A multilayer perceptron (MLP) is a fully connected feed-forward artificial neural network with at
least three layers input, output, and at least one hidden layer.
The mapping between inputs and output is non-linear. (Ex: XOR gate)
In Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
MLP networks are uses back propagation for supervised learning network.
The activation functions used in the layers can be linear or Non-linear depending on the type of a
problem.
NOTE : In each iteration, after the weighted sums are forwarded through all layers, the gradient of
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it back,
the weights of the first hidden layer are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the neural network.
This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold, compared to
the previous iteration.
Works in 2 stages.
1. Forward phase
2. Backward phase
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function
by repeatedly updating these weights. After computing the loss, a backward pass propagates it
from the output layer to the previous layers, providing each weight parameter with an update
value meant to decrease the loss.
ALGORITHM
Radial Basis Function Neural Network
This networks have a fundamentally different architecture than most neural network architectures.
Most neural network architecture consists of many layers and introduces nonlinearity by repetitively
applying nonlinear activation functions.
RBF network on the other hand only consists of an input layer, a single hidden layer, and an
output layer.
The input layer is not a computation layer, it just receives the input data and feeds it into the special
hidden layer of the RBF network. The computation that is happened inside the hidden layer is very
different from most neural networks, and this is where the power of the RBF network comes from.
The output layer performs the prediction task such as classification or regression.
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models.
It is useful for interpolation, function approximation ,time series prediction and classification.
RBFNN Architecture :
Self-organizing Feature Map
SOM is trained using unsupervised learning.
SOM doesn’t learn by backpropagation with Stochastic Gradient Descent(SGD) ,it use competitive
learning to adjust weights in neurons. Artificial neural networks often utilize competitive
learning models to classify input without the use of labeled data.
Used: In dimension reduction to reduce our data by creating a spatially organized representation,
also it help us to discover the correlation between data.
Self organizing maps have two layers, the first one is the input layer and the second one is the
output layer or the feature map.
SOM doesn’t have activation function in neurons, we directly pass weights to output layer without
doing anything.
Network Architecture and operations
It consists of 2 layers:
1. Input layer
2. Output layer
No Hidden layer.
The initialization of the weight to vectors initiates the mapping processes of the Self-Organizing
Maps.
The mapped vectors are then examined to determine which weight most accurately represents the
chosen sample using a sample random vector. Neighboring weights that are near each weighted
vector are present. The chosen weight is allowed to turn into a vector for a random sample. This
encourages the map to develop and take on new forms. In a 2D feature space, they typically form
hexagonal or square shapes. More than 1,000 times are spent repeatedly performing this entire
process.
• To determine whether appropriate weights are similar to the input vector, each node is analyzed.
The best matching unit is the term used to describe the appropriate node.
• The Best Matching Unit's neighborhood value is then determined. Over time, the neighbors tend
to decline in number.
The appropriate weight further evolves into something more resembling the sample vector. The
surrounding areas change similarly to the selected sample vector. A node's weights change more as
it gets closer to the Best Matching Unit (BMU), and less as it gets farther away from its neighbor.
For N iterations, repeat step two.
Advantages and Disadvantages of ANN
Limitations of ANN
Challenges of Artificial Neural Networks
Chapter 13
CLUSTERING ALGORITHMS
• Clustering: the process of grouping a set of objects into classes of similar objects
• Documents within a cluster should be similar.
• Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters.
• Example: Below fig: shows the data points with two features shown in different shaded
samples.
If few similarities then manually we can do , but when examples have more
features then cannot be done manually, so automatic clustering is required.
Applications of Clustering
Advantages and Disadvantages
PROXIMITY MEASURES
Clustering algorithms need a measure to find the similarity or dissimilarity among the
objects to group them. Similarity and Dissimilarity are collectively known as proximity
measures. This is used by a number of data mining techniques, such as clustering,
nearest neighbour classification, and anomaly detection.
Distance measures are known as dissimilarity measures, as these indicate how one
object is different from another.
Measures like cosine similarity indicate the similarity among objects.
Distance measures and similarity measures are two sides of a same coin, as more
distance indicates more similarity and vice-versa.
If all the conditions are satisfied, then the distance measure is called metric.
Some of proximity measures:
1. Quantitative variables
a) Euclidean distance: It is one of the most important and common
distance measure. It is also called L2 norm.
Advantage: The distance does not change with the addition of new object.
Disadvantage: i) If the unit changes, the resulting Euclidean or squared
Euclidean Changes drastically.
ii) Computational complexity is high, because it involves square root and
square.
b) City Block Distance: Known as Manhattan Distance or L1 norm.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.
2. Categorical variables
Ordinal Variables
Cosine Similarity
🞂 Cosine similarity is a metric used to measure how similar the documents are
irrespective of their size.
🞂 It measures the cosine of the angle between two vectors projected in a multi-
dimensional space.
🞂 The cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance (due to the size of the
document), chances are they may still be oriented closer together.
🞂 The smaller the angle, higher the cosine similarity.
🞂 Consider 2 documents P1 and P2.
◦ If distance is more, then less similar.
◦ If distance is less, then more similar.
1. Consider the following data and, calculate the Euclidean, Manhattan and
Chebyshev distances.
a. (2 3 4) and (1 5 6)
Solution
2 −1 + 3 − 5) + 4 − 6 = 1+ 2 + 2 = 5
Manhattan distance =
b. (2 2 9) and (7 8 9)
25 + 36 + 09 61
Euclidean Distance = (2 − 7) + (2 − 8) + (9 − 9) =
2 2 2 = = 7.81
Manhattan Distance = 2 − 7 + 2 − 8) + 9 − 9 = 5 + 6 + 0 = 11
2. Find cosine similarity, SMC and Jaccard coefficients for the following binary
data:
a. (1 0 1 1) and (1 1 0 0)
Solution
10 11
110 0
C = 2, b = 1, d = 1,
a+d 1
SMC = = = 0.25
a+b+c+ d 4
d 1
= = 0.25
Jaccard Coefficient =
b+c+d 4
Solution
No match
(1 0 0 0 1) and (1 1 0 0 0)
10001
11000
A=2, b= 1, c = 1, d= 1
a+d 2
SMC = = = 0.5
a+b+c+d 5
d 1
Jaccard Coefficient = = = 0.33
b+c+d 3
(11+ 01+ 0 0 + 0 0 +1 0) 1 1
Cosine Similarity = = = = 0.5
2 2 2 2 2
3. Find Hamming distance for the following binary data:
a. (1 1 1) and (1 0 0)
Solution
It differs in two positions; therefore Hamming distance is 2
b. (1 1 1 0 0) and (0 0 1 1 1)
Solution
It differs in four positions; therefore, Hamming distance is 4
Solution
= −1 = 1 = 0.5
1− 2
Therefore, the distance between (yellow, red) =
2 2 2
2 − 3 −1 1
Distance between (red, green) = = = = 0.5
2
2 2
3 −1 2
Distance between (green, yellow) =
= =1
2 2
Therefore, distance between (Yellow, red, green) and (red, green, yellow) is (0.5,0.5,1).
b. (bread, butter, milk) and (milk, sandwich, Tea)
Solution
−2 1
=
The distance between (bread, milk) = 1− 3 =
5 −1 4 2
−2 1
The distance between (butter, sandwich) = 2 − 4 = =
5 −1 4 2
1
= =
The distance between (Milk, Tea) = 3 − 5 −2
5 −1 4 2
Therefore, the distance
between (bread, butter, milk)
and (milk, sandwich, Tea) =
1 1 1
, ,
2 2 2
🞂 Hierarchical Clustering Algorithms
Hierarchical clustering involves creating clusters that have a predetermined ordering
from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy.
Hierarchical relationship is shown in the form of dendogram.
There are two types of hierarchical clustering.
◦ Divisive and Agglomerative.
🞂 The following three methods differ in how the distance between each cluster is
measured.
1. Single Linkage
2. Average Linkage
3. Complete Linkage
Single Linkage or MIN algorithm
In single linkage hierarchical clustering, the distance between two clusters is
defined as the shortest distance between two points in each cluster. For example,
the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two closest points.
🞂 Complete Linkage : In complete linkage hierarchical clustering, the distance between
two clusters is defined as the longest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the length of the
arrow between their two furthest points.
OR
🞂 Average Linkage : In average linkage hierarchical clustering, the distance between two
clusters is defined as the average distance between each point in one cluster to every
point in the other cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the average length each arrow between connecting the points of one
cluster to the other.
Mean-Shift Algorithm
Use dataset and apply hierarchical methods. Show the dendrogram.
SNo. X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
5. 20 8
Solution
The similarity table among the variables is computed and is shown in Table 134.4. Euclidean
distance is computed and is shown in the following Table 143.57.
Objects 0 1 2 3 4
0 - 5 9 9.85 17.26
1 - 5.83 9.49 13
2 - 5.66 8.94
3 - 4.12
4 -
The minimum distance is 4.12. Therefore, the items 1 and 4 are clustered together. The resultant
table is given as shown in the following Table.
Clusters {1,4} 2 3 5
5 -
The distance between the group {1, 4} and items 2, 3, 5 are computed using this formula.
Thus, the distance between {1,4} and {2} is:
Minimum { {1,4}, {2} = Minimum {(1,2),(4,2)=5
The distance between {1,4} and {3} is given as:
Minimum { {1,3}, {4,3} } = Minimum {9,5.66}=5.66
The distance between {1,4} and {5} is given as:
Minimum { {1,5}, {2,5} } = Minimum {17.26,4.12} = 4.12
The minimum distance of above table is 4.12. Therefore, {1,4} and {5} are combined. This
results in the following Table.
Clusters {1,4,5} 2 5
{1,4,5} - 5 5.66
2 - 5.83
5 -
The minimum is 5. Therefor {1,4,5} and {2} is combined. And finally, it is combined with
{3}.
therefore, the order of cluster is {1,4} then {5}, then {2} and finally {3}.
Complete Linkage or MAX or Clique
Here from the first iteration table minimum is taken and {1,4} is combined. Then maximum
is computed as
Clusters {1,4} 2 3 5
3 - 8.94
5 -
So, the minimum is 8.94. Therefore, {3,5} is combined. This is shown in the following Table.
2 -
The minimum is 9.49. Therefore {1,4,2} are combined. The order of cluster is {1,4}, {1,4}
and {2}, and {3,5}.
Hint: The same is used for average link algorithm where the average distance of all pairs of
points across the clusters is used to form clusters.
Consider the following data shown in Table 143.125. Use k-means algorithm with k
= 2 and show the result.
Table Sample Data
SNO X Y
1. 3 5
2. 7 8
3. 12 5
4. 16 9
Solution
Let us assume the seed points are (3,5) and (16,9). This is shown in the following table
as starting clusters.
Cluster 1 Cluster 2
(3,5) (16,9)
Iteration 1: Compare all the data points or samples with the centroid and assigned to the
nearest sample.
Take the sample object 2 and compare it with the two centroids as follows:
Dist(2,centroid 1) = = = =5
(7 − 3)2 + (8 − 5)2 16 + 9 25
Dist(2,centroid 2) = (7 −16)2 + (8 − 9)2 = 81+1 = 82 = 9.05
Object 2 is closer to centroid of cluster 1 and hence assign it to the cluster 1. This is shown in
Table. For the object 3:,
Dist(3,centroid 1) = = =9
(12 − 3)2 + (5 − 5)2 81
Dist(3,centroid 2) = (12 −16)2 + (5 − 9)2 = 16 +16 = 32 = 5.66
Object 3 is closer to centroid of cluster 2. and hence remains in the same cluster 1
Cluster 1 Cluster 2
(3,5) (12,4)
(7,8) (10,4)
Dist(1,centroid 1) = = 6.25
(7 − 5)2 + (8 − 6.5)2
Dist(1,centroid 2) = (12 −14)2 + (8 − 7)2 = 49 +1 = 50 = 7.07
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3, compute again
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster centroid.
𝑘
SSE= ∑ 𝑓(𝑥) = ∑ (𝐝𝐢𝐬𝐭(𝐜𝐢 , x)2)
𝑖=1
A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.
There are three types of points after the DBSCAN clustering is complete:
• Core — This is a point that has at least m points within distance n from itself.
• Border — This is a point that has at least one Core point at a distance n.
• Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Grid-Based Approaches
grid-based clustering method takes a space-driven approach by partitioning the embedding
space into cells independent of the distribution of the input objects.
The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically independent
of the number of data objects, yet dependent on only the number of cells.
Subspace Clustering
CLIQUE is a density-based and grid-based subspace clustering algorithm, useful for finding
clustering in subspace.
Concept of Dense cell
CLIQUE partitions each dimension into several overlapping intervals and intervals it into
cells. Then, algorithm determines whether the cells is dense or sparse. The cell is considered
dense if it exceeds a threshold value.
It is defined as the ratio of number of points and volume of the region. In one pass, the
algorithm finds the number of cells , number of points etc and then combines the dense cells.
For that the algorithm uses the contiguous intervals and a set of dense cells.
MONOTONICITY Property
CLIQUE uses anti- monotonicity property or apriori algorithm. It means that all the
subsets of a frequent itemset are frequent. Similarly if the subset is infrequent then its
superset are infrequent.
Two popular probability model-based clustering methods are Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs). other than these we have other set of model . those are:
1. Fuzzy Clustering
2. EM algorithm
Fuzzy Clustering :
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to belong
to more than one cluster with different degrees of membership. Unlike traditional clustering algorithms,
such as k-means or hierarchical clustering, which assign each data point to a single cluster, fuzzy
clustering assigns a membership degree between 0 and 1 for each data point for each cluster.
Let us consider ci and cj then an element say x, can belong to both the cluster.The strength of
the association of an object with the cluster is given as wij . The value of wij lies between 0
and 1. The sum of the weights of an object, if added, gives 1.
Expectation Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a statistical method used for estimating
parameters in statistical models when you have incomplete or missing data. It's commonly used
in unsupervised machine learning tasks such as clustering and Gaussian Mixture Model (GMM)
fitting.
Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. Gaussian distribution is a bell shaped curve.
1. Initialization: Start with initial estimates of the model parameters. These initial values can be
random or based on some prior knowledge.
2. E-step (Expectation):
• In this step, you compute the expected values (expectation) of the latent (unobserved)
variables given the observed data and the current parameter estimates.
• This involves calculating the posterior probabilities or likelihoods of the missing data or
latent variables.
• Essentially, you're estimating how likely each possible value of the latent variable is,
given the current model parameters.
3. M-step (Maximization):
• In this step, you update the model parameters to maximize the expected log-likelihood
found in the E-step.
• This involves finding the parameters that make the observed data most likely given the
estimated values of the latent variables.
• The M-step involves solving an optimization problem to find the new parameter values.
4. Iteration:
• Repeat the E-step and M-step alternately until convergence criteria are met. Common
convergence criteria include a maximum number of iterations, a small change in
parameter values, or a small change in the likelihood.
5. Termination:
• Once the EM algorithm converges, you have estimates of the model parameters that
maximize the likelihood of the observed data.
6. Result:
• The final parameter estimates can be used for various purposes, such as clustering,
density estimation, or imputing missing data.
The EM algorithm is widely used in various fields, including machine learning, image
processing, and bioinformatics.
One of its notable applications is in Gaussian Mixture Models (GMMs), where it's used to
estimate the means and covariances of Gaussian distributions that are mixed to model
complex data distributions.
It's important to note that the EM algorithm can sometimes get stuck in local optima, so the
choice of initial parameter values can affect the results. To mitigate this, you may run the
algorithm multiple times with different initializations and select the best result.
Here,α and β are parameters. DUNN index is a useful measure that can combine both cohension and
separation.
Silhouette Coefficient
This metric measures how well each data point fits into its assigned cluster and ranges from -1 to
1. A high silhouette coefficient indicates that the data points are well-clustered, while a low
coefficient indicates that the data points may be assigned to the wrong cluster.
--