ML Unsupervised Notes
ML Unsupervised Notes
Instructions:
• Read this study material carefully and make your own handwritten short notes. (Short
notes must not be more than 5-6 pages)
• Revise this material at least 5 times and once you have prepared your short notes, then
revise your short notes twice a week
• If you are not able to understand any topic or required detailed explanation,
please mention it in our discussion forum on webiste
• Let me know, if there are any typos or mistakes in the study materials.
Mail me at piyushwairale100@gmail.com
• Unlike supervised learning, unsupervised machine learning models are given unlabeled
data and allowed to discover patterns and insights without any explicit guidance or
instruction.
• Unsupervised learning is a machine learning paradigm that deals with unlabeled data
and aims to group similar data items into clusters. It differs from supervised learning,
where labeled data is used for classification or regression tasks. Unsupervised learning
has applications in text clustering and other domains and can be adapted for supervised
learning when necessary.
• Unsupervised learning doesn’t refer to a specific algorithm but rather a general frame-
work. The process involves deciding on the number of clusters, initializing cluster
prototypes, and iteratively assigning data items to clusters based on similarity. These
clusters are then updated until convergence is achieved.
1.1 Working
As the name suggests, unsupervised learning uses self-learning algorithms—they learn with-
out any labels or prior training. Instead, the model is given raw, unlabeled data and has
to infer its own rules and structure the information based on similarities, differences, and
patterns without explicit instructions on how to work with each piece of data.
Unsupervised learning algorithms are better suited for more complex processing tasks,
such as organizing large datasets into clusters. They are useful for identifying previously
undetected patterns in data and can help identify features useful for categorizing data.
Imagine that you have a large dataset about weather. An unsupervised learning algo-
rithm will go through the data and identify patterns in the data points. For instance, it
might group data by temperature or similar weather patterns.
While the algorithm itself does not understand these patterns based on any previous infor-
mation you provided, you can then go through the data groupings and attempt to classify
them based on your understanding of the dataset. For instance, you might recognize that
the different temperature groups represent all four seasons or that the weather patterns are
separated into different types of weather, such as rain, sleet, or snow.
1. Clustering
Clustering is a technique for exploring raw, unlabeled data and breaking it down into
groups (or clusters) based on similarities or differences. It is used in a variety of
applications, including customer segmentation, fraud detection, and image analysis.
Clustering algorithms split data into natural groups by finding similar structures or
patterns in uncategorized data.
Clustering is one of the most popular unsupervised machine learning approaches. There
are several types of unsupervised learning algorithms that are used for clustering, which
include exclusive, overlapping, hierarchical, and probabilistic.
2. Association
Association rule mining is a rule-based approach to reveal interesting relationships
between data points in large datasets. Unsupervised learning algorithms search for fre-
quent if-then associations—also called rules—to discover correlations and co-occurrences
within the data and the different connections between data objects.
It is most commonly used to analyze retail baskets or transactional datasets to represent
how often certain items are purchased together. These algorithms uncover customer
purchasing patterns and previously hidden relationships between products that help
inform recommendation engines or other cross-selling opportunities. You might be
most familiar with these rules from the “Frequently bought together” and “People
who bought this item also bought” sections on your favorite online retail shop.
Association rules are also often used to organize medical datasets for clinical diagnoses.
Using unsupervised machine learning and association rules can help doctors identify
the probability of a specific diagnosis by comparing relationships between symptoms
from past patient cases.
Typically, Apriori algorithms are the most widely used for association rule learning to
identify related collections of items or sets of items. However, other types are used,
such as Eclat and FP-growth algorithms.
3. Dimensionality reduction
Dimensionality reduction is an unsupervised learning technique that reduces the num-
ber of features, or dimensions, in a dataset. More data is generally better for machine
learning, but it can also make it more challenging to visualize the data.
Dimensionality reduction extracts important features from the dataset, reducing the
number of irrelevant or random features present. This method uses principle compo-
nent analysis (PCA) and singular value decomposition (SVD) algorithms to reduce
the number of data inputs without compromising the integrity of the properties in the
original data.
• Hierarchical clustering
• Partitioning clustering
• Agglomerative clustering
• Divisive clustering
• K-Means clustering
There are many different fields where cluster analysis is used effectively, such as
• Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity relation
modelling
• Data mining: simplify the data mining task by grouping a large number of features
from an extremely large data set to make the analysis manageable
2 K-Means Clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. , a
group of n objects is broken down into k number of clusters based on their similarities.
• It aims to find cluster centers (centroids) and assign data points to the nearest centroid
based on their similarity.
• K-means clustering is one of the simplest and popular unsupervised machine learning
algorithms.
• Typically, unsupervised algorithms make inferences from datasets using only input
vectors without referring to known, or labelled, outcomes.
• We first assign each data point of the given data to the nearest centroid. Next, we
calculate the mean of each cluster and the means then serve as the new centroids. This
step is repeated until the positions of the centroids do not change anymore.
• The goal of k-means is to minimize the sum of the squared distances between each
data point and its centroid.
• The algorithm aims to minimize the sum of squared distances between data points and
their respective cluster centroids.
ni
K X
X
J= ∥xij − ci ∥2
i=1 j=1
where:
• Initialization: Choose the number of clusters (K) you want to create. Initialize K
cluster centroids randomly. These centroids can be selected from the data points or
with random values.
• Assignment Step: For each data point in the dataset, calculate the distance between
the data point and all K centroids. Assign the data point to the cluster associated with
the nearest centroid. This step groups data points into clusters based on similarity.
• Update Step: Recalculate the centroids for each cluster by taking the mean of all
data points assigned to that cluster. The new centroids represent the center of their
respective clusters.
• Results: The result of the K-Means algorithm is K clusters, each with its centroid.
Data points are assigned to the nearest cluster, and you can use these clusters for
various purposes, such as data analysis, segmentation, or pattern recognition.
• The choice of the number of clusters (K) is a critical decision and often requires do-
main knowledge or experimentation. Various methods, such as the elbow method or
silhouette score, can help in determining an optimal K value.
• K-Means is computationally efficient and works well for large datasets, but it may not
perform well on data with irregularly shaped or non-convex clusters.
• The algorithm may converge to a local minimum, and it’s not guaranteed to find the
global optimum.
10
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to
implement.
• K-medoids clustering method but unlike k-means, rather than minimizing the sum of
squared distances, k-medoids works on minimizing the number of paired dissimilarities.
• We find this useful since k-medoids aims to form clusters where the objects within each
cluster are more similar to each other and dissimilar to objects in the other clusters.
Instead of centroids, this approach makes use of medoids.
• Medoids are points in the dataset whose sum of distances to other points in the cluster
is minimal.
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of
time complexity. The following K-Medoids are performed using PAM. In the further parts,
we’ll see what CLARA and CLARANS are.
11
3 Hierarchical clustering
Till now, we have discussed the various methods for partitioning the data into different clus-
ters. But there are situations when the data needs to be partitioned into groups at different
levels such as in a hierarchy. The hierarchical clustering methods are used to group the data
into hierarchy or tree-like structure.
For example, in a machine learning problem of organizing employees of a university in dif-
ferent departments, first the employees are grouped under the different departments in the
university, and then within each department, the employees can be grouped according to
their roles such as professors, assistant professors, supervisors, lab assistants, etc. This
creates a hierarchical structure of the employee data and eases visualization and analysis.
Similarly, there may be a data set which has an underlying hierarchy structure that we want
to discover and we can use the hierarchical clustering methods to achieve that.
• The hierarchical clustering produces clusters in which the clusters at each level of the
hierarchy are created by merging clusters at the next lower level.
• At the lowest level, each cluster contains a single observation. At the highest level
there is only one cluster containing all of the data.
• The decision regarding whether two clusters are to be merged or not is taken based on
the measure of dissimilarity between the clusters. The distance between two clusters
is usually taken as the measure of dissimilarity between the clusters.
• The choice of linkage method and distance metric can significantly impact the results
and the structure of the dendrogram.
• Dendrograms are useful for visualizing the hierarchy and helping to decide how many
clusters are appropriate for a particular application.
12
3.1.1 Dendrogram
• Hierarchical clustering can be represented by a rooted binary tree. The nodes of the
trees represent groups or clusters. The root node represents the entire data set. The
terminal nodes each represent one of the individual observations (singleton clusters).
Each nonterminal node has two daughter nodes.
• The distance between merged clusters is monotone increasing with the level of the
merger. The height of each node above the level of the terminal nodes in the tree is
proportional to the value of the distance between its two daughters (see Figure 13.9).
• The dendrogram may be drawn with the root node at the top and the branches growing
vertically downwards (see Figure 13.8(a)).
• It may also be drawn with the root node at the left and the branches growing horizon-
tally rightwards (see Figure 13.8(b)).
Example Figure 13.7 is a dendrogram of the dataset {a, b, c, d, e}. Note that the root node
represents the entire dataset and the terminal nodes represent the individual observations.
However, the dendrograms are presented in a simplified format in which only the terminal
13
nodes (that is, the nodes representing the singleton clusters) are explicitly displayed. Figure
13.8 shows the simplified format of the dendrogram in Figure 13.7. Figure 13.9 shows the
distances of the clusters at the various levels. Note that the clusters are at 4 levels. The
distance between the clusters {a} and {b} is 15, between {c} and {d} is 7.5, between {c, d}
and {e} is 15 and between {a, b} and {c, d, e} is 25.
Figure 13.9: A dendrogram of the dataset a, b, c, d, e showing the distances (heights) of the
clusters at different levels
14
If there are N observations in the dataset, there will be N − 1 levels in the hierarchy.
The pair chosen for merging consists of the two groups with the smallest “intergroup dissim-
ilarity”. Each nonterminal node has two daughter nodes. The daughters represent the two
groups that were merged to form the parent.
• Step 1: Initialization Start with each data point as a separate cluster. If you have N
data points, you initially have N clusters.
– Single Linkage: Merge clusters based on the minimum distance between any
pair of data points from the two clusters.
15
• Step 4: Cutting the Dendrogram To determine the number of clusters, you can cut
the dendrogram at a specific level. The height at which you cut the dendrogram
corresponds to the number of clusters you obtain. The cut produces the final clusters
at the chosen level of granularity.
• Step 5: Results The resulting clusters are obtained based on the cut level. Each cluster
contains a set of data points that are similar to each other according to the chosen
linkage method.
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the
agglomerative method as shown in Figure 13.10. Each nonterminal node has two daughter
nodes. The daughters represent the two groups that were merged to form the parent.
16
17
18
For example, the hierarchical clustering shown in Figure 13.7 can be constructed by the
divisive method as shown in Figure 13.11. Each nonterminal node has two daughter nodes.
The two daughters represent the two groups resulting from the split of the parent.
In both these cases, it is important to select the split and merger points carefully, because
the subsequent splits or mergers will use the result of the previous ones and there is no option
to perform any object swapping between the clusters or rectify the decisions made in previous
steps, which may result in poor clustering quality at the end.
19
two observations can be defined and also there are also several ways in which the distance
between two groups of observations can be defined
Often the distance measure is used to decide when to terminate the clustering algorithm.
For example, in an agglomerative clustering, the merging iterations may be stopped once the
MIN distance between two neighboring clusters becomes less than the user-defined threshold.
So, when an algorithm uses the minimum distance Dmin to measure the distance between
the clusters, then it is referred to as nearest neighbour clustering algorithm, and if the decision
to stop the algorithm is based on a user-defined limit on Dmin , then it is called a single
linkage algorithm.
On the other hand, when an algorithm uses the maximum distance Dmax to measure
the distance between the clusters, then it is referred to as furthest neighbour clustering
algorithm, and if the decision to stop the algorithm is based on a userdefined limit on Dmax
then it is called complete linkage algorithm.
20
As minimum and maximum measures provide two extreme options to measure distance
between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean
and average distance helps in avoiding such problem and provides more consistent results.
1. Initialization: Start with each data point as an individual cluster. If you have N data
points, you initially have N clusters.
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
between two clusters is defined as the minimum distance between any two data points,
one from each cluster.
3. Merge Clusters: Merge the two clusters with the shortest distance, as defined by
single linkage. This creates a new, larger cluster.
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single
cluster, or you reach a specified number of clusters.
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the
dendrogram at a specific level. The height at which you cut the dendrogram corre-
sponds to the number of clusters you obtain.
• It is sensitive to outliers and noise because a single close pair of points from different
clusters can cause a merger.
• Single linkage is just one of several linkage methods used in hierarchical clustering,
each with its own strengths and weaknesses. The choice of linkage method depends on
the nature of the data and the desired clustering outcome.
21
1. Initialization: Start with each data point as an individual cluster. If you have N data
points, you initially have N clusters.
2. Cluster Distance: Calculate the pairwise distances between all clusters. The distance
between two clusters is defined as the maximum distance between any two data points,
one from each cluster.
3. Merge Clusters: Merge the two clusters with the shortest (maximum) distance, as
defined by complete linkage. This creates a new, larger cluster.
4. Repeat: Continue steps 2 and 3 iteratively until all data points are part of a single
cluster or you reach a specified number of clusters.
6. Cutting the Dendrogram: To determine the number of clusters, you can cut the
dendrogram at a specific level. The height at which you cut the dendrogram corre-
sponds to the number of clusters you obtain.
• It tends to produce compact, spherical clusters since it minimizes the maximum dis-
tance within clusters.
• It is less sensitive to outliers than single linkage because it focuses on the maximum
distance.
The choice of linkage method (single, complete, average, etc.) depends on the nature of the
data and the desired clustering outcome. Different linkage methods can produce different
cluster structures based on how distance is defined between clusters.
22
• Curse of Dimensionality: High-dimensional data can suffer from the curse of dimen-
sionality, where the data becomes sparse and noisy, making it challenging to analyze
and model effectively. Reducing dimensionality can help mitigate this issue.
• Feature Selection: In this approach, you select a subset of the original features
and discard the rest. This subset contains the most relevant features for the task
at hand. Common methods for feature selection include correlation analysis, mutual
information, and filter methods.
• Feature Extraction: In this approach, you create new features that are combinations
or transformations of the original features. Principal Component Analysis (PCA) and
Linear Discriminant Analysis (LDA) are popular techniques for feature extraction.
They find linear combinations of the original features that capture the most significant
variation in the data.
Popular algorithms used for dimensionality reduction include principal component anal-
ysis (PCA).These algorithms seek to transform data from high-dimensional spaces to low-
dimensional spaces without compromising meaningful properties in the original data. These
techniques are typically deployed during exploratory data analysis (EDA) or data processing
to prepare the data for modeling.
It’s helpful to reduce the dimensionality of a dataset during EDA to help visualize data:
this is because visualizing data in more than three dimensions is difficult. From a data pro-
cessing perspective, reducing the dimensionality of the data simplifies the modeling problem.
23
4. Sort Eigenvalues: Order the eigenvalues in descending order. The first eigenvalue
corresponds to the most significant variance, the second eigenvalue to the second-most
significant variance, and so on.
5. Select Principal Components: Decide how many principal components you want
to retain in the lower-dimensional representation. This choice can be based on the
proportion of explained variance or specific requirements for dimensionality reduction.
24
10. Application: Use the lower-dimensional data for various tasks, such as visualization,
clustering, classification, or regression, with reduced computational complexity and
noise.
• Noise Reduction: PCA can help filter out noise in the data, leading to cleaner and
more interpretable patterns.
25
References
• https://alexjungaalto.github.io/MLBasicsBook.pdf
• www.medium.com
• geeksforgeeks.org/
• javatpoint.com/
26