Ds Module 5

2023-2024
21IS5C05
Data Science
Module 5
Rampur Srinath
NIE, Mysuru
rampursrinath@nie.ac.in 1
Machine Learning
Machine learning involves coding programs that

automatically adjust their performance in
accordance with their exposure to information in
data.
Machine learning can be considered a subfield of

artificial intelligence (AI) we can roughly divide
the field into the following three major classes.
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Supervised learning: Algorithms which learn from a training set of
labeled examples to generalize to the set of all possible inputs.
Examples of techniques in supervised learning: logistic regression,
support vector machines, decision trees, random forest, etc.
Unsupervised learning: Algorithms that learn from a training set of

unlabeled examples. Used to explore data according to some
statistical, geometric or similarity criterion. Examples of unsupervised
learning include k-means clustering and kernel density estimation.
Reinforcement learning: Algorithms that learn via reinforcement from

criticism that provides information on the quality of a solution, but not
on how to improve it. Improved solutions are achieved by iteratively
exploring the solution space.
• As a data scientist, the first step you apply given a certain
problem is to identify the question to be answered. According
to the type of answer we are seeking, we are directly aiming
for a certain set of techniques.
Unsupervised Learning
Unsupervised learning is defined as the task performed by

algorithms that learn from a training set of unlabeled or
unannotated examples, using the features of the inputs to
categorize them according to some geometric or statistical
criteria.
In machine learning, the problem of unsupervised learning is

that of trying to find hidden structure in unlabeled data. Since
the examples given to the learner are unlabeled, there is no
error or reward signal to evaluate the goodness of a potential
solution.
Most unsupervised learning techniques can be summarized as
those that tackle the following four groups of problems:
• Clustering: has as a goal to partition the set of examples into
groups.
• Dimensionality reduction: aims to reduce the dimensionality of
the data. Here, we encounter techniques such as Principal
Component Analysis (PCA), independent component analysis,
and nonnegative matrix factorization.
• Outlier detection: has as a purpose to find unusual events (e.g., a
malfunction), that distinguish part of the data from the rest
according to certain criteria.
• Novelty detection: deals with cases when changes occur in the
data (e.g., in streaming data).
The most common unsupervised task is clustering,
Clustering
Clustering is a process of grouping similar objects together;

i.e., to partition unlabeled examples into disjoint subsets of
clusters, such that:
• Examples within a cluster are similar (in this case, we speak
of high intraclass similarity).
• Examples in different clusters are different (in this case, we
speak of low interclass similarity).
When we denote data as similar and dissimilar,

Two kinds of inputs can be used for grouping:
• in similarity-based clustering, the input to the algorithm is
an n × n dissimilarity matrix or distance matrix;
• in feature-based clustering, the input to the algorithm is an n

× D feature matrix or design matrix, where n is the number
of examples in the dataset and D the dimensionality of each
sample.
Several questions regarding the clustering process arise.
• What is a natural grouping among the objects?We need to define the
“groupness” and the “similarity/distance” between data.
• How can we group samples? What are the best procedures? Are they
efficient? Are they fast? Are they deterministic?
• How many clusters should we look for in the data? Shall we state this
number a priori? Should the process be completely data driven or can
the user guide the grouping process? How can we avoid “trivial”
clusters? Should we allow final clustering results to have very large or
very small clusters? Which methods work when the number of
samples is large? Which methods work when the number of classes is
large? What constitutes a good grouping? What objective measures
can be defined to
• evaluate the quality of the clusters?
Similarity and Distances
• To speak of similar and dissimilar data, we need to introduce

a notion of the similarity of data. There are several ways for
modeling of similarity. A simple way to model this is by means
of a Gaussian kernel:
as a surrogate,
The most widespread distance metric is the Minkowski
distance:
The best-known instantiations of this metric
are as follows:
• when p = 2, we have the Euclidean distance,
• when p = 1, we have the Manhattan distance, and
• when p = inf, we have the max-distance. In this case, the
distance corresponds to the component |ai − bi| with the
highest value.
Euclidean Distance
Euclidean distance is defined as the distance between two

points.
Euclidean distance formula helps to find the distance of a line

segment. Let us assume two points, such as (x1, y1) and (x2,
y2) in the two-dimensional coordinate plane.
Find the distance between two points P(0, 4)
and Q(6, 2).
Given: P(0, 4) = (x1, y1) Q(6, 2) = (x2, y2)
The distance between the point PQ is
PQ = √[(x2 – x1)2 + (y2 – y1)2]
PQ = √[(6 – 0)2 + (2 – 4)2]
PQ = √[(6)2 + (-2)2]
PQ = √(36+4)
PQ = √40
PQ = 2√10
Manhattan distance
• Manhattan distance is a distance measure that is calculated by

taking the sum of distances between the x and y coordinates.
• The Manhattan distance is also known as Manhattan length. In

other words, it is the distance between two points measured
along axes at right angles.
• Manhattan distance works very well for high-dimensional
datasets. As it does not take any squares, it does not amplify the
differences between any of the features. It also does not ignore
any features.
What Constitutes a Good Clustering?
Defining Metrics to Measure Clustering
Quality
• When performing clustering, the question normally arises:
How do we measure the quality of the clustering result? Note
that in unsupervised clustering, we do not have ground truth
labels that would allow us to compute the accuracy of the
algorithm. Still, there are several procedures for assessing
quality. We find two families of techniques:
• those that allow us to compare clustering techniques,
• those that check on specific properties of the clustering, for
example “compactness”.
Rand Index, Homogeneity, Completeness
and V-measure Scores
• One of the best-known methods for comparing the results in
clustering techniques in statistics is the Rand index or Rand
measure (named after William M. Rand).
• The Rand index evaluates the similarity between two results
of data clustering.
Given a set of n elements S = {o1, . . . , on}, we can compare
two partitions of S:
X = {X1, . . . , Xr}, a partition of S into r subsets; and
Y = {Y1, . . . , , Ys}, a partition of S into s subsets.
Let us use the annotations as follows:
• • a is the number of pairs of elements in S that are in the
same subset in both X and Y ;
• • b is the number of pairs of elements in S that are in different
subsets in both X and Y ;
• • c is the number of pairs of elements in S that are in the
same subset in X , but in different subsets in Y ; and
• • d is the number of pairs of elements in S that are in different
subsets in X , but in the same subset in Y .
• When the number of clusters increases it is desirable that the
upper limit tends to the unity. To solve this problem, a form
of the Rand index, called the Adjusted Rand index, is used
that adjusts the Rand index with respect to chance grouping
of elements. It is defined as follows:
V-measure
• Another way for comparing clustering results is the V-

measure.
• A clustering result satisfies a homogeneity criterion if all of

its clusters contain only data points which are members of
the same original (single) class.
• The homogeneity criteria: each cluster should contain only
data points that are members of a single class.
• A clustering result satisfies a completeness criterion if all the

data points that are members of a given class are elements of
the same predicted cluster.
V-measure
• A clustering result satisfies a completeness criterion if all the

data points that are members of a given class are elements of
the same predicted cluster.
• The completeness criteria: all of the data points that are

members of a given class should be elements of the same
cluster.
V-measure
• The V-measure is the harmonic mean between the

homogeneity and the completeness defined as follows:
• metrics.homogeneity_score( )
• metrics.completeness_score( )
• metrics.v_measure_score( )
In summary, we can say that the advantages of the V-

measure include that it has bounded scores:
• 0.0 means the clustering is extremely bad;
• 1.0 indicates a perfect clustering result.
Silhouette Score
• It is defined as a function of the intracluster distance of a

sample in the dataset, a and the nearest-cluster distance, b
for each sample.
The Silhouette coefficient for a sample i can be written as

follows:
• Hence, if the Silhouette s(i) is close to 0, it means that the sample
is on the border of its cluster and the closest one from the rest of
the dataset clusters.
• A negative value means that the sample is closer to the neighbor
cluster. The average of the Silhouette coefficients of all samples
of a given cluster defines the “goodness” of the cluster.
• A high positive value, i.e., close to 1 would mean a compact
cluster, and vice versa. And the average of the Silhouette
coefficients of all clusters gives idea of the quality of the
clustering result.
• Note that the Silhouette coefficient only makes sense when the
number of labels predicted is less than the number of samples
clustered.
Taxonomies of Clustering Techniques
Within different clustering algorithms, one can find

• soft partition algorithms, which assign a probability of the
data belonging to each cluster,
• hard partition algorithms, where each data point is assigned
precise membership of one cluster.
• According to the grouping process of the hard partition
algorithm, there are two large families of clustering
techniques:
• • Partitional algorithms: these start with a random partition
and refine it iteratively. That is why sometimes these
algorithms are called “flat” clustering. (K-means and
spectral clustering)
• • Hierarchical algorithms: these organize the data into

hierarchical structures, where data can be agglomerated in
the bottom-up direction, or split in a top-down manner.
(agglomerative clustering)
K-means Clustering
K-means algorithm is a hard partition algorithm with the goal

of assigning each data point to a single cluster.
K-means algorithm divides a set of n samples X into k disjoint

clusters ci, i = 1, . . . , k, each described by the mean μi of the
samples in the cluster. The means are commonly called
cluster centroids.
The K-means algorithm assumes that all k groups have equal

variance.
K-means clustering solves the following minimization
problem:
• where ci is the set of points that belong to cluster i and

• μi is the center of the class ci .
• K-means clustering objective function uses the square of the
Euclidean distance , that is also referred to
as the inertia or within-cluster sumofsquares.
KMeans(copy_x=True, init=’random’, max_iter=300,
n_clusters=3, n_init=10, n_jobs=1, precompute_distances=True,
random_state=None, tol=0.0001, verbose=0)
Hierarchical Clustering
• Hierarchical clustering is comprised of a general family of

clustering algorithms that construct nested clusters by
successive merging or splitting of data.
• The hierarchy of clusters is represented as a tree. The tree is
usually called a dendrogram.
• The root of the dendrogram is the single cluster that contains
all the samples;
• the leaves are the clusters containing only one sample each.
Two types of hierarchical clustering:
• Top-down divisive clustering
• Bottom-up agglomerative clustering

Top-down divisive clustering
• Start with all the data in a single cluster.

• Consider every possible way to divide the cluster into two.
• Choose the best division.
• Recursively, it operates on both sides until a stopping
criterion is met. That can be something as follows: there are
as much clusters as data; the predetermined number of
clusters has been reached; the maximum distance between all
possible partition divisions is smaller than a predetermined
threshold; etc.
Bottom-up agglomerative clustering
• Start with each data point in a separate cluster.

• Repeatedly join the closest pair of clusters.
• At each step, a stopping criterion is checked: there is only
one cluster; a predetermined number of clusters has been
reached; the distance between the closest clusters is greater
than a predetermined threshold; etc.
Measure for the distance between two
clusters
• The closest distance between the two clusters is crucial for
the hierarchical clustering.
• There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering.
• These measures are called Linkage methods.
• Some of the popular linkage methods are given below:
• Single Linkage
• Complete Linkage
• Average Linkage
• Centroid Linkage
Single Linkage
Complete Linkage
Average Linkage
Centroid Linkage
Thank You

Ds Module 5

Uploaded by

Copyright:

Available Formats

Ds Module 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ds Module 5

Uploaded by

Copyright:

Available Formats

2023-2024

Machine learning involves coding programs that

Machine learning can be considered a subfield of

Unsupervised learning: Algorithms that learn from a training set of

Reinforcement learning: Algorithms that learn via reinforcement from

Unsupervised learning is defined as the task performed by

In machine learning, the problem of unsupervised learning is

Clustering is a process of grouping similar objects together;

When we denote data as similar and dissimilar,

• in feature-based clustering, the input to the algorithm is an n

• To speak of similar and dissimilar data, we need to introduce

Euclidean distance is defined as the distance between two

Euclidean distance formula helps to find the distance of a line

• Manhattan distance is a distance measure that is calculated by

• The Manhattan distance is also known as Manhattan length. In

• Another way for comparing clustering results is the V-

• A clustering result satisfies a homogeneity criterion if all of

• A clustering result satisfies a completeness criterion if all the

• A clustering result satisfies a completeness criterion if all the

• The completeness criteria: all of the data points that are

• The V-measure is the harmonic mean between the

In summary, we can say that the advantages of the V-

• It is defined as a function of the intracluster distance of a

The Silhouette coefficient for a sample i can be written as

Within different clustering algorithms, one can find

• • Hierarchical algorithms: these organize the data into

K-means algorithm is a hard partition algorithm with the goal

K-means algorithm divides a set of n samples X into k disjoint

The K-means algorithm assumes that all k groups have equal

• where ci is the set of points that belong to cluster i and

• Hierarchical clustering is comprised of a general family of

• Top-down divisive clustering

• Bottom-up agglomerative clustering

• Start with all the data in a single cluster.

• Start with each data point in a separate cluster.

You might also like