Clustering Analysis (Unsupervised)

This document discusses different types of clustering analysis techniques. It begins by defining hard clustering, soft clustering, and other clustering types. It then describes connectivity-based clustering techniques like hierarchical and centroid-based clustering like k-means. Distribution-based clustering using Gaussian mixture models is covered along with density-based techniques like DBSCAN. The document concludes with a discussion of cluster evaluation methods.

Uploaded by

Partho Roychoudhury

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Clustering Analysis (Unsupervised)

Uploaded by

Partho Roychoudhury

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

CLUSTERING ANALYSIS (UNSUPERVISED)

A "clustering" is essentially a set of such clusters, usually containing all objects in the data set.
Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of
clusters embedded in each other. Clusterings can be roughly distinguished as:

 Hard clustering: each object belongs to a cluster or not (No probabilities involved)
 Soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain
degree (for example, a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

 Strict partitioning clustering: each object belongs to exactly one cluster
 Strict partitioning clustering with outliers: objects can also belong to no cluster, and are
considered outliers
 Overlapping clustering (also: alternative clustering, multi-view clustering): objects may
belong to more than one cluster(but not probabilistically); usually involving hard clusters
 Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
 Subspace clustering: while the complete dataset may show an overlapping clustering, within
a uniquely defined subspace of the original large dataset, clusters are not expected to overlap

Connectivity-based clustering (hierarchical clustering)

A cluster can be described largely by the maximum distance needed to connect parts of the cluster.
At different distances, different clusters will form, which can be represented using a dendrogram. In
a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are
placed along the x-axis such that the clusters don't mix.

Connectivity-based clustering is a whole family of methods that differ by the way distances are
computed:
Single-Linkage Clustering: Distance between 2 closest points not belonging to the same cluster.
Eg: SLINK Algorithm (O(n2))
Complete-Linkage Clustering: Distance between 2 farthest points of 2 different clusters.
Eg: CLINK Algorithm (O(n2))
Average-Linkage Clustering: Average distance between any 2 points of 2 different clusters.

Employing a different distance calculating method can give different clusters. Moreover, hierarchical
clustering can be agglomerative (O(n3)) (starting with single elements and aggregating them into
clusters, bottom-up approach) or divisive (O(2n-1)) (starting with the complete data set and dividing it
into partitions, top-down approach)

Very slow and not good towards handling outliers which will either show up as additional
clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular
with single-linkage clustering).
Centroid-based clustering (k-means clustering)

When the number of clusters is fixed to k, k-means clustering gives a formal definition as an

optimization problem: find the k cluster centers and assign the objects to the nearest cluster center,
such that the squared distances from the cluster are minimized.
The k-means algorithm is also called ‘Lloyd’s Algorithm’. It only finds a local optimum, and is
commonly run multiple times with different random initializations.
Variations of k-means:
1) k-medoids: restricting the centroids to members of the data set
2) k-medians clustering : choosing medians instead of centroid or mean
3) k-means++ : choosing the initial centers less randomly
4) fuzzy c-means : allowing a fuzzy cluster assignment.

First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is
conceptually close to nearest neighbor classification. Third, it can be seen as a variation of model
based clustering, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm.
k-means cannot represent density-based clusters and clusters with convex shapes.

Distribution-based clustering
The clustering model most closely related to statistics is based on distribution models. They suffer
from one key problem known as overfitting, unless constraints are put on the model complexity. A
more complex model will usually be able to explain the data better, which makes choosing the
appropriate model complexity inherently difficult.
One prominent method is known as Gaussian mixture models (using the expectation-maximization
algorithm). Here, the data set is usually modeled with a fixed (to avoid overfitting) number
of Gaussian distributions that are initialized randomly and whose parameters are iteratively
optimized to better fit the data set. This will converge to a local optimum, so multiple runs may
produce different results. In order to obtain a hard clustering, objects are often then assigned to the
Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.
Distribution-based clustering produces complex models for clusters that can capture correlation and
dependence between attributes. However, these algorithms put an extra burden on the user: for
many real data sets, there may be no concisely defined mathematical model (e.g. assuming
Gaussian distributions is a rather strong assumption on the data). Density-based clusters cannot
be modeled using Gaussian distributions.
Density-based clustering
Clusters are defined as areas of higher density. Objects in the sparse areas - that are required to
separate clusters - are usually considered to be noise and border points.
The most popular density based clustering method is DBSCAN.[13] It features a well-defined cluster
model called "density-reachability". Similar to linkage based clustering, it is based on connecting
points within certain distance thresholds. However, it only connects points that satisfy a density
criterion. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary
shape, in contrast to many other methods). Its complexity is fairly low – it requires a linear number of
range queries on the database. There is no need to run it multiple times. OPTICS is a generalization
of DBSCAN that removes the need to choose an appropriate value for the range parameter

, and produces a hierarchical result related to that of linkage clustering.

The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect
cluster borders. On data sets with, for example, overlapping Gaussian distributions, the cluster
borders produced by these algorithms will often look arbitrary, because the cluster density
decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are
nearly always outperformed by methods such as EM clustering that are able to precisely model this
kind of data.
Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity,
based on kernel density estimation. Eventually, objects converge to local maxima of density. This
algorithm is usually slower than DBSCAN or k-Means. Besides that, the applicability of the mean-
shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density
estimate, which results in over-fragmentation of cluster tails.

CLUSTER EVALUATION
1) Internal Evaluation: Evaluating how well are the clusters formed(unsupervised ML)
a) David Bouldin Index
b) Dunn Index
c) Silhouette Coefficient
2) External Evaluation: Evaluating how good the clusters are to classify another set of input
data(supervised ML)
a) Purity
b) Rand-Index
c) F-measure
d) Jaccard Index
e) Dice Index
f) Confusion Matrix
3) Hopkins Statistic (to measure cluster tendency): Useless in practice as it can’t handle
multimodality
DECISION TREE ANALYSIS (SUPERVISED)
Tree models can be of 2 types:
a) Classification Trees: Here the target variable can take a discrete set of values. In these tree
structures, leaves represent class labels (to be predicted) and branches
represent conjunctions of features that lead to those class labels.
b) Regression Trees: These are decision trees where the target variable can take continuous
values (typically real numbers).
c) Classification and Regression Trees (CART): Trees where both the above procedures are
followed. Eg:

Some techniques, often called ensemble methods, construct more than one decision tree:

d) Boosted trees: Incrementally building an ensemble by training each new instance to emphasize

the training instances previously mis-modeled. A typical example is AdaBoost (Adaptive
Boosting) and Gradient Boosting . These can be used for regression-type and classification-type
problems.
e) Bootstrap aggregated: (or bagged) decision trees, an early ensemble method, builds multiple
decision trees by repeatedly resampling training data with replacement, and voting the trees for
a consensus prediction.[7]
 A random forest classifier is a specific type of bootstrap aggregating.
f) Rotation forest – in which every decision tree is trained by first applying principal component
analysis (PCA) on a random subset of the input features.
g) Decision List: A special case of a decision tree is a decision list, which is a one-sided decision
tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child
(except for the bottommost node, whose only child is a single leaf node). While less expressive,
decision lists are arguably easier to understand than general decision trees.

A tree is built by splitting the source set, constituting the root node of the tree, into subsets -
which constitute the successor children. The splitting is based on a set of splitting rules based
on classification features.[2] This process is repeated on each derived subset in a recursive
manner called recursive partitioning. The recursion is completed when the subset at a node has
all the same values of the target variable, or when splitting no longer adds value to the
predictions. This process of top-down induction of decision trees (TDIDT)[3] is an example of
a greedy algorithm.

Different parameters or metrics are used to build a decision tree. They are:

1) Gini- impurity (Different from Gini- coefficient): For discrete target variable
2) Information Gain: For discrete target variable
3) Variance Reduction: For continuous target variable

Advantages:
1) Able to handle both numerical and categorical data.
2) Requires little data preparation (Normalizing the data is not required).
3) Uses a white box model: Its working can be easily seen and visualized.
4) Makes no assumptions of the training data or prediction residuals
5) Performs well with large datasets.
6) Robust against co-linearity, particularly boosting

Disadvantages:
1) A small change in the training data can result in a large change in the tree and
consequently the final predictions.
2) Decision-tree learners can create over-complex trees that do not generalize well from the
training data. This is known as overfitting. Mechanisms such as pruning are necessary to
avoid this problem (with the exception of some algorithms such as the Conditional Inference
approach that does not require pruning).
3) For data including categorical variables with different numbers of levels, information gain in
decision trees is biased in favor of attributes with more levels. However, the issue of biased
predictor selection is avoided by the Conditional Inference approach, a two-stage approach,
or adaptive leave-one-out feature selection.

Review Article: Deep Learning For Computer Vision: A Brief Review
No ratings yet
Review Article: Deep Learning For Computer Vision: A Brief Review
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
ML UNIT 4
No ratings yet
ML UNIT 4
15 pages
Clustering
No ratings yet
Clustering
12 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML - 8
No ratings yet
ML - 8
70 pages
Basic Clustering For IED Class PDF
No ratings yet
Basic Clustering For IED Class PDF
25 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unit 4
No ratings yet
Unit 4
5 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Clustering
No ratings yet
Clustering
65 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Clustering new
No ratings yet
Clustering new
6 pages
clustering
No ratings yet
clustering
6 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
YEAH
No ratings yet
YEAH
2 pages
Chapter 2 (19-06-2019 v2)
No ratings yet
Chapter 2 (19-06-2019 v2)
10 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
UNIT5
No ratings yet
UNIT5
60 pages
Management-Activity Prediction For Differently-Mouneshachari S
No ratings yet
Management-Activity Prediction For Differently-Mouneshachari S
6 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering
No ratings yet
Clustering
39 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering
No ratings yet
Clustering
29 pages
DBSCAN_An_Assessment_of_Density_Based_Cl
No ratings yet
DBSCAN_An_Assessment_of_Density_Based_Cl
5 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
DBSCAN.docx
No ratings yet
DBSCAN.docx
7 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Module 5
No ratings yet
Module 5
91 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Cluster
100% (1)
Cluster
72 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
ML 2022 Sheet 10
No ratings yet
ML 2022 Sheet 10
1 page
ML Questions Paper
No ratings yet
ML Questions Paper
8 pages
d2l en Pytorch
No ratings yet
d2l en Pytorch
979 pages
ML, DL Questions: Downloaded From
No ratings yet
ML, DL Questions: Downloaded From
4 pages
Deep Learning - Question Papers
No ratings yet
Deep Learning - Question Papers
7 pages
CSA501_ QB Neural Network Deep Learning_updated2024
No ratings yet
CSA501_ QB Neural Network Deep Learning_updated2024
11 pages
Question QUIZ MID 2
No ratings yet
Question QUIZ MID 2
6 pages
Chap8 Basic Cluster Analysis
100% (1)
Chap8 Basic Cluster Analysis
104 pages
Basic Models of Artificial Neural Network
No ratings yet
Basic Models of Artificial Neural Network
4 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Soft Computing - WWW - Rgpvnotes.in
14 pages
Machine Learning in A Nutshell
No ratings yet
Machine Learning in A Nutshell
36 pages
Introduction To Artificial Intelligence Machine Learning July 2023
No ratings yet
Introduction To Artificial Intelligence Machine Learning July 2023
4 pages
Deep Learning Exp
No ratings yet
Deep Learning Exp
25 pages
Question Bank of Advanced Dbms
No ratings yet
Question Bank of Advanced Dbms
2 pages
14 CS1AC16 Deep Learning
No ratings yet
14 CS1AC16 Deep Learning
3 pages
Audio GAN
No ratings yet
Audio GAN
2 pages
Instant Access to (Ebook) An Introduction to Spatial Data Science with GeoDa: Volume 2 – Clustering Spatial Data by Luc Anselin ISBN 9781032713021, 103271302X ebook Full Chapters
100% (4)
Instant Access to (Ebook) An Introduction to Spatial Data Science with GeoDa: Volume 2 – Clustering Spatial Data by Luc Anselin ISBN 9781032713021, 103271302X ebook Full Chapters
71 pages
2009.05673 - Jeff Heaton - Applications of Deep Learning in TF 2.0
No ratings yet
2009.05673 - Jeff Heaton - Applications of Deep Learning in TF 2.0
569 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
02 01 KMeans
100% (1)
02 01 KMeans
62 pages
Artificial Neural Network Based Model For Forecasting of Inflation in India
No ratings yet
Artificial Neural Network Based Model For Forecasting of Inflation in India
12 pages
Unit 2 CNN
No ratings yet
Unit 2 CNN
9 pages
DL Ut - 1
No ratings yet
DL Ut - 1
14 pages
Full Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Ebook All Chapters
100% (2)
Full Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Ebook All Chapters
62 pages
GoogleNet
No ratings yet
GoogleNet
40 pages
10 - Mark - CNN Architecture and Training
No ratings yet
10 - Mark - CNN Architecture and Training
7 pages
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
No ratings yet
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
4 pages
Chapter 7
No ratings yet
Chapter 7
18 pages