Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
47 views

Clustering Analysis (Unsupervised)

This document discusses different types of clustering analysis techniques. It begins by defining hard clustering, soft clustering, and other clustering types. It then describes connectivity-based clustering techniques like hierarchical and centroid-based clustering like k-means. Distribution-based clustering using Gaussian mixture models is covered along with density-based techniques like DBSCAN. The document concludes with a discussion of cluster evaluation methods.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Clustering Analysis (Unsupervised)

This document discusses different types of clustering analysis techniques. It begins by defining hard clustering, soft clustering, and other clustering types. It then describes connectivity-based clustering techniques like hierarchical and centroid-based clustering like k-means. Distribution-based clustering using Gaussian mixture models is covered along with density-based techniques like DBSCAN. The document concludes with a discussion of cluster evaluation methods.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

CLUSTERING ANALYSIS (UNSUPERVISED)

A "clustering" is essentially a set of such clusters, usually containing all objects in the data set.
Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of
clusters embedded in each other. Clusterings can be roughly distinguished as:

 Hard clustering: each object belongs to a cluster or not (No probabilities involved)
 Soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain
degree (for example, a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:


 Strict partitioning clustering: each object belongs to exactly one cluster
 Strict partitioning clustering with outliers: objects can also belong to no cluster, and are
considered outliers
 Overlapping clustering (also: alternative clustering, multi-view clustering): objects may
belong to more than one cluster(but not probabilistically); usually involving hard clusters
 Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
 Subspace clustering: while the complete dataset may show an overlapping clustering, within
a uniquely defined subspace of the original large dataset, clusters are not expected to overlap

Connectivity-based clustering (hierarchical clustering)


A cluster can be described largely by the maximum distance needed to connect parts of the cluster.
At different distances, different clusters will form, which can be represented using a dendrogram.  In
a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are
placed along the x-axis such that the clusters don't mix.

Connectivity-based clustering is a whole family of methods that differ by the way distances are
computed:
Single-Linkage Clustering: Distance between 2 closest points not belonging to the same cluster.
Eg: SLINK Algorithm (O(n2))
Complete-Linkage Clustering: Distance between 2 farthest points of 2 different clusters.
Eg: CLINK Algorithm (O(n2))
Average-Linkage Clustering: Average distance between any 2 points of 2 different clusters.

Employing a different distance calculating method can give different clusters. Moreover, hierarchical
clustering can be agglomerative (O(n3)) (starting with single elements and aggregating them into
clusters, bottom-up approach) or divisive (O(2n-1)) (starting with the complete data set and dividing it
into partitions, top-down approach)

Very slow and not good towards handling outliers which will either show up as additional
clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular
with single-linkage clustering).
Centroid-based clustering (k-means clustering)

When the number of clusters is fixed to k, k-means clustering gives a formal definition as an


optimization problem: find the k cluster centers and assign the objects to the nearest cluster center,
such that the squared distances from the cluster are minimized.
The k-means algorithm is also called ‘Lloyd’s Algorithm’. It only finds a local optimum, and is
commonly run multiple times with different random initializations.
Variations of k-means:
1) k-medoids: restricting the centroids to members of the data set
2) k-medians clustering : choosing medians instead of centroid or mean
3) k-means++ :  choosing the initial centers less randomly
4) fuzzy c-means : allowing a fuzzy cluster assignment.

First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is
conceptually close to nearest neighbor classification. Third, it can be seen as a variation of model
based clustering, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm.
k-means cannot represent density-based clusters and clusters with convex shapes.

Distribution-based clustering
The clustering model most closely related to statistics is based on distribution models. They suffer
from one key problem known as overfitting, unless constraints are put on the model complexity. A
more complex model will usually be able to explain the data better, which makes choosing the
appropriate model complexity inherently difficult.
One prominent method is known as Gaussian mixture models (using the expectation-maximization
algorithm). Here, the data set is usually modeled with a fixed (to avoid overfitting) number
of Gaussian distributions that are initialized randomly and whose parameters are iteratively
optimized to better fit the data set. This will converge to a local optimum, so multiple runs may
produce different results. In order to obtain a hard clustering, objects are often then assigned to the
Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.
Distribution-based clustering produces complex models for clusters that can capture correlation and
dependence between attributes. However, these algorithms put an extra burden on the user: for
many real data sets, there may be no concisely defined mathematical model (e.g. assuming
Gaussian distributions is a rather strong assumption on the data). Density-based clusters cannot
be modeled using Gaussian distributions.
Density-based clustering
Clusters are defined as areas of higher density. Objects in the sparse areas - that are required to
separate clusters - are usually considered to be noise and border points.
The most popular density based clustering method is DBSCAN.[13] It features a well-defined cluster
model called "density-reachability". Similar to linkage based clustering, it is based on connecting
points within certain distance thresholds. However, it only connects points that satisfy a density
criterion. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary
shape, in contrast to many other methods). Its complexity is fairly low – it requires a linear number of
range queries on the database. There is no need to run it multiple times. OPTICS is a generalization
of DBSCAN that removes the need to choose an appropriate value for the range parameter 

, and produces a hierarchical result related to that of linkage clustering.


The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect
cluster borders. On data sets with, for example, overlapping Gaussian distributions, the cluster
borders produced by these algorithms will often look arbitrary, because the cluster density
decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are
nearly always outperformed by methods such as EM clustering that are able to precisely model this
kind of data.
Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity,
based on kernel density estimation. Eventually, objects converge to local maxima of density. This
algorithm is usually slower than DBSCAN or k-Means. Besides that, the applicability of the mean-
shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density
estimate, which results in over-fragmentation of cluster tails.

CLUSTER EVALUATION
1) Internal Evaluation: Evaluating how well are the clusters formed(unsupervised ML)
a) David Bouldin Index
b) Dunn Index
c) Silhouette Coefficient
2) External Evaluation: Evaluating how good the clusters are to classify another set of input
data(supervised ML)
a) Purity
b) Rand-Index
c) F-measure
d) Jaccard Index
e) Dice Index
f) Confusion Matrix
3) Hopkins Statistic (to measure cluster tendency): Useless in practice as it can’t handle
multimodality
DECISION TREE ANALYSIS (SUPERVISED)
Tree models can be of 2 types:
a) Classification Trees: Here the target variable can take a discrete set of values. In these tree
structures, leaves represent class labels (to be predicted) and branches
represent conjunctions of features that lead to those class labels.
b) Regression Trees: These are decision trees where the target variable can take continuous
values (typically real numbers).
c) Classification and Regression Trees (CART): Trees where both the above procedures are
followed. Eg:

Some techniques, often called ensemble methods, construct more than one decision tree:

d) Boosted trees: Incrementally building an ensemble by training each new instance to emphasize


the training instances previously mis-modeled. A typical example is AdaBoost (Adaptive
Boosting) and Gradient Boosting . These can be used for regression-type and classification-type
problems.
e) Bootstrap aggregated: (or bagged) decision trees, an early ensemble method, builds multiple
decision trees by repeatedly resampling training data with replacement, and voting the trees for
a consensus prediction.[7]
 A random forest classifier is a specific type of bootstrap aggregating.
f) Rotation forest – in which every decision tree is trained by first applying principal component
analysis (PCA) on a random subset of the input features.
g) Decision List: A special case of a decision tree is a decision list, which is a one-sided decision
tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child
(except for the bottommost node, whose only child is a single leaf node). While less expressive,
decision lists are arguably easier to understand than general decision trees.

A tree is built by splitting the source set, constituting the root node of the tree, into subsets -
which constitute the successor children. The splitting is based on a set of splitting rules based
on classification features.[2] This process is repeated on each derived subset in a recursive
manner called recursive partitioning. The recursion is completed when the subset at a node has
all the same values of the target variable, or when splitting no longer adds value to the
predictions. This process of top-down induction of decision trees (TDIDT)[3] is an example of
a greedy algorithm.

Different parameters or metrics are used to build a decision tree. They are:

1) Gini- impurity (Different from Gini- coefficient): For discrete target variable
2) Information Gain: For discrete target variable
3) Variance Reduction: For continuous target variable

Advantages:
1) Able to handle both numerical and categorical data.
2) Requires little data preparation (Normalizing the data is not required).
3) Uses a white box model: Its working can be easily seen and visualized.
4) Makes no assumptions of the training data or prediction residuals
5) Performs well with large datasets.
6) Robust against co-linearity, particularly boosting

Disadvantages:
1) A small change in the training data can result in a large change in the tree and
consequently the final predictions.
2) Decision-tree learners can create over-complex trees that do not generalize well from the
training data. This is known as overfitting. Mechanisms such as pruning are necessary to
avoid this problem (with the exception of some algorithms such as the Conditional Inference
approach that does not require pruning).
3) For data including categorical variables with different numbers of levels, information gain in
decision trees is biased in favor of attributes with more levels. However, the issue of biased
predictor selection is avoided by the Conditional Inference approach, a two-stage approach,
or adaptive leave-one-out feature selection.

You might also like