Clustering Analysis (Unsupervised)
Clustering Analysis (Unsupervised)
A "clustering" is essentially a set of such clusters, usually containing all objects in the data set.
Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of
clusters embedded in each other. Clusterings can be roughly distinguished as:
Hard clustering: each object belongs to a cluster or not (No probabilities involved)
Soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain
degree (for example, a likelihood of belonging to the cluster)
Connectivity-based clustering is a whole family of methods that differ by the way distances are
computed:
Single-Linkage Clustering: Distance between 2 closest points not belonging to the same cluster.
Eg: SLINK Algorithm (O(n2))
Complete-Linkage Clustering: Distance between 2 farthest points of 2 different clusters.
Eg: CLINK Algorithm (O(n2))
Average-Linkage Clustering: Average distance between any 2 points of 2 different clusters.
Employing a different distance calculating method can give different clusters. Moreover, hierarchical
clustering can be agglomerative (O(n3)) (starting with single elements and aggregating them into
clusters, bottom-up approach) or divisive (O(2n-1)) (starting with the complete data set and dividing it
into partitions, top-down approach)
Very slow and not good towards handling outliers which will either show up as additional
clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular
with single-linkage clustering).
Centroid-based clustering (k-means clustering)
First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is
conceptually close to nearest neighbor classification. Third, it can be seen as a variation of model
based clustering, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm.
k-means cannot represent density-based clusters and clusters with convex shapes.
Distribution-based clustering
The clustering model most closely related to statistics is based on distribution models. They suffer
from one key problem known as overfitting, unless constraints are put on the model complexity. A
more complex model will usually be able to explain the data better, which makes choosing the
appropriate model complexity inherently difficult.
One prominent method is known as Gaussian mixture models (using the expectation-maximization
algorithm). Here, the data set is usually modeled with a fixed (to avoid overfitting) number
of Gaussian distributions that are initialized randomly and whose parameters are iteratively
optimized to better fit the data set. This will converge to a local optimum, so multiple runs may
produce different results. In order to obtain a hard clustering, objects are often then assigned to the
Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.
Distribution-based clustering produces complex models for clusters that can capture correlation and
dependence between attributes. However, these algorithms put an extra burden on the user: for
many real data sets, there may be no concisely defined mathematical model (e.g. assuming
Gaussian distributions is a rather strong assumption on the data). Density-based clusters cannot
be modeled using Gaussian distributions.
Density-based clustering
Clusters are defined as areas of higher density. Objects in the sparse areas - that are required to
separate clusters - are usually considered to be noise and border points.
The most popular density based clustering method is DBSCAN.[13] It features a well-defined cluster
model called "density-reachability". Similar to linkage based clustering, it is based on connecting
points within certain distance thresholds. However, it only connects points that satisfy a density
criterion. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary
shape, in contrast to many other methods). Its complexity is fairly low – it requires a linear number of
range queries on the database. There is no need to run it multiple times. OPTICS is a generalization
of DBSCAN that removes the need to choose an appropriate value for the range parameter
CLUSTER EVALUATION
1) Internal Evaluation: Evaluating how well are the clusters formed(unsupervised ML)
a) David Bouldin Index
b) Dunn Index
c) Silhouette Coefficient
2) External Evaluation: Evaluating how good the clusters are to classify another set of input
data(supervised ML)
a) Purity
b) Rand-Index
c) F-measure
d) Jaccard Index
e) Dice Index
f) Confusion Matrix
3) Hopkins Statistic (to measure cluster tendency): Useless in practice as it can’t handle
multimodality
DECISION TREE ANALYSIS (SUPERVISED)
Tree models can be of 2 types:
a) Classification Trees: Here the target variable can take a discrete set of values. In these tree
structures, leaves represent class labels (to be predicted) and branches
represent conjunctions of features that lead to those class labels.
b) Regression Trees: These are decision trees where the target variable can take continuous
values (typically real numbers).
c) Classification and Regression Trees (CART): Trees where both the above procedures are
followed. Eg:
Some techniques, often called ensemble methods, construct more than one decision tree:
A tree is built by splitting the source set, constituting the root node of the tree, into subsets -
which constitute the successor children. The splitting is based on a set of splitting rules based
on classification features.[2] This process is repeated on each derived subset in a recursive
manner called recursive partitioning. The recursion is completed when the subset at a node has
all the same values of the target variable, or when splitting no longer adds value to the
predictions. This process of top-down induction of decision trees (TDIDT)[3] is an example of
a greedy algorithm.
Different parameters or metrics are used to build a decision tree. They are:
1) Gini- impurity (Different from Gini- coefficient): For discrete target variable
2) Information Gain: For discrete target variable
3) Variance Reduction: For continuous target variable
Advantages:
1) Able to handle both numerical and categorical data.
2) Requires little data preparation (Normalizing the data is not required).
3) Uses a white box model: Its working can be easily seen and visualized.
4) Makes no assumptions of the training data or prediction residuals
5) Performs well with large datasets.
6) Robust against co-linearity, particularly boosting
Disadvantages:
1) A small change in the training data can result in a large change in the tree and
consequently the final predictions.
2) Decision-tree learners can create over-complex trees that do not generalize well from the
training data. This is known as overfitting. Mechanisms such as pruning are necessary to
avoid this problem (with the exception of some algorithms such as the Conditional Inference
approach that does not require pruning).
3) For data including categorical variables with different numbers of levels, information gain in
decision trees is biased in favor of attributes with more levels. However, the issue of biased
predictor selection is avoided by the Conditional Inference approach, a two-stage approach,
or adaptive leave-one-out feature selection.