Intermediate R - Cluster Analysis
Intermediate R - Cluster Analysis
• No assumptions are made concerning the • Consider sorting the 16 face cards in an
number of groups or the group structure. ordinary deck into clusters of similar objects.
DATA REQUIREMENT
HIERARCHICAL CLUSTER ANALYSIS
• Nominal - numbers or symbols that do not imply
any ordering. E.g., color, sex, presence or
• Agglomerative - starts with individual objects. absence.
Most similar objects are first grouped.
• Ordinal - numbers or symbols where the relation >
Eventually, as the similarity decreases, all
holds. E.g. Damage score (1,3,5,7,9), Size (small,
subgroups are fused into a single cluster.
medium, large)
• Divisive - an initial single group of objects is • Interval - numbers that has the characteristics of
divided into two dissimilar subgroups. These an ordinal data and in addition, the distance
subgroups are then further divided into between any two numbers on the scales are known.
E.g. Scholastic grades, income
dissimilar subgroups; the process continues
until there are as many subgroups as objects. • Ratio - interval data with a meaningful zero point.
(Will not be covered in this course.) E.g., yield, plant height, tiller number
DATA REQUIREMENT
STEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSIS
WARNING
DATAFRAME: Ratio_agro.csv
STEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSIS
Read data file Ratio_agro.csv
Step 1. Obtain the Data Matrix – Ratio_agro.csv
> Ratio <- read.table(“Ratio_agro.csv",
header=T, sep=‘,’, row.names=“CODE”)
> str(Ratio)
∑ (Χ − Χ jk )
2
• Euclidean distance: d ij = ik
k =1
Resemblance (or similarity) coefficient - measures dist(method=“euclidean”)
the degree of similarity or distance between a pair
p
of objects. • Scaled Euclidean distance: d ij = ∑ w (Χ 2
k ik − Χ jk )
2
1
♦ Distance measures wk =
sk
p
∑ (x
k =1
ik , x jk ) a+c b+d P
• Another approach: Such variables can be ordered • Computes and returns the distance matrix computed
E.g. Endosperm type by using the specified distance measure
Discrete : Non-glutinous(1), Glutinous(2),
Indeterminate(3) > dist(x, method=“euclidean”)
• Use the daisy(type=“ordratio”) in R if ranks are # x – a numeric matrix, dataframe, or “dist” object
treated as continuous
# method – the distance measure to be used. This
• Use the daisy(), if a discrete variable has the class could be “euclidean”, “maximum”, “manhattan”,
set to “ordered” “canberra”, “binary”, and “minkowski”
DISSIMILARITY MATRIX : dist()
K-means Clustering
Example of SLINK
1 (2,3) 4 5 1 (2,3) 4 5
1 0 1 0
(2,3) 5.10 0 (2,3) 5.10 0
D= 4 4.58 4.12 0 D= 4 4.58 4.12 0
5 3.00 6.17 0 5 3.00 4.36 6.17 0
Similarly,
Minimum distance
d(1,5)(4) = min(d14, d54) = (4.58, 6.17) = 4.58
⇒ 1 & 5 most similar
⇒ fuse 1 & 5 d(2,3)(4) = min(d24, d34) = (4.12, 4.12) = 4.12
(1,5) 0 (1,5) 0
D=
D = (2,3) 4.36 0 (2,3,4) 4.36 0
4 4.58 4.12 0
d(1,5)(2,3,4)
= min ( d12, d13, d14, d52, d53, d54)
Group together clusters (2,3) and 4
= min (5.10, 6.17, 4.58, 4.36, 5.74, 6.17) = 4.36
COMPLETE LINKAGE (CLINK)
Dendrogram using SLINK
• Also known as the Furthest Neighbor
• similar to SLINK except that the distance between
two clusters is determined by the distance
Euclidean Distance
4.0
between two elements, one from each cluster, that
are most distant.
3.0
• • • •
• • •
2.0 • • • • •
• • • • • • •
• •• • • • • • •
1.0
SLINK CLINK
0 • Can be obtained using hclust(method=“complete”)
2 3 4 1 5
Plants
3.0
2.0
1.0
0
2 3 4 1 5
Plants
(
ESS = ∑∑ xki , j − x k, j )
Euclidean Distance
4.0 i +1 j =1
0
2 3 4 1 5
Plants
# object – any R object that can be made into one of center - logical; if TRUE, nodes are plotted centered with respect to the
class “dendrogram” leaves in the branch. Otherwise (default), plot them in the middle of all
direct child nodes.
nodePar - a list of plotting parameters to use for the nodes
[1] 0.8069435
K-MEANS CLUSTERING: kmeans() K-MEANS CLUSTERING: kmeans()
• Perform k-means clustering on a data matrix > KMRatio <- kmeans(Ratio, centers=3)
> names(KMRatio)
> kmeans(x, centers, algorithm…)
[1] "cluster" "centers" "withinss"
"size"
# x– A numeric matrix of data, or an object that can
be coerced to such a matrix.
# centers – no. of clusters or set of initial distinct • Select cluster 1
clusters > grp1 <- names(which(KMRatio$cluster==1))
# algorithm – character string to speficy the algorithm > grp1
used in clustering method. Default algorithm is [1] "08R144" "08R153" "08R160" "08R164"
“Hartigan-Wong” "08R168" "08R169"
R APPLICATION
DIFFERENT LEVELS OF DATA
Convert V1-V4, from integer to factor • Compute all the pairwise dissimilarities(distances)
between observations in the data set.
> FLWR$V1 <- as.factor(FLWR$V1)
> FLWR$V2 <- as.factor(FLWR$V2) > daisy(x, metric=“euclidean”, method,
> FLWR$V3 <- as.factor(FLWR$V3) stand, type=list())
> FLWR$V4 <- as.factor(FLWR$V4)
# x – a numeric matrix, dataframe, or “dist” object
Convert V5-V6, from integer to ordinal # metric – character string specifying the metric to be
used
> FLWR$V5 <- as.ordered(FLWR$V5) # stand – logical value specifying whether
> FLWR$V6 <- as.ordered(FLWR$V6) measurements in x are standardized before
calculating the dissimilarities.
Thank you!