33% found this document useful (3 votes)

3K views

Intermediate R - Cluster Analysis

Cluster analysis is an exploratory technique used to group objects based on the natural structure of the data without any preexisting groups. It involves standardizing data, generating a resemblance or distance matrix, and executing a clustering method to group similar objects hierarchically or non-hierarchically. The data must be in at least an interval scale for meaningful cluster analysis results. Key steps include obtaining the data, standardizing it, generating a resemblance matrix, and performing hierarchical or non-hierarchical clustering.

Uploaded by

Vivay Salazar

Available Formats

Download as PDF, TXT or read online on Scribd

33% found this document useful (3 votes)

3K views

Intermediate R - Cluster Analysis

Uploaded by

Vivay Salazar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

WHAT IS CLUSTER ANALYSIS?

• Exploratory technique which may be used to

search for category structure based on natural
CLUSTER ANALYSIS groupings in the data, or reduce a very large
body of data to a relatively compact description

• Cluster Analysis vs. Discriminant Analysis

Discriminant : has prespecified groups
Cluster : categorization is based on data
Slides prepared by: and
Leilani Nora Violeta Bartolome
Assistant Scientist Senior Associate Scientist
PBGB-CRIL

CLUSTER ANALYSIS ILLUSTRATION

• No assumptions are made concerning the • Consider sorting the 16 face cards in an
number of groups or the group structure. ordinary deck into clusters of similar objects.

• Grouping is done on the basis of similarities or ⇒ Cards may be classified by

distances (dissimilarities). color
suit
• There are various techniques in doing this which face value
may give different results. Thus researcher etc.
should consider the validity of the clusters
found.
GROUPING FACE CARDS
TWO KINDS OF CLUSTER ANALYSIS
♠ ♣ M ♥ ♠ ♣ M ♥ ♠ ♣ M ♥
A A A
K K K
Q Q Q
J
• Hierarchical - involves the construction of
J J
(a) individual cards (b) individual suits (c) black and red suits
tree-like structure (dendrogram)

♠ ♣ M ♥ ♠ ♣ M ♥ ♠ ♣ M ♥ • Non-hierarchical - designed to group objects

A A A into a collection of k clusters. k may either be
K K K specified in advance or determined as part of
Q Q Q the clustering procedure.
J J J
(d) major and minor suits (e)hearts plus queen of
spades and other suits (f) like face cards

DATA REQUIREMENT
HIERARCHICAL CLUSTER ANALYSIS
• Nominal - numbers or symbols that do not imply
any ordering. E.g., color, sex, presence or
• Agglomerative - starts with individual objects. absence.
Most similar objects are first grouped.
• Ordinal - numbers or symbols where the relation >
Eventually, as the similarity decreases, all
holds. E.g. Damage score (1,3,5,7,9), Size (small,
subgroups are fused into a single cluster.
medium, large)

• Divisive - an initial single group of objects is • Interval - numbers that has the characteristics of
divided into two dissimilar subgroups. These an ordinal data and in addition, the distance
subgroups are then further divided into between any two numbers on the scales are known.
E.g. Scholastic grades, income
dissimilar subgroups; the process continues
until there are as many subgroups as objects. • Ratio - interval data with a meaningful zero point.
(Will not be covered in this course.) E.g., yield, plant height, tiller number
DATA REQUIREMENT
STEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSIS
WARNING

1. Obtain the Data Matrix

• Practically data in all scales are amenable to 2. Standardize the data matrix if need be
cluster analysis
3. Generate the resemblance or distance
• The measurement scale of the data affects the matrix
manner by which the resemblance coefficients
are computed. But the overall clustering 4. Execute the Clustering Method
procedure is unaffected.

DATAFRAME: Ratio_agro.csv
STEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSIS
Read data file Ratio_agro.csv
Step 1. Obtain the Data Matrix – Ratio_agro.csv
> Ratio <- read.table(“Ratio_agro.csv",
header=T, sep=‘,’, row.names=“CODE”)
> str(Ratio)

'data.frame': 176 obs. of 5 variables:

$ GWT : num 3.2 2.5 2.98 2.63 2.28 2.34 2.69...
$ GrainL: num 10.3 8.89 9.11 9.11 9.02 ...
$ GrainW: num 4.45 3.53 3.86 3.76 3.31 3.08 3.25 ...
$ Flower: int 83 79 106 102 94 94 94 94 76 83 ...
$ PlantH: num 94.2 94.6 128.4 97.6 107.4 ...
STEPS IN PERFORMING HIERARCHICAL STEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSIS CLUSTER ANALYSIS

Step 2: Standardize the data matrix if need be.

Step 2: Standardize the data matrix if need be.
2 For clustering on objects (R-cluster analysis),
standardize the attributes with a column-
Reasons for standardizing data matrix:
standardizing function.
• The units chosen for measuring
attributes can arbitrarily affect the resemblance 2 For clustering of attributes (Q-cluster analysis),
among objects. standardize the objects with a row-standardizing
function.
• Standardizing makes attributes contribute more
equally to the resemblances among objects.

STEPS IN PERFORMING HIERARCHICAL STANDARDIZE THE DATA:

CLUSTER ANALYSIS data.Normalization()
Step 2: Standardize the data matrix if need be. • One of the functions in the package ‘clusterSim’ that
standardized or normalized the data.
> data.Normalization(x, type=“n0”)
# x – vector, matrix or a dataframe
# type – type of normalization
n0 – without normalization
n1 – standardization (x-mean)/sd
n2 – weber standardization (x-mean)/MAD
Χ ij − Χ j n3 – unitization (x-mean)/range
Standardizing function: Ζ ij = n4 – unitization with 0 minimum (x-min)/range
Sj
STANDARDIZE THE DATA
• Standardize the dataframe “Ratio”

> library(clusterSim) STEP 3: GENERATE THE

> RatioZ <- data.Normalization(Ratio)
> round(RatioZ, digits=4)
RESEMBLANCE OR DISTANCE
MATRIX.
GWT GrainL GrainW Flower PlantH
08R124 1.4836 1.1679 1.3265 -0.4642 -0.6126
08R125 -0.2507 -0.5308 0.0300 -0.8062 -0.5943
08R126 0.9385 -0.2658 0.4950 1.5021 0.9514
08R127 0.0714 -0.2658 0.3541 1.1602 -0.4571
08R128 -0.7958 -0.3742 -0.2801 0.4762 -0.0090
. . .

DISTANCE MEASURES : INTERVAL

AND RATIO
DISSIMILARITY / DISTANCE MEASURES
p

∑ (Χ − Χ jk )
2
• Euclidean distance: d ij = ik
k =1
Resemblance (or similarity) coefficient - measures dist(method=“euclidean”)
the degree of similarity or distance between a pair
p
of objects. • Scaled Euclidean distance: d ij = ∑ w (Χ 2
k ik − Χ jk )
2

♦ Measures of proximity or closeness k =1

1
♦ Distance measures wk =
sk
p

• City Block or Manhattan: d ij = ∑ Χ ik − Χ jk

k =1
dist(method=“manhattan”)
DISTANCE MEASURES : INTERVAL DISTANCE MEASURES : BINARY DATA
AND RATIO
• Binary data where xik can only take two values,
λ
these are coded 0 and 1 :
p
• Minkowski Metric: d ij = λ
∑ Χ ik − Χ jk Individual i
dist(method=“minkowski”) k =1
1 0
p 1 a b a+b
2 ∑ min (x ik , x jk ) Individual j
k =1 0 c d c+d
• Czekanowski: d ij = 1 − p

∑ (x
k =1
ik , x jk ) a+c b+d P

• Gower distinguishes two types of binary

variables, symmetric and asymmetric. Asymmetry
in binary data arises when one state is
considered as more informative than the other.

DISTANCE MEASURES : SYMMETRIC BINARY DISTANCE MEASURES : ASSYMMETRIC

BINARY
• Simple Matching Coefficient :
• Jaccard Coefficient :
a+d b+c
d ij = 1 − = a b+c
a+b+c+d a+b+c+d d ij = 1 − =
a+b+c a+b+c
- Can be calculated in daisy(metric=“gower”)
of cluster package - Can be calculated in dist(method=“binary”)
DISTANCE MEASURES : ASSYMMETRIC
BINARY DISTANCE MEASURES : NOMINAL DATA

• One approach: Treat each level as a binary variable

• Czekanowski Coefficient (Dice):
E.g. Endosperm type
2a b+c
d ij = 1 − = - Non-glutinous, Glutinous, Indeterminate
2a + b + c 2a + b + c
Attributes NG G I
- Increases the weights of agreements Plant 1 Glutinous 0 1 0
Plant 2 Indeterminate 0 0 1
- Can be calculated in dist.binary(method=5) of the Plant 3 Non-glutinous 1 0 0
ade4 package)
• Use the Simple matching coefficient using daisy() in
R having specified that the variable concerned is a
factor.

DISTANCE MEASURES : NOMINAL DATA DISSIMILARITY MATRIX : dist()

• Another approach: Such variables can be ordered • Computes and returns the distance matrix computed
E.g. Endosperm type by using the specified distance measure
Discrete : Non-glutinous(1), Glutinous(2),
Indeterminate(3) > dist(x, method=“euclidean”)

• Use the daisy(type=“ordratio”) in R if ranks are # x – a numeric matrix, dataframe, or “dist” object
treated as continuous
# method – the distance measure to be used. This
• Use the daisy(), if a discrete variable has the class could be “euclidean”, “maximum”, “manhattan”,
set to “ordered” “canberra”, “binary”, and “minkowski”
DISSIMILARITY MATRIX : dist()

> Rdist <- dist(RatioZ, method="euclidean")

> round(Rdist,digits=4) STEP 4: EXECUTE THE
CLUSTERING METHOD
08R124 08R125 08R126 08R127 08R128 08R129 ...
08R125 2.7734
08R126 3.0588 3.0690
08R127 2.7673 2.0407 1.6949
08R128 3.3769 1.5508 2.3654 1.3545
08R129 3.3198 1.5383 2.5671 1.3833 0.5585
. . .

CLUSTERING METHOD SINGLE LINKAGE (SLINK)

Types of Hierarchical Agglomerative Clustering • Also known as the Nearest Neighbor

1. Single Linkage (SLINK) • procedure is based on minimum distance
2. Complete Linkage (CLINK) between elements
3. Average Linkage (ALINK) • Can be obtained in R using
4. Ward’s Method - minimize the error SS hclust(method=“single”)
5. Centroid Method

K-means Clustering
Example of SLINK

• Consider the hypothetical distance matrix 1 (2,3) 4 5

1 2 3 4 5 most similar 1 0
1 0 fuse 2 & 3 (2,3) 5.10 0
2 5.10 0
D = 3 6.17 1.42 0 D= 4 4.58 0
4 4.58 4.12 4.12 0 5 3.00 6.17 0
5 3.00 4.36 5.74 6.17 0

Distance between (2,3) and 1

d(2,3)(1) = min(d21, d31)
= min(5.10, 6.17) = 5.10

1 (2,3) 4 5 1 (2,3) 4 5
1 0 1 0
(2,3) 5.10 0 (2,3) 5.10 0
D= 4 4.58 4.12 0 D= 4 4.58 4.12 0
5 3.00 6.17 0 5 3.00 4.36 6.17 0

Distance between (2,3) and 4 Distance between (2,3) and 5

d(2,3)(4) = min(d24, d34) d(2,3)(5) = min(d25, d35)
= min(4.12, 4.12) = 4.12 = min(4.36, 5.74) = 4.36
(1,5) (2,3) 4
1 (2,3) 4 5 (1,5) 0
1 0 D = (2,3) 4.36 0
(2,3) 5.10 0 4 4.58 4.12 0
D= 4 4.58 4.12 0
5 3.00 4.36 6.17 0 Distance between (1,5) and (2,3) = d(1,5)(2,3)
= min(d12, d13, d52, d53 ) = (5.10, 6.17, 4.36, 5.74) = 4.36

Similarly,
Minimum distance
d(1,5)(4) = min(d14, d54) = (4.58, 6.17) = 4.58
⇒ 1 & 5 most similar
⇒ fuse 1 & 5 d(2,3)(4) = min(d24, d34) = (4.12, 4.12) = 4.12

(1,5) (2,3) 4 (1,5) (2,3,4)

(1,5) 0 (1,5) 0
D=
D = (2,3) 4.36 0 (2,3,4) 4.36 0
4 4.58 4.12 0

d(1,5)(2,3,4)
= min ( d12, d13, d14, d52, d53, d54)
Group together clusters (2,3) and 4
= min (5.10, 6.17, 4.58, 4.36, 5.74, 6.17) = 4.36
COMPLETE LINKAGE (CLINK)
Dendrogram using SLINK
• Also known as the Furthest Neighbor
• similar to SLINK except that the distance between
two clusters is determined by the distance
Euclidean Distance
4.0
between two elements, one from each cluster, that
are most distant.
3.0
• • • •
• • •
2.0 • • • • •
• • • • • • •
• •• • • • • • •
1.0
SLINK CLINK
0 • Can be obtained using hclust(method=“complete”)
2 3 4 1 5
Plants

Example of CLINK 1 (2,3) 4 5

1 0
• Consider the same distance matrix
(2,3) 6.17 0
1 2 3 4 5
1 0
most similar D= 4 4.58 4.12 0
fuse 2 & 3
2 5.10 0 5 3.00 5.74 6.17 0
D= 3 6.17 1.42 0
4 4.58 4.12 4.12 0 d14, d15, and d45 derive their values from the original matrix
5 3.00 4.36 5.74 6.17 0
d(2,3)(1) = max(d21, d31) = max(5.10, 6.17) = 6.17
d(2,3)(4) = max(d24, d34) = max(4.12, 4.12) = 4.12
d(2,3)(5) = max(d25, d35) = max(4.36, 5.74) = 5.74
(1,5) (2,3) 4
1 (2,3) 4 5
(1,5) 0
1 0
D = (2,3) 6.17 0
(2,3) 6.17 0
D= 4 4.58 4.12 0 4 6.17 4.12 0
5 3.00 5.74 6.17 0
d(2,3)(1,5) = max(d21, d31, d25, d35)
= max(5.10, 6.17, 4.36, 5.74) = 6.17
d(2,3)(4) = max(d24, d34) = max(4.12, 4.12) = 4.12
Most similar : clusters 1 & 5 d(1,5)(4) = max(d14, d54) = max(4.58, 6.17) = 6.17

(1,5) (2,3) 4 (1,5) (2,3,4)

(1,5) 0 (1,5) 0
D= (2,3) 6.17 0 D=
(2,3,4) 6.17 0
4 6.17 4.12 0

d(1,5)(2,3,4) = max ( d12, d13, d14, d52, d53, d54)

Most similar : clusters (2,3) and 4 = max (5.10, 6.17, 4.58, 4.36, 5.74, 6.17) = 6.17
Dendrogram using CLINK AVERAGE LINKAGE (ALINK)
• Also known as the Average Linkage (ALINK)
• cluster criterion is the average distance from objects
6.0 in one cluster to those in another
Euclidean Distance

5.0 • Can be obtained in R using

agnes(method=“average”) in package ‘cluster’
4.0

3.0

2.0

1.0

0
2 3 4 1 5
Plants

Example of ALINK 1 (2,3) 4 5

1 0
• Consider the same distance matrix (2,3) 5.64 0
1 2 3 4 5 D= 4 4.58 4.12 0
1 0 most similar 5 3.00 5.05 6.17 0
2 5.10 0 fuse 2 & 3
D= 3 6.17 1.42 0 d14, d15, and d45 derive their values from the original matrix
4 4.58 4.12 4.12 0
5 3.00 4.36 5.74 6.17 0 d(2,3)(1) = mean(d21, d31) = mean(5.10, 6.17) = 5.64
d(2,3)(4) = mean(d24, d34) = mean(4.12, 4.12) = 4.12
d(2,3)(5) = mean(d25, d35) = mean(4.36, 5.74) = 5.05
1 (2,3) 4 5 (1,5) (2,3) 4
1 0 (1,5) 0
(2,3) 5.64 0 D = (2,3) 5.34 0
D= 4 4.58 4.12 0 4 5.38 4.12 0
5 3.00 5.05 6.17 0
d(2,3)(1,5) = mean(d21, d31, d25, d35)
= mean(5.10, 6.17, 4.36, 5.74) = 5.34
d(2,3)(4) = mean(d24, d34) = ave(4.12, 4.12) = 4.12
Most similar : clusters 1 & 5 d(1,5)(4) = mean(d14, d54) = ave(4.58, 6.17) = 5.38

(1,5) (2,3) 4 (1,5) (2,3,4)

(1,5) 0 (1,5) 0
D = (2,3) 5.34 0 D=
4 5.38 4.12 0 (2,3,4) 5.35 0

d(1,5)(2,3,4) = mean ( d12, d13, d14, d52, d53, d54)

Most similar : clusters (2,3) and 4 = mean (5.10, 6.17, 4.58, 4.36, 5.74, 6.17) = 5.35
WARD’S METHOD
Dendrogram using ALINK
• Ward(1963) proposed a method in which clustering
proceeds by selecting those groupings which
minimizes the error sum of squares.
5.0 nk p 2

(
ESS = ∑∑ xki , j − x k, j )
Euclidean Distance

4.0 i +1 j =1

3.0 • Where: x k, j - the mean of cluster k with respect

to variable j
2.0 xki , j - is the value of j for each object i in
1.0
cluster k.

0
2 3 4 1 5
Plants

SUMMARY OF HIERARCHICAL HIERARCHICAL CLUSTERING : hclust()

AGGLOMERATIVE CLUSTERING
• Hierarchical cluster analysis on a set of
Method R call dissimilarities and methods for analyzing it.
SLINK method = “single”
> hclust(d, method=“complete”, …)
CLINK method = “complete” > SIN1 <- hclust(Rdist, method="single")
ALINK method = “average”
WARD method = “ward” # d – a dissimilarity structure as produced by dist() or
daisy()
MEDIAN method = “median”
# method – the agglomeration method to be used
CENTROID method = “centroid”
such as “ward”, “single”, “complete”, “average”,
• From SLINK to WARD you can either use agnes() of “mcquitty” or “centroid”
cluster package or hclust()
• For Median and Centroid use the hclust()
GENERAL TREE STRUCTURE :
DENDROGRAM: plot()
as.dendrogram()
• as.dendrogram() converts object to class > plot(denSIN1, center = T,
“dendrogram” which provided general functions for nodePar = list(lab.cex = 0.6,
handling tree-like structures. lab.col = "black", pch = NA),
main = "Dendrogram using SLINK
> as.dendrogram(object, …) Clustering Method")
> denSIN1 <- as.dendrogram(SIN1)

# object – any R object that can be made into one of center - logical; if TRUE, nodes are plotted centered with respect to the
class “dendrogram” leaves in the branch. Otherwise (default), plot them in the middle of all
direct child nodes.
nodePar - a list of plotting parameters to use for the nodes

DENDROGRAM USING SLINK DENDROGRAM USING CLINK

DENDROGRAM USING ALINK DENDROGRAM USING WARD

AGGLOMERATIVE METHOD : agnes() AGGLOMERATIVE METHOD : agnes()

• Computes agglomerative hierarchical clustering of
dataset. Required package ‘cluster’ > agnes(x, diss, metric=“euclidean”,
stand=F, method=“average”, …)
> agnes(x, diss, metric=“euclidean”,
stand=F, method=“average”, …) # stand – logical value whether measurements
should be standardized for each variable
# x – data matrix or dataframe, or dissimilarity before calculating dissimilarities
matrix # method – character string defining the
# diss – logical value whether x is a dissimilarity clustering method (“average”, “single”,
matrix or a data matrix “complete”, “ward”, “weighted”, “flexible”)
# metric – character string specifying the metric
to be used for calculating dissimilarities.
DRAW RECTANGLES : rect.hclust() AGGLOMERATIVE METHOD : agnes()
• Draws rectangles around the branches of a > library(cluster)
dendrogram highlighting the corresponding clusters. > AVE <- agnes(Ratio, stand = TRUE,
metric = "Euclidean",
> rect.hclust(tree, k, h, border) method=“average")
> denAVE <- as.dendrogram(AVE)
# tree – hclust object > par(cex=0.8, par=c(3,2,2,2),
cex.axis=0.8)
# k,h – scalar which cut the dendrogram such > plot(denAVE, horiz=FALSE,
that either exactly k clusters are produced or center=TRUE, nodePar=list(lab.cex
by cutting at height h. =0.6, lab.col="forest green",
# border – vector with border colors for the pch=NA), main = "Dendrogram using
rectangles ALINK Clustering Method")
> rect.hclust(tree=AVE, k=4,
border=c("red", "blue", "green",
"purple"))

AGGLOMERATIVE METHOD : agnes() CUTTING A CLUSTER: cutree()

• Cuts a tree into several groups either by specifying

desired number of groups or the cut heights.
> cutree(tree, k, h)

# tree – a tree as produced by hclust

# k – an integer scalar or vector with desired number
of groups
# h – numeric scalar or vector with the desired number
of groups
CUTTING A CLUSTER: cutree() COPHENETIC CORRELATION
• Can be used as some kind of measure of goodness
> AVE2 <- cutree(as.hclust(AVE), k=3)
of fit of a particular dendrogram.
> table(AVE2)
n
AVE2 ∑ (d ij )(
− d hij − h )
1 2 3 Cluster number ρ cophenetic = i =1, j =1,i < j
n
3 46 2
∑ (d ) (h
Number of units
− h)
2 2
ij −d ij
i =1, j =1,i < j

• Use hclust() and cophenetic()

COPHENETIC CORRELATION : cophenetic() K-MEANS CLUSTERING

• Computes the cophenetic distances for hierarchical • Different method of clustering, aimed at finding
clustering “more homogeneous” subgroups within the data
> cophenetic(x) • Given a number of k starting points, the data are
classified, the centroids recalculated and the
# x – an R representing a hierarchical clustering. process iterates until stable.

> AVECOP <- cophenetic(AVE)

> cor(Rdist, AVECOP)

[1] 0.8069435
K-MEANS CLUSTERING: kmeans() K-MEANS CLUSTERING: kmeans()
• Perform k-means clustering on a data matrix > KMRatio <- kmeans(Ratio, centers=3)
> names(KMRatio)
> kmeans(x, centers, algorithm…)
[1] "cluster" "centers" "withinss"
"size"
# x– A numeric matrix of data, or an object that can
be coerced to such a matrix.
# centers – no. of clusters or set of initial distinct • Select cluster 1
clusters > grp1 <- names(which(KMRatio$cluster==1))
# algorithm – character string to speficy the algorithm > grp1
used in clustering method. Default algorithm is [1] "08R144" "08R153" "08R160" "08R164"
“Hartigan-Wong” "08R168" "08R169"

K-MEANS CLUSTERING: plot() K-MEANS CLUSTERING: plot()

> plot(Ratio, col=KMRatio$cluster, > plot(prcomp(Ratio, center=T)$x[,c(1,
pch=KMRatio$cluster) 2, 3)], col=KMRatio$cluster,
pch=KMRatio$cluster)
DATAFRAME: Flower.csv
• Data with 8 characteristics for 18 popular flowers

R APPLICATION
DIFFERENT LEVELS OF DATA

V1 – asymmetric binary which indicate whether the

plant may be left in the garden when it freezes

DATAFRAME: Flower.csv DATAFRAME: Flower.csv

• 8 characteristics for 18 popular flowers Read data file Flower.csv
V2 – binary, shows whether the plant needs to > FLWR <-read.table("flower.csv", sep=",",
stand in the shadow header=T)
V3 – asymmetric binary which distinguishes between > str(FLWR)
plants with tubers or that grow in any other way 'data.frame': 18 obs. of 8 variables:
V4 – nominal that specifies the flower’s color $ V1: int 0 1 0 0 0 0 0 0 1 1 ...
$ V2: int 1 0 1 0 1 1 0 0 1 1 ...
V5 – ordinal, indicates whether the plant grows in dry, $ V3: int 1 0 0 1 0 0 0 1 0 0 ...
normal and wet soil. $ V4: int 4 2 3 4 5 4 4 2 3 5 ...
$ V5: int 3 1 3 2 2 3 3 2 1 2 ...
V6 – ordinal, preference ranking $ V6: int 15 3 1 16 2 12 13 7 4 14 ...
V7 – ratio, plant’s height in cm $ V7: int 25 150 150 125 20 50 40...
$ V8: int 15 50 50 50 15 40 20 15...
V8 – ratio, distance in cm. between plants
DATAFRAME: Flower.csv DISSIMILARITY MATRIX : daisy()

Convert V1-V4, from integer to factor • Compute all the pairwise dissimilarities(distances)
between observations in the data set.
> FLWR$V1 <- as.factor(FLWR$V1)
> FLWR$V2 <- as.factor(FLWR$V2) > daisy(x, metric=“euclidean”, method,
> FLWR$V3 <- as.factor(FLWR$V3) stand, type=list())
> FLWR$V4 <- as.factor(FLWR$V4)
# x – a numeric matrix, dataframe, or “dist” object
Convert V5-V6, from integer to ordinal # metric – character string specifying the metric to be
used
> FLWR$V5 <- as.ordered(FLWR$V5) # stand – logical value specifying whether
> FLWR$V6 <- as.ordered(FLWR$V6) measurements in x are standardized before
calculating the dissimilarities.

DISSIMILARITY MATRIX : daisy() DISSIMILARITY MATRIX : daisy()

> daisy(x, metric=“euclidean”, method, > library(cluster)
stand, type=list()) > df1 <- daisy(FLWR,
type=list(asymm = c("V1","V3"),
# type – list for specifying some (or all) of the types of symm=2,ordratio=6))
the variables in x.
List may contain the following components:
“ordratio” - ratio scaled variables to be treated as
ordinal
“logratio” - ratio scaled variables that must be
logarithmically transformed.
“asymm” - asymmetric binary
“symm” - symmetric binary
AGGLOMERATIVE METHOD : agnes()
> AGN.FLWR <- agnes(df1)
> plot(AGN.FLWR, hang=-1)
> rect.hclust(tree=AGN.FLWR, h=0.3,
border=c("red", "blue")) HIERARCHICAL CLUSTERING
WITH P-VALUES VIA
MULTISCALE BOOTSRAP
RESAMPLING

Package ‘pvclust’ MULTISCALE BOOTSTRAP RESAMPLING

• An R package for assessing the uncertainty in • A method which calculates the accuracy of the
hierarchical cluster analysis. clusters by means of calculating the pvalues by
• Pvalues are calculated via multiscale bootstrap resampling of data.
resampling, which indicates how strong the cluster is • For a cluster with pvalue >0.95 reject Ho (The
supported by the data. cluster do not exist). Thus, the cluster exist.
• Pvalues are between 0 and 1, which indicates how • The pvalue calculated is an approximation, with less
strong the cluster is supported by data. biased than BP.
• Two types of pvalues
1. Approximately unbiased (AU)
2. Bootstrap Probability Value (BP)
STEPS IN PERFROMING HIERACHICAL DATAFRAME: lung.csv
CLUSTERING WITH PVALUES
• DNA Microarray data of lung tumors, where rows
correspond to genes and columns are individuals.
1. Perform hierarchical clustering with pvalues via
multiscale bootstrapping
2. Diagnostic – Identity cluster with extremely high
standard error value by examining the plot
3. Obtain the estimated values of the cluster in step2
and evaluate the result.
4. Apply remedial measure whenever needed –
requires large number of boostrap sample size

DATAFRAME: lung.csv P-VALUES FOR HIERARCHICAL

CLUSTERING : pvclust()
Read data file lung.csv
• Performs hierarchical cluster analysis via function
> LUNG <- read.table("lung.csv", sep=",", ‘hclust’ and calculates pvalues for all the clusters
header=T, row.names="Gene") contained in the clustering of original data, via
> str(LUNG) multiscale bootstrap resampling.
'data.frame': 916 obs. of 73 variables:
$ X1 : num -0.4 -2.22 -1.35 0.68 NA ... > pvclust(data, method.hclust,
$ X2 : num 4.28 5.21 -0.84 0.56 4.14... method.dist, nboot, r)
$ X3 : num 3.68 4.75 -2.88 -0.45 3.58...
$ X4 : num -1.35 -0.91 3.35 -0.2 -0.4... # x – a numeric matrix, or a dataframe
$ X5 : num -1.74 -0.33 3.02 1.14 -...
$ X6 : num 2.2 2.56 -4.48 0.22 1.59... # method.hclust– the agglomerative method used in
. . . hierarchical clustering. Same method in argument
hclust()
P-VALUES FOR HIERARCHICAL FIND CLUSTERS WITH HIGH PVALUES:
CLUSTERING : pvclust() pvrect()
> pvclust(data, method.hclust, > pvrect(x, ,pv=“au”, alpha=0.95)
method.dist, nboot)
# x – object of class ‘pvclust’
# alpha – threshold value for pvalues
# method.dist – distance measure to be used such as # pv – specify the p-value to be used, either “au” of
“correlation”, “uncentered”, “abscor” or same method “bp”
in dist()
# nboot – the number of bootstrap replications. Default
is 1000

PRINT CLUSTERS WITH HIGH PVALUES: P-VALUES FOR HIERARCHICAL

pvpick() CLUSTERING : pvclust()
> pvpick(x, ,pv=“au”, alpha=0.95) > lung.pv <- pvclust(LUNG,
nboot=1000, method.hclust="average",
# x – object of class ‘pvclust’ method.dist="cor")
# alpha – threshold value for pvalues > windows(17, 10)
> plot(lung.pv, hang=-1, ylim=c(0,1.2))
# pv – specify the p-value to be used, either “au” of > pvrect(lung.pv, alpha=0.95)
“bp” > pvpick(lung.pv, alpha=0.95)
DENDROGRAM : plot() pvpick()

DIAGNOSTIC PLOT FOR STANDARD DIAGNOSTIC PLOT FOR STANDARD

ERROR OF PVALUE : seplot() ERROR OF PVALUE : seplot()
• Draws diagnostic plot for standard error of p-value
> seplot(lung.pv, identify =T)
for pvclust object

> seplot(object, type, identify=F)

# object – object of class ‘pvclust’

# type – the type of p-value to be plotted, can
either be “au” or “bp”
# identify– logical value to specify whether edge
numbers can be indentified interactively.
PRINT VALUES : print()
• A generic function which means that new printing
methods can be easily added for new classes. REMEDIAL MEASURE
> print(x, …)
• Increase large number of bootstrap replication
# x – an object used to select for printing specified in “nboot”
• Print result of clusters 21, 65, and 67
> print(lung.pv, which=c(21,65,67))

Thank you!