Lecture 23
Lecture 23
Cluster Analysis
(Type of data in Cluster Analysis)
Data Structures
Measure the Quality of Clustering
Type of data in cluster analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Summary
• Dissimilarity/Similarity metric :
• Similarity is expressed in terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled,
Boolean, Categorical, Ordinal and Ratio variables.
• Weights should be associated with different variables based on applications and
data semantics.
• It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective.
Interval-scaled variables
Binary variables
• Standardize data
• Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
• Where mf 1
n (x1 f x2 f ... xnf ).
• where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer.
Manhattan Distance: If q = 1, d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
• If all binary variables are thought of as having same weight , construct a 2-by-
2 contingency table.
• A binary variable is symmetric if both of states are equally valuable and carry the
same weight, no preference on the outcome should be coded as 0 or 1.
• Ex: gender (male or female)
• Dissimilarity based on symmetric binary variable is called symmetric binary
dissimilarity.
d (i, j) bc
a bc d
Categorical variable: sometimes called a nominal variable, is one that has two or
more categories.
• But there is no intrinsic ordering to the categories
• For example, gender is a categorical variable having two categories (male and
female) and there is no intrinsic ordering to the categories
• Hair colour is also a categorical variable having a number of categories (blonde,
brown, brunette, red, etc.) and again, there is no agreed way to order these from
highest to lowest.
• A purely categorical variable is one that simply allows you to assign categories but
you cannot clearly order the variables. If the variable has a clear ordering, then
that variable would be an ordinal variable.
• A generalization of the binary variable is that it can take more than 2 states, e.g.,
red, yellow, blue, green
• Let, no. of states of a categorical variable be M
• The states can be denoted as 1,2,….,M
• Method : Simple matching d (i, j) p
p
m
• Ratio variables are those in which the ratio of two of the numbers have meaning,
such as miles per gallon, for example. If car A gets 15 mpg and car B gets 20 mpg,
you can take the ratio of the two: 15/20 and compute 0.75, meaning car A gets
75% of the mileage of car B.
• Statistical computations and analyses assume that the variables have a specific
levels of measurement.
• For example, it would not make sense to compute an average hair colour.
• An average of a categorical variable does not make much sense because there is
no intrinsic ordering of the levels of the categories.
• Moreover, if you tried to compute the average of educational experience as
defined in the ordinal section above, you would also obtain a nonsensical result.
• Because the spacing between the four levels of educational experience is very
uneven, the meaning of this average would be very questionable.