Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Lecture 23

Data mining

Uploaded by

Nancy Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 23

Data mining

Uploaded by

Nancy Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Lecture-23 Data Mining

Cluster Analysis
(Type of data in Cluster Analysis)

Dr. Kaberi Das


Associate Professor
Department of Computer Science and Engineering
ITER, Siksha ‘O’ Anusandhan University.
Content

 Data Structures
 Measure the Quality of Clustering
 Type of data in cluster analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
 Summary

1/15/2021 Type of data in Cluster Analysis 2


Data Structures

• Data matrix  x11 ... x1f ... x1p 


 
• (two modes)  ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
• Dissimilarity matrix
 0 
• (one mode)  d(2,1) 
 0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

1/15/2021 Type of data in Cluster Analysis 3


Measure the Quality of Clustering

• Dissimilarity/Similarity metric :
• Similarity is expressed in terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled,
Boolean, Categorical, Ordinal and Ratio variables.
• Weights should be associated with different variables based on applications and
data semantics.
• It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

1/15/2021 Type of data in Cluster Analysis 4


Type of data in cluster analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

1/15/2021 Type of data in Cluster Analysis 5


Interval-scaled variables:

• Standardize data
• Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
• Where mf  1
n (x1 f  x2 f  ...  xnf ).

• Calculate the standardized measurement (z-score)


xif  m f
zif  sf
• Using mean absolute deviation is more robust than using standard deviation

1/15/2021 Type of data in Cluster Analysis 6


Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between


two data objects
• Some popular ones include:
Minkowski Distance: d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp

• where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer.
Manhattan Distance: If q = 1, d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

1/15/2021 Type of data in Cluster Analysis 7


Similarity and Dissimilarity Between Objects (Cont.)

Euclidean Distance: If q = 2, d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )


i1 j1 i2 j2 ip jp
• Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson product moment
correlation, or other dissimilarity measures.

1/15/2021 Type of data in Cluster Analysis 8


Binary Variables

Binary variable: It has only two states (0 or 1)


• 0 means variable is absent
• 1 means variable is present
• Ex: variable smoker (1 indicates patient smokes and 0 indicates patient does
not).
• Treating binary variables as interval-scaled can lead to misleading results.
• There may be symmetric and asymmetric binary variables.
 To compute the dissimilarity between two binary variables

• If all binary variables are thought of as having same weight , construct a 2-by-
2 contingency table.

1/15/2021 Type of data in Cluster Analysis 9


Binary Variables (Contd…)

• A contingency table for binary data


Object j
1 0 sum
Object i 1 a b a b
0 c d cd
sum ac bd p

• Where, p is total no of variable and p= a + b + c + d

1/15/2021 Type of data in Cluster Analysis 10


Binary Variables (Contd…)

• A binary variable is symmetric if both of states are equally valuable and carry the
same weight, no preference on the outcome should be coded as 0 or 1.
• Ex: gender (male or female)
• Dissimilarity based on symmetric binary variable is called symmetric binary
dissimilarity.

d (i, j)  bc
a bc  d

1/15/2021 Type of data in Cluster Analysis 11


Binary Variables (Contd…)

• A binary variable is asymmetric if the states are not equally important.


• Ex: test (positive or negative)
• By convention, we shall code the most important outcome, which is usually the
rarest one by 1 (e.g. HIV positive) and the other by 0 (e.g. HIV negative)
• Given two asymmetric variables , the agreement of two 1s (a positive match) is
considered more significant than two 0s (a negative match)
• Dissimilarity based on asymmetric binary variable is called asymmetric binary
dissimilarity where the no. of –ve matches d is considered unimportant and thus
ignored in the computation.
𝑏+𝑐
𝑑𝑖𝑗 =
𝑎+𝑏+𝑐

1/15/2021 Type of data in Cluster Analysis 12


Binary Variables (Contd…)

• Complementarily , we can measure the distance between two binary variables


based on notion of similarity instead of dissimilarity
• Jaccard coefficient ,
• sim (i, j) = a 1 d (i, j)
a b c

1/15/2021 Type of data in Cluster Analysis 13


Dissimilarity between Binary Variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Object j
Mary F Y N P N P N
Jim M Y P N N N N
1 0 sum
1 a b a b
• Gender is a symmetric attribute. Object i
0 c d cd
• The remaining attributes are asymmetric sum ac bd p
binary.
• Let the values Y and P be set to 1, and the
value N be set to 0. 01
d ( jack , mary )   0.33
2 01
11
𝑏+𝑐 d ( jack , jim )   0.67
𝑑𝑖𝑗 = 111
𝑎+𝑏+𝑐 1 2
d ( jim , mary )   0.75
11 2
1/15/2021 Type of data in Cluster Analysis 14
Nominal Variables (or categorical)

Categorical variable: sometimes called a nominal variable, is one that has two or
more categories.
• But there is no intrinsic ordering to the categories
• For example, gender is a categorical variable having two categories (male and
female) and there is no intrinsic ordering to the categories
• Hair colour is also a categorical variable having a number of categories (blonde,
brown, brunette, red, etc.) and again, there is no agreed way to order these from
highest to lowest.

1/15/2021 Type of data in Cluster Analysis 15


Nominal Variables (or categorical) [Cont..]

• A purely categorical variable is one that simply allows you to assign categories but
you cannot clearly order the variables. If the variable has a clear ordering, then
that variable would be an ordinal variable.
• A generalization of the binary variable is that it can take more than 2 states, e.g.,
red, yellow, blue, green
• Let, no. of states of a categorical variable be M
• The states can be denoted as 1,2,….,M
• Method : Simple matching d (i, j)  p 
p
m

 m: # of matches, p: total # of variables

1/15/2021 Type of data in Cluster Analysis 16


Nominal Variables (or categorical) [Cont..]

• Example: Dissimilarity between categorical variables


• Suppose, we have object identifier and test-1 are the variables.
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
1 Code-A Excellent 445
2 Code-B Fair 22
3 Code-C Good 164
4 Code-A Excellent 1,210

• The dissimilarity matrix is :


Since, we have one categorical variable 0
0
d(2,1) 0
test-1, we set p=1 in equation and d(i,j) 1 0
1 1 0
d(3,1) d(3,2) 0 is 0 if objects i and j match, and 1 if 0 1 1 0
d(4,1) d(4,2) d(4,3) 0
differ.
1/15/2021 Type of data in Cluster Analysis 17
Ordinal Variables

Ordinal variable: It is similar to a categorical variable.


• The difference between the two is that there is a clear ordering of the variables
• For example, suppose you have a variable, economic status, with three categories
(low, medium and high).
• In addition to being able to classify people into these three categories, you can
order the categories as low, medium and high
• Now consider a variable like educational experience (with values such as
elementary school graduate, high school graduate, some college and college
graduate).
• Even though we can order these from lowest to highest, the spacing between the
values may not be the same across the levels of the variables.
1/15/2021 Type of data in Cluster Analysis 18
Ordinal Variables [Cont..]

• Say we assign scores 1, 2, 3 and 4 to these four levels of educational experience


and we compare the difference in education between categories one and two
with the difference in educational experience between categories two and three,
or the difference between categories three and four.
• The difference between categories one and two (elementary and high school) is
probably much bigger than the difference between categories two and three
(high school and some college).
• In this example, we can order the people in level of educational experience but
the size of the difference between categories is inconsistent (because the spacing
between categories one and two is bigger than categories two and three).

1/15/2021 Type of data in Cluster Analysis 19


How are Ordinal Variables handled ?

• Quite similar as interval-scaled variable while computing the dissimilarity


between objects.
• Let f is a variable from a set of ordinal variables describing n objects
• The dissimilarity w.r.t f involves the following steps:
• replacing xif by their rankrif {1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by r 1
zif  if
M f
1
• compute the dissimilarity using methods for interval-scaled variables

1/15/2021 Type of data in Cluster Analysis 20


Ordinal Variables (Example)

• Example: Dissimilarity between ordinal variables


• Suppose, we have object identifier and test-2 are the variables.
• There are 3 states for test-2 i.e Mf = 3
• Step1 : Replace each value of test-2 by its rank, i.e 3,1,2,3
• Step2: Normalize the ranking by mapping rank 1 to 0, rank 2 to 0.5 and rank 3 to 1.
• Step3: Use Euclidian distance to find the dissimilarity matrix
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
0 1 Code-A Excellent 445
1 0
0.5 0.5 0 2 Code-B Fair 22
0 1.0 0.5 0 3 Code-C Good 164
4 Code-A Excellent 1,210

1/15/2021 Type of data in Cluster Analysis 21


Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale,


approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
• treat them like interval-scaled variables — not a good choice! (since it is likely
that the scale may be distorted)
• apply logarithmic transformation
yif = log(xif)
• Treat them as continuous ordinal data treat their rank as interval-scaled.

1/15/2021 Type of data in Cluster Analysis 22


Ratio-Scaled Variables (Example)

• Ratio variables are those in which the ratio of two of the numbers have meaning,
such as miles per gallon, for example. If car A gets 15 mpg and car B gets 20 mpg,
you can take the ratio of the two: 15/20 and compute 0.75, meaning car A gets
75% of the mileage of car B.

1/15/2021 Type of data in Cluster Analysis 23


Ratio-Scaled Variables (Example)[Cont..]

• Example: Dissimilarity between ratio-scaled variables


• Suppose, we have object identifier and test-3 are the variables.
• Step1 : Let us try logarithmic transformation( the results are 2.65 , 1.34, 2.21 and
3.08) yif = log(xif)
• Step2: Use Euclidian distance to find the dissimilarity matrix
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
0
1 Code-A Excellent 445
1.31 0
0.44 0.87 0 2 Code-B Fair 22
0.43 1.74 0.87 0
Code-C Good 164
4 Code-A Excellent 1,210

1/15/2021 Type of data in Cluster Analysis 24


Why does it matter whether a variable is categorical, ordinal
or interval?

• Statistical computations and analyses assume that the variables have a specific
levels of measurement.
• For example, it would not make sense to compute an average hair colour.
• An average of a categorical variable does not make much sense because there is
no intrinsic ordering of the levels of the categories.
• Moreover, if you tried to compute the average of educational experience as
defined in the ordinal section above, you would also obtain a nonsensical result.
• Because the spacing between the four levels of educational experience is very
uneven, the meaning of this average would be very questionable.

1/15/2021 Type of data in Cluster Analysis 25


Why does it matter whether a variable is categorical, ordinal
or interval?[Cont..]

• In short, an average requires a variable to be interval.


• Sometimes you have variables that are "in between" ordinal and interval, for
example, a five-point scale with values "strongly agree", "agree", "neutral",
"disagree" and "strongly disagree".
• If we cannot be sure that the intervals between each of these five values are the
same, then we would not be able to say that this is an interval variable, but we
would say that it is an ordinal variable.
• However, in order to be able to use statistics that assume the variable is interval,
we will assume that the intervals are equally spaced.

1/15/2021 Type of data in Cluster Analysis 26


Variables of Mixed Types

• A database may contain all the six types of variables


• symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
• dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and zif  r  1 if

 and treat zif as interval-scaled M 1 f

1/15/2021 Type of data in Cluster Analysis 27


Summary

• The quality of clustering can be assessed based on a measure of dissimilarity of


objects, which can be computed for various types of data, including interval-
scaled, binary, categorical, ordinal, and ratio-scaled variables, or combinations of
these variable types.

1/15/2021 Type of data in Cluster Analysis 28


Thank You

1/15/2021 Type of data in Cluster Analysis 29

You might also like