0% found this document useful (0 votes)

6 views

Lecture 23

Data mining

Uploaded by

Nancy Kumari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 23

Data mining

Uploaded by

Nancy Kumari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Lecture-23 Data Mining

Cluster Analysis
(Type of data in Cluster Analysis)

Dr. Kaberi Das

Associate Professor
Department of Computer Science and Engineering
ITER, Siksha ‘O’ Anusandhan University.
Content

 Data Structures
 Measure the Quality of Clustering
 Type of data in cluster analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
 Summary

1/15/2021 Type of data in Cluster Analysis 2

Data Structures

• Data matrix  x11 ... x1f ... x1p 

 
• (two modes)  ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
• Dissimilarity matrix
 0 
• (one mode)  d(2,1) 
 0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

1/15/2021 Type of data in Cluster Analysis 3

Measure the Quality of Clustering

• Dissimilarity/Similarity metric :
• Similarity is expressed in terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled,
Boolean, Categorical, Ordinal and Ratio variables.
• Weights should be associated with different variables based on applications and
data semantics.
• It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

1/15/2021 Type of data in Cluster Analysis 4

Type of data in cluster analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

1/15/2021 Type of data in Cluster Analysis 5

Interval-scaled variables:

• Calculate the standardized measurement (z-score)

xif  m f
zif  sf
• Using mean absolute deviation is more robust than using standard deviation

1/15/2021 Type of data in Cluster Analysis 6

Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between

two data objects
• Some popular ones include:
Minkowski Distance: d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp

• where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and q is a positive integer.
Manhattan Distance: If q = 1, d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

1/15/2021 Type of data in Cluster Analysis 7

Similarity and Dissimilarity Between Objects (Cont.)

Euclidean Distance: If q = 2, d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )

i1 j1 i2 j2 ip jp
• Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson product moment
correlation, or other dissimilarity measures.

1/15/2021 Type of data in Cluster Analysis 8

Binary Variables

Binary variable: It has only two states (0 or 1)

• 0 means variable is absent
• 1 means variable is present
• Ex: variable smoker (1 indicates patient smokes and 0 indicates patient does
not).
• Treating binary variables as interval-scaled can lead to misleading results.
• There may be symmetric and asymmetric binary variables.
 To compute the dissimilarity between two binary variables

• If all binary variables are thought of as having same weight , construct a 2-by-
2 contingency table.

1/15/2021 Type of data in Cluster Analysis 9

Binary Variables (Contd…)

• A contingency table for binary data

Object j
1 0 sum
Object i 1 a b a b
0 c d cd
sum ac bd p

• Where, p is total no of variable and p= a + b + c + d

1/15/2021 Type of data in Cluster Analysis 10

Binary Variables (Contd…)

• A binary variable is symmetric if both of states are equally valuable and carry the
same weight, no preference on the outcome should be coded as 0 or 1.
• Ex: gender (male or female)
• Dissimilarity based on symmetric binary variable is called symmetric binary
dissimilarity.

d (i, j)  bc
a bc  d

1/15/2021 Type of data in Cluster Analysis 11

Binary Variables (Contd…)

• A binary variable is asymmetric if the states are not equally important.

• Ex: test (positive or negative)
• By convention, we shall code the most important outcome, which is usually the
rarest one by 1 (e.g. HIV positive) and the other by 0 (e.g. HIV negative)
• Given two asymmetric variables , the agreement of two 1s (a positive match) is
considered more significant than two 0s (a negative match)
• Dissimilarity based on asymmetric binary variable is called asymmetric binary
dissimilarity where the no. of –ve matches d is considered unimportant and thus
ignored in the computation.
𝑏+𝑐
𝑑𝑖𝑗 =
𝑎+𝑏+𝑐

1/15/2021 Type of data in Cluster Analysis 12

Binary Variables (Contd…)

• Complementarily , we can measure the distance between two binary variables

based on notion of similarity instead of dissimilarity
• Jaccard coefficient ,
• sim (i, j) = a 1 d (i, j)
a b c

1/15/2021 Type of data in Cluster Analysis 13

Dissimilarity between Binary Variables

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Object j
Mary F Y N P N P N
Jim M Y P N N N N
1 0 sum
1 a b a b
• Gender is a symmetric attribute. Object i
0 c d cd
• The remaining attributes are asymmetric sum ac bd p
binary.
• Let the values Y and P be set to 1, and the
value N be set to 0. 01
d ( jack , mary )   0.33
2 01
11
𝑏+𝑐 d ( jack , jim )   0.67
𝑑𝑖𝑗 = 111
𝑎+𝑏+𝑐 1 2
d ( jim , mary )   0.75
11 2
1/15/2021 Type of data in Cluster Analysis 14
Nominal Variables (or categorical)

Categorical variable: sometimes called a nominal variable, is one that has two or
more categories.
• But there is no intrinsic ordering to the categories
• For example, gender is a categorical variable having two categories (male and
female) and there is no intrinsic ordering to the categories
• Hair colour is also a categorical variable having a number of categories (blonde,
brown, brunette, red, etc.) and again, there is no agreed way to order these from
highest to lowest.

1/15/2021 Type of data in Cluster Analysis 15

Nominal Variables (or categorical) [Cont..]

• A purely categorical variable is one that simply allows you to assign categories but
you cannot clearly order the variables. If the variable has a clear ordering, then
that variable would be an ordinal variable.
• A generalization of the binary variable is that it can take more than 2 states, e.g.,
red, yellow, blue, green
• Let, no. of states of a categorical variable be M
• The states can be denoted as 1,2,….,M
• Method : Simple matching d (i, j)  p 
p
m

 m: # of matches, p: total # of variables

1/15/2021 Type of data in Cluster Analysis 16

Nominal Variables (or categorical) [Cont..]

• Example: Dissimilarity between categorical variables

• Suppose, we have object identifier and test-1 are the variables.
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
1 Code-A Excellent 445
2 Code-B Fair 22
3 Code-C Good 164
4 Code-A Excellent 1,210

• The dissimilarity matrix is :

Since, we have one categorical variable 0
0
d(2,1) 0
test-1, we set p=1 in equation and d(i,j) 1 0
1 1 0
d(3,1) d(3,2) 0 is 0 if objects i and j match, and 1 if 0 1 1 0
d(4,1) d(4,2) d(4,3) 0
differ.
1/15/2021 Type of data in Cluster Analysis 17
Ordinal Variables

Ordinal variable: It is similar to a categorical variable.

• The difference between the two is that there is a clear ordering of the variables
• For example, suppose you have a variable, economic status, with three categories
(low, medium and high).
• In addition to being able to classify people into these three categories, you can
order the categories as low, medium and high
• Now consider a variable like educational experience (with values such as
elementary school graduate, high school graduate, some college and college
graduate).
• Even though we can order these from lowest to highest, the spacing between the
values may not be the same across the levels of the variables.
1/15/2021 Type of data in Cluster Analysis 18
Ordinal Variables [Cont..]

• Say we assign scores 1, 2, 3 and 4 to these four levels of educational experience

and we compare the difference in education between categories one and two
with the difference in educational experience between categories two and three,
or the difference between categories three and four.
• The difference between categories one and two (elementary and high school) is
probably much bigger than the difference between categories two and three
(high school and some college).
• In this example, we can order the people in level of educational experience but
the size of the difference between categories is inconsistent (because the spacing
between categories one and two is bigger than categories two and three).

1/15/2021 Type of data in Cluster Analysis 19

How are Ordinal Variables handled ?

• Quite similar as interval-scaled variable while computing the dissimilarity

between objects.
• Let f is a variable from a set of ordinal variables describing n objects
• The dissimilarity w.r.t f involves the following steps:
• replacing xif by their rankrif {1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by r 1
zif  if
M f
1
• compute the dissimilarity using methods for interval-scaled variables

1/15/2021 Type of data in Cluster Analysis 20

Ordinal Variables (Example)

• Example: Dissimilarity between ordinal variables

• Suppose, we have object identifier and test-2 are the variables.
• There are 3 states for test-2 i.e Mf = 3
• Step1 : Replace each value of test-2 by its rank, i.e 3,1,2,3
• Step2: Normalize the ranking by mapping rank 1 to 0, rank 2 to 0.5 and rank 3 to 1.
• Step3: Use Euclidian distance to find the dissimilarity matrix
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
0 1 Code-A Excellent 445
1 0
0.5 0.5 0 2 Code-B Fair 22
0 1.0 0.5 0 3 Code-C Good 164
4 Code-A Excellent 1,210

1/15/2021 Type of data in Cluster Analysis 21

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale,

approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
• treat them like interval-scaled variables — not a good choice! (since it is likely
that the scale may be distorted)
• apply logarithmic transformation
yif = log(xif)
• Treat them as continuous ordinal data treat their rank as interval-scaled.

1/15/2021 Type of data in Cluster Analysis 22

Ratio-Scaled Variables (Example)

• Ratio variables are those in which the ratio of two of the numbers have meaning,
such as miles per gallon, for example. If car A gets 15 mpg and car B gets 20 mpg,
you can take the ratio of the two: 15/20 and compute 0.75, meaning car A gets
75% of the mileage of car B.

1/15/2021 Type of data in Cluster Analysis 23

Ratio-Scaled Variables (Example)[Cont..]

• Example: Dissimilarity between ratio-scaled variables

• Suppose, we have object identifier and test-3 are the variables.
• Step1 : Let us try logarithmic transformation( the results are 2.65 , 1.34, 2.21 and
3.08) yif = log(xif)
• Step2: Use Euclidian distance to find the dissimilarity matrix
Object Test-1 Test-2 Test-3
Identifier (categorical) (ordinal) (ratio-scaled)
0
1 Code-A Excellent 445
1.31 0
0.44 0.87 0 2 Code-B Fair 22
0.43 1.74 0.87 0
Code-C Good 164
4 Code-A Excellent 1,210

1/15/2021 Type of data in Cluster Analysis 24

Why does it matter whether a variable is categorical, ordinal
or interval?

• Statistical computations and analyses assume that the variables have a specific
levels of measurement.
• For example, it would not make sense to compute an average hair colour.
• An average of a categorical variable does not make much sense because there is
no intrinsic ordering of the levels of the categories.
• Moreover, if you tried to compute the average of educational experience as
defined in the ordinal section above, you would also obtain a nonsensical result.
• Because the spacing between the four levels of educational experience is very
uneven, the meaning of this average would be very questionable.

1/15/2021 Type of data in Cluster Analysis 25

Why does it matter whether a variable is categorical, ordinal
or interval?[Cont..]

• In short, an average requires a variable to be interval.

• Sometimes you have variables that are "in between" ordinal and interval, for
example, a five-point scale with values "strongly agree", "agree", "neutral",
"disagree" and "strongly disagree".
• If we cannot be sure that the intervals between each of these five values are the
same, then we would not be able to say that this is an interval variable, but we
would say that it is an ordinal variable.
• However, in order to be able to use statistics that assume the variable is interval,
we will assume that the intervals are equally spaced.

1/15/2021 Type of data in Cluster Analysis 26

Variables of Mixed Types

• A database may contain all the six types of variables

• symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 f is binary or nominal:
• dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
 f is interval-based: use the normalized distance
 f is ordinal or ratio-scaled
 compute ranks rif and zif  r  1 if

 and treat zif as interval-scaled M 1 f

1/15/2021 Type of data in Cluster Analysis 27

Summary

• The quality of clustering can be assessed based on a measure of dissimilarity of

objects, which can be computed for various types of data, including interval-
scaled, binary, categorical, ordinal, and ratio-scaled variables, or combinations of
these variable types.

1/15/2021 Type of data in Cluster Analysis 28

Thank You

1/15/2021 Type of data in Cluster Analysis 29

Catalogue PAI - 2017 Water Pumps
100% (4)
Catalogue PAI - 2017 Water Pumps
531 pages
Latinos and The Nation's Future Edited by Henry G. Cisneros and John Rosales
100% (2)
Latinos and The Nation's Future Edited by Henry G. Cisneros and John Rosales
273 pages
Business and Administrative Communication by Kitty Locker and Donna Kienzler - 10e, TEST BANK 0073403180
No ratings yet
Business and Administrative Communication by Kitty Locker and Donna Kienzler - 10e, TEST BANK 0073403180
35 pages
Chemical Cleaning Procedure
100% (3)
Chemical Cleaning Procedure
4 pages
DM 05 02 Types of Data
No ratings yet
DM 05 02 Types of Data
51 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
DM-24-TYPES-OF-DATA-IN-CLUSTER-ANALYSIS
No ratings yet
DM-24-TYPES-OF-DATA-IN-CLUSTER-ANALYSIS
3 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Cluster Cat Vars
No ratings yet
Cluster Cat Vars
17 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
b4l
No ratings yet
b4l
28 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Clustering
No ratings yet
Clustering
47 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
02DataCategorization
No ratings yet
02DataCategorization
25 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
NoteSCK3483-7b-Clustering
No ratings yet
NoteSCK3483-7b-Clustering
24 pages
1730098650_ML12_Clustering
No ratings yet
1730098650_ML12_Clustering
34 pages
2 Graphical Descriptive Techniques 1
No ratings yet
2 Graphical Descriptive Techniques 1
24 pages
Week 01, PT 1
No ratings yet
Week 01, PT 1
16 pages
Clustering
No ratings yet
Clustering
64 pages
Data Science
No ratings yet
Data Science
47 pages
Chapter 8 PG
No ratings yet
Chapter 8 PG
21 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
CH 8 Data Analysis
No ratings yet
CH 8 Data Analysis
34 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
STAT243 Chapter 2 - Section 2.4 (1)
No ratings yet
STAT243 Chapter 2 - Section 2.4 (1)
41 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Categorization
No ratings yet
Data Categorization
20 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
02 Data
No ratings yet
02 Data
35 pages
Introduction To Data Analytics-Module 1 Part 2
No ratings yet
Introduction To Data Analytics-Module 1 Part 2
78 pages
01 - Introduction To Biostatistics
No ratings yet
01 - Introduction To Biostatistics
16 pages
Quantitative Techniques For Research: Ahmed Arif
No ratings yet
Quantitative Techniques For Research: Ahmed Arif
11 pages
Nominal, Ordinal, Scale Variable
No ratings yet
Nominal, Ordinal, Scale Variable
14 pages
Graphical Descriptive Statistics Lec 1
No ratings yet
Graphical Descriptive Statistics Lec 1
18 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
E370 Week 01, PT 2: Different Kinds of Data Qualitative and Quantitative
No ratings yet
E370 Week 01, PT 2: Different Kinds of Data Qualitative and Quantitative
8 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Cluster Analysis: Handling Binary and Categorical Variables
No ratings yet
Cluster Analysis: Handling Binary and Categorical Variables
8 pages
Statistics: An Overview: Unit 1
No ratings yet
Statistics: An Overview: Unit 1
10 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
Clustering
No ratings yet
Clustering
51 pages
(Buiness Statistics) Chapter 1 2
No ratings yet
(Buiness Statistics) Chapter 1 2
33 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
No ratings yet
.Chapter 1: What Is Statistics?: 1.1 Key Statistical Concepts
66 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Lecture#7
No ratings yet
Lecture#7
20 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Mathcad Prestressed Concrete Jefferson Example
No ratings yet
Mathcad Prestressed Concrete Jefferson Example
222 pages
Audit Planning
No ratings yet
Audit Planning
3 pages
Defining Critical Thinking
No ratings yet
Defining Critical Thinking
5 pages
Essentials of Oceanography 12th Edition Trujillo Test Bank - Quickly Download And Never Miss Important Content
100% (2)
Essentials of Oceanography 12th Edition Trujillo Test Bank - Quickly Download And Never Miss Important Content
49 pages
Introduction To Oracle Linux - Shell and Command Line
No ratings yet
Introduction To Oracle Linux - Shell and Command Line
1 page
Albert Bandura
No ratings yet
Albert Bandura
15 pages
Phutikettrkit,+3 +RFID+Hospitals+-+Edited+by+HBDS
No ratings yet
Phutikettrkit,+3 +RFID+Hospitals+-+Edited+by+HBDS
10 pages
Essay Arts
No ratings yet
Essay Arts
4 pages
Summer Internship Report
No ratings yet
Summer Internship Report
75 pages
Fork Lift Preventive Maintenance 3 Checklist: Carwill Construction Inc
No ratings yet
Fork Lift Preventive Maintenance 3 Checklist: Carwill Construction Inc
2 pages
Chapter (7) Beams: Revision
No ratings yet
Chapter (7) Beams: Revision
12 pages
CV MAF April 2023 PDF
No ratings yet
CV MAF April 2023 PDF
2 pages
L4M2 Quiz
No ratings yet
L4M2 Quiz
108 pages
CHN Lecture - 2
No ratings yet
CHN Lecture - 2
13 pages
SDET Bottles Filling and Capping Machine Operation Manual
No ratings yet
SDET Bottles Filling and Capping Machine Operation Manual
10 pages
LT BILL 76229365020 Apr22
No ratings yet
LT BILL 76229365020 Apr22
2 pages
Both - Either - Neither
No ratings yet
Both - Either - Neither
13 pages
IPC UTM-Systems EN WEB 0516
No ratings yet
IPC UTM-Systems EN WEB 0516
12 pages
Finding Nemo Thesis Statement
100% (3)
Finding Nemo Thesis Statement
5 pages
Aft 3450 Regulatory Toxicology
No ratings yet
Aft 3450 Regulatory Toxicology
2 pages
Amot 4255
No ratings yet
Amot 4255
6 pages
Studies in Education
No ratings yet
Studies in Education
120 pages
Electrolysing Molten Lead (II) Bromide PDF
No ratings yet
Electrolysing Molten Lead (II) Bromide PDF
3 pages
Dual Full-Bridge MOSFET Driver With Microstepping Translator
No ratings yet
Dual Full-Bridge MOSFET Driver With Microstepping Translator
18 pages
Yamaha 8 Owner's Manual
100% (2)
Yamaha 8 Owner's Manual
66 pages
ESG Normative Landscape - v0.2
No ratings yet
ESG Normative Landscape - v0.2
124 pages