Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Data Preprocessing Data Basics

The scales represented are: Nominal: Favourite candy bar Ratio: Weight of luggage Interval: Year of your birth Ordinal: Egg size (small, medium, large, extra large, jumbo) Ordinal: Military rank Ratio: Number of children in a family Nominal: Jersey numbers for a football team Interval: Shoe size

Uploaded by

Jaydeep Dodiya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Data Preprocessing Data Basics

The scales represented are: Nominal: Favourite candy bar Ratio: Weight of luggage Interval: Year of your birth Ordinal: Egg size (small, medium, large, extra large, jumbo) Ordinal: Military rank Ratio: Number of children in a family Nominal: Jersey numbers for a football team Interval: Shoe size

Uploaded by

Jaydeep Dodiya
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

1
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla

wi
crosstabs

n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

2
Basic terms
 Data
 Data is raw, unorganized facts that need to be

processed.
 Example: Marks of students

 Student_1 = 50/100, Student_2 = 25/100.

 Information
 When data is processed, organized, structured or

presented in a given context so as to make it useful, it


is called information.
 Example: Result of students (Pass or Fail)

 Student_1 = Pass, Student_2 = Fail.

3
Basic terms
 Metadata
 Metadata is data about data.
 Data such as table name, column name, data type, authorized user and
user access privileges for any table is called metadata for that table.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt

 Metadata of above table is:


 Table name such as Faculty

 Column name such as Emp_Name, Address, Mobile_No, Subject

 Datatype such as Varchar, Decimal

 Access privileges such as Read, Write (Update)

4
Basic terms
 Data dictionary
 A data dictionary is an information repository which contains

metadata.
• Table Name – Faculty
• Column Name – EmpName, Address, Mob, Subject, Salary
• Datatype – Varchar, Decimal
• Access Privileges – Read, Write (Update)

 Data warehouse
 A data warehouse is an information repository which stores

historical data.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt

Prof. Verma Ahmedabad 5678 DBMS

5
6
7
Database schema

 Relational Database schema(related relations)


 • student (ID, name, dept name, tot cred)
 • advisor (s id, i id)
 • takes (ID, course id, sec id, semester, year, grade)
 • classroom (building, room number, capacity)
 • time slot (time slot id, day, start time, end time)
8
9
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

10
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

11
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

12
Interval scales
 Measurements with two defining principles—
equidistant scales and no true zero.
 Equidistance refers to intervals with values that
are distributed in equal units.
 A true zero refers to a scale where 0 indicates the
absence of something.
 An interval scale lacks a true zero.
 Values have order
 Examples of scales without a true zero include

rating scales, temperature, and measures of


latitude and longitude, calendar dates
 .
13
Ratio scales
 Ratio scales: Measurements with two defining
principles—equidistant scales and a true zero.
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚)
 Examples of scales with a true zero include
weight, height, time, and calories.

14
Quantitative vs Qualitative Data

15
Numeric attribute

16
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
17
Discrete vs. Continuous Attributes
 Number of emergency room patients
 Blood pressure of a patient
 Weight of a patient
 Pulse for a patient
 Emergency room wait time rounded to the nearest minute
 Tumor size

Answers: d,c,c,d,d,c

18
19
Nominal, Ordinal, Interval, and Ratio Scales

Each scale is represented once in the list below. 


 Favourite candy bar

 Weight of luggage

 Year of your birth

 Egg size (small, medium, large, extra large, jumbo)

Each scale is represented once in the list below. 


 Military rank

 Number of children in a family

 Jersey numbers for a football team

 Shoe size

Answers: N,R,I,O and O,R,N,I

20
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

21
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

22
Measuring the Central Tendency
1n
x= xi 
 x

Mean (algebraic measure) (sample vs. population):


n i=1 N

◦ Distributive measure: sum() and count ()


◦ Algebric Measure : avg()
◦ Weighted arithmetic mean /
◦ weighted avg:

n
∑ wi x i
x = i= 1n
̄
∑ wi
i= 1
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):

◦ Trimmed mean: which is the mean obtained after chopping off values
at the high and low extremes.
For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean. We should avoid trimming too large a portion
(such as 20%) at both ends as this can result in the loss of valuable information.

Problem : Mean is sensitive to extreme values

24
Measuring the Central Tendency

Median
◦ Middle value if odd number of values, or average of the
middle two values otherwise

 Mode
◦ Value that occurs most frequently in the data set

• The midrange can also be used to assess the central tendency of a


data set. It is the average of the largest and smallest values in the set.
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric

symmetric, positively and


negatively skewed data

positively skewed negatively skewed

6/22/23 Data Mining: Concepts and Techniques 26


Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) 2 ]
n  1 i 1 n i 1
2
 
N
 ( xi  2
 ) 
N
 xi   2
2

i 1 i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

27
Outlier detection using quartile

Arrange your data in ascending order


Calculate Q1 ( the first Quarter)
Calculate Q3 ( the third Quartile)
Find IQR = (Q3 - Q1)
Find the lower Range = Q1 -(1.5 * IQR)
Find the upper Range = Q3 + (1.5 * IQR)
 Detect outliers data points < lower range and
data points > upper range and remove.
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

29
Example

Data:
4, 17, 7, 14, 18, 12, 3, 16,
10, 4, 4, 11
Solution
Put them in order:
3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18
Cut it into quarters:
3, 4, 4 | 4, 7, 10 | 11, 12, 14 | 16, 17, 18
all the quartiles are between numbers:
Quartile 1 (Q1) = (4+4)/2 = 4
Quartile 2 (Q2) = (10+11)/2 = 10.5
Quartile 3 (Q3) = (14+16)/2 = 15
The Lowest Value is 3,
The Highest Value is 18
Interquartile Range is: Q3 − Q1 = 15 − 4 = 11

31
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
32
33
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
 It shows what proportion of cases 30
fall into each of several categories
25
 Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent

34
Histograms Often Tell More than Boxplots

 The two histograms


shown in the left may
have the same boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have rather
different data
distributions

35
Histogram vs Box plot

What do box plots


tell you that
histograms don’t?
Box plots tell me the
least and greatest value
in the data, as well as
the median. They also
tell the lower and upper
quartile. Those values
can only be estimated
from histograms.
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

37
Positively and Negatively Correlated Data

 The left half fragment is positively


correlated
 The right half is negative correlated

38
Uncorrelated Data

39
Example I

Suppose that the data for analysis includes the


attribute age. The age values for the data tuples
are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.

1. Find mean, median, mode with data modality


(bimodal/trimodal), midrange
2. Find five number summary
3. Plot histogram
4. Plot box plot.
1. Mean = 30, median = 25,
2. mode – 25, 35 (bimodal),
3. Mid range (average of largest and smallest value) = 41.55
4. Q1=20, Q3 = 35
5. Five number summary
Min,Q1,madian(Q2),Q3,max = 13,20,25,35,70

Frequency Table
Class Count
13-25 14
26-38 8
39-51 3
52-64 1
65-77 1
Example II

Suppose that a hospital tested the age and body fat


data for 18 randomly selected adults with the
following results:
age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9
41.2 35.7

1. Calculate the mean, median, and standard


deviation of age and %fat.
2. Draw the boxplots for age and %fat.
3. Draw a scatter plot and a q-q plot based on
these two variables.
Example II solution

Age %fat
Population size: Population size: 18
18 Median: 30.7
Median: 51 Minimum: 7.8
Minimum: 23 Maximum: 42.5
Maximum: 61 First quartile: 26.35
First quartile: Third quartile:
36 34.225
Third quartile: Interquartile
57.25 Range: 7.875
Interquartile Outliers: 7.8 9.5
Range: 21.25
Outliers: none
Example II solution
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

47
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)


 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

48
Similarity and Dissimilarity
 Suppose that we have n objects (e.g., persons,
items, or courses) described by p attributes (also
called measurements or features, such as age,
height, weight, or gender).

 The objects are x1 =(x11,x12, : : : ,x1p),


x2 = (x21,x22, : : : ,x2p) , and so on,
where xij is the value for object xi of the jth
attribute.

49
50
Data Matrix
 Data matrix (object by attribute structure or n by p
matrix)
 n data points with p dimensions

 Two-mode

 x11 ... x1f ... x1p 


 
 ... ... ... ... ... 
x ... xif ... x ip 
 i1 
 ... ... ... ... ... 
x ... xnf ... x np 
 n1 

51
Dissimilarity Matrix
 Dissimilarity matrix (object by object structure)
 n data points, but registers only the distance

 A triangular matrix

 Single-mode

 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

52
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i

 Distance measure for symmetric


binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables): 1 – d (i, j)
 Note: Jaccard coefficient is the same as “coherence”:

53
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 Gender is a symmetric attribute


 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0

54
Distance measure for asymmetric attributes

Fever Cough Test I Test II Test Test 1 0


III IV 1 2 0 2
Jack 1 0 1 0 0 0 0 1 3 4
Merry 1 0 1 0 1 0 3 3 6

Contingency table

01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
55
Distance measure for symmetric attributes

Exercise

Find the distance measure for symmetric attributes

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

56
Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,


green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m

57
Example I

1-0/1 = 1
P = 1 ( one nominal attribute)

58
Example II
RollNo Marks Grade
1 90 A
2 80 B
3 82 B
4 90 A

d(RollNo1,RollNo1) d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4)

d(RollNo2,RollNo1) d(RollNo2,RollNo2) d(RollNo2,RollNo3) d(RollNo2,RollNo4)

d(RollNo3,RollNo1) d(RollNo3,RollNo2) d(RollNo3,RollNo3) d(RollNo3,RollNo4)

d(RollNo4,RollNo1) d(RollNo4,RollNo2) d(RollNo4,RollNo3) d(RollNo4,RollNo4)

distance(object1, Object2) = P – M / P


P is total number of attributes
M is total number of matches

59
Distance measure

d(1,1) = P –
M / P
d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4)
= 2 – 2 / 2
=0

(2,1) = P –
(2,2) = P – M / P
M / P
= (2 – 2) / 2 d(RollNo2,RollNo3) d(RollNo2,RollNo4)
= (2 – 0) / 2
= 0
= 1
(3,1) = P –
(3,2) = P – M / P (3,3) = P – M / P
M / P
= (2 – 1 )/ 2 = (2 – 2 )/ 2 d(RollNo3,RollNo4)
= (2 – 0) / 2
= 0.5 = 0
= 1
(4,1) = P –
(4,2) = P – M / P (4,3) = P – M / P (4,4) = P – M / P
M / P
= (2 – 0) / 2 =( 2 – 0 )/ 2 =( 2 – 2) / 2
= (2 – 2) / 2
= 1 = 1 = 0
= 0

60
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 Attribute f that gives the maximum difference in values
between any component (attribute) of the vectors (objects)

62
Example

f1 difference |1-3| = 2
f2 difference |2-5| = 3
Select maximum difference i.e. 3
63
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

64
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 Manhattan (L1)

Euclidean (L2)
x2 x4

4 Supremum

2 x1

x3
0 2 4
65
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
66
Example

Given two objects represented by the tuples (22, 1,


42, 10) and (20, 0, 36, 8):
1. Compute the Euclidean distance between the
two objects.
2. Compute the Manhattan distance between the
two objects.
3. Compute the Minkowski distance between the
two objects, using h = 3.
4. Compute the supremum distance between the
two objects.

67
68
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif {1,...,M f }
if

 map the range of each variable onto [0, 1] by replacing


i-th object in the f-th variable by
rif 1
zif 
M f 1

 compute the dissimilarity using methods for interval-


scaled variables
69
Example

70
Dissimilarity measure for ordinal data
 There are three states for test-2: fair, good, and
excellent, that is, Mf = 3.
 step 1 - Replace each value for test-2 by its rank,
four objects are assigned the ranks 3, 1, 2, and
3, respectively.
 Step 2 - Normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
 step 3 – Use the Euclidean distance

71
Example

72
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

73
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

74
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )

 if either
 xif or xjf missing (i.e., there is no measurement of attribute f for
object i or object j), or
 xif = xjf = 0 and attribute f is asymmetric binary;
 otherwise,
75
Attributes of Mixed Type
 f is binary or nominal:
dij(f) = 0 if xif = xjf
dij(f) = 1 otherwise
 f is numeric: use the normalized distance

Where h runs over all non missing objects for the attribute
f.
 f is ordinal

 Compute ranks rif and zif 


r 1
if

M 1
 Treat zif as interval-scaled f

76
Example

77
 Dissimilarity Matrix for test I
 di,j = p – m / p ( simple matching)
 P=1

78
 Dissimilarity Matrix for test II
zif  r  1
if

M 1 f

79
 Dissimilarity Matrix for test III

80
Solution
 Dissimilarity Matrix

81
Example

82
Solution (a)

83
Solution (b)

X1’=(A1/sqrt(A12+A22),
A2/sqrt(A12+A22))

84
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.

85
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S.  Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
86

You might also like