Data Preprocessing Data Basics
Data Preprocessing Data Basics
Summary
1
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
wi
crosstabs
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
2
Basic terms
Data
Data is raw, unorganized facts that need to be
processed.
Example: Marks of students
Information
When data is processed, organized, structured or
3
Basic terms
Metadata
Metadata is data about data.
Data such as table name, column name, data type, authorized user and
user access privileges for any table is called metadata for that table.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt
4
Basic terms
Data dictionary
A data dictionary is an information repository which contains
metadata.
• Table Name – Faculty
• Column Name – EmpName, Address, Mob, Subject, Salary
• Datatype – Varchar, Decimal
• Access Privileges – Read, Write (Update)
Data warehouse
A data warehouse is an information repository which stores
historical data.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt
5
6
7
Database schema
10
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
11
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
12
Interval scales
Measurements with two defining principles—
equidistant scales and no true zero.
Equidistance refers to intervals with values that
are distributed in equal units.
A true zero refers to a scale where 0 indicates the
absence of something.
An interval scale lacks a true zero.
Values have order
Examples of scales without a true zero include
14
Quantitative vs Qualitative Data
15
Numeric attribute
16
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
17
Discrete vs. Continuous Attributes
Number of emergency room patients
Blood pressure of a patient
Weight of a patient
Pulse for a patient
Emergency room wait time rounded to the nearest minute
Tumor size
Answers: d,c,c,d,d,c
18
19
Nominal, Ordinal, Interval, and Ratio Scales
Weight of luggage
Shoe size
20
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
21
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
22
Measuring the Central Tendency
1n
x= xi
x
n
∑ wi x i
x = i= 1n
̄
∑ wi
i= 1
Measuring the Central Tendency
◦ Trimmed mean: which is the mean obtained after chopping off values
at the high and low extremes.
For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean. We should avoid trimming too large a portion
(such as 20%) at both ends as this can result in the loss of valuable information.
24
Measuring the Central Tendency
Median
◦ Middle value if odd number of values, or average of the
middle two values otherwise
Mode
◦ Value that occurs most frequently in the data set
i 1 i 1
27
Outlier detection using quartile
29
Example
Data:
4, 17, 7, 14, 18, 12, 3, 16,
10, 4, 4, 11
Solution
Put them in order:
3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18
Cut it into quarters:
3, 4, 4 | 4, 7, 10 | 11, 12, 14 | 16, 17, 18
all the quartiles are between numbers:
Quartile 1 (Q1) = (4+4)/2 = 4
Quartile 2 (Q2) = (10+11)/2 = 10.5
Quartile 3 (Q3) = (14+16)/2 = 15
The Lowest Value is 3,
The Highest Value is 18
Interquartile Range is: Q3 − Q1 = 15 − 4 = 11
31
Graphic Displays of Basic Statistical Descriptions
34
Histograms Often Tell More than Boxplots
35
Histogram vs Box plot
37
Positively and Negatively Correlated Data
38
Uncorrelated Data
39
Example I
Frequency Table
Class Count
13-25 14
26-38 8
39-51 3
52-64 1
65-77 1
Example II
Age %fat
Population size: Population size: 18
18 Median: 30.7
Median: 51 Minimum: 7.8
Minimum: 23 Maximum: 42.5
Maximum: 61 First quartile: 26.35
First quartile: Third quartile:
36 34.225
Third quartile: Interquartile
57.25 Range: 7.875
Interquartile Outliers: 7.8 9.5
Range: 21.25
Outliers: none
Example II solution
Chapter 2: Getting to Know Your Data
Summary
47
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
are
Lower when objects are more alike
48
Similarity and Dissimilarity
Suppose that we have n objects (e.g., persons,
items, or courses) described by p attributes (also
called measurements or features, such as age,
height, weight, or gender).
49
50
Data Matrix
Data matrix (object by attribute structure or n by p
matrix)
n data points with p dimensions
Two-mode
51
Dissimilarity Matrix
Dissimilarity matrix (object by object structure)
n data points, but registers only the distance
A triangular matrix
Single-mode
0
d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
52
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
53
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
54
Distance measure for asymmetric attributes
Contingency table
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
55
Distance measure for symmetric attributes
Exercise
56
Proximity Measure for Nominal Attributes
57
Example I
1-0/1 = 1
P = 1 ( one nominal attribute)
58
Example II
RollNo Marks Grade
1 90 A
2 80 B
3 82 B
4 90 A
59
Distance measure
d(1,1) = P –
M / P
d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4)
= 2 – 2 / 2
=0
(2,1) = P –
(2,2) = P – M / P
M / P
= (2 – 2) / 2 d(RollNo2,RollNo3) d(RollNo2,RollNo4)
= (2 – 0) / 2
= 0
= 1
(3,1) = P –
(3,2) = P – M / P (3,3) = P – M / P
M / P
= (2 – 1 )/ 2 = (2 – 2 )/ 2 d(RollNo3,RollNo4)
= (2 – 0) / 2
= 0.5 = 0
= 1
(4,1) = P –
(4,2) = P – M / P (4,3) = P – M / P (4,4) = P – M / P
M / P
= (2 – 0) / 2 =( 2 – 0 )/ 2 =( 2 – 2) / 2
= (2 – 2) / 2
= 1 = 1 = 0
= 0
60
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
62
Example
f1 difference |1-3| = 2
f2 difference |2-5| = 3
Select maximum difference i.e. 3
63
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
64
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 Manhattan (L1)
Euclidean (L2)
x2 x4
4 Supremum
2 x1
x3
0 2 4
65
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
66
Example
67
68
Ordinal Variables
70
Dissimilarity measure for ordinal data
There are three states for test-2: fair, good, and
excellent, that is, Mf = 3.
step 1 - Replace each value for test-2 by its rank,
four objects are assigned the ranks 3, 1, 2, and
3, respectively.
Step 2 - Normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
step 3 – Use the Euclidean distance
71
Example
72
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
73
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
74
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
if either
xif or xjf missing (i.e., there is no measurement of attribute f for
object i or object j), or
xif = xjf = 0 and attribute f is asymmetric binary;
otherwise,
75
Attributes of Mixed Type
f is binary or nominal:
dij(f) = 0 if xif = xjf
dij(f) = 1 otherwise
f is numeric: use the normalized distance
Where h runs over all non missing objects for the attribute
f.
f is ordinal
M 1
Treat zif as interval-scaled f
76
Example
77
Dissimilarity Matrix for test I
di,j = p – m / p ( simple matching)
P=1
78
Dissimilarity Matrix for test II
zif r 1
if
M 1 f
79
Dissimilarity Matrix for test III
80
Solution
Dissimilarity Matrix
81
Example
82
Solution (a)
83
Solution (b)
X1’=(A1/sqrt(A12+A22),
A2/sqrt(A12+A22))
84
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of
research.
85
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
86