0% found this document useful (0 votes)

25 views

Data Preprocessing Data Basics

The scales represented are: Nominal: Favourite candy bar Ratio: Weight of luggage Interval: Year of your birth Ordinal: Egg size (small, medium, large, extra large, jumbo) Ordinal: Military rank Ratio: Number of children in a family Nominal: Jersey numbers for a football team Interval: Shoe size

Uploaded by

Jaydeep Dodiya

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Data Preprocessing Data Basics

Uploaded by

Jaydeep Dodiya

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 86

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

1
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla

wi
crosstabs

n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

2
Basic terms
 Data
 Data is raw, unorganized facts that need to be

processed.
 Example: Marks of students

 Student_1 = 50/100, Student_2 = 25/100.

 Information
 When data is processed, organized, structured or

presented in a given context so as to make it useful, it

is called information.
 Example: Result of students (Pass or Fail)

 Student_1 = Pass, Student_2 = Fail.

3
Basic terms
 Metadata
 Metadata is data about data.
 Data such as table name, column name, data type, authorized user and
user access privileges for any table is called metadata for that table.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt

 Metadata of above table is:

 Table name such as Faculty

 Column name such as Emp_Name, Address, Mobile_No, Subject

 Datatype such as Varchar, Decimal

 Access privileges such as Read, Write (Update)

4
Basic terms
 Data dictionary
 A data dictionary is an information repository which contains

metadata.
• Table Name – Faculty
• Column Name – EmpName, Address, Mob, Subject, Salary
• Datatype – Varchar, Decimal
• Access Privileges – Read, Write (Update)

 Data warehouse
 A data warehouse is an information repository which stores

historical data.
Faculty
Emp_Name Address Mobile_No Subject
Prof. Sharma Ahmedabad 1234 Mgmt

Prof. Verma Ahmedabad 5678 DBMS

5
6
7
Database schema

 Relational Database schema(related relations)

 • student (ID, name, dept name, tot cred)
 • advisor (s id, i id)
 • takes (ID, course id, sec id, semester, year, grade)
 • classroom (building, room number, capacity)
 • time slot (time slot id, day, start time, end time)
8
9
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

10
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

11
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

12
Interval scales
 Measurements with two defining principles—
equidistant scales and no true zero.
 Equidistance refers to intervals with values that
are distributed in equal units.
 A true zero refers to a scale where 0 indicates the
absence of something.
 An interval scale lacks a true zero.
 Values have order
 Examples of scales without a true zero include

rating scales, temperature, and measures of

latitude and longitude, calendar dates
 .
13
Ratio scales
 Ratio scales: Measurements with two defining
principles—equidistant scales and a true zero.
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚)
 Examples of scales with a true zero include
weight, height, time, and calories.

14
Quantitative vs Qualitative Data

15
Numeric attribute

16
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
17
Discrete vs. Continuous Attributes
 Number of emergency room patients
 Blood pressure of a patient
 Weight of a patient
 Pulse for a patient
 Emergency room wait time rounded to the nearest minute
 Tumor size

Answers: d,c,c,d,d,c

18
19
Nominal, Ordinal, Interval, and Ratio Scales

Each scale is represented once in the list below.

 Favourite candy bar

 Weight of luggage

 Year of your birth

 Egg size (small, medium, large, extra large, jumbo)

Each scale is represented once in the list below.

 Military rank

 Number of children in a family

 Jersey numbers for a football team

 Shoe size

Answers: N,R,I,O and O,R,N,I

20
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

21
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

22
Measuring the Central Tendency
1n
x= xi 
 x

Mean (algebraic measure) (sample vs. population):

n i=1 N

◦ Distributive measure: sum() and count ()

◦ Algebric Measure : avg()
◦ Weighted arithmetic mean /
◦ weighted avg:

n
∑ wi x i
x = i= 1n
̄
∑ wi
i= 1
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):

◦ Trimmed mean: which is the mean obtained after chopping off values
at the high and low extremes.
For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean. We should avoid trimming too large a portion
(such as 20%) at both ends as this can result in the loss of valuable information.

Problem : Mean is sensitive to extreme values

24
Measuring the Central Tendency

Median
◦ Middle value if odd number of values, or average of the
middle two values otherwise

 Mode
◦ Value that occurs most frequently in the data set

• The midrange can also be used to assess the central tendency of a

data set. It is the average of the largest and smallest values in the set.
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric

symmetric, positively and

negatively skewed data

positively skewed negatively skewed

6/22/23 Data Mining: Concepts and Techniques 26

Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) 2 ]
n  1 i 1 n i 1
2
 
N
 ( xi  2
 ) 
N
 xi   2
2

i 1 i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

27
Outlier detection using quartile

Arrange your data in ascending order

Calculate Q1 ( the first Quarter)
Calculate Q3 ( the third Quartile)
Find IQR = (Q3 - Q1)
Find the lower Range = Q1 -(1.5 * IQR)
Find the upper Range = Q3 + (1.5 * IQR)
 Detect outliers data points < lower range and
data points > upper range and remove.
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

29
Example

Data:
4, 17, 7, 14, 18, 12, 3, 16,
10, 4, 4, 11
Solution
Put them in order:
3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18
Cut it into quarters:
3, 4, 4 | 4, 7, 10 | 11, 12, 14 | 16, 17, 18
all the quartiles are between numbers:
Quartile 1 (Q1) = (4+4)/2 = 4
Quartile 2 (Q2) = (10+11)/2 = 10.5
Quartile 3 (Q3) = (14+16)/2 = 15
The Lowest Value is 3,
The Highest Value is 18
Interquartile Range is: Q3 − Q1 = 15 − 4 = 11

31
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
32
33
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
 It shows what proportion of cases 30
fall into each of several categories
25
 Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent

34
Histograms Often Tell More than Boxplots

 The two histograms

shown in the left may
have the same boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have rather
different data
distributions

35
Histogram vs Box plot

What do box plots

tell you that
histograms don’t?
Box plots tell me the
least and greatest value
in the data, as well as
the median. They also
tell the lower and upper
quartile. Those values
can only be estimated
from histograms.
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

37
Positively and Negatively Correlated Data

 The left half fragment is positively

correlated
 The right half is negative correlated

38
Uncorrelated Data

39
Example I

Suppose that the data for analysis includes the

attribute age. The age values for the data tuples
are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.

1. Find mean, median, mode with data modality

(bimodal/trimodal), midrange
2. Find five number summary
3. Plot histogram
4. Plot box plot.
1. Mean = 30, median = 25,
2. mode – 25, 35 (bimodal),
3. Mid range (average of largest and smallest value) = 41.55
4. Q1=20, Q3 = 35
5. Five number summary
Min,Q1,madian(Q2),Q3,max = 13,20,25,35,70

Frequency Table
Class Count
13-25 14
26-38 8
39-51 3
52-64 1
65-77 1
Example II

Suppose that a hospital tested the age and body fat

data for 18 randomly selected adults with the
following results:
age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9
41.2 35.7

1. Calculate the mean, median, and standard

deviation of age and %fat.
2. Draw the boxplots for age and %fat.
3. Draw a scatter plot and a q-q plot based on
these two variables.
Example II solution

Age %fat
Population size: Population size: 18
18 Median: 30.7
Median: 51 Minimum: 7.8
Minimum: 23 Maximum: 42.5
Maximum: 61 First quartile: 26.35
First quartile: Third quartile:
36 34.225
Third quartile: Interquartile
57.25 Range: 7.875
Interquartile Outliers: 7.8 9.5
Range: 21.25
Outliers: none
Example II solution
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

47
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)

 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

48
Similarity and Dissimilarity
 Suppose that we have n objects (e.g., persons,
items, or courses) described by p attributes (also
called measurements or features, such as age,
height, weight, or gender).

 The objects are x1 =(x11,x12, : : : ,x1p),

x2 = (x21,x22, : : : ,x2p) , and so on,
where xij is the value for object xi of the jth
attribute.

49
50
Data Matrix
 Data matrix (object by attribute structure or n by p
matrix)
 n data points with p dimensions

 Two-mode

 x11 ... x1f ... x1p 

 
 ... ... ... ... ... 
x ... xif ... x ip 
 i1 
 ... ... ... ... ... 
x ... xnf ... x np 
 n1 

51
Dissimilarity Matrix
 Dissimilarity matrix (object by object structure)
 n data points, but registers only the distance

 A triangular matrix

 Single-mode

 0 
 d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

52
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i

 Distance measure for symmetric

binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables): 1 – d (i, j)
 Note: Jaccard coefficient is the same as “coherence”:

53
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 Gender is a symmetric attribute

 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0

54
Distance measure for asymmetric attributes

Fever Cough Test I Test II Test Test 1 0

III IV 1 2 0 2
Jack 1 0 1 0 0 0 0 1 3 4
Merry 1 0 1 0 1 0 3 3 6

Contingency table

01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
55
Distance measure for symmetric attributes

Exercise

Find the distance measure for symmetric attributes

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

56
Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,

green (generalization of a binary attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
d (i, j)  p 
p
m

57
Example I

1-0/1 = 1
P = 1 ( one nominal attribute)

58
Example II
RollNo Marks Grade
1 90 A
2 80 B
3 82 B
4 90 A

d(RollNo1,RollNo1) d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4)

d(RollNo2,RollNo1) d(RollNo2,RollNo2) d(RollNo2,RollNo3) d(RollNo2,RollNo4)

d(RollNo3,RollNo1) d(RollNo3,RollNo2) d(RollNo3,RollNo3) d(RollNo3,RollNo4)

d(RollNo4,RollNo1) d(RollNo4,RollNo2) d(RollNo4,RollNo3) d(RollNo4,RollNo4)

distance(object1, Object2) = P – M / P

P is total number of attributes
M is total number of matches

59
Distance measure

d(1,1) = P –
M / P
d(RollNo1,RollNo2) d(RollNo1,RollNo3) d(RollNo1,RollNo4)
= 2 – 2 / 2
=0

(2,1) = P –
(2,2) = P – M / P
M / P
= (2 – 2) / 2 d(RollNo2,RollNo3) d(RollNo2,RollNo4)
= (2 – 0) / 2
= 0
= 1
(3,1) = P –
(3,2) = P – M / P (3,3) = P – M / P
M / P
= (2 – 1 )/ 2 = (2 – 2 )/ 2 d(RollNo3,RollNo4)
= (2 – 0) / 2
= 0.5 = 0
= 1
(4,1) = P –
(4,2) = P – M / P (4,3) = P – M / P (4,4) = P – M / P
M / P
= (2 – 0) / 2 =( 2 – 0 )/ 2 =( 2 – 2) / 2
= (2 – 2) / 2
= 1 = 1 = 0
= 0

60
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp

 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.

 Attribute f that gives the maximum difference in values
between any component (attribute) of the vectors (objects)

62
Example

f1 difference |1-3| = 2
f2 difference |2-5| = 3
Select maximum difference i.e. 3
63
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

64
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5 Manhattan (L1)

Euclidean (L2)
x2 x4

4 Supremum

2 x1

x3
0 2 4
65
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
66
Example

Given two objects represented by the tuples (22, 1,

42, 10) and (20, 0, 36, 8):
1. Compute the Euclidean distance between the
two objects.
2. Compute the Manhattan distance between the
two objects.
3. Compute the Minkowski distance between the
two objects, using h = 3.
4. Compute the supremum distance between the
two objects.

67
68
Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif {1,...,M f }
if

 map the range of each variable onto [0, 1] by replacing

i-th object in the f-th variable by
rif 1
zif 
M f 1

 compute the dissimilarity using methods for interval-

scaled variables
69
Example

70
Dissimilarity measure for ordinal data
 There are three states for test-2: fair, good, and
excellent, that is, Mf = 3.
 step 1 - Replace each value for test-2 by its rank,
four objects are assigned the ranks 3, 1, 2, and
3, respectively.
 Step 2 - Normalizes the ranking by mapping rank
1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
 step 3 – Use the Euclidean distance

71
Example

72
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

73
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

74
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )

 if either
 xif or xjf missing (i.e., there is no measurement of attribute f for
object i or object j), or
 xif = xjf = 0 and attribute f is asymmetric binary;
 otherwise,
75
Attributes of Mixed Type
 f is binary or nominal:
dij(f) = 0 if xif = xjf
dij(f) = 1 otherwise
 f is numeric: use the normalized distance

Where h runs over all non missing objects for the attribute
f.
 f is ordinal

 Compute ranks rif and zif 

r 1
if

M 1
 Treat zif as interval-scaled f

76
Example

77
 Dissimilarity Matrix for test I
 di,j = p – m / p ( simple matching)
 P=1

78
 Dissimilarity Matrix for test II
zif  r  1
if

M 1 f

79
 Dissimilarity Matrix for test III

80
Solution
 Dissimilarity Matrix

81
Example

82
Solution (a)

83
Solution (b)

X1’=(A1/sqrt(A12+A22),
A2/sqrt(A12+A22))

84
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.

85
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
86

Trauma Solutions Relational Empowerment Training Slides PDF
100% (1)
Trauma Solutions Relational Empowerment Training Slides PDF
23 pages
Supply and Demand: Economic Analysis of Maggi Noodles
No ratings yet
Supply and Demand: Economic Analysis of Maggi Noodles
8 pages
Pa State Police Study Guide Criminal Law Review Guide
No ratings yet
Pa State Police Study Guide Criminal Law Review Guide
4 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Lecture 3 Variables and Data Preprocessing
No ratings yet
Lecture 3 Variables and Data Preprocessing
17 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
65 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Lect 3
No ratings yet
Lect 3
51 pages
Data ch2
No ratings yet
Data ch2
16 pages
IDS Unit 2 Additional Topics
No ratings yet
IDS Unit 2 Additional Topics
15 pages
02data Part1
No ratings yet
02data Part1
19 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
03 ML Data Intro
No ratings yet
03 ML Data Intro
12 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
02 Data
No ratings yet
02 Data
64 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02 Data
No ratings yet
02 Data
35 pages
02Data
No ratings yet
02Data
66 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data Mining Memahami Data
No ratings yet
Data Mining Memahami Data
38 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Chapter-1, Pages 1-9
No ratings yet
Chapter-1, Pages 1-9
9 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Ch01_ICS422_04
No ratings yet
Ch01_ICS422_04
84 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Lec2 1-Dataset1
No ratings yet
Lec2 1-Dataset1
32 pages
CH2 Data
No ratings yet
CH2 Data
25 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
RL3.1 Data Descriptions 1
No ratings yet
RL3.1 Data Descriptions 1
18 pages
02 Getting to know your Data
No ratings yet
02 Getting to know your Data
11 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Lecture 4 c75532e6bd9bc0e8f4bdaaa6b50f59ee
No ratings yet
Lecture 4 c75532e6bd9bc0e8f4bdaaa6b50f59ee
25 pages
Lec 2
No ratings yet
Lec 2
26 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
02 Data
No ratings yet
02 Data
64 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
02Data
No ratings yet
02Data
65 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
From Everand
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
Michel Dekker
No ratings yet
Kural 2011 J. Phys.: Conf. Ser. 305 012088
No ratings yet
Kural 2011 J. Phys.: Conf. Ser. 305 012088
11 pages
1 s2.0 S0346251X23000970 Main
No ratings yet
1 s2.0 S0346251X23000970 Main
11 pages
Society For The Promotion of Roman Studies
No ratings yet
Society For The Promotion of Roman Studies
47 pages
Melanie Martinez - Cry Baby Lyrics Genius Lyrics
No ratings yet
Melanie Martinez - Cry Baby Lyrics Genius Lyrics
1 page
Virginia SOL - Art
No ratings yet
Virginia SOL - Art
36 pages
panchanama 1
No ratings yet
panchanama 1
1 page
Product-Oriented: Performance-Based Assessment
No ratings yet
Product-Oriented: Performance-Based Assessment
6 pages
Dual Nature of Light CREAM
No ratings yet
Dual Nature of Light CREAM
14 pages
Lesson 1 M 2
No ratings yet
Lesson 1 M 2
16 pages
Introduction To JMS Testing
No ratings yet
Introduction To JMS Testing
8 pages
Marketing Aspect: Learning Outcomes
No ratings yet
Marketing Aspect: Learning Outcomes
23 pages
Committee On Ways and Means: U.S. House of Representatives Washington, DC 20515
No ratings yet
Committee On Ways and Means: U.S. House of Representatives Washington, DC 20515
2 pages
Seven. A New Gender (Dis) Order?-Neoliberal Restructuring in Australia
No ratings yet
Seven. A New Gender (Dis) Order?-Neoliberal Restructuring in Australia
15 pages
7.1 Coyne2019 - Ludwig Von Mises On War and The Economy
No ratings yet
7.1 Coyne2019 - Ludwig Von Mises On War and The Economy
14 pages
PR in Hospitals
No ratings yet
PR in Hospitals
6 pages
Partnership
No ratings yet
Partnership
11 pages
Horse and Two Goats Summary
No ratings yet
Horse and Two Goats Summary
4 pages
Chapter 2 Introduction To Technical Writing 1
No ratings yet
Chapter 2 Introduction To Technical Writing 1
16 pages
65 Nov 2012
No ratings yet
65 Nov 2012
23 pages
Thesis Grabiner SC
No ratings yet
Thesis Grabiner SC
362 pages
GERRY
No ratings yet
GERRY
3 pages
Dr. Prashanth Bharadwaj Department of Management
No ratings yet
Dr. Prashanth Bharadwaj Department of Management
8 pages
W. Undefeatable 4
No ratings yet
W. Undefeatable 4
2 pages
Jose Rizal and Theory of Nationalism
No ratings yet
Jose Rizal and Theory of Nationalism
5 pages
Published research paper
No ratings yet
Published research paper
10 pages
HOURLY TURNOUT REPORTING POLLING STATION WISE FINAL REPORT DISTRICT RAJOURI
No ratings yet
HOURLY TURNOUT REPORTING POLLING STATION WISE FINAL REPORT DISTRICT RAJOURI
14 pages
I. Natural Cross-Breeding in Winter Beans: (B) Methods of Estimation
No ratings yet
I. Natural Cross-Breeding in Winter Beans: (B) Methods of Estimation
8 pages