Lec 2
Lec 2
Lec 2
Lecture 2:
Getting to Know Your Data
Lecturer: Dr. Nguyen, Thi Thanh Sang
(nttsang@hcmiu.edu.vn)
References:
Chapter 2 in Data Mining: Concepts and Techniques (Third Edition), by Jiawei
Han, Micheline Kamber
1
31/01/2021
Summary
Relational records
Transaction data
Molecular Structures
Ordered
Image data:
Video data: 4
2
31/01/2021
Dimensionality
Curse of dimensionality
Sparsity
Distribution
Centrality and dispersion
5
Data Objects
3
31/01/2021
Attributes
Attribute Types
4
31/01/2021
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
10
5
31/01/2021
Activities
11
Summary
12
12
6
31/01/2021
13
wx i i
w
i
x
Trimmed mean: chopping extreme values i 1
N
Example:
Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Mean = ?
14
14
7
31/01/2021
Empirical formula:
15
Example:
Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Median = ?
Mode = ?
16
8
31/01/2021
17
17
18
18
9
31/01/2021
Boxplot Analysis
19
Fig. 2.3. Boxplot for the unit price data for items sold at four
branches of AllElectronics during a given time period.
Median
Q1
Min
20
10
31/01/2021
Example:
Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110.
Q1 = ?; Q3 = ?
IQR = ?
21
22
22
11
31/01/2021
23
23
24
24
12
31/01/2021
Histogram Analysis
25
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
26
26
13
31/01/2021
27
27
Scatter plot
28
28
14
31/01/2021
29
30
30
15
31/01/2021
Uncorrelated Data
31
31
Summary
32
32
16
31/01/2021
33
Data matrix
n data points with p x 11 ... x 1f ... x 1p
dimensions
... ... ... ... ...
Two modes x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
0
n data points, but d(2,1)
0
registers only the
distance d(3,1 ) d ( 3 ,2 ) 0
A triangular matrix : : :
d ( n ,1 ) d ( n,2 ) ... ... 0
Single mode
34
34
17
31/01/2021
d ( i , j ) p p m
35
35
36
36
18
31/01/2021
37
xif m f
standardized measure (z-score): zif sf
Using mean absolute deviation is more robust than using standard deviation
38
38
19
31/01/2021
Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
39
39
40
40
20
31/01/2021
41
41
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
42
42
21
31/01/2021
Ordinal Variables
43
43
Dissimilarity matrix:
44
22
31/01/2021
45
45
Dissimilarity matrix:
For test-3: ; for the data:
46
23
31/01/2021
Cosine Similarity
47
47
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
48
48
24
31/01/2021
Activities
49
Summary
50
50
25
31/01/2021
Summary
51
52
52
26