Data Mining (DM) : Lecture 3: Know Your Data
Data Mining (DM) : Lecture 3: Know Your Data
Summary
3
Announcements
Install Weka on your machines
Read about arff file format
E.g. http://www.cs.waikato.ac.nz/ml/weka/arff.html
Explore Iris plants dataset
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
5
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix, crosstabs
Document data: text documents: term-frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Molecular Structures
Ordered
Video data: sequence of images
Temporal data: time-series
Sequential Data: transaction sequences
Genetic sequence data
Spatial, image and multimedia:
Spatial data: maps
Image data:
Video data
6
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points,
objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
15
Attributes
Attribute (or dimensions, features, variables): a
data field, representing a characteristic or feature of
a data object.
o E.g., customer _ID, name, address
Types:
o Nominal
o Binary
o Ordinal
o Numeric: quantitative
i. Interval-scaled
ii. Ratio-scaled
16
Attribute Types
Nominal: categories, states, or “names of things”
o Hair_color = {auburn, black, blond, brown, grey, red, white}
o marital status, occupation, ID numbers, zip codes
Binary
o Nominal attribute with only 2 states (0 and 1)
o Symmetric binary: both outcomes equally important
e.g., gender, the binary variable "is evergreen?" for a plant has the
possible states "loses leaves in winter" and "does not lose leaves in
winter." Both are equally valuable and carry the same weight.
o Asymmetric binary: outcomes not equally important.
o e.g., medical test (positive vs. negative)
o Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
o Values have a meaningful order (ranking) but magnitude between
successive values is not known.
o Size = {small, medium, large}, grades, army rankings
17
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
o Measured on a scale of equal-sized units
o Values have order and the difference between
each value is the same.
o E.g., temperature in C˚or F˚, calendar date.
For example, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the
difference between 80 and 70 degrees
o No true zero-point
Ratio
o Inherent zero-point
Zero mean no value or an absent of property.
Height, weight, counts, monetary quantities
18
Discrete vs. Continuous Attributes
Discrete Attribute
o Has only a finite or countable infinite set of values
E.g., zip codes, profession, or the set of words in a collection of
documents
o Sometimes, represented as integer variables
o Note: Binary attributes are a special case of discrete
attributes
Continuous Attribute
o Has real numbers as attribute values
E.g., temperature, height, or weight
o Practically, real values can only be measured and
represented using a finite number of digits
o Continuous attributes are typically represented as
floating-point variables
19
Class Activity
eye color
hardness of minerals{good, better, best}
calendar dates
Sex {male, female}
Angles as measured in degree between 0o and 3600
------
Solution
Eye Color: Nominal
hardness of minerals{good, better, best}:Ordinal
calendar dates: Interval, discrete
Sex {male, female}:Nominal, binary(symmetric)
Angles as measured in degree between 0o and 3600
:Ratio
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
22
Basic Statistical Descriptions of Data
Motivation
Tobetter understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Boxplot or quantile analysis on the transformed cube
23
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median:
Middle value if odd number of values, or average of the
middle two values otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
Data: 3, 1, 5
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Symmetric vs. Skewed Data
26
6/11/20 Data Mining: Concepts and Techniques
Find different sets which have same
mean, median and mode (symmetric)
3,4,5,5,8
1,1,1
1, 2, 2, 3, 3, 3, 4, 4, 5
Measuring the Dispersion of Data
Quartiles, outliers and box-plots
Quartiles: Q1 (25th percentile), Q3 (75th
percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1,
median, Q3, max
Box-plot: ends of the box are the
quartiles; median is marked; add
whiskers, and plot outliers individually
Outlier: usually, a value higher/lower
than 1.5 x IQR
Midrange=min+max/2
29
Boxplot Analysis
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended
to Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
30
Class activity
54,60,65,66,67,69,70,72,73,75,76,89,90,92,95,100,115
,
117,119,120,122,123,125
Box plots for categorical data
Visualization of Data Dispersion: 3-D Boxplots
33
6/11/20 Data Mining: Concepts and Techniques
Graphic Displays of Basic Statistical Descriptions
34
Histogram Analysis
36
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f indicates
i i
that approximately 100 fi% of the data are below or
equal to the value xi
37
Data Mining: Concepts and Techniques
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one uni-variate distribution against the
corresponding quantiles of another
View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2.
38
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc.
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
39
Positively and Negatively Correlated Data
correlated
The right half is negative correlated
40
Uncorrelated Data
41
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
42
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
45
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, create m windows on the screen, one
for each dimension
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
46
Laying Out Pixels in Circle Segments
48
http://www.jmp.com/support/help/The_Scatterplot_3D_Report.shtml
http://support.sas.com/documentation/
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
52
Quiz 2
Ch 02-Getting to know your data