0% found this document useful (0 votes)

20 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

5/26/2020

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person

• A collection of attributes describe an object,

entity, or instance

1 2

Attribute Values Types of Attributes

• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute  Nominal: categories, states
 Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values  Binary: Nominal attribute with only 2 states (0 or 1)
 Examples: gender
Same attribute can be mapped to different attribute
values  Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters  Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of  Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
 ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts

3 4

1
5/26/2020

Discrete and Continuous Attributes

• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

5 6

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central

Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8

2
5/26/2020

Frequency and Mode Percentiles

• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value

• The notions of frequency and mode are typically

used with categorical data
9 10

Measures of Location: Mean and Median Arithmetic Mean

• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is

11 12

3
5/26/2020

Median Measuring the Central Tendency

• If N is odd, then the median is the middle value
of the ordered set.

• If N is even, then the median is not unique; it is

the two middlemost values and any value in
between.

• If X is a numeric attribute in this case, by

convention, the median is taken as the average
of the two middlemost values.
13 14

Measures of Spread: Range and Variance

Variance and Standard Deviation

15 16

4
5/26/2020

Types of data sets Examples of data quality problems

• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph  Reasons for missing values
World Wide Web  Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
 Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data  Handling missing values
 Eliminate Data Objects
Temporal Data  Estimate Missing Values
Sequential Data  Ignore the Missing Value During Analysis
Genetic Sequence Data  Replace with all possible values (weighted by their probabilities)
17 18

Why Data Preprocessing? Why Preprocessing? Data be

Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20

5
5/26/2020

Feature Extraction in Fingerprint Recognition

Fingerprint Recognition Case
• Fingerprint identification at the gym

It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …

Forms of Data Preprocessing Why Data Preprocessing?

• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names

• Feature creation • No quality data, no quality mining results!

Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24

6
5/26/2020

Forms of Data Preprocessing What is Data Exploration?

Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis

Aggregation Exploratory Data Analysis Techniques

• Combining two or more attributes (or objects)
into a single attribute (or object)  Summary Statistics
 Visualization
• Purpose
Data reduction
 Feature Selection (big topic)
– Reduce the number of attributes or objects  Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc

More “stable” data

• Aggregated data tends to have less variability
27

7
5/26/2020

Sampling Types of Sampling

• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the  There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.  As each item is selected, it is removed from the population

• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming.  Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following:  In sampling with replacement, the same object can be picked
 using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
 A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data  Split the data into several partitions; then draw random
samples from each partition
29 30

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality

• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise

• Techniques of dimensionality reduction:

 Principle Component Analysis
 Singular Value Decomposition
31
 Others: supervised and non-linear techniques 32

8
5/26/2020

Feature Subset Selection Feature Selection and Correlation Matrix

• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34

Feature Subset Selection Feature Creation

• Techniques: • Create new attributes that can capture the
Brute-force approch:
 Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
 Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subsetof attributes
35 36

9
5/26/2020

DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
 On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.

Cryptography and Network Security: Third Edition by William Stallings
No ratings yet
Cryptography and Network Security: Third Edition by William Stallings
17 pages
Spectroscopy Class Notes Chemical Sciences
100% (1)
Spectroscopy Class Notes Chemical Sciences
173 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
2 Data Types Quality
No ratings yet
2 Data Types Quality
15 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Attributes
No ratings yet
Attributes
66 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
3.1 Data Modeling en
No ratings yet
3.1 Data Modeling en
55 pages
Full
No ratings yet
Full
367 pages
Statistics N Probability
No ratings yet
Statistics N Probability
31 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
05 Chapter
No ratings yet
05 Chapter
40 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
Data secu
No ratings yet
Data secu
40 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
FoDS - L8
No ratings yet
FoDS - L8
53 pages
CertPREP Instructor PPT ITDataAnlytics 01
No ratings yet
CertPREP Instructor PPT ITDataAnlytics 01
21 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Chapter 1 Descriptivestatistics
No ratings yet
Chapter 1 Descriptivestatistics
21 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
Quantitative Methods
No ratings yet
Quantitative Methods
2 pages
Statistics Lecture
No ratings yet
Statistics Lecture
35 pages
L2-4 - Data
No ratings yet
L2-4 - Data
83 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Course Content: St. Paul University Philippines
No ratings yet
Course Content: St. Paul University Philippines
33 pages
Unit 10 Flow 1_ Lists of Data
No ratings yet
Unit 10 Flow 1_ Lists of Data
58 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Data Exploration
No ratings yet
Data Exploration
12 pages
STATS REVIEWER
No ratings yet
STATS REVIEWER
5 pages
QQDATA
No ratings yet
QQDATA
28 pages
Fundamental and Advanced Database Tutorial
No ratings yet
Fundamental and Advanced Database Tutorial
93 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
EDS Unit 2 ?
No ratings yet
EDS Unit 2 ?
13 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Assignment Answers Sample
No ratings yet
Assignment Answers Sample
24 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Week 04
No ratings yet
Week 04
32 pages
Chapter II
No ratings yet
Chapter II
115 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
L09 - 09-Measurement and Scaling-2 - Page
No ratings yet
L09 - 09-Measurement and Scaling-2 - Page
10 pages
Data
No ratings yet
Data
84 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
CL 2
No ratings yet
CL 2
85 pages
IT 101 New
No ratings yet
IT 101 New
15 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Lec 2
No ratings yet
Lec 2
26 pages
IS5740 W02
No ratings yet
IS5740 W02
37 pages
Psychstats Lecture Reviewer
No ratings yet
Psychstats Lecture Reviewer
16 pages
Chap2 Data
No ratings yet
Chap2 Data
78 pages
BoS - Session 1
100% (1)
BoS - Session 1
37 pages
Spectrum Place Value and Rounding
From Everand
Spectrum Place Value and Rounding
Spectrum
No ratings yet
Amharic Sentence Parsing Using Base Phrase Chunking
No ratings yet
Amharic Sentence Parsing Using Base Phrase Chunking
10 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
3 pages
Bayesian Networks: A Tutorial
No ratings yet
Bayesian Networks: A Tutorial
73 pages
IEEE Report Graphs - Sreekar
No ratings yet
IEEE Report Graphs - Sreekar
3 pages
Test Request Form - (Name Laboratory) : Patient Details Requester Details
No ratings yet
Test Request Form - (Name Laboratory) : Patient Details Requester Details
1 page
Time Zones PDF
No ratings yet
Time Zones PDF
2 pages
Webster's Unabridged Dictionary by Webster, Noah, 1758-1843
67% (3)
Webster's Unabridged Dictionary by Webster, Noah, 1758-1843
495 pages
Qinph2 P08 Tes MST JTC Ele 00012 - R0
No ratings yet
Qinph2 P08 Tes MST JTC Ele 00012 - R0
34 pages
AgPhD
No ratings yet
AgPhD
8 pages
8030 17945 1 SM
No ratings yet
8030 17945 1 SM
17 pages
IECEE Definitions Ed.2.0
No ratings yet
IECEE Definitions Ed.2.0
16 pages
Extraction and Evaluation of OKRA Fibres
No ratings yet
Extraction and Evaluation of OKRA Fibres
7 pages
5 - Foundation Knowledge of Forensic Odontology Lec 1
No ratings yet
5 - Foundation Knowledge of Forensic Odontology Lec 1
15 pages
Contax N/645 - Sony E Full Auto Adapter Ring Mk3 User's Manual (V. 31)
No ratings yet
Contax N/645 - Sony E Full Auto Adapter Ring Mk3 User's Manual (V. 31)
5 pages
Chapter 4 Location Decision and Facilities Layout
No ratings yet
Chapter 4 Location Decision and Facilities Layout
30 pages
Deepwell Revised Proposal
No ratings yet
Deepwell Revised Proposal
1 page
GRADE 6 MELCS For SOCIAL MEDIA TV and RADIO BASED MODALITY
No ratings yet
GRADE 6 MELCS For SOCIAL MEDIA TV and RADIO BASED MODALITY
11 pages
Car Auto - Data
No ratings yet
Car Auto - Data
6 pages
1926.1411 - Power Line Safety-While Traveling Under or Near Power Lines With No Load. - Occupational Safety and Health Administration
No ratings yet
1926.1411 - Power Line Safety-While Traveling Under or Near Power Lines With No Load. - Occupational Safety and Health Administration
3 pages
N5. Energy Management and Real-time Control (20245S1)
No ratings yet
N5. Energy Management and Real-time Control (20245S1)
93 pages
C15 Diagranma
100% (2)
C15 Diagranma
2 pages
Off-Road Suspension Tuning Guide
100% (1)
Off-Road Suspension Tuning Guide
4 pages
Getting A Life: The Emergence of The Life Story in Adolescence
No ratings yet
Getting A Life: The Emergence of The Life Story in Adolescence
22 pages
Chandrai Dookhna English Sba Final Draft
No ratings yet
Chandrai Dookhna English Sba Final Draft
18 pages
Project Workplan and Budget Matrix
100% (1)
Project Workplan and Budget Matrix
12 pages
Lehle Parallel L en
No ratings yet
Lehle Parallel L en
22 pages
Adeka Flyer LG
No ratings yet
Adeka Flyer LG
2 pages
LedsMaster LED Lighting
No ratings yet
LedsMaster LED Lighting
29 pages
Top Causes of An ECM Failure
No ratings yet
Top Causes of An ECM Failure
2 pages
OceanofPDF.com the Wrong Alpha - Gertty Rudraw
No ratings yet
OceanofPDF.com the Wrong Alpha - Gertty Rudraw
265 pages
Fo Project
No ratings yet
Fo Project
28 pages
FUSO Value Parts Catalogue 01.2019 en
No ratings yet
FUSO Value Parts Catalogue 01.2019 en
12 pages