Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lect2 - Data Preprocessing

Uploaded by

chala mitafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

5/26/2020

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person

• A collection of attributes describe an object,


entity, or instance

1 2

Attribute Values Types of Attributes


• Attribute values are numbers or symbols assigned to • There are different types of attributes
an attribute  Nominal: categories, states
 Examples: ID numbers, eye color, zip codes
• Distinction between attributes and attribute values  Binary: Nominal attribute with only 2 states (0 or 1)
 Examples: gender
Same attribute can be mapped to different attribute
values  Ordinal: Values have a meaningful order (ranking) but
magnitude between successive values is not known.
Example: height can be measured in feet or meters  Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
Different attributes can be mapped to the same set of  Interval: Measured on a scale of equal-sized units
values • Values have order
Example: Attribute values for ID and age are integers • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
 Ratio: We can speak of values as being an order of magnitude
But properties of attribute values can be different larger than the unit of measurement
 ID has no limit but age has a maximum and minimum value • Examples: temperature in Kelvin, length, time, counts

3 4

1
5/26/2020

Discrete and Continuous Attributes


• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.

5 6

Basic Statistical Descriptions of Data Summary Statistics: Measuring the Central


Tendency
• Motivation • Summary statistics are numbers that summarize
To better understand the data: central tendency, properties of the data
variation and spread
• Summarized properties include frequency,
• Data dispersion characteristics
location and spread
– median, max, min, quantiles, outliers, variance, etc.
– Examples: location - mean
• Numerical dimensions correspond to sorted
– spread - standard deviation
intervals
– Data dispersion: analyzed with multiple granularities • Most summary statistics can be calculated in a
of precision single pass through the data
• Dispersion analysis on computed measures
– Folding measures into numerical dimensions
7 8

2
5/26/2020

Frequency and Mode Percentiles


• The frequency of an attribute value is the
percentage of time the value occurs in the data
set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
• The mode of a an attribute is the most frequent
• attribute value

• The notions of frequency and mode are typically


used with categorical data
9 10

Measures of Location: Mean and Median Arithmetic Mean


• The mean is the most common measure of the
location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also
commonly used. The mean of this set of values is

11 12

3
5/26/2020

Median Measuring the Central Tendency


• If N is odd, then the median is the middle value
of the ordered set.

• If N is even, then the median is not unique; it is


the two middlemost values and any value in
between.

• If X is a numeric attribute in this case, by


convention, the median is taken as the average
of the two middlemost values.
13 14

Measures of Spread: Range and Variance


Variance and Standard Deviation

15 16

4
5/26/2020

Types of data sets Examples of data quality problems


• Record • Noise: Refers to modification of original values
Data Matrix
Document Data • Outliers: data that are considerably different than most
of the other data objects in the data set
Transaction Data
• Missing values
• Graph  Reasons for missing values
World Wide Web  Information is not collected (e.g., people decline to give their age and
weight)
Molecular Structures
 Attributes may not be applicable to all cases (e.g., annual income is
not applicable to children)
• Ordered
Spatial Data  Handling missing values
 Eliminate Data Objects
Temporal Data  Estimate Missing Values
Sequential Data  Ignore the Missing Value During Analysis
Genetic Sequence Data  Replace with all possible values (weighted by their probabilities)
17 18

Why Data Preprocessing? Why Preprocessing? Data be


Incomplete!
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
• Attributes of interest are not available (e.g.,
attributes of interest, or containing only aggregate customer information for sales transaction data)
data
• e.g., occupation=“ ”
• Data were not considered important at the time
– noisy: containing errors or outliers
of transactions, so they were not recorded.
• e.g., Salary=“-10” • Data not recorder because of misunderstanding
– inconsistent: containing discrepancies in codes or or malfunctions
names
• e.g., Age=“42” Birthday=“03/07/1997” • Data may have been recorded and later deleted.
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records • Missing/unknown values for some data
– Redundant: including everything, some of which are
irrelevant to our task
May 26, 2020 20

5
5/26/2020

Feature Extraction in Fingerprint Recognition


Fingerprint Recognition Case
• Fingerprint identification at the gym

It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …

Forms of Data Preprocessing Why Data Preprocessing?


• Aggregation • Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
• Sampling attributes of interest, or containing only aggregate
data
• Dimensionality Reduction noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
• Feature subset selection names

• Feature creation • No quality data, no quality mining results!


Quality decisions must be based on quality data
• Discretization and Binarization DM needs consistent integration of quality data
• Attribute Transformation 23 24

6
5/26/2020

Forms of Data Preprocessing What is Data Exploration?


Data cleaning
A preliminary exploration of the data to better
understand its characteristics.
Data Integration • Key motivations include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
Data transformation
• Related to the area of Exploratory Data Analysis (EDA)
Data reduction – Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
Data Exploratory Analysis

Aggregation Exploratory Data Analysis Techniques


• Combining two or more attributes (or objects)
into a single attribute (or object)  Summary Statistics
 Visualization
• Purpose
Data reduction
 Feature Selection (big topic)
– Reduce the number of attributes or objects  Dimension Reduction (big topic)
Change of scale
• Cities aggregated into regions, states, countries, etc

More “stable” data


• Aggregated data tends to have less variability
27

7
5/26/2020

Sampling Types of Sampling


• Sampling is the main technique employed for data selection.
• Simple Random Sampling
• It is often used for both the preliminary investigation of the  There is an equal probability of selecting any particular item
data and the final data analysis.
• Sampling without replacement
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.  As each item is selected, it is removed from the population

• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming.  Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following:  In sampling with replacement, the same object can be picked
 using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
 A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data  Split the data into several partitions; then draw random
samples from each partition
29 30

Curse of dimensionality Dimensionality Reduction: Curse of Dimensionality


• When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
• Definitions of density and distance between points, which is
critical for clustering and outlier detection, become less
meaningful
• Purpose of dimensionality reduction:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining
algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise

• Techniques of dimensionality reduction:


 Principle Component Analysis
 Singular Value Decomposition
31
 Others: supervised and non-linear techniques 32

8
5/26/2020

Feature Subset Selection Feature Selection and Correlation Matrix


• Another way to reduce dimensionality of data
• Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task
of predicting students' GPA 33 34

Feature Subset Selection Feature Creation


• Techniques: • Create new attributes that can capture the
Brute-force approch:
 Try all possible feature subsets as input to data mining
important information in a data set much more
algorithm efficiently than the original attributes
Embedded approaches:
 Feature selection occurs naturally as part of the data mining
• Three general methodologies:
algorithm 1. Feature Extraction domain-specific
Filter approaches: 2. Mapping Data to New Space
 Features are selected before data mining algorithm is run 3. Feature Construction combining features
Wrapper approaches:
 Use the data mining algorithm as a black box to find best
subsetof attributes
35 36

9
5/26/2020

DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
 On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.

37

10

You might also like