Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
199 views

Data Preprocessing

- Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for machine learning algorithms. This includes handling missing data, encoding categorical variables, normalizing numeric features, and splitting the data into training and test sets. - The goals of data preprocessing are to clean dirty or inconsistent data, reduce data volume and dimensionality, handle missing values, and scale features for model training. This prepares the data for analysis and improves machine learning results. - Techniques like imputation, binning, clustering and regression can be used to clean noisy data. Data integration combines sources and resolves conflicts while transformation techniques include aggregation, generalization and normalization. Dimensionality reduction further compactly represents the data.

Uploaded by

Dinh Duy Hiep
Copyright
© Public Domain
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views

Data Preprocessing

- Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for machine learning algorithms. This includes handling missing data, encoding categorical variables, normalizing numeric features, and splitting the data into training and test sets. - The goals of data preprocessing are to clean dirty or inconsistent data, reduce data volume and dimensionality, handle missing values, and scale features for model training. This prepares the data for analysis and improves machine learning results. - Techniques like imputation, binning, clustering and regression can be used to clean noisy data. Data integration combines sources and resolves conflicts while transformation techniques include aggregation, generalization and normalization. Dimensionality reduction further compactly represents the data.

Uploaded by

Dinh Duy Hiep
Copyright
© Public Domain
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

DATA PREPROCESSING

Content
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Data Preprocessing for Machine learning
Why data preprocessing?
Data in the real world is dirty
o incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
o noisy: containing errors or outliers
o inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
o Quality decisions must be based on quality data
o Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality
o A well-accepted multi-dimensional view:
accuracy, completeness, consistency, timeliness, believability, value added,
interpretability, accessibility
Broad categories
intrinsic, contextual, representational, and accessibility
Why data preprocessing?
Major Tasks of Data Preprocessing
• Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
o Integration of multiple databases, data cubes, files, or notes
• Data transformation
o Normalization (scaling to a specific range)
o Aggregation
• Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
o Data discretization: with particular importance, especially for numerical data
o Data aggregation, dimensionality reduction, data compression, generalization
Data cleaning
Importance
• “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
Tasks of Data Cleaning
• Fill in missing values
• Identify outliers and smooth noisy data
• Correct inconsistent data
Data cleaning
Manage Missing Data
• Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain
cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill in
the missing value: smarter
• Use the most probable value to fill in the missing value: inference
based such as regression, Bayesian formula, decision tree
Data cleaning
Manage Noisy Data
Binning Method:
• first sort data and partition into (equal-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc…
Clustering:
• detect and remove outliers
Semi Automated
• Computer and Manual Intervention
Regression
• smooth by fitting the data into regression functions
Data cleaning
Inconsistent Data
Manual correction using external references
Semi-automatic using various tools
• To detect violation of known functional dependencies and
data constraints
• To correct redundant data
Data integration and transformation
Tasks of Data Integration and transformation
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• Entity identification problem:
• identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different
sources are different
• possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency
Data integration and transformation
Manage Data Integration
• Redundant data occur often when integration of multiple
databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data integration and transformation
Manage Data Transformation
• Smoothing: remove noise from data (binning, clustering,
regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified
range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Data reduction
Why data reduction?
• A database/data warehouse may store terabytes of
data
• Complex data analysis/mining may take a very long
time to run on the complete data set
Data reduction strategies
• Data cube aggregation
• Dimensionality reduction e.g., remove unimportant
attributes
• Attribute subset selection
• Numerosity reduction e.g., fit data into models
• Discretization and concept hierarchy generation
Data Preprocessing for ML
• Get Dataset
• Importing the Libraries
• Importing the Dataset
• Missing data
• Categorical data
• Splitting the Dataset into the Training set and Test set
• Feature scaling
• Data Preprocessing template
Data Preprocessing
 Get Dataset
 Can download from Internet
 Create your own dataset
Data Preprocessing
 Importing the Libraries
 A tool that you can use to make a specific job

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Data Preprocessing
 Importing Dataset

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

Data Preprocessing
 Missing Data
 Happens quite a lot actually in real life so we have
to get the trick to handle this problem.
 Idea:
- Remove missing : quite dangerous (not use).
- Replace the missing data by the median or mean of the feature
column
- Replace the missing data by the most frequent value of the
feature column
Data Preprocessing
France Spain Germany

 Categorical Data France 1 0 0

 Machine learning models are Spain 0 1 0

based on mathematical Germany 0 0 1


equations so we must encode Spain 0 1 0
categorical data. Germany 0 0 1
France 1 0 0
Spain 0 1 0
France 1 0 0
Germany 0 0 1
France 1 0 0
Data Preprocessing
 Splitting the Dataset into the Training set and
Test set
 Training set: To build machine learning model
 Test set: To test the performance of machine
learning model.
Data Preprocessing
Feature scaling
Data Preprocessing
Feature scaling

You might also like