Data preprocessing using Machine Learning

Dr. Gopal Sakarkar,
IEEE-CIS Member, Ph.D(CSE)
Department of AI and Machine Learning ,
G H RaisoniCollegeof Engineering , Nagpur
Data Pre-processing Services
using
Machine Learning Algorithms

Data Cleaning Services
Good data preparation is key
to producing valid and reliable
models.

Applications of Machine Learning

What is Machine Learning?
• According to Arthur Samuel(1959), Machine Learning algorithms enable the
computers to learn from data, and even improve themselves, without being
explicitly programmed.
• Machine learning (ML) is a category of an algorithm that allows software
applications to become more accurate in predicting outcomes without being
explicitly programmed.
• The basic premise of machine learning is to build algorithms that can receive
input data and use statistical analysis to predict an output while updating
outputs as new data becomes available.

Types of Machine Learning
Supervised Learning Unsupervised Learning
MachineLearningAlgorithms

Where is Data Cleaning used?
Machine Learning Life Cycle

Data Pre-processing
• Data preprocessing is an important step in ML
• The phrase "garbage in, garbage out" is particularly applicable to data
mining and machine learning projects.
• It involves transforming raw data into an understandable format.
• Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors.
• Data preprocessing is a proven method of resolving such issues

Why Data Pre-processing?
• A manager at All Electronics and have been charged with analyzing the company's data with
respect to the sales at a branch.
• He carefully inspect the company's database and data warehouse, identifying dimensions to be
included, such as item, price, units sold, and session .
• He notice that several of the attributes for various tuples have no recorded value. For analysis,
he would like to include information.
• In other words, the data he wish to analyze by machine learning techniques is incomplete,
noisy and inconsistent.

Why Data Pre-processing?
Item Price Unit Sold Session
TV 7200 44 All
Fan 480 27 Summer
Tube light 54 30 All
AC 27000 38
Fridge 40 Summer
Switches 58 35
2 mm Wire 520 All
Backup
Light 790 48 Winter
Fan
Regulator 83 50 All
Bulb 87 37 Rainy Session

What do you mean by data Pre-processing ?
• It is cleaning and explorating data for analysis
• Prepping data for modeling
• Modeling in Python requires numerical input
• Data preprocessing is a technique that involves transforming raw data into an understandable
format.
• Data preprocessing is a proven method of resolving such issues.

Data Understanding : Relevance of data
• What data is available for the task?
• Is this data relevant?
• Is additional relevant data available?
• How much historical data is available?

Data Understanding: Quantity of data
• Number of instances (records, objects)
• Rule of thumb: 5,000 or more desired
• if less, results are less reliable; use special methods (boosting, …)
• Number of attributes (fields)
• Rule of thumb: for each attribute, 10 or more instances
• If more fields, use feature reduction and selection
• if very unbalanced, use sampling

• Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data, identify or
remove outliers, and resolve inconsistencies.
• Data Integration
Integration of multiple databases, data cubes, or files.
• Data Transformation
Data transformation is the task of data normalization and aggregation.
Data Pre-processing Steps

• Data Reduction
Process of reduced representation in volume but produces the same or similar analytical
results.
• Data Discretization
Part of data reduction but with particular importance, especially for numerical data.
Data Pre-processing Steps

Data Cleaning
• Importance
Data cleaning is the number one problem during working with large data.
Data Cleaning Tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

Data Cleaning: Missing Data
• Data is not always available
E.g., while admission filling form by student at the time of admission,
he might be don’t known local guardian contact number.
• Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
no register history or changes of the data
expansion of data schema

How to Handle Missing Data?
• Ignore the tuple (loss of information)
• Fill in missing values manually: tedious, infeasible?
• Fill in it automatically with
a global constant : e.g., unknown, a new class?!
Imputation: Use the attribute mean to fill in the missing value,
 Use the most probable value to fill in the missing value.

Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
• Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

How to handle noisy data?
• Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Combined computer and human inspection
detect suspicious values and check by human

Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9
-Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23
-Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34

Data Integration
Data integration:
Its combines data from multiple sources
• Schema integration
Integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
• Removing duplicates and redundant data

Data Transformation
Data Transformation
• Smoothing: remove noise from data
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction
 New attributes constructed from the given ones
• Aggregation: summarization
 Integrate data from different sources (tables)

Data Reduction
• Data is too big to work with
 Too many instances
 too many features (attributes)
Data Reduction
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results (easily said but difficult to do)
• Data reduction strategies
 Dimensionality reduction — remove unimportant attributes
 Aggregation and clustering –
 Remove redundant or close associated ones
 Sampling

Data Reduction
Clustering
• Partition data set into clusters, and one can store cluster
representation only.
• Can be very effective if data is clustered but not if data is dirty.
• There are many choices of clustering and clustering algorithms.

Data Reduction
Sampling
• Choose a representative subset of the data
 Simply selecting random sampling may have improve
performance in the presence of scenario .
• Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation of
interest) in the overall database

Data Discretization
• Discretization is a process that transforms quantitative data into
qualitative data.
• It significantly improve the quality of discovering knowledge.
• It reduces the running time of various machine learning tasks such as
association rule discovery, classification, clustering and prediction.
• It reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
• Interval labels can then be used to replace actual data values

Email: gopal.sakarkar@raisoni.net

Part 2
Implementation of Data
Cleaning Services
Using
Python Programming

Data preprocessing using Machine Learning

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Data preprocessing using Machine Learning

Similar to Data preprocessing using Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Data preprocessing using Machine Learning