Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Dr. Gopal Sakarkar,
IEEE-CIS Member, Ph.D(CSE)
Department of AI and Machine Learning ,
G H RaisoniCollegeof Engineering , Nagpur
Data Pre-processing Services
using
Machine Learning Algorithms
Data Cleaning Services
Good data preparation is key
to producing valid and reliable
models.
Applications of Machine Learning
Applications of Machine Learning
Applications of Machine Learning
Applications of Machine Learning
What is Machine Learning?
• According to Arthur Samuel(1959), Machine Learning algorithms enable the
computers to learn from data, and even improve themselves, without being
explicitly programmed.
• Machine learning (ML) is a category of an algorithm that allows software
applications to become more accurate in predicting outcomes without being
explicitly programmed.
• The basic premise of machine learning is to build algorithms that can receive
input data and use statistical analysis to predict an output while updating
outputs as new data becomes available.
Types of Machine Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning
MachineLearningAlgorithms
Where is Data Cleaning used?
Machine Learning Life Cycle
Data Pre-processing
• Data preprocessing is an important step in ML
• The phrase "garbage in, garbage out" is particularly applicable to data
mining and machine learning projects.
• It involves transforming raw data into an understandable format.
• Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors.
• Data preprocessing is a proven method of resolving such issues
Why Data Pre-processing?
Why Data Pre-processing?
• A manager at All Electronics and have been charged with analyzing the company's data with
respect to the sales at a branch.
• He carefully inspect the company's database and data warehouse, identifying dimensions to be
included, such as item, price, units sold, and session .
• He notice that several of the attributes for various tuples have no recorded value. For analysis,
he would like to include information.
• In other words, the data he wish to analyze by machine learning techniques is incomplete,
noisy and inconsistent.
Why Data Pre-processing?
Item Price Unit Sold Session
TV 7200 44 All
Fan 480 27 Summer
Tube light 54 30 All
AC 27000 38
Fridge 40 Summer
Switches 58 35
2 mm Wire 520 All
Backup
Light 790 48 Winter
Fan
Regulator 83 50 All
Bulb 87 37 Rainy Session
What do you mean by data Pre-processing ?
• It is cleaning and explorating data for analysis
• Prepping data for modeling
• Modeling in Python requires numerical input
• Data preprocessing is a technique that involves transforming raw data into an understandable
format.
• Data preprocessing is a proven method of resolving such issues.
Data Understanding : Relevance of data
• What data is available for the task?
• Is this data relevant?
• Is additional relevant data available?
• How much historical data is available?
Data Understanding: Quantity of data
• Number of instances (records, objects)
• Rule of thumb: 5,000 or more desired
• if less, results are less reliable; use special methods (boosting, …)
• Number of attributes (fields)
• Rule of thumb: for each attribute, 10 or more instances
• If more fields, use feature reduction and selection
• if very unbalanced, use sampling
Data Pre-processing Steps
Data Pre-processing Steps
• Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data, identify or
remove outliers, and resolve inconsistencies.
• Data Integration
Integration of multiple databases, data cubes, or files.
• Data Transformation
Data transformation is the task of data normalization and aggregation.
Data Pre-processing Steps
• Data Reduction
Process of reduced representation in volume but produces the same or similar analytical
results.
• Data Discretization
Part of data reduction but with particular importance, especially for numerical data.
Data Pre-processing Steps
Data Pre-processing Steps
Data Cleaning
• Importance
Data cleaning is the number one problem during working with large data.
Data Cleaning Tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Data Cleaning: Missing Data
• Data is not always available
E.g., while admission filling form by student at the time of admission,
he might be don’t known local guardian contact number.
• Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
no register history or changes of the data
expansion of data schema
How to Handle Missing Data?
• Ignore the tuple (loss of information)
• Fill in missing values manually: tedious, infeasible?
• Fill in it automatically with
a global constant : e.g., unknown, a new class?!
Imputation: Use the attribute mean to fill in the missing value,
 Use the most probable value to fill in the missing value.
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
• Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to handle noisy data?
• Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Combined computer and human inspection
detect suspicious values and check by human
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9
-Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23
-Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
Data Integration
Data integration:
Its combines data from multiple sources
• Schema integration
Integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
Data Transformation
Data Transformation
• Smoothing: remove noise from data
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction
 New attributes constructed from the given ones
• Aggregation: summarization
 Integrate data from different sources (tables)
Data Reduction
• Data is too big to work with
 Too many instances
 too many features (attributes)
Data Reduction
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results (easily said but difficult to do)
• Data reduction strategies
 Dimensionality reduction — remove unimportant attributes
 Aggregation and clustering –
 Remove redundant or close associated ones
 Sampling
Data Reduction
Clustering
• Partition data set into clusters, and one can store cluster
representation only.
• Can be very effective if data is clustered but not if data is dirty.
• There are many choices of clustering and clustering algorithms.
Data Reduction
Sampling
• Choose a representative subset of the data
 Simply selecting random sampling may have improve
performance in the presence of scenario .
• Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation of
interest) in the overall database
Data Reduction
Sampling
Data Discretization
• Discretization is a process that transforms quantitative data into
qualitative data.
• It significantly improve the quality of discovering knowledge.
• It reduces the running time of various machine learning tasks such as
association rule discovery, classification, clustering and prediction.
• It reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
• Interval labels can then be used to replace actual data values
Data Discretization
Email: gopal.sakarkar@raisoni.net
Part 2
Implementation of Data
Cleaning Services
Using
Python Programming

More Related Content

What's hot

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
Paras Kohli
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
Kush Kulshrestha
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
Samra Shahzadi
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Haris Jamil
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
Machine learning Algorithm
Machine learning AlgorithmMachine learning Algorithm
Machine learning Algorithm
Md. Farhan Nasir
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 

What's hot (20)

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Machine learning Algorithm
Machine learning AlgorithmMachine learning Algorithm
Machine learning Algorithm
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 

Similar to Data preprocessing using Machine Learning

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
Chandrika Sweety
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
Chandra Meena
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
FEG
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
Vijay Kumar
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Arunnaik63
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Pre processing
Pre processingPre processing
Pre processing
Vijay Kumar
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
BCS Data Management Specialist Group
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 

Similar to Data preprocessing using Machine Learning (20)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Pre processing
Pre processingPre processing
Pre processing
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 

Recently uploaded

Computer Graphics - Cartesian Coordinate System.pdf
Computer Graphics - Cartesian Coordinate System.pdfComputer Graphics - Cartesian Coordinate System.pdf
Computer Graphics - Cartesian Coordinate System.pdf
Amol Gaikwad
 
Importent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdfImportent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdf
PradeepNigam12
 
buy a fake University of London diploma supplement
buy a fake University of London diploma supplementbuy a fake University of London diploma supplement
buy a fake University of London diploma supplement
GlethDanold
 
PCI Design Handbook Content and Updates.pptx
PCI Design Handbook Content and Updates.pptxPCI Design Handbook Content and Updates.pptx
PCI Design Handbook Content and Updates.pptx
gunjanatulbansal
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
tushardatta
 
Presentation on ergonomics in mining industry
Presentation on ergonomics in mining industryPresentation on ergonomics in mining industry
Presentation on ergonomics in mining industry
praku727
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
University of Hertfordshire
 
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
r07z26xt
 
一比一原版(uofs毕业证书)萨省大学毕业证如何办理
一比一原版(uofs毕业证书)萨省大学毕业证如何办理一比一原版(uofs毕业证书)萨省大学毕业证如何办理
一比一原版(uofs毕业证书)萨省大学毕业证如何办理
r07z26xt
 
Computer Vision and GenAI for Geoscientists.pptx
Computer Vision and GenAI for Geoscientists.pptxComputer Vision and GenAI for Geoscientists.pptx
Computer Vision and GenAI for Geoscientists.pptx
Yohanes Nuwara
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
Kiran Kumar Manigam
 
Artificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imagingArtificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imaging
NeeluPari
 
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
AnamikaRani12
 
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
smarakd64
 
How Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer SupportHow Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer Support
Dean Wyatte
 
Thesis on Assessment of Landslide Prone Area and Their Consequences Due to C...
Thesis on Assessment of Landslide Prone Area and Their Consequences  Due to C...Thesis on Assessment of Landslide Prone Area and Their Consequences  Due to C...
Thesis on Assessment of Landslide Prone Area and Their Consequences Due to C...
ErBamBhandari
 
Gen AI with LLM for construction technology
Gen AI with LLM for construction technologyGen AI with LLM for construction technology
Gen AI with LLM for construction technology
Tae wook kang
 
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
【网祉:5j8.net】 极品美鲍【网祉:5j8.net】
 
Electrical and Electronics engineering power point presentation.
Electrical and Electronics engineering power point presentation.Electrical and Electronics engineering power point presentation.
Electrical and Electronics engineering power point presentation.
sameerkrdbg
 
BLW vocational training mechanical production workshop report.
BLW vocational training mechanical production workshop report.BLW vocational training mechanical production workshop report.
BLW vocational training mechanical production workshop report.
nk3275141
 

Recently uploaded (20)

Computer Graphics - Cartesian Coordinate System.pdf
Computer Graphics - Cartesian Coordinate System.pdfComputer Graphics - Cartesian Coordinate System.pdf
Computer Graphics - Cartesian Coordinate System.pdf
 
Importent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdfImportent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdf
 
buy a fake University of London diploma supplement
buy a fake University of London diploma supplementbuy a fake University of London diploma supplement
buy a fake University of London diploma supplement
 
PCI Design Handbook Content and Updates.pptx
PCI Design Handbook Content and Updates.pptxPCI Design Handbook Content and Updates.pptx
PCI Design Handbook Content and Updates.pptx
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
 
Presentation on ergonomics in mining industry
Presentation on ergonomics in mining industryPresentation on ergonomics in mining industry
Presentation on ergonomics in mining industry
 
Future Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari ItalyFuture Networking v Energy Limits ICTON 2024 Bari Italy
Future Networking v Energy Limits ICTON 2024 Bari Italy
 
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
 
一比一原版(uofs毕业证书)萨省大学毕业证如何办理
一比一原版(uofs毕业证书)萨省大学毕业证如何办理一比一原版(uofs毕业证书)萨省大学毕业证如何办理
一比一原版(uofs毕业证书)萨省大学毕业证如何办理
 
Computer Vision and GenAI for Geoscientists.pptx
Computer Vision and GenAI for Geoscientists.pptxComputer Vision and GenAI for Geoscientists.pptx
Computer Vision and GenAI for Geoscientists.pptx
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
 
Artificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imagingArtificial Intelligence Imaging - medical imaging
Artificial Intelligence Imaging - medical imaging
 
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
 
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
SM_5th-SEM_Cse_Mobile-Computing.pdf_________________
 
How Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer SupportHow Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer Support
 
Thesis on Assessment of Landslide Prone Area and Their Consequences Due to C...
Thesis on Assessment of Landslide Prone Area and Their Consequences  Due to C...Thesis on Assessment of Landslide Prone Area and Their Consequences  Due to C...
Thesis on Assessment of Landslide Prone Area and Their Consequences Due to C...
 
Gen AI with LLM for construction technology
Gen AI with LLM for construction technologyGen AI with LLM for construction technology
Gen AI with LLM for construction technology
 
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
 
Electrical and Electronics engineering power point presentation.
Electrical and Electronics engineering power point presentation.Electrical and Electronics engineering power point presentation.
Electrical and Electronics engineering power point presentation.
 
BLW vocational training mechanical production workshop report.
BLW vocational training mechanical production workshop report.BLW vocational training mechanical production workshop report.
BLW vocational training mechanical production workshop report.
 

Data preprocessing using Machine Learning

  • 1. Dr. Gopal Sakarkar, IEEE-CIS Member, Ph.D(CSE) Department of AI and Machine Learning , G H RaisoniCollegeof Engineering , Nagpur Data Pre-processing Services using Machine Learning Algorithms
  • 2. Data Cleaning Services Good data preparation is key to producing valid and reliable models.
  • 7. What is Machine Learning? • According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed. • Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. • The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
  • 8. Types of Machine Learning
  • 9. Types of Machine Learning Supervised Learning Unsupervised Learning MachineLearningAlgorithms
  • 10. Where is Data Cleaning used? Machine Learning Life Cycle
  • 11. Data Pre-processing • Data preprocessing is an important step in ML • The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. • It involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. • Data preprocessing is a proven method of resolving such issues
  • 13. Why Data Pre-processing? • A manager at All Electronics and have been charged with analyzing the company's data with respect to the sales at a branch. • He carefully inspect the company's database and data warehouse, identifying dimensions to be included, such as item, price, units sold, and session . • He notice that several of the attributes for various tuples have no recorded value. For analysis, he would like to include information. • In other words, the data he wish to analyze by machine learning techniques is incomplete, noisy and inconsistent.
  • 14. Why Data Pre-processing? Item Price Unit Sold Session TV 7200 44 All Fan 480 27 Summer Tube light 54 30 All AC 27000 38 Fridge 40 Summer Switches 58 35 2 mm Wire 520 All Backup Light 790 48 Winter Fan Regulator 83 50 All Bulb 87 37 Rainy Session
  • 15. What do you mean by data Pre-processing ? • It is cleaning and explorating data for analysis • Prepping data for modeling • Modeling in Python requires numerical input • Data preprocessing is a technique that involves transforming raw data into an understandable format. • Data preprocessing is a proven method of resolving such issues.
  • 16. Data Understanding : Relevance of data • What data is available for the task? • Is this data relevant? • Is additional relevant data available? • How much historical data is available?
  • 17. Data Understanding: Quantity of data • Number of instances (records, objects) • Rule of thumb: 5,000 or more desired • if less, results are less reliable; use special methods (boosting, …) • Number of attributes (fields) • Rule of thumb: for each attribute, 10 or more instances • If more fields, use feature reduction and selection • if very unbalanced, use sampling
  • 20. • Data Cleaning Data cleaning is process of fill in missing values, smoothing the noisy data, identify or remove outliers, and resolve inconsistencies. • Data Integration Integration of multiple databases, data cubes, or files. • Data Transformation Data transformation is the task of data normalization and aggregation. Data Pre-processing Steps
  • 21. • Data Reduction Process of reduced representation in volume but produces the same or similar analytical results. • Data Discretization Part of data reduction but with particular importance, especially for numerical data. Data Pre-processing Steps
  • 23. Data Cleaning • Importance Data cleaning is the number one problem during working with large data. Data Cleaning Tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration
  • 24. Data Cleaning: Missing Data • Data is not always available E.g., while admission filling form by student at the time of admission, he might be don’t known local guardian contact number. • Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry no register history or changes of the data expansion of data schema
  • 25. How to Handle Missing Data? • Ignore the tuple (loss of information) • Fill in missing values manually: tedious, infeasible? • Fill in it automatically with a global constant : e.g., unknown, a new class?! Imputation: Use the attribute mean to fill in the missing value,  Use the most probable value to fill in the missing value.
  • 26. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention • Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data
  • 27. How to handle noisy data? • Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Combined computer and human inspection detect suspicious values and check by human
  • 28. Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: -Bin 1: 4, 8, 9, 15 -Bin 2: 21, 21, 24, 25 -Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9 -Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23 -Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29 * Smoothing by bin boundaries: -Bin 1: 4, 4, 4, 15 -Bin 2: 21, 21, 25, 25 -Bin 3: 26, 26, 26, 34
  • 29. Data Integration Data integration: Its combines data from multiple sources • Schema integration Integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units • Removing duplicates and redundant data
  • 30. Data Transformation Data Transformation • Smoothing: remove noise from data • Normalization: scaled to fall within a small, specified range • Attribute/feature construction  New attributes constructed from the given ones • Aggregation: summarization  Integrate data from different sources (tables)
  • 31. Data Reduction • Data is too big to work with  Too many instances  too many features (attributes) Data Reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results (easily said but difficult to do) • Data reduction strategies  Dimensionality reduction — remove unimportant attributes  Aggregation and clustering –  Remove redundant or close associated ones  Sampling
  • 32. Data Reduction Clustering • Partition data set into clusters, and one can store cluster representation only. • Can be very effective if data is clustered but not if data is dirty. • There are many choices of clustering and clustering algorithms.
  • 33. Data Reduction Sampling • Choose a representative subset of the data  Simply selecting random sampling may have improve performance in the presence of scenario . • Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database
  • 35. Data Discretization • Discretization is a process that transforms quantitative data into qualitative data. • It significantly improve the quality of discovering knowledge. • It reduces the running time of various machine learning tasks such as association rule discovery, classification, clustering and prediction. • It reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Interval labels can then be used to replace actual data values
  • 38. Part 2 Implementation of Data Cleaning Services Using Python Programming