Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Machine Learning
Dataset Preparation
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
July, 2017
Dataset Preparation
• Prior to using a dataset to train a model, the dataset
must be prepared.
1. Import the data
2. Clean the data (Data Wrangling)
3. Replace Missing Values
4. Categorical Value Conversion
5. Feature Scaling
Importing the Dataset
• Datasets are generally imported as a raw data files
(e.g., US Census) or via an API service (e.g., NWS
Weather Data SOAP API).
• Datasets are generally in the form of CSV, JSON or XML
data format.
• For the purpose of this tutorial, CSV is used in the
accompanying examples.
Importing the Dataset - Python
import pandas as pd # use pandas library for data frames
dataset = pd.read_csv( ‘data.csv’ ) # read CSV file into a data frame
pathname to raw data file
Function to read a CSV file
CSV data converted
to data frame.
Example Data (CSV File): Generated Data Frame:
Age, Gender, Income, Spending
22,M,18000,6000
25,F,30000,8000
31,F,35000,12000
35,M,40000,18000
Age Gender Income Spending
0 22 M 18000 6000
1 25 F 30000 8000
2 31 F 35000 12000
3 35 M 40000 18000
Data Frame adds these indices
Cleaning the Data (Data Wrangling)
• It is not uncommon for datasets to have some dirty
data entries (i.e., samples, rows in CSV file, …)
• Common Problems
• Bad Character Encodings (Funny Characters)
• Misaligned Data (e.g., row has too few/many columns)
• Data in wrong format.
Great Britain and the United States are two of the few places in the world that use a period to indicate the
decimal place. Many other countries use a comma instead. Likewise, while the U.K. and U.S. use a comma to
separate groups of thousands, many other countries use a period instead, and some countries separate
thousands groups with a thin space.
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
• Data Wrangling is an expertise/occupation all in its own.
Common Practices in Data Wrangling
• Know the character encoding of the data file and
intended character encoding of the data.
Convert the data encoding format of the file if necessary.
e.g., Notepad++ -> Encodings
• Know the data format of the source and expected
data format.
Convert the data format using a batch preprocessing file.
e.g., 1 000 000 -> 1,000,000
Replace Missing Values
• Not unusual for samples (rows) to contain missing (blank)
entries, or not a number (NaN).
• Blank/NaN entries do not work for Machine Learning!
• Need to replace the blank/NaN entry with something
meaningful.
• Delete the rows (generally not desirable)
• Replace with a Single Value
• Mean Average
• Multivariate Imputation using Chained Equations (MICS)
https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx
Missing Values – Mean Value
from sklearn.preprocessing import Imputer # scikit-learn module
# Create imputer object to replace NaN values with the mean value of the column
imputer = Imputer( missing_values=‘NaN’,
strategy=‘mean’ )
# Fit the data to the imputer object
imputer = imputer.fit( dataset[ :, 2 ] )
# do the replacement and update the dataset
dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] )
scikit-learn class for handling missing data
original dataset
replace missing values in column 2 (index starts at 0)
select all rows
needs to be the same columns in dataset
Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values
Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.
Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.
Categorical Variable Conversion
from sklearn.preprocessing import LabelEncoder # scikit-learn module
# Create an encoder object to numerically (enumeration) encode categorical variables
labelEncoder = LabelEncoder()
# Fit the data to the Encoder object
labelEncoder.fit_transform()
dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] )
# Create an encoder to convert numerical encodings to 1-encoded dummy variables
onehotencoder = OneHotEncoder( categorical_features = [ 1 ] )
# Replace the encoded categorical values with the 1-encoded dummy variables
dataset = onehotencoder.fit_transform( dataset )
scikit-learn class for categorical variable conversion
original dataset
encode the categorical values in column 1 (index starts at 0)
select all rows
needs to be the same columns in dataset
Categorical variables to convert are in column 1
Dataset with converted categorical variables
Feature Scaling
• If features do not have the same numerical scale
in values, will cause issues in training a mode.
• If the scale of one independent variable (feature) is
greater than another independent variable, the model
will give more importance (skew) to the independent
variable with the larger range.
• To eliminate this problem, one converts all the
independent variables to use the same scale.
• Normalization ( 0 to 1 )
• Standardization ( -1 to 1 )
Scaling Issue - Euclidean Distance
• Most machine learning models use Euclidean distance
between two points in 2D Cartesian space.
𝒙 𝟐 − 𝒙 𝟏
𝟐 + (𝒚 𝟐 − 𝒚 𝟏) 𝟐
• Given two independent variables (x1 = Age, x2 = Income)
and a dependent variable (y = spending), becomes for
a given sample (row) i:
𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐 + 𝒚𝒊 − 𝒚𝒊 𝟐 = 𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐
• If x1 or x2 is a substantially greater scale than the other,
the corresponding independent variable will dominate
the result, and will contribute more to the model.
Normalization or Standardization
• Feature Scaling means scaling features to the same scale.
• Normalization scales features between 0 and 1, retaining
their proportional range to each other.
• Standardization scales features to have a mean (u) of 0
and standard deviation (a) of 1.
X’ =
𝑥 − min(𝑥)
max 𝑥 − min(𝑥)
Normalization
original valuenew value
X’ =
𝑥 − 𝑢
𝑎
Standardization
original valuenew value
mean
standard deviation
Feature Scaling in Python
from sklearn.preprocessing import StandardScalar # scikit-learn module
# Create a scaling object to scale the features.
scale = StandardScalar()
# Fit the data to the Scaling object and transform the data
dataset [:,-1] = scale.fit_transform( dataset[:,-1] )
scikit-learn class for Feature Scaling
feature scale all the variables except the last column (y or label)

More Related Content

What's hot

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
MaryamRehman6
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
Functional Imperative
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Classification
ClassificationClassification
Classification
CloudxLab
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
Mohit Rajput
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
Data Preprocessing
Data PreprocessingData Preprocessing
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
YashwantGahlot1
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
Laila Fatehy
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
Rushali Deshmukh
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 

What's hot (20)

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Classification
ClassificationClassification
Classification
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 

Similar to Machine Learning - Dataset Preparation

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
mukeshgarg02
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
Siddharth Shrivastava
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
University of Huddersfield
 
Data structure
Data structureData structure
Data structure
Muhammad Farhan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
Ralf Gommers
 
Search-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing SystemsSearch-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing Systems
Lionel Briand
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
MadhuriChandanbatwe
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
avinashBajpayee1
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
 
Java
JavaJava
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Olga Scrivner
 
Java 101 Intro to Java Programming
Java 101 Intro to Java ProgrammingJava 101 Intro to Java Programming
Java 101 Intro to Java Programming
agorolabs
 
Java 101 intro to programming with java
Java 101  intro to programming with javaJava 101  intro to programming with java
Java 101 intro to programming with java
Hawkman Academy
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Boston Institute of Analytics
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
Kaniska Mandal
 

Similar to Machine Learning - Dataset Preparation (20)

data science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdfdata science with python_UNIT 2_full notes.pdf
data science with python_UNIT 2_full notes.pdf
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Data structure
Data structureData structure
Data structure
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Search-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing SystemsSearch-Based Robustness Testing of Data Processing Systems
Search-Based Robustness Testing of Data Processing Systems
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptxUnit-1 Introduction and Mathematical Preliminaries.pptx
Unit-1 Introduction and Mathematical Preliminaries.pptx
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Java
JavaJava
Java
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Java 101 Intro to Java Programming
Java 101 Intro to Java ProgrammingJava 101 Intro to Java Programming
Java 101 Intro to Java Programming
 
Java 101 intro to programming with java
Java 101  intro to programming with javaJava 101  intro to programming with java
Java 101 intro to programming with java
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 

More from Andrew Ferlitsch

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
Andrew Ferlitsch
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
Andrew Ferlitsch
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
Andrew Ferlitsch
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
Andrew Ferlitsch
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
Andrew Ferlitsch
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
Andrew Ferlitsch
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
Andrew Ferlitsch
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Andrew Ferlitsch
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
Andrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
Andrew Ferlitsch
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
Andrew Ferlitsch
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
Andrew Ferlitsch
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
Andrew Ferlitsch
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
Andrew Ferlitsch
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Andrew Ferlitsch
 

More from Andrew Ferlitsch (20)

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
 
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Recently uploaded

Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdfSummer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Anna Loughnan Colquhoun
 
this resume for sadika shaikh bca student
this resume for sadika shaikh bca studentthis resume for sadika shaikh bca student
this resume for sadika shaikh bca student
SadikaShaikh7
 
Lessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien RiouxLessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien Rioux
crioux1
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamicsArtificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Chintan Kalsariya
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
Alpen-Adria-Universität
 
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
Edge AI and Vision Alliance
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 

Recently uploaded (20)

Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdfSummer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
Summer24-ReleaseOverviewDeck - Stephen Stanley 27 June 2024.pdf
 
this resume for sadika shaikh bca student
this resume for sadika shaikh bca studentthis resume for sadika shaikh bca student
this resume for sadika shaikh bca student
 
Lessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien RiouxLessons Of Binary Analysis - Christien Rioux
Lessons Of Binary Analysis - Christien Rioux
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamicsArtificial Intelligence (AI), Robotics and Computational fluid dynamics
Artificial Intelligence (AI), Robotics and Computational fluid dynamics
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)HTTP Adaptive Streaming – Quo Vadis (2024)
HTTP Adaptive Streaming – Quo Vadis (2024)
 
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
“Transforming Enterprise Intelligence: The Power of Computer Vision and Gen A...
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 

Machine Learning - Dataset Preparation

  • 1. Machine Learning Dataset Preparation Portland Data Science Group Created by Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2. Dataset Preparation • Prior to using a dataset to train a model, the dataset must be prepared. 1. Import the data 2. Clean the data (Data Wrangling) 3. Replace Missing Values 4. Categorical Value Conversion 5. Feature Scaling
  • 3. Importing the Dataset • Datasets are generally imported as a raw data files (e.g., US Census) or via an API service (e.g., NWS Weather Data SOAP API). • Datasets are generally in the form of CSV, JSON or XML data format. • For the purpose of this tutorial, CSV is used in the accompanying examples.
  • 4. Importing the Dataset - Python import pandas as pd # use pandas library for data frames dataset = pd.read_csv( ‘data.csv’ ) # read CSV file into a data frame pathname to raw data file Function to read a CSV file CSV data converted to data frame. Example Data (CSV File): Generated Data Frame: Age, Gender, Income, Spending 22,M,18000,6000 25,F,30000,8000 31,F,35000,12000 35,M,40000,18000 Age Gender Income Spending 0 22 M 18000 6000 1 25 F 30000 8000 2 31 F 35000 12000 3 35 M 40000 18000 Data Frame adds these indices
  • 5. Cleaning the Data (Data Wrangling) • It is not uncommon for datasets to have some dirty data entries (i.e., samples, rows in CSV file, …) • Common Problems • Bad Character Encodings (Funny Characters) • Misaligned Data (e.g., row has too few/many columns) • Data in wrong format. Great Britain and the United States are two of the few places in the world that use a period to indicate the decimal place. Many other countries use a comma instead. Likewise, while the U.K. and U.S. use a comma to separate groups of thousands, many other countries use a period instead, and some countries separate thousands groups with a thin space. https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html • Data Wrangling is an expertise/occupation all in its own.
  • 6. Common Practices in Data Wrangling • Know the character encoding of the data file and intended character encoding of the data. Convert the data encoding format of the file if necessary. e.g., Notepad++ -> Encodings • Know the data format of the source and expected data format. Convert the data format using a batch preprocessing file. e.g., 1 000 000 -> 1,000,000
  • 7. Replace Missing Values • Not unusual for samples (rows) to contain missing (blank) entries, or not a number (NaN). • Blank/NaN entries do not work for Machine Learning! • Need to replace the blank/NaN entry with something meaningful. • Delete the rows (generally not desirable) • Replace with a Single Value • Mean Average • Multivariate Imputation using Chained Equations (MICS) https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx
  • 8. Missing Values – Mean Value from sklearn.preprocessing import Imputer # scikit-learn module # Create imputer object to replace NaN values with the mean value of the column imputer = Imputer( missing_values=‘NaN’, strategy=‘mean’ ) # Fit the data to the imputer object imputer = imputer.fit( dataset[ :, 2 ] ) # do the replacement and update the dataset dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] ) scikit-learn class for handling missing data original dataset replace missing values in column 2 (index starts at 0) select all rows needs to be the same columns in dataset
  • 9. Categorical Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values
  • 10. Dummy Variable Conversion Known in Python as OneHotEncoder For each categorical feature: 1. Scan the dataset and determine all the unique instances. 2. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. 3. Remove the categorical feature from the dataset. 4. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: 5. Set a 0 in the remaining features (dummy variables) for that categorical field. 6. Remove one dummy variable field.
  • 11. Dummy Variable Trap Gender Male Female Male Female Need to Drop one Dummy Variable! Male Female 1 0 0 1 1 0 0 1 x1 x2 x3 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
  • 12. Categorical Variable Conversion from sklearn.preprocessing import LabelEncoder # scikit-learn module # Create an encoder object to numerically (enumeration) encode categorical variables labelEncoder = LabelEncoder() # Fit the data to the Encoder object labelEncoder.fit_transform() dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] ) # Create an encoder to convert numerical encodings to 1-encoded dummy variables onehotencoder = OneHotEncoder( categorical_features = [ 1 ] ) # Replace the encoded categorical values with the 1-encoded dummy variables dataset = onehotencoder.fit_transform( dataset ) scikit-learn class for categorical variable conversion original dataset encode the categorical values in column 1 (index starts at 0) select all rows needs to be the same columns in dataset Categorical variables to convert are in column 1 Dataset with converted categorical variables
  • 13. Feature Scaling • If features do not have the same numerical scale in values, will cause issues in training a mode. • If the scale of one independent variable (feature) is greater than another independent variable, the model will give more importance (skew) to the independent variable with the larger range. • To eliminate this problem, one converts all the independent variables to use the same scale. • Normalization ( 0 to 1 ) • Standardization ( -1 to 1 )
  • 14. Scaling Issue - Euclidean Distance • Most machine learning models use Euclidean distance between two points in 2D Cartesian space. 𝒙 𝟐 − 𝒙 𝟏 𝟐 + (𝒚 𝟐 − 𝒚 𝟏) 𝟐 • Given two independent variables (x1 = Age, x2 = Income) and a dependent variable (y = spending), becomes for a given sample (row) i: 𝒙𝟐𝒊 − 𝒙𝟏𝒊 𝟐 + 𝒚𝒊 − 𝒚𝒊 𝟐 = 𝒙𝟐𝒊 − 𝒙𝟏𝒊 𝟐 • If x1 or x2 is a substantially greater scale than the other, the corresponding independent variable will dominate the result, and will contribute more to the model.
  • 15. Normalization or Standardization • Feature Scaling means scaling features to the same scale. • Normalization scales features between 0 and 1, retaining their proportional range to each other. • Standardization scales features to have a mean (u) of 0 and standard deviation (a) of 1. X’ = 𝑥 − min(𝑥) max 𝑥 − min(𝑥) Normalization original valuenew value X’ = 𝑥 − 𝑢 𝑎 Standardization original valuenew value mean standard deviation
  • 16. Feature Scaling in Python from sklearn.preprocessing import StandardScalar # scikit-learn module # Create a scaling object to scale the features. scale = StandardScalar() # Fit the data to the Scaling object and transform the data dataset [:,-1] = scale.fit_transform( dataset[:,-1] ) scikit-learn class for Feature Scaling feature scale all the variables except the last column (y or label)