Machine Learning - Dataset Preparation

Machine Learning
Dataset Preparation
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
July, 2017

Dataset Preparation
• Prior to using a dataset to train a model, the dataset
must be prepared.
1. Import the data
2. Clean the data (Data Wrangling)
3. Replace Missing Values
4. Categorical Value Conversion
5. Feature Scaling

Importing the Dataset
• Datasets are generally imported as a raw data files
(e.g., US Census) or via an API service (e.g., NWS
Weather Data SOAP API).
• Datasets are generally in the form of CSV, JSON or XML
data format.
• For the purpose of this tutorial, CSV is used in the
accompanying examples.

Importing the Dataset - Python
import pandas as pd # use pandas library for data frames
dataset = pd.read_csv( ‘data.csv’ ) # read CSV file into a data frame
pathname to raw data file
Function to read a CSV file
CSV data converted
to data frame.
Example Data (CSV File): Generated Data Frame:
Age, Gender, Income, Spending
22,M,18000,6000
25,F,30000,8000
31,F,35000,12000
35,M,40000,18000
Age Gender Income Spending
0 22 M 18000 6000
1 25 F 30000 8000
2 31 F 35000 12000
3 35 M 40000 18000
Data Frame adds these indices

Cleaning the Data (Data Wrangling)
• It is not uncommon for datasets to have some dirty
data entries (i.e., samples, rows in CSV file, …)
• Common Problems
• Bad Character Encodings (Funny Characters)
• Misaligned Data (e.g., row has too few/many columns)
• Data in wrong format.
Great Britain and the United States are two of the few places in the world that use a period to indicate the
decimal place. Many other countries use a comma instead. Likewise, while the U.K. and U.S. use a comma to
separate groups of thousands, many other countries use a period instead, and some countries separate
thousands groups with a thin space.
https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html
• Data Wrangling is an expertise/occupation all in its own.

Common Practices in Data Wrangling
• Know the character encoding of the data file and
intended character encoding of the data.
Convert the data encoding format of the file if necessary.
e.g., Notepad++ -> Encodings
• Know the data format of the source and expected
data format.
Convert the data format using a batch preprocessing file.
e.g., 1 000 000 -> 1,000,000

Replace Missing Values
• Not unusual for samples (rows) to contain missing (blank)
entries, or not a number (NaN).
• Blank/NaN entries do not work for Machine Learning!
• Need to replace the blank/NaN entry with something
meaningful.
• Delete the rows (generally not desirable)
• Replace with a Single Value
• Mean Average
• Multivariate Imputation using Chained Equations (MICS)
https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx

Missing Values – Mean Value
from sklearn.preprocessing import Imputer # scikit-learn module
# Create imputer object to replace NaN values with the mean value of the column
imputer = Imputer( missing_values=‘NaN’,
strategy=‘mean’ )
# Fit the data to the imputer object
imputer = imputer.fit( dataset[ :, 2 ] )
# do the replacement and update the dataset
dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] )
scikit-learn class for handling missing data
original dataset
replace missing values in column 2 (index starts at 0)
select all rows
needs to be the same columns in dataset

Categorical Variables
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values Value to Predict
Categorical Values

Dummy Variable Conversion
Known in Python as OneHotEncoder
For each categorical feature:
1. Scan the dataset and determine all the unique instances.
2. Create a new feature (i.e., dummy variable) in dataset, one
per unique instance.
3. Remove the categorical feature from the dataset.
4. For each sample (row), set a 1 in the feature (dummy
variable) that corresponds to that categorical value instance,
and:
5. Set a 0 in the remaining features (dummy variables) for that
categorical field.
6. Remove one dummy variable field.

Dummy Variable Trap
Gender
Male
Female
Male
Female
Need to Drop one Dummy Variable!
Male Female
1 0
0 1
1 0
0 1
x1 x2 x3
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 – x3)
As a result, a regression analysis cannot distinguish between the
contribution of x2 and x3.

Categorical Variable Conversion
from sklearn.preprocessing import LabelEncoder # scikit-learn module
# Create an encoder object to numerically (enumeration) encode categorical variables
labelEncoder = LabelEncoder()
# Fit the data to the Encoder object
labelEncoder.fit_transform()
dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] )
# Create an encoder to convert numerical encodings to 1-encoded dummy variables
onehotencoder = OneHotEncoder( categorical_features = [ 1 ] )
# Replace the encoded categorical values with the 1-encoded dummy variables
dataset = onehotencoder.fit_transform( dataset )
scikit-learn class for categorical variable conversion
original dataset
encode the categorical values in column 1 (index starts at 0)
select all rows
needs to be the same columns in dataset
Categorical variables to convert are in column 1
Dataset with converted categorical variables

Feature Scaling
• If features do not have the same numerical scale
in values, will cause issues in training a mode.
• If the scale of one independent variable (feature) is
greater than another independent variable, the model
will give more importance (skew) to the independent
variable with the larger range.
• To eliminate this problem, one converts all the
independent variables to use the same scale.
• Normalization ( 0 to 1 )
• Standardization ( -1 to 1 )

Scaling Issue - Euclidean Distance
• Most machine learning models use Euclidean distance
between two points in 2D Cartesian space.
𝒙 𝟐 − 𝒙 𝟏
𝟐 + (𝒚 𝟐 − 𝒚 𝟏) 𝟐
• Given two independent variables (x1 = Age, x2 = Income)
and a dependent variable (y = spending), becomes for
a given sample (row) i:
𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐 + 𝒚𝒊 − 𝒚𝒊 𝟐 = 𝒙𝟐𝒊 − 𝒙𝟏𝒊
𝟐
• If x1 or x2 is a substantially greater scale than the other,
the corresponding independent variable will dominate
the result, and will contribute more to the model.

Normalization or Standardization
• Feature Scaling means scaling features to the same scale.
• Normalization scales features between 0 and 1, retaining
their proportional range to each other.
• Standardization scales features to have a mean (u) of 0
and standard deviation (a) of 1.
X’ =
𝑥 − min(𝑥)
max 𝑥 − min(𝑥)
Normalization
original valuenew value
X’ =
𝑥 − 𝑢
𝑎
Standardization
original valuenew value
mean
standard deviation

Feature Scaling in Python
from sklearn.preprocessing import StandardScalar # scikit-learn module
# Create a scaling object to scale the features.
scale = StandardScalar()
# Fit the data to the Scaling object and transform the data
dataset [:,-1] = scale.fit_transform( dataset[:,-1] )
scikit-learn class for Feature Scaling
feature scale all the variables except the last column (y or label)

Machine Learning - Dataset Preparation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning - Dataset Preparation

Similar to Machine Learning - Dataset Preparation (20)

More from Andrew Ferlitsch

More from Andrew Ferlitsch (20)

Recently uploaded

Recently uploaded (20)

Machine Learning - Dataset Preparation