Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Regression Methods in
Machine Learning
Splitting Datasets
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017
Splitting Datasets
• To use a dataset in Machine Learning, the dataset is
first split into a training and test set.
• The training set is used to train the model.
• The test set is used to test the accuracy of the model.
• Typically, split 80% training, 20% test.
It’s About Training
Machine Learning is about using data to train a model
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Produce Model
Test the Model Accuracy
Determine
Accuracy
Serial Splitting of the Dataset
• Simplest method of splitting data is to split it serially.
• Take first 80% rows and put into training set.
• Take remaining 20% rows and put into test set.
import pandas as pd # pandas library
dataset = pd.read_csv("Data.csv") # read in data as panda dataframe
nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of rows
train = dataset.iloc[ 1: int(nrows * .8) , : ]
80% rows 20% rows all columns
test = dataset.iloc[int(nrows * .8) +1, nrows, : ]
Random Splitting of the Dataset
• Another method is too pick rows at random.
• Sci-kit learn has built-in method
from sklearn.cross_validation import train_test_split
ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns
# Assume label is last column in dataset
X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last column)
y = dataset.iloc[ :, ncols ] # Y is the label (last column)
# Split the data, with 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
split size
seed for random
Number generator
Data Imbalance - Overfitting
• If the training data is overly unbalanced, then the model
will predict a non-meaningful result.
• For example, if the model is a binary classifier (e.g.,
apple vs. pear), and nearly all the samples are of the
same label (e.g., apple), then the model will simply
learn that everything is a that label (apple).
• This is called overfitting. To prevent overfitting, there
needs to be a fairly equal distribution of training samples
for each classification, or range if label is a real value.
Data Imbalance - Overfitting
Training
Data
(90%
Apple,
10%
Pear)
Train
In an imbalance, the model will fit itself to the imbalance, not the predictor.
Nearly all samples are an Apple
ModelPear
Test Sample is a Pear
Apple
The model will most likely predict
the pear as an apple.
Cross Validation
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Accuracy
Predict
Accuracy
Training
Data
Validation
Data
Use Validation Data
To predict accuracy
of the model.
Make adjustments to training
method to improve accuracy
Randomly pick split
Repeat Process until Converge to Desired Accuracy
K-Fold Cross Validation
• K-Fold is a well-known form of cross validation.
• Steps:
1. Partition the dataset into k equal sized partitions.
2. Select one partition as the validation data.
3. Use the remaining k-1 as the training data.
4. Train the model and determine accuracy from
the validation data.
5. Repeat the process k times, selecting a different
partition each time for the validation data.
5. Average the accuracy results.

More Related Content

Machine Learning - Splitting Datasets

  • 1. Regression Methods in Machine Learning Splitting Datasets Portland Data Science Group Andrew Ferlitsch Community Outreach Officer July, 2017
  • 2. Splitting Datasets • To use a dataset in Machine Learning, the dataset is first split into a training and test set. • The training set is used to train the model. • The test set is used to test the accuracy of the model. • Typically, split 80% training, 20% test.
  • 3. It’s About Training Machine Learning is about using data to train a model DATA Training Data Test Data Train Split Dataset into Training and Test Model Use Training Data to Train the Model Produce Model Test the Model Accuracy Determine Accuracy
  • 4. Serial Splitting of the Dataset • Simplest method of splitting data is to split it serially. • Take first 80% rows and put into training set. • Take remaining 20% rows and put into test set. import pandas as pd # pandas library dataset = pd.read_csv("Data.csv") # read in data as panda dataframe nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of rows train = dataset.iloc[ 1: int(nrows * .8) , : ] 80% rows 20% rows all columns test = dataset.iloc[int(nrows * .8) +1, nrows, : ]
  • 5. Random Splitting of the Dataset • Another method is too pick rows at random. • Sci-kit learn has built-in method from sklearn.cross_validation import train_test_split ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns # Assume label is last column in dataset X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last column) y = dataset.iloc[ :, ncols ] # Y is the label (last column) # Split the data, with 80% train and 20% test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) split size seed for random Number generator
  • 6. Data Imbalance - Overfitting • If the training data is overly unbalanced, then the model will predict a non-meaningful result. • For example, if the model is a binary classifier (e.g., apple vs. pear), and nearly all the samples are of the same label (e.g., apple), then the model will simply learn that everything is a that label (apple). • This is called overfitting. To prevent overfitting, there needs to be a fairly equal distribution of training samples for each classification, or range if label is a real value.
  • 7. Data Imbalance - Overfitting Training Data (90% Apple, 10% Pear) Train In an imbalance, the model will fit itself to the imbalance, not the predictor. Nearly all samples are an Apple ModelPear Test Sample is a Pear Apple The model will most likely predict the pear as an apple.
  • 8. Cross Validation DATA Training Data Test Data Train Split Dataset into Training and Test Model Use Training Data to Train the Model Accuracy Predict Accuracy Training Data Validation Data Use Validation Data To predict accuracy of the model. Make adjustments to training method to improve accuracy Randomly pick split Repeat Process until Converge to Desired Accuracy
  • 9. K-Fold Cross Validation • K-Fold is a well-known form of cross validation. • Steps: 1. Partition the dataset into k equal sized partitions. 2. Select one partition as the validation data. 3. Use the remaining k-1 as the training data. 4. Train the model and determine accuracy from the validation data. 5. Repeat the process k times, selecting a different partition each time for the validation data. 5. Average the accuracy results.