Abstract: This PDSG workshop introduces basic concepts of splitting a dataset for training a model in machine learning. Concepts covered are training, test and validation data, serial and random splitting, data imbalance and k-fold cross validation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
1 of 9
More Related Content
Machine Learning - Splitting Datasets
1. Regression Methods in
Machine Learning
Splitting Datasets
Portland Data Science Group
Andrew Ferlitsch
Community Outreach Officer
July, 2017
2. Splitting Datasets
• To use a dataset in Machine Learning, the dataset is
first split into a training and test set.
• The training set is used to train the model.
• The test set is used to test the accuracy of the model.
• Typically, split 80% training, 20% test.
3. It’s About Training
Machine Learning is about using data to train a model
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Produce Model
Test the Model Accuracy
Determine
Accuracy
4. Serial Splitting of the Dataset
• Simplest method of splitting data is to split it serially.
• Take first 80% rows and put into training set.
• Take remaining 20% rows and put into test set.
import pandas as pd # pandas library
dataset = pd.read_csv("Data.csv") # read in data as panda dataframe
nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of rows
train = dataset.iloc[ 1: int(nrows * .8) , : ]
80% rows 20% rows all columns
test = dataset.iloc[int(nrows * .8) +1, nrows, : ]
5. Random Splitting of the Dataset
• Another method is too pick rows at random.
• Sci-kit learn has built-in method
from sklearn.cross_validation import train_test_split
ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns
# Assume label is last column in dataset
X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last column)
y = dataset.iloc[ :, ncols ] # Y is the label (last column)
# Split the data, with 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
split size
seed for random
Number generator
6. Data Imbalance - Overfitting
• If the training data is overly unbalanced, then the model
will predict a non-meaningful result.
• For example, if the model is a binary classifier (e.g.,
apple vs. pear), and nearly all the samples are of the
same label (e.g., apple), then the model will simply
learn that everything is a that label (apple).
• This is called overfitting. To prevent overfitting, there
needs to be a fairly equal distribution of training samples
for each classification, or range if label is a real value.
7. Data Imbalance - Overfitting
Training
Data
(90%
Apple,
10%
Pear)
Train
In an imbalance, the model will fit itself to the imbalance, not the predictor.
Nearly all samples are an Apple
ModelPear
Test Sample is a Pear
Apple
The model will most likely predict
the pear as an apple.
8. Cross Validation
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Accuracy
Predict
Accuracy
Training
Data
Validation
Data
Use Validation Data
To predict accuracy
of the model.
Make adjustments to training
method to improve accuracy
Randomly pick split
Repeat Process until Converge to Desired Accuracy
9. K-Fold Cross Validation
• K-Fold is a well-known form of cross validation.
• Steps:
1. Partition the dataset into k equal sized partitions.
2. Select one partition as the validation data.
3. Use the remaining k-1 as the training data.
4. Train the model and determine accuracy from
the validation data.
5. Repeat the process k times, selecting a different
partition each time for the validation data.
5. Average the accuracy results.