Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Validation Slides

The document discusses techniques for estimating the performance of machine learning models, including validation, resampling methods like cross validation and bootstrapping, and estimating bias and variance. Cross validation techniques like k-fold and leave-one-out are described, as well as how to use resampling to estimate a model's bias and variance. The text also recommends splitting data into separate training, validation, and test sets.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Validation Slides

The document discusses techniques for estimating the performance of machine learning models, including validation, resampling methods like cross validation and bootstrapping, and estimating bias and variance. Cross validation techniques like k-fold and leave-one-out are described, as well as how to use resampling to estimate a model's bias and variance. The text also recommends splitting data into separate training, validation, and test sets.

Uploaded by

Richa Halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning

and Network Analysis


MA4207
Performance Estimation
Pattern recognition techniques usually have one or more free parameters
◼ The number of neighbors in a kNN classifier
◼ The feature dimension in feature extraction or selection process
Model Selection: How do we select the “optimal” parameter(s) for a given
classification problem?
Validation: After model selection, how to estimate its true error rate?
◼ The true error rate is the classifier’s error rate when tested on the ENTIRE POPULATION
◼ Choose the model that provides the lowest error rate on the entire population
Only a finite set of examples is available
◼ Smaller training size than requirement (e.g. Biomedical data)
◼ Data collection and storage
Entire training data to select the best classifier gives rise to two problems
◼ Overfitting
◼ Computed error rate typically optimistic (~100% in some cases)
Performance Estimation

◼ Training
◼ Model selection
◼ Resampling Techniques
◼ Cross Validation
◼ Bootstrap
◼ Jackknife
The holdout method
Split dataset into two groups
◼ Training set: used to train the classifier
◼ Test set: used to estimate the error rate of the trained classifier
The holdout method has two basic drawbacks
◼ In sparse dataset , setting aside a portion of the dataset for testing not feasible
◼ A single train-and-test experiment, the holdout estimate of error rate dependant
on split of test and train
These limitations of the holdout can be overcome with a family of resampling
methods at the expense of higher computational cost
◼ Cross validation
◼ Random subsampling
◼ K-fold cross-validation
◼ Leave-one-out cross-validation
◼ Bootstrap
◼ Jackknife
Random subsampling

Random subsampling performs K data splits of the entire dataset


◼ Each data split randomly selects a (fixed) number of examples without
replacement
◼ For each data split we retrain the classifier from scratch with the training
examples and then estimate 𝐸𝑖 with the test examples
◼ The true error estimate is obtained as the average of the separate estimates 𝐸𝑖
◼ This estimate is significantly better than the holdout estimate
K-fold Cross Validation

Create a K-fold partition of the dataset


◼ For each of 𝐾 experiments, use 𝐾−1 folds for training and a different fold for
testing
◼ K-Fold cross validation is similar to random subsampling
◼ –The advantage of KFCV is that all the examples in the dataset are eventually used
for both training and testing
◼ –As before, the true error is estimated as the average error rate on test examples
Leave-one-out cross validation
◼ LOO is the degenerate case of Kfold CV, where K is chosen as the total number of
examples
◼ For a dataset with 𝑁 examples, perform 𝑁 experiments
◼ For each experiment use 𝑁−1 examples for training and the remaining example
for testing
◼ As usual, the true error is estimated as the average error rate on test examples
Selecting number of fold (k)
Large k
◼ The bias of the true error rate estimator will be small (the estimator will be very
accurate)
◼ The variance of the true error rate estimator will be large
◼ The computational time will be very large as well (many experiments)

Small k
◼ The number of experiments and, therefore, computation time are reduced
◼ The variance of the estimator will be small
◼ The bias of the estimator will be large (conservative or larger than the true error rate)

Choice for K depends on the size of the dataset


◼ For large datasets, even 3-fold cross validation will be quite accurate
◼ For very sparse datasets, we may have to use leave-one-out in order to train on as many
examples as possible

Commonly used K=10


Bootstrap
Resampling technique with replacement
◼ Randomly select (with replacement) 𝑁 examples from Data set of size N and use this set
for training
◼ Remaining examples not selected for training are used for testing
◼ Differs from fold to fold
◼ Repeat this process for a specified number of folds (𝐾)
◼ The true error is estimated as the average error rate on test data
CV vs Bootstrap
Compared to basic CV, the bootstrap increases the variance
that can occur in each fold [Efron and Tibshirani, 1993]
◼ desirable property since it is a more realistic simulation of the real-life
experiment from which our dataset was obtained
Consider a classification problem with 𝐶 classes, a total of 𝑁
examples and 𝑁𝑖 examples for each class 𝝎𝑖
◼ The a priori probability of choosing an example from class 𝜔i is 𝑁𝑖/𝑁
◼ Once we choose an example from class 𝜔i, if we do not replace it for the next
selection, then the a priori probabilities will have changed since the
probability of choosing an example from class 𝜔i will now be 𝑁i−1/𝑁
◼ Thus, sampling with replacement preserves the a priori probabilities of the
classes throughout the random selection process *
◼ An additional benefit is that the bootstrap can provide accurate measures of
BOTH the bias and variance of the true error estimate

*Stratified CV used to overcome problem of maintaining similar prior


Jackknife
◼ One of the earliest resampling methods
◼ Leave one sample out and compute model
parameters on the remaining data
◼ Computationally simple
◼ Mean Estimate

◼ Variance Estimate

Linear approximation of bootstrap, only works well for linear statistics (e.g.,
mean). It fails to give accurate estimation for non-smooth (e.g., median) and
nonlinear (e.g., correlation coefficient) cases
Bias and Variance of a Statistical Estimate
Problem: estimate parameter 𝛼 of unknown distribution 𝐺
◼ To emphasize the fact that 𝛼 concerns 𝐺, we write 𝛼(𝐺)
Solution
◼ We collect 𝑁 examples 𝑋={𝑥1 ,𝑥2 …𝑥𝑁} from 𝐺
◼ 𝑋 defines a discrete distribution 𝐺’ with mass 1/𝑁 at each example
◼ We compute the statistic 𝛼’=𝛼(𝐺’) as an estimator of 𝛼(𝐺)
◼ e.g., 𝛼(𝐺’) may is the estimate of the true error rate for a classifier
How good is this estimator?
◼ Bias: How much does it deviate from the true value

◼ Variance: how much variability does it show for different samples


Bias and variance of the sample mean

Bias: The sample mean is known to be an unbiased estimator


variance?
the standard deviation of the sample mean is equal to

This term is also known in statistics as the STANDARD ERROR


Unfortunately, there is no such a neat algebraic formula for almost any
estimate other than the sample mean
Bootstrap Estimates
The bootstrap allows us to estimate bias and variance for practically any statistical
estimate, be it a scalar or vector (matrix)
◼ Here we will only describe the estimation procedure
◼ For more details refer to “Advanced algorithms for neural networks” [Masters,
1995], which provides an excellent introduction
Approach
◼ Consider a dataset of 𝑁 examples 𝑋 ={𝑥1 ,𝑥2 …𝑥𝑁} from distribution 𝐺
◼ This dataset defines a discrete distribution 𝐺’
◼ Compute 𝛼’=𝛼(𝐺’) as our initial estimate of 𝛼(𝐺)
◼ Let {𝑥1∗,𝑥2∗,…,𝑥𝑁∗} be a bootstrap dataset drawn from 𝑋 ={𝑥1 ,𝑥2 …𝑥𝑁}
◼ Estimate parameter 𝛼 using this bootstrap dataset 𝛼∗(𝐺∗)
◼ Generate 𝐾 bootstrap datasets and obtain 𝐾 estimates
{𝛼∗1(𝐺∗),𝛼∗2(𝐺∗)…,𝛼∗𝐾(𝐺∗)}
The bias and variance estimates of 𝛼′are

◼ The effect of generating a bootstrap dataset from the distribution 𝐺’ is similar to


the effect of obtaining the dataset 𝑋={𝑥1,𝑥2 …,𝑥𝑁} from the original distribution 𝐺
◼ In other words, the distribution {𝛼∗1(𝐺∗),𝛼∗2(𝐺∗)…,𝛼∗𝐾(𝐺∗)} is related to the initial
estimate 𝛼′ in the same fashion that multiple estimates 𝛼′ are related to the true
value 𝛼
Example
◼ Assume a small dataset 𝑥={3,5,2,1,7}, and we want to compute the bias and
variance of the sample mean 𝛼′=3.6
We generate a number of bootstrap samples (three in this case)
◼ Assume that the first bootstrap yields the dataset {7,3,2,3,1}
◼ the sample mean 𝛼∗1=3.2
◼ The second bootstrap sample yields the dataset {5,1,1,3,7}
◼ the sample mean 𝛼∗2=3.4
◼ The third bootstrap sample yields the dataset {2,2,7,1,3}
◼ the sample mean 𝛼∗3=3.0
◼ average these estimates and obtain an average of 𝛼∗∘=3.2
What are the bias and variance of the sample mean 𝜶’?
𝑩𝒊𝒂𝒔(𝜶’) = 𝟑.𝟐 − 𝟑.𝟔 = −𝟎.𝟒
◼ Resampling introduced a downward bias on the mean, so we would be inclined to use 3.6 + 0.4 =
4.0 as an unbiased estimate of 𝛼
𝑽𝒂𝒓(𝜶’) =𝟏/𝟐∗[𝟑.𝟐−𝟑.𝟐𝟐+𝟑.𝟒−𝟑.𝟐𝟐+𝟑.𝟎−𝟑.𝟐𝟐] = 𝟎.𝟎𝟒
◼ NOTES
Example given for the sample mean, but 𝛼 could be any other statistical operator!
How many bootstrap samples should we use?
As a rule of thumb, several hundred resamples will suffice for most problems
Three Way Data Split
Simultaneously, the data should be divided into three disjoint sets [Ripley, 1996]
◼ Training set: used for learning, e.g., to fit the parameters of the classifier
◼ Validation set: used to select among several trained classifiers
◼ Test set: used only to assess the performance of a fully-trained classifier
Why separate test and validation sets?
◼ The error rate of the final model on validation data will be biased (smaller than the true error rate)
since the validation set is used to select the final model
◼ After assessing the final model on the test set, YOU MUST NOT tune the model any further!
Procedure outline

This outline assumes a holdout method


If CV or bootstrap are used, steps 3 and 4 have to be repeated for each of the K folds

You might also like