Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Lecture6c HyperparameterOptimization

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture6c HyperparameterOptimization

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

HYPERPARAMETER
OPTIMIZATION
Big Data Science (Master in Statistical Data Analysis)
PARAMETER OPTIMIZATION
̶ So far, we have talked about parameter optimization:
̶ Our model contains trainable parameters
̶ We define a loss function
̶ An optimization algorithm searches the parameters that
minimize the loss:
‒ Analytic solutions
‒ Newton-Raphson
‒ (Stochastic) gradient descent
‒ ...

2
HYPERPARAMETER OPTIMIZATION
̶ Most models also have hyperparameters:
̶ Fixed before training the model
̶ Involve assumptions of the model
̶ Not taken into account in the gradient of the
optimization function

3
EXAMPLES OF HYPERPARAMETERS
Neural
Linear models Random Forest SVM KNN
networks
• Regularization • Number of • Kernel • Architecture •𝐾
constant trees • Margin • Number of • Distance
• Maximum • Kernel layers metric
depth parameters: • Size of each • Parameters of
• Minimum leaf • Polinomial layer approximate
size degree • Activation structures
• Criterion for • Gaussian function
split kernel width • Dropout
• Number of • ... • Regularization
features per • ...
split
• ...

4
CHOOSING HYPERPARAMETERS
̶ Manual search
̶ Grid search
̶ Random search
̶ Automated methods:
̶ Bayesian optimization
̶ Evolutionary optimization

5
MANUAL TUNING
̶ Using assumptions or knowledge to select the hyperparameters

̶ Pros:
̶ Computationally efficient

̶ Cons:
̶ Requires manual labor
̶ Prone to bias
̶ Limited combinations are tested

6
GRID SEARCH
̶ For each hyperparameter, define a subset of values that will be
tested
̶ Iteratively test all combinations

̶ Pros:
̶ The individual effect of parameters can be studied
̶ Cons:
̶ The number of combinations can become very high
̶ Few values are tested for every parameter
̶ The combined effect of parameters is not completely modeled

7
RANDOM SEARCH
̶ A random distribution is specified for each parameter
̶ Samples are drawn and tested

̶ Pros:
̶ The combined effect of parameters is somewhat modeled
̶ More values per parameter can be considered
̶ Cons:
̶ The search is not guided
̶ The individual effect of parameters is not clear

8
GRID VS RANDOM

J. Bergstra, Y. Bengio, “Random Search for Hyper-


Parameter Optimization”, Journal of Machine Learning
Research 13 (2012) 281-305

9
HYPERPARAMETER
OPTIMIZATION AS AN
OPTIMIZATION PROBLEM
10
AUTOMATED HYPERPARAMETER OPTIMIZATION
̶ Why not solve hyperparameter optimization in the
same way as parameter optimization?

̶ Main approaches:
̶ Bayesian optimization
̶ Evolutionary algorithms

11
SEQUENTIAL MODEL-BASED BAYESIAN OPTIMIZATION (SMBO)
1. Query the function 𝑓 at 𝑡 values and record the
𝑡
resulting pairs S = 𝜽𝑖 , 𝑓(𝜽𝑖 ) 𝑖=1

2. For a fixed number of iterations:


1. Fit a probabilistic model ℳ to the pairs in S
2. Apply an aquisition function 𝑎(𝜽, ℳ) to select a
promising input 𝜽 to evaluate next
3. Evaluate 𝑓(𝜽) and include 𝜽, 𝑓(𝜽) into S

12
GENETIC ALGORITHMS
̶ Applying the principles of natural selection to optimization
̶ Solutions are encoded as "chromosomes"
̶ A crossover operator combines two chromosomes into new ones
̶ A mutation operator introduces random mutations

1. Generate an initial population of solutions


2. For a number of generations:
1. Crossover solutions to increase population size
2. Apply mutation operator
3. Evaluate new solutions
4. Discard some "bad" solutions to maintain a "good" population

13
PARTITIONING
14
PARTITIONING FOR HYPERPARAMETER OPTIMIZATION
̶ Remember: NEVER TRAIN ON THE TEST SET

̶ This is also valid when training hyperparameters

15
TEST SET + CROSS VALIDATION
Valid. Training
Training
Valid.
Training
CV Valid. ...
Training
Training
Training
Valid.

Test

Daniel Peralta <daniel.peralta@ugent.vib.be> 16


NESTED CROSS VALIDATION

Test Training

Test
Valid. Training
Training
Training

...
Training Valid.

Training
Training

Training Valid.
Test

Daniel Peralta <daniel.peralta@ugent.vib.be>


17
NESTED CROSS VALIDATION: EXAMPLE
̶ 5 folds
̶ 3 classifiers: Logistic Regression, Random Forest, SVM

̶ We want to know which classifier is better suited to our problem


̶ We also want to optimize the hyperparameters of each classifier
̶ 3 inner folds for hyperparameter optimization
̶ The ultimate goal is to have a system in production doing real
predictions

18
NESTED CROSS VALIDATION: EXAMPLE
1. For each outer fold i in [1...5]:
1. Validation set: Fold i
2. Training set: Folds {1,2,3,4,5}\{i}
3. Split training set into 3 inner folds
4. For each classifier 𝐶 in {LR, RF, SVM}:
1. For each combination of hyperparameters 𝜃𝑐 for 𝐶:
1. For each inner fold j in [1...3]:
1. (Inner) validation set: Fold j
2. (Inner) training set: Folds {1,2,3}\{j}
3. Train classifier 𝐶(𝜃𝑐 ) on training set
4. Evaluate 𝐶(𝜃𝑐 ) on validation set
2. Calculate average performance of 𝐶 𝜃𝑐 across 3 inner folds
∗(𝑖)
2. Select best performing parameters 𝜃𝑐 for classifier 𝐶
∗(𝑖)
3. Evaluate C(𝜃𝑐 ) on validation set
∗(𝑖)
2. Calculate average performance of each C(𝜃𝑐 ) across all validation folds
3. Select the best classifier C ∗
4. Select 𝜃𝑐∗∗ as the optimal hyperparameters for C ∗
5. Train C ∗ 𝜃𝑐∗∗ on the entire dataset

∗(𝑖)
̶ Note that the best parameters 𝜃𝑐 for each classifier depend on the outer fold that was used for training

19

You might also like