Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
49 views

Machine Learning Basics

The document provides an overview of machine learning basics and concepts that are important for understanding deep learning. It outlines topics like learning algorithms, overfitting and underfitting, hyperparameters, supervised and unsupervised learning, stochastic gradient descent, and challenges that motivated deep learning. The document aims to give the reader a solid foundation in machine learning principles before delving into deep learning.

Uploaded by

20010700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Machine Learning Basics

The document provides an overview of machine learning basics and concepts that are important for understanding deep learning. It outlines topics like learning algorithms, overfitting and underfitting, hyperparameters, supervised and unsupervised learning, stochastic gradient descent, and challenges that motivated deep learning. The document aims to give the reader a solid foundation in machine learning principles before delving into deep learning.

Uploaded by

20010700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Deep Learning

Machine Learning Basics:


Learning Algorithms

1
Deep Learning

Overview of discussion
• Deep learning is a specific kind of Machine
Learning
– To understand deep learning well, one needs a
solid understanding of basic principles of ML
• This chapter is a brief course on most
important general principles of ML
• Such as discussed in Intro to ML (Bishop 2006)

2
Deep Learning
Plan of discussion Sr

– Definition of a learning algorithm


• Example of linear regression algorithm
– How fitting data differs from generalizing to new data
– ML algos have hyperparameters determined outside
of learning algorithm– how to set using additional data
– Statistical approaches: frequentist and Bayesian
– Supervised versus unsupervised learning
– Optimization using stochastic gradient descent
– Combining components into an ML algorithm
• Optimization, cost function, model, and dataset
– Factors that have limited traditional ML to generalize
• Motivating deep learning to overcome these obstacles3
Deep Learning

Topics in Machine Learning Basics


1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10.Building a Machine Learning Algorithm
11.Challenges Motivating Deep Learning 4
1. Learning Algorithms
Deep Learning

• An ML algorithm is an algorithm that is able to


learn from data
– But what do we mean by learning?
– Definition (well-posed learning problem):
• A computer program is said to learn from experience E
• with respect to some class of tasks T and performance
measure P,
• if its performance at task T, as measured by P, improves
with experience E
• One can imagine a wide variety of experiences
E, tasks T, and performance measures P
– We intuitively describe what these are 5
The Task, T
Deep Learning

• ML enables tackling tasks too difficult to solve


with fixed programs, written and designed by
human beings
– Scientific and philosophical perspective:
• ML is interesting because developing ML entails
understanding principles that underlie intelligence
• Process of learning itself is not the task
– Learning is our means of attaining ability to perform
the task
• E.g., if we want a robot to be able to walk, then walking is
the task
• We could either program the robot to learn to walk or we6
could directly write a program manually for walking
Deep Learning

Machine Learning Task Description


• Usually described in terms of how the machine
learning system should process an example
• An example is a collection of features that have
been quantitatively measured for some
object/event that we want the ML system to
process
• Typically represent an example as a vector x
εRn where each entry xi of the vector is another
feature
7
Deep Learning

Inference in Machine Learning


TRAINING INFERENCE

Untrained
Neural Net Training Trained New
Model Dataset Model Data

https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/ 8
Deep Learning

Kinds of Tasks solved using ML


1. Classification
2. Classification with missing inputs
3. Regression
4. Transcription
5. Machine Translation
6. Structured Output
7. Anomaly Detection
8. Synthesis and Sampling
9. Imputation of Missing Values
10.Denoising
11.Density Estimation 9
Deep Learning

1. Classification
• Computer program asked to specify which of k
categories some input belongs to
– Learning algorithm is asked to produce a function f :
Rnà{1,..,k}, where n =no of input variables
• When y =f (x) model assigns input vector x to a category
identified by a numeric code y
• Other variants of classification task:
– f outputs a probability distribution over classes

10
2. Classification with missing inputs
Deep Learning

• Classification more challenging when not every


measurement in its input vector is present
– Simple classification requires a single function
mapping from a vector input to a category output
• When inputs missing, must learn a set of functions
• Each corresponds to classifying x with a different subset
of its inputs missing
– Ex: medical diagnosis when medical tests are expensive/invasive

– Solution: learn distribution over all variables, solve


by marginalizing over missing variables 11
3. Regression
Deep Learning

• Computer program required to


predict a numerical value given
some input
• Algorithm to output f : RnàR
– Task similar to classification except
that format of output is different
– Ex: expected claim amount an
insured person will make (used to
set insurance premiums) or
prediction of future prices of
securities
12
• Also used for algorithmic trading
Deep Learning

4. Transcription
• System observes unstructured representation
of some kind of data and transcribe the
information into discrete textual form
• OCR: input is an image of text
– asked to return this text in the form of a sequence of
characters in ASCII/Unicode
• Google Streetview uses deep learning for address nos.
• Speech recognition: an audio waveform is
converted to word ID codes
Google cloud speech API
Recognizes 110 language variants 13
Deep Learning

5. Machine Translation
• Input already consists of a
sequence of symbols in some
language
• Computer program must convert it
into a sequence of symbols in
another language
• Commonly applied to natural
languages, Ex:English to French
• Deep learning is having an
important impact 14
6. Structured Output
Deep Learning

• Output data structure has


relationships between elements.
– Parsing: map sentence into a tree
• Nodes tagged as verbs, adverbs,..
– Pixel-wise segmentation: every
superpixel in image is assigned to a
specific category
– Image captioning: observe an image
and output a sentence describing the
image
• Words produced by captioning program
must form a valid sentence
15
Deep Learning

7. Anomaly Detection
• Sift through a set of events or objects and flags
some of them as being unusual or atypical

• Ex: credit card fraud detection


– Model purchase habits to detect misuse of cards
• Thief’s purchases come from different probability
distributions than your own
16
Deep Learning

8. Synthesis and Sampling


• Generate new samples similar to those in
training data
• Useful when generating large content by hand
is expensive/boring/require too much time. Exs:
– Videogames generate textures for large objects not
requiring artist to manually label pixels

– Speech synthesis: provide input sentence and ask


to generate audio waveform containing a spoken17
version, but with large variation to seem natural
9. Imputation of Missing Values
Deep Learning

• ML algorithm is given a new example x ε Rn


– But with some entries xi of x missing
– Algorithm must predict values of missing entries
• In R, missing values are indicated by NA’s. To see data in Social Indicators
Survey ( picking rows 91–95), we type R code
cbind (sex, race, educ_r, r_age, earnings, police)[91:95,] and get R output
sex race educ_r r_age earnings police
[91,] 1 3 3 31 NA 0
[92,] 2 1 2 37 135.00 1
[93,] 2 3 2 40 NA 1
[94,] 1 1 3 42 3.00 1
[95,] 1 3 1 24 0.00 NA
• In regression R excludes cases in which inputs are missing;
• This limits information available in analysis, especially if there are many18
inputs with missingness.
Deep Learning

10. Denoising
• The ML algorithm is given an input a corrupted
examples x~ε Rn obtained in an unknown
corruption process
• Learner must predict the clean example x from
its corrupted version x~
• Or more generally the conditional probability
distribution p(x|x~)

19
11. Density Estimation
Deep Learning

• Learn a function pmodel: RnàR where pmodel (x)


is a pdf if x is continuous and pmf if discrete
– It must know where examples cluster tightly and
where they are unlikely

• If we explicitly capture distribution we can solve


other tasks as well
– If we can capture p(x) we can solve missing value
imputation task. If xi is missing and values of x-i are
given then we get p(x|x-i)
Deep Learning

The Performance Measure, P


• Need measure of performance specific to task T
– For classification, classification with missing inputs
and transcription:
• Accuracy: Proportion of samples for which correct output
is produced
– Equivalent to error rate, proportion of examples for which model
produces incorrect output
» Called 0-1 loss, 0 loss if correct, 1 loss if incorrect
– For density estimation
• Average log probability assigned to some examples
• Usually on data not seen before, a test set
• Often difficult to choose a good measure 21
Deep Learning

Example of Performance Measures


• Autonomous Vehicle Navigation
– Task T : driving on public highway using vision sensors
– Performance measure P : average distance traveled
before an error (as judged by human overseer)
– Training experience E : sequence of images and
steering commands recorded observing a human driver

22
Deep Learning

Example of Performance Measures


• Text Categorization Problem
– Task T : assign a document to its content category
– Performance measure P : Precision and Recall
– Training experience E : Example pre-classified
documents

23
Deep Learning

The Experience, E
• Broadly categorized as unsupervised or
supervised
– Based on what kind of experience they are
allowed to have during learning process
• Most algorithms experience a dataset
– Also called data points

24
Example of a data set
Deep Learning

• Anderson’s Iris data (oldest set in stat/ML)


– Measurements of 150 iris flowers
• sepal length, sepal width, petal length, petal width
– in 3 species: Setosa, versicolor, virginica

25
Deep Learning

Unsupervised learning experience

• Data set contains many useful features


• Learn structure of this data set
• Learn the entire probability distribution that
generated this data set
– Explicitly for density estimation
– Implicitly for synthesis or denoising
• Perform a role such as clustering
– i.e., dividing data set into clusters of similar
examples
26
Deep Learning

Supervised Learning experience


• Data set contains features, but each example is
associated with a label or target
• Iris dataset is annotated with the species of
each iris plant
• A supervised learning algorithm can learn to
classify iris plants into three different species

Here is a
test case

27
Deep Learning

Blur between supervised/unsupervised


• Unsupervised: observe several examples of x
and learn p(x)
• Supervised: observe examples of x and y and
learn to estimate p(y|x)
• Many ML methods can be used for both tasks
– E.g., Using chain rule of probability, for xεRn
n
p(x) = ∏ p(x i | x1,..,x i −1 )
i =1

• We can model p(x) by splitting it into n supervised


learning problems
• Alternatively we can solve supervised problem of learning
p(y|x) by using unsupervised learning to determine p(x,y)
and then inferring p(y|x)
Deep Learning

Semi-supervised Learning
• Some examples include a supervision target
but others do not
• In multi-instance learning, an entire collection of
samples is labeled as containing or not
containing an example of a class, but individual
members of the collection are not labeled

29
Deep Learning
Reinforcement Learning
• Algorithms interact with environment, no fixed
data set
– A feedback loop between system and its experience
• Dog is given a reward/punishment for an action
– Policies: what actions to take in a particular situation
– Utility estimation: how good is state (àused by policy)
• No supervised output but delayed reward
– Credit assignment
• what was responsible for outcome
• Applications:
– Game playing, Robot in a maze, Multiple agents, partial 30
observability, …
Deep Learning

Reinforcement learning (RL) connected to a deep Paper: “Playing Atari with Deep Reinforcement
neural network proves to be an effective solution Learning” by Volodymyr Mnih, Koray Kavukcuoglu,
for learning to navigate in complex environments David Silver, Alex Graves, Ioannis Antonoglou, Daan
without any prior knowledge. Wierstra and Martin Riedmiller, NIPS 2013 (Cited by
1490)
RL agent can infer from complex environments by
Dataset: Q-learning generates data exclusively from
punishment-reward system. It can model decision
experience, without incorporation of the prior
making process.
knowledge. If we put all our history data into a table
with state, action, reward, next state and then
Applications: AlphaGo Zero beat the world’s
sample from it, it should be possible to train our
professionals (December 2017). OpenAI presented
agent that way, without the dataset.
a bot, that won in Dota2 world championship (Aug
2018). RL has also many applications in robotics. Backend: Python3, Keras, Tensorflow
Core libraries: OpenAI Gym, Keras - RL
Code: https://github.com/nathanmargaglio/DQN
Strategy: (1) estimate the discounted sum of rewards
of taking action a in state s - Q(s, a) function, (2)
choose the action with the maximum Q-value in any
given state.

where s - state; a - action to take; r - reward; y -


discounting factor.
An agent learns by getting positive or negative
rewards.
Loss: Huber Loss (modified MSE/MAE)
Deep Learning

Reinforcement Learning: ATARI


Environment: BreakoutDeterministic-v4
Backend: Keras, Python3
Libraries: OpenAI Gym, Keras-RL
Reward: max score - 208 (the benchmark in the paper
225)
Preprocessing: original image was downsampled from
210×160 pixel images to 105×80 and converted from RGB
to gray-scale to decrease the computation
Training time: 15 hours including simulation time on a
GTX 650 with 1 GB of RAM
Notations:
Frame - a snapshot of the environment state at every point
Action (a) - a set of actions, that agent can take {0, 1, 2, 3}
Upper left corner - score (our evaluation metric)
Upper middle - number of “lives” for each game (initially 5)
Upper right corner - might be version
Deep Learning

Network structure & parameters


Deep Learning

Describing a data set


• Common way of describing a data set is with a
design matrix
• Different examples in each row
• Each column corresponds to a different feature
• Iris dataset contains 150 examples with four
features for each example
• Data set is a design matrix X ε R150 4
• Xi,1 is sepal length of plant i, Xi,2 is sepal width
of plant i
34
Deep Learning

Dealing with varying sizes


• For a design matrix, each example is a vector
of same size
– But photos of varying size contain different nos. of
pixels
• Rather than describing the matrix with m rows
we describe it as a set of m elements
{x(1),x(2),..,x(m)}. It does not imply that vectors
x{i} and x{j} have the same size
– Instead of multiplying by a weight matrix of fixed
size, convolution with a kernel is applied different
no. of times depending on size of input 35
Deep Learning

Machine Learning Basics:


Capacity, Over-and Under-fitting

1
Deep Learning
Topics
1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning

Generalization
• Central challenge of ML is that the algorithm
must perform well on new, previously unseen
inputs
– Not just those on which our model has been trained
• Ability to perform well on previously unobserved
inputs is called generalization

3
Deep Learning
Generalization Error
• When training a machine learning model we
have access to a training set
– We can compute some error measure on the
training set, called training error
• ML differs from optimization
– Generalization error (also called test error) to be low
• Generalization error definition
– Expected value of the error on a new input
• Expected value is computed as average over inputs taken
from distribution encountered in practice
– In linear regression with squared error criterion the expected
error is: 1 2
4
(test)
X (test)w − y (test)
m 2
Deep Learning

Estimating the generalization error


• We estimate generalization error of a ML model
by measuring performance on a test set that
were collected separately from the training set
• In linear regression example we train model by
minimizing the training error Polynomial Model
∑ y(x,w) = w0 + w1x + w2x 2 + .. + wM x M =
M
wjx j
1 2 j =0

(train)
X (train)
w −y (train)

m 2

– But we actually care about the test error


1 2

(test)
X w −y
(test) (test)

m 2

• How can we affect performance when we


observe only the training set?
5
– Statistical learning theory provides some answers
Deep Learning

Statistical Learning Theory


• Need assumptions about training and test sets
– Training/test data arise from same process
– We make use of i.i.d. assumptions
1. Examples in each data set are independent
2. Training set and testing set are identically distributed
– We call the shared distribution, the data generating
distribution pdata
• Probabilistic framework and i.i.d. assumption
allows us to study relationship between training
and testing error
6
Deep Learning

Why are training/test errors unequal?


• Expected training error of a randomly selected
model is equal to expected test error of model
– If we have a joint distribution p(x,y) and we
randomly sample from it to generate the training
and test sets. For some fixed value w the expected
training set error is the same as the expected test
set error
• But we don’t fix the parameters w in advance!
– We sample the training set and then use it to
choose w to reduce training set error

7
Under- and Over-fitting
Deep Learning

• Factors determining how well an ML algorithm


will perform are its ability to:
1. Make the training error small
2. Make gap between training and test errors small
• They correspond to two ML challenges
– Underfitting
• Inability to obtain low enough error rate on the training set
– Overfitting
• Gap between training error and testing error is too large
• We can control whether a model is more likely
to overfit or underfit by altering its capacity 8
Capacity of a model
Deep Learning

• Model capacity is ability to fit variety of functions


– Model with Low capacity struggles to fit training set
– A High capacity model can overfit by memorizing
properties of training set not useful on test set
• When model has higher capacity, it overfits
– One way to control capacity of a learning algorithm
is by choosing the hypothesis space
• i.e., set of functions that the learning algorithm is allowed
to select as being the solution
– E.g., the linear regression algorithm has the set of all linear
functions of its input as the hypothesis space
– We can generalize to include polynomials is its hypothesis space
which increases model capacity 9
Deep Learning

Capacity of Polynomial Curve Fits


• A polynomial of degree 1 gives a linear
regression model with the prediction
ŷ = b + wx

– By introducing x2 as another features provided to


the regression model, we can learn a model that is
quadratic as a function of x
ŷ = b + w1x + w2x 2

• The output is still a linear function of the parameters so


we can use normal equations to train in closed-form
• We can continue to add more powers of x as additional
features, e.g., a polynomial of degree 9
9
ŷ = b + ∑ wi x i 10
i =1
Deep Learning

Appropriate Capacity
• Machine Learning algorithms will perform well
when their capacity is appropriate for the true
complexity of the task that they need to perform
and the amount of training data they are
provided with
• Models with insufficient capacity are unable to
solve complex tasks
• Models with high capacity can solve complex
tasks, bit when their capacity is higher than
needed to solve the present task, they may
overfit 11
Principle of Capacity in action
Deep Learning

• We fit three models to a training set


– Data generated synthetically sampling x values and
choosing y deterministically (a quadratic function)

Polynomial Model M
y(x,w) = w0 + w1x + w2x 2 + .. + wM x M = ∑ w j x j
j =0

Polynomial of degree 9 suffers from overfitting.


Linear function fit Quadratic function fit Used Moore-Penrose inverse to solve
cannot capture generalizes well to underdetermined normal equations
curvature present unseen data. (we have N equations corresponding to N training samples)
in data No underfitting Solution passes through all points but does not
or overfitting capture correct structure: deep valley between 12
two points not true of underlying function, also function is
decreasing at first point, not increasing
Ordering Learning Machines by Capacity
Deep Learning

Goal of learning is to choose an optimal element


of a structure (e.g., polynomial degree) and
estimate its coefficients from a given training
sample.

For approximating functions linear in parameters


such as polynomials, complexity is given by the
no. of free parameters.

For functions nonlinear in parameters, the


complexity is defined as VC-dimension.
The optimal choice of model complexity provides
the minimum of the expected risk.

13
Deep Learning

Representational and Effective Capacity

• Representational capacity:
– Specifies family of functions learning algorithm can
choose from
• Effective capacity:
– Imperfections in optimization algorithm can limit
representational capacity
• Occam’s razor:
– Among competing hypotheses that explain known
observations equally well, choose the simplest one
• Idea is formalized in VC dimension 14
Deep Learning

VC Dimension
• Statistical learning theory quantifies model
capacity
• VC dimension quantifies capacity of a binary
classifier
• VC dimension is the largest value of m for
which there exists a training set of m different
points that the classifier can label arbitrarily

15
Deep Learning

VC dimension
(capacity) of a
Linear classifier in
R2 is 3

16
Deep Learning

Capacity and Learning Theory


• Quantifying the capacity of a model enables
statistical learning theory to make quantitative
predictions
• Most important results in statistical learning
theory show that:
• The discrepancy between training error and
generalization error
– is bounded from above by a quantity that grows as the
model capacity grows
– But shrinks as the number of training examples increases

17
Deep Learning

Usefulness of statistical learning theory


• Provides intellectual justification that machine
learning algorithms can work
• But rarely used in practice with deep learning
• This is because:
– The bounds are loose
– Also difficult to determine capacity of deep learning
algorithms

18
Deep Learning

Typical generalization error


Relationship between capacity and error
Typically generalization error has a U-shaped curve

19
Arbitrarily high capacity:
Deep Learning

Nonparametric models
• When do we reach most extreme case of
arbitrarily high capacity?
• Parametric models such as linear regression:
• learn a function described by a parameter whose
size is finite and fixed before data is observed
• Nonparametric models have no such limitation
• Nearest-neighbor regression is an example
• Their complexity is a function of training set size
20
Deep Learning

Nearest neighbor regression

• Simply store the X and y from the training set


• When asked to classify a test point x the model
looks up the nearest entry in the training set
and returns the associated target, i.e.,
ŷ = yi where i = arg min || Xi,: − x ||22

• Algorithm can be generalized to distance


metrics other than L2 norm such as learned
distance metrics
21
Deep Learning

Bayes Error

• Ideal model is an oracle that knows the true


probability distributions that generate the data
• Even such a model incurs some error due to
noise/overlap in the distributions
• The error incurred by an oracle making
predictions from the true distribution p(x,y) is
called the Bayes error

22
Deep Learning

Effect of size of training set

• Expected generalization error can never


increase as the no of training examples
increases
• For nonparametric models, more data yields
better generalization until best possible error is
achieved
• For any fixed parametric model with less than
optimal capacity will asymptote to an error
value that exceeds the Bayes error 23
Deep Learning

Effect of training set size


Synthetic regression problem
Noise added to a 5th degree
polynomial.
Generated a single test set
and several different sizes of
training sets
Error bars show 95%
Confidence interval

As training set size increases


optimal capacity increases 24
Plateaus after reaching sufficient complexity
Deep Learning

Probably Approximately Correct


• Learning theory claims that a ML algorithm
generalizes from finite training set of examples
• It contradicts basic principles of logic
• Inductive reasoning, or inferring general rules from
a limited no of samples is not logically valid
• To infer a rule describing every member of a set, one
must have information about every member of the set
• ML avoids this problem with probabilistic rules
• Rather than rules of logical reasoning
• Find rules that are probably correct about most
members of the set they concern 25
The no-free lunch theorem
Deep Learning

• PAC does not entirely resolve the problem


• No free lunch theorem states:
• Averaged over all distributions, every algo
has same error classifying unobserved points
• i.e., no ML algo universally better than any other
• Most sophisticated algorithm has same error rate that
merely predicts that every point belongs to same class
• Fortunately results hold only when we average
over all possible data generating distributions
• if we can make distribution assumptions we perform well
• We don’t seek a universal learning algorithm26
Deep Learning

Regularization
• No free lunch theorem implies that we must
design our ML algorithms to perform well on a
specific task
• We do so by building a set of preferences into
learning algorithm
• When these preferences are aligned with the
learning problems it performs better

27
Deep Learning

Other ways of solution preference


• Only method so far for modifying learning algo
is to change representational capacity
• Adding/removing functions from hypothesis space
• Specific example: degree of polynomial
• Specific identity of those functions can also
affect behavior of our algorithm
• Ex of linear regression: include weight decay
J(w) = MSE train + λwT w
• λ chosen to control preference for smaller weights

28
Tradeoff between fitting data and
Deep Learning

being small
Controlling a model’s tendency to overfit
or underfit via weight decay

Weight decay for high degree polynomial (degree 9)


while the data comes from a quadratic

29
Deep Learning

Regularization defined
• Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error
• We must choose a form of regularization that is
well-suited to the particular task

30
Deep Learning

Hyperparameters and Validation Sets

1
Deep Learning

Topics in Machine Learning Basics


1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning

Hyperparams control ML Behavior


• Most ML algorithms have hyperparameters
– We can use to control algorithm behavior
– Values of hyperparameters are not adapted by
learning algorithm itself
• Although, we can design nested learning where
one learning algorithm
– Which learns best hyperparameters for another
learning algorithm

3
Deep Learning Srihari

Hyperparams for complexity, regularization


Linear Regression problem
M
y(x,w) = w0 + w1x + w2x + ..+ wM x 2 M
= ∑ wjx j Regularization with weight decay
j =0 hyperparameter λ
Model complexity hyperparameter M J(w) = MSE train + λwT w
N N
J(w) = ∑ {y(x n ,w)−tn } 2
J(w) = ∑ {y(x n ,w)−tn }2 + λwT w
n=1 n=1

Models with different capacities M Model with high degree polynomial (degree 9)
Linear Quadratic Degree 9 Different weight decay hyperparameters λ

Low capacity: High capacity:


Struggles to fit Overfit

⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
Using Moore Penrose pseudoinverse ⎜ ⎟
⎜ φ (x ) ⎟
to solve underdetermined normal equations: Design matrix: Φ=⎜ 0 2 ⎟
⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
w ML = Φ + t Φ + = (ΦT Φ) −1 ΦT ⎝ 0 N
Deep Learning Srihari

Reasons for hyperparameters


• Sometimes setting is chosen as a hyperparam
because it is too difficult to optimize
• More frequently, the setting is a hyperparam
because it is not appropriate to learn that
hyperparam on the training set
– Applies to all hyperparameters for model capacity
• If learned on training set, they would always choose
maximum model capacity resulting in overfitting

Can always fit the training


set better with a higher
degree polynomial and
weight decay λ=0 5
Deep Learning Srihari

Validation Set
• To solve the problem we use a validation set
– Examples that training algorithm does not observe
• Test examples should not be used to make
choices about the model hyperparameters
• Training data is split into two disjoint parts
– First to learn the parameters
– Other is the validation set to estimate generalization
error during or after training
• allowing for the hyperparameters to be updated
– Typically 80% of training data for training and 20%
for validation
6
Deep Learning Srihari

Test sets also need to change


• Over many years, the same test set used
repeatedly to evaluate performance of different
algorithms
• With repeated attempts to beat state-of-the-art
performance, we have optimistic evaluations
with the test set as well
• Community tends to move to new, usually more
ambitious and larger benchmark data sets

7
Deep Learning Srihari

Cross-Validation
• When data set is too small, dividing into a fixed
training set and fixed testing set is problematic
– If it results in a small test set
• Small test set implies statistical uncertainty around the
estimated average test error
• Cannot claim algorithm A works better than algorithm B
for a given task
• k - fold cross-validation
– Partition the data into k non-overlapping subsets
– On trial i, i th subset of data is used as the test set
– Rest of the data is used as the training set
8
Deep Learning Srihari

k-fold Cross Validation


• Supply of data is limited k=4

• All available data is partitioned


into k groups (folds)
• k-1 groups are used to train and
evaluated on remaining group
• Repeat for all k choices of held-
out group If k =N this is the
leave-one-out method
• Performance scores from k runs
are averaged

9
Algorithm KFoldXV(D,A,L,k)
Deep Learning Srihari

• Estimates generalization error of a learning


algorithm A when given data set D is too small
to yield an accurate estimate
– Because mean loss L will have too high a variance
– Data set D contains examples z(i) (for ith example)
• For supervised learning z(i)=(x(i), y(i))
• For unsupervised learning z(i)=x(i)
• Algorithm returns the vector of errors e for each
example in D whose mean is the estimated
generalization error
Deep Learning Srihari

k -fold cross validation algorithm

Train A on dataset without Di

Determine errors for samples in Di

Return vector of errors e for samples in D

11
Deep Learning Srihari

Cross validation confidence


• Cross-validation algorithm returns vector of
errors e for examples in D
– Whose mean is the estimated generalization error
– The errors can be used to compute a confidence
interval around the mean
• 95% confidence interval centered around mean µ̂m is
(µ̂m − 1.96SE(µ̂m ), µ̂m + 1.96SE(µ̂m ))
⎡1 m ⎤ σ
SE(µ̂m ) = Var ⎢ ∑ x (i ) ⎥ =
where the standard error of the mean is: ⎣ m i =1 ⎦ m
Which is square root of variance of the estimator
Deep Learning Srihari

Caveats for Cross-validation


• No unbiased estimators of the average error
exist; approximations are used
• Confidence intervals are not well-justified after
use of cross-validation
• It is still common practice to declare that
Algorithm A is better than Algorithm B only of
the confidence interval of Algorithm A lies
below and does not intersect the confidence
interval of Algorithm B
13
Deep Learning Srihari

Machine Learning Basics:


Estimators, Bias and Variance

1
Srihari

Topics in Basics of ML
Deep Learning

1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari

Topics in Estimators, Bias, Variance


0. Statistical tools useful for generalization
1. Point estimation
2. Bias
3. Variance and Standard Error
4. Bias-Variance tradeoff to minimize MSE
5. Consistency

3
Deep Learning Srihari

Statistics provides tools for ML


• The field of statistics provides many tools to
achieve the ML goal of solving a task not only
on the training set but also to generalize
• Foundational concepts such as
– Parameter estimation
– Bias
– Variance
• They characterize notions of generalization,
over- and under-fitting
4
Deep Learning Srihari

Point Estimation
• Point Estimation is the attempt to provide the
single best prediction of some quantity of
interest
– Quantity of interest can be:
• A single parameter
• A vector of parameters
– E.g., weights in linear regression
• A whole function

5
Deep Learning Srihari

Point estimator or Statistic


• To distinguish estimates of parameters from
their true value, a point estimate of a parameter
θ is represented by θ̂
• Let {x(1), x(2),..x(m)} be m independent and
identically distributed data points
– Then a point estimator or statistic is any function of
the data
θ̂m = g(x (1),...x (m) )
• Thus a statistic is any function of the data
• It need not be close to the true θ
– A good estimator is a function whose output is close
to the true underlying θ that generated the data 6
Deep Learning Srihari

Function Estimation
• Point estimation can also refer to estimation of
relationship between input and target variables
– Referred to as function estimation
• Here we predict a variable y given input x
– We assume f(x) is the relationship between x and y
• We may assume y=f(x)+ε
– Where ε stands for a part of y not predictable from x
– We are interested in approximating f with a model fˆ
• Function estimation is same as estimating a parameter θ
– where fˆ is a point estimator in function space
• Ex: in polynomial regression we are either estimating a
parameter w or estimating a function mapping from x to y
Deep Learning Srihari

Properties of Point Estimators


• Most commonly studied properties of point
estimators are:
1. Bias
2. Variance
• They inform us about the estimators

8
Deep Learning Srihari

1. Bias of an estimator
• The bias of an estimator θ̂m = g(x (1),...x (m)) for
parameter θ is defined as
( )
bias θ̂m = E ⎡⎣θ̂m ⎤⎦ − θ

• The estimator is unbiased if bias( θ̂m )=0


– which implies that E ⎡⎣θ̂m ⎤⎦ = θ
• An estimator is asymptotically unbiased if
( )
limm→∞ bias θ̂m = 0

9
Deep Learning Srihari

Examples of Estimator Bias


• We look at common estimators of the following
parameters to determine whether there is bias:
– Bernoulli distribution: mean θ
– Gaussian distribution: mean µ
– Gaussian distribution: variance σ2

10
Deep Learning Srihari

Estimator of Bernoulli mean


• Bernoulli distribution for binary variable x ε{0,1}
with mean θ has the form P(x;θ) = θx (1 − θ)1−x
• Estimator for θ given samples {x(1),..x(m)} is θ̂ = m1 ∑ x
m
(i )
m
i =1

• To determine whether this estimator is biased


determine bias(θ̂ ) = E ⎡⎣θ̂ ⎤⎦ − θ
m m

⎡ 1 m (i ) ⎤
= E ⎢ ∑x ⎥ − θ
⎣ m i −1 ⎦
m
1
= ∑ E ⎡⎣x (i ) ⎤⎦ − θ
m i =1
1 m 1
( )
= ∑ ∑ x (i )θx (1 − θ)(1−x ) − θ
m i =1 x (i ) =0
(i ) (i )

1 m
= ∑ (θ) − θ = θ − θ = 0
m i =1

– Since bias( θ̂ )=0 we say that the estimator is unbiased


m
Deep Learning Srihari

Estimator of Gaussian mean


• Samples {x(1),..x(m)} are independently and
identically distributed according to p(x(i))=N(x(i);µ,σ2)
– Sample mean is an estimator of the mean parameter
1 m (i )
µ̂m = ∑ x
m i =1
– To determine bias of the sample mean:

– Thus the sample mean is an unbiased estimator of the


Gaussian mean
Deep Learning Srihari

Estimator for Gaussian variance


• The sample variance is ( 1 m (i )
)
2
σ̂ = ∑ x − µ̂m
2
m
m i =1
• We are interested in computing
bias( σ̂m2 ) =E( σ̂m2 ) - σ2
• We begin by evaluating à
• σ̂ 2
Thus the bias of m is –σ2/m
• Thus the sample variance is a biased estimator
• The unbiased sample variance estimator is
1 m (i )
( )
2
σ̂ =
2
m ∑ x − µ̂m
m − 1 i =1
13
Deep Learning Srihari

2. Variance and Standard Error


• Another property of an estimator:
– How much we expect the estimator to vary as a
function of the data sample
• Just as we computed the expectation of the
estimator to determine its bias, we can compute
its variance
• The variance of an estimator is simply Var(θ̂ )
where the random variable is the training set
• The square root of the the variance is called the
standard error, denoted SE(θ̂)
14
Deep Learning Srihari

Importance of Standard Error


• It measures how we would expect the estimate
to vary as we obtain different samples from the
same distribution
• The standard error of the mean is given by
⎡ 1 m (i ) ⎤ σ
( )
SE µ̂m = Var ⎢ ∑ x ⎥ =
⎣ m i =1 ⎦ m

– where σ2 is the true variance of the samples x(i)


– Standard error often estimated using estimate of σ
• Although not unbiased, approximation is reasonable
– The standard deviation is less of an underestimate than variance
Deep Learning Srihari

Standard Error in Machine Learning


• We often estimate generalization error by
computing error on the test set
– No of samples in the test set determine its accuracy
– Since mean will be normally distributed, (according
to Central Limit Theorem), we can compute
probability that true expectation falls in any chosen
interval
• Ex: 95% confidence interval centered on mean µ̂m is
(µ̂ m ( ) ( ))
− 1.96SE µ̂m , µ̂m + 1.96SE µ̂m

• ML algorithm A is better than ML algorithm B if


– upperbound of A is less than lower bound of B
Deep Learning Srihari

Confidence Intervals for error

95% confidence intervals for error estimate

17
Deep Learning Srihari

Trading-off Bias and Variance


• Bias and Variance measure two different
sources of error of an estimator
• Bias measures the expected deviation from the
true value of the function or parameter
• Variance provides a measure of the expected
deviation that any particular sampling of the
data is likely to cause

18
Deep Learning Srihari

Negotiating between bias - tradeoff


• How to choose between two algorithms, one
with a large bias and another with a large
variance?

– Most common approach is to use cross-validation


– Alternatively we can minimize Mean Squared Error
which incorporates both bias and variance

19
Deep Learning Srihari

Mean Squared Error


• Mean Squared Error of an estimate is
MSE = E ⎡⎢( θ̂ − θ ) ⎤⎥
2

⎣ m

=Bias ( θ̂ ) + Var ( θ̂ )
2
m m

• Minimizing the MSE keeps both bias and


variance in check
As capacity increases, bias (dotted )
tends to decrease and variance (dashed)
tends to increase

20
Deep Learning Srihari

Underfit-Overfit : Bias-Variance
Relationship of bias-variance to capacity is similar to
underfitting and overfitting relationship to capacity

Bias-Variance to capacity Model complexity to capacity

Both have a U-shaped curve of generalization


Error as a function of capacity

21
Deep Learning Srihari

Consistency
• So far we have discussed behavior of an
estimator for a fixed training set size
• We are also interested with the behavior of the
estimator as training set grows
• As the no. of data points m in the training set
grows, we would like our point estimates to
converge to the true value of the parameters:
plimm→∞θ̂m = θ
– Symbol plim indicates convergence in probability
Deep Learning Srihari

Weak and Strong Consistency


• plimm→∞θ̂m = θ means that
For any ε > 0, P(| θ̂m − θ |> ε) → 0 as m → ∞
• It is also known as weak consistency
• Implies almost sure convergence of θ̂ to θ
• Strong consistency refers to almost sure convergence
of a sequence of random variables x(1),x(2),… to a
value x occurs when
p(limm→∞x (m) = x) = 1
• Consistency ensures that the bias induced by the
estimator decreases with m
23
Deep Learning Srihari

Machine Learning Basics:


Maximum Likelihood Estimation

1
Topics
Deep Learning Srihari

1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari

Topics in Maximum Likelihood


0. The maximum likelihood principle
– Maximizing likelihood is minimizing KL divergence
1. Conditional log-likelihood & MSE
– Minimizing negative log-likelihood is equivalent to
minimizing MSE in linear regression with Gaussian
noise
2. Properties of maximum likelihood
– No consistent estimator has lower MSE than
maximum likelihood estimator
3
Deep Learning Srihari

How to obtain good estimators?


• We have seen some definitions of common
estimators and their properties
– Ex: sample mean, bias-variance
• Where do they come from?
• Rather than guessing some function and
determining its bias and variance, better to
have some principle from which we can derive
specific functions that are good estimators
• The most common such principle is the
maximum likelihood principle 4
Deep Learning Srihari

Maximum Likelihood Principle


• Consider set of m examples X={x(1), x(2),..x(m)}
– Drawn independently from the true but unknown
data generating distribution pdata(x)
• Let pmodel(x;θ) be a parametric family of
distributions over same space indexed by θ
• i.e., pmodel(x;θ) maps any configuration of x to a real no.
estimating the true probability pdata(x)
• The maximum likelihood estimator for θ is:

– This product over many probabilities is inconvenient


5
• ex: underflow
Deep Learning Srihari

Alternative form of max likelihood


• An equivalent optimization problem is to take
logarithm of the likelihood

– Since dividing by m does not change the problem


– This maximization can be written as

• The expectation is wrt the empirical distribution p̂data


defined by the training data
– One way to interpret maximum likelihood estimation
is to view it as minimizing the dissimilarity between
the empirical distribution p̂datadefined by the training
set and the model distribution pmodel(x) as seen next 6
Maximizing likelihood is minimizing
Deep Learning Srihari

KL divergence
• Max likelihood =minimizing KL diverg. between:
– empirical distribution p̂data and model distribution pmodel(x)
• The K-L divergence is

– First Term a function of data generation, not model


– Thus we only need to minimize −E ⎡⎣ log p (x)⎤⎦ x~p̂data model

• i.e., cross entropy between distribution of training set and


probability distribution defined by model
– Definition of cross entropy between distributions p and q is
H(p,q)=Ep[-log q]=H(p)+DKL(p||q)
– For discrete distributions H(p,q)=-Σxp(x)log q(x) 7
Deep Learning Srihari

Summary of Max Likelihood


• It is an attempt to make model distribution pmodel
match the empirical distribution p̂data
– Ideally we would like to match the unknown p̂
• The optimal θ is the same whether we
maximize likelihood or minimize KL divergence
• In software both are phrased as minimization
– Maximum likelihood becomes minimization of
negative log-likelihood (NLL)
• Equivalently minimization of cross entropy
– KL divergence has a minimal value 0
– Negative log-likelihood is negative with no bound
Deep Learning Srihari

Conditional Log-likelihood and MSE


• Maximum likelihood estimator can be readily
generalized to parameters of an input-output
relationship
– Goal is prediction: conditional probability P(y|x;θ)
– Which forms the basis of most supervised learning
• If X represents all our inputs and Y represents
all our targets then the conditional maximum
likelihood estimator is
– If examples are i.i.d. then
9
Deep Learning Srihari

Linear Regression as Maximum Likelihood


• Basic linear regression:
– Takes an input x and produce an output ŷ
– Mapping from x to ŷ is chosen to minimize MSE
• Revisit linear regression as maximum likelihood
– Think of model as producing a conditional
distribution p(y|x)
• To derive same algorithm, define p(y | x) = N(y; ŷ(x,w), σ 2 )

– Function ŷ(x,w) predicts mean of Gaussian


– Since samples are i.i.d.
Thus maximizing the log-likelihood
is same as minimizing MSE
10
Deep Learning Srihari

Properties of Maximum Likelihood


• Main appeal of maximum likelihood estimator:
– It is the best estimator asymptotically
• In terms of its rate of converges, as mà∞
– Under some conditions, it has consistency property
• As mà∞ it converges to the true parameter value
• Conditions for consistency
– pdata must lie within model family pmodel(.,θ)
– pdata must correspond to exactly one value of θ

11
Deep Learning Srihari

Statistical Efficiency
• Several estimators, based on inductive
principles other than MSE, can be consistent
• Define Statistical efficiency: estimator has lower
generalization error for fixed no of samples
• Or equivalently, needs fewer examples to obtain a fixed
generalization error
• Needs measuring closeness to true parameter
– MSE between estimated and true parameter values
– Parameter MSE decreases as m increases
– Using Cramer-Rao bound
• No consistent estimator has lower MSE than Maximum
Likelihood Estimator
Deep Learning Srihari

Machine Learning Basics:


Bayesian Statistics

1
Deep Learning Srihari

Topics in Machine Learning Basics


1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari

Topics in Bayesian Statistics


• Frequentist versus Bayesian Statistics
• Prior Probability Distribution
• Bayesian Estimation
• Ex: Bayesian Linear Regression
• Maximum A Posteriori (MAP) Estimation

3
Deep Learning Srihari

Frequentist Statistics
• So far we have discussed frequentist statistics
• The approaches are based on estimating a
single value of θ, then making all predictions
thereafter based on that one estimate
• Summary of the frequentist perspective:
– True parameter value θ is fixed but unknown
– Point estimate θ̂ is a random variable
• On account of it being a function of the dataset
• The Bayesian perspective is quite different

4
Deep Learning Srihari

Bayesian Perspective
• The Bayesian approach is to consider all
possible values of θ before making predictions
• The Bayesian uses probability to reflect
degrees of uncertainty in states of knowledge
• The dataset is directly observed and so is not
random
• On the other hand, the true parameter θ is
unknown or uncertain and is thus represented
as a random variable

5
Deep Learning Srihari

Prior Distribution
• Before observing the data, we represent our
knowledge of θ is using the prior probability
distribution p(θ)
– Sometimes simply referred to as the prior
– ML practitioner selects prior distribution to be broad
• i.e., with high entropy to reflect high uncertainty
• Examples
– θ is in a finite range/volume with uniform distribution
– Many priors reflect preferences for “simpler”
solutions
• Such as smaller magnitude coefficients
6
• Or a function closer to being constant
Deep Learning Srihari

Bayes rule in Bayesian approach


• Consider we have a set of samples {x(1),..,x(m)}
• We can recover the effect of data on our belief
about θ by combining the data likelihood
p(x(1),..,x(m)|θ) with the prior p(θ) via Bayes rule:

7
Deep Learning Srihari

From prior to posterior


• In Bayesian estimation the prior begins as a
relatively uniform or Gaussian with high entropy
• Data causes the posterior to lose entropy
– And concentrate around a few highly likely values of
parameters

8
Deep Learning Srihari

Two Differences between maximum


likelihood and Bayesian estimation
1. Making predictions
2. Contribution of the prior distribution

9
Deep Learning Srihari

1. Making Predictions in Bayesian approach


• MLE predictions use point estimate θ
• Bayesian predicts using full distribution over θ
– Ex: after m samples, prediction over sample xm+1 is

• Where

• Here each value of θ with positive probability density


contributes to the prediction of the next example
– With the contribution weighted by the posterior density itself
• After having observed {x(1),..,x(m)}, if we are still quite
uncertain about the value of θ, then the uncertainty is
incorporated directly into any predictions we might make
Deep Learning Srihari

Handling uncertainty in θ
• In the frequentist approach, uncertainty in a
given point estimate of θ is handled by
evaluating its variance
– Variance of the estimator is an assessment of how
the estimate might change with alternative
samplings of the observed data
• Bayesian answer to the question of how to
handle uncertainty in the estimator is to simply
integrate over it

11
Deep Learning Srihari

Bayesian approach avoids overfitting


• Bayesian approach tends to protect well
against overfitting
– In the Bayesian approach there is no fitting, just
computing the posterior from the prior
• Integral is just an application of the laws of
probability making the Bayesian approach
simple to justify
• While the frequentist machinery for constructing
an estimator is based on an ad-hoc decision to
summarize all knowledge contained in the
dataset with a single point estimate 12
Deep Learning Srihari

Second difference: Bayesian vs MLE


2. Contribution of the prior distribution
– Prior has influence of shifting probability mass
function towards regions of the parameter space
that are preferred a priori
– In practice prior expresses a preference over
models that are simpler or more smooth
• Critics of Bayesian approach identify the prior
as a source of subjective human judgment
afsecting the predictions

13
Deep Learning Srihari

When is Bayesian approach better?


• Bayesian methods typically generalize much
better when limited training data is available
• But suffer from high computational cost when
the number of training examples is large

14
Deep Learning Srihari

Ex: Bayesian Linear Regression


• Consider the Bayesian estimation approach to
learning linear regression parameters
• Here we learn a linear mapping from an input
vector x ε Rn to predict value of a scalar y ε R
• The prediction is parameterized by the vector w
ε Rn: ŷ = wT x
• Given a set of m training samples (X(train),y(train)),
– we can express the prediction of y over the entire
training set as (train)
ŷ =X (train)
w 15
Deep Learning Srihari

Prediction as a Gaussian conditional


• Prediction y(train) can be expressed as a
Gaussian conditional distribution:

– Where we follow the standard MSE formulation in


assuming that the Gaussian variance on y is one
– In what follows we refer to (X(train),y(train)) as simply
(X,y)
16
Deep Learning Srihari

Gaussian prior distribution


• To specify a posterior distribution over the
parameters w, we first need to specify a prior
• The prior should reflect our naiive belief about
the value of these parameters
– Assume a fairly broad distribution to express high
degree of uncertainty about θ
• For real-valued parameters it is common to use
a Gaussian as a prior distribution

• where µ0 and Λ0 are prior distribution mean and


covariance matrix; typically assume diagonal Λ0=diag(λ0)
Deep Learning Srihari

Posterior on the weights


• With prior specified, we can determine the
posterior distribution over parameters

Now define and

– Using these new variables, we find that the


posterior can be written as a Gaussian distribution:

All terms that do not include the


parameter vector w have been omitted
They are implied by normalization.
Deep Learning Srihari

Maximum A Posteriori (MAP)


Estimation
• Most principled approach is to use full posterior
distribution over parameters θ, it is still often
desirable to have a single point estimate
• Most often a Bayesian posterior is intractable
• A point estimate offers a tractable solution
• Rather than simply returning the maximum
likelihood estimate, we can still gain some
benefit of the Bayesian approach by allowing
the prior to influence the choice of the point
19
estimate
Deep Learning Srihari

Intuition on Bayesian inference


• In most situations we set µ0=0
• If we set Λ0=(1/α)I then µm gives the same
estimate of w as does frequentist linear
regression with a weight decay penalty of αwTw
• The most important difference is that the
Bayesian estimate provides a covariance
matrix, showing how likely all the different
values of w are, rather than providing only the
estimate µm
20
Deep Learning Srihari

The MAP point estimate


• The MAP estimate chooses the point of
maximum a posteriori probability (or maximum
probability density in the case of a continuous
distribution)

• In the rhs log p(x|θ) is the standard log-


likelihood term
• log p(θ) corresponds to the prior distribution 21
Deep Learning Srihari

Example of MAP estimate


• Linear regression with Gaussian prior on w
• If this prior is given by N(w;0, (1/λ)I2)
then the log prior term is proportional to
the familiar λwTw weight decay penalty
plus a term that does not depend on w
and does not affect the learning process
• MAP Bayesian inference with a Gaussian
prior on the weights thus corresponds to
weight decay
22
Deep Learning Srihari

Advantage of MAP point estimate


• Leverages information brought by the prior and
cannot be found in the training data
• It helps reduce the variance in the MAP point
estimate (in comparison to the ML estimate)
• However it does so at the risk of increased bias
• A more complicated penalty term can be
derived by using a mixture of Gaussians rather
than a single Gaussian distribution as prior
23
Deep Learning Srihari

Machine Learning Basics:


Stochastic Gradient Descent

1
Topics
Deep Learning Srihari

1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari

Stochastic Gradient Descent (SGD)


• Nearly all deep learning is powered by SGD
– SGD extends the gradient descent algorithm
• Recall gradient descent:
– Suppose y=f(x) where both x and y are real nos
– Derivative is a function denoted as f ’(x) or dy/dx
• It gives the slope of f(x) at the point x
• i.e., it specifies how to make a small change in the input
to make a corresponding change in the output
– f(x+ε)≈f(x)+εf ’(x)

3
Deep Learning Srihari

How Gradient Descent uses derivatives


• Criterion f(x) minimized by moving from current
solution in direction of the negative of gradient

4
Deep Learning Srihari

Gradient with multiple inputs


• For multiple inputs we need partial derivatives

is how f changes as only xi increases
∂x i
f (x)

– Gradient of f is a vector of partial derivatives ∇x f (x )

• Gradient descent proposes a new point


x' = x - ε∇x f (x )
– where ε is the learning rate, a positive scalar.
• Set to a small constant

5
Deep Learning Srihari

Learning rate for deep learning


• Useful to reduce ε as training progresses
• Constant learning rate is default in Keras
– Momentum and decay are set to 0 by default
• keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

Constant learning rate

Time-based decay: decay_rate=learning_rate/epochs)


SGD(lr=0.1, momentum=0.8, decay=decay_rate,
Nesterov=False)
6
Deep Learning Srihari

Computational bottleneck
• A recurring problem in machine learning:
– large training sets are necessary for good
generalization
– but large training sets are also computationally
expensive
• SGD is an extension of gradient descent that
offers a solution
– Moreover it is a method of generalization beyond
the training set
7
Deep Learning Srihari

Cost function is sum over samples


• Criterion in machine learning is a cost function
• Cost function often decomposes as a sum of
per sample loss function
– E.g., Negative conditional log-likelihood of training
data is
In linear regression
m
1
J(θ) = Ex,y~p̂ L(x,y,θ) =
data

m i=1
(
L x (i ),y (i ),θ ) we minimize
2
1 m
{
J(θ ) = ∑ y (i ) − θ T x (i ) }
where m is the no. of samples and 2 i =1

L is the per-example loss L(x,y,θ)= - log p(y|x;θ)

8
Deep Learning Srihari

Gradient is also sum over samples


• For these additive cost functions, gradient
descent requires computing
1 m In linear regression

m i=1
(
∇θJ(θ) = ∑ ∇θL x (i ),y (i ),θ ) m

{ }
∇ ln p(y | X, θ, β) = β ∑ y (i ) − θT x (i ) x (i )T
i=1

• Computational cost of this operation is O(m)


• As training set size grows to billions, time
taken for single gradient step becomes
prohibitively long

9
Deep Learning Srihari

Insight of SGD
• Insight: Gradient is an expectation 1 m
(
∇θJ(θ) = ∑ ∇θL x (i ),y (i ),θ
m i=1
)
– Expectation may be approximated using small set of
samples
• In each step of SGD we can sample a
minibatch of examples B ={x(1),..,x(m’)}
– drawn uniformly from the training set
– Minibatch size m’ is typically chosen to be small: 1
to a hundred
• Crucially m’ is held fixed even if sample set is in billions
• We may fit a training set with billions of examples using
updates computed on only a hundred examples
10
Deep Learning Srihari

SGD Estimate on minibatch


• Estimate of gradient is formed as
1 m'
g= ∇ ∑ L (x ,y ,θ )
θ
(i ) (i )

m' i=1

– using examples from minibatch B


• SGD then follows the estimated gradient
downhill
θ ← θ − εg

– where ε is the learning rate

11
Deep Learning Srihari

How good is SGD?


• In the past gradient descent was regarded as
slow and unreliable
• Application of gradient descent to non-convex
optimization problems was regarded as
unprincipled
• SGD is not guaranteed to arrive at even a local
minumum in reasonable time
• But it often finds a very low value of the cost
function quickly enough
12
Deep Learning Srihari

SGD and Training Set Size


• Outside of deep learning
– SGD is the main way to train large linear models on
very large data sets
• Without SGD cost per update increases with m
– Cost per SGD update does not depend on the
training set size m (it depends only on m’)
• As mà ∞ model will eventually converge to its best
possible test error before SGD has sampled every
example in the training set
• Asymptotic cost of training a model with SGD is O(1) as a
function of m 13
Deep Learning Srihari

Deep Learning vs SVM


• Prior to advent of DL main way to learn
nonlinear models was to use the kernel trick in
combination with a linear model
– SVM:
m
f (x) = w x + b = b + ∑ αix T x (i )
T

i =1

• Replace x by a feature function ϕ(x) and the dot product


with a kernel function k(x,x(i))=ϕ(x)Ÿϕ(x(i))
– Requires constructing an m x m matrix Gi,j=k(x(i),x(j))
• Constructing this matrix is O(m2)

14
Deep Learning Srihari

Growth of interest in Deep Learning


• In academia (with medium sized data sets)
– Starting in 2006, deep learning was interesting
because it performed better on data sets with
thousands of examples
• In industry (with large data sets)
– Because it provided a scalable way of training
nonlinear models on large datasets

15

You might also like