Machine Learning Basics
Machine Learning Basics
1
Deep Learning
Overview of discussion
• Deep learning is a specific kind of Machine
Learning
– To understand deep learning well, one needs a
solid understanding of basic principles of ML
• This chapter is a brief course on most
important general principles of ML
• Such as discussed in Intro to ML (Bishop 2006)
2
Deep Learning
Plan of discussion Sr
Untrained
Neural Net Training Trained New
Model Dataset Model Data
https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/ 8
Deep Learning
1. Classification
• Computer program asked to specify which of k
categories some input belongs to
– Learning algorithm is asked to produce a function f :
Rnà{1,..,k}, where n =no of input variables
• When y =f (x) model assigns input vector x to a category
identified by a numeric code y
• Other variants of classification task:
– f outputs a probability distribution over classes
10
2. Classification with missing inputs
Deep Learning
4. Transcription
• System observes unstructured representation
of some kind of data and transcribe the
information into discrete textual form
• OCR: input is an image of text
– asked to return this text in the form of a sequence of
characters in ASCII/Unicode
• Google Streetview uses deep learning for address nos.
• Speech recognition: an audio waveform is
converted to word ID codes
Google cloud speech API
Recognizes 110 language variants 13
Deep Learning
5. Machine Translation
• Input already consists of a
sequence of symbols in some
language
• Computer program must convert it
into a sequence of symbols in
another language
• Commonly applied to natural
languages, Ex:English to French
• Deep learning is having an
important impact 14
6. Structured Output
Deep Learning
7. Anomaly Detection
• Sift through a set of events or objects and flags
some of them as being unusual or atypical
10. Denoising
• The ML algorithm is given an input a corrupted
examples x~ε Rn obtained in an unknown
corruption process
• Learner must predict the clean example x from
its corrupted version x~
• Or more generally the conditional probability
distribution p(x|x~)
19
11. Density Estimation
Deep Learning
22
Deep Learning
23
Deep Learning
The Experience, E
• Broadly categorized as unsupervised or
supervised
– Based on what kind of experience they are
allowed to have during learning process
• Most algorithms experience a dataset
– Also called data points
24
Example of a data set
Deep Learning
25
Deep Learning
Here is a
test case
27
Deep Learning
Semi-supervised Learning
• Some examples include a supervision target
but others do not
• In multi-instance learning, an entire collection of
samples is labeled as containing or not
containing an example of a class, but individual
members of the collection are not labeled
29
Deep Learning
Reinforcement Learning
• Algorithms interact with environment, no fixed
data set
– A feedback loop between system and its experience
• Dog is given a reward/punishment for an action
– Policies: what actions to take in a particular situation
– Utility estimation: how good is state (àused by policy)
• No supervised output but delayed reward
– Credit assignment
• what was responsible for outcome
• Applications:
– Game playing, Robot in a maze, Multiple agents, partial 30
observability, …
Deep Learning
Reinforcement learning (RL) connected to a deep Paper: “Playing Atari with Deep Reinforcement
neural network proves to be an effective solution Learning” by Volodymyr Mnih, Koray Kavukcuoglu,
for learning to navigate in complex environments David Silver, Alex Graves, Ioannis Antonoglou, Daan
without any prior knowledge. Wierstra and Martin Riedmiller, NIPS 2013 (Cited by
1490)
RL agent can infer from complex environments by
Dataset: Q-learning generates data exclusively from
punishment-reward system. It can model decision
experience, without incorporation of the prior
making process.
knowledge. If we put all our history data into a table
with state, action, reward, next state and then
Applications: AlphaGo Zero beat the world’s
sample from it, it should be possible to train our
professionals (December 2017). OpenAI presented
agent that way, without the dataset.
a bot, that won in Dota2 world championship (Aug
2018). RL has also many applications in robotics. Backend: Python3, Keras, Tensorflow
Core libraries: OpenAI Gym, Keras - RL
Code: https://github.com/nathanmargaglio/DQN
Strategy: (1) estimate the discounted sum of rewards
of taking action a in state s - Q(s, a) function, (2)
choose the action with the maximum Q-value in any
given state.
1
Deep Learning
Topics
1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning
Generalization
• Central challenge of ML is that the algorithm
must perform well on new, previously unseen
inputs
– Not just those on which our model has been trained
• Ability to perform well on previously unobserved
inputs is called generalization
3
Deep Learning
Generalization Error
• When training a machine learning model we
have access to a training set
– We can compute some error measure on the
training set, called training error
• ML differs from optimization
– Generalization error (also called test error) to be low
• Generalization error definition
– Expected value of the error on a new input
• Expected value is computed as average over inputs taken
from distribution encountered in practice
– In linear regression with squared error criterion the expected
error is: 1 2
4
(test)
X (test)w − y (test)
m 2
Deep Learning
(train)
X (train)
w −y (train)
m 2
(test)
X w −y
(test) (test)
m 2
7
Under- and Over-fitting
Deep Learning
Appropriate Capacity
• Machine Learning algorithms will perform well
when their capacity is appropriate for the true
complexity of the task that they need to perform
and the amount of training data they are
provided with
• Models with insufficient capacity are unable to
solve complex tasks
• Models with high capacity can solve complex
tasks, bit when their capacity is higher than
needed to solve the present task, they may
overfit 11
Principle of Capacity in action
Deep Learning
Polynomial Model M
y(x,w) = w0 + w1x + w2x 2 + .. + wM x M = ∑ w j x j
j =0
13
Deep Learning
• Representational capacity:
– Specifies family of functions learning algorithm can
choose from
• Effective capacity:
– Imperfections in optimization algorithm can limit
representational capacity
• Occam’s razor:
– Among competing hypotheses that explain known
observations equally well, choose the simplest one
• Idea is formalized in VC dimension 14
Deep Learning
VC Dimension
• Statistical learning theory quantifies model
capacity
• VC dimension quantifies capacity of a binary
classifier
• VC dimension is the largest value of m for
which there exists a training set of m different
points that the classifier can label arbitrarily
15
Deep Learning
VC dimension
(capacity) of a
Linear classifier in
R2 is 3
16
Deep Learning
17
Deep Learning
18
Deep Learning
19
Arbitrarily high capacity:
Deep Learning
Nonparametric models
• When do we reach most extreme case of
arbitrarily high capacity?
• Parametric models such as linear regression:
• learn a function described by a parameter whose
size is finite and fixed before data is observed
• Nonparametric models have no such limitation
• Nearest-neighbor regression is an example
• Their complexity is a function of training set size
20
Deep Learning
Bayes Error
22
Deep Learning
Regularization
• No free lunch theorem implies that we must
design our ML algorithms to perform well on a
specific task
• We do so by building a set of preferences into
learning algorithm
• When these preferences are aligned with the
learning problems it performs better
27
Deep Learning
28
Tradeoff between fitting data and
Deep Learning
being small
Controlling a model’s tendency to overfit
or underfit via weight decay
29
Deep Learning
Regularization defined
• Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error
• We must choose a form of regularization that is
well-suited to the particular task
30
Deep Learning
1
Deep Learning
3
Deep Learning Srihari
Models with different capacities M Model with high degree polynomial (degree 9)
Linear Quadratic Degree 9 Different weight decay hyperparameters λ
⎛ φ 0( x1 ) φ 1( x1 ) ... φ M −1( x1 ) ⎞
Using Moore Penrose pseudoinverse ⎜ ⎟
⎜ φ (x ) ⎟
to solve underdetermined normal equations: Design matrix: Φ=⎜ 0 2 ⎟
⎜ ⎟
⎜φ (x ) φ M −1( x N ) ⎟⎠
w ML = Φ + t Φ + = (ΦT Φ) −1 ΦT ⎝ 0 N
Deep Learning Srihari
Validation Set
• To solve the problem we use a validation set
– Examples that training algorithm does not observe
• Test examples should not be used to make
choices about the model hyperparameters
• Training data is split into two disjoint parts
– First to learn the parameters
– Other is the validation set to estimate generalization
error during or after training
• allowing for the hyperparameters to be updated
– Typically 80% of training data for training and 20%
for validation
6
Deep Learning Srihari
7
Deep Learning Srihari
Cross-Validation
• When data set is too small, dividing into a fixed
training set and fixed testing set is problematic
– If it results in a small test set
• Small test set implies statistical uncertainty around the
estimated average test error
• Cannot claim algorithm A works better than algorithm B
for a given task
• k - fold cross-validation
– Partition the data into k non-overlapping subsets
– On trial i, i th subset of data is used as the test set
– Rest of the data is used as the training set
8
Deep Learning Srihari
9
Algorithm KFoldXV(D,A,L,k)
Deep Learning Srihari
11
Deep Learning Srihari
1
Srihari
Topics in Basics of ML
Deep Learning
1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari
3
Deep Learning Srihari
Point Estimation
• Point Estimation is the attempt to provide the
single best prediction of some quantity of
interest
– Quantity of interest can be:
• A single parameter
• A vector of parameters
– E.g., weights in linear regression
• A whole function
5
Deep Learning Srihari
Function Estimation
• Point estimation can also refer to estimation of
relationship between input and target variables
– Referred to as function estimation
• Here we predict a variable y given input x
– We assume f(x) is the relationship between x and y
• We may assume y=f(x)+ε
– Where ε stands for a part of y not predictable from x
– We are interested in approximating f with a model fˆ
• Function estimation is same as estimating a parameter θ
– where fˆ is a point estimator in function space
• Ex: in polynomial regression we are either estimating a
parameter w or estimating a function mapping from x to y
Deep Learning Srihari
8
Deep Learning Srihari
1. Bias of an estimator
• The bias of an estimator θ̂m = g(x (1),...x (m)) for
parameter θ is defined as
( )
bias θ̂m = E ⎡⎣θ̂m ⎤⎦ − θ
9
Deep Learning Srihari
10
Deep Learning Srihari
⎡ 1 m (i ) ⎤
= E ⎢ ∑x ⎥ − θ
⎣ m i −1 ⎦
m
1
= ∑ E ⎡⎣x (i ) ⎤⎦ − θ
m i =1
1 m 1
( )
= ∑ ∑ x (i )θx (1 − θ)(1−x ) − θ
m i =1 x (i ) =0
(i ) (i )
1 m
= ∑ (θ) − θ = θ − θ = 0
m i =1
17
Deep Learning Srihari
18
Deep Learning Srihari
19
Deep Learning Srihari
⎣ m
⎦
=Bias ( θ̂ ) + Var ( θ̂ )
2
m m
20
Deep Learning Srihari
Underfit-Overfit : Bias-Variance
Relationship of bias-variance to capacity is similar to
underfitting and overfitting relationship to capacity
21
Deep Learning Srihari
Consistency
• So far we have discussed behavior of an
estimator for a fixed training set size
• We are also interested with the behavior of the
estimator as training set grows
• As the no. of data points m in the training set
grows, we would like our point estimates to
converge to the true value of the parameters:
plimm→∞θ̂m = θ
– Symbol plim indicates convergence in probability
Deep Learning Srihari
1
Topics
Deep Learning Srihari
1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari
KL divergence
• Max likelihood =minimizing KL diverg. between:
– empirical distribution p̂data and model distribution pmodel(x)
• The K-L divergence is
11
Deep Learning Srihari
Statistical Efficiency
• Several estimators, based on inductive
principles other than MSE, can be consistent
• Define Statistical efficiency: estimator has lower
generalization error for fixed no of samples
• Or equivalently, needs fewer examples to obtain a fixed
generalization error
• Needs measuring closeness to true parameter
– MSE between estimated and true parameter values
– Parameter MSE decreases as m increases
– Using Cramer-Rao bound
• No consistent estimator has lower MSE than Maximum
Likelihood Estimator
Deep Learning Srihari
1
Deep Learning Srihari
3
Deep Learning Srihari
Frequentist Statistics
• So far we have discussed frequentist statistics
• The approaches are based on estimating a
single value of θ, then making all predictions
thereafter based on that one estimate
• Summary of the frequentist perspective:
– True parameter value θ is fixed but unknown
– Point estimate θ̂ is a random variable
• On account of it being a function of the dataset
• The Bayesian perspective is quite different
4
Deep Learning Srihari
Bayesian Perspective
• The Bayesian approach is to consider all
possible values of θ before making predictions
• The Bayesian uses probability to reflect
degrees of uncertainty in states of knowledge
• The dataset is directly observed and so is not
random
• On the other hand, the true parameter θ is
unknown or uncertain and is thus represented
as a random variable
5
Deep Learning Srihari
Prior Distribution
• Before observing the data, we represent our
knowledge of θ is using the prior probability
distribution p(θ)
– Sometimes simply referred to as the prior
– ML practitioner selects prior distribution to be broad
• i.e., with high entropy to reflect high uncertainty
• Examples
– θ is in a finite range/volume with uniform distribution
– Many priors reflect preferences for “simpler”
solutions
• Such as smaller magnitude coefficients
6
• Or a function closer to being constant
Deep Learning Srihari
7
Deep Learning Srihari
8
Deep Learning Srihari
9
Deep Learning Srihari
• Where
Handling uncertainty in θ
• In the frequentist approach, uncertainty in a
given point estimate of θ is handled by
evaluating its variance
– Variance of the estimator is an assessment of how
the estimate might change with alternative
samplings of the observed data
• Bayesian answer to the question of how to
handle uncertainty in the estimator is to simply
integrate over it
11
Deep Learning Srihari
13
Deep Learning Srihari
14
Deep Learning Srihari
1
Topics
Deep Learning Srihari
1. Learning Algorithms
2. Capacity, Overfitting and Underfitting
3. Hyperparameters and Validation Sets
4. Estimators, Bias and Variance
5. Maximum Likelihood Estimation
6. Bayesian Statistics
7. Supervised Learning Algorithms
8. Unsupervised Learning Algorithms
9. Stochastic Gradient Descent
10. Building a Machine Learning Algorithm
11. Challenges Motivating Deep Learning 2
Deep Learning Srihari
3
Deep Learning Srihari
4
Deep Learning Srihari
5
Deep Learning Srihari
Computational bottleneck
• A recurring problem in machine learning:
– large training sets are necessary for good
generalization
– but large training sets are also computationally
expensive
• SGD is an extension of gradient descent that
offers a solution
– Moreover it is a method of generalization beyond
the training set
7
Deep Learning Srihari
8
Deep Learning Srihari
m i=1
(
∇θJ(θ) = ∑ ∇θL x (i ),y (i ),θ ) m
{ }
∇ ln p(y | X, θ, β) = β ∑ y (i ) − θT x (i ) x (i )T
i=1
9
Deep Learning Srihari
Insight of SGD
• Insight: Gradient is an expectation 1 m
(
∇θJ(θ) = ∑ ∇θL x (i ),y (i ),θ
m i=1
)
– Expectation may be approximated using small set of
samples
• In each step of SGD we can sample a
minibatch of examples B ={x(1),..,x(m’)}
– drawn uniformly from the training set
– Minibatch size m’ is typically chosen to be small: 1
to a hundred
• Crucially m’ is held fixed even if sample set is in billions
• We may fit a training set with billions of examples using
updates computed on only a hundred examples
10
Deep Learning Srihari
m' i=1
11
Deep Learning Srihari
i =1
14
Deep Learning Srihari
15