Today: - Calculus
Today: - Calculus
Calculus
Lagrange Multipliers
Linear Regression
1
Optimization with constraints
What if I want to constrain the parameters
of the model.
The mean is less than 10
2
Lagrange Multipliers
Find maxima of f
(x,y) subject to a
constraint.
f (x, y) = x + 2y
2 2
x +y =1
3
General form
Maximizing: f (x, y)
Subject to: g(x, y) = c
t ?
x
9
Graphical Example of Regression
x
10
Graphical Example of Regression
x
11
Definition
In linear regression, we assume that the
model that generates the data involved only
a linear combination of input variables.
y(x, w)
= w 0 + w 1 x1 + . . . + w D xD
D
y(x, w)
= w0 + w j xj
j=1
Where w is a vector of weights which
define the D parameters of the model 12
Evaluation
How can we
evaluate the
performance of a
regression solution?
Error Functions (or
Loss functions)
Squared Error
Linear Error
E(t , y(
x , w))
=1 |t y(
x , w)|
2
i i = (ti y(xi , w))
E(ti , y(xi , w)) i i
2
13
Regression Error
14
Empirical Risk
Empirical risk is the measure of the loss from
data.
N
1
Remp = E(ti , y(xi , w))
N i=1
N
1 1
= 2
(ti y(xi , w))
N i=1 2
By minimizing risk on the training data, we
optimize the fit with respect to the loss function w R = 0
15
Model Likelihood and Empirical
Risk
Two related but distinct ways to look at a
model.
1. Model Likelihood.
1. What is the likelihood that a model generated
the observed data?
2. Empirical Risk
1. How much error does the model have on the
training data?
16
Model Likelihood
p(t|x, w, 1 )
) = N (t; y(x, w),
1
where = 2
N
1
p(t|x, w,
) = 1 )
N (ti ; y(xi , w),
i=0
Assuming Independently Identically
Distributed (iid) data.
17
Understanding Model Likelihood
N
1
p(t|x, w,
) = 1 )
N (ti ; y(xi , w),
i=0
N
1
1 Substitution for
p(t|x, w,
) = exp ti ) 2
(y(xi , w) the eqn of a
i=0
2 2 gaussian
N
1
1 Apply a log
ln p(t|x, w,
) = ln exp ti ) 2
(y(xi , w) function
i=0
2 2
N 1
2
N N Let the log
= ti ) +
(y(xi , w) ln ln 2 dissolve
2 i=0 2 2
products into
sums
18
Understanding Model Likelihood
N
1
2
N N
ln p(t|x, w,
) = ti ) +
(y(xi , w) ln ln 2
2 i=0 2 2
N1
Optimize the
w ln p(t|x, w,
) = w ti ) 2
(y(xi , w) weights.
2 i=0 (Maximum
Likelihood
Estimation)
N 1
2
w ti )
(y(xi , w) = 0 Log Likelihood
2 i=0
N Empirical
1 1 2 Risk w/
Remp = (ti y(xi , w))
N i=1 2 Squared Loss
Function
19
Maximizing Log Likelihood (1-D)
Find the optimal settings of w.
T
= w0
w w1
R
w R = 0 w0 0
R =
w1
0
N
1
1 2
R(w)
= (ti w1 xi w0 )
2N i=0
20
Maximizing Log Likelihood
N 1
1
w R(w)
= (ti w1 xi w0 )2
2N i=0
N 1
R 1 Partial
= (ti w1 xi w0 )(1) derivative
w0 N i=0
N 1
1
(ti w1 xi w0 )(1) = 0 Set to zero
N i=0
N 1 N 1
1 1
w0 = (ti w1 xi ) Separate
N i=0 N i=0 the sum to
N 1 isolate w0
1
w0 = (ti w1 xi )
N i=0
N 1 N 1
1 1
w0 = ti w 1 xi
N i=0 N i=0
21
Maximizing Log Likelihood
N 1
1
w R(w)
= (ti w1 xi w0 )2
2N i=0
N 1
R 1 Partial
= (ti w1 xi w0 )(xi ) derivative
w1 N i=0
N 1
1
(ti w1 xi w0 )(xi ) = 0 Set to zero
N i=0
N 1
1
(ti xi w1 x2i w0 xi ) = 0
N i=0
N 1 N 1 N 1
1 1 1
Separate
w1 x2i = ti x i w 0 xi
N i=0 N i=0 N i=0 the sum to
N 1 N 1 N 1
isolate w0
w1 x2i = ti x i w 0 xi
i=0 i=0 i=0
22
Maximizing Log Likelihood
N
1 N
1
1 1 From
w0 = ti w 1 xi previous
N i=0 N i=0 partial
N
1 N
1 N
1
w1 x2i = ti x i w 0 xi From prev.
i=0 i=0 i=0 slide
N 1 N 1
N 1 N 1
N 1
1 1
w1 x2i = ti x i ti w 1 xi xi
i=0 i=0
N i=0
N i=0 i=0 Substitute
N 1 N 1 N 1
N 1 N 1 N 1
1 1
w1 x2i xi xi = ti x i ti xi Isolate w1
N i=0 N i=0 i=0
i=0
N 1
i=0 i=0
1
N 1 N 1
i=0 ti x i N i=0 ti
i=0 xi
w1 = N 1 1
N 1 N 1
i=0 x2i N i=0 xi i=0 xi 23
Maximizing Log Likelihood
Clean and easy.
1 N 1 N 1
1
w0 N i=0 ti w1 N i=0 xi
N 1 N 1 N 1
=
1
w1 i=0 ti xi N i=0 ti i=0 xi
N 1 2 1 N 1 N 1
i=0 xi N i=0 xi i=0 xi
Or not
24
Likelihood using linear algebra
Representing the linear regression
function in terms of vectors.
y = w0 + w1 x1 + w2 x2 + . . . + wN 1 xN 1
T
x = 1 x1 x2 . . . xN 1
T
= w0 w1 w2 . . . wN 1
w
T
y = x w
25
Likelihood using linear algebra
Stack xT into a matrix of data points, X.
N 1
1
Remp (w)
= (ti w1 xi w0 )2
2N i=0
N
1 2
Representation
1 w0
= ti 1 x i as vectors
2N i=0 w1
2
t0 1 x0 Stack the data
1
t1 1 x1 w0
into a matrix
= .. .. and use the
2N . . w
1 Norm operation
to handle the
tN 1 1 xN 1
sum
1
2
= t X w 26
2N
Likelihood in multiple dimensions
This representation of risk has no inherent
dimensionality.
1 2
Remp (w)
= t X w
2N
w Remp (w)
=0
1 2
w t X w
= 0
2N
27
Maximum Likelihood Estimation
redux
w Remp (w)
=0
2
1
w t X w = 0
2N
1
w) T w) Decompose
w (t X (t X =0
2N the norm
1
w (tT t tT Xw wT X T t + w
T XTX w)
=0 FOIL linear
2N algebra style
1
X T t X T t + 2XTX w = 0 Differentiate
2N
1 T
2X t + 2X X
T w
=0 Combine terms
2N
XTXw = X T t Isolate w
= (X
w T X) 1 X
T t
28
Extension to polynomial regression
29
Extension to polynomial regression
y = c 0 + c 1 x1 + c 2 x2
y = c 0 + c 1 x + c 2 x2
30
Generate new features
Standard Polynomial with coefficients, w
D
y(x, w)
= w d xd + w 0
d=1
Risk
p 2
t0 1 x0 ... x0 w0
1
t1
1 x1 ... xp1
w1
R = . .. .. .. .. ..
2 .. . . . . .
tn1 1 xn1 ... p
xn1 wp
31
Generate new features
Feature Trick: To fit a D dimensional polynomial,
Create a D-element vector from xi
T
xi = x0i x1i ... xP
i
32
How is this still linear regression?
The regression is linear in the parameters,
despite projecting xi from one dimension to D
dimensions.
Now we fit a plane (or hyperplane) to a
representation of xi in a higher dimensional
feature space.
This generalizes to any set of functions
i : R R
T
xi = 0 (xi ) 1 (xi ) ... P (xi )
33
Basis functions as feature
extraction
These functions are called basis functions.
They define the bases of the feature space
Allows linear decomposition of any type of
function to data points
Common Choices:
Polynomial i : R R
Gaussian
Sigmoids
Wave functions (sine, etc.)
34
Training data vs. Testing Data
Evaluating the performance of a classifier on
training data is meaningless.
With enough parameters, a model can simply
memorize (encode) every training point
To evaluate performance, data is divided into
training and testing (or evaluation) data.
Training data is used to learn model parameters
Testing data is used to evaluate performance
35
Overfitting
36
Overfitting
37
Overfitting performance
38
Definition of overfitting
When the model describes the noise,
rather than the signal.
39
Possible detection of overfitting
Stability
An appropriately fit model is stable under
different samples of the training data
An overfit model generates inconsistent
performance
Performance
A good model has low test error
A bad model has high test error
40
What is the optimal model size?
The best model size generalizes to unseen
data the best.
Approximate this by testing error.
One way to optimize parameters is to
minimize testing error.
This operation uses testing data as tuning or
development data
Sacrifices training data in favor of parameter
optimization
Can we do this without explicit evaluation
data?
41
Context for linear regression
Simple approach
Efficient learning
Extensible
Regularization provides robust models
42
Break
Coffee. Stretch.
43
Linear Regression
Identify the best parameters, w, for a
regression function
N
y = w0 + w i xi
i=1
w T 1 T
= (X X) X t
44
Overfitting
Recall: overfitting happens when a model
is capturing idiosyncrasies of the data
rather than generalities.
Often caused by too many parameters relative
to the amount of training data.
E.g. an order-N polynomial can intersect any
N+1 data points
45
Dealing with Overfitting
Use more data
Use a tuning set
Regularization
Be a Bayesian
46
Regularization Regularization
InIn a linear
a Linear regression
Regression model
model, overfitting overfitting
is characterized by largeis
characterized by large weights.
parameters.
M=0 M =1 M=3 M =9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
47
Penalize large weights
Introduce a penalty term in the loss
function. N1
1
E(w)
= 2
{tn y(xn , w)}
2 n=0
Regularized Regression
(L2-Regularization or Ridge Regression)
N
1
1 2 2
E(w)
= (tn y(xn , w))
+ w
2 n=0
2
48
Regularization Derivation
w (E(w))
=0
N 1
1
2 2
w ti ) + w
(y(xi , w) =0
2 i=0
2
1 2 2
w t X w + w =0
2 2
1 T T
w (t X w)
(t X w) + w w =0
2 2
49
1 T T
w (t X w) (t X w) + w w
=0
2 2
T T T
X t + X X w + w w w =0
2
T
X t + X X w T
+ w =0
X T t + XTX w + Iw
=0
T T
X t + (X X + I) w =0
T
(X X + I)w T
=X t
w T 1 T
= (X X + I) X t
50
Regularization in Practice
51
Regularization Results
52
More regularization
The penalty term
defines the styles of
regularization N 1
1
2 + w
2
L2-Regularization E(w)
=
2 n=0
(tn y(xn , w))
2
N
1
L1-Regularization E(w) =
1
2 + |w|
(tn y(xn , w)) 1
2 n=0
L0-Regularization
N 1 N 1
L0-norm is the 1
E(w)
= 2+
(tn y(xn , w)) (wn = 0)
2 n=0
optimal subset of n=0
features
53
Curse of dimensionality
Increasing dimensionality of features increases the
data requirements exponentially.
For example, if a single feature can be accurately
approximated with 100 data points, to optimize the
joint over two features requires 100*100 data points.
54
Bayesians v. Frequentists
What is a probability?
Frequentists
A probability is the likelihood that an event will happen
It is approximated by the ratio of the number of observed events to the
number of total events
Assessment is vital to selecting a model
Point estimates are absolutely fine
Bayesians
A probability is a degree of believability of a proposition.
Bayesians require that probabilities be prior beliefs conditioned on data.
The Bayesian approach is optimal, given a good model, a good prior
and a good loss function. Dont worry so much about assessment.
If you are ever making a point estimate, youve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior
55
Bayesian Linear Regression
The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.
Bayesians say, hold it right there.
Use a prior distribution over w to estimate parameters
(M +1)/2
p(w|)
= N (w| =
0, 1 I) exp wT w
2 2
Alpha is a hyperparameter over w, where alpha is the
precision or inverse variance of the distribution.
Now optimize:
x, t, , ) p(t|x, w,
p(w| )p(w|)
56
Optimize the Bayesian posterior
x, t, , ) p(t|x, w,
p(w| )p(w|)
58
Optimize the Bayesian posterior
ln p(t|x, w,
) + ln p(w|)
N
1
N N
ln p(t|x, w,
) = ln ln 2 2
(tn y(xn , w))
2 2 2 n=0
M +1 M +1 T
ln p(w|)
= ln ln 2 w w
2 2 2
Ignoring terms that do not depend on w
N
1
2 T
ln p(t|x, w,
) + ln p(w|)
(tn y(xn , w))
+ w w
2 n=0
2
60
Next Time
Logistic Regression
61