Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views

Today: - Calculus

The document discusses optimization problems with constraints and introduces Lagrange multipliers as a method to solve such problems. It provides examples of maximizing an objective function subject to a constraint. Specifically, it gives the example of maximizing f(x,y)=x+2y subject to the constraint x^2 + y^2 = 1. The solution involves introducing a new variable λ and finding the critical points of Λ(x,y,λ) = f(x,y) + λ(g(x,y)-c).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Today: - Calculus

The document discusses optimization problems with constraints and introduces Lagrange multipliers as a method to solve such problems. It provides examples of maximizing an objective function subject to a constraint. Specifically, it gives the example of maximizing f(x,y)=x+2y subject to the constraint x^2 + y^2 = 1. The solution involves introducing a new variable λ and finding the critical points of Λ(x,y,λ) = f(x,y) + λ(g(x,y)-c).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Today

Calculus
Lagrange Multipliers

Linear Regression

1
Optimization with constraints
What if I want to constrain the parameters
of the model.
The mean is less than 10

Find the best likelihood, subject to a


constraint.
Two functions:
An objective function to maximize
An inequality that must be satisfied

2
Lagrange Multipliers
Find maxima of f
(x,y) subject to a
constraint.

f (x, y) = x + 2y
2 2
x +y =1

3
General form
Maximizing: f (x, y)
Subject to: g(x, y) = c

Introduce a new variable, and find a


maxima.

(x, y, ) = f (x, y) + (g(x, y) c)


4
Example
Maximizing: f (x, y) = x + 2y
Subject to: x2 + y 2 = 1

Introduce a new variable, and find a


maxima.
(x, y, ) = x + 2y + (x2 + y 2 1)
5
Example
(x, y, )
= 1 + 2x = 0
x
(x, y, )
= 2 + 2y = 0
y
(x, y, ) 2 2
= (x + y 1) = 0

Now have 3 equations with 3 unknowns.
6
Example
Eliminate Lambda Substitute and Solve
2 2
1 = 2x x +y =1
2 = 2y x2 + (2x)2 = 1
2
1 2 5x = 1
= 2 =
x y 1
x =
y = 2x 5
2
y =
5 7
Basics of Linear Regression
Regression algorithm
Supervised technique.
In one dimension:
Identify y:RR
In D-dimensions:
Identify y : RD R
Given: training data: {
x0 , x1 , . . . , xN }
And targets: {t0 , t1 , . . . , tN }
8
Graphical Example of Regression

t ?

x
9
Graphical Example of Regression

x
10
Graphical Example of Regression

x
11
Definition
In linear regression, we assume that the
model that generates the data involved only
a linear combination of input variables.

y(x, w)
= w 0 + w 1 x1 + . . . + w D xD
D

y(x, w)
= w0 + w j xj
j=1
Where w is a vector of weights which
define the D parameters of the model 12
Evaluation
How can we
evaluate the
performance of a
regression solution?
Error Functions (or
Loss functions)
Squared Error
Linear Error

E(t , y(
x , w))
=1 |t y(
x , w)|
2
i i = (ti y(xi , w))
E(ti , y(xi , w)) i i
2
13
Regression Error

14
Empirical Risk
Empirical risk is the measure of the loss from
data.
N
1
Remp = E(ti , y(xi , w))

N i=1
N
1 1
= 2
(ti y(xi , w))
N i=1 2
By minimizing risk on the training data, we
optimize the fit with respect to the loss function w R = 0

15
Model Likelihood and Empirical
Risk
Two related but distinct ways to look at a
model.
1. Model Likelihood.
1. What is the likelihood that a model generated
the observed data?
2. Empirical Risk
1. How much error does the model have on the
training data?

16
Model Likelihood
p(t|x, w, 1 )
) = N (t; y(x, w),
1
where = 2

N
1
p(t|x, w,
) = 1 )
N (ti ; y(xi , w),
i=0
Assuming Independently Identically
Distributed (iid) data.
17
Understanding Model Likelihood
N
1
p(t|x, w,
) = 1 )
N (ti ; y(xi , w),
i=0
N
1
1 Substitution for
p(t|x, w,
) = exp ti ) 2
(y(xi , w) the eqn of a
i=0
2 2 gaussian

N
1
1 Apply a log
ln p(t|x, w,
) = ln exp ti ) 2
(y(xi , w) function
i=0
2 2

N 1
2
N N Let the log
= ti ) +
(y(xi , w) ln ln 2 dissolve
2 i=0 2 2
products into
sums
18
Understanding Model Likelihood
N
1
2
N N

ln p(t|x, w,
) = ti ) +
(y(xi , w) ln ln 2
2 i=0 2 2

N1
Optimize the

w ln p(t|x, w,
) = w ti ) 2
(y(xi , w) weights.
2 i=0 (Maximum
Likelihood
Estimation)
N 1


2
w ti )
(y(xi , w) = 0 Log Likelihood
2 i=0
N Empirical
1 1 2 Risk w/
Remp = (ti y(xi , w))

N i=1 2 Squared Loss
Function
19
Maximizing Log Likelihood (1-D)
Find the optimal settings of w.
T
= w0
w w1
R
w R = 0 w0 0
R =
w1
0
N
1
1 2
R(w)
= (ti w1 xi w0 )
2N i=0

20
Maximizing Log Likelihood
N 1
1
w R(w)
= (ti w1 xi w0 )2
2N i=0
N 1
R 1 Partial
= (ti w1 xi w0 )(1) derivative
w0 N i=0
N 1
1
(ti w1 xi w0 )(1) = 0 Set to zero
N i=0
N 1 N 1
1 1
w0 = (ti w1 xi ) Separate
N i=0 N i=0 the sum to
N 1 isolate w0
1
w0 = (ti w1 xi )
N i=0
N 1 N 1
1 1
w0 = ti w 1 xi
N i=0 N i=0
21
Maximizing Log Likelihood
N 1
1
w R(w)
= (ti w1 xi w0 )2
2N i=0
N 1
R 1 Partial
= (ti w1 xi w0 )(xi ) derivative
w1 N i=0
N 1
1
(ti w1 xi w0 )(xi ) = 0 Set to zero
N i=0
N 1
1
(ti xi w1 x2i w0 xi ) = 0
N i=0
N 1 N 1 N 1
1 1 1
Separate
w1 x2i = ti x i w 0 xi
N i=0 N i=0 N i=0 the sum to
N 1 N 1 N 1
isolate w0

w1 x2i = ti x i w 0 xi
i=0 i=0 i=0
22
Maximizing Log Likelihood
N
1 N
1
1 1 From
w0 = ti w 1 xi previous
N i=0 N i=0 partial
N
1 N
1 N
1
w1 x2i = ti x i w 0 xi From prev.
i=0 i=0 i=0 slide

N 1 N 1
N 1 N 1
N 1
1 1
w1 x2i = ti x i ti w 1 xi xi
i=0 i=0
N i=0
N i=0 i=0 Substitute
N 1 N 1 N 1
N 1 N 1 N 1
1 1
w1 x2i xi xi = ti x i ti xi Isolate w1
N i=0 N i=0 i=0
i=0
N 1
i=0 i=0

1
N 1 N 1
i=0 ti x i N i=0 ti
i=0 xi
w1 = N 1 1
N 1 N 1
i=0 x2i N i=0 xi i=0 xi 23
Maximizing Log Likelihood
Clean and easy.
1 N 1 N 1
1


w0 N i=0 ti w1 N i=0 xi
N 1 N 1 N 1
=
1
w1 i=0 ti xi N i=0 ti i=0 xi
N 1 2 1 N 1 N 1
i=0 xi N i=0 xi i=0 xi
Or not

Apply some linear algebra.

24
Likelihood using linear algebra
Representing the linear regression
function in terms of vectors.
y = w0 + w1 x1 + w2 x2 + . . . + wN 1 xN 1
T
x = 1 x1 x2 . . . xN 1
T
= w0 w1 w2 . . . wN 1
w
T
y = x w

25
Likelihood using linear algebra
Stack xT into a matrix of data points, X.
N 1
1
Remp (w)
= (ti w1 xi w0 )2
2N i=0
N
1 2
Representation
1 w0
= ti 1 x i as vectors
2N i=0 w1
2
t0 1 x0 Stack the data

1
t1 1 x1 w0
into a matrix
= .. .. and use the

2N . . w
1 Norm operation
to handle the
tN 1 1 xN 1
sum
1
2

= t X w 26
2N
Likelihood in multiple dimensions
This representation of risk has no inherent
dimensionality.

1 2
Remp (w)
= t X w

2N
w Remp (w)
=0

1 2
w t X w
= 0
2N
27
Maximum Likelihood Estimation
redux
w Remp (w)
=0
2
1
w t X w = 0
2N
1
w) T w) Decompose
w (t X (t X =0
2N the norm
1
w (tT t tT Xw wT X T t + w
T XTX w)
=0 FOIL linear
2N algebra style
1
X T t X T t + 2XTX w = 0 Differentiate
2N
1 T
2X t + 2X X
T w
=0 Combine terms
2N
XTXw = X T t Isolate w

= (X
w T X) 1 X
T t

28
Extension to polynomial regression

29
Extension to polynomial regression

y = c 0 + c 1 x1 + c 2 x2
y = c 0 + c 1 x + c 2 x2

Polynomial regression is the same as


linear regression in D dimensions

30
Generate new features
Standard Polynomial with coefficients, w
D
y(x, w)
= w d xd + w 0
d=1

Risk
p 2
t0 1 x0 ... x0 w0

1
t1
1 x1 ... xp1
w1

R = . .. .. .. .. ..
2 .. . . . . .

tn1 1 xn1 ... p
xn1 wp
31
Generate new features
Feature Trick: To fit a D dimensional polynomial,
Create a D-element vector from xi

T
xi = x0i x1i ... xP
i

Then standard linear regression in D dimensions

32
How is this still linear regression?
The regression is linear in the parameters,
despite projecting xi from one dimension to D
dimensions.
Now we fit a plane (or hyperplane) to a
representation of xi in a higher dimensional
feature space.
This generalizes to any set of functions
i : R R
T
xi = 0 (xi ) 1 (xi ) ... P (xi )
33
Basis functions as feature
extraction
These functions are called basis functions.
They define the bases of the feature space
Allows linear decomposition of any type of
function to data points
Common Choices:
Polynomial i : R R
Gaussian
Sigmoids
Wave functions (sine, etc.)
34
Training data vs. Testing Data
Evaluating the performance of a classifier on
training data is meaningless.
With enough parameters, a model can simply
memorize (encode) every training point
To evaluate performance, data is divided into
training and testing (or evaluation) data.
Training data is used to learn model parameters
Testing data is used to evaluate performance

35
Overfitting

36
Overfitting

37
Overfitting performance

38
Definition of overfitting
When the model describes the noise,
rather than the signal.

How can you tell the difference between


overfitting, and a bad model?

39
Possible detection of overfitting
Stability
An appropriately fit model is stable under
different samples of the training data
An overfit model generates inconsistent
performance
Performance
A good model has low test error
A bad model has high test error

40
What is the optimal model size?
The best model size generalizes to unseen
data the best.
Approximate this by testing error.
One way to optimize parameters is to
minimize testing error.
This operation uses testing data as tuning or
development data
Sacrifices training data in favor of parameter
optimization
Can we do this without explicit evaluation
data?
41
Context for linear regression
Simple approach
Efficient learning
Extensible
Regularization provides robust models

42
Break
Coffee. Stretch.

43
Linear Regression
Identify the best parameters, w, for a
regression function
N

y = w0 + w i xi
i=1

w T 1 T
= (X X) X t

44
Overfitting
Recall: overfitting happens when a model
is capturing idiosyncrasies of the data
rather than generalities.
Often caused by too many parameters relative
to the amount of training data.
E.g. an order-N polynomial can intersect any
N+1 data points

45
Dealing with Overfitting
Use more data
Use a tuning set
Regularization
Be a Bayesian

46
Regularization Regularization
InIn a linear
a Linear regression
Regression model
model, overfitting overfitting
is characterized by largeis
characterized by large weights.
parameters.

M=0 M =1 M=3 M =9
w0 0.19 0.82 0.31 0.35
w1 -1.27 7.99 232.37
w2 -25.43 -5321.83
w3 17.37 48568.31
w4 -231639.30
w5 640042.26
w6 -1061800.52
w7 1042400.18
w8 -557682.99
w9 125201.43
47
Penalize large weights
Introduce a penalty term in the loss
function. N1
1
E(w)
= 2
{tn y(xn , w)}
2 n=0

Regularized Regression
(L2-Regularization or Ridge Regression)
N
1
1 2 2
E(w)
= (tn y(xn , w))
+ w

2 n=0
2
48
Regularization Derivation
w (E(w))
=0
N 1

1
2 2
w ti ) + w
(y(xi , w) =0
2 i=0
2

1 2 2
w t X w + w =0
2 2

1 T T
w (t X w)
(t X w) + w w =0
2 2
49

1 T T
w (t X w) (t X w) + w w
=0
2 2
T T T
X t + X X w + w w w =0
2
T
X t + X X w T
+ w =0
X T t + XTX w + Iw
=0
T T
X t + (X X + I) w =0
T
(X X + I)w T
=X t
w T 1 T
= (X X + I) X t
50
Regularization in Practice

51
Regularization Results

52
More regularization
The penalty term
defines the styles of
regularization N 1
1
2 + w
2
L2-Regularization E(w)
=
2 n=0
(tn y(xn , w))
2
N
1
L1-Regularization E(w) =
1
2 + |w|
(tn y(xn , w)) 1
2 n=0
L0-Regularization
N 1 N 1
L0-norm is the 1
E(w)
= 2+
(tn y(xn , w)) (wn = 0)
2 n=0
optimal subset of n=0

features

53
Curse of dimensionality
Increasing dimensionality of features increases the
data requirements exponentially.
For example, if a single feature can be accurately
approximated with 100 data points, to optimize the
joint over two features requires 100*100 data points.

Models should be small relative to the amount of


available data
Dimensionality reduction techniques feature
selection can help.
L0-regularization is explicit feature selection
L1- and L2-regularizations approximate feature selection.

54
Bayesians v. Frequentists
What is a probability?
Frequentists
A probability is the likelihood that an event will happen
It is approximated by the ratio of the number of observed events to the
number of total events
Assessment is vital to selecting a model
Point estimates are absolutely fine
Bayesians
A probability is a degree of believability of a proposition.
Bayesians require that probabilities be prior beliefs conditioned on data.
The Bayesian approach is optimal, given a good model, a good prior
and a good loss function. Dont worry so much about assessment.
If you are ever making a point estimate, youve made a mistake. The
only valid probabilities are posteriors based on evidence given some
prior

55
Bayesian Linear Regression
The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.
Bayesians say, hold it right there.
Use a prior distribution over w to estimate parameters

(M +1)/2
p(w|)
= N (w| =
0, 1 I) exp wT w

2 2
Alpha is a hyperparameter over w, where alpha is the
precision or inverse variance of the distribution.
Now optimize:
x, t, , ) p(t|x, w,
p(w| )p(w|)

56
Optimize the Bayesian posterior
x, t, , ) p(t|x, w,
p(w| )p(w|)

As usual its easier to optimize after a log transform.


ln p(t|x, w,
) + ln p(w|)

N
1
2
p(t|x, w,
) = exp (tn y(xn , w))

n=0
2 2
N
1
N N

ln p(t|x, w,
) = ln ln 2 2
(tn y(xn , w))
2 2 2 n=0
57
Optimize the Bayesian posterior
x, t, , ) p(t|x, w,
p(w| )p(w|)

As usual its easier to optimize after a log transform.


ln p(t|x, w,
) + ln p(w|)

(M +1)/2
1 T
p(w|)
= N (w|
0, I) = exp w w
2 2
M +1 M +1 T
ln p(w|)
= ln ln 2 w w

2 2 2

58
Optimize the Bayesian posterior
ln p(t|x, w,
) + ln p(w|)

N
1
N N

ln p(t|x, w,
) = ln ln 2 2
(tn y(xn , w))
2 2 2 n=0
M +1 M +1 T
ln p(w|)
= ln ln 2 w w
2 2 2
Ignoring terms that do not depend on w
N
1
2 T
ln p(t|x, w,
) + ln p(w|)
(tn y(xn , w))
+ w w

2 n=0
2

IDENTICAL formulation as L2-regularization


59
Context
Overfitting is bad.
Bayesians vs. Frequentists
Is one better?
Machine Learning uses techniques from both
camps.

60
Next Time
Logistic Regression

Read Chapter 4.1, 4.3

61

You might also like