CS550 Regression Aug12

CS550: Machine Learning
Regression
Dr. Gagan Gupta

Slides based on Aurelien Geron’s book; ISL and ELS book and Harvard’s CS109A
11/24/2023 Regression 1
True vs. Statistical Model
• We will assume that the response variable, , relates to the predictors, , through
some unknown function ‘f’ expressed generally as:
• Here, is the unknown function expressing an underlying rule for relating to , is

the random amount (unrelated to ) that differs from the rule
• A statistical model is any algorithm that estimates . We denote the estimated

function as
Flexibility vs. Interpretability Tradeoff
• There are many methods of
regression (that estimate f)
• Some are less flexible but
more interpretable
• These are useful for
inference problems where
we want to study the
relationships between
predictor variables
• But highly flexible methods
can also lead to over-fitting!
Error Evaluation
In order to quantify how well a model performs, we define a loss or error function.
A common loss function for quantitative outcomes is the Mean Squared Error
(MSE):
The quantity is called a residual and measures the error at the i-th prediction.
The square root of MSE is RMSE:
4
R-squared Error
• If our model is as good as the mean value, , then

• If our model is perfect then
• can be negative if the model is worst than the average. This can happen when
we evaluate the model in the real life test set.
5
Bias Variance Tradeoff
• Total Error = Bias2 + Variance + Irreducible Error

Bias is the average distance of estimate from the true mean of f(x)
Variance is the sq. dev of the estimate around its mean
Bias Variance Tradeoff
“All models are wrong, but some models are useful.” : George Box (1919-2013)
• Occam’s razor: This philosophical principle states that “the
simplest explanation is best”.
• Bias is error from erroneous assumptions in the model, like
making it linear/simplistic. (underfitting)
• Variance is error from sensitivity to small fluctuations in the
training set, indicating it will not work in real world. (overfitting)
• First-principle models likely to suffer from bias, with data-driven
models in greater danger of overfitting.
Example Problem (Advertising)
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product in each
of those markets for three different media: TV, radio, and newspaper.
Everything is given in units of $1000.
Some of the figures in this presentation are taken from ISL book: "An Introduction to Statistical Learning, with applications in R"
(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani "
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
TV radio newspaper sales

n observations
230.1 37.8 69.2 22.1

44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9
11/24/2023 p predictors
Regression
Estimate of the regression coefficients (cont)
Which of the above three lines fit the data points the best?
a. One which goes through maximum number of points
b. One with least slope
c. One from which no point is too far, i.e. it is approximately in middle of
all points
10
So how do Linear Regression solvers work?
• Matrix Methods
• Exact methods that solve the set of linear equations
• Involve computation of matrix inverse, QR or pseudoinverse (more efficient)
• Gradient Descent
• A generic method of solving optimization problems
• Begin with a random point and reach the optimal solution through a
sequence of improvements
• Faster improvements could be done by stochastic methods
Matrix Algebra for n-dimensions
• Computational complexity of computing the matrix inverse: O(d2.4 ) to
O(d3 ) depending on implementation.
• Scikit-learn’s Linear Regression class uses SVD approach O(d2 )
• SVD stands for singular value decomposition
• Uses pseudoinverse approach (Moore-Penrose) :numpy.linalg.pinv()
• β=
• They both have linear complexity in terms of the number of
instances, n; but at least quadratic in d
• So, we need to look at alternate techniques if d is very large, e.g.,
100,000
1
Optimization Problems in ML 3
 The general form of an optimization problem in ML will usually be
Usually a sum of the
training error +
regularizer However, possible to have
linear/ridge regression
 Here denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
 is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
 Non-negativity constraint: All entries in must be non-negative that we saw were
 Sparsity constraint: is a sparse vector with atmost non-zeros unconstrained ( was a real-
valued vector)
 If no is specified, it is an unconstrained optimization problem
 Constrained opt. probs can be converted into unconstrained opt. (will see
later)
 For now, assume we have an unconstrained optimization problem CS771: Intro to ML
1
Method 1: Using First-Order Optimality 4
 Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the
objective is convex and there are no
constraints on the values can take
 First order optimality: The gradient must be equal to zero at the optima
=0
 Sometimes, setting and solving for gives a closed form solution
 If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
1
Method 2: Iterative
Can I used this
Optimiz. via Gradient Descent5
For max. problems we Iterative since it requires
approach to solve several steps/iterations to find
maximization
can use gradient ascent
the optimal solution
problems?
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
 Initialize as set carefully (fixed
or chosen
adaptively). Will
 For iteration (or until convergence) discuss some
 Calculate the gradient using the current iterates strategies later
Will see the
 Set the learning rate justification
Sometimes may be
tricky to to assess
 Move in the opposite direction of gradient shortly convergence? Will
( 𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈 𝑡 later
CS771: Intro to ML
1
Gradient Descent: An Illustration 6
Negative gradient here . Let’s Learning rate is very important
𝐿(𝒘 ) move in the positive direction
Positive gradient
here. Let’s move in
the negative
direction
𝒘
(0) ( 1) (3)
𝒘 𝒘 𝒘∗ 𝒘
(2)
𝒘 𝒘
𝒘 ∗𝒘 𝒘
(2)(3) ( 1) (0)
𝒘
Woohoo!  Stuck at a
Global minima local minima
found!!! 
GD thanks you for the Good initialization
good initialization  is very important CS771: Intro to ML
1
GD: An Example 7
 Let’s apply GD for least squares linear regression
=
Training
 The gradient: examples on
Prediction error of current which the current
 Each GD update will be of the form model on the training example model’s error is
large contribute
more to the
update
 Exercise: Assume , and show that GD update improves prediction on the

training input (, ), i.e, is closer to than to
 This is sort of a proof that GD updates are “corrective” in nature (and it actually is true
not just for linear regression but can also be shown for various other ML models)
CS771: Intro to ML
1
Stochastic Gradient Descent (SGD) Writing as an average instead of
sum. Won’t affect minimization of
8
 Consider a loss function of the form
Expensive to compute –
requires doing it for all the
 The (sub)gradient in this case can be written as training examples in each
iteration 
𝒈 =∇ 𝒘 𝐿 ( 𝑤 ) =∇ 𝒘 ¿ (Sub)gradient of the
loss on training example
 Stochastic Gradient Descent (SGD) approximates using a single training example
 At iter. , pick an index uniformly randomly and approximate as

Can show that is an
𝒈 ≈ 𝒈 𝑖= ∇ 𝒘 ℓ 𝑖 (𝒘 ) unbiased estimate of ,
i.e.,
 May take more iterations than GD to converge but each iteration is much faster 
 SGD per iter cost is whereas GD per iter cost is
CS771: Intro to ML
1
Minibatch SGD 9
 Gradient approximation using a single training example may be noisy
The approximation may have a high variance
– may slow down convergence, updates may
be unstable, and may even give sub-optimal
solutions (e.g., local minima where GD
might have given global minima)
 We can use unif. rand. chosen train. ex. with indices

 Using this “minibatch” of examples, we can compute a minibatch gradient
𝐵
1
𝒈≈
𝐵
∑ 𝒈𝑖 𝑏
𝑏=1
 Averaging helps in reducing the variance in the stochastic gradient

 Time complexity is per iteration in this case
CS771: Intro to ML
2
Some Practical Aspects: Initialization 0
 Iterative opt. algos like GD, SGD, etc need to be initialized to “good” values
 Bad initialization can result on bad local optima But still be careful
with learning rate
 Mainly a concern for non-convex loss functions, not so much for convex loss
If the goal is to learn the same model but for a different training
functions set
 Transfer Learning: Initialize using params of a model trained on a related dataset
 Initialize using solution of a simpler but related model

 E.g., for multitask regression (say coupled regression problems), initialize using the solutions of
the independently trained regression problems
 For deep learning models, initialization is very important

 Transfer learning approach is often used (initialize using “pre-trained” model from another
dataset)
 Bad initialization can make the model be stuck at saddle points. Need more care. CS771: Intro to ML
2
Some Practical Aspects: Assessing Convergence 1
 Various ways to assess convergence, e.g. consider converged if
 The objective’s value (on train set) ceases to change much across iterations
)- (for some small pre-defined )
 The parameter values cease to change much across iterations

(for some small pre-defined )
 Above condition is also equivalent to saying that the gradients are close to zero
0 Caution: May not yet be at the
optima. Use at your own risk!
 The objective’s value has become small enough that we are happy with 
 Use a validation set to assess if the model’s performance is acceptable (early stopping)
CS771: Intro to ML
2
Some Practical Aspects: Learning Rate (Step Size)2
 Some guidelines to select good learning rate (a.k.a. step size) C is a hyperparameter
 For convex functions, setting something like or often works well

 These step-sizes are actually theoretically optimal in some settings
 In general, we want the learning rates to satisfy the following conditions
 as becomes very very large
 (needed to ensure that we can potentially reach anywhere in the parameter space)
 Sometimes carefully chosen constant learning rates (usually small, or initially large and
later small) also work well in practice
 Can also search for the “best” step-size by solving an opt. problem in each
step Also called
“line 𝜂 𝑡 =arg min 𝑓 ( 𝐰(𝑡 ) −𝜂 ⋅ 𝐠(𝑡 ) ) A(note
one-dim optimization problem
that and are fixed)
search” 𝜂 ≥0
 An faster alternative to line search is the Armijo-Goldstein rule

 Starting with current (or some large) learning rate (from prev. iter), and try a few values
CS771: Intro to ML
2
Some Practical Aspects: Adaptive Gradient Methods 3
If some dimension had
big updates recently
 Can also use different learning rate in different dimensions (marked by large
(𝑡 ) 1 gradient values), show
𝑒𝑑 = down along those
√
𝑡
( 𝑡 ) 2 directions by using
Vector of learning rates Element-wise product 𝜖 +∑ ( 𝑔𝑑 ) smaller learning rates -
𝜏 =1 AdaGrad
along each dimension of two vectors
(Duchi et al, 2011)
 Can use a momentum term to stabilize gradients by reusing info from past
grads
In an even faster version of this,
 Move faster along directions that were previously good is replaced by the gradient
 Slow down along directions where gradient has changed abruptly computed at the next step if
(𝑡) (𝑡 −1 ) (𝑡)
𝒎 =𝛽𝒎 +𝜂t 𝒈
The “momentum” previous direction were used,
usually set term. Set to 0 at
as 0.9 i.e., ).
initialization Called Nesterov’s Accelerated
Gradient (NAG) method
 Also exists several more advanced methods that combine the above methods
 RMS-Prop: AdaGrad + Momentum, Adam: NAG + RMS-Prop CS771: Intro to ML
2
Optimization for ML: Some Final Comments 4
 Gradient methods are simple to understand and implement
 More sophisticated optimization methods also often use gradient methods
 Backpropagation algo used in deep neural nets is GD + chain rule of differentiation
 Use subgradient methods if function not differentiable
 Constrained optimization can use Lagrangian or projected/proximal GD
 Second order methods such as Newton’s method faster but computationally
expensive
 But computing all this gradient related stuff by hand looks scary to me. Any help?
 Don’t worry. Automatic Differentiation (AD) methods available now (will see them later)
 AD only requires specifying the loss function (especially useful for deep neural nets)
 Many packages such as Tensorflow, PyTorch, etc. provide AD support
 But having a good understanding of optimization is still helpful
CS771: Intro to ML
Parametric or Non-Parametric?
Linear Regression (parametric) k-NN Approach (non-parametric)
Assumption on A linear function is assumed Can work even if the function is non-
function f linear. But it has to be locally constant
High Complexity problems which can be Difficult to find neighbors nearby
dimensions overcome by efficient algorithms which can cause errors
Bias Low Small K=> Low bias
Large K => High Bias
Variance Depends on the problem Small K=> High Variance
Large K => Low Variance
Computations Once during the model fitting phase. Every time a prediction has to be
After that predictions are quick made, we look at all the training points
Possible Questions
• How accurately do we know our model parameters?
• Is at least one predictor variable useful in the prediction?
• We have to examine the p-values
• Which subset of the predictor variables are important?
• There are several techniques of predictor variable/feature selection
• What would be the accuracy of predictions on unseen data?
• We can generate confidence intervals on our estimates
• Cross-validation gives us an estimate.
• Do I need more predictor variables/features?
• Look at patterns in the residual errors
Confidence intervals for predictor estimators
• What causes errors in estimation of ?
• we do not know the exact form of

• limited sample size
• Variance of is called as standard error,
• To estimate SE, we use Bootstrapping
• sampling from the training data (X,Y) to estimate its statistical properties.
• In our case, we can sample with replacement
• Compute multiple times by random sampling
• Variance of multiple estimates approximates the true variance
Standard Errors Intuition from Formulae
• Better model:
√
2
• More data: and ^0 ) = 𝜎 1 𝑥
𝑆𝐸 ( 𝑤 +
𝑛 Σ 𝑖 ( 𝑥𝑖 − 𝑥 )
2
• Larger coverage: or
• Better data:
^1 ) =𝜎
𝑆𝐸 ( 𝑤
√ 1
Σ 𝑖 ( 𝑥𝑖 − 𝑥 )
2
−1
𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝐹𝑜𝑟𝑚𝑢𝑙𝑎 : 𝑆𝐸 ( 𝑤 ) = 𝜎 ( 𝑋 𝑋 )
2 2 𝑇
28
Significance of predictor variables
• As we saw, there are inherent uncertainties in estimation of w=β
• We evaluate the importance of predictors using hypothesis testing, using
the t-statistics and p-values (Small p-value(<0.05) => significant)
• Null hypothesis is that βi=0
Test statistic here would be

Which measures the distance of the
mean from zero in units of standard
deviation.
Sample Results
import statsmodels.api as sm
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
Subset Selection Techniques
• Total number of subsets of a set of size J = ?
• Goal: All the variables in the model should have sufficiently low p-
values, and all the variables outside the model should have a large p-
value if added to the model.
• Three possible approaches
• Forward selection
• Backward selection
• Mixed selection
Subset Selection Techniques
• Forward selection:
• Begin will a null set, S
• Perform J linear regressions, each with exactly one variable
• Add the variable that results in lowest Cross-validation error to the set, S
• Again, perform J-1 linear regressions with 2 variables
• Add the variable that results in lowest Cross-validation error to the set, S
• Continue until some stopping criteria is reached… eg. CV error is not decreasing
• Backward selection begins with all the variables and removes the
variable with highest p-value at successive steps
• Mixed selection is similar to Forward Selection, but it may also remove
a variable if it doesn’t yield any improvement to the model
Do I need more predictors/change of model?
• When we estimated the variance of ϵ, we assumed that the residuals
were uncorrelated and normally distributed with mean 0 and fixed
variance.
• These assumptions need to be verified using the data. In residual
analysis, we typically create two types of plots:
1. a plot ofwith respect to or . This allows us to compare the
distribution of the noise at different values of .
2. a histogram of . This allows us to explore the distribution of the
noise independent of or .
Patterns in Residuals
• Residuals are easier to interpret than the model

• We plot () with , so the graph is always 2-D
Confidence intervals on predictions of y
• Depends on confidence on w
• Different w => different values of y
• Given , examine distribution of , determine the mean and standard deviation.
• For each of these the prediction for
Potential problems of Linear Models
• Non-linearity
• Can use polynomial linear regression or design better features
• Outliers
• Disturbs the models because of quadratic penalty, Discard outliers carefully
• High-leverage points
• Outliers in the predictor variables
• Collinearity (2 or more predictor variables have high correlation)
• Keep one of them or design a good combined feature
• Correlation of error terms, Non-constant variance of error terms
• Gives higher confidence in the model, can’t trust the CI on model parameter
Polynomial Regression
• The simplest non-linear model we can consider, for a
response Y and a predictor X, is a polynomial model of
degree M,
• Just as in the case of linear regression with cross terms,
polynomial regression is a special case of linear regression -
we treat each as a separate predictor. Thus, we can write
Polynomial Regression
• Which of the above three is the best model?

• Check RMSE
• Check R2
• Remember bias and variance??
Benefit of Cross-Validation
𝐾
1
𝐶𝑉 ( Model )= ∑
𝐾 𝑖=1
𝐿¿ ¿ ¿
• Using cross-validation, we generate validate the models on a portion

of training data which our learning algorithm has never seen.
• Leave-one out method is used when the number of sample points is
very small.
Regularization of Linear Models
• Goal: Reduce over-fitting of the data by reducing degrees of freedom
• For a linear model, regularization is typically achieved by constraining
the weights of the model
where is a scalar that gives the weight (or importance) of the

regularization term.
• Fitting the model using the modified loss function Lreg would result in
model parameters with desirable properties (specified by R).
Ridge Regression
• Alternatively, we can choose a regularization term that penalizes the
squares of the parameter magnitudes. Then, our regularized loss function
is:
• Works best when least-square estimates have high variance

• As increases, flexibility decreases, variance decreases, bias increases
slightly
• Note that is the l2 norm of the vector β
41
Ridge Regression
• We often say that Lridge is the loss function for l2 regularization.
• Finding the model parameters β ridge that minimize the l2 regularized loss
function is called ridge regression.
42
LASSO (least absolute shrinkage and selection operator) Regression
• Ridge regression reduces the parameter values but doesn’t force them
to go to zero. LASSO is very effective in doing that.
• It uses the following regularized loss function is:
• Note that is the l1 norm of the vector β
43
LASSO Regression
• Hence, we often say that LLASSO is the loss function for l1 regularization.
• Finding the model parameters β LASSO that minimize the l1 regularized loss function
is called LASSO regression.
44
Choosing l
• In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter , the more heavily we penalize large values in β,
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force β ridge and β LASSO to be close
to zero.
• To avoid ad-hoc choices, we should select using cross-validation.
• Once the model is trained, we use the unregularized performance measure to
evaluate the model’s performance.
45
Elastic Net
• Middle ground between Ridge and Lasso regression
• Regularization term is a simple mix with parameter ‘r’
• Elastic Net has better convergence features over Lasso.
Sklearn.linear_model import ElasticNet

Elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
4
7
Constrained Optimization
CS771: Intro to ML
4
Projected Gradient Descent 8
 Consider an optimization problem of the form
 Projected GD is very similar to GD with an extra projection step
 Each iteration will be of the form

 Perform update:
Projection
step
Projectio
 Check if satisfies constraints n
 If set = operator
 If project as =
CS771: Intro to ML
4
Projected GD: How to Project? 9
 Here projecting a point means finding the “closest” point from the constraint
set
Projected GD
commonly used only
when the projection step
is simple and efficient
 For some sets
: Unit, radius
the projection
ball step is easy to compute
: Set of non-negative reals
(0,1)
(1,0)
Projection = Normalize to unit Euclidean length vector Projection = Set each negative entry in to be zero
CS771: Intro to ML
5
Proximal Gradient Descent 0
 Consider minimizing a regularized loss function of the form
Note: The reg. hyperparam.
assumed part of itself
 Proximal GD popular when regularizer is non-differentiable

 Basic idea: Do GD on and use a prox. operator to regularize via
 For a func. , its prox. operator is
Proximal GD That is, regularize
by reducing the
 Assume reg. loss function of the form

Special Cases value of each
component of the
 Initialize as For the vector by half
i.e. scaling
 For iteration (or until convergence)
 Calculate the (sub)gradient of train. Loss (w/o reg.)
If defines a set based constraint
:=
 Set learning rate
 Step 1:
Prox. GD becomes
 Step 2:
equivalent to projected GD
CS771: Intro to ML
5
Constrained Opt. via Lagrangian 1
 Consider the following constrained minimization problem (using instead of )
 Note: If constraints of the form , use

 Can handle multiple inequality and equality constraints too (will see later)
 Can transform the above into the following equivalent unconstrained problem
 Our problem can now be written as
CS771: Intro to ML
5
Constrained Opt. via Lagrangian 2
The Lagrangian:
 Therefore, we can write our original problem as
 The Lagrangian is now optimized w.r.t. and (Lagrange multiplier)

 We can defined Primal and Dual problem as
Both equal if and the set

are convex complimentary slackness/Karush-
Kuhn-Tucker (KKT) condition
CS771: Intro to ML
5
Constrained Opt. with Multiple Constraints 3
 We can also have multiple inequality and equality constraints
 Introduce Lagrange multipliers and

 The Lagrangian based primal and dual problems will be
CS771: Intro to ML
5
4
Some other useful

optimization methods
CS771: Intro to ML
5
Co-ordinate Descent (CD) 5
 Standard gradient descent update for :
 CD: In each iter, update only one entry (co-ordinate) of . Keep all others fixed
-- partial derivative w.r.t. the element of vector (or the

element of the gradient vector g)
 Cost of each update is now independent of

 In each iter, can choose co-ordinate to update unif. randomly or in cyclic order
 Instead of updating a single co-ord, can also update “blocks” of co-ordinates
=
 Called Block co-ordinate descent (BCD)
 To avoid cost of gradient computation, can cache previous computations
 Recall that grad. computations may have terms like – if just one co-ordinate of w
changes, we should avoid computing the new from scratch
CS771: Intro to ML
5
Alternating Optimization (ALT-OPT) 6
 Consider opt. problems with several variables, say two variables and
 Often, this “joint” optimization is hard/impossible to solve

 We can take an alternating optimization approach to solve such problems
 Usually converges to a local optima. But very very useful. Will see examples
later CS771: Intro to ML
5
Newton’s Method 7
 Unlike GD and its variants, Newton’s method uses second-order information
(second derivative, a.k.a. the Hessian)
 At each point , minimize the quadratic (second-order) approx. of
[]
𝐿(𝒘 )
Show that
Converges much faster than GD (very fast for convex

functions). Also no “learning rate”. But per iteration
cost is slower due to Hessian computation and
inversion
Faster versions of Newton’s method also exist, e.g.,
those based on approximating Hessian using
previous gradients (see L-BFGS which is a popular
𝒘
( 1)
𝒘 𝑜𝑝𝑡 𝒘 method)
CS771: Intro to ML
SVM (Support Vector Machines)
• Uses a different approach to regression
• Instead of thinking of the fit as a line, let us think of it as a channel
• Fit as many instances as possible on the channel while limiting the
margin violations, (i.e. instances off the channel)
• Width of the channel is the hyper-parameter ‘ε’
• Adding more training instances within the channel doesn’t change
the model parameters
• Hence these models are more robust against over-fitting
SVM Regression
ε insensitive Loss function
-e e
SVM Regression
Sklearn.svm import ElasticNet

Svm_reg = LinearSVR(epsilon=1, C=2)
Parameters in SVM regression
• Parameter ε controls the width of the channel and can affect the
number of support vectors used to construct the regression function.
• Adding more training vectors
• Bigger ε => fewer support vectors
• Parameter C determines the trade-off between the model complexity
and the degree to which the deviations larger than ε can be tolerated
• It is interpreted as a traditional regularization parameter that can be
estimated by Cross Validation, for example
Non-linear data
• SVM allow for a computationally efficient method of transforming
the dataset to higher dimensions using kernel trick.
• Common kernels that are used are
• Linear, polynomial, Gaussian RBF, Sigmoid

CS550 Regression Aug12

Uploaded by

Copyright:

Available Formats

CS550 Regression Aug12

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS550 Regression Aug12

Uploaded by

Copyright:

Available Formats

CS550: Machine Learning

Dr. Gagan Gupta

• Here, is the unknown function expressing an underlying rule for relating to , is

• A statistical model is any algorithm that estimates . We denote the estimated

• If our model is as good as the mean value, , then

• Total Error = Bias2 + Variance + Irreducible Error

TV radio newspaper sales

230.1 37.8 69.2 22.1

 If no is specified, it is an unconstrained optimization problem

 Exercise: Assume , and show that GD update improves prediction on the

 At iter. , pick an index uniformly randomly and approximate as

 We can use unif. rand. chosen train. ex. with indices

 Averaging helps in reducing the variance in the stochastic gradient

 Transfer Learning: Initialize using params of a model trained on a related dataset

 Initialize using solution of a simpler but related model

 For deep learning models, initialization is very important

 The parameter values cease to change much across iterations

 For convex functions, setting something like or often works well

 An faster alternative to line search is the Armijo-Goldstein rule

• we do not know the exact form of

Test statistic here would be

• Residuals are easier to interpret than the model

• Which of the above three is the best model?

• Using cross-validation, we generate validate the models on a portion

where is a scalar that gives the weight (or importance) of the

• Works best when least-square estimates have high variance

• Note that is the l2 norm of the vector β

• Note that is the l1 norm of the vector β

• Elastic Net has better convergence features over Lasso.

Sklearn.linear_model import ElasticNet

 Projected GD is very similar to GD with an extra projection step

 Each iteration will be of the form

 Proximal GD popular when regularizer is non-differentiable

 Assume reg. loss function of the form

 Note: If constraints of the form , use

 Our problem can now be written as

 The Lagrangian is now optimized w.r.t. and (Lagrange multiplier)

Both equal if and the set

 Introduce Lagrange multipliers and

Some other useful

-- partial derivative w.r.t. the element of vector (or the

 Cost of each update is now independent of

 Often, this “joint” optimization is hard/impossible to solve

Converges much faster than GD (very fast for convex

Sklearn.svm import ElasticNet

You might also like