CS550 Regression Aug12
CS550 Regression Aug12
CS550 Regression Aug12
Regression
11/24/2023 Regression 2
Flexibility vs. Interpretability Tradeoff
• There are many methods of
regression (that estimate f)
• Some are less flexible but
more interpretable
• These are useful for
inference problems where
we want to study the
relationships between
predictor variables
• But highly flexible methods
can also lead to over-fitting!
11/24/2023 Regression 3
Error Evaluation
In order to quantify how well a model performs, we define a loss or error function.
A common loss function for quantitative outcomes is the Mean Squared Error
(MSE):
The quantity is called a residual and measures the error at the i-th prediction.
The square root of MSE is RMSE:
4
R-squared Error
5
Bias Variance Tradeoff
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product in each
of those markets for three different media: TV, radio, and newspaper.
Everything is given in units of $1000.
Some of the figures in this presentation are taken from ISL book: "An Introduction to Statistical Learning, with applications in R"
(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani "
11/24/2023 Regression 8
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
11/24/2023 p predictors
Regression
Estimate of the regression coefficients (cont)
Which of the above three lines fit the data points the best?
a. One which goes through maximum number of points
b. One with least slope
c. One from which no point is too far, i.e. it is approximately in middle of
all points
10
So how do Linear Regression solvers work?
• Matrix Methods
• Exact methods that solve the set of linear equations
• Involve computation of matrix inverse, QR or pseudoinverse (more efficient)
• Gradient Descent
• A generic method of solving optimization problems
• Begin with a random point and reach the optimal solution through a
sequence of improvements
• Faster improvements could be done by stochastic methods
11/24/2023 Regression 11
Matrix Algebra for n-dimensions
• Computational complexity of computing the matrix inverse: O(d2.4 ) to
O(d3 ) depending on implementation.
• Scikit-learn’s Linear Regression class uses SVD approach O(d2 )
• SVD stands for singular value decomposition
• Uses pseudoinverse approach (Moore-Penrose) :numpy.linalg.pinv()
• β=
• They both have linear complexity in terms of the number of
instances, n; but at least quadratic in d
• So, we need to look at alternate techniques if d is very large, e.g.,
100,000
11/24/2023 Regression 12
1
Optimization Problems in ML 3
The general form of an optimization problem in ML will usually be
Usually a sum of the
training error +
regularizer However, possible to have
linear/ridge regression
Here denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
Non-negativity constraint: All entries in must be non-negative that we saw were
Sparsity constraint: is a sparse vector with atmost non-zeros unconstrained ( was a real-
valued vector)
Constrained opt. probs can be converted into unconstrained opt. (will see
later)
For now, assume we have an unconstrained optimization problem CS771: Intro to ML
1
Method 1: Using First-Order Optimality 4
Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the
objective is convex and there are no
constraints on the values can take
First order optimality: The gradient must be equal to zero at the optima
=0
Sometimes, setting and solving for gives a closed form solution
If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
1
Method 2: Iterative
Can I used this
Optimiz. via Gradient Descent5
For max. problems we Iterative since it requires
approach to solve several steps/iterations to find
maximization
can use gradient ascent
the optimal solution
problems?
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
Initialize as set carefully (fixed
or chosen
adaptively). Will
For iteration (or until convergence) discuss some
Calculate the gradient using the current iterates strategies later
Will see the
Set the learning rate justification
Sometimes may be
tricky to to assess
Move in the opposite direction of gradient shortly convergence? Will
( 𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈 𝑡 later
CS771: Intro to ML
1
Gradient Descent: An Illustration 6
Negative gradient here . Let’s Learning rate is very important
𝐿(𝒘 ) move in the positive direction
Positive gradient
here. Let’s move in
the negative
direction
𝒘
(0) ( 1) (3)
𝒘 𝒘 𝒘∗ 𝒘
(2)
𝒘 𝒘
𝒘 ∗𝒘 𝒘
(2)(3) ( 1) (0)
𝒘
Woohoo! Stuck at a
Global minima local minima
found!!!
GD thanks you for the Good initialization
good initialization is very important CS771: Intro to ML
1
GD: An Example 7
Let’s apply GD for least squares linear regression
=
Training
The gradient: examples on
Prediction error of current which the current
Each GD update will be of the form model on the training example model’s error is
large contribute
more to the
update
𝒈 =∇ 𝒘 𝐿 ( 𝑤 ) =∇ 𝒘 ¿ (Sub)gradient of the
loss on training example
Stochastic Gradient Descent (SGD) approximates using a single training example
Mainly a concern for non-convex loss functions, not so much for convex loss
If the goal is to learn the same model but for a different training
functions set
Above condition is also equivalent to saying that the gradients are close to zero
0 Caution: May not yet be at the
optima. Use at your own risk!
The objective’s value has become small enough that we are happy with
Use a validation set to assess if the model’s performance is acceptable (early stopping)
CS771: Intro to ML
2
Some Practical Aspects: Learning Rate (Step Size)2
Some guidelines to select good learning rate (a.k.a. step size) C is a hyperparameter
√
𝑡
( 𝑡 ) 2 directions by using
Vector of learning rates Element-wise product 𝜖 +∑ ( 𝑔𝑑 ) smaller learning rates -
𝜏 =1 AdaGrad
along each dimension of two vectors
(Duchi et al, 2011)
Can use a momentum term to stabilize gradients by reusing info from past
grads
In an even faster version of this,
Move faster along directions that were previously good is replaced by the gradient
Slow down along directions where gradient has changed abruptly computed at the next step if
(𝑡) (𝑡 −1 ) (𝑡)
𝒎 =𝛽𝒎 +𝜂t 𝒈
The “momentum” previous direction were used,
usually set term. Set to 0 at
as 0.9 i.e., ).
initialization Called Nesterov’s Accelerated
Gradient (NAG) method
Also exists several more advanced methods that combine the above methods
RMS-Prop: AdaGrad + Momentum, Adam: NAG + RMS-Prop CS771: Intro to ML
2
Optimization for ML: Some Final Comments 4
Gradient methods are simple to understand and implement
More sophisticated optimization methods also often use gradient methods
Backpropagation algo used in deep neural nets is GD + chain rule of differentiation
Use subgradient methods if function not differentiable
Constrained optimization can use Lagrangian or projected/proximal GD
Second order methods such as Newton’s method faster but computationally
expensive
But computing all this gradient related stuff by hand looks scary to me. Any help?
Don’t worry. Automatic Differentiation (AD) methods available now (will see them later)
AD only requires specifying the loss function (especially useful for deep neural nets)
Many packages such as Tensorflow, PyTorch, etc. provide AD support
But having a good understanding of optimization is still helpful
CS771: Intro to ML
Parametric or Non-Parametric?
Linear Regression (parametric) k-NN Approach (non-parametric)
Assumption on A linear function is assumed Can work even if the function is non-
function f linear. But it has to be locally constant
High Complexity problems which can be Difficult to find neighbors nearby
dimensions overcome by efficient algorithms which can cause errors
Bias Low Small K=> Low bias
Large K => High Bias
Variance Depends on the problem Small K=> High Variance
Large K => Low Variance
Computations Once during the model fitting phase. Every time a prediction has to be
After that predictions are quick made, we look at all the training points
11/24/2023 Regression 25
Possible Questions
• How accurately do we know our model parameters?
• Is at least one predictor variable useful in the prediction?
• We have to examine the p-values
• Which subset of the predictor variables are important?
• There are several techniques of predictor variable/feature selection
• What would be the accuracy of predictions on unseen data?
• We can generate confidence intervals on our estimates
• Cross-validation gives us an estimate.
• Do I need more predictor variables/features?
• Look at patterns in the residual errors
11/24/2023 Regression 26
Confidence intervals for predictor estimators
• What causes errors in estimation of ?
√
2
• More data: and ^0 ) = 𝜎 1 𝑥
𝑆𝐸 ( 𝑤 +
𝑛 Σ 𝑖 ( 𝑥𝑖 − 𝑥 )
2
• Larger coverage: or
• Better data:
^1 ) =𝜎
𝑆𝐸 ( 𝑤
√ 1
Σ 𝑖 ( 𝑥𝑖 − 𝑥 )
2
−1
𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝐹𝑜𝑟𝑚𝑢𝑙𝑎 : 𝑆𝐸 ( 𝑤 ) = 𝜎 ( 𝑋 𝑋 )
2 2 𝑇
28
Significance of predictor variables
• As we saw, there are inherent uncertainties in estimation of w=β
• We evaluate the importance of predictors using hypothesis testing, using
the t-statistics and p-values (Small p-value(<0.05) => significant)
• Null hypothesis is that βi=0
11/24/2023 Regression 29
Sample Results
import statsmodels.api as sm
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
11/24/2023 Regression 30
Subset Selection Techniques
• Total number of subsets of a set of size J = ?
• Goal: All the variables in the model should have sufficiently low p-
values, and all the variables outside the model should have a large p-
value if added to the model.
• Three possible approaches
• Forward selection
• Backward selection
• Mixed selection
11/24/2023 Regression 31
Subset Selection Techniques
• Forward selection:
• Begin will a null set, S
• Perform J linear regressions, each with exactly one variable
• Add the variable that results in lowest Cross-validation error to the set, S
• Again, perform J-1 linear regressions with 2 variables
• Add the variable that results in lowest Cross-validation error to the set, S
• Continue until some stopping criteria is reached… eg. CV error is not decreasing
• Backward selection begins with all the variables and removes the
variable with highest p-value at successive steps
• Mixed selection is similar to Forward Selection, but it may also remove
a variable if it doesn’t yield any improvement to the model
11/24/2023 Regression 32
Do I need more predictors/change of model?
• When we estimated the variance of ϵ, we assumed that the residuals
were uncorrelated and normally distributed with mean 0 and fixed
variance.
• These assumptions need to be verified using the data. In residual
analysis, we typically create two types of plots:
1. a plot ofwith respect to or . This allows us to compare the
distribution of the noise at different values of .
2. a histogram of . This allows us to explore the distribution of the
noise independent of or .
11/24/2023 Regression 33
Patterns in Residuals
• Depends on confidence on w
• Different w => different values of y
• Given , examine distribution of , determine the mean and standard deviation.
• For each of these the prediction for
11/24/2023 Regression 35
Potential problems of Linear Models
• Non-linearity
• Can use polynomial linear regression or design better features
• Outliers
• Disturbs the models because of quadratic penalty, Discard outliers carefully
• High-leverage points
• Outliers in the predictor variables
• Collinearity (2 or more predictor variables have high correlation)
• Keep one of them or design a good combined feature
• Correlation of error terms, Non-constant variance of error terms
• Gives higher confidence in the model, can’t trust the CI on model parameter
11/24/2023 Regression 36
Polynomial Regression
• The simplest non-linear model we can consider, for a
response Y and a predictor X, is a polynomial model of
degree M,
• Just as in the case of linear regression with cross terms,
polynomial regression is a special case of linear regression -
we treat each as a separate predictor. Thus, we can write
11/24/2023 Regression 37
Polynomial Regression
𝐾
1
𝐶𝑉 ( Model )= ∑
𝐾 𝑖=1
𝐿¿ ¿ ¿
• Fitting the model using the modified loss function Lreg would result in
model parameters with desirable properties (specified by R).
11/24/2023 Regression 40
Ridge Regression
• Alternatively, we can choose a regularization term that penalizes the
squares of the parameter magnitudes. Then, our regularized loss function
is:
41
Ridge Regression
• We often say that Lridge is the loss function for l2 regularization.
• Finding the model parameters β ridge that minimize the l2 regularized loss
function is called ridge regression.
42
LASSO (least absolute shrinkage and selection operator) Regression
• Ridge regression reduces the parameter values but doesn’t force them
to go to zero. LASSO is very effective in doing that.
• It uses the following regularized loss function is:
43
LASSO Regression
• Hence, we often say that LLASSO is the loss function for l1 regularization.
• Finding the model parameters β LASSO that minimize the l1 regularized loss function
is called LASSO regression.
44
Choosing l
• In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter , the more heavily we penalize large values in β,
• If is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.
• If is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force β ridge and β LASSO to be close
to zero.
• To avoid ad-hoc choices, we should select using cross-validation.
• Once the model is trained, we use the unregularized performance measure to
evaluate the model’s performance.
45
Elastic Net
• Middle ground between Ridge and Lasso regression
• Regularization term is a simple mix with parameter ‘r’
Constrained Optimization
CS771: Intro to ML
4
Projected Gradient Descent 8
Consider an optimization problem of the form
CS771: Intro to ML
4
Projected GD: How to Project? 9
Here projecting a point means finding the “closest” point from the constraint
set
Projected GD
commonly used only
when the projection step
is simple and efficient
For some sets
: Unit, radius
the projection
ball step is easy to compute
: Set of non-negative reals
(0,1)
(1,0)
Projection = Normalize to unit Euclidean length vector Projection = Set each negative entry in to be zero
CS771: Intro to ML
5
Proximal Gradient Descent 0
Consider minimizing a regularized loss function of the form
Note: The reg. hyperparam.
assumed part of itself
CS771: Intro to ML
5
Constrained Opt. via Lagrangian 2
The Lagrangian:
Therefore, we can write our original problem as
CS771: Intro to ML
5
4
CS771: Intro to ML
5
Co-ordinate Descent (CD) 5
Standard gradient descent update for :
CD: In each iter, update only one entry (co-ordinate) of . Keep all others fixed
Usually converges to a local optima. But very very useful. Will see examples
later CS771: Intro to ML
5
Newton’s Method 7
Unlike GD and its variants, Newton’s method uses second-order information
(second derivative, a.k.a. the Hessian)
At each point , minimize the quadratic (second-order) approx. of
[]
𝐿(𝒘 )
Show that
11/24/2023 Regression 58
SVM Regression
11/24/2023 Regression 59
ε insensitive Loss function
-e e
11/24/2023 Regression 60
SVM Regression
11/24/2023 Regression 62
Non-linear data
• SVM allow for a computationally efficient method of transforming
the dataset to higher dimensions using kernel trick.
• Common kernels that are used are
• Linear, polynomial, Gaussian RBF, Sigmoid
11/24/2023 Regression 63