Regression
Regression
Regression
Sabber Ahamed
sabbers@gmail.com
linkedin.com/in/sabber-ahamed
github.com/msahamed
Contents
1 Linear Regression 2
1.1 Estimating β0 and β1 using least squares . . . . . . . . . . . . 2
1.1.1 Analytical solution . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Gradient Descent solution . . . . . . . . . . . . . . . . 5
1.2 Interpretation of Coefficients . . . . . . . . . . . . . . . . . . . 8
1.2.1 Case 1: when predictors (xi ) are binary (0 or 1) . . . . 8
1.2.2 Case-1: when predictors (xi ) are continuous . . . . . . 9
1.2.3 Example using the Boston Housing Dataset . . . . . . 9
1.3 Limitations of Linear Regression . . . . . . . . . . . . . . . . . 10
1.4 Why heteroscedasticity and non-normality are problematic . . 12
1.4.1 Homoscedasticity: . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Normality of Residuals: . . . . . . . . . . . . . . . . . . 12
1.5 Lasso and Ridge Regression . . . . . . . . . . . . . . . . . . . 13
1.5.1 Lasso Regression . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.3 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Ridge Regression . . . . . . . . . . . . . . . . . . . . . 14
1.5.5 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . 15
1
2 Logistic Regression 15
1 Linear Regression
Linear regression is a statistical method that is used to model the relationship
between a dependent variable and one or more independent variables. In
other words, it is a method to find the line of best fit for a given set of data
points.
Linear regression can be used for both simple and multiple regression prob-
lems. In simple regression, we have one dependent variable and one indepen-
dent variable, while in multiple regression, we have one dependent variable
and more than one independent variable.
In this lecture, we will focus on simple linear regression.
In simple linear regression, we have a dependent variable y and an indepen-
dent variable x. We assume that there is a linear relationship between x and
y, which can be represented as:
y = β0 + β1 x + ϵ (1)
where β0 is the y-intercept, β1 is the slope of the line, and ϵ is the error term.
The goal of linear regression is to estimate the values of β0 and β1 so that
the line of best fit can be determined.
n
X
E= (yi − (β0 + β1 xi ))2 (2)
i=1
2
To minimize E, we take the partial derivatives of E with respect to β0 and
β1 and set them equal to 0:
n
∂E X
= −2 (yi − (β0 + β1 xi )) = 0 (3)
∂β0 i=1
n
∂E X
= −2 (yi − (β0 + β1 xi ))xi = 0 (4)
∂β1 i=1
β0 = ȳ − β1 x̄ (6)
x 1 2 3 4 5 6
y 1 3 2 5 4 6
We want to find the line of best fit for these data points.
First, we calculate the means of x and y:
1+2+3+4+5+6
x̄ = = 3.5 (7)
6
1+3+2+5+4+6
ȳ = = 3.5 (8)
6
Next, using the formula 5, we can calculate β1 :
β1 = 0.928 (9)
3
Finally, we calculate β0 :
Let’s implement simple linear regression in Python using numPy and matplotlib
libraries:
1 import numpy as np
2 import matplotlib . pyplot as plt
Next, we calculate β1 :
1 num = np . sum (( x - x_mean ) * ( y - y_mean ) )
2 denum = np . sum (( x - x_mean ) ** 2)
3 beta_1 = num / denum
Finally, we calculate β0 :
1 beta_0 = y_mean - beta_1 * x_mean
2 pred = beta_0 + beta_1 * x
We can plot the data points and the line of best fit using matplotlib:
1 plt . scatter (x , y , label = ’ data points ’)
2 plt . plot (x , pred , ’r ’ , label = ’ line of best fit ’)
3 plt . show ()
4
Figure 1: Linear regression
The resulting plot should show the data points and the line of best fit.
In this lecture, we covered simple linear regression, including the estimation
of β0 and β1 using the method of least squares. We also provided an example
and implementation in Python using NumPy and Matplotlib. Linear regres-
sion is a powerful tool for modeling relationships between variables and can
be applied in a wide range of fields.
5
eraged over the number of observations m and multiplied by a factor of 1/2
for mathematical convenience during the derivation (i.e., the derivative of a
squared function). If x(i) is the feature vector for the i-th training example
and y (i) is the corresponding target value, then the cost function J(β) is
defined as follows:
m
1 X
J(β) = (hβ (x(i) ) − y (i) )2 (12)
2m i=1
where hβ (x) = β T x is the hypothesis function that gives our predicted value
of y for a given x.
The idea of the gradient descent algorithm is to update each parameter βj
in β such that the cost function J(β) is minimized. The update rule for βj
is obtained by taking the derivative of the cost function with respect to βj
and setting it to 0:
m
1 X (i)
βj := βj − α (hβ (x(i) ) − y (i) )xj (13)
m i=1
(i)
In this equation, α is the learning rate and xj is the j-th feature of the i-th
training example.
This update rule should be applied simultaneously to all βj in β which can
be performed efficiently using vectorized operations. The vectorized update
rule is:
1 T
β := β − α X (Xβ − y) (14)
m
In this equation, X is the m × (n + 1) matrix of feature vectors, with an
extra column of 1s added for the intercept term, and y is the m-dimensional
vector of target values.
The gradient descent algorithm iteratively applies the update rule (Equation
14) until the parameters β converge to their optimal values that minimize
J(β), or until a maximum number of iterations is reached.
Let’s implement gradient descent in Python using NumPy and Matplotlib
libraries.
6
First, we import the necessary libraries:
1 import numpy as np
2 import matplotlib . pyplot as plt
We also need to define the cost function, which is the sum of squared errors:
1 def cost_function (x , y , beta_0 , beta_1 ) :
2 n = len ( x )
3 a = np . sum (( beta_0 + beta_1 * x - y ) ** 2)
4 return 1 / (2 * n ) * a
The function takes in the data points, learning rate, and number of iterations.
It initializes β0 and β1 to 0, calculates the predicted values, and updates
the parameters using the gradients. It also keeps track of the cost at each
iteration.
We can now call the gradient descent function:
7
1 alpha = 0.01
2 iterations = 1000
3 beta_0 , beta_1 , cost = gradient_descent (
4 x , y , alpha , iterations
5 )
The resulting plot should show a decreasing cost over the iterations.
Finally, we can plot the data points and the line of best fit using the estimated
values of β0 and β1 :
1 plt . scatter (x , y )
2 plt . plot (x , beta_0 + beta_1 * x )
3 plt . show ()
The resulting plot should show the data points and the line of best fit.
y = β0 + β1 x + ϵ (15)
8
where β0 is the intercept, β1 is the coefficient for the binary predictor x, and
ϵ is the error term.
If x = 0, the predicted value of y is ŷ = β0 . If x = 1, the predicted value
of y is ŷ = β0 + β1 . Thus, the coefficient β1 represents the difference in the
predicted value of y between the two levels of the predictor.
y = β0 + β1 x + ϵ (16)
where β0 is the intercept, β1 is the coefficient for the predictor x, and ϵ is the
error term.
If we increase x by one unit, the predicted value of y increases by β1 . Thus,
the coefficient β1 represents the expected change in y associated with a one-
unit increase in x, holding all other predictors constant.
9
positive, this means that the price of the house is expected to increase with
the square footage of the house, holding all other predictors constant. If β3
is positive, this means that the price of the house is expected to be higher in
that neighborhood, holding all other predictors constant.
By interpreting the coefficients, we can gain insight into the relationships
between the predictors and the response variable, and make predictions about
new observations based on the values of the predictors.
10
Figure 2: Variations in Linear Regression. The figure demonstrates differ-
ent scenarios in linear regression modeling. The top row from left to right
shows an ideal linear case, a non-linear case, and a case with outliers. The
bottom row, from left to right, displays normality of residuals, homoscedas-
ticity (constant error variance), and heteroscedasticity (non-constant error
variance). Each plot demonstrates key considerations and assumptions in
linear regression analysis.
11
1.4 Why heteroscedasticity and non-normality are prob-
lematic
1.4.1 Homoscedasticity:
The assumption of homoscedasticity (meaning ”equal scatter”) is central
to linear regression models. Homoscedasticity describes a situation in which
the error term (that is, the ”noise” or random disturbance in the relationship
between the independent variables and the dependent variable) is the same
across all levels of the independent variables.
Here’s an example. Suppose you’re trying to predict a person’s weight based
on their height. If we have homoscedasticity, that means the variability in
weights is the same for all heights. In other words, whether you’re looking at
short people, medium-height people, or tall people, you see the same amount
of variation in weights.
If we don’t have homoscedasticity (i.e., we have heteroscedasticity), the vari-
ability in weights changes with height. Maybe there’s a lot of variability in
weights among tall people, but less variability among short people.
This matters for linear regression because when we have heteroscedasticity,
the estimates of the coefficients can be inefficient, although they are still
unbiased. This means that our predictions won’t be as good as they could
be. Also, hypothesis tests about the coefficients could give the wrong results.
12
model isn’t appropriate and won’t give valid hypothesis tests or confidence
intervals.
Figure 3: The plots for Lasso and Ridge penalties. The blue contours repre-
sent the cost function (SSE) and the red circle/ellipse represents the L1/L2
penalty. The optimal coefficients (β0 and β1 ) are represented by the red dot.
13
n p p
1 X X
2
X
J(β) = (yi − β0 − xij βj ) + λ |βj | (18)
2n i=1 j=1 j=1
The L1 penalty has the effect of shrinking some of the coefficients to zero,
effectively performing variable selection. This is because the L1 norm has
sharp corners at the axes (Figure 3), which can cause the optimizer to set
some coefficients exactly to zero. This property is shown graphically by the
diamond shape touching the contour at an axis, implying a beta value of zero.
Thus, Lasso can be used for feature selection and reducing the dimensionality
of the problem.
1.5.2 Advantages
• Performs feature selection by setting some coefficients to exactly zero.
• Can reduce the dimensionality of the problem, which can improve the
computational efficiency and generalization performance.
1.5.3 Disadvantages
• Can be unstable and sensitive to the choice of the tuning parameter λ.
• If there are highly correlated predictors, Lasso tends to select one and
ignore the others, which can lead to bias and poor generalization per-
formance.
n p p
1 X X
2
X
J(β) = (yi − β0 − xij βj ) + λ βj2 (19)
2n i=1 j=1 j=1
The L2 penalty has the effect of shrinking all the coefficients towards zero,
but not exactly to zero. This property is shown graphically by the circle
shape touching the contour in an area where beta values are not exactly
14
zero (Figure 3). This can reduce the impact of the predictors with small
coefficients and prevent overfitting, but it does not perform variable selection
like Lasso.
1.5.5 Advantages
• Can improve the computational stability and generalization perfor-
mance by reducing the impact of predictors with small coefficients.
• Does not suffer from the instability and sensitivity to the choice of the
tuning parameter that Lasso does.
1.5.6 Disadvantages
• Does not perform feature selection, so it does not reduce the dimen-
sionality of the problem.
• Can still suffer from bias and poor generalization performance if there
are irrelevant or redundant predictors.
Overall, Lasso and Ridge regression are useful regularization techniques for
linear regression. However, they do not address the problem of irrelevant or
redundant predictors, so they may not be sufficient for some problems.
2 Logistic Regression
Logistic Regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable. It’s an extension of the linear
regression model for classification problems.
In logistic regression, we are interested in predicting a binary outcome. The
function used to make predictions is the logistic function or the sigmoid
function. The hypothesis function for logistic regression is defined as:
1
hθ (x) = (20)
1 + e−θT x
Here are the notations in the equation:
• hθ (x): This is the predicted output, representing the probability that
the input example x belongs to the positive class.
15
Figure 4: The logistic function. In this code, sigmoid(x) is a function that
takes in a real-valued number x and returns a number between 0 and 1. This
is the logistic function. We then create an array x of 100 evenly spaced
numbers between -10 and 10 using the linspace function from numpy. We
apply the sigmoid function to these numbers to get the corresponding y-
values.
16