Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Regression for applied machine learning

Sabber Ahamed
sabbers@gmail.com
linkedin.com/in/sabber-ahamed
github.com/msahamed

July 17, 2023

Contents
1 Linear Regression 2
1.1 Estimating β0 and β1 using least squares . . . . . . . . . . . . 2
1.1.1 Analytical solution . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Gradient Descent solution . . . . . . . . . . . . . . . . 5
1.2 Interpretation of Coefficients . . . . . . . . . . . . . . . . . . . 8
1.2.1 Case 1: when predictors (xi ) are binary (0 or 1) . . . . 8
1.2.2 Case-1: when predictors (xi ) are continuous . . . . . . 9
1.2.3 Example using the Boston Housing Dataset . . . . . . 9
1.3 Limitations of Linear Regression . . . . . . . . . . . . . . . . . 10
1.4 Why heteroscedasticity and non-normality are problematic . . 12
1.4.1 Homoscedasticity: . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Normality of Residuals: . . . . . . . . . . . . . . . . . . 12
1.5 Lasso and Ridge Regression . . . . . . . . . . . . . . . . . . . 13
1.5.1 Lasso Regression . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.3 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Ridge Regression . . . . . . . . . . . . . . . . . . . . . 14
1.5.5 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . 15

1
2 Logistic Regression 15

1 Linear Regression
Linear regression is a statistical method that is used to model the relationship
between a dependent variable and one or more independent variables. In
other words, it is a method to find the line of best fit for a given set of data
points.
Linear regression can be used for both simple and multiple regression prob-
lems. In simple regression, we have one dependent variable and one indepen-
dent variable, while in multiple regression, we have one dependent variable
and more than one independent variable.
In this lecture, we will focus on simple linear regression.
In simple linear regression, we have a dependent variable y and an indepen-
dent variable x. We assume that there is a linear relationship between x and
y, which can be represented as:

y = β0 + β1 x + ϵ (1)

where β0 is the y-intercept, β1 is the slope of the line, and ϵ is the error term.
The goal of linear regression is to estimate the values of β0 and β1 so that
the line of best fit can be determined.

1.1 Estimating β0 and β1 using least squares


To estimate the values of β0 and β1 , we need a set of data points. Let
(x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) be n data points.

1.1.1 Analytical solution


We can use the method of least squares to estimate β0 and β1 . The idea is
to find the values of β0 and β1 that minimize the sum of squared errors (E):

n
X
E= (yi − (β0 + β1 xi ))2 (2)
i=1

2
To minimize E, we take the partial derivatives of E with respect to β0 and
β1 and set them equal to 0:

n
∂E X
= −2 (yi − (β0 + β1 xi )) = 0 (3)
∂β0 i=1

n
∂E X
= −2 (yi − (β0 + β1 xi ))xi = 0 (4)
∂β1 i=1

Solving these equations simultaneously, we get:


Pn
(x − x̄)(yi − ȳ)
Pn i
β1 = i=1 2
(5)
i=1 (xi − x̄)

β0 = ȳ − β1 x̄ (6)

where x̄ and ȳ are the means of x and y, respectively.


Let’s say we have the following data points:

x 1 2 3 4 5 6
y 1 3 2 5 4 6

Table 1: Example dataset

We want to find the line of best fit for these data points.
First, we calculate the means of x and y:

1+2+3+4+5+6
x̄ = = 3.5 (7)
6

1+3+2+5+4+6
ȳ = = 3.5 (8)
6
Next, using the formula 5, we can calculate β1 :

β1 = 0.928 (9)

3
Finally, we calculate β0 :

β0 = 3.5 − 0.928 × 3.5 = 0.714 (10)

Therefore, the line of best fit is:

y = 0.714 + 0.928x (11)

Let’s implement simple linear regression in Python using numPy and matplotlib
libraries:
1 import numpy as np
2 import matplotlib . pyplot as plt

Next, we define the data points:


1 x = np . array ([1 , 2 , 3 , 4 , 5 , 6])
2 y = np . array ([1 , 3 , 2 , 5 , 4 , 6])

Then, we calculate the means of x and y:


1 x_mean = np . mean ( x )
2 y_mean = np . mean ( y )

Next, we calculate β1 :
1 num = np . sum (( x - x_mean ) * ( y - y_mean ) )
2 denum = np . sum (( x - x_mean ) ** 2)
3 beta_1 = num / denum

Finally, we calculate β0 :
1 beta_0 = y_mean - beta_1 * x_mean
2 pred = beta_0 + beta_1 * x

We can plot the data points and the line of best fit using matplotlib:
1 plt . scatter (x , y , label = ’ data points ’)
2 plt . plot (x , pred , ’r ’ , label = ’ line of best fit ’)
3 plt . show ()

4
Figure 1: Linear regression

The resulting plot should show the data points and the line of best fit.
In this lecture, we covered simple linear regression, including the estimation
of β0 and β1 using the method of least squares. We also provided an example
and implementation in Python using NumPy and Matplotlib. Linear regres-
sion is a powerful tool for modeling relationships between variables and can
be applied in a wide range of fields.

1.1.2 Gradient Descent solution


Another way to estimate the values of β0 and β1 is through gradient descent.
Gradient descent is an optimization algorithm used to minimize a function
by iteratively moving in the direction of steepest descent as defined by the
negative of the gradient. In this case, the function we want to minimize is
the sum of squared errors (SSE). The algorithm starts with an initial guess
for β0 and β1 (collectively denoted as β) and updates them in the opposite
direction of the gradient until convergence. The learning rate, denoted by α,
controls the step size at each iteration.
We first define the cost function J(β) which is essentially the SSE, but av-

5
eraged over the number of observations m and multiplied by a factor of 1/2
for mathematical convenience during the derivation (i.e., the derivative of a
squared function). If x(i) is the feature vector for the i-th training example
and y (i) is the corresponding target value, then the cost function J(β) is
defined as follows:

m
1 X
J(β) = (hβ (x(i) ) − y (i) )2 (12)
2m i=1

where hβ (x) = β T x is the hypothesis function that gives our predicted value
of y for a given x.
The idea of the gradient descent algorithm is to update each parameter βj
in β such that the cost function J(β) is minimized. The update rule for βj
is obtained by taking the derivative of the cost function with respect to βj
and setting it to 0:

m
1 X (i)
βj := βj − α (hβ (x(i) ) − y (i) )xj (13)
m i=1

(i)
In this equation, α is the learning rate and xj is the j-th feature of the i-th
training example.
This update rule should be applied simultaneously to all βj in β which can
be performed efficiently using vectorized operations. The vectorized update
rule is:

1 T
β := β − α X (Xβ − y) (14)
m
In this equation, X is the m × (n + 1) matrix of feature vectors, with an
extra column of 1s added for the intercept term, and y is the m-dimensional
vector of target values.
The gradient descent algorithm iteratively applies the update rule (Equation
14) until the parameters β converge to their optimal values that minimize
J(β), or until a maximum number of iterations is reached.
Let’s implement gradient descent in Python using NumPy and Matplotlib
libraries.

6
First, we import the necessary libraries:
1 import numpy as np
2 import matplotlib . pyplot as plt

Next, we define the data points:


1 x = np . array ([1 , 2 , 3 , 4 , 5 , 6])
2 y = np . array ([1 , 3 , 2 , 5 , 4 , 6])

We also need to define the cost function, which is the sum of squared errors:
1 def cost_function (x , y , beta_0 , beta_1 ) :
2 n = len ( x )
3 a = np . sum (( beta_0 + beta_1 * x - y ) ** 2)
4 return 1 / (2 * n ) * a

Next, we define the gradient descent function:


1 def gradient_descent (x , y , alpha , iterations ) :
2 beta_0 = 0
3 beta_1 = 0
4 n = len ( x )
5 cost = np . zeros ( iterations )
6

7 for i in range ( iterations ) :


8 y_hat = beta_0 + beta_1 * x
9 loss = y_hat - y
10 gradient_beta_0 = 1 / n * np . sum ( loss )
11 gradient_beta_1 = 1 / n * np . sum ( loss * x )
12 beta_0 -= alpha * gradient_beta_0
13 beta_1 -= alpha * gradient_beta_1
14 cost [ i ] = cost_function (x , y , beta_0 , beta_1 )
15 return beta_0 , beta_1 , cost

The function takes in the data points, learning rate, and number of iterations.
It initializes β0 and β1 to 0, calculates the predicted values, and updates
the parameters using the gradients. It also keeps track of the cost at each
iteration.
We can now call the gradient descent function:

7
1 alpha = 0.01
2 iterations = 1000
3 beta_0 , beta_1 , cost = gradient_descent (
4 x , y , alpha , iterations
5 )

We can plot the cost history to ensure convergence:


1 plt . plot ( cost )
2 plt . xlabel ( ’ Iterations ’)
3 plt . ylabel ( ’ Cost ’)
4 plt . show ()

The resulting plot should show a decreasing cost over the iterations.
Finally, we can plot the data points and the line of best fit using the estimated
values of β0 and β1 :
1 plt . scatter (x , y )
2 plt . plot (x , beta_0 + beta_1 * x )
3 plt . show ()

The resulting plot should show the data points and the line of best fit.

1.2 Interpretation of Coefficients


After fitting a linear regression model, it is important to interpret the coeffi-
cients to understand the relationship between the predictors and the response
variable. The coefficients represent the change in the response variable asso-
ciated with a one-unit change in the corresponding predictor variable, holding
all other predictors constant.

1.2.1 Case 1: when predictors (xi ) are binary (0 or 1)


For binary predictors, the coefficient represents the difference in the response
variable between the two levels of the predictor. Let’s consider a simple
example where we have a binary predictor variable x that takes on values 0
or 1, and a continuous response variable y. The linear regression model is:

y = β0 + β1 x + ϵ (15)

8
where β0 is the intercept, β1 is the coefficient for the binary predictor x, and
ϵ is the error term.
If x = 0, the predicted value of y is ŷ = β0 . If x = 1, the predicted value
of y is ŷ = β0 + β1 . Thus, the coefficient β1 represents the difference in the
predicted value of y between the two levels of the predictor.

1.2.2 Case-1: when predictors (xi ) are continuous


For continuous predictors, the coefficient represents the change in the re-
sponse variable associated with a one-unit increase in the predictor, holding
all other predictors constant. Let’s consider a simple example where we have
a continuous predictor variable x and a continuous response variable y. The
linear regression model is:

y = β0 + β1 x + ϵ (16)

where β0 is the intercept, β1 is the coefficient for the predictor x, and ϵ is the
error term.
If we increase x by one unit, the predicted value of y increases by β1 . Thus,
the coefficient β1 represents the expected change in y associated with a one-
unit increase in x, holding all other predictors constant.

1.2.3 Example using the Boston Housing Dataset


Let’s consider an example where we have a dataset of housing prices and
predictors such as the number of bedrooms, the square footage of the house,
and the neighborhood. We want to predict the price of a house based on
these predictors using linear regression. The linear regression model is:

price = β0 + β1 · bedrooms + β2 · square footage + β3 · neighborhood + ϵ (17)

where β0 is the intercept, β1 is the coefficient for the number of bedrooms,


β2 is the coefficient for the square footage of the house, β3 is the coefficient
for the neighborhood, and ϵ is the error term.
If β1 is positive, this means that the price of the house is expected to increase
with the number of bedrooms, holding all other predictors constant. If β2 is

9
positive, this means that the price of the house is expected to increase with
the square footage of the house, holding all other predictors constant. If β3
is positive, this means that the price of the house is expected to be higher in
that neighborhood, holding all other predictors constant.
By interpreting the coefficients, we can gain insight into the relationships
between the predictors and the response variable, and make predictions about
new observations based on the values of the predictors.

1.3 Limitations of Linear Regression


Linear regression is a statistical method used to study the relationship be-
tween a dependent variable and one or more independent variables. Here are
some of the limitations and assumptions of linear regression:
• Linearity: Linear regression assumes that the relationship between
the dependent variable and the independent variables is linear. If the
relationship is non-linear, then linear regression may not be an appro-
priate method to use.
• Normality: Linear regression assumes that the residuals (the dif-
ferences between the predicted and actual values) are normally dis-
tributed. If the residuals are not normally distributed, then the results
may be biased or unreliable.
• Homoscedasticity: Linear regression assumes that the variance of the
residuals is constant across all levels of the independent variables. If the
variance is not constant, then the results may be biased or unreliable.
• Independence: Linear regression assumes that the observations are
independent of each other. If the observations are correlated, then the
results may be biased or unreliable.
Figure 2 shows six different situations that one may encounter in a linear
regression analysis. The top row, from left to right, displays an ideal case, a
non-linear case, and a case with outliers. In the ideal case, the data points
are scattered around the line of best fit, which is characteristic of a well-
behaved linear regression model. The non-linear case shows a scenario where
the relationship between the predictor variable (X) and the response variable
(Y) isn’t linear, resulting in a poor fit of the linear regression model. In the
case with outliers, the model is influenced by extreme values, causing a shift

10
Figure 2: Variations in Linear Regression. The figure demonstrates differ-
ent scenarios in linear regression modeling. The top row from left to right
shows an ideal linear case, a non-linear case, and a case with outliers. The
bottom row, from left to right, displays normality of residuals, homoscedas-
ticity (constant error variance), and heteroscedasticity (non-constant error
variance). Each plot demonstrates key considerations and assumptions in
linear regression analysis.

in the line of best fit.


The bottom row, from left to right, demonstrates the normality of residuals
case, a homoscedasticity case, and a heteroscedasticity case. For the normal-
ity of residuals case, the plot shows the frequency distribution of residuals
(prediction errors), where a bell-shaped curve suggests that residuals are
normally distributed. The homoscedasticity case displays residuals scattered
randomly around zero, indicating that the variance of errors is constant across
values of X, an essential assumption of linear regression. In contrast, the het-
eroscedasticity case shows that the spread of residuals increases as the value
of X increases, indicating a violation of the homoscedasticity assumption.

11
1.4 Why heteroscedasticity and non-normality are prob-
lematic
1.4.1 Homoscedasticity:
The assumption of homoscedasticity (meaning ”equal scatter”) is central
to linear regression models. Homoscedasticity describes a situation in which
the error term (that is, the ”noise” or random disturbance in the relationship
between the independent variables and the dependent variable) is the same
across all levels of the independent variables.
Here’s an example. Suppose you’re trying to predict a person’s weight based
on their height. If we have homoscedasticity, that means the variability in
weights is the same for all heights. In other words, whether you’re looking at
short people, medium-height people, or tall people, you see the same amount
of variation in weights.
If we don’t have homoscedasticity (i.e., we have heteroscedasticity), the vari-
ability in weights changes with height. Maybe there’s a lot of variability in
weights among tall people, but less variability among short people.
This matters for linear regression because when we have heteroscedasticity,
the estimates of the coefficients can be inefficient, although they are still
unbiased. This means that our predictions won’t be as good as they could
be. Also, hypothesis tests about the coefficients could give the wrong results.

1.4.2 Normality of Residuals:


The assumption of normally distributed residuals (which is really an as-
sumption about the errors in the model, which we estimate using residuals)
is needed for conducting hypothesis tests and constructing confidence inter-
vals.
The hypothesis tests for the coefficients in a linear regression model are based
on a t-distribution, and this t-distribution is valid when the errors (estimated
by residuals) follow a normal distribution. If this assumption is violated, then
these hypothesis tests and confidence intervals may be inaccurate.
An example of this can be seen when you have a binary dependent variable.
Suppose you are trying to predict whether a coin flip comes up heads (1) or
tails (0) based on the temperature of the room. The errors here clearly don’t
follow a normal distribution - they’re binary. In this case, a linear regression

12
model isn’t appropriate and won’t give valid hypothesis tests or confidence
intervals.

1.5 Lasso and Ridge Regression


Linear regression assumes that all predictor variables are equally important
and that their coefficients are all non-zero. However, in some cases, there
may be a large number of predictors, and some of them may be irrelevant or
redundant. This can lead to overfitting and poor generalization performance.
Regularization is a technique that addresses this problem by adding a penalty
term to the cost function that favors smaller coefficients. Two popular reg-
ularization methods for linear regression are Lasso and Ridge regression.

Figure 3: The plots for Lasso and Ridge penalties. The blue contours repre-
sent the cost function (SSE) and the red circle/ellipse represents the L1/L2
penalty. The optimal coefficients (β0 and β1 ) are represented by the red dot.

1.5.1 Lasso Regression


Lasso (Least Absolute Shrinkage and Selection Operator) regression adds an
L1 penalty term to the cost function. The L1 penalty is the sum of the
absolute values of the coefficients, multiplied by a tuning parameter λ:

13
n p p
1 X X
2
X
J(β) = (yi − β0 − xij βj ) + λ |βj | (18)
2n i=1 j=1 j=1

The L1 penalty has the effect of shrinking some of the coefficients to zero,
effectively performing variable selection. This is because the L1 norm has
sharp corners at the axes (Figure 3), which can cause the optimizer to set
some coefficients exactly to zero. This property is shown graphically by the
diamond shape touching the contour at an axis, implying a beta value of zero.
Thus, Lasso can be used for feature selection and reducing the dimensionality
of the problem.

1.5.2 Advantages
• Performs feature selection by setting some coefficients to exactly zero.
• Can reduce the dimensionality of the problem, which can improve the
computational efficiency and generalization performance.

1.5.3 Disadvantages
• Can be unstable and sensitive to the choice of the tuning parameter λ.
• If there are highly correlated predictors, Lasso tends to select one and
ignore the others, which can lead to bias and poor generalization per-
formance.

1.5.4 Ridge Regression


Ridge regression adds an L2 penalty term to the cost function. The L2
penalty is the sum of the squares of the coefficients, multiplied by a tuning
parameter λ:

n p p
1 X X
2
X
J(β) = (yi − β0 − xij βj ) + λ βj2 (19)
2n i=1 j=1 j=1

The L2 penalty has the effect of shrinking all the coefficients towards zero,
but not exactly to zero. This property is shown graphically by the circle
shape touching the contour in an area where beta values are not exactly

14
zero (Figure 3). This can reduce the impact of the predictors with small
coefficients and prevent overfitting, but it does not perform variable selection
like Lasso.

1.5.5 Advantages
• Can improve the computational stability and generalization perfor-
mance by reducing the impact of predictors with small coefficients.
• Does not suffer from the instability and sensitivity to the choice of the
tuning parameter that Lasso does.

1.5.6 Disadvantages
• Does not perform feature selection, so it does not reduce the dimen-
sionality of the problem.
• Can still suffer from bias and poor generalization performance if there
are irrelevant or redundant predictors.
Overall, Lasso and Ridge regression are useful regularization techniques for
linear regression. However, they do not address the problem of irrelevant or
redundant predictors, so they may not be sufficient for some problems.

2 Logistic Regression
Logistic Regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable. It’s an extension of the linear
regression model for classification problems.
In logistic regression, we are interested in predicting a binary outcome. The
function used to make predictions is the logistic function or the sigmoid
function. The hypothesis function for logistic regression is defined as:

1
hθ (x) = (20)
1 + e−θT x
Here are the notations in the equation:
• hθ (x): This is the predicted output, representing the probability that
the input example x belongs to the positive class.

15
Figure 4: The logistic function. In this code, sigmoid(x) is a function that
takes in a real-valued number x and returns a number between 0 and 1. This
is the logistic function. We then create an array x of 100 evenly spaced
numbers between -10 and 10 using the linspace function from numpy. We
apply the sigmoid function to these numbers to get the corresponding y-
values.

• θ: This is the parameter vector of the model. It includes the weights


and bias term that the model learns during training. The superscript
T denotes the transpose of the vector, which is needed to match the
dimensionality when multiplying it with the input vector x.
• x: This is the input vector, representing the features of the input ex-
ample. It matches the dimensionality of the θ vector, so that they can
be multiplied together.
For example, we have a binary classification problem where we want to pre-
dict whether an email is spam (1) or not spam (0), based on the frequency of
certain keywords in the email text. The features x could be the frequencies
of these keywords, and the model would learn weights θ for each keyword
that indicate how strongly that keyword predicts whether an email is spam
or not.

16

You might also like