0% found this document useful (0 votes)

2 views

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables, particularly in predicting continuous outcomes. It involves fitting a line to data points to minimize the difference between actual and predicted values, utilizing loss functions such as the Sum of Squares for Error (SSE). The document also discusses the gradient descent optimization process for minimizing the loss function to find the best-fitting model parameters.

Uploaded by

Sajad Ulhaq PK

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Linear Regression

Uploaded by

Sajad Ulhaq PK

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Linear regression is a method that uses linear techniques to model the relationship between a dependent

variable and one or more independent variables. In the context of the model, independent variables are
the input values, and the dependent variable is the output value predicted by the model based on the
independent variables. Linear regression is suitable for applications where the data exhibits a linear
relationship between the variables.
Linear regression has widespread applications in data science, particularly in predicting continuous
outcomes. For example, in areas such as house price prediction and sales forecasting, linear regression
models can help predict the dependent variable (such as house prices or sales volume) based on
independent variables (such as house size, location, or the amount spent on advertising by agents).

Buying price

Price

Year

System estimate
We take y as the dependent variable and x as the independent variable, resulting in a
linear equation with two variables.
! = ! ! + !" " (1)
When the parameters ! ! and !! are given, the plot on the coordinate plane is a straight line
(this is the meaning of “linear”).
When we use only one x to predict y, it is called simple linear regression, which means
finding a straight line to fit the data. For example, if I have a set of data plotted as a scatter
plot, with the horizontal axis representing the advertising expenditure of a real estate
agency and the vertical axis representing the sales volume, linear regression aims to find a
straight line that best fits the data points on the graph.

Here, the fitted equation we obtain is y = 0.0512x +

sales volume

7.1884 . With this equation, when we get a new

advertising expenditure value, we can use it to
predict the approximate sales volume. The predicted
sales volume is usually denoted as "! .
advertising expenditure
Since we are fitting a line to the scatter points, why is the final line y = 0.0512x + 7.1884 instead of y =
0.0624x + 5, as shown in the diagram? Both lines seem to fit the data well, right? After all, the data doesn’t fall
exactly on a single line but is scattered around it. So, we need to establish a criterion to evaluate which line is
the “best fit.” We refer to this criterion as the loss.
A loss function is a method to measure how well the model fits the data. It quantifies the difference between
the actual measured values and the predicted values. The higher the loss function value, the more inaccurate
the predictions; conversely, the lower the loss function value, the closer the predictions are to the actual values.
sales volume

advertising expenditure
Let’s start with residuals. Simply put, the residual is the difference between the actual value and the predicted
value (it can also be understood as the gap or distance). In formula terms, it’s represented as:
! = " ! "! (2)
For a given advertising expenditure " "
, we have the actual sales volume ! (the label) and the predicted sales
volume "! " !
"
! (calculated by plugging ! into Equation 1). The residual for this point is ! (according to Equation
2). We then square this residual (to eliminate negative signs), and do this for each data point in our dataset.
!
After calculating the squared residuals for all the points " , we sum them up. This gives us a measure of the
!
total error between the fitted line and the actual labels.

(3)

This formula is known as the Sum of Squares for Error (SSE),

and it is one of the most commonly used loss functions in
regression problems in machine learning.
Now we know that the loss function is a function that
measures the error of a regression model, which serves
as the evaluation criterion for the “line” we are seeking.
The smaller the value of this function, the better the
line fits our data.
{ }
!
Given a set of sample observations #" " $" " =! , the goal is to have the regression function fit this set of values
as closely as possible. The criterion provided by the ordinary least squares method is to minimize the sum of
squared residuals (Equation 3). (3)
(3)
Equation 3 is a quadratic equation. We know that a univariate quadratic equation looks something like the left
diagram below. However, for Equation 3, both !" and !" are unknown, making it a quadratic equation with
! !
two unknown parameters. When plotted, it forms a surface in three-dimensional space, similar to the diagram
on the right.
This type of function is called a convex function in mathematics (having only one extremum, which means the
local minimum is also the global minimum). If you remember calculus, you’ll know that Q reaches its minimum
when the derivative equals zero. Therefore, we take the partial derivatives with respect to !" and !" and set
! !
them to zero.

{ #" " $" }" =! are known values, and by substituting them
!

into the two equations above, we can solve for !"! and !" .
!
This is the least squares method, where “squares” refers
to the squared terms.

Reference：https://zhuanlan.zhihu.com/p/72513104
Linear regression is a method for modeling the relationship between one or more independent variables. The
example given above is a one-dimensional case (with only one x). If there are two features, it becomes bivariate
! "
linear regression, where we fit a plane in two-dimensional space. If there are multiple features $# # $# #! # $# {
" !
# =!
}
(such as how a house's price is influenced by the number of rooms, number of floors, and age of the house,
among other variables), then it becomes multiple linear regression:
" = ! $ + !! #! + ! " # " + ! + + ! ! # ! = ! $ + % # & where W and X can be interpreted
as matrices:

" !" # " "!

#
$ % $ #%
$ !! % $" %
% & $ ! # % ' ( & $ "$ %
$ % $ %
$! % $! %
$! % $ !%
& !' $" %
& '
( )
!
The loss function of multiple linear regression is $ = '" % %# $ %$# & = '" " %# $ !$# + !$" &#" + !$! &#! + ! + !$ " &#" # .
! ! !

% &

Similar to linear regression, we take the partial derivatives with respect to !$! % !$" % !$# %! % !$ ! and set them
to zero.
"! $!! $!" ! $!! # " ! # # " "! # " #! #
$ !% $ % $ " % $# %
! $ " "
!
$! ! $! % $ ! % $ " % $ " %
$% & " = $ !
' & =
$! " " # " % $ " % $" % $"%
$ !% $ % $ % $ %
$(! $!
!
$! ! $! %) $( ! ! %) $( " ! %) $( # ! %)
!

We use X to represent all the samples, and w (column vector) to represent the weights of the sample
features. The Loss function can be rewritten as :

1) Let Xw-y = N, then：

2) the derivative of the formula is : 2N * (the derivative of N)
3)
Equation derivation process:

By derivation of the loss function

A square matrix makes it easy to convert it to
a unit matrix
Get the unit matrix
The equation after dropping X
Drop 2 and put y to the right of the
equal sign
Then X is turned into a square matrix to facilitate
the calculation of its unit matrix
Left-multiplication inverse matrix
Get the final w-vector

Note that the first element of the w vector !! is also called the bias.

1.Any matrix multiplied by the unit matrix equals the matrix itself.
2.M matrix x N matrix is not necessarily equal to N matrix x M
matrix
The basic idea of the gradient descent method can be analogized to a downhill process.

Assume a scenario like this:

A person is trapped on a mountain and needs to get down from the mountain (i.e. find the lowest point of the
mountain, which is the valley). However, at this point in time, the mountain is very foggy, resulting in low visibility.

As a result, the path down the mountain would be impossible to determine, and he would have to use the
information around himself to find the path down the mountain. At this time, he can utilize a gradient descent
algorithm to help him descend the mountain. Specifically, take his current location as a benchmark, look for the
steepest part of this location, and then walk towards the mountain where the height of the mountain decreases,
(Similarly, if our goal is to go up the mountain, that is, climb to the top, then at this time should be towards the
steepest direction up). The same method is then repeated for each distance, and eventually we will succeed in
reaching the valley.
Basic process of gradient descent is similar to the downhill scenario right here.
First, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum of
this function, which is the bottom of the mountain.
Based on the assumptions of the previous scenario, the fastest way down the mountain is to find the steepest
direction at the current location, and then go down that direction, which corresponds to the function, which is
to find the gradient at a given point, and then go in the direction opposite of the gradient, which will result in
the fastest decrease in the function value! This is because the direction of the gradient is the direction in which
the function value changes the fastest. So, we reuse this method to repeatedly find the gradient, and eventually
we reach the local minimum, which is similar to our descent down the mountain. And by solving for the
gradient we determine the steepest direction, which is the means of measuring the direction in the scene.
The concept of Gradient
The gradient is a very important concept in calculus, as well.
-In a function of one variable, the gradient is really the differentiation of the function, representing the slope of the
tangent line to the function at a given point;
-In a multivariate function, the gradient is a vector; vectors have directions, and the direction of the gradient points
to the direction in which the function rises fastest at a given point;
Inside calculus, taking partial derivatives of the parameters of a multivariate function, and writing the partial
derivatives of each parameter obtained as a vector, is the gradient.
This explains why we need to find the gradient by any means necessary! We need to get to the bottom of the hill, we
need to observe at each step the steepest point at that point, and the gradient tells us exactly that direction. The
direction of the gradient is the direction in which the function rises fastest at a given point, so the opposite direction
of the gradient is the direction in which the function falls fastest at a given point, which is exactly what we need. So
we just keep going in the opposite direction of the gradient, and we'll get to the local minimum!
The Formula of
Gradient Descent

-𝛼 : learning rate (step size). Can't be too big or too small .

-The gradient is the fastest way up, we need it to be the fastest way down, so we need to put a minus sign on it.

𝛼 is called the learning rate or step size in gradient descent algorithms, meaning that we can use a to control the
distance traveled at each step to ensure that we don't take too big a step, which really means that we don't go too fast
and miss the lowest point. It is also important to ensure that we don't walk too slowly, resulting in the sun going down
and not yet making it to the bottom of the hill. So the choice of a is often important in the gradient descent method.a
should not be too big or too small, too small may lead to a delay in getting to the lowest point, and too big will lead to
missing the lowest point!
A negative sign in front of the gradient means going in the opposite direction of the gradient! We mentioned
earlier that the direction of the gradient is actually the direction in which the function rises fastest at this point! And
we need to go in the direction of the fastest descending direction, which is naturally the direction of the negative
gradient, so we need to add the negative sign here.
Gradient descent optimization process

Given the initial position

Calculate the current fastest-rising negative direction of the point
Move in that negative direction in steps
Repeat steps 2-3 until convergence
The difference between the two times is less than a specified threshold
Reach the specified number of iterations

For example (univariate):

𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃 = 𝜃 ! , 𝐹𝑜𝑟 𝑤ℎ𝑎𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝜃, the value of 𝐽 𝜃 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑑.

Initialize:
1.the starting point is 1
2.Learning rate 𝛼 is 0.4
We begin the iterative computational process of gradient descent:
Step 1

Step 2

Step 3

Step 4

Step 5

𝜃 is 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑒𝑥𝑡𝑟𝑒𝑚𝑒𝑙𝑦 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒

Step N
𝑜𝑝𝑡𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 0,
𝑎𝑛𝑑 𝐽 𝜃 𝑖𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒.
For example (Multiple):
𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃
= 𝜃1! + 𝜃2! , 𝐹𝑜𝑟 𝑤ℎ𝑎𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝜃1 𝑎𝑛𝑑 𝜃2, the value of 𝐽 𝜃 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑑.
Initialize:
1.the starting point is (1,3)
2.Learning rate 𝛼 is 0.1
We begin the iterative computational process of gradient descent:

We begin the iterative computational process of gradient descent:

Step 1

Step 2

Step N: 𝜃1 𝑎𝑛𝑑 𝜃2 𝑎𝑟𝑒 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑒𝑥𝑡𝑟𝑒𝑚𝑒𝑙𝑦 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑜𝑝𝑡𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 0,

𝑎𝑛𝑑 𝐽 𝜃 𝑖𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒.
Derivation of the gradient descent formula
Steps:
1. Confirm the assumption function and loss function of the optimization model
2. Initialize algorithm-related parameters, e.g., weights, bias initial values, learning rate
3. Use the gradient descent formula to iteratively solve the model parameters (weights,
bias)
Loss function derivation:
(Mean Square Error)

Parameter update formula:

Note: w needs to use the gradient values of

all samples for each update
Full Gradient Descent algorithm (FGD)
- Uses the gradient values of all the samples for each delivery
Stochastic Gradient Descent (SGD)
- Randomly selects and uses one sample gradient value per delivery
Minibatch Gradient Descent (MBGD)
- At each delivery, a small batch of sample gradient values is randomly selected and used
Stochastic Average Gradient Descent algorithm (SAG)
- At each iteration, randomly select the mean of the gradient values of one person's samples and the gradient values
of previous samples
- Assumption: the training set has A B C D E F G H with a total of 8 entry samples
1. Randomly select a sample, assuming that sample D is selected, compute its gradient value and store it in the
list:[D], and then use the mean of the gradient values in the list to update the model parameters.
2. 2. Randomly select another sample, assuming that sample G is selected, calculate its gradient value and store it
in the list: [D,G], then use the mean of the gradient values in the list to update the model parameters.
3. Choose another sample at random, assuming that sample D is chosen, recalculate the gradient value of that
sample, and update the gradient value of sample D in the list, and then use the mean of the gradient values in
the list to update the model parameters.
4. This is done until the algorithm converges.

Transportation Cross Docking - EWM
No ratings yet
Transportation Cross Docking - EWM
5 pages
Symmetry of Mixed Partials - Young and Schwarz Theorem
No ratings yet
Symmetry of Mixed Partials - Young and Schwarz Theorem
6 pages
Linear Regression
100% (1)
Linear Regression
8 pages
eng
No ratings yet
eng
10 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
CSE_412__Lab_Manual_3___Linear_Regression
No ratings yet
CSE_412__Lab_Manual_3___Linear_Regression
10 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Chapter4_Regression.docx
No ratings yet
Chapter4_Regression.docx
15 pages
linear regression
No ratings yet
linear regression
130 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
ML L6 Linear Regresion
No ratings yet
ML L6 Linear Regresion
54 pages
2.1 Linear Regression
No ratings yet
2.1 Linear Regression
39 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Lecture-17-Linear Regression Using Sklearn
No ratings yet
Lecture-17-Linear Regression Using Sklearn
32 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Regression
No ratings yet
Regression
16 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Unit No. 2
No ratings yet
Unit No. 2
30 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
MachineLearning_Unit-II
No ratings yet
MachineLearning_Unit-II
45 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
5.linear Regression
No ratings yet
5.linear Regression
39 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Unit-4 DS Student
No ratings yet
Unit-4 DS Student
43 pages
linear regression (1)
No ratings yet
linear regression (1)
8 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Lecture 5 Regression
No ratings yet
Lecture 5 Regression
77 pages
Linear Regression in Machine Learning MY NOTES
No ratings yet
Linear Regression in Machine Learning MY NOTES
21 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
2-LR_Optim
No ratings yet
2-LR_Optim
60 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
UNIt-3 TY
No ratings yet
UNIt-3 TY
67 pages
Regression Notes
100% (1)
Regression Notes
20 pages
3. Linear Regression
No ratings yet
3. Linear Regression
49 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
Linear Regression
No ratings yet
Linear Regression
24 pages
Data Science
100% (1)
Data Science
14 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Regression and Optimization in ML
No ratings yet
Regression and Optimization in ML
41 pages
CSL0777 L12
No ratings yet
CSL0777 L12
18 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Cost-Function
No ratings yet
Cost-Function
31 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Regression
No ratings yet
Regression
60 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
Ai - W3L6
No ratings yet
Ai - W3L6
29 pages
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
From Everand
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Real-Time Lost Circulation Estimation and Mitigation. Egyptian Journal of Petroleum
No ratings yet
Real-Time Lost Circulation Estimation and Mitigation. Egyptian Journal of Petroleum
8 pages
Transformers, How Do They Work?: Generative AI To Create Content
No ratings yet
Transformers, How Do They Work?: Generative AI To Create Content
14 pages
Quality Control of Products in Petroleum Refining
No ratings yet
Quality Control of Products in Petroleum Refining
29 pages
Method Statement Road Work - MORTH
No ratings yet
Method Statement Road Work - MORTH
18 pages
Catalog AltoMarine
No ratings yet
Catalog AltoMarine
19 pages
Sample Black Book
No ratings yet
Sample Black Book
64 pages
10 Science Imp ch10 1
No ratings yet
10 Science Imp ch10 1
10 pages
P4P800 VM
No ratings yet
P4P800 VM
80 pages
GL200 SMS Protocol V102 Decrypted.100130920 PDF
No ratings yet
GL200 SMS Protocol V102 Decrypted.100130920 PDF
28 pages
3ds Max (Glass)
No ratings yet
3ds Max (Glass)
12 pages
Syll2001ao1to4 PDF
No ratings yet
Syll2001ao1to4 PDF
48 pages
402-X-2022-23-Sample QP-1
No ratings yet
402-X-2022-23-Sample QP-1
9 pages
Problems Theory and Solutions in Linear Algebra
No ratings yet
Problems Theory and Solutions in Linear Algebra
169 pages
JC Cuevas Molecular Electronics Lecture PDF
No ratings yet
JC Cuevas Molecular Electronics Lecture PDF
83 pages
Bulb Units The Complete Solution For Low Heads: Power
100% (1)
Bulb Units The Complete Solution For Low Heads: Power
12 pages
Captiva Series II Product Overview
No ratings yet
Captiva Series II Product Overview
10 pages
Evaluation of Ergosterol Composition and Esterification Rate in Fungi Isolated From Mangrove Soil, Long-Term Storage of Broken Spores, and Two Soils
No ratings yet
Evaluation of Ergosterol Composition and Esterification Rate in Fungi Isolated From Mangrove Soil, Long-Term Storage of Broken Spores, and Two Soils
15 pages
Java Question Bank 1
No ratings yet
Java Question Bank 1
6 pages
Fhsstphy PDF
No ratings yet
Fhsstphy PDF
397 pages
Semester Test 1 Memo
No ratings yet
Semester Test 1 Memo
12 pages
Cables de Acero Usha Martin User Manual1
100% (1)
Cables de Acero Usha Martin User Manual1
60 pages
QB On Application 2
No ratings yet
QB On Application 2
9 pages
1445603289CAT Mock Paper 1 For QA
No ratings yet
1445603289CAT Mock Paper 1 For QA
12 pages
Punctuation S
No ratings yet
Punctuation S
28 pages
Part and Assembly Modeling: With Solidworks 2014
100% (1)
Part and Assembly Modeling: With Solidworks 2014
123 pages
Yuxuan's 4 Look Last Layer Tutorial
100% (1)
Yuxuan's 4 Look Last Layer Tutorial
4 pages
Exam On Work and Power
No ratings yet
Exam On Work and Power
9 pages
Schemes Talcher
100% (1)
Schemes Talcher
206 pages