Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables, particularly in predicting continuous outcomes. It involves fitting a line to data points to minimize the difference between actual and predicted values, utilizing loss functions such as the Sum of Squares for Error (SSE). The document also discusses the gradient descent optimization process for minimizing the loss function to find the best-fitting model parameters.

Uploaded by

Sajad Ulhaq PK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables, particularly in predicting continuous outcomes. It involves fitting a line to data points to minimize the difference between actual and predicted values, utilizing loss functions such as the Sum of Squares for Error (SSE). The document also discusses the gradient descent optimization process for minimizing the loss function to find the best-fitting model parameters.

Uploaded by

Sajad Ulhaq PK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Linear regression is a method that uses linear techniques to model the relationship between a dependent

variable and one or more independent variables. In the context of the model, independent variables are
the input values, and the dependent variable is the output value predicted by the model based on the
independent variables. Linear regression is suitable for applications where the data exhibits a linear
relationship between the variables.
Linear regression has widespread applications in data science, particularly in predicting continuous
outcomes. For example, in areas such as house price prediction and sales forecasting, linear regression
models can help predict the dependent variable (such as house prices or sales volume) based on
independent variables (such as house size, location, or the amount spent on advertising by agents).

Buying price

Price

Year

System estimate
We take y as the dependent variable and x as the independent variable, resulting in a
linear equation with two variables.
! = ! ! + !" " (1)
When the parameters ! ! and !! are given, the plot on the coordinate plane is a straight line
(this is the meaning of “linear”).
When we use only one x to predict y, it is called simple linear regression, which means
finding a straight line to fit the data. For example, if I have a set of data plotted as a scatter
plot, with the horizontal axis representing the advertising expenditure of a real estate
agency and the vertical axis representing the sales volume, linear regression aims to find a
straight line that best fits the data points on the graph.

Here, the fitted equation we obtain is y = 0.0512x +


sales volume

7.1884 . With this equation, when we get a new


advertising expenditure value, we can use it to
predict the approximate sales volume. The predicted
sales volume is usually denoted as "! .
advertising expenditure
Since we are fitting a line to the scatter points, why is the final line y = 0.0512x + 7.1884 instead of y =
0.0624x + 5, as shown in the diagram? Both lines seem to fit the data well, right? After all, the data doesn’t fall
exactly on a single line but is scattered around it. So, we need to establish a criterion to evaluate which line is
the “best fit.” We refer to this criterion as the loss.
A loss function is a method to measure how well the model fits the data. It quantifies the difference between
the actual measured values and the predicted values. The higher the loss function value, the more inaccurate
the predictions; conversely, the lower the loss function value, the closer the predictions are to the actual values.
sales volume

advertising expenditure
Let’s start with residuals. Simply put, the residual is the difference between the actual value and the predicted
value (it can also be understood as the gap or distance). In formula terms, it’s represented as:
! = " ! "! (2)
For a given advertising expenditure " "
, we have the actual sales volume ! (the label) and the predicted sales
volume "! " !
"
! (calculated by plugging ! into Equation 1). The residual for this point is ! (according to Equation
2). We then square this residual (to eliminate negative signs), and do this for each data point in our dataset.
!
After calculating the squared residuals for all the points " , we sum them up. This gives us a measure of the
!
total error between the fitted line and the actual labels.

(3)

This formula is known as the Sum of Squares for Error (SSE),


and it is one of the most commonly used loss functions in
regression problems in machine learning.
Now we know that the loss function is a function that
measures the error of a regression model, which serves
as the evaluation criterion for the “line” we are seeking.
The smaller the value of this function, the better the
line fits our data.
{ }
!
Given a set of sample observations #" " $" " =! , the goal is to have the regression function fit this set of values
as closely as possible. The criterion provided by the ordinary least squares method is to minimize the sum of
squared residuals (Equation 3). (3)
(3)
Equation 3 is a quadratic equation. We know that a univariate quadratic equation looks something like the left
diagram below. However, for Equation 3, both !" and !" are unknown, making it a quadratic equation with
! !
two unknown parameters. When plotted, it forms a surface in three-dimensional space, similar to the diagram
on the right.
This type of function is called a convex function in mathematics (having only one extremum, which means the
local minimum is also the global minimum). If you remember calculus, you’ll know that Q reaches its minimum
when the derivative equals zero. Therefore, we take the partial derivatives with respect to !" and !" and set
! !
them to zero.

{ #" " $" }" =! are known values, and by substituting them
!

into the two equations above, we can solve for !"! and !" .
!
This is the least squares method, where “squares” refers
to the squared terms.

Reference:https://zhuanlan.zhihu.com/p/72513104
Linear regression is a method for modeling the relationship between one or more independent variables. The
example given above is a one-dimensional case (with only one x). If there are two features, it becomes bivariate
! "
linear regression, where we fit a plane in two-dimensional space. If there are multiple features $# # $# #! # $# {
" !
# =!
}
(such as how a house's price is influenced by the number of rooms, number of floors, and age of the house,
among other variables), then it becomes multiple linear regression:
" = ! $ + !! #! + ! " # " + ! + + ! ! # ! = ! $ + % # & where W and X can be interpreted
as matrices:

" !" # " "!


#
$ % $ #%
$ !! % $" %
% & $ ! # % ' ( & $ "$ %
$ % $ %
$! % $! %
$! % $ !%
& !' $" %
& '
( )
!
The loss function of multiple linear regression is $ = '" % %# $ %$# & = '" " %# $ !$# + !$" &#" + !$! &#! + ! + !$ " &#" # .
! ! !

% &

Similar to linear regression, we take the partial derivatives with respect to !$! % !$" % !$# %! % !$ ! and set them
to zero.
"! $!! $!" ! $!! # " ! # # " "! # " #! #
$ !% $ % $ " % $# %
! $ " "
!
$! ! $! % $ ! % $ " % $ " %
$% & " = $ !
' & =
$! " " # " % $ " % $" % $"%
$ !% $ % $ % $ %
$(! $!
!
$! ! $! %) $( ! ! %) $( " ! %) $( # ! %)
!

We use X to represent all the samples, and w (column vector) to represent the weights of the sample
features. The Loss function can be rewritten as :

1) Let Xw-y = N, then:


2) the derivative of the formula is : 2N * (the derivative of N)
3)
Equation derivation process:

By derivation of the loss function


A square matrix makes it easy to convert it to
a unit matrix
Get the unit matrix
The equation after dropping X
Drop 2 and put y to the right of the
equal sign
Then X is turned into a square matrix to facilitate
the calculation of its unit matrix
Left-multiplication inverse matrix
Get the final w-vector

Note that the first element of the w vector !! is also called the bias.

1.Any matrix multiplied by the unit matrix equals the matrix itself.
2.M matrix x N matrix is not necessarily equal to N matrix x M
matrix
The basic idea of the gradient descent method can be analogized to a downhill process.

Assume a scenario like this:


A person is trapped on a mountain and needs to get down from the mountain (i.e. find the lowest point of the
mountain, which is the valley). However, at this point in time, the mountain is very foggy, resulting in low visibility.

As a result, the path down the mountain would be impossible to determine, and he would have to use the
information around himself to find the path down the mountain. At this time, he can utilize a gradient descent
algorithm to help him descend the mountain. Specifically, take his current location as a benchmark, look for the
steepest part of this location, and then walk towards the mountain where the height of the mountain decreases,
(Similarly, if our goal is to go up the mountain, that is, climb to the top, then at this time should be towards the
steepest direction up). The same method is then repeated for each distance, and eventually we will succeed in
reaching the valley.
Basic process of gradient descent is similar to the downhill scenario right here.
First, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum of
this function, which is the bottom of the mountain.
Based on the assumptions of the previous scenario, the fastest way down the mountain is to find the steepest
direction at the current location, and then go down that direction, which corresponds to the function, which is
to find the gradient at a given point, and then go in the direction opposite of the gradient, which will result in
the fastest decrease in the function value! This is because the direction of the gradient is the direction in which
the function value changes the fastest. So, we reuse this method to repeatedly find the gradient, and eventually
we reach the local minimum, which is similar to our descent down the mountain. And by solving for the
gradient we determine the steepest direction, which is the means of measuring the direction in the scene.
The concept of Gradient
The gradient is a very important concept in calculus, as well.
-In a function of one variable, the gradient is really the differentiation of the function, representing the slope of the
tangent line to the function at a given point;
-In a multivariate function, the gradient is a vector; vectors have directions, and the direction of the gradient points
to the direction in which the function rises fastest at a given point;
Inside calculus, taking partial derivatives of the parameters of a multivariate function, and writing the partial
derivatives of each parameter obtained as a vector, is the gradient.
This explains why we need to find the gradient by any means necessary! We need to get to the bottom of the hill, we
need to observe at each step the steepest point at that point, and the gradient tells us exactly that direction. The
direction of the gradient is the direction in which the function rises fastest at a given point, so the opposite direction
of the gradient is the direction in which the function falls fastest at a given point, which is exactly what we need. So
we just keep going in the opposite direction of the gradient, and we'll get to the local minimum!
The Formula of
Gradient Descent

-𝛼 : learning rate (step size). Can't be too big or too small .

-The gradient is the fastest way up, we need it to be the fastest way down, so we need to put a minus sign on it.

𝛼 is called the learning rate or step size in gradient descent algorithms, meaning that we can use a to control the
distance traveled at each step to ensure that we don't take too big a step, which really means that we don't go too fast
and miss the lowest point. It is also important to ensure that we don't walk too slowly, resulting in the sun going down
and not yet making it to the bottom of the hill. So the choice of a is often important in the gradient descent method.a
should not be too big or too small, too small may lead to a delay in getting to the lowest point, and too big will lead to
missing the lowest point!
A negative sign in front of the gradient means going in the opposite direction of the gradient! We mentioned
earlier that the direction of the gradient is actually the direction in which the function rises fastest at this point! And
we need to go in the direction of the fastest descending direction, which is naturally the direction of the negative
gradient, so we need to add the negative sign here.
Gradient descent optimization process

Given the initial position


Calculate the current fastest-rising negative direction of the point
Move in that negative direction in steps
Repeat steps 2-3 until convergence
The difference between the two times is less than a specified threshold
Reach the specified number of iterations

For example (univariate):

𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃 = 𝜃 ! , 𝐹𝑜𝑟 𝑤ℎ𝑎𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝜃, the value of 𝐽 𝜃 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑑.

Initialize:
1.the starting point is 1
2.Learning rate 𝛼 is 0.4
We begin the iterative computational process of gradient descent:
Step 1

Step 2

Step 3

Step 4

Step 5

𝜃 is 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑒𝑥𝑡𝑟𝑒𝑚𝑒𝑙𝑦 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒


Step N
𝑜𝑝𝑡𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 0,
𝑎𝑛𝑑 𝐽 𝜃 𝑖𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒.
For example (Multiple):
𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐽 𝜃
= 𝜃1! + 𝜃2! , 𝐹𝑜𝑟 𝑤ℎ𝑎𝑡 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝜃1 𝑎𝑛𝑑 𝜃2, the value of 𝐽 𝜃 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑑.
Initialize:
1.the starting point is (1,3)
2.Learning rate 𝛼 is 0.1
We begin the iterative computational process of gradient descent:

We begin the iterative computational process of gradient descent:

Step 1

Step 2

Step N: 𝜃1 𝑎𝑛𝑑 𝜃2 𝑎𝑟𝑒 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑒𝑥𝑡𝑟𝑒𝑚𝑒𝑙𝑦 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑜𝑝𝑡𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 0,


𝑎𝑛𝑑 𝐽 𝜃 𝑖𝑠 𝑐𝑙𝑜𝑠𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒.
Derivation of the gradient descent formula
Steps:
1. Confirm the assumption function and loss function of the optimization model
2. Initialize algorithm-related parameters, e.g., weights, bias initial values, learning rate
3. Use the gradient descent formula to iteratively solve the model parameters (weights,
bias)
Loss function derivation:
(Mean Square Error)

Parameter update formula:

Note: w needs to use the gradient values of


all samples for each update
Full Gradient Descent algorithm (FGD)
- Uses the gradient values of all the samples for each delivery
Stochastic Gradient Descent (SGD)
- Randomly selects and uses one sample gradient value per delivery
Minibatch Gradient Descent (MBGD)
- At each delivery, a small batch of sample gradient values is randomly selected and used
Stochastic Average Gradient Descent algorithm (SAG)
- At each iteration, randomly select the mean of the gradient values of one person's samples and the gradient values
of previous samples
- Assumption: the training set has A B C D E F G H with a total of 8 entry samples
1. Randomly select a sample, assuming that sample D is selected, compute its gradient value and store it in the
list:[D], and then use the mean of the gradient values in the list to update the model parameters.
2. 2. Randomly select another sample, assuming that sample G is selected, calculate its gradient value and store it
in the list: [D,G], then use the mean of the gradient values in the list to update the model parameters.
3. Choose another sample at random, assuming that sample D is chosen, recalculate the gradient value of that
sample, and update the gradient value of sample D in the list, and then use the mean of the gradient values in
the list to update the model parameters.
4. This is done until the algorithm converges.

You might also like