Linear Regression
Linear Regression
variable and one or more independent variables. In the context of the model, independent variables are
the input values, and the dependent variable is the output value predicted by the model based on the
independent variables. Linear regression is suitable for applications where the data exhibits a linear
relationship between the variables.
Linear regression has widespread applications in data science, particularly in predicting continuous
outcomes. For example, in areas such as house price prediction and sales forecasting, linear regression
models can help predict the dependent variable (such as house prices or sales volume) based on
independent variables (such as house size, location, or the amount spent on advertising by agents).
Buying price
Price
Year
System estimate
We take y as the dependent variable and x as the independent variable, resulting in a
linear equation with two variables.
! = ! ! + !" " (1)
When the parameters ! ! and !! are given, the plot on the coordinate plane is a straight line
(this is the meaning of “linear”).
When we use only one x to predict y, it is called simple linear regression, which means
finding a straight line to fit the data. For example, if I have a set of data plotted as a scatter
plot, with the horizontal axis representing the advertising expenditure of a real estate
agency and the vertical axis representing the sales volume, linear regression aims to find a
straight line that best fits the data points on the graph.
advertising expenditure
Let’s start with residuals. Simply put, the residual is the difference between the actual value and the predicted
value (it can also be understood as the gap or distance). In formula terms, it’s represented as:
! = " ! "! (2)
For a given advertising expenditure " "
, we have the actual sales volume ! (the label) and the predicted sales
volume "! " !
"
! (calculated by plugging ! into Equation 1). The residual for this point is ! (according to Equation
2). We then square this residual (to eliminate negative signs), and do this for each data point in our dataset.
!
After calculating the squared residuals for all the points " , we sum them up. This gives us a measure of the
!
total error between the fitted line and the actual labels.
(3)
{ #" " $" }" =! are known values, and by substituting them
!
into the two equations above, we can solve for !"! and !" .
!
This is the least squares method, where “squares” refers
to the squared terms.
Reference:https://zhuanlan.zhihu.com/p/72513104
Linear regression is a method for modeling the relationship between one or more independent variables. The
example given above is a one-dimensional case (with only one x). If there are two features, it becomes bivariate
! "
linear regression, where we fit a plane in two-dimensional space. If there are multiple features $# # $# #! # $# {
" !
# =!
}
(such as how a house's price is influenced by the number of rooms, number of floors, and age of the house,
among other variables), then it becomes multiple linear regression:
" = ! $ + !! #! + ! " # " + ! + + ! ! # ! = ! $ + % # & where W and X can be interpreted
as matrices:
% &
Similar to linear regression, we take the partial derivatives with respect to !$! % !$" % !$# %! % !$ ! and set them
to zero.
"! $!! $!" ! $!! # " ! # # " "! # " #! #
$ !% $ % $ " % $# %
! $ " "
!
$! ! $! % $ ! % $ " % $ " %
$% & " = $ !
' & =
$! " " # " % $ " % $" % $"%
$ !% $ % $ % $ %
$(! $!
!
$! ! $! %) $( ! ! %) $( " ! %) $( # ! %)
!
We use X to represent all the samples, and w (column vector) to represent the weights of the sample
features. The Loss function can be rewritten as :
Note that the first element of the w vector !! is also called the bias.
1.Any matrix multiplied by the unit matrix equals the matrix itself.
2.M matrix x N matrix is not necessarily equal to N matrix x M
matrix
The basic idea of the gradient descent method can be analogized to a downhill process.
As a result, the path down the mountain would be impossible to determine, and he would have to use the
information around himself to find the path down the mountain. At this time, he can utilize a gradient descent
algorithm to help him descend the mountain. Specifically, take his current location as a benchmark, look for the
steepest part of this location, and then walk towards the mountain where the height of the mountain decreases,
(Similarly, if our goal is to go up the mountain, that is, climb to the top, then at this time should be towards the
steepest direction up). The same method is then repeated for each distance, and eventually we will succeed in
reaching the valley.
Basic process of gradient descent is similar to the downhill scenario right here.
First, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum of
this function, which is the bottom of the mountain.
Based on the assumptions of the previous scenario, the fastest way down the mountain is to find the steepest
direction at the current location, and then go down that direction, which corresponds to the function, which is
to find the gradient at a given point, and then go in the direction opposite of the gradient, which will result in
the fastest decrease in the function value! This is because the direction of the gradient is the direction in which
the function value changes the fastest. So, we reuse this method to repeatedly find the gradient, and eventually
we reach the local minimum, which is similar to our descent down the mountain. And by solving for the
gradient we determine the steepest direction, which is the means of measuring the direction in the scene.
The concept of Gradient
The gradient is a very important concept in calculus, as well.
-In a function of one variable, the gradient is really the differentiation of the function, representing the slope of the
tangent line to the function at a given point;
-In a multivariate function, the gradient is a vector; vectors have directions, and the direction of the gradient points
to the direction in which the function rises fastest at a given point;
Inside calculus, taking partial derivatives of the parameters of a multivariate function, and writing the partial
derivatives of each parameter obtained as a vector, is the gradient.
This explains why we need to find the gradient by any means necessary! We need to get to the bottom of the hill, we
need to observe at each step the steepest point at that point, and the gradient tells us exactly that direction. The
direction of the gradient is the direction in which the function rises fastest at a given point, so the opposite direction
of the gradient is the direction in which the function falls fastest at a given point, which is exactly what we need. So
we just keep going in the opposite direction of the gradient, and we'll get to the local minimum!
The Formula of
Gradient Descent
-The gradient is the fastest way up, we need it to be the fastest way down, so we need to put a minus sign on it.
𝛼 is called the learning rate or step size in gradient descent algorithms, meaning that we can use a to control the
distance traveled at each step to ensure that we don't take too big a step, which really means that we don't go too fast
and miss the lowest point. It is also important to ensure that we don't walk too slowly, resulting in the sun going down
and not yet making it to the bottom of the hill. So the choice of a is often important in the gradient descent method.a
should not be too big or too small, too small may lead to a delay in getting to the lowest point, and too big will lead to
missing the lowest point!
A negative sign in front of the gradient means going in the opposite direction of the gradient! We mentioned
earlier that the direction of the gradient is actually the direction in which the function rises fastest at this point! And
we need to go in the direction of the fastest descending direction, which is naturally the direction of the negative
gradient, so we need to add the negative sign here.
Gradient descent optimization process
Initialize:
1.the starting point is 1
2.Learning rate 𝛼 is 0.4
We begin the iterative computational process of gradient descent:
Step 1
Step 2
Step 3
Step 4
Step 5
Step 1
Step 2