Linear Regression
Linear Regression
When the relationship between the two variables is linear, a simple mathematical model can be developed by finding the line of best fit. The equation of this line of best fit can be used to make predictions by Interpolation (estimates within the given data range) and Extrapolation (estimates outside of the given data range)
EXTRAPOLATION
INTERPOLATION
An analytic method for determining the line of best fit can be determined by the method of least squares In order to understand the least squares method, we must first define the term residual In regression analysis, a residual is defined to be the vertical distance from a particular data point to the line of best fit
Residual (vertical deviation)
For the line of best fit in the method of least squares The sum of the residuals is zero (sum of the distance above the line is equal to the sum of the distance below the line)
The sum of the squares of the residuals has the least possible value. (Boxes shown below are the smallest possible)
Boxes represent the Residuals squared
Statisticians have developed the following formula to determine the equation of the line of best fit using the least squares method
y ax b
WHERE
n( xy) ( x)( y) n( x ) ( x )
2
or
and
b y ax
y a x
The table and scatter plot show data for the full-time employees of a company:
In order to calculate the line of best fit using the least squares method, the following table and calculations are set up.
The slope a indicates only how y varies with x on the line of best fit The slope a does NOT tell anything about the strength of the correlation between the two variables (the correlation coefficient r does) It is possible to have a weak correlation with a large slope or a strong correlation with a small slope
Outliers
an observation that is numerically distant from the rest of the data
Why Do We Have Outliers? Measurement Error Miscoding Misinterpretation Entered incorrectly Relationship is non-linear etc.