Notes 3 - Linear Regression
Notes 3 - Linear Regression
Recall
If we think that a variable x may explain or even cause changes in another variable y, we call x an
__________________ and y a ___________________.
_________________ displays the relationship between two quantitative variables measured on the same
individuals.
In examining a scatterplot, look for an overall pattern showing the ________, ___________, and
_____________ of the relationship. And then for ____________ or other deviations from the pattern.
Example
(A) A correlation of 0.2 means that 20% of the points are highly correlation.
(B) Perfect correlation, that is, when the point lie exactly on a straight line, results in r = 0.
(C) Correlation is not affected by which variable is called x and which is called y.
(D) Correlation is not affected by extreme values.
(E) A correlation of 0.75 indicates a relationship that is 3 times as linear as one for which the correlation is only
0.25
The fitted line is called the line of best fit, linear regression line, or least squares regression line, (LSRL) and
has an equation in a form that should look very familiar:
____ : slope
____ : y-intercept
_______________________ - A form on the calculator (in addition to the form we see above)
The way the line is fitted to the data is through a process called the method of least squares. The main idea
behind this method is that the square of the vertical distance between each data point and the line is minimized.
Slope: The slope of the regression line is important in the sense that it gives us the change of y with respect to x.
In other words, it gives us the amount of change in y when x increases by 1.
Intercept: The intercept is statistically meaningful only when x can actually take values close to zero. When it
does make sense to have a x-value of zero, the y-intercept is the y-value we would expect.
When we have a data set (x,y), we can calculate the LSRL by hand or with technology. In our Vitruvian Man
activity, we were already introduced to both. Let’s go through the process again, but this time, we will use a
different example.
Example
Many schools require teachers to have evaluations done by students. A study investigated the extent to which
student evaluations are related to grades. Teacher evaluations and grades are both given on a scale of 100. The
results for Mrs. H for 10 of her students are given below together with the average for each student (x).
x 40 6 70 73 7 68 65 8 98 90
0 5 5
y 10 5 60 65 7 73 78 8 90 95
0 5 0
Step 1: Enter the data into your calculator. x values go into L1 and y values into L2.
Slope = r( )
Sy
Sx
S y =¿ _______
S x =¿ _______
*Finding r: We could go through the process of lists to find “r”, like we did in the activity, but we can just use
our calculator to get us “r”. You will NOT have to find “r” by hand on the AP exam. You will have to show
work for calculating the slope by hand.
*Also, when you find “r” on your calculator, yes you can see the LSRL from your calculator. Still show your
work on your paper but feel free to make sure your answers match!
*Fun fact of the LSRL, it will ALWAYS pass through the point ( x , y ) . We use this property to find the y-
intercept.
(b) Use your equation to predict what evaluation Mrs. H will get from a student who scored a 81.
(d) Do you think student grades and the evaluations students give their teachers are related? Explain.
LSRL on Calculator
At this point, I’m sure you know the trick to finding the LSRL on your calculator. There are two ways to get the
line and it depends on how you like to write the line.
__________________ is the use of a regression line for prediction far outside the interval of values of the
explanatory variable x used to obtain the line. Such predictions are often not accurate.
CORRELATION DOES NOT IMPLY CAUSATION. For example, there was a study done that showed a
strong, positive linear relationship between ice cream sales and homicides in New York City. Does this mean
that if we stop selling ice cream, we will have no more homicides?
Coefficient of Determination
The strength of a prediction which uses the LSRL depends on how close the data points are to the regression
line. The mathematical approach to describing this strength is via the coefficient of determination (________).
The coefficient of determination gives us the proportion of variation in the values of y that is explained by least-
squares regression of y on x. The coefficient of determination turns out to be the correlation coefficient squared.
Whenever you use the regression line for prediction, also include as a measure of how successful the regression
is in explaining the response.
In our example, r 2=0.818. This means that 81.8% of the variation in teacher evaluations (the dependent
variable) can be explained by the linear relationship it has with the student class average (the independent
variable).
Residuals
In most cases, no line will pass exactly through all the points. This means that even if we use the LSRL
to make predictions about our dependent variable, there will still be some error from the actual y-value.
Because we use the line to predict __ from __, the prediction errors we make are errors in y, the
_________ direction in the scatterplot.
A good regression makes the vertical deviations of the points from the line
____________________________________ (remember Least Square Regression?)
A residual is the _____________________ between an observed value of the response variable and the
value predicted by the regression line.
Residual = observed – predicted OR _________________________
If the residual is positive, the observed point lies ___________ the least squares regression line.
If the residual is negative, the observed point lies ____________ the least squares regression line.
Fun fact: If you add up all the residuals from your data, you will get “0”. That is why the method of LSRL
involves squaring the residuals then adding them up and minimizing that value
Example: Everyone knows that cars and trucks lose value the more they are driven. Can we predict the price of
a used Ford F-150 SuperCrew 4x4 if we know how many miles it has on the odometer? A random sample of 16
used trucks was selected from autotrader.com. Here is a graph of the data:
Find and interpret the residual for the Ford F-150 that had 70,583 miles driven and a price of $21,994.
Residual Plots
A residual plot makes it
easy to study the residuals
by plotting them against
the explanatory variable.
When an obvious curved pattern exists in a residual plot, the model we are using is not appropriate:
The TI-83/84 will generate a complete set of residuals when you perform a LinReg. They are stored in a list
called RESID which can be found in the LIST menu. RESID stores only the current set of residuals. That is, a
new set of residuals is stored in RESID each time you perform a new regression.
In order to draw a residual plot on the TI-83/84, first enter your data and perform a LinReg. Next, create a
STAT PLOT where XList is L1 and YList is RESID (get this by pressing 2nd STAT 9:RESID)