Regression Lecture Summary
Regression Lecture Summary
Review Lecture
Major Points
• Is there a relationship between x and y?
• What is the strength of this relationship
• Pearson’s r
• Can we describe this relationship and use this to predict y
from x?
• Regression
• Is the relationship we have described statistically
significant?
• t test
The relationship between x and y
• Correlation: is there a relationship between 2
variables?
• Regression: how well a certain independent variable
predict dependent variable?
• CORRELATION CAUSATION
• In order to infer causality: manipulate independent
variable and observe effect on dependent variable
Regression
• How well a set of data points fits a straight line can
be measured by calculating the distance between
the data points and the line.
• The total error between the data points and the
line is obtained by squaring each distance and then
summing the squared values.
• The regression equation is designed to produce the
minimum sum of squared errors.
4
Regression
• Is the statistical technique for finding the best-
fitting straight line for a set of data.
• To find the line that best describes the relationship
for a set of X and Y data.
Regression Analysis
• Question asked: Given one variable, can we predict
values of another variable?
•
• Examples: Given the weight of a person, can we
predict how tall he/she is; given the IQ of a person,
can we predict their performance in statistics; given
the basketball team’s wins, can we predict the
extent of a riot. ...
Regression line
• makes the relationship between variables easier to
see.
• identifies the center, or central tendency, of the
relationship, just as the mean describes central
tendency for a set of scores.
• can be used for a prediction.
The Equation for a Line
Y = bX + a
• b = the slope
• a = y-intercept
• Y= predicted value
Regression
• The mathematical equation for a line:
Y = mx + b
Where: Y = the line’s position on the vertical axis at any point
X = the line’s position on the horizontal axis at any point
m = the slope of the line
r2
(Y – Y)2 - (Y – Y)2
(Y – Y)2
Correlation, r,
is significant.
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Before you get too excited about this output, let’s cross-off the info that we
are not going to discuss or learn about in class. I’m only trying to give you an
elementary exposure to Regression. The following slides will show you the
things you need to understand and each of those items will be explained.
Regression Example: Answer this question using Regression.
What is the best predicted size of a household that discard 0.50 lb of plastic?
Info to help you understand Regression Info that you MUST know for test
Info to help you understand Regression Info that you MUST know for test
Info to help you understand Regression Info that you MUST know for test
yˆ b0 b1 x1 b2 x2 ... bk xk
Multiple Regression Guidelines
• More x variables is NOT necessarily better
• Remember that R Square is a measure of how effective
our regression equation is. Therefore, if adding an x
variable does not appreciably increase the R Square value,
then DON’T add it
• Use those x variables (the fewest possible) that give you
the biggest R Square (or Adjusted R Square) value. We
want efficiency so a few variables that provide a big R
Square is best
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight). Use the Excel function Tools, Data
Analysis, Correlation to construct a correlation matrix. An Excel file containing this
Bear data and Correlation Matrix are on the class website (mrbear.xls).
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest
linear relationships with the y variable (weight)
• Ideally, we want to pick those few x variables that have strong correlations (close to
-1 or +1) with the y variable, BUT we also want the x variables to NOT be highly
correlated with each other
• The addition of an x variable that is strongly correlated with any x variable(s)
already in a multiple regression equation WILL NOT do much to increase the R
Squared or Adjusted R Square value
• On the other hand, adding an x variable that is strongly correlated with the y
variable, but NOT with any x variables already in the regression equation WILL
increase our R Squared value substantially
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Again, before we analyze the output, let’s cross-off the info that we are not
going to discuss or learn about in class. The following slides will explain
those things you need to understand.
Multiple Regression Example: Using the following data
(measurements taken from Bears that had been anesthetized), construct a
multiple regression equation to predict the weight of Bears.
Info to help you understand Regression Info that you MUST know for test
Info to help you understand Regression Info that you MUST know for test
Info to help you understand Regression Info that you MUST know for test
Excel calculates a t test statistic for each of the x variables (ignore the t Stat for Intercept).
The next column (P-value) is the important one because it provides the test result of this
test statistic as compared to the critical value. The critical values are not shown, but we
don’t need to see them because the P-value effectively tells us whether this test statistic is
inside of outside the critical value (see P-value for more explanation)
The P-value tells us whether the x variable is statistically significant in the regression equation.
If the P-value is less than alpha (usually 0.05) then that x variable is a significant contributor in
the regression equation. If the P-value is greater than alpha then that x variable does not
contribute significantly to the regression equation. In this case, Neck has a P-value much less
than 0.05, so we see that Neck size contributes significantly to our regression equation to predict
a Bear’s weight. On the other hand, Age has a P-value greater than 0.05, so we see that adding
the Age x variable into the equation was not a good idea since Age does not significantly help us
predict a Bear’s weight in our regression equation.