1-1 - Simple and Multiple Linear Regression
1-1 - Simple and Multiple Linear Regression
OVERVIEW
Linear regression model
𝑌 = 𝛽! + 𝛽" 𝑋" + 𝛽# 𝑋# + ⋯ + 𝛽$ 𝑋$ + 𝜖
• 𝑌 is the Dependent variable (response variable); the 𝑋’s are the
Independent variables (explanatory variables)
• p: number of independent variables
– 𝑝 = 1: simple linear regression
– 𝑝 > 1: multiple linear regression
• 𝛽! is the intercept; the remaining terms 𝛽" , … , 𝛽# are the slopes
• 𝜖 is the error term, whose distribution is 𝑁(0, 𝜎)
Overhead Costs.xlsx
• The manager of Bendrix Manufacturing Company wants
to get a better understanding of their overhead costs.
• Dependent variable is Overhead Costs ($) of production
• Independent variables are MachHrs (# of machine hours)
and ProdRuns (# of production runs)
• Are both independent variables useful to explain the
variation of overhead costs?
Simple linear regression
• Eventually we will estimate a regression
equation with both of the variables included.
• However, if we include only ONE at a time, what
do they tell us about the overhead costs?
5
Analysis ToolPak add-in for Excel (Windows)
• Go to Data, if you cannot see Data Analysis, here is
the way to invoke the add-in
1. Go to File -> Options ->
Add-ins
2. At the bottom Manage ->
Excel Add-ins, click on “Go”
3. Check Analysis Toolpak
• Mac users: see recording
6
Linear regression in Excel
• With Excel Analysis ToolPak, we can do linear
regression easily
1. Go to Data -> Data Analysis -> Regression
• Check “label” if your first line represents variable names
2. Specify Input Y,X ranges, and click on “OK”
3. The regression output will be in a new worksheet
• Video tutorial
https://www.youtube.com/watch?v=Q5JlRmmHzsg
7
Regression output for Overhead vs. MachHrs
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.631884525
R Square 0.399278053
Adjusted R Square 0.38160976
Standard Error 8584.739353
Observations 36
ANOVA
df SS MS F Significance F
Regression 1 1665463368 1.67E+09 22.59856 3.57424E-05
Residual 34 2505723492 73697750
Total 35 4171186860
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 48621.35463 10725.3327 4.533319 6.86E-05 26824.85615 70417.8531 26824.85615 70417.85312
Machine Hours 34.70223642 7.299902097 4.753795 3.57E-05 19.86705047 49.5374224 19.86705047 49.53742238
8
Regression output for Overhead vs. ProdRuns
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.520543527
R Square 0.270965564
Adjusted R Square 0.249523375
Standard Error 9457.239463
Observations 36
ANOVA
df SS MS F Significance F
Regression 1 1130247999 1130247999 12.6370288 0.001135484
Residual 34 3040938861 89439378.26
Total 35 4171186860
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 75605.51571 6808.610629 11.10439704 7.46473E-13 61768.75415 89442.27728 61768.75415 89442.27728
Production Runs 655.0706602 184.2746779 3.554859885 0.001135484 280.5794579 1029.561862 280.5794579 1029.561862
9
Example: regression line
• The two regression lines are
Predicted Overhead = 48,621 + 34.7MacHrs
Predicted Overhead = 75,606 + 655.1ProdRuns
• The two regression models are
𝑌 = 48,621 + 34.7 𝑋! + 𝜖, 𝑤ℎ𝑒𝑟𝑒 𝜖 ∼ 𝑁 0,8584.7
𝑌 = 75,606 + 655.1 𝑋" + 𝜖, 𝑤ℎ𝑒𝑟𝑒 𝜖 ∼ 𝑁 0,9457.2
• Clearly these two models are quite different, although each effectively
breaks Overhead into a fixed component and a variable component.
• The equations imply that expected overhead increases by about $35
for each extra machine hour and about $655 for each extra production
run.
10
Multiple regression
• There was a positive linear relationship between
Overhead and each of the MachHrs and
ProdRuns variables
• The differences between these two lines can be
attributed to neither one telling the whole story
• What happens if we include both variables?
11
Output
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.930819542
R Square 0.866425021
Adjusted R Square 0.858329567
Standard Error 4108.99309
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 3614020661 1807010330 107.0261279 3.75374E-15
Residual 33 557166199.1 16883824.22
Total 35 4171186860
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 3996.678209 6603.650932 0.605222512 0.549170949 -9438.550632 17431.90705 -9438.550632 17431.90705
Machine Hours 43.53639812 3.5894837 12.12887472 1.04645E-13 36.23353862 50.83925761 36.23353862 50.83925761
Production Runs 883.6179252 82.25140753 10.74289124 2.6114E-12 716.2761784 1050.959672 716.2761784 1050.959672
12
Example
• What is the fitted regression line?
• What is the fitted regression model?
13
Example
• What is the fitted regression line?
𝑌 = 3997 + 43.54 𝑋! + 883.62 𝑋"
• What is the fitted regression model?
𝑌 = 3997 + 43.54 𝑋! + 883.62 𝑋" + 𝜖, 𝑤ℎ𝑒𝑟𝑒 𝜖
∼ 𝑁 0,4109
14
Interpretation
• The interpretation of the equation is that if the number of
production runs is held constant, then the overhead cost is
expected to increase by $43.54 for each extra machine hour; and if
the number of machine hours is held constant, the overhead is
expected to increase by $883.62 for each extra production run.
• The Bendrix manager can interpret $3997 as the fixed component
of overhead. The slope terms involving MachHrs and
ProdRuns are the variable components of overhead.
15
Regression line comparison
It is interesting to compare this equation with the separate equations
found in the previous example:
Predicted Overhead = 48,621 + 34.7MacHrs
Predicted Overhead = 75,606 + 655.1ProdRuns
vs
Predicted Overhead = 3997 + 43.45MachHrs + 883.62ProdRuns
• Note that both slope coefficients have increased. Also, the intercept is
now lower than either intercept in the single variable equation.
• Since the coefficients have different meanings, it is not surprising that
we obtain different estimates.
16
Summary
• Linear regression model
– Simple vs. Multiple
– Getting familiar with Analysis Toolpak add-in for
Excel
• Readings: Textbook Sections 10.1, 10.4, 10.5
• Video tutorial
https://www.youtube.com/watch?v=Q5JlRmmHz
sg
17