Multiple Regression and Correlation Analysis: BX A Y
Multiple Regression and Correlation Analysis: BX A Y
CORRELATION ANALYSIS
• The linear regression equation
using one independent
variable has the form
Y a bX
/
• The multiple regression case
extends the equation to
include additional independent
variables. For two independent
variables, the general form of
the multiple regression is: Y a b1 X 1 b2 X 2
/
Where :
X1 and X2 are the two
independent variables
• a is the Y intercept
• b1 is the net change in Y for each unit change in
X1, while holding X2 constant. It is called a partial
regression coefficient, a net regression
coefficient , or just a regression coefficient.
• b2 is the net change in Y for each unit change in
X2, while holding X1 holding constant. It is also
called a partial regression coefficient, a net
regression coefficient , or just a regression
coefficient.
• For three independent variables
designated X1, X2, and X3, the general
multiple regression equation is :
• Y/ = a+ B1 X1 +b2 X2 + b3 X3
Example:
s y .12...k
Y Y / 2
n (k 1)
Y is the observation.
Y/ is the value estimated from the regression nequation.
n is the number of observation in the sample
k is the number of independent variables.
• From our previous example, the Y/ values
are found by using the regression equation
and given in the following table:
Home Mean Attic Insu- Age of Fur- Cost Y/ Y- Y/ (Y-Y/)2
Outside lation(inche Nace(years) Y Residuals
temperature s)
(0F)
1 35 3 6 250 258.90 -8.90 79.21
2 29 4 10 360 295.98 64.02 4098.56
3 36 7 3 165 176.82 -11.82 139.71
4 60 6 9 43 118.30 -75.30 5670.09
5 65 5 6 92 91.90 0.10 0.01
6 30 5 5 200 246.10 -46.10 2125.21
7 10 6 7 355 335.10 19.90 396.01
8 7 10 10 290 307.94 -17.94 321.84
9 21 9 11 230 264.72 -34.72 1205.48
10 55 2 5 120 176.00 -56.00 3136.00
11 54 12 4 73 26.48 46.52 2164.11
12 48 5 1 205 139.26 65.74 4321.75
13 20 5 15 400 352.90 47.10 2218.41
14 39 4 7 320 231.88 88.12 7765.13
15 60 8 6 72 70.40 1.60 2.56
16 20 5 8 272 310.20 -38.20 1459.24
17 58 7 3 94 76.06 17.94 321.84
18 40 8 11 190 192.50 -2.50 6.25
19 27 9 8 235 218.94 16.06 257.92
20 30 7 5 139 216.50 -77.50 6006.25
Total 41695.58
• Here n=20 and k=3 (three independent
variables), so the multiple error of estimate is:
41695.58
51.05
20 (3 1)
• How do you interpret the 51.05?
• It is the typical error we make when we use
this equation to predict the cost. If the errors
are normally distributed, about 68 percent of
the residuals should be less than 51.05 and
about 95 percent should be less than
2(51.05) or 102.10 .
THE ANOVA TABLE
• The multiple regression calculations are lengthy. Many
software systems are available to perform the
calculations
• In the analysis of variance table, the variation is
divided into two components: that due to the
treatments and that due to random error. Here the
total is also divided into two components: that
explained by the regression, that is, the independent
variables and the error or unexplained variation.
These two categories are identified in the source
column of the analysis of variance table. In our
example, there are 20 observations, so n=20. The
total number of degrees of freedom is n-1=20-1=19.
The number of degrees of freedom in the regression
row is the number of independent variables. We let k
represent the number of independent variables , so
k=3. The number of degrees of degrees of freedom in
the error row is n-(k+1)=20-(3+1)=16 df.
• The heading “SS” in the middle of the ANOVA
table refers top the sum of squares, or the
variation.
• Total variation = SS total (Y - Y)2 = 212916
• Error Variation = SSE =(Y – Y/) 2 = 41695
• Regression Variation + SSR = SS total – SSE =
212916 – 41695 = 171220.
The general format of the ANOVA table is:
Source df SS MS F
Total n-1 SS
total
• The coefficient of multiple determination,
written as R2 is the percent of the variation
explained by the regression . It is the sum
of squares due to the regression , divided
by the sum of squares total.
Coefficient Of Multiple Determination
SSR 171220
R
2
0.804
SStotal 212916
The multiple standard error may be found
directly from the ANOVA table:
SSE 41695
s y.123 51.05
n (k 1) 20 (3 1)
Cost 1
Temp. - 0.81151 1
Source DF SS MS F p
Regression 3 171220 57073 21.90 0.000
Residual Error 16 41695 2606
Total 19 212915
SSR / k MSR
F
SSE /[n (k 1)] MSE
SSR / k 171220 / 3
F 21.90
SSE /[n (k 1)] 41695 /[20 (3 1)]
The critical value is found from in Appendix G. It is 3.24.
• The decision rule is: if the computed F is greater
than 3.24, reject H0 and accept the alternate
hypothesis, H1. The computed value of F is 21.90,
which is in the rejection region. The null hypothesis
that all the multiple regression coefficients are zero is
rejected. The p-value is 0.000 from the above table, so
it is quite unlikely that H0 is true. The alternate
hypothesis is accepted, indicating that not all the
regression variables (amount of insulation, etc.) do
have the ability to explain the variation in the
dependent variable (heating cost).
• The outside temperature, the amount of insulation and
age of the furnace have a great bearing on heating
costs.
Evaluating Individual Regression
Coefficients
• The next step is to test the variables individually
to determine which regression coefficients may
be 0 and which are not.
• [Why is it important to find whether it is possible
that any of the s equal 0? If a could equal 0, it
implies that this particular independent variable
is of no value in explaining any variation in the
dependent value. If there are coefficients for
which H0 cannot be rejected, we may want to
eliminate them from the regression equation.]
• We will now conduct three separate tests of
hypothesis – for temperature, for insulation, and
for the age of the furnace.