Module 5
Module 5
COLLEGE OF ENGINEERING
DEPARTMENT OF INDUSTRIAL ENGINEERING
MODULE 5
SIMPLE LINEAR REGRESSION AND CORRELATION
I. Empirical Models
E (Y x ) = Y x = 0 + 1 x
1 of 28
Table 1-1: Oxygen and Hydrocarbon Levels
Observation Hydrocarbon Level Purity
Number x (%) y (%)
1 0.99 90.01
2 1.02 89.05
3 1.15 91.43
4 1.29 93.74
5 1.46 96.73
6 1.36 94.45
7 0.87 87.59
8 1.23 91.77
9 1.55 99.42
10 1.40 93.65
11 1.19 93.54
12 1.15 92.52
13 0.98 90.56
14 1.01 89.54
15 1.11 89.85
16 1.20 90.39
17 1.26 93.25
18 1.32 93.41
19 1.43 94.98
20 0.95 87.33
Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level
100
98
96
Purity (y)
94
92
90
88
86
0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55
Hydrocarbon Level (x)
2 of 28
Figure 1. Scatter Diagram of Oxygen Purity vs Hydrocarbon
Level
where 0 and 1 are referred to as the slope and intercept of the
line; collectively, they are called the regression coefficients.
Y = 0 + 1 x + (Equation 1-1)
E (Y x ) = E ( 0 + 1 x + ) = 0 + 1 x + E ( ) = 0 + 1 x
V (Y x ) = V ( 0 + 1 x + ) = V ( 0 + 1 x ) + V ( ) = 0 + 2 = 2
3 of 28
while the slope, i.e. 1 can be interpreted as the change in the
mean of Y for a unit change in x. Also, the variability of Y at a
particular value of x is determined by the error variance 2 .
Y = 0 + 1 x +
4 of 28
Figure 2. Deviations of the data from the estimated regression
model
y i = 0 + 1 xi + i , i = 1, 2, ,n
n n
L= = 2
( y i 0 1 xi )2
i =1 i =1
L
(y )
n
= 2 i 0 1 xi xi = 0
1 0 , 1 i =1
5 of 28
Simplifying the two preceding partial differential equations, we
get:
n n
n0 + 1 xi = yi
i =` i =1
n n n
(equations 1-2 and1-3)
0 xi + 1 x = 2
i y i xi
i =1 i =1 i =1
0 = y 1 x
n n
n
yi xi
i =1 i =1
y i xi
n
1 = i =1
2
n
n
xi
i =1
xi2
i =1 n
y = 0 + 1 x
yi = 0 + 1 xi + ei , i = 1, 2, ,n
6 of 28
where ei = y i y i is called the residual. The residual describes
the error in the fit of the model for each actual observation.
n 2
xi
(x )
n n
2 i =1
S xx = i x = x 2
i
i =1 i =1 n
n n
xi yi
( )
n n
2 i =1 i =1
S xy = y i xi x = xi y i
i =1 i =1 n
S xy
1 =
S xx
EXAMPLE: Fit a simple linear regression model to the oxygen
purity data in Table 1-1.
20 20
n = 20 xi = 23.92 y i = 1,843.21 x = 1.196 y = 92.1605
i =1 i =1
20 20 20
y i2 = 170,044.5321 xi2 = 29.2892 xi y i = 2,214.6566
i =1 i =1 i =1
7 of 28
n 2
xi
S xx =
n
x
2 i =1
= 29.2892
(23.92)
2
= 0.68088
i
i =1 n 20
n n
xi yi
S xy =
n
xi yi i =1 i =1
= 2,214.6566
(23.92)(1,843.21) = 10.17744
i =1 n 20
S xy 10.17744
1 = = = 14.94748
S xx 0.68088
0 = y 1 x = 92.1605 (14.94748)1.196 = 74.28331
y = 74.283 + 14.947 x
Using the regression model above, we would predict oxygen
purity of y = 89.23% when the hydrocarbon level is x = 1.00% .
The purity y = 89.23% may be interpreted as an estimate of the
true population mean purity when x = 1.00% or as an estimate of
a new observation when x = 1.00% . Such estimate is, of course,
subject to error, i.e. you cant expect that a future observation of
purity at x = 1.00% be exactly equal to 89.23%.
Estimating
2
the squares of the residuals, which is often called the error sum
of squares, is equal to
n n
SS E = e =
2
i ( yi y i )2
i =1 i =1
8 of 28
SS E
2 =
n2
However, computing for SS E given using the equation that was
just present can be very mind-numbing. Another way to
compute for SS E is through:
SS E = SST 1 S xy , where SS T =
n
i =1
(y i y )
2
=
n
i =1
y 2
i ny
2
EXERCISES
9 of 28
selling price and annual taxes for 24 houses. The data is given in
the table immediately proceeding below.
Sale Price (in Taxes (in Sale Price (in Taxes (in
US$k) US$k) US$k) US$k)
25.9 4.9176 30.0 5.0500
29.5 5.0208 36.9 8.2464
27.9 4.5429 41.9 6.6969
25.9 4.5573 40.5 7.7841
29.9 5.0597 43.9 9.0384
29.9 3.8910 37.5 5.9894
30.9 5.8980 37.9 5.7422
28.9 5.6039 44.5 8.7951
35.9 5.8282 37.9 6.0831
31.5 5.3003 38.9 8.3607
31.0 6.2712 36.9 8.1400
30.9 5.9592 45.8 9.1416
b. Find the mean selling price given that the taxes paid are
x = 7.50.
c. Calculate the fitted value of y corresponding to x =
5.8980. Find the corresponding residual.
Recall that we have assumed that the error term in the model
Y = 0 + 1 x + is a random variable with mean zero and
10 of 28
variance . Because the values of x are fixed and Y is a
2
( )
se 1 =
2
S xx and
( )
se 0 = 2
1 x
+
n S xx
11 of 28
Use of t-Tests
H 0 : 1 = 1, 0
H1 : 1 1, 0
1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1
t0 > t 2,n2
12 of 28
H 0 : 0 = 0, 0
H 1 : 0 0, 0
0 0, 0 0 0, 0
T0 = =
1 x
2 ( )
se 0
2
+
n S xx
t0 > t 2,n2
H 0 : 1 = 0
H1 : 1 0
13 of 28
EXAMPLE: Let us test the significance of regression using the
model for the oxygen purity data. Use = 0.01
H 0 : 1 = 0
H1 : 1 0
= 0.01
Step 3: Declare the test statistic:
1 1,0 1 1,0
T0 = =
S xx
2
( )
se 1
1 1, 0
T0 =
2 S xx
14.97 0
=
1.18 0.68088
= 11.35
14 of 28
Step 5: Declare the critical region:
Since the computed test statistic is within the critical region, i.e.
11.35 > 2.88, we reject the null hypothesis.
EXERCISES
15 of 28
6. A rocket motor is manufactured by bonding together two
types of propellants, an igniter and a sustainer. The shear
strength of the bond y is thought to be a linear function of the
age of the propellant x when the motor is cast. Data from twenty
observations are shown below:
V. Confidence Intervals
16 of 28
these parameters. The width of the confidence interval is a
measure on the overall quality of the regression line.
2 2
1 t 2,n 2 1 1 + t 2,n 2
S xx S xx
2 2
1 x 1 x
0 t 2, n 2 2 + 0 0 t 2, n 2 2 +
n S xx n S xx
2 2
1 t0.025,18 1 1 + t0.025,18
S xx S xx
1.18 1.18
14.97 2.101 1 14.97 + 2.101
0.68088 0.68088
12.197 1 17.697
17 of 28
Confidence Interval on the Mean Response
Y x t 2,n 2 2
+
(
1 x0 x )2
Y x0 Y x0 + t 2,n 2 2 (
1 x0 x
+
)2
0
n S xx n S xx
model.
Y x = 74.283 + 14.947(1.00)
0
Y x = 89.23
0
confidence interval:
1 (1.00 1.196)
2
89.23 2.101 1.18 +
20 0.68088
89.23 0.75
88.48 Y 1.00 89.98
18 of 28
VI. Prediction of New Observations
Y = 0 + 1 x0
only on the data used to fit the regression model. Thus, we use a
difference equation to construct the confidence interval for
predicting new values.
y 0 t 2,n 2
(
1 x x
1 + + 0
2 ) 2
y 0 y 0 t 2,n 2 2 (
1 x x
1 + + 0
)
2
n S xx n S xx
19 of 28
previous example that y 0 = 89.23 . Constructing the prediction
interval gives us:
1 (1.00 1.196 )
2
89.23 2.101 1.18 1 + +
20 0.68088
86.83 y 0 91.63
EXERCISES
20 of 28
VII. Adequacy of the Regression Model
1. Residual Analysis
2. Coefficient of Determination (R2)
Residual Analysis
21 of 28
As can be seen from the two immediately preceding figures, we
can infer that the residuals approximate a normal distribution.
The first graph plots the residuals against the predicted values.
We can see that there is a random pattern evident in the figure
implying the residuals approximating a normal distribution. The
second figure on the other hand is a normal probability plot of
residuals. We can see that the residuals are fall approximately
along a straight line, which implies that there is no severe
departure from normality.
To summarize:
22 of 28
Coefficient of Determination (R2)
The quantity
SS R SS
R2 = = 1 E
SS T SS T
0 R2 1
23 of 28
VIII. Transformations to a Straight Line
Y = 0 e 1 x
ln Y = ln 0 + 1 x + ln
1
Y = 0 + 1 +
x
1
If we let z = , the model is linearized to:
x
Y = 0 + 1 z +
24 of 28
1
Y=
exp( 0 + 1 x + )
ln Y * = 0 + 1 x +
IX. Correlation
The estimators of the intercept and slope when both X and Y are
random variables are identical to what was already discussed.
An additional estimator though can be computed from the case
when both X and Y are random variables and that is the
estimator for the correlation coefficient, . Recall that the
correlation coefficient, is defined as:
XY
=
XY
25 of 28
( )
n
Yi X i X
S XY SS R SS
R= i =1
= = = 1 E
SST SST
(X ) (Y Y ) S XX ST
n n
2 2
i X i
i =1 i =1
H0 : = 0
H1 : 0
R n2
T0 =
1 R2
H 0 : = 0
H1 : 0
Z 0 = (arctanh R arctanh 0 ) (n 3)
We reject the null hypothesis if the value of the test statistic falls
within the critical region z0 > z 2 .
26 of 28
It is also possible to construct an approximate 100(1 )%
confidence interval for :
z 2 z 2
tanh arctanh r tanh arctanh r +
n3 n3
EXERCISES
10. The final test and exam averages for 20 randomly selected
students taking a course in engineering statistics and a course in
operations research follow. Assume that the final averages are
jointly normally distributed.
Statistics 86 75 69 75 90
OR 80 81 75 81 82
Statistics 94 83 86 71 65
OR 95 80 81 76 72
Statistics 84 71 62 90 83
OR 85 72 65 93 81
Statistics 75 71 76 84 97
OR 70 73 72 80 98
27 of 28
11. A random sample of 50 observations was made on the
diameter of spot welds and the corresponding weld shear
strength.
28 of 28