Topic10 Written
Topic10 Written
Topic10 Written
• Regression analysis is the part of statistics that investigates the relationship between
two or more variables related in a nondeterministic fashion.
• If the two variables are not deterministically related, then for a fixed value of x, there is
uncertainty in the value of y.
• For example, if we are investigating the relationship between age of child and size of vo-
cabulary and decide to select a child of age x = 5 years. Before the selection, vocabularly
size is a random variable Y . After a particular 5-year-old child has been selected and
tested, the observed value of Y could be, say y = 2, 000 words.
• Let x1 , x2 , . . . , xn denote values of the independent variable for which observations are
made, and let Yi and yi , respectively, denote the random variable and observed value
associated with xi . The available bivariate data then consists of the n pairs
Example 10.1. Visual and musculoskeletal problems associated with the use of visual display
terminals (VDTs) have become rather common in recent years. Some researchers have focused
on vertical gaze direction as a source of eye strain and irritation. The direction is known to be
closely related to ocular surface area (OSA). The accompanying representative data set on y =
OSA (cm2 ) and x = width of the palprebal fissure (i.e. the horizontal width of the eye opening,
in cm) is from the article “Analysis of Ocular Surface Area for Comfortable VDT Workstation
Layout” (Ergonomics, 1996: 877–884). The order of the observations was not given, so they
are listed in increasing order of x values.
2
Here are some things to notice about the data and plot:
• Several observations have identical x values yet different y values. For example, x8 = x9 =
, but y8 = and y9 = . Thus the value of y is not determined
solely by x but also by various other factors.
Example 10.2. Arsenic is found in many ground waters and some surface waters. Recent
health effects research has prompted the Environmental Protection Agency to reduce allowable
arsenic levels in drinking water so that many water systems are no longer compliant with
standards. This has spurred interest in the development of methods to remove arsenic. The
accompanying data on x = pH and y = arsenic removed (%) by a particular process were
read from a scatterplot in the article “Optimizing Arsenic Removal During Iron Removal:
Theoretical and Practical Considerations” (J. of Water Supply Res. and Tech., 2005: 545–560).
In plot (a), Minitab (a statistical software) selected the scale for both axes. In plot (b), the
scale for both axes was specified so that they would intersect near the origin. Plot (b) is clearly
much more crowded than plot (a), making it difficult to ascertain the general nature of any
relationship. For example, curvature can be overlooked in a crowded plot.
Definition 10.1.
Y = β0 + β1 x +
where is the random error term assumed to be normally distributed with E() = 0 and
Var() = σ 2 .
Assumptions:
1. E(Y |X = x) = β0 + β1 x (Linearity)
3. Y |X = x ∼ Normal (Normality)
Example 10.3. Suppose the relationship between applied stress x and time-to-failure y is
described by the simple linear regression model with true regression line y = 65 − 1.2x and
σ = 8.
a. State the distribution of time-to-failure for any fixed value of applied stress.
c. Find the probability that time-to-failure is more than 50 when applied stress is 20.
d. Find the probability that time-to-failure is more than 50 when applied stress is 25.
6
• According to this principle, a line provides a good fit to the data if the vertical distances
(deviations) from the observed points to the line are small.
• The best-fit line is then the one having the smallest possible sum of squared deviations.
Model:
Y = β0 + β1 x +
Slope: Pn
(x − x)(yi − y) Sxy
Pn i
β̂1 = b1 = i=1 2
=
i=1 (xi − x) Sxx
where ! !
n n n n
X X 1 X X
Sxy = (xi − x)(yi − y) = xi y i − xi yi
i=1 i=1
n i=1 i=1
n n n
!2
X
2
X 1 X
Sxx = (xi − x) = x2i − xi
i=1 i=1
n i=1
Intercept:
β̂0 = b0 = y − b1 x
Example 10.4. The cetane number is a critical property in specifying the ignition quality of a
fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and
time-consuming. The article “Relating the Cetane Number of Biodiesel Fuels to Their Fatty
Acid Composition: A Critical Study” (J. of Automobile Engr., 2009: 565–583) included the
following data on x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The
iodine value is the amount of iodine necessary to saturate a sample of 100 g of oil.
a. Assuming the simple linear regression model is appropriate, estimate the regression line
by finding the equation of the least squares line. Then, interpret the coefficients.
8
b. Calculate a point prediction of the cetane number for a future observation made when
iodine value is 100.
The least squares line should not be used to make a prediction for an x value much beyond the
range of the data, such as x = 40 or x = 150 in this example. The danger of extrapolation is
that the fitted relationship may not be valid for such x values. Therefore, many interpretations
of the intercept are not sensible.
9
Estimating σ 2 and σ
• The parameter σ 2 determines the amount of variability inherent in the regression model.
A large value of σ 2 will lead to observed points that are typically quite spread out about
the true regression line, whereas when σ 2 is small the observed points will tend to fall
very close to the true line.
The estimate of σ 2 is Pn 2
SSE e
2 2
σ̂ = s = MSE = = i=1 i
n−2 n−2
The estimate of σ is √
σ̂ = s = σ̂ 2
10
Example 10.5. Show that the error sum of squares (or, residual sum of squares)
n
X
SSE = e2i = Syy − b1 Sxy
i=1
where !2
n n n
X
2
X 1 X
Syy = (yi − y) = yi2 − yi
i=1 i=1
n i=1
n n n
! n
!
X X 1 X X
Sxy = (xi − x)(yi − y) = xi y i − xi yi
i=1 i=1
n i=1 i=1
11
X X
x = 890 x2 = 67182 n = 14
X X X
y = 37.6 y 2 = 103.54 xy = 2234.3
Compute an estimate of σ 2 .
12
• In plot (a), all variation in y is explained. In plot (b), most variation is explained. And
in plot (c), little variation is explained.
• The error sum of squares SSE can be interpreted as a measure of how much variation in
y is left unexplained by the model.
• A quantitative measure of the total amount of variation in y is given by the total sum
of squares !2
n n n
X X 1 X
SST = Syy = (yi − y)2 = yi2 − yi .
i=1 i=1
n i=1
• Since SSE ≤ SST, 0 ≤ r2 ≤ 1. The higher the value of r2 , the more successful is the
simple linear regression model in explaining y variation.
• If r2 is small, an analyst will usually want to search for an alternative model (either a non-
linear model or a multiple regression model that involves more than a single independent
variable) that can more effectively explain y variation.
13
Example 10.8. Calculate and interpret the coefficient of determination for the iodine-cetane
number data (Example 10.4).
15
• Recall that the values of xi ’s are assumed to be chosen before the experiment is performed,
so only the Yi ’s are random. The estimator for β1 is then
Pn
(x − x)(Yi − Y )
β̂1 = i=1Pn i 2
.
i=1 (xi − x)
σ2
where βˆ1 ∼ N β1 , Pn 2
.
i=1 (xi − x)
βˆ1 − β1
∼ tn−2 .
Sβ̂1
Example 10.9. When damage to a timber structure occurs, it may be more economical to
repair the damaged area rather than replace the entire structure. The article “Simplified Model
for Strength Assessment of Timber Beams Joined by Bonded Plates” (J. of Materials in Civil
Engr., 2013: 980–990) investigated a particular strategy for repair. The accompanying data
were used by the authors of the article as a basis for fitting the simple linear regression model.
The dependent variable is y = rupture load (N) and the independent variable is anchorage
length (the additional length of material used to bond at the junction, in mm).
Sxx = 18, 000 Syy = 331, 839, 568.9 Sxy = 2, 225, 580
b. The figure below shows a scatterplot of the data (also displayed in the cited article); there
appears to be a rather substantial positive linear relationship between the two variables.
Test if there is a positive linear relationship between the two variables. Use α = 0.05.
18
c. Test if there is a linear relationship between the two variables. Use α = 0.05.
19
d. Suppose it had previously been believed that when anchorage length increased by 1 mm,
the associated true average change in rupture load would be at most 100. Do the sample
data contradict this belief? State and test the relevant hypotheses using α = 0.05.
20
(x∗ − x)2
∗ ∗ 2 1
β̂0 + β̂1 x ∼ N β0 + β1 x , σ + Pn 2
.
n i=1 (xi − x)
or equivalently,
s
(x∗ − x)2
1
βˆ0 + βˆ1 x∗ ± tα/2,n−2 MSE + Pn 2
n i=1 (xi − x)
• An investigator may wish to obtain an interval of plausible values for the value of Y
associated with some future observation when the independent variable has value x∗ –
Prediction Interval.
Y − (βˆ0 + βˆ1 x∗ )
(x∗ − x)2
ˆ ˆ ∗ 2 1
Y − (β0 + β1 x ) ∼ N 0 , σ 1 + + Pn 2
.
n i=1 (xi − x)
Y − (β̂0 + βˆ1 x∗ )
∼ tn−2 .
SY −(β̂0 +β̂1 x∗ )
or equivalently,
s
(x∗ − x)2
1
βˆ0 + βˆ1 x ± tα/2,n−2
∗
MSE 1 + + Pn 2
n i=1 (xi − x)
Example 10.10. Corrosion of steel reinforcing bars is the most important durability problem
for reinforced concrete structures. Carbonation of concrete results from a chemical reaction
that lowers the pH value by enough to initiate corrosion of the rebar. Representative data
on x = carbonation depth (mm) and y = strength (MPa) for a sample of core specimens
taken from a particular building follows (read from a plot in the article “The Carbonation of
Concrete Structures in the Tropical Environment of Singapore,” Magazine of Concrete Res.,
1996: 293–300).
23
a. Calculate an interval with a confidence level of 95% for the true average strength of all
core specimens having a carbonation depth of 45 mm. Interpret the interval obtained.
b. Calculate an interval with confidence level of 95% for the strength value that would
result from selecting a single core specimen whose depth is 45 mm. Interpret the interval
obtained.
24
Note 10.1. There are cases where we would want to regress a dependent variable y on more
than one independent or predictor variable. In this case, we can perform multiple regression.
In multiple regression, the objective is to build a probabilistic model that relates a dependent
variable y to more than one independent or predictor variable. Let k represent the number
of predictor variables (k ≥ 2) and denote these predictors by x1 , x2 , . . . , xk . For example, in
attempting to predict the selling price of a house, we might have k = 3 with x1 = size (ft2 ),
x2 = age (years), and x3 = number of rooms. The general additive multiple regression model
equation is
Y = β0 + β1 x1 + β2 x2 + . . . + βk xk +
where is assumed to be normally distribution with E( = 0) and Var() = σ 2 .
Thus just as β0 + β1 x describes the mean Y value as a function of x in simple linear regression,
the true (or population) regression function β0 + β1 x1 + β2 x2 + . . . + βk xk gives the expected
value of Y as a function of x1 , x2 , . . . , xk k. The βi ’s are the true (or population) regression
coefficients. The regression coefficient β1 is interpreted as the expected change in Y associated
with a 1-unit increase in x1 while x2 , . . . , xk are held fixed. Analogous interpretations hold for
β2 , . . . , βk .
The interested reader may refer to Chapter 13 of the main reference for further details.
25
10.5 Correlation
• There are many situations in which the objective in studying the joint behavior of
two variables is to see whether they are related, rather than to use one to predict the
value of the other.
Properties of r
1. The value of r does not depend on which of the two variables under study is labeled x
and which is labeled y.
3. −1 ≤ r ≤ 1.
4. r = 1 if and only if (xi , yi ) all pairs lie on a straight line with positive slope, and r = –1
if and only if all (xi , yi ) pairs lie on a straight line with negative slope.
26
A frequently asked question is, “When can it be said that there is a strong correlation be-
tween the variables, and when is the correlation weak?” Here is an informal rule of thumb for
characterizing the value of r:
Moderate either −0.8 < r < −0.5 or 0.5 < r < 0.8
Example 10.11. The article “Productivity Ratings Based on Soil Series” (Prof. Geographer,
1980: 158–163) presents data on corn yield x and peanut yield y (mT/Ha) for eight different
types of soil. Find the sample correlation coefficient r. How would you describe the correlation
between corn yield and peanut yield?.
27
Exercises
Sections 12.1 to 12.4 of textbook: 2, 3, 7, 9(a)-(d), 11(a)(b), 12, 15(b)(c)(d), 17, 19, 21, 23(b),
27, 31, 33, 35, 38, 45, 48, 49, 53, 59(a)(b)(c)(d), 61(b)
2. scatter plot
4. residual
8. estimated intercept, estimated slope coefficient (know how to interpret this value!)
9. error sum of squares (SSE), total sum of squares (SST), regression sum of squares (SSR)
13. distinguish between the sample correlation coefficient r and the coefficient of determina-
tion r2