Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Topic10 Written

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

1

Topic 10: Linear Statistical Models

10.1 The Simple Linear Regression Model

• Regression analysis is the part of statistics that investigates the relationship between
two or more variables related in a nondeterministic fashion.

• The simplest deterministic mathematical relationship between two variables x and y is a


linear relationship
y = β0 + β1 x

• If the two variables are not deterministically related, then for a fixed value of x, there is
uncertainty in the value of y.

• For example, if we are investigating the relationship between age of child and size of vo-
cabulary and decide to select a child of age x = 5 years. Before the selection, vocabularly
size is a random variable Y . After a particular 5-year-old child has been selected and
tested, the observed value of Y could be, say y = 2, 000 words.

• In the example, age is called the independent, predictor, or explanatory variable,


while size of vocabulary is called the dependent or response variable.

• Let x1 , x2 , . . . , xn denote values of the independent variable for which observations are
made, and let Yi and yi , respectively, denote the random variable and observed value
associated with xi . The available bivariate data then consists of the n pairs

(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

which can be visualized on a scatterplot.

Example 10.1. Visual and musculoskeletal problems associated with the use of visual display
terminals (VDTs) have become rather common in recent years. Some researchers have focused
on vertical gaze direction as a source of eye strain and irritation. The direction is known to be
closely related to ocular surface area (OSA). The accompanying representative data set on y =
OSA (cm2 ) and x = width of the palprebal fissure (i.e. the horizontal width of the eye opening,
in cm) is from the article “Analysis of Ocular Surface Area for Comfortable VDT Workstation
Layout” (Ergonomics, 1996: 877–884). The order of the observations was not given, so they
are listed in increasing order of x values.
2

Here are some things to notice about the data and plot:

• Several observations have identical x values yet different y values. For example, x8 = x9 =
, but y8 = and y9 = . Thus the value of y is not determined
solely by x but also by various other factors.

• There is a strong tendency for y to as x increases. That is larger


values of OSA tend to be associated with values of fissure width – a
relationship between the variables.

• It appears that the value of y could be predicted from x by finding a


that is reasonably close to the points in the plot. In other words, there is evidence of a
substantial relationship between the two variables.
3

Example 10.2. Arsenic is found in many ground waters and some surface waters. Recent
health effects research has prompted the Environmental Protection Agency to reduce allowable
arsenic levels in drinking water so that many water systems are no longer compliant with
standards. This has spurred interest in the development of methods to remove arsenic. The
accompanying data on x = pH and y = arsenic removed (%) by a particular process were
read from a scatterplot in the article “Optimizing Arsenic Removal During Iron Removal:
Theoretical and Practical Considerations” (J. of Water Supply Res. and Tech., 2005: 545–560).

In plot (a), Minitab (a statistical software) selected the scale for both axes. In plot (b), the
scale for both axes was specified so that they would intersect near the origin. Plot (b) is clearly
much more crowded than plot (a), making it difficult to ascertain the general nature of any
relationship. For example, curvature can be overlooked in a crowded plot.

Observations from the plot:

• values of arsenic removal tend to be associated with low pH, a


relationship.

• The two variables appear to be approximately related, although the


points in the plot would spread out somewhat about any superimposed .
4

A Linear Probabilistic Model

Definition 10.1.

The Simple Linear Regression Model

Y = β0 + β1 x + 

where  is the random error term assumed to be normally distributed with E() = 0 and
Var() = σ 2 .

Assumptions:

1. E(Y |X = x) = β0 + β1 x (Linearity)

2. Var(Y |X = x) = σ 2 (Homogeneity of Variance)

3. Y |X = x ∼ Normal (Normality)

4. All observations are independent of one another. (Independence)


5

Example 10.3. Suppose the relationship between applied stress x and time-to-failure y is
described by the simple linear regression model with true regression line y = 65 − 1.2x and
σ = 8.

a. State the distribution of time-to-failure for any fixed value of applied stress.

b. State the distribution of time-to-failure when applied stress is 20.

c. Find the probability that time-to-failure is more than 50 when applied stress is 20.

d. Find the probability that time-to-failure is more than 50 when applied stress is 25.
6

10.2 Estimating Model Parameters


• Our estimate of y = β0 + β1 x should be a line that provides in some sense a best fit to the
observed data points. That is what motivates the principle of least squares, which
can be traced back to the German mathematician Gauss (1777 – 1855).

• According to this principle, a line provides a good fit to the data if the vertical distances
(deviations) from the observed points to the line are small.

• The measure of goodness-of-fit is the sum of the squares of these deviations,


Xn
(yi − b0 − b1 xi )2 .
i=1

• The best-fit line is then the one having the smallest possible sum of squared deviations.

Model:
Y = β0 + β1 x + 

Slope: Pn
(x − x)(yi − y) Sxy
Pn i
β̂1 = b1 = i=1 2
=
i=1 (xi − x) Sxx
where ! !
n n n n
X X 1 X X
Sxy = (xi − x)(yi − y) = xi y i − xi yi
i=1 i=1
n i=1 i=1

n n n
!2
X
2
X 1 X
Sxx = (xi − x) = x2i − xi
i=1 i=1
n i=1

Intercept:
β̂0 = b0 = y − b1 x

Fitted Line or Least Squares Line:


ŷ = b0 + b1 x
7

Example 10.4. The cetane number is a critical property in specifying the ignition quality of a
fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and
time-consuming. The article “Relating the Cetane Number of Biodiesel Fuels to Their Fatty
Acid Composition: A Critical Study” (J. of Automobile Engr., 2009: 565–583) included the
following data on x = iodine value (g) and y = cetane number for a sample of 14 biofuels. The
iodine value is the amount of iodine necessary to saturate a sample of 100 g of oil.

a. Assuming the simple linear regression model is appropriate, estimate the regression line
by finding the equation of the least squares line. Then, interpret the coefficients.
8

b. Calculate a point prediction of the cetane number for a future observation made when
iodine value is 100.

The least squares line should not be used to make a prediction for an x value much beyond the
range of the data, such as x = 40 or x = 150 in this example. The danger of extrapolation is
that the fitted relationship may not be valid for such x values. Therefore, many interpretations
of the intercept are not sensible.
9

Estimating σ 2 and σ

• The parameter σ 2 determines the amount of variability inherent in the regression model.
A large value of σ 2 will lead to observed points that are typically quite spread out about
the true regression line, whereas when σ 2 is small the observed points will tend to fall
very close to the true line.

• Plot (a) has a smaller σ 2 compared to plot (b) .

• Fitted Values (Predicted Values): ŷi = b0 + b1 xi , for i = 1, 2, . . . , n.

• Residuals (Observed − Fitted): ˆi = ei = yi − ŷi , for i = 1, 2, . . . , n.

The estimate of σ 2 is Pn 2
SSE e
2 2
σ̂ = s = MSE = = i=1 i
n−2 n−2

The estimate of σ is √
σ̂ = s = σ̂ 2
10

Example 10.5. Show that the error sum of squares (or, residual sum of squares)
n
X
SSE = e2i = Syy − b1 Sxy
i=1

where !2
n n n
X
2
X 1 X
Syy = (yi − y) = yi2 − yi
i=1 i=1
n i=1
n n n
! n
!
X X 1 X X
Sxy = (xi − x)(yi − y) = xi y i − xi yi
i=1 i=1
n i=1 i=1
11

Example 10.6. The article “Promising Quantitative Nondestructive Evaluation Techniques


for Composite Materials” (Materials Evaluation, 1985: 561–565) reports on a study to inves-
tigate how the propagation of an ultrasonic stress wave through a substance depends on the
properties of the substance. The accompanying data on fracture strength (x, as a percentage
of ultimate tensile strength) and attenuation (y, in neper/cm, the decrease in amplitude of the
stress wave) in fibreglass-reinforced polyester composites were read from a graph that appeared
in the article. The simple linear regression model is suggested by the substantial linear pattern
in the scatterplot.

The data and several summary statistics are given below:

X X
x = 890 x2 = 67182 n = 14
X X X
y = 37.6 y 2 = 103.54 xy = 2234.3

Compute an estimate of σ 2 .
12

The Coefficient of Determination

• The proportion of variation in y that is explained by the linear relationship with x.

• In plot (a), all variation in y is explained. In plot (b), most variation is explained. And
in plot (c), little variation is explained.

• The error sum of squares SSE can be interpreted as a measure of how much variation in
y is left unexplained by the model.

• A quantitative measure of the total amount of variation in y is given by the total sum
of squares !2
n n n
X X 1 X
SST = Syy = (yi − y)2 = yi2 − yi .
i=1 i=1
n i=1

The coefficient of determination, denoted by r2 , is given by


SSE
r2 = 1 −
SST

• Since SSE ≤ SST, 0 ≤ r2 ≤ 1. The higher the value of r2 , the more successful is the
simple linear regression model in explaining y variation.

• If r2 is small, an analyst will usually want to search for an alternative model (either a non-
linear model or a multiple regression model that involves more than a single independent
variable) that can more effectively explain y variation.
13

Example 10.7. Starting with

yi − y = (yi − ŷi ) + (ŷi − y)


n
X
show that SST = SSR + SSE, where SSR = (ŷi − y)2 .
i=1

Thus, the coefficient of determination can also be written as


SSR
r2 =
SST
where regression sum of squares, SSR is interpreted as the amount of total variation that is
explained by the model.
14

Example 10.8. Calculate and interpret the coefficient of determination for the iodine-cetane
number data (Example 10.4).
15

10.3 Inferences About the Slope Parameter β1

• Recall that the values of xi ’s are assumed to be chosen before the experiment is performed,
so only the Yi ’s are random. The estimator for β1 is then
Pn
(x − x)(Yi − Y )
β̂1 = i=1Pn i 2
.
i=1 (xi − x)

σ2
 
where βˆ1 ∼ N β1 , Pn 2
.
i=1 (xi − x)

• The assumptions of the simple linear regression model imply

βˆ1 − β1
∼ tn−2 .
Sβ̂1

• A summary of t test and t interval for β1

Assumptions Checking: Linearity, Homogeneity of Variance, Normality and Indepen-


dence.

Null Hypothesis H0 : β1 = β10

Alt. Hypothesis Ha : β1 6= β10 Ha : β1 < β10 Ha : β1 > β10


β̂1 − β1
Test Statistics t=
sβ̂1

P -value 2P (tn−2 > |t|) P (tn−2 < t) P (tn−2 > t)

100(1 − α)% CI βˆ1 ± tα/2,n−2 sβ̂1


16

Example 10.9. When damage to a timber structure occurs, it may be more economical to
repair the damaged area rather than replace the entire structure. The article “Simplified Model
for Strength Assessment of Timber Beams Joined by Bonded Plates” (J. of Materials in Civil
Engr., 2013: 980–990) investigated a particular strategy for repair. The accompanying data
were used by the authors of the article as a basis for fitting the simple linear regression model.
The dependent variable is y = rupture load (N) and the independent variable is anchorage
length (the additional length of material used to bond at the junction, in mm).

x 50 50 80 80 110 110 140 140 170 170


y 17,052 14,063 26,264 19,600 21,952 26,362 26,362 26,754 31,654 32,928

Given the following summaries:


X X X X X
x = 1100 x2 = 139, 000 y = 242, 991 y 2 = 6, 236, 302, 177 xy = 28, 954, 590

Sxx = 18, 000 Syy = 331, 839, 568.9 Sxy = 2, 225, 580

a. Compute a 95% confidence interval for β1 .


17

b. The figure below shows a scatterplot of the data (also displayed in the cited article); there
appears to be a rather substantial positive linear relationship between the two variables.

Test if there is a positive linear relationship between the two variables. Use α = 0.05.
18

c. Test if there is a linear relationship between the two variables. Use α = 0.05.
19

d. Suppose it had previously been believed that when anchorage length increased by 1 mm,
the associated true average change in rupture load would be at most 100. Do the sample
data contradict this belief? State and test the relevant hypotheses using α = 0.05.
20

10.4 Inferences Concerning µY and the Prediction of Future Y Val-


ues

• Let x∗ denote a specified value of the independent variable X, then

µ̂Y |X=x∗ = β̂0 + β̂1 x∗

is a point estimator of the true average value of Y when X = x∗ .

• Given that Yi ’s are independent N(β0 + β1 xi , σ 2 ) random variables, for i = 1, 2, . . . , n,

(x∗ − x)2
  
∗ ∗ 2 1
β̂0 + β̂1 x ∼ N β0 + β1 x , σ + Pn 2
.
n i=1 (xi − x)

• The assumptions of the simple linear regression model imply

(β̂0 + βˆ1 x∗ ) − (β0 + β1 x∗ )


∼ tn−2 .
Sβ̂0 +β̂1 x∗

A 100(1 − α)% CI for µY , the expected value of Y when X = x∗ , is

βˆ0 + βˆ1 x∗ ± tα/2,n−2 sβ̂0 +β̂1 x∗

or equivalently,
s
(x∗ − x)2
 
1
βˆ0 + βˆ1 x∗ ± tα/2,n−2 MSE + Pn 2
n i=1 (xi − x)

• The confidence interval is narrowest when x∗ = .

• The confidence interval becomes wider as x∗ moves from x in either


direction.

• An investigator may wish to obtain an interval of plausible values for the value of Y
associated with some future observation when the independent variable has value x∗ –
Prediction Interval.

• There is more uncertainty in prediction than in estimation, so a prediction interval will


be than a confidence interval for X = x∗ .

• Let Y denote a future value when X = x∗ , then the error of prediction is

Y − (βˆ0 + βˆ1 x∗ )

where the future value Y is independent of the observed Yi ’s.


21

• Given that Yi ’s are independent N(β0 + β1 xi , σ 2 ) random variables, for i = 1, 2, . . . , n,

(x∗ − x)2
  
ˆ ˆ ∗ 2 1
Y − (β0 + β1 x ) ∼ N 0 , σ 1 + + Pn 2
.
n i=1 (xi − x)

• The assumptions of the simple linear regression model imply

Y − (β̂0 + βˆ1 x∗ )
∼ tn−2 .
SY −(β̂0 +β̂1 x∗ )

A 100(1 − α)% PI for a future Y observation to be made when X = x∗ , is

βˆ0 + βˆ1 x∗ ± tα/2,n−2 sY −(β̂0 +β̂1 x∗ )

or equivalently,
s
(x∗ − x)2
 
1
βˆ0 + βˆ1 x ± tα/2,n−2

MSE 1 + + Pn 2
n i=1 (xi − x)

• Confidence Intervals versus Prediction Intervals:


22

Example 10.10. Corrosion of steel reinforcing bars is the most important durability problem
for reinforced concrete structures. Carbonation of concrete results from a chemical reaction
that lowers the pH value by enough to initiate corrosion of the rebar. Representative data
on x = carbonation depth (mm) and y = strength (MPa) for a sample of core specimens
taken from a particular building follows (read from a plot in the article “The Carbonation of
Concrete Structures in the Tropical Environment of Singapore,” Magazine of Concrete Res.,
1996: 293–300).
23

a. Calculate an interval with a confidence level of 95% for the true average strength of all
core specimens having a carbonation depth of 45 mm. Interpret the interval obtained.

b. Calculate an interval with confidence level of 95% for the strength value that would
result from selecting a single core specimen whose depth is 45 mm. Interpret the interval
obtained.
24

Note 10.1. There are cases where we would want to regress a dependent variable y on more
than one independent or predictor variable. In this case, we can perform multiple regression.
In multiple regression, the objective is to build a probabilistic model that relates a dependent
variable y to more than one independent or predictor variable. Let k represent the number
of predictor variables (k ≥ 2) and denote these predictors by x1 , x2 , . . . , xk . For example, in
attempting to predict the selling price of a house, we might have k = 3 with x1 = size (ft2 ),
x2 = age (years), and x3 = number of rooms. The general additive multiple regression model
equation is
Y = β0 + β1 x1 + β2 x2 + . . . + βk xk + 
where  is assumed to be normally distribution with E( = 0) and Var() = σ 2 .

Let x∗1 , x∗2 , . . . , x∗k be particular values of x1 , . . . , xk . Then

µY ·x∗1 ,x∗2 ,...,x∗k = β + 0 + β1 x∗1 + · · · + βk x∗k .

Thus just as β0 + β1 x describes the mean Y value as a function of x in simple linear regression,
the true (or population) regression function β0 + β1 x1 + β2 x2 + . . . + βk xk gives the expected
value of Y as a function of x1 , x2 , . . . , xk k. The βi ’s are the true (or population) regression
coefficients. The regression coefficient β1 is interpreted as the expected change in Y associated
with a 1-unit increase in x1 while x2 , . . . , xk are held fixed. Analogous interpretations hold for
β2 , . . . , βk .

The interested reader may refer to Chapter 13 of the main reference for further details.
25

10.5 Correlation

• There are many situations in which the objective in studying the joint behavior of
two variables is to see whether they are related, rather than to use one to predict the
value of the other.

The sample correlation coefficient for the n pairs (x1 , y1 ), . . . , (xn , yn ) is


Pn
(xi − x̄)(yi − ȳ) Sxy
r = pPn i=1 pPn =√ p
2 2
i=1 (xi − x̄) i=1 (yi − ȳ) Sxx Syy

where Sxy = ni=1 (xi − x̄)(yi − ȳ).


P

Properties of r

1. The value of r does not depend on which of the two variables under study is labeled x
and which is labeled y.

2. The value of r is independent of the units in which x and y are measured.

3. −1 ≤ r ≤ 1.

4. r = 1 if and only if (xi , yi ) all pairs lie on a straight line with positive slope, and r = –1
if and only if all (xi , yi ) pairs lie on a straight line with negative slope.
26

A frequently asked question is, “When can it be said that there is a strong correlation be-
tween the variables, and when is the correlation weak?” Here is an informal rule of thumb for
characterizing the value of r:

Weak −0.5 ≤ r ≤ 0.5

Moderate either −0.8 < r < −0.5 or 0.5 < r < 0.8

Strong either r ≤ −0.8 or r ≥ 0.8

Example 10.11. The article “Productivity Ratings Based on Soil Series” (Prof. Geographer,
1980: 158–163) presents data on corn yield x and peanut yield y (mT/Ha) for eight different
types of soil. Find the sample correlation coefficient r. How would you describe the correlation
between corn yield and peanut yield?.
27

Exercises
Sections 12.1 to 12.4 of textbook: 2, 3, 7, 9(a)-(d), 11(a)(b), 12, 15(b)(c)(d), 17, 19, 21, 23(b),
27, 31, 33, 35, 38, 45, 48, 49, 53, 59(a)(b)(c)(d), 61(b)

R output for Question 48:

As you work on your exercises, check your understanding of the following:


1. response/dependent variable; explanatory/factor/predictor/independent variable

2. scatter plot

3. actual/observed value; fitted/predicted value

4. residual

5. homogeneity of variance/homogeneous variance

6. line of mean values/true regression line

7. estimated regression equation/equation of least squares line

8. estimated intercept, estimated slope coefficient (know how to interpret this value!)

9. error sum of squares (SSE), total sum of squares (SST), regression sum of squares (SSR)

10. coefficient of determination (know how to interpret this value!)

11. distinguish between σ̂ = s and σ̂b1 = sb1 = sβˆ1

12. distinguish between confidence interval and prediction interval

13. distinguish between the sample correlation coefficient r and the coefficient of determina-
tion r2

You might also like