Chapt 11 Simples Linear Regression and Correlation

Uploaded by

Will Tedjo

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

49 views

Chapt 11 Simples Linear Regression and Correlation

Uploaded by

Will Tedjo

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 48

(15 July 2009)—Space Shuttle Endeavour and its seven-member STS-127 erew head toward Earth orbit and rendezvous with the International Space Station Courtesy NASA Simple Linear Regression and Correlation The space shuttle Challenger accident in January 1986 was the result of the failure of O-rings used to seal field joints in the solid rocket motor due to the extremely low amt cnt temperatures at the time of launch. Prior to the launch there were data on the occur rence of O-ring failure and the corresponding temperature on 24 prior launches or static firings of the motor. In this chapter we will see how to build a statistical model relating the probability of O-ring failure to temperature. This model provides a measure of the risk associated with launching the shuttle at the low temperature occurring when Challenger was launched. CHAPTER OUTLINE 11-1 EMPIRICAL MODELS 11-5.2 Confidence Interval on the 11-2 SIMPLE LINEAR REGRESSION ‘Mean Response 11-6 PREDICTION OF NEW 11.3 PROPERTIES OF THE LEAST SQUARES ESTIMATORS OBSERVATIONS 11-4 HYPOTHESIS TESTS IN SIMPLE 11-7. ADEQUACY OF THE REGRESSION LINEAR REGRESSION MODEL, 11-41 Use of Tests 11-741 Residual Analysis 11-7.2 Coefficient of Determination (R?) 11-4.2 Analysis of Variance Approach to Test Significance of Regression 11-8 CORRELATION, 11-5 CONFIDENCE INTERVALS 11-9 REGRESSION ON TRANSFORMED 11-51 Confidence Intervals on the VARIABLES Slope and Intercept 11-10 LOGISTIC REGRESSION402 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION LEARNING OBJECTIVES After careful study of this chapter, you should be able to do the following: 1. Use simple linear regression for building empirical models to engineering and scientifi data 2. Understand how the method of least squares is used to estimate the parameters in a linear regression model, 3. Analyze residuals to determine if the regression model is an adequate fit to the data or to see if any underlying assumptions are violated 4. Test statistical hypotheses and construct confidence intervals on regression model parameters 5. Use the regression model to make a prediction of a future observation and construct an appropriate prediction interval on the future observation 6. Apply the correlation model 7. Use simple transformations to achieve a linear regression model 11-1 EMPIRICAL MODELS Many problems in engineering and the sciences involve a study or analysis of the relationship between two or more variables. For example, the pressure of a gas in a container is related to the temperature, the velocity of water in an open channel is related to the width of the channel, and the displacement of a particle at a certain time is related to its velocity. In this last example, if we let dy be the displacement of the particle from the origin at time t= 0 and v be the velocity, then the displacement at time is d, = dy + vt. This is an example of a deterministic linear relationship, because (apart from measurement errors) the model predicts displacement perfectly However, there are many situations where the relationship between variables is not deterministic. For example, the electrical energy consumption of a house (y) is related to the size of the house (x, in square feet), but itis unlikely to be a deterministic relationship. Similarly, the fuel usage of an automobile (y) is related to the vehicle weight x, but the relationship is not a deterministic one. In both of these examples the value of the response of interest y (energy consumption, fuel usage) cannot be predicted perfectly from knowledge of the corresponding x. Itis possible for different automobiles to have different fuel usage even if they weigh the same, and its possible for different houses to use different amounts of electricity even if they are the same size The collection of statistical tools that are used to model and explore relationships between variables that are related in a nondeterministic manner is called regression analysis. Because problems of this type occur so frequently in many branches of engineering and science, regression analysis is one of the most widely used statistical tools. In this chapter we present the situation where there is only one independent or predictor variable x and the relationship with the response y is assumed to be linear. While this seems to be a simple sce nario, there are many practical problems that fall into this framework. For example, in a chemical process, suppose that the yield of the product is related to the process-operating temperature. Regression analysis can be used to build a model to predict yield at a given temperature level, This model can also be used for process opti- mization, such as finding the level of temperature that maximizes yield, or for process control purposes.LLL EMPIRICALMODELS 403 As an illustration, consider the data in Table 11-1, In this table y is the purity of oxygen produced in a chemical distillation process, and x is the percentage of hydrocarbons that are present in the main condenser of the distillation unit, Figure 11-1 presents a seatter diagram ofthe data in Table 11-1. This is just a graph on which each (x,y) pair is represented as a point plotted in a two-dimensional coordinate system, This scatter diagram was produced by Minitab, and we selected an option that shows dot diagrams of the x and y variables along the top and right margins of the graph, respectively, making it easy to see the distributions of the individual variables (box plots ot histograms could also be selected), Inspection of this seatter diagram indicates that, although no simple curve will pass exactly through all the points, there isa strong indication that the points lie scattered randomly around a straight line, Therefore, it is probably reasonable to assume that the mean of the random variable Y is related to x by the following straight-line relationship: E(Y |x) = bys = Bo + Bre where the slope and intercept of the line are called regression coefficients, While the mean of Yisa linear function of x, the actual observed value y does not fall exactly on a straight line. ‘The appropriate way to generalize this to a probabilistic linear model is to assume that the expected value of Ys a linear function of x, but that fora fixed value of x the actual value of is determined by the mean value function (the lineat model) plus a random ertor term, say, Y=Bo+Bxte ait) Table 11-1 Oxygen and Hydrocarbon Levels ‘Observation Number "1 2 B 14 15 16 7 18 19 20 Hydrocarbon Level Purity 3%) 6) 0.99 90.01 102 89.05 Lis 91.43 129 93.74 146 96.73 i . 136 944s ot ad ss 0x7 $7.59 xo “+ 123 on 88 155 9942 . . 1.40 93.65 *° . 119 93.54 oa . , : Lis 92.52 oo : 0.98 90.56 ” 7 ‘ 101 $9.54 0 wet Lu 89.85 : 120 90.39 Be . 126 93.25 ts ie Saat Gas 095 105 11S 1.25 Las 14s 1.58 12 san Hyerocerbon lvl (x — Figure 11-1 _ Scatter diagram of oxygen purity versus hycocarbon, level from Table 11-1404 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION where ¢ isthe random error term, We will call this model the simple linear regression model, because it has only one independent variable or regressor. Sometimes a model like this will arise from a theoretical relationship. At other times, we will have no theoretical knowledge of the relationship between x and y, and the choice of the model is based on inspection of a scatter, diagram, such as we did with the oxygen purity data. We then think of the regression model as an empirical model. To gain more insight into this model, suppose that we can fix the value of x and observe the value of the random variable ¥. Now if x is fixed, the random component € on the right- hhand side of the model in Equation 11-1 determines the properties of ¥. Suppose that the mean and variance of € are 0 and o*, respectively. Then, E(Y|x) = E(Bo + Bix + €) = Bo + Bix + Ele) = Bo + Bix ‘Notice that this is the same relationship that we initially wrote down empirically from inspection of the scatter diagram in Fig. 11-1. The variance of ¥ given x is V(¥|x) = V(Bo + Bix + €) = V(Bo + Bix) + V(e) Thus, the true regression model jy}, = By + Gy is a line of mean values; that is the height of the regression line at any value of vis just the expected value of Y for that x. The slope, Bi. can be interpreted as the change in the mean of Y for a unit change in x. Furthermore, the variability of Y at a particular value of x is determined by the error variance o?. This implies that there isa distribution of Y-values at each x and thatthe variance of this distribution isthe same at each x For example, suppose tha the true regression model relating oxygen purity to hydrocarbon level is wy, = 75 + 15x, and suppose that the variance is @? = 2. Figure 11-2 illustrates this situation, Notice that we have used a normal distribution to describe the random variation in €, Since Y is the sum of a constant fly + B:x (the mean) and a normally distributed random variable, Y is a normally distributed random variable. The variance o? determines the variability inthe observations Yon oxygen purity. Thus, when @? is small, the observed values of ¥ will fall close to the line, and when o° is large, the observed values of ¥ may deviate considerably from the line, Because o? is constant, the variability in Y at any value of xis the same. ‘The regression model describes the relationship between oxygen purity Y and hydrocarbon level x. Thus, for any value of hydrocarbon level, oxygen purity has a normal distribution (onger uri) ¢ Bor Ar can We regression ire style =Bo * Boe 18S BoB: C00, k= 100 5125 tlydrecsbon leet Figure 11-2 The distribution of ¥ for a given value of x for the ‘oxygen purity-hydrocarbon data,112 SIMPLE LINEAR REGRESSION 405 with mean 75 + 1Sicand variance 2. For example, ffx ~ 1.25, Yhas mean value jtyjp = 75 + 15(1.25) = 93.75 and variance 2. In most real-world problems, the values of the intercept and slope (By. B:) and the error variance a" will not be known, and they must be estimated from sample data, Then this ited regression equation or model is typically used in prediction of future obscrvations of Y, or for estimating the mean response ata particular level of x. To illustrate a chemical engineer might be interested in estimating the mean purity of oxygen produced when the hydrocarbon level is, 1.25%, This chapter discusses such procedures and applications for the simple linear regression model. Chapter 12 wil discuss multiple linear regression models that involve more than one regressor. Historical Note Sir Francis Galton first used the term regression analysis in a study of the heights of fathers (x) and sons (»). Galton fit a least squares line and used it to predict the son's height from the father’s height. He found that ia father’s height was above average, the son's height would also be above average, but not by as much as the father’s height was. A similar effect was observed for below average heights. That is, the son's height “regressed” toward the average. Consequently, Galton referred to the least squares line as a regression line. Abuses of Regression Regression is widely used and frequently misused; several common abuses of regression are briefly mentioned here. Care should be taken in selecting variables with which to construct regression equations and in determining the form of the model. It is possible to develop sta- tistically significant relationships among variables that are completely unrelated in a eausal sense. For example, we might attempt to relate the shear strength of spot welds with the number of empty parking spaces in the visitor parking lot, A straight line may even appear to pto- vide a good fit to the data, but the relationship is an unreasonable one on which to rely, You can’t increase the weld strength by blocking off parking spaces. A strong observed association between variables does not necessarily imply that a causal relationship exists between those variables. This type of effect is encountered fairly often in retrospeetive data analysis, and even in observational studies. Designed experiments are the only way to determine cause- and-effect relationships. Regression relationships are valid only for values of the regressor variable within the range of the original data, The linear relationship that we have tentatively assumed may be valid over the original range of x, but it may be unlikely to remain so as we extrapolate—that is, if we use values of x beyond that range. In other words, as we move beyond the range of values of x for which data were collected, we become less certain about the validity of the assumed model, Regression models are not necessarily valid for extrapolation purposes, Now this does not mean don't ever extrapolate, There are many problem situations in science and engineering where extrapolation of a regression model is the only way to even approach the problem. However, there isa strong warning to be eareful. A modest extrapolation may be perfectly all right in many cases, but a large extrapolation will almost never produce acceptable results 11-2 SIMPLE LINEAR REGRESSION The case of simple linear regression considers a single regressor variable or predictor variable xand a dependent or response variable ¥, Suppose that the true relationship between Y and x isa straight line and that the observation ¥ at each level of x is a random variable, As noted406 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION previously, the expected value of ¥ for each value of x is BLY |x) = Bo + Bix where the intercept fy and the slope 8 are unknown regression coefficients, We assume that each observation, ¥, can be described by the model Y¥=Bo+Bixte «ai2) where € is @ random error with mean zero and (unknown) variance 0°. The random errors corresponding to different observations ate also assumed to be uncorrelated random variables. Suppose that we have m pairs of observations (x1,11), (as Ya) -++» (tm Ya)» Figure 11-3 shows a typical scatter plot of observed data and a candidate for the estimated regression line, The estimates of By and f should result in a line that is (in some sense) a “best fit” to the data, The German scientist Karl Gauss (1777-1855) proposed estimating the parameters Bo and 8, in Equation 11-2 to minimize the sum of the squares of the vertical deviations in Fig. 11 We call this criterion for estimating the regression coefficients the method of least squares, Using Equation 11-2, we may express the » observations in the sample as Bot Bite tes i an (1-3) and the sum of the squares of the deviations of the observations from the true regression line is Le Se ~ S01- Bo Bsx)? (4) ‘The least squares estimators of By and 8, say, by and (,, must satisty aL “ a a WBolbebs 220 — Bo - Bix) = 0 a. Naa 3B; bess 7 2 — Bo — Bix) = 0 (is) observed vie | Oats G) Extimates regession Ine Figure 11-3 Deviations of the data from the estimated regression model,Least Squares Estimates 11-2 SIMPLE LINEAR REGRESSION 407 ‘Simplifying these two equations yields Bo +b B= Dv a Suet Se= Som ars Equations 11-6 are called the least squares normal equations. The solution to the normal ‘equations results in the least squares estimators By and ‘The least squares estimates ofthe intercept and slope in the simple linear regression model are (i-7) 8 (1-8) where F = (I/n) By yy and ¥ = (1/n) Ei) x The fitted or estimated regression line is therefore = Bo + Bx (11-9) Note that each pair of observations satisfies the relationship w= Bot Byte, f= 12..." where ¢; = y; ~ J} is called the residual. The residual describes the error in the ft of the model to the ith observation y,. Later in this chapter we will use the residuals to provide information about the adequacy of the fitted model Nolationally, it is occasionally convenient to give special symbols to the numerator and denominator of Equation 11-8. Given data (ry), (2,2) --++ Oy Ya et guar ge Be) (i-t0) and ai408 EXAMPLE 11-1 Oxygen Purity ‘We will fit a simple linear regression model to the oxygen purity data in Table 11-1. The following quantities may be ‘computed: 22 Sy,= 18821 92.1605 $7 - vossssar Sad ~ 29.2802 Su = 22146566 20 Suc = 0.68088 and = S01 - (23.92)(1,843.21) 2,214,656 — 3 10.1748 CHAPTER LI SIMPLE LINEAR REGRESSION AND CORRELATION ‘Therefore, the least squares estimates of the slope and intercept are Su _ 10.4744 Se Osnons— “AP4788 and fy = F ~ fF = 92.1605 ~ (14.94748)1.196 = 74.28331 ‘The fitted simple linear regression model (with the coefficients reported to three decimal places) is 5 74283 + 14.9478 ‘This model is plotted in Fig. 11-4, along withthe sample data, Practical Interpretation: Using te regression model, we would predict oxygen purity of = 89.23% when the hydrocarbon level is x = 1.00%. The purity 89.23% may be interpreted as an estimete ofthe true population mean purity when x = 1.00%, or as an estimate of a new observation ‘when x ~ 1.00%, These estimates are, of course, subject to 180, 1.85, 1.87, 1.77, 2.02, 227, 2.15, 226, 2.47, 2.19, 2.26, 2.40, 2.39, 241, 2.50, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47, 2.64, 256, 2.70, 2.72, 2.57 (a) Find the least squares estimates ofthe slope and the inter cept in the simple linear regression model. Find an estimate of (b) Estimate the mean length of dugongs at age 11 (©) Obtain the fitted values 9, that correspond to each observed value 3), Plot $, versus y;, and comment on wh this plot would look like ifthe linea relationship between length and age were perfectly deterministic (no error, Does this plot indicate that age is a reasonable choice of regressor variable in this model? 11-16. Consider the regression model developed in Ex- cercise 11-2. {@) Suppose that temperature is measured in “C rather than °F. Write the new regression model (b) What change in expected pavement deflection is associated with # 1°C change in surface temperature? 11-17. Consider the regression model developed in Exercise 11-6, Suppose that engine displacement is measured in cubi centimeters instead of cubic inches CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELA 10N (@) Write the new regression model (©) What change in gasoline mileage is associated with a 1 can change i engine displacement? 11-18. Show that in a simple linear regression model ‘he point (%F) lies exactly onthe least squares regression line 11-19. Consider the simple linear regression model ¥ = By + Bir +e. Suppose that the analyst wants to use 2 = x — 3 as the epressor variable (@) Using the data in Exercise 11-11, constrict one scatter plot of the (sy) points and ‘then another of the (= x1 — 3,3) points. Use the two plots to intuitively explain how the two models, Y= By + Byx + « and Y= 85+ Ble + ¢arerlated () Find the leas squares estimates of and Bin the mode! E+ Btz +e How do they relate to the least squares estimates 8 and 8? 11.20. Suppose we wish to ita regression model for which the true regression line passes through the point (0 0). The appropriate model is ¥ = Bx + ¢, Assume that we have » pairs (of data (5,1) (in Yabo» in Je) (a) Find the least squares estimate of 8 (b) Fitthe model ¥ = rr + « tothe chloride concentration- roadway area data in Exercise 11-10. Plot the fitted ‘model on a scatter diagram of the data and comment on the appropriateness of the model 11-3. PROPERTIES OF THE LEAST SQUARES ESTIMATORS ‘The statistical properties of the least squares estimators fy and (y may be easily described. Recall that we have assumed thatthe error term e in the model Y= By + yr + eis a random variable with mean zero and variance a”, Since the values of x are fixed, Y is a random variable with mean pty, = {By + Bux and variance o?, Therefore, the values of fy and i, depend con the observed y's; thus, the least squares estimators of the regression coefficients may be viewed as random variables. We will investigate the bia faquare eatiators iy and By \d variance properties of the least Consider first fy. Because (is a linear combination of the observations Y, we can use properties of expectation to show that the expected value of B is B(B,) = Ba al ‘Thus, By is an unbiased estimator of the true slope ‘Now consider the variance of @,. Since we have assumed that M(c)) = 0°, it follows that V(X) = 07, Because f, is a linear combination of the observations Y,, the results in Section 5-5 can be applied to show that 1b) =F 1-16)Estimated Standard Errors 11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 415 For the intercept, we can show in a similar manner that . ay aft BGG.) = Bo and (84) = 0? [2 + ay ‘Thus, By is an unbiased estimator of the intercept By. The covariance of the random variables By and 8; is not zero. It can be shown (se¢ Exercise 11-98) that coviBo, Bs) ~ The estimate of a could be used in Equations 11-16 and 11-17 to provide estimates of the variance of the slope and the intercept. We call the square roots of the resulting variance estimators the estimated standard errors of the slope and intercept, respectively. In simple lincar regression the estimated standard error of the slope and the estimated standard error of the intercept are and se(®:) =f VSq respectively, where 6? is computed from Equation 11-13, ‘The Minitab computer output in Table 11-2 reports the estimated standard errors of the slope and intercept under the column heading “SE coeff.” 11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION ‘An important part of assessing the adequacy of a lincar regression model is testing statistical hypotheses about the model parameters and constructing certain confidence intervals. Hypothesis testing in simple linear regression is discussed in this section, and Section 11-5 presents methods for constructing confidence intervals. To test hypotheses about the slope and intercept of the gression model, we must make the additional assumption that the error component in the model, ¢, is normally distributed. Thus, the complete assumptions are that the errors are normally and independently distributed with mean zero and variance a”, abbreviated NID(0, 11-4.1 Use of t-Tests Suppose we wish to test the hypothesis that the slope equals a constant, say, Bo. The appropriate hypotheses are Hy: Bi = Bio Ay Bi # Bio (11-18) where we have assumed a two-sided alternative, Since the errors e, are NID(0, @”), it follows directly that the observations ¥, are NID(By + Bx, 0”). Now B; is a linear combination of416 (CHAPTER Lt Test Statistic Test Statistic SIMPLE LINEAR REGRESSION AND CORRELATION independent normal random variables, and consequently, 6, is M(B). 07/S,,), using the bias and variance properties ofthe slope discussed in Section 11-3. In addition, (n ~ 2)4°/o" has a chi-square distribution with n — 2 degrees of freedom, and @, is independent of 37. As a result of those properties, the statistic _ Bi- Bus VES Ty al-19) follows the r distribution with n — 2 degrees of freedom under Hy: 8, = Bio. We would reject, Hy: Bi = Bip if Ito] > farn-2 (11-20) where fy is computed from Equation 11-19. The denominator of Equation 11-19 is the standard error of the slope, so we could write the test statistic as = Bio se(By) A similar procedure can be used to test hypotheses about the intercept. To test He: Bo = Boo Ay: Bo # Boo (1-21) ‘we would use the statistic (1-22) and reject the null hypothesis if the computed value of this test statistic, f, is such that fol > fyay—2- Note that the denominator of the test statistic in Equation 11-22 is just the standard ertor of the intercept. ‘A very important special case of the hypotheses of Equation 11-18 is, Hy: By = 0 hy: By #0 (1-23) These hypotheses relate to the significance of regression. Failure to reject Hs: By = 0 is equivalent to concluding that there is no linea relationship between x and ¥. This situation i, illustrated in Fig. 11-3, Note tht this may imply either that xis of ttle value in explaining the variation in ¥and thatthe best estimator of Y for any xis j = ¥ (Fig 11-5(a)] or that the true relationship between x and Y is not linear (Fig. 11-5(6)]. Akematively, if Hp: B, = 0 is rejected, this implies that xis of value in explaining the variability in Y (see Fig. 11-6). Rejecting Hy: 8; = 0 could mean either that the straight-line model is adequate (Fig. 11-6(2)] or that, although there isa linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x (Fig. 11-6(8)])11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 417 Figure 11-5 The hypothesis My: By = is not rejected. © ® EXAMPLE 11-2 Oxygen Purity Tests of Coefficients ‘We will test for significance of regression using the model for Practical Interpretation: Since the reference value of is the oxygen purty data foes Example 1-1. The hypotheses are fang ~ 2.88 the value ofthe test statistic is very far int the critical region, implying that Hy: B, = 0 should be rejected Hy: = 0 “There is strong evidence to suppor this claim. The P-value for Ay By FO this test is P = 1.23 x 10°*. This was obtained manually with a calculator Table 11-2 presents the Minitab output fr this problem, Notice tht the statistic value for the slope is computed as and we will use « wwe have (01, From Example 11-1 and Table 11-2 r= 14947 1» =20, Sq = 0.68088, 6? = 1.18 11.38 and thatthe reported P-value is P = 0.000. Minitab also reports the t-statistic for testing the hypothesis Hy: By so the statistic in Equation 10-20 becomes ‘This statistic is computed from Equation 11-22, with By = 0, . . 8 fy = 46.62. Clearly, then, the hypothesis tha the intercept is & & 14.947 zero is rejected. O° Ve is. SelB) Viawro6s0ss 11-4.2 Analysis of Variance Approach to Test Significance of Regression ‘A method called the analysis of variance can be used to test for significance of regression. ‘The procedure partitions the total variability in the response variable into meaningful components as the basis for the test. The analysis of variance identity is as follows: Variance * Twenty Sow So-+ Sor oP (1-24) Figure 11-6 The hypothesis Hy: 8, = 0 * * is rejected. @ o418 (CHAPTER Lt ‘Test for Significance of Regression SIMPLE LINEAR REGRESSION AND CORRELATION ‘The two components on the right-hand-side of Equation 11-24 measure, respectively, the amount of variability in y, accounted for by the regression line and the residual variation left unexplained by the regression line. We usually call SS; = E/., (y; — 9)? the error sum of squares and SSz = 7: ? the regression sum of squares. Symbolically, Equation 11.24 may be written as SSp = SSp + SS (11-25) where SS; = SL; (») — ¥)? is the total corrected sum of squares of y. In Section 11-2 we noted that SS; = SS;— {,S,, (see Equation 11-14), so since SS; = B,S,, + Sp, we note that the regression sum of squares in Equation 11-25 is SS = Sy. The total sum of squares 8S, has 1 ~ 1 degrees of freedom, and SS, and SS; have 1 and n ~ 2 degrees of freedom, respectively We may show that £[SSp/(n — 2)] = 07, E(SSp) = 0? + B35, and that SS;/o? and ‘SSp/o? are independent chi-square random variables with n — 2 and 1 degrees of freedom, respectively. Thus, ifthe null hypothesis £7: B, = 0 is tru, the statistic SSp/1___ MSx ‘SSe/(n = 2) MSz a (11-26) follows the F,~ distribution, and we would reject Hy if fy > fus.-> The quantities MSy = S5q/ and MS; = SS;/(n ~ 2) are called mean squares. In general, a mean square is always computed by dividing a sum of squares by its number of degrees of freedom. The test procedure is usually arranged in an analysis of variance table, such as Table 11-3. EXAMPLE 11-3 Oxygen Purity ANOVA ‘We will use the analysis of variance approach to test for sigifi- ‘The analysis of variance for testing H: 8; = 0 is sum- cance of regression using the oxygen purty data model fom marized inthe Minitab output in Table 11-2. The test statistic Example 11-1, Recall that SS, = 17338, B, = 14.947, S,,= 18 f = MSy/MS- = 152.13/1.18 = 128.86, for which we 10.1744, and n = 20, The regression sum of squares is find that the P-value is P = 1.23 x 10™, so we conclude that B, isnot zero S5q = BiSy = (14947)10.17744 = 152.13, ‘Thece are frequently minor differences in terminology among computer packages. For example, sometimes the re- and the error sum of squares is “gression sum of squares is called the “model” sum of squares, and the error sum of squares is called the “residual” sum of S5p = SSp = Sp = 173K — 152.13 = 21.25 squares. Table 11-3 Analysis of Variance for Testing Significance of Regression Souree of Sum oF Degrees of Mean Variation Squares Freedom Square hy Regression 1 MS, MSx/MSe Enror 85; — BS, n-2 MS, Talal nol Note that My = @°11-4 HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION 419) Note that the analysis of variance procedure for testing for significance of regression is «equivalent to the -test in Section 11-4.1. Thatis, either procedure will lead tothe same conclusions. This is easy to demonstrate by starting wit the test statisti in Equation 11-19 with By = 0, say n= (27) ViPS. Squating both sides of Equation 11-27 and using the fact that 2 ~ MS, results in o BiSy _ MSp OMS. Se MSs mes Note that 73 in Equation 11-28 is identical to Fy in Equation 11-26, Iti true, in general, that the square of ar random variable with v degrees of freedom is an F random variable, with one and v degrees of freedom in the numerator and denominator, respectively. Thus, the test using To is equivalent to the test based on F, Note, however, that the (test is somewhat more flexible in that it would allow testing against a one-sided alternative hypothesis, while the F-test is, restricted to a two-sided alternative. EXERCISES FOR SECTION 11-4 11-21, Consider the computer output below. (@ Fill in the missing information, You may use bounds for the P-values (b) Can you conclude that the model defines a useful linear relationship? ‘The regression equation is Y= 129 + 234% Prsdsor Cos «SE Cosh TP) What youre of Contant 2857108135. onsder the dts fom Exo 1-1 on x= “compressive strength and y = intrinsic permeability of concrete. S= 148111 R-Sq = 98.1% — R—Sqfadj) (a) Test for significance of regression using « = 0.05. Find Anas Vs the Poa fr ht Can yosconto hat the dl ann bess sep se at elnino eo Region 1 Sing 9124877 Exim oad he stand deviation of Res 5 ss (What the snd or ft ner idl 11-24. Consider the data from Exercise 11-2 on x = road (2) Fill in the missing information. You may use bounds for the P-values. (b) Can you conclude that the model defines # useful linear relationship? (©) What is your estimate of o*? 11-22, Consider the computer output below. ‘The regression equation is Y=268 + 148% Predictor Cool SECoel = P Constant 26.753 2373 2 2 x 14756 0.1063 ” 2 S = 270040 R-Sq= 93.7% — R-Sq (adi) = 93.2% Analysis of Variance Souree DF ss oMS oF P Regression 1 ? > 2 9 Residual Error? 948 Total 1s 1500.0 ‘way surface temperature and y = pavement deflection, (a) Test for significance of regression using a ~ 0.05. Find the P-value for this test. What conclusions can you draw? (b) Estimate the standard errors of the slope and intercept. 11-25. Consider the National Football League data, Exercise 11-3. (@) Test for significance of regression using a = 0.01. Find the P-value for this test. What conclusions can you draw? (8) Estimate the standard errors ofthe slope and intercept (6) Test He: By = 10 versus Mf: B, # 10 with « = 0.01 Would you agree with the statoment that this is atest of the hypothesis that # one-yard inerease in the average ‘yards per attempt results in a mean inerease of 10 rating points? 11-26. Consider the data from Exercise 11-4 on y = sales price and.x = taxes pai (@) Test Mo: By = O using the Fest; use a = 0.05 (©) Test F: B= Ousing the analysis of variance with a = 0.05, Discuss the relationship ofthis test tothe test from par (2)420 (©) Estimate the standard errors ofthe slope and intercept (@) Test the hypothesis that B, = 0. (127 Consider the dia from Exercise 1-5 on y= steam usage and x = average temperature (a) Test for significance of regression using « = 0.01. What is the P-value for this test? State the conclusions that ‘result from this test (b) Estimate the standard erzors ofthe slope and interoept. (©) Testthe hypothesis /,: B; = 10 versus H: By # 10 using ‘a = 0.01. Find the P-value for this tes. (a) Test Hy: By = 0 versus Hy: By # O using « = 0.01. Find the P-value for this test and draw conclusions. 11.28. Consider the data fom Exercise 11-6.0ny = highway ‘gasoline mileage and x ~ engine displacement (a) Test for significance of regression using « = 0.01. Find the P-value for this test. What conclusions ean you reach? ((b) Estimate the standard errors of the slope and intercept, (6) Test Fy: By = 0.05 versus Fh: B, = —0.05 using a = (01 and draw conclusions. What isthe P-value for this es (2) Test the hypothesis Hy: By ~ 0 versus Hi: By # 0 using «= 0.01, What isthe P-value for this test? (222%, Comite a dom Bree 1-7 ny © een © iguor'Na,S concentration and x = produetion ina paper mill (@) Test for significance of regression using « = 0.05. Find the Pevalue for this test, (b) Estimate the standard errors of the slope and intercept. (6) Test: By = 0 versus Fh: By # O using a = 0.05. What isthe Povalue for this test? 11-50, Consider the data from Exercise L1-8 on y gp) vressure rise and x = sound pressure level Oo Testo senitcane o regression wing a = 0.05 What isthe P-value for this txt? (b) Estimate the standard errors of the slope and intercept, (6) Test Fh: By = 0 versus M7: By # O using a = 0.05. Find the Povalue for this test. (6 112%, Comite daa fom seis 111 ony = ser strength ofa propellant and.x ~ propellant age (@) Tes fr significance ofeptession with a = 0.01. Find the vale fortis test. () Estimate the standard errors of fy and, (6) Test Hy By = ~30 versus Hy By # —30 using a = 0.01 ‘What is the P-value fortis test? (@) Test He: By = 0 versus I}: By # O using a = 0.01. What isthe P-value for this test? (6) Test Hy: By — 2500 versus Hy: By > 2500 using @ = 0.01, What isthe P-value for this es? S122 Comite da fom ses 11-1009 = bre concentration in surface streams and + = roadway area (a) Test the hypothesis Hy: B, = 0 versus Hy: B, ~ 0 using the analysis of variance procedure with « = 0.01 (@) Find the P-value forthe test ia part (©) Estimate the standard errors off and fi ood CHAPTER LI SIMPLE LINEAR REGRESSION AND CORRELATION (4) Test Hy: By = 0 versus Hy: By #0 using a = 0.01. What conclusions can you draw? Does it seem that the model ‘might be a better fit to the daa ifthe intereept were removed? 11-33. Consider the data in Exercise 11-13 on y= ‘oxygen demand and x ~ time (a) Test for significance of regression using « = 0.01. Find the P-value for this test. What conclusions can you draw? (&) Estimate the standard errors of the slope and intercept (©) Test the hypothesis that By = 0. 11-34, Consider the data in Exercise 11-14 on y and x = stress level. (@) Test for significance of regression using a = 0.01. What is the P-value for this test? State the conclusions that result from this test (®) Does this model appear to be adequate? (c) Estimate the standard errors of the slope and intercept. 1135. An article in The Journal of Clinical Endocrinology and Metabolism ("Sinmultaneous and Continuous 24-Hour Plasma and Cerebrospinal Fluid Leptin Measurements: Dissociation of Concentrations in Central and Peripheral Compartments” (2004, Vol. 89, pp. 258-265)] studied the demographics of simultaneous and continuous 24-hour plasma and cerebrospinal fluid leptin measurements, The data follow: lefleetion y=BMl(kgim’); 19.92 20.59 29.02 20.78 25.97 2039 3329 1727 3524 Agen: 455 346 406 329 282 301 sl 333 470 (a) Test for significance of regression using 0.05. Find the ‘Peale fortis test. Can you conelude that the model spec fies useful linear relationship between these two variables? () Estimate o? andthe standard deviation off. (©) What is the standard error ofthe intercept in this model? 11-36. Suppose that each value of xis multiplied by a positive constant a, and each value of y, is maliplied by another positive constant b. Show that the t-statistic for testing Hg: B= O-versus Hy By # 0is unchanged in value 11.37, The type I error probability for the Hest for Hg: ® = Byy can be computed in a similar manner to the ‘ests of Chapter 9. If the true value of Bis 8, the value d= [Bio ~ BilAo-VG = D/Sq is calculated and used as thehorizontal scale factor on the operating characteristic curves forthe r-test (Appendix Charts Vile throug VIIA) and the type Terror probability is ead from the vertical scale using the curve form ~ 2 degrees of freedoms Apply this procedure to the fot ball da of Exercise 11.3, using o = 5.5 and 8; = 12.5, where the hypotheses are HB = 10 versus H,: 8; # 10. 11-38. Consider the no-intercept_model ¥ = Br + € with the es NIDO, 0°), The estimate of o? is. = Bhi (y — Bulle ~ 1) and HB) = @?/S]x?. (a) Devise atest statistic for: B = 0 versus Hy: B ¥ 0. () Apply the fest in (a) to the model from Exercise 11-20115 CONFIDENCE INTERVALS 421 11-5 CONFIDENCE INTERVALS 11-5.1 Confidence Intervals on the Slope and Intercept In addition to point estimates of the slope and intercept, it is possible to obtain confidence interval estimates of these parameters. The width of these confidence intervals is a measure of the overall quality of the regression line, If the error terms, ¢, in the regression model are normally and independently distributed, 1 -6VFE, md Go - synlo'[t = =] are both distributed as ¢ random variables with n — 2 degrees of freedom. This leads to the following definition of 100(1 — «)% confidence intervals on the slope and intercept. Confidence Intervals on | Under the assumption thatthe observations are normally and independently distributed, 100(1 ~ a)% confidence interval on the slope Bin simple linear regression is Parameters (11-29) (11-30) EXAMPLE 11-4 Oxygen Purity Confidence Interval on the Slope ‘We will ind a 95% confidence interval onthe slope ofthe re- This simplifies to gression Line using the data in Example 11-1, Recall that B, = 14.947, S., = 0.68088, and 6? = 1.18 (sce Table 11-2) 12181 =, = 17.713 ‘Then, from Equation 11-29 we find Practical Interpretation: This CI does not include zero, so By b= Br + tows 5 there is strong evidence (at « = 0.05) that the slope is not zero. = ‘The Cl is reasonably narrow (2.766) because the error vari- or ance is ey small ra 4947 = 2101 (TER =p, = 14947 (aa * ON Fess 11-5.2 Confidence Interval on the Mean Response A confidence interval may be constructed on the mean response at a specified value of x, say, This is a confidence interval about £(Y |x) = sy}, and is often called a confidence interval422 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION about the regression line. Since EU [xo) = Hy, = Bo + Beto, We may obtain a point estimate ofthe mean of ¥ at x = xy(}4y1,) from the fitted model as Bin, = Bo + Bt Now fy) is an unbiased point estimator of py, since By Bp and B. The variance of fy), is 8 are unbiased estimators of Vira) = ot ‘ ‘This last result follows from the fact that yy, (4% ~ 3) and cov (8 ,) = 0, The zero covariance result is lefts a mind-expanding exercise. Also fy), is normally distributed, because 8, and @, are normally distributed, and if we & use as an estimate of 0”, itis easy to show that =n has a ¢ distribution with n — 2 degrees of freedom. This leads to the following confidence interval definition, Confidence Interval on the | A 100(1 — @)% confidence interval about the mean response at the value of Mean Response | y= 5, say yy is given by 5 bri $ Bri * farsn-2y] ei a | (131) where flys, = Bo + Brxo is computed from the fitted regression model, Note that the width of the Cl for jy, i8 a function of the value specified for xy. The interval width is a minimum for x = ¥ and widens as |x» — ¥| increases. EXAMPLE 11-5 Oxygen Purity Confidence Interval on the Mean Response We will construct a 95% confidence interval about the mean and the 95% confidence response forthe dats in Example 11-1, The fted mod! is fy, = T4283 + 14.9475, and the 95% confidence interval . on fi is found from Equation 11.31 as 89.23 = 21014) so1{ re) 2» or inne = 0 Brg = 210K) 118 og 0.68088 $9.23 * 0295 Suppose that we are inleresied in predicting mean oxygen purity when x) ~ 1.00%, Then “Therefore, the 95% CI on yj 90 8 fiyjajy ~ TA283 + 14,947(1,00) ~ 89.23 BEAB = jeygo S 89.9811-6 PREDICTION OF NEW OBSERVATIONS = 423 1a? see * g = 98 2 Bure 117 Seater 8 diagram of oxygen ary dat from so Example 1-1 with fied reresson line and 95 percent wi confidence limits on oar er ar ne yen ‘This is reasonable ncrw CI. Minitab will also perform these calculations, Refer to Table 11-2. The predicted value of y at x = 1.00 is shown along with the 95% Clon the mean of yat this level of x ‘By repeating these calculations for several different val- ‘ues for xy, We ean obtain confidence limits for each correspon- with the fitted model and the corresponding 95% confidence limits plotted as the upper and lower lines. The 95% confidence level applies only to the interval obtained at one value ‘of x and not tothe entire set of x-evels, Notice thatthe width of the confidence interval on pty/x, inereases as [xg — ¥] ding value of ty], Figure 11-7 displays the scatter diagram 11-6 PREDICTION OF NEW OBSERVATIONS ‘An important application of a regression model is predicting new or future observations Y corresponding to a specified level of the regressor vatiable x, Ifxy is the value of the regressor variable of interest, f= Bo + Beso (1132) is the point estimator of the new or future value of the response ¥,, Now consider obtaining an interval estimate for this future observation Yo, This new observation is independent of the observations used to develop the regression model. ‘Therefore, the confidence interval for wy, , in Equation 11-31 is inappropriate, since itis based only on the data used to fit the regression model, The confidence interval about py .,refets to the true mean response at x = %» (that is, a population parameter), not to future observations. Let Yo be the future observation at x = xp, and let Y given by Equation 11-32 be the estimator of ¥o, Note that the error in prediction e=%o~ he is @ normally distributed random variable with mean zero and variance424 CHAPTER LL Predic Interval SIMPLE LINEAR REGRESSION AND CORRELA because ¥ is independent of fy, If we use has a ¢ distribution with m ~ 2 degrees of freedom. From this we can develop the following prediction interval definition A 100(1 ~ a) % prediction interval on a future observation ¥y at the value xy is, given by bo taneay el ase 1 5% Sot teeny Pll tt The value jis computed from the regression model jy = By + Bx Notice that the prediction interval is of minimum width at xy = ¥ and widens as |xp ~ increases. By comparing Equation 11-33 with Equation 11-31, we observe that the prediction interval at the point x, is always wider than the confidence interval at xy. This results because the prediction interval depends on both the error from the fitted model and the error associated ‘with future observations. EXAMPLE 11-6 Oxygen Purity Prediction Interval ‘To illustrate the construction ofa prediction interval, suppose ‘This i a reasonably narrow prediction interval wwe use the data in Example 11-1 and find a 95% prediction in- Minitab will also caleulat e prodietion intervals. Refer to terval on the next observation of oxygen purity atx» = 1.00%, the output in Table 11-2, The 95% Pl on the future observation ‘sing Equation 11-33 and recalling from Example 11-5 that atxp = 1,00 i shown in the display. 5g = 89.23, we find that the prediction interval is By repeating the foregoing calculations at diferent levels, ‘of xp, we may obiain the 95% prediction intervals shown To = TeOP ‘graphically as the lower and upper lines about the fitted re- 89.23 — 2101/1.18]1 + 55 + “9 Gaga session model in Fig. 11-8, Notice that this graph also shows the 95% confidence limits on py , caleulated in Example 11-5, 25923 42101 fawlye 1 ,GOU=EIIGOF] — Iillustrats thatthe prediction limits are abways wider than ' 18|1+ 55 +—Gagome | the confidence limits which simplifies to 86.83 = yy = 91.63* 5 96 Fire 118 seat B Sage cto ply dam gy Example 11-1 with fitted regression line, 95% prediction limits g (outer lines) and 95% oy 107 ‘confidence limits on Mri * EXERCISES FOR SECTIONS 11-5 AND 11-6 127 11-6 PREDICTION OF NEW OBSERVATIONS = 425 La yeraceron level (1) 11-39. Refer to the data in Exercise 11-1 on y = intrinsic permeability of concrete and x = compressive strength. Find 95% confidence interval an each ofthe following: (2) Slope (b) Interoept (©) Mean permeability when x = 2.5 (@ Find a 95% prediction interval on permeability when x= 25. Explain why this interval is wider than the interval in pat (©) 11-40. Exercise 11-2 presented data on roadway surface temperature x and pavement deflection y. Find a 99% confidence interval on each of the following: (@) Slope _(b) Intercept (6) Mean deffcction when temperature x = 85°F (@) Find a 99% predietion interval on pavement deflection ‘when the temperature is 90°F. 11-41. Refer to the NFL quarterback ratings data in Exercise 11-3, Find a 95% confidence interval on eaca of the following: (2) Slope (b) Imercept (©) Mean rating when the average yards per attempt is 8.0, (@) Find a 95% prediction interval on the rating when the average yards per attempt is 8.0, 11-42, Refer to the data on y ~ house selling price and x = taxes paid in Exercise 11-4, Find a 95% confidence inter~ val on each of the following: @ B ) Bo (©) Mean selling price when the taxes paid are x = 7.50 (@) Compute the 95% prediction interval for selling price ‘when the taxes paid are x = 7.50, 11-43, Exercise 11-5 presented data on y and.x = monthly average temperature. steam usage (@) Find a 99% confidence interval for (©) Find 2 99% confidence interval fr By (©) Find ® 95% confidence interval on mean steam usage ‘when the average temperature is SS". (4) Find a 95% prediction interval on steam usage when temperature is 55°F, Explain why this interval is wider than the interval in pat () 11-44. Exercise 11-6 presented gasoline mileage performance for 21 cas, along with information about the engine displacement Find 95% confidence interval on cach of the following (a) Slope) Intercept (©) Mean highway gasoline mileage when the engine displacement isx = 150 in (@) Construct a 95% prediction interval om highway gasoline mileage when the engine displacement isx = 150 in" 11.45. Consider the data in Exercise 11-7 on y = green liquor NaS coneeniration and x = production in @ paper rill, ind 99% confidence interval on each ofthe following @ By ©) Be (©) Mean NaS concentration when production x = 910 tons/day (4) Find a 99% prediction interval on Na,S concentration when x = 910 tons/day. 11-46. Exercise 11-8 presented data on y = blood pressure rise and x = sound pressure level, Find a 95% confidence interval on each of the following: By 0) Be (©) Mean blood pressure rise when the sound pressure level is 85 decibels (@) Find 4 95% prediction interval on blood pressure rise when the sound pressure level is 85 decibels, s s s426 CHAPTER 1 SIMPLELINEAR REGRESSION AND CORRELATION 11-47. Refer to the data in Exercise 11-9 on y = wear = ‘volume of mild steel and x = oil viscosity. Find a 95% confi- rediction interval on shear strength when age 0 weeks dence interval on each of the following: Refer to the data in Bxereise 11-12 on the mic {@) Intercept) Slope crostructure of zirconia. Find a 95% confidence interval on (6) Mean wear when oil viscosity x ~ 30 cach ofthe following 11-48, Exercise 11-10 presented data on chloride concenira- (a) Slope (b) Tnteroept tion y and roadway area x on watersheds in central Rhode (c) Mean length when x ~ 1500 Island. Find a 99% confidence interval on each ofthe following: _(@) Find a 95% prediction interval on length when x = 1500. @) Br 0) Bo Explain why this interval is wider than the interval in (6) Mean chloride concentration when roadway ereax = 10% part (¢). () Find 99% prediction inerval on chloride concentration 11-51, Refer tothe data in Exercise 11-13 on oxygen de- ‘when roadway area.x = 1.0% mand. Find # 99% confidence interval on each of the we 129, Retr ote iain Exec 111 onc mior flowing shear strength y and propellant age x. Find # 95% confidence (a) By interval on each of the following: () Bo {@) Slope By (6) Intercept By (6) Find a 95% confidence interval on mean BOD when the (©) Mean shear strength when age x = 20 weeks time is 8 days 11-7 ADEQUACY OF THE REGRESSION MODEL Fitting a regression model requires several assumptions. Estimation of the model parameters requires the assumption that the errors are uncorrelated random variables with mean zero and constant variance. Tests of hypotheses and interval estimation require that the errors be normally distributed. In addition, we assume that the order of the model is correct; that is, if we fita simple linear regression model, we are assuming that the phenomenon actually behaves in a linear or first-order manner. ‘The analyst should always consider the validity of these assumptions to be doubtful and conduct analyses to examine the adequacy of the model that has been tentatively entertained. In this section we discuss methods useful in this respect. 11-7.1 Residual Analysis ‘The residuals from a regression model are e; = yj ~ Jip i = 1,2, +++, where y is an actual observation and J, isthe corresponding fited value liom the regression model. Analysis of the residuals is frequently helpful in checking the assumption that the errors are approximately normally distributed with constant variance, and in determining whether additional terms in the model would be useful As an approximate check of normality, the experimenter can construct a frequency histogram of the residuals or a normal probability plot of residuals. Many computer programs will produce a normal probability plot of residuals, and since the sample sizes in regression are often too small for a histogram to be meaningful, the normal probability plotting method is preferred. It requires judgment to assess the abnormality of such plots. (Refer to the discussion of the “fat pencil” method in Section 6-6), We may also standardize the residuals by computing d; = ¢/'V@", i= 1, 2,...,m If the errors are normally distributed, approximately 95% of the standardized residuals should fall in the interval (~2, +2). Residuals that are far outside this interval may indicate the presence of an outlier, that is, an observation that is not typical of the rest ofthe data, Various rules have been proposed for discarding outliers, However, outliers sometimes provide11-7 ADEQUACY OF THE REGRESSION MODEL 427 @ » @ @ Figure 11-9 Patterns for residual plots (a) Satisfactory, (b) Funnel, () Double bow, (d) Nonlinear. [Adapted from Montgomery, Peck, and Vining (2006), important information about unusual circumstances of interest to experimenters and should not be automatically discarded. For further discussion of outliers, see Montgomery, Peck, and Vining (2006). It is frequently helpful to plot the residuals (1) in time sequence (if known), (2), against the fj, and (3) against the independent variable x. These graphs will usually look like one of the four general patterns shown in Fig, 11-9. Pattern (a) in Fig, 11-9 represents the ideal situation, while patterns (b), (c), and (4) represent anomalies. Ifthe residuals appear as in (b), the variance of the observations may be increasing with time or with the magnitude of y,or x, Data transformation on the response y’ is often used to eliminate this problem. Widely used variance-stabilizing transformations include the use of V3, In y, or 1/y as the response. See Montgomery, Peck, and Vining (2006) for more details regarding methods for selecting an ap= propriate transformation. Plots of residuals against §), and x, that look like (c) also indicate in- equality of variance. Residual plots that look like (d) indicate model inadequacy; that is, higher order terms should be added to the model, a transformation on the x-variable or the }-variable (or both) should be considered, or other regressors should be considered. EXAMPLE 11-7 Oxygen Purity Residuals "The regression model for the oxygen purity data in Example 74.283 + 14.947x, Table 11-4 presents the observed and predicted values of y at each value of x from this data set, along with the corresponding residual. These values ‘were computed using Minitab and show the number of deci- ‘mal places typical of computer output. A normal probability plot ofthe residuals is shown in Fig. 11-10. Since the residuals fall approximately along a straight line inthe figure, we ‘conclude that there is no severe departure from normality, The residuals are also plotied against the predicted value J; in Fig. 1-11 and against the hydrocarbon levels x; in Fig. 11-12. ‘These plots do not indicate any serious model inadequacies.428 CHAPTER 1 SIMPLE LINEAR REGRESSION AND CORRELATION Table 11-4 Oxygen Purity Data ftom Example [1-1, Predieved Values, and Residuals Hydrocarbon Oxygen Predicted Hydrocarbon Oxygen Predicted Residual Level,x _Purity,y Value, 9 Levelx _Purityyy Valu j__e=y— 9. 1 099 90.01 89,081 ng 9354 92071 1.469) 2 102 89.05 89.530 2 Lis 9252 91473 1.087, 3 Ls 9143 91.473 13098 9056 88932 1.628 4 129 93.74 93.566 4 10 8954 89.380 0,160 5s 146 96.73 96.107 1% 89.85 90875 -1.025 6 136 9445 94.612 161.20 9039 92.220 — 1.830 7 087 8759 87.288 7 126 93.25 93.117 0.133 8123 91.77 92.669 we 132 9341 94014 0,604 9 155 9942 97.452 19143 94.98 95.658 0.678 10140 93.65 95.210 2 095 8733-88483 =1,153 11-7.2, Coefficient of Determination (R?) 999 99 iy 95 50 20 ‘cumatve normal proba 1 on ry Figure 11-10 Normal probability plot of residuals, Example 11-7. ‘A widely used measure for a regression model is the following ratio of sum of squares. ge] ‘The coefficient is often used to judge the adequacy of a regression model. Subsequently, we will see that in the case where and Y are jointly distributed random variables, R* is the square of the correlation coefficient between X and ¥. From the analysis of variance identity in Equations 11-24 and 11-25, 0 = R= 1, We often refer loosely to R? as the amount of variability in the data explained or accounted for by the regression model, For the oxygen putity ‘The coefficient of determination is Ra Se SSp regression model, we have R? = SSq/SSp counts for 87.7% of the variability in the data 09 ar Residual 1 2a 25 2 15 1 08 -08 “1 “15 -2 -25, Residunts SSe SS; 152.13/173.38 87 Figure 11-11 purty 9, Bxample 11-7, a 81 (ai34) 0.877; that is, the model ac- oo . 9578) Predicted vues, $ Plot of residuals versus predicted oxygenFigure 11-12 Plotof residuals versus hydro- ccurbon level x, Example 11-8. 11-7 ADEQUACY OF THE REGRESSION MODEL 429 Lor 127 Hydrocarbon level (4) a7 The statistic R? should be used with caution, because it is always possible to make R? unity by simply adding enough terms to the model. For example, we can obtain a “perfect” fit to n data points with a polynomial of degree n ~ I. In addition, R? will always increase if we add a variable to the model, but this does not necessarily imply that the new model is superior to the old one. Unless the error sum of squares in the new model is reduced by an amount cequal to the original error mean square, the new model will have a larger error mean square than the old one, because of the loss of one error degree of freedom. Thus, the new model will actually be worse than the old one. ‘There are several misconceptions about R’. In general, R? does not measure the magnitude of the slope of the regression line. A large value of R? does not imply a steep slope. Furthermore, R? does not measure the appropriateness of the model, since it can be artificially inflated by adding higher order polynomial terms in x to the model, Even ify and x are related in a nonlinear fashion, R° will often be large. For example, R° for the regression equation in Fig. 11-6(b) will be relatively large, even though the linear approximation is poor. Finally, even though R° is large, this does not necessarily imply that the regression model will provide accurate predictions of future observations EXERCISES FOR SECTION 11-7 11-52, Refer to the compressive strength data in Exercise 11-1. Use the summary statisties provided to calculate R® and. provide a practical interpretation of this quantity. 11-53. Refer to the NFL quarterback ratings data in Exercise 11-3. (a) Caleulate R* for this model and provide a practical interpretation of this quantity (©) Prepare a normal probability plot of the residuals from the least squares model, Does the normality assumption seem to be satisfied? (6) Plot the residuals versus the fitted values and against x. Interpret these graphs. 11-54. Refer to the data in Exercise 11-4 on house selling prive y and taxes paid x (a) Find the residuals forthe least squares model. () Prepare a normal probability plot ofthe residuals and interpret this display. (©) Plot the residuals versus j and versus x. Does the assumption of constant variance seem to be satisfied? (4) What proportion of total variability is explained by the regression model? 11-55, Refer to the data in Exercise 11-5 on y = steam usage and x ~ average monthly temperature, (@) What proportion of total variability is accounted for by the simple linear regression model? (b) Prepare a normal probability plot of the residuals and interpret this graph (6) Plot residuals versus # and x. Do the regression assumptions appear to be satisfied? 11-56. Refer to the gasoline mileage data in Exercise 11-6, (a) What proportion of total variability in highway gasoline mileage performance is accounted for by engine displacement? () Plotthe residuals versus and x, and comment on the graphs430 (©) Prepare a normal probability plot of the residuals. Does the normality assumption appear to be satisfied? 11-57. Exercise 11-9 presents data on wear volume y and oil viscosity x. {a) Calculate R* for this model, Provide an interpretation of this quantity (b) Plot the residuals from this model versus and versus x Interpret these plots. (©) Prepare a normal probability plot of the residuals. Does the normality assumption appear to be satstied? 11-58. Refer to Exercise 11-8, which presented data on blood pressure rise y and sound pressure level x, {(@) What proportion of total variability in blood pressure rise is accounted for by sound pressure level? (b) Prepare @ normal probability plot of the residuals from this least squares model. Interpret this plot. (6) Plotresiduals versus 9/and versus x, Comment on these plots 11-59. Refer to Exercise 11-10, which presented data on Chloride concentration y and roadway area x (a) What proportion of the total variability in chloride con centration is accounted for by the regression model? (b) Plot the residuals versus 9 and versus. Interpret these plots (©) Prepare a normal probability plot of the residuals. Does the normality assumption appear to be satisfied? 11-60. An article in the Journal of the American Statistical Association ("Markov Chain Monte Carla Methods for Compressive ‘Compressive Strength Density _Srength__ Density 3040 292, 3840 307 2470 247 3800 32.7 3610 323 4600 32.6 3480 313 1900 21 3810 315 2530 253 2330 245 2920 30.8 1800 199) 4990 389) 310 273 1670 21 3160 mA 3310 29.2 2310 240 3450 30.1 4360 338 3600 314 1880 215 2850 26.7 3670 32.2 1590) 21 1740 25 3770 303 2250 25 3850 320 2650 256 2480 22 4970 345 3570 303 2620 262 2620 299 2900 267 1890, 208 1670 2 3030 332 2540 241 3030 28.2 CHAPTER LI SIMPLE LINEAR REGRESSION AND CORRELATION ‘Computing Bayes Factors: A Comparative Review” (2001, ‘Vol. 96, pp. 1122-1132)] analyzed the tabulated data on compressive strength parallel to the grain versus resin-adjusted density for specimens of radiata pine. (@) Fit @ regression model relating compressive strength to density (&) Test for significance of regression with a = 005, (6) Estimate 0” for this model (@) Calculate R? for this model. Provide an interpretation of this quantity (©) Prepare a normal probability plot of the residuals and interpret this display. (£) Plot the residuals versus and versus x. Does the assumption of constant variance seem tobe satisfied? 11-61. Consider the rocket propellant data in Exercise 11-11 (@) Calculate R for this model. Provide an interpretation of this quantity (8) Plot the residuals on & normal probability seale, Do any points seem unusual on this plot? (©) Delete the two points identified in part (b) from the sample and fit the simple linear regression mode to the re- ‘maining 18 points. Calculate the value of R for the new ‘model. Is it larger or smaller than the value of R® com pated in part (a)? Why? (@) Did the value of 6? change dramatically when the two points identified above were deleted and the model fit to the remaining points? Why? 11-62, Consider the data in Exercise 11-7 on y = green liquor NayS concentration and x = paper machine production. ‘Suppose that a 4th sample point is added to the original data, where yg = 59 and.xje = 855 (@) Prepare a scatter diagram of y versus x. Ft the simple lincar regression mode! to all [4 observations (&) Test for significance of regression with a = 0.05 (6) Estimate o° for this moéel (@) Compare the estimate of 0” obtained in part (c) above with the estimate of o obtained from the original 13 points ‘Which estimate is larger and why? (©) Compute the residuals for this model. Does the value of sg appear umastal? (©) Prepare and interpret » normal probability plot of the residuals, (g) Plot the residuals versus and versus x, Comment on these graphs. 11-63. Consider the rocket propellant data in Exercise 11-11 CCaleulate the standardized residuals for these data. Does this, provide any helpful information about the magnitude of the residuals? 11-64, Studentized Residuals. Show that the variance of the ih residual is He) = [1 -G+Hint: cov, £) = of The ith studentized residual is defined as (@) Explain why 7 has unit standard deviation (©) Do the standardized residuals have unit standard deviation? (©) Discuss the bebavior of the studentized residual when the sample value sis very close tothe middle ofthe range of x, 11-8 CORRELATION 1-8 CORRELATION 431 (@) Discuss the behavior ofthe studentized residual when the sample value x, is very near one end of the range of x, 11-65. Show that an equivalent way to define the test for significance of regression in simple linear regression is to base the tet on Ras follows: to test HB, = 0 versus HB; # 0, calculate and to reject Hy: By ~ 0 if the computed value f, > fan.-2 Suppose that a simple linear regression model has been ito n= 25 observations and &® = 0.90. (@) Test for significance ofrgression at « = 0.05 (©) What isthe smallest value off? that would ead to the conclusion ofa significant regression if « = 0.052 Our development of regression analysis has assumed that x is a mathematical variable, measured with negligible error, and that Y is a random variable. Many applications of regression analysis involve situations in which both X and ¥ are random variables. In these situations, it is usually assumed that the observations (X,, ¥), i = 1,2, are jointly distributed random variables obtained from the distribution f(x,y). For example, suppose we wish to de ‘lop a regression model relating the shear strength of spot welds to the weld diameter, In this example, weld diameter cannot be controlled, We ‘would randomly select » spot welds and observe a diameter (X)) and a shear strength (Y)) for each. Therefore (X,, ¥,) are jointly distributed random variables. ‘We assume that the joint distribution of X, and ¥, is the bivariate normal distribution presented in Chapter 5, and ty and of are the mean and variance of Y, ty and &% are the mean and variance of X, and p is the correlation coefficient between ¥ and X, Recall that the correlation coefficient is defined as where o yy is the covariance between Y and X. ‘The conditional distribution of ¥ for a given value of X = xis fuisl) where and the variance of the conditional distribution of Y given X = xis = (1135) 1 L(y = Bo = Bx 5 sym a(HSe)] arse Bo = Hr — waxy at-37) Bi - e (11-38) a}, = o}(1 ~ 6°) (1139)432 CHAPTER 11 SIMPLE LINEAR REGRESSION AND CORRELATION That is, the conditional distribution of ¥ given X = x is normal with mean E(Y|x) = Bo + Bix (11-40) and variance a} ,. Thus, the mean of the conditional distribution of ¥ given X = x is a simple linear regression model. Furthermore, there is a relationship between the correlation coefficient p and the slope B. From Equation 11-38 we see that if p = 0, then i, = 0, which implies that there is no regression of ¥ on X. That is, knowledge of X does not assist us in predicting Y. The method of maximum likelihood may be used to estimate the parameters By and B. It can be shown that the maximum likelihood estimators of those parameters are BX (4) and «-42y We note that the estimators ofthe intercept and slope in Equations 11-41 and 11-42 are identical to those given by the method of least squares inthe case where X'was assumed to be mathematical variable, That is, the regression model with ¥ and X jointly normally distributed is equivalent to the model with X considered as a mathematical variable. This follows because the random variables ¥ given X= x are independently and normally distributed with ‘mean fy + 6x and constant variance 03). These results wil also hold for any joint distribu tion of and X such that the conditional distribution of Y given X'is normal Its possible to draw inferences about the correlation coefficient p in this model. The estimator of pis the sample correlation coefficient Re (1143) | Duis Note that «aay so the slope 8 is just the sample correlation coefficient R multiplied by a scale factor that is the square root of the “spread” of the ¥ values divided by the “spread” of the X values. Thus, B; and R are closely related, although they provide somewhat different information. The sample correlation coefficient R measures the linear association between Y and X, while 8 measures the predicted change in the mean of ¥ for @ unit change in X. In the case of a math- ‘matical variable x, R has no meaning because the magnitude of R depends on the choice of spacing of x. We may also write, from Equation 11-44, 2 _ jz Sux _ BiSey _ Se R= Biss “Ss, > SS