Linear Regression
Linear Regression
Introduction
In Chapters 7 and 8, two areas of inferential statistics—confidence intervals and hypothe-
sis testing—were explained. Another area of inferential statistics involves determining
whether a relationship exists between two or more numerical or quantitative variables. For
example, a businessperson may want to know whether the volume of sales for a given
month is related to the amount of advertising the firm does that month. Educators are inter-
ested in determining whether the number of hours a student studies is related to the stu-
dent’s score on a particular exam. Medical researchers are interested in questions such as,
Is caffeine related to heart damage? or Is there a relationship between a person’s age and
his or her blood pressure? A zoologist may want to know whether the birth weight of a
certain animal is related to its life span. These are only some of the many questions that can
be answered by using the techniques of correlation and regression analysis.
The purpose of this chapter then is to answer these questions statistically:
U n u s u a l Stat
A person walks on aver- 1. Are two or more variables linearly related?
age 100,000 miles in 2. If so, what is the strength of the relationship?
his or her lifetime. This is 3. What type of relationship exists?
about 3.4 miles per day. 4. What kind of predictions can be made from the relationship?
Hours of
Student study x Grade y (%)
A 6 82
B 2 63
C 1 57
D 5 88
E 2 68
F 3 75
The two variables for this study are called the independent variable and the depen-
dent variable. The independent variable is the variable in regression that can be con-
trolled or manipulated. In this case, the number of hours of study is the independent vari-
able and is designated as the x variable. The dependent variable is the variable in
regression that cannot be controlled or manipulated. The grade the student received on the
exam is the dependent variable, designated as the y variable. The reason for this distinc-
tion between the variables is that you assume that the grade the student earns depends on
the number of hours the student studied. Also, you assume that, to some extent, the
student can regulate or control the number of hours he or she studies for the exam. The
independent variable is also known as the explanatory variable, and the dependent vari-
able is also called the response variable.
The determination of the x and y variables is not always clear-cut and is sometimes an
arbitrary decision. For example, if a researcher studies the effects of age on a person’s blood
pressure, the researcher can generally assume that age affects blood pressure. Hence, the
variable age can be called the independent variable, and the variable blood pressure can be
called the dependent variable. On the other hand, if a researcher is studying the attitudes of
husbands on a certain issue and the attitudes of their wives on the same issue, it is difficult
to say which variable is the independent variable and which is the dependent variable. In this
study, the researcher can arbitrarily designate the variables as independent and dependent.
10–2
Section 10–1 Scatter Plots and Correlation 551
FIGURE 10–1 y y
Types of Relationships
x x
(a) Positive linear relationship (b) Negative linear relationship
y y
x x
(c) Curvilinear relationship (d) No relationship
The independent and dependent variables can be plotted on a graph called a scatter
plot. The independent variable x is plotted on the horizontal axis, and the dependent vari-
able y is plotted on the vertical axis.
A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the
independent variable x and the dependent variable y.
The scatter plot is a visual way to describe the nature of the relationship between the
independent and dependent variables. The scales of the variables can be different, and
the coordinates of the axes are determined by the smallest and largest data values of the
variables.
Researchers look for various types of patterns in scatter plots. For example, in Fig-
ure 10–1(a), the pattern in the points of the scatter plot shows a positive linear relation-
ship. Here, as the values of the independent variable (x variable) increase, the values of
the dependent variable (y variable) increase. Also, the points form somewhat of a straight
line going in an upward direction from left to right.
The pattern of the points of the scatter plot shown in Figure 10–1(b) shows a negative
linear relationship. In this case, as the values of the independent variable increase, the
values of the dependent variable decrease. Also, the points show a somewhat straight line
going in a downward direction from left to right.
The pattern of the points of the scatter plot shown in Figure 10–1(c) shows some type
of a nonlinear relationship or a curvilinear relationship.
Finally, the scatter plot shown in Figure 10–1(d) shows basically no relationship
between the independent variable and the dependent variable since no pattern (line or
curve) can be seen.
The procedure table for drawing a scatter plot is given next.
Procedure Table
Drawing a Scatter Plot
Step 1 Draw and label the x and y axes.
Step 2 Plot each point on the graph.
Step 3 Determine the type of relationship (if any) that exists for the variables.
10–3
552 Chapter 10 Correlation and Regression
The procedure for drawing a scatter plot is shown in Examples 10–1 through 10–3.
SOLUTION
7.75
Revenue (billions of dollars)
6.50
5.25
4.00
2.75
1.50
x
8.5 17.5 26.5 35.5 44.5 53.5 62.5
Cars (in 10,000s)
10–4
Section 10–1 Scatter Plots and Correlation 553
SOLUTION
100
90
80
Final grade
70
60
50
40
30
x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of absences
SOLUTION
10–5
554 Chapter 10 Correlation and Regression
70
60
Wealth ($ billions)
50
40
30
20
10
x
0
0 10 20 30 40 50 60 70 80 90
Age
Correlation
OBJECTIVE 2 Correlation Coefficient Statisticians use a measure called the correlation coeffi-
Compute the correlation cient to determine the strength of the linear relationship between two variables. There
coefficient. are several types of correlation coefficients.
The linear correlation coefficient computed from the sample data measures the
strength and direction of a linear relationship between two quantitative variables.
The symbol for the sample correlation coefficient is r.
The linear correlation coefficient explained in this section is called the Pearson product
moment correlation coefficient (PPMC), named after statistician Karl Pearson, who
pioneered the research in this area.
The range of the linear correlation coefficient is from 1 to 1. If there is a strong
positive linear relationship between the variables, the value of r will be close to 1. If
there is a strong negative linear relationship between the variables, the value of r will be
close to 1. When there is no linear relationship between the variables or only a weak
relationship, the value of r will be close to 0. See Figure 10–5. When the value of r is 0 or
close to zero, it implies only that there is no linear relationship between the variables. The
data may be related in some other nonlinear way.
10–6
Section 10–1 Scatter Plots and Correlation 555
FIGURE 10–6 y y y
Relationship Between the
Correlation Coefficient and
the Scatter Plot
x x x
y y y
x x x
The graphs in Figure 10–6 show the relationship between the correlation coefficients
and their corresponding scatter plots. Notice that as the value of the correlation coefficient
increases from 0 to 1 (parts a, b, and c), data values become closer to a straight line and
to an increasingly strong relationship. As the value of the correlation coefficient decreases
from 0 to 1 (parts d, e, and f ), the data values also become closer to a straight line. Again
this suggests a stronger relationship.
In this book, the assumptions will be stated in the exercises; however, when encountering
statistics in other situations, you must check to see that these assumptions have been met
before proceeding.
There are several ways to compute the value of the correlation coefficient. One
method is to use the formula shown here.
10–7
556 Chapter 10 Correlation and Regression
Rounding Rule for the Correlation Coefficient Round the value of r to three dec-
imal places.
The formula looks somewhat complicated, but using a table to compute the values, as
shown in Example 10–4, makes it somewhat easier to determine the value of r.
There are no units associated with r, and the value of r will remain unchanged if the
x and y values are switched.
The procedure for finding the value of the linear correlation coefficient is given next.
Procedure Table
Finding the Value of the Linear Correlation Coefficient
Step 1 Make a table as shown.
x y xy x2 y2
Step 2 Place the values of x in the x column and the values of y in the y column.
Multiply each x value by the corresponding y value, and place the products in the
xy column.
Square each x value and place the squares in the x2 column.
Square each y value and place the squares in the y 2 column.
Find the sum of each column.
Step 3 Substitute in the formula and find the value for r.
n1 ©xy2 1©x2 1 ©y2
r
2 3n1©x 2 1©x2 2 4 3n1 ©y2 2 1©y2 2 4
2
Cars x Revenue y
Company (in ten thousands) (in billions) xy x2 y2
A 63.0 $7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Step 2 Find the values of xy, x2, and y2, and place these values in the corresponding
columns of the table.
10–8
Section 10–1 Scatter Plots and Correlation 557
Cars x Revenue y
Company (in 10,000s) (in billions of dollars) xy x2 y2
A 63.0 7.0 441.00 3969.00 49.00
B 29.0 3.9 113.10 841.00 15.21
C 20.8 2.1 43.68 432.64 4.41
D 19.1 2.8 53.48 364.81 7.84
E 13.4 1.4 18.76 179.56 1.96
F 8.5 1.5 12.75 72.25 2.25
兺x 153.8 兺y 18.7 兺xy 682.77 兺x 2 5859.26 兺y 2 80.67
1621682.772 1153.82118.72
0.982
2 3 162 15859.262 1153.82 2 4 3 162180.672 118.72 2 4
The linear correlation coefficient suggests a strong positive linear
relationship between the number of cars a rental agency has and its annual
revenue. That is, the more cars a rental agency has, the more annual revenue
the company will have.
172137452 157215112
0.944
2 3 172 15792 1572 2 4 3 172138,9932 15112 2 4
10–9
558 Chapter 10 Correlation and Regression
10114,644.22 160121232.92
2 310137,5712 16012 2 4 3 1018477.432 1232.92 2 4
6469.1 6469.1
0.307
2 114,5092 130,531.892 21,047.26091
The value of r indicates a weak positive linear relationship between age and wealth of
the richest people in the world.
In Example 10–4, the value of r was high (close to 1.00); in Example 10–6, the value
of r was much lower (close to 0). This question then arises, When is the value of r due to
chance, and when does it suggest a significant linear relationship between the variables?
This question will be answered next.
OBJECTIVE 3 The Significance of the Correlation Coefficient As stated before, the range of
Test the hypothesis the correlation coefficient is between 1 and 1. When the value of r is near 1 or 1,
H0: r 0. there is a strong linear relationship. When the value of r is near 0, the linear relationship
is weak or nonexistent. Since the value of r is computed from data obtained from samples,
there are two possibilities when r is not equal to zero: either the value of r is high enough
to conclude that there is a significant linear relationship between the variables, or the
value of r is due to chance.
10–10