We introduce the linear correlation coefficient r
We introduce the linear correlation coefficient r
sample data fit a straight-line pattern when graphed. We use the sample of paired data (sometimes
called bivariate data) to find the value of r, and then we use r to decide whether there is a linear
correlation between the two variables.
We consider only linear relationships, which means that when graphed in a scatterplot, the points
approximate a straight-line pattern. Then, we discuss methods for conducting a formal hypothesis test
that can be used to decide whether there is a linear correlation between all population values for the
two variables. Finally, we discuss a method of randomization whereby we resample many times to test
the null hypothesis of no correlation.
A correlation exists between two variables when the values of one variable are somehow associated
with the values of the other variable.
A linear correlation exists between two variables when there is a correlation and the plotted points of
paired data result in a pattern that can be approximated by a straight line.
We use the linear correlation coefficient r, which is a number that measures the strength of the linear
association between the two variables
The linear correlation coefficient r measures the strength of the linear correlation between the paired
quantitative x values and y values in a sample.
The linear correlation coefficient r is computed by using Formula 10-1 or Formula 10-2, included in the
following Key Elements box. [The linear correlation coefficient is sometimes referred to as the Pearson
product moment correlation coefficient in honor of Karl Pearson (1857–1936), who originally developed
it
Objective
∑x² indicates that each x value should be squared and then those squares added.
(∑x)² indicates that the x values should be added and the total then squared. Avoid confusing ∑x² and
(∑x)²
∑xy indicates that each x value should first be multiplied by its corresponding y value. After obtaining all
such products, find their sum.
r linear correlation coefficient for sample data
Given any collection of sample paired quantitative data, the linear correlation coefficient r can always be
computed, but the following requirements should be satisfied when using the sample paired data to
make a conclusion about linear correlation in the corresponding population of paired data.
1. The sample of paired (x, y) data is a simple random sample of quantitative data.
Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern
• of outliers, any outliers must be removed if they are known to be errors. The effects of any other
outliers should be considered by calculating r with and without the outliers included*.
• *Requirements 2 and 3 above are simplified attempts at checking this formal requirement: The
pairs of (x, y) data must have a bivariate normal distribuBecause results can be strongly affected
by the presence tion.
• FORMULA 10-1
n ( ∑ xy ) −( ∑ x )( ∑ y )
r=
√ n ( ∑ x ) −( ∑ x ) √ n ( ∑ y ) −( ∑ y )
2 2 2 2
FORMULA 10-2
r=
∑ (zx zy)
n−1
where zx denotes the z score for an individual sample value x and zy is the z score for the corresponding
sample value y.
Round the linear correlation coefficient r to three decimal places so that its value can be directly
compared to critical values.
Using P-Value from Technology to Interpret r: Use the P-value and significance level α as follows:
Using Table A-6 to Interpret r: Consider critical values from Table A-6 or technology as being both
positive and negative:
• Correlation If | r | ≥ critical value, conclude that there is sufficient evidence to support the claim
of a linear correlation.
• No Correlation If | r | < critical value, conclude that there is not sufficient evidence to support
the claim of a linear correlation.
• If all values of either variable are converted to a different scale, the value of r does not change.
• The value of r is not affected by the choice of x or y. Interchange all x values and y values, and
the value of r will not change.
• r measures the strength of a linear relationship. It is not designed to measure the strength of a
relationship that is not linear.
• r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value.
• Use technology to find the value of the linear correlation coefficient r for the white red blood
cell counts listed in the table.
White Red
8.7 4.80
6.9 4.47
8.1 4.60
8.0 4.09
6.9 4.15
8.1 5.22
6.4 4.22
6.3 4.30
10.9 6.34
4.8 3.54
The value of r will be automatically calculated with software or a calculator. r = 0.900 (rounded to three
decimal places).
Use Formula 10-1 to find the linear correlation coefficient r for the paired white/red blood cell counts
listed in the table.
x (White) y (Red) x² y² xy
The variable x is used for the white blood cell counts, and the variable y is used for the red blood cell
counts. Because there are 10 pairs of data, n = 10, and the other required values are computed in the
table.
n ( ∑ xy ) −( ∑ x )( ∑ y )
r=
√ n ( ∑ x ) −( ∑ x ) √ n ( ∑ y ) −( ∑ y )
2 2 2 2
102.637
=0.900
√246.29 √ 52.8171
Use Formula 10-2 to find the linear correlation coefficient r for the paired white/red blood cell counts
listed in the table
White Red
8.7 4.80
6.9 4.47
8.1 4.60
8.0 4.09
6.9 4.15
8.1 5.22
6.4 4.22
6.3 4.30
10.9 6.34
4.8 3.54
If manual calculations are absolutely necessary, Formula 10-1 is much easier than Formula 10-2, but
Formula 10-2 has the advantage of making it easier to understand how r works. The variable x is used for
the white blood cell counts, and the variable y is used for the red blood cell counts. In Formula 10-2,
each sample value is replaced by its corresponding z score.
Solution
For example, using unrounded numbers, the jackpots have a mean of x = 75.1 and a standard deviation
of sx = 1.654254, so the first x value of 8.7 is converted to a z score of 0.71936:
x−x 8.7−7.51
zx= = =0.71936
sx 1.654254
The table on the next slide lists the z scores for all of the white blood cell counts (third column) and the
z scores for all of the red blood cell counts (fourth column). The last column of the table lists the
products zx · zy .
x (White) y (Red) zx zy z
r=
∑ ( z x z y ) = 8.09870 =0.900
n−1 10−1
Using the value of r = 0.900 for the 10 pairs of data in the table, and using a significance level of 0.05,
determine whether there is sufficient evidence to support a claim that there is a linear correlation
between white blood cell counts and red blood cell counts.
White Red
8.7 4.80
6.9 4.47
8.1 4.60
8.0 4.09
6.9 4.15
8.1 5.22
6.4 4.22
6.3 4.30
10.9 6.34
4.8 3.54
Requirement Check
The data are a simple random sample, so the first requirement is satisfied
Using P-Value from Technology to Interpret r: Use the P-value and significance level α as follows:
• The Statdisk display shows that the P-value is 0.00039. Because the P-value is less than or equal
to the significance level of 0.05, we conclude there is sufficient evidence to support the
conclusion that for countries, there is a linear correlation between white blood cell counts and
red blood cell counts.
Using
Table A-6 to Interpret r: Consider critical values from Table A-6 as being both positive and negative, and
draw a graph similar to the figure below
No Correlation If the computed linear correlation coefficient lies between the two critical values,
conclude that there is not sufficient evidence to support the claim of a linear correlation.
Because the figure shows that the computed value of r = 0.900 lies beyond the upper critical value of
0.632, we conclude that there is sufficient evidence to support the claim of a linear correlation between
white blood cell counts and red blood cell counts.
Interpretation
It appears that there is a linear correlation between white blood cell counts and red blood cell counts. It
appears that higher white blood cell counts correspond to higher red blood cell counts. But let’s
consider the second “caution” presented earlier. It is possible that in reality there is not a correlation but
random chance makes it appear that there is a correlation. The table lists only 10 pairs of white red
blood cell counts. But watch what happens when we include all of the 147 pairs of white red blood cell
counts for females listed in Data Set 1 “Body Data” in Appendix B. See Example 5 in your text
Spurious correlations will become more common with the increased use of big data, and they are more
likely to occur with time-series data that have similar trends
The value of r² is the proportion of the variation in y that is explained by the linear relationship between
x and y.
Using the 147 pairs of white red blood cell counts, we get a linear correlation coefficient of r = 0.05815.
What proportion of the variation in the red blood cell counts can be explained by the variation in the
white blood cell counts?
We conclude that 0.00338 (or about 0.338%) of the variation in the red blood cell counts can be
explained by the linear relationship between white blood cell counts and red blood cell counts. That’s
just about nothing! (If we had used
r = 0.900 from the 10 pairs of data, then r2 = 0.810, which shows that 81% of the variation in red blood
cell counts can be explained by the linear relationship between the white red blood cell counts. It would
then follow that 9% of that variation cannot be explained by the linear correlation between those two
variables.)
We noted previously that there is sufficient evidence to support the claim of a linear correlation
between lottery jackpot amounts and numbers of tickets sold. We should not make any conclusion that
includes a statement about a cause-effect relationship between the two variables. We should not
conclude that an increase in the jackpot amount will cause ticket sales to increase.
34. REQUIREMENT CHECK In a previous example, we noted that the requirements appear to be
satisfied. To claim that there is a linear correlation is to claim that the population linear
correlation coefficient ρ is different from 0. We therefore have the following hypotheses:
35. H0: r = 0 (No correlation)
36. H1: r ≠ 0 (Correlation
37. The linear correlation coefficient is r = 0.900 (from technology) and n = 10 (because there are 10
pairs of sample data), so the test statistic is
r 0.900
t= = =5.840
38.
√ 1−r 2
n−2 √ 1−0.900 2
10−2
39. With n – 2 = 8 degrees of freedom, Table A-3 shows that the test statistic of t = 5.840 yields a P-
value that is less than 0.01. Technologies show that the P-value is 0.000389. Because the P-value
of 0.000389 is less than the significance level of 0.05, we reject H0. (“If the P is low, the null must
go.” The P-value of 0.000389 is low.)
40. We conclude that there is sufficient evidence to support the claim of a linear correlation
between white red blood cell counts of adult females. (Note that if we use the larger data set of
147 pairs of data, the P-value becomes 0.4842, indicating that there is not sufficient evidence to
support the claim of a linear correlation between white red blood cell counts of adult females.)
41. The examples and exercises in this section generally involve two-tailed tests, but one-tailed tests
can occur with a claim of a positive linear correlation or a claim of a negative linear correlation.
In such cases, the hypotheses will be as shown here.
H0 : ρ = 0 H0 : ρ = 0
H1 : ρ < 0 H1 : ρ > 0