Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

We introduce the linear correlation coefficient r

The document introduces the linear correlation coefficient r, which quantifies the strength of the linear relationship between two paired variables. It explains how to compute r using specific formulas and discusses the significance of the results through hypothesis testing and P-values. Additionally, it highlights the importance of understanding correlation versus causation and the potential for spurious correlations in data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

We introduce the linear correlation coefficient r

The document introduces the linear correlation coefficient r, which quantifies the strength of the linear relationship between two paired variables. It explains how to compute r using specific formulas and discusses the significance of the results through hypothesis testing and P-values. Additionally, it highlights the importance of understanding correlation versus causation and the potential for spurious correlations in data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

We introduce the linear correlation coefficient r, which is a number that measures how well paired

sample data fit a straight-line pattern when graphed. We use the sample of paired data (sometimes
called bivariate data) to find the value of r, and then we use r to decide whether there is a linear
correlation between the two variables.
We consider only linear relationships, which means that when graphed in a scatterplot, the points
approximate a straight-line pattern. Then, we discuss methods for conducting a formal hypothesis test
that can be used to decide whether there is a linear correlation between all population values for the
two variables. Finally, we discuss a method of randomization whereby we resample many times to test
the null hypothesis of no correlation.

A correlation exists between two variables when the values of one variable are somehow associated
with the values of the other variable.

A linear correlation exists between two variables when there is a correlation and the plotted points of
paired data result in a pattern that can be approximated by a straight line.

Distinct straight-line, or linear,


pattern. We say that there is a positive linear correlation between x and y, since as the x values increase,
the corresponding y values also increase.
Distinct straight-line, or linear
pattern. We say that there is a negative linear correlation between x and y, since as the x values
increase, the corresponding y values decrease

No distinct pattern, which suggests


that there is no correlation between x and y.
Distinct pattern suggesting a
correlation between x and y, but the pattern is not that of a straight line.

We use the linear correlation coefficient r, which is a number that measures the strength of the linear
association between the two variables

The linear correlation coefficient r measures the strength of the linear correlation between the paired
quantitative x values and y values in a sample.

The linear correlation coefficient r is computed by using Formula 10-1 or Formula 10-2, included in the
following Key Elements box. [The linear correlation coefficient is sometimes referred to as the Pearson
product moment correlation coefficient in honor of Karl Pearson (1857–1936), who originally developed
it

Objective

Determine whether there is a linear correlation between two variables.

Notation for the Linear Correlation Coefficient

n number of pairs of sample data.

∑ denotes addition of the items indicated.

∑x sum of all x values.

∑x² indicates that each x value should be squared and then those squares added.

(∑x)² indicates that the x values should be added and the total then squared. Avoid confusing ∑x² and
(∑x)²

Notation for the Linear Correlation Coefficient

∑xy indicates that each x value should first be multiplied by its corresponding y value. After obtaining all
such products, find their sum.
r linear correlation coefficient for sample data

ρ linear correlation coefficient for a population of paired data

Given any collection of sample paired quantitative data, the linear correlation coefficient r can always be
computed, but the following requirements should be satisfied when using the sample paired data to
make a conclusion about linear correlation in the corresponding population of paired data.

1. The sample of paired (x, y) data is a simple random sample of quantitative data.

Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern

• of outliers, any outliers must be removed if they are known to be errors. The effects of any other
outliers should be considered by calculating r with and without the outliers included*.

• *Requirements 2 and 3 above are simplified attempts at checking this formal requirement: The
pairs of (x, y) data must have a bivariate normal distribuBecause results can be strongly affected
by the presence tion.

• FORMULA 10-1

n ( ∑ xy ) −( ∑ x )( ∑ y )
r=
√ n ( ∑ x ) −( ∑ x ) √ n ( ∑ y ) −( ∑ y )
2 2 2 2

FORMULA 10-2

r=
∑ (zx zy)
n−1
where zx denotes the z score for an individual sample value x and zy is the z score for the corresponding
sample value y.

Round the linear correlation coefficient r to three decimal places so that its value can be directly
compared to critical values.

Interpreting the Linear Correlation Coefficient r

Using P-Value from Technology to Interpret r: Use the P-value and significance level α as follows:

P-value ≤ α: Supports the claim of a linear correlation.

P-value > α: Does not support the claim of a linear correlation.

Interpreting the Linear Correlation Coefficient r

Using Table A-6 to Interpret r: Consider critical values from Table A-6 or technology as being both
positive and negative:

• Correlation If | r | ≥ critical value, conclude that there is sufficient evidence to support the claim
of a linear correlation.
• No Correlation If | r | < critical value, conclude that there is not sufficient evidence to support
the claim of a linear correlation.

• The value of r is always between −1 and 1 inclusive. That is, −1 ≤ r ≤ 1.

• If all values of either variable are converted to a different scale, the value of r does not change.

• The value of r is not affected by the choice of x or y. Interchange all x values and y values, and
the value of r will not change.

• r measures the strength of a linear relationship. It is not designed to measure the strength of a
relationship that is not linear.

• r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value.

• Use technology to find the value of the linear correlation coefficient r for the white red blood
cell counts listed in the table.

White Red

8.7 4.80

6.9 4.47

8.1 4.60

8.0 4.09

6.9 4.15

8.1 5.22

6.4 4.22

6.3 4.30

10.9 6.34

4.8 3.54

The value of r will be automatically calculated with software or a calculator. r = 0.900 (rounded to three
decimal places).
Use Formula 10-1 to find the linear correlation coefficient r for the paired white/red blood cell counts
listed in the table.

x (White) y (Red) x² y² xy

8.7 4.80 75.69 23.0400 41.760


6.9 4.47 47.61 19.9809 30.843

8.1 4.60 65.61 21.1600 37.260

8.0 4.09 64.00 16.7281 32.720

6.9 4.15 47.61 17.2225 28.635

8.1 5.22 65.61 27.2484 42.282

6.4 4.22 40.96 17.8084 27.008

6.3 4.30 39.69 18.4900 27.090

10.9 6.34 118.81 40.1956 69.106

4.8 3.54 23.04 12.5316 16.992

∑x = 75.1 ∑y = 45.73 ∑x² = 588.63 ∑y² = 214.405 ∑xy = 353.696

The variable x is used for the white blood cell counts, and the variable y is used for the red blood cell
counts. Because there are 10 pairs of data, n = 10, and the other required values are computed in the
table.

n ( ∑ xy ) −( ∑ x )( ∑ y )
r=
√ n ( ∑ x ) −( ∑ x ) √ n ( ∑ y ) −( ∑ y )
2 2 2 2

10 ( 353.696 )− (75.1 )( 45.73 )


√10 ( 588.63 )−( 75.1 ) √ 10 ( 214.405 )−( 45.73 )
2 2

102.637
=0.900
√246.29 √ 52.8171
Use Formula 10-2 to find the linear correlation coefficient r for the paired white/red blood cell counts
listed in the table

White Red

8.7 4.80

6.9 4.47

8.1 4.60
8.0 4.09

6.9 4.15

8.1 5.22

6.4 4.22

6.3 4.30

10.9 6.34

4.8 3.54

If manual calculations are absolutely necessary, Formula 10-1 is much easier than Formula 10-2, but
Formula 10-2 has the advantage of making it easier to understand how r works. The variable x is used for
the white blood cell counts, and the variable y is used for the red blood cell counts. In Formula 10-2,
each sample value is replaced by its corresponding z score.

Solution

For example, using unrounded numbers, the jackpots have a mean of x = 75.1 and a standard deviation
of sx = 1.654254, so the first x value of 8.7 is converted to a z score of 0.71936:

x−x 8.7−7.51
zx= = =0.71936
sx 1.654254

The table on the next slide lists the z scores for all of the white blood cell counts (third column) and the
z scores for all of the red blood cell counts (fourth column). The last column of the table lists the
products zx · zy .

x (White) y (Red) zx zy z

8.7 4.80 0.71936 0.29631 0

6.9 4.47 –0.36875 –0.13445 0

8.1 4.60 0.35666 0.03524 0

8.0 4.09 0.29621 –0.63046 –

6.9 4.15 –0.36875 –0.55215 0


8.1 5.22 0.35666 0.84454 0

6.4 4.22 –0.67100 –0.46077 0

6.3 4.30 –0.73145 –0.35635 0

10.9 6.34 2.04926 2.30648 4

4.8 3.54 –1.63820 –1.34838 2

Using ∑(zx · zy) = 8.09870, the value of r is calculated:

r=
∑ ( z x z y ) = 8.09870 =0.900
n−1 10−1
Using the value of r = 0.900 for the 10 pairs of data in the table, and using a significance level of 0.05,
determine whether there is sufficient evidence to support a claim that there is a linear correlation
between white blood cell counts and red blood cell counts.

White Red

8.7 4.80

6.9 4.47

8.1 4.60

8.0 4.09

6.9 4.15

8.1 5.22

6.4 4.22

6.3 4.30

10.9 6.34

4.8 3.54
Requirement Check

The data are a simple random sample, so the first requirement is satisfied

The second requirement of a scatterplot showing a straight-line pattern is satisfied

Using P-Value from Technology to Interpret r: Use the P-value and significance level α as follows:

• P-value ≤ α: Supports the claim of a linear correlation.

• P-value > α: Does not support the claim of a linear correlation.

• The Statdisk display shows that the P-value is 0.00039. Because the P-value is less than or equal
to the significance level of 0.05, we conclude there is sufficient evidence to support the
conclusion that for countries, there is a linear correlation between white blood cell counts and
red blood cell counts.
Using
Table A-6 to Interpret r: Consider critical values from Table A-6 as being both positive and negative, and
draw a graph similar to the figure below

For the 10 pairs of data, Table A-6 yields a critical value of


r = 0.632 and technology yields a critical value of r = 0.632. We can now compare the computed value of
r = 0.900 to the critical values of r = ±0.632
Correlation If the computed linear correlation coefficient r lies in the left or right tail region beyond the
critical value for that tail, conclude that there is sufficient evidence to support the claim of a linear
correlation.

No Correlation If the computed linear correlation coefficient lies between the two critical values,
conclude that there is not sufficient evidence to support the claim of a linear correlation.

Because the figure shows that the computed value of r = 0.900 lies beyond the upper critical value of
0.632, we conclude that there is sufficient evidence to support the claim of a linear correlation between
white blood cell counts and red blood cell counts.

Interpretation
It appears that there is a linear correlation between white blood cell counts and red blood cell counts. It
appears that higher white blood cell counts correspond to higher red blood cell counts. But let’s
consider the second “caution” presented earlier. It is possible that in reality there is not a correlation but
random chance makes it appear that there is a correlation. The table lists only 10 pairs of white red
blood cell counts. But watch what happens when we include all of the 147 pairs of white red blood cell
counts for females listed in Data Set 1 “Body Data” in Appendix B. See Example 5 in your text

A spurious correlation is a correlation that doesn’t have an actual association.

Spurious correlations will become more common with the increased use of big data, and they are more
likely to occur with time-series data that have similar trends

The value of r² is the proportion of the variation in y that is explained by the linear relationship between
x and y.

Using the 147 pairs of white red blood cell counts, we get a linear correlation coefficient of r = 0.05815.
What proportion of the variation in the red blood cell counts can be explained by the variation in the
white blood cell counts?

We conclude that 0.00338 (or about 0.338%) of the variation in the red blood cell counts can be
explained by the linear relationship between white blood cell counts and red blood cell counts. That’s
just about nothing! (If we had used
r = 0.900 from the 10 pairs of data, then r2 = 0.810, which shows that 81% of the variation in red blood
cell counts can be explained by the linear relationship between the white red blood cell counts. It would
then follow that 9% of that variation cannot be explained by the linear correlation between those two
variables.)

Correlation does not imply causality!

We noted previously that there is sufficient evidence to support the claim of a linear correlation
between lottery jackpot amounts and numbers of tickets sold. We should not make any conclusion that
includes a statement about a cause-effect relationship between the two variables. We should not
conclude that an increase in the jackpot amount will cause ticket sales to increase.

Common Errors Involving Correlation

1. Assuming that correlation implies causality


2. Using data based on averages

3. Ignoring the possibility of a nonlinear relationship

4. Obtaining false correlations arising with the use of many tests.

5. Hypotheses If conducting a formal hypothesis test to determine whether there is a significant


linear correlation between two variables, use the following null and alternative hypotheses that
use ρ to represent the linear correlation coefficient of the population:

6. Null Hypothesis H0: ρ = 0 (No correlation)

7. Alternative Hypothesis H1: ρ ≠ 0 (Correlation


8. Test Statistic The same methods of Part 1 can be used with the test statistic r, or the t test
statistic can be found using the following:
r
t=
9.
√ 1−r 2
n−2
10. P-values and critical values can be found using technology or Table A-3 as described in earlier
chapters.
11. Use the paired white red blood cell counts in the table to conduct a formal hypothesis test of
the claim that there is a linear correlation between the two variables. Use a 0.05 significance
level with the
P-value method of testing hypotheses.

12. White 13. Red

14. 8.7 15. 4.80

16. 6.9 17. 4.47

18. 8.1 19. 4.60

20. 8.0 21. 4.09

22. 6.9 23. 4.15

24. 8.1 25. 5.22

26. 6.4 27. 4.22

28. 6.3 29. 4.30

30. 10.9 31. 6.34


32. 4.8 33. 3.54

34. REQUIREMENT CHECK In a previous example, we noted that the requirements appear to be
satisfied. To claim that there is a linear correlation is to claim that the population linear
correlation coefficient ρ is different from 0. We therefore have the following hypotheses:
35. H0: r = 0 (No correlation)
36. H1: r ≠ 0 (Correlation

37. The linear correlation coefficient is r = 0.900 (from technology) and n = 10 (because there are 10
pairs of sample data), so the test statistic is

r 0.900
t= = =5.840
38.
√ 1−r 2
n−2 √ 1−0.900 2
10−2
39. With n – 2 = 8 degrees of freedom, Table A-3 shows that the test statistic of t = 5.840 yields a P-
value that is less than 0.01. Technologies show that the P-value is 0.000389. Because the P-value
of 0.000389 is less than the significance level of 0.05, we reject H0. (“If the P is low, the null must
go.” The P-value of 0.000389 is low.)

40. We conclude that there is sufficient evidence to support the claim of a linear correlation
between white red blood cell counts of adult females. (Note that if we use the larger data set of
147 pairs of data, the P-value becomes 0.4842, indicating that there is not sufficient evidence to
support the claim of a linear correlation between white red blood cell counts of adult females.)

41. The examples and exercises in this section generally involve two-tailed tests, but one-tailed tests
can occur with a claim of a positive linear correlation or a claim of a negative linear correlation.
In such cases, the hypotheses will be as shown here.

Claim of Negative Correlation Claim of Positive Correlation


(Left-Tailed Test) (Right-Tailed Test)

H0 : ρ = 0 H0 : ρ = 0

H1 : ρ < 0 H1 : ρ > 0

You might also like