Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pearson R Correlation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Pearson’s Correlation

Definition
Pearson's correlation coefficient is the covariance of the two variables divided by the product of
their standard deviations. The form of the definition involves a "product moment", that is, the
mean (the first moment about the origin) of the product of the mean-adjusted random
variables; hence the modifier product-moment in the name.

In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to


as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate
correlation,[1] is a measure of the linear correlation between two variables X and Y. Owing to
the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear
correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used
in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis
Galton in the 1880s.

Examples of scatter diagrams with different values of correlation coefficient (ρ)

Several sets of (x, y) points, with the correlation coefficient of x and yfor each set. Note that the
correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of
that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the
center has a slope of 0 but in that case the correlation coefficient is undefined because the variance
of Y is zero.

Mathematical properties
The absolute values of both the sample and population Pearson correlation coefficients are less
than or equal to 1. Correlations equal to 1 or −1 correspond to data points lying exactly on a
line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a
line (in the case of the population correlation). The Pearson correlation coefficient is symmetric:
corr(X,Y) = corr(Y,X).
A key mathematical property of the Pearson correlation coefficient is that it is invariant under
separate changes in location and scale in the two variables. That is, we may
transform X to a + bX and transform Yto c + dY, where a, b, c, and d are constants with b, d > 0,
without changing the correlation coefficient. (This holds for both the population and sample
Pearson’s Correlation

Pearson correlation coefficients.) Note that more general linear transformations do change the
correlation: see § Decorrelation of n random variables for an application of this.

Geometric interpretation
For uncentered data, there is a relation
between the correlation coefficient and the
angle φ between the two regression
lines, y = gx(x) and x = gy(y), obtained by
regressing y on x and x on y respectively.
(Here φ is measured counterclockwise within
the first quadrant formed around the lines'
intersection point if r > 0, or counterclockwise
from the fourth to the second quadrant if r <
0.) One can show[9] that if the standard
deviations are equal, then r = sec φ − tan φ,
where sec and tan are trigonometric functions.

For centered data (i.e., data which have been


shifted by the sample means of their
respective variables so as to have an average
of zero for each variable), the correlation Regression lines for  y  =  g (x)  [red] and  x  =  g (y)  [blue]
x y

coefficient can also be viewed as the cosine of the angle θ between the two observed vectors in N-


dimensional space (for Nobservations of each variable)[10]:ch. 5 (as illustrated for a special case in the
next paragraph).
Both the uncentered (non-Pearson-compliant) and centered correlation coefficients can be determined
for a dataset. As an example, suppose five countries are found to have gross national products of 1, 2, 3,
5, and 8 billion dollars, respectively. Suppose these same five countries (in the same order) are found to
have 11%, 12%, 13%, 15%, and 18% poverty. Then let x and y be ordered 5-element vectors containing
the above data: x = (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).
By the usual procedure for finding the angle θ between two vectors (see dot product),
the uncentered correlation coefficient is:

This uncentred correlation coefficient is identical with the cosine similarity. Note that the above
data were deliberately chosen to be perfectly correlated: y = 0.10 + 0.01 x. The Pearson
correlation coefficient must therefore be exactly one. Centering the data (shifting x by E(x) =
3.8 and y by E(y) = 0.138) yields x = (−2.8, −1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −0.008,
0.012, 0.042), from which

as expected.

You might also like