Pearson R Correlation
Pearson R Correlation
Pearson R Correlation
Definition
Pearson's correlation coefficient is the covariance of the two variables divided by the product of
their standard deviations. The form of the definition involves a "product moment", that is, the
mean (the first moment about the origin) of the product of the mean-adjusted random
variables; hence the modifier product-moment in the name.
Several sets of (x, y) points, with the correlation coefficient of x and yfor each set. Note that the
correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of
that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the
center has a slope of 0 but in that case the correlation coefficient is undefined because the variance
of Y is zero.
Mathematical properties
The absolute values of both the sample and population Pearson correlation coefficients are less
than or equal to 1. Correlations equal to 1 or −1 correspond to data points lying exactly on a
line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a
line (in the case of the population correlation). The Pearson correlation coefficient is symmetric:
corr(X,Y) = corr(Y,X).
A key mathematical property of the Pearson correlation coefficient is that it is invariant under
separate changes in location and scale in the two variables. That is, we may
transform X to a + bX and transform Yto c + dY, where a, b, c, and d are constants with b, d > 0,
without changing the correlation coefficient. (This holds for both the population and sample
Pearson’s Correlation
Pearson correlation coefficients.) Note that more general linear transformations do change the
correlation: see § Decorrelation of n random variables for an application of this.
Geometric interpretation
For uncentered data, there is a relation
between the correlation coefficient and the
angle φ between the two regression
lines, y = gx(x) and x = gy(y), obtained by
regressing y on x and x on y respectively.
(Here φ is measured counterclockwise within
the first quadrant formed around the lines'
intersection point if r > 0, or counterclockwise
from the fourth to the second quadrant if r <
0.) One can show[9] that if the standard
deviations are equal, then r = sec φ − tan φ,
where sec and tan are trigonometric functions.
This uncentred correlation coefficient is identical with the cosine similarity. Note that the above
data were deliberately chosen to be perfectly correlated: y = 0.10 + 0.01 x. The Pearson
correlation coefficient must therefore be exactly one. Centering the data (shifting x by E(x) =
3.8 and y by E(y) = 0.138) yields x = (−2.8, −1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −0.008,
0.012, 0.042), from which
as expected.