Correlation and Simple Linear Regression: Statistical Concepts Series

Statistical Concepts Series
Radiology
Kelly H. Zou, PhD Correlation and Simple Linear

Kemal Tuncali, MD
Stuart G. Silverman, MD Regression1
Index terms: In this tutorial article, the concepts of correlation and regression are reviewed and
Data analysis demonstrated. The authors review and compare two correlation coefficients, the
Statistical analysis
Pearson correlation coefficient and the Spearman ␳, for measuring linear and non-
Published online linear relationships between two continuous variables. In the case of measuring the
10.1148/radiol.2273011499 linear relationship between a predictor and an outcome variable, simple linear
Radiology 2003; 227:617– 628 regression analysis is conducted. These statistical concepts are illustrated by using a
data set from published literature to assess a computed tomography– guided inter-
1
From the Department of Radiology, ventional technique. These statistical methods are important for exploring the
Brigham and Women’s Hospital (K.H.Z., relationships between variables and can be applied to many radiologic studies.
K.T., S.G.S.) and Department of Health © RSNA, 2003
Care Policy (K.H.Z.), Harvard Medical
School, 180 Longwood Ave, Boston,
MA 02115. Received September 10,
2001; revision requested October 31;
revision received December 26; ac-
cepted January 21, 2002. Address
correspondence to K.H.Z. (e-mail: Results of clinical studies frequently yield data that are dependent of each other (eg, total
zou@bwh.harvard.edu). procedure time versus the dose in computed tomographic [CT] fluoroscopy, signal-to-
© RSNA, 2003 noise ratio versus number of signals acquired during magnetic resonance imaging, and
cigarette smoking versus lung cancer). The statistical concepts correlation and regression,
which are used to evaluate the relationship between two continuous variables, are re-
viewed and demonstrated in this article.
Analyses between two variables may focus on (a) any association between the variables,
(b) the value of one variable in predicting the other, and (c) the amount of agreement.
Agreement will be discussed in a future article. Regression analysis focuses on the form of
the relationship between variables, while the objective of correlation analysis is to gain
insight into the strength of the relationship (1,2). Note that these two techniques are used
to investigate relationships between continuous variables, whereas the ␹2 test is an exam-
ple of a test for association between categorical variables. Continuous variables, such as
procedure time, patient age, and number of lesions, have no gaps on the measurement
scale. In contrast, categorical variables, such as patient sex and tissue classification based
on segmentation, have gaps in their possible values. These two types of variables and the
assumptions about their measurement scales were reviewed and distinguished in an article
by Applegate and Crewson (3) published earlier in this Statistical Concepts Series in
Radiology.
Specifically, the topics covered herein include two commonly used correlation coeffi-
cients, the Pearson correlation coefficient (4,5) and the Spearman ␳ (6 –10) for measuring
linear and nonlinear relationship, respectively, between two continuous variables. Corre-
lation analysis is often conducted in a retrospective or observational study. In a clinical
trial, on the other hand, the investigator may also wish to manipulate the values of one
variable and assess the changes in values of another variable. To evaluate the relative
impact of the predictor variable on the particular outcome, simple regression analysis is
preferred. We illustrate these statistical concepts with existing data to assess patient skin
dose based on total procedure time by using a quick-check method in CT fluoroscopy–
guided abdominal interventions (11).
These statistical methods are useful tools for assessing the relationships between con-
tinuous variables collected from a clinical study. However, it is also important to distin-
guish these statistical methods. While they are similar mathematically, their purposes are
different. Correlation analysis is generally overused. It is often interpreted incorrectly (to
establish “causation”) and should be reserved for generating hypotheses rather than for
testing them. On the other hand, regression modeling is a more useful statistical technique
that allows us to assess the strength of the relationships in the data and the uncertainty in
the model by using confidence intervals (12,13).
617
Radiology
Figure 1. Scatterplots of four sets of data generated by means of the following Pearson correlation coefficients (from left to right): r ⫽ 0
(uncorrelated data), r ⫽ 0.8 (strongly positively correlated), r ⫽ 1.0 (perfectly positively correlated), and r ⫽ ⫺1 (perfectly negatively correlated).
CORRELATION Rank Correlation

TABLE 1
The Spearman ␳ is the sample correla- Interpretation of Correlation
The purpose of correlation analysis is Coefficient
tion coefficient (rs) of the ranks (the rel-
to measure and interpret the strength of
ative order) based on continuous data
a linear or nonlinear (eg, exponential, Correlation Direction and Strength
(19,20). It was first introduced by Spear- Coefficient Value of Correlation
polynomial, and logistic) relationship be-
man in 1904 (6). The Spearman ␳ is used
tween two continuous variables. When ⫺1.0 Perfectly negative
to measure the monotonic relationship
conducting correlation analysis, we use ⫺0.8 Strongly negative
between two variables (ie, whether one ⫺0.5 Moderately negative
the term association to mean “linear asso-
variable tends to take either a larger or ⫺0.2 Weakly negative
ciation” (1,2). Herein, we focus on the 0.0 No association
smaller value, though not necessarily lin-
Pearson and Spearman ␳ correlation co- ⫹0.2 Weakly positive
early) by increasing the value of the other
efficients. Both correlation coefficients ⫹0.5 Moderately positive
variable. ⫹0.8 Strongly positive
take on values between ⫺1 and ⫹1, rang-
⫹1.0 Perfectly positive
ing from being negatively correlated (⫺1)
Linear versus Rank Correlation
to uncorrelated (0) to positively corre- Note.—The sign of the correlation coefficient
Coefficients (ie, positive or negative) defines the direction
lated (⫹1). The sign of the correlation
of the relationship. The absolute value indi-
coefficient (ie, positive or negative) de- The Pearson correlation coefficient ne- cates the strength of the correlation.
fines the direction of the relationship. cessitates use of interval or continuous
The absolute value indicates the strength measurement scales of the measured out-
of the correlation (Table 1, Fig 1). We come in the study population. In con-
elaborate on two correlation coefficients, trast, rank correlations also work well
linear (eg, Pearson) and rank (eg, Spear- with ordinal rating data, and continuous Spearman ␳ can be found in the literature
man), that are commonly used for mea- data are reduced to their ranks (Appendix (20,21).
suring linear and general relationships C) (20,21). The rank procedure will also
between two variables. be illustrated briefly with our example Limitations and Precautions
data. The smallest value in the sample It is worth noting that even if two vari-
Linear Correlation has rank 1, and the largest has the high- ables (eg, cigarette smoking and lung
est rank. In general, rank correlations are cancer) are highly correlated, it is not
The Pearson correlation coefficient is
not easily influenced by the presence of sufficient proof of causation. One vari-
also known as the sample correlation co-
skewed data or data that are highly variable may cause the other or vice versa, or
efficient (r), product-moment correlation
able. a third factor is involved, or a rare event
coefficient, or coefficient of correlation
(14). It was introduced by Galton in 1877 may have occurred. To conclude causa-
Statistical Hypothesis Tests for a tion, the causal variables must precede
(15,16) and developed later by Pearson
Correlation Coefficient the variable it causes, and several con-
(17). It measures the linear relationship
between two random variables. For ex- The null hypothesis states that the un- ditions must be met (eg, reversibility,
ample, when the value of the predictor is derlying linear correlation has a hypoth- strength, and exposure response on the
manipulated (increased or decreased) by esized value, ␳0. The one-sided alterna- basis of the Bradford-Hill criteria or the
a fixed amount, the outcome variable tive hypothesis is that the underlying Rubin causal model) (23–26).
changes proportionally (linearly). A lin- value exceeds (or is less than) ␳0. When
ear correlation coefficient can be com- the sample size (n) of the paired data is SIMPLE LINEAR REGRESSION
puted by means of the data and their large (n ⱖ 30 for each variable), the stan-
sample means (Appendix A). When a sci- dard error (s) of the linear correlation (r) The purpose of simple regression analysis
entific study is planned, the required is approximately s(r) ⫽ (1 ⫺ r2)/公n. The is to evaluate the relative impact of a
sample size may be computed on the ba- test statistic value (r ⫺ ␳0)/s(r) may be predictor variable on a particular out-
sis of a certain hypothesized value with computed by means of the z test (22). If come. This is different from a correlation
the desired statistical power at a specified the P value is below .05, the null hypoth- analysis, where the purpose is to examine
level of significance (Appendix B) (18). esis is rejected. The P value based on the the strength and direction of the rela-
618 䡠 Radiology 䡠 June 2003 Zou et al

the relationship for prediction and estima- performed. (a) To understand whether
tion, and (d) assess whether the data fit the assumptions have been met, deter-
these criteria before the equation is applied mine the magnitude of the gap between
for prediction and estimation. the data and the assumptions of the
model. (b) No matter how strong a rela-
Least Squares Method tionship is demonstrated with regression
Radiology
analysis, it should not be interpreted as

The main goal of linear regression is to
causation (as in the correlation analysis).
fit a straight line through the data that
(c) The regression should not be used to
predicts Y based on X. To estimate the in-
predict or estimate outside the range of
tercept and slope regression parameters
values of the independent variable of the
that determine this line, the least squares
sample (eg, extrapolation of radiation
method is commonly used. It is not neces-
cancer risk from the Hiroshima data to
sary for the errors to have a normal distri-
that of diagnostic radiologic tests).
bution, although the regression analysis is
more efficient with this assumption (27).
With this regression method, a set of re- AN EXAMPLE: DOSE VERSUS
Figure 2. Simple linear regression model gression parameters are found such that TOTAL PROCEDURE TIME
shows that the expectation of the dependent the sum of squared residuals (ie, the differ-
variable Y is linear in the independent variable
IN CT FLUOROSCOPY
ences between the observed values of the
X, with an intercept a ⫽ 1.0 and a slope b ⫽ outcome variable and the fitted values) are
2.0. We applied these statistical methods to
minimized (14). The fitted y value is then
help assess the benefit of the use of CT
computed as a function of the given x
fluoroscopy to guide interventions in the
value and the estimated intercept and
abdomen (11). During CT fluoroscopy–
slope regression parameter (Appendix D).
tionship between two random variables. guided interventions, one might postu-
For example, in Equation (1), once the es-
In this article, we deal with only linear late that the radiation dose received by a
timates of a and b are obtained from the
regression of one continuous variable on patient is related to (or correlated with)
regression analysis, the predicted y value at
another continuous variable with no the total procedure time, because the
any given x value is calculated as a ⫹ bx.
gaps on each measurement scale (3). more difficult the procedure is, the more
There are other types of regression (eg, CT fluoroscopic scanning is required,
Coefficient of Determination, R2
multiple linear, logistic, and ordinal) which means a longer procedure time.
analyses, which will be provided in a fu- It is meaningful to interpret the value of The rationale was to assess whether radi-
ture article in this Statistical Concepts the Pearson correlation coefficient r by ation dose could be estimated by simply
Series in Radiology. squaring it; hence, the term R-square (R2) measuring the total CT fluoroscopic pro-
A simple regression model contains or coefficient of determination. This mea- cedure time, with the null hypothesis
only one independent (explanatory) vari- sure (with a range of 0–1) is the fraction of that the slope of the regression line is 0.
able, Xi, for i ⫽ 1, . . ., n subjects, and is the variability in Y that can be explained Earlier, we discussed two methods to
linear with respect to both the regression by the variability in X through their linear target lesions with CT fluoroscopy. In
parameters and the dependent variable. relationship, or vice versa. That is, R2 ⫽ one method, continuous CT scanning is
The corresponding dependent (outcome) SSregression/SStotal, where SS stands for the used during needle placement. In the
variable is labeled. The model is ex- sum of squares. Note that R2 is calculated other method, short CT scanning is used
pressed as only on the basis of the Pearson correlation to image the needle after it is placed. The
coefficient in the linear regression analysis. latter method, the so-called quick-check
Y i ⫽ a ⫹ bX i ⫹ e i, (1)
Thus, it is not appropriate to compute R2 method, has been adopted almost exclu-
where the regression parameter a is the in- on the basis of rank correlation coefficients sively at our institution. Now, we demon-
tercept (on the y axis), and the regression such as the Spearman ␳. strate correlation and regression analyses
parameter b is the slope of the regression based on a subset of the interventional
line (Fig 2). The random error term ei is Statistical Hypothesis Tests procedures (n ⫽ 19). With the quick-
assumed to be uncorrelated, with a mean check method, we examine the relation-
There are several hypotheses in the
of 0 and constant variance. For conve- ship between total procedure time (in
context of regression analysis, for exam-
nience in inference and improved effi- minutes) and dose (in rads) on a natural
ple, to test if the slope of the regression
ciency in estimation (27), analyses often log scale. We also examine the marginal
line is b ⫽ 0 (hypothesis, there is no lin-
incur an additional assumption that the ranks of the x (log of total time) and y
ear association between Y and X). One
errors are distributed normally. Transfor- (log of dose) components (Table 2). For
may also test whether intercept a takes
mation of the data to achieve normality convenience, the x data are given in as-
on a certain value. The significance of the
may be applied (28,29). Thus, the word line cending order.
effects of the intercept and slope may
(linear, independent, normal, equal vari- In Table 2, each set of rank data is
also be computed by means of a Student
ance) summarizes these requirements. derived by first placing the 19 observa-
t statistic introduced earlier in this Statis-
Typical steps for regression model anal- tions in each sample in ascending order
tical Concepts Series in Radiology (30).
ysis are the following: (a) determine if the and then assigning ranks 1–19. Ties are
assumptions underlying a normal relation- broken by means of averaging the respec-
Limitations and Precautions
ship are met in the data, (b) obtain the tive adjacent ranks. Finally, the ranks are
equation that best fits the data, (c) evaluate The following understandings should identified for the observations of each of
the equation to determine the strength of be considered when regression analysis is the paired x and y samples.
Volume 227 䡠 Number 3 Correlation and Simple Linear Regression 䡠 619

The natural log (ln) transformation of
the total time is used to make the data TABLE 2
Total Procedure Time and Dose of CT Fluoroscopy– guided Procedures, by
appear normal, for more efficient analy- Means of the Quick-Check Method
sis (Appendix D), with normality verified
statistically (31). However, normality is Subject x Data: Log Time Ranks of y Data: Log Dose Ranks of
No. (ln[min]) x Data (ln[rad]) y Data
Radiology
not necessary in the subsequent regres-

sion analysis. We created a scatterplot of 1 3.61 1 1.48 2
the data, with the log of dose (ln[rad]) on 2 3.87 2 1.24 1
the x axis and the log of total time (ln- 3 3.95 3 2.08 5.5
4 4.04 4 1.70 3
[minutes]) on the y axis (Fig 3). 5 4.06 5 2.08 5.5
For illustration purposes, we will con- 6 4.11 6 2.94 10
duct both correlation and regression 7 4.19 7 2.24 7
analyses; however, the choice of analysis 8 4.20 8 1.85 4
9 4.32 9.5 2.84 9
depends on the aim of research. For ex- 10 4.32 9.5 3.93 16
ample, if the investigators wish to assess 11 4.42 11.5 3.03 11
whether there is a relationship between 12 4.42 11.5 3.23 13
time and dose, then correlation analysis 13 4.45 13 3.87 15
14 4.50 14 3.55 14
is appropriate. In comparison, if the in-
15 4.52 15 2.81 8
vestigators wish to evaluate the impact of 16 4.57 16 4.07 17
the total time on the resulting dose, then 17 4.58 17 4.44 19
regression analysis is preferred. 18 4.61 18 3.16 12
19 4.74 19 4.19 18
Source.—Reference 11.
Correlations Note.—Paired x and y data are sorted according to the x component; therefore, the log of the
To compute the Spearman ␳ with a Pear- total procedure time and the log of the corresponding rank have an increasing order. When ties are
present in the data, the average of their adjacent ranks is used. Pearson correlation coefficient
son correlation coefficient of r ⫽ 0.85, the between log time and log dose, r ⫽ 0.85; Spearman ␳ ⫽ 0.84.
marginal ranks of time and dose were de-
rived separately; consequently, rs ⫽ 0.84.
Both correlation coefficients confirm that
the log of total time and the log of dose are
correlated strongly and positively.
Regression
We first conducted a simple linear re-
gression analysis of the data on a log scale
(n ⫽ 19); results are shown in Table 3. The
value calculated for R2 was 0.73, which
suggests that 73% of the variability of the
data could be explained by the linear re-
gression. Figure 4. Scatterplot of the log of dose (y
Figure 3. Scatterplot of the log of dose (y
The regression line, expressed in the axis) versus the log of total time (x axis). Each axis) versus the log of total time (x axis). The
form given in Equation (1), is Y ⫽ point in the scatterplot represents the values of regression line has the intercept a ⫽ ⫺9.28 and
⫺9.28 ⫹ 2.83X, where the predictor vari- two variables for a given observation. slope b ⫽ 2.83. We conclude that there is a
able X represents the log of total time, possible association between the radiation
dose and the total time of the procedure.
and the outcome variable Y represents
the log of dose. The estimated regression
parameters are a ⫽ ⫺9.28 (intercept) and total time is specified at x ⫽ 4 (translated to
b ⫽ 2.83 (slope) (Fig 4). This regression e4 ⫽ 55 minutes, approximately), then the
TABLE 3
line can be interpreted as follows: At X ⫽ log dose that is to be applied is approxi- Results based on Correlation and
0, the value of Y is ⫺9.28. For every one- mately y ⫽ ⫺9.28 ⫹ 2.83 ⫻ 4 ⫽ 2.04 (trans- Regression Analysis for Example
unit increase in X, the value of Y will lated to e2.04 ⫽ 7.69 rad). On the other Data
increase on average by 2.83. Effects of hand, if the log total time is specified at x ⫽
Regression Statistic Numerical Result
both the intercept and slope are statisti- 4.5 (translated to e4.5 ⫽ 90 minutes, ap-
cally significant (P ⬍ .005) (Excel; Mi- proximately), then the log dose that is to Correlation coefficient r 0.85
be applied is approximately y ⫽ ⫺9.28 ⫹ R-square (R2) 0.73
crosoft, Redmond, Wash); therefore, the Regression parameter
null hypothesis (H0, the dose remains 2.83 ⫻ 4.5 ⫽ 3.46 (translated to e3.46 ⫽
Intercept ⫺9.28
constant as the total procedure time in- 31.82 rad). Such prediction can be useful Slope 2.83
creases) is rejected. Thus, we confirm the for future clinical practice.
Source.—Reference 11.
alternative hypothesis (H1, the dose in-
creases in the total procedure time). SUMMARY AND REMARKS
The regression line may be used to give
predicted values of Y. For example, if in a Two important statistical concepts, cor- commonly in radiology research, are re-
future CT fluoroscopy procedure, the log relation and regression, which are used viewed and demonstrated herein. Addi-

tional sources of information and elec- ware program (R Software. Available at: lib ranks of the two marginal data by using the
tronic textbooks on statistical analysis .stat.cmu.edu/R). Correlation option in Data Analysis Tools
methods found on the World Wide Web (Excel; Microsoft) or by using the Cor func-
are listed in Appendix E. A glossary of the tion (Insightful; MathSoft) or the free soft-
APPENDIX B
statistical terms used in this article is pre- ware.
Radiology
sented in Appendix F. Total sample size based on the Pearson

When correlation analysis is con- correlation coefficient: Specify r ⫽ expected APPENDIX D
ducted to measure the association be- correlation coefficient, C ⫽ 0.5 ⫻ ln[(1 ⫹
tween two random variables, either the r)/(1 ⫺ r)], N ⫽ total number of subjects Simple regression analysis: Regression
Pearson linear correlation coefficient or required, ␣ ⫽ type I error (ie, significance analysis may be performed by using the
the Spearman rank correlation coeffi- level, typically fixed at 0.05), ␤ ⫽ type II “Regression” option with Data Analysis
cient ␳ may be adopted. The former coef- error (ie, 1 minus statistical power, typically Tools (Excel; Microsoft). This regression
ficient is used to measure the linear rela- fixed at 0.10). Then N ⫽ [(Z␣ ⫹ Z␤)/C]2 ⫹ 3, analysis tool yields the sample correlation
tionship but is not recommended for use where Z␣ is the inverse of the cumulative R2; estimates of the regression parameters,
with skewed data or data with extremely probability of a standard normal distribu- along with their statistical significance on
tion with the tail probability of ␣. Similarly, the basis of the Student t test; residuals; and
large or small values (often called the
Z␤ is the inverse of the cumulative proba- standardized residuals. Scatter, line fit, and
outliers). In contrast, the latter coeffi-
bility of a standard normal distribution residual plots may also be created. Alterna-
cient is used to measures a general asso-
with the tail probability of ␤. Consequently, tively, the analyses can be performed by
ciation, and it is recommended for use
compute the smallest integer, n, such that using the function “lsfit” (Insightful; Math-
with data that are skewed or that have Soft) or the free software.
n ⱖ N, as the required sample size.
outliers. With either program, one may choose to
For example, an investigator wishes to
When simple regression analysis is transform the data or exclude outliers be-
conduct a clinical trial of a paired design
conducted to assess the linear relation- fore conducting a simple regression analy-
based on a one-tailed hypothesis test of the
ship of a dependent variable as a func- correlation coefficient. The null hypothesis sis. A commonly used variance-stabilizing
tion of the independent variable, caution is that the correlation between two vari- transformation is the natural log function
must be used when determining which ables is r ⫽ 0.60 (ie, C ⫽ 0.693) in the (ln) applied to one or both variables. Other
of the two variables is viewed as the in- population of interest. The alternative hy- transformation (eg, Box-Cox transforma-
dependent variable that makes sense pothesis is that the correlation is r ⬎ 0.60. tion) and weighting methods in regression
clinically. A useful graphical aid is a scat- Type I error is fixed to be 0.05 (ie, Z␣ ⫽ analysis may also be used (28,29).
terplot. Once the regression line is ob- 1.645), while type II error is fixed to be 0.10
tained, caution should also be used to (ie, Z␤ ⫽ 1.282). Thus, the required sample APPENDIX E
avoid prediction of a y value for any size is N ⫽ 21 subjects. A sample size table
value of x that is outside the range of the may also be found in reference 18. Uniform resource locator, or URL, links to
data. Finally, correlation and regression electronic statistics textbooks: www.davidm
analyses do not infer causality, and more APPENDIX C lane.com/hyperstat/index.html, www.statsoft
rigorous analyses are required if causal .com/textbook/stathome.html, www.ruf.rice.edu
inference is to be made (23–26). /⬃lane/rvls.html, www.bmj.com/collections
Formula for computing Spearman ␳ and
/statsbk/index.shtml, espse.ed.psu.edu/statistics
Pearson rs: Replace bivariate data, Xi and Yi
/investigating.htm.
(i ⫽ 1,. . .,n), by their respective ranks Ri ⫽
APPENDIX A
rank(Xi) and Si ⫽ rank(Yi). Rank correlation
coefficient, rs, is defined as the Pearson cor- APPENDIX F
Formula for computing the Pearson cor-
relation coefficient between the Ri and Si
relation coefficient, r: The formula for com-
values, which can be computed by means of Glossary of statistical terms:
puting r between bivariate data, Xi and Yi
the formula given in Appendix A. An alter- Bivariate data.—Measurements obtained
values (i ⫽ 1,. . .,n) is
native direct formula was given by Hett- on more than one variable for the same unit
mansperger (19). or subject.
冘
n
The Spearman ␳ may also be computed by Correlation coefficient.—A statistic be-
៮ 兲共Y i ⫺ Y
共X i ⫺ X ៮兲
first reducing the continuous data to their tween ⫺1 and 1 that measures the associa-
冑冘
i⫽1
r⫽ , marginal ranks by using the “rank and per- tion between two variables.
冘
n n centile” option with Data Analysis Tools Intercept.—The constant a in the regres-
៮兲
共X i ⫺ X 2 ៮兲
共Y i ⫺ Y 2 (Excel; Microsoft) or the “rank” function sion equation, which is the value for y when
i⫽1 i⫽1 (Insightful; MathSoft) or the free software. x ⫽ 0.
Both software programs correctly rank the Least squares method.—The regression line
where X and Y are the sample means of the data in ascending order. However, the rank that is the best fit to the data for which the
Xi and Yi values, respectively. and percentile option in Excel ranks the sum of the squared residuals is minimized.
The Pearson correlation coefficient may data in descending order (the largest is 1). Outlier.—An extreme observation far
be computed by means of a computer-based Therefore, to compute the correct ranks, away from the bulk of the data, often
statistics program (Excel; Microsoft) by us- one may first multiply all of the data by ⫺1 caused by faulty measuring equipment or
ing the option “Correlation” under the op- and then apply the rank function. Excel recording error.
tion “Data Analysis Tools”. Alternatively, it also gives integer ranks in the presence of Pearson correlation coefficient.—Sample
may also be computed by means of a ties compared with the methods that yield correlation coefficient for measuring the
built-in software function “Cor” (Insightful; possible noninteger ranks, as described in linear relationship between two variables.
MathSoft, Seattle, Wash [MathSoft S-Plus 4 the standard statistics literature (19). R2.—The square of the Pearson correla-
guide to statistics, 1997; 89 –96]. Available Subsequently, the sample correlation co- tion coefficient r, which is the fraction of
at: www.insightful.com) or with a free soft- efficient is computed on the basis of the the variability in Y that can be explained by
Volume 227 䡠 Number 3 Correlation and Simple Linear Regression 䡠 621

the variability in X through their linear re- cal sciences. New York, NY: Wiley, 1982; heredity and panmixia. Phil Trans R Soc
lationship or vice versa. 193–204. Lond Series A 1896; 187:253–318.
Rank.—The relative ordering of the mea- 3. Applegate KE, Crewson PE. An introduc- 18. Hulley SB, Cummings SR. Designing clin-
tion to biostatistics. Radiology 2002; 225: ical research: an epidemiological ap-
surements in a variable, which can be non- 318 –322. proach. Baltimore, Md: Williams &
integer numbers in the presence of ties. 4. Goldman RN, Weinberg JS. Statistics: an Wilkins, 1988; appendix 13.C.
Residual.—The difference between the ob- introduction. Upper Saddle River, NJ: 19. Hettmansperger TP. Statistical inference
Radiology
served values of the outcome variable and Prentice Hall, 1985; 72–98. based on ranks. Malabar, Fla: Krieger,
the fitted values based on a linear regression 5. Freund JE. Mathematical statistics. 5th ed. 1991; 200 –205.
Upper Saddle River, NJ: Prentice Hall, 20. Kendall M, Gibbons JD. Rank correlation
analysis.
1992; 494 –546. methods. 5th ed. New York, NY: Oxford
Scatterplot.—A plot of the observed biva- 6. Spearman C. The proof and measurement University Press, 1990; 8 –10.
riate outcome variable (y axis) against its of association between two things. Am J 21. Zou KH, Hall WJ. On estimating a trans-
predictor variable (x axis), with a dot for Psychol 1904; 15:72–101. formation correlation coefficient. J Appl
each pair of bivariate observations. 7. Fieller EC, Hartley HO, Pearson ES. Tests Stat 2002; 29:745–760.
Simple linear regression analysis.—A linear for rank correlation coefficient. I. Bio- 22. Fisher RA. Frequency distributions of the
metrika 1957; 44:470 – 481. values of the correlation coefficient in
regression analysis with one predictor and 8. Fieller EC, Pearson ES. Tests for rank cor- samples from an indefinitely large popu-
one outcome variable. relation coefficients. II. Biometrika 1961; lation. Biometrika 1915; 10:507–521.
Skewed data.—A distribution is skewed if 48:29 – 40. 23. Duncan OD. Path analysis: sociological
there are more extreme data on one side of 9. Kruskal WH. Ordinal measurement of asso- examples. In: Blalock HM Jr, ed. Causal
the mean. Otherwise, the distribution is ciation. J Am Stat Assoc 1958; 53:814 –861. models in the social sciences. Chicago,
symmetric. 10. David FN, Mallows CL. The variance of Ill: Alpine-Atherton, 1971; 115–138.
Spearman’s rho in normal samples. Bio- 24. Rubin DB. Estimating casual effects of
Slope.—The constant b in the regression metrika 1961; 48:19 –28. treatments in randomized and nonran-
equation, which is the change in y that cor- 11. Silverman SG, Tuncali K, Adams DF, domized studies. J Ed Psych 1974;
responds to a one-unit increase (or de- Nawfel RD, Zou KH, Judy PF. CT fluoros- 66:688 –701.
crease) in x. copy-guided abdominal interventions: 25. Holland P. Statistics and causal inference.
Spearman ␳.—A rank correlation coeffi- techniques, results, and radiation expo- J Am Stat Assoc 1986; 81:945–970.
sure. Radiology 1999; 212:673– 681. 26. Angrist JD, Imbens GW, Rubin DB. Iden-
cient for measuring the monotone relation-
12. Daniel WW. Biostatistics: a foundation tification of causal effects using instru-
ship between two variables. for analysis in the health sciences. 7th ed. mental variables. J Am Stat Assoc 1996;
New York, NY: Wiley, 1999. 91:444 – 455.
13. Altman DG. Practical statistics for medi- 27. Seber GAF. Linear regression analysis.
Acknowledgments: We thank Kimberly E. cal research. Boca Raton, Fla: CRC, 1990. New York, NY: Wiley, 1997; 48 –51.
Applegate, MD, MS, and Philip E. Crewson, 14. Neter J, Wasserman W, Kutner MH. Ap- 28. Carroll RJ, Ruppert D. Transformation
PhD, co-editors of this Statistical Concepts Se- plied linear models: regression, analysis and weighting in regression. New York,
ries in Radiology for their constructive com- of variance, and experimental designs. NY: Chapman & Hall, 1988; 2– 61.
ments on earlier versions of this article. 3rd ed. Homewood, Ill: Irwin, 1990; 38 – 29. Box GEP, Cox DR. An analysis of trans-
44, 62–104. formation. J R Stat Soc Series B 1964; 42:
15. Galton F. Typical laws of heredity. Proc R 71–78.
References Inst Great Britain 1877; 8:282–301. 30. Tello R, Crewson PE. Hypothesis testing
1. Krzanowsk WJ. Principles of multivariate 16. Galton F. Correlations and their measure- II: means. Radiology 2003; 227:1– 4.
analysis: a user’s perspective. Oxford, En- ments, chiefly from anthropometric data. 31. Mudholkar GS, McDermott M, Scrivas-
gland: Clarendon, 1988; 405– 432. Proc R Soc London 1888; 45:219 –247. tava DK. A test of p-variate normality.
2. Rodriguez RN. Correlation. In: Kotz S, 17. Pearson K. Mathematical contributions Biometrika 1992; 79:850 – 854.
Johnson NL, eds. Encyclopedia of statisti- to the theory of evolution. III. Regression,

Correlation and Simple Linear Regression: Statistical Concepts Series

Uploaded by

Copyright:

Available Formats

Correlation and Simple Linear Regression: Statistical Concepts Series

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation and Simple Linear Regression: Statistical Concepts Series

Uploaded by

Copyright:

Available Formats

Statistical Concepts Series

Kelly H. Zou, PhD Correlation and Simple Linear

CORRELATION Rank Correlation

618 䡠 Radiology 䡠 June 2003 Zou et al

analysis, it should not be interpreted as

Volume 227 䡠 Number 3 Correlation and Simple Linear Regression 䡠 619

not necessary in the subsequent regres-

620 䡠 Radiology 䡠 June 2003 Zou et al

sented in Appendix F. Total sample size based on the Pearson

Volume 227 䡠 Number 3 Correlation and Simple Linear Regression 䡠 621

622 䡠 Radiology 䡠 June 2003 Zou et al

You might also like