Untitled 472
Untitled 472
Untitled 472
Statistics are used in medicine for data description and inference. Inferential statistics are used to
answer questions about the data, to test hypotheses (formulating the alternative or null hypotheses),
to generate a measure of effect, typically a ratio of rates or risks, to describe associations
(correlations) or to model relationships (regression) within the data and, in many other functions.
Usually point estimates are the measures of associations or of the magnitude of effects.
Confounding, measurement errors, selection bias and random errors make unlikely the point
estimates to equal the true ones. In the estimation process, the random error is not avoidable. One
way to account for is to compute p-values for a range of possible parameter values (including the
null). The range of values, for which the p-value exceeds a specified alpha level (typically 0.05) is
called confidence interval. An interval estimation procedure will, in 95% of repetitions (identical
studies in all respects except for random error), produce limits that contain the true parameters. It is
argued that the question if the pair of limits produced from a study contains the true parameter could
not be answered by the ordinary (frequentist) theory of confidence intervals1. Frequentist
approaches derive estimates by using probabilities of data (either p-values or likelihoods) as
measures of compatibility between data and hypotheses, or as measures of the relative support that
data provide hypotheses. Another approach, the Bayesian, uses data to improve existing (prior)
estimates in light of new data. Proper use of any approach requires careful interpretation of
statistics1,2.
The goal in any data analysis is to extract from raw information the accurate estimation. One of the
most important and common question concerning if there is statistical relationship between a
response variable (Y) and explanatory variables (Xi). An option to answer this question is to employ
regression analysis in order to model its relationship. There are various types of regression analysis.
The type of the regression model depends on the type of the distribution of Y; if it is continuous and
approximately normal we use linear regression model; if dichotomous we use logistic regression; if
Poisson or multinomial we use log-linear analysis; if time-to-event data in the presence of censored
cases (survival-type) we use Cox regression as a method for modeling. By modeling we try to
predict the outcome (Y) based on values of a set of predictor variables (Xi). These methods allow us
to assess the impact of multiple variables (covariates and factors) in the same model3,4.
In this article we focus in linear regression. Linear regression is the procedure that estimates the
coefficients of the linear equation, involving one or more independent variables that best predict the
value of the dependent variable which should be quantitative. Logistic regression is similar to a
linear regression but is suited to models where the dependent variable is dichotomous. Logistic
regression coefficients can be used to estimate odds ratios for each of the independent variables in
the model.
Linear equation
In most statistical packages, a curve estimation procedure produces curve estimation regression
statistics and related plots for many different models (linear, logarithmic, inverse, quadratic, cubic,
power, S-curve, logistic, exponential etc.). It is essential to plot the data in order to determine which
model to use for each depedent variable. If the variables appear to be related linearly, a simple linear
regression model can be used but in the case that the variables are not linearly related, data
transformation might help. If the transformation does not help then a more complicated model may
be needed. It is strongly advised to view early a scatterplot of your data; if the plot resembles a
mathematical function you recognize, fit the data to that type of model. For example, if the data
resemble an exponential function, an exponential model is to be used. Alternatively, if it is not
obvious which model best fits the data, an option is to try several models and select among them. It
is strongly recommended to screen the data graphically (e.g. by a scatterplot) in order to determine
how the independent and dependent variables are related (linearly, exponentially etc.)4–6.
The most appropriate model could be a straight line, a higher degree polynomial, a logarithmic or
exponential. The strategies to find an appropriate model include the forward method in which we
start by assuming the very simple model i.e. a straight line (Y = a + bX or Y = b0 + b1X ). Then we
find the best estimate of the assumed model. If this model does not fit the data satisfactory, then we
assume a more complicated model e.g. a 2nd degree polynomial (Y=a+bX+cX2) and so on. In a
backward method we assume a complicated model e.g. a high degree polynomial, we fit the model
and we try to simplify it. We might also use a model suggested by theory or experience. Often a
straight line relationship fits the data satisfactory and this is the case of simple linear regression. The
simplest case of linear regression analysis is that with one predictor variable6,7.
Assuming a linear relation in population, mean of Y for given X equals α+βX i.e. the "population
regression line".
Ŷi = a + bXi is called the fitted (or predicted) value, and Yi Ŷi is called the residual.
The estimated regression line is determined in such way that (residuals)2 to be the minimal i.e. the
standard deviation of the residuals to be minimized (residuals are on average zero). This is called the
"least squares" method. In the equation
b is the slope (the average increase of outcome per unit increase of predictor)
A more detailed (higher precision of the estimates a and b) regression equation line can also be
written as
Further inference about regression line could be made by the estimation of confidence interval
(95%CI for the slope b). The calculation is based on the standard error of b:
and the test for H0: β=0, is t = b / se(b) [p-value derived from t-distr. with df = n-2].
If the p value lies above 0.05 then the null hypothesis is not rejected which means that a straight line
model in X does not help predicting Y. There is the possibility that the straight line model holds
(slope = 0) or there is a curved relation with zero linear component. On the other hand, if the null
hypothesis is rejected either the straight line model holds or in a curved relationship the straight line
model helps, but is not the best model. Of course there is the possibility for a type II or type I error
in the first and second option, respectively. The standard deviation of residual (σres) is estimated by
The standard deviation of residual (σres) characterizes the variability around the regression line i.e.
the smaller the σres, the better the fit. It has a number of degrees of freedom. This is the number to
divide by in order to have an unbiased estimate of the variance. In this case df = n-2, because two
parameters, α and β, are estimated7.
In the multiple linear regression model, Y has normal distribution with mean
β0 = intercept
β1 βρ = regression coefficients
β1 equals the mean increase in Y per unit increase in Xi , while other Xi's are kept fixed. In other
words βi is influence of Xi corrected (adjusted) for the other X's. The estimation method follows the
least squares criterion.
If b0, b1, , bρ are the estimates of β0, β1, , βρ then the "fitted" value of Y is
The b0, b1, ... , b are computed such that to be minimal. Since Y – Yfit is called
the residual; one can also say that the sum of squared residuals is minimized.
In our example, the statistical packages give the following estimates or regression coefficients (bi)
and standard errors (se) for toluene personal exposure levels.
Then the regression equation for toluene personal exposure levels would be:
The estimated coefficient for time spent outdoors (0.582) means that the estimated mean increase in
toluene personal levels is 0.582 g/m3 if time spent outdoors increases 1 hour, while home levels and
wind speed remain constant. More precisely one could say that individuals differing one hour in the
time that spent outdoors, but having the same values on the other predictors, will have a mean
difference in toluene xposure levels equal to 0.582 µg/m3 8.
Be aware that this interpretation does not imply any causal relation.
95% CI for i is given by bi ± t0.975*se(bi) for df= n-1-p (df: degrees of freedom)
In our example that means that the 95% CI for the coefficient of time spent outdoors is 95%CI: -
0.19 to 0.49
As in example if we test the H0: β humidity = 0 and find P = 0.40, which is not significant, we
assumed that the association between between toluene personal exposure and humidity could be
explained by the correlation between humididty and wind speed8.
In order to estimate the standard deviation of the residual (Y Yfit), i.e. the estimated standard
deviation of a given set of variable values in a population sample, we have to estimate σ
The ANOVA table gives the total variability in Y which can be partitioned in a part due to
regression and a part due to residual variation:
In statistical packages the ANOVA table in which the partition is given usually has the following
format [6]:
SS: "sums of squares"; df: Degrees of freedom; MS: "mean squares" (SS/dfs); F: F statistics (see below)
As a measure of the strength of the linear relation one can use R. R is called the multiple correlation
coefficient between Y, predictors (X1, Xp ) and Yfit and R square is the proportion of total variation
explained by regression (R2=SSreg / SStot).
Test on overall or reduced model
In our example Tpers = β0 + β1 time outdoors + β2 Thome +β3 wind speed + residual
The null hypothesis (H0) is that there is no regression overall i.e. β1= β2=+βρ = 0
The test is based on the proportion of the SS explained by the regression relative to the residual SS.
The test statistic (F= MSreg / MSres) has F-distribution with df1 = p and df2 = n p 1 (F-
distribution table). In our example F= 5.49 (P<0.01)
In general k of p regression coefficients are set to zero under H0. The model that is valid if H0=0 is
true is called the "reduced model". The Idea is to compare the explained variability of the model at
hand with that of the reduced model.
If one or two variables are left out and we calculate SS reg (the statistical package does) and we find
that the test statistic for F lies between 0.05 < P < 0.10, that means that there is some evidence,
although not strong, that these variables together, independently of the others, contribute to the
prediction of the outcome.
Assumptions
If a linear model is used, the following assumptions should be met. For each value of the
independent variable, the distribution of the dependent variable must be normal. The variance of the
distribution of the dependent variable should be constant for all values of the independent variable.
The relationship between the dependent variable and the independent variables should be linear, and
all observations should be independent. So the assumptions are: independence; linearity; normality;
homoscedasticity. In other words the residuals of a good model should be normally and randomly
distributed i.e. the unknown does not depend on X ("homoscedasticity") 2,4,6,9.
To discover deviations form linearity and homogeneity of variables we can plot residuals against
each predictor or against predicted values. Alternatively by using the PARTIAL plot we can assess
linearity of a predictor variable. The partial plot for a predictor X1 is a plot of residuals of Y
regressed on other Xs and against residuals of Xi regressed on other X's. The plot should be linear.
To check the normality of residuals we can use an histogram (with normal curve) or a normal
probability plot6,7.
The goodness-of-fit of the model is assessed by studying the behavior of the residuals, looking for
"special observations / individuals" like outliers, observations with high "leverage" and influential
points. Observations deserving extra attention are outliers i.e. observations with unusually large
residual; high leverage points: unusual x - pattern, i.e. outliers in predictor space; influential points:
individuals with high influence on estimate or standard error of one or more β's. An observation
could be all three. It is recommended to inspect individuals with large residual, for outliers; to use
distances for high leverage points i.e. measures to identify cases with unusual combinations of
values for the independent variables and cases that may have a large impact on the regression model.
For influential points use influence statistics i.e. the change in the regression coefficients (DfBeta(s))
and predicted values (DfFit) that results from the exclusion of a particular case. Overall measure for
influence on all β's jointly is "Cook's distance" (COOK). Analogously for standard errors overall
measure is COVRATIO6.
We can use some tips to correct some deviation from model assumptions. In case of curvilinearity in
one or more plots we could add quadratic term(s). In case of non homogeneity of residual sd, we can
try some transformation: log Y if Sres is proportional to predicted Y; square root of Y if Y
distribution is Poisson-like; 1/Y if Sres2 is proportional to predicted Y; Y2 if Sres2 decreases with Y.
If linearity and homogeneity hold then non-normality does not matter if the sample size is big
enough (n≥50- 100). If linearity but not homogeneity hold then estimates of β's are correct, but not
the standard errors. They can be corrected by computing the "robust" se's (sandwich, Huber's
estimate)4,6,9.
Relative issues
Binary logistic regression models can be fitted using either the logistic regression procedure or the
multinomial logistic regression procedure. An important theoretical distinction is that the logistic
regression procedure produces all statistics and tests using data at the individual cases while the
multinomial logistic regression procedure internally aggregates cases to form subpopulations with
identical covariate patterns for the predictors based on these subpopulations. If all predictors are
categorical or any continuous predictors take on only a limited number of values the mutinomial
procedure is preferred. As previously mentioned, use the Scatterplot procedure to screen data for
multicollinearity. As with other forms of regression, multicollinearity among the predictors can lead
to biased estimates and inflated standard errors. If all of your predictor variables are categorical, you
can also use the loglinear procedure.
In order to explore correlation between variables, Pearson or Spearman correlation for a pair of
variables r (Xi, Xj) is commonly used. For each pair of variables (Xi, Xj) Pearson's correlation
coefficient (r) can be computed. Pearsons r (Xi; Xj) is a measure of linear association between two
(ideally normally distributed) variables. R2 is the proportion of total variation of the one explained
by the other (R2 = b * Sx/Sy), identical with regression. Each correlation coefficient gives measure
for association between two variables without taking other variables into account. But there are
several useful correlation concepts involving more variables. The partial correlation coefficient
between Xi and Xj, adjusted for other X's e.g. r (X1; X2 / X3). The partial correlation coefficient
can be viewed as an adjustment of the simple correlation taking into account the effect of a control
variable: r(X ; Y / Z ) i.e. correlation between X and Y controlled for Z. The multiple correlation
coefficient between one X and several other X's e.g. r (X1 ; X2 , X3 , X4) is a measure of
association between one variable and several other variables r (Y ; X1, X2, , Xk). The multiple
correlation coefficient between Y and X1, X2,, Xk is defined as the simple Pearson correlation
coefficient r (Y ; Yfit) between Y and its fitted value in the regression model: Y = β0 + β1X1+
βkXk + residual. The square of r (Y; X1, , Xk ) is interpreted as the proportion of variability in Y
that can be explained by X1, , Xk. The null hypothesis [H0: ρ ( : X1, , Xk) = 0] is tested with the F-
test for overall regression as it is in the multivariate regression model (see above)6,7. The multiple-
partial correlation coefficient between one X and several other X`s adjusted for some other X's e.g. r
(X1 ; X2 , X3 , X4 / X5 , X6 ). The multiple partial correlation coefficient equal the relative increase
in % explained variability in Y by adding X1,, Xk to a model already containing Z1, , Zρ as
predictors6,7.
Other interesting cases of multiple linear regression analysis include: the comparison of two group
means. If for example we wish to answer the question if mean HEIGHT differs between men and
women? In the simple linear regression model:
The linear regression model assumes a normal distribution of HEIGHT in both groups, with equal .
This is exactly the model of the two-sample t-test. In the case of comparison of several group
means, we wish to answer the question if mean HEIGHT differ between different SES classes?
Testing β1 = β2 = 0 is equivalent with the one-way ANalysis Of VAriance F-test. The statistical
model in both cases is in fact the same4,6,7,9.
If we wish to compare a continuous variable Y (e.g. HEIGHT) between groups (e.g. men and
women) corrected (adjusted or controlled) for one or more covariables X (confounders) (e.g. X =
age or weight) then the question is formulated: Are means of HEIGHT of men and women different,
if men and women of equal weight are compared? Be aware that this question is different from that
if there is a difference between the means of HEIGHT for men and women? And the answers can be
quite different! The difference between men and women could be opposite, larger or smaller than the
crude if corrected. In order to estimate the corrected difference the following multiple regression
model is used:
where Y: response variable (for example HEIGHT); Z: grouping variable (for example Z = 0 for
men and Z = 1 for women); X: covariable (confounder) (for example weight).
So, for men the regression line is y = β0 + β2 and for women is y = (β0 + β1) + β2.
This model assumes that regression lines are parallel. Therefore β1 is the vertical difference, and can
be interpreted as the: for X corrected difference between the mean response Y of the groups. If the
regression lines are not parallel, then difference in mean Y depends on value of X. This is called
"interaction" or "effect modification".
As an example, we are interested to answer what is - the corrected for body weight - difference in
HEIGHT between men and women in a population sample?
By testing β3=0, a p-value much larger than 0.05 was calculated. We assume therefore that there is
no interaction i.e. regression lines are parallel. Further Analysis of Covariance for ≥ 3 groups
could be used if we ask the difference in mean HEIGHT between people with different level of
education (primary, medium, high), corrected for body weight. In a model where the three lines may
be not parallel we have to check for interaction (effect modification) 7. Testing the hypothesis that
coefficient of interactions terms equal 0, it is reasonable to assume a model without interaction.
Testing the hypothesis H0: β1 = β2 = 0, i.e. no differences between education level when corrected
for weight, gives the result of fitting the model, for which the P-values for Z1 and Z2 depend on your
choice of the reference group. The purposes of ANCOVA are to correct for confounding and
increase of precision of an estimated difference.
where Y the response variable; k groups (dummy variables Z1, Z2, , Zk-1) and X1, , Xp confounders
there is a straightforward extension to arbitrary number of groups and covariables.
One always has to figure out which way of coding categorical factors is used, in order to be able to
interpret the parameter estimates. In "reference cell" coding, one of the categories plays the role of
the reference category ("reference cell"), while the other categories are indicated by dummy
variables. The β's corresponding to the dummies that are interpreted as the difference of
corresponding category with the reference category. In "difference with overall mean" coding in
the model of the previous example: [Y = β0 + β11+ β22 ++ residual], the β0 is interpreted as the
overall mean of the three levels of education while β1 and β2 are interpreted as the deviation of mean
of primary and medium from overall mean, respectively. The deviation of the mean of high level
from overall mean is given by (- β1 - β2). In "cell means" coding in the previous model (without
intercept): [Y = β0 + β11+ β22 + β33 + residual], β1 is the mean of primary, β2 the middle and β3 of
the high level education6,7,9.
Conclusions
It is apparent to anyone who reads the medical literature today that some knowledge of biostatistics
and epidemiology is a necessity. The goal in any data analysis is to extract from raw information the
accurate estimation. But before any testing or estimation, a careful data editing, is essential to
review for errors, followed by data summarization. One of the most important and common question
is if there is statistical relationship between a response variable (Y) and explanatory variables (Xi).
An option to answer this question is to employ regression analysis. There are various types of
regression analysis. All these methods allow us to assess the impact of multiple variables on the
response variable.
References
1. Rothman KJ, Greenland S. Modern Epidemiology. 2nd ed. Lippincot-Raven; 1998. [Google Scholar]
2. Altman DG. Practical Statistics for Medical Research. Chapman & Hall/CRC; 1991. . [Google Scholar]
3. Rosner BA. Fundamentals of Biostatistics. 4th ed. Duxbury; 1995. [Google Scholar]
4. Draper NR, Smith H. Applied Regression Analysis. Wiley Series in Probability and Statistics; 1998. Applied Regression
Analysis. [Google Scholar]
5. Munro BH. Statistical Methods for Health Care Research. 5th ed. Lippincott Williams & Wilkins; 2005. [Google
Scholar]
6. SPSS 15.0 Command Syntax Reference 2006. Chicago Ill: SPSS Inc; 2006. [Google Scholar]
7. Stijnen T, Mulder PGH. Classical methods for data analyses. Rotterdam: NIHES program; 1999. [Google Scholar]
8. Alexopoulos EC, Chatzis C, Linos A. An analysis of factors that influence personal exposure to toluene and xylene in
residents of Athens, Greece. BMC Public Health. 2006;6:50. [PMC free article] [PubMed] [Google Scholar]
9. Shedecor GW, Cochran WG. Staistical Methods. 8nd ed. Iowa State Univ Press; 1989. [Google Scholar]
Back to Top