Exact Logistic
Exact Logistic
Exact Logistic
R
Performing Exact Logistic Regression with the SASSystem
ABSTRACT
Exact logistic regression has become an important
analytical technique, especially in the pharmaceutical industry, since the usual asymptotic methods for
analyzing small, skewed, or sparse data sets are unreliable. Inference based on enumerating the exact distributions of sufcient statistics for parameters
of interest in a logistic regression model, conditional
on the remaining parameters, is computationally infeasible for many problems. Hirji, Mehta, and Patel
(1987) developed an efcient algorithm for generating the required conditional distributions, thus making these methods computationally available. This
paper discusses the theory and methods for exact
logistic regression and illustrates their application in
Version 8 of the SAS System with new facilities in
the LOGISTIC procedure.
Dose-Response Study
First, consider a small dose-response study to motivate the usefulness of exact logistic regression. Researchers are interested in analyzing how mortality
rates change with respect to dosage of a drug. The
dose data set contains life/death outcomes for six levels of drug dosage (0 to 5). Three subjects are given
each specic dose of the drug, and the number of
deaths are recorded.
INTRODUCTION
data dose;
input Dose Deaths Total @@;
datalines;
0 0 3
1 0 3
2 0 3
3 0 3
4 1 3
5 2 3
;
run;
Many clinical trials deal with the comparison of populations of subjects with categorical responses. Historically, statistical inference for such studies involve
large-sample approximations, and tting logistic regression models to such data is performed through
the unconditional likelihood function.
However,
asymptotic methods may be inadequate when sample sizes are small or the data are sparse, skewed, or
heavily tied. Exact conditional inference remains valid
in such situations.
All of the cells have counts that are less than 5, which
makes the applicability of large sample theory questionable. For each subject receiving dosage ,
, let
if the subject died,
other
. Then the linear logistic
wise, and
the subjects. In the PROC LOGISTIC invocation below, the EXACT statement requests an exact analysis
and the ESTIMATE option produces exact parameter
estimates.
proc logistic data=dose descending;
model Deaths/Total = Dose;
exact Dose / estimate=both;
run;
Figure 1 displays some of the unconditional asymptotic results that are produced by default. The likelihood ratio and score tests reject the null hypothesis that is zero. However, the Wald test does not
reject this null hypothesis. The seemingly conicting
conclusions of these tests are a telltale sign that the
large-sample approximation is unreliable. The estimates for the intercept and the slope both have
-values greater than , indicating marginal inuence. The condence limits for the odds ratio of the
dose parameter contains the value , from which you
could conclude, if you accept the model, that there is
no change in mortality with a change in dosage.
Figure 2.
Chi-Square
DF
Pr > ChiSq
8.1478
5.7943
2.7249
1
1
1
0.0043
0.0161
0.0988
Likelihood Ratio
Score
Wald
6.049
1.123
353.000
p-Value
0.0245
(continued)
95% Confidence
Limits
Estimate
METHODOLOGY
DF
Estimate
Standard
Error
Chi-Square
Intercept
Dose
1
1
-9.4745
2.0804
5.5677
1.2603
2.8958
2.7249
Pr > ChiSq
0.0888
0.0988
Effect
Dose
Figure 1.
8.007
95% Wald
Confidence Limits
0.677
94.679
Figure 2 shows the results from the EXACT statement. The -values in the Conditional Exact Tests
table lead to rejecting the null hypothesis that is zero
(no conclusions can be made about since it is conditioned away). Note that the -values for the asymptotic estimates are larger than those for the exact estimates; however, Stokes, Davis, and Koch (1995) observe that, in general, the exact methods tend to produce more conservative results. The Exact Parameter Estimates table shows that the slope is esti , and since the condence
mated to be
interval for the odds ratio of does not contain , the
odds of death increase signicantly with dosage. Note
that the exact tests do not produce standard errors for
the estimates.
Logistic Regression
, let
For each observation
be a vector of ex . Let
planatory variables, and denote
be the event probability for
, and denote
each
. Then
, or
the logistic regression model is
Consider
where
vector.
Test
Dose
Score
Probability
Statistic
5.4724
0.0110
0.0190
0.0190
Figure 2.
Estimate
1.8000
95% Confidence
Limits
0.1157
5.8665
is the unknown parameter
p-Value
0.0245
Unconditional likelihood inference is based on maximizing this likelihood function, and several asymptotic statistics (likelihood ratio, score, and Wald) can
be used to perform hypothesis tests.
Observation
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x0
1
1
1
1
x1
1
1
2
0
where
is the number of
sequences that generate . Suppose the param are nuisance parameters; that
eters
is, the current analysis is geared toward the last parameters . Denote the sufcient statistics for the
, the correnuisance parameters as
sponding observed values as , and the correspondas . Similarly, dene , ,
ing columns of
and for the parameters of interest. The nuisance
parameters can be removed from the analysis by conditioning on their sufcient statistics to create the conditional likelihood
where
y
0
1
0
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
2
1
2
2
3
1
2
2
3
2
3
3
4
0
0
2
2
1
1
3
3
1
1
3
3
2
2
4
4
2
1
2
2
2
3
total
Frequency
2
2
2
6
Probability
2/6
2/6
2/6
1
vec-
.
.
.
.
.
.
be the
relation results: .
Note that, in order to obtain the correct distribution, each node descended from this combined
node must count as outcomes.
Figure 3 displays a tree diagram where each row (after the rst) corresponds to an observation , and
each node of the tree is denoted by a pair of digits
representing the value of . The top node in the
00
1:
00
2:
00
3:
00
11
11
11
12
11
23
11
23
22
22
34
0
0
1
0
1
1
1
2
2
1
2
2
2
3
3
2
3
3
3
4
4
4
total
Frequency
1
1
2
1
2
2
2
1
2
1
1
16
Probability
1/16
1/16
2/16
1/16
2/16
2/16
2/16
1/16
2/16
1/16
1/16
1
4: 00 10 12 22 11 21 23 33 11 21 23 33 22 32 34 44
Figure 3.
two tests for the null hypothesis that the parameters for the effects specied in the EXACT statement are zero: the exact probability test and the
exact conditional scores test. For each test, the
Conditional Exact Tests table displays
a test statistic
an exact -value, which is the probability of
obtaining a more extreme statistic than the
observed, assuming the null hypothesis
a mid -value, which adjusts for the discreteness of the distribution
MAXTIME=seconds
STATUSTIME=seconds
optionally, output data sets containing the derived distributions and summary statistics
EXACT Options
Several options can be specied in each EXACT
statement. The available options are
ALPHA=value
ESTIMATE<=keyword>
JOINT
JOINTONLY
ONESIDED
OUTDIST=SAS data set
SYNTAX
The following statements control the exact analyses
in the LOGISTIC procedure. Items within the <> are
optional.
The JOINT option requests a test that all the parameters for the EXACT statement are simultaneously
equal to zero in addition to the tests of the individual
parameters, while the JOINTONLY option suppresses
the default individual tests. The test is indicated in the
Conditional Exact Tests table by the label Joint.
The ONESIDED option requests one-sided condence intervals and -values for the individual parameter estimates and odds ratios. Note that the twosided -values are twice the one-sided -values.
The EXACTONLY option suppresses the unconditional likelihood analyses that PROC LOGISTIC usually performs, and only the exact analyses are executed. Input data sets can be in single-trial or
5
The OUTDIST= data set contains all of the exact conditional distributions requested in its EXACT statement. This data set contains the possible sufcient
statistics for the effects specied in the EXACT statement, the counts derived from the multivariate shift algorithm, the probability of occurrence, and the score
value for each sufcient statistic. When you request
an OUTDIST= data set, the observed sufcient statistics are displayed in the Sufcient Statistics table.
EXAMPLES
The following examples illustrate different types of exact analysis. The data in these examples were constructed solely for illustrative purposes. The Sparse
Data example illustrates that the MLE for the unconditional likelihood analysis may not exist, rendering
the asymptotic inference impossible, while the exact
conditional inference is still plausible. The Stratied
Analyses example demonstrates how to use exact
conditional analysis to adjust for within-strata correlation. The Crossover Clinical Trial example is a popular phase II analysis for the pharmaceutical industry.
If you receive messages indicating that the NewtonRaphson iterations for the parameter estimates or
condence intervals did not converge, specifying the
ABSFCONV=, FCONV=, XCONV=, or MAXITER=
options in the MODEL statement may help.
Exact analyses are not performed when you specify
a WEIGHT statement, a non-logit link, an offset variable, the NOFIT option, or a model-selection method.
Sparse Data
There are several types of data for which unconditional maximum likelihood estimates fail to exist, or
for which the theory is not applicable. For data with
small cell counts, tests based on the asymptotic normality of the maximum likelihood estimates may not
be valid. For other data, the maximum likelihood estimates may not exist and the estimated dispersion
matrix may be unbounded. In this example, the data
set separate contains variables which perfectly predict the response, yielding a complete separation of
data points.
Output
PROC LOGISTIC presents the exact conditional analysis results in several tables:
data separate;
input A B Response count @@;
datalines;
0 0 1 1 0 1 0 2 1 0 1 8 1 1 1 21
;
The Exact Odds Ratios table displays odds ratios for individual parameters, condence limits,
and a -value for testing that the odds ratio is 1.
A B
The Sufcient Statistics table displays the sufcient statistic for each parameter in the model.
This table is only generated when you also
specify the OUTDIST= option to output the distribution to a SAS data set. The information is
useful for certain further analyses.
Figure 4 shows that the usual asymptotic analysis indicates that complete separation has occurred. You
can see that the parameter estimates do not converge
if you specify both the ITPRINT and NOCHECK options in the MODEL statement. However, exact tests
and estimates for the conditional analysis can still be
computed and are displayed in Figure 5.
Model Convergence Status
Complete separation of data points detected.
Figure 4.
Convergence Status
Obs
In Figure 5, the joint exact test of A and B is signicant, but the B parameter appears insignicant. The
median unbiased estimate is created instead of the
CMLE because the value of the observed sufcient
statistic lies at an extreme of the derived distribution,
implying that the CMLE does not exist. Even though
the asymptotic results are unreliable, the exact analysis allows you to conclude that there is a signicant
effect due to A.
Count
Score
1
2
3
4
5
6
7
8
9
10
11
12
13
0
0
1
1
1
2
2
2
0
1
2
.
.
1
2
0
1
2
0
1
2
.
.
.
1
2
2
1
8
37
42
28
168
210
1
42
210
2
1
20.2622
21.1153
8.9654
4.4055
4.9644
5.5822
0.7281
0.9929
22.0000
4.5023
0.1995
0.5000
2.0000
Figure 6.
Prob
0.00403
0.00202
0.01613
0.07460
0.08468
0.05645
0.33871
0.42339
0.00395
0.16601
0.83004
0.66667
0.33333
Stratied Analyses
Exact Conditional Analysis
Sufficient Statistics
Parameter
Value
Intercept
A
B
2
0
2
Test
Joint
Score
Probability
Score
Probability
Score
Probability
A
B
Statistic
21.1153
0.00202
22.0000
0.00395
2.0000
0.3333
0.0010
0.0010
0.0020
0.0020
0.1667
0.1667
where indexes the strata, are the strata intercepts, and indexes the subjects within the strata.
Estimate
-3.8398*
0.6931*
95% Confidence
Limits
-Infinity
-2.9704
-1.0718
Infinity
p-Value
0.0079
0.6667
Figure 5.
Stratum
1
2
3
Level 1
1
0
0
Level 2
0
1
0
Z to be if the response is an event and if the response is a nonevent. This variable is used as the
time variable as well as the censoring indicator (with
as the censored value) in the MODEL statement
of PROC PHREG. Also specify the TIES=DISCRETE
option to request the discrete logistic model, and the
STRATA statement to specify the strata to be conditioned on.
Level 1
1
0
0
Level 2
0
1
0
Level 3
0
0
1
proc phreg;
freq count;
strata Stratum;
model Z*Z(2)=X1 X2 / ties=discrete;
run;
2
2
0
1
3
3
1
2
3
3
3
3
3
0
0
1
1
1
1
2
1
2
3
0
1
0
2
2
2
1
1
2
1
Chi-Square
Likelihood Ratio
Score
Wald
DF
Pr > ChiSq
9.6425
7.9291
4.6510
2
2
2
0.0081
0.0190
0.0977
1
1
Figure 8.
Parameter
Estimate
2.32474
-1.11430
Standard
Error Chi-Square Pr > ChiSq
1.11585
0.72917
4.3404
2.3353
0.0372
0.1265
Hazard
Ratio
10.224
0.328
is
the overall null hypothesis
for both the asymptotic conditional analysis in PROC
PHREG and the exact analysis in PROC LOGISTIC.
However, PROC PHREG computes a -value of
by comparing the value of the conditional score statistic to a chi-squared distribution with degrees of freedom (since there are two parameters), while PROC
LOGISTIC derives a -value of from the exact conditional distribution. Inference on individual parameters is often not the same between the exact conditional analysis and the asymptotic conditional likelihood results.
Test
Statistic
Joint
Score
Probability
7.9291
0.000612
0.0165
0.0077
0.0162
0.0074
Figure 7.
95% Confidence
Limits
Estimate
1.9979
-1.0097
0.3140
-2.9152
5.2012
0.4142
p-Value
0.0126
0.1931
Exact Results
This exact analysis should be compared to an asymptotic conditional likelihood analysis, which is available
with the PHREG procedure. First, dene a variable
8
Drug A
Drug B
where indexes the subject, are the subject intercepts, indexes the period, and the are indicator
variables taking the value when the condition is true.
Note that this model ignores carryover effects.
where
, and
The exact conditional score -value for the test of signicance of all the parameters is ; hence, you
cannot reject the null hypothesis. However, the exact
conditional score -value for the test of no drug effects, , is , while the -value for the
, is , which
test of no period effects,
suggests that the period term should be dropped from
this model.
APPENDIX
Hypothesis Tests
where
there exist with
, and .
The model to be t is
for
and
REFERENCES
Hirji, Karim F. and Tang, Man-Lai (1998), A Comparison of Tests for Trend, Communications in
StatisticsTheory and Methods, 27, 943963.
Hirji, Karim F., Tsiatis, Anastasios A., and Mehta,
Cyrus R. (1989), Median Unbiased Estimation for Binary Data, American Statistician, 43, 711.
Lancaster, H. O., (1961), Signicance Tests in Discrete Distributions, JASA, 56, 223234.
against
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks
of SAS Institute Inc. in the USA and other countries.
indicates USA registration.
R
ACKNOWLEDGMENTS
Other brand and product names are registered trademarks or trademarks of their respective companies.
Version 3.0
10