Journal of Statistical Software
Journal of Statistical Software
Journal of Statistical Software
Abstract
Accelerated failure time (AFT) models are alternatives to relative risk models which
are used extensively to examine the covariate effects on event times in censored data
regression. Nevertheless, AFT models have been much less utilized in practice due to
lack of reliable computing methods and software. This paper describes an R package aft-
gee that implements recently developed inference procedures for AFT models with both
the rank-based approach and the least squares approach. For the rank-based approach,
the package allows various weight choices and uses an induced smoothing procedure that
leads to much more efficient computation than the linear programming method. With the
rank-based estimator as an initial value, the generalized estimating equation approach is
used as an extension of the least squares approach to the multivariate case. Additional
sampling weights are incorporated to handle missing data needed as in case-cohort stud-
ies or general sampling schemes. A simulated dataset and two real life examples from
biomedical research are employed to illustrate the usage of the package.
1. Introduction
The linear regression model is the most commonly used regression model in data analysis
for uncensored data. When survival data are right-censored, two of the most frequently used
regression models are the relative risk model (Cox 1972) and the accelerate failure time (AFT)
model (e.g., Kalbfleisch and Prentice 2002, Chapter 4). The AFT model is appealing because
it is analogous to the classical linear regression approach, directly linking the expected failure
time to covariates. The AFT model with an unspecified error distribution is known as the
semiparametric AFT model, which has been studied extensively and is an alternative to the
2 aftgee: Fitting AFT Models with R
relative risk model with an unspecified baseline hazard function. Two methods for fitting
such models have been popular. One is the rank-based approach motivated by inverting
the weighted log-rank test (Prentice 1978). Its asymptotic properties have been rigorously
studied by Tsiatis (1990) and Ying (1993). The other method is an extension of the least
squares principle, such as the Buckley-James (BJ) estimator (Buckley and James 1979). The
theoretical properties of the BJ estimator were investigated in Ritov (1990) and Lai and Ying
(1991). Due to lack of efficient and reliable computing algorithms, both approaches have
not been widely used in practice until recently (Jin, Lin, Wei, and Ying 2003; Jin, Lin, and
Ying 2006b,c). Our R package aftgee (Chiou, Kang, and Yan 2014c) aims to provide an easy
access to AFT models with both methods based on the recent methodological developments.
Package aftgee is available from the Comprehensive R Archive Network (CRAN) at http:
//CRAN.R-project.org/package=aftgee.
Several packages for AFT models are available for the R environment (R Core Team 2014).
For parametric AFT models, where the error distribution is parametrically specified, one can
use survreg in package survival (Therneau 2014), psm in package rms (Harrel 2014) or aftreg
in package eha (Brostöm 2014). Misspecified error distributions in parametric AFT modeling
may lead to bias in estimation and false conclusion under the presents of censoring. For
semiparametric AFT models with unspecified error distribution, one can use bj in package
rms (Harrel 2014) or lss in package lss (Jin and Huang 2007). Function bj provides the
BJ estimator but it has several limitations: it computes the variance estimator based on
non-censored observations only which, although this has been reported to behave well in
simulation studies, lacks theoretical justification (Wei 1992); its convergence is slow and not
guaranteed; and it is only implemented for univariate failure time data. Package lss provides a
rank-based estimator with Gehan’s weight obtained from a linear programming approach (Jin
et al. 2003) and a least squares estimator with an iterative algorithm starting from the rank-
based estimator (Jin et al. 2006b). The variance estimators for both methods are bootstrap
based with validity theoretically justified. Nevertheless, there are several features lss fall
short. Its rank-based estimator is limited to Gehan’s weight which may not be the optimal
weight (Tsiatis 1990). The linear programming approach used for the rank-based estimator
is computationally very intensive, which also affects the least squares estimator through the
initial estimator. The bootstrap based variance estimation is very time consuming. Although
easily fixable, the package does not support user-specified initial values for the least squares
estimator. For clustered failure times, it operates with working independence and disregards
the within-cluster dependence, which may lead to efficiency loss especially when the within-
cluster dependence is strong (Chiou, Kang, Kim, and Yan 2014a).
Our package aftgee overcomes the aforementioned limitations in existing implementations and
provides a set of comprehensive tools for semiparametric AFT models in practical survival
analysis. For the rank-based estimator with Gehan’s weight, we implemented the induced
smoothing approach which is much faster than the linear programming approach without
loss in accuracy (Brown and Wang 2005, 2007). The induced smoothing approach has been
extended to work with any general weight (in addition to Gehan’s weight; Chiou, Kang, and
Yan 2013). Our efficient sandwich variance estimators provide much faster alternatives to the
full bootstrap variance estimation (Chiou, Kang, and Yan 2014b). With the fast rank-based
estimators as initial estimators, we implemented an iterative least squares procedure method
that extends generalized estimating equations (GEE) to clustered censored data (Chiou et al.
2014a). The resulting estimator is robust to misspecification of the working covariance matrix,
Journal of Statistical Software 3
and the efficiency is higher when the working covariance structure is closer to the truth.
Furthermore, these methodologies are generalized to incorporate additional sampling weights
for handing missing data and various sampling schemes (Chiou, Kang, and Yan 2014d).
Because of these features, the aftgee package is appealing to analysts who would like to fit
AFT models in their routine analysis of survival data.
The rest of the article is organized as follows. In the next section, we introduce the notations
and model formulation for the univariate AFT model. The multivariate extension is presented
in Section 3. Incorporation of sampling weight with application to case-cohort data is extended
in Section 4. Detailed usages are described in Section 5. A simulated dataset, an univariate
example and a multivariate example are used for illustration in Section 6. Conclusion and
some remarks are summarized in Section 7.
Ti = Xi> β + i , i = 1, . . . , n,
where ei (β) = Yi − Xi> β and ϕi (β) is a possibly data-dependent nonnegative weight func-
tion with values between 0 and 1. Let F̂ei (β) (t) be the estimated cumulative distribution
function
Pn based on the censored residual ei (β)’s. Some common choices of ϕi (β) are 1,
−1 ρ
n i=1 I[ej (β) ≥ ei (β)], 1 − F̂ei (β) (t) and [1 − F̂ei (β) (t)] , ρ ≥ 0, corresponding to log-
rank (Prentice 1978), Gehan (Gehan 1965), Prentice-Wilcoxon (Prentice 1978) and the more
general Gρ class (Harrington and Fleming 1982), respectively. The Kaplan-Meier estimator is
typically used to obtain F̂ei (β) (t). The solution of Equation 1, β̂n,ϕ , is consistent to the true
parameter, β0 , and is asymptotically normal (Tsiatis 1990; Ying 1993). Noting that Equa-
tion 1 with Gehan’s weight is the gradient of an objective function, Jin et al. (2003) used a
linear programming approach to obtain the estimator, which is computationally demanding,
especially for larger datasets and for obtaining variance estimators through bootstrap. In our
implementation, we used the Barzilai-Borwein spectral method implemented in package BB
(Varadhan and Gilbert 2009) to solve Equation 1 directly.
4 aftgee: Fitting AFT Models with R
where κij (β) = [ej (β) − ei (β)]/rij . The asymptotic equivalence between Equation 3 and the
smooth version of Equation 1 for the log-rank weight is established in Chiou et al. (2013).
For general weights, the regression parameters can be estimated from an iterative induced
smoothing procedure with the following steps:
(0)
1. Obtain an initial estimate β̃n,ϕ = bn of β and initialize with m = 1.
(m) (m−1) (m)
2. Update β̂n,ϕ by solving Ũn,ϕ (β̂n,ϕ , β̂n,ϕ ) = 0.
(m−1) (m)
3. Increase m by one and repeat Step 2 until |β̃n,ϕ,q − β̃n,ϕ,q | < t for all q = 1, . . . , p,
(m) (m)
where β̃n,ϕ,q is the qth component of β̃n,ϕ and t is a prefixed tolerance.
A simple choice of the initial estimator is the easy-to-compute Gehan’s estimator, β̃n,G .
Since estimating Equation 3 is not necessarily monotone in β, it might cause numerical prob-
lems in solving the estimating equations. Inspired by a discussion in Jin et al. (2003), a
Journal of Statistical Software 5
where φi (β) = ϕi (β)/ nj=1 I[ej (β) ≥ ei (β)]. Fixing the weight φi (b) evaluated at b and
P
applying induced smoothing on Equation 4 lead to
n X
n
X ej (β) ≥ ei (β)
Ũn,φ (b, β) = ∆i φi (b)(Xi − Xj )Φ = 0. (5)
rij
i=1 j=1
This is the same as Equation 2 except for the weight φi (b) which is free from β. For an
initial estimator b of β, an estimator β̃n,φ can be obtained from the iterative procedure with
Ũn,ϕ (b, β) replaced by Ũn,φ (b, β). Using the arguments in Jin et al. (2003), the consistency and
asymptotic normality of the resulting estimators can be established (Chiou et al. 2013). The
equations within each iteration can be solved with package BB. The convergence is usually fast
with the initial Gehan’s estimator. Variance estimation can be done with the full resampling
method (Jin et al. 2003) or a fast sandwich variance estimator (Chiou et al. 2014b, 2013).
The theoretical properties of the BJ estimator have been studied by Ritov (1990) and Lai and
Ying (1991). The method, however, is rarely used in practice due to numerical challenges.
Jin, Lin, Wei, and Ying (2006a) proposed a more practical solution that generalizes the BJ
estimator. Given an initial estimator bn of β, the least squares estimator is the solution of
the following estimating equation
n
X
Un,ls (β, b) = (Xi − X̄)> (Ŷi (b) − Xi β) = 0, (6)
i=1
where X̄ = ni=1 Xi /n. The solution to Un,ls (β, β) = 0 is the BJ estimator. The advantage for
P
fixing the initial value bn is to avoid numerical complexity caused by solving Equation 6 which
is neither continuous nor monotone in β. Jin et al. (2006a) devised an iterative procedure
(m) (m−1) (0)
β̂n,ls = Ln (β̂n,ls ) for m > 1 with β̂n,ls = bn where
" n #−1 " n #
X X
Ln (b) = (Xi − X̄)> (Xi − X̄) (Xi − X̄)> Ŷi (b) − Ȳ (b) ,
i=1 i=1
6 aftgee: Fitting AFT Models with R
and Ȳ (b) = ni=1 Ŷi (b)/n. If the initial estimator bn is consistent and asymptotically normal,
P
(m)
then β̂n,ls is also consistent and asymptotically normal for every m (Jin et al. 2006b). A good
candidate for the initial estimator is the induced smoothing Gehan estimator. The variance
of the resulting estimator can be approximated by a resampling procedure (Jin et al. 2006b).
where eik (β) = Yik − Xik> β and ϕ (β) is a possibly data-dependent nonnegative weight func-
ik
tion. If Ki = 1 for all i = 1, . . . n, Equation 8 will reduce to Equation 1. This estimating
equation also yields a consistent estimator for β0 (Jin et al. 2006a). Applying the aforemen-
tioned induced smoothing technique, the smoothed version of Equation 8 with Gehan’s weight
is
n XKi X n X Kj
X ejl (β) − eik (β)
Ũn,G (β) = ∆ik (Xik − Xjl )Φ = 0, (9)
rikjl
i=1 k=1 j=1 l=1
2
where rikjl = (Xik −Xjl )> Σn (Xik −Xjl ). The consistency and asymptotic properties continue
to hold (Johnson and Strawderman 2009). The multivariate version of Equations 3 and 5 are
Ki
n X
" Pn PKj #
j=1 l=1 Xj Φ[κikjl (β)]
X
Ũn,ϕ (β) = ∆ik ϕik (b) Xik − Pn PKj = 0, (10)
i=1 k=1 j=1 l=1 Φ[κikjl (β)]
and
Ki X
n X Kj
n X
X ejl (β) ≥ eik (β)
Ũn,φ (β) = ∆ik φik (β)(Xik − Xjl )Φ = 0, (11)
rikjl
i=1 k=1 j=1 l=1
Journal of Statistical Software 7
PKj
where κikjl (β) = [ejl (β) − eik (β)]/rikjl and φik (β) = ϕik (β)/ nj=1 l=1
P
I(ejl (β) ≥ eik (β)).
The same iterative procedure as for the univariate case can be used and the asymptotic
properties of the resulting estimator, β̃n,φ , continue to hold.
Pn
where X̄ = i=1 Xi /n. Given α and b, the solution to Equation 12 has a closed form
Ln (b, α) =
" n #−1 " n #
X X
(Xi − X̄)> Ω−1 (Xi − X̄)> Ω−1
i α(b) (Xi − X̄) i α(b) Ŷi (b) − Ȳ(b) ,
i=1 i=1
Pn
where Ȳ(b) = i Ŷi (b)/n.
The GEE estimator, denoted by β̂n,GEE , can be obtained from an iterative procedure:
(0)
1. Obtain an initial estimate β̂n,GEE = bn of β and initialize with m = 1.
(m−1) (m−1)
2. Obtain an estimate α̂n of α given β̂n,GEE , α̂n (β̂n,GEE ).
The iteration proceeds with the aid of function geese in package geepack (Højsgaard, Halekoh,
and Yan 2014; Halekoh, Højsgaard, and Yan 2006). The estimator reduces to the least squares
estimator of Jin et al. (2006a) when the working weight matrices Ωi ’s are the identity matrices.
We refer to Chiou et al. (2014a) for more details. The working parameter estimate α̂n does
not affect the consistency of the GEE estimator, but may affect its efficiency. Higher efficiency
can be achieved if Ωi is closer to the covariance matrix of Ŷi (b) and even an imperfect working
weight still improves the efficiency (Chiou et al. 2014a). The variance of the estimator can
again be estimated by resampling procedures.
and
Ki X
n X Kj
n X
X ejl (β) ≥ eik (β)
Ũn,φ (β) = hi hj ∆i φi (β)(Xik − Xjl )Φ = 0. (14)
rikjl
i=1 k=1 j=1 l=1
Note that if we sample all subjects within each strata, then Equations 13 and 14 reduce to
Equations 10 and 11, respectively. The variance estimation can be obtained via resampling
procedures or fast sandwich variance estimators similar to the unweighted versions (Chiou
et al. 2014b,d).
Journal of Statistical Software 9
5. Package implementation
The two major functions in package aftgee are aftsrr for the rank-based approach and aftgee
for the least squares or GEE approach. The synopsis of aftsrr is:
The required arguments are formula and data. Argument formula specifies the model to be
fit with the variables coming with data. The formula is the same as the argument of function
survreg in package survival, with response created from Surv. The ‘Surv’ object consists
of two columns, where the first column is the survival time or censored time and the second
column is the censoring indicator, indicating right censored data. Since ranks are invariant to
location shift, the intercept cannot be estimated and the estimation will ignore the intercept
term whether it is specified or not. Clusters are defined by vector id. The weights argument
is a vector containing sampling weights (hi ) as described in Section 4. When data arise
from a stratified design, a vector of integers that specifies the stratification is indicated in
strata. The length of the arguments id, weights and strata needs to be the same as the
number of observations. The rank weight, controlled by argument rkWeight, includes the
aforementioned log-rank weight ("logrank"), Gehan’s weight ("gehan"), Prentice-Wilcoxon
weight ("PW") and general Gρ class weight ("GP"). Argument method determines the type of
weighted estimating equations to be used. When method = "nonsm", regression parameters
are estimated by directly solving the nonsmooth estimating Equations 1 or 8. When method =
"sm" and rkWeight = "gehan", the induced smoothing estimating Equations 2 or 9 are used.
For the non-Gehan’s weights, method = "sm" and method = "monosm" apply the iterative
procedure with the smooth estimating Equations 3 and 5, respectively. The initial values for
the variance estimator, or the Σn in the smoothing progress, are determined by sigmainit.
The identity matrix is used for sigmainit, if it is left unspecified.
Given a point estimate, variance estimates can be obtained from several approaches which
are specified by argument variance. A straightforward but computationally inefficient vari-
ance estimator is the multiplier bootstrap approach ("MB"). A more efficient method is to
consider sandwich variance estimators (Chiou et al. 2014b). Suppose the variance of the es-
timator has a sandwich form, Σ = A−1 V (A−1 )> where V is the asymptotic variance of the
estimating function and A is the slope matrix. Chiou et al. (2014b) proposed to estimate V
by either a closed-form formulation (CF) or through bootstrap the estimating equations (MB).
The bootstrapping estimate of V is much less demanding than the full multiplier bootstrap,
because it only involves evaluations of estimating equations instead of solving them. On the
other hand, to estimate the slope matrix A, Chiou et al. (2014b) proposed three methods
based on the induced smoothing approach (IS), smoothed Huang’s approach (sH) motivated
by Huang (2002) or Zeng and Lin’s approach (ZL) by Zeng and Lin (2008). Combinations
between estimating V and A yield six sandwich estimators, "ISCF", "ISCF", "ZLCF", "ZLMB",
"sHCF", "sHMB" for variance. When a bootstrap is needed, the bootstrap size is controlled
by B with default value 100.
The convergence for the procedure is controlled by relative tolerance. The iteration stops and
the output is given when the tolerance is met or iteration reaches the pre-specified maximum
iteration number. The default relative tolerance is set at 0.001 and the default maximum
10 aftgee: Fitting AFT Models with R
The maximum number of iterations is controlled by maxiter and relative convergence toler-
ance is controlled by reltol. A logical value, trace, is used to determine whether to print
the output for each iteration.
The least squares estimator can be obtained by calling aftgee with the following arguments
Most of the arguments and the convergence criterion of aftsrr are shared by aftgee. With
aftgee, the intercept, if included, is estimated by the mean of the estimated cumulative
distribution function based on the censored residual computed from the slope estimator.
The margin argument is a vector with the same length as data. It is used to specify the
marginal distribution within clusters. Identical marginal distributions are assumed with un-
specified margin. A character string, corstr, is used to specify the working correlation
structure, as offered by package geepack. Four working correlation structures are indepen-
dence ("independence"), exchangeable ("exchangeable"), autoregressive model of order one
("ar1") and unstructured ("unstructured"). The default is "independence". The initial
value is specified by binit with default "srrgehan" giving the induced smooth rank-based
estimator with Gehan’s weight. Alternatively, although not recommended, the simple linear
regression with censored observations ignored ("lm"), can also be used for faster results.
In the uncensored case, aftgee with independent working correlation structure will return
an ordinary least squares estimate. In the multivariate case, efficiency can be improved in
aftgee when the working correlation structure is close to the true correlation even in the
absent of censoring. A more detailed multivariate illustration is presented in a kidney cather
data in Section 6.3.
6. Illustrations
T = 2 + X1 + X2 + ,
where X1 is Bernoulli with rate 0.5 and X2 is a standard normal variable. The error term,
, follows an exponential distribution with mean 3. The censoring time was generated from
Uniform(0, τ ) with τ adjusted to yield approximately 50% censoring rate. A dataset with 500
subjects was generated with the following code:
+ x2 <- rnorm(n)
+ e <- rweibull(n, 1, 3)
+ T <- exp(2 + x1 + x2 + e)
+ cstime <- runif(n, 0, tau)
+ delta <- (T < cstime) * 1
+ Y <- pmin(T, cstime)
+ out <- data.frame(T = T, Y = Y, delta = delta, x1 = x1, x2 = x2)
+ }
R> set.seed(1)
R> mydata <- datgen()
On a 3.3 GHz linux machine, we start with the comparison between two versions of Gehan’s
estimators: one is the nonsmooth version estimated from Equation 1 fitted with lss and the
other is the smooth version estimated from Equation 2 fitted with aftsrr. For lss, variances
are estimated with the multiplier bootstrap approach with bootstrap sample size 100. For
aftsrr, in addition to the fully bootstrapping variance estimator from Equation 2 ("MB"), the
sandwich variance estimator using induced smoothing approach ("ISMB") is also considered.
R> library("aftgee")
R> library("survival")
R> library("lss")
R> system.time(rk.lss <- lss(Surv(log(Y), delta) ~ x1 + x2, data = mydata,
+ gehanonly = TRUE, mcsize = 100))
x1 x2
rk.lss 0.9412 0.9496
srrMB 0.9399 0.9499
srrISMB 0.9399 0.9499
12 aftgee: Fitting AFT Models with R
x1 x2
rk.lss 0.1552 0.07141
srrMB 0.1333 0.07133
srrISMB 0.1385 0.06940
The output indicates that aftsrr clearly outperforms lss in timing. The timing result also
suggests that the efficient sandwich estimator is substantially faster.
We next fit the simulated data with the least squares approach. For the parametric approach,
survreg is used with dist = "lognormal". For the semiparametric approach, the BJ esti-
mator (bj from package rms), the lss estimator and the aftgee estimator are considered.
R> library("rms")
R> system.time(ls.bj <- bj(Surv(Y, delta) ~ x1 + x2, data = mydata))
Intercept x1 x2
bj 4.509 0.9833 0.9332
lss NA 0.9838 0.9338
gee 4.510 0.9838 0.9338
sur 4.313 0.9126 0.8652
Journal of Statistical Software 13
Table 1: Comparison of the bj, lss, aftgee and survreg estimators. Bias and MSE represent
the bias and mean squared error of the estimator, respectively. The true regression coefficient
is β0 . Each cell is the average of 1000 replicates.
Intercept x1 x2
bj 0.07846 0.1261 0.06448
lss NA 0.1932 0.09152
gee 0.10588 0.1758 0.09061
sur 0.11078 0.1672 0.08321
Estimation for both the lss estimator and the aftgee estimator are based on a rank-based
initial value that is invariant to the intercept. Once the slope estimator is obtained, the lss
estimator left out the intercept whereas the bj and aftgee approach estimated the intercept
by the mean of the estimated cumulative distribution function based on the censored residual
computed from the slope estimator. The semiparametric methods from bj, lss and aftgee
provide fairly close point estimates. In terms of timing, the lss estimator took the longest
with more than six minutes. For further investigation, the estimation performance is assessed
via bias and mean squared error with a full scale simulation. Table 1 summarizes the results
for 1000 replicates.
The performance of bj, lss and aftgee are similar in terms of the biases and mean squared
errors. As expected, when the error distribution is misspecified in the parametric model, the
survreg approach produced a biased estimate.
To take advantage of the full cohort data, we fit the full-cohort data with aftsrr using two
types of sandwich variance estimators ("ISMB" and "ISCF").
R> set.seed(1)
R> system.time(fit.IS <- aftsrr(Surv(edrel, rel) ~ histol + age,
+ data = nwtco, variance = c("ISCF", "ISMB")))
R> summary(fit.IS)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age, data = nwtco,
variance = c("ISCF", "ISMB"))
All point estimators and variance estimators are close to each other. The coefficients of
central histological lab diagnosis is found to be significantly different from zero. In addition,
the coefficient for the central lab histological diagnosis is negative. This suggests patients who
do not favor the central lab histological diagnosis tend to have shorter time to tumor relapse.
Journal of Statistical Software 15
With the same dataset, we next demonstrate incorporating weights via a case-cohort design.
Define cases and controls as those who experience the event of interest by the end of the
study period and who do not, respectively. In nwtco, there are 571 cases who experienced
the relapse of tumor and 3457 controls who did not experience the relapse of tumor. The
case-cohort sample is the union of all the cases and the sub-cohort sample selected via a
simple random sampling. The case-cohort sample of the data had 1154 subjects, including all
571 cases and 583 controls. This gave sampling weights 1 and 5.93 for the cases and controls,
respectively. The following codes give a summary of the case-cohort weight, hi .
0 1
FALSE 2874 486
TRUE 583 85
0 1 5.93
2874 571 583
For the case-cohort design, we also demonstrate the usage of different rank weights in a rank-
based approach; we considered the Gehan’s, log-rank and PW weights. For the log-rank and
PW weights, the monotone function approach was used. Jin and Huang (2007)’s lss was
not considered in this analysis because it does not have the capability of handling general
rank weights and sampling weights. Standard errors are estimated with the efficient sandwich
variance estimator, ZLMB. Commands for these estimators are presented below followed by a
summary.
R> summary(fit.gh)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age, data = nwtco,
subset = in.casecohort, weights = hi, variance = "ZLMB")
R> summary(fit.lk)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age, data = nwtco,
subset = in.casecohort, weights = hi, rankWeights = "logrank",
variance = "ZLMB")
R> summary(fit.pw)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age, data = nwtco,
subset = in.casecohort, weights = hi, rankWeights = "PW",
method = "monosm", variance = "ZLMB")
Although the differences in standard errors among the three weights are noticeable, they all
lead to the same conclusion. The p values suggest that the coefficient of central histological
Journal of Statistical Software 17
diagnosis is significantly different from zero and had a significant effect on the time to relapse.
Compared to the full cohort analysis, all point estimates are reasonably close. This result is
also found in the full cohort analysis.
Call:
aftgee(formula = Surv(time, status) ~ age + sex, data = kidney,
id = id)
AFTGEE Estimator
Estimate StdErr z.value p.value
(Intercept) 2.071 0.609 3.40 0.001 ***
age -0.005 0.008 -0.64 0.523
sex 1.374 0.346 3.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> summary(fit.ex)
Call:
aftgee(formula = Surv(time, status) ~ age + sex, data = kidney,
18 aftgee: Fitting AFT Models with R
AFTGEE Estimator
Estimate StdErr z.value p.value
(Intercept) 2.070 0.688 3.01 0.003 **
age -0.005 0.010 -0.54 0.589
sex 1.374 0.369 3.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient of sex is found to be significantly different from zero for both models. This
suggests that female patients tend to have longer recurrence times to infection. The efficiency
gain is expected on average but, unfortunately, this dataset does not show an efficiency gain.
In addition to the common marginal error distribution and common coefficient assumption,
we also consider the case where the marginal error distributions and regression coefficients
are different. In this case, we need to specify margin and construct the corresponding block
diagonal design matrix. After the block diagonal design matrix is constructed, least squares
estimators with both independent covariance working structure and exchangeable working
structure are fitted. For each model, we use the smooth Gehan estimator as the initial value.
Call:
aftgee(formula = Surv(time, status) ~ age:margin + sex:margin +
margin - 1, data = kidney, id = id, margin = margin)
AFTGEE Estimator
Estimate StdErr z.value p.value
margin1 1.676 0.804 2.08 0.037 *
margin2 2.542 0.881 2.89 0.004 **
age:margin1 -0.013 0.011 -1.18 0.238
age:margin2 0.005 0.013 0.41 0.679
margin1:sex 1.744 0.439 3.97 <2e-16 ***
margin2:sex 0.895 0.451 1.99 0.047 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R> summary(fit2.ex)
Call:
aftgee(formula = Surv(time, status) ~ age:margin + sex:margin +
Journal of Statistical Software 19
AFTGEE Estimator
Estimate StdErr z.value p.value
margin1 1.672 0.897 1.86 0.062 .
margin2 2.544 0.901 2.82 0.005 **
age:margin1 -0.013 0.012 -1.14 0.253
age:margin2 0.005 0.012 0.47 0.638
margin1:sex 1.744 0.467 3.73 <2e-16 ***
margin2:sex 0.887 0.473 1.88 0.060 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model allows hypothesis testing of equal coefficiencts for each covariate across the two
margins from Wald-type tests with covariance matrix estimates. However, the covariates of
age and sex are found to be not significantly different across the two margins, with p values
of 0.87 and 0.06, respectively, under the exchangeable structure. The coefficients of sex for
both margins and the two working structures are found to be significantly different from zero.
These results coincide with those from the common margin model. The efficiency gain from
the exchangeable structure in the margin-specific case is also absent. This is probably because
there is not much strength to borrow with distinctive marginal fits.
7. Conclusion
Package aftgee provides an easy access to fitting semiparametric AFT models for possibly
clustered failure times with both rank-based approaches and the least squares approach.
For rank-based approaches, we implemented the induced smoothing method with Gehan’s
weight and extended it to allow arbitrary rank weight. The method is much faster than
those based on linear programming. Computationally efficient sandwich variance estimators
are provided for all the estimators, and additional sampling weight can be incorporated for
various sampling schemes. Our least squares approach uses rank-based estimators as initial
estimators in an iterative estimation procedure. For clustered data, we exploited within-
cluster dependence through working correlation structure in a GEE framework which enhances
efficiency when within-cluster dependence is strong. The implementation is fast and reliable,
making it possible for AFT models to be much more widely applied in routine survival analysis.
Our package can be expanded in several directions. The current version allows weights for
handling missing data in the rank-based approach, similar weights can also be made available
to our GEE approach. For the rank-based approach with clustered data, Wang and Fu
(2011) considered estimating equations that can be decomposed into between- and within-
cluster estimating equations for better efficiency. An implementation of this method would
be desirable. To account for measurement errors in covariates, package simexaft (He, Xiong,
and Yi 2012) implemented a simulation-extrapolation approach for AFT models. Such an
approach can be extended to the semiparametric AFT model. Furthermore, our methods can
also be extended to accommodate survival data other than with right censoring.
20 aftgee: Fitting AFT Models with R
Acknowledgments
This research was partially supported by NSF DMS grant 120922.
References
Brostöm G (2014). eha: Event History Analysis. R package version 2.4-1, URL http:
//CRAN.R-project.org/package=eha.
Brown BM, Wang YG (2005). “Standard Errors and Covariance Matrices for Smoothed Rank
Estimators.” Biometrika, 92(1), 149–158.
Brown BM, Wang YG (2007). “Induced Smoothing for Rank Regression with Censored Sur-
vival Times.” Statistics in Medicine, 26(4), 828–836.
Buckley J, James I (1979). “Linear Regression with Censored Data.” Biometrika, 66(3),
429–436.
Chiou SH, Kang S, Kim J, Yan J (2014a). “Marginal Semiparametric Multivariate Accelerated
Failure Time Model with Generalized Estimating Equations.” Lifetime Data Analysis,
20(4), 599–618.
Chiou SH, Kang S, Yan J (2013). “Rank-Based Estimating Equations with General Weight for
Accelerated Failure Time Models: An Induced Smoothing Approach.” Technical Report 39,
Department of Statistics, University of Connecticut.
Chiou SH, Kang S, Yan J (2014b). “Fast Accelerated Failure Time Modeling for Case-Cohort
Data.” Statistics and Computing, 24(4), 559–568.
Chiou SH, Kang S, Yan J (2014c). aftgee: Accelerated Failure Time Model with General-
ized Estimating Equations. R package version 1.0-0, URL http://CRAN.R-project.org/
package=aftgee.
Chiou SH, Kang S, Yan J (2014d). “Semiparametric Accelerated Failure Time Modeling
for Clustered Failure Times from Stratified Sampling.” Technical report. doi:10.1080/
01621459.2014.917978. Forthcoming.
Cox DR (1972). “Regression Models and Life-Tables.” Journal of the Royal Statistical Society
B, 34(2), 187–220.
Green DM, Breslow NE, Beckwith JB, Finklestein JZ, Grundy PE, Thomas PR, Kim T,
Shochat SJ, Haase GM, Ritchey ML, Kelalis PP, D’Angio GJ (1998). “Comparison be-
tween Single-Dose and Divided-Dose Administration of Dactinomycin and Doxorubicin for
Patients with Wilms’ Tumor: A Report from the National Wilms’ Tumor Study Group.”
Journal of Clinical Oncology, 16(1), 237–245.
Halekoh U, Højsgaard S, Yan J (2006). “The R Package geepack for Generalized Estimating
Equations.” Journal of Statistical Software, 15(2), 1–11. URL http://www.jstatsoft.
org/v15/i02/.
Harrel Jr FE (2014). rms: Regression Modeling Strategies. R package version 4.2-1, URL
http://CRAN.R-project.org/package=rms.
Harrington DP, Fleming TR (1982). “A Class of Rank Test Procedures for Censored Survival
Data.” Biometrika, 69(3), 133–143.
He W, Xiong J, Yi GY (2012). “SIMEX R Package for Accelerated Failure Time Models with
Covariate Measurement Error.” Journal of Statistical Software, Code Snippets, 46(1), 1–14.
URL http://www.jstatsoft.org/v46/c01/.
Huang Y (2002). “Calibration Regression of Censored Lifetime Medical Cost.” Journal of the
American Statistical Association, 97(457), 318–327.
Jin Z, Huang L (2007). “lss: An S-PLUS/R Program for the Accelerated Failure Time Model to
Right Censored Data Based on Least-Squares Principle.” Computer Methods and Programs
in Biomedicine, 86(1), 45–50.
Jin Z, Lin DY, Wei LJ, Ying Z (2003). “Rank-Based Inference for the Accelerated Failure
Time Model.” Biometrika, 90(2), 341–353.
Jin Z, Lin DY, Wei LJ, Ying Z (2006a). “Rank Regression Analysis of Multivariate Falure
Time Data Based on Marginal Linear Models.” Scandinavian Journal of Statistics, 33(1),
1–23.
Jin Z, Lin DY, Ying Z (2006b). “On Least-Squares Regression with Censored Data.”
Biometrika, 93(1), 147–161.
Jin Z, Lin DY, Ying Z (2006c). “Rank Regression Analysis of Multivariate Failure Time Data
Based on Marginal Linear Models.” Scandinavian Journal of Statistics, 33(1), 1–23.
Johnson LM, Strawderman RL (2009). “Induced Smoothing for the Semiparametric Accel-
erated Failure Time Model: Asymptotics and Extensions to Clustered Data.” Biometrika,
96(3), 577–590.
Kalbfleisch JD, Prentice RL (2002). The Statistical Analysis of Failure Time Data. John
Wiley & Sons.
Lai TL, Ying Z (1991). “Large Sample Theory of a Modified Buckley-James Estimator for
Regression Analysis with Censored Data.” The Annals of Statistics, 19(3), 1370–1402.
22 aftgee: Fitting AFT Models with R
McGilchrist CA, Aisbett CW (1991). “Regression with Frailty in Survival Analysis.” Biomet-
rics, 47(2), 461–466.
Prentice RL (1978). “Linear Rank Tests with Right Censored Data.” Biometrika, 65(1),
167–180.
R Core Team (2014). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Ritov Y (1990). “Estimation in a Linear Regression Model with Censored Data.” The Annals
of Statistics, 18(1), 303–328.
Tsiatis AA (1990). “Estimating Regression Parameters Using Linear Rank Tests for Censored
Data.” The Annals of Statistics, 18(1), 354–372.
Varadhan R, Gilbert P (2009). “BB: An R Package for Solving a Large System of Nonlinear
Equations and for Optimizing a High-Dimensional Nonlinear Objective Function.” Journal
of Statistical Software, 32(4), 1–26. URL http://www.jstatsoft.org/v32/i04/.
Wang YG, Fu L (2011). “Rank Regression for Accelerated Failure Time Model with Clustered
and Censored Data.” Computational Statistics & Data Analysis, 55(7), 2334–2343.
Wei LJ (1992). “The Accelerated Failure Time Model: A Useful Alternative to the Cox
Regression Model in Survival Analysis.” Statistics in Medicine, 11(14–15), 1871–1879.
Ying Z (1993). “A Large Sample Study of Rank Estimation for Censored Regression Data.”
The Annals of Statistics, 21(1), 76–99.
Zeng D, Lin DY (2008). “Efficient Resampling Methods for Nonsmooth Estimating Functions.”
Biostatistics, 9(2), 355–363.
Affiliation:
Sy Han Chiou
Department of Mathematics and Statistics
University of Minnesota, Duluth
1117 University Drive,
Duluth, MN 55812-3000, United States of America
Telephone: 218/726-7032
Fax: 218/726-8399
E-mail: schiou@d.umn.edu
Journal of Statistical Software 23
Sangwook Kang
Department of Applied Statistics
Yonsei University
50 Yonsei Road
Seodaemun-Gu, Seoul 120-749, Korea
E-mail: kanggi1@yonsei.ac.kr
Jun Yan
Department of Statistics
University of Connecticut
215 Glenbrook Road U-4120
Storrs, CT 06279-4120, United States of America
Telephone: 860/486-3414
Fax: 860/486-4113
E-mail: jun.yan@uconn.edu