Modeling Discrete Time-To-Event Data (PDFDrive)
Modeling Discrete Time-To-Event Data (PDFDrive)
Gerhard Tutz
Matthias Schmid
Modeling
Discrete
Time-to-Event
Data
Springer Series in Statistics
Series editors
Peter Bickel, CA, USA
Peter Diggle, Lancaster, UK
Stephen E. Fienberg, Pittsburgh, PA, USA
Ursula Gather, Dortmund, Germany
Ingram Olkin, Stanford, CA, USA
Scott Zeger, Baltimore, MD, USA
More information about this series at http://www.springer.com/series/692
Gerhard Tutz • Matthias Schmid
Modeling Discrete
Time-to-Event Data
123
Gerhard Tutz Matthias Schmid
LMU Munich University of Bonn
Munich, Germany Bonn, Germany
In recent years, a large variety of textbooks dealing with time-to-event analysis has
been published. Most of these books focus on the statistical analysis of observations
in continuous time. In practice, however, one often observes discrete event times—
either because of grouping effects or because event times are intrinsically measured
on a discrete scale. Statistical methodology for discrete event times has been mainly
presented in journal articles and a few book chapters. In this book we introduce
basic concepts and give several extensions that allow to model discrete time data
adequately. In particular, modeling discrete time-to-event data strongly profits from
the smoothing and regularization methods that have been developed in recent
decades. The presented approaches include methods that allow to find much more
flexible models than in the early times of survival modeling.
The book is aimed at applied statisticians, students of statistics and researchers
from areas like biometrics, social sciences and econometrics. The mathematical
level is moderate, instead we focus on basic concepts and data analysis.
Objectives
v
vi Preface
Special Topics
• All numerical results presented in this book were obtained by using the R
System for Statistical Computing (R Core Team 2015). Hence readers are able to
reproduce all the results by using freely available software.
• Various functions and tools for the analysis of discrete time-to-event data are
collected in the R package discSurv (Welchowski and Schmid 2015).
We are grateful to many colleagues for valuable discussions and suggestions, in
particular to Kaveh Bashiri, Moritz Berger, Jutta Gampe, Andreas Groll, Wolfgang
Hess, Stephanie Möst, Vito M. R. Mugeo, Margret Oelker, Hein Putter, Micha
Schneider and Steffen Unkel. Silke Janitza carefully read preliminary versions of
the book and helped to reduce the number of mistakes. We also thank Helmut
Küchenhoff for late but substantial suggestions.
Special thanks go to Thomas Welchowski for his excellent programming work
and to Pia Oberschmidt for assisting us in compiling the subject index.
1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Survival and Time-to-Event Data . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Continuous Versus Discrete Survival .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.3 Overview.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2 The Life Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.1 Life Table Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.1.1 Distributional Aspects . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
2.1.2 Smooth Life Table Estimators . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
2.1.3 Heterogeneous Intervals .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
2.2 Kaplan–Meier Estimator .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.3 Life Tables in Demography .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 27
2.4 Literature and Further Reading . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
2.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 32
3 Basic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
3.1 The Discrete Hazard Function .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
3.2 Parametric Regression Models . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
3.2.1 Logistic Discrete Hazards: The Proportional
Continuation Ratio Model.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 38
3.2.2 Alternative Models . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 42
3.3 Discrete and Continuous Hazards . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 48
3.3.1 Concepts for Continuous Time.. . . . . . . .. . . . . . . . . . . . . . . . . . . . 48
3.3.2 The Proportional Hazards Model . . . . . .. . . . . . . . . . . . . . . . . . . . 50
3.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
3.4.1 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
3.5 Time-Varying Covariates . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
3.6 Continuous Versus Discrete Proportional Hazards . . . . . . . . . . . . . . . . . 64
3.7 Subject-Specific Interval Censoring .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
3.8 Literature and Further Reading . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 70
vii
viii Contents
3.9 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 70
3.10 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
4 Evaluation and Model Choice .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
4.1 Relevance of Predictors: Tests . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
4.2 Residuals and Goodness-of-Fit . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 77
4.2.1 No Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 78
4.2.2 Deviance in the Case of Censoring . . . .. . . . . . . . . . . . . . . . . . . . 80
4.2.3 Martingale Residuals . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
4.3 Measuring Predictive Performance .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 86
4.3.1 Predictive Deviance and R2 Coefficients . . . . . . . . . . . . . . . . . . 86
4.3.2 Prediction Error Curves . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 88
4.3.3 Discrimination Measures .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 92
4.4 Choice of Link Function and Flexible Links .. . .. . . . . . . . . . . . . . . . . . . . 96
4.4.1 Families of Response Functions . . . . . . .. . . . . . . . . . . . . . . . . . . . 97
4.4.2 Nonparametric Estimation of Link Functions .. . . . . . . . . . . . 101
4.5 Literature and Further Reading . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
4.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
4.7 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
5 Nonparametric Modeling and Smooth Effects . . . . . .. . . . . . . . . . . . . . . . . . . . 105
5.1 Smooth Baseline Hazard .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 105
5.1.1 Estimation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109
5.1.2 Smooth Life Table Estimates . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
5.2 Additive Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 115
5.3 Time-Varying Coefficients .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 118
5.3.1 Penalty for Smooth Time-Varying Effects
and Selection .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 119
5.3.2 Time-Varying Effects and Additive Models .. . . . . . . . . . . . . . 121
5.4 Inclusion of Calendar Time . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 122
5.5 Literature and Further Reading . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 124
5.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 125
5.7 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 125
6 Tree-Based Approaches .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 129
6.1 Recursive Partitioning .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 130
6.2 Recursive Partitioning Based on Covariate-Free
Discrete Hazard Models . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 132
6.3 Recursive Partitioning with Binary Outcome . . .. . . . . . . . . . . . . . . . . . . . 133
6.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141
6.4.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 141
6.4.2 Random Forests .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 142
6.5 Literature and Further Reading . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 144
6.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 144
6.7 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 144
Contents ix
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225
Survival analysis consists of a body of methods that are known under different
names. In biostatistics, where one often examines the time to death, survival analysis
is the most often used name. In the social sciences one often speaks of event history
data, in technical applications of reliability methods. In all of these areas one wants
to model time-to-event data. The focus is on the modeling of the time it takes until a
specific event occurs. More generally, one has mutually exclusive states that can be
taken over time. For example, in the analysis of unemployment the states can refer
to unemployment, part-time employment, full-time employment or retirement. One
wants to model the course of an individual between these states over time. An event
occurs if an individual moves from one state to another. In a single spell analysis,
which is the most extensively treated case in this book, one considers just one time
period between two events, for example, how long it takes until unemployment ends.
Since one models the transition between states, one also uses the name transition
models. More general names for the type of data to be modeled, which do not refer to
a specific area of applications, are duration data, sojourn data or failure time data,
and the corresponding models are called duration models or failure time models. We
will most often use the term survival data and survival models but, depending on the
context, also use alternative names.
What makes survival data special? In a regression model, if one wants to
investigate how predictors determine a specific survival time T, time takes the
role of the response variable. Thus one has a response variable with a restricted
support because T 0 has to hold. Nevertheless, by using some transformation, for
example, log.T/, such that all values can occur, one might be tempted to consider it
as a common regression problem of the form log.T/ D xT C, where x denotes the
predictors, is a vector of coefficients, and is a noise variable. Although models
like that can be used in simple cases they do not work in more general settings.
There are in particular two issues that are important in the modeling of survival
data, namely the modeling of the underlying dynamics of the process, which can be
captured in the form of a risk or hazard function, and censoring, which means that
in some cases the exact time is not available. In the following these two aspects are
briefly sketched.
In time-to-event data one often considers the so-called hazard function. In the case
of discrete time (e.g., if time is given in months), it has the simple form of a
conditional probability. Then the hazard or intensity function for a given vector of
predictors x is defined by
It represents the conditional probability that the time period T ends at time t, given
T t (and x). In survival analysis, for example, the hazard is the current risk
of dying at time t given the individual has survived until then. When considering
duration of unemployment it can be the probability that unemployment ends in
month t given the person was unemployed until then. In the latter case a positive
event ends the spell, and the hazard represents an opportunity rather than a risk.
But the important point is that the hazard function is a current (local on the time
scale) measure for the strength of the tendency to move from one state to the
other. It measures at each time point the tendency that a transition takes place.
In this sense it measures the underlying dynamics of survival. Typically one is
interested in studying the dynamics behind the time under investigation. The hazard
rate becomes even more important when covariates vary over time, for example, if
treatment in a clinical study is modified over time. Then a simple regression model
with (transformed) time as response will not work, because in a regression model
one has to consider covariates that are fixed at the beginning of the time under
investigation. However, time-varying covariates can be considered within the hazard
function framework by specifying
.tjxt / D P.T D t j T t; xt /; t D 1; 2; : : : ;
where xt can include the available information on covariates up until time t. Then the
hazard function measures the current risk given the covariate values up until time t,
so that .tjxt / represents the dynamics of the underlying process given any value
of the covariates at or before t. More formally the phenomenon to be modeled is
a stochastic process, that is, a collection of random variables indexed by time with
values that correspond to the states. The modeling as a stochastic process (more
concisely, a counting process) is extensively treated in Andersen et al. (1993) and
Fleming and Harrington (2011).
1.1 Survival and Time-to-Event Data 3
Even without using the counting process framework one way to handle survival
data is to specify a parametric or non-parametric model for the hazard function. If
the hazard given x or xt is specified, the behavior of the survival time is implicitly
defined. In fact, most of the models considered in this book are hazard rate models.
The second issue that makes survival data special is censoring. An observation
is called “censored” if its survival time has not been fully observed, that is, the
exact time within one state is not known. Censoring may, for example, occur if
observations in a study drop out early (so that they are lost before the event of
interest occurs), or if the event of an observation occurs after the study has been
finished. These situations are illustrated in Fig. 1.1. For observation 1 a spell started
at t D 0, and the observation remained in the study until its event occurred. The
black dot indicates that the event of observation 1 has actually been observed.
In contrast, for observation 2 the spell started later than for observation 1, but it
was right censored because it dropped out of the study before its event occurred.
Similarly, for observation 4 the exact survival time is not known because it has
not occurred before the end of the study (indicated by the dashed line). It should
be noted that in Fig. 1.1 time refers to calendar time. What is actually modeled in
survival is the spell length, that is, the time from entry time until transition to another
state.
observed event
censored
Observation 4
Observation 3
Observation 2
Observation 1
0 1 2 3 4
time
Fig. 1.1 Four observations for which spells start at different times. Exact survival time is observed
for observations 1 and 3 (black dots); for observations 2 and 4 the end of the spell is not observed
(circles), since observation 2 drops out early and observation 4 is still alive at the end of the study,
which is shown as a dashed line
4 1 Introduction
Most textbooks on survival analysis assume that the survival time is continuous
and the event to be modeled may occur at any particular time point. Several
books are available that treat continuous survival data extensively, for example,
Lawless (1982), Lancaster (1992), Kalbfleisch and Prentice (2002) and Klein and
1.2 Continuous Versus Discrete Survival 5
Moeschberger (2003). What in these books is often considered very briefly, if at all,
is the case of discrete survival, which is the topic of the present book.
Although we imagine time as a continuum, in practice measurement of time is
always discrete. In particular in the medical sciences, economics and the social
sciences duration is usually measured, for example, in days, years or months.
Thus, even though the transition between states takes place at a specific time point,
the exact time points are usually not known. What is available are the data that
summarize what was happening during a specific interval. One can use positive
integers, 1; 2; 3; : : : to denote time. More formally, continuous time is divided into
intervals
In this book discrete event times are denoted by T, where T D t means that the event
has occurred in the interval Œat1 ; at /, also called “time period t”. One also speaks
of grouped survival data or interval censoring. In grouped survival data there are
typically some observations that have the same survival time. This phenomenon is
usually referred to as “ties”. In continuous time, ties ideally should not occur. In
fact, some models and estimation methods for continuous time even assume that
there are no ties in the data. Nevertheless, in practical applications with continuous
event times ties occur, which might be taken as a hint for underlying grouping. In
some areas, for example in demography, discrete data are quite natural. For example,
life tables traditionally use years as a measure for life span.
In some cases the underlying transition process is what Jenkins (2004) calls
intrinsically discrete. Consider, for example, time to pregnancy. A natural measure
for the time it takes a couple to conceive is the number of menstrual cycles, which
is a truly discrete response, see also Scheike and Jensen (1997) for an application of
discrete survival models to model fertility. Genuinely discrete measurements may
also result from surveys that are taken every month or year. For example, the IFO
Business Climate for Germany (http://www.cesifo-group.de/ifoHome), is based on
a survey in which firms from Germany are asked each month if “Production has
decreased”, “Production has remained unchanged”, or “Production has increased”.
Consequently, when investigating the factors that determine how long the answer
of a firm remains the same, one obtains time-discrete measurements. Also, when
considering the important problem of panel mortality, where one investigates for
how long a firm or an individual is in a panel, the response is the number of times
the questionnaire was sent back and is therefore genuinely discrete.
In summary, discrete time-to-event data occur as
• intrinsically discrete measurements, where the measurements represent natural
numbers, or
• grouped data, which represent events in underlying time intervals, and the
response refers to an interval.
The basic modeling approaches are the same for both types of data, in particular
when the intervals in grouped data have the same length (e.g., if they represent
6 1 Introduction
1.3 Overview
“classical” assumption of linear covariate effects of the form x> , Chap. 5 extends
this approach by considering smooth hazard and survival functions. Moreover,
the linear discrete hazard models of Chap. 3 are extended by smooth nonlinear
covariate effects (modeled, e.g., via kernel smoothers or splines). The underlying
methodology closely follows the Generalized Additive Model (GAM) framework
introduced by Hastie and Tibshirani (1990) for regression models with uncensored
response. Chapter 6 deals with discrete-time survival trees, which are a convenient
nonparametric approach to model discrete survival data in situations where the
GAM framework is not appropriate or where complex (non-additive) interaction
patterns between the covariates exist. Chapter 7 deals with techniques for variable
selection and model choice, which are important in situations where the set of
available covariates is large. In these situations one is often interested in fitting
a “sparse” survival model that contains a small subset of highly informative
predictors. To identify these predictors, statistical methods such as penalized
regression and gradient boosting (presented in Chap. 7) can be applied. In Chap. 8,
the focus is on competing risks models, which are needed to model discrete survival
data with several “competing” target events. For example, when modeling duration
of unemployment, these target events might be defined by full-time or part-time
jobs that end the unemployment spell. In Chap. 9 methods to deal with unobserved
heterogeneity are presented. They are important when some relevant covariates that
affect survival behavior have not been observed. This missing information may
cause severe artefacts when not accounted for in statistical analysis. Regression
techniques such as the class of “frailty models” covered in Chap. 9 can be used to
address this issue. The final chapter (Chap. 10) provides a brief overview of multiple
spell analysis.
At the end of each chapter subsections on software and on references for further
reading are found. Data sets and additional program code is contained in the R add-
on package discSurv (Welchowski and Schmid 2015), which accompanies the book.
1.4 Examples
This section introduces some example data sets that serve as typical applications
of discrete-time survival modeling. The data will be analyzed in later chapters.
Additional data sets will also be described and analyzed in later chapters, see
page 236 for a complete list of examples considered in this book.
Example 1.1 Duration of Unemployment
This data set was originally analyzed by McCall (1996) and Cameron and Trivedi (2005).
It contains information about the duration of unemployment spells of n D 3343 U.S. citizens.
Data were derived from the January Current Population Survey’s Displaced Workers Supplements
(DWS) in the years 1986, 1988, 1990 and 1992. The events of interest are the re-employment
of a person in a part-time job or a full-time job. Unemployment duration was measured in
2-week intervals; observed event times ranged from one interval (2 weeks) to 28 intervals
(56 weeks).
8 1 Introduction
Table 1.1 Explanatory variables that are used to model the time to re-employment (UnempDur
data set, as contained in the R add-on package Ecdat, Croissant 2015)
Variable Categories/unit Sample proportion/median (range)
Age Years 34 (20–61)
Filed unemployment claim? Yes/no 55 %/45 %
Eligible replacement rate 0.50 (0.07–2.06)
Eligible disregard rate 0.10 (0.00–1.02)
Log weekly earnings in lost job $ 5.68 (2.71–7.60)
Tenure in lost job Years 2 (0–40)
Unemployment spells were measured in 2-week intervals. The replacement rate is defined as the
weekly benefit amount divided by the amount of weekly earnings in the lost job (cf. Cameron and
Trivedi 2005, p. 604). The disregard rate is defined as the disregard (i.e., the amount up to which
recipients of unemployment insurance who accept part-time work can earn without any reduction
in unemployment benefits) divided by the weekly earnings in the lost job
In this book we analyze a publicly available version of the data that is part of the R add-on
package Ecdat (Croissant 2015). The list of explanatory variables that will be used for modeling
unemployment duration is presented in Table 1.1. t
u
Table 1.3 Explanatory variables for the Copenhagen Stroke Study (data set cost in R package
pec)
Variable Categories/unit Sample proportion/median (range)
Sex Female/male 53 %/47 %
Hypertension Yes/no 33 %/67 %
Ischemic heart disease Yes/no 20 %/80 %
Previous stroke Yes/no 18 %/82 %
Other disabling disease Yes/no 16 %/84 %
Alcohol intake Yes/no 32 %/68 %
Diabetes Yes/no 14 %/86 %
Smoking status Yes/no 46 %/54 %
Atrial fibrillation Yes/no 13 %/87 %
Hemorrhage (stroke subtype) Yes/no 5 %/95 %
Age Years 75 (25–95)
Scandinavian stroke score 0 (worst)–58 (best) 46 (0–58)
Cholesterol level mmol/L 5.9 (1.5–11.6)
election (general). The dependent variable is defined by the transition process of a Congressman
from his first election up to one of the competing events general, primary, retirement or ambition.
The duration until the occurrence of one of the competing events is measured as terms served,
where a maximum of 16 terms can be reached. Career path data were collected on every member
of the House of Representatives from each freshman class elected from 1950 to 1976. Each
incumbent in the data set was tracked from the first re-election bid until the last term served in
office. A member initially elected in 1950 does not enter the risk set until the election cycle of
1952, as the members of the House of Representatives serve two-year terms. At each subsequent
election, a terminating event or re-election is observed. Once a terminating event is experienced,
the incumbent is no longer observed. The data set covers all election cycles from 1952 up to 1992.
A detailed description can be found in the book by Box-Steffensmeier and Jones (2004) and in
Jones (1994).
Originally, up to 20 terms occurred, however, only for very few Congressmen. Hence, due to
stability reasons, durations that exceed 15 terms have been aggregated. Furthermore, only complete
cases, that is, observations with no missing values for any covariate, have been incorporated in
the analysis. The data set considered in this book contains the career paths of 860 Congressmen.
Several covariates are available: The covariate age gives the incumbent’s age at each election cycle
and, to improve interpretability, is centered around 51 years. The incumbent’s margin of victory in
his or her previous election is collected in the variable priorm, which is centered around a margin
of 35. The covariate redistricting indicates if the incumbent’s district was substantially redistricted.
The covariate scandal captures if an incumbent was involved in an ethical or sexual misconduct
scandal or if the incumbent was under criminal investigation. The covariates openGub and openSen
indicate if there is an open gubernatorial and/or open senatorial seat available in the incumbent’s
state. The data set considers members of the Republican and the Democratic party. Whether the
Congressman is a member of the Republican party is gathered in the variable republican. Finally,
leadership describes if a member is in the House leadership and/or is a chair of a standing House
committee. With the exception of the predictor republican all covariates are time-varying, that is,
the covariate values per object may vary over the duration time. Further details and descriptive
statistics are presented in Tables 1.4 and 1.5. t
u
1.4 Examples 11
Table 1.4 Description of variables for the Congressional Careers data; response (top) and
covariates (bottom)
Variable Description
surv Duration of time (measured in terms served) the incumbent has spent in Congress
prior to the election cycle
general Terminating event (2 f0; 1g), coded 1 if incumbent lost the general election and 0
if he won the general election
primary Terminating event (2 f0; 1g), coded 1 if incumbent lost the primary election and 0
if he won the primary election
retirement Terminating event (2 f0; 1g), coded 1 if incumbent retires from work and 0
otherwise
ambition Terminating event (2 f0; 1g), coded 1 if incumbent has a higher ambition than
re-election and 0 otherwise. The baseline event for the binary indicator variables
above (generalD0 & primaryD0 & retirementD0 & ambitionD0) occurred when
the incumbent candidated for re-election and won it
age Incumbent’s age measured in years at each election cycle
district Reciprocal of the number of Congressional districts in the state; measures the
proportion of the state the incumbent’s district encompasses
leader Indicator of prestige position (2 f0; 1g), coded 1 if a member is in the House
leadership and/or is a chair of a standing House committee and 0 otherwise
opengub Dummy variable (2 f0; 1g), coded 1 if there is an open gubernatorial seat available
in the incumbent’s state and 0 if not
opensen Dummy variable (2 f0; 1g), coded 1 if there is an open senatorial seat available in
the incumbent’s state and 0 if not
prespart Dummy variable (2 f0; 1g), coded 1 if the incumbent’s party affiliation is the same
as the president’s
priorm The incumbent’s margin of victory in his or her previous election
redist Dummy variable (2 f0; 1g), coded 1 if incumbent’s district was substantially
redistricted and 0 otherwise
reform Coded 1 for the election cycles 1968, 1970 and 1972 (because there was a house
reform) and 0 otherwise
republican Dummy variable (2 f0; 1g), coded 1 if incumbent is Republican and 0 otherwise
scandal Dummy variable (2 f0; 1g), coded 1 if incumbent was involved in an ethical or
sexual misconduct scandal or when incumbent was under criminal investigation
and 0 if not
12 1 Introduction
Table 1.5 Descriptive statistics of covariates for the Congressional Careers data
Variable Categories/unit Sample proportion/median (range)
age Years 51 (27–83)
district Proportion 7 % (0–100 %)
leader Prestige position 3%
Otherwise 97 %
opengub Gubernatorial seat available 20 %
Otherwise 80 %
opensen Senatorial seat available 13 %
Otherwise 87 %
prespart Same party as president 48 %
Otherwise 52 %
priorm Margin of victory in percent 29 % (0–100 %)
redist Incumbent’s district was substantially 2%
redistricted
Otherwise 98 %
reform Era of house reform 17 %
Otherwise 83 %
republican Republican 42 %
Otherwise 58 %
scandal Ethical, sexual misconduct scandal or
incumbent under criminal investigation 1%
Otherwise 99 %
Table 1.6 Description of covariates for the Pairfam data: response (top), control (middle) and
leisure variables (bottom)
Variable Description
child A dummy (2 f0; 1g) indicating if the woman gave birth to her first child
within the regarded interval (or is currently pregnant)
age Age (in years) of the anchor woman
page Age (in years) of the male partner
sat6 Degree of life satisfaction (2 f0; 1; : : : ; 10g) of the anchor woman
reldur Duration of the relationship (in months)
relstat Status of relationship (categorical with three levels: “living apart together”,
“cohabitation”, “married”)
yeduc Years of education (2 Œ8; 20) of the anchor woman
pyeduc Years of education (2 Œ8; 20) of the male partner
casprim Employment status of the anchor woman (categorical with five levels: “in
education”, “full-time employed”, “part-time employed”, “non-working”,
“other”)
pcasprim Employment status of the male partner (categorical—see casprim)
siblings Number of siblings of the anchor woman
hlt7 Average sleep length of the anchor woman (in hours)
leisure (Approx.) yearly leisure time of the anchor woman (in hours) spent for the
following five major categories: (1) bar/cafe/restaurant; (2) sport; (3)
internet/tv; (4) meet friends; (5) discotheque
leisure.partner Relative proportion (2 Œ0; 1) of leisure that the partner spends together with
the anchor woman
holiday Time of the anchor woman (in weeks) spent on holiday
Chapter 2
The Life Table
The life table is one of the oldest tools to analyze survival in homogeneous
populations. In classical applications in demography and actuarial science it was
used to estimate the probability of death for each age given an interval or year of
birth. In the following it serves as a model for all kinds of observed times. Time
can refer to survival time, waiting time, lifetime, duration of marriage, duration of
unemployment, or any other time-to-event. It is assumed that individuals or, more
general, statistical units are at risk of experiencing a single target event.
We will assume that time is recorded in discrete intervals. Consequently,
continuous time is subdivided into intervals Œ0; a1 /; Œa1 ; a2 /; : : : ; Œaq1 ; aq /, Œaq ; 1/.
In many applications the first q intervals are equally spaced (representing years,
months, or weeks) and time is measured in these units. Discrete time-to-event is
denoted by T, where T D t means that the event has occurred in the interval
Œat1 ; at /, which is also called time period t.
In most empirical studies the occurrence of the event under consideration is not
observed for all observations. Rather for part of the observations it is only known
that the time-to-event exceeds a certain value. This phenomenon is called censoring,
more specific right censoring, since it is known that survival time is larger than or
equal to the time the individual has been observed for the last time. Censoring is
denoted by C, where C D t means that an observation has been censored in time
period t.
In life tables one assumes that the population is homogeneous. The probability that
for a randomly drawn individual the target event occurs in period t is denoted by
t D P.T D t/:
t D P.T D t j T t/; t D 1; : : : ; q;
which is the probability of surviving interval t, also called survival function (see
Exercise 2.1). The probability of an event in interval t is determined by
Y
t1
t D P.T D t/ D t .1 s / D t S.t 1/:
sD1
Thus the number of observations at risk in the first, second, or, more generally, the
rth interval are given by
dt
O t D : (2.3)
nt
If censoring occurs, that is, wt > 0, the estimator is appropriate only if censoring
occurs at the end of the interval. If censoring is assumed to occur at the beginning
of the interval Œat1 ; at /, a better choice is O t D dt =.nt wt /.
The standard life table estimator takes the withdrawals into account by using
dt
O t D : (2.4)
nt wt =2
Hence the estimator implicitly assumes that withdrawals are at risk during half the
interval. It is a compromise between censoring at the beginning and the end of the
interval. Based on (2.1) the probability of surviving beyond at can be estimated by
Y
t
O D
S.t/ .1 O s / (2.5)
sD1
Y
t1
O D t/ D O t
P.T .1 O i /
iD1
for t D 1; : : : ; q.
It should be noted that the life table estimator and the corresponding probabilities
are computed for the q intervals Œ0; a1 /; Œa1 ; a2 /; : : : ; Œaq1 ; aq / only. The probability
P.T D q C 1/, which corresponds to the interval Œaq ; 1/, is then implicitly
given. This is because if a random variable T can take P values f1; : : : ; q C 1g,
qC1
the probabilities t D P.T D t/ sum up to 1, that is, tD1 t D 1. For the
hazards one obtains t D P.T D t j T t/, in particular qC1 D 1 since
P.T D q C 1 j T q C 1/ D 1. If the life table estimator is used to estimate
mortality, qC1 D 1 is quite natural because any human being is mortal. In other
cases, for example, if life tables are used to model the duration of unemployment,
18 2 The Life Table
qC1 D 1 does not mean that unemployment ends in the last category. But the last
Qq
hazard qC1 can be used to compute P.T D q C 1/ D O qC1 iD1 .1 O i /, which
is the probability that the duration of unemployment is beyond aq . Table 2.1 shows
the estimates of the hazard in an unemployment study considered by Fahrmeir et al.
Table 2.1 Life table estimates obtained from the German unemployment data (Example 2.1)
Œat1 ; at / nt wt nt wt =2 dt O t O t/
P.T O D t/
P.T
months
Œ1; 2/ 1669 131 1603:5 197 0:1229 0:8771 0:1229
Œ2; 3/ 1341 7 1337:5 178 0:1331 0:7604 0:1167
Œ3; 4/ 1156 12 1150:0 159 0:1383 0:6553 0:1051
Œ4; 5/ 985 3 983:5 89 0:0905 0:5960 0:0593
Œ5; 6/ 893 9 888:5 86 0:0968 0:5383 0:0577
Œ6; 7/ 798 6 795:0 81 0:1019 0:4834 0:0548
Œ7; 8/ 711 4 709:0 53 0:0748 0:4473 0:0361
Œ8; 9/ 654 5 651:5 58 0:0890 0:4075 0:0398
Œ9; 10/ 591 7 587:5 45 0:0766 0:3763 0:0312
Œ10; 11/ 539 0 539:0 20 0:0371 0:3626 0:0140
Œ11; 12/ 519 4 517:0 25 0:0484 0:3448 0:0175
Œ12; 13/ 490 21 479:5 188 0:3921 0:2096 0:1352
Œ13; 14/ 281 2 280:0 30 0:1071 0:1871 0:0225
Œ14; 15/ 249 3 247:5 22 0:0889 0:1705 0:0166
Œ15; 16/ 224 2 223:0 26 0:1166 0:1506 0:0199
Œ16; 17/ 196 1 195:5 16 0:0818 0:1383 0:0123
Œ17; 18/ 179 2 178:0 15 0:0843 0:1267 0:0117
Œ18; 19/ 162 2 161:0 18 0:1118 0:1125 0:0142
Œ19; 20/ 142 3 140:5 7 0:0498 0:1069 0:0056
Œ20; 21/ 132 3 130:5 8 0:0613 0:1003 0:0066
Œ21; 22/ 121 6 118:0 12 0:1017 0:0901 0:0102
Œ22; 23/ 103 1 102:5 6 0:0585 0:0849 0:0053
Œ23; 24/ 96 0 96:0 4 0:0417 0:0813 0:0035
Œ24; 25/ 92 8 88:0 16 0:1818 0:0665 0:0148
Œ25; 26/ 68 0 68:0 3 0:0441 0:0636 0:0029
Œ26; 27/ 65 1 64:5 2 0:0310 0:0616 0:0020
Œ27; 28/ 62 1 61:5 3 0:0488 0:0586 0:0030
Œ28; 29/ 58 1 57:5 4 0:0696 0:0545 0:0041
Œ29; 30/ 53 0 53:0 0 0:0000 0:0545 0:0000
Œ30; 31/ 53 2 52:0 3 0:0577 0:0514 0:0032
Œ31; 32/ 48 2 47:0 3 0:0638 0:0481 0:0033
Œ32; 33/ 43 0 43:0 1 0:0233 0:0470 0:0011
Œ33; 34/ 42 2 41:0 2 0:0488 0:0447 0:0023
Œ34; 35/ 38 0 38:0 3 0:0789 0:0412 0:0035
Œ35; 36/ 35 0 35:0 0 0:0000 0:0412 0:0000
2.1 Life Table Estimates 19
(1996). It starts with 1669 unemployed persons; 36 were still unemployed after 36
months. It is discussed in detail in Example 2.1.
O D n d1 dt ;
S.t/
n
which is the number of individuals surviving beyond at divided by the sample size.
O are given by
Since n d1 dt B.n; S.t//, expectation and variance of S.t/
O
E.S.t// D S.t/;
O
var.S.t// D S.t/.1 S.t//=n;
O 2 // D
O 1 /; S.t .1 S.t1 //S.t2 /
cov.S.t :
n
E.O t / D .t/;
var.O t / D .t/.1 .t// E.1=nt /;
Since the frequencies .d1 =n; w1 =n; : : : ; wq =n/ are asymptotically normally dis-
tributed, the standard life table estimate O t D dt =.nt wt =2/ is also asymptotically
normal with expectation
td
t D ;
t0 C tw =2
20 2 The Life Table
where t0 D E.nt =n/. However, only in the case without withdrawals it holds that
t D .t/. Thus, the standard life table estimator is not a consistent estimator.
Concerning the asymptotic variance of O t in the random censorship model, Lawless
(1982) derives
1 t0 tw =4
ar.O t / D .t t 2 / 0
vc :
n .t tw =2/.t0 tw =2/
O t O 2t
ar.O t / D
vc
nt wt =2
will overestimate var.O t / if t and t are not too different. For S.t/,
O Lawless (1982)
derives
X
t
var.1 O t /
O
var.S.t// D S .t/2
iD1
.1 t /2
Q
for large sample sizes, where S .t/ D tiD1 .1 i /. Approximating var.1 O t / by
O t .1 O t /=.nt wt =2/ and S .t/; t by S.t/;
O O t , respectively, yields Greenwood’s
(1926) often used formula
X O t
t
O
var.S.t// O 2
D S.t/ :
iD1 .1 O t /.nt wt =2/
The life table estimate of the hazards, which in the simplest case without with-
drawals has the form O t D dt =nt , can be very volatile. In particular for large t,
typically only few observations are at risk and estimates become unstable. The
result are jumps in the estimated function that occur when the hazards O t are plotted
against time. This problem can be addressed by applying smoothing techniques
to O t . For example, one might use smoothing techniques like averaging over
neighborhood estimates or spline fitting. Note that one should be careful to take
the local sample size into account, which for the estimate in interval t is nt D
nt1 dt1 wt1 .
2.1 Life Table Estimates 21
X
q
Q t D O s ns w .s; t/;
sD1
where w .s; t/ is a weight function that gives observations that are close to t large
weights and observationsP that are far away from t small weights. It should be
standardized such that s ns w =1. Often one uses weight functionsRthat are based
on kernels. A kernel is a continuous symmetric function that fulfills K.u/du D 1,
where the corresponding kernel weight is given by
st
w .s; t/ / K ;
and where is a tuning parameter. Candidates for the choice of K are the Gaussian
kernel, which uses the standard Gaussian density for K, and kernels with finite
support like the Epanechnikov kernel K.u/ D 34 .1 u2 / for juj 1 and zero
otherwise. If is small, the weights decrease fast with increasing distance js tj,
and estimates are based solely on observations from the close neighborhood of x.
For large estimates from a wide range of neighbors are included. More refined
local smoothers are local polynomial regression smoothers, see, for example, Hastie
and Loader (1993) and Loader (1999). In Chap. 5 more details are given, and
an alternative smooth estimator that is based on an expansion in basis functions
(“penalized regression spline”) is considered in more detail.
Example 2.1 German Unemployment Data (Socio-Economic Panel)
Table 2.1 shows the life table for the socio-economic panel data considered in Fahrmeir
et al. (1996). Figure 2.1 visualizes the corresponding estimated hazard rates of transition from
unemployment to employment. Inspection of the estimate shows that the hazard of the transition to
employment has a peak at approximately 13 months and another peak at approximately 25 months
after beginning of the unemployment spell. These effects should be seen in connection with the
collection of data: In the socio-economic panel people are asked in intervals of 1 year, and the
unemployment status is established retrospectively. Therefore, it is likely that the effects are due to
heaping, which means that respondents tend to remember that they were unemployed for a year or
two when the exact times were close to 1 or 2 years. Because the raw estimates of the hazard rates
are relatively volatile, Fig. 2.1 also shows a smoothed hazard rate (gray line) that was obtained by
applying a penalized regression spline estimator (as implemented in the R package mgcv, Wood
2015). It is seen that smoothed estimation ameliorates the heaping bias in the estimates. Short-term
unemployment is captured by the peak of the smoothed hazard rate at 4 months after beginning
of the unemployment spell. Note that the spline estimate has been weighted by the observations at
risk (given by nt wt =2), thereby imposing large weights on time points with a large number of
individuals at risk. In addition to presenting point estimates, Fig. 2.1 also contains 95 % confidence
bands for the estimated hazards. The bands were obtained by addition and subtraction of 1.96 times
the estimated standard deviation of the hazard estimates.
22 2 The Life Table
0.5
estimated hazard
smoothed estimated hazard
0.4
95% CI for estimated hazard
0.3
λt
^
0.2
0.1
0.0
0 5 10 15 20 25 30 35
months
Fig. 2.1 Hazard estimates obtained from the German unemployment data (Example 2.1)
Figure 2.2 shows the estimated survival function obtained from the German unemployment
data. For each time point t, this function corresponds to the estimated probability of still being
unemployed after t months. It is seen that heaping effects in the estimated hazards (for example,
at 13 months after beginning of the unemployment spell) lead to a relatively large decrease in the
estimated survival function at the respective time points. The 95 % confidence bands shown in the
upper panel of Fig. 2.2 were obtained by addition and subtraction of 1.96 times the estimated
standard deviation of the estimated survival function. The lower panel of Fig. 2.2 shows the
smoothed estimated survival function that is based on the smoothed hazard estimates presented
in Fig. 2.1. Similar to the hazard estimates in Fig. 2.1, heaping effects in the estimated survival
function have become much smaller in size. t
u
1.0
estimated survival function
0.8
95% CI for estimated survival function
0.6
S (t)
^
0.4
0.2
0.0
0 5 10 15 20 25 30 35
months
1.0
0.4
0.2
0.0
0 5 10 15 20 25 30 35
months
Fig. 2.2 Estimates of the survival function obtained from the German unemployment data
(Example 2.1). Life table estimates are shown in the upper panel, smoothed estimates are shown
in the lower panel
The life table estimates considered so far determine the risk of survival for fixed t,
with t denoting the tth interval. This strategy works fine as long as the intervals are
equally spaced, for example when t is measured in months or years. If, however, the
intervals Œ0; a1 /; Œa1 ; a2 /; : : : : : : ; Œaq1 ; aq / have varying length, some care should
be taken when presenting and interpreting estimates.
24 2 The Life Table
0.30
no UI claim filed
UI claim filed
0.20
λt
^
0.10
0.00
0 5 10 15 20
two−week intervals
Fig. 2.3 Raw and smoothed hazard estimates obtained from the U.S. unemployment data (Exam-
ple 2.2). Smoothed estimates were obtained by applying a penalized regression spline estimator
(as implemented in the R package mgcv)
1.0
no UI claim filed
0.8
UI claim filed
0.6
S(t)
^
0.4
0.2
0.0
0 5 10 15 20
two−week intervals
Fig. 2.4 Raw and smoothed estimated survival functions obtained from the U.S. unemployment
data (Example 2.2). Smoothed estimates were obtained by applying a penalized regression spline
estimator (as implemented in the R package mgcv). It is seen that people who filed a UI claim have
lower probabilities of getting re-employed than people who did not file a UI claim
2.2 Kaplan–Meier Estimator 25
To derive life table estimators in situations where time intervals are of varying
length, we assume that there is an underlying continuous time (denoted by Tc ). It
should be noted that Tc itself is not observed. What is observed are data in intervals,
that means, T D t if Tc 2 Œat1 ; at /. In this case one also refers to interval-censored
data. In particular, discrete survival transforms into continuous time by
where Sc .at / denotes survival on the continuous time scale. The life table estimator
O D t/ in the tth interval, which is also an estimator of the
provides an estimate of P.T
probability P.Tc 2 Œat1 ; at //. If one assumes that the density is constant over fixed
intervals, one obtains the estimator for the density of the continuous time (denoted
by f .x/) by
O S.t
S.t/ O 1/
fO .x/ D for x 2 Œat1 ; at /:
at at1
X
t1
O
F.x/ O c x/ D
WD P.T cs s C ct x for x 2 Œat1 ; at /:
sD1
O c > x/ D 1 F.x/
P.T O
is again a polygon that connects estimates at the boundaries of the intervals linearly.
In the following an estimator is considered that is strongly related to the life table
estimator, although it assumes that time is observed on a continuous scale. It is also
used in applications to compare estimates based on grouped data to estimates that
use the exact lifetime when available.
Let t.1/ < : : : < t.m/ denote the observed continuous lifetimes, which, for
simplicity, are assumed to be distinct. Thus the data contain no ties. Based on the
observations one constructs the intervals
1
pO j D 1 O j D 1 ; j D 2; 3; : : : ;
jRj j
where jRj j denotes the number of individuals at risk (“risk set”) in the interval
Œt. j1/ ; t. j/ /.
The Kaplan–Meier estimator (Kaplan and Meier 1958), also called product-
limit estimator, is defined in analogy to the survival function in life table estimates
(Eq. (2.1)) by
(
1 if t < t.1/ ;
O D Q
S.t/ 1
jWt. j/ <t .1 jRj j
/ if t > t.1/ :
It is a step function that decreases by 1=m just after each observed lifetime. If ties
occur, the factor 1 1=jRj j is replaced by 1 dj =jRj j, where dj is the number
of observations with value t. j/ . Censored observations that do not coincide with
times of deaths automatically reduce the risk set in the next interval. If observations
coincide with observed values t. j/ , it is customary to assume that the censoring time
is infinitesimally larger than the observed lifetime to obtain the size of the risk set.
The Kaplan–Meier estimator can also be derived as a maximum likelihood
estimator, see Lawless (1982) or Kalbfleisch and Prentice (2002). Under the
random censorship
p model, Breslow and Crowley (1974) showed that the random
function n.S.t/O S.t// converges weakly to a mean zero Gaussian process. An
asymptotically motivated estimator of the variance (Kaplan and Meier 1958) is
given by
X dj
vc O
ar.S.t// O 2
D S.t/ :
t. j/ <t
jRj j.jRj j dj /
diabetes
0.8
no diabetes
Kaplan−Meier estimates
S(t)
0.4
five−year level
0.0
Fig. 2.5 Kaplan–Meier and life table estimates obtained from the Copenhagen Stroke Study. The
black lines correspond to life table estimates based on 12-month time intervals. Gray step functions
correspond to Kaplan–Meier estimates. The vertical dashes indicate censored survival times
Life tables in demography are a special presentation of discrete survival data that
involve some specific notation. In demography there are in particular two forms
of life tables: the cohort life table and the current life table. The cohort life table
reflects the mortality of a specific group of individuals, the so-called cohort. It is
constructed for this group of people from birth to death of the last member of the
group. The methods considered in Sect. 2.1 refer to this type of data; one assumes
that the population is homogeneous and that a random sample from the population is
available. In contrast, current life tables contain the mortality of a population during
a current time interval, for example, a year. They represent a cross-sectional view on
survival. The observed mortality aims at measuring survival that is to be expected
if the same mortality pattern that holds for the current year holds throughout life.
Complete life tables are computed for each year whereas so-called abridged life
tables deal with greater time intervals. We consider the principal structure of life
tables following closely (Chiang 1984). As an example we consider the life tables
given in Tables 2.2 and 2.3, which refer to Swedish citizens that were born in 1920.
Example 2.4 Swedish Life Table Data
The data set was extracted from the Human Mortality Database at www.mortality.org. It contains
the mortality numbers of a virtual population of 100,000 Swedish men and 100,000 Swedish
woman born in 1920. Excerpts of the data are presented in Tables 2.2 and 2.3. t
u
28 2 The Life Table
Table 2.2 Excerpt from the Swedish life table data (male population)
Age mx qx ax lx dx Lx Tx eO x
0 0:07886 0:07450 0:26 100,000 7450 94,463 6,568,254 65.68
1 0:01259 0:01251 0:48 92,550 1158 91,951 6,473,792 69.95
2 0:00495 0:00494 0:49 91,393 451 91,160 6,381,840 69.83
3 0:00346 0:00346 0:51 90,941 314 90,786 6,290,680 69.17
4 0:00310 0:00309 0:49 90,627 280 90,484 6,199,894 68.41
5 0:00225 0:00224 0:48 90,346 203 90,240 6,109,410 67.62
:: :: :: :: :: :: :: :: ::
: : : : : : : : :
104 0:56391 0:43103 0:45 32 14 24 47 1.45
105 0:90000 0:60000 0:44 18 11 12 22 1.21
106 0:46875 0:38462 0:53 7 3 6 10 1.36
107 0:70588 0:50000 0:42 4 2 3 4 0.88
108 3:00000 1:00000 0:33 2 2 1 1 0.33
Table 2.3 Excerpt from the Swedish life table data (female population)
Age mx qx ax lx dx Lx Tx eO x
0 0:05943 0:05680 0:22 100,000 5680 95,567 7,269,901 72.70
1 0:01069 0:01063 0:48 94,320 1003 93,796 7,174,334 76.06
2 0:00474 0:00473 0:47 93,318 442 93,085 7,080,538 75.88
3 0:00265 0:00265 0:51 92,876 246 92,755 6,987,453 75.23
4 0:00250 0:00249 0:48 92,630 231 92,510 6,894,698 74.43
5 0:00211 0:00211 0:49 92,399 195 92,300 6,802,188 73.62
:: :: :: :: :: :: :: :: ::
: : : : : : : : :
104 0:54815 0:42775 0:49 200 86 156 328 1.64
105 0:66443 0:49500 0:48 115 57 85 172 1.50
106 0:73585 0:53061 0:47 58 31 42 86 1.49
107 0:64706 0:48889 0:50 27 13 21 44 1.64
108 0:67925 0:52174 0:56 14 7 11 24 1.72
109 0:33333 0:28571 0:50 7 2 6 13 2.00
. 0:62500 1:00000 1:60 5 5 8 8 1.60
In the following we consider the discrete table for intervals of length 1, Œx; x C 1/.
The first interval is Œ0; 1/, the second Œ1; 2/, etc. The last interval, Œq; 1/ is open-
ended. The entries in the table are:
lx : Number alive at age x. The first number, l0 is an arbitrary figure called the
radix. Typically one assigns the value 100,000. Thus the successive figures
represent the number of survivors at the age x from a group of size l0 . Slightly
misleading it is sometimes called the “probability” of survival from birth to
age x (multiplied by 100,000).
2.3 Life Tables in Demography 29
dx : Number of deaths within the age interval Œx; x C 1/. Of course it is not the
number of actually observed deaths but has meaning only in conjunction with
the radix, 100,000.
qx : Proportion of those alive at age x dying in the interval Œx; x C 1/. It represents
an estimate of the probability that an individual still alive at the exact age x
will die in the interval Œx; x C 1/; it is given by qx D dx =lx .
mx : Mortality or death rate at age x (see below).
ax : Fraction of the interval Œx; x C 1/ lived by persons who die in the interval. If
ax D 0:5, it is assumed that they live on average during half of the interval.
Lx : Total number of years lived within the age interval Œx; x C 1/ for all persons.
With the fraction ax it is given by Lx D .lx dx / C ax dx . As a sum over the
years lived by all the persons it represents the so-called person-years.
Tx : Total number of years lived beyond age x, given by Tx D Lx CLxC1 C CLq .
eO x : Expected number of years yet to be lived by a person of age x, given by
eO x D Tx =lx .
Typically life tables are given for x 2 f0; 1; : : : g. Thus l0 refers to the number of
persons with which the life table starts, l1 is the number of persons who reach age 1,
etc. The number of deaths is the difference for successive ages, dx D lx lxC1 , and
the proportion of deaths within the age interval Œx; x C 1/ is given by qx D dx =lx .
It can be seen as a probability estimate and is a pure number as all probabilities.
Life table analysis in demography customarily also uses the age-specific death rate,
which should be distinguished from the probability of dying in the interval. Let us
consider more generally the interval Œx; x C x /. Then the age-specific death rate or
mortality rate is defined by
It can be seen as the proportion of the number of occurrences and the exposure time.
Formally it is given by
dx qx
mx D D ;
x .lx dx / C x ax dx x qx x .1 ax /
where ax denotes the fraction of the interval Œx; x C x / lived by persons who die in
the interval, dx is the corresponding number of deaths, and qx D dx =lx . The death
rate is a rate, and its units are death per person-year. It is considered as a measure of
mortality which should not be confused with the probability of death in an interval.
In general it is not the same as qx . Formally the relation is
mx x
qx D :
1 C .1 ax /mx x
30 2 The Life Table
Let us consider the case where the length of the interval is one. Then the estimate
of the probability qx D dx =lx simply uses the number of deaths in the interval and
the number of individuals entering the interval. It makes no assumptions on the time
lived during the interval. In contrast, the death rate given by mx D dx =.lx dx Cax dx /
includes the fraction ax . If one assumes that on average the persons dying in interval
Œx; x C 1/ live through half of the interval, that is, the fraction ax equals 0:5, one
obtains mx D dx =.lx dx =2/, which has the same form as the standard life table
estimator (2.4) considered in Sect. 2.1. Apart from notation, the only difference is
that withdrawals are replaced by deaths. But withdrawals refer to all individuals
that get lost, for example, during a treatment study. The correction by dx =2 in the
denominator aims at correcting for the distribution of deaths within the interval.
In general, the correction factors in life tables can be determined empirically
if exact lifetimes (not rounded to years) are available. It has been shown that, in
particular for the first entries, the fractions differ from 0.5. For example, Chiang
(1972) showed that for the 1960 California mortality data, the values a0 D 0:09,
a1 D 0:43, a2 D 0:45, a3 D 0:47, a4 D 0:49 are more appropriate to account for
the large proportion of infant deaths.
For an extensive discussion of all forms of life tables, see Chiang (1984). Formal
concepts are also given by Chiang (1972).
It is of particular interest to study the dynamics of mortality over time. For example,
life expectancy in the USA rose from 47 to 75 years from 1900 to 1988 (Lee and
Carter 1992). Especially for the social security systems the decline of mortality
over time raises problems. In their seminal paper, Lee and Carter (1992) proposed
a bilinear model for age-specific mortality that has been widely used and extended.
For the death rate at age x in year t, in the following denoted by mx .t/, they consider
the model
The parameters ˛x and ˇx are age-specific while t is time-varying. The error term
with mean 0 and variance "2 reflects particular age-specific historical influences.
Following Brouhns et al. (2002) one can interpret the parameters in the following
way:
˛x represents the basic effect of age; exp.˛x / determines the general shape of the
mortality pattern.
ˇx represents the age-specific change of mortality. It indicates the sensitivity of
mortality at age x to variations in the time effect t .
t represents the variation over time. The actual mortality is modulated by the
age-specific response ˇx .
2.5 Software 31
P P
For a fixed number of time points and ages the constraints x ˇx D 0 and t t D 0
are used to obtain identifiability. The model is typically fit to historical data and the
resulting estimates are used to model and forecast future mortality as a stochastic
time series.
Lee and Carter (1992) estimated the parameters by using least squares methods.
However, special fitting procedures are necessary; one cannot use ordinary least
squares methods here because there are no regressors, only the bilinear form that
contains parameters. Various extensions and modifications have been proposed, for
example, Brouhns et al. (2002) proposed a Poisson regression approach. Currie et al.
(2004) obtain smooth mortality surfaces by penalized estimation methods using
splines.
Life Table Estimators. Properties of estimates of cohort life tables are extensively
discussed in Lawless (1982). Current life tables and concepts used in demography
are found in Chiang (1984) and Preston et al. (2000). Smoothing procedures for
life table estimates by localizing techniques were considered by Tutz and Pritscher
(1996) and Patil and Bagkavos (2012).
Lee Carter Models. The basic approach is outlined in Lee and Carter (1992).
Lee (2000) gives an overview on extensions and various applications. Smoothing
procedures were proposed by Currie et al. (2004) and Delwarde et al. (2007).
2.5 Software
2.6 Exercises
2.1 Consider the hazard function t D P.T D t j T t/ and the survival function
S.t/ D P.T > t/ for t D 1; : : : ; q. Show that the following links between these
functions hold:
Q
(a) S.t/ D P.T > t/ D tsD1 .1 s /,
Qt1
(b) P.T D t/ D t sD1 .1 s / D t S.t 1/:
2.2 Consider the multinomial distribution .Y1 ; : : : ; Yk / M .n; .1 ; : : : ; k //,
which has the mass function
P.Y1 D y1 ; : : : ; Yq D yq /
nŠ
1 : : : q q .1 1 q /ny1 yq :
y y
D
y1 Š : : : yq Š.n y1 yq /Š 1
(2.6)
Y
i1
i D i .1 j /:
jD1
P.Y1 D y1 ; : : : ; Yq D yq /
Yq
.n y1 yi1 /Š yi
D i .1 i /ny1 yi1 : (2.7)
iD1
y i Š.n y 1 y i /Š
(c) Does this mean that the binomial distributions are independent?
2.3 Show that the discrete time hazard
If the hazard function does not depend on time, that is, c .t/ D c for t 2 Œ0; 1/,
one obtains
2.4 We consider a data set of n D 423 couples that took part in a Danish
study on fertility (Bonde et al. 1998). The outcome variable of interest is time
to pregnancy (TTP), which is defined as the “duration that a couple waits from
initiating attempts to conceive until conception occurs” (Scheike and Keiding 2006,
http://publicifsv.sund.ku.dk/~ts/survival/sas/ttp.txt). It is measured by the number
of menstrual cycles until conception and is therefore an intrinsically discrete time
variable. The aim is to analyze the effects of various environmental and lifestyle
factors on TTP (cf. Table 2.4). Among the inclusion criteria of the study were “no
use of contraception to conceive,” as well as “prior knowledge of fertility.” The
median number of observed menstrual cycles was four; 39 % of the observations
were censored.
(a) Convert the data into life table format by considering the number of menstrual
cycles as discrete time variable.
(b) Estimate the discrete hazard rate, its standard deviation, and a weighted smooth
version of the discrete hazard rate.
(c) Plot the estimated hazard rates and their 95 % CIs.
2.5 Again consider the TTP data. The aim is now to analyze the effects of various
covariates on TTP.
Table 2.4 Explanatory variables for the analysis of time to pregnancy (TTP)
Variable Categories/unit Sample proportion/median (range)
Intake of caffeine for the male mg per day 378.55 (0.00–3661.42)
Intake of caffeine for the female mg per day 250 (0–1,100)
Number of drinks for the male Counts 7 (0–84)
Number of drinks for the female Counts 2 (0–39)
Smoking status of the male Yes 31 %
No 69 %
Smoking status of the female Yes 29 %
No 71 %
Smoking status of the male’s mother Yes 36 %
No 54 %
Not available 10 %
Smoking status of the female’s mother Yes 38 %
No 53 %
Not available 9%
34 2 The Life Table
(a) Split the data into cohorts of non-smoking and smoking females. Convert the
data into life table format, in the same way as in Exercise 2.4.
(b) Estimate the survival function in both cohorts. In addition, compute the standard
deviations of the survival function estimates and illustrate the results. Is there
an effect of smoking on the time to pregnancy?
(c) Repeat the above analysis for couples where both partners smoke and compare
the results to those obtained from couples where neither partner smokes.
2.6 Verify the formulas for mx and qx in Sect. 2.3.
Chapter 3
Basic Regression Models
Life tables as considered in the previous chapter yield estimates of discrete hazard
rates and survival functions without relating them to covariates such as age, sex,
etc. If covariates are available one can estimate separate life tables for specific
combinations of covariate values. The so-obtained life tables allow to investigate
the effect of covariates on survival. The method is, however, restricted to cases with
not too many combinations of covariate values.
This chapter extends the life table methodology by introducing statistical models
that directly link the hazard rate to an additive combination of multiple covariates.
As explained in Chap. 1, many applications involve survival times that are measured
on a discrete scale, for example, in days, months, or weeks. One can consider the
measurement as a discretized version of the underlying continuous time, but discrete
time often is the natural way how observations are collected. In the following let
time take values in f1; : : : ; kg. If time results from intervals one has k underlying
intervals Œa0 ; a1 /; Œa1 ; a2 /; : : : : : : ; Œaq1 ; aq /, Œaq ; 1/, where q D k 1. Often for the
first interval a0 D 0 is assumed, and aq denotes the final follow-up. Discrete time
T 2 f1; : : : ; kg means that T D t is observed if failure occurs within the interval
Œat1 ; at /. In the following we will consider the response T for given covariates
x D .x1 ; : : : ; xp /T , which are assumed to have an impact on the survival time.
The stochastic behavior of the discrete random variable T given x can be described
in the usual way by specifying the probability density function P.T D t j x/, t D
1; : : : ; k, or the cumulative density function, which is equivalent to the distribution
function F.tjx/ D P.T t j x/. As already mentioned before in survival analysis
often one uses an alternative representation that captures the dynamic aspect of time
as a response. The main tool is the hazard function, which for given vector of
explanatory variables x has the form
It is the conditional probability for failure in interval Œat1 ; at / given the interval
is reached and describes the instantaneous rate of death at time t given that the
individual survives until t.
The corresponding discrete survival function is given by
Y
t
S.tjx/ D P.T > t j x/ D .1 .ijx//: (3.2)
iD1
The survival function describes the probability that failure occurs later than at time t.
In other words, if one considers the underlying intervals, it represents the probability
of surviving interval Œat1 ; at /. The survival function is directly linked to the more
conventional cumulative density function by F.tjx/ D P.T tjx/ D 1P.T > tjx/ D
1 S.tjx/.
Alternatively, one can also consider the probability of reaching period t or
interval Œat1 ; at /, which is given by an alternative definition of the survival function:
Y
t1
Q
S.tjx/ WD P.T t j x/ D .1 .ijx//: (3.3)
iD1
The only difference to S.tjx/ is that now period t is included. The link between the
Q
alternative survival functions is given by S.tjx/ D S.t 1jx/.
The unconditional probability of failure at time t (i.e., of an event in interval
Œat1 ; at /) is simply computed as
Y
t1
P.T D t j x/ D .tjx/ Q
.1 .sjx// D .tjx/ S.tjx/: (3.4)
sD1
Basic Concepts
Discrete hazard:
Survival function:
The main extension (compared to (3.5)) is that the intercept 0t now depends on
time. Since the response function h./ is strictly monotonically increasing, one can
38 3 Basic Regression Models
construct the inverse function g D h1 , and the model has the form
In generalized linear model (GLM) terminology g./ is called the link function. The
two models, (3.6) and (3.7), are equivalent, but reveal different aspects of the model.
In the following sections we will first consider simple models of the form (3.6).
We will introduce several popular versions and discuss properties of the model and
the interpretation of parameters, which depend on the chosen response function.
The most widely used binary regression model is the logit model, which uses the
logistic distribution function h.
/ D exp.
/=.1 C exp.
//. The corresponding
logistic discrete hazard model has the form
exp.0t C xT /
.tjx/ D ;
1 C exp.0t C xT /
which models the occurrence of the event at time t (given it is reached) as a logistic
model. An alternative version is
.tjx/
log D 0t C xT : (3.8)
1 .tjx/
The ratio P.Y D t j x/=P.Y > t j x/ is known as continuation ratio, to which the
model owes its name continuation ratio model (Agresti 2013). It compares the
probability of an event at t to the probability of an event later than t. It also represents
conditional odds because it is equivalent to P.T D t j T t; x/=.1 P.T D t j
T t; x//, which compares the conditional probability of an event at time t to the
conditional probability of an event later than t, both under the condition T t. This
representation is particularly useful for the interpretation of parameters.
In model (3.8) it is assumed that the intercepts 0t vary over time whereas the
parameter is fixed. This separation of time variation and covariate effects makes
interpretation of parameters easy. The intercepts, which vary over time, can be
interpreted as a baseline hazard, that is, the hazard that is always present for any
given set of covariates. Specifically, 0t is the log continuation ratio for the covariate
3.2 Parametric Regression Models 39
vector xT D .0; : : : ; 0/. Alternatively one can consider the exponential form of
model (3.8), which is given by
P.T D t j x/
.tjx/ WD D e0t .e1 /x1 : : : .ep /xp ;
P.T > t j x/
where .tjx/ denotes the continuation ratio at value t. It follows that the exponential
of the intercept is given by
P.T D t j 0/
exp.0t / D .tj0/ D
P.T > t j 0/
and can be directly interpreted as the continuation ratio for covariate values xT D
0T D .0; : : : ; 0/. The presence of the baseline hazard for all values of covariates can
be illustrated by considering the ratio of continuation ratios at two time periods, t
and s. From
.tjx/
D exp.0t 0s / (3.10)
.sjx/
it is seen that the ratio of the continuation ratios of two periods does not depend on
the value of the covariate vector. Therefore the basic shape of the continuation ratios
is the same for all values of the predictor.
The interpretation of the coefficients is easily obtained from (3.9) and (3.10).
The parameter j represents the change in the log continuation ratios (or the
conditional log odds ratios) if xj increases by one unit. Correspondingly,
.t j x1 ; : : : ; xj C 1; : : : ; xp /
exp.j / D
.t j x1 ; : : : ; xj ; : : : ; xp /
is the factor by which the continuation ratio changes if xj is increased by one unit.
It is important to note that the change does not depend on time. It is the same for all
periods, allowing for a simple interpretation of effects: If a predictor increases the
continuation ratio, the increase is the same for all periods, and, of course, the same
relationship holds for a decrease. An alternative view on this strong property of the
model is obtained by considering two populations characterized by the values of the
predictor x and xQ . One sees from
.tjx/
D exp..x xQ /T / (3.11)
.tjQx/
40 3 Basic Regression Models
that the comparison of these subpopulations in terms of the continuation ratio does
not depend on time. If one compares, for example, two therapies, represented by
x D 1 and x D 0, one obtains .t j x D 1/= .t j x D 0/ D exp. /. If the modeled
event is death and is positive, this means that the therapy coded as x D 0 is to
be preferred over the therapy coded as x D 1 because the continuation ratio of the
latter therapy is higher at all time points. Property (3.11) implies in particular that
a therapy cannot have superior short-time effects while being inferior in terms of
chances of survival after a certain time point. In this case it would be hard to choose
among therapies since the one that supports short-time survival could be inferior in
terms of long time survival. Property (3.11), which excludes such effects, makes the
model a proportional continuation ratio model.
One consequence of the property is that survival functions do not cross. If one
compares two therapies represented by x D 1 and x D 0 with positive, one
has .tjxQD 1/ > .tjx D 0/ for all t. Since the survival function is given by
S.tjx/ D tiD1 .1 .ijx// one obtains
for all t, which means the probability of survival for therapy x D 1 is smaller than
for therapy x D 0 for all time points. Consequently, the survival functions do never
cross.
Example 3.1 Copenhagen Stroke Study
For illustration we investigate survival in the Copenhagen Stroke Study with just one predictor
in the model. Figure 3.2 presents the resulting estimates of the continuation ratio model for
patients with diabetes and without diabetes. The upper panel of Fig. 3.2 shows the estimated
hazard functions for the two populations whereas the lower panel shows the corresponding survival
functions. Observed survival times (originally measured in days) were grouped into 12-month
intervals. The coefficient for the diabetes group (represented by x D 1) was estimated as
O D 0:401. From Eq. (3.11) it follows that the continuation ratio of the diabetes group increases
by the factor exp.0:401/ 1:49 for all time points (if compared to the non-diabetes group
represented by x D 0). This implies that the risk of dying after hospital admission measured by the
continuation ratio is approximately 1.5 times larger in the diabetes group than in the non-diabetes
group. Details on how to estimate the parameters of the continuation ratio model will be presented
in Sect. 3.4.
In addition to the continuation ratio estimates, the Kaplan–Meier estimates for the two groups
are included in the lower panel of Fig. 3.2. It is seen that the continuation ratio model for discrete-
time survival data yields a good approximation of the Kaplan–Meier estimates. Specifically, the
model estimates are close to the life table estimates presented in Fig. 2.5, which were calculated
separately for each of the two groups. t
u
Modeling of Hazards
Before considering alternative models let us make some remarks on the basic
approach. Especially in discrete survival analysis the parameterization of the hazard
function is the most widespread method: One models the effect of a set of covariates
on the hazard. Subsequently one can consider the effect of the covariates on
3.2 Parametric Regression Models 41
diabetes
0.30
no diabetes
0.20
λt
^
0.10
0.00
diabetes
no diabetes
0.8
Kaplan−Meier estimates
0.6
S(t)
^
0.4
0.2
0.0
Fig. 3.2 Hazard rates (upper panel) and survival functions (lower panel) for the Copenhagen
Stroke Study. Estimates were obtained by fitting a continuation ratio model with binary covariate
diabetes/no diabetes
the distribution of the survival time or, equivalently, on the survival function.
This approach differs from the common regression model, where one typically
investigates the effect of covariates on the mean of the dependent variable. The
advantages of modeling the hazard function have already been mentioned; in
particular this approach allows to include time-varying covariates. Also, in modeling
the transition process from one state to another it focusses on a feature that is of high
interest in duration analysis.
Nevertheless, some care is needed when interpreting the form of the hazard
function. If no covariates are involved (as in life tables) one observes the transition
to other states in the population. In parametric regression models one conditions on
the covariates, and the transition rate captured in the hazard function is conditional
on covariates. This means that one observes transition rates in populations that are
defined by the covariates. In particular when covariates are categorical, for example,
defining gender, the underlying populations are clearly defined. But since usually
42 3 Basic Regression Models
not all relevant covariates are available, there might be considerable heterogeneity
in the population that is not accounted for. Therefore one should not necessarily
assume that each subject in a subpopulation defined by covariates has the same
hazard function. For example, different individual hazard functions can yield the
same population level hazard function. Also, constant hazards on the individual level
will produce decreasing hazard functions on the population level, a phenomenon
that is considered in Chap. 9. In frailty models, which are discussed in Chap. 9,
one tries to explicitly account for the heterogeneity within the populations and to
investigate the effects on the individual level. Since inference on the true hazard
function on the individual level is hard to obtain, there is a tendency in biostatistics
to focus on the effects of covariates on the survival function but not to (over)interpret
the functional form of the hazard function. Nevertheless, in some areas the focus
is on the individual hazards. When modeling the duration of unemployment,
for example, it makes a difference whether a decreasing hazard observed in the
population is caused by decreasing individual hazards, signaling declining chances
in the labor market, or time-constant but population-varying individual hazards that
yield decreasing hazard function in the populations because individuals with small
hazards will stay unemployed for a longer time. In the latter case the individual
hazards do not decrease but the population hazard does. In this spirit Van den Berg
(2001) states that “the hazard function of the duration distribution is the focal point
and basic building block of econometric duration models. Properties of the duration
distribution are generally discussed in terms of properties of the hazard function.
The individual hazard function and the way it depends on its determinants are the
parameters of interest.”
In the logistic discrete hazard model the transition to the next category is modeled by
a logistic distribution function. Alternative functions are in common use for binary
models and yield alternative models. A special distribution function is the minimum
extreme value or Gompertz distribution h.
/ D 1 exp. exp.
//, which yields the
model
or equivalently
that the parameter space needs to be restricted because the hazard rate is restricted
(i.e., .tjx/ 2 Œ0; 1). For an application of the exponential model see Weinberg and
Gladen (1986). The box also includes the probit model, which uses the cumulative
standard normal distribution ˆ./, which apart from scaling is very similar to
the logistic distribution function. In particular in economic applications the probit
model is preferred over the logistic model when binary responses are modeled.
The use of common binary regression models for the modeling of the transition
to period t C 1 (given period t was reached) has the big advantage that software
for binary models can be used to fit the corresponding discrete hazard models.
However, alternative names are often used for the models. As already mentioned,
the binary model behind the grouped proportional hazards model is often called
“complementary log-log model” because of the used link function; the model behind
the Gumbel model is known as “log-log model.”
All models use a cumulative distribution function as response function h./, for
example, the logistic or the Gompertz distribution. The use of distribution functions
has the nice side effect that the estimated hazard rates are between zero and one.
However, if one wants to compare estimated coefficients for alternative models
one should keep in mind that the distribution functions correspond to underlying
random variables with different variances. For example, the logistic distribution
function h.
/ D exp.
/=.1 C exp.
// corresponds to a random variable with
variance 2 =3, whereas the Gompertz distribution refers to a random variable
with variance 2 =6 (with D 3:14159 : : :). Therefore comparisons of estimates
obtained from different link functions should be based on standardized coefficients.
Standardization to response functions that refer to random variables with variance 1
is obtained by computing i = h , where h2 denotes the variance of the distribution
function h./ (see, for example, Tutz 2012, Sect. 5.1). In Example 3.2 various models
with different link functions are fitted and compared.
Example 3.2 Copenhagen Stroke Study
We consider survival in the Copenhagen Stroke Study to illustrate the fitting of various discrete
hazard models. We start with one predictor and then include all of the available predictors.
Figure 3.4 presents the estimated survival functions of patients that suffered a previous stroke
and of patients without previous stroke. Five different survival models (logistic, probit, Gompertz,
Gumbel, and exponential) were fitted with observations grouped into 12-month intervals. It is seen
from Fig. 3.4 that the estimated survival functions obtained for the five models are very similar.
The model-free Kaplan Meier estimates, which are also shown, indicate that the structure of the
patient group without previous stroke is captured relatively well by the models, whereas survival
in the previous-stroke group is overestimated in the interval Œ1000; 2100 days (corresponding to
an interval of approximately 3–6 years after admission to hospital). It is seen that the risk of dying
is considerably larger in the previous-stroke group than in the group of patients with a first-time
stroke.
3.2 Parametric Regression Models 45
1.0
0.8
0.8
0.6
0.6
S(t)
S(t)
^
^
0.4
0.4
0.2
0.2
0.0
0.0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
time (days) time (days)
Gompertz model Gumbel model
1.0
1.0
0.8
0.8
0.6
0.6
S(t)
S(t)
^
^
0.4
0.4
0.2
0.2
0.0
0.0
prev. stroke
no prev. stroke
0.6
Kaplan−Meier estimates
S(t)
^
0.4
0.2
0.0
Fig. 3.4 Copenhagen Stroke Study. The figures show the estimated survival functions of patients
that suffered a previous stroke and of patients without previous stroke. Different link functions
were used to fit discrete hazard models
Table 3.1 shows the coefficient estimates that were obtained from the Copenhagen Stroke Study
when all covariates were included in the hazard models. Positive coefficient estimates imply that
the corresponding covariates have a positive effect on hazards (and hence a negative effect on
survival). For example, patients with diabetes and a previous stroke have a higher estimated risk
of dying than patients without diabetes or previous stroke. As expected, because of the negative
coefficient signs, patients with a large Scandinavian stroke score tend to have a decreased risk
of dying early. In addition to the coefficient estimates, Table 3.1 shows the estimated standard
deviations and the p-values obtained from statistical hypotheses tests. Generally, p-values are a
measure of the statistical significance of a covariate; they indicate whether the effect of a covariate
on survival is systematically different from zero or not. In Table 3.1, for example, p-values for
46 3 Basic Regression Models
Table 3.1 Copenhagen Stroke Study. The table presents the parameter estimates that were
obtained from fitting discrete hazard models with different link functions (coef = parameter
estimate, se = estimated standard deviation, p = p-value obtained from t-test)
Logistic Probit Gompertz
Coef se p Coef se p Coef se p
Age 0:058 0:007 0:000 0:031 0:004 0:000 0:053 0:006 0:000
Sex (male) 0:458 0:128 0:000 0:255 0:070 0:000 0:412 0:115 0:000
Hypertension (yes) 0:233 0:123 0:058 0:132 0:068 0:050 0:206 0:110 0:061
Ischemic heart disease (yes) 0:128 0:143 0:372 0:079 0:079 0:316 0:094 0:128 0:461
Previous stroke (yes) 0:213 0:148 0:150 0:125 0:083 0:130 0:179 0:131 0:172
Other disabling disease (yes) 0:148 0:153 0:333 0:086 0:086 0:318 0:119 0:136 0:379
Alcohol intake (yes) 0:107 0:133 0:420 0:058 0:072 0:423 0:090 0:121 0:458
Diabetes (yes) 0:325 0:165 0:049 0:188 0:092 0:040 0:280 0:146 0:055
Smoking status (yes) 0:345 0:127 0:006 0:204 0:069 0:003 0:301 0:114 0:008
Atrial fibrillation (yes) 0:407 0:172 0:018 0:233 0:099 0:019 0:367 0:148 0:013
Hemorrhage (yes) 0:009 0:274 0:973 0:025 0:150 0:868 0:018 0:246 0:941
Scandinavian stroke score 0:025 0:004 0:000 0:014 0:003 0:000 0:022 0:004 0:000
Cholesterol 0:010 0:045 0:821 0:004 0:025 0:882 0:009 0:041 0:833
Gumbel Exponential
Coef se p Coef se p
Age 0:025 0:003 0:000 0:048 0:006 0:000
Sex (male) 0:215 0:057 0:000 0:372 0:101 0:000
Hypertension (yes) 0:114 0:056 0:041 0:184 0:097 0:057
Ischemic heart disease (yes) 0:077 0:066 0:242 0:067 0:112 0:549
Previous stroke (yes) 0:113 0:070 0:107 0:148 0:113 0:189
Other disabling disease (yes) 0:077 0:073 0:291 0:088 0:117 0:451
Alcohol intake (yes) 0:047 0:059 0:423 0:069 0:107 0:520
Diabetes (yes) 0:169 0:078 0:030 0:234 0:125 0:061
Smoking status (yes) 0:184 0:057 0:001 0:262 0:101 0:009
Atrial fibrillation (yes) 0:201 0:088 0:022 0:336 0:122 0:006
Hemorrhage (yes) 0:045 0:125 0:715 0:042 0:219 0:846
Scandinavian stroke score 0:012 0:002 0:000 0:019 0:003 0:000
Cholesterol 0:001 0:020 0:954 0:007 0:036 0:851
the covariates age, sex, smoking status, atrial fibrillation, and Scandinavian stroke score were
smaller than 0.05 in all five models. This result indicates that the corresponding covariates had
a significant effect on survival (at significance level 0.05). When comparing the different models,
most of the p-values were similar in size. Nevertheless, there were also differences. For example,
in the logistic, probit, and Gumbel models diabetes had a significant negative effect on survival (if
one considers the often-used significance level 0:05) while p-values were slightly larger than 0.05
in the Gompertz and exponential models. Also, the p-values of hypertension on survival showed
some variation around 0.05. In some cases even the signs of the coefficient estimates changed.
For example, estimates seem to indicate for hemorrhage patients an increased risk of dying in the
logistic, probit, and Gumbel models and a decreased risk in the Gompertz and exponential models.
But standard deviations and p-values are so large that the effects could have also been due to
chance. Details on statistical hypotheses tests will be presented in Chap. 4.
3.2 Parametric Regression Models 47
Table 3.2 presents the standardized coefficient estimates of the four models, which were
calculated by dividing the coefficient estimates by the standard deviationspof the respective
response function.
p More specifically, the standard deviations are given by = 3 for the logistic
model, = 6 for the Gompertz and Gumbel models, and 1 for the probit and exponential models.
As mentioned earlier, the standardized estimates are measured on the same scale and are more
suitable for the comparison of coefficients than the raw estimates. Similar to the results presented
in Fig. 3.4, the estimates obtained from the five modeling approaches (logistic, probit, Gompertz,
Gumbel, and exponential) are similar with respect to the standardized coefficient estimates. Note
that, in addition to comparing standardized coefficient estimates, it is also necessary to evaluate
which of the five modeling approaches fitted the data best. Strategies for model comparison are
presented in Chap. 4. t
u
Table 3.2 Copenhagen Stroke Study. The table presents the standardized coefficient estimates that
were obtained from fitting discrete hazard models with different link functions
Logistic Probit Gompertz Gumbel Exponential
Age 0:032 0:031 0:041 0:020 0:048
Sex (male) 0:253 0:255 0:321 0:167 0:372
Hypertension (yes) 0:128 0:132 0:161 0:089 0:184
Ischemic heart disease (yes) 0:070 0:079 0:074 0:060 0:067
Previous stroke (yes) 0:117 0:125 0:139 0:088 0:148
Other disabling disease (yes) 0:082 0:086 0:093 0:060 0:088
Alcohol intake (yes) 0:059 0:058 0:070 0:037 0:069
Diabetes (yes) 0:179 0:188 0:218 0:132 0:234
Smoking status (yes) 0:190 0:204 0:234 0:144 0:262
Atrial fibrillation (yes) 0:224 0:232 0:286 0:156 0:336
Hemorrhage (yes) 0:005 0:025 0:014 0:035 0:042
Scandinavian stroke score 0:014 0:014 0:017 0:010 0:019
Cholesterol 0:006 0:004 0:007 0:001 0:007
Concerning the choice among the models, two aspects are of major importance:
The first one is goodness-of-fit (i.e., how well the model fits the underlying
data), and the second one is ease of interpretation (i.e., how easily the model
parameters can be interpreted). The difference between models can be small if
the grouping intervals are very short. For example, Thompson (1977) showed that
estimates are typically very similar for the grouped proportional hazards model and
the logistic model in this case. Although differences between the standard links
considered here can be small, in some applications quite different link functions
can be more appropriate. In Chap. 4 more general families of response functions
are considered. Some of the families contain the logistic and the clog-log model as
special cases. By fitting a whole family one tries to find the most appropriate link
function, which can be the logistic link function but can also be far away from it.
Example 4.7 demonstrates that in some applications the underlying random variable
can be strongly skewed. Even more flexible models are obtained by allowing for
nonparametrically estimated response functions. These more general approaches
focussing on the choice of the link are deferred to Chap. 4.
48 3 Basic Regression Models
Proportionality
It should be noted that all models considered in this chapter postulate some
proportionality property that is important w.r.t. the interpretation of parameters.
Because h./ is a monotonic function in the model equation .tjx/ D h.0t C xT /,
one can always compare two populations characterized by the values of the
predictors x and xQ by
which does not depend on t. Therefore, if a population has a larger hazard rate
than the other, this ordering is the same over time. Which specific property holds
for which model can be derived from computing the left-hand side of (3.13). As
shown before, for the logistic model one obtains the log continuation ratios, which
are postulated to not depend on time. For the grouped proportional hazards model it
can be shown that the ratio of the logarithms of the survival functions is given by
log.S.tjx//
D exp..x xQ /T /
log.S.tjQx//
and therefore depends on x and xQ but not on time. It means that the loga-
rithms of the discrete survival functions for two populations are proportional
over time. For the Gumbel model the corresponding quantity is the logarithm of
log..tjx//= log..tjQx//, which seems less easy to interpret.
Models for continuous survival time use the same concepts as discrete survival
models. But the definition of the hazard function and the link between the hazard
function and the survival function is different. In the following we will shortly sketch
the concepts for continuous time and the link between continuous time survival and
discrete time survival. In particular the discretized version of the widely used Cox
model is considered.
Let Tc denote a continuous survival time, which has density function f ./. The
survival function, which represents the probability of survival until time t, is now
given without ambiguity by
Rt
where Fc .t/ D P.Tc t/ D 0 f .u/du denotes the distribution function of Tc . The
hazard function for continuous time is defined as a limit by
P.t Tc < t C t j Tc t/
c .t/ D lim : (3.14)
t!0 t
Generally, the stochastic properties of a survival time are fixed if one of the
following quantities is specified: distribution function, density, hazard function, or
cumulative hazard function. It is straightforward to derive the connection between
these quantities:
f .t/
c .t/ D ;
Sc .t/
Z t
Sc .t/ D exp c .u/du D exp.ƒc .t//;
0
Z t
f .t/ D c .t/ exp c .u/du D c .t/ exp.ƒc .t//
0
(see, for example, Lawless 1982). The simplest model for continuous time assumes
that the hazard is constant over time, c .t/ D c for t 2 Œ0; 1/. In this
case the survival function is the exponential function Sc .t/ D exp.c t/. The
corresponding cumulative distribution function is Fc .t/ D 1 exp.c t/ and
the density is f .t/ D c exp.c t/. The distribution function and the density are
familiar from introductory courses to statistics as describing a random variable
that is exponentially distributed with parameter c . Therefore the assumption of
an exponential distribution is equivalent to postulating that the underlying hazard
function is constant over time. The exponential distribution is also known as a
distribution without memory, having the property P.Tc > t C j Tc > t/ D
P.Tc > /. This means the probability of a duration time longer than t C given
Tc > t is the same as the probability of a duration time longer than right from
the beginning. In other words, when starting at Tc > t the conditional distribution is
the same as when starting at zero. This is another way of expressing that the hazard
does not depend on time. More flexible models such as, for example, the Weibull
distribution, are considered extensively in the literature on continuous time survival
(Lawless 1982; Klein and Moeschberger 2003). A very flexible regression model
that comprises all these models is the Cox model considered in the next section.
50 3 Basic Regression Models
The most widely used model in continuous survival modeling is Cox’s proportional
hazard model (Cox 1972), which specifies
The model is composed of two parts, namely the unknown baseline hazard 0 .t/
and the effects of the covariates on c given by exp.xT /. Thus the variation of
the hazard over time is separated from the effects of the covariates. The important
consequence is that effects are constant over time. This becomes clear when
considering two populations defined by the values of the predictor x and xQ : One
obtains that the ratio of the continuous hazards
c .tjx/
D exp..x xQ /T / (3.17)
c .tjQx/
does not depend on time. This property makes the model a proportional hazards
model. The model is usually referred to as “Cox model” or simply as “proportional
hazards model.” The fact that hazards are proportional helps in the interpretation
of the parameters. For example, in a treatment study where the hazard refers to
survival, let x D 1 indicate treatment 1 and xQ D 0 indicate treatment 2. Then for
positive the hazard for treatment 1 is exp. / times the hazard for treatment 2 at
all times, which makes treatment 2 the better choice. It is crucial that the ratio of
the hazards does not depend on time because then one of the treatments is to be
preferred regardless of the survival time.
A different view on the proportional hazards property is obtained by considering
the corresponding survival functions. It is easily derived that the survival function
Sc .tjx/ D P.Tc > t j x/ of the Cox model has the form
Z t exp.xT /
D S0 .t/exp.x / ;
T
Sc .tjx/ D exp 0 .u/du (3.18)
0
Rt
where S0 .t/ D exp. 0 0 .u/du/ is the baseline survival function. Therefore, the
survival function given x is given as the baseline survival function to the power
of exp.xT /. The proportional hazards assumption has the consequence that the
survival functions for distinct values of explanatory variables x and xQ do never cross.
Thus, Sc .tjx/ > Sc .tjQx/ (or Sc .tjx/ < Sc .tjQx/) holds for all values of t, indicating that
one of the values, x or xQ , yields higher chances of survival at all times.
3.4 Estimation 51
Let now time be coarsened by use of the intervals Œ0; a1 /; Œa1 ; a2 /; : : : : : : ; Œaq1 ; aq /,
Œaq ; 1/, and assume that the proportional hazards model holds. If discrete time is
observed, that is, if one observes T D t if failure occurs within the interval Œat1 ; at /,
the discrete hazard .tjx/ D P.T D t j T t; x/ takes the form
are derived from the baseline hazard function 0 .u/ (Exercise 3.1). Model (3.19)
is equivalent to the Gompertz model or clog-log model for discrete hazards. It is
essential that the effect of covariates contained in the parameter is the same as in
the original Cox model for continuous time. Therefore, if the data generating model
is the Cox model but data come in a coarsened form by grouping into intervals one
can use the clog-log model for inference on the effect of covariates specified by the
Cox model. Of course if continuous lifetimes are available the grouped version of
the models uses less information than the original Cox model specified in (3.16). We
defer the investigation of the resulting information loss to Sect. 3.6 after estimation
concepts for discrete hazards have been considered.
When considering proportionality one should be careful if one refers to pro-
portions in continuous time or discrete time. In fact, the proportionality in (3.17)
holds for continuous time. However, proportionality takes a different form for the
corresponding time-discrete version. With the clog-log model, proportionality does
not refer to the discrete hazards but to discrete survival functions, that is, one obtains
log.S.tjx//=log.S.tjQx//
(compare (3.13)). Nevertheless we will call the clog-log model also “grouped
proportional hazards model.” The latter name refers to the connection to the
proportional hazards model for continuous time and is in common use.
3.4 Estimation
Let the basic discrete survival model .tjxi / D h.0t C xTi / be given in the form
Estimation
In the following we assume that observations are subject to censoring, that means
that only a portion of the observed times can be considered as exact survival times.
For the rest of the observations one only knows that the survival time exceeds a
certain time point. We assume random censoring, that is, each individual i has a
survival time Ti and a censoring time Ci , where Ti and Ci are independent random
variables. In this framework the observed time is given by ti WD min.Ti ; Ci / as the
minimum of the survival time Ti and the censoring time Ci . It is often useful to
introduce an indicator variable for censoring given by
1 if Ti Ci ;
ıi D
0 if Ti > Ci ;
where it is implicitly assumed that censoring occurs at the end of the interval. Let the
total data with potential censoring be given by .ti ; ıi ; xi /, i D 1; : : : ; n, where n is the
sample size. It is useful to first consider non-censored and censored observed data
separately. Since time and censoring are independent by assumption, the probability
of observing the exact survival time .ti ; ıi D 1/ is given by
A failure in interval Œati 1 ; ati / implies Ci ti , and censoring in interval Œati 1 ; ati /
(i.e., Ci D ti ) implies survival beyond ati . The contribution of the ith observation to
the likelihood function is therefore given by
By using the discrete hazard function and including the covariates, one obtains with
the help of (3.2) and (3.4)
i 1
tY
Li D ci .ti jxi /ıi .1 .ti jxi //1ıi .1 . jjxi //: (3.24)
jD1
This likelihood seems to be very specific for a discrete survival model with
censoring. But closer inspection shows that it is in fact equivalent to the likelihood
of a sequence of binary responses. By defining for a non-censored observation
(ıi D 1) the sequence .yi1 ; : : : ; yiti / D .0; : : : ; 0; 1/ and for a censored observation
(ıi D 0) the sequence .yi1 ; : : : ; yiti / D .0; : : : ; 0/, the likelihood (omitting ci ) can be
written as
Y
ti
Li D .sjxi /yis .1 .sjxi //1yis :
sD1
Thus Li is equivalent to the likelihood of a binary response model with values yis 2
f0; 1g. Generally, for each i one has ti such observations. Note, however, that the
binary observations are not just artificially constructed values. In fact, the values yis
code the binary decisions for the transition to the next period. Specifically, yis codes
the transition from interval Œas1 ; as / to Œas ; asC1 / in the form
1; if individual fails in Œas1 ; as /;
yis D
0; if individual survives Œas1 ; as /;
s D 1; : : : ; ti . For exact survival times one has the observation vector .0; : : : ; 0; 1/
of length ti , that is, survival for the first ti intervals and failure in the ti -th interval.
For censored observations, it is known that the first ti intervals have been survived.
As a consequence, the total log-likelihood of the discrete survival model is
given by
X
n X
ti
l/ yis log .sjxi / C .1 yis / log.1 .sjxi //: (3.25)
iD1 sD1
generate the binary observations before fitting the binary model for transitions.
For illustration let us consider the discrete hazard model with linear predictor
ir D or C xTi . In closed form the linear predictor is given by xTir ˇ with parameter
vector ˇ T D .01 ; : : : 0q ; T /. For the linear predictors of the binary model one
has to adapt the covariate vector to the observations under consideration. For the
generation of the design matrix one has to distinguish between censored and non-
censored observations. If T D ti and ıi D 1, the ti binary observations and design
variables are given by the augmented data matrix.
Instead of constructing the design matrix one can also construct the observations
with running time t, which has to be considered as a factor when using appropriate
software. Then for T D ti , ıi D 1 one uses the coding
In the previous section the contribution of censoring to the likelihood has been
essentially ignored. However, the censoring process can also be analyzed by
specifying a discrete hazard model. To this purpose let cens .tjxi / D P.C D t j
C t; xi / denote the hazard function for the censoring time. With censoring at
the end of the interval the contribution of the ith observation to the likelihood, as
considered before, is
i 1
tY
Li D ci .ti jxi /ıi .1 .ti jxi //1ıi .1 . jjxi //;
jD1
where ci D P.Ci ti /ıi P.Ci D ti /1ıi contains the contribution of the censored
observations. It is straightforward to derive
i 1
tY
ci D cens .ti jxi /1ıi .1 cens .sjxi //:
sD1
If ıi D 0 one obtains
Y
ti
.c/ .c/
ci D cens .sjxi /yis .1 cens .sjxi //1yis ;
sD1
56 3 Basic Regression Models
i 1
tY
.c/
ci D .1 cens .sjxi //1yis ;
sD1
which means that censoring has not occurred up to and including period ti 1.
In closed form the contribution of the ith observation to the likelihood function
is given by
i ıi
tY
.c/ .c/
ci D cens .sjxi /yis .1 cens .sjxi //1yis :
sD1
Concerning the design matrix, one obtains for the censoring process with running
time t and ıi D 0 the ti observations
For the censoring process with running time t and ıi D 1 one obtains the ti 1
observations.
Simultaneous maximum likelihood estimation for the failure time and censoring
processes uses the log-likelihood l D lT C lC , where
X
n X
ti
lT D .yis log .sjxi / C .1 yis / log.1 .sjxi ///
iD1 sD1
3.4 Estimation 57
is the log-likelihood that depends on the discrete hazard of failure (i.e., on .sjxi /),
and
X i ıi
n tX
.c/ .c/
lC D yis log.cens .sjxi // C .1 yis / log..1 cens .sjxi ///
iD1 sD1
The derivation of the log-likelihood in the previous sections was based on the
assumption that censoring occurs at the end of the interval. Definitions and
derivations change if one assumes censoring at the beginning of the interval. In
this case one has
1 if Ti < Ci ;
ıi D
0 if Ti Ci ;
and
i 1
tY
ıi
Li D ci .ti jxi / .1 . jjxi //: (3.26)
jD1
Y
ti
Li D .sjxi /yis .1 .sjxi //1yis :
sD1
58 3 Basic Regression Models
For a censored observation one defines the sequence .yi1 ; : : : ; yi;ti 1 / D .0; : : : ; 0/,
yielding
i 1
tY
Li D .sjxi /yis .1 .sjxi //1yis :
sD1
In closed form the contribution of the ith observation to the likelihood is given
by
ti .1ıi /
Y
Li D .sjxi /yis .1 .sjxi //1yis ;
sD1
n ti .1ı
X X i/
l/ .yis log .sjxi / C .1 yis / log.1 .sjxi /// :
iD1 sD1
Let us consider the case where censoring occurs at the end of the interval. Then
the log-likelihood (3.25) can be used to derive approximate standard errors for the
coefficient estimates of a discrete hazard model. It is straightforward to derive the
information matrix, which is given by
2 X n X
@h.
it / 2 2
ti
@ l.ˇ/
F.ˇ/ WD E D x xT
it it = it ;
@ˇ@ˇ T iD1 sD1
@
where
it D xTit ˇ (with xit from model (3.20)) and it2 D h.
it /.1 h.
it //.
By standard arguments, asymptotic standard errors of the maximum likelihood
estimator Ǒ are obtained from the approximation
Ǒ
a
N ˇ; F. Ǒ /1 : (3.27)
3.5 Time-Varying Covariates 59
For example, the asymptotic standard errors presented in Table 3.1 have been
computed by using the above approximation. Note that the approximation in (3.27)
also allows to construct Wald confidence intervals at the level 1 ˛ for Ǒ , which are
given by
q
ˇOk ˙ z1˛=2 F. Ǒ /1
kk ; (3.28)
In the previous sections it has been assumed that predictors are fixed values that
were observed at the beginning of the duration under investigation. Although we
sometimes wrote the model as .tjxi / D h.xTit ˇ/, this was just a convenient form to
denote the variation of the intercept over time. In fact, the actual model was always
.tjxi / D h.0t C xTi / with fixed predictor value xi .
In practice, however, it is not uncommon that potential explanatory variables vary
over time. For example, general measures of economic activity like business cycles
may have an impact on the duration of unemployment of individuals. The corre-
sponding measurements will vary over an individual’s duration of unemployment.
In the following we will show how time-dependent covariate information may
be incorporated into discrete-time hazard models. To this purpose, let xi1 ; : : : ; xit
denote the sequence of observations of covariates for the ith unit (or individual)
until time t, where xit is a vector observed at discrete time t (or, if “discrete time”
refers to intervals, at the beginning of the interval Œat1 ; at /). When modeling the
hazard one might want to include part of the history up to time point t, fxis gst . Let
this more general hazard be given by
it D 0t C xTit :
it D 0t C xTit C xTi;t1 1 C xTi;t2 2 C : : :
60 3 Basic Regression Models
P.fyis ; xis gst / D P.yit ; xit jfyis ; xis gs<t / P.yi;t1 ; xi;t1 jfyis ; xis gs<t1 / : : : P.yi1 ; xi1 /:
P.yit ; xit jfyis ; xis gs<t / D P.yit j xit ; fyis ; xis gs<t / P.xit jfyis ; xis gs<t / ;
P.yit ; xit jfyis ; xis gs<t / D .tjfxis gst / P.xit jfyis ; xis gs<t /;
Therefore, if P.xit jfyis ; xis gs<t / is not informative for the parameters in the hazard
rate, one obtains for an observation Ti D ti , which corresponds to .yi1 ; : : : ; yiti / D
.0; : : : ; 0; 1/,
i 1
tY
P.fyis ; xis gsti / D P.Ti D ti ; fxis gsti / D ci .ti jfxis gst / .1 .ljfxis gsl //;
lD1
(3.29)
where ci is determined by the non-informative probabilities P.xit jfyis ; xis gs<t /.
In this representation the crucial assumption is that P.xit jfyis ; xis gs<t / is not
informative. In particular, when the covariate process depends on survival, that is, on
fyis gs<t this assumption is questionable. But if, for example, the aim is to model the
duration of unemployment, then covariate processes such as countrywide economic
activity are hardly determined by the duration of unemployment of single persons.
One can also postulate P.xit jfyis ; xis gs<t / D P.xit jfxis gs<t /, which corresponds to
the concept of an external covariate process (Kalbfleisch and Prentice 2002). A
much more general framework for the use of time-varying covariates is provided
by counting processes and martingales, see, for example, Fleming and Harrington
(2011).
The advantage of representation (3.29) is that under similar assumptions on the
censoring process one can rely on the likelihood (3.25). For illustration we consider
the case where the hazard depends on the current xit only. If all assumptions hold,
3.5 Time-Varying Covariates 61
one obtains for the log-likelihood of the corresponding model .tjfxis gst / D
h.0t C xTit /
X
n X
ti
l/ yis log .sjxit / C .1 yis / log.1 .sjxit //:
iD1 sD1
When using software for the corresponding binary response model that describes
the transitions, one has to construct again the corresponding design matrices and
distinguish between censored and non-censored observations.
If T D ti , ıi D 1 the ti binary observations and design variables are given by
The crucial modification compared to the design matrices in Sect. 3.4 is that the
covariate values vary over time.
Example 3.3 German Socio-Economic Panel (SOEP)
The Socio-Economic Panel is an ongoing longitudinal survey of about 12;000 private households
in Germany. Data collection started in 1984 and is focussed on topics such as employment, income,
quality of life, and health. In this example the event of interest is drop-out of the SOEP. Since data
are collected annually, the time until drop-out is measured in years (median duration = 3 years).
Approximately 37 % of the households are censored, meaning that they continue to participate in
the study. The data are available after registration at http://www.diw.de/de/soep; for a complete
description of the household data we refer to https://data.soep.de/datasets/11 and https://data.soep.
de/datasets/14. For survival modeling we applied various pre-processing steps to the SOEP data. In
particular, we considered categories such as “could be any value” and “does not apply” as missing
values. Also, the categories “answer improbable” and “no answer” were set to missing.
62 3 Basic Regression Models
0.5
0.0
baseline hazard
−0.5
−1.0
−1.5
−2.0
Fig. 3.5 Coefficient estimates for the baseline hazard rate in Example 3.3 (SOEP data); confidence
intervals are 95 % Wald intervals
Table 3.3 Estimated effect of the continuous covariate in Example 3.3 (SOEP data)
Covariate O sd(O ) p
HGAHINC 0.000059 0.00000654 <0.0001
3.6 Continuous Versus Discrete Proportional Hazards 63
Table 3.4 Estimated effects of the categorical covariate “household typology” (HGTYP2HH) in
Example 3.3 (SOEP data)
Category O sd(O ) p
1-Person HH (ref. category) 0:0000 – –
Single parent, 1 child 0:3301 0.04788 < 0.001
Single parent, more than 1 child 0:1223 0.05883 0.038
Couple two children 0:2098 0.02327 < 0.001
Couple, 1 child 0:2640 0.02954 < 0.001
Couple, more than 1 child 0:2430 0.02814 < 0.001
3- and more-generation HH 0:2836 0.08896 0.001
Table 3.5 Estimated effects of the categorical covariate “interview method” (HGHMODE) in
Example 3.3 (SOEP data)
Category O sd(O ) p
With interviewer assistance (ref. category) 0:0000 – –
Oral interview 0:3157 0.0887 0.0004
Written ques. interviewer 0:2670 0.9009 0.0030
Written ques. no interviewer 0:6012 0.1142 < 0.0001
Oral and written 0:3103 0.1059 0.0034
With interpreter 0:1371 0.1436 0.3398
Exc interpreter 0:1459 0.0992 0.1415
CAPI - Wave O onwards 0:2532 0.0897 0.0047
Written, by mail 0:2630 0.0905 0.0036
Phone interview 0:5464 0.2276 0.0163
Table 3.6 Estimated effects of the time-dependent categorical covariate “month of interview”
(HGHMONTH) in Example 3.3 (SOEP data). The reference category with O D 0 is February.
There were no interviews in January, March, and May
Category O sd(O ) p
April 0:345 0:0225 <0.001
June 0:668 0:0315 <0.001
July 0:786 0:0357 <0.001
Aug 0:938 0:0333 <0.001
Sep 1:013 0:0548 <0.001
Oct 1:306 0:0742 <0.001
Nov 1:160 0:1596 <0.001
Dec 1:309 0:1682 <0.001
64 3 Basic Regression Models
As outlined in Sect. 3.3.2 the Cox (or proportional hazards) model for continuous
time, given by c .tjx/ D 0 .t/ exp.xT /, yields for data that are grouped in fixed
intervals the discrete hazard model .tjx/ D 1 exp. exp.0t C xT //, which is
called the grouped proportional hazards model.
If the Cox model holds and continuous lifetimes are observed, estimation is
typically based on the maximization of the partial likelihood
Y
k
exp.xT.i/ /
PL./ WD P ; (3.30)
j2R.t.i/ / exp.xj /
T
iD1
where t.1/ < < t.k/ refer to the ordered observed event times in the data
(assuming that there are k events) and x.1/ ; : : : ; x.k/ refer to the corresponding
vectors of explanatory variables. Since T is continuous, all event times are assumed
to be distinct. The terms R.t.i/ /, i D 1; : : : ; k, are the so-called risk sets, i.e., the
groups of observations that are still alive at time point t.i/ . By definition of the Cox
model in continuous time, the term exp.xT.i/ / is a time-constant inflation factor
of the conditional hazard of observation i. Therefore, the partial likelihood can be
interpreted as the product of k risk probability factors, where each factor quantifies
the conditional risk of observing an event in observation i at time point t.i/ .
It follows from (3.30) that the estimation of a Cox model in continuous time
is based on a finite set of discrete-time points t.i/ and is therefore similar to the
estimation of a discrete survival model with intervals Œ0; t.1/ /; Œt.1/ ; t.2/ /; : : : Œt.k/ ; 1/.
The main conceptual difference between the estimation of discrete survival models
and the estimation of a continuous Cox model is that the interval borders t.i/ are
random (and therefore data-dependent) whereas in the case of discrete survival
data they are fixed. Also, unlike in discrete survival modeling, each of the intervals
Œt.i/ ; t.iC1/ / contains exactly one observation.
Because of rounded database entries, in practice often ties occur even when the
survival times have not been grouped. If the Cox model is applied to such data, the
partial log-likelihood in (3.30) has to be modified, as some of the event times t.i/ are
not unique but refer to several observations. One way to adjust (3.30) in the presence
of ties is to construct the so-called exact partial likelihood (Cox 1972). Assuming
that the data contain ki events in observations j1 ; : : : ; jki at time point t.i/ , the factors
in (3.30) are replaced by the conditional risks of realizing these events given that
there are ki events at t.k/ (Box-Steffensmeier and Jones 2004). The tie-handling
method based on the exact partial likelihood is sometimes referred to as the discrete
method (for example, within the software system SAS). Other methods for tie
handling, which are not considered here, are the Breslow and Efron approximations,
and the exact marginal likelihood technique (Kalbfleisch and Prentice 2002).
Applying the exact partial likelihood correction technique reveals another
interesting relationship between the continuous Cox model and discrete survival
modeling. It can be shown that the optimization of the exact partial likelihood is
3.6 Continuous Versus Discrete Proportional Hazards 65
exactly the same as fitting a so-called conditional logistic regression model to the
data. The latter model is based on the same assumptions as the continuation ratio
model in Sect. 3.2.1 but does not involve estimation of the baseline parameters 0t
(Chamberlain 1980). As a consequence, fitting a Cox model with the exact partial
likelihood to survival data (even if there are may ties and strong grouping effects)
will often yield results that are similar to those obtained from a continuation ratio
model. Finally, it is important to note that if a Cox model in continuous time holds,
estimates obtained for grouped observations when using the original Cox model
differ from those obtained from the grouped proportional hazards model (with
clog-log link). This situation is illustrated in Fig. 3.6, which was produced using
simulated data (n D 100) that were generated from an exponential survival model.
The exponential survival model is a special case of the continuous proportional
hazards model with time-constant hazard function .t/ D exp.0 C x> /. For the
simulation study we considered five independent normally distributed covariates
x D .x1 ; : : : ; x5 /> with mean D .4:5; 4:5; 4:5; 4:5; 4:5/> and unit covariance
matrix. The coefficient vector was set to D .0:1; 0:2; 0:3; 0:4; 0:5/>, yielding
survival times with median 530 and 75 % quantile 1550. Censoring times were
generated independently from an exponential distribution with mean 2500. Survival
times that were larger than 1500 were also considered censored, yielding an overall
censoring rate of 35 %
In Fig. 3.6, estimates that use the originally continuous survival data, but also
estimates that use grouped data with different interval lengths are considered.
Specifically, the continuous Cox model based on the exact partial likelihood,
the discrete continuation ratio logit model, and the discrete grouped proportional
hazards model were fitted. The upper panel of Fig. 3.6 shows boxplots for the
estimates of 2 , which corresponds to the parameter for covariate x2 , for varying
interval lengths (based on B D 100 simulation runs). The first number on the
abscissa refers to the interval length while the second number refers to the resulting
number of intervals. For example, 30 (50) means that the interval length was 30
yielding 50 intervals. White box plots refer to the estimates for the discrete clog-
log model, and gray box plots refer to the estimates for the continuous Cox model
based on the exact partial likelihood. It is seen that estimates based on the grouped
proportional hazards model show larger variability than estimates based on the
continuous Cox model if many intervals (and hence almost the exact times) are used.
However, for strong grouping the variability is smaller for the grouped proportional
hazards model.
The lower panel of Fig. 3.6 depicts the corresponding values of the root mean
squared error
v
u
u1 X B
RMSE2 D t .2 O2 /2 (3.31)
B jD1
of the estimator O2 . As expected, estimates obtained from the Cox model in
continuous time (based on the exact partial likelihood) are highly similar to those
66 3 Basic Regression Models
0.2
0.0
−0.2
estimate of γ2
−0.4
−0.6
−0.8
Fig. 3.6 Results from a simulation study with five normally distributed covariates. Survival times
and covariate data (n D 100) were generated from a survival model with exponentially distributed
outcome and five covariates. The number of simulation runs was B D 100. The upper panel shows
box plots of the estimates of the regression coefficient 2 referring to the second covariate. White
box plots refer to the discrete clog-log model, and gray box plots refer to the continuous Cox model
based on the exact partial likelihood. The numbers in brackets refer to the number of time intervals;
the dashed line refers to the true value of 2 . The lower panel shows the respective RMSE estimates
for various grouping scenarios
3.7 Subject-Specific Interval Censoring 67
obtained from the continuation ratio model with logistic link function. However,
there are notable differences between the root mean squared error values of the
proportional hazards models in continuous and discrete time. Since by definition
the grouped proportional hazards model is correctly specified even if the interval
lengths (and hence the number of ties) are large, RMSE values of this model are
smaller than those of the Cox model in continuous time. It is also seen that larger
interval lengths result in larger RMSE values and thus in a worse performance of
the estimators. This result is to be expected, as stronger grouping usually entails
a loss of information. Nevertheless it is remarkable that the performance of the
grouped proportional hazards model is quite stable. Its performance in terms of
MSE deteriorates only when the number of intervals is quite small; for example, at
the right end the data have been grouped into only five intervals. Very similar results
were obtained for the estimates of the other coefficients, see, for example, Fig. 3.7.
where Sc .tjx/ D P.T > t j x/ is the continuous time survival function. For right
censored data with ri D 1 the contribution is reduced to Sc .li jxi /. For the case
of fixed intervals Œ0; a1 /; Œa1 ; a2 /; : : : : : : ; Œaq1 ; aq / and individuals that do not
miss visits one obtains the same log-likelihood contribution as in discrete survival
(Eq. (3.23)) but with censoring at the end of the interval.
68 3 Basic Regression Models
0.5
0.0
estimate of γ3
−0.5
−1.0
Fig. 3.7 Results from a simulation study with five normally distributed covariates. Survival times
and covariate data (n D 100) were generated from a survival model with exponentially distributed
outcome and five covariates. The number of simulation runs was B D 100. The upper panel shows
boxplots of the estimates of the regression coefficient 3 referring to the third covariate. White
boxplots refer to the discrete clog-log model, and gray boxplots refer to the continuous Cox model
based on the exact partial likelihood. The numbers in brackets refer to the number of time intervals;
the dashed line refers to the true value of 3 . The lower panel shows the respective RMSE estimates
for various grouping scenarios
3.7 Subject-Specific Interval Censoring 69
baseline survival function. Then the likelihood contribution of the ith individual is
T / T /
Li D S0 .li /exp.x S0 .ri /exp.x :
The total log-likelihood can be written in a more convenient form. From the sets
.li ; ri /, i D 1; : : : n, one determines the set of times 0 D a0 < a1 < < aqC1 D 1,
such that each li and ri is contained in the set. Then, PqC1 the contribution of the i-th
observation to the likelihood can be expressed as jD1 ˛ij fSc .ai1 jxi / Sc .ai jxi /g,
where ˛ij D 1 if Œai1 ; aj / is a subset of Œli ; ri /, and 0 otherwise. If one assumes
a proportional hazards model in continuous time, one obtains for the total log-
likelihood
X
n X
qC1
lD ˛ij fSc .aj1 jxi / Sc .aj jxi /g
iD1 jD1
X
n X
qC1
T / T /
D ˛ij fS0 .aj1 /exp.x S0 .aj /exp.x g
iD1 jD1
X
n X
qC1
D ˛ij fexp. exp.
j1 C xT / exp. exp.
j C xT /g;
iD1 jD1
where
j D log. log.S0 .aj /// represents a parameterization of the baseline survival
function. Finkelstein (1986) derived the first and second derivatives for the log-
likelihood function to obtain maximum likelihood estimates for the survival function
and the parameter .
It should be noted that for the intervals that were built from all interval boundaries
for estimation one can also use the representation of the proportional hazards
as model with clog-log link as considered
R aj in (3.19). The main difference is that
in (3.19) the parameters 0j D log. aj1 0 .u/du/ D log.log S0 .aj1 / log S0 .aj //
are used to parameterize the baseline hazard while here the parameters
j are used.
When using the likelihood for grouped data as given in Sect. 3.4, one just has to
include the weights ˛ij , which encode the intervals in which censoring occurred.
More on the modeling of individual-specific interval-censored data is found, for
example, in Rabinowitz et al. (1995), Lindsey and Ryan (1998), Cai and Betensky
(2003), Sun (2006), and Chen et al. (2012).
70 3 Basic Regression Models
Discrete Hazard Models. Thompson (1977) and Prentice and Gloeckler (1978) were
among the first to consider discrete survival data in biometrics. While Thompson
(1977) focussed on the logistic model, Prentice and Gloeckler (1978) considered the
grouped Cox model and derived a score test for hypothesis testing. Early references
for the representation of hazard models as binary Bernoulli trials are Brown (1975)
and Laird and Olivier (1981). An overview of discrete survival models was given by
Fahrmeir (1998).
Extensions. Mantel and Hankey (1978) replaced the parameters of the baseline
hazard by a polynomial. Similarly, Efron (1988) proposed regression splines to
model the large number of baseline hazard parameters. He also discussed exten-
sively the use of the binary response model representation of the discrete hazard
model.
Regression Models for Interval-Censored Data. The proportional hazards model
for interval-censored data was proposed by Finkelstein (1986). Extensions to
additive models with smooth effects were considered in Cai and Betensky (2003).
Estimation techniques for accelerated failure time models were proposed by
Rabinowitz et al. (1995) and Betensky et al. (2001); additive hazards models were
studied by Zeng et al. (2006) and Wang et al. (2010).
Bayesian Approaches. Fahrmeir and Knorr-Held (1997) proposed a flexible
Bayesian nonparametric analysis for dynamic models with time-varying effects.
Fahrmeir and Kneib (2011) considered simulation-based full Bayesian Markov
chain Monte Carlo (MCMC) inference for longitudinal and event history data.
3.9 Software
Discrete hazard models can be fitted in R by using the dataLong function that is
part of the R package discSurv and by applying the glm function for generalized
linear models. In the first step, dataLong is used to convert a set of discrete survival
data to its corresponding binary representation (as demonstrated in Sect. 3.4).
Next, binomial regression models can be fitted by using glm with appropriate link
function. When specifying family = binomial(), a logistic discrete hazard model
is fitted. Gompertz and exponential models can be fitted by specifying family
= binomial(link = "cloglog") and family = binomial(link = "log"), respectively.
Gumbel models can be fitted by specifying family = binomial(gumbel()), with the
gumbel() link function being available as part of the discSurv package.
3.10 Exercises 71
3.10 Exercises
3.1 Assume that the proportional hazards model holds for continuous time Tc , i.e.,
Let now time be coarsened by use of the intervals Œ0; a1 /; Œa1 ; a2 /; : : : : : : ; Œaq1 ; aq /,
Œaq ; 1/ and let T denote discrete time with T D t if failure occurs within the interval
Œat1 ; at /. Show that the discrete hazard .tjx/ D P.T D t j T t; x/ is given by
and write the parameters 0t as functions of quantities of the proportional hazards
model. (Basic concepts of continuous survival are given in Sect. 3.3.1.)
3.2 Consider the congressional careers data described in Example 1.4 with response
“losing the general election.”
1. Convert the data to a set of augmented data with binary response.
2. Fit a proportional continuation ratio model with logistic link function and with
covariates age, priorm, prespart, leader, scandal, and redist to the data.
3. Illustrate the baseline hazard rates graphically over time.
4. Interpret the covariate effects. Which effects are significant at level ˛ D 0:05
(according to the respective t-tests)?
5. Fit the same model using probit and complementary log-log link functions.
Compare the models with respect to their fitted hazard rates and coefficient
estimates.
3.3 Consider a data set collected for the Second National Survey on Fertility of
Italian women. This data set and its covariates are described in detail in Chap. 4
(Example 4.1). In the following, the time to first childbirth (measured in years) will
be the outcome of interest.
1. Convert the data to a set of augmented data. Use 16 intervals for categorizing the
continous event times: Œ0; 1 ; .1; 2 ; : : : ; .14; 15 ; .15; 30.
2. Fit a proportional continuation ratio model to the data and illustrate the baseline
hazard rates graphically over time.
3. Interpret the covariate effects obtained from the proportional continuation ratio
model. Which effects are significant at level ˛ D 0:05?
3.4 Derive the total log-likelihood of a discrete-time survival model with interval-
censored observations, assuming that the censoring process does not depend on the
parameters of the survival time.
72 3 Basic Regression Models
3.5 Recall the definition of the hazard function with discrete time T (Eq. (3.1)).
Discuss the relationship between this function and the hazard function for continu-
ous time, which is given by
P.t Tc < t C t j Tc t; x/
c .tjx/ D lim
t!0 t
with t > 0.
3.6 Show that the identities for c , Sc , and f stated in Sect. 3.3.1 hold for continuous
survival time. Derive the corresponding relations for discrete time.
3.7 Consider the Weibull model in continuous time, which is defined as
log.Tc / D xT C ;
with ˛ WD 1= and xT D log.'/, ' > 0. Show that this model satisfies the
proportional hazards assumption, i.e., verify that its hazard rate can be written as
In Chap. 3 basic hazard models for discrete survival have been introduced and
estimation methods have been discussed. In the present chapter we consider
diagnostic tools for these models. In particular we discuss test statistics that evaluate
the significance of predictors, consider goodness-of-fit tests and residuals as well as
measures for the predictive performance and more flexible links. A summary of the
concepts presented in this chapter is given in Fig. 4.1.
The discrete survival model considered in the previous chapter has the form
H0 W j D 0 against H1 W j ¤ 0:
But this simple pair of hypotheses works only if a variable is represented by only one
parameter. If, for example, one has a factorial explanatory variable or if quadratic
terms of a continuous variable are included, one has to test simultaneously if all the
corresponding parameters are zero. A more general pair of hypotheses that covers
Fig. 4.1 Summary of basic concepts for the evaluation of discrete hazard models
H0 W Cˇ D against H1 W Cˇ ¤ ;
where C is a fixed matrix of full rank s p and is a fixed vector. The vector ˇ in
the linear hypothesis Cˇ D collects all the parameters of the model, that is, ˇ T D
.01 ; : : : ; 0q ; T /. For example, if the model contains one factorial explanatory
variable with three categories (represented by the coefficients > D .1 ; 2 /) and if
the aim is to test this covariate against zero, C becomes
0 ::: 0 1 0
0 ::: 0 0 1
model (4.1) and Q̌ denote the estimate under the constraint Cˇ D . Then the
likelihood ratio statistic is given by
LR D 2fl. Q̌ / l. Ǒ /g;
which quantifies the change of the log-likelihood l (given in (3.25)) when evaluated
at Ǒ and Q̌ . Under regularity conditions LR follows asymptotically a 2 -distribution
with s D rk.C/ degrees of freedom, where rk.C/ denotes the rank of the matrix C.
That means one fits the model twice, once without constraints and once with the
constraints. Then one computes the differences in the log-likelihoods.
Alternative test statistics that can be derived as approximations of LR are the
Wald statistic and the score statistic. The Wald statistic has the form
u D sT . Q̌ /F1 . Q̌ /s. Q̌ /;
.a/
LR; w; u 2 .rank C/;
Table 4.1 Explanatory variables for the time between cohabitation and first childbirth (Second
National Survey on Fertility in Italy)
Sample proportion/median
Variable Categories/unit (range)
Age at the beginning of cohabitation Years 22.7 (12.20–49.10)
Cohort of birth 1946–1950 20 %
1951–1955 21 %
1956–1960 20 %
1961–1965 21 %
1966–1975 18 %
Educational attainment First stage basic 21 %
Second stage basic 33 %
Upper secondary 37 %
Degree 9%
Geographic area North 45 %
Center 22 %
South 33 %
Occupational status Worker 48 %
Non-worker 52 %
Siblings in the family of origin of the woman 0 7%
1 29 %
2 24 %
3 40 %
In the following we will analyze the results of various statistical hypothesis tests that were
obtained from modeling the time to first childbirth (grouped in years). Predictor variables included
all six covariates presented in Table 4.1. Observations with missing values in any of the predictor
variables were excluded from the analysis (resulting in a reduced sample size of n D 3147). Also,
event times that were larger than 10 years were considered censored (resulting in a censoring rate
equal to 14:8 %).
Table 4.2 shows the results that were obtained from fitting a discrete-time hazard model with
logistic link to the data. The parameter estimates O in Table 4.2 confirm several often observed
results: For example, women with a high educational attainment tend to give birth to their first
children later than women with low educational attainment. Also, women belonging to later cohorts
(1960+) tend to give birth to their first children later than earlier cohorts born before 1961. Table 4.3
shows the corresponding p-values obtained from likelihood ratio, Wald, and score tests. Each
test was applied to each of the covariates separately in order to investigate statistical significance
(“marginal”/“type II” tests). Obviously, all covariates are significant at level ˛ D 0:05, with only
small differences between the covariate-specific test statistics. t
u
4.2 Residuals and Goodness-of-Fit 77
Table 4.2 Years between cohabitation to first childbirth. The table shows the parameter estimates
and estimated standard deviations that were obtained from fitting a logistic discrete-time hazard
model to the data (ageCo D age at cohabitation, edu D educational attainment, area D geographic
area, cohort D cohort of birth, occ D occupational status, sibl D number of siblings)
Covariate Parameter estimate Est. std. error
ageCo 0:0499 0:0069
edu First stage basic (Ref. category)
edu Second stage basic 0:0249 0:0745
edu Upper secondary 0:2106 0:0786
edu degree 0:2693 0:1090
cohort 1946–1950 (Ref. category)
cohort 1951–1955 0:0253 0:0764
cohort 1956–1960 0:2346 0:0781
cohort 1961–1965 0:2920 0:0790
cohort 1966–1975 0:7555 0:0907
area North (Ref. category)
area Center 0:2844 0:0624
area South 0:6695 0:0617
occ worker (Ref. category)
occ non-worker 0:2296 0:0528
sibl 0:0549 0:0267
Table 4.3 Years between cohabitation to first childbirth. The table shows the test statistics and
p-values of likelihood ratio, Wald, and score tests that were obtained from fitting a logistic discrete-
time hazard model to the data (ageCo D age at cohabitation, edu D educational attainment, area D
geographic area, cohort D cohort of birth, occ D occupational status, sibl D number of siblings)
LR test Wald test Score test
Covariate Statistic p value Statistic p-value Statistic p value
ageCo 53:395 <0:00001 51:774 <0:00001 53:395 <0:00001
edu 18:721 0:00031 18:768 0:00031 18:721 0:00030
cohort 92:095 <0:00001 88:456 <0:00001 92:095 <0:00001
area 117:863 <0:00001 117:898 <0:00001 117:863 <0:00001
occ 18:862 0:00001 18:905 0:00001 18:862 0:00001
sibl 4:220 0:03993 4:217 0:04002 4:220 0:03997
here. In the following we will present strategies on how to construct valid residuals
and goodness-of-fit statistics for discrete survival data.
4.2.1 No Censoring
Y
t1
O it D P.T O
O D tjxi / D .tjx i/
O
.1 .sjxi //:
sD1
X
N X
k
. pit O it /2
2P D ni ;
iD1 tD1
O it
X
k
. pit O it /2
2
rP;i D ni :
tD1
O it
4.2 Residuals and Goodness-of-Fit 79
The smaller the Pearson statistic (i.e., the smaller the sum of Pearson residuals), the
better the model fit. An alternative goodness-of-fit statistic is the deviance
X
N X
k
pit
DD2 ni pit log ;
iD1 tD1
O it
X
k
2 pit
rD;i D 2ni pit log :
tD1
O it
Similar to the Pearson statistic, the deviance becomes small in situations where
the model fits the data well. Under the assumptions of the fixed cells asymptotic
.ni =N ! i 2 .0; 1// and regularity conditions, 2P and D are asymptotically
2 -distributed with N.k 1/ p degrees of freedom, where p is the number of
estimated parameters. When using a significance level ˛ the model is considered
inappropriate if both or one of the test statistics are larger than the 1 ˛-quantile of
the corresponding distribution, 21˛ .N.k1/p/. It should be noted that these tests
serve as goodness-of-fit tests only if ni is not too small. For ni D 1, with only one
observation being available at fixed value xi , they are useless since no asymptotic
distribution is available.
Nevertheless, residuals can also be used as diagnostic tools in the case ni D 1. In
this case the squared deviance residual for observation i is
ti 1
X
1
2
rD;i D 2 log D 2 log.O iti / 2 log.1 O is /;
O iti sD1
where O is D .sjx
O i /. It can be re-written in a form that is familiar from binary
response models,
X
ti
2
rD;i D 2 fyis log.O is / C .1 yis / log.1 O is /g; (4.2)
sD1
where . yi1 ; : : : ; yiti / D .0; : : : ; 0; 1/ denotes the transitions over periods. Thus, the
deviance residual can be written as the residual for the binary response vector that
codes the non-transitions. How observations and fit are compared becomes even
more obvious in the representation
ti
X
2 yis 1 yis
rD;i D2 yis log C .1 yis / log ; (4.3)
sD1 O is 1 O is
80 4 Evaluation and Model Choice
which shows that at each time point the discrepancy between data and fit is measured
by log.yis =O is / if yis D 1 and by log.1 yis /=.1 O is / if yis D 0. For the Pearson
statistic no such representation seems available.
A transformation of the deviance residuals that typically is closer to a normal
distribution (in case of a well fitting model) is the adjusted deviance residual
( s )
X
ti
yis 1 yis
di D sign. yis O is / yis log C .1 yis / log
sD1 O is 1 O is
ti
X q
ı
C .1 2O is / Ois .1 O is / 36 : (4.4)
sD1
Note that we assume 0 log.0/ 0 in (4.4). When considering this type of residual,
model fit can be assessed by inspecting normal quantile–quantile plots.
In the presence of censoring, analysis of the fitted model is more complicated. In this
case one models the observed time periods min.Ti ; Ci /. In particular, with q D k 1
one can only observe the combined events
.t; ı D 1/ W fT D t; C tg; t D 1; : : : ; q;
.t; ı D 0/ W fT > t; C D tg; t D 1; : : : ; q:
Therefore one has to distinguish between 2q categories. Under the random censoring
assumption the corresponding probabilities are
.ı/ P.T D t/P.C t/ if ı D 1;
t D
P.T > t/P.C D t/ if ı D 0;
.1/ .0/
Let the probabilities for observation i be collected in Ti D .. i /T ; . i /T /, where
.1/ .1/ .1/ .0/ .1/ .0/
i WD .i1 ; : : : ; iq /T , and i WD .i1 ; : : : ; iq /T . Then the corresponding
4.2 Residuals and Goodness-of-Fit 81
.1/
!
pir P.Ti D r/ P.Ci r/
log D log C log ;
.1/
O ir O i D r/
P.T O i r/
P.C
.0/
!
pir P.Ti > r/ P.Ci D r/
log D log C log :
.0/
O ir O i > r/
P.T O i D r/
P.C
In order to use the deviance as a goodness-of-fit statistic one has to specify two
models, one for the survival time and one for the censoring process. Goodness-of-fit
then refers to both models. This is somewhat unsatisfying since one would prefer to
evaluate the fit of the survival model separately.
Although the likelihood based on the binary variables that code transitions does
not yield a goodness-of-fit statistic, it can be used to define residuals also in the case
of censored observations. For single observations (i.e., ni D 1) the contribution of
the ith observation to the corresponding “deviance” D D 2l (with l denoting the
log-likelihood (3.25)) is
2
rD;i O i D ti // C .1 ıi / log.P.T
D 2 fıi log.P.T O i > ti //g
X
ti
D 2 yis log.O is / C .1 yis / log.1 O is /; (4.5)
sD1
An alternative type of residual that takes censoring into account and is particularly
suited for assessing the functional forms of predictor effects is the martingale
residual defined by
X
ti
mi D ı i O is ; i D 1; : : : ; n ; (4.6)
sD1
82 4 Evaluation and Model Choice
Pti O
where O is D .sjx
O i /. Here, ƒ.ti / D sD1 is measures the cumulative risk of
observation i up to time ti . This definition is equivalent to the cumulative hazard
function in survival analysis for continuous time (see Sect. 3.3.1 or Klein and
Moeschberger 2003). The idea of the martingale residual is to compare for each
individual the observed number of events up to ti (measured by ıi ) with the expected
number of events up to ti (measured by ƒ.ti /). If one uses the binary variables
representation with . yi1 ; : : : ; yiti / D .0; : : : ; 0; ıi / the residuals can be defined as
X
ti
mi D . yis O is /; i D 1; : : : ; n:
sD1
Thus the martingale residuals use the differences between the transition indicators
and the estimated probabilities of failure, in contrast to the deviance residuals, which
use the log-transformed proportion of observation and fit (Eq. (4.3)). It can be shown
that for the logistic model the martingale residuals sum up to zero (Exercise 4.1).
For a well fitting model that includes all relevant predictors, the martingale
residuals should be “random” and uncorrelated with the covariate values. Following
this idea, martingale residuals can be used to assess the importance and the
functional forms of the covariates in a discrete-time survival model. For each
covariate this is done by plotting the residuals vs. the covariate values of interest.
Adding a smooth estimate to the plot (e.g., obtained via a P-spline estimator)
provides further information on the functional form of the respective covariate effect
(see Example 4.2).
Example 4.2 Promotions in Rank for Biochemists
For illustration we consider a data set from the book by Allison (1995). It contains information on
the careers of 301 biochemists who were employed as assistant professors at graduate departments
in the USA. The event of interest is the promotion of a biochemist to the rank of associate professor
during his/her career. The data are available under the link http://support.sas.com/publishing/bbu/
zip/61339.zip. Observed event times were measured in years and ranged from 1 year to 10 years;
28 % of the event times were censored. The variables that will be used to model the time to
promotion are presented in Table 4.4.
In the following we will analyze the martingale and deviance residuals obtained from a discrete-
time hazard model with logit link. Two models are considered: The first model is the full model
with the six covariates “Ph.D. from medical school?”, “prestige of the Ph.D. institution,” “prestige
of the first employing institution,” “selectivity of the undergraduate institution,” “cumulative
number of articles,” and “cumulative number of citations.” This model is compared to a “reduced”
model, which is the same as the full model but without the covariate “cumulative number of
articles.” Note that the covariates “cumulative number of articles” and “cumulative number of
citations” are time-dependent. We start with the reduced model: In Fig. 4.2 the martingale residuals
of the reduced model are plotted vs. the number of articles in year 1. In addition, the plot contains
a trend line that was estimated via a P-spline with cross-validated smoothing parameter. The
functional form of the trend line suggests that the number of articles in year 1 is an influential
covariate and should be included in the model, as there is a positive linear effect of this variable
on the martingale residuals. Similarly, the normal quantile–quantile plot of the adjusted deviance
residuals of the reduced model (Fig. 4.3) indicates deviations from normality, suggesting that the
model fit is suboptimal.
4.2 Residuals and Goodness-of-Fit 83
Table 4.4 Variables that are used to model the time to promotion in rank for biochemists (Allison
1995). Note: None of the biochemists had more than two employers during the observation period
Sample proportion/median
Variable Categories/unit (range)
Selectivity of undergraduate institution Score 5 (1–7)
Ph.D. from medical school? Yes/no 63 %/37 %
Prestige of the Ph.D. institution Score 3.36 (0.92–4.62)
Number of articles published in year 1 3 (0–22), n D 301
Number of articles published in year 2 4 (0–27), n D 299
Number of articles published in year 3 5 (0–36), n D 292
Number of articles published in year 4 7 (0–44), n D 263
Number of articles published in year 5 7 (0–35), n D 211
Number of articles published in year 6 8 (0–36), n D 149
Number of articles published in year 7 8 (0–32), n D 96
Number of articles published in year 8 8 (0–44), n D 59
Number of articles published in year 9 6 (2–49), n D 42
Number of articles published in year 10 6 (2–35), n D 29
Number of citations in year 1 20 (0–420), n D 301
Number of citations in year 2 25 (0–421), n D 299
Number of citations in year 3 32.5 (0–566), n D 292
Number of citations in year 4 41 (0–566), n D 263
Number of citations in year 5 45 (0–579), n D 211
Number of citations in year 6 53 (0–430), n D 149
Number of citations in year 7 53 (0–496), n D 96
Number of citations in year 8 39 (0–724), n D 59
Number of citations in year 9 29.5 (1–864), n D 42
Number of citations in year 10 30 (1–566), n D 29
Prestige of first employing institution Score 2.56 (0.65–4.64)
Prestige of second employing institution Score 2.52 (0.65–4.64)
(D prestige of 1st institution if biochemist
did not change employer)
Year of employer change 4.5 (2–10), n D 74
(if biochemist changed employer)
Figure 4.4 shows the martingale residuals obtained from the full model (including the cumula-
tive number of articles as time-dependent covariate). Now the P-spline estimate is close to zero,
indicating only very little correlation between the number of articles in year 1 and the residuals.
Also, the adjusted deviance residuals of the full model (Fig. 4.5) are slightly closer to normality
than the respective residuals of the reduced model in Fig. 4.3. These findings indicate that the model
fit improves by adding the cumulative number of articles to the model. Still, Fig. 4.5 shows that
there remain deviations of the residuals from normality, which might have been caused by other
missing covariates or by an insufficient small-sample approximation of the normal distribution.
t
u
84 4 Evaluation and Model Choice
1
0
martingale residuals
−1
−2
−3
0 5 10 15 20
number of articles in year 1
Fig. 4.2 Promotions in rank for biochemists. The plot shows the martingale residuals obtained
from the reduced model with covariates “Ph.D. from medical school?”, “prestige of the Ph.D.
institution,” “prestige of the first employing institution,” “selectivity of the undergraduate insti-
tution,” and “cumulative number of citations.” The residuals are plotted vs. the values of the
variable “number of articles in year 1.” The trend line was obtained via a P-spline with cross-
validated smoothing parameter. Obviously, there is a positive effect of the variable on the residuals,
indicating that the number of published articles should be added to the model equation
10
sample quantiles, reduced model
2 4 6 08
−3 −2 −1 0 1 2 3
theoretical quantiles, reduced model
Fig. 4.3 Promotions in rank for biochemists. The figure contains a normal quantile–quantile plot
of the adjusted deviance residuals obtained from the reduced model with covariates “Ph.D. from
medical school?”, “prestige of the Ph.D. institution,” “prestige of the first employing institution,”
“selectivity of the undergraduate institution,” and “cumulative number of citations.” The plot
indicates deviations from normality.
4.2 Residuals and Goodness-of-Fit 85
1
martingale residuals
0
−1
−2
−3
0 5 10 15 20
number of articles in year 1
Fig. 4.4 Promotions in rank for biochemists. The plot shows the martingale residuals obtained
from the full model with covariates “Ph.D. from medical school?”, “prestige of the Ph.D. insti-
tution,” “prestige of the first employing institution,” “selectivity of the undergraduate institution,”
“cumulative number of citations,” and “cumulative number of articles.” The residuals are plotted
vs. the values of the variable “number of articles in year 1.” The trend line (obtained via fitting
a P-spline with cross-validated smoothing parameter) suggests that there is only little correlation
between the number of articles in year 1 and the residuals
12 10
sample quantiles, full model
4 6 8 2
−3 −2 −1 0 1 2 3
theoretical quantiles, full model
Fig. 4.5 Promotions in rank for biochemists. The figure contains a normal quantile–quantile plot
of the adjusted deviance residuals obtained from the full model with covariates “Ph.D. from
medical school?”, “prestige of the Ph.D. institution,” “prestige of the first employing institution,”
“selectivity of the undergraduate institution,” “cumulative number of citations,” and “cumulative
number of articles.” The figure indicates a slightly improved model fit compared to the reduced
model (Fig. 4.3)
86 4 Evaluation and Model Choice
T tT
X
n X
j
where O js D P.T
O jT D s j TjT s; xTj / and where TjT are the (unobserved) true
survival times in the test data. As before, . yTj1 ; : : : ; yTjtT / D .0; : : : ; 0; 1/ if ıjT D 1
j
and . yTj1 ; : : : ; yTjtT / D .0; : : : ; 0; 0/ if ıjT D 0 denote the transitions over periods.
j
The predictive deviance is equivalent to the negative log-likelihood (3.25) of
a binomial regression model evaluated on the test data. Consequently, prediction
accuracy is large if (4.7) is small and vice versa.
Note that the predictive deviance is an unbounded measure. To facilitate inter-
pretation, it is sometimes convenient to consider R2 -type coefficients given by
PT
1 exp . njD1 tjT /1 .D D0 /
R2 D PT ; (4.8)
1 exp.. njD1 tjT /1 D0 /
where D0 is the predictive deviance obtained from a null model without covariate
information. It can be shown that (4.8) is equal to 1 if prediction accuracy is perfect
and equal to zero if predictions are based on the null model (Nagelkerke 1991).
However, R2 coefficients have several problems when used in practical applications.
In particular, R2 coefficients are often much smaller than 1 even in case of very
well-predicting models, which makes interpretation difficult. Also, in contrast to the
predictive deviance, it is unclear whether R2 coefficients are “proper” in the sense
that they become maximal if computed from the true underlying model (cf. Gneiting
and Raftery 2007).
Example 4.3 Promotions in Rank for Biochemists
In Chap. 3 we introduced five different types of link functions for discrete-time hazard models
(logistic, probit, Gompertz, Gumbel, and exponential). Because the models result in different
interpretations regarding the magnitude of coefficients and the significance of predictor effects,
it is of interest to investigate which model is most suitable to predict future or unseen survival
times.
Here we use the predictive deviance to investigate which link function results in the best
prediction accuracy for the biochemists data. For statistical analysis we used the six covariates
“Ph.D. from medical school?”, “prestige of the Ph.D. institution,” “prestige of the first employing
institution,” “selectivity of the undergraduate institution,” “cumulative number of articles,” and
“cumulative number of citations.” To compare the models with respect to their predictive
performance, we calculated the predictive deviance in combination with resampling. This was
done as follows: First, we used 50 random splits of the data and generated 50 learning data sets (of
size 2=3n 200 each) and 50 test data sets (of size 1=3n 101 each). The discrete-time hazard
models were fitted to each of the 50 training data sets, and the respective predictive deviances were
computed from the 50 test data sets.
The resulting 50 values of the predictive deviance are visualized in Fig. 4.6 for the logistic,
probit, Gompertz, and Gumbel models. It is seen that the logistic model resulted in the best
performance among the models. The Gumbel model showed the worst performance; the probit and
Gompertz models also performed slightly worse than the logistic model. Note that the exponential
model was excluded from this study because of its numerical instability (which is due to the
restrictions that have to be imposed on the parameter space, cf. Sect. 3.2.2). t
u
88 4 Evaluation and Model Choice
1.08
550
1.06
500
predictive deviance
1.04
450
1.02
400
1.00
350
0.98
300
Fig. 4.6 The left panel shows the predictive deviances for the biochemists data, as obtained from
50 test samples of size nT D 101. The right panel shows the predictive deviances of the probit,
Gompertz, and Gumbel models divided by the respective predictive deviances obtained from the
logistic model. Therefore, the right panel provides information on the percentage decrease in
prediction accuracy that resulted from using the probit, Gompertz, and Gumbel link functions
instead of the logistic link function. Note that two outlying values in the Gompertz and Gumbel
models were omitted from the panels
S 2
c D
PE.t/ C Oj .t/ SQj .t/
S
nT jD1 GO j .tjT 1/ O j .t/
G
1 X T
O 2
T
n
D T wj .t/ Sj .t/ SQj .t/ : (4.9)
n jD1
4.3 Measuring Predictive Performance 89
estimates. Similarly, groups with missing data are often up-weighted by their inverse
probabilities of being completely observed. Mathematically, inverse probability of
censoring weighting guarantees the consistency of the estimator PE.t/ c for the mean
O Q 2
of the random variable .Sj .t/ Sj .t// ; for details, see in particular van der Laan and
Robins (2003) and also Gerds and Schumacher (2006).
where the running time t has to be considered as a factor when using appropriate
software. Similarly, if the observed survival time is ti and ıi D 0, the censoring
event has taken place at ti and the censoring time Ci is therefore equal to ti . In this
case, the binary observations and design variables for the estimation of the censoring
process are given by
By definition, the PE curve becomes small if the predicted survival functions agree
closely with the observed survival functions. It can further be shown that the PE
curve is a “proper scoring rule” for each t in the sense that it becomes minimal if SO
is equal to the true survival function P.tjT > t j x/ (Gneiting and Raftery 2007). A
time-independent coefficient of prediction error is given by the integrated PE curve,
which is defined as
X
c int D EO T ŒPE.T/
PE c D c P.T
PE.t/ O D t/ : (4.10)
t
Brier Score
0.25
full model
reduced model
0.20
null model
PEt
^
0.15 0.10
2 4 6 8 10
t (years)
Fig. 4.7 Years from cohabitation to first childbirth. The plot shows the average prediction error
curves obtained from the full and reduced models. Discrete-time hazard models with logit link
were fitted to 50 learning samples of size n D 2098 each; predictions were obtained from the
remaining sets of observations (of size nT D 1049 each)
Table 4.5 Years from cohabitation to first childbirth. Average integrated PE curves, as predicted
from 50 learning and 50 test samples
b int
PE Mean sd
Full model 0:18939 0:00372
Reduced model 0:19509 0:00372
Null model 0:20772 0:00302
Generally, the Brier score is small if prediction accuracy is high and vice versa. In
contrast to the prediction error curves defined in the previous subsection (which
are also based on quadratic deviations and are sometimes referred to as “Brier
score” in the literature on continuous survival times, see Gerds and Schumacher
2006), the above definition does not involve the survival function but the estimated
O D t j xi /.
probabilities O it D P.T
sens.c; t/ WD P.
> c j T D t/ (4.11)
and
spec.c; t/ WD P.
c j T > t/ ; (4.12)
The ROC curve plots the hit rate (sensitivity) against the false positive rate (1-
specificity) for varying thresholds. It typically has a concave shape connecting the
points (0,0) and (1,1). If
has a high discriminative power the curve is strongly
concave and has a large area below the curve (see Example 4.5).
Analogously to ROC analysis for binary outcomes (as outlined, for example,
in Pepe 2003) it is possible to calculate the areas under the time-dependent ROC
curves for each time point t. This results in the time-dependent AUC curve, denoted
by AUC.t/. The AUC curve should be larger than 0.5 for all time points, because
0.5 corresponds to the AUC value that is obtained by a null model without covariate
information.
Similar to the integrated prediction error curve, the area under the time-dependent
AUC curve can be used as a time-independent measure of discriminative power.
Following the approach by Heagerty and Zheng (2005), we consider the index
X
C D AUC.t/ w.t/ (4.14)
t
P
with weights w.t/ D P.T D t/ P.T > t/= t P.T D t/ P.T > t/. It can be
shown that C equals the probability P.
j1 >
Tj2 j TjT1 < TjT2 /, which is a global
T
94 4 Evaluation and Model Choice
concordance index measuring the probability that observations with large values
of
have shorter survival times than observations with small values of
. Here,
Tj1 ,
Tj2 , TjT1 , and TjT2 denote the predictors and survival times of two randomly chosen
observations j1 and j2 in the test sample. Analogously to the time-dependent AUC
curve, C should be larger than 0.5 if the predictor
performs better than chance.
Following Uno et al. (2007), we estimate sens.c; t/ and spec.c; t/ by
P
ı T
O T
> \ T
D O j .tjT 1/
=G
b j j I j c tj t
sens.c; t/ D P T
; (4.15)
T O T
j ıj I tj D t = Gj .tj 1/
P
T
jI
O j c \ tjT > t
b
spec.c; t/ D P
T ; (4.16)
j I tj > t
respectively, where
O j , j D 1; : : : ; nT , denote the estimates of xT in the
test data. Similar to the estimator of the prediction error curve in (4.9), the
weights 1=G O j .tjT 1/ ensure the consistency of (4.15). Estimates of AUC.t/
can
˚ be obtained by using numerical integration of the estimated ROC curve
b b
1 spec.c; t/; sens.c; t/ c2R;t0 . The concordance index C can be estimated by
CO D
X
b P.T
AUC.t/ O D t/ P.T
O > t/
.X
O D t/ P.T
P.T O > t/ ; (4.17)
t t
b
where AUC.t/ denotes the estimated time-dependent AUC curve.
Example 4.5 Promotions in Rank for Biochemists
The residual analysis in Example 4.2 suggested that the addition of the covariate “cumulative
number of articles” improved the fit of the logistic discrete-time hazard model. We now analyze
whether considering this covariate also improves prediction accuracy on external or future data.
To this purpose, we generated a learning sample of size n D 200 that was drawn randomly from
the biochemists data. The remaining observations (nT D 101) were used as a test sample for the
evaluation of prediction accuracy. Again we considered two models: The full model (including
the six covariates already used in Example 4.2) and the reduced model without the covariate
“cumulative number of articles.”
Figure 4.8 shows the ROC curves that were obtained from the test data at time point t D 5.
It is seen that the addition of the cumulative number of published articles leads to an increase of
prediction accuracy at t D 5, as the ROC curve of the full model is above the curve of the reduced
model at almost all thresholds. The diagonal line (with AUC = 0.5) corresponds to a survival
model without covariates. Therefore any model predicting better than chance should result in an
ROC curve that is above the diagonal line. Generally, the closer the ROC curve is to the upper and
left borders of the unit square, the better the predictive performance of the corresponding model
will be. From Fig. 4.8 it is seen that the reduced model (being close to the diagonal line) predicts
only slightly better than chance. Of course, this does not mean that the other covariates do not have
any predictive value in the population. In fact, the relatively small sample size might also have
contributed to the low prediction accuracy of the reduced model.
The AUC values at t D 5 (computed from the test data) were 0.562 for the full model and 0.512
for the reduced model, again indicating that the former model outperforms the latter one and that
4.3 Measuring Predictive Performance 95
1.0
0.8
0.6
sens(c,t)
0.4
full model
0.2
reduced model
0.0
Fig. 4.8 Promotions in rank for biochemists. The plot shows the ROC curves (computed from
the test data) that were obtained from the full and reduced models at t D 5. Discrete-time hazard
models with logistic link were fitted to a learning sample of size n D 200; predictions were
obtained from the remaining 101 observations in the data
the reduced model predicts only slightly better than chance at t D 5. The AUC curve for all time
points is presented in Fig. 4.9. It shows that the reduced model performs worse than the full model
at almost all time points, confirming the prognostic value of the cumulative number of articles.
In the final step we computed the summary index CO for both models. We obtained CO D 0:583
O
for the full model and C D 0:561 for the reduced model. This result is in line with the results
obtained from the time-dependent AUC curves. t
u
senscum .c; t/ D P.
> c j T t/; (4.18)
1.0
0.8
0.6
AUC(t)
0.4
0.2
full model
reduced model
0.0
3 4 5 6 7 8 9
t
Fig. 4.9 Promotions in rank for biochemists. The plot shows the time-dependent AUC curves
(computed from the test data) that were obtained from the full and reduced models. Discrete-
time hazard models with logistic link function were fitted to a learning sample of size n D 200;
predictions were obtained from the remaining 101 observations in the data
unlike C D P.
Tj1 >
Tj2 j TjT1 < TjT2 /, integrated versions of the AUC curve based
on cumulative sensitivities do not seem to have an easy probabilistic interpretation.
.tjx/ D h.0t C xT /
with given response function h./ have been considered and several choices of h./
have been discussed. Although these response functions yield satisfactory results in
many applications, it is sometimes possible to improve the model fit by considering
more flexible choices of response functions. More flexible models can either be
obtained if
(1) the response function is embedded into a family of response functions, or if
(2) the response function is estimated nonparametrically.
We first introduce families of response functions; afterwards nonparametric estima-
tors of response functions are briefly described.
4.4 Choice of Link Function and Flexible Links 97
.tjx/ D F .0t C xT /;
with > 0. For D 1 one obtains the logistic distribution function; for the limit
! 0 one obtains the clog-log model F .u/ D 1 exp. exp.u// (Exercise 4.6).
Thus the family comprises the two models that are most widely used in discrete
survival modeling, namely the logistic and the grouped proportional hazards model.
The function F .u/ D 1 .1 C exp.u//1= is also known as the distribution
function of the log-Burr distribution. The corresponding density is given by f .u/ D
.1 C exp.u//1=1 exp.u/; it is left-skewed for < 1 and right skewed for > 1
(see Fig. 4.10). If D 1 it is symmetric, which is a well-known property of the logis-
tic distribution. The generalized logistic distribution has been considered by Prentice
(1975) and Prentice (1976) in the modeling of binary data and by Hess (2009) in
discrete survival modeling. Prentice showed that can be consistently estimated
along with the other parameters by maximum likelihood. A Wald test based on the
estimate of can be used to test the parameter within the family of distributions.
If the logistic model holds ( D 1) the asymptotic distribution of the maximum
likelihood estimator O is normal and can be approximated by N.1; 4. 2 C 3/=
.Qn. 2 6///, where nQ denotes the total number of binary observations in the
uncensored case. In the limiting case ! 0, the asymptotic distribution of O is
equal to the distribution of a random variable
defined as trunc
D if 0 and
2 2
trunc
D 0 if < 0, where N 0I =.Qn. 6// .
Example 4.6 Simulation Study on the Generalized Logistic Family
In the following the generalized logistic family is used to demonstrate that strongly biased effects
can occur if the true model is far away from the typically used logistic and clog-log models. The
illustration closely follows Hess et al. (2014), where the family was used to model the duration
of trade relationships. First we consider a simulation study in which the true response function is
given by the log-Burr distribution (4.19), with the parameter given by D 5. That means the
true underlying model is definitely not a logistic model, but, as shown later, is not unrealistic in
applications.
Let the model contain two explanatory variables, x1 ; x2 with parameters given by 1 D 2 D 1.
The variables are generated as independent random draws from a normal distribution with zero
mean and unit variance. The baseline hazard is given by 0t D log.t/. To illustrate the impact of
response functions on estimation, four different hazard models are fitted: the models with D 0
(clog-log) and D 1 (logit), the true model with D 5, and a probit model which is not nested
in the class of the generalized logistic family. In all models the baseline hazard was modeled
98 4 Evaluation and Model Choice
ξ = 0.1
0.4
ξ =1
ξ =2
fξ ( u )
0.2
0.0
−3 −2 −1 0 1 2 3
u
Fig. 4.10 Illustration of the generalized logistic distribution family. The figure depicts the
density f .u/ for various values of the parameter . For D 1 the density is symmetric
Table 4.6 Estimated covariate effects for different response functions, as obtained from the
simulation study described in Example 4.6 (genLog D generalized logistic)
Estimated models
cloglog logit genLog ( D 5) probit
O1 0.713 0.815 0.953 0.843
O2 0.590 0.738 0.967 0.788
O1 =O2 1.209 1.104 0.986 1.070
Hazard ratio at t D 1 1.566 1.643 1.659 1.626
Hazard ratio at t D 12 1.625 1.784 2.177 1.968
by dummies for each discrete time point t 2 .1; : : : ; 12/. To make parameters comparable, the
parameters are transformed by using the conversion factors proposed by Amemiya (1981). The
corresponding estimates are denoted by O1 ; O2 . Table 4.6 shows an overview of the impact of
response functions on the estimated covariate effects.
It is seen that the covariate effects show almost no bias if the correct response function is used,
but are distinctly underestimated when the response function is misspecified. Table 4.6 also shows
the ratios of the estimated covariate effects, and the results indicate that also the relative effects
of explanatory variables are biased if the response function is misspecified. Moreover, estimated
hazard ratios at the shortest (t D 1) and longest (t D 12) durations were considered. The hazard
ratios were calculated for an increase in x1 from zero to one, keeping x2 D 0. For misspecified
response functions the estimated hazard ratios are smaller than their counterparts obtained from the
correct specification. Also the differences in the hazard ratios at t D 1 and t D 12 vary substantially
across the fitted models. For the model with D 5 and the probit model, the estimated hazard
ratios increase by about 31 and 21 %, respectively, while they are rather constant for the clog-log
model. The latter result was to be expected because the clog-log model is the grouped-duration
analogue of Cox’s proportional hazards model. The effect is also illustrated in Fig. 4.11, which
shows the estimated hazard rates obtained from the clog-log model relative to the true hazard rates
generated by the model with D 5. It is seen that the hazard estimates obtained from the clog-log
model are substantially biased. Small and large hazard rates are overestimated, whereas medium-
sized hazard rates are underestimated. In summary, bias of various forms concerning parameter
estimates, relative effects and hazard rates have to be expected if the true response function is
strongly skewed but standard models are fitted. t
u
4.4 Choice of Link Function and Flexible Links 99
3.0
Predicted relative hazard
1.0 1.5 2.0
0.5 2.5
Fig. 4.11 Predicted hazards obtained from the simulation study described in Example 4.6. The
true link function was a generalized logistic function with D 5 while the estimated model was a
clog-log model
.tjx/ D F˛ .0t C xT /;
that depends on the parameter ˛. For ˛ ! 0 one obtains the grouped proportional
hazards model. For ˛ D 1 one gets the model
where the constant 1 is absorbed into the parameters 0t . The model for ˛ D 1 is a
discretized version of the additive continuous time model c .tjx/ D 0 .t/ C xT :
Therefore the family includes the grouped proportional hazards model and a
discretized version of an additive model as special cases. A disadvantage of the
family is that the range of predictors is restricted since the linear predictor has to
fulfill 0t C xT > 1=˛. An extension of the model that includes polynomial
terms was proposed by Tibshirani and Ciampi (1983).
Pregibon (1980) considered a family of link function for binary responses. For
discrete hazards, these link functions refer to the representation
˛ı 1 .1 /˛ı 1
g˛;ı ./ D :
˛ı ˛Cı
The family also contains the symmetric logit link as the limiting case ˛; ı ! 0, but
is not symmetric for ˛; ı > 0. Pregibon (1980) also shows how to test the deviation
of the link function from a hypothesized link function.
Several further families that include the link functions in common use have been
proposed; see Morgan (1985), Stukel (1988), Czado (1992), Czado (1997), and
Koenker and Yoon (2009).
4.5 Literature and Further Reading 101
In binary regression several tools have been developed to estimate unknown link
functions. Weisberg and Welsh (1994) proposed to estimate regression coefficients
using the canonical link and then to estimate the link via kernel smoothers given
the estimated parameters. In the next step, all the parameters are re-estimated, and
alternating between estimation of link and parameters yields consistent estimates.
The basic principle of alternating between these two estimates was also used by
Yu and Ruppert (2002), but instead of kernel smoothers the unknown function is
approximated by an expansion in basis functions. Alternatives have been considered
by Ruckstuhl and Welsh (1999) and Muggeo and Ferrara (2008). Leitenstorfer and
Tutz (2011) and Tutz and Petry (2012) used boosting techniques. The latter approach
additionally includes variable selection.
4.6 Software
4.7 Exercises
4.1 Let the logistic model be used to model the hazards. Show that the martingale
Pi O
residuals defined by mi D ıi tsD1 is , i D 1; : : : ; n sum up to zero. Hint: Consider
the log-likelihood for the binary representation of the transitions and use that the ML
estimate is found if the derivative of the log-likelihood function (the score function)
is zero.
4.2 Consider the pairfam data, which are described in detail in Chap. 9.
1. Estimate the hazard rate for the time to first childbirth by fitting a logistic discrete
hazard model. Use the covariates “age of woman,” “age of partner,” “duration of
relationship,” and “status of the relationship.”
2. Specify the expressions C, ˇ, and to test the null hypothesis that age of both
partners does not have any effect on the time to first childbirth.
3. Calculate the likelihood ratio (LR) test statistic.
4. Calculate the p-value. Can H0 be rejected?
5. Test the same hypothesis using a Wald test and compare the result to the result
obtained from the LR test.
6. Conduct LR and Wald tests to investigate whether the relationship categories
“living apart together” and “married” are significantly different from the category
“living together.”
4.3 Consider the TTP data from Exercise 2.4 and fit a grouped proportional hazards
model (using the covariates presented in Table 2.4) to the data. Apply square root
transformations to the continuous covariates before including them in the model.
4.7 Exercises 103
1. Calculate the adjusted deviance residuals for the full model and for the intercept
model without covariates.
2. Analyze the adjusted deviance residuals by inspecting normal quantile–quantile
plots. How does the inclusion of the covariates affect the distribution of the
adjusted deviance residuals?
3. Conduct a goodness-of-fit test on normality of the adjusted deviance residuals
(for example, an Anderson–Darling or a Shapiro–Wilk test).
4.4 Consider the congressional careers data. The aim is to compare different
modeling options for the response variable time to loss of a general election by
carrying out tenfold cross-validation.
1. Subdivide the data randomly into tenfolds and create ten training samples for
tenfold cross-validation.
2. Fit five logistic discrete hazard models to each of the ten training data sets.
Specifically, use the following sets of predictor variables:
(a) age, scandal, redist
(b) age, prespart, opengub, redist, district
(c) age, opengub, opensen, district
(d) age, priorm, prespart, opengub, redist, district
(e) age, priorm, prespart, opengub, opensen, redist
3. Compute predictions from the five models using the ten test samples.
4. Evaluate the predictive deviances of the five models.
5. Draw boxplots of the ten predictive deviances for each model and interpret the
results. Which model has the best predictive performance?
4.5 Consider the US unemployment data of Example 1.1. The outcome variable
of interest in Example 1.1 was the time to re-employment (regardless of whether
re-employment was at a full-time job or at a part-time job).
1. Subdivide the observations into ten equally sized parts, to be used for tenfold
cross-validation.
2. Convert all samples to sets of augmented data with binary outcome variables yis .
3. Fit logistic discrete hazard models to the ten training samples. Use the covariates
age, filed unemployment claim, log weekly earnings in lost job, and tenure in lost
job.
4. For each of the ten test samples calculate the true positive rate (TPR), the false
positive rates (FPR), and the area under the curve (AUC) at t D 5.
5. Compute the AUC values at each time point and draw the cross-validated time-
dependent AUC curve. In addition, estimate the cross-validated concordance
index. Evaluate the contributions of the four covariates to the prediction accuracy
of the model.
4.6 Show that the generalized logistic distribution family defined in (4.19) con-
verges to the clog-log model as ! 0.
104 4 Evaluation and Model Choice
4.7 Consider a data set from Singer and Willett (2003) that was obtained from a
sample of 180 middle-school boys. After each grade, the boys were asked whether
they had sex for the first time (see Capaldi et al. 1996). As a consequence, all times
to the event “first sex” were measured in years (median of observed event times
D 11 years). Because boys were observed between the 7th and the 12th grade, the
maximum observed event time was 12 years (median 11 years). The censoring rate
was 30 %, implying that 30 % of the boys were still virgins after the 12th grade. The
explanatory variable was a binary predictor that indicated wether a boy didn’t live
with his biological parents any more at the beginning of the 7th grade (“parental
transition,” observed for 60 % of the boys). The data are available at http://www.ats.
ucla.edu/stat/examples/alda.
1. Convert the sample to a set of augmented data with binary outcome variable yis .
2. Estimate the hazard rate for the time to first sex by fitting discrete hazard models.
Use the logistic, probit, and cloglog link functions and compare the estimates.
3. Conduct LR and Wald tests to investigate whether the covariate “parental
transition” has a significant effect on the time to first sex.
4. Fit a Cox proportional hazards model in continuous time to the non-augmented
data. Compare the results to those obtained from the grouped proportional
hazards model with cloglog link.
Chapter 5
Nonparametric Modeling and Smooth Effects
The basic discrete survival model considered in the previous chapters has the form
.tjxi / D h.
it / with the linear predictor given by
it D 0 .t/ C xTi ; (5.1)
where, for notational convenience, 0 .t/ denotes the parameters that vary over
time. The model is parametric, with the parameters given by 0 .1/; : : : ; 0 .q/; T .
More specifically, the predictor
it is linear in , implying that each covariate
contained in xi has a linear effect on the transformed hazard g..tjxi //. In practice,
however, the linearity assumption is often too restrictive, for example, when there
are quadratic or logarithmic predictor effects. In the following we will consider
models that allow for a more flexible predictor which is not necessarily linear. We
will first consider smooth nonlinear versions of the baseline hazard 0 .t/. In the next
step, additive hazard models that relax the linearity assumption for the covariate
effects will be considered. Finally, we will introduce time-varying coefficients that
allow parameter estimates to vary smoothly over time.
The model with predictor (5.1) assumes that all parameters are fixed. The number of
parameters in the model is determined by the number of intervals, because for each
interval one has a separate intercept 0 .t/. Thus, if the number of intervals is large,
the number of parameters is large as well. In particular in this case one obtains
more stable estimates if one assumes that the baseline hazard represented by the
parameters 0 .1/; : : : ; 0 .q/ is a smooth function in time. Then the smooth function
can be specified by a simpler parameterization that contains fewer parameters.
A common way to fit a smooth function is to assume that the function can be
approximated by a finite sum of basis functions. Let the parameters 0 .1/; : : : ; 0 .q/
be approximated by
X
m
0 .t/ D 0s s .t/ ; (5.2)
sD1
where s ./ are fixed basis functions. Common choices are polynomial splines in
the form of the truncated power series basis and B-splines, which are considered
later. The crucial point is that one considers the parameters 0 .t/ as a function
in t and approximates this function by a weighted sum of m basis functions. The
number of basis functions, m, can usually be chosen much smaller than the number
of intervals, q, without losing much in terms of accuracy of fit.
As a first example let us consider basis functions that are given in the form of
polynomial splines. Polynomial regression splines are obtained by dividing the time
domain into continuous intervals Œ
i ;
iC1 and representing the unknown function
by a separate polynomial of degree d in each interval. In addition, the polynomials
are supposed to join smoothly at the knots
1 < : : : <
ms , ms D m 1, which
determine the boundaries of the intervals. In discrete hazard models the knots are
chosen from the time domain Œ0; q. A simple representation of polynomial splines
of degree d is the truncated power series basis, which yields
X
ms
0 .t/ D 00 C 01 t C : : : C 0d td C kCi .t
i /dC ; (5.3)
iD1
which form the truncated power series basis of degree d (or, alternatively, of order
d C 1).
An alternative representation is by B-spline basis functions, which are defined
recursively. Since the definition is not very instructive they are visualized in Fig. 5.1.
5.1 Smooth Baseline Hazard 107
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7
t
Fig. 5.1 Example of a set of B-spline basis functions of degree d D 3. The gray lines indicate the
positions of the boundary knots (at t D 0 and t D 7) and the ms D 6 interior knots. The black
lines represent the ms C d C 1 D 10 basis functions. Because d D 3, each basis function is larger
than zero in d C 1 D 4 adjacing intervals. The 2 d D 6 additional knots needed for the recursive
construction of the B-spline basis are not shown here. They were at the positions t D 3; 2; 1
and t D 8; 9; 10
1.0
0.8
0.6
γ0(t)
0.4
0.2
0.0
0 1 2 3 4 5 6 7
t
Fig. 5.2 This figure shows how a spline function of degree d D 3 (represented by the thick black
line) can be constructed from the set of B-spline basis functions defined in Fig. 5.1. Each of the ten
basis functions (represented by the dashed and solid black lines below the thick black line) was
weighted by a coefficient 0s , s D 1; : : : ; 10. Summing up the weighted basis functions results in
the smooth function represented by the thick black line
where
s is the center of the basis function and 2 is an additional parameter that
determines the spread of the basis function. Radial basis functions have been used
mainly in the machine learning community; for a more detailed description, we refer
to Ripley (1996).
Generally, when time-varying parameters such as the baseline hazard 0 .t/ are
expanded in basis functions, two strategies are in common use:
• Choosing a small number of basis functions, say 4 or 5, such that numerically
stable estimates of the coefficients exist.
• Choosing a relatively large number of basis functions and using a penalty term
to obtain stable estimates. This approach has been propagated, in particular, by
Eilers and Marx (1996). Generally, a large number of basis functions result in
very flexible spline functions. On the other hand, estimates are usually wiggly
and have many local optima. Therefore, in order to maintain the flexibility and
to obtain sufficiently smooth spline estimates that are numerically stable, penalty
terms are used. Illustrations and examples of penalty terms will be given in the
next subsection.
5.1 Smooth Baseline Hazard 109
5.1.1 Estimation
One advantage of the expansion in basis functions (5.2) is that the predictor is
again linear, not in the covariates but in the coefficients 0s , which is the essential
condition for using the GLM framework of Chap. 3. As has been shown in Sect. 3.4,
the likelihood for an observation ti is equivalent to the likelihood for the binary
observations .yi1 ; : : : ; yiti / D .0; : : : ; 0; ıi /, which code whether the failure has
occurred over the first ti intervals. The binary model for yit has been specified in
the form .tjxi / D h.0 .t/ C xTi /.
If 0 .t/ is expanded in basis functions, one obtains the linear predictor
X
m
it D 0 .t/ C xTi D 0s s .t/ C xTi : (5.4)
sD1
Since the values s .t/ are known, one has an “extended” linear predictor
it D .1 .t/; : : : ; q .t/; xTi / ˇ;
where all model coefficients are collected in the vector ˇ. Consequently, if one uses
a small number of basis functions, fitting procedures for binary regression models
can be used based on the likelihood representation (3.25).
If one uses a large number of basis functions to obtain increased flexibility,
estimation is usually based on penalized log-likelihood approaches. In penalized
likelihood estimation, the usual log-likelihood is replaced by
lp .ˇ/ D l.ˇ/ J.ˇ/;
2
X
m
Jı D .ı 0j /2 ; (5.5)
jDıC1
110 5 Nonparametric Modeling and Smooth Effects
where is the difference operator on adjacent B-spline coefficients, that is, 0j D
0j 0;j1 ; 2 0j D .0j 0;j1 / D 0j 20;j1 C 0;jC2 , etc. This penalty has
the effect that the parameters are estimated smoothly, with the degree of smoothness
determined by the tuning parameter . If increases, smoothness is enforced. If one
uses, for example, first differences 0j 0;j1 (ı D 1), in the extreme case . ! 1/
the baseline hazard 0 .t/ will become a constant. In general, if a penalty of order ı is
used and the degree of the B-spline is higher than ı, for large values of the fit will
approach a polynomial of degree ı 1. For this approach Eilers and Marx (1996)
coined the term P-splines (“penalized splines”).
With T0 D .01 ; : : : ; 0q / the penalty (5.5) has the general form
Jı D T0 K 0 0 D ˇ T Kˇ; (5.6)
which is in wide use in penalty approaches. The matrices K 0 and K are easily
constructed for differences of fixed order (Exercise 5.2). It should be noted that
using B-splines of degree 1 is equivalent to penalizing the size of the original
parameters 0 .t/ because then 0 .t/ D 0t . Maximization of the penalized log-
likelihood can be obtained by modifying maximization methods used in GLMs
(see, for example, Wood 2006). Alternative smoothing approaches that also use
a quadratic form for a penalty are smoothing splines, which are based on non-
equidistant knots but which are not considered in detail here (see, e.g., Hastie and
Tibshirani 1990 for a description).
By using matrix notation, approximations to standard errors are obtained in a
similar way as in generalized linear models. For example, approximate covariances
are obtained by the sandwich matrix
where F.ˇ/ denotes the information matrix of the ML estimate (see Eilers and Marx
1996).
In R, smooth estimates of the baseline hazard can be obtained by using the gam
function in the package mgcv. For estimation one uses the codings
for T D ti , ıi D 1 and
for T D ti , ıi D 0. These codings are essentially the same as those used in Sect. 3.4.
However, the time column is no longer coded as factor variable but is treated as a
numerical covariate.
If the intervals from which the discrete time has been generated have varying
length and one wants to approximate the underlying continuous time, one might
want to modify the expansion in basis functions. Let fm1 ; : : : ; mq g denote the
mean values of the intervals, that is, mi D .ai P ai1 /=2. Then the knots are
m
chosen from Œa0 ; aq and the expansion 0 .t/ D sD1 0s s .t/ is considered for
values t 2 fm1 ; : : : ; mq g. Also the penalty term should be modified because simple
differences do not reflect the varying interval lengths. As in smoothing splines one
R .2/ .2/
can use the penalty term .0 .t//2 dt, where 0 .t/ denotes the second derivative.
Generally, the second derivative is a measure of the wiggliness of a function and
is related to the second-order difference penalty, see Eilers and Marx (1996). The
penalty term can also be written as the quadratic form T0 K 0 0 , with the entries of
R .2/ .2/
the matrix K 0 D .kij / given by kij D i .t/j .t/dt.
Example 5.1 Munich Founder Study
For illustration we use the Munich Founder study where the dependent variable is the failure time
measured in months (Example 1.2). Figure 5.3 shows the estimates of a continuation ratio model
for founders with working experience less than 10 years and 10 or more years. The black lines in
the upper panel are the estimates of the survival functions (“time to failure”) of the two groups
that were obtained from a continuation ratio model with non-smooth intercepts 0t . In contrast,
the black lines in the lower panel were obtained by smoothing these estimates, i.e., by fitting a
continuation ratio model that treats time as a continuous variable. Smoothing was accomplished
by using P-spline basis functions of degree d D 3 with a second-order difference penalty and
ms D 6 interior knots. The smoothing parameter was determined by generalized cross-validation
(see Wood 2006).
It is seen that founders with little working experience (<10 years) tend to have smaller survival
rates than founders with 10 or more years of working experience. Also, the model with non-
smoothed intercepts (upper panel of Fig. 5.3) yields estimated survival functions that are close
to a smooth function, which is due to the relatively large number of equally spaced time points
(corresponding to 1-month intervals). Figure 5.4 shows the estimates of the transformed intercept
parameters 0t , as well as their smoothed versions. It is seen that there is a large variation between
the non-smooth coefficient estimates of 0t . By applying smoothing (gray line in Fig. 5.4), one
sees a clear trend in the baseline: The baseline risk of failure increases up to a time point of
approximately 15 months. After this time point, the baseline risk gradually decreases. t
u
112 5 Nonparametric Modeling and Smooth Effects
0.8
S(t)
0.4
less than 10 years
10 or more years
0.0
0 10 20 30 40 50 60
time (months)
0.8
S(t)
0.4
0 10 20 30 40 50 60
time (months)
Fig. 5.3 Munich Founder Study. The two plots show the estimated survival functions for the time
to failure (obtained from fitting a continuation ratio model). The predictor “working experience of
the founder” was considered without smoothing (upper panel) and smoothing (lower panel). Gray
lines represent Kaplan–Meier estimates
If one fits data from a homogeneous population without including any covariates, the
smooth baseline hazard is equivalent to a smoothed life table estimator. Therefore,
if one applies the methods described in the previous sections to life table data, one
obtains a smoothed life table estimator with smoothing based on the expansion in
basis functions.
An alternative approach that can be used for life tables is local smoothing. Local
smoothing borrows strength from the neighborhood of a target value by including
observations that are close to the target value. The latter observations are included
with weights that decrease with the distance between target value and observation.
5.1 Smooth Baseline Hazard 113
0.030
exp(γ^ot) / (1 + exp(γ^ot)
0.020
0.010
0.000
0 10 20 30 40 50 60
time (months)
Fig. 5.4 Munich Founder Study. The plot shows the unsmoothed (dots) and smoothed (curve)
estimates of the baseline hazard exp.0 .t//=.1 C exp.0 .t/// obtained from a continuation ratio
model. Estimates represent the hazard rate for the reference group (founders with less than 10
years of working experience)
Let t denote the target value, that is, the value at which the hazard rate .tjx/ is to
be estimated. Then for the estimation of .tjx/ one uses not only observations that
were collected at t but also observations that were collected in a neighborhood of t.
The weighting of neighborhood observations is obtained by including weights
into the binomial log-likelihood given in Eq. (3.25)
X
n X
ti
lD yis log .s/ C .1 yis / log.1 .s//;
iD1 sD1
where yis codes the transition from interval Œas1 ; as / to Œas ; asC1 / in the form
1 individual fails in Œas1 ; as /;
yis D
0 individual survives in Œas1 ; as /;
for s D 1; : : : ; q.
114 5 Nonparametric Modeling and Smooth Effects
X
q
X
lD yis log .s/ C .1 yis / log.1 .s//:
sD1 i2Rs
For fixed target value t one fits a model that assumes in the simplest case that
the hazard is constant, .s/ D . Specifically, one maximizes the weighted log-
likelihood
X
q X
lD fyis log C .1 yis / log.1 /g w .t; s/;
sD1 i2Rs
where w .t; s/ is a weight function that decreases with the distance between t and s,
with the decrease depending on a tuning parameter . For the extreme weights, i.e.,
w .t; s/ D 1 if t D s and w .t; s/ D 0 if t ¤ s, one obtains by maximizing l
P
over the expression O D .t/
O D i2Rt yit =jRt j, which is the number of failures
at t divided by the number of individuals at risk. This estimator is the unsmoothed
life table estimator. In general one obtains the estimator
q nX
X o
O D .t/
O D yis =jRsj wQ .t; s/;
sD1 i2Rs
Pq
where wQ .t; s/ D w .t; s/jRs j=. jD1 w .t; j/jRj j/ are standardized weights (see
Exercise
P 5.1). It represents a weighted sum of the life table estimates at s,
y
i2Rs is =jR s j.
Weight functions that are in common use in localized estimation are built from
kernels and have the form w .t; s/ / K..t s/= /, where the kernel K./ is a
symmetric density function, for example, the Gaussian density. If ! 0, one
obtains the unsmoothed life table estimator, whereas for ! 1 one obtains the
ultrasmooth estimates .1/O O
D : : : D .q/.
O
Fitting a constant function D .t/ O by using weights yields smooth hazard
estimates, but this local constant fitting procedure can suffer from severe bias. Better
procedures are often obtained by locally fitting a polynomial. To this purpose, one
fits (for fixed target value t) a polynomial that is centered around t. In other words,
one fits the model t .s/ D h.ˇo C.st/ˇ1 C: : :C.st/m ˇm /, where h./ is a response
function, for example, the logistic distribution function, and the explanatory term is
a polynomial of degree m. The subscript t in t .s/ is used only as a reminder that one
wants to estimate the hazard function at the fixed value t. In this case the weighted
5.2 Additive Models 115
X
q
X
lD fyis log t .s/ C .1 yis / log.1 t .s//gw .t; s/;
sD1 i2Rs
but the fitted model now is t .s/ D h.ˇ0 C.st/ˇ1 C: : :C.st/m ˇm /. Maximization
is straightforward by using GLM methodology with weights on the observations.
With ˇO0 ; : : : ; ˇOm denoting the estimated parameters, the estimated hazard at t is
O t .t// D h.ˇO0 /.
The tuning parameter can be chosen by cross-validation, with the performance
measured as in Chap. 4. Typical choices for the degree of the polynomial are 1
and 3. More details on local smoothing are found in Hastie and Loader (1993),
Loader (1999), and Fan and Gijbels (1996). For an application to the German
unemployment data, see Exercise 5.3.
For heterogeneous intervals weights can be modified to include the distances in
continuous time. With mi D .ai ai1 /=2 again denoting the middle of the intervals,
the weights w .t; s/ are replaced by w .mt ; ms /.
In the previous section only the baseline hazard was modeled as a smooth function
over time whereas the effect of the explanatory variables was still captured by a
linear predictor. In additive discrete hazard models this assumption is weakened by
assuming that the predictor has the additive form
it D f0 .t/ C f1 .xi1 / C : : : C fp .xip /;
where the functions f0 ./; : : : ; fp ./ are unknown and are to be determined by the
data. The function f0 .t/ corresponds to the time-varying baseline hazard 0 .t/
considered in the previous section.
One approach to estimate the functional form of the predictors is again to assume
that they can be expanded as a sum of basis functions
X
mj
fj .xj / D js js .xj /;
sD1
where the basis functions may depend on the covariate xj . For example, the basis
functions j1 ./; : : : ; jmj ./ represent the basis functions for the jth covariate and
have to be defined on the corresponding domain.
116 5 Nonparametric Modeling and Smooth Effects
The weight parameters js are again estimated by maximizing a penalized log-
likelihood. If one uses an equally spaced grid for the basis functions, one can again
use a difference penalty for all covariates. For example, with differences of order ı
one uses the penalty
X
p X
mj
X
p
Jı D .ı js /2 D Tj K j j D ˇ T Kˇ; (5.7)
jD0 sDıC1 jD0
Table 5.1 Time between cohabitation and first childbirth. The table shows the parameter estimates
and estimated standard deviations that were obtained from fitting a logistic discrete-time hazard
model to the data (edu = educational attainment, area = geographic area, cohort = cohort of
birth, occ = occupational status, sibl = number of siblings, time measured in years). In contrast
to Table 4.2, a smooth nonlinear effect was specified for the “covariate age at the beginning of
cohabitation.” The respective effect estimate is not included here but is visualized in Fig. 5.5
Covariate Parameter estimate Est. std. error
edu First stage basic (Ref. category)
edu Second stage basic 0.0351 0.0747
edu Upper secondary 0.1960 0.0793
edu Degree 0.2787 0.1099
cohort 1946–1950 (Ref. category)
cohort 1951–1955 0.0421 0.0767
cohort 1956–1960 0.2638 0.0787
cohort 1961–1965 0.3112 0.0794
cohort 1966–1975 0.7682 0.0908
area North (Ref. category)
area Center 0.2957 0.0626
area South 0.6784 0.0621
occ worker (Ref. category)
occ non-worker 0.2272 0.0529
sibl 0.0548 0.0269
1.0
0.5
0.0
−2.0 −1.5 −1.0 −0.5
f
10 15 20 25 30 35 40
age at the beginning of cohabitation
Fig. 5.5 Time between cohabitation and first childbirth. The solid line shows the P-spline estimate
of the effect of the covariate “age at beginning of cohabitation” that was obtained from a
proportional continuation ratio model. The smoothing parameter was obtained via generalized
cross-validation. The dashed line corresponds to the respective estimate that was obtained when
including the covariate in a linear fashion (Example 4.1). Both effect estimates were centered such
that the fitted values computed from the data had zero mean
118 5 Nonparametric Modeling and Smooth Effects
In the previous sections it has been assumed that the effect of the covariates is the
same for all transitions from category t to t C 1, that is, it has been assumed that the
predictor in the model .tjxi / D h.
it / has the form
it D 0 .t/ C xTi with fixed
parameter . But in many applications it has to be assumed that the effects vary over
time. In particular when the covariates code some initial condition, for example, the
type of treatment at the beginning of the study, the effect on the hazard at earlier
times is expected to be stronger than at later times during the study.
A more general approach lets the effects vary over time,
it D 0 .t/ C xTi .t/; (5.8)
where .t/ D .1 .t/; : : : ; p .t// is a vector-valued function of time with j .t/
representing the time-varying coefficients of the jth covariate.
It is quite natural to assume that the effects for one covariate vary smoothly over
time. Within the basis functions approach one assumes that the functions can be
represented by
X
m
j .t/ D js s .t/: (5.9)
sD1
It can be assumed that the basis functions are the same for all variables since they
all are defined on the same time domain. With xi0 D 1 the basis functions yield the
predictor
X
p
X
p
X
m
it D xij j .t/ D xij s .t/js ;
jD0 jD0 sD1
it D .xi0 1 .t/; xi0 2 .t/; : : : ; xip m .t// ˇ;
where ˇ collects all the parameters to be estimated. The penalty is the same as for
additive models, only the values of the predictors differ from that of the additive
model.
The penalty for smooth effects is the same as for additive models, i.e.,
X
p
X
m X
p
Jı D .ı js /2 D Tj K j j D ˇ T Kˇ; (5.10)
jD0 sDıC1 jD0
5.3 Time-Varying Coefficients 119
The penalty Tj K j j only enforces smoothness of the estimated function but not
selection of covariates. Moreover, when using B-splines, the penalty does not
penalize all parameters but shrinks the estimate towards a polynomial. For example,
if K j represents first order differences between adjacent basis functions, a constant
term is not penalized, and the parameters are shrunk towards a constant. Therefore,
to obtain selection of covariates alternative penalties have to be used.
For simplicity, let us consider first order differences with difference vector ıj D
.j2 j1 ; : : : ; jm j;m1 /T . A penalty that enforces variable selection is
X
p
X
p
J D 0 T0 K 0 0 C 1 k ıj k C 2 k j k; (5.11)
jD1 jD1
where the tuning parameters are already included. The first term of the penalty is a
simple smoothing penalty, which enforces the smoothness of the baseline hazard,
with the amount of smoothing determined by 0 . The second term simultaneously
penalizes the vector of differences. By using the norm of ıj rather than the squared
norm (which would mean smoothing) the components in the difference vector are
simultaneously shrunk towards zero. The term is strongly related to the group lasso,
which is designed to simultaneously select sets of variables, see Yuan and Lin
(2006). For 1 ! 1 all differences are set to zero and one obtains for each variable
a constant term in the linear predictor. The third term penalizes the weight parameter
itself. It enforces that also constant effects are set to zero and enforces selection of
variables.
The use of basis functions together with tailored penalty terms has the advantage
that one can distinguish between effects that vary over time and effects that are
constant. In alternative approaches that use localization techniques, where the
effects are estimated by using weights on the observations in the neighborhood of
the time for which one wants to estimate the effect, all effects are estimated as
time-varying. We refer to Kauermann et al. (2005) for an application of localization
methods in discrete survival.
Example 5.3 Munich Founder Study
For illustration we use the Munich Founder study introduced in Example 1.2 and use the
penalty (5.11) to determine which of the variables have time-varying coefficients. For all coefficient
functions cubic B-splines with ten interior knots were used. The tuning parameter for the baseline
effect was fixed at 0.001, the tuning parameters for the other terms in the penalty were determined
by fivefold cross-validation. The discrete survival model used the complementary log-log link.
120 5 Nonparametric Modeling and Smooth Effects
1 1 1
Legal Form
0 0 0
−1 −1 −1
−2 −2 −2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (in quarters) Time (in quarters) Time (in quarters)
2 2 2
1 1 1
Equity Capital
Seed Capital
Debt Capital
0 0 0
−1 −1 −1
−2 −2 −2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (in quarters) Time (in quarters) Time (in quarters)
2 2 2
1 1 1
Education Degree
Target Market
Clientele
0 0 0
−1 −1 −1
−2 −2 −2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (in quarters) Time (in quarters) Time (in quarters)
2 2 2
Professional Experience
Number of Employees
1 1 1
Gender
0 0 0
−1 −1 −1
−2 −2 −2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Time (in quarters) Time (in quarters) Time (in quarters)
Fig. 5.6 Smooth effect estimates for the Munich Founder Study. The bold curves represent the
estimates resulting from the use of cubic B-splines with penalty (5.11). Dashed-dotted curves
stand for the estimates from mgcv. Dashed lines represent 95 % bootstrap confidence intervals
Figure 5.6 shows the estimated time-varying coefficients. The curves represent the estimates
resulting from use of penalty (5.11). Dashed-dotted curves stand for the estimates when using the
R package mgcv (Wood 2015), which also offers an option to select variables. The dashed lines
represent 95 % bootstrap confidence intervals for the estimates obtained by use of (5.11). The
coefficient curve for age is omitted because age was estimated to have no impact on the survival of
firms. The resulting curves for (5.11) and mgcv are rather similar for most of the effects, although
5.3 Time-Varying Coefficients 121
mgcv has several internal constraints. For example, it enforces the inclusion of a main effect if
a variable is categorical and selects only metric covariates. It is seen that some variables show
distinct variation of the effects over time while others can be considered as constant over time. For
example, the effect of debt capital (1 = yes, 0 = no) is decreasing over time while the legal form
(1 = partnership, 0 = small trade) has a negative but time-constant effect. For more details, see
Möst (2014). t
u
Models with time-varying coefficients are more flexible by allowing for the variation
of the effect size over time. However, one should be careful when interpreting
these effects because time-varying effects can also be found when the link between
response and covariate is misspecified. This may be seen from a simple example
where the effect of a covariate is quadratic, that is, the predictor has the form
it D f0 .t/ C f1 .xi1 ; t/ C : : : C fp .xip ; t/;
X X
mj m
fj .xj ; t/ D jsl js1 .xj / jl2 .t/;
sD1 lD1
where j11 .:/; : : : ; jmj 1 .:/ denote the basis functions for variable xj , and
j12 .:/; : : : ; jm2 .:/ denote the basis functions for time. Since many parameters
are involved, penalty terms are needed that restrict the variation over time and
covariates. Penalties for differences j;sC1;l jsl and js;lC1 jsl were considered
by Eilers and Marx (2003) and Currie et al. (2004). In continuous time survival
modeling the general hazard regression model (HARE) of Kooperberg et al. (1995)
uses the tensor-product approach.
122 5 Nonparametric Modeling and Smooth Effects
Fig. 5.7 Mean estimates of 0 .t/ (left panel) and .t/ (right panel) when fitting a model with time-
varying effects
it D 0 .t/ C xTi .t/ to data generated from a model with quadratic effect
it D x2i
(dashed lines: empirical confidence bands, dashed-dotted lines: estimated confidence bands)
The model is very flexible by allowing effects of covariates and time to vary
freely but yields as many two-dimensional surfaces as variables are available. A
more restrictive form has been considered by Tutz and Binder (2004). The predictor
it D f0 .t/ C t1 f1 .xi1 / C : : : C tp fp .xip / (5.12)
assumes that the basic form of the jth predictor, fj .:/, is the same but the strength of
the effect varies over time. For example, if tj decreases over time, the effect of the
jth predictor is damped. Since the model includes multiplicative effects, alternative
fitting methods have to be used, see Tutz and Binder (2004). Similar models were
considered by Abrahamowicz and MacKenzie (2007) for continuous time data. For
an application, see Example 5.4.
In studies that cover a long time period, calendar time may have an additional
effect on the hazard function. For example, in the analysis of the duration of
unemployment the hazard might not only be determined by the time a person is
unemployed, but also by calendar time (which contains the economic conditions
under which an unemployed person is looking for a job).
Calendar time can be incorporated into discrete hazard models as follows: Let the
time of duration as well as calendar time be measured on the same discrete scale,
representing, for example, weeks or months. The linear predictor, which determines
the hazard for person i after t units of time, can be specified by
it D 0 .t/ C c .ci C t/ C xTi ; (5.13)
5.4 Inclusion of Calendar Time 123
where ci is the calendar time at the beginning of the duration time that is to be
investigated. For example, when modeling the duration of unemployment, ci is
the calendar time when unemployment started, 0 .t/ is the effect on the hazard
after t months of unemployment, and c .ci C t/ is the effect on the hazard at
the corresponding calendar time. Both functions 0 ./ and c ./ are unknown and
are assumed to be smooth functions. Estimates of the unknown functions can be
obtained by expanding them in basis functions and using penalty methods as in
additive models. Of course it is necessary that there is enough variation in the
sample; if unemployment for most of the observations starts at about the same time,
there is not enough information to distinguish between the two time scales.
Example 5.4 Psychiatric Hospital Data
For illustration of the semiparametric model (5.12) and the inclusion of calendar time we give
an application that was considered by Tutz and Binder (2004). For 1922 patients of a German
psychiatric hospital with diagnosis “schizophrenic disorder of paranoid type” the modeled response
was the time spent in hospital measured in days. The covariates that were modeled as smooth
functions were age at admission, calendar time between January 1, 1995 and December 31, 1999
(measured in days), and the GAF (Global Assessment of Functioning) score at admission, which
evaluates the patient’s level of functioning. Large values indicate high levels of functioning.
Categorical predictors that were included as 0–1 variables were gender (“Male,” 1 = male),
education (“Edu,” 1 = above high school level), partner situation (“Part,” 1 = has a permanent
partner), job situation (“Job,” 1 = full/part time job at admission), first hospitalization (“First,” 1
= first admission in a psychiatric hospital), and suicidal action (“Sui,” 1 = suicidal act previous to
admission). The total predictor of the model had the form
it D f0 .t/ C fT .calendar time/ C Male ˇMale C Edu ˇEdu C Part ˇPart
CJob ˇJob C First ˇFirst C Sui ˇSui C fA .Age/ C t fG .GAF score/
with smooth functions fT , fA , and fG . Table 5.2 shows the parameter estimates of the parametric
terms. The strongest effect is found for the variable “partner situation,” all others can be neglected.
Figure 5.8 shows the smooth estimates for the continuous variables. It is seen that there is a
tendency that younger people stayed longer in the hospital. The effect of calendar time signals
that the time spent in hospital decreases almost continuously with calendar time. The GAF score,
which is an assessment score at admission, indicates that a lower level of functioning resulted in a
lower probability of dismissal. It seems that there is an essential difference between low and high
GAF score, with the effect changing only between 30 and 50 points. The estimates of the modifying
factors t (right lower panel of Fig. 5.2) show that the effect of the GAF score at admission vanishes
over time, indicating that the predictive power of the initial score diminishes. t
u
Table 5.2 Estimates of the parametric terms of the model for psychiatric hospital data
Covariates Estimates Standard deviations
Male (male) 0.0525 0.0550
Edu (above high school) 0.0387 0.0716
Part (has a permanent partner) 0.2619 0.0654
Job (full/part time job at admission) 0.0428 0.0731
First (first admission in a psychiatric hospital) 0.0845 0.0680
Sui (suicidal act previous to admission) 0.0828 0.1145
124 5 Nonparametric Modeling and Smooth Effects
Fig. 5.8 Estimated smooth effects for the psychiatric hospital data (estimates: solid lines;
confidence bands: dashed lines)
5.6 Software
After transforming the original data into binary data for coding transitions over
intervals, one can fit smooth baseline hazards and additive models by using software
packages for binary regression. For example, an efficient implementation of additive
models is contained in the gam function of the add-on package mgcv (Wood 2015).
The mgcv package contains a variety of options for the specification of basis
functions (specified via the bs argument in the model formula). In particular, it is
possible to estimate nonlinear predictor effects via cubic regression splines and
P-splines. The method argument of gam provides several algorithms to estimate
smoothing parameters (e.g., generalized cross-validation and restricted maximum
likelihood estimation). For details, we refer to Wood (2015).
5.7 Exercises
5.1 Locally constant life table estimators use for fixed target value t the weighted
log-likelihood
X
q X
lD fyis log C .1 yis / log.1 /g w .t; s/:
sD1 i2Rs
Jı D T0 K 0 0 D ˇ T Kˇ
for 0T D .01 ; : : : ; 0q / and calculate the entries kij of the matrix K explicitly for
ı D 1; 2.
5.3 Apply local polynomial fitting to the German unemployment data introduced
in Example 2.1.
(a) Convert the data to a set of augmented data with binary response.
(b) Consider the fixed target value t D 18 and calculate the hazard function .18/ O
by using a Gaussian kernel with bandwidth D 1.
(c) Next consider local polynomial smoothing by fitting the hazard t .s/ D
h.ˇ0 C C .s t/m ˇm / with degree m D 3. For each of the bandwidths
D 0:01; 0:1; 0:2; 0:3; 0:4; 0:5; 1; 3; 10; 50 estimate the hazard rate at t D 18.
(d) Compare the resulting estimates to the raw hazard estimates in Table 2.1.
126 5 Nonparametric Modeling and Smooth Effects
5.4 In this exercise we consider additive models with smooth baseline hazard and
smooth covariate effects. To this purpose, the data set on promotions in rank for
biochemists (Example 4.2) is considered.
(a) Convert the data to a set of augmented data with binary response.
(b) Fit a discrete hazard model with logistic link function and a smooth baseline
hazard. Consider the prestige of the Ph.D. institution (smooth effect), the
cumulative number of published articles, and the cumulative number of citations
as covariates. Use penalized B-splines to estimate smooth effects and determine
the smoothing parameters via generalized cross-validation. Which covariates
have a significant effect (at level ˛ D 0:05) on the time to event?
(c) Plot the estimated spline functions for the baseline hazard and the effect of the
prestige of the Ph.D. institution. How does the prestige of the Ph.D. institution
affect the hazard?
5.5 The aim is to fit a discrete survival model to a set of echocardiogram data that
was collected for analyzing the survival behavior of patients after heart attack. The
data, which were originally collected at The Reed Institute (Miami), are stored in
the UCI Machine Learning Repository (Lichman 2013) and are publicly available
at https://archive.ics.uci.edu/ml/datasets/Echocardiogram. All patients contained in
the data (n D 130) suffered a heart attack at some point in the past; the survival
time after heart attack was measured in months. The event of interest is death of the
patients. The median time of observed patient survival was 24.5 months; 32.3 % of
the observations were censored. A description and summary of the covariates are
given in Tables 5.3 and 5.4, respectively.
(a) Prepare the data set for analysis:
(i) Delete observations with missing values in the response.
(ii) Generate an event indicator (1 = observed death, 0 = still alive).
(iii) Impute missing values in the continuous covariates by inserting the
respective sample means.
Table 5.3 Description of the covariates contained in the echocardiogram data (Exercise 5.5)
Variable Description
AHA Age at heart attack (measured in years)
PE Indicator if pericardial effusion is fluid (= 1) or not (= 0)
FS A measure of contractility around the heart; lower numbers are
increasingly abnormal
epss e-point septal separation; another measure of contractility; larger numbers
are increasingly abnormal
lvdd Left ventricular end-diastolic dimension; measures the size of the heart at
end-diastole
WMI Wall-motion-score; measures how the segments of the left ventricle are
moving; standardized by the number of segments seen
5.7 Exercises 127
Table 5.4 Descriptive analysis of the covariates contained in the echocardiogram data (Exer-
cise 5.5)
Variable Categories/unit Sample proportion/median (range)
AHA Years 62 (35–86)
PE Fluid 18.5 %
Otherwise 81.5 %
FS Score/continuous 0.22 (0.01–0.61)
epss Score/continuous 12 (0–40)
lvdd Score/continuous 12 (0–40)
WMI Score/continuous 1.22 (1–3)
The modeling techniques described in Chaps. 3 and 5 are based on the assumption
that the predictor is an additive function of the covariates. Although this assumption
is intuitive and facilitates interpretation of the models, it often happens in practice
that additive predictors may not capture the true data structure very well. This is,
for example, the case when interactions between categorical covariates are present,
i.e., when the combination of two (or more) levels of some covariates affects the
survival in more than just an additive way.
In principle, interaction terms (being the products of two or more covariates) may
be easily incorporated into discrete hazard models. However, the necessary speci-
fication of the model formula requires data analysts to know about the respective
interactions in advance. In other words, unknown interactions between covariates,
especially if they involve more than two variables, cannot be automatically detected
by the models presented in Chaps. 3 and 5. Although, in principle, one could apply
traditional stepwise variable selection techniques (known from linear regression)
to build models that contain both main effects and interaction terms, this strategy
is often unfeasible because of the large number of possible interaction terms.
Moreover, backward selection strategies face the problem that estimation becomes
numerically unstable or that estimators do not even exist if the number of parameters
is large relative to the sample size. This issue is particularly cumbersome when
the covariate space contains a set of categorical variables with large numbers
of categories, or when the number of predictors is high-dimensional (“small n,
large p”).
Recursive partitioning techniques (also termed “tree-based” techniques) are a
popular method to address these problems. The method has its roots in automatic
interaction detection (AID), proposed by Morgan and Sonquist (1963). A modern
version is due to Breiman et al. (1984) and is known by the name of classification
and regression trees, often abbreviated as CART. In this chapter we consider
versions of recursive partitioning that are useful in discrete survival modeling.
The idea of recursive partitioning is to subdivide the covariate space into a set
of rectangles and then fit a covariate-free model to the observations in each of
them. For example, if the outcome variable is a continuous event time, a covariate-
free estimate of the survival function is obtained by applying the Kaplan–Meier
estimator. The most popular algorithm for recursive partitioning is the classification
and regression tree (CART) algorithm (Breiman et al. 1984), which is defined by the
following procedure: Starting with the “root node” (referring to the unpartitioned
covariate space and containing the complete set of observations), partitioning is
done in a hierarchical way by recursively applying binary splits to the data (see
Fig. 6.1 for an illustration). In each node (i.e., in each rectangle), the idea is to
b
the covariate space into two b b
b
b
e e
e
rectangles at threshold c1 . In aa b
e e
c
c
e
E
into two more rectangles c e
c c
“x2 > c2 ”. Two more splits c e
e
e
e
e
e
are carried out at thresholds
c3 and c4 , so that the whole c4 c1
covariate space is subdivided
into five rectangles. The x1
resulting tree with five
terminal nodes (referring to x 1<c 1
the five rectangles) is
visualized in the lower figure
x 2<c 2 x 2<c 3
x 1<c 4
C E D
A B
6.1 Recursive Partitioning 131
select a single covariate xk , k 2 f1; : : : ; pg, and a binary splitting rule R.xk / that is
obtained by the optimization of a pre-defined splitting criterion. For example, if xk
is continuous or ordinal, the splitting rule R.xk / is of the form “xk ck vs. xk > ck ”
with threshold ck (Fig. 6.1). Consequently, the covariate space is subdivided into
two rectangles (termed “children nodes”) according to R.xk /. If xk is a categorical
predictor with categories Ck D fck1 ; : : : ; ckqk g, the idea is to subdivide Ck into two
mutually exclusive sets Ck1 and Ck2 with Ck1 [ Ck2 D Ck , and the splitting rule R.xk /
becomes “xk 2 Ck1 vs. xk 2 Ck2 ”. Of course xk , ck , Ck1 and Ck2 may vary across the
nodes.
The choice of the splitting criterion depends on the scale of the outcome
variable Y. For example, if Y is a binary variable, a popular strategy is to construct a
classification tree. An often used splitting criterion for binary Y is the Gini impurity
measure, which is computed as follows: Assume that a node has to be split into
two children nodes using covariate xk . Denote the children nodes by Nm .xk ; R.xk //,
m D 1; 2, and let pm .xk ; R.xk // be the proportions of ones in the children
nodes. The Gini impurity measure is defined as Gm .xk ; R.xk // WD 2pm .xk ; R.xk //
.1 pm .xk ; R.xk ///, so that it attains its minimum value 0 in pure nodes with
pm 2 f0; 1g. The optimization problem in each split is then given by minimizing
the weighted sum of the Gini coefficients in the children nodes
n o
min jN1 .xk ; R.xk //j G1 .xk ; R.xk // C jN2 .xk ; R.xk //j G2 .xk ; R.xk // ; (6.1)
xk ;R.xk /
where jN1 j and jN2 j denote the cardinalities of the sets of observations contained
in the nodes N1 and N2 , respectively. Similarly, if Y is a continuous survival time, a
survival tree can be constructed by maximizing the log-rank statistic obtained from
the survival times in the children nodes (Bou-Hamad et al. 2011b). The result of
a recursive partitioning procedure is a set of “terminal nodes,” each containing a
subset of the observations of the data. Tree estimates are given by fitting covariate-
free models to the observations in the terminal nodes.
In the literature, many types of recursive partitioning methods have been
proposed. These methods mainly differ in the algorithm that is used to determine
the optimal covariate xk , and also in the choice of splitting criteria and techniques
for determining the optimal tree size. Apart from the CART algorithm, popular
methods are C4.5 (Quinlan 1993) and conditional inference trees (Hothorn et al.
2006). Overviews of the various methods are given, e.g., in Hastie et al. (2009),
Tutz (2012), and Strobl et al. (2009). In the following we consider two recursive
partitioning methods that are specifically designed for discrete-time survival out-
comes.
132 6 Tree-Based Approaches
An algorithm for discrete failure times that extends the CART approach to discrete-
time survival data has been developed by Bou-Hamad et al. (2009). The approach
uses the log-likelihood of a discrete survival model given by
X
n X
ti
l/ yis log .sjxi / C .1 yis / log.1 .sjxi // (6.2)
iD1 sD1
to define a splitting criterion for tree construction. In each child node one fits a
covariate-free discrete hazard model of the simple form .tjxi / D h.0t /, which
contains only intercepts. Splits are obtained by maximizing the sum of the two
observed log-likelihoods in the children nodes. More specifically, the optimization
problem in each split is given by
n o
max l1 .xk ; R.xk // C l2 .xk ; R.xk // ; (6.3)
xk ;R.xk /
where l1 and l2 refer to the log-likelihood functions of the intercept models in the
children nodes N1 and N2 , respectively.
After having grown the tree, the hazard functions of the covariate-free model
with intercepts 0t only are estimated in each of the terminal nodes. These
functions constitute the tree estimate of the hazard .tj/ conditional on covariate
combinations x.
A problem of recursive partitioning methods is their tendency to overfit the data
if the number of terminal nodes becomes too large. For example, growing the
largest possible tree with exactly one observation in each terminal node is not a
good strategy, as the estimation of the hazard functions would be based on only
one observation each. For this reason, classification and regression trees are usually
“pruned” after tree construction. Starting with the terminal nodes of the full tree, a
cost-complexity criterion (governing the trade-off between classification accuracy
and tree size) is successively evaluated by collapsing the nodes that result in the
smallest decrease of classification accuracy. The optimal subtree is given by the tree
with the minimum value of the cost-complexity criterion.
Bou-Hamad et al. (2009) adapted this strategy to discrete-time survival trees
as follows: After tree construction, pruning is accomplished by minimizing the
information criterion
2 lT C q Q ; (6.4)
6.3 Recursive Partitioning with Binary Outcome 133
where lT is the sum of the log-likelihood functions in the terminal nodes and Q is
the number of terminal nodes. The tuning parameter governs the trade-off between
model fit (measured by lT ) and model complexity (measured by Q). Popular values
are D 2 (giving rise to Akaike’s information criterion) and D log.n/ (giving rise
to the Bayesian information criterion). The optimal tree is then given by the subtree
with minimum value of (6.4).
The tree method described above follows the same rationale as recursive
partitioning techniques for continuous survival data. But for continuous time one
uses the Kaplan–Meier estimate instead of a model that contains intercepts only
to obtain covariate-free estimates in the terminal nodes. In fact, the two methods
are not so different because the Kaplan–Meier estimator presented in Chap. 2 is
strongly related to the life table estimator. Also, continuous survival trees are
often based on the maximum value of the log-rank statistic for obtaining splits
of covariates (LeBlanc and Crowley 1993). Because this measure is sensitive to
the differences between the (covariate-free) Kaplan–Meier estimates in the children
nodes, it follows the same principle as the sum of the covariate-free discrete log-
likelihood functions in (6.3).
Because of these similarities, survival trees for continuous failure times might
also work when applied to discrete failure time data, especially when the number of
time points is large and intervals are equidistant. Nevertheless, splitting criteria such
as the log-rank statistic are problematic when data are grouped and when the number
of ties is large. These issues are avoided when using the recursive partitioning
techniques for discrete survival times that are presented in this chapter.
An alternative way to obtain trees for discrete failure times is to explicitly use the
binary variables in the representation of the log-likelihood function given in (6.2)
and to fit a classification tree to the augmented learning data (Schmid et al. 2016).
Basically one uses trees for binary responses. Since one uses the augmented data in
this case, the approach comprises the following steps:
and
Binary observations Design variables
0 1 0 0 ::: 0 xTi
0 0 1 0 ::: 0 xTi
0 0 0 1 ::: 0
::
:
0 0 0 0 ::: 1 xTi
for uncensored and censored observations, respectively. The same matrices have
already been used to estimate the parametric discrete-time hazard models in Chap. 3.
Also, the definition of the hazard function
implies that one can condition on both the covariate values x and the event “T t”.
To account for the ordinal structure of T, the dummy variables for the baseline
hazard in the design matrices above are replaced by the column vector .1; 2; : : : /> .
The input data for individual observations are thus given by
and
for uncensored and censored observations, respectively. The augmented data set for
tree construction is obtained by combining
P the above matrices. By definition, the
resulting design matrix has nQ WD niD1 ti rows. This data structure, which implies
that T is treated as an ordinal covariate during tree construction, represents a much
more flexible form than the specification of the time trend 0t in a discrete hazard
model.
Y
t
S.tjx/ D P.T > t j x/ D .1 .ijx//
iD1
to O 1 ; : : : ; O Q .
Concerning the choice of an appropriate splitting criterion, discrete-time survival
trees do not focus on the correct classification of the individual values yj 2 f0; 1g but
on the accurate estimation of the probabilities 1 ; : : : ; Q (“probability estimation
tree,” Provost and Domingos 2003). Therefore, the aim is to minimize the deviations
of the true node probabilities from the estimated probabilities in each node during
tree construction. Denoting by N the set of observations in some arbitrary node, a
natural measure of node impurity is given by the Brier score
1 X
O 2
BSN WD P.yj D 1 j j 2 N / I.yj D 1/
jN j j2N
2
O
C P.yj D 0 j j 2 N / I.yj D 0/
2
1 X
O 2
X2=0 X2=1
1 X3=0
X3=0 X3=1 X4=1 X4=0 X3=1
2 3 10 13
X1=1 X1=0 X1=0 X1=1 T>2.5 T<2.5 T<3.5 T>3.5
4 7 8 11 12 19
T>2.5 T<2.5 X4=0 X4=1
5 6
T<1.5 T>1.5 T<2.5 T>2.5
14 17 18
T>2.5 T<2.5
15 16
Fig. 6.2 Illustration of the tree building approach with binary outcome. The tree was constructed
by applying the CART algorithm to a simulated data set with four binary predictor variables
x1 ; : : : ; x4 . Tree construction resulted in 20 terminal nodes, with the terminal nodes defining
time intervals T.1/ ; : : : ; T.Q/ f1; : : : ; 20g. For example, the time interval corresponding to
the first node (referring to observations with x2 D 0 and x4 D 0) is T.1/ D f1; : : : ; 20g
because T was not used in the construction of node 1. Consequently, the estimated hazard for
the respective observations is time-constant. Conversely, the time interval for nodes 7 and 8 is
given by T.7/ D T.8/ D f4; 5; : : : g because T was used as a splitting variable in the construction
of the two nodes
which quantifies the squared distance between the “observed” hazards yj 2 f0; 1g,
P
j 2 N , and the estimated hazards O N WD j yj =jN j. The Brier score is a “proper”
measure in the sense that (6.5) becomes minimal if the true probabilities P.yj D 1j
j 2 N / are used instead of O N (Gneiting and Raftery 2007).
Because the outcome values yj are binary, the Brier score as a measure of node
impurity is equivalent to the Gini impurity measure defined above (Exercise 6.2).
Consequently, the traditional CART algorithm based on the Gini criterion can be
used for survival tree construction, and the hazard .tj/ can be estimated from the
terminal nodes as described previously.
Similar to classification and regression trees, probability estimation trees may overfit
the data if they involve too many terminal nodes. For example, growing the largest
6.3 Recursive Partitioning with Binary Outcome 137
possible tree with exactly one observation in each terminal node is not desirable, as
the “probability” estimates O q would all be either 0 or 1 in this case. Moreover, the
variance of O N is inversely related to the node size, implying that larger nodes lead
to more accurate estimates of P.yj D 1j j 2 N /.
In view of these considerations, tree construction for discrete survival times
can be optimized by using the minimum number of observations in the terminal
nodes as the main tuning parameter of the algorithm (“cardinality pruning”). This
strategy implies that tree construction is stopped when further splitting of any of
the current nodes would result in children nodes that contain less observations than
the minimum node size. In practice, the optimum minimum node size can either
be determined by means of information criteria or by means of cross-validation
techniques. Similar to (6.4), information criteria can be defined by
2 l C .Q 1/ ; (6.6)
The final step is to obtain an estimate of the conditional survival function S.tjxTj /
for a new observation that is contained in a test sample . QtjT ; ıjT ; xTj /, j D 1; : : : ; nT .
This is done as follows: First, the new covariate values xTj are combined with
every possible time point t, yielding a set of vectors ..xTj /> ; 1/; : : : ; ..xTj /> ; tmax /,
j D 1; : : : ; nT . These vectors are dropped down the tree, and the estimates O q
138 6 Tree-Based Approaches
in the respective terminal nodes form the estimated hazard function .tjx O T
j /. By
T O Tj /)
definition of the survival function (3.2), a prediction of S.tjxj / (denoted by S.tjx
is obtained.
A schematic overview of Steps 1–4 is given in Fig. 6.3.
Remark A key feature of the method is that the estimated hazard rate of an
observation may depend on different sets of predictor variables at different time
points. Also, the method allows for a flexible modeling of the baseline hazard.
In particular, the method includes a data-driven detection of time-constant hazards
(because T is treated in the same way as the other covariates and does not necessarily
have to be selected as a splitting variable for tree construction).
1280
1240
AIC
1200
1 3 5 7 9 11 13
number of splits
Fig. 6.4 Copenhagen stroke study. The panel shows the AIC values that were obtained from
Method 1 (recursive partitioning based on covariate-free discrete hazard models). Estimation was
based on a random sample of size n D 345. The black dot indicates the optimal number of splits,
which was estimated to be equal to 5
Leaf: 5 Leaf: 7
cholest >= 5.2 Surv. Median: > 12 strokeSc >= 38 Surv. Median: 4.6
n= 32 n= 96
Fig. 6.5 Copenhagen stroke study. The figure shows the survival tree that was estimated via
Method 1 (recursive partitioning based on covariate-free discrete hazard models) from a random
sample of size n D 345. Each node contains an estimate of the median survival time, i.e., of the
O
time point for which S.tjx/ D 0:5
the survival time T was used in addition to the covariates to construct the 13 terminal nodes. For
example, there was a change in the estimated hazard rate after 3 years (“time.interval < 3.5”),
as this splitting rule was implemented in the second step of Method 2. The numbers below the
terminal nodes in Fig. 6.7 are the estimates of the hazard rates. Figure 6.8 shows two examples
of survival functions that were obtained by averaging the predictions obtained from Method 2 for
patients with Scandinavian stroke score below and above median (which was equal to 46 in the
learning data). As shown in the figure, patients with a stroke score below the median had smaller
survival probabilities on average. t
u
140 6 Tree-Based Approaches
0.24
PE int
0.20
0.16
Fig. 6.6 Copenhagen stroke study. The panel shows the prediction error PE b int that was obtained
from applying Method 2 (recursive partitioning with binary outcome) to five sets of learning and
evaluation data (of sizes 230 and 115, respectively, each) that were drawn from a random sample
of size n D 345. The black dot indicates the optimum minimum node size for the augmented data
(2022 data lines), which was estimated to be 80
strokeScore>=50.5
|
0.0181 0.07051
sex=female cholest>=5.55
0.02105 0.08696
strokeScore>=56.5
0.2315 cholest>=5.65 strokeScore>=43.5
0.05208 0.1263
hypTen=no
0.2162
cholest< 6.95
0.1782 0.1844 0.3385
0.06356 0.1414
Fig. 6.7 Copenhagen stroke study. The figure shows the survival tree with minimum node size
80 that was estimated via Method 2 (recursive partitioning with binary outcome) from a random
sample of size n D 345. The numbers below the terminal nodes are the estimated hazards
O 1; : : : ;
O 13 (timeInt = time interval, strokeScore = Scandinavian stroke score, cholest = cholesterol
level, hypTen = hypertension)
6.4 Ensemble Methods 141
1.0
stroke score above median
0.8
stroke score below median
0.6
S(t)
0.4
0.2
0.0
2 4 6 8 10 12
t
Fig. 6.8 Copenhagen stroke study. The figure shows the survival functions that were obtained by
averaging the predictions for patients with Scandinavian stroke score below and above the median
(D 46). Estimates were obtained via Method 2 (recursive partitioning with binary outcome). It is
seen that patients with below median score values tend to have an increased risk of dying at all
time points
6.4.1 Bagging
Method 2 to bootstrap samples drawn from the augmented data. The bagged hazard
and survival function estimates are then given by the averages of the B individual
hazard and survival function estimates, respectively.
While bagging reduces the variance of a single tree, the variance of the bagged
estimate may still be large if the B individual trees are very similar (i.e., if the B tree
estimates resemble each other despite being constructed from different input data). It
may therefore be of interest to decorrelate the B trees, i.e., to reduce their similarity
in order to obtain a smaller variance. This approach is implemented in the random
forest algorithm (Breiman 2001), which artificially reduces the number of covariates
that are available for splitting in each node. Similar to bagging, random forests apply
recursive partitioning methods to B bootstrap samples. The main difference between
the two methods is that random forests only use a small number of covariates
(denoted by mtry, where mtry p) that are sampled randomly from the whole set
of covariates in each node. With the random forest approach, mtry is an additional
tuning parameter that needs to be determined using cross-validation methods.
Usually, optimization of mtry is performed in a highly efficient way by using the
out-of-bootstrap observations (i.e., those observations that are not contained in the
bootstrap samples) as test data for each of the B trees. Similar to bagging, random
forests can be adapted to discrete failure time outcomes by either applying Method 1
to bootstrap samples from the original data or by applying Method 2 to bootstrap
samples from the augmented data. The random forest estimates are then given by
the averages of the B individual hazard and survival function estimates.
Compared to single trees, bagging and random forests have the advantage that
predictions are usually more accurate (see Hastie et al. 2009). On the other hand,
interpretability of the model (which is simple and intuitive in case of a single tree)
is lost. Despite the high prediction accuracy of bagging and random forest methods,
this may be considered as a disadvantage, especially when the aim is to build an
easy-to-interpret prediction formula and to quantify the effect of a specific predictor
variable on survival. Overviews of the bagging and random forest algorithms are
given in Fig. 6.9.
Example 6.2 Copenhagen Stroke Study
Here we consider Method 2 (recursive partitioning with binary data). In Example 6.1 we fitted a
discrete-time survival tree with minimum node size 80 to the augmented data of 345 observations
(“learning” sample) that were drawn randomly from the complete data. The value of the summary
measure PEbint that was computed from the remaining 173 “test” observations is shown in Fig. 6.10.
bint that were obtained from applying bagging and random
Figure 6.10 also shows the values of PE
forests to the learning sample. The number of trees (with minimum node size 80 each) that was
used for both bagging and random forests was 500. The value of mtry was set to 3, which is the
default value of the R package randomForest (Breiman et al. 2015). It is seen that both bagging and
random forests improved the prediction accuracy of the single tree. Also, both the single tree and
the ensemble methods performed better than a null model without covariate information (that was
6.4 Ensemble Methods 143
Bagging.
Step 1: Generate a fixed number B of bootstrap samples (with replacement) from the
complete data set.
Step 2: Apply the recursive partitioning algorithms described in Sections 6.2 and 6.3 to
each of the B samples. In case of Method 2, the minimum node size can be optimized
using the complete augmented data, and this minimum node size can be used for each of
the B trees.
Step 3: Calculate the average of the B outputs (i.e., of the individual hazard estimates) to
obtain the bagging estimate.
Random Forests.
Step 1: Generate a fixed number B of bootstrap samples (with replacement) from the
complete data set.
Step 2: Apply recursive partitioning techniques to each of the B samples. In each split
only a randomly chosen subset of size mtry p of the covariates is considered. In case
of Method 2, the minimum node size can be optimized using the complete augmented
data, and this minimum node size can be used for each of the B trees.
Step 3: Calculate the average of the B outputs (i.e., of the individual hazard estimates) to
obtain the random forest estimate.
test data
training data
0.12
Fig. 6.10 Copenhagen stroke study. The figure shows the values of the summary measures PE b int
that were obtained from the learning and test data sets (of sizes 345 and 173, respectively)
obtained by fitting a discrete-time survival tree with T as the only splitting variable). In addition
bint computed from the test data, Fig. 6.10 also shows the “in-sample” values
to the values of PE
bint that were obtained by computing PE
of PE b int from the learning data. As expected, the in-sample
bint (which can be considered as a measure of deviation from perfect model fit) were
values of PE
smaller for all methods than the estimates of prediction error that were computed from the test
data. Regarding the comparison of estimation techniques, however, a similar pattern was observed
144 6 Tree-Based Approaches
in both situations: random forests had the smallest prediction error, followed by bagging and by
the single survival tree. u
t
Various splitting criteria for tree construction have been proposed for continuous
survival outcomes. A review of early approaches is given in LeBlanc and Crowley
(1995); for an overview of more recent approaches, see Bou-Hamad et al. (2011b).
An extension of Method 1 to time-dependent covariates has been proposed by Bou-
Hamad et al. (2011a). Method 2 can be extended by using alternative techniques for
the estimation of the hazards in the terminal nodes (such as the Laplace correction
and the m-estimate, see, e.g., Ferri et al. (2003) or Broström (2007) for a discussion
of various node probability estimators). A bagging method for continuous survival
outcomes has been proposed by Hothorn et al. (2004); random survival forests for
continuous outcomes were first considered in Ishwaran et al. (2008). For recent
developments in this field, especially with regard to variable selection, see Ishwaran
et al. (2011).
6.6 Software
6.7 Exercises
6.1 Construct a tree from the partition of the covariate space shown below
(analogous to Fig. 6.1).
6.7 Exercises 145
c4
c5
x2 c1
c2 c3
x1
Table 6.1 Description of the covariates contained in the stage C prostate cancer data
Variable Description
age Age of patients in years
eet Was an early endocrine therapy conducted (yes/no)?
g2 Percent of cells in G2 phase, as found by flow cytometry
grade Grade of the tumor, Farrow system
gleason Grade of the tumor, Gleason system
ploidy Ploidy status of the tumor, from flow cytometry
Table 6.2 Summary of the covariates contained in the stage C prostate cancer data
Variable Categories/unit Sample proportion/median (range)
age Years 63 (47–75)
eet Yes 76 %
No 24 %
g2 Percent 13 % (2–55 %)
grade 1 2%
2 40 %
3 54 %
4 4%
gleason 3 2%
4 4%
5 25 %
6 24 %
7 27 %
8 13 %
9 3%
10 2%
ploidy diploid 48 %
tetraploid 48 %
aneuploid 4%
1. Delete observations with missing values and convert the data into a set of
augmented data with binary outcome yis .
2. Fit random forests with 500 trees to the augmented data, i.e., apply Method 2 to
1000 bootstrap samples with replacement. Consider all available covariates and
vary the number of randomly selected covariates in each split from 1 to 10.
3. Choose the model with lowest prediction error on the data that were not included
in the bootstrap samples (“out-of-bag data”).
4. Inspect the Gini variable importance coefficients. Which variables have a strong
effect on the time to hospitalized pneumonia?
5. Fit a single survival tree (using Method 2 based on the Gini splitting criterion)
to the data. Are the covariates with high random forest Gini variable importance
also contained in the first splits of the single tree?
6. Compare the Gini variable importance and the permutation-based variable
importance coefficients of the random forest. Can the differences be explained?
Table 6.3 Description of the covariates contained in the pneumon data set
Variable Description
alcohol Alcohol use by mother during pregnancy
bweight Indicator for normal birthweight of the child (>5.5 lbs.)
education Education of the mother, measured in years
mthage Age of the mother in years
nsibs Number of siblings of the child
poverty Indicator for the mother living at poverty level
race Race of the mother
region Geographic region
sfmonth Month the child was ready for solid food
smoke Indicator for cigarette smoking during pregnancy
urban Does the mother live in an urban environment?
wmonth Month the child was weaned
148 6 Tree-Based Approaches
Table 6.4 Summary of the covariates contained in the pneumon data set
Variable Categories/unit Sample proportion/median (range)
alcohol Alcohol 21 %
No 79 %
bweight Weight > 5.5 lbs 35 %
Weight 5.5 lbs 65 %
education Years 12 (0–19)
mthage Years 21 (14–29)
nsibs Counts 0 (0–6)
poverty Poverty 92 %
No poverty 8%
race White 52 %
Black 29 %
Other 19 %
region Northeast 14 %
North central 24 %
South 41 %
West 21 %
sfmonth Months 0 (0–18)
smoke Cigarette smoker 23 %
Non-smoker 77 %
urban Urban environment 76 %
Non-urban environment 24 %
wmonth Months 0 (0–28)
Chapter 7
High-Dimensional Models: Structuring
and Selection of Predictors
In this chapter we consider strategies to select the relevant variables in cases where
many explanatory variables are available. It is important to select the relevant ones
in order to obtain a reduced model that is easier to interpret than a big model with a
multitude of variables. Moreover, prediction performance typically suffers if many
irrelevant variables are included in the model. Variable selection even becomes the
central issue in applications where the number of predictors exceeds the number
of observations, for example, when the effects of genes are to be investigated.
Typical data of this type are microarray data, where the expressions of thousands
of predictors (genes) are observed and only some hundred samples are available. In
these cases the maximum likelihood estimates for the full model do not exist, and
alternatives are needed. In the following we will consider a data set on breast cancer,
which contains information on clinical variables as well as on a pre-selected number
of gene expression levels.
Example 7.1 Breast Cancer
Breast cancer is the most common invasive cancer in women worldwide. In the Western World
breast cancer accounts for approximately 20 % of all cancers diagnosed in female patients.
Although 5-year survival rates are larger than 80 % in most OECD countries, breast cancer is
still the most common cause of cancer-related death in women. Depending on a variety of risk
factors (such as age and disease stage), response rates of patients to breast cancer treatment vary
considerably. It is therefore essential to analyze the effect of clinical and genetic variables on breast
cancer prognosis.
We consider a data set collected by the Netherlands Cancer Institute to validate predictions for
breast cancer in n D 144 lymph node positive women (van de Vijver et al. 2002). The outcome
of the study was the time to the development of distant metastases, which is an important measure
of treatment response. Clinical predictor variables included the age of the patients (median: 45
years, range: 26–53 years), the grade of the tumor, the number of affected lymph nodes, the tumor
diameter, and the estrogen receptor status (indicating whether estrogen receptor proteins were over-
expressed in breast cancer cells, see Table 7.1). In addition to the clinical predictor variables, the
data contain expression measurements of 70 genes (measured on a continuous scale using cDNA
microarray technology). Observed metastasis-free survival times ranged from 0:05 months to
17:66 months, with two thirds of the observations being censored. Details on the clinical variables
Table 7.1 Clinical explanatory variables for the breast cancer data (nki70 data set in R package
penalized)
Sample
Variable Category proportion (%)
Tumor diameter 2 cm 50.69
>2 cm 49.31
Number of pos. lymph nodes 1–3 73.61
4 26.39
Estrogen receptor status Negative 18.75
Positive 81.25
Tumor grade Poorly differentiated 33.33
Intermediate 38.19
Well differentiated 28.47
are given in Table 7.1. The data are publicly available as part of the R add-on package penalized
(Goeman et al. 2014).
The aim of our analysis is to quantify the effects of the predictor variables on metastasis-free
survival. An important issue in this regard is whether the expression values of the 70 genes can be
used to build a prediction model for the time to development of distant metastases. Specifically,
the question is whether all 70 genes need to be included into the model, or whether it is sufficient
to include only a small number of informative genes. When building a prediction model, one also
has to take into account that measuring gene expression values is relatively expensive; on the other
hand, the clinical variables in Table 7.1 are well-established predictors for survival that are readily
available. The task therefore is to investigate the predictive value of the genes in addition to the
clinical predictors. In this situation, one wants to identify only those genes that improve the survival
predictions derived from the clinical variables. t
u
We consider the hazard model with linear predictor from Chap. 3, which is given by
lp ./ D l./ J./; (7.2)
2
where l./ is the usual log-likelihood function, is a tuning parameter, and J./
is a penalty that puts restrictions on that enforce variable selection. Note that
the tuning parameter should not be confused with the hazard function .tjx/. We
will denote both terms by , as this is standard notation that is commonly used in the
literature. Penalized likelihood approaches have already been considered in Chap. 5,
but with the focus on smoothing and not on variable selection. A penalty that
distinctly enforces variable selection is the L1 or lasso penalty (for “least absolute
shrinkage and selection operator”)
X
p
J./ D jj j; (7.3)
jD1
which was proposed by Tibshirani (1996). The penalty contains the sum over the
absolute values of all the parameters. When used in (7.2), in the extreme case
D 0 one obtains the maximum likelihood estimate, whereas if ! 1 all
parameters are set to zero. The most interesting case is therefore the intermediate
case with appropriately chosen tuning parameter between 0 and 1. In this
case all coefficients are shrunk toward zero and, depending on , few or many
parameters are set to zero. The latter property of the lasso method effectively implies
variable selection. Generally, the lasso tends to avoid the high variability of stepwise
selection while producing a sparse model that shows good prediction performance.
Example 7.2 Breast Cancer
We used a penalized likelihood approach to investigate the effects of the 70 genes on metastasis-
free survival. To this purpose, we subdivided the data into a learning data set containing 96
observations (i.e., two thirds of the complete data set) and a test data set containing 48 observations.
Survival times were grouped into 3-month intervals (with the last interval being defined as “>18
months”). This grouping pattern was chosen because the focus in cancer research is often on a small
number of survival probabilities at distinct time points (e.g., 3-month survival, 6-month survival,
etc.).
152 7 High-Dimensional Models: Structuring and Selection of Predictors
In the first step, we built a prediction model using the clinical covariates only. To this purpose
we fitted a continuation ratio model to the learning data that contained the covariates “tumor
diameter,” “number of affected lymph nodes,” “estrogen receptor status,” “tumor grade,” and
“age at diagnosis.” The coefficient estimates obtained from this model (in the following termed
“clinical model”) are presented in Table 7.2. These estimates reflect often observed effects in cancer
research: For example, a large number of affected lymph nodes increase the risk of developing
distant metastases. The same is true for patients with negative estrogen receptor status.
Table 7.2 Breast cancer data. The table shows the coefficient estimates obtained from two
continuation ratio models fitted to a learning data set of size 96 (clinical model = continuation
ratio model with the clinical variables only, combined model = L1 penalized continuation ratio
model with unpenalized clinical covariates). Abbreviations of covariates are as follows: Diam =
tumor diameter, N = number of affected lymph nodes, ER = estrogen receptor status
Coefficient estimates
Clinical Combined
model model
Gene name
TSPYL5 0:4755
QSCN6L1 0:9312
Contig32125_RC 1:4342
RUNDC1 1:2185
GPR180 1:0926
ZNF533 0:1447
COL4A2 0:1205
ORC6L 0:3014
LOC643008 0:9720
IGFBP5.1 0:9958
NMU 0:1133
LGP2 1:8194
PRC1 1:3828
Contig20217_RC 0:2920
NM_004702 0:0165
Variable name
Diam 2 cm 0:0000 0:0000
Diam >2 cm 0:4217 0:3936
N 4 0:0000 0:0000
N 1–3 0:9769 0:8368
ER negative 0:0000 0:0000
ER positive 0:7032 1:5282
Grade poorly diff. 0:0000 0:0000
Grade intermediate 0:3974 0:2073
Grade well diff. 0:3252 0:3697
Age 0:0278 0:0237
7.1 Penalized Likelihood Approaches 153
L1 penalized regression
cross−validated log−likelihood
−1.0
−1.5
−2.0
Fig. 7.1 Breast cancer data. The figure shows the cross-validated log-likelihood of an L1 penalized
continuation ratio model that was obtained by applying fivefold cross-validation to a learning
sample of size 96. The curve corresponds to the average over the fivefolds. Note that the values
of only refer to the 70 genes but not to the clinical covariates. The latter covariates entered the
model in an unpenalized fashion to ensure their inclusion in the model. The black dot indicates the
cross-validated log-likelihood at the optimal value of D 0:014
In the next step, we investigated the predictive value of the 70 genes in addition to the available
clinical covariates. To this purpose we fitted a penalized continuation ratio model with L1 penalty
(“lasso”) to the learning data. This model contains both the clinical covariates and the expression
values of the 70 genes. In contrast to the values of the 70 genes, the clinical covariates (and also
the dummy variables for the baseline hazard) entered the model in an unpenalized fashion, i.e.
their coefficients were not included in the penalty in (7.3). In contrast, the coefficients of the 70
genes were penalized, and the tuning parameter of the lasso penalty was determined by fivefold
cross-validation. This strategy implied that the 70 genes were subject to variable selection whereas
all clinical covariates were “forced” a priori to enter the model. Consequently, only those genes
that increased the predictive power of the clinical variables were selected.
The results of fivefold cross-validation are presented in Fig. 7.1. As seen from Fig. 7.1, the
optimal value of was estimated to be D 0:014. From Table 7.2 it is seen that only 15 of the 70
genes entered the model. This result suggests that part of the predictive power contained in the 70
genes is already contained in the clinical covariates.
In the final step, we compared the predictive power of the clinical model using the clinical
covariates only and the combined model containing the unpenalized clinical covariates and the
penalized expression values of the 15 selected genes. The integrated prediction error curves for
the two models (evaluated on the test data) resulted in PE bint D 0:187 for the model with clinical
variables and PEbint D 0:157 for the combined model. This indicates that the combined model
(incorporating both genetic and clinical information) had a larger predictive power than the clinical
model. t
u
The basic lasso has the disadvantage that the selection procedure is not consistent
in terms of variable selection. Therefore, Zou (2006) proposed the adaptive lasso,
which is an extended version of the lasso for which the penalty has the form
X
p
J./ D wj jj j; (7.4)
jD1
154 7 High-Dimensional Models: Structuring and Selection of Predictors
where wj are known weights. He showed that for appropriately chosen data-
dependent weights the adaptive lasso is consistent. One choice of weights is based
on a root-n consistent estimator Q of , for example, the maximum likelihood
estimate. In this case weights are defined by wj D 1=jQj jı for fixed chosen ı > 0.
For growing sample size the weights for zero-coefficients get inflated, whereas the
weights on non-zero-coefficients converge to a finite constant. Further results on
optimality of the adaptive lasso were given by Zou (2006).
One restriction of the lasso in the form considered so far is that all coefficients
enter the penalty term in an unweighted fashion. Hence all coefficients are penalized
in the same way, and it is recommended to standardize variables before analysis
to avoid problems with covariates that are measured on different scales. Another
problem occurs when a categorical covariate enters the model in the form of a set
of dummy variables. Then the lasso has no information that the coefficients of these
dummy variables are linked together, and it may happen that some of the categories
end up with zero-coefficients while others do not.
A procedure that enforces selection of whole groups of parameters is the
group lasso proposed by Yuan and Lin (2006). Let the p-dimensional predictor
be structured as xTi D .xTi1 ; : : : ; xTi;G /, where xij corresponds to the jth group of
variables. A group of variables typically refers to the dummy variables of one
categorical predictor, with dfj denoting the number of the variables in the jth group.
But it also can contain only one parameter if the variable is a continuous variable
that enters the model as a main effect. Let the parameter vector be partitioned into
the corresponding subvectors T D . T1 ; : : : ; TG /. Then the group lasso uses the
penalty
X
G
p
J./ D dfj k j k2 ; (7.5)
jD1
7.2 Boosting
Another popular technique for simultaneous model fitting and variable selection
is boosting. Similar to the lasso, boosting is especially useful in high-dimensional
settings where the number of available covariates exceeds the number of obser-
vations ( p > n). In the literature, several approaches to construct boosting
algorithms have been proposed. Historically, the first boosting method (the so-called
AdaBoost algorithm, Freund and Schapire 1996) originated in the machine learning
community and was designed for the prediction of binary outcome variables,
without explicitly focussing on variable selection. Later, boosting algorithms were
adapted to become an optimization method for building a wide range of regression
models with many possible types of outcomes and predictor effects (see Mayr et al.
2014a,b for a review). Often used boosting algorithms include gradient boosting
(Friedman 2001; Bühlmann and Hothorn 2007) and likelihood-based boosting (Tutz
and Binder 2006). Since both approaches often yield similar results, we only present
gradient boosting here.
The aim is again to estimate the coefficient vector of the discrete-time hazard
model
In the generic boosting framework with arbitrary outcome variable y, the aim
is to estimate the “optimal” prediction function f in terms of minimization of the
expected loss
f WD argminf Ey;x .y; f .x// : (7.7)
The idea is hence to minimize a risk function that is given by the expectation of the
loss function . For example, in Gaussian regression, this approach corresponds to
minimizing the theoretical mean E.y f .x//2 over f .
In practice, the true data-generating process (and hence the theoretical mean
in (7.7)) is usually unknown. Instead, one has a data set .yi ; xi /, i D 1; : : : ; n, that
contains realizations of the random variables y and x. Consequently, one does not
minimize the theoretical risk in (7.7) but replaces it by the empirical risk
1X
n
R WD .yi ; f .xi // ; (7.8)
n iD1
which is the empirical mean of the individual deviations between yi and xi (and
hence is an unbiased estimator of the theoretical risk). For example, in Gaussian
regression with the squared error loss, the empirical risk is defined by R D
P n 2
iD1 .yi f .xi // =n, which is the well-known residual sum of squares used in linear
regression.
Non-Gaussian Loss Functions If the outcome variable is not normally dis-
tributed, a common strategy is to define as the negative probability density
function of the outcome distribution. Then, if the distribution of the outcome
belongs to the exponential family of distributions, the empirical risk is equivalent
to the negative log-likelihood function of a generalized linear model:
1X
n
RD '.yi ; f .xi // ; (7.9)
n iD1
where ' denotes the probability density of yi . In (7.9), f .xi / is a linear predictor that
is associated with the conditional mean of yi jxi via a known link function (which
is implicitly assumed to be incorporated into '). In other words, minimizing the
empirical risk in (7.9) over f is equivalent to maximum likelihood estimation of the
linear predictor f in a generalized linear model.
Specification of the Gradient Boosting Algorithm Having defined the optimiza-
tion problem, a gradient boosting algorithm is used to estimate f . The idea is to
begin with an n-dimensional vector of starting values (which represents the starting
values for the predicted values of the n observations in the data) and to update this
vector iteratively. The updates are carried out by the so-called set of base-learners,
which are defined as simple linear regression models with one input variable and one
7.2 Boosting 157
output variable. Their role will become clear below. Gradient boosting is formally
defined as follows:
Œ0 Œ0
1. Initialize the n-dimensional vector fOŒ0 D . fO1 ; : : : ; fOn /T with starting values.
For example, start with the same constant value (denoted by O Œ0 ) for each of the
n elements of fOŒ0 . This constant value can, e.g., be obtained by minimizing the
empirical risk of a covariate-free model numerically over f .
2. For each of the predictor variables specify a base-learner, i.e., a simple linear
regression model with the respective predictor variable as input variable and one
output variable (which will be defined below). Set the iteration counter m to 0.
3. Increase m by 1.
Œm1
4. (a) Compute the negative gradient @R @f
and evaluate at yi ; fOi , i D 1; : : : ; n.
This yields the negative gradient vector
Œm
UŒm D Ui
iD1;:::;n
@
OŒm1
WD R yi ; f i : (7.10)
@f iD1;:::;n
(b) Fit the negative gradient vector UŒm to each of the p covariates separately
by using the p base-learners specified in Step 2. Hence the negative
gradient UŒm , whose values are measured on a continuous scale, becomes the
output variable of the base-learners. Since the base-learners are simple linear
models, ordinary least squares estimation is carried out to estimate UŒm . This
procedure yields p vectors of predicted values (one vector per base-learner),
where each vector is an estimate of the negative gradient vector UŒm .
(c) Select the base-learner that fits UŒm best according to the R2 goodness-of-fit
criterion. Set UO Œm equal to the fitted values of the best model.
(d) Update fOŒm fOŒm1 C UO Œm , where 0 < 1 is a real-valued step length
factor.
5. Iterate Steps 3 and 4 until the stopping iteration mstop is reached. The choice
of mstop will be discussed below.
From Step 4 it is seen that the algorithm descends the gradient of the empirical
risk R: in each iteration, an estimate of the true negative gradient of R is added to
the current estimate of f . Moreover, as seen from Steps 4(c) and 4(d), the gradient
boosting additionally carries out variable selection, as only one base-learner (and
therefore only one covariate) is selected for updating fOŒm in each iteration. Due to
the additive update in Step 5, the final boosting estimate at iteration mstop can be
interpreted as an additive prediction function. Importantly, it can be shown that the
update in Step 5 results in an additive update of the estimated coefficient vector O
in each iteration (for details see Example 7.3).
158 7 High-Dimensional Models: Structuring and Selection of Predictors
Definition of the Risk Function Having defined a generic gradient boosting algo-
rithm for arbitrary outcomes, the task is to specify a suitable loss function for
the discrete hazard model. Keeping in mind that the log-likelihood function of
model (7.6) is equivalent to the log-likelihood function of a binary regression model
(cf. Chap. 3), a convenient choice is to set R equal to the negative binomial log-
likelihood function, i.e.,
X
n X
ti
RD yis log .sjxi / C .1 yis / log.1 .sjxi //; (7.11)
iD1 sD1
where the yis ’s code the transition to the next period. As in Sect. 3.4 we use the
linear predictor f .xit / WD xTit ˇ with xTit D .0; : : : ; 0; 1; 0; : : : ; 0; xTi / and ˇ T D
.01 ; : : : ; 0q ; T /. Then the generic boosting algorithm defined in the previous
subsection can be applied to the augmented data. As before the base-learners that
are used consist of only one predictor variable, that is, one component of x. The
corresponding intercept parameters are estimated without penalization in the first
step of the algorithm, and the resulting estimates are used as offset values for
gradient boosting.
Interpretation of the Model Fit The interpretation of the model fit is directly
related to the choice of simple linear models as base-learners in Step 2. In fact, the
linearity of the base-learners directly results in the linearity of the whole estimated
prediction function. This can be easily seen from the following example, which for
simplicity ignores the parameters 0t that are linked to the baseline hazard.
Example 7.3 A Simple Example with Three Covariates
Assume that there are three predictor variables x1 , x2 , x3 . Then, if predictor variable xj , j 2 f1; 2; 3g,
is chosen in boosting iteration m, a simple linear model is fitted to the negative gradient, and there
Œm
is a coefficient estimate Oj resulting from this model. Further suppose that the number of boosting
iterations is mstop D 5, and that x1 was selected in iterations 1, 2, and 5. For iterations 3 and 4 we
assume that x3 was selected. Keeping in mind the additive structure of the update Step 4(d), the
estimated prediction function can be written as follows:
for all elements of , gradient boosting uses only a subset of the covariates for estimating f . For
example, x2 is not included in the estimated prediction function above. t
u
Table 7.3 Breast cancer data. The table shows the coefficient estimates obtained from two
continuation ratio models fitted to a learning data set of size 96 (clinical model = continuation ratio
model with the clinical variables only, combined model = gradient boosting with the 70 genes as
covariates and the clinical covariates as offset values). Abbreviations of covariates are as follows:
Diam = tumor diameter, N = number of affected lymph nodes, ER = estrogen receptor status
Coefficient estimates
Clinical Combined
model model
Gene name
TSPYL5 0:6008
QSCN6L1 0:3378
GPR180 0:8991
IGFBP5.1 1:4568
LGP2 0:8382
PRC1 1:5119
NUSAP1 1:7227
Variable name
Diam 2 cm 0:0000 0:0000
Diam >2 cm 0:4217 0:4217
N 4 0:0000 0:0000
N 1–3 0:9769 0:9769
ER negative 0:0000 0:0000
ER positive 0:7032 0:7032
Grade poorly diff. 0:0000 0:0000
Grade intermediate 0:3974 0:3974
Grade well diff. 0:3252 0:3252
Age 0:0278 0:0278
The results of fivefold cross-validation are presented in Fig. 7.2. As seen from Fig. 7.2, the
optimal stopping iteration was estimated to be mstop D 84. At mstop D 84, 7 of the 70 genes entered
the model (Table 7.3). Again this result suggests that part of the predictive power contained in the
70 genes is already contained in the clinical covariates.
In the final step we compared the predictive power of the clinical model and the combined
model. The integrated prediction error curves for these models (evaluated on the test data) resulted
bint D 0:186 and PE
in PE bint D 0:157, respectively. This result indicates that the combined model
(incorporating both genetic and clinical information) had a larger predictive power than the clinical
model, and that incorporating genetic information into the survival model may increase prediction
accuracy. Note that the values of PE bint are almost identical to those obtained from L1 -penalized
regression in Sect. 7.1. In addition, 6 of the 7 genes selected for boosting are also included in the
L1 -penalized model, highlighting the similarities between the two methods (cf. Table 7.2). t
u
gradient boosting
Fig. 7.2 Breast cancer data. The figure shows the cross-validated negative log-likelihood of a
continuation ratio model that was obtained by applying fivefold cross-validation to a learning
sample of size 96. The curve corresponds to the average over the fivefolds. Gradient boosting with
the 70 genes as covariates was used to fit continuation ratio models to the data in the fivefolds. The
fitted values of the clinical model were used as offset values to ensure that all clinical covariates
were included the model. The black dot indicates the cross-validated negative log-likelihood at the
optimal boosting iteration mstop , which was found to be 84
of 50 patients who underwent surgery due to stage II colon cancer and did not receive postoperative
adjuvant chemotherapy; we thank Sandrine Dudoit for providing us with the data. Times to the
development of metachronous metastases were grouped by the authors according to the criterion
“t < 60 months,” so that only two time intervals (Œ0; 60/ and Œ60; 1/) were considered. Twenty-
five patients developed metachronous metastases within the first 60 months after surgery (t < 60),
and another 25 patients remained disease free for at least 60 months (t 60). Gene expression
levels of p D 22;283 genes (measured with the Affymetrix R
HGU133A GeneChip) were
available. Unlike for the analysis of the breast cancer data in Example 7.2, clinical covariates
were not used in the predictive analysis by Barrier et al. (2006).
In the same way as in Examples 7.2 and 7.4, we used L1 penalized regression and gradient
boosting to carry out variable selection and to investigate the effects of the 22;283 genes on
the development of metachronous metastases. The results of fivefold cross-validation for the
continuation ratio model with logistic link function are presented in Fig. 7.3. The optimal value
of for the L1 penalized regression model was D 0:1331. For gradient boosting (with step
length D 0:1) the optimal stopping iteration was found to be mstop D 13. From Table 7.4 it is
seen that 10 of the 22,283 genes entered the L1 penalized regression model, whereas 9 genes were
selected by gradient boosting.
It is remarkable that the sets of genes that were selected by L1 penalized regression and gradient
boosting almost coincide. In fact, all 9 genes that were selected by gradient boosting were also
contained in the set of genes selected by L1 penalized regression. In addition, the signs of the
respective coefficient estimates coincide. It is also noteworthy that 8 of the 10 genes selected by
L1 penalized regression are also contained in the 30-gene prognosis predictor that was identified in
the original publication by Barrier et al. (2006) with quite different methods. t
u
162 7 High-Dimensional Models: Structuring and Selection of Predictors
L1 penalized regression
cross−validated log−likelihood
−1.0
−1.4
−1.8
gradient boosting
0.70
0.60
0.50
Fig. 7.3 Stage II colon cancer data. The upper panel shows the cross-validated log-likelihood of
an L1 penalized continuation ratio model that was obtained by applying fivefold cross-validation
to the data. The lower panel shows the cross-validated negative log-likelihood of a continuation
ratio model that was fitted via gradient boosting. The curves correspond to the averages over the
fivefolds. The black dots refer to the optimal values of and mstop
The variable selection methods considered in the previous sections refer to models
with a linear predictor. It is straightforward to extend the methods to the case of
additive predictors, where the model has the form .tjxi / D h.
it / with predictor
it D f0 .t/ C f1 .xi1 / C C fp .xip /;
where the functions f0 ./; : : : ; fp ./ are unknown and will be determined by the
data. As in Sect. 5.2 let the functional form of the predictors be determined by an
7.4 Literature and Further Reading 163
X
mj
fj .xj / D js js .xj /;
sD1
where the basis functions depend on the covariate xj . Let Tj D .j1 ; : : : ; jmj /
represent the vector of parameters linked to the jth predictor.
Then one can use lasso-type selection procedures by penalizing the parameter
vectors j in the form of the group lasso (7.5). Alternative penalty terms for variable
selection in additive models were used by Cantoni et al. (2011) and Marra and Wood
(2011). In boosting procedures one can use blockwise boosting, which means that
the group of parameters linked to one variable is updated simultaneously (Bühlmann
and Yu 2003; Tutz and Binder 2006; Schmid and Hothorn 2008). In this case the
base learner is a linear model with the basis functions of the corresponding variable
as input. It has been shown in particular by Tutz and Binder (2006) that the method
selects additive predictors efficiently also in very high-dimensional settings.
7.5 Software
7.6 Exercises
7.1 Consider the gradient boosting algorithm for discrete hazard models and derive
the negative gradient vector @R@f for the case where the risk function is given
by (7.11). Derive the negative gradient separately for the logistic, probit, Gompertz,
and Gumbel link functions.
7.2 A popular approach to illustrate penalized regression and gradient boosting
methods is to draw coefficient path plots. With this technique, coefficient estimates
(on the y-axis) are plotted against different values of the respective tuning parameter
(on the x-axis). The behavior of the coefficient estimates can thus be analyzed for
varying levels of regularization.
1. Generate a coefficient path plot for the L1 penalized regression model of
Example 7.2 and visualize the relationship between the tuning parameter (in
decreasing order on the x-axis) and the resulting coefficient estimates (on the
y-axis). Which is the first gene to enter the model as starts to decrease?
7.6 Exercises 165
2. Similarly, generate a coefficient path plot for the gradient boosting model of
Example 7.4 and visualize the relationship between the number of boosting
iterations (on the x-axis) and the resulting coefficient estimates (on the y-axis).
Compare the plot to the respective plot for L1 penalized regression model and
analyze the orders in which the genes enter the model.
7.3 In Sect. 7.1 the adaptive lasso and the group lasso were introduced. It is possible
to use the advantages of both methods simultaneously by penalizing categorical
variables together and by adaptively choosing the weights of each variable. Wang
and Leng (2008) showed that this method is able to consistently identify the true
underlying model. Apply the adaptive group lasso to the SOEP data and use the
time until drop-out as outcome variable:
1. Estimate the initial weights wj D 1=jjjı of a discrete hazard model with logistic
link using ı D 1. For illustration purposes and for numerical reasons exclude the
following variables from the model (but include all other variables in the SOEP
data):
• HID, HGTYP1HH, HGNUTS1, HGRSUBS, HGOWNER, HGACQUIS.
2. Based on the specified weights fit the adaptive group lasso to the data
and estimate the optimal value of the tuning parameter via fivefold
cross-validation. Hint: Use the function cv.glmnet (. . . , penalty.factor,
type.multinomial = “grouped”).
3. Refit the survival model by using all observations of the SOEP data as well as
the optimal . Draw the coefficient paths. Do any variables have coefficients with
zero values?
7.4 Again consider the SOEP data. The aim is to analyze the data by using gradient
boosting techniques with smooth nonlinear predictor effects and by considering the
same covariates as in the previous exercise.
1. For all continuous covariates specify penalized B-spline base-learners (using the
default implementation in the bbs function of R package mboost). For categorical
variables use linear base-learners (bols function in R package mboost, see Hofner
et al. 2014 for details on the implementation of base-learners in mboost).
2. Fit a discrete survival model with logistic link function and determine an
appropriate number of boosting iterations using the cvrisk function in mboost.
Visualize the effect of each selected covariate on the linear predictor and interpret
the partial effects of the variables.
3. What are the relative frequencies of the selected variables across the boosting
iterations?
Chapter 8
Competing Risks Models
In the previous chapters we have considered various statistical techniques that model
the time to a particular event of interest. There are applications, however, where
these models do not apply because the interest is in several distinct types of target
events. For example, in survival analysis the events may refer to several causes of
death. Similarly, when modeling duration of unemployment one often distinguishes
between full-time and part-time jobs that end the unemployment spell. Models for
this type of data are often referred to as competing risks models. In this chapter we
will first consider parametric competing risks models for discrete time-to-event data
and then show how the estimation can be embedded into the framework of GLMs.
In simple time-to-event models with one target event the dynamics of the process
were described by one hazard function. In the case of several target events it is useful
to define several hazard functions, one for each type of event. Let in the following
R 2 f1; : : : ; mg denote the distinct target events. For discrete time T 2 f1; : : : ; qC1g,
the cause-specific hazard function resulting from cause or risk r is defined by
X
m
.tjx/ D r .tjx/ D P.T D t j T t; x/:
rD1
The survival function and unconditional probability of an event in period t have the
same form as in the simple case of one target event, i.e.,
Y
t
S.tjx/ D P.T > t j x/ D .1 .ijx//
iD1
and
For an individual reaching interval Œat1 ; at /, there are m possible outcomes, namely
the end of the duration in one of the m target events, or survival beyond Œat1 ; at /.
The corresponding conditional response probabilities are given by
Overall hazard
Survival function
Event probability
Fig. 8.1 Basic concepts for competing risks models in discrete time
8.1 Parametric Models 169
The multinomial logit model is the most widely used model for categorical
responses, see, for example, Tutz (2012) or Agresti (2013). In discrete survival the
responses are either the target events or survival. The corresponding model is given
by
exp.0tr C xT r /
r .tjx/ D Pm (8.1)
1 C iD1 exp.0ti C xT i /
X
m
1
P.T > t j T t; x/ D 1 r .tjx/ D Pm :
rD1
1C jD1 exp.0tj C x j /
T
Thus the conditional model that is used is the multinomial logit model for m C 1
categories. With R 2 f0; 1; : : : ; mg, where R D 0 denotes conditional survival,
the conditional probabilities that sum up to 1 are given by 0 .tjx/ D P.T > t j
T t; x/; 1 .tjx/; : : : ; m .tjx/.
For the interpretation of parameters it is useful to consider the model in the form
r .tjx/
log D 0tr C xT r :
0 .tjx/
It is seen that the linear predictor determines the cause-specific log-odds, that means
the logarithm of the proportion r .tjx/=0 .tjx/, which compares the conditional
probability of the target event R D r to conditional survival. With parameter vector
Tr D .r1 ; : : : ; rp / one obtains
r .tjx/
D exp.0tr / exp.r1 /x1 exp.rp /xp :
0 .tjx/
Thus the increase of xj by one unit increases the cause-specific odds by the factor
exp.rj /. While rj gives the additive effect of variable xj on the log-odds, the
transformed parameter exp.rj / shows the multiplicative effect on the odds, which
is often more intuitive.
As in the single-event case, the baseline hazard function may be simplified by
assuming a smooth function, and the weight r may depend on time.
170 8 Competing Risks Models
The multinomial logit model can be used for any number of risk categories but
contains many parameters. For each category of the risk one has to estimate the
risk-specific baseline hazard 0tr as a function in t and the parameter vector r . If
the causes or risks are ordered, sparser parameterizations can be found. For example,
in a study on unemployment, the target events R 2 f1; : : : ; mg can represent “part-
time job” or “full-time job” as alternatives to remaining unemployed, where the
latter category can be denoted by R D 0. In cases like this one can consider R as an
ordered categorical response given that a specific time point is reached.
If R has a natural order, one can use classical ordinal response models as proposed
by McCullagh (1980) to model the cause-specific hazards. The cumulative-type
model uses cumulative probabilities, which in the case of discrete hazards are given
by
X
r
r .tjx/ D j .tjx/ D P.T D t; R r j T t; x/:
jD1
where F.:/ is a fixed cumulative distribution function, and the intercepts must satisfy
0tr 0t;rC1 for all r. Specifically, if F./ is the logistic distribution function, one
obtains McCullagh’s proportional odds model for responses R given by
r .tjx/
log D 0tr C xT :
1 r .tjx/
Given that the model fits the data well, the advantage of the model over the
multinomial logit model is that one needs only one parameter vector instead of
one parameter vector r for each category. For an application and further discussion,
see Tutz (1995).
The models considered above all have a common form. They can be written as
exp.
r /
hr .
1 ; : : : ;
m / D Pm
1 C iD1 exp.
i /
where a 1 in rth row of X t is at the .t C r/th position. The parameter vector in this
case is then given by
where xi .t/T D .xi1 ; : : : ; xit / is the sequence of observations until time t. The model
for the hazard function has the form (8.3), where the design matrix X t is a function
of t and x.t/.
In competing risks models multiple target events are modeled simultaneously. One
might be tempted to use a simpler modeling approach by focussing on one target
event and considering the occurrence of other targets as censored observations.
Then, for fixed r0 one would use the hazard function
Let the data for the competing risks again be given by .ti ; ri ; ıi ; xi /, where ri 2
f1; : : : ; mg indicates the target event. When focussing on this event, one considers
the transformed data
.r0 / .r0 /
.ti ; ıi ; xi /; where ıi D 0 if ıi D 0 or ri ¤ r0 .
172 8 Competing Risks Models
.r /
The indicator function ıi 0 denotes censoring in the single-cause model for the r0 th
target event.
Although this approach seems attractive because simple binary models are used,
it has severe disadvantages. In particular, estimation can be strongly biased. This is
because when fitting simple discrete survival models, it is assumed that censoring
and survival are independent (the so-called random censoring property). Thus, when
the underlying survival times for the competing events are correlated, the censoring
process in the simplified model for separate single targets is correlated with the
survival time. If correlation between survival time and censoring is ignored, the
estimated hazards will be biased. As a consequence, if survival times are correlated,
separate modeling of single targets cannot be recommended. Since in practice the
correlation structure is usually unknown, competing risks modeling is often the
better choice.
Let the data be given by .ti ; ri ; ıi ; xi /, where ri 2 f1; : : : ; mg indicates the target
event. We again assume random censoring at the end of the interval with ti D
minfTi ; Ci g, where events are defined by the indicator function
(
1; Ti Ci ; i.e., the event of interest occurred in interval Œati 1 ; ati / ;
ıi D
0; Ti > Ci ; which refers to censoring in interval Œati 1 ; ati / :
Under the assumption of random censoring, the factor P.Ci ti /ıi P.Ci D ti /1ıi
can be omitted, and the likelihood reduces to
Analogous to Chap. 3, embedding model (8.3) into the GLM framework uses the
representation by binary indicators for the transition to the next time period. There-
fore one specifies indicator variables in the following way: For each observation one
defines for t < ti
which encodes survival for all time points before ti . For t D ti one defines for ıi D 1
With these indicator variables the likelihood for the ith observation can be written
in the form
Y
ti
˚Y
m
˚ yit0
Li D r .tjxi /yitr 1 .tjxi /
tD1 rD1
Y
ti
˚Y
m
˚ X
m
yit0
D r .tjxi /yitr 1 r .tjxi / :
tD1 rD1 rD1
This means that the likelihood for the ith observation is the same as the likelihood
for the ti observations yi1 ; : : : ; yiti of a multinomial response model. The indicator
variables actually represent the distributions given that a specific interval is reached.
Given that an individual reaches interval Œat1 ; at /, the response is multinomially
distributed with yTit D .yit0 ; yit1 ; : : : ; yitm / M.1; 1.tjxi /; 1 .tjxi /; : : : ; m .tjxi //.
The dummy variable yit0 D 1 yit1 : : : yitm has value 1 if individual i does not
fail in interval Œat1 ; at / and yit0 D 0 if individual i fails in Œat1 ; at /.
As a consequence, the likelihood is that of the multi-categorical model
P.Yit D r/ D hr .Xt ˇ/, where Yit D r if yitr D 1. As in the single-cause
model, maximum likelihood estimates may be calculated within the framework
of multivariate generalized linear models after augmenting the design matrices. For
the ith observation, the response and design matrices are given by
2 3 2 3
yi1 X1
6 :: 7 6 :: 7
4 : 5; 4 : 5 ;
yi;ti X ti
−0.2
−0.6
5 10 15 20
Fig. 8.2 Duration of unemployment of U.S. citizens. The figure shows the smooth baseline hazard
estimates that were obtained from a competing risks model with logistic link function. Event times
were measured in 2-week intervals
8.3 Variable Selection 175
Table 8.1 Duration of unemployment of U.S. citizens. The table shows the effect estimates for the
three covariates “age,” “filed UI claim?” and “log weekly earnings in lost job.” The 95 % interval
estimates were obtained from 250 bootstrap samples
Coef. estimate Coef. estimate Coef. estimate for
Event for age for UI claim log weekly earnings
Re-employed 0.01248 1.14683 0.49464
at ft job (0.01895, 0.00617) (1.28556, 1.02068) (0.36700, 0.61977)
Re-employed 0.00119 1.17169 0.30369
at pt job (0.00892, 0.01226) (1.37452, 0.94198) (0.51747, 0.08458)
The linear predictor in the multinomial model for the cause-specific hazard r .tjx/
has the form
r D 0tr C xT r ;
enforces the selection of parameters but not of variables. With appropriate choice
of the tuning parameters part of the parameters will be set to zero, but the remaining
parameters can be linked to any of the variables. Thus it might occur that all
variables have still to be kept in the model.
An alternative method that explicitly enforces variable selection instead of
parameter selection uses a penalty that specifies groups of parameters and links
them to one covariate each. Let all the effects on the jth variable be collected in
T:j D .1j ; : : : ; mj / and consider the grouped penalty
X
p
X
p
J./ D jj :j jj D .1j2 C C m;j
2 1=2
/ ;
jD1 jD1
p
where jjujj D jjujj2 D uT u denotes the L2 norm. The penalty enforces variable
selection, that is, all the parameters in :j are simultaneously shrunk toward zero. It
is strongly related to the group lasso (Yuan and Lin 2006), see Sect. 7.1. However,
176 8 Competing Risks Models
in the group lasso the grouping refers to the parameters that are linked to a
categorical predictor within a univariate regression model, whereas in the present
model grouping arises from the multivariate response structure. It was originally
proposed for multinomial responses by Tutz et al. (2015). In discrete survival
the penalty should be amended by a term that ensures that the baseline hazard
is sufficiently smooth over time. A penalty that enforces structured and effective
variable selection and that smooths the baseline hazards over time is given by
X
m X
q
X
p
J1 ;2 ./ D 1 .0tr 0;t1;r /2 C 2 j :j ; (8.7)
rD1 tD2 jD1
p
where j D m is a weight that adjusts the penalty level on parameter vectors :j
for their dimension. The importance of the penalty terms is determined by the
tuning parameters 1 and 2 . Without a penalty, that is with 1 D 2 D 0, ordinary
maximum likelihood estimation is obtained.
Generally, the common penalty level 2 for all :j is not an optimal choice. As
was shown by Zou (2006), penalties of the form (8.7) are inconsistent if used with
a common penalty parameter. The proposed remedy are so-called adaptive weights,
which are obtained by replacing the weights j by
p
m
ja D ;
jjO Init
:j jj
where O Init
:j denotes an appropriate initial estimator. For example, the penalized
estimator that results from application of penalty (8.7) with 2 D 0 can be used as
initial estimator. In this case, the initial estimator uses unpenalized covariate effects,
but an active smoothing penalty on the baseline effects. The tuning parameters
themselves have to be chosen, for example, by cross-validation. For details, see
Möst et al. (2015).
Example 8.2 Congressional Careers
We consider the data on congressional careers described in Example 1.4. The competing risks were
defined by the way a career ends, by retirement, an alternative office (ambition), losing a primary
election (primary) or losing a general election (general). The dependent variable is defined by
the transition process of a Congressman from his/her first election up to one of the competing
events general, primary, retirement, or ambition. The duration until the occurrence of one of the
competing events is measured as terms served, where a maximum of 16 terms can be reached.
Predictors were described in Example 1.4, and summaries were given in Tables 1.4 and 1.5.
We fitted a penalized multinomial logit model with risks defined by cause 1 (general), 2 (pri-
mary), 3 (retirement), and 4 (ambition). The effect of covariates is specified by the cause-specific
linear predictors
itr D 0tr CxTit r . All covariates described in Table 1.4 were considered. To be on
comparable scales, all covariates were standardized to have equal variance. Moreover, we included
all pairwise interactions with the exception of republican:leader, leader:redist, opengub:scandal,
scandal:redist because the data contained too few observations of the corresponding combinations.
Such a high-dimensional interaction model cannot be properly handled by unpenalized maximum
likelihood estimation, but stable estimation and efficient variable selection is obtained by using
penalization. Since the adaptive version of the penalty (8.7) yielded better cross-validation scores,
8.3 Variable Selection 177
General Primary
−2.5 −2.5
−3.0 −3.0
−3.5 −3.5
γ0tr γ0tr
−4.0 −4.0
−4.5 −4.5
−5.0 −5.0
1 5 10 15 1 5 10 15
Terms Terms
Retirement Ambition
−2.5 −2.5
−3.0 −3.0
−3.5 −3.5
γ0tr γ0tr
−4.0 −4.0
−4.5 −4.5
−5.0 −5.0
1 5 10 15 1 5 10 15
Terms Terms
Fig. 8.3 Parameter estimates of the cause-specific time-varying baseline effects for the Congres-
sional careers data. Dashed lines represent the 95 % pointwise bootstrap intervals
adaptive weights were included. Tuning parameters 1 and 2 were chosen on a two-dimensional
grid by fivefold cross validation with the predictive deviance as loss criterion.
Figure 8.3 shows the parameter estimates for the cause-specific time-varying baseline effects.
The corresponding pointwise confidence intervals, marked by light gray dashed lines, were
estimated by a nonparametric bootstrap method with 1000 bootstrap replications. It can be seen that
cause-specific baseline effects are needed because the shapes are quite different. For retirement the
parameters increase over early terms and eventually become stable, while for ambition there is an
early peak at about five terms and then a decrease. Due to the penalization of adjacent coefficients,
the estimated baseline effects are rather smooth.
Parameter estimates of the covariate effects are summarized in Table 8.2. It shows the ordinary
maximum likelihood estimates and the estimates resulting from the penalized competing risk
model with their corresponding standard errors. The computation of the standard errors is again
based on a nonparametric bootstrap approach with 1000 bootstrap replications. It is immediately
seen that the penalization removes a considerable number of effects, that is, only 68 out of
128 parameters remain in the model, leading to a strong reduction of the model complexity.
The selection procedure suggests that the main effects Republican and Leader are not needed
in the predictor. Moreover, a large number of interaction effects have been deleted. Concerning
178
Table 8.2 Parameter estimates for the Congressional careers data. Ordinary maximum likelihood estimates are denoted by “ML”; penalized estimates are
denoted by “pen.” Estimated standard errors for the penalized model obtained by bootstrapping are given in the columns denoted by “sd”
General Primary Retirement Ambition
ML pen. sd ML pen. sd ML pen. sd ML pen. sd
Age 0:069 0:046 0:008 0:071 0:046 0:011 0:070 0:068 0:008 0:034 0:037 0:007
Republican 0:255 0 0:005 0:188 0 0:002 0:201 0 0:009 0:343 0 0:018
Priorm 0:078 0:060 0:005 0:006 0:001 0:005 0:007 0:005 0:003 0:010 0:004 0:002
Leader 0:272 0 0:087 2:779 0 0:081 0:393 0 0:065 0:033 0 0:080
Opengub 0:815 0:205 0:116 0:598 0:181 0:097 0:227 0:109 0:077 0:528 0:208 0:121
Opensen 0:638 0:243 0:125 0:215 0:193 0:134 0:086 0:062 0:125 1:136 0:878 0:134
Scandal 3:750 2:689 0:370 3:215 3:272 0:428 1:921 1:611 0:441 3:118 1:532 0:073
Redist 2:548 1:617 0:447 1:465 1:149 0:499 0:563 0:431 0:251 0:574 0:801 0:309
Age:Republican 0:007 0:011 0:007 0:045 0:010 0:007 0:041 0:030 0:009 0:038 0:029 0:009
Age:Priorm 0:001 0:000 0:000 0:001 0:000 0:000 0:000 0:000 0:000 0:000 0:000 0:000
Age:Leader 0:014 0 0:002 0:117 0 0:002 0:018 0 0:002 0:269 0 0:001
Age:Opengub 0:006 0 0 0:034 0 0 0:016 0 0 0:011 0 0
Age:Opensen 0:005 0 0:001 0:074 0 0:001 0:039 0 0:004 0:015 0 0:002
Age:Scandal 0:106 0 0 0:022 0 0 0:090 0 0 0:009 0 0
Age:Redist 0:001 0:007 0:016 0:066 0:039 0:018 0:174 0:097 0:031 0:037 0:018 0:016
Republican:Priorm 0:016 0:005 0:004 0:041 0:016 0:005 0:008 0:004 0:004 0:015 0:012 0:004
Republican:Opengub 0:532 0:342 0:200 4:282 1:337 0:147 0:147 0:233 0:201 0:063 0:294 0:184
Republican:Opensen 0:323 0 0:001 0:092 0 0:002 0:802 0 0:010 0:260 0 0:011
Republican:Scandal 0:007 0 0:021 2:121 0 0:054 0:182 0 0:005 1:418 0 0:001
Republican:Redist 1:833 0 0:076 0:447 0 0:059 1:247 0 0:050 0:276 0 0:051
(continued)
8 Competing Risks Models
Table 8.2 (continued)
General Primary Retirement Ambition
ML pen. sd ML pen. sd ML pen. sd ML pen. sd
Priorm:Leader 0:025 0 0 0:009 0 0 0:008 0 0:001 0:057 0 0
Priorm:Opengub 0:020 0 0 0:001 0 0:001 0:008 0 0:001 0:009 0 0:001
Priorm:Opensen 0:016 0 0:001 0:019 0 0:002 0:013 0 0:002 0:011 0 0:004
8.3 Variable Selection
Priorm:Scandal 0:006 0:007 0:005 0:017 0:010 0:004 0:071 0:019 0:006 0:028 0:001 0
Priorm:Redist 0:066 0:037 0:019 0:000 0:002 0:003 0:030 0:010 0:006 0:013 0:009 0:007
Leader:Opengub 5:168 0 0:117 1:693 0 0:087 1:054 0 0:359 5:402 0 0:116
Leader:Opensen 4:513 0 0 0:941 0 0 1:001 0 0 6:053 0 0
Leader:Scandal 0:213 0:029 0:594 4:212 1:803 0:733 8:621 1:925 0:756 0:897 0:108 0:047
Opengub:Opensen 0:436 0 0 0:124 0 0 0:280 0 0 0:429 0 0
Openub:Redist 0:175 0:172 0:663 4:274 0:415 0:125 5:297 0:666 0:237 2:751 2:126 0:932
Opensen:Scandal 2:277 0 0:307 1:482 0 0:206 8:270 0 0:266 3:311 0 0:058
Opensen:Redist 0:914 0 0:052 4:560 0 0:006 0:522 0 0:031 1:771 0 0:147
179
180 8 Competing Risks Models
General Retirement
Primary Ambition
0.100 0.100
Hazard Rates
Hazard Rates
0.075 0.075
0.050 0.050
0.025 0.025
0.000 0.000
1 5 10 15 1 5 10 15
Terms Terms
Fig. 8.4 Estimated hazard rates for the Congressional careers data. The following covariate
specifications were used: Age = 51 (left), Age = 41 (right), Prior Margin = 35, no Republican, no
Leadership, no open Gubernatorial seat, no open Senatorial seat, no Scandal and no Redistricting
interpretation, for example, the absolute value for the effect of the covariate Scandal indicates a
strong effect. If a Congressman became embroiled in a scandal it is more likely that he/she loses
a primary or general election or that he/she retires. A scandal also decreases the probability of
seeking an alternative office as compared to re-election.
In Fig. 8.4 a selection of the resulting hazard rates is depicted. It shows hazard functions for
the following covariate specifications: Age = 51 (left) and Age D 41 (right), Prior Margin D 35,
no Republican, no Leadership, no open Gubernatorial seat, no open Senatorial seat, no Scandal,
and no Redistricting for the transitions to General, Primary, Retirement, and Ambition. It can be
seen that the probability of retirement tends to increase over early terms and then remains rather
stable. The probability of seeking an alternative office as compared to re-election increases for
early terms and then decreases. The hazard rate for losing either a primary or a general election is
rather constant in the considered group. For more details, in particular on the selection of effects,
see Möst (2014) and Möst et al. (2015). t
u
Competing Risks for Continuous Time Most of the literature on competing risks
consider the case of continuous time, see, for example, Kalbfleisch and Prentice
(2002), Klein and Moeschberger (2003), Beyersmann et al. (2011), and Kleinbaum
and Klein (2013).
Discrete Hazards Narendranathan and Stewart (1993) considered discrete compet-
ing risks in the modeling of unemployment spells with flexible baseline hazards.
Fahrmeir and Wagenpfeil (1996) proposed to estimate smooth hazard functions and
time-varying effects by posterior modes. A random effects model of workfare tran-
sitions was considered by Enberg et al. (1990), and a general multilevel multistate
8.6 Exercises 181
competing risks model for repeated episodes was considered by Steele et al.
(2004). Han and Hausman (1990) considered discretized versions of continuous-
time models.
8.5 Software
The VGAM package (Yee 2010) can be used to fit multinomial and ordinal additive
models. Ordinal models can also be fitted with the R package ordinal (Bojesen
Christensen 2015). Variable selection can be carried out by use of the MLSP
package, which is available at http://www.statistik.lmu.de/~poessnecker/software.
html. The data.long.comp.risk function in R package discSurv can be used to convert
survival data with competing events to a data frame with multinomial response.
8.6 Exercises
8.1 Assume that for continuous time the specific cause-specific proportional hazard
model
holds. Show that the corresponding discrete hazard model with T D t if failure
occurs in interval Œat1 ; at / has the form
exp.
r / n
X o
m
r .tjx/ D Pm 1 exp exp.
j / ;
jD1 exp.
j / jD1
R
at
where 0t D log at1 0 .t/ .
exp.
r / n
X o
m
hr .
1 ; : : : ;
m / D Pm 1 exp exp.
j / :
jD1 exp.
j / jD1
Calculate the corresponding design matrix X t and the parameter vector ˇ, i.e.
rewrite r as
r .tjx/ D hr .X t ˇ/:
8.3 In Example 8.1 a multinomial logit model was used to analyze the U.S.
unemployment data. This model is best suited for unordered event categories
182 8 Competing Risks Models
because it allows for estimating a separate coefficient vector for each target event.
The aim of this exercise is to investigate whether a cumulative logit model fits
the data better (or at least equally well). In principle, fitting a cumulative model
is justified by the fact that the target events of the U.S. unemployment data have
a natural order (“still jobless,” “re-employed at part-time job,” “re-employed at
full-time job”). On the other hand, Example 8.1 suggested very different effects
of some of the covariates on the target events; this would in turn justify the use of a
multinomial logit model.
1. Subdivide the U.S. unemployment data randomly into a learning sample (com-
prising two thirds of the observations) and a validation sample (comprising the
remaining third of observations).
2. Convert the learning and validation samples into sets of augmented data.
3. Fit a multinomial logit model with covariates age, filed UI claim and log weekly
earnings in lost job to the learning data. In addition, estimate the parameters of a
cumulative logit model that contains the same covariates.
4. For both models compute the predictive deviance in the validation sample.
5. Repeat the above steps 100 times. For both logistic models (multinomial
and cumulative) compute the means, medians and standard deviations of the
predicted deviance values of the 100 validation samples. Which model results
in the higher prediction accuracy?
8.4 The aim of this exercise is to investigate how the estimates obtained from
competing risks models and the estimates obtained from separate modeling of
single targets differ in the case of correlated events. Since the exercise is based on
simulated data, it also serves to illustrate how survival data with correlated events
can be generated for simulation purposes. It is assumed that there are three events
of interest and that the respective event times (denoted by E1 , E2 , and E3 ) follow
a Weibull distribution each. The correlation structure between the event times is
modeled via a Gaussian copula approach (see below). The censoring time (denoted
by C) is assumed to follow an exponential distribution. Scale and shape parameters
of the distributions are specified as follows:
where
e WD .X1 ; X2 ; X3 /> ˇEe , e D 1; 2; 3, are linear predictors based on three
covariates and event-specific vectors of coefficients ˇEe . ./ refers to the gamma
function.
8.6 Exercises 183
Regarding the covariates X1 , X2 , X3 and the coefficient vectors ˇE1 , ˇE2 , ˇE3 , the
following specifications are made:
Regarding the correlation structure between the event times, three different
correlation matrices are considered. We specify each of the matrices via Kendall’s
rank correlation coefficients:
2 3 2 3 2 3
100 1 0:1 0:2 1 0:3 0:4
6 7 6 7 6 7
RespCorr1 D 4 0 1 0 5 ; RespCorr2 D 4 0:1 1 0:3 5 ; RespCorr3 D 4 0:3 1 0:5 5 :
001 0:2 0:3 1 0:4 0:5 1
The idea of Gaussian copula modeling is to first draw vectors of correlated normally
distributed random numbers. Since the population version of Kendall’s
is related
to the population version of the Bravais–Pearson correlation coefficient by the
equation D sin.
2 / (Kruskal 1958), the values of can be used to specify the
correlation structure of the normally distributed random variables. In the next step,
the normally distributed random numbers are converted into uniformly distributed
random numbers via the standard normal cumulative distribution function. Finally,
Weibull distributed random numbers are generated by re-transforming the uniformly
distributed random numbers via the quantile functions (i.e., the inverse cumu-
lative distribution functions) of the Weibull distributions specified above. Since
Kendall’s
is invariant under monotone transformations, the correlation structures
specified above do not only apply to the normally distributed random numbers in
the first step but also to the Weibull distributed random numbers generated in the
last step.
The correlation structure between the three covariates is specified as follows:
2 3
1 0:25 0
CovariateCorr D 4 0:25 1 0:25 5 :
0 0:25 1
1. Specify 100 random seeds and save them to make the results of the simulation
study reproducible.
184 8 Competing Risks Models
2. Repeat the following steps 100 times for each of the three correlation matrices
(! 3
100 Monte Carlo samples):
(a) Express the correlation matrix of the covariates in terms of Bravais–
Pearson correlation coefficients. (Hint: Use the function tauToPearson or
the simulation function for competing risk models simCompRisk in the R
package discSurv.)
(b) Generate the covariate values via the Gaussian copula approach:
i. First draw n D 1000 random vectors of length 3 each from a multivariate
normal distribution with expectation D .0; 0; 0/ and the Bravais–
Pearson correlation matrix calculated in (a).
ii. Use the univariate standard normal distribution function to convert the
random numbers into sets of uniformly distributed data.
iii. Insert the uniformly distributed random numbers into the quantile func-
tions of X1 ; X2 ; X3 .
(c) In the next step generate the values of the event times E1 ; E2 ; E3 via the
Gaussian copula approach:
i. Express the correlation matrix of the event times in terms of Bravais–
Pearson correlation coefficients.
ii. Draw n D 1000 random vectors of length 3 each from a multivariate
normal distribution with expectation D .0; 0; 0/ and the correlation
matrix calculated in (i).
iii. Use the univariate standard normal distribution function to convert the
random numbers into sets of uniformly distributed data.
iv. Use the covariate values generated in (b) and the coefficients ˇEe to
calculate the linear predictors. Insert the uniformly distributed random
numbers into the quantile functions of E1 ; E2 ; E3 .
(d) Simulate the censoring process by independently drawing random numbers
from the exponential distribution specified above.
(e) Calculate the observed event times and censoring indicators (assuming right
censoring).
˚
3. Discretize all simulated event times using the grid g D i1 20 I i D 1; : : : ; 100 .
(Hint: Use the R function contToDisc.)
4. Generate sets of augmented binary data from the discretized random numbers.
5. Use the R package VGAM to estimate multinomial logistic competing risks
models with constant baseline hazards.
6. For each of the three events fit a discrete single spell survival model with logistic
link function.
7. For each correlation matrix calculate the squared deviations between the true
survival functions and the estimated survival functions.
8. For each correlation matrix average the results across the observations and Monte
Carlo samples.
9. Display and interpret the results by comparing the squared deviations from
competing risks and single spell models.
Chapter 9
Frailty Models and Heterogeneity
In regression modeling one tries to include all relevant variables. But in empirical
studies typically only a limited number of potentially influential variables are
observed and one has to suspect that part of the heterogeneity in the population
remains unobserved. In particular in survival modeling unobserved heterogeneity,
when ignored, may cause severe artifacts.
A simple example illustrates the potential effects. Let the population be par-
titioned into M subpopulations, where in each subpopulation the hazard rate is
constant over time. Thus, in the jth subpopulation the hazard rate is j .t/ D j .
One easily derives that for a randomly sampled individual the population hazard is
given by
PM
Q
jD1 j .t/Sj .t/p. j/
.t/ D PM ;
Q
jD1 Sj .t/p. j/
0.8
0.6
λ(t)
0.4 0.2
0.0
0 2 4 6 8 10
t
Fig. 9.1 Time-varying hazard resulting from a mixture of two subpopulations with time-constant
hazard functions. The solid line shows the overall hazard function .t/ whereas the dashed lines
refer to the time-constant hazards (1 D 0:2, 2 D 0:4) of the subpopulations
to be smooth and nonlinear. The model class that is considered are discrete additive
hazard frailty models. Because model misspecification is a critical issue in random
effects models, Sect. 9.4 presents an efficient strategy to variable selection in
discrete hazard frailty models. Alternative approaches to incorporate unobserved
heterogeneity in discrete time-to-event models are presented in Sects. 9.5 and 9.6,
which deal with penalized fixed-effects and finite mixture modeling, respectively.
The final section of this chapter extends the basic discrete hazard frailty model to
sequential models in item response theory (Sect. 9.7).
example where the probability of an event on the individual level is the same at all
time points (within given time intervals) but may vary in the population. It can be
seen as a demographic model for fecundability: A couple who wants to have a child
has in each interval the same probability of succeeding, but the probabilities vary
strongly over couples. Let the probability for one couple at each time point be given
by . Thus, one assumes on the individual level of the couple a constant hazard
function .t/ D . For the probability of being successful at time t one obtains the
geometric distribution with possible outcomes f1; 2; : : : g
P.T D t j / D .1 /t1 :
If one assumes for the distribution of a beta distribution with density f ./ D
Œ.˛ C ˇ/=..˛/.ˇ// ˛1 .1 /ˇ1 , ˛; ˇ > 0, one obtains for the marginal
distribution of T
Z 1
B.˛ C 1; ˇ C k 1/
P.T D t/ D P.T D t j /f ./d D ;
0 B.˛; ˇ/
where B.˛; ˇ/ D .˛/.ˇ/= .˛ C ˇ/. This distribution is a special case of the
beta negative binomial distribution. For the marginal hazard function one obtains
˛
m .t/ D P.T D t j T t/ D :
˛CˇCt1
Thus for a couple drawn at random from the population the probability that an event
occurs at time t given it has not occurred before is determined by m .t/.
While one has a constant hazard, .t/ D , on the individual level, the hazard
on the population level is a decreasing function of t determined by the parameters ˛
and ˇ. What one observes are in fact realizations of T on the population level, not on
the individual level. Therefore the (estimated) hazard function refers to the marginal
hazard rate, not to the individual hazard rate which is not observed. It should be
noted that even if the model for the marginal hazard fits the data well, this does not
mean that the model on the individual level holds. This is because similar population
models can be derived from quite different individual level models. Aalen (1988)
demonstrated that the mixture model defined previously can be used to fit a data set
referring to incidence rates of conception per months. But from this result one can
hardly infer that the hazard is constant over time for single couples. Nevertheless,
individual level models are a tool to derive models that may hold on the population
level. In the following we will discuss various strategies to model individual effects.
188 9 Frailty Models and Heterogeneity
The basic random effects model assumes that the hazard given covariates depends
on the sum of the linear predictor xTi and a subject-specific random effect bi :
(Exercise 9.2). Therefore the survival function of an individual with random effect bi
is obtained as a power function of the survival function of the “reference” individual
with bi D 0. Figure 9.2 illustrates the modification of the reference survival function
by individual effects. It is seen that the hazard rates show distinct variation over
individuals. For the Gumbel model (log-log model) with response function h.
/ D
exp. exp.
// one obtains a similar relation for the hazard function, i.e.,
Typically there is no closed form for the corresponding marginal hazard function.
Scheike and Jensen (1997) used the clog-log model and, for convenience, assumed
that exp.bi / is gamma distributed with mean 1 and variance . Then the marginal
hazard has the form
1=
1 log S.t 1jxi /
.tjxi / D 1 ;
1 log S.tjxi /
9.1 Discrete Hazard Frailty Model 189
1.0
0.8
0.6
Sj(t)
0.4
0.2
0.0
0 10 20 30 40
t
Fig. 9.2 Survival functions for individuals with varying random effects for the grouped propor-
tional hazards model. The bold line refers to the survival function of the reference individual with
bi D 0
where S.tjxi / is the marginal survival function of the clog-log model. However, in
general, no closed form is available.
In the following we consider the general form with censoring, where ıi denotes
the censoring indicator. The unconditional probability of observing .ti ; ıi / is
given by
Z
P.ti ; ıi jxi / D P.ti ; ıi jxi ; bi /f .bi / dbi
and therefore by
Z
P.ti ; ıi jxi / D ci P.Ti D ti j xi ; bi /ıi P.Ti > ti j xi ; bi /1ıi f .bi / dbi ;
where again it is assumed that censoring occurs at the end of the interval and that
ci D P.Ci ti /ıi P.Ci D ti /1ıi does not depend on the parameters.
As shown in Chap. 3, the model can also be represented by binary observations
as
Z Y
ti
P.ti ; ıi jxi / D ci .sjxi ; bi /yis .1 .sjxi ; bi //1yis f .bi / dbi ; (9.2)
sD1
The inclusion of a frailty term accounts for the hidden heterogeneity in the
population and should therefore be closer to the underlying data-generating process.
On the other hand, adding a frailty term may also lead to model misspecification.
While the link function and the predictor (which is assumed to be linear) may
already be misspecified in models without subject-specific effects, now there is an
additional risk of misspecifying the frailty term. Moreover, identifiability issues can
arise.
The problem of misspecified distributions has been studied in particular for
continuous survival models. The basic case of misspecified heterogeneity is ignored
heterogeneity. Heckman and Singer (1984a) showed analytically that ignoring het-
erogeneity yields biased estimated hazards towards negative duration dependence,
where negative duration dependence means that the hazard function decreases as in
Fig. 9.1. The occurring estimation bias implies that the derivatives of the marginal
hazard function tend to be smaller than the integrated derivatives of the individual
hazards. The intuition is that individuals who have a high unobserved risk are
more likely to have shorter survival times, such that the individuals who survive
are those with small unobserved risk. These individuals form the selected sample
that is considered at later time points. As a consequence the observed marginal
hazard decreases. Nicoletti and Rondinelli (2010) referred to this phenomenon
9.1 Discrete Hazard Frailty Model 191
with explanatory variables wit ; zit and random effect vector bi . In Model (9.3)
different sets of predictor variables are collected in the vectors wit and zit , referring
to the fixed and random effects and bi , respectively. For example, the simple frailty
model with random intercepts uses wTit D .0; : : : ; 1; : : : ; 0; xTi / with parameters
ˇ T D .01 ; : : : ; 0q ; T / and zit D 1.
A common assumption for random effects is a normal distribution, bi N.0; Q/.
The representation of the probability P.ti ; ıi jxi ; bi / (see Eq. (9.2)) as a binary
response model allows to use estimation concepts for generalized linear mixed
models (GLMMs) for the binary responses yi1 ; : : : ; yiti .
To estimate ˇ and Q simultaneously, one can apply numerical integration
techniques that solve the integral in (9.2). This can be done by maximizing the
marginal log-likelihood
Z Y !
X
n ti
1yis
l.ˇ; Q/ D log .sjxi / .1 .sjxi //
yis
dbi : (9.4)
iD1 sD1
Two popular approaches for numerical integration are the Gauss–Hermite quadra-
ture and Monte Carlo approximations. The Gauss–Hermite procedure approxi-
mates the integral in (9.4) by using a pre-specified number of quadrature points.
With increasing number of quadrature points the exactness of the approximation
increases. Typically, for simple models with random intercept estimates are stable
if 8–15 quadrature points are used. Gauss–Hermite has been considered by Hinde
(1982), and Anderson and Aitkin (1985). A procedure that may reduce the number
of quadrature points is the adaptive Gauss–Hermite quadrature (Liu and Pierce
1994; Pinheiro and Bates 1995; Hartzel et al. 2001).
9.2 Estimation of Frailty Models 193
X
n
1 X T 1
n
lp .ı/ D log . f . yi jbi ; ˇ// b Q bi ; (9.5)
iD1
2 iD1 i
Qi
where f . yi jbi ; ˇ/ D tsD1 .sjxi /yis .1 .sjxi //1yis . A disadvantage of this method
is its tendency to underestimate the variance of the mixing distribution and therefore
the true values of the random effects (see, for example, McCulloch 1997). However,
this effect can be ameliorated by using the modifications proposed by Breslow and
Lin (1995) and Lin and Breslow (1996). The penalized quasi-likelihood method can
be justified in various ways (see also Schall 1991; Wolfinger 1994; McCulloch and
Searle 2001).
Example 9.1 Family Dynamics
We illustrate parameter estimation in frailty models on data from Germany’s current panel analysis
of intimate relationships and family dynamics (“Pairfam”) described in Example 1.5.
For each of the anchor women from the two age groups Œ24I 30 and Œ34I 40 it is known whether
she has given birth to her first child within the year between two interview dates. Altogether, 137
events were observed. We consider years as the unit in our discrete survival model and start with 24
years, which is the age of the youngest woman in the sample. We consider a discrete hazard model
with the covariates “relstat,” “siblings,” “yeduc,” and “leisure” (see Table 9.1) and investigate
whether the inclusion of a frailty term accounts for the heterogeneity among the anchor women. For
the categorical variable “relstat” the reference level “cohabitation” was chosen. Table 9.2 contains
the coefficient estimates that were obtained from fitting a continuation ratio model with person-
specific random intercepts to the Pairfam data. For numerical optimization of the log-likelihood we
applied the Gauss–Hermite quadrature with 20 quadrature points (first column of Table 9.2) using
the R package glmmML (Broström 2013) and a penalized likelihood-based method implemented
in the R package mgcv (bs D “re” option of the gam function, second column of Table 9.2). The
last column of Table 9.2 contains the effects obtained from an ordinary continuation ratio model
without frailty term. As expected, the coefficient estimates of the frailty models were larger in
absolute size than the respective estimates of the discrete hazard model. The variance estimates for
Table 9.1 Pairfam data. Description of the variables that were used in Example 9.1
Variable Description
age Age of the anchor woman (in years)
relstat status of relationship (categorical with three levels:
“living apart together”, “cohabitation”, “married”)
yeduc Years of education (2 Œ8; 20) of the anchor woman
siblings number of siblings of the anchor woman
leisure (approx.) yearly leisure time of the anchor woman (in hours)
spent for the following five major categories:
(1) bar/cafe/restaurant
(2) sports; (3) internet/TV; (4) meet friends; (5) discotheque
194 9 Frailty Models and Heterogeneity
Table 9.2 Pairfam data. The first two columns contain the coefficient estimates obtained from a
frailty model (continuation ratio model with person-specific random intercepts) that was fitted to
the complete data via Laplace approximation (first column) and via the penalized likelihood-based
approach (second column). Column 3 contains the respective estimates of an ordinary continuation
ration model without frailty term (lat D living apart together)
Coefficient estimates
Frailty, Gauss–Hermite Frailty, pen. log-lik Discrete hazard
relstat D lat 1.0741 1.0460 1.0440
relstat D married 0.9412 0.9001 0.8809
Siblings 0.0227 0.0215 0.0201
yeduc 0.0462 0.0445 0.0440
Leisure 0.0002 0.0002 0.0002
O 2 0.55632 0.50602 –
the random intercept were 0:55632 (Gauss–Hermite) and 0:50602 (penalized likelihood), implying
a notable (yet non-significant, p D 0:346 and p D 0:246, respectively) heterogeneity among
the anchor women. Note that, as expected, the variance estimate was smaller for the penalized
likelihood-based fitting routine than for the Gauss–Hermite routine. t
u
Cautionary Remark Although the inclusion of frailty terms into discrete hazard
models is a well-founded strategy to account for unobserved heterogeneity, maxi-
mization of the resulting log-likelihood function is often numerically problematic.
This is especially true if there are many time-constant covariates in the model, and
also if the baseline hazard is allowed to vary freely (e.g., if baseline parameters
are modeled via dummy variables or splines). In these cases, the inclusion of a
subject-specific random effect often results in an almost perfect model fit and hence
in numerical problems associated with the various fitting routines. It is therefore
recommended to use not only one but several of the routines for model fitting and to
carefully inspect and compare the results. If convergence problems occur, a solution
might be to restrict the functional form of the baseline hazard (for example, to
a lower-order polynomial or to a time-constant function). Alternatively, adding a
ridge penalty to the baseline parameters might increase the numerical stability of
the fitting process.
In the model considered previously the linear predictor had the form
it D xTit C zTit bi . For a more general additive predictor let the explanatory variables
be given by .xit ; uit ; zit /, i D 1; : : : ; n, t D 1; : : : ; ti , with xTit D .xit1 ; : : : ; xitp /,
uTit D .uit1 ; : : : ; uitm /, zTit D .zit1 ; : : : ; zits / denoting vectors of covariates, which
may vary across individuals and observations. The components in uit are assumed
to represent continuous variables which do not necessarily have a linear effect
9.3 Extensions to Additive Models Including Frailty 195
within the model. The corresponding additive semiparametric mixed model has the
predictor
X
m
it D xTit C ˛. j/ .uitj / C zTit bi ;
jD1
whereas in the generalized linear mixed model bi is a random vector for which it
is typically assumed that bi N.0; Q/. The new terms in the predictor are the
unknown functions ˛.1/ ./; : : : ; ˛.m/ ./. Let these functions be expanded in basis
functions, that is,
X
mj
˛. j/ .u/ D ˛s. j/ s. j/ .u/ D ˛Tj j .u/;
sD1
. j/ . j/
where 1 .u/; : : : ; mj .u/ are appropriately chosen basis functions for variable uitj .
Defining ˛T WD .˛T1 ; : : : ; ˛Tm / and Tit WD .1 .uit1 /T ; : : : ; m .uitm /T /, one obtains
the linear predictor
it D xTit C Tit ˛ C zTit bi ;
X
n
1 X T 1
n X X m
lp .ı/ D log . f . yi jbi // bi Q bi j .˛j;sC1 ˛j;s /2 ;
iD1
2 iD1 jD1 s
which, when compared to (9.5), contains an additional penalty that smooths the
differences between adjacent weights on basis functions ˛j;s ; ˛j;sC1 . The parameters
1 ; : : : ; m are tuning parameters that have to be chosen appropriately.
The tuning parameters can be chosen by applying cross-validation or by max-
imizing some information criterion, but alternative ways are available as well. In
particular, models with additive terms and random effects can be embedded into the
mixed modeling representation of smoothing where tuning parameters are estimated
as variance components (see, e.g., Lin and Zhang 1999; Ruppert et al. 2003; Wood
2006). The versatile R package mgcv contains the function gamm that allows one
to fit generalized additive mixed models. Alternatively, the bs = “re” option for
smooth fits can be used in combination with the gam function of mgcv.
196 9 Frailty Models and Heterogeneity
Table 9.3 Years between cohabitation to first childbirth. The table shows the estimates of the
linear effects and the respective estimated standard deviations that were obtained from fitting a
logistic discrete hazard model to the data (edu D educational attainment, area D geographic area,
cohort D cohort of birth, occ D occupational status, sibl D number of siblings). In contrast to
Table 5.1, an additional random intercept term was added to the model in order to account for
unobserved heterogeneity. Columns 4 and 5 contain the estimates of the original additive model
without random intercept (see also Table 5.1)
Model with random intercept Model without random intercept
Parameter Parameter
Covariate estimate Est. std. error estimate Est. std. error
edu First stage basic (ref.
category)
edu Second stage basic 0.0153 0.0825 0.0351 0.0747
edu Upper secondary 0.2423 0.0876 0.1960 0.0793
edu Degree 0.3261 0.1218 0.2787 0.1099
cohort 1946–1950 (ref.
category)
cohort 1951–1955 0.0477 0.0850 0.0421 0.0767
cohort 1956–1960 0.2949 0.0871 0.2638 0.0787
cohort 1961–1965 0.3532 0.0877 0.3112 0.0794
cohort 1966–1975 0.8268 0.0992 0.7682 0.0908
area North (ref. category)
area Center 0.3245 0.0692 0.2957 0.0626
area South 0.7407 0.0684 0.6784 0.0621
occ Worker (ref. category)
occ Non-worker 0.2499 0.0584 0.2272 0.0529
sibl 0.0634 0.0296 0.0548 0.0269
9.4 Variable Selection in Frailty Models 197
1.0
0.5
0.0
−2.0 −1.5 −1.0 −0.5
f
10 15 20 25 30 35 40
age at the beginning of cohabitation
Fig. 9.3 Time between cohabitation and first childbirth. The solid line shows the P-spline estimate
of the effect of the covariate “age at beginning of cohabitation” that was obtained from a
proportional continuation ratio model with random intercept term. The dashed line corresponds
to the respective estimate that was obtained from the original model without random intercept (cf.
Fig. 5.5). Both effect estimates were centered such that the fitted values had zero mean
X
p
llasso .ı/ D lp .ı/ jj j: (9.6)
jD1
198 9 Frailty Models and Heterogeneity
P
The additional lasso-type penalty j jj j enforces variable selection. For details on
the maximization of llasso .ı/ see Groll and Tutz (2014).
Example 9.3 Family Dynamics
We illustrate variable selection in frailty models using the Pairfam data. Again we consider years
as the unit of discrete survival and start with 24 years. The baseline hazard, which corresponds to
the effect of age, is now included in the form of a penalized smooth effect. Similarly, we allow for
a nonlinear effect of the male partner’s age by including polynomial terms of this covariate. For the
categorical variables relstat, casprim, and pcasprim the reference levels “living apart together” and
“non-working,” respectively, are chosen. It should be noted that all variables can vary over time
and are included as time-varying.
A frailty model with smooth effect of age and with variable selection based on the lasso penalty
was performed by use of the package glmmLasso (Groll 2015). To demonstrate the difference
between AIC and BIC with regard to model sparsity, we used both criteria to select the tuning
parameter . The results of the estimation of fixed effects and the amount of heterogeneity
(measured by the estimated variance O 2 ) are given in Table 9.4. Figure 9.4 shows the estimated
smooth effect(s) of age. As expected, the estimated functions are nonlinear and bell-shaped with a
maximum in the mid-twenties (gray line: AIC, black line: BIC). For more details on the modeling
of family dynamics with frailty models, see Groll and Tutz (2016). t
u
Table 9.4 Estimated effects and standard deviation of the random intercept for the pairfam data
(standard errors in brackets)
glmmLasso (AIC) glmmLasso (BIC)
Intercept 2.41 (0.14) 2.32 (0.13)
Page 0.39 (4.96) .
Page2 0.25 (10.19) .
Page3 0.48 (5.57) .
Page4 . .
hlt7 . .
sat6 . .
reldur.1=3/ 0.06 (0.14) .
Siblings . .
relstat:cohab 0.53 (0.18) 0.52 (0.15)
relstat:married 0.87 (0.17) 0.80 (0.14)
yeduc . .
pyeduc 0.24 (0.11) .
Leisure.1=3/ 0.07 (0.11) .
Leisure.partner 0.06 (0.12) .
Holiday 0.26 (0.10) .
casprim:educ 0.38 (0.50) .
casprim:fulltime 0.72 (0.54) .
casprim:parttime 0.37 (0.28) .
casprim:other 0.19 (0.22) .
pcasprim:educ 0.31 (0.18) .
pcasprim:fulltime 0.32 (0.18) .
pcasprim:parttime 0.23 (0.15) .
pcasprim:other 0.07 (0.12) .
O 1.13 1.04
9.5 Fixed-Effects Model 199
−2.1
−2.2
baseline hazard
−2.3
−2.4
−2.5
25 30 35 40
age
Fig. 9.4 Estimates of the smoothed hazard as a function of age for the pairfam data (gray line:
tuning parameter selection based on AIC, black line: tuning parameter selection based on BIC)
Random effects models are a strong tool to model heterogeneity. However, the
approach has also drawbacks. One is that inference on the unknown distributional
assumption is hard to obtain and that the choice of the distribution may affect the
results, see, for example, Heagerty and Kurland (2001), Agresti et al. (2004), and
McCulloch and Neuhaus (2011).
Alternatives to random effects models that do not assume a specific distribution
are fixed effects models. In discrete survival one assumes that the hazard function
has the form
where ˇi is not a random but a fixed parameter that characterizes the ith individual.
Fixed effects models have several advantages. One advantage of the models is that
they do not assume that the individual parameters are independent of the explanatory
variables. In particular in econometrics this is regarded as a disadvantage of the
random effects model that also concerns the frailty model (9.1). The reason is that
200 9 Frailty Models and Heterogeneity
correlation between random effects and covariates leads to biases and inconsistent
estimators, as demonstrated, for example, by Neuhaus and McCulloch (2006). In
special cases, it is possible to use alternative estimators that are consistent. For
example, conditional likelihood methods (Diggle et al. 2002) can be used for
canonical links. Also mixed-effects models that decompose covariates into between-
and within-cluster components (Neuhaus and McCulloch 2006) can alleviate the
problem of biased estimates in specific settings.
Of course, model (9.7) contains many parameters, and estimation tends to fail
if there are many individuals relative to the sample size. However, if one assumes
that the variation of the individual parameters is not overly large (so that many of the
individuals share parameters of similar size), estimates can be nevertheless obtained.
One strategy to fit the model is to use penalized likelihood methods. In particular
one can include into the log-likelihood the penalty
X
J WD .ˇr ˇs /2 ; (9.8)
r<s
Table 9.5 Breast cancer data. The first column contains the coefficient estimates obtained from
a continuation ratio model with fixed patient-specific intercept terms (clinical model, n D 144).
The variation of the intercepts was restricted by the penalty in (9.8) (paraPen argument in the gam
function of R package mgcv). The second column contains the estimates of a frailty model with
patient-specific random effects and time-constant baseline hazard (fitted via the Gauss–Hermite
quadrature with 20 quadrature points, as implemented in the R package glmmML, Broström 2013).
Column 3 contains the effect estimates obtained from an ordinary continuation ratio model without
frailty term. Abbreviations of covariates are as follows: Diam D tumor diameter, N D number of
affected lymph nodes, ER D estrogen receptor status
Coefficient estimates
Frailty, fixed effects Frailty, random effects Discrete hazard
Diam 2 cm 0.0000 0.0000 0.0000
Diam > 2 cm 0.4027 0.5375 0.3979
N4 0.0000 0.0000 0.0000
N1 3 0.9086 1.1704 0.8485
ER negative 0.0000 0.0000 0.0000
ER positive 0.5145 0.6543 0.5266
Grade poorly diff. 0.0000 0.0000 0.0000
Grade intermediate 0.7248 0.9839 0.6066
Grade well diff. 0.2132 0.2574 0.2533
Age 0.0726 0.0988 0.0585
j .t j x; otj ; j / D h.otj C xT j /:
X
m
P.T D t j x/ D j Pj .T D t j x/;
jD1
202 9 Frailty Models and Heterogeneity
X
m
S.tjx/ D j Sj .tjx/; (9.9)
jD1
where Sj .tjx/ is the survival time in the jth component. The population level hazard
can be directly derived from .tjx/ D P.T D t j x/=P.T t j x/.
The simplest mixture, which in the following is mostly used for illustration,
contains just two components. With
1 .t j x; 0t ; 1 / D h.0t C xT /
2 .tjx/ D 0;
where S1 .tjx/ is the survival function for the population at risk. The model has been
used by Muthén and Masyn (2005) in discrete survival modeling. In continuous
survival modeling the concept of long-time survivors has been used much earlier,
for an overview see Maller and Zhou (1996). Similar models have been considered
under the name cure model. Cure models were originally developed for use in
biomedical applications because for some severe diseases patients often react
differently on a treatment. In particular, a class of patients who respond to treatment
and are free of symptoms may be considered cured (and therefore as long-time
survivors). An example of modeling unobserved heterogeneity by finite mixtures of
hazard functions is given in Exercise 9.4.
9.6 Finite Mixture Models 203
exp.˛0j C zT ˛j /
P.C D j j z/ D Pm1 ; j D 1; : : : ; m 1;
1 C sD1 exp.˛0s C zT ˛s /
j .tj0j / D h.0j /:
This model assumes that the hazards are constant within classes, that is, the hazard
is low or high but does not vary over time. The effect of covariates captured in ˛j
has a simple interpretation, because it indicates which variables are responsible for
being in a high or low risk group. The model imposes a very simple structure but
profits from good and simple interpretation.
It is tempting to allow the vector z to be identical to the vector x, which
determines the effect of covariates on the hazard within classes. But then it is hard
to separate the effects because both relate to the risk. For example, let us consider
the simple case of two classes with constant baseline hazards, j .t j x; 0j ; j / D
h.0j C xT j /. Then the survival function has the form
exp.˛0 C zT ˛/
S.tjx; z/ D .1 h.01 C xT 1 //t1
1 C exp.˛0 C zT ˛//
1
C .1 h.02 C xT 2 //t1 :
1 C exp.˛0 C zT ˛//
204 9 Frailty Models and Heterogeneity
Let x and z represent the same variable, for example, gender in 0–1 coding (with 1
indicating males) and let us focus on the first class. If ˛ is positive, men tend to
be in the first class. If class 1 is the group with higher risk (01 > 02 ), males
show an increased risk. But there is an additional effect: If 1 is positive, the risk
in class 1 is higher for males than for females; if 1 is negative, the risk for males
is smaller than for females. In particular for negative 1 the effect within the class
contradicts the effect in the membership probability. Thus interpretation has to refer
to both parameters and, moreover, also has to include the intercepts, which makes
interpretation tricky.
It therefore seems advisable to include a covariate either in the membership
probability or in the hazard function, but not in both terms. But there still remains a
choice to be made. One strategy is to either include all variables in the hazard and
to let membership probabilities be fixed, or to let the membership probability be
determined by covariates and assume constant hazards.
A simple model for m classes with constant hazard within classes and effects on
membership only is
X
m
S.tjx/ D P.C D j j z/ .1 h.0j //t1 ;
jD1
which in the classes assumes a geometrical distribution of survival time. Even then
a multinomial logistic model for the membership parameters P.C D j j z/ can yield
a complicated interpretation of parameters. This can be simplified by ordering the
hazards such that 01 < : : : < 0m (that is, class 1 has the lowest and class m the
highest hazard) and by assuming for P.C D j j z/ an ordered model, for example,
the cumulative hazard model P.C j j z/ D F.˛0j C zT ˛/. Then the effect of a
predictor is contained in just one parameter (see Agresti 2009). A weaker model
is obtained by assuming time-varying hazard functions h.0tj / with the restriction
0t1 < : : : < 0tm for all t, such that classes are still ordered.
The long-time survivor or cure model has the advantage that the mixture com-
ponents are more structured. By assuming for the long-time survivors the hazard
2 .tjx/ D 0 and for individuals in the susceptible group who will eventually experi-
ence the event of interest if followed for long enough the survival function S1 .tjx/,
one obtains the model
Y
t1
S1 .tjx/ D .1 h.0t C xT //:
sD1
If is set to zero, the hazard depends on time but covariates determine the mixture
probabilities only.
As in all mixture models the problem of identifiability arises, that is, one has to
ensure that the parameters that describe a specific response structure are unique. The
cure model is certainly not identifiable if both the survival function S1 .tjx/ and the
mixture probabilities P.C D 1 j z/ do not depend on covariates. The same holds if
the survival function does not depend on x and a single binary covariate determines
the mixture probabilities. This results from the identifiability conditions given by
Li et al. (2001). The authors investigate cure models for continuous time and show
in particular that even for continuous time the cure model is not identified in these
cases. However, they also show that models are identified if the mixture probability
is specified as a logistic function depending on continuous covariate z although the
survival function does not depend on covariates. They also investigate the case of
proportional hazards models with covariates and show that the corresponding cure
model is identifiable under weak conditions.
Example 9.5 SEER Breast Cancer Data
For illustration of the cure model we analyze data from the Surveillance, Epidemiology, and
End Results (SEER) Program of the U.S. National Cancer Institute (http://seer.cancer.gov), which
collects information on cancer incidences and survival from various locations throughout the USA.
Here we consider a random sample of 6000 breast cancer patients that entered the SEER database
between 1997 and 2011 (SEER 1973–2011 Research Data, version of November 2013). Discrete-
time survival models are used to analyze the time from diagnosis to death from breast cancer
in years. Tables 9.6 and 9.7 show the variables that are used for statistical analysis. Categorical
predictor variables include tumor grade (I–III), estrogen receptor (ER) status, progesterone receptor
(PR) status, and number of positive lymph nodes. In addition, we consider the age at diagnosis and
the tumor size; for details on the predictors, we refer to the SEER text data file description at
http://seer.cancer.gov. We included all variables from Tables 9.6 and 9.7 in the mixture component
and in the hazard function. For the hazards of the subpopulation at risk we use the continuation
ratio model. The mixture is determined by a logit model. Table 9.8 shows the coefficient estimates
together with bootstrap-based standard errors and confidence intervals (500 bootstrap samples).
It is seen that most of the covariates show no significant effect on the probability of being cured.
Exceptions are the size of the tumor and age. With the exception of the PR status all variables seem
to have an effect on survival in the subpopulation at risk that cannot be neglected.
The estimated survival functions for different grades and different tumor sizes are presented in
Fig. 9.5. Tumor sizes for which survival functions are shown correspond to the minimum value and
the quartiles in the sample; the respective values are 1, 10, 16, 21, and 25 mm. The curves refer
Table 9.6 Quantitative explanatory variables for the SEER breast cancer data
Minimum First Quantile Median Mean Third Quantile Maximum
Age (years) 18 48 56 56 64 75
Tumor size (mm) 1 10 16 21 25 230
206 9 Frailty Models and Heterogeneity
Table 9.7 Categorical explanatory variables for the SEER breast cancer data
Category Observations Proportions (%)
Grade I 1300 22
II 2569 43
III 2131 35
No of pos. nodes 0 4018 67
1–3 1416 24
>3 566 9
ER status Positive 4760 79
Negative 1240 21
PR status Positive 4241 71
Negative 1759 29
Table 9.8 Estimates for the cure model (SEER breast cancer data). The upper part of the table
contains the mixture coefficients, in the lower part the coefficients for the hazard of the population
at risk are given
Estimates BS SE 0.95 confidence intervals
Constant 0.0224 0.0176 [0.0694, 0.0062]
Age 0.0170 0.0067 [0.0274, 0.0008]
Grade II 0.0115 0.0189 [0.0120, 0.0612]
Grade III 0.0297 0.0330 [0.1198, 0.0009]
No of pos. nodes 1–3 0.0650 0.0550 [0.1895, 0.0011]
No of pos. nodes > 3 0.0714 0.0736 [0.0013, 0.2534]
Size of tumor 0.0202 0.0120 [0.0033, 0.0520]
ER status (negative) 0.0544 0.0509 [0.1712, 0.0007]
PR status (negative) 0.0349 0.0400 [0.1401, 0.0007]
Age 0.0180 0.0066 [0.0053, 0.0298]
Grade II 0.6330 0.2205 [0.2498, 1.0708]
Grade III 1.4980 0.2345 [1.0622, 1.9777]
No of pos. nodes 1–3 1.0990 0.1960 [0.7034, 1.4466]
No of pos. nodes > 3 1.8510 0.1929 [1.4018, 2.1525]
Size of tumor 0.0060 0.0046 [0.0006, 0.0165]
ER status (negative) 0.7430 0.1911 [0.3719, 1.1253]
PR status (negative) 0.1210 0.1724 [0.2165, 0.4438]
to a person who is 56 years old with ER and PR status positive and no positive nodes. In the plot
for different grades the tumor size was 16 mm, whereas in the plot for different tumor sizes the
grade was one. The left column shows the mixture survival functions, that is, the survival functions
in the total population. The right column shows the survival functions for the population at risk.
It is seen that the grade has a strong effect on survival, yielding quite different functions for both
populations. Since the effect of grade on the mixture is weak and the probability of being at risk
is large for the person under consideration the curves found for different grades are very similar
in the total population and the population at risk. In contrast, the curves for different tumor sizes
found for the mixture are quite different from the curves found for the population at risk. While
the curves for tumor sizes between 1 and 25 are not so far apart in the population at risk, the curves
differ much more strongly in the mixture population, because in these curves the strong effect of
the tumor size on the mixture is included and the other variables have almost no effect on the
mixture. t
u
9.6 Finite Mixture Models 207
0.95
0.95
0.85
0.85
Survival
Survival
0.75
0.75
Grade: I Grade: I
Grade: II Grade: II
0.65
Grade: III
0.65
Grade: III
0 5 10 15 0 5 10 15
Years Years
1.00
1.00
0.98
0.98
0.96
0.96
Survival
Survival
0.94
0.94
0.92
0.92
0 5 10 15 0 5 10 15
Years Years
1.0
1.0
0.9
0.9
0.8
0.8
Survival
Survival
0.7
0.7
0.6
0.6
0.5
0 5 10 15 0 5 10 15
Years Years
Fig. 9.5 Survival functions for the SEER breast cancer data. The left column shows the mixture
survival functions, whereas the right column shows the survival functions for the population at risk
The basic assumption inP finite mixtures is that the probability of survival time is
given by P.T D t j x/ D m jD1 j Pj .T D t j x/. For random censoring one derives in
a similar way as in Sect. 3.4 that the probability of an observation .ti ; ıi / is given by
0 1 ıi 0 11ıi
X
m X
m
P.ti ; ıi jxi / D ci @ j Pj .Ti D ti j xi /A @ j Sj .ti jxi /A
jD1 jD1
X
m Y
ti
D ci j j .sjxi /yis .1 j .sjxi //1yis ; (9.11)
jD1 sD1
208 9 Frailty Models and Heterogeneity
where ci D P.Ci ti /ıi P.Ci D ti /1ıi is assumed not to depend on the parameters
and .yi1 ; : : : ; yiti / D .0; : : : ; 0; 1/ if ıi D 1, .yi1 ; : : : ; yi;ti / D .0; : : : ; 0/ if ıi D 0
(Exercise 9.3). The last formula contains a product of binomial terms for the
probability of failure within classes. Although this representation does not help to
construct a likelihood for a binary mixture model, it can be used to construct an EM
algorithm for binary response models.
Mixture models in general were considered, for example, by Follmann and
Lambert (1989) and Aitkin (1999). An extensive treatment was given by Frühwirth-
Schnatter (2006). Follmann and Lambert (1989) investigated the identifiability of
finite mixtures of binomial regression models and gave sufficient identifiability
conditions for mixing at the binary and the binomial level.
exp.ˇi ıj /
P.Rij D 1/ D ;
1 C exp.ˇi ıj /
where Rij 2 f0; 1g indicates that item j is not solved or solved, ˇi is the ability of
the person, and ıj is the difficulty of the item. The Rasch model is simply a logistic
regression model with latent traits as regressors. With F.
/ D exp.
/=.1 C exp.
//
denoting the logistic distribution function, it has the form P.Rij D 1/ D F.ˇi ıj /.
Therefore, if the ability of the person, ˇi , equals the difficulty of the item, ıj , the
probability of solving the item is 0.5. If the ability is larger than the difficulty of the
item, the probability of solving the item is accordingly increased.
In modern item response theory one frequently uses items that have more than
two categories. Then one has graded response categories that reflect to what extent
an item is solved. More specifically, the responses for item j take values from
f1; : : : kj g, where larger numbers indicate better performance. Of particular interest
are items that containp several steps and are solved in a consecutive manner. A simple
example is the item 9:0=0:3 5 considered by Masters (1982). Three levels of
9.8 Literature and Further Reading 209
where F.:/ again is the logistic distribution function, ˇi is the ability of the person,
and ıjr is the difficulty of solving the rth step. Thus the transition to the next level
of performance is determined by a binary Rasch model determined by the ability of
the person and the corresponding difficulty of the transition. The alternative form of
the model,
shows that it is a discrete hazard model with latent traits as regressors. The response
in categories (denoted here by the random variable Rij ) can also be a genuine time,
for example, if several trials at solving an item are allowed in answer-until-correct
test administrations. For an application to this type of data, see Culpepper (2014).
The sequential item response model (9.12) has the structure of a discrete survival
model but with repeated measurements since each person tries to solve all the
items. Estimation refers to the item difficulties as well as the person abilities.
An estimation strategy that uses the Rasch model structure for single transitions
was given by Tutz (1990). Better estimates for the item difficulties, which are of
special importance when designing an assessment test, were given by De Boeck
et al. (2011). Effectively they use a random effects representation of the model.
Overviews on item response theory were given by van der Linden and Hambleton
(1997) and De Boeck and Wilson (2004). Further details on the sequential model
are found in Rijmen et al. (2003) and Tutz (2015).
Frailty in Duration Models Vaupel et al. (1979), Elbers and Ridder (1982),
and Vaupel and Yashin (1985) were among the first to investigate the impact of
heterogeneity on survival. Heckman and Singer (1984a) and Heckman and Singer
(1984b) investigated it from an econometric perspective. Hougaard (1984) and
Aalen (1988) considered survival from a biostatistical perspective.
Random Effects in Discrete Survival Ham and Rea Jr. (1987) used a logistic model
to analyze unemployment duration. Vermunt (1996) proposed a modified loglinear
model, which is restricted to categorical covariates. Land et al. (2001) extended the
model to allow for metric covariates. McDonald and Rosina (2001) investigated
210 9 Frailty Models and Heterogeneity
a mixture model for recurrent event times and gave an analysis of Hutterite birth
histories. Nicoletti and Rondinelli (2010) focused on the misspecification of
discrete survival models; applications are found in Hedeker et al. (2000). Xue
and Brookmeyer (1997) avoided the specification of the frailty distribution by
using a marginal model where regression coefficients have population-averaged
interpretation. Frederiksen et al. (2007) considered the modeling of group level
heterogeneity.
Cure Models Proportional hazards cure models were proposed by Kuk and Chen
(1992) and Sy and Taylor (2000). Yu et al. (2004) considered cure models for
grouped survival times with fixed distributions of survival time. A cure model for
interval-censored data was considered by Kim and Jhun (2008).
Discrete Mixture Models A standard reference for finite mixture models is the book
by McLachlan and Peel (2000), which also contains a chapter on continuous survival
mixtures. Almansa et al. (2014) considered a factor mixture model for multivariate
survival data. Muthén and Masyn (2005) proposed a latent variable approach to
mixture models.
9.9 Software
Generalized Linear Mixed Models The function glmmML that is contained in the
R package glmmML allows one to fit GLMs with random intercepts by maximum
likelihood estimation with numerical integration via a Gauss–Hermite quadrature.
The function glmer contained in the package lme4 offers the adaptive Gauss–
Hermite approximation proposed by Liu and Pierce (1994). glmmML and glmer
also allow to use the Laplace approximation. Another function that fits GLMMs by
using penalized quasi-likelihood methods is glmmPQL from the package MASS.
Variable Selection Variable selection for generalized mixed models is implemented
in the R package glmmLasso (Groll 2015).
Generalized Additive Mixed Models The package mgcv contains the function
gamm, which allows one to fit GLMMs that contain smooth functions. Alternatively,
the bs = “re” option for smooth fits can be used in combination with the gam
function of mgcv.
Mixture Models The package flexmix (Grün and Leisch 2008) provides a flexible
tool to estimate discrete mixtures.
9.10 Exercises 211
9.10 Exercises
9.1 Let the population be partitioned into m subpopulations. Further denote by j .t/
the hazard rate and by Pj .T t/ the survival function in the jth subpopulation. Show
that for a randomly sampled individual the hazard rate is given by
PM
jD1 j .t/Pj .T t/ p. j/
.t/ D PM :
jD1 Pj .T t/ p. j/
9.3 Show that under the assumption of random censoring the probability of an
observation .ti ; ıi / is given by
X
m Y
ti
P.ti ; ıi jxi / D ci j .ti jxi /yis .1 j .ti jxi //yis
jD1 sD1
with a term ci that depends on censoring only and dummy variables yis .
9.4 Reconsider the clinical model for the breast cancer data of Example 9.4. The
analysis of these data in Sect. 9.5 suggested considerable heterogeneity among the
patients, so that fixed and random effects modeling was appropriate to account for
the heterogeneity. The aim of this exercise is to investigate whether unobserved
heterogeneity can also be modeled using finite mixtures of hazard functions, as
presented in Sect. 9.6.
(a) Fit finite mixture models with varying numbers of components to the breast
cancer data. (Hint: Use the stepFlexmix function of the R package flexmix (Grün
and Leisch 2008) for model fitting. Use the fixed argument of stepFlexmix
to guarantee that the estimates of covariate effects do not vary across the
components.)
(b) For each of the models calculate Akaike’s information criterion (AIC). (Hint:
use the AIC function in R package flexmix.) What is the optimum number of
components according to the AIC criterion?
(c) Compare the coefficient estimates of the optimal model to the estimates
obtained from random and fixed effects modeling (as presented in Table 9.5).
Which of the models performs best according to the AIC criterion?
Chapter 10
Multiple-Spell Analysis
In the previous chapters only single spells of duration have been considered.
This is an appropriate modeling strategy in many applications; for example, if in
biostatistics the transition to an absorbing state like death is modeled, only single
spells matter. But in many areas like economics and the social sciences subjects can
experience a sequence of events as time elapses. For instance, subjects can have a
first spell of unemployment, then be employed for some time, then have a second
spell of employment, etc. Thus a person’s history can be divided into spells with
transitions between various states.
The term event history, which is used in particular in sociology, refers to the
modeling of duration and transition between states of interest. For example, Willett
and Singer (1995) illustrated their method using longitudinal data on exit from,
and re-entry into, the teaching profession. Johnson (2006) modeled the enrolment
history of students, which was divided into periods of enrolment and non-enrolment.
Hamerle and Tutz (1989) considered breast cancer data with competing risks
occurrence of metastases and death.
In the following we briefly consider basic concepts in discrete-time multiple-
spell modeling.
It is the conditional probability of leaving the state yk1 within the kth spell
conditional on Tk t. Thus, given Tk t the conditional probabilities are
Y
t
S.k/.t j Hk1 ; xk / D P.Tk > t j Hk1 ; xk / D .1 .k/ .i j Hk1 ; xk //:
iDtk1 C1
Parametric models for the kth spell have the familiar form
.k/
r.k/ .t j Hk1 ; xk / D h.otr C xTk r.k/ /; (10.1)
10.1.1 Estimation
Let us first consider the case of uncensored data, with observed data given by
yi0 ; .tik ; yik ; xik /, i D 1; : : : ; n; k D 1; : : : ; ki . The likelihood contribution of the ith
observation is
where the subscript i is suppressed on the right-hand side. Since duration times are
consecutive, the contribution is given by
Y
ki
Li D P.Tk D tk ; Yk D yk ; xk j Hk1 /;
kD1
Y
ki
Li D P.Tk D tk ; Yk D yk j Tk tk ; Hk1 ; xk / P.Tk tk j Hk1 ; xk / P.xk j Hk1 /
kD1
Y
ki k 1
tY
D y.k/
k
.tk j Hk1 ; xk / .1 .k/ .s j Hk1 ; xk // P.xk j Hk1 /;
kD1 sDtk1 C1
where P.xk j Hk1 / represents the conditional density of the covariate in spell k given
the history. If it is not informative, the likelihood contribution is determined by the
hazards. If censoring occurs in the last spell, one has
Y
ki k 1
tY
Li D Œy.k/
k
.tk j Hk1 ; xk / ık
.1 .k/ .s j Hk1 ; xk // P.xk j Hk1 /;
kD1 sDtk1 C1
Y
n Y
ki Y
m k 1
tY
LD Œy.k/
k
.tik j Hk1 ; xik /ıikr .1.k/ .s j Hk1 ; xik // P.xk j Hk1 /ik ;
iD1 kD1 rD1 sDti;k1 C1
where
1; ith individual ends at tik in state r;
ıikr D
0; otherwise ;
1; ith individual survives spell k;
ik D
0; otherwise :
If the parameters are specific for the spell (as in Model (10.1)), the likelihood can
be maximized separately for the spells by considering only those individuals who
were observed for the specific spell. Similar forms of the likelihood can be derived
by assuming that the covariate process xit ; t D 1; 2; : : : is linked to discrete time
rather than being fixed within spells. But for time-varying covariates the counting
216 10 Multiple-Spell Analysis
process approach provides a much more general framework, and estimation should
be embedded into this framework (see, for example, Fleming and Harrington 2011).
In many applications the objective is less ambitious than described in the previous
section. One does not aim at modeling all transitions between various states over
time but focusses on one specific recurrent transition. For example, in studies on
unemployment duration, one person can be unemployed several times during the
time of the study. The main interest is often on the duration of unemployment,
whereas the duration of employment serves as a predictor but is not itself modeled.
For the ith individual, the corresponding simplified model uses the hazard function
which determines the duration of the kth spell given covariates xik .
Since Ti1 ; : : : ; Tiki are measurements on the same individual, the model should
include a subject-specific effect as in frailty models. The basic model for the hazard
of the ith individual in the kth spell is
where bi is the individual effect and it is assumed that the spell durations are
independent drawings from the common distribution of the frailty. For example,
one often assumes bi N.0; 2 /.
Let the data for the survival times Ti1 ; : : : ; Tiki of the spells be given by
ti1 ; : : : ; tiki ; ıi , where ıi D 0 denotes censoring in the last spell. The kth spell is
again represented as a sequence of binary variables yik1 ; : : : ; yiktik , where
1; ith individual fails at t in spell k given it reaches t;
yikt D
0; otherwise :
The representation with binary variables implies that the model can be estimated
in the same way as a binary mixed-effects model for repeated measurements. The
repeated measurements now refer to the separate spells and time points.
Example 10.1 Unemployment Spells
We analyze the duration of unemployment spells using data from the German socio-economic
panel. Here we consider the time of unemployment (measured in 1-month intervals) with
terminating event “full-time job.” This is a typical example of multiple spells, since each
person in the socio-economic panel can have several unemployment spells during the study
period. For statistical analysis we use a subsample of 1693 persons who live in the German
region of North Rhine-Westphalia. For these persons data were collected between January 1990
and December 2011. Altogether, the data comprised 2512 unemployment spells; the overall
number of events was 1010. Using a continuation ratio model with subject-specific random
effects (bs = “re” option in the gam function of R package mgcv), we investigate the effects
of the covariates “age” (in years), “gender” (male/female), and “status” (married/married but
separated/single/divorced/widowed/husband/wife abroad) on the time to employment at full-time
job. In addition to the main effects of the covariates, we include an interaction term between
gender and status in the model. The baseline hazard is modeled by a penalized regression spline,
as described in Chap. 5.
Coefficient estimates are presented in Table 10.1. Judged by the main effects, men had a higher
chance of getting re-employed at full-time job than women. The results presented in Table 10.1
Table 10.1 German socio-economic panel. The table contains the coefficient estimates that were
obtained from a multiple-spell continuation ratio model with subject-specific random intercept
terms. Estimation was based on data from a subsample of 1693 persons that live in the German
region of North Rhine-Westphalia. The response variable was the time to re-employment at full-
time job (measured in months)
Variable name Coef. estimate Standard dev. z value p-value
Intercept 2:8262 0.1584 17:841 <0.0001
Gender female (ref. category) 0:0000
Gender male 1:1075 0.1083 10:218 <0.0001
Age 0:0396 0.0031 12:660 <0.0001
Status married (ref. category) 0:0000
Status married but separated 0:3514 0.2516 1:396 0:1625
Status single 0:4978 0.1346 3:698 0:0002
Status divorced 0:3235 0.2496 1:296 0:1948
Status widowed 0:5645 1.0088 0:560 0:5757
Status husband/wife abroad 1:8210 0.6236 2:920 0:0035
Gender male: status married (ref. 0:0000
category)
Gender male: status married but 0:2015 0.3592 0:561 0:5747
separated
Gender male: status single 0:8586 0.1537 5:585 <0.0001
Gender male: status divorced 0:1261 0.2959 0:426 0:6698
Gender male: status widowed 0:5682 1.1068 0:513 0:6076
Gender male: status 2:4181 0.8590 2:815 0:0048
husband/wife abroad
218 10 Multiple-Spell Analysis
0.08
2
0.06
0
λ(t|x)
0.04
−2
0
^γ
0.02
−4
0.00
−6
Fig. 10.1 German socio-economic panel. The left panel shows the estimated smooth func-
tion O0 .t/ for the baseline hazard in Example 10.1. The right panel shows the respective hazard
O
estimate .tjx/ for a 40-year-old male person with “status = married”
Table 10.2 German socio-economic panel. The table contains the hazard estimates for a 40-year-
old person at t D 10 months. Estimates were obtained from the multiple-spell continuation ratio
model in Example 10.1. They reflect the chance of getting re-employed at full-time job given a
10-month unemployment spell. Numbers in brackets are 95 % confidence intervals
Hazard (= chance of getting re-employed)
Status Male Female
Married 0.056 (0.050, 0.064) 0.019 (0.016, 0.023)
Married but separated 0.065 (0.040, 0.103) 0.027 (0.017, 0.043)
Single 0.040 (0.034, 0.047) 0.031 (0.025, 0.038)
Divorced 0.047 (0.035, 0.063) 0.014 (0.009, 0.022)
Widowed 0.057 (0.024, 0.129) 0.011 (0.001, 0.075)
Husband/wife abroad 0.032 (0.010, 0.095) 0.109 (0.035, 0.292)
also show the well-known effect of age on the (conditional) probability of getting re-employed:
The older the persons, the smaller the probability of getting re-employed (coefficient estimate D
0:04 per year, p-value < 0:0001). Figure 10.1 shows the estimated smooth function O0 .t/ for the
baseline hazard. It is seen that the baseline hazard increases in the time interval .0; 10, reflecting
short-term unemployment. After 10 months, the baseline hazard starts to decrease indicating that
the probability of re-employment decreases with the duration of unemployment. Figure 10.1 also
shows the corresponding hazard function for a 40-year-old male person with “status = married.”
It is seen that for an unemployment duration longer than 5 years the probability of reemployment
becomes very small. The variance of the subject-specific random effect is significantly different
from zero (p-value D 0:0228) but small in magnitude ( O D 1:1524 108 ), suggesting that not
much heterogeneity is left when the explanatory variables are included.
The interaction effects between gender and status are summarized in Table 10.2. It is seen that
the probability of getting re-employed is higher for men than for women in all status categories,
with the only exception being the category “husband/wife abroad.” For some categories of “status,”
the relation between effects is similar within the male and female groups. For example, for both
10.3 Generalized Estimation Approach to Repeated Measurements 219
male and female subjects chances of getting re-employed are higher in the “married but separated”
group than in the “married” group. However, there are also notable differences between male and
female subjects. In particular, women have a higher chance of getting re-employed if they are
single instead of married, whereas for male subjects the respective coefficient estimates suggest
an opposite trend. It should be noted that the results in Tables 10.1 and 10.2 do not provide
information on whether subjects were actively looking for a full-time job during the study period.
For this reason the hazard estimates of the continuation ratio model do not only reflect unequal
opportunities for the various subgroups in the labor market, but also the family situation(s) of the
study participants and transitions to part-time employment. t
u
As in the preceding section we consider the simplified model for repeated failure
times, which for the ith person uses the hazard function
In estimation one should be aware that Ti1 ; : : : ; Tiki are measurements on the same
person and therefore cannot be assumed to be independent. For example, persons
with many job opportunities (large values in the hazard) tend to have shorter
unemployment spells during the whole observation period. In contrast, less qualified
persons with high values in the hazard tend to have longer spells. Model (10.2)
accounts for this potential dependence by including a subject-specific random effect.
Alternatively, one can consider
.k/
.k/ .t j xik / D h.0t C xTik .k/ / (10.3)
as a model for the marginal responses Tik . Model (10.3) does not specify the
association between components of the vector .Ti1 ; : : : ; Tiki /. Since each component
follows a multinomial distribution, the vector .Ti1 ; : : : ; Tiki / itself is multinomially
distributed. Maximum likelihood estimation for marginal models with multinomi-
ally distributed components is a rather advanced topic, and most approaches are
limited to binary responses (McCullagh and Nelder 1989, Fitzmaurice and Laird
1993, Lang and Agresti 1994, Glonek and McCullagh 1995, and Bergsma et al.
2009). A more easily accessible approach is the estimation of Model (10.3) by
generalized estimation equations (GEEs), which is sketched in the following.
Let us consider one component, Tik , of the multiple response vector
.Ti1 ; : : : ; Tiki /. It can again be represented as a sequence of binary variables
yik1 ; : : : ; yiktik , where
1; ith individual fails at t in spell k given it reaches t
yikt D
0; otherwise :
220 10 Multiple-Spell Analysis
The model for the binary variables is determined by P. yikt D 1/ D .k/ .t j xik /. It
is marginal in the sense that only one spell is modeled, but conditional on previous
responses within one spell because only sequences of the form .0; 0; : : : ; 1/ can
occur. The corresponding variance is var. yikt D 1/ D .k/ .tjxik /.1.k/.tjxik //. The
strength of generalized estimation equations (GEEs) is that the covariance matrix of
the responses does not have to be specified correctly. Instead one works with a so-
called working covariance structure. A simple working covariance structure for the
whole sequence
is given by
e
cov. yikr ; yiks / D .k/ .rjxik /.k/ .sjxik /; r ¤ s
and
e Q
cov. yikr ; yikQs / D 0; k ¤ k:
The first equation specifies the working covariance within spells as the usual covari-
ance of the multinomial distribution, and the second equation uses independence
as working covariance for components that refer to distinct spells. Let W contain
the working variances and covariances in correspondence to the vector yi . Then the
GEE that has to be solved to obtain an estimate of the coefficients is
X
n
X Ti Di .ˇ/W 1
i .ˇ; ˛/. yi i .ˇ// D 0; (10.4)
iD1
b
cov./ D V 1 1
W V˙ VW ;
P Pn
where V W D niD1 X Ti Di W 1
i Di X i , V ˙ D
1 1
iD1 Xi Di W i ˙ i W i Di Xi , and ˙ i is
T
10.5 Software
10.6 Exercises
10.1 Reconsider the data from the German socio-economic panel that were ana-
lyzed in Example 10.1. Figure 10.2 and Tables 10.3 and 10.4 contain the results
obtained from a continuation ratio model with subject-specific random effects that
was fitted to a subsample of 995 persons living in the German region of Baden–
Wuerttemberg. This subsample comprised 1489 unemployment spells; the overall
number of events was 652.
(a) Interpret the results and compare them to the results obtained from the region
of North Rhine-Westphalia shown in Fig. 10.1 and Tables 10.1 and 10.2.
222 10 Multiple-Spell Analysis
(b) Compare the numbers of events and analyzed persons in the two German
regions and discuss the results in Fig. 10.2 and Tables 10.3 and 10.4 in the light
of the remark on page 194.
0.12
2
0
0.08
λ(t|x)
−2
0
^γ
0.04
−4
0.00
−6
Fig. 10.2 German socio-economic panel. The left panel shows the estimated smooth function
O0 .t/ for the baseline hazard in Exercise 10.1. The right panel shows the respective hazard estimate
O
.tjx/ for a 40-year-old male person with “status = married”
Table 10.3 German socio-economic panel. The table contains the coefficient estimates that were
obtained from a multiple-spell continuation ratio model with subject-specific random intercept
terms. Estimation was based on data from a subsample of 995 persons who live in the German
region of Baden–Wuerttemberg. The response variable was the time to re-employment at full-time
job (measured in months)
Variable name Coef. estimate Standard dev. z value p-value
Intercept 2:323 0:201 11:556 < 0.001
Gender female (ref. category) 0:000
Gender male 0:906 0:135 6:714 < 0.001
Age 0:044 0:004 10:573 < 0.001
Status married (ref. category) 0:000
Status married but separated 0:169 0:376 0:449 0:653
Status single 0:596 0:158 3:762 < 0.001
Status divorced 0:268 0:244 1:099 0:271
Status widowed 1:067 1:010 1:057 0:290
Status husband/wife abroad 15:06 1669:0 0:009 0:992
Gender male: status married (ref. 0:000
category)
Gender male: status married but 1:099 0:635 1:729 0:083
separated
Gender male: status single 0:835 0:181 4:591 < 0.001
Gender male: status divorced 0:106 0:343 0:309 0:757
Gender male: status widowed 1:474 1:095 1:347 0:178
Gender male: status 0:000 1833:0 0:000 1:000
husband/wife abroad
10.6 Exercises 223
Table 10.4 German socio-economic panel. The table contains the hazard estimates for a 40-year-
old person at t D 10 months. Estimates were obtained from the multiple-spell continuation ratio
model in Exercise 10.1. They reflect the chance of getting re-employed at full-time job given a
10-month unemployment spell. Numbers in brackets are 95 % confidence intervals
Hazard (= chance of getting re-employed)
Status Male Female
Married 0.046 (0.040, 0.053) 0.019 (0.015, 0.023)
Married but separated 0.018 (0.007, 0.049) 0.022 (0.011, 0.044)
Single 0.036 (0.030, 0.043) 0.034 (0.027, 0.042)
Divorced 0.065 (0.042, 0.099) 0.024 (0.016, 0.037)
Widowed 0.067 (0.030, 0.141) 0.006 (0.001, 0.045)
Husband/wife abroad 0.000 (0.000, 1.000) 0.000 (0.000, 1.000)
10.2 Investigate the behavior of the binary mixed effects and GEE procedures by
generating simulated data as follows:
(a) Specify the number of subjects and generate values of a binary covariate x1 and
a three-level covariate x2 by using the following probability table:
x1
0 1
0 0.26 0.30
x2 1 0.15 0.14
2 0.09 0.06
Betensky, R. A., Rabinowitz, D., & Tsiatis, A. A. (2001). Computationally simple accelerated
failure time regression for interval censored data. Biometrika, 88, 703–711.
Beyersmann, J., Allignol, A., & Schumacher, M. (2011). Competing risks and multistate models
with R. New York: Springer.
Bojesen Christensen, R. H. (2015). Ordinal: Regression models for ordinal data. R package
version 2015.6-28. http://cran.r-project.org/web/packages/ordinal/
Bonde, J., Hjollund, N., Jensen, T., Ernst, E., Kolstad, H., Henriksen, T., et al. (1998). A follow-up
study of environmental and biologic determinants of fertility among 430 danish first-pregnancy
planners: Design and methods. Reproductive Toxicology, 12, 19–27.
Bondell, H. D., & Reich, B. J. (2009). Simultaneous factor selection and collapsing levels in anova.
Biometrics, 65, 169–177.
Bou-Hamad, I., Larocque, D., & Ben-Ameur, H. (2011a). Discrete-time survival trees and forests
with time-varying covariates: Application to bankruptcy data. Statistical Modelling, 11, 429–
446.
Bou-Hamad, I., Larocque, D., & Ben-Ameur, H. (2011b). A review of survival trees. Statistics
Surveys, 5, 44–71.
Bou-Hamad, I., Larocque, D., Ben-Ameur, H., Masse, L., Vitaro, F., & Tremblay, R. (2009).
Discrete-time survival trees. Canadian Journal of Statistics, 37, 17–32.
Boulesteix, A.-L., & Hothorn, T. (2010). Testing the additional predictive value of high-
dimensional data. BMC Bioinformatics, 11, 78.
Boulesteix, A.-L., & Sauerbrei, W. (2011). Added predictive value of high-throughput molecular
data to clinical data and its validation. Briefings in Bioinformatics, 12, 215–229.
Box-Steffensmeier, J. M., & Jones, B. S. (2004). Event history modeling: A guide for social
scientists. New York: Cambridge University Press.
Breheny, P. (2015). grpreg: Regularization paths for regression models with grouped covariates.
R package version 2.8-1. http://cran.r-project.org/web/packages/grpreg/index.html
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Breiman, L., Cutler, A., Liaw, A., & Wiener, M. (2015). randomForest: Breiman and Cutler’s
random forests for classification and regression. R package version 4.6-12. http://cran.r-
project.org/web/packages/randomForest
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, J. C. (1984). Classification and regression
trees. Monterey, CA: Wadsworth.
Breslow, N., & Crowley, J. (1974). A large sample study of the life table and product limit estimates
under random censorship. The Annals of Statistics, 2, 437–453.
Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear mixed
model. Journal of the American Statistical Association, 88, 9–25.
Breslow, N. E., & Lin, X. (1995). Bias correction in generalized linear mixed models with a single
component of dispersion. Biometrika, 82, 81–91.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather
Review, 78, 1–3.
Broström, G. (2013). glmmML: Generalized linear models with clustering. R package version 1.0.
http://cran.r-project.org/web/packages/glmmML
Broström, H. (2007). Estimating class probabilities in random forests. In ICMLA ’07: Proceedings
of the 6th International Conference on Machine Learning and Applications (pp. 211–216).
Washington, DC: IEEE Computer Society.
Brouhns, N., Denuit, M., & Vermunt, J. K. (2002). A Poisson log-bilinear regression approach to
the construction of projected lifetables. Insurance: Mathematics and Economics, 31, 373–393.
Brown, C. (1975). On the use of indicator variables for studying the time-dependence of parameters
in a response-time model. Biometrics, 31, 863–872.
Brüderl, J., Preisendörfer, P., & Ziegler, R. (1992). Survival chances of newly founded business
organizations. American Sociological Review, 57, 227–242.
Bühlmann, P. (2006). Boosting for high-dimensional linear models. Annals of Statistics, 34, 559–
583.
References 227
Bühlmann, P., Gertheiss, J., Hieke, S., Kneib, T., Ma, S., Schumacher, M., et al. (2014). Discussion
of “The evolution of boosting algorithms” and “Extending statistical boosting”. Methods of
Information in Medicine, 53, 436–445.
Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model
fitting (with discussion). Statistical Science, 22, 477–505.
Bühlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal
of the American Statistical Association, 98, 324–339.
Cai, T., & Betensky, R. A. (2003). Hazard regression for interval-censored data with penalized
spline. Biometrics, 59, 570–579.
Callens, M., & Croux, C. (2009). Poverty dynamics in Europe: A multilevel recurrent discrete-time
hazard analysis. International Sociology, 24, 368–396.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications.
Cambridge: Cambridge University Press.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger
than n. Annals of Statistics, 35, 2313–2351.
Cantoni, E., Flemming, J. M., & Ronchetti, E. (2011). Variable selection in additive models by
non-negative garrote. Statistical modelling, 11, 237–252.
Capaldi, D. M., Crosby, L., & Stoolmiller, M. (1996). Predicting the timing of first sexual
intercourse for adolescent males. Child Development, 67, 344–359.
Chamberlain, G. (1980). Analysis of covariance with qualitative data. The Review of Economic
Studies, 47, 225–238.
Chen, D., Sun, J., & Peace, K. E. (2012). Interval-censored time-to-event data: Methods and
applications. New York: Chapman & Hall/CRC.
Chiang, C. L. (1972). On constructing current life tables. Journal of the American Statistical
Association, 67, 538–541.
Chiang, C. L. (1984). The life table and its applications. Malabar, FL: Robert E. Krieger
Publishing.
Claeskens, G., Krivobokova, T., & Opsomer, J. D. (2009). Asymptotic properties of penalized
spline estimators. Biometrika, 96, 529–544.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal
of the American Statistical Association, 74, 829–836.
Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal
Statistical Society, Series B, 34, 187–220.
Croissant, Y. (2015). Ecdat: Data sets for econometrics. R package version 0.2-9. http://cran.r-
project.org/web/packages/Ecdat/index.html
Culpepper, S. A. (2014). If at first you don’t succeed, try, try again – applications of sequential
IRT models to cognitive assessments. Applied Psychological Measurement, 38, 632–644.
Currie, I. D., Durban, M., & Eilers, P. H. C. (2004). Smoothing and forecasting mortality rates.
Statistical Modelling, 4, 279–298.
Czado, C. (1992). On link selection in generalized linear models. In Advances in GLIM and
statistical modelling. Springer lecture notes in statistics (Vol. 78, pp. 60–65). New York:
Springer.
Czado, C. (1997). On selecting parametric link transformation families in generalized linear
models. Journal of Statistical Planning and Inference, 61, 125–139.
De Boeck, P., Bakker, M., Zwitser, R., Nivard, M., Hofman, A., Tuerlinckx, F., et al. (2011). The
estimation of item response models with the lmer function from the lme4 package in R. Journal
of Statistical Software, 39(12), 1–28.
De Boeck, P., & Wilson, M. (2004). A framework for item response models. New York: Springer.
De Boor, C. (1978). A practical guide to splines. New York: Springer.
Delwarde, A., Denuit, M., & Eilers, P. (2007). Smoothing the Lee–Carter and Poisson log-bilinear
models for mortality forecasting – a penalized log-likelihood approach. Statistical Modelling, 7,
29–48.
Diggle, P. J., Heagerty, P., Liang, K.-Y., & Zeger, S. L. (2002). Analysis of longitudinal data (2nd
ed.). New York: Oxford University Press.
228 References
Efron, B. (1988). Logistic regression, survival analysis, and the Kaplan-Meier-curve. Journal of
the American Statistical Association, 83, 414–425.
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and Penalties. Statistical
Science, 11, 89–121.
Eilers, P. H. C., & Marx, B. D. (2003). Multivariate calibration with temperature interaction
using two-dimensional penalized signal regression. Chemometrics and Intelligent Laboratory
Systems, 66, 159–174.
Elbers, C., & Ridder, G. (1982). True and spurious duration dependence: The identifiability of the
proportional hazard model. The Review of Economic Studies, 49, 403–409.
Enberg, J., Gottschalk, P., & Wolf, D. (1990). A random-effects logit model of work-welfare
transitions. Journal of Econometrics, 43, 63–75.
Fahrmeir, L. (1994). Dynamic modelling and penalized likelihood estimation for discrete time
survival data. Biometrika, 81, 317–330.
Fahrmeir, L. (1998). Discrete survival-time models. In P. Armitage & T. Colton (Eds.),
Encyclopedia of biostatistics (Vol. 2). Chichester: Wiley.
Fahrmeir, L., Hamerle, A., & Tutz, G. (1996). Regressionsmodelle zur Analyse von Verweildauern.
In L. Fahrmeir, A. Hamerle, & G. Tutz (Eds.), Multivariate statistische Verfahren. Berlin: De
Gruyter.
Fahrmeir, L., & Kneib, T. (2011). Bayesian smoothing and regression for longitudinal, spatial and
event history data. Oxford: Oxford University Press.
Fahrmeir, L., & Knorr-Held, L. (1997). Dynamic discrete-time duration models: Estimation via
Markov Chain Monte Carlo. Sociological Methodology, 27, 417–452.
Fahrmeir, L., & Tutz, G. (2001). Multivariate statistical modelling based on generalized linear
models. New York: Springer.
Fahrmeir, L., & Wagenpfeil, S. (1996). Smoothing hazard functions and time-varying effects
in discrete duration and competing risks models. Journal of the American Statistical
Association, 91, 1584–1594.
Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications. London: Chapman
& Hall.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96, 1348–1360.
Ferri, C., Flach, P. A., & Hernandez-Orallo, J. (2003). Improving the AUC of probabilistic
estimation trees. In Proceedings of the 14th European Conference on Artifical Intelligence
(Vol. 2837, pp. 121–132). Berlin: Springer.
Finkelstein, D. M. (1986). A proportional hazards model for interval-censored failure time data.
Biometrics, 42, 845–854.
Fitzmaurice, G. M., & Laird, N. M. (1993). A likelihood-based method for analysing longitudinal
binary responses. Biometrika, 80, 141–151.
Fleming, T. R., & Harrington, D. P. (2011). Counting processes and survival analysis. New York:
Wiley.
Follmann, D., & Lambert, D. (1989). Generalizing logistic regression by non-parametric mixing.
Journal of the American Statistical Association, 84, 295–300.
Fox, J., & Weisberg, S. (2015). car: Companion to applied regression. R package version 2.1-0.
http://cran.r-project.org/web/packages/car
Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools
(with discussion). Technometrics, 35, 109–148.
Frederiksen, A., Honoré, B. E., & Hu, L. (2007). Discrete time duration models with group-level
heterogeneity. Journal of Econometrics, 141, 1014–1043.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In
Proceedings of the Thirteenth International Conference on Medicine Learning (pp. 148–156).
San Francisco: Morgan Kaufmann.
Friedman, J., Hastie, T., & Tibshirani, R. (2015). glmnet: Lasso and elastic-net regularized
generalized linear models. R package version 2.0-2. http://cran.r-project.org/web/packages/
glmnet/
References 229
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals
of Statistics, 29, 1189–1232.
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view
of boosting. Annals of Statistics, 28, 337–407.
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. New York: Springer.
Gerds, T. A. (2015). pec: Prediction error curves for survival models. R package version 2.4.7.
http://cran.r-project.org/web/packages/pec/
Gerds, T. A., & Schumacher, M. (2006). Consistent estimation of the expected Brier score in
general survival models with right-censored event times. Biometrical Journal, 48, 1029–1040.
Gertheiss, J., & Tutz, G. (2010). Sparse modeling of categorial explanatory variables. Annals of
Applied Statistics, 4, 2150–2180.
Glonek, G. F. V., & McCullagh, P. (1995). Multivariate logistic models. Journal of the Royal
Statistical Society, Series B, 57, 533–546.
Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and estimation. Journal
of the American Statistical Association, 102, 359–376.
Goeman, J., Meijer, R., & Chaturvedi, N. (2014). penalized: L1 (lasso and fused lasso) and L2
(ridge) penalized estimation in GLMs and in the Cox model. R package version 0.9-45. http://
cran.r-project.org/web/packages/penalized/index.html
Graf, E., Schmoor, C., Sauerbrei, W., & Schumacher, M. (1999). Assessment and comparison of
prognostic classification schemes for survival data. Statistics in Medicine, 18, 2529–2545.
Greenwood, M. (1926). The natural duration of cancer. Reports of Public Health and Medical
Subjects 33, His Majesty’s Stationary Office, London.
Groll, A. (2015). glmmLasso: Variable selection for generalized linear mixed models by
L1-penalized estimation. R package version 1.3.6. http://cran.r-project.org/web/packages/
glmmLasso
Groll, A., & Tutz, G. (2014). Variable selection for generalized linear mixed models by L1 -
penalized estimation. Statistics and Computing, 24, 137–154.
Groll, A., & Tutz, G. (2016). Variable selection in discrete survival models including heterogeneity.
Lifetime Data Analysis [published online].
Grün, B., & Leisch, F. (2008). FlexMix version 2: Finite mixtures with concomitant variables and
varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.
Gu, C. (2002). Smoothing splines ANOVA models. New York: Springer.
Gu, C., & Wahba, G. (1993). Semiparametric analysis of variance with tensor product thin plate
splines. Journal of the Royal Statistical Society, Series B, 55, 353–368.
Ham, J. C., & Rea, S. A., Jr. (1987). Unemployment insurance and male unemployment duration
in Canada. Journal of Labor Economics, 5, 325–353.
Hamerle, A. (1989). Multiple-spell regression models for duration data. Applied Statistics, 38,
127–138.
Hamerle, A., & Tutz, G. (1989). Diskrete Modelle zur Analyse von Verweildauern und Leben-
szeiten. Frankfurt/New York: Campus Verlag.
Han, A., & Hausman, J. A. (1990). Flexible parametric estimation of duration and competing risk
models. Journal of Applied Econometrics, 5, 1–28.
Harrell, F. E., Jr., Lee, K. L., & Mark, D. B. (1996). Multivariable prognostic models: Issues in
developing models, evaluating assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine, 15, 361–387.
Hartzel, J., Liu, I., & Agresti, A. (2001). Describing heterogenous effects in stratified ordinal
contingency tables, with applications to multi-center clinical trials. Computational Statistics &
Data Analysis, 35, 429–449.
Hastie, T., & Loader, C. (1993). Local regression: Automatic kernel carpentry. Statistical
Science, 8, 120–143.
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman & Hall.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning (2nd ed.).
New York: Springer.
230 References
Heagerty, P. J., & Kurland, B. F. (2001). Misspecified maximum likelihood estimates and
generalised linear mixed models. Biometrika, 88, 973–984.
Heagerty, P. J., Lumley, T., & Pepe, M. S. (2000). Time-dependent ROC curves for censored
survival data and a diagnostic marker. Biometrics, 56, 337–344.
Heagerty, P. J., & Zheng, Y. (2005). Survival model predictive accuracy and ROC curves.
Biometrics, 61, 92–105.
Heckman, J. J., & Singer, B. (1984a). Econometric duration analysis. Journal of Econometrics, 24,
63–132.
Heckman, J. J., & Singer, B. (1984b). A method for minimizing the impact of distributional
assumptions in econometric models of duration. Econometrica, 52, 271–320.
Hedeker, D., Siddiqui, O., & Hu, F. B. (2000). Random-effects regression analysis of correlated
grouped-time survival data. Statistical Methods in Medical Research, 9, 161–179.
Hess, W. (2009). A flexible hazard rate model for grouped duration data. Working Paper
No. 2009:18, Department of Economics, Lund University.
Hess, W., & Persson, M. (2012). The duration of trade revisited – continuous-time vs. discrete-time
hazards. Empirical Economics, 43, 1083–1107.
Hess, W., Tutz, G., & Gertheiss, J. (2014). A flexible link function for discrete-time duration
models. Technical Report 155, Department of Statistics, University of Munich.
Hinde, J. (1982). Compound Poisson regression models. In R. Gilchrist (Ed.), GLIM 1982
International Conference on Generalized Linear Models (pp. 109–121). New York: Springer.
Hofner, B., Mayr, A., Robinzonov, N., & Schmid, M. (2014). Model-based boosting in R: A
hands-on tutorial using the R package mboost. Computational Statistics, 29, 3–35.
Hojsgaard, S., Halekoh, U., & Yan, J. (2014). geepack: Generalized estimating equation package.
R package version 1.2-0. http://cran.r-project.org/web/packages/geepack/index.html
Hothorn, T., Bühlmann, P., Kneib, T., Schmid, M., & Hofner, B. (2015). mboost: Model-based
boosting. R package version 2.5-0. http://cran.r-project.org/web/packages/mboost/
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional
inference framework. Journal of Computational and Graphical Statistics, 15, 651–674.
Hothorn, T., Lausen, B., Benner, A., & Radespiel-Tröger, M. (2004). Bagging survival trees.
Statistics in Medicine, 23, 77–91.
Hougaard, P. (1984). Life table methods for heterogeneous populations: Distributions describing
the heterogeneity. Biometrika, 71, 75–83.
Huinink, J., Brüderl, J., Nauck, B., Walper, S., Castiglioni, L., & Feldhaus, M. (2011). Panel
analysis of intimate relationships and family dynamics (pairfam): Conceptual framework and
design. Journal of Family Research, 23, 77–101.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random survival forests.
Annals of Applied Statistics, 2, 841–860.
Ishwaran, H., Kogalur, U. B., Chen, X., & Minn, A. J. (2011). Random survival forests for high-
dimensional data. Statistical Analysis and Data Mining, 4, 115–132.
Jackman, S. (2015). pscl: Political science computational laboratory, Stanford University. R
package version 1.4.9. http://cran.r-project.org/web/packages/pscl
James, G. M., & Radchenko, P. (2009). A generalized Dantzig selector with shrinkage tuning.
Biometrika, 96, 323–337.
Jenkins, S. P. (2004). Survival analysis. Unpublished manuscript, Institute for Social and Economic
Research, University of Essex. http://www.iser.essex.ac.uk/teaching/degree/stephenj/ec968/
pdfs/ec968lnotesv6.pdf
Joergensen, H. S., Nakayama, H., Reith, J., Raaschou, H. O., & Olsen, T. S. (1996). Acute stroke
with atrial fibrillation - the Copenhagen Stroke Study. Stroke, 27, 1765–1769.
Johnson, I. Y. (2006). Analysis of stopout behavior at a public research university: The multi-spell
discrete-time approach. Research in Higher Education, 47, 905–934.
Jones, B. (1994). A Longitudinal Perspective on Congressional Elections. Ph.D. thesis, State
University of New York at Stony Brook.
Kalbfleisch, J. D., & Prentice, R. L. (2002). The statistical analysis of failure time data (2nd ed.).
New York: Wiley.
References 231
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association, 53, 457–481.
Kauermann, G., Krivobokova, T., & Fahrmeir, L. (2009). Some asymptotic results on generalized
penalized spline smoothing. Journal of the Royal Statistical Society, Series B, 71, 487–503.
Kauermann, G., Tutz, G., & Brüderl, J. (2005). The survival of newly founded firms: A case-
study into varying-coefficient models. Journal of the Royal Statistical Society, Series A, 168,
145–158.
Kim, J.-H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-
out and bootstrap. Computational Statistics and Data Analysis, 53, 3735–3745.
Kim, Y.-J., & Jhun, M. (2008). Cure rate model with interval censored data. Statistics in
Medicine, 27, 3–14.
Klein, J. P., & Moeschberger, M. L. (2003). Survival analysis: Statistical methods for censored
and truncated data (2nd ed.). New York: Springer.
Klein, J. P., Moeschberger, M. L., & J. Yan (2012). KMsurv: Data sets from Klein and
Moeschberger (1997), survival analysis. R package version 0.1-5. http://cran.r-project.org/
web/packages/KMsurv
Kleinbaum, D. G., & Klein, M. (2013). Survival analysis: A self-learning text (3rd ed.). New York:
Springer.
Koenker, R., & Yoon, J. (2009). Parametric links for binary choice models: A Fisherian–Bayesian
colloquy. Journal of Econometrics, 152, 120–130.
Kooperberg, C., Stone, C. J., & Truong, Y. K. (1995). Hazard regression. Journal of the American
Statistical Association, 90, 78–94.
Kruskal, W. H. (1958). Ordinal measures of association. Journal of the American Statistical
Association, 53, 814–861.
Kuk, A. Y., & Chen, C.-H. (1992). A mixture model combining logistic regression with
proportional hazards regression. Biometrika, 79, 531–541.
Laird, N., & Olivier, D. (1981). Covariance analysis of censored survival data using log-linear
analysis techniques. Journal of the American Statistical Association, 76, 231–240.
Lancaster, T. (1985). Generalised residuals and heterogeneous duration models: With applications
to the Weibull model. Journal of Econometrics, 28, 155–169.
Lancaster, T. (1992). The econometric analysis of transition data. Cambridge: Cambridge
University Press.
Land, K. C., Nagin, D. S., & McCall, P. L. (2001). Discrete-time hazard regression models with
hidden heterogeneity: The semiparametric mixed Poisson regression approach. Sociological
Methods & Research, 29, 342–373.
Lang, J. B., & Agresti, A. (1994). Simultaneous modelling joint and marginal distributions of
multivariate categorical responses. Journal of the American Statistical Association, 89, 625–
632.
Lawless, J. F. (1982). Statistical models and methods for lifetime data. New York: Wiley.
LeBlanc, M., & Crowley, J. (1993). Survival trees by goodness of split. Journal of the American
Statistical Association, 88, 457–467.
LeBlanc, M., & Crowley, J. (1995). A review of tree-based prognostic models. Journal of Cancer
Treatment and Research, 75, 113–124.
Lee, R. (2000). The Lee-Carter method for forecasting mortality, with various extensions and
applications. North American Actuarial Journal, 4, 80–91.
Lee, R. D., & Carter, L. R. (1992). Modeling and forecasting US mortality. Journal of the American
Statistical Association, 87, 659–671.
Leitenstorfer, F., & Tutz, G. (2011). Estimation of single-index models based on boosting
techniques. Statistical Modelling, 11, 183–197.
Li, C.-S., Taylor, J. M., & Sy, J. P. (2001). Identifiability of cure models. Statistics & Probability
Letters, 54, 389–395.
Li, Y., & Ruppert, D. (2008). On the asymptotics of penalized splines. Biometrika, 95, 415–436.
Liang, K.-Y., & Zeger, S. (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73, 13–22.
232 References
Liang, K.-Y., & Zeger, S. (1993). Regression analysis for correlated data. Annual Review of Public
Health, 14, 43–68.
Liang, K.-Y., Zeger, S., & Qaqish, B. (1992). Multivariate regression analysis for categorical data
(with discussion). Journal of the Royal Statistical Society, Series B, 54, 3–40.
Lichman, M. (2013). UCI machine learning repository. School of Information and Computer
Sciences, University of California, Irvine. http://archive.ics.uci.edu/ml
Lillard, L. A., & Panis, C. W. (1996). Marital status and mortality: The role of health.
Demography, 33, 313–327.
Lin, X., & Breslow, N. E. (1996). Bias correction in generalized linear mixed models with multiple
components of dispersion. Journal of the American Statistical Association, 91, 1007–1016.
Lin, X., & Zhang, D. (1999). Inference in generalized additive mixed models by using smoothing
splines. Journal of the Royal Statistical Society, Series B, 61, 381–400.
Lindsey, J. C., & Ryan, L. M. (1998). Methods for interval-censored data. Statistics in
Medicine, 17, 219–238.
Liu, Q., & Pierce, D. A. (1994). A note on Gauss-Hermite quadrature. Biometrika, 81, 624–629.
Loader, C. (1999). Local regression and likelihood. New York: Springer.
Maller, R. A., & Zhou, X. (1996). Survival analysis with long-term survivors. New York: Wiley.
Mantel, N., & Hankey, B. F. (1978). A logistic regression analysis of response time data where the
hazard function is time dependent. Communications in Statistics – Theory and Methods, A7,
333–347.
Marra, G., & Wood, S. N. (2011). Practical variable selection for generalized additive models.
Computational Statistics & Data Analysis, 55, 2372–2387.
Marx, B. D., & Eilers, P. H. C. (1998). Direct generalized additive modelling with penalized
likelihood. Computational Statistics & Data Analysis, 28, 193–209.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
Mayer, P., Larocque, D., & Schmid, M. (2014). DStree: Recursive partitioning for discrete-time
survival trees. R package version 1.0. http://cran.r-project.org/web/packages/DStree/index.
html
Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014a). The evolution of boosting algorithms
(with discussion). Methods of Information in Medicine, 53, 419–427.
Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014b). Extending statistical boosting (with
discussion). Methods of Information in Medicine, 53, 428–435.
Mayr, A., & Schmid, M. (2014). Boosting the concordance index for survival data – a unified
framework to derive and evaluate biomarker combinations. PLoS One, 9(1), e84483.
McCullagh, P. (1980). Regression model for ordinal data (with discussion). Journal of the Royal
Statistical Society, Series B, 42, 109–127.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). New York: Chapman
& Hall.
McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models.
Journal of the American Statistical Association, 92, 162–170.
McCulloch, C. E., & Neuhaus, J. M. (2011). Misspecifying the shape of a random effects
distribution: Why getting it wrong may not matter. Statistical Science, 26, 388–402.
McCulloch, C. E., & Searle, S. (2001). Generalized, linear, and mixed models. New York: Wiley.
McDonald, J. W., & Rosina, A. (2001). Mixture modelling of recurrent event times with long-
term survivors: Analysis of Hutterite birth intervals. Statistical Methods and Applications, 10,
257–272.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley.
McCall, B. P. (1996). Unemployment insurance rules, joblessness, and part-time work. Economet-
rica, 64, 647–682.
Meier, L. (2015). grplasso: Fitting user specified models with Group Lasso penalty. R package
version 0.4-5. http://cran.r-project.org/web/packages/grplasso/index.html
Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal
of the Royal Statistical Society, Series B, 70, 53–71.
References 233
Molinaro, A., Simon, R., & Pfeiffer, R. M. (2005). Predition error estimation: A comparison of
resampling methods. Bioinformatics, 21, 3301–3307.
Morgan, B. J. T. (1985). The cubic logistic model for quantal assay data. Applied Statistics, 34,
105–113.
Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal.
Journal of the American Statistical Association, 58, 415–435.
Möst, S. (2014). Regularization in Discrete Survival Models. Ph.D. Thesis, Department of
Statistics, University of Munich.
Möst, S., Pößnecker, W., & Tutz, G. (2015). Variable selection for discrete competing risks models.
Quality & Quantity. doi:10.1007/s11135-015-0222-0.
Muggeo, V. M., Attanasio, M., & Porcu, M. (2009). A segmented regression model for event
history data: An application to the fertility patterns in Italy. Journal of Applied Statistics, 36,
973–988.
Muggeo, V. M., & Ferrara, G. (2008). Fitting generalized linear models with unspecified link
function: A P-spline approach. Computational Statistics & Data Analysis, 52, 2529–2537.
Muthén, B., & Masyn, K. (2005). Discrete-time survival mixture analysis. Journal of Educational
and Behavioral Statistics, 30, 27–58.
Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination.
Biometrika, 78, 691–692.
Nakazawa, M. (2015). fmsb: Functions for medical statistics book with some demographic data.
R package version 0.5.2. http://cran.r-project.org/web/packages/fmsb
Narendranathan, W., & Stewart, M. B. (1993). Modelling the probability of leaving unemployment:
Competing risks models with flexible base-line hazards. Applied Statistics, 42, 63–83.
Nauck, B., Brüderl, J., Huinink, J., & Walper, S. (2013). The German Family
Panel (pairfam). GESIS Data Archive, Cologne. ZA5678 Data file Version 4.0.0.
doi:10.4232/pairfam.5678.4.0.0.
Neuhaus, J. M., & McCulloch, C. E. (2006). Separating between- and within-cluster covariate
effects by using conditional and partitioning methods. Journal of the Royal Statistical Society,
Series B, 68, 859–872.
Nicoletti, C., & Rondinelli, C. (2010). The (mis)specification of discrete duration models with
unobserved heterogeneity: A Monte Carlo study. Journal of Econometrics, 159, 1–13.
Ondrich, J., & Rhody, S. E. (1999). Multiple spells in the Prentice-Gloeckler-Meyer likelihood
with unobserved heterogeneity. Economics Letters, 63, 139–144.
Patil, P. N., & Bagkavos, D. (2012). Semiparametric smoothing of discrete failure time data.
Biometrical Journal, 54, 5–19.
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction.
New York: Chapman & Hall.
Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the
nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4, 12–35.
Pregibon, D. (1980). Goodness of link tests for generalized linear models. Applied Statistics, 29,
15–24.
Prentice, R. L. (1975). Discrimination among some parametric models. Biometrika, 62, 607–614.
Prentice, R. L. (1976). A generalization of the probit and logit methods for dose response curves.
Biometrics, 32, 761–768.
Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary
observation. Biometrics, 44, 1033–1084.
Prentice, R. L., & Gloeckler, L. A. (1978). Regression analysis of grouped survival data with
application to breast cancer data. Biometrics, 34, 57–67.
Preston, S., Heuveline, P., & Guillot, M. (2000). Demography: Measuring and modeling
population processes. Chichester: Wiley-Blackwell.
Provost, F., & Domingos, P. (2003). Tree induction for probability-based ranking. Machine
Learning, 52, 199–215.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan
Kaufmann.
234 References
R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria:
R Foundation for Statistical Computing. Software version 3.2.2, http://www.R-project.org
Rabinowitz, D., Tsiatis, A., & Aragon, J. (1995). Regression with interval-censored data.
Biometrika, 82, 501–513.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In
J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics
and Probability. Berkeley: University of California Press.
Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model
framework for item response theory. Psychological Methods, 8, 185–205.
Ripley, B. (2015). gee: Generalized estimation equation solver. R package version 4.13-19. http://
cran.r-project.org/web/packages/gee/index.html
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University
Press.
Ruckstuhl, A., & Welsh, A. (1999). Reference bands for nonparametrically estimated link
functions. Journal of Computational and Graphical Statistics, 8, 699–714.
Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression. Cambridge:
Cambridge University Press.
Schall, R. (1991). Estimation in generalised linear models with random effects. Biometrika, 78,
719–727.
Scheike, T., & Jensen, T. (1997). A discrete survival model with random effects: An application to
time to pregnancy. Biometrics, 53, 318–329.
Scheike, T., & Keiding, N. (2006). Design and analysis of time-to-pregnancy. Statistical Methods
in Medical Research, 15, 127–140.
Schmid, M., & Hothorn, T. (2008). Boosting additive models using component-wise P-splines.
Computational Statistics & Data Analysis, 53, 298–311.
Schmid, M., Hothorn, T., Maloney, K. O., Weller, D. E., & Potapov, S. (2011). Geoadditive regres-
sion modeling of stream biological condition. Environmental and Ecological Statistics, 18,
709–733.
Schmid, M., Kestler, H. A., & Potapov, S. (2015). On the validity of time-dependent AUC
estimators. Briefings in Bioinformatics, 16, 153–168.
Schmid, M., Küchenhoff, H., Hoerauf, A., & Tutz, G. (2016). A survival tree method for the
analysis of discrete event times in clinical and epidemiological studies. Statistics in Medicine,
35, 734–751.
Schmid, M., & Potapov, S. (2012). A comparison of estimators to evaluate the discriminatory
power of time-to-event models. Statistics in Medicine, 31, 2588–2609.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and
event occurrence. Oxford: Oxford University Press.
Steele, F., Goldstein, H., & Browne, W. (2004). A general multilevel multistate competing risks
model for event history data, with an application to a study of contraceptive use dynamics.
Statistical Modelling, 4, 145–159.
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale,
application and characteristics of classification and regression trees, bagging and random
forests. Psychological Methods, 14, 323–348.
Stukel, T. A. (1988). Generalized logistic models. Journal of the American Statistical
Association, 83, 426–431.
Sun, J. (2006). The statistical analysis of interval-censored failure time data. New
York/Heidelberg: Springer.
Sy, J. P., & Taylor, J. M. (2000). Estimation in a Cox proportional hazards cure model.
Biometrics, 56, 227–236.
Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart: Recursive partitioning. R package version
4.1-9. http://cran.r-project.org/web/packages/rpart
Thompson, W. A. (1977). On the treatment of grouped observations in life studies. Biometrics, 33,
463–470.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B, 58, 267–288.
References 235
Tibshirani, R., & Ciampi, A. (1983). A family of proportional- and additive-hazards models for
survival data. Biometrics, 39, 141–147.
Tutz, G. (1990). Sequential item response models with an ordered response. British Journal of
Statistical and Mathematical Psychology, 43, 39–55.
Tutz, G. (1995). Competing risks models in discrete time with nominal or ordinal categories of
response. Quality & Quantity, 29, 405–420.
Tutz, G. (2012). Regression for categorical data. Cambridge: Cambridge University Press.
Tutz, G. (2015). Sequential models for ordered responses. In W. van der Linden & R. Hambleton
(Eds.), Handbook of modern item response theory. New York: Springer.
Tutz, G., & Binder, H. (2004). Flexible modelling of discrete failure time including time-varying
smooth effects. Statistics in Medicine, 23, 2445–2461.
Tutz, G., & Binder, H. (2006). Generalized additive modeling with implicit variable selection by
likelihood-based boosting. Biometrics, 62, 961–971.
Tutz, G., & Oelker, M. (2015). Modeling clustered heterogeneity: Fixed effects, random effects
and mixtures. International Statistical Review (to appear).
Tutz, G., & Petry, S. (2012). Nonparametric estimation of the link function including variable
selection. Statistics and Computing, 21, 545–561.
Tutz, G., Pößnecker, W., & Uhlmann, L. (2015). Variable selection in general multinomial logit
models. Computational Statistics & Data Analysis, 82, 207–222.
Tutz, G., & Pritscher, L. (1996). Nonparametric estimation of discrete hazard functions. Lifetime
Data Analysis, 2, 291–308.
Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B., & Wei, L. J. (2011). On the C-statistics for
evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics
in Medicine, 30, 1105–1117.
Uno, H., Cai, T., Tian, L., & Wei, L. J. (2007). Evaluating prediction rules for t-year survivors with
censored regression models. Journal of the American Statistical Association, 102, 527–537.
van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A. M., Voskuil, D. W., et al.
(2002). A gene-expression signature as a predictor of survival in breast cancer. New England
Journal of Medicine, 347, 1999–2009.
Van den Berg, G. J. (2001). Duration models: Specification, identification and multiple durations.
In J. J. Heckman & E. Leamer (Eds.), Handbook of econometrics (Vol. V, pp. 3381–3460).
Amsterdam: North Holland.
van der Laan, M. J., & Robins, J. M. (2003). Unified methods for censored longitudinal data and
causality. New York: Springer.
van der Linden, W., & Hambleton, R. K. (1997). Handbook of modern item response theory.
New York: Springer.
Vaupel, J. W., Manton, K. G., & Stallard, E. (1979). The impact of heterogeneity in individual
frailty on the dynamics of mortality. Demography, 16, 439–454.
Vaupel, J. W., & Yashin, A. I. (1985). Heterogeneity’s ruses: Some surprising effects of selection
on population dynamics. The American Statistician, 39, 176–185.
Verhelst, N. D., Glas, C., & De Vries, H. (1997). A steps model to analyze partial credit. In
W. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp.
123–138). New York: Springer.
Vermunt, J. K. (1996). Log-linear event history analysis: A general approach with missing data,
latent variables, and unobserved heterogeneity. Tilburg: Tilburg University Press.
Wand, M. P. (2000). A comparison of regression spline smoothing procedures. Computational
Statistics, 15, 443–462.
Wang, H., & Leng, C. (2008). A note on adaptive group lasso. Computational Statistics & Data
Analysis, 52, 5277–5286.
Wang, L. (2011). GEE analysis of clustered binary data with diverging number of covariates. The
Annals of Statistics, 39, 389–417.
Wang, L., Sun, J., & Tong, X. (2010). Regression analysis of case II interval-censored failure time
data with the additive hazards model. Statistica Sinica, 20, 1709–1723.
236 References
Weinberg, C., & Gladen, B. (1986). The beta-geometric distribution applied to comparative
fecundability studies. Biometrics, 42, 547–560.
Weisberg, S., & Welsh, A. H. (1994). Adapting for the missing link. The Annals of Statistics, 22,
1674–1700.
Welchowski, T., & Schmid, M. (2015). discSurv: Discrete time survival analysis. R package
version 1.1.1. http://cran.r-project.org/web/packages/discSurv
Willett, J. B., & Singer, J. D. (1995). It’s déja vu all over again: Using multiple-spell discrete-time
survival analysis. Journal of Educational and Behavioral Statistics, 20, 41–67.
Wolfinger, R. W. (1994). Laplace’s approximation for nonlinear mixed models. Biometrika, 80,
791–795.
Wood, S. (2015). mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness
estimation. R package version 1.8-9. http://cran.r-project.org/web/packages/mgcv
Wood, S. N. (2006). Generalized additive models: An introduction with R. London: Chapman &
Hall/CRC.
Xie, M., & Yang, Y. (2003). Asymptotics for generalized estimating equations with large cluster
sizes. The Annals of Statistics, 31, 310–347.
Xue, X., & Brookmeyer, R. (1997). Regression analysis of discrete time survival data under
heterogeneity. Statistics in Medicine, 16, 1983–1993.
Yee, T. (2010). The VGAM package for categorical data analysis. Journal of Statistical
Software, 32(10), 1–34.
Yu, B., Tiwari, R. C., Cronin, K. A., & Feuer, E. J. (2004). Cure fraction estimation from the
mixture cure models for grouped survival data. Statistics in Medicine, 23, 1733–1747.
Yu, Y., & Ruppert, D. (2002). Penalized spline estimation for partially linear single-index models.
Journal of the American Statistical Association, 97, 1042–1054.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B, 68, 49–67.
Zeng, D., Cai, J., & Shen, Y. (2006). Semiparametric additive risks model for interval-censored
data. Statistica Sinica, 16, 287–302.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101, 1418–1429.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society, Series B, 67, 301–320.
List of Examples
van de Geer, S., 154, 163 Wood, S., 21, 31, 110, 111, 124, 125,
Van den Berg, G., 42, 191, 192, 221 163, 195
Van der Laan, M., 90
Van der Linden, W., 209
Vaupel, J., 209 Xie, M., 220
Verhelst, N., 209 Xue, X., 210
Vermunt, J., 30, 31, 209, 210
Vitaro, F., 132 Yan, J., 146, 221
Yang, Y., 220
Yashin, A., 209
Wagenpfeil, S., 124, 180 Yee, T., 181
Wahba, G., 116 Yoon, J., 100
Walper, S., 12 Yu, B., 155, 163, 210
Wand, M., 116, 124, 195 Yu, Y., 101
Wang, H., 165 Yuan, M., 119, 154, 163, 175
Wang, L., 70, 220
Wei, L., 94, 101
Weinberg, C., 44 Zeger, S., 200, 220
Weisberg, S., 101, 102 Zeileis, A., 131
Welchowski, T., vi, 7, 31 Zeng, D., 70
Weller, D., 155 Zhang, D., 195
Welsh, A., 101 Zheng, Y., 93, 95, 101
Willett, J., 104, 213, 221 Zhou, X., 202
Wilson, M., 209 Ziegler, R., 8
Wolf, D., 180 Zou, H., 153, 154, 163, 176
Wolfinger, R., 193 Zwitser, R., 209