John O'Quigley - Survival Analysis - Proportional and Non-Proportional Hazards Regression (Springer The Data Sciences) - Springer (2021)
John O'Quigley - Survival Analysis - Proportional and Non-Proportional Hazards Regression (Springer The Data Sciences) - Springer (2021)
John O'Quigley - Survival Analysis - Proportional and Non-Proportional Hazards Regression (Springer The Data Sciences) - Springer (2021)
Survival
Analysis
Proportional and Non-Proportional
Hazards Regression
Survival Analysis
John O’Quigley
Survival Analysis
Proportional and Non-Proportional Hazards
Regression
123
John O’Quigley
Department of Statistical Science
University College London
London WC1E 6BT, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
In loving memory of my father, Arthur
O’Quigley, whose guiding strategy when
confronting the most challenging of problems
– use a sense of proportion – would have
baffled many a latter-day epidemiologist.
Preface
In common with most scientific texts this one is the result of several iterations. The
seed for the iterations was the 2008 text Proportional Hazards Regression and the
first step was in the direction of a second edition to that work. The next several
iterations from there took us in a number of different directions before slowly
converging to a text that appears to be more in harmony with modern approaches to
data analysis, graphical techniques in particular. These are given greater emphasis
in this book and, following recent work on the regression effect process, it seems
clear that this process, first presented in O’Quigley (2003) ought be given a more
prominent place. Just about everything we need, from complex estimation to simple
tests, can be readily obtained from the regression effect process. Formal analysis
can be backed up by intuitive graphical evaluation based directly on this process.
Proportional hazards models and their extensions (models with time-dependent
covariates, models with time-dependent regression coefficients, stratified models,
models with random coefficients, and any mixture of these) can be used to char-
acterize just about any applied problem to which the techniques of survival analysis
are appropriate. This simple observation enables us to find an elegant statistical
expression for all plausible practical situations arising in the analysis of survival
data. We have a single unifying framework. In consequence, a solid understanding
of the framework itself offers the investigator the ability to tackle the thorniest of
questions which may arise when dealing with survival data.
Our goal is not to present or review the very substantial amount of research that
has been carried out on proportional hazards and related models. Rather, the goal is
to consider the many questions which are of interest in a regression analysis of
survival data, questions relating to prediction, goodness of fit, model construction,
inference and interpretation in the presence of misspecified models. To address
these questions the standpoint taken is that of the proportional hazards and the
non-proportional hazards models.
This standpoint is essentially theoretical in that the aim is to put all of the
inferential questions on a firm conceptual footing. However, unlike the commonly
preferred approach among academic statisticians, based almost entirely on the
construction of stochastic integrals and the associated conditions for the valid
vii
viii Preface
application of the martingale central limit theorem for multivariate counting pro-
cesses, we work with more classical and better known central limit theorems. In
particular we appeal to theorems dealing with sums of independent but not nec-
essarily identically distributed univariate random variables and of special interest is
the classical functional central limit theorem establishing the Brownian motion limit
for standardized univariate sums.
So, we avoid making use of Rebolledo’s central limit theorem for multivariate
counting processes, including the several related works required in order to validate
this theorem, the Lenglart-Rebolledo inequality for example. Delicate
measure-theoretic arguments, such as uniform integrability, borrowed from math-
ematical analysis can then be wholly avoided. In our view, these concepts have
strayed from their natural habitat—modern probability theory—to find themselves
in a less familiar environment—that of applied probability and applied statistics—
where they are not well understood nor fully appreciated. While it could be
advanced that this abstract theory affords greater generality, the author has yet to
see an applied problem in which the arguably less general but more standard and
well known central limit theorems for univariate sums fail to provide an equally
adequate solution. None of the results presented here lean on Rebolledo’s central
limit theorem.
A central goal of this text is to shift the emphasis away from such abstract theory
and to bring back the focus on those areas where, traditionally, we have done well
—careful parsimonious modelling of medical, biological, physical and social
phenomena. Our aim is to put the main spotlight on robust model building using
analytical and graphical techniques. Powerful tests, including uniformly most
powerful tests, can be obtained, or at least approximated, under given conditions.
We discuss the theory at length but always looked at through the magnifying glass
of real applied problems.
I would like to express my gratitude to several colleagues for their input over
many years. The collaborative work and countless discussions I have had with
Philippe Flandre, Alexia Iasonos, Janez Stare and Ronghui Xu in particular have had
a major impact on this text. I have worked at several institutions, all of them having
provided a steady support to my work. Specifically, I would like to acknowledge: the
Department of Mathematics of the University of California at San Diego, U.S.A., the
Fred Hutchinson Cancer Research Center, Seattle, U.S.A., the Department of
Mathematics at the University of Leeds, U.K., the Department of Mathematics at
Lancaster University, U.K., the Department of Biostatistics at the University of
Washington, U.S.A., the Department of Biostatistics at Harvard University, Boston,
U.S.A., the Division of Biostatistics at the University of Virginia School of
Medicine, the Laboratory for Probability, Statistics and Modelling, University of
Paris, Sorbonne and last (but not least), the Department of Statistical Science,
University College London. I am very grateful to them all for this support.
A final word. At the heart of our endeavor is the goal of prediction, the idea of
looking forward and making statements about the future. Perhaps somewhat
paradoxically, this only makes real sense when looking backwards, that is when all
the observations are in, allowing us to make some general summary statements.
Preface ix
Public health policy might use such statements to make planning decisions, clinical
trials specialists may use them to increase the power of a randomized study. But, to
use them to make individual predictions is unlikely to be of any help and could
potentially be of harm. Even for the terminally ill patient it is not rare for those with
the very worst prognosis to confound the predictions while others, looking much
better in theory, can fare less well than anticipated. Such imprecision can only be
greatly magnified when dealing with otherwise healthy subjects, the example of the
BRCA1 and BRCA2, so-called, susceptibility genes being a striking one. We
consider this question in the chapter dealing with epidemiology. While past history
is indispensable to improving our understanding of given phenomena and how they
can describe group behaviour, on an individual level, it can never be taken to be a
reliable predictor of what lies ahead. Assigning probabilities to individuals, or to
single events—however accurate the model—is not helpful.
When anyone asks me how I can best describe my experience in nearly 40 years at
sea, I merely say ... uneventful. Captain Edward John Smith, RMS Titanic April,
1912.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Context and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Main objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Neglected and underdeveloped topics . . . . . . . . . . . . . . . . . . . 7
1.6 Model-based prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Use as a graduate text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9 Classwork and homework . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Survival analysis methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Context and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Basic tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Some potential models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Competing risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Classwork and homework . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Survival without covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Context and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Parametric models for survival functions . . . . . . . . . . . . . . . . 50
3.4 Empirical estimate (no censoring) . . . . . . . . . . . . . . . . . . . . . 56
3.5 Kaplan-Meier (empirical estimate with censoring) . . . . . . . . . . 58
3.6 Nelson-Aalen estimate of survival . . . . . . . . . . . . . . . . . . . . . 68
3.7 Model verification using empirical estimate . . . . . . . . . . . . . . 69
3.8 Classwork and homework . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.9 Outline of proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xi
xii Contents
xv
xvi Summary of main notation
Introduction
in sociological surveys. The more classic motivating examples behind the bulk
of theoretical advances made in the area have come from reliability problems in
engineering and, especially, clinical studies in chronic diseases such as cancer and
AIDS. Many of the examples given in this text come from clinical research. Par-
allels for these examples from other disciplines are usually readily transposable.
Proportional hazards regression is now widely appreciated as a very powerful
technique for the statistical analysis of broad classes of problems. The ability to
address issues relating to complex time-dependent effects and non-proportional
hazards structures broadens this class yet further.
Financial analysis
Mathematical techniques of stochastic integration have seen wide application in
recent years in the study of financial products such as derivatives, futures, and
other pricing schemes for certain kinds of options. An alternative and potentially
more flexible approach to many of these questions is via regression modeling.
The increasing volatility of many of the markets has also given rise to the use
of survival modeling as a tool to identify, among the large-scale borrowers, large
industries, countries, and even financial institutions themselves the ones which
are most likely to fail on their repayment schedule. Modeling techniques can
make such analyses more precise, identifying just which factors are most strongly
predictive.
than one endpoint, tracing out the lifetime of a product. The first event may be
something minor and subsequent events of increasing degrees of seriousness, or
involving different components, until the product is finally deemed of no further
use. The many different paths through these states which any product may follow
in the course of a lifetime can be very complex. Models can usefully shed light
on this.
The purpose of this book is to provide the structure necessary to the building
of a coherent framework in which to view proportional and non-proportional
hazards regression. The essential idea is that of prediction, a feature common to
all regression models, but one that sometimes slips from sight amid the wealth
of methodological innovations that has characterized research in this area. Our
motivation should mostly derive from the need to obtain insights into the complex
data sets that can arise in the survival setting, keeping in mind the key notion
that the outcome time, however measured, and its dependence on other factors
is at the center of our concerns.
The predictive power of a model, proportional and non-proportional hazards
models, in particular, is an area that we pay a lot of attention to. It is neces-
sary to investigate how we can obtain predictions in the absence of information
(often referred to as explanatory variables or covariate information) which relate
to survival and, subsequently, obtaining predictions in the presence of such infor-
mation. How best such information can be summarized brings us into the whole
area of model adequacy and model performance. Measures of explained varia-
tion and explained randomness can indicate to what extent our prediction accu-
racy improves when we include additional covariate information into any model.
They also play a central role in helping identify the form of the unknown time-
dependent regression effects model that is taken to generate the observations.
In order to give the reader, new to the area, a feeling as to how the Cox
model fits into the general statistical literature we provide some discussion of
the original paper of Professor Cox in 1972, some of the background leading up
to that seminal paper, and some of the scientific discussions that ensued. The
early successes of the model in characterizing and generalizing several classes of
statistics are described.
As is true of much of science—statistics is no exception—important ideas
can be understood on many levels. Some researchers, with a limited training in
even the most basic statistical methods, can still appreciate the guiding principles
behind proportional hazards regression. Others are mainly interested in some of
the deeper inferential mathematical questions raised by the estimation techniques
1.5. Neglected and underdeveloped topics 7
employed. Hopefully both kinds of readers will find something in this book to their
taste. The aim of the book is to achieve a good balance, a necessary compromise,
between the theoretical and the applied. This will necessarily be too theoretical for
some potential readers, not theoretical enough for others. Hopefully the average
gap is not too great.
The essential techniques from probability and statistics that are needed
throughout the text are gathered together in several appendices. These can be
used for reference. Any of these key ideas called upon throughout the book can
be found more fully developed in these appendices. The text can therefore serve
as a graduate text for students with a relatively limited background in either
probability or statistics. Advanced measure theory is not really necessary either
in terms of understanding the proportional hazards model or for gaining insight
into applied problems. It is not emphasized in this book and this is something
of a departure from a number of other available texts which deal with these and
related topics. Proofs of results are clearly of importance, partly to be reassured
as to the validity of the techniques we apply, but also in their own right and of
interest to those focusing mostly on the methods. In order not to interrupt the
text with proofs, we give theorems, corollaries, and lemmas, but leave the proofs
to be gathered together at the end of each chapter. The reader, less interested in
the more formal presentation in terms of theorems and proofs, will nonetheless,
it is hoped, find the style helpful in that, by omitting the proofs at the time of
development, the necessary results are organized and brought out in a sharper
focus. While our motivation always comes from real practical problems, the pre-
sentation aims to dig deep enough into the mathematics so that full confidence
can be obtained in using the results as well as using the machinery behind the
results to obtain new ones if needed.
otherwise, we do not study left truncation in its own right. Readers looking for
more on this topic are referred to Hyde (1977), Tsai et al. (1987), Keiding et al.
(1987), Keiding and Gill (1990), Jiang et al. (2005), Shen (2006), Cain et al.
(2011), and Geskus (2011).
Partial likelihood
The term “partial likelihood” is common in survival texts as a description of a
particular approach to inference. It is not a very useful concept in our view, and
perhaps it is time to put the term “partial likelihood” to a long deserved rest.
In his introductory paper on proportional hazards models, Professor Sir David
Cox (Cox, 1972) made no reference at all to partial likelihood. He presented a
suitable likelihood with which to make inference, given the observations, and, no
doubt unintentionally, generated some confusion by referring to that likelihood
as a conditional likelihood. Some of the discussants of Cox’s paper picked up
on that and were puzzled as to the sense of a conditional likelihood since the
statistic upon which we would be conditioning is not very transparent. These
discussants pointed out that there is no obvious statistic whose observed value
is taken as fixed before we set about making inferences. Nonetheless, there is
a lot of sequential conditioning that leads to the expression, first derived by
Professor Cox, and, for want of a better term, “conditional likelihood” was not
so bad. Cox’s choice for likelihood was a fully legitimate one. It was also a very
good one, having among several properties that of leaving inference unaffected
by increasing transformations on the time variable. More explanation may have
been needed but, instead, regrettably, we all moved off in something of a wrong
1.5. Neglected and underdeveloped topics 11
direction: not dramatically wrong but wrong enough to cloud and confuse issues
that ought, otherwise, have been quite plain.
These were very early days for proportional hazards regression and, for those
concerned, it was not always easy to see clearly ahead. Efforts were made either
to justify the likelihood given by Cox, as arising within a particular conditional
sampling framework or to appeal to different techniques from probability that,
by good fortune, led us to the same estimating equations like the ones derived
by Cox. All of this had two unfortunate consequences that turned out to be very
significant for the later development of the field. Those initial worries about the
nature, and indeed the legitimacy, of the likelihood, took us on something of
an arduous path over the next few years. In the late seventies, early eighties,
a solution to the problem was believed to have been found. And in the place
where we might least expect to find it: within the very abstract French school
of probability. It was argued that the key to the solution lay in the central limit
theorem of Rebolledo (1978) for multivariate counting processes.
Now, while stochastic processes, and counting processes, in particular, are at
the very heart of survival analysis, this particular branch of the French school
of probability, built around Lenglart’s theorem, Rebolledo’s multivariate central
limit theorem for stochastic integrals, Jacod’s formula, Brémaud’s conditions,
bracket processes, and locally square integrable martingales, is, and has remained,
almost inaccessible to those lacking a strong background in real analysis. This
development had the unfortunate effect of pushing the topic of survival analysis
beyond the reach of most biostatisticians. Not only that but the field became for
very many years too focused on the validity of Rebolledo’s findings under different
conditions. The beauty of the Cox model and the enormous range of possibilities
it opened up took second place so that questions at the heart of model building:
predictive strength, model validation, goodness of fit, and related issues did not
attract the attention that they deserved.
Our goal is not to use the benefit of hindsight to try to steer the course of
past history—we couldn’t anyway—but we do give much less emphasis to the
multivariate central limit theorem of Rebolledo in favor of more classical central
limit theorems, in particular the functional central limit theorem. This appears
to allow a very simple approach to inference. No more than elementary calculus
is required to understand the underpinnings of this approach.
The concept of partial likelihood as a general technique of inference in its own
right, and for problems other than inference for the proportional hazards model
has never really been thoroughly developed. One difficulty with the concept of
partial likelihood, as currently defined, is that, for given problems, it would not
be unique. For general situations then it may not be clear as to the best way
to proceed, for instance, which of many potential partial likelihoods ought we
choose to work with. For these reasons we do not study partial likelihood as a tool
for inference and the concept is not given particular weight. This is a departure
from several other available texts on survival analysis. In this work, we do not
12 Chapter 1. Introduction
Predictive indices
The main purpose of model construction is ultimately that of prediction. The
amount of variation explained is viewed naturally as a quantity that reflects the
1.6. Model-based prediction 13
usually that of prediction of some quantity of interest. Any model is then simply
judged by its predictive performance. The second school sees the statistician’s
job as tracking down the “true” model that can be considered to have generated
the data. The position taken in this work is very much closer to the first than
the second. This means that certain well-studied concepts, such as efficiency,
a concept which assumes our models are correct, are given less attention than
is often the case. The regression parameter β in our model is typically taken
to be some sort of average where the proportional hazards model corresponds
to a summary representation of the broader non-proportional hazards model.
We view the restricted model as a working model and not some attempt to
represent the true situation. Let’s remind ourselves that the proportional hazards
model stipulates that effects, as quantified by β, do not change through time.
In reality the effects must surely change, hopefully not too much, but absolute
constancy of effects is perhaps too strong an assumption to hold up precisely. The
working model enables us to estimate useful quantities, one of them being average
regression effect, the average of β taken through time. Interestingly, the usual
partial likelihood estimator in the situation of changing regression effects does
not estimate an average effect, as is often believed. Even so, we can estimate an
average effect but we do require an estimator different from that commonly used.
Details are given in Xu and O’Quigley (2000), O’Quigley (2008) and recalled in
Section 7.6.
ing in terms of predictive strength. But the scale is still not something that we
can grasp on an intuitive level. The great majority of the so-called predictive
indices reviewed by Choodari-Oskooei et al. (2012) offer no advantage over the
C-statistic and fail to provide any useful statistical insight into the problem of
prediction for a survival model.
Pepe et al. (2015) shows that some alternative measures, such as the net
reclassification index (Pencina et al., 2008) do no better than the measures they
set out to improve upon. The main problem is almost always that of interpre-
tation and, in this context, improvised solutions that take care of the censoring
have rarely been met with success. The way to proceed, described in Section
3.9 of O’Quigley (2008), is to start from first principles, considering the elemen-
tary definitions of explained variation, what they mean, and how they might be
estimated in the presence of independent censoring. Some suggestions in the
literature have no clear meaning and it is something of a puzzle that they were
ever used at all, let alone that they continue to attract interest. The so-called
proportion of treatment effect explained (Freedman et al., 1992), believed to
quantify how much of the treatment effect can be explained by another variable,
typically some kind of surrogate endpoint for treatment, enjoys no interpretation
as a proportion. Flandre and Saidi (1999) finds values for this quantity, using
data from published clinical trials, ranging from −13% to 249%. These are the
point estimates and not the endpoints of the confidence intervals which reach
as far as 416%. There was nothing unusual in the data from those examples
and we can conclude that Freedman’s measure is not any kind of a percentage
which, necessarily, would lie between zero and one hundred percent. What it is
we cannot say although there is no reason to believe that 249% corresponds to
stronger predictive effects than −13%.
These measures are quite useless. Some measures, while not necessarily being
useless, appear to be not very useful. Take, for example, the lifetime risk of
getting breast cancer for carriers of the BRCA1 and BRCA2 genetic mutation. A
few years back we learned that a celebrity was deemed to have an 87% lifetime
risk of getting breast cancer on the basis of her BRCA status. Despite enjoying
full health, as a result of that risk she chose to have a double mastectomy. This
so-called risk prompting such drastic action became not only a topic of discussion
in scholarly medical journals but also made it into the mainstream press where,
almost universally, the celebrity was praised for her good judgment and her pro-
active stance, a stance taken to avoid the perils that appeared to be confronting
her. As far as we are aware, there was little or no discussion on the obvious
question ... what is meant by the number 87%. Lifetime risk is near impossible
to interpret (O’Quigley, 2017) and, by any standards, cannot be considered a
useful measure of anything. More interpretable measures are available and we
take the view that lifetime risk is not only not useful but is in fact misleading.
It does not measure any clinical or physiological characteristic and basing major
treatment and surgical decisions on its calculated value makes no sense. We
16 Chapter 1. Introduction
The text can be used as support for an introductory graduate course in survival
analysis with particular emphasis on proportional and non-proportional hazards
models. The approach to inference is more classical than often given in such
courses, steering mostly away from the measure-theoretic difficulties associated
with multivariate counting processes and stochastic integrals and focusing instead
on the more classical, and well known, results of empirical processes. Brownian
motion and functions of Brownian motion play a central role. Exercises are pro-
vided in order to reinforce the coursework. Their aim is not so much to help
develop a facility with analytic calculation but more to build insight into the
important features of models in this setting. Some emphasis then is given to
practical work carried out using a computer. No particular knowledge of any
specific software package is assumed.
1.9. Classwork and homework 17
1. For the several examples described in Section 1.2 write down those features
of the data which appear common to all examples. Which features are
distinctive?
3. Suppose that the methods of survival analysis were not available to us.
Suggest how we might analyze a randomized clinical trial using (i) multiple
linear regression, (ii) multiple logistic regression.
9. Check out the definitions for explained variation and explained randomness
for bivariate continuous distributions. Show that, in the case of a multi-
normal model, explained randomness and explained variation coincide.
10. Suppose that T1 and T2 are two survival variates considered in a competing
risks setting. We suppose independence. Imagine though that the observa-
tions of T2 can be modeled by treating T1 as a time-dependent covariate in
a Cox model. When the regression coefficient is non-zero describe how this
would impact the Kaplan-Meier curves obtained for a set of observations;
T21 ,..., T2n . What can be said when the model is an NPH one and the
regression coefficient a function of time?
Chapter 2
Survival time T will be a positive random variable, typically right skewed and
with a non-negligible probability of sampling large values, far above the mean.
The fact that an ordering, T1 > T2 , corresponds to a solid physical interpretation
has led some authors to argue that time is somehow different from other contin-
uous random variables, reminiscent of discussion among early twentieth-century
physicists about the nature of time “flowing inexorably in and of itself”. These
characteristics are sometimes put forward as a reason for considering techniques
other than the classic techniques of linear regression. From a purely statistical
Conditional distributions
Our goal is to investigate dependence, its presence, and, if present, the degree of
dependence. The information on this is conveyed to us via the conditional distri-
bution of time given the covariate, keeping in mind that this covariate may be a
combination of several covariates. The conditional distribution of the covariate
given time may not immediately strike us as particularly relevant. However, it is
very relevant because: (1) the marginal distribution of time and the marginal dis-
tribution of the covariate contain little or no information on dependency so that
the two conditional distributions can be viewed, in some sense, as being equiv-
alent and, (2) problems relating to censoring are near intractable when dealing
with the conditional distribution of time given the covariate whereas they become
2.3. Basic tools 21
It helps understanding to contrast Equation (2.2) and (2.1) where we see that
λ(t) and f (t) are closely related quantities. In a sense, the function f (t) for all
values of t is seen from the standpoint of an observer sitting at T = 0, whereas,
for the function λ(t), the observer moves along with time looking at the same
quantity but viewed from the position T = t. Analogous to a density, conditioned
by some event, we can define
1
λ(t|C > t) = lim Pr(t < T < t + Δt|T > t, C > t). (2.3)
Δt→0+ Δt
The conditioning event C > t is of great interest since, in practical investigations,
all our observations at time t have necessarily been conditioned by the event. All
associated probabilities are also necessarily conditional. But note that, under an
22 Chapter 2. Survival analysis methodology
independent censoring mechanism, we have that λ(t|C > t) = λ(t). This result
underlies the great importance of certain assumptions, in this case that of inde-
pendence between C and T . The conditional failure rate, λ(t), is also sometimes
referred to as the hazard function, the force of mortality, the instantaneous failure
rate or the age-specific failure rate. If we consider a small interval then λ(t) × Δt
closely approximates the probability of failing in a small interval for those aged
t, the approximation improving as Δt goes to zero. If units are one year then
these are yearly death rates. The
t cumulative hazard function is also of interest
and this is defined as Λ(t) = 0 λ(u)du. For continuous λ(t), using elementary
calculus we can see that:
From this it is clear that S(t, u) = S(t)/S(u) and that S(u, u) = 1 so that it is as
though the process had been restarted at time t = u. Other quantities that may
be of interest in some particular contexts are the mean residual lifetime, m(t),
and the mean time lived in the interval [0, t], μ(t), defined as
t
m(t) = E(T − t|T ≥ t), μ(t) = S(u)du. (2.4)
0
Like the hazard itself, these functions provide a more direct reflection on the
impact of having survived until time t. The mean residual lifetime provides a very
interpretable measure of how much more time we can expect to survive, given
that we have already reached the timepoint t. This can be useful in actuarial appli-
cations. The mean time lived in the interval [0, t] is not so readily interpretable,
requiring a little more thought (it is not the same as the expected lifetime given
that T < t). It has one strong advantage in that it can be readily estimated from
right-censored data in which, without additional assumptions, we may not even be
able to estimate the mean itself. The functions m(t) and μ(t) are mathematically
equivalent to one another as well as the three described above and, ∞for example,
a straightforward integration by parts shows that m(t) = S −1 (t) t S(u)du and
that μ(∞) = E(T ). If needed, it follows that the survivorship function can be
expressed in terms of the mean residual lifetime by
2.3. Basic tools 23
t
−1 −1
S(t) = m (t)m(0) exp − m (u)du .
0
We may wish to model directly in terms of m(t), allowing this function to depend
on some vector of parameters θ. If the expression for m(t) is not too intractable
then, using f (t) = −S (t) and the above relationship between m(t) and S(t),
we can write down a likelihood for estimation purposes in the situation of inde-
pendent censoring. An interesting and insightful relationship (see, for instance,
the Kaplan-Meier estimator) between S(t) and S(t, u) follows from considering
some discrete number of time points of interest. Thus, for any partition of the
time axis, 0 = a0 < a1 <, . . . , an = ∞, we see that
S(aj ) = S(aj−1 )S(aj , aj−1 ) = S(a , a−1 ). (2.5)
≤j
The implication of this is that the survival function S(t) can always be viewed
as the product of a sequence of conditional survival functions, S(t, u). This
simple observation often provides a foundation that helps develop our intuition,
one example being the case of competing risks looked at later in this chapter.
Although more cumbersome, a theory could equally wellbe constructed for the
discrete case whereby f (ti ) = Pr(T = ti ) and S(ti ) = ≥i f (t ). We do not
explore this here.
Alive Dead
t = 0, remaining at this same value until some time point, say T = u, at which
the event under study occurs and then N (t) = 1t≥u . We can then define, in an
infinitesimal sense, i.e., the equality only holds precisely in the limit as dt goes
24 Chapter 2. Survival analysis methodology
where Ft−dt , written as Ft− when we allow dt > 0 to be arbitrarily close to zero,
is the accumulated information, on all processes under consideration, observed
up until time t − dt (Figure 2.1).
The observed set Ft− is referred to as the history at time t. The set is nec-
essarily non-decreasing in size as t increases, translating the fact that more is
being observed or becoming known about the process. The Kolmogorov axioms of
probability, in particular sigma additivity, may not hold for certain noncountable
infinite sets. For this reason probabilists take great care, and use considerable
mathematical sophistication, to ensure, in broad terms, that the size of the set
Ft− does not increase too quickly with t. The idea is to ensure that we remain
within the Kolmogorov axiomatic framework, in particular that we do not vio-
late sigma additivity. Much of these concerns have spilled over into the applied
statistical literature where they do not have their place. No difficulties will arise
in applications, with the possible exception of theoretical physics, and the prac-
titioner, unfamiliar with measure theory, ought not to be deterred from applying
the techniques of stochastic processes simply because he or she lacks a firm grasp
of concepts such as filtrations. It is hard to imagine an application in which a lack
of understanding of the term “filtration” could have led to the error. On the other
hand, the more accessible notions of history, stochastic process, and conditioning
sets are central and of great importance both to understanding and to deriving
creative structures around which applied problems can be solved. Viewing t as
an index to a stochastic process rather than simply the realization of a random
variable T , and defining the intensity process α(t) as above, will enable great
flexibility and the possibility to model events dynamically as they unfold.
Dead
Figure 2.2: A compartment model with 3 covariate levels and an absorbing state.
Model is fully specified by the 3 transition rates from states A, B, or C to the
state Dead
2.3. Basic tools 25
This relation is important in that, under the above condition, referred to as the
independent censoring condition, the link between the intensity function and the
hazard function is clear. Note that the intensity function is random since Y is
random when looking forward in time. Having reached some time point, t say,
then α(t) is fixed and known since the function Y (u), 0 < u < t is known and
Y (t) is left continuous.
We call Y (·) the “at risk” function (left continuous specifically so that at
time t the intensity function α(t) is not random). The idea generalizes readily
and in order to cover a wide range of situations we also allow Y to have an
argument w where w takes integer values counting the possible changes of state.
For the ith subject in any study we will typically define Yi (w, t) to take the value
1 if this subject, at time t, is at risk of making a transition of type w, and 0
otherwise. Figure 2.2 summarizes a situation in which there are four states of
interest, an absorbing state, death, and three states from which an individual
is able to make a transition into the death state. Transitions among the three
non-death states themselves cannot occur. Later we will consider different ways
of modeling such a situation, depending upon further assumptions we may wish
or not wish to make.
In Figure 2.3 there is one absorbing state, the death state, and two non-
absorbing states between which an individual can make transitions. We can define
w = 1 to indicate transitions from state 1 to state 2, w = 2 to indicate transitions
from state 2 to state 1, w = 3 to indicate transitions from state 1 to state 3 and,
finally, w = 4 to indicate transitions from state 2 to state 3. Note that such an
enumeration only deals with whether or not a subject is at risk for making the
transition, the transition probabilities (intensities) themselves could depend on
the path taken to get to the current state.
26 Chapter 2. Survival analysis methodology
State 1 State 2
No symptoms Progression
State 3
Dead
Figure 2.3: Binary time-dependent covariate and single absorbing state. Prob-
abilistic structure is fully specified by the transition rates at time t, αij (t) for
leaving state i for state j. Fix and Neyman’s illness-death model, also known as
the semi-competing risks model, is a special case when α21 (t) = 0, ∀t.
State 1 State 2
State 3 State 4
Figure 2.4: Binary time-dependent covariate and two absorbing states. Transi-
tions to state 4 pass through state 2, i.e., αj4 (t) = 0, j = 2, ∀t. Note also that
α4j (t)=0.
the repeated incidence of benign breast disease in its own right. Clearly a patient
can only be at risk of having a third incident of benign breast disease if she has
already suffered two earlier incidents. We can model the rate of incidence for the
j th occurrence of benign disease as,
Simple exponential
The simple exponential model is fully specified by a single parameter λ. The
hazard function, viewed as a function of time, does not in fact depend upon
time so that λ(t) = λ. By simple calculation we find that Pr(T > t) = exp(−λt).
Note that E(T ) = 1/λ and, indeed, the exponential model is often parameter-
ized directly in terms of the mean θ = E(T ) = 1/λ. Also Var(t) = 1/λ2 . This
model expresses the physical phenomenon of no aging or wearing out since, by
elementary calculations, we obtain S(t + u, u) = S(t); the probability of surviving
a further t units of time, having already survived until time u, is the same as that
associated with surviving the initial t units of time. The property is sometimes
referred to as the lack of memory property of the exponential model.
For practical application the exponential model may suggest itself in view
of its simplicity or sometimes when the constant hazard assumption appears
realistic. A good example is that of a light bulb which may only fail following
a sudden surge in voltage. The fact that no such surge has yet occurred may
provide no information about the chances for such a surge to take place in the
next given time period. If T has an exponential distribution with parameter λ then
λT has the so-called standard exponential distribution, i.e., mean and variance
are equal to one.
Recall that for a random variable Y having normal distribution N (μ, σ 2 )
it is useful to think in terms of a simple linear model Y = μ + σ, where
has the standard distribution N (0, 1). As implied above, scale changes for the
exponential model lead to a model still within the exponential class. However,
this is no longer so for location changes so that, unlike the normal model in which
linear transformations lead to other normal models, a linear formulation for the
exponential model is necessarily less straightforward. It is nonetheless of interest
to consider the closest analogous structure and we can write
where W has the standard extreme value density f (w) = exp{w − exp(w)}.
When α = 0 we recover an exponential model for T with parameter b, values
other than zero for α pushing the variable T out of the restricted exponential
class into the broader Weibull class discussed below.
2.4. Some potential models 29
The important point to note here is that the ratio of the hazards, λ(t|Z =
1)/λ(t|Z = 0) does not involve t. It also follows that S(t|Z = 1) = S(t|Z = 0)α
where α = exp(β). The survival curves are power transformations of one another.
This is an appealing parameterization since, unlike a linear parameterization,
whatever the true value of β, the constraints that we impose upon S(t|Z = 1)
and S(t|Z = 0) in order to be well-defined probabilities, i.e., remaining between
0 and 1, are always respected. Such a model is called a proportional hazards
model. For three groups we can employ two indicator variables, Z1 and Z2 , such
that, for group 1 in which the hazard rate is equal to λ0 , Z1 = 0 and Z2 = 0, for
group 2, Z1 = 1 and Z2 = 0 whereas for group 3, Z1 = 0 and Z2 = 1. We can
then write;
p
log − log S(t|Z) = log − log S(t|0) + βi Zi .
i=1
This formulation is the same as the proportional hazards formulation. Noting that
− log S(T |Z = z) is an exponential variate some authors prefer to write a model
down as a linear expression in the transformed random variable itself with an
exponential error term. This then provides a different link to the more standard
linear models we are familiar with.
Piecewise exponential
The lack of flexibility of the exponential model will often rule it out as a potential
candidate for application. Many other models, only one or two of which are
mentioned here, are more tractable, a property stemming from the inclusion
of at least one additional parameter. Even so, it is possible to maintain the
advantages of the exponential model’s simplicity while simultaneously gaining
flexibility. One way to achieve this is to construct a partition of the time axis
0 = a0 < a1 < . . . < ak = ∞. Within the jth interval (aj−1 , aj ) , (j = 1, . . . , k) the
hazard function is given by λ(t) = λj . We can imagine that this may provide quite
a satisfactory approximation to a more involved smoothly changing hazard model
in which the hazard function changes through time. We use S(t) = exp{−Λ(t)}
to obtain the survival function where
k
k
Λ(t) = I(t ≥ aj )λj (aj − aj−1 ) + I(aj−1 ≤ t < aj )λj (t − aj−1 ). (2.12)
j=1 j=1
Properties such as the lack of memory property of the simple exponential have
analogs here by restricting ourselves to remaining within an interval. Another
attractive property of the simple exponential is that the calculations are straight-
forward and can be done by hand and, again, there are ready analogues for the
piecewise case. Although the ready availability of sophisticated computer pack-
ages tends to eliminate the need for hand calculation, it is still useful to be able
to work by hand if for no other purposes than those of teaching. Students gain
invaluable insight by doing these kinds of calculations the long way.
2.4. Some potential models 31
Weibull model
Another way to generalize the exponential model to a wider class is to consider
a power transformation of the random variable T . For any positive γ, if the
distribution of T γ is exponential with parameter λ, then the distribution of T
itself is said to follow a Weibull model whereby
and S(t) = exp −(λt)γ . The hazard function follows immediately from this and
we see, as expected, that when γ = 1 an exponential model with parameter λ is
recovered. It is of interest to trace out the possible forms of the hazard function
for any given λ. It is monotonic, increasing for values of γ greater than 1 and
decreasing for values less than 1. This property, if believed to be reflected in some
given physical situation, may suggest the appropriateness of the model for that
same situation. An example might be the time taken to fall over for a novice one-
wheel skateboard enthusiast—the initial hazard may be high, initially decreasing
somewhat rapidly as learning sets in and thereafter continuing to decrease to
zero, albeit more slowly.
The Weibull model, containing the exponential model as a special case, is
an obvious candidate structure for framing questions of the sort—is the hazard
decreasing to zero or is it remaining at some constant level? A null hypothesis
would express this as H0 : γ = 1. Straightforward integration shows that E(T r ) =
λ−r Γ(1 + r/γ) where Γ(·) is the gamma function,
∞
Γ(p) = up−1 e−u du p > 0.
0
32 Chapter 2. Survival analysis methodology
For p integer Γ(p) = (p − 1)! The mean and the variance are λ−1 Γ(1 + 1/γ)
and λ−2 Γ(1 + 2/γ) − E 2 , respectively. The Weibull model can be motivated by
the theory of statistics of extremes. The distribution coincides with the limiting
distribution of the smallest of a collection of random variables, under broad
conditions on the random variables in question (Kalbfleisch and Prentice, 2002,
page 48).
Log-minus-log transformation
As a first step to constructing a model for S(t|Z) we may think of a linear
shift, based upon the value of Z, the amount of the shift to be estimated from
data. However, the function S(t|Z) is constrained, becoming severely restricted
for both t = 0 and for large t where it approaches one and zero respectively.
Any model would need to accommodate these natural constraints. It is usu-
ally easiest to do this by eliminating the constraints themselves during the ini-
tial steps of model construction. Thus, log S(t|Z) = −Λ(t) is a better starting
point for modeling, weakening the hold the constraints have on us. However,
log − log S(t|Z) = log Λ(t) is better still. This is because log Λ(t) can take any
value between −∞ and +∞, whereas Λ(t) itself is constrained to be positive.
The transformation log − log S(t|Z) is widely used and is called the log-minus-
log transformation. The above cases of the exponential and Weibull proportional
hazards models, as already seen, fall readily under this heading.
Other models
The exponential, piecewise exponential, and Weibull models are of particular
interest to us because they are especially simple and of the proportional hazards
form. Nonetheless there are many other models which have found use in practical
applications. Some are directly related to the above, such as the extreme value
model in which
t−μ
S(t) = exp − exp ,
σ
since, if T is Weibull, then log T is extreme value with σ = 1/γ and μ = log λ.
These models may also be simple when viewed from some particular angle. For
instance, if M (s) is the moment-generating function for the extreme value density
then we can readily see that M (s) = Γ(1 + s). A distribution, closely related to
the extreme value distribution (Balakrishnan and Johnson, 1994), and which has
found wide application in actuarial work is the Gompertz where
The hazard rates for these distributions increase with time, and, for actuar-
ial work, in which time corresponds to age, such a constraint makes sense for
studying disease occurrence or death. The normal distribution is not a natural
candidate in view of the tendency for survival data to exhibit large skewness, not
34 Chapter 2. Survival analysis methodology
2.5 Censoring
We typically view the censoring as a nuisance feature of the data, and not
of direct interest in its own right, essentially something that hinders us from
estimating what it is we would like to estimate. In order for our endeavors to
succeed we have to make some assumptions about the nature of the censoring
mechanism. The assumptions may often be motivated by convenience, in which
case it is necessary to give consideration as to how well grounded the assumptions
appear to be as well as to how robust are the procedures to depart from any
such assumptions. In other cases the assumptions may appear natural given the
physical context of interest, a common case being the uniform recruitment into
a clinical trial over some predetermined time interval. When the study closes
patients for whom the outcome of interest has not been observed are censored
at study close and until that point occurs it may be reasonable to assume that
patients are included in the study at a steady rate.
It is helpful to think of a randomly chosen subject being associated with a
pair of random variables (T, C), an observation on one of the pair impeding
observation on the other, while at the same time indicating that the unobserved
member of the pair must be greater than the observed member. This idea is
made more succinct by saying that only the random variable X = min(T, C)
can be fully observed. Clearly Pr (X > x) = Pr (T > x, C > x) and we describe
censoring as being independent whenever
Type I censoring
Such censoring most often occurs in industrial or animal experimentation. Items
or animals are put on test and observed until failure. The study is stopped at
some time T ∗ . If any subject does not fail it will have observed survival time at
least equal to T ∗ . The censoring times for all those individuals being censored
are then equal to T ∗ . Equation (2.16) is satisfied and so this is a special case of
independent censoring, although not very interesting since all subjects, from any
random sample, have the same censoring time.
Type II censoring
The proportion of censoring is determined in advance. So if we wish to study
100 individuals and observed half of them as failures we determine the number
of failures to be 50. Again all censored observations have the same value T ∗
although, in this case, this value is not known in advance. This is another special
case of independent censoring.
36 Chapter 2. Survival analysis methodology
Pr (Xi > x) = Pr (Ti > x, Ci > x) = Pr (Ti > x) Pr (Ci > x). (2.17)
The assumption is strong but not entirely arbitrary. For the example of the clinical
trial with a fixed closing date for recruitment it seems reasonable to take the
length of time from entry up until this date as not being associated with the
mechanism generating the failures. For loss to follow-up due to an automobile
accident or due to leaving the area, again the assumption may be reasonable,
or, at least, a good first approximation to a much more complex, unknown, and
almost certainly unknowable, reality.
interest, from the study implies a change in risk. Informative censoring is nec-
essarily more involved than non-informative censoring and we have to resort to
more elaborate models for the censoring itself in order to make progress. If, as
might be the case for a clinical trial where the only form of censoring would
be the termination of the study, we know for each subject, in advance, their
censoring time C, we might then postulate that
are only allowed to take place on the set {a0 , a1 , . . . , ak }. This is not a practi-
cal restriction since we can make the division (aj , aj−1 ) as fine as we wish. We
will frequently need to consider the empirical distribution function and analogues
(Kaplan-Meier estimate, Nelson-Aalen estimate) in the presence of censoring. If
we adopt this particular censoring setup of finite censoring support, then gener-
alization from the empirical distribution function to an analogue incorporating
censoring is very straightforward. We consider this in greater detail when we
discuss the estimation of marginal survival.
A number of models for competing risks are in use and we consider these briefly
without going into technical detail. We might view the simplest approach to this
question as one, seen from a somewhat abstract theoretical angle, where everyone
will ultimately suffer any and all events given enough time. We only observe the
first such event and all others are censored at that time. This corresponds to
the competing risks model based on latent events and with an independence
2.6. Competing risks 39
realistic, but there are many technical issues associated with this model and
there is considerable practical difficulty in outlining an appropriate partition of
the event space. The survival probabilities of main interest are confounded with
all of the other survival probabilities and, necessarily, dependent on the chosen
partition. This can be problematic from the point of view of interpretation. It is
also very difficult in terms of finding suitable data that allow estimation of the
main quantities of interest. Competing risks models have seen greater application
in epidemiology than in clinical research and we consider this again in the context
of genetic epidemiology (Chapter 5).
For the simplest case, that of just two competing risks, say failure and censor-
ing, a quite fundamental result was derived by Tsiatis (1975). The result indicates
that the marginal distributions of failure or censoring—viewed as competing risks,
i.e., the observation on one impedes the observation on the other—cannot, under
general conditions, be estimated. These distributions are not identifiable. In order
to find consistent estimates of these distributions from observations arising under
such competing risks, it is necessary to add further restrictive conditions. The
most common condition used in this context is that of independence. Tsiatis
(1975), in a paper that is something of a cornerstone to subsequent theory,
showed that we can obtain consistent estimates using, for example, the Kaplan-
Meier estimator, under the assumption of independence.
Consider a clinical trial where patients enter the study according to a uniform
law over some time interval. This appears to be a somewhat reasonable model.
Their time to incidence of some event of interest is then recorded. At some point
the study is closed and the available data analyzed. Not only will we consider
those patients yet to experience the event to be censored but such censoring will
be taken to be independent. The assumption seems reasonable. It would break
down of course if the recruitment over the interval was subject to change, in
particular, the more robust patients being recruited early on. So, even in such a
simple case, the needed independence assumption ought to come under scrutiny.
We can encounter other situations where the independence assumption requires
something of a stretch of the imagination. For example, if patients who are
responding poorly to treatment were to be removed from the study.
Next, let’s consider a more realistic but greatly more complex situation in
which there are several competing risks. We cannot reasonably assume indepen-
dence. The occurrence of any particular outcome will prevent us from observing
all of the other outcomes. In real-life situations this is what happens of course,
when dealing with an absorbing state such as death. If a patient were to die of
heart failure then he or she will no longer be in a position to die of anything else.
In order to make progress in analyzing such data it is necessary to make a large
number of, essentially, unverifiable assumptions. In making these assumptions
we need to be guided by a basic principle in all statistical modeling, the prin-
ciple of parsimony. The situation we wish to model is, almost always, infinitely
more complex than that we can realistically hope to model accurately. Even if we
2.6. Competing risks 41
could find a plausible model deemed to describe the mechanism that generates
the observations, the lack of precision of our parameter estimates will be such
as to drown in noise any worthwhile signal we are aiming to identify. Modeling
requires something of the artist’s ability to sense when, in trying too hard, their
work can fall short of the goal in mind. The dilemma confronting the statisti-
cal modeler is not altogether different. Competing risks models are particularly
complex and great care is called for in order to remain fully confident in our
conclusions. The field is very large and several published texts are devoted solely
to the topic. We do not have the space here to do justice to this extensive body
of work apart from to make some observations on the 3 main approaches that
have been widely adopted. For those new to the field two clear overviews are
given by Putter et al. (2007) and Xu and Tai (2010). There is a large body of
literature on these topics and suggested reading would include Crowder (1994),
Tsiatis (2005), Moeschberger and Kochar (2007), Andersen and Rosthoj (2002),
Bakoyannis and Touloumi (2012) and Austin and Fine (2017). The impact of
competing risks on clinical trials is looked at in Satagopan and Auerbach (2004)
and Freidlin and Korn (2005).
In order to assess the impact of model choice on estimation we can consider
the development more formally. For CRM-I, the independence assumption for
all causes is made together with the idea of latent event times. In this setup
every individual is taken, initially, to be at risk of an event of all causes. If we
have m competing risks and times to event, T1 , T2 , ..., Tm , then we only get to
observe X = min(T1 , T2 , ..., Tm ) and all event times greater than the minimum
remain unobserved and recorded as censored at X. This structure is simple and
appealing. Cancer registries such as the SEER registries compiling data from
several millions of individuals provide information of a form that can be used by
CRM-I. Very large numbers of individuals, at given ages, are taken to be at risk
from one of the possible outcome events under study. The number of occurrences
of any particular cause over some small age interval divided by the number at
risk, either at the beginning of the interval or some average over the interval,
provide an estimate of the age-specific incidence rate for that particular cause.
This is then repeated for the next adjacent interval. Cumulative incidence follows
from just adding together the age-specific incidence rates. Using the elementary
formulas introduced in Section 2.3 we can obtain a survival probability. This
probability is not easily interpreted, leaning as it does on many near impossibly
restrictive assumptions and, more bothersome still, an idealized characterization
of reality that is so far removed from human existence as to make its direct
applicability problematic.
The implications of choosing CRM-J rather than CRM-I are highlighted by
consideration of Equation 2.5. At some time point, let’s say aj , we have that
where SI (t) is the survival function for the event of interest under an assumption
of independence of the competing events, i.e., SI (t) = S(t) for all t. In Equation
2.5 there was only a single competing event, the censoring itself, taken to be
independent, but let us extend that to all competing events. The left-hand side
of the equation gives the probability of surviving beyond time point aj and this
is the simple product of the probabilities of reaching time point aj−1 together
with the probability of surviving the interval (aj−1 , aj ). These probabilities are
not affected by an independent censoring mechanism, the working assumption
that we are making. So, when it comes to estimation we can use convergent
estimators of survival, such as the Kaplan-Meier or the Nelson-Aalen formulas
together with the observed rate estimated from the interval (aj−1 , aj ). From
a theoretical viewpoint any subject censored before time aj−1 remains in an
abstract sense at risk for the event under study albeit, given actual data, for this
subject, the event can not be observed. The event for this subject is considered
a latent variable, it exists but is not observable.
For the competing risk model CRM-J, this is no longer true. Once one type of
event has been observed then all other types of event are considered to not exist,
we do not have a latent variable structure. In consequence, in the above formula,
S(aj ) describes the probability of the event of interest occurring after time point
aj , and this is the product of the incidence rate for this event between aj−1 and
aj and the probability, SJ aj−1 that all events—the one of interest together with
all competing events—occur some time later than aj−1 . In this case we have:
Clearly, SI (t) and SJ (t) in Equations 2.19 and 2.20 may differ so that in turn,
the cumulative incidence rates, 1 − S(aj ), depending on whether model CRM-I
or CRM-J is assumed, can also differ. For a breast cancer study, Satagopan and
Auerbach (2004) found the cumulative incidence rates based on CRM-I or CRM-
J to all but coincide while, for an example in haematologic malignancy, the two
working models led to quite different estimates.
The working assumption behind the characterization of CRM-I in terms of
latent variables is that, ultimately, if given enough time, then everyone dies from
everything. Of course this is a hypothetical construct and while of value in the
appropriate context, it has little role to play in individual prediction. Such pre-
diction can give an entirely false impression on the non-specialist. For any cause,
say, for example, breast cancer, the only reason registry data will furnish a prob-
ability that grows to anything less than one (100%) is a lack of observations,
a limitation on the overall time span. This is a somewhat abstract, theoretical
observation—its mathematical expression comes from the divergence of the har-
monic series—that does require we think carefully about how we interpret any
cumulative risks. The age-specific risks are calculated by constantly updating
the risk sets, having removed deaths from all other causes in the denominator.
2.6. Competing risks 43
fortunately, this is not the goal. The goal is to provide no more than some rough
and ready tools to help the investigators gain better insight into the myriad of
factors, and their approximate relative importance, that may be implicated in the
outcomes on which we are focused.
A slightly modified structure, that is well described within the framework
of the compartment models described earlier, gives rise to the so-called semi-
competing risks models. For these models, one of the competing causes is not
an absorbing state so that, having arrived in the non-absorbing state, subjects
are still free to leave this state in order to transition to the absorbing state.
Seen from the viewpoint of a compartment model it is clear that semi-competing
risks models coincide with the classical illness-death model (Fix and Neyman,
1951). For this reason we refer to the model as CRM-ID. Some of the underlying
assumptions can differ however, in particular the existence and behavior of latent
times, and these assumptions will impact the task of estimation. There is an
extensive literature on such models and the methods of this book can be adapted
with more or less effort in order to deal with inferential questions that arise in
this context. In this text our central focus is on regression and while we do not
study in depth the question of regression with competing risks models there are
several papers that do, among which Larson and Dinse (1985), Lunn and McNeil
(1995), Kim (2007), Scheike and Zhang (2008), Lau and Gange (2009), Dignam
and Kocherginsky (2012), and Haller et al. (2013).
Which of the 3 main models, CRM-I, CRM-J, and CRM-ID, we may use
for analysis will mostly be guided by the application. The different underlying
assumptions can have a more or less strong impact on inferences. The most
useful focus is regression modeling, which can involve assessing the impact of
certain risk factors after having adjusted for others, the joint impact of these risk
factors and the relative importance of different risk factors. In this, arguably the
most important, setting the model choice tends to be relatively robust. If one
model provides a hierarchy of importance of risk factors then, in the main, the
same or a very similar hierarchy obtains with other models. The main differences
due to model choice will occur when we consider absolute rather than relative
quantities such as the probability of surviving so many years without being inci-
dent of some particular disease. A great deal of circumspection is recommended
when describing absolute rather than relative assessments, especially when any
such assessments can be heavily model dependent. The author’s own limited
experiment in asking fellow statisticians—mostly not in the particular field—to
interpret the 87% chance of breast cancer risk given to some carriers of a mutated
gene was revealing. Not one person got it right—the author is no exception, at
least initially—leading to the inescapable conclusion that this number, 87%, is
almost impossibly difficult to interpret. For non-specialists, and most particularly
the carriers themselves, it is hard to understand the wisdom behind asking them
to make life-changing decisions on the basis of any such number. We return to
this in the chapter on epidemiology and, here, we point this out as a way to
2.7. Classwork and homework 45
underline an important fact: no less effort ought to go into the purpose of our
competing risks models than in their construction.
The main focus of this text is on regression and we do not investigate the
specifics of this in the competing risks setting. Many ideas carry though. Many
do not. An introduction to regression modeling of the sub-distribution function
for models CRM-J and CRM-ID can be found in Fine and Gray (1999) and Zhang
and Fine (2011).
1. Using the definition for λ(t) = f (t)/S(t), show that S(t) = exp{−Λ(t) and
that f (t) = λ(t) exp{−Λ(t).
2. For a Weibull variate with parameters λ and k, derive an expression for the
conditional survivorship function S(t + u, u). How does this function vary
with t for fixed u? With u for fixed t?
6. Repeat the previous class project, focusing this time on the mean residual
lifetime. Again what conclusions can be drawn from the graphs.
7. Consider a disease with three states of gravity (state 1, state 2, and state
3), the severity corresponding to the size of the number. State 4 corre-
sponds to death and is assumed to follow state 3. New treatments offer
the hope of prolonged survival. The first treatment, if it is effective, is antic-
ipated to slow down the rate of transition from state 2 to state 3. Write
down a compartmental model and a survival model, involving a treatment
indicator, for this situation. A second treatment, if effective, is anticipated
to slow down all transition rates. Write down the model for this. Write
down the relevant null and alternative hypotheses for the two situations.
46 Chapter 2. Survival analysis methodology
10. For a proportional hazards Weibull model describe the relationship between
the respective medians.
11. Investigate the function S(t, u) for different parametric models described
in this chapter. Draw conclusions from the form of this two-dimensional
function and suggest how we might make use of these properties in order
to choose suitable parametric models when faced with actual data.
13. Consider a clinical trial comparing two treatments in which patients enter
sequentially. Identify situations in which an assumption of an independent
censoring mechanism may seem a little shaky.
14. On the basis of a single data set, fit the exponential, the Weibull, the
Gompertz, and the log-normal models. On the basis of each model estimate
the mean survival. On the basis of each model estimate the 90th percentile.
What conclusions would you draw from this?
15. Suppose our focus of interest is on the median. Can you write down a model
directly in terms of the median. Would there be any advantage/drawback
to modeling in this way rather than modeling the hazard and then obtaining
the median via transformations of the hazard function?
16. A cure model will typically suppose two populations; one for which the
probability of an event follows some distribution and another for which
2.7. Classwork and homework 47
the hazard rate can be taken to be zero. The ratio of the sizes of the
populations is θ/(1 − θ). Write down such a model and indicate how we
could go about estimating the mixing parameter, θ. How might a Bayesian
tackle such a problem.
The marginal survival function is of central interest even when dealing with covari-
ates. We need to find good estimates of this function and we use those estimates
in several different contexts. Good estimates only become possible under certain
assumptions on the censoring mechanism. Attention is paid to the exponential
and piecewise exponential models, both of which are particularly transparent.
The exponential model, fully characterized by its mean, can appear over restric-
tive. However, via the probability integral transform and empirical estimates of
marginal survival, it can be used in more general situations. The piecewise expo-
nential is seen, in some sense, to lie somewhere between the simple exponential
and the empirical estimate. Particular attention is paid to empirical processes
and how the Kaplan-Meier estimator, very commonly employed in survival-type
problems, can be seen to be a natural generalization of the empirical distribution
function. In the presence of parametric assumptions, it is also straightforward to
derive suitable estimating equations. The equations for the exponential model
are very simple.
Our interest is mostly in the survival function S(t). Later we will focus on how
S(t), written as S(t|Z), depends on covariates Z. Even though such studies of
dependence are more readily structured around the hazard function λ(t|Z), the
most interpretable quantity we often would like to be able to say something about
is the survival function itself. In order to distinguish the study of the influence
of Z on S(t|Z) from the less ambitious goal of studying S(t), we refer to the
former as conditional survival and the latter as marginal survival.
Since we will almost always have in mind some subset Z from the set of
all possible covariates, and some distribution for this subset, we should remind
ourselves that, although Z has been “integrated out” of the quantity S(t), the
distribution of Z does impact S(t). Different experimental designs will gener-
ally correspond to different S(t). Marginal survival, S(t), corresponds to two
situations: (i) the subjects that are considered as i.i.d. replicates from a single
population or (ii) the subjects can be distinct, from many, and potentially an infi-
nite number of populations, each population being indexed by a value of some
covariate Z. It may also be that we have no information on the covariates Z
that might distinguish these populations. In case (ii), S(t) is an average over
these several populations, not necessarily representing any particular population
of interest in itself. It is important to appreciate that, in the absence of distribu-
tional assumptions, and the absence of observable Z, it is not possible, on the
basis of data, to distinguish case (i) from case (ii). The homogeneous case then
corresponds to either case (i) or case (ii) and it is not generally useful to speculate
on which of the cases we might be dealing with. They are not, in the absence
of observable Z, identifiable from data. We refer to S(t|Z) as the conditional
survival function given the covariate Z. This whole area, the central focus of this
work, is studied in the following chapters. First, we need to consider the simpler
case of a single homogeneous group.
Let us suppose that the survival distribution can be completely specified via
some parametric model, the parameter vector being, say, θ. We take θ to be a
scalar in most cases in order to facilitate the presentation. The higher-dimensional
generalization is, in most cases, very straightforward.
This covers the majority of cases in which parametric models are used.
Later, when we focus on conditional survival involving covariates Z, rather than
marginal survival, the same arguments follow through. In this latter case the
common assumption, leading to an analogous expression for the log-likelihood,
is that of conditional independence of the pair (T, C) given Z.
3.3. Parametric models for survival functions 51
Estimating equation
The maximum likelihood estimate is obtained as the value of θ, denoted θ̂, which
maximizes L(θ) over the parameter space. Such a value also maximizes log L(θ)
(by monotonicity) and, in the usual case where log L(θ) is a continuous function
of θ this value is then the solution to the estimating equation (see Appendix
D.1),
U (θ) = ∂ log L(θ)/∂θ = ∂ log Li (θ)/∂θ = 0.
i
Next, notice that at the true value of θ, denoted θ0 , we have Var{U (θ0 )} =
EU 2 (θ0 ) = EI(θ0 ) where
n
n
I(θ) = Ii (θ) = −∂ 2 log L(θ)/∂θ2 = − ∂ 2 log Li (θ)/∂θ2 .
i=1 i=1
As for likelihood in general, some care is needed in thinking about the meaning
of these expressions and the fact that the operators E(·) and Var(·) are taken
with respect to the distribution of the pairs (xi , δi ) but with θ0 fixed. The score
equation is U (θ̂) = 0 and the large sample variance is approximated by Var(θ̂) ≈
1/I(θ̂). It is usually preferable to base calculations on I(θ̂) rather than EI(θ̂),
the former being, in any event, a consistent estimate of the latter (after dividing
both sides of the equation by n). The expectation itself would be complicated to
evaluate, involving the distribution of the censoring, and unlikely, in view of the
study by Efron and Hinkley (1978) to be rewarded by more accurate inference.
Newton-Raphson iteration is set up from
where θ̂1 is some starting value, often zero, to the iterative cycle. The Newton-
Raphson formula arises as an immediate application of the mean value theorem
(Appendix A). The iteration is brought to a halt once we achieve some desired
level of precision.
Large sample inference can be based on any one of the three tests based
on the likelihood function; the score test, the likelihood ratio test, or the Wald
test. For the score test there is no need to estimate the unknown parameters.
Many well-established tests can be derived in this way. In exponential families,
also the so-called curved exponential families (Efron et al., 1978), such tests
reduce to contrasting some observed value to its expected value under the model.
Confidence intervals with optimal properties (Cox and Hinkley, 1979) can be
constructed from uniformly most powerful tests. For the exponential family class
of distributions the likelihood ratio forms a uniformly most powerful test and,
as such, allows us to obtain confidence intervals with optimal properties. The
other tests are asymptotically equivalent so that confidence intervals based on
the above test procedures will agree as sample size increases. Also we can use
52 Chapter 3. Survival without covariates
such intervals for other quantities of interest such as the survivorship function
which depends on these unknown parameters.
Sα+ (t; θ̂) = sup S(t; θ) , Sα− (t; θ̂) = inf S(t; θ) , (3.3)
θ∈Θα θ∈Θα
then Sα+ (t; θ̂) and Sα− (t; θ̂) form the endpoints of the 100(1 − α)% confidence
interval for S(t; θ). Such a quantity may not be so easy to calculate in general,
simulating from Θα or subdividing the space being an effective way to approx-
imate the interval. Some situations nonetheless simplify such as the following
example, for scalar θ, based on the exponential model in which S(t; θ) is mono-
tonic in θ. For such cases it is only necessary to invert any interval for θ to obtain
an interval with the same coverage properties for S(t; θ).
Exponential survival
For this model we only need estimate a single parameter, λ which will then
determine the whole survival curve. Referring to Equation 3.1 in which, for δi = 1,
the contribution to the likelihood is, f (xi ; λ) = λ exp(−λxi ) and, for δi = 0, the
contribution is, S(xi ; λ) = exp(−λxi ). Equation 3.1 then becomes:
n
log L(λ) = k log λ − λ xj , (3.4)
j=1
where k = n N (∞). Differentiating this and equating with zero we find that
n i=1 i
λ̂ = k/ j=1 xj . Differentiating a second time we obtain I(λ) = k/λ2 . Note, by
conditioning upon the observed number of failures k, that EI(λ) = I(λ), the
observed information coinciding with the expected Fisher information, a property
of exponential families, but which we are not generally able to recover in the
presence of censoring.
An ancillary argument would nonetheless treat k as being fixed and this is
what we will do as a general principle in the presence of censoring, the observed
information providing the quantity of interest. Some discussion of this is given
by Efron and Hinkley (1978) and Barndorff-Nielsen and Cox (1994). We can now
write down an estimate of the large sample variance which, interestingly, only
depends on the number of observed failures. Thus, in order to correctly estimate
the average, it is necessary to take into account the total time on study for both
the failures and those observations that result in censoring. On the other hand,
3.3. Parametric models for survival functions 53
given this estimate of the average, the precision we will associate with this only
depends on the observed number of failures. This is an important observation
and will be made again in the more general stochastic process framework.
Multivariate setting
In the majority of applications, the parameter θ will be a vector of dimension p.
The notation becomes heavier but otherwise everything is pretty much the same.
The estimating equation, U (θ) = 0 then corresponds to a system of p estimating
equations and I(θ) is a p × p symmetric matrix in which the (q, r) th element
is given by −∂ 2 log L(θ)/∂θq ∂θr where θq and θr are elements of the vector θ.
Also, the system of Newton-Raphson iteration can be applied to each one of the
components of θ so that we base our calculations on solving the set of equations:
where, in this case, θ̂1 is a vector of starting values to the iterative cycle, again
most often zero.
Example 3.1. For the Freireich data (Cox, 1972), we calculate for the 6-MP
group λ̂ = 9/359 = 0.025. For the placebo group we obtain λ̂ = 21/182 = 0.115.
Furthermore in the 6-MP group we have Var (λ̂) = 9/(359)2 = 0.000070 whereas
for the placebo group we have Var (λ̂) = 21/(182)2 = 0.0006.
The non-parametric empirical estimate (described below) agrees well with
curves based on the exponential model and this is illustrated in Figure 3.1. Infer-
1.0
0.8
Survival
0.6
0.4
0.2
0.0
0 5 10 15 20 25 30 35
Time
ence can be based on an appeal to the usual large sample theory. In this particular
case, however, we can proceed in a direct way by recognizing that, for the case of
no censoring k = n, the sample size and n j=1 Tj is a sum of n independent ran-
dom variables each exponential with parameter λ. We can therefore treat n/λ̂
as a gamma variate with parameters (λ, n). When there is censoring, in view
of the consistency of λ̂, we can take k/λ̂ as a gamma variate with parameters
(λ, k), when k < n. This is not an exact result, since it hinges on a large sample
approximation, but it may provide greater accuracy than the large sample normal
approximation.
In order to be able to use standard tables we can multiply each term of the
sum by 2λ since this then produces a sum of n exponential variates, each with
variance 2. Such a distribution is a gamma (2, n), equivalent to a chi-square
distribution with 2n degrees of freedom Evans et al. (2001). Taking the range of
values of 2kλ/λ̂ to be between χα/2 and χ1−α/2 gives a 100(1 − α)% confidence
interval for λ. For the Freireich data we find 95% CI = (0.0115, 0.0439). Once
we have intervals for λ, we immediately have intervals with the same coverage
properties for the survivorship function, this being a monotonic function of λ.
Denoting the upper and lower limits of the 100(1 − α)% confidence interval
Sα+ (t; λ̂) and Sα− (t; λ̂) respectively, we have:
+
λ̂χα/2 −
λ̂χ1−α/2
Sα (t; λ̂) = exp − t , Sα (t; λ̂) = exp − t .
2k 2k
where, this time, the corresponding expression for Sα− (t; λ̂) obtains by replacing
z1−α/2 by zα/2 .
λ̂j = kj / {(x − aj−1 )I(x < aj ) + (aj − aj−1 )I(x ≥ aj )}.
:x >aj−1
Confidence intervals for this function can be based on Equation 3.3. We can view
the simple exponential survival model as being at one extreme of the parametric
spectrum, leaning as it does on a single parameter, the mean. It turns out that
we can view the piecewise exponential model, with a division so fine that only
single failures occur in any interval, as being at the other end of the parametric
spectrum, i.e., a non-parametric estimate. Such an estimate corresponds to that
obtained from the empirical distribution function. This is discussed below.
For each t = a , ( > 0), the empirical estimate of S(t) based on a sample of size
n, and denoted Sn (t), is simply the observed number of observations that are
greater than t. For a random sample of observations Ti , (i = 1, . . . , n) we use the
indicator variable I(·) to describe whether or not the subject i survives beyond
point t, i.e., for t = aj ,
n j
1
Sn (aj ) = I(Ti > aj ) = Sn (a , a−1 ) (3.8)
n
i=1 =1
Lemma 3.2. For any fixed value of t, Sn (t) is asymptotically normal with
mean and variance given by:
Finally these results apply more generally than just to the uniform case for,
as long as T has a continuous distribution, there exists a unique monotonic
transformation from T to the uniform, such a transformation not impacting
√
n{Fn (t) − F (t)} itself. In particular this enables us to use the result of
Appendix B.2 to make inference for an arbitrary continuous cumulative
√
distribution function, F (t), whereby, for Wn (t) = n{Fn (t) − F (t)},
∞
Pr sup |Wn (t)| ≤ D → 1 − 2 (−1)k+1 exp(−2k 2 D2 ) , D ≥ 0.
t
k=1
(3.12)
Most often we are interested in events occurring with small probability in which
case a good approximation obtains by only taking the first term of the sum, i.e.,
√
k = 1. Under this approximation | n{Fn (t) − F (t)}| will be greater than about
1.4 less than 5% of the time. This is a simple and effective working rule.
58 Chapter 3. Survival without covariates
cannot be calculated. Simply taking the observed censored times as though they
were failure times will clearly not work so that the estimator;
n n
1 1
1− Yi (t) = 1Xi ≤ t , 0≤t≤T
n n
i=1 i=1
will exhibit increasing bias as the amount of censoring increases. This follows
since we will be estimating P (T ≤ t, C ≤ t) and this underestimates P (T ≤ t).
A little more work is needed although it is only really a question of treating the
various needed ingredients in a sensible way for things to work. The most famous
empirical estimate is that of Kaplan and Meier (1958) which we can show to be
consistent under an independent censoring mechanism.
X 2 4 7 9 13 16 22
δ 1 1 0 1 0 1 0
0.857
0.714
Sˆ ( T|Z )
0.536
0.268
0 2 4 7 9 13 16 22
T
n n
Xi j=1
Yj (Xi ) 1 − δi / j=1
Yj (Xi ) Ŝ(Xi )
2 7 1-1/7 1-1/7 0.857
4 6 1-1/6 (1-1/7)(1-1/6) 0.714
7 5 1 0.714
9 4 1-1/4 (1-1/7)(1-1/6)(1-1/4) 0.536
13 3 1 0.536
16 2 1-1/2 (1-1/7)(1-1/6)(1-1/4)(1-1/2)0.268
22 1 1 0.268
role for values lower than themselves. The Kaplan-Meier curve does not change
at the censored observation time and, beyond this, the role of the censored
observation is indirect. It does not influence the actual calculation. Figure 3.2
shows the estimated Kaplan–Meier curve, Ŝ(t). Unlike the empirical function,
1 − Fn (t), the curve does not reach zero in this example and this will be the
case whenever the last observation is a censored observation. The jumps in the
Kaplan-Meier curve play an important role in proportional and non-proportional
hazards modeling. The size of these jumps at time t, written dŜ(t), are readily
described and we have (Table 3.2):
60 Chapter 3. Survival without covariates
Proposition 3.1. For t an observed event time and where Ŝ(t− ) = lim Ŝ(s),
s→t−
we can write,
Ŝ(t− )
dŜ(t) = Ŝ(t) − Ŝ(t− ) = n , (3.13)
j=1 Yj (t)
In later chapters we will see how to make full use of these increments. They can
be viewed as weights or as approximations to infinitesimal contributions to an
integral with respect to the survival function. These are sometimes referred to as
Kaplan-Meier integrals and, in the light of Helly-Bray’s theorem (Appendix A),
allow us to consistently estimate functions of T of interest in the presence of cen-
soring. Details can be found in Stute (1995) where, in particular, under general
conditions, we can obtain the asymptotic normality of these integrals. Another
important example in which we use standard software to fit a proportional haz-
ards model to data generated under a non-proportional hazards mechanism has
been studied by Xu and O’Quigley (2000) and O’Quigley (2008). The use of the
Kaplan-Meier increments enables us to consistently estimate an average regres-
sion effect Eβ(T ) when β(t) is not a constant. Failing to correctly incorporate
the Kaplan-Meier increments into the estimating equation—a common oversight
encouraged by the availability of a lot of standard software—will lead to serious
bias in the estimation of Eβ(T ).
This is because of the non-zero masses being associated to the times at which the
censorings occur. Nonetheless,
n the rest follows through readily, although, unlike
(3.9), we now define n = i=1 I(Xi ≥ a ), noting that this definition contains
(3.9) as a special case when there is no censoring.
3.5. Kaplan-Meier (empirical estimate with censoring) 61
evaluation at the distinct observed failure times t1 < t2 < · · · < tk . All divisions
of the time interval (a , a−1 ), = 1, . . . , N, for different N , lead to the same
estimate Ŝ(t), provided that the set of observed failure points is contained within
the set {a ; = 1, . . . , N }. A minimal division of the time axis arises by taking
the set {a } to be the same as the set of the distinct observed failure times. So,
for practical purposes aj = tj , j = 1, . . . , k. and the Kaplan-Meier estimate can
be defined as
nj − dj dj
Ŝ(t) = = 1− . (3.16)
nj nj
j:tj <t j:tj <t
where dj = δj , the sum being over the ties at time tj . If there are no ties at
tj then dj = δj . Note that Ŝ(t) is a left-continuous step function that equals
1 at t = 0 and drops immediately after each failure time tj . The estimate does
not change at censoring times. When a censoring time and a failure time tj
are recorded as equal, the convention is that censoring times are adjusted an
infinitesimal amount to the right so that the censoring time is considered to
be infinitesimally larger that tj . Any subjects censored at time tj are therefore
included in the risk set of size nj , as are those that fail at tj . This convention is
sensible because a subject censored at time tj almost certainly survives beyond
tj . Note also that when the last observation is a censoring time rather than
a failure time, the KM estimate is taken as being defined only up to this last
observation.
t − tj−1
S̄(t) = Ŝ(tj−1 ) + Ŝ(tj ) − Ŝ(tj−1 ) ; t ∈ (tj−1 , tj ). (3.17)
tj − tj−1
Note that at the distinct failure times tj the two estimates, Ŝ(t) and S̄(t) coin-
cide. An example of where we make an appeal to S̄(t) is illustrated in the two-
3.5. Kaplan-Meier (empirical estimate with censoring) 63
The above expression for the variance of Ŝ(t) is known as Greenwood’s formula.
Breslow and Crowley (1974) in a detailed large sample study of the Kaplan-
Meier estimator obtained a result asymptotically equivalent to the Greenwood
formula, making a slight correction to overestimation of the variation in the
estimated survival probability. The formula’s simplicity, however, made it the
most commonly used when computing the variance of the Kaplan-Meier estimate
and related quantities. We also have:
The usual use to which we put such variance estimates is in obtaining approxi-
mate confidence intervals. Thus, using the large sample normality of Ŝ(t), adding
and subtracting to this z1−α/2 (the 1 − α/2 quantile from the standard normal
distribution) multiplied by the square root of the variance estimate, provides
approximate 100(1 − α)% confidence intervals for Ŝ(t). As mentioned before the
constraints on Ŝ(t), lying between 0 and 1, will impact the operating characteris-
tics of such intervals, in particular, it may not be realistic, unless sample sizes are
large, to limit attention to symmetric intervals around Ŝ(t). Borgan and Liestol
(1990) investigate some potential transformations, especially the log-minus-log
transformation discussed in Section 2.4, leading to
Corollary 3.4. Let w(α) = Var1/2 Ŝ(t)z1−α/2 /Ŝ(t) log Ŝ(t). For each t = a , a
100(1 − α)% confidence intervals for Ŝ(t) can be approximated by
The same arguments which led to Greenwood’s formula also lead to approxi-
mate variance expressions for alternative transformations of the survivorship func-
tion. In particular we have
Corollary 3.5. For each t = a the estimate log Ŝ(t) is asymptotically normal
with asymptotic mean log S(t) and variance
dm
Var log S(a ) ≈ (3.20)
nm (nm − dm )
m≤ m≤
Corollary 3.6. For each t = a the estimate log Ŝ(t)/{1 − Ŝ(t)} is asymp-
totically normal with asymptotic mean log S(t)/{1 − S(t)} and variance
S(a ) dm
Var log ≈ {1 − S(a )}−2 . (3.21)
1 − S(a ) nm (nm − dm )
m≤ m≤
Confidence intervals calculated using any of the above results will be of help
in practice. Following some point estimate, obtained from Ŝ(t) at some given
t, these intervals are useful enough to quantify the statistical precision that we
wish to associate with the estimate. All of the variance estimates involve a com-
parable degree of complexity of calculation so that choice is to some extent a
question of taste. Nonetheless, intervals based on the log-minus-log or the logit
transformation will behave better for smaller samples, and guarantee that the
endpoints of the intervals themselves stay within the interval (0,1). This is not
so for the Greenwood formula, the main argument in its favor being that it has
been around the longest and is the most well known. For moderate to large sam-
ple sizes, and for Ŝ(t) not too close to 0 or 1, all the intervals will, for practical
purposes, coincide.
Figure 3.3: Kaplan-Meier curves obtained from the Curie breast cancer study.
A proportional hazards assumption suggests itself as a possibility to model the
observed differences. A good candidate test would be the log-rank test.
Figure 3.4: Kaplan-Meier curves obtained from a randomized clinical trial and
a study in breast cancer. In both cases, a proportional hazards assumption is
doubtful. The log-rank test would show poor performance in such situations.
been misclassified and, as we will see in later theoretical work, the consequences
of this would be to produce an impression of a diminishing regression effect.
Figure 3.4 also illustrates two Kaplan-Meier curves taken from a randomized
clinical trial in lung cancer. There are three groups and the short-term effects and
long term effects appear to indicate no real treatment effect. Initially, the curves
more or less coincide, as they do in the long term. However, for a significant part
of the study, say between the median and the 10th percentile, there appears to
be a real advantage in favor of the two active treatment groups. The log-rank
test fails to detect this and more suitable tests, described in later chapters, are
able to confirm the treatment differences. The clear lack of proportionality of
hazards is at the root of the problem here.
Striving to obtain tests that to a greater or lesser extent can reverse the power
deficit arising as a result of non-proportional hazards has stimulated the work of
many authors in this field. Weighting the linear contributions to the log-rank
test, leading to the so-called weighted log-rank tests, has a long history (Fleming
and Harrington, 1991; Gehan, 1965; Peto and Peto, 1972). Prentice (1978) and
others showed that a weighting that mirrors the actual form of the departure from
proportional hazards will lead to powerful alternative tests. What is trickier, but
arguably much more relevant, is to obtain tests that will have good performance
in both situations, that will be close to optimal under proportional hazards, if not
quite optimal, but that will retain good power in situations where the alternative
is significantly remote from that of proportional hazards. We consider this in later
chapters.
size, of the function at points where it changes. If we denote tj + the time instant
immediately after tj , then Ŝ(tj ) − Ŝ(tj +) is the stepsize, or jump, of the KM
curve at time tj . From Equation 3.16 we see that
nj − dj
Ŝ(tj +) = Ŝ(tj ) · ,
nj
so the stepsize is Ŝ(tj )·dj /nj . That is to say, when the total “leftover" probability
mass is Ŝ(tj ), each observed failure gets one-nj th of it, where nj = n i=1 Yi (tj )
is the number of subjects at risk at time tj . In the absence of censoring this
corresponds exactly to the way in which the empirical estimate behaves. When
there is censoring, then one way of looking at a censored observation is to consider
that the mass that would have been associated with it is simply reallocated to all
of those observations still remaining in the risk set (hence the term “redistribution
to the right.”)
the mean itself being then estimated by μ̂(∞). However, the theory comes a
little unstuck here since, not only must we restrict the time scale to be within
the range determined by the largest observation, the empirical distribution itself
will not correspond to a probability distribution whenever F̂ (t) = 1 − Ŝ(t) fails to
reach one. In practice then it makes more sense to consider mean life time μ̂(t)
over intervals [0, t], acknowledging that t needs to be kept within the range of
our observations. The following result provides the required inference for μ̂(t);
Lemma 3.3. For large samples, μ̂(t) can be approximated by a normal distribu-
tion with E μ̂(t) = μ(t) and
{μ̂(t) − μ̂(am )}2 dm
Var μ̂(t) ≈ . (3.23)
nm (nm − dm )
m≤ m≤
68 Chapter 3. Survival without covariates
dm (1 − p2 )f −2 (ξˆp )
Var ξˆp ≈ . (3.24)
nm (nm − dm )
m≤ m≤
It is difficult to use the above result in practice in view of the presence of the
density f (·) in the expression. Smoothing techniques and the methods of density
estimation can be used to make progress here but our recommendation would be
to use a more direct, albeit more heavy, approach constructing intervals based
on sequences of hypothesis tests. In principle at least the programming of these
is straightforward.
λ(a )(a − a−1 ) ≈ P (a < T < a−1 |T > a−1 ) = 1 − S(a , a−1 ).
Applying Theorem 3.2 and then, first replacing S(a , a−1 ) by G(a , a−1 ), sec-
ond replacing G(a , a−1 ) by Gn (a , a−1 ) i.e., d /n , we obtain, at t = aj ,
Λ̃(aj ) = j=1 d /n as a consistent estimator for Λ(t). The resulting estimator
3.7. Model verification using empirical estimate 69
is called the Nelson-Aalen estimate of survival. Recalling the Taylor series expan-
sion, exp(x) = 1 + x + x2 /2! + · · · , for small values of x we have exp(−x) =
1 − x + O(x2 ), the error of the approximation being strictly less than x2 /2 since
the series is convergent with alternating sign. Applying this approximation to
S̃(t) we recover the Kaplan-Meier estimate described above. In fact we can use
this idea to obtain:
Lemma 3.5. Under the Breslow-Crowley conditions, |S̃(t) − Ŝ(t)| converges
almost surely to zero.
In view of the lemma, large sample results for the Nelson-Aalen estimate
can be deduced from those already obtained for the Kaplan-Meier estimate.
This is the main reason that there is relatively little study of the Nelson-Aalen
estimate in its own right. We can exploit the wealth of results for the Kaplan-
Meier estimate that are already available to us. Indeed, in most practical finite
sample applications, the level of agreement is also very high and the use of
one estimator rather than the other is really more a question of taste than any
theoretical advantage. In some ways the Nelson-Aalen estimate appears very
natural in the survival setting, and it would be nice to see it used more in
practice.
to manifest themselves mostly in the tails of the distribution where there may be
few observations. As a goodness of fit tool these procedures are not usually very
powerful.
1. Write down the estimating equations for a Weibull model based on maxi-
mum likelihood. Write down the estimating equations for a Weibull model
based on the mean and variance.
3. For the data of the previous question, calculate and plot the survivorship
function. Calculate an approximate 90% confidence interval for S(4). Do
the same for S(7).
5. Compare the variance expression for S(7) with that approximated by the
binomial formula based on Ŝ(7) and 8 failure times.
6. Take 100 bootstrap samples, fit the Weibull model to each one separately
and estimate S(4) and S(7). Calculate empirical variances based on the
100 sample estimates. How do these compare with those calculated on the
basis of large sample theory.
3.8. Classwork and homework 71
11. Explain the importance of Theorem 3.2 and how it is used in order to
obtain consistent estimates of survival in the presence of an independent
censoring mechanism.
13. For the 200 observations of the previous question, introduce an independent
censoring mechanism so that approximately half of the observations are
censored. Calculate the logarithm of the Kaplan-Meier and Nelson-Aalen
estimates and plot one against the other. Fit a least squares line to the
plot and comment on the values of the slope.
14. Use the results of Lemma 3.3 to show that μ̂(t) is consistent for μ(t).
15. Show that when there is no censoring, the Greenwood estimate of the vari-
ance of the Kaplan-Meier estimate reduces to the usual variance estimate
for the empirical distribution function. Conclude from this that confidence
intervals based on the Greenwood estimate of variance are only valid at
a single given time point, t, and would not provide bounds for the whole
Kaplan-Meier curve.
72 Chapter 3. Survival without covariates
16. Carry out a study on the coverage properties based on Ŝ(t), log Ŝ(t) and
log Ŝ(t)/{1 − Ŝ(t)}. Describe what you anticipate to be the relative merits
of the different functions.
18. Following the idea of Malani, suppose, in the presence of dependent cen-
soring, we obtained a Nelson-Aalen estimate of survival for each level of the
covariate. Subsequently, appealing to the law of total probability, we esti-
mate marginal survival by a linear combination of these several estimates.
Comment on such an estimate and contrast it with that of Malani.
19. Recall that the uncensored Kaplan-Meier estimator, i.e., the usual empirical
estimate, is unbiased. This is no longer generally so for the Kaplan-Meier
estimate. Can you construct a situation in which the estimate of Equation
3.17 would exhibit less bias than the Kaplan-Meier estimate?
20. Using data from a cancer registry, show how you could make use of the
piecewise exponential model to obtain conditional survival estimates for
S(T > t + s|T > s).
Lemma 3.1 and Theorem 3.2: There are two possibilities; the observation xi
corresponds to a failure, the observation corresponds to a censoring time. For
the first possibility let dxi be an infinitesimally small interval around this point.
The probability that we can associate with this event is Pr(T ∈ dxi , C > xi ) =
Pr(T ∈ dxi ) × pr(C > xi ) i.e. f (xi ; θ)dxi × G(xi ; θ). For a censored observation
at time xi (δi = 0) we have Pr(C ∈ dxi , T > xi ) = Pr(C ∈ dxi ) × pr(T > xi ) i.e.,
g(xi ; θ)dxi × S(xi ; θ). We can then write the likelihood as
In most cases the further assumption that the censoring itself does not depend
on θ may be reasonable. Taking logs and ignoring constants we obtain the result.
For Theorem 3.2 note that
3.9. Outline of proofs 73
Pr{Xi ≥ a |Xi > a−1 } = Pr{Ti ≥ a |Xi > a−1 } × Pr{Ci ≥ a |Xi > a−1 }
= Pr{Ti ≥ a |Ci > a−1 , Ti > a−1 } = Pr{Ti ≥ a |Ti > a−1 }
by independence.
Theorem 3.3: We have Var (log S(a ) ≈ m≤ Var (log(1 − πm )). This is
−2 Var (π ) which we write as
−2 π (1 − π )/n
m≤ (1−πm ) m m≤ (1−πm ) m m m
where
nm corresponds to the number at risk. i.e. the denominator. We write
m≤ dm /rm (rm − dm ) Use log function and a further application of delta
method to obtain:
Var S(a ) ≈ S(a )2 dm /rm (rm − dm ).
m≤ m≤
An approach not using the delta method follows Greenwood (1926). Let t0 <
t1 < · · · < tk . An estimate of the variance of the survival probability P of the form
P = p1 × p2 × .... × pk , where each pi is the estimated probability of survival from
time ti−1 to time ti , qi = 1 − pi and P is therefore the estimated probability of
survival from t0 to tk . Assuming that the pi ’s are independent of one another, we
have E(P ) = E(p1 ) × E(p2 ) × .... × E(pk ), as well as, E(P 2 ) = E(p21 ) × E(p22 ) ×
.... × E(p2k ), and E(p2i ) = (Epi )2 + σi2 , where σi2 = Var(pi ). Then
Var(P ) = E(P 2 ) − {E(P )}2 = {E(P )}2 ( 1 + σ12 /(Ep1 )2 1 + σ22 /(Ep2 )2 · · ·
1 + σk2 /(Epk )2 − 1) ≈ {E(P )}2 (σ12 /(Ep1 )2 + σ22 /(Ep2 )2 + .... + σk2 /(Epk )2 ).
Proposition 3.1: Take l ∈ {1, . . . , n}. If subject l fails at time t, then we have
that Xl = t and δl = 1, so that denoting ΔŜ(t) = Ŝ(t) − Ŝ(t− ). then
δi δi
ΔŜ(t) = 1 − n − 1 − n =
j=1
Yj (Xi ) j=1
Yj (Xi )
i = 1, . . . , n i = 1, . . . , n
Xi ≤t Xi ≤t−
calculated fitted parametric curves. If, however, the curves are related, then each
estimate provides information not only about its own population curve but also
about the other group’s population curve. The curve estimates would not be inde-
pendent. Exploiting such dependence can lead to considerable gains in our esti-
mating power. The agreement between an approach modeling dependence and
1.0
0.8
survival function
0.6
0.4
0.2
Figure 4.1: Kaplan-Meier survival curves and PH model curves for two groups
defined by a binary covariate. Dashed lines represent PH estimates. The fit is
adequate up to 150 months, after which the fit becomes progressively poorer.
one ignoring it can be more or less strong and, in Figure 4.1, agreement is
good apart from observations beyond 150 months where a proportional hazards
assumption may not hold very well. Returning to the simplest case, we can imag-
ine a compartmental model describing the occurrence of deaths independently
of group status in which all individuals are assumed to have the same hazard
rates. As pointed out in the previous chapter, the main interest then is in the
survival function S(t) when the Z are either unobservable or being ignored. Here
we study the conditional survival function given the covariates Z and we write
this as S(t|Z). In the more complex situations (multicompartment models, time-
dependent Z) it may be difficult, or even impossible, to given an interpretation
to S(t) as an average over conditional distributions, but the idea of condition-
ing is still central although we may not take it beyond that of the probability
of a change of state conditional upon the current state as well as the relevant
covariate history which led to being in that state.
The goal here is to consider models with varying degrees of flexibility applied
to the summary of n subjects each with an associated covariate vector Z of
dimension p. The most flexible models will be able to fully describe any data at
hand but, as a price for their flexibility, little reduction in dimension from the
n × p data matrix we begin with. Such models will have small bias in prediction
compared with large sampling errors. The most rigid models can allow for striking
reductions in dimension. Their consequent impact on prediction will be associated
with much smaller sampling errors. However, as a price for such gains, the biases
4.3. General or non-proportional hazards model 77
in prediction can be large. The models we finally work with will lie between these
two extremes. Their choice then depends on an artful balance between the two
conflicting characteristics. A central task, guided by the principle of parsimony, is
to use as few parameters as possible to achieve whatever purpose we have in mind.
where λ(t|·) is the conditional hazard function, λ0 (t) the baseline hazard
corresponding to Z = 0, and β(t) a time-varying regression effect. Whenever
Z has dimension greater than one we view β(t)Z as an inner product in which
β(t) has the same dimension as Z so that β(t)Z = β1 (t)Z1 +, · · · , +βp (t)Zp .
As long as we do not view Z(t) as random, i.e., the whole time path of Z(t) is
known at t = 0, then a hazard function interpretation for λ(t|Z) is maintained.
Otherwise we lose the hazard function interpretation, since this requires knowl-
edge of the whole function at the origin t = 0, i.e., the function is a deterministic
and not a random one. In some ways this loss is of importance in that the equiv-
alence of the hazard function, the survival function, and the density function
means that we can easily move from one to another. However, when Z(t) is ran-
dom, we can reason in terms of intensity functions and compartmental models,
a structure that enables us to deal with a wide variety of applied problems such
as clinical trials using cross-over designs, studies in HIV that account for var-
ied accumulated treatment histories and involved epidemiological investigations
in which exposure history over time can be complex. The parameter β(t) is of
78 Chapter 4. Proportional hazards models
infinite dimension and therefore the model would not be useful without some
restrictions upon β(t).
Corresponding to the truth or reality under scrutiny, we can view Equation 4.2
as being an extreme point on a large scale which calibrates model complexity.
The opposite extreme point on this scale might have been the simple exponential
model, although we will start with a restriction that is less extreme, specifically
the proportional hazards model in which β(t) = β so that;
Putting restrictions on β(t) can be done in many ways, and the whole art of sta-
tistical modeling, not only for survival data, is in the search for useful restrictions
upon the parameterization of the problem in hand. Our interpretation of the word
“useful” depends very much on the given particular context. Just where different
models find themselves on the infinite scale between Equation 4.3 and Equation
4.2 and how they can be ordered is a very important concept we need to master
if we are to be successful at the modeling process, a process which amounts to
feeling our way up this scale (relaxing constraints) or down this scale (adding
constraints), guided by the various techniques at our disposal. From the outset it
is important to understand that the goal is not one of establishing some unknown
hidden truth. We already have this, expressed via the model described in Equation
4.1. The goal is to find a much smaller, more restrictive model, which, for practical
purposes is close enough or which is good enough to address those questions that
we have in mind; for example, deciding whether or not there is an effect of treat-
ment on survival once we have accounted for known prognostic factors which may
not be equally distributed across the groups we are comparing. For such purposes,
no model to date has seen more use than the Cox regression model (Figure 4.2).
In tackling the problem of subject heterogeneity, the Cox model has enjoyed
outstanding success, a success, it could be claimed, matching that of classic
multilinear regression itself. The model has given rise to considerable theoretical
work and continues to provoke methodological advances. Research and develop-
ment into the model and the model’s offspring have become so extensive that
we cannot here hope to cover the whole field, even at the time of writing. We
aim nonetheless to highlight what seems to be the essential ideas and we begin
with a recollection of the seminal paper of D.R. Cox, presented at a meeting of
the Royal Statistical Society in London, England, March 8, 1972.
4.5. Cox regression model 79
Figure 4.2: An illustration of survival curves and associated hazard functions for
a proportional hazards model.
where λ0 (t) is a fixed “baseline” hazard function, and β is a relative risk param-
eter to be estimated. Whenever Z = 0 has a concrete interpretation (which we
can always obtain by re-coding) then so does the baseline hazard λ0 (t) since, in
this case, λ(t|Z = 0) = λ0 (t). As mentioned just above, when Z is a vector of
covariates, then the model is the same, although with the product of vectors βZ
interpreted as an inner product. It is common to replace the expression βZ by
β Z or β T Z where β and Z are p × 1 vectors, and a b, or aT b denote the inner
product of vectors a and b. Usually, though, we will not distinguish notationally
between the two situations since the former is just a special case of the latter.
We write them both as βZ. Again we can interpret λ0 (t) as being the hazard
corresponding to the group for which the vector Z is identically zero.
The model is described as a multiplicative model, i.e., a model in which factors
related to the survival time have a multiplicative effect on the hazard function.
An illustration in which two binary variables are used to summarize the effects
of four groups is shown in Figure 4.3. As pointed out by Cox, the function (βZ)
can be replaced by any function of β and Z, the positivity of exp(·) guaranteeing
that, for any hazard function λ0 (t), and any Z, we can always maintain a hazard
function interpretation for λ(t|Z). Indeed it is not necessary to restrict ourselves
80 Chapter 4. Proportional hazards models
h(t)
Figure 4.3: Proportional hazards with two binary covariates indicating 4 groups.
Log-hazard rate written as h(t) = log λ(t).
to exp(·), and we may wish to work with other functions R(·), although care is
required to ensure that R(·) remains positive over the range of values of β and Z
of interest. Figure 4.3 represents the case of two binary covariables indicating four
distinct groups (in the figure we take the logarithm of λ(t)) and the important
thing to observe is that the distance between any two groups on this particular
scale, i.e., in terms of the log-hazards, does not change through time. In view of
the relation between the hazard function and the survival function, there is an
equivalent form of Equation 4.4 in terms of the survival function. Defining S0 (t)
to be the baseline survival function; that is, the survival function corresponding
to S(t|Z = 0), then, for scalar or vector Z, we have that
When the covariate is a single binary variable indicating, for example, treatment
groups, the model simply says that the survival function of one group is a power
transformation of the other, thereby making an important connection to the class
of Lehmann alternatives (Lehmann et al., 1953).
Cox took the view that “parametrization of the dependence on Z is required so
that our conclusions about that dependence are expressed concisely”, adding that
any choice “needs examination in the light of the data”. “So far as secondary fea-
tures of the system are concerned ... it is sensible to make a minimum of assump-
tions.” This view led to focusing on inference that allowed λ0 (t) to remain arbi-
trary. The resulting procedures are nonparametric with respect to t in that infer-
ence is invariant to any increasing monotonic transformation of t, but parametric
in as much as concerns Z. For this reason the model is often referred to as Cox’s
semi-parametric model. Let’s keep in mind, however, that it is the adopted infer-
ential procedures that are semi-parametric rather than the model itself. Although,
of course, use of the term λ0 (t) in the model, in which λ0 (t) is not specified,
implies the use of procedures that will work for all allowable functions λ0 (t).
4.5. Cox regression model 81
Having recalled to the reader how inference could be carried out following
some added assumptions on λ0 (t), the most common assumptions being that
λ0 (t) is constant, that λ0 (t) is a piecewise constant function, or that λ0 (t)
is equal to tγ for some γ, Cox presented his innovatory likelihood expression
for inference, an expression that subsequently became known as a partial likeli-
hood (Cox, 1975). We look more closely at these inferential questions in later
chapters. First note that the quantity λ0 (t) does not appear in the likelihood
expression given by
n
δi
exp(βZi )
L(β) = n , (4.6)
i=1 j=1 j (Xi ) exp(βZj )
Y
and, in consequence, λ0 (t) can remain arbitrary. Secondly, note that each term in
the product is the conditional probability that at time Xi of an observed failure,
it is precisely individual i who is selected to fail, given all the individuals at risk
and given that one failure would occur. Taking the logarithm in Equation 4.6
and its derivative with respect to β, we obtain the estimating equation which,
upon setting equal to zero, can generally be solved without difficulty using the
Newton-Raphson method, to obtain the maximum partial likelihood estimate β̂
of β. We will discuss more deeply the function U (β) under the various approaches
to inference. We can see already that it has the same form as that encountered
in the standard linear regression situation where the observations are contrasted
to some kind of weighted mean. The exact nature of this mean is described later.
Also, even though the expression
n
n
j=1 Yj (Xi )Zj exp(βZj )
U (β) = δi Zi − n (4.7)
i=1 j=1 Yj (Xi ) exp(βZj )
looks slightly involved, we might hope that the discrepancies between the Zi and
the weighted mean, clearly some kind of residual, would be uncorrelated, at least
for large samples, since the Zi themselves are uncorrelated.
All of this turns out to be so and makes it relatively easy to carry out appropri-
ate inference. The simplest and most common approach to inference is to treat
β̂ as asymptotically normally distributed with mean β and large sample vari-
ance I(β̂)−1 , where I(β), called the information in view of its connection to the
likelihood, is the second derivative of − log L(β) with respect to β, i.e., letting
n n 2
2
j=1 Yj (Xi )Zj exp(βZj ) j=1 Yj (Xi )Zj exp(βZj )
Ii (β) = n − n , (4.8)
j=1 Yj (Xi ) exp(βZj ) j=1 Yj (Xi ) exp(βZj )
82 Chapter 4. Proportional hazards models
then I(β) = n i=1 δi Ii (β). Inferences can also be based on likelihood ratio meth-
ods. A third possibility, which is sometimes convenient, is to base tests on the
score U (β), which in large samples can be considered to be normally distributed
with mean zero and variance I(β). Multivariate extensions are completely natu-
ral, with the score being a vector and I an information matrix.
The observed rates and the expected rates are simply summed across the
distinct failure points, each of which gives rise to its own contingency table
where the margins are obtained from the available risk sets at that time. From
the above, if Zi = 1 when subject i is in group A and zero otherwise, then
elementary calculation gives that
n
n
U (0) = δi {Zi − π(Xi )} , I(0) = δi π(Xi ){1 − π(Xi )}
i=1 i=1
where π(t) = nA (t)/{nA (t) + nB (t)}. The statistic U then contrasts the
observations with their expectations under the null hypothesis of no effect. This
expectation is simply the probability of choosing, from the subjects at risk, a
subject from group A. The variance expression is the well-known expression for
a Bernoulli variable. Readers interested in a deeper insight into this test should
also consult (Cochran, 1954; Mantel, 1963; Mantel and Haenszel, 1959; Peto
and Peto, 1972). As pointed out by Cox, “whereas the test in the contingency
table situation is, at least in principle, exact, the test here is only asymptotic ...”
4.5. Cox regression model 83
This statement is not fully precise since there is still an appeal to the DeMoivre-
Laplace approximation. Nonetheless, we can understand his point.
However, the real advantage of Cox’s approach was that while contributing
significantly toward a deeper understanding of the log-rank and related tests, it
opened up the way for more involved situations; additional covariates, continuous
covariates, random effects, and, perhaps surprisingly, in view of the attribute
“proportional hazards”, a way to tackle problems involving time-varying effects
or time-dependent covariates. Cox illustrated his model via an application to the
now famous Freireich data (Acute Leukemia Group B et al., 1963) describing a
clinical trial in leukemia in which a new treatment was compared to a placebo.
Treating the two groups independently and estimating either survivorship function
using a Kaplan-Meier curve gave good agreement with the survivorship estimates
derived from the Cox model. Such a result can also, of course, be anticipated
by taking a log(− log) transform of the Kaplan-Meier estimates and noting that
they relate to one another via a simple shift. This shift exhibits only the weakest,
if any, dependence on time itself.
Figure 4.4: Kaplan-Meier curves and model-based curves for Freireich data.
Dashed lines represent Model-based estimates; exponential model (left), Cox
model (right).
Recovering the usual two-group log-rank statistic as a special case of a test
based on model (4.4) is reassuring. In fact, exactly the same approach extends
to the several group comparison (Breslow, 1972). More importantly, Equation
4.4 provides the framework for considering the multivariate problem from its
many angles; global comparisons of course but also more involved conditional
comparisons in which certain effects are controlled for while others are tested.
We look at this in more detail below under the heading “Modeling multivariate
problems”. The partially proportional hazards model (in particular the stratified
model) was to appear later to Cox’s original work of 1972 and provide great
84 Chapter 4. Proportional hazards models
to some of his own work with Julian Peto. Their work demonstrated the asymp-
totic efficiency of the log-rank test and that, for the two-group problem and for
Lehmann alternatives, this test was locally most powerful. Since the log-rank
test coincides with a score test based on Cox’s likelihood, Peto argued that Cox’s
method necessarily inherits the same properties.
Professor Bartholomew of the University of Kent considered a lognormal
model in current use and postulated its extension to the regression situation by
writing down the likelihood. Such an analysis, being fully parametric, represents
an alternative approach since the structure is not nested in a proportional hazards
one. Bartholomew made an insightful observation that allowing for some depen-
dence of the explanatory variable Z on t can enable the lognormal model and
a proportional hazards model to better approximate each other. This is indeed
true and allows for a whole development of a class of non-proportional hazards
models where Z is a function of time and within which the proportional hazards
model arises as a special case.
Professors Oakes and Breslow discussed the equivalence between a saturated
piecewise exponential model and the proportional hazards model. By a satu-
rated piecewise exponential model we mean one allowing for constant hazard
rates between adjacent failures. The model is data dependent in that it does not
specify in advance time regions of constant hazard but will allow these to be
determined by the observed failures. From an inferential standpoint, in particular
making use of likelihood theory, we may expect to run into some difficulties.
This is because the number of parameters of the model (number of constant
hazard rates) increases at the same rate as the effective sample size (number of
observed failure times). However, the approach does nonetheless work, although
justification requires the use of techniques other than standard likelihood. A sim-
ple estimate of the hazard rate, the cumulative hazard rate, and the survivorship
function are then available. When β = 0 the estimate of the cumulative hazard
rate coincides with that of Nelson (1969).
Professor Lindley of University College London writes down the full likelihood
which involves λ0 (t) and points out that, since terms involving λ0 (t) do not factor
out we cannot justify Cox’s conditional likelihood. If we take λ0 (t) as an unknown
nuisance parameter having some prior distribution, then we can integrate the full
likelihood with respect to this in order to obtain a marginal likelihood (this would
be different to the marginal likelihood of ranks studied later by Kalbfleisch and
Prentice (1973). Lindley argues that the impact of censoring is greater for the
Cox likelihood than for this likelihood which is then to be preferred. The author
of this text confesses to not fully understanding Lindley’s argument and there
is some slight confusion there since, either due to a typo or to a subtlety that
escapes me, Lindley calls the Cox likelihood a “marginal likelihood” and what I
am referring to as a marginal likelihood, an “integrated likelihood”. We do, of
course, integrate a full likelihood to obtain a marginal likelihood, but it seems
as though Professor Lindley was making other, finer, distinctions which are best
understood by those in the Bayesian school. His concern on the impact of cen-
86 Chapter 4. Proportional hazards models
In Feigl and Zelen their model was not written exactly this way, expressed as
λ = α + βZ. However, since λ is constant, the two expressions are equivalent
and highlight the link to Cox’s more general formulation. Feigl and Zelen only
considered the case of uncensored data. Zippin and Armitage (1966) used a
modeling approach, essentially the same as that of Feigl and Zelen, although
4.6. Modeling multivariate problems 87
The strength of the Cox model lies in its ability to describe and characterize
involved multivariate situations. Crucial issues concern the adequacy of fit of
the model, how to make predictions based on the model, and how strong is the
model’s predictive capability. These are considered in detail later. Here, in the
following sections and in the chapter on inference we consider how the model
can be used as a tool to formulate questions of interest to us in the multivari-
ate setting. The simplest case is that of a single binary covariate Z taking the
values zero and one. The zero might indicate a group of patients undergoing a
standard therapy, whereas the group for which Z = 1 could be undergoing some
experimental therapy. Model 4.4 then indicates the hazard rate for the standard
group to be λ0 (t) and for the experimental group to be λ0 (t) exp(β). Testing
whether or not the new therapy has any effect on survival translates as testing
the hypothesis H0 : β = 0. If β is less than zero then the hazard rate for the
experimental therapy is less than that for the standard therapy at all times and
is such that the arithmetic difference between the respective logarithms of the
hazards is of magnitude β. Suppose the problem is slightly more complex and we
have two new experimental therapies. We can write:
and obtain Table 4.2. As we shall see the two covariate problem is very much more
complex than the case of a single covariate. Not only do we need to consider the
88 Chapter 4. Proportional hazards models
effect of each individual treatment on the hazard rate for the standard therapy
but we also need to consider the effect of each treatment in the presence or
absence of the other as well as the combined effect of both treatments together.
The particular model form in which we express any relationships will typically
imply assumptions on those relationships and an important task is to bring under
scrutiny (goodness of fit) the soundness of any assumptions.
It is also worth noting that if we are to assume that a two-dimensional covari-
ate proportional hazards model holds exactly, then, integrating over one of the
covariates to obtain a one-dimensional model will not result (apart from in very
particular circumstances) in a lower-dimensional proportional hazards model. The
lower-dimensional model would be in a much more involved non-proportional
hazards form. This observation also holds when adding a covariate to a one-
dimensional proportional hazards model, a finding that compels us, in realistic
modeling situations, to only ever consider the model as an approximation.
By extension the case of several covariates becomes rapidly very complicated.
If, informally, we were to define complexity as the number of things you have to
worry about, then we could, even more informally, state an important theorem.
Obviously such a theorem cannot hold in any precise mathematical sense without
the need to add conditions and restrictions such that its simple take-home mes-
sage would be lost. For instance, if each added covariate was a simple constant
multiple of the previous one, then there would really be no added complexity.
But, in some broad sense, the theorem does hold and to convince ourselves of
this we can return to the case of two covariates. Simple combinatorial arguments
show that the number of possible hypotheses of potential interest is increasing
exponentially. But it is more complex than that. Suppose we test the hypothesis
H0 : β1 = β2 = 0. This translates the clinical null hypothesis: neither of the exper-
imental therapies impacts survival against the alternative, H1 : ∃ βi = 0, i = 1, 2.
This is almost, yet not exactly, the same as simply regrouping the two experi-
mental treatments together and reformulating the problem in terms of a single
binary variable.
4.6. Modeling multivariate problems 89
Note that fitting the above models needs no new procedures or software for
example, since both cases come under the standard heading. In the first equation
all we do is write α1 = β1 and α2 = β1 + β2 . In the second we simply redefine
the covariates themselves. The equivalence expressed in the above equation is
important. It implies two things. Firstly, that this previous question concerning
differential treatment effects can be re-expressed in a standard way enabling us
to use existing structures, and computer programs. Secondly, since the effects in
our models express themselves via products of the form βZ, any re-coding of β
can be artificially carried out by re-coding Z and vice versa. This turns out to be
an important property and anticipates the fact that a non-proportional hazards
model β(t)Z can be re-expressed as a time-dependent proportional hazards model
βZ(t). Hence the very broad sweep of proportional hazards models.
90 Chapter 4. Proportional hazards models
of these binary coding variables, noting that, as before, there are different ways
of expressing this. In standard form we write
so that the hazard rate for those exposed to the risk factor at level i, i = 1, . . . , 4,
is given by λ0 (t) exp(βi ) where we take β0 = 0. Our interest may be more on
the incremental nature of the risk as we increase through the levels of exposure
to the risk factor. The above model can be written equivalently as
The cost, however, is much less so, and is investigated more thoroughly in the
chapters on prediction (explained variation, explained randomness) and goodness
of fit. If the fit is good, i.e., the assumed linearity is reasonable, then we would
certainly prefer the latter model to the former. If we are unsure we may prefer
to make less assumptions and use the extra flexibility afforded by a model which
includes three binary covariates rather than a single linear covariate. In real data
analytic situations we are likely to find ourselves somewhere between the two,
using the tools of fit and predictability to guide us.
Returning once more to Table 4.4 we can see that the same idea prevails for
the βi not all assuming the same values. A situation in which four ordered levels
is described by three binary covariates could be recoded so that we only have a
single covariate Z, together with a single coefficient β. Next, suppose that in the
model, λ(t|Z) = λ0 (t) exp(βZ), Z not only takes the ordered values, 0, 1, 2 and
3 but also all of those in between. In a clinical study this might correspond to
some prognostic indicator, such as blood pressure or blood cholesterol, recorded
continuously and re-scaled to lie between 0 and 3.
Including the value of Z, as a continuous covariate, in the model amounts to
making very strong assumptions. It supposes that the log hazard increases by the
same amount for every given increase in Z, so that the relative risk associated
92 Chapter 4. Proportional hazards models
with Δ = z2 − z1 is the same for all values of z1 between 0 and 3 − Δ. Let’s make
things a little more involved. Suppose we have the same continuous covariate,
this time let’s call it Z1 , together with a single binary covariate Z2 indicating
one of two groups. We can write
supposes that we know the functional form of the relative risk, at least up to the
constant multiple β. Then, a power series approximation to this would allow us to
4.7. Classwork and homework 93
write ψ(Z) = βj Z j in which any constant term β0 is absorbed into λ0 (t). We
then introduce the covariates Zj = Z j to bring the model into its standard form.
1. One of the early points of discussion on Cox’s 1972 paper was how to deal
with tied data. Look up the Cox paper and write down the various different
ways that Cox and the contributors to the discussion suggested that tied data
be handled. Explain the advantages and disadvantages of each approach.
2. One suggestion for dealing with tied data, not in that discussion, is to simply
break the ties via some random split mechanism. What are the advantages
and drawbacks of such an approach?
4. Show that the relation; S(t|Z) = {S0 (t)}exp(βZ) implies the Cox model and
vice versa.
5. Suppose that we have two groups and that a proportional hazards model
is believed to apply. Suppose also that we know for one of the groups that
the hazard rate is a linear function of time, and equal to zero at the origin.
Given data from such a situation, suggest different ways in which it can
be analyzed and the possible advantages and disadvantages of the various
approaches.
6. Explain in what sense the components of Equation 4.7 and equation (4.8)
can be viewed as an equation for the mean and an equation for the variance.
7. Using equations (4.7) and (4.8) work out the calculations explicitly for the
two-group case, i.e., the case in which there are n1 (t) subjects at risk from
group 1 at time t and n2 (t) from group 2.
10. Consider an experiment in which there are eight levels of treatment. The
levels are ordered. The null hypothesis is that there is no treatment effect.
The alternative is that there exists a non-null effect increasing with level
until it reaches one of the levels, say level j, after which the remaining levels
all have the same effect as level j. How would you test for this?
11. Write down the joint likelihood for the underlying hazard rate and the regres-
sion parameter β for the two-group case in which we assume the saturated
piecewise exponential model. Use this likelihood to recover the partial like-
lihood estimate for β. Obtain an estimate of the survivorship function for
both groups.
12. For the previous question derive an approximate large sample confidence
interval for the estimate of the survivorship function for both groups in
cases: (i) where the parameter β is exactly known, (ii) where the parameter
is replaced by an estimate with approximate large sample variance σ 2 .
13. Carry out a large sample simulation for a model with two binary variables.
Each study is balanced with a total of 100 subjects. Choose β1 = β2 = 1.5
and simulate binary Z1 and Z2 to be uncorrelated. Show the distribution of
β̂1 in two cases: (i) where the model used includes Z2 , (ii) where the model
used includes only Z1 . Comment on the distributions, in particular the mean
value of β̂1 in either case.
14. In the previous exercise, rather than include in the model Z2 , use Z2 as a
variable of stratification. Repeat the simulation in this case for the stratified
model. Comment on your findings.
15. Consider the following regression situation. We have one-dimensional covari-
ates Z, sampled from a density g(z). Given z we have a proportional hazards
model for the hazard rates. Suppose that, in addition, we are
in a position
to know exactly the marginal survivorship function S(t) = S(t|z)g(z)dz.
How can we use this information to obtain a more precise analysis of data
generated under the PH model with Z randomly sampled from g(z)?
16. Suppose we have two groups defined by the indicator variable Z = {0, 1}.
In this example, unlike the previous in which we know the marginal survival,
we know the survivorship function S0 (t) for one of the groups. How can this
information be incorporated into a two-group comparison in which survival
for both groups is described by a proportional hazards model? Use a likelihood
approach.
17. Use known results for the exponential regression model in order to construct
an alternative analysis to that of the previous question based upon likelihood.
18. A simple test in the two-group case for absence of effects is to calculate the
area between the two empirical survival curves. We can evaluate the null
4.7. Classwork and homework 95
19. Carry out a study, i.e., the advantages, drawbacks, and potentially restrictive
assumptions of the test of the previous example. How does this test compare
with the score test based on the proportional hazards model?
20. Obtain a plot of the likelihood function for the Freireich data. Using simple
numerical integration routines, standardize the area under the curve to be
equal to one.
21. For the previous question, treat the curve as a density. Use the mean as
an estimate of the unknown β. Use the upper and lower 2.5% percentiles
as limits to a 95% confidence interval. Compare these results with those
obtained using large sample theory.
The basic questions of epidemiology are reconsidered in this chapter from the
standpoint of a survival model. We rework the calculations of relative risk, where
the time factor is now age, and we see how our survival models can be used to con-
trol for the effects of age. Series of 2×2 tables, familiar to epidemiologists, can be
structured within the regression model setting. The well-known Mantel-Haenszel
test arises as a model-based score test. Logistic regression, conditional logistic
regression as well as stratified regression are all considered. These various mod-
els, simple proportional hazards model, stratified models, and time-dependent
models can all be exploited in order to better evaluate risk factors, how they
interrelate, and how they relate to disease incidence in various situations. The
use of registry data is looked at in relation to the estimation of survival in specific
risk sub-groups. This motivates the topic of relative survival.
can be used to describe states. Multistate models in which subjects can move in
and out of different states, or into an absorbing state such as death, can then
be analyzed using the same methodology.
For arbitrary random variables X and Y with joint density f (x, y), conditional
densities g(x|y) and h(y|x), marginal densities v(x) and w(y), we know that
so that, in the context of postulating a model for the pair (X, Y ), we see that
there are two natural potential characterizations. Recalling the discussion from
Section 2.3 note that, for survival studies, our interest in the binary pair (T, Z),
time and covariate, can be seen equivalently from the viewpoint of the conditional
distribution of time given the covariate, along with the marginal distribution of
the covariate, or from the viewpoint of the conditional distribution of the covari-
ate given time, along with the marginal distribution of time. This equivalence we
exploit in setting up inference where, even though the physical problem concerns
time given the covariate, our analysis describes the distribution of the covariate
given time.
In epidemiological studies the variable time T is typically taken to be age.
Calendar time and time elapsed from some origin may also be used but, mostly,
the purpose is to control for age in any comparisons we wish to make. Usually we
will consider rates of incidence of some disease within small age groups or possibly,
via the use of models, for a large range of values of age. Unlike the relatively
artificial construction of survival analysis which exploits the equivalent ways of
expressing joint distributions, in epidemiological studies our interest naturally
falls on the rates of incidence for different values of Z given fixed values of age
T . It is not then surprising that the estimating equations we work with turn out
to be essentially the same for the two situations.
The main results of proportional hazards regression, focused on the condi-
tional distribution of the covariable given time, rather than the other way around,
apply more immediately and in a more natural way in epidemiology than in sur-
vival type studies. We return to this in the later chapters that consider inference
more closely. One important distinction, although already well catered for by use
of our “at risk” indicator variables, is that for epidemiological studies the subjects
in different risk sets are often distinct subjects. This is unlike the situation for
survival studies where the risk sets are typically nested. Even so, as we will see,
the form of the equations is the same, and software which allows an analysis of
survival data will also allow an analysis of certain problems in epidemiology.
5.3. Odds ratio, relative risk, and 2 × 2 tables 99
In the above and in what follows, in order for the notation not to become too
cluttered, we write Pr (A) = P (A). Under a “rare disease assumption”, i.e., when
P (Y = 0|Z = 0) and P (Y = 0|Z = 1) are close to 1, then the odds ratio and
relative risk approximate one another.
One reason for being interested in the odds ratio, as a measure of the impact
of different levels of the covariate (risk factor) Z follows from the identity
Thus, the impact of different levels of the risk factor Z can equally well be
estimated by studying groups defined on the basis of this same risk factor and
their corresponding incidence rates of Y = 1. This provides the rationale for the
case-control study in which, in order to estimate ψ, we make our observations on
Z over fixed groups of cases and controls (distribution of Y fixed), rather than
the more natural, but practically difficult if not impossible, approach of making
our observations on Y for a fixed distribution of Z. Assumptions and various
subtleties are involved. The subject is vast and we will not dig too deeply into
this. The points we wish to underline in this section are those that establish the
link between epidemiological modeling and proportional hazards regression.
Series of 2 × 2 tables
The most elementary presentation of data arising from either a prospective study
(distribution of Z fixed) or a case-control study (distribution of Y fixed) is in
the form of a 2 × 2 contingency table in which the counts of the number of
observations are expressed. Estimated probabilities or proportions of interest are
readily calculated.
In Table 5.1, a1∗ = a11 + a12 , a2∗ = a21 + a22 , a∗1 = a11 + a21 , a∗2 =
a12 + a22 and a∗∗ = a1∗ + a2∗ = a∗1 + a∗2 . For prospective studies the pro-
portions a11 /a∗1 and a12 /a∗2 estimate the probabilities of being a case (Y = 1)
for both exposure groups while, for case-control studies, the proportions a11 /a1∗
and a21 /a2∗ estimate the probabilities of exhibiting the risk or exposure factor
100 Chapter 5. Proportional hazards models in epidemiology
Z =1 Z =0 totals
Y =1 a11 a12 a1∗
Y =0 a21 a22 a2∗
Totals a∗1 a∗2 a∗∗
(Z = 1) for both cases and controls. For both types of studies we can esti-
mate ψ by the ratio (a11 a22 )/(a21 a12 ), which is also the numerator of the usual
chi-squared test for equality of the two probabilities. If we reject the null hypoth-
esis of the equality of the two probabilities we may wish to say something about
how different they are based on the data from the table.
As explained below, in Section 5.4, quantifying the difference between two
proportions is not best done via the most obvious, and simple, arithmetic differ-
ence. There is room for more than one approach, the simple arithmetic difference
being perfectly acceptable when sample sizes are large enough to be able to use
the De Moivre-Laplace approximation (Appendix C.2) but, more generally, the
most logical in our context is to express everything in terms of the odds ratio.
We can then exploit the following theorem:
Theorem 5.1. Taking all the marginal totals as fixed, the conditional distri-
bution of a11 is written
a1∗ a2∗ a1∗ a2∗
P (a|a1∗ , a2∗ , a∗1 , a∗2 ) = ψa ψu ,
a a∗1 − a u
u a∗1 − u
the sum over u being over all integers compatible with the marginal totals.
The conditionality principle appears once more, in this instance in the form of
fixed margins. The appropriateness of such conditioning, as in other cases, can
be open to discussion. And again, insightful conditioning has greatly simplified
the inferential structure. Following conditioning of the margins, it is only nec-
essary to study the distribution of any single entry in the 2 × 2 table, the other
entries being then determined. This kind of approach forms the basis of the well-
known Fisher’s exact test. It is usual to study the distribution of a11 . A non-linear
estimating equation can be based on a11 − E(a11 ), expectation obtained from
Theorem 5.1, and from which we can estimate ψ and associate a variance term
with the estimator. The non-linearity of the estimating equation, the only approx-
imate normality of the estimator, and the involved form of variance expressions
has led to much work in the methodological epidemiology literature; improving
the approximations, obtaining greater robustness and so on. However, all of this
can be dealt with in the context of a proportional hazards (conditional logistic)
5.3. Odds ratio, relative risk, and 2 × 2 tables 101
Table i Z =1 Z =0 Totals
Y =1 a11 (i) a12 (i) a1∗ (i)
Y =0 a21 (i) a22 (i) a2∗ (i)
Totals a∗1 (i) a∗2 (i) a∗∗ (i)
Table 5.2: 2 × 2 table for ith age group of cases and controls.
regression model. Since it would seem more satisfactory to work with a single
structure rather than deal with problems on a case-by-case basis, our recom-
mendation is to work with proportional and non-proportional hazards models.
Not only does a model enable us to more succinctly express the several assump-
tions which we may be making, it offers, more readily, well-established ways of
investigating the validity of any such assumptions. In addition the framework for
studying questions such as explained variation, explained randomness and partial
measures of these is clear and requires no new work.
The “rare disease” assumption, allowing the odds ratio and relative risk to
approximate one another, is not necessary in general. However, the assumption
can be made to hold quite easily and is therefore not restrictive. To do this
we construct fine strata, within which the probabilities P (Y = 0|Z = 0) and
P (Y = 0|Z = 1) can be taken to be close to 1. For each stratum, or table,
we have a 2 × 2 table as in Table 5.2, indexed by i. Each table provides an
estimate of relative risk at that stratum level and, assuming that the relative
risk itself does not depend upon this stratum, although the actual probabilities
themselves composing the relative risk definition may themselves depend upon
strata, then the problem is putting all these estimates of the same thing into
a single expression. The most common such expression for this purpose is the
Mantel-Haenszel estimate of relative risk.
Table 5.3: 2 × 2 table for ith age group of cases and controls. Left-hand table:
observed counts. Right-hand table: expected counts.
102 Chapter 5. Proportional hazards models in epidemiology
If we first define for the i th sub-table Ri = a11 (i)a22 (i)/a∗∗ (i) and
Si = a12 (i)a21 (i)/ a∗∗ (i), then the Mantel-Haenszel summary relative risk
estimate across the tables is given by ψ̂M H = i Ri / i Si . Breslow (1996)
makes the following useful observations concerning ψ̂M H and β̂M H = ψ̂M H .
First, E(Ri ) = ψi E(Si ) where the true odds ratio in the ith table is given
by ψi . When all of these odds ratios coincide then ψ̂M H is the solution
to the
unbiased estimating equation; R − ψS = 0, where R = i Ri and
S = i Si .
E {[a11 (i)a22 (i) + ψa12 (i)a21 (i)] [a11 (i) + a22 (i) + ψ (a12 (i) + a21 (i))]} ,
Without any loss in generality we can express the two probabilities of interest,
P (Y = 1|Z = 1) and P (Y = 1|Z = 0) as simple power transforms of one another.
This follows, since, whatever the true values of these probabilities, there exists
some positive number α such that P (Y = 1|Z = 1) = P (Y = 1|Z = 0)α . The
parameter α is constrained to be positive in order that the probabilities themselves
remain between 0 and 1. To eliminate any potential dangers that may arise,
particularly in the estimation context where, even though the true value of α
is positive, the estimate itself may not be, a good strategy is to re-express this
parameter as α = exp(β). We then have
The parameter β can then be interpreted as a linear shift in the log-log trans-
formation of the probabilities, and can take any value between −∞ and ∞, the
inverse transformations being one-to-one and guaranteed to lie in the interval
(0,1). An alternative model to the above is
Since the groups are indicated by a binary Z, we can exploit this in order to
obtain the more concise notation, now common for such models, whereby
where β0 = logit P (Y = 1|Z = 0). Maintaining an analogy with the usual linear
model we can interpret β0 as an intercept, simply a function of the risk for a
“baseline” group defined by Z = 0.
Assigning the value Z = 0 to some group and thereby giving that group
baseline status is, naturally, quite arbitrary and there is nothing special about
the baseline group apart from the fact that we define it as such. We are at lib-
erty to make other choices and, in all events, the only quantities of real interest to
us are relative ones. In giving thought to the different modeling possibilities that
arise when dealing with a multivariate Z, the exact same kind of considerations,
104 Chapter 5. Proportional hazards models in epidemiology
already described via several tables in the section on modeling multivariate prob-
lems will guide us (see Section 4.6 and those immediately following it). Rather
than repeat or reformulate those ideas again here, the reader, interested in these
aspects of epidemiological modeling, is advised to go over those earlier sections.
Indeed, without a solid understanding as to why we choose to work with a par-
ticular model rather than another, and as to what the different models imply
concerning the complex inter-relationships between the underlying probabilities,
it is not really possible to carry out successful modeling in epidemiology.
P (Y = 1|Z, S)
= exp(β0 + βZ) , (5.7)
1 − P (Y = 1|Z, S)
where, in the same way as before, β0 = logit P (Y = 1|Z = 0, S). The important
aspect of a stratified model is that the levels of S only appear on the left-hand
side of the equation.
We might conclude that this is the same model as the previous one but it is
not quite and, in later discussions on inference, we see that it does impact the
way in which inferences are made. It also impacts interpretation. In the simpler
cases, in as far as β is concerned, the stratified model is exactly equivalent to a
regular logistic model if we include in the regression function indicator variables,
of dimension one less than the number of strata. However, when the number of
strata is large, the use of the stratified model enables us to bypass estimation of
the stratum-level effects. If these are not of real interest then this may be useful in
that it can result in gains in estimating efficiency even though the underlying mod-
els may be equivalent. In a rough intuitive sense we are spending the available esti-
mating power on the estimation of many less parameters, thereby increasing the
precision of each one. This underlines an important point in that the question of
stratification is more to do with inference than the setting up of the model itself.
This last remark is even more true when we speak of conditional logistic
regression. The model will look almost the same as the unconditional one but
the process of inference will be quite different. Suppose we have a large number
of strata, very often in this context defined by age. A full model would be as in
5.4. Logistic regression and proportional hazards 105
Equation 5.6, including in addition to the risk factor vector Z, a vector parameter
of indicator variables of dimension one less than the number of strata. Within
each age group, for the sake of argument let’s say age group i, we have the simple
logistic model. However, rather than write down the likelihood in terms of the
products P (Y = 1|Z) and P (Y = 0|Z) we consider a different probability upon
which to construct the likelihood, namely the probability that the event of inter-
est, the outcome or case in other words, occurred on an individual (in particular
the very individual for whom the event did occur, given that one event occurred
among the set S{i} of the a∗∗ (i) cases and controls. Denoting Zi to be the risk
factor for the case,
corresponding to the age group i, then this probability is sim-
ply exp(βZi )/ I[j ∈ S{i}] exp(βZj ). The likelihood is then the product of such
terms across the number of different age groups for which a case was selected.
If we carefully define the “at-risk” indicator Y (t) where t now represents age,
we can write the conditional likelihood as
n
δi
exp(βZi )
L(β) = n . (5.8)
i=1 j=1 Yj (Xi ) exp(βZj )
Here we take the at-risk indicator function to be zero unless, for the subject j,
Xj has the same age, or is among the same age group as that given by Xi .
In this case the at-risk indicator Yj (Xi ) takes the value one. To begin with, we
assume that there is only a single case per age group, that the ages are distinct
between age groups, and that, for individual i, the indicator δi takes the value
one if this individual is a case. Use of the δi would enable us to include in an
analysis sets of controls for which there was no case. This would be of no value
in the simplest case but, generalizing the ideas along exactly the same lines as
for standard proportional hazards models, we could easily work with indicators
Y (t) taking the value one for all values less than t and becoming zero if the
subject becomes incident or is removed from the study. A subject is then able
to make contributions to the likelihood at different values of t, i.e., at different
ages, and appears therefore in different sets of controls. Indeed, the use of the
risk indicator Y (t) can be generalized readily to other complex situations.
One example is to allow it to depend on two time variables, for example, an
age and a cohort effect, denoting this as Y (t, u). Comparisons are then made
between individuals having the same age and cohort status. Another useful gen-
eralization of Y (t) is where individuals go on and off risk, either because they
leave the risk set for a given period or, possibly, because their status cannot be
ascertained. Judicious use of the at-risk indicator Y makes it possible then to
analyze many types of data that, at first glance, would seem quite intractable.
This can be of particular value in longitudinal studies involving time-dependent
measurements where, in order to carry out unmodified analysis we would need, at
106 Chapter 5. Proportional hazards models in epidemiology
each observed failure time, the time-dependent covariate values for all subjects at
risk. These would not typically all be available. A solution based on interpolation,
assuming that measurements do not behave too erratically, is often employed.
Alternatively we can allow for subjects for whom, at an event time, no reliable
measurement is available, to simply temporarily leave the risk set, returning later
when measurements have been made.
The striking thing to note about the above conditional likelihood is that it
coincides with the expression for the partial likelihood. This is no real coincidence
of course and the main theorems of proportional hazards models apply equally
well here. This result anticipates a very important concept, and that is the idea
of sampling from the risk set. The difference between the Y (t) in a classical
survival study, where it is equal to one as long as the subject is under study
and then drops to zero, as opposed to the Y (t) in the simple epidemiological
application in which it is zero most of time, taking the value one when indicating
the appropriate age group, is a small one. It can be equated with having taken
a small random sample from a conceptually much larger group followed since
time (age) is zero. On the basis of the above conditional likelihood we obtain
the estimating equation
n
n
j=1 Yj (Xi )Zj exp(βZj )
U (β) = δi Zi − n , (5.9)
i=1 j=1 Yj (Xi ) exp(βZj )
which we equate to zero in order to estimate β. The equation contrasts the same
quantities written down in Table 5.3 in which the expectations are taken with
respect to the model. The estimating equations are then essentially the same
as those given in Table 5.3 for the Mantel-Haenszel estimator. Furthermore,
taking the second
derivative of the expression for the log-likelihood, we have
that I(β) = n i=1 i Ii (β) where
δ
n n 2
2
j=1 Yj (Xi )Zj exp(βZj ) j=1 Yj (Xi )Zj exp(βZj )
Ii (β) = n − n ,
j=1 Yj (Xi ) exp(βZj ) j=1 Yj (Xi ) exp(βZj )
(5.10)
n
then I(β) = i=1 δi Ii (β). Inferences can then be carried out on the basis of
these expressions. In fact, once we have established the link between the applied
problem in epidemiology and its description via a proportional hazards model,
we can then appeal to those model-building techniques (explained variation,
explained randomness, goodness of fit, conditional survivorship function etc.)
which we use for applications in time to event analysis. In this context the building
of models in epidemiology is no less important, and no less delicate, than the
building of models in clinical research. Although many of the regression modeling
ideas came to the field of epidemiology later than they did for clinical research,
several of the deeper concepts such as stratification, risk-set sampling and non-
nested studies were already well known to epidemiologists. As a result some
5.5. Survival in specific groups 107
of the most important contributions to the survival literature have come from
epidemiologists. For readers looking for more of an epidemiological flavor to
survival problems we would recommend taking a look at Breslow (1978), Annesi
and Lellouch (1989), Rosenberg and Anderson (2010), Cologne et al. (2012),
Xue et al. (2013), Moolgavkar and Lau (2018) and Fang and Wang (2020).
The issue of competing risks is never far from our attention, no less so when we
discuss problems in epidemiology. Almost without exception, outcomes other than
that of the investigators’ main focus, will hinder its observation and will bring into
play one of the approaches, CRM-I, CRM-J or CRM-ID (Section 2.6), in order to
enable our work to progress. Some obvious questions such as how would a given
population fare if we were able to remove some particular cause of death turns
out to be almost impossibly difficult to even correctly formulate. How probable
is it that a member of some high-risk group will succumb to the disease is again
a question that is very difficult to address. This may at first seem surprising.
Survival methods can help and, in this and the following section dealing with
genetic epidemiology, we will see how very great care is needed to avoid making
quite misleading inferences. One case in point is that of the so-called breast
cancer susceptibility genes, BRCA1 and BRCA2. Any additional risk to carriers,
if indeed there is any additional risk, has been very greatly exaggerated.
where S(t|Z) is the survival of the sub-group of interest and S0 (t) the population
survival. For this model, we refer to α(t) as relative survival. It is described as
a measure of net survival for this group in the absence of all those other factors
that influence the reference group. We often assume that the size of the group
Z is not large enough to have a significant impact on S0 (t). In consequence, the
sub-group, S(t|Z) forms a negligible component of S0 (t) so that α(t) will be
described as the ratio of survival in the sub-group with respect to the expected
population survival unaffected by whatever handicap the sub-group suffers from.
Equation 5.11 is very popular in this context due to its simple interpretation. As
pointed out in Section 2.4 it is generally preferable to appeal to the log-minus-log
transformation as the basis for a model. We would then have a non-proportional
hazards model alternative to Equation 5.11 as
Aside from taking logs it is no more difficult to estimate β(t) than α(t) and,
should β(t) appear to be reasonably constant over time, then we could appeal to
Table 5.4: Based on NIH-SEER data. n(t) =# at-risk at age t, nC (t) =# cases
between t and t + 5, I(t) = overall incidence rate, IH (t) = incidence rate (high
risk, BRCA1/2) on the basis of a relative risk of 16.
5.5. Survival in specific groups 109
λ{t|Cmin > t} = lim (Δt)−1 Pr t < T < t + Δt|T > t, Cmin > t .
Δt→0+
where Cmin = min(C1 , ..., Ck ). The conditioning event Cmin > t is of central
importance in epidemiological studies since, in practical investigations—in par-
ticular the compilation of registry data—all our observations at time t have
necessarily been conditioned by the events, T > t and Cmin > t. All associated
probabilities are also necessarily conditional. But note that, under an indepen-
dent censoring mechanism, λ(t|Cmin > t) = λ(t). This result is crucial since,
given independence of the competing causes of failure, and, by partitioning the
interval (0,t) into t0 < t1 <, ..., < tm where t0 = 0 and tm = t then, as a conse-
quence of the above expression we can write:
t m
Λ(t) = λ{u|Cmin > u}du ≈ Pr{tj−1 < T < tj |T > tj−1 , Cmin > tj−1 }
0 j=1
(5.13)
so that we can approximate F (t) via F (t) = 1 − exp{−Λ(t)}. When using this
approximation with real data, the quality of the approximation can be investigated
on a case-by-case basis. It is common, in epidemiological applications, to use
gaps, tj − tj−1 , as large as 5 years. Equation 5.13 can be directly used and
interval probability estimates taken directly from registry data. Data such as the
SEER data, for very large samples, indicate those subjects entering each 5-year
age interval, i.e., those satisfying the conditioning restriction in Equation 5.13,
and just how many observed cases could be counted during that 5-year interval.
For each component to the sum we have an empirical estimate of the required
probability, the numerator being simply the number of cases seen during the
110 Chapter 5. Proportional hazards models in epidemiology
interval and the denominator typically the number of individuals satisfying the
criteria to be “at-risk” at the beginning of the interval. Attempts to improve the
approximation may involve looking at the average number at-risk throughout the
interval rather than just at the beginning or some other model to improve the
numerical approximation to the integral. Working with model CRM-J, rather than
CRM-I will, in many cases, and these can include breast cancer, have little impact
on the estimation of the cumulative incidence rate (Satagopan and Auerbach,
2004).
Equation 5.13 is the basis for estimating probabilities of cancer incidence over
given periods. Cancer registry data, such as the SEER data set, provide for the
numbers exposed to risk of breast and other cancers and the number of cases
observed during those 5-year intervals. These two numbers provide the numerator
and the denominator to our conditional probability estimates. We can first use
Table 5.4 to confirm the widely quoted figure of one woman in eight will have
breast cancer in her life. If we let H = 1/39 + 1/41 + 1/46 + ... + 1/4, 947, 580,
in Equation 5.13, then we find that 1 − exp(−H) = 0.121 ≈ 1/8, a much quoted
result in not just scientific journals but in everyday popular sources. Note that
this estimated probability is unaltered in the first 3 figures if we start the clock
from age 20 rather than age zero. The reason for this can be seen in the table
since those early rates are so small as to be negligible. On the other hand, at the
top end, the summation stops for women over 75 years of age. Some authors go
beyond 75 years when evaluating lifetime risk (Brose et al. (2002), for example,
calculate out to age 110 years).
Our estimates are limited by the rapidly declining number of observations
for the higher age groups but, theoretically, if we were able to estimate the
death rate due to cancer without fixing some upper limit then it would go to
one hundred percent. This is easily seen intuitively since, everyone has to die
eventually, and the calculations do not allow for anyone to die of causes other
than breast cancer (in technical language, other causes of death are “censored
out” under CRM-I when we calculate the risk sets). When appealing to model
CRM-J, instead of CRM-I, we will need to introduce into the calculation the
marginal survival time to either death or breast cancer incidence (see Section
2.6) which is often approximated from life tables by registry death rates. In this
case the choice of model did not have any significant impact on estimation.
It is not easy to find a subject more fascinating than genetics. Seyerle and Avery
(2013) describe genetic epidemiology as being at the crossroads of epidemiology
and genetics, the discipline’s aim being to identify the myriad of relationships
between inherited risk factors and disease etiology. The field is truly a vast one and
our purpose here is very limited—to consider the contribution to such an aim that
can be made by modern techniques of survival analysis, proportional and non-
5.6. Genetic epidemiology 111
H0 : Pr (Y ≥ 2 | Y ≥ 1) = Pr (Y ≥ 1 | Y ≥ 0)
Pr (Y ≥ 2 | Y ≥ 1)
ψ= .
Pr (Y ≥ 1 | Y ≥ 0)
We can re-express the null hypothesis—are cases no more likely than controls
to have other family members with breast cancer—as; H0 : ψ = 1. Now, while
the hypothesis, H0 : Pr (Y ≥ 2 | Y ≥ 1, n) = Pr (Y ≥ 1 | Y ≥ 0, n) will correctly
control the size of the test, the more obvious formulation, H0 : Pr (Y ≥ 2 | Y ≥
1) = Pr (Y ≥ 1 | Y ≥ 0) will not. Indeed, under H0 it is not generally true that
ψ = 1. We have that;
Pr (Y ≥ 2 | Y ≥ 1) = Pr (Y ≥ 2 | Y ≥ 1, n) × g(n | Y ≥ 1) (5.14)
n
Pr (Y ≥ 1 | Y ≥ 0) = Pr (Y ≥ 1 | Y ≥ 0, n) × g(n | Y ≥ 0) (5.15)
n
Under the null hypothesis we would like for ψ, the ratio of Equation 5.14 to
Equation 5.15, to be equal to one. This however is not the case. Choosing a
particular distribution for g(n), specifically one with support restricted to n =
{3, 25}, and with g(3)/g(25) for the under 50 age group twice that of the other
group, we obtain the results shown in Table 5.5. In brackets are the values taken
from early works that supposedly justify the conclusion of family association.
Table 5.5 is based on an independence assumption and we obtain very similar
results to those taken from the literature. Such results do not therefore give
support to a conclusion of family association.
Of course, the particular distribution chosen for g(n) is not at all plausible.
This is however beside the point since it underlines what we need to know and
5.6. Genetic epidemiology 113
Table 5.5: Ratio of the left hand sides of Equation 5.14 to Equation 5.15 under
Binomial sampling with unknown n and independence. In brackets, ψ from case
control studies, the strong similarity of the numbers tending to contradict a
conclusion of dependence.
that is that, even under independence, the null hypothesis will not hold. The
reason for this is that knowing that Y ≥ 1 tells us something about n, i.e., a case
is more likely to be associated with another case in the family than is a control.
This dependence will disappear if we are able to condition (take as fixed) n. This
is not usually feasible and, in any event, the bias would not disappear due to
other less obvious biases, recall bias in particular. In statistical terms, the test
cannot be shown to be unbiased (note that the meaning of the term unbiased
differs when we refer to a test rather than an estimator) and will fail to control
for the false-positive rate.
Essentially, the likelihood (LOD scores) method is of little help in cases like
this. For more regular problems the log-likelihood that is used to estimate an
unknown parameter will furnish us with an estimating equation, the zero of which
corresponds to our parameter estimate. Small perturbations in the likelihood
around its maximum correspond to small perturbations in our parameter estimate.
We have continuity but, usually more, we have a log-likelihood that is twice
differentiable allowing us, after verifying that the order of limiting processes can
be switched, to provide an assessment of the precision of our estimate. For our
genome-wide search we have no such conditions and no such reassuring estimates
of precision. So much so that, for the kind of limited data available, the gene
location identified by a likelihood maximum has much less chances of being the
quantity we are seeking than has the sum total of all other locations. If, as a
statistical technique, likelihood was up to the task, it would tell us that, even if
our best estimate of the location on the genome is not estimated with perfect
precision, then the location must be nearby. However, no such statement can be
made, even in any approximate way, telling us that a heavy reliance on likelihood
is not warranted.
In order to get a better insight into the size of the challenge here, consider
an experiment quite unrelated to genetics. Suppose we have a fair deck of cards
and one card is randomly chosen. The outcome of interest is obtaining a black
queen. The probability of this outcome is 1/26. Suppose now that we have a
second deck, this time a very flawed deck in which half of the cards are black
queens. On the basis of observations our goal is to identify the flawed deck. To
this end we will use the likelihood function. The likelihood that a random draw
obtains a black queen is 13 times higher in the second deck than the first. If,
for each deck, 10 cards were drawn with replacement and we count the number
of times X , X = 0, ...., 10, we observe a black queen, then, identifying (using
the likelihood) the fair (1/26) from the flawed (13/26) deck, we would choose
incorrectly the flawed deck less than one time in ten thousand. We are just about
certain to correctly identify the faulty deck.
How would this work though when our faulty deck is surrounded by, not
one, but one hundred thousand fair decks. Choosing the deck that maximizes
the likelihood will very rarely lead to the correct choice. We are far more likely
to incorrectly identify a fair deck than correctly identify the faulty deck. And
this despite the huge difference in risk, 13:1. Note also, returning to the case of
interest, that of the BRCA gene, the relative risk is not generally believed to be
this large. And of course, while the card example relates to a search across one
hundred thousand candidate choices, in the case of alleles defined by nucleotide
variations, we are talking about millions of choices.
An approach based on the likelihood can be enhanced using techniques
such as linkage analysis that will weigh more strongly the relationships mother-
daughter to mother-niece. It will nonetheless do little to refine what remains a
rough and ready approach involving a lot of statistical uncertainty. Although it
5.6. Genetic epidemiology 115
can be argued that the best estimate of the implicated gene is the likelihood-
based estimate, we ought not to overlook the great deal of imprecision associated
with this estimate. More problematic is the highly erratic nature of an estimate,
such as this one, where the parameter of interest (the gene) is not a contin-
uous function of the log-likelihood. Very small non-significant differences from
the maximum of the likelihood (LOD score) can be associated with genes that
are not only very far removed from the so-called BRCA genes but they may well
be found on entirely different chromosomes. The situation here is quite different
from that we are familiar with when the log-likelihood is a continuous differen-
tiable function of the unknown parameter(s), enabling us to structure reliable
inference based on estimating equations. All we have here is the best point esti-
mate of the implicated gene and no measure at all regarding the amount of
sampling uncertainty that ought to be associated with it.
how does that probability change. The answer to that is ... not much. It can
be estimated using registry data and some calculations involving order statistics
(Appendix A.9). It would greatly strengthen intuitive understanding of the impli-
cations of carrier status. The idea is to come up with quantifiers of risk that can
be understood on some level as opposed to lifetime risk which a carrier will not
understand. More work could be done. The example of the first to die out of a
small group of friends is immediately extended to not being one of the first two
and how this might change given BRCA status. As well as short-term probabil-
ities, another useful quantity from survival analysis, is the expected remaining
lifetime at some point. For a 25-year-old carrier, assuming that surgery would
put them in the same position as a non-carrier, we can calculate a difference in
mean remaining lifetimes of 3.7 years. We might tell a carrier that they have an
87% probability of getting breast cancer without surgery. Or that, with surgery,
they will be expected to gain, on average, some 3.7 years a half-century down
the line. The carrier would interpret those two statements very differently. And
yet they use the very same data, the same assumptions, and the same belief as
to the existence of the BRCA gene as well as the unfavorable outcome associated
with many of the variant polymorphisms.
2. Write down a conditional logistic model in which we adjust for both age and
cohort effects where cohorts are grouped by intervals of births from 1930-35,
1936-40, 1940-45, etc. For such a model is it possible to answer the question:
was there a peak in relative risk during the nineteen sixties?
Describe this situation via the use of compartment models. Complete the
description via the use of regression models. What assumptions do you need
to make? How can you go about testing the validity of those assumptions?
6. Simulate the number of cases per family in the following way. First choose
family size, n, based on a mixture of two Poisson laws: one with a mean equal
to 6 and one with a mean equal to 20 and where P (20) = 1 − P (8) = 0.1.
Next, use a Binomial law B(n, p), where p = 0.15, to simulate the number of
observed cases. Note down this number and repeat for 200 families. Finally,
produce a histogram of the observed number of cases per family; 0, 1, 2 ...
7. For the data obtained from the previous exercise, ask a classmate or colleague
to look at the data and, in the absence of any further analysis, to judge
whether or not there appears to be evidence of a family effect. Do they
consider this effect to be absent or weak, moderate to strong, or very strong?
Do not provide any explanation as to how the data were obtained.
8. If given data like that of the previous question and no indication as to how
the data were obtained how might you go about testing a null hypothesis—
there is no family association. What would be the nature of the statistic
that would lead to a consistent test while maintaining power against the
alternative hypothesis.
The most general model, described in Chapter 4 covers a very broad spread of
possibilities and, in this chapter, we consider some special cases. Proportional
hazards models, partially proportional hazards models (O’Quigley and Stare,
2002), stratified models, or models with frailties or random coefficients all arise
as special cases of this model (Xu and O’Quigley, 2000). One useful parame-
terization (O’Quigley and Pessione, 1991; O’Quigley and Prentice, 1991) can
be described as a non-proportional hazards model with intercept. Changepoint
models are a particular form of a non-proportional hazards model with intercept
(O’Quigley and Natarajan, 2004). Any model can be viewed as a special case of
the general model, lying somewhere on a conceptual scale between this general
model and the most parametric extreme, which would be the simple exponential
model. Models can be placed on this scale according to the extent of model con-
straints and, for example, a random effects model would lie somewhere between
a stratified model and a proportional hazards model.
It is very easy to extend the simple proportional hazards model to deal with
much more general problems. Allowing regression effects, β(t), to change with
time introduces great generality to our model structure. We can then view the
proportional hazards structure as a special case of a non-proportional hazards
structure. Intermediary cases, where some parameters do not depend on time,
and some do, is also very easily accommodated. The fact that so many differ-
ent model structures can be all put under the same heading brings two major
benefits. The first is that a general understanding of the overall model struc-
ture allows a better understanding of the specific structure of the special cases.
The second benefit of this generality is that we only need to tackle inferential
questions in the broadest setting. Applications to special cases are then almost
immediate. An important result is that any non-proportional hazards model can,
under a suitable transformation, be made into a proportional hazards one. The
transformation depends on the unknown regression function, β(t), and so is of
no obvious practical value. However, since we have optimality results for the
proportional hazards setting, we can use this result, in a theoretical framework,
to investigate for instance how well any test performs when contrasted to the
(generally unavailable) optimal test.
In the case of a single binary variable, model (4.2) and model (4.4) represent
the two extremes of the modeling options open to us. Under model (4.2) there
would be no model constraint and any consequent estimation techniques would
amount to dealing with each level of the variable independently. Under model
(4.4) we make a strong assumption about the nature of the relative hazards,
an assumption that allows us to completely share information between the two
levels. There exists an important class of models lying between these extremes
and, in order to describe this class, let us now imagine a more complex situation;
that of three groups, A, B, and C, identified by a vector Z of binary covariates;
Z = (Z2 , Z3 ). This is summarized in Table 6.1.
Our assumptions are becoming stronger in that not only are we modeling the
treatment affect via β1 but also the group effects via β2 and β3 . Expressing this
problem in complete generality, i.e., in terms of model (4.2), we write
Unlike the simple case of a single binary variable where our model choices were
between the two extremes of model (4.2) and model (4.4), as the situation
becomes more complex, we have open to us the possibility of a large number
of intermediary models. These are models that make assumptions lying between
model (4.2) and model (4.4) and, following O’Quigley and Stare (2002) we call
them partially proportional hazards models. A model in between (6.1) and (6.2) is
Dead
Figure 6.1: A stratified model with transitions only to death state. For all 3
transitions the log-relative risk is given by β. The base rates, however, λ0w , w =
1, 2, 3 depend on the stratum w. Risk sets are stratum specific.
This model is of quite some interest in that the strongly modeled part of
the equation concerns Z1 , possibly the major focus of our study. Figure 6.1
illustrates a simple situation. The only way to leave any state is to die, the
probabilities of making this transition varying from state to state and the rates
of transition themselves depending on time. Below, under the heading time-
dependent covariates, we consider the case where it is possible to move within
states. Here it will be possible to move from a low-risk state to a high-risk state,
to move from either to the death state, but to also, without having made the
transition to the absorbing state, death, to move back from high-risk to low-risk.
Stratified models
Coming under the heading of a partially proportional hazards model is the class
of models known as stratified models. In the same way these models can be
considered as being situated between the two extremes of Equation 4.2 and
Equation 4.3 and have been discussed by Kalbfleisch and Prentice (2002) among
others. Before outlining why stratified models are simply partially proportional
hazards models we recall the usual expression for the stratified model as
a model clearly lying, in a well-defined way, between models (4.3) and (4.2). It
follows that
and
where λ∗0 (t) = λ0 (t)eβ1 (t) . Recoding the binary Z1 (t) to take the values 1 and
2, and rewriting λ∗0 (t) = λ02 (t), λ0 (t) = λ01 (t) we recover the stratified PH
model (6.4) for Z2 (t). The argument is easily seen to be reversible and readily
extended to higher dimensions so we can conclude an equivalence between the
stratified model and the partially proportional hazards model in which some
of the β(t) are constrained to be constant. We can exploit this idea in the
goodness of fit or the model construction context. If a PH model holds as a
good approximation, then the main effect of Z2 say, quantified by β2 , would be
similar over different stratifications of Z1 and remain so when these stratifications
are re-expressed as a PH component to a two-covariate model. Otherwise the
indication is that β1 (t) should be allowed to depend on t. The predictability of
any model is studied later under the headings of explained variation and explained
randomness and it is of interest to compare the predictability of a stratified model
and an un-stratified one. For instance, we might ask ourselves just how strong is
the predictive strength of Z2 after having accounted for Z1 . Since we can account
for the effects of Z1 either by stratification or by its inclusion in a single PH model
we may obtain different results. Possible discrepancies tell us something about
our model choice.
6.3. Partially proportional hazards models 123
The relation between the hazard function and the survival function follows
as a straightforward extension of (4.5). Specifically, we have
S(t|Z) = φ(w){S0w (t)}exp(βZ) , (6.6)
w
where S0w (t) is the corresponding baseline survival function in stratum w and
φ(w) is the probability of coming from that particular stratum. This is then
slightly more involved than the nonstratified case in which, for two groups, the
model expresses the survival function of one group as a power transformation
of the other. The connection to the class of Lehmann alternatives is still there
although somewhat weaker. For the stratified model, once again the quantity
λ0w (t) does not appear in the expression for the partial likelihood given now by
n
δi
exp(βZi )
L(β) = n (6.7)
j=1 Yj {wi (Xi ), Xi } exp(βZj )
i=1
and, in consequence, once again, λ0w (t) can remain arbitrary. Note also that
each term in the product is the conditional probability that at time Xi of an
observed failure, it is precisely individual i who is selected to fail, given all the
individuals at risk from stratum w and that one failure from this stratum occurs.
The notation wi (t) indicates the stratum in which the subject i is found at
time t. Although we mostly consider wi (t) which do not depend on time, i.e.,
the stratum is fixed at the outset and thereafter remains the same, it is almost
immediate to generalize this idea to time dependency and we can anticipate the
later section on time-dependent covariates where the risk indicator Yj {wi (t), t}
is not just a function taking the value one until it drops at some point to zero,
but can change between zero and one with time, as the subject moves from one
stratum to another. For now the function Yj {wi (t), t} will be zero unless the
subject is at risk of failure from stratum wi , i.e., the same stratum in which the
subject i is to be found. Taking the logarithm in (6.7) and derivative with respect
to β, we obtain the score function
n
n
j=1 Yj {wi (Xi ), Xi }Zj exp(βZj )
U (β) = δi Zi − n , (6.8)
i=1 j=1 Yj {wi (Xi ), Xi } exp(βZj )
which, upon setting equal to zero, can generally be solved without difficulty using
standard numerical routines, to obtain the maximum partial likelihood estimate
β̂ of β. The parameter β then is assumed to be common across the different
strata.
Inferences about β are made by treating β̂ as asymptotically normally dis-
tributed with mean β and variance I(β̂)−1 , where, now, I(β) is given by
124 Chapter 6. Non-proportional hazards models
I(β) = n i=1 δi Ii (β). In this case each Ii is, as before, obtained as the derivative
of each component to the score statistic U (β). For the stratified score this is
n 2
n 2
j=1 Yj {wi (Xi ), Xi }Zj exp(βZj ) j=1 Yj {wi (Xi ), Xi }Zj exp(βZj )
Ii = n − n .
j=1 Yj {wi (Xi ), Xi } exp(βZj ) j=1 Yj {wi (Xi ), Xi } exp(βZj )
The central notion of the risk set is once more clear from the above expressions
and we most usefully view the score function as contrasting the observed covari-
ates at each distinct failure time with the means of those at risk from the same
stratum. A further way of looking at the score function is to see it as having put
the individual contributions on a linear scale. We simply add them up within a
stratum and then, across the strata, it only remains to add up the different sums.
Once again, inferences can also be based on likelihood ratio methods or on the
score U (β), which in large samples can be considered to be normally distributed
with mean zero and variance I(β).
Multivariate extensions follow as before. For the stratified model the only
important distinction impacting the calculation of U (β) and Ii (β) is that the
sums are carried out over each stratum separately and then combined at the
end. The indicator Yj {wi (Xi )} enables this to be carried out in a simpler way as
indicated by the equation. The random effects model has proved to be a valuable
tool in applications and this can be seen in several detailed applications (Binder,
1992; Carlin and Hodges, 1999; Collaboration, 2009; Natarajan and O’Quigley,
2002; O’Quigley and Stare, 2002; Zhou et al., 2011; Hanson, 2012).
These models are also partially proportional in that some effects are allowed not
to follow a proportional hazards constraint. However, unlike the stratified models
described above, restrictions are imposed. The most useful view of a random
effects model is to see it as a stratified model with some structure imposed upon
the strata. A random effects model is usually written
in which we take w as having been sampled from some distribution G(w; θ).
Practically there will only be a finite number of distinct values of w, however
large. For any value w we can rewrite λ0 (t)ew = λ0w (t) and recover model (6.4).
For the right-hand side of this equation, and as we might understand from (6.4),
we suppose w to take the values 1,2, ... The values on the left-hand side, being
generated from G(·) would generally not be integers but this is an insignificant
notational issue and not one involving concepts. Consider the equation to hold. It
implies that the random effects model is a stratified model in which added struc-
ture is placed on the strata. In view of Equation 6.5 and the arguments following
this equation we can view a random effects model equivalently as in Equation
4.2 where, not only are PH restrictions imposed on some of the components of
β(t), but the time dependency of the other components is subject to constraints.
These latter constraints, although weaker than imposing constancy of effect, are
all the stronger as the distribution of G(w; θ) is concentrated. In applications it
is common to choose forms for G(w; θ) that are amenable to ready calculation
(Duchateau and Janssen, 2007; Gutierrez, 2002; Hanagal, 2011; Li and Ryan,
2002; Liu et al., 2004; Wienke, 2010).
6.2. The illustration makes it clear that, under the assumption, a weaker one
than that implied by Equation 4.3, we can estimate the treatment effect whilst
ignoring center effects. A study of these figures is important to understanding
what takes place when we impose a random effects model as in Equation 6.10.
For many centers, Figure 6.3, rather than having two curves per center, parallel
but otherwise arbitrary, we have a family of parallel curves. We no longer are able
to say anything about the distance between any given centers, as we could for the
model of Equation 4.3, a so-called fixed effects model, but the distribution of the
distances between centers is something we aim to quantify. This is summarized
by the distribution G(w; θ) and our inferences are then partly directed at θ.
stratified model in which some structure is imposed upon the differences between
strata. The stratified model not only leaves any distribution of differences between
strata unspecified, but it also makes no assumption about the form of any given
stratum. Whenever the random effects model is valid, then so also is the stratified
model, the converse not being the case.
It may then be argued that we are making quite a strong assumption when we
impose this added structure upon the stratified model. In exchange we would hope
to make non-negligible inferential gains, i.e., greater precision of our estimates
of errors for the parameters of main interest, the treatment parameters.
In practice gains tend to be small for most situations and give relatively little
reward for the extra effort made. Since any such gains are only obtainable under
the assumption that the chosen random effects model generates the data, actual
gains in practice are likely to be yet smaller and, of course, possibly negative
when our additional model assumptions are incorrect. A situation where gains
for the random effects model may be of importance is one where a non-negligible
subset of the data include strata containing only a single subject. In such a case
simple stratification would lose information on those subjects. A random effects
model, assuming the approximation to be sufficiently accurate, enables us to
recover such information.
100 × 5 250 × 2 25 × 20
Ignoring effect 0.52 (0.16) 0.51 (0.16) 0.54 (0.16)
Random effect model 1.03 (0.19) 0.99 (0.22) 1.01 (0.17)
Stratified model 1.03 (0.22) 1.02 (0.33) 1.01 (0.18)
inclusion of a different w per group, will estimate the relevant expectations over
the whole risk set and not just that relative to the group defined by the covariate
value.
Comparisons for the stratified model are made with respect to the relatively
few subjects of the group risk sets. This may lead us to believe that much infor-
mation could be recovered were we able to make the comparison, as does the
alternative random effects analysis, with respect to the whole risk set. Unfortu-
nately this is not quite so because each contribution to the score statistic involves
a difference between an observation on a covariate and its expectation under the
model and the “noise” in the expectation estimate is of lower order that the
covariate observations themselves. There is not all that much to be gained by
improving the precision of the expectation estimate.
In other words, using the whole of the risk set or just a small sample from
it will provide similar results. This idea of risk set sampling has been studied in
epidemiology and it can be readily seen that the efficiency of estimates based on
risk set samples of size k, rather than the whole risk set, is of the order
⎧ ⎫
k ⎨ ⎬
n
1
1+ . (6.11)
k+1 ⎩ n(n − j + 1) ⎭
j=1
This function increases very slowly to one but, with as few as four subjects on
average in each risk set comparison, we have already achieved 80% efficiency.
With nine subjects this figure is close to 90%. Real efficiency will be higher for
two reasons: (1) the above assumes that the estimate based on the full risk set
is without error, (2) in our context we are assuming that each random effect w
is observed precisely.
Added to this is the fact that, since the stronger assumptions of the random
effects model must necessarily depart to some degree from the truth, it is by no
means clear that there is much room to make any kind of significant gains. As
an aside, it is of interest to note that, since we do not gain much by considering
the whole of the risk set as opposed to a small sample from it, the converse
must also hold, i.e., we do not lose very much by working with small samples
rather than the whole of the risk set. In certain studies, there may be great
economical savings made by only using covariate information, in particular when
time dependent, from a subset of the full risk set.
6.4. Partitioning of the time axis 129
Table 6.2 was taken O’Quigley and Stare (2002). The table was constructed
from simulated failure times where the random effects model was taken to be
exactly correct. Data were generated from this model in which the gamma frailty
had a mean and variance equal to one. The regression coefficient of interest
was exactly equal to 1.0. Three situations were considered; 100 strata each of
size 5, 250 strata each of size 2, and 25 strata each of size 20. The take-home
message from the table is that, in these cases for random effects models, not
much is to be gained in terms of efficiency. Any biases appear negligible and the
mean of the point estimates for both random effects and stratified models, while
differing notably from a crude model ignoring model inadequacy, are effectively
indistinguishable. As we would expect there is a gain for the variance of estimates
based on the random effects model but, even for highly stratified data (100 × 5),
any gain is very small. Indeed for the extreme case of 250 strata, each of size
2, surely the worst situation for the stratified model, it is difficult to become
enthusiastic over the comparative performance of the random effects model.
We might conclude that we only require around 80% of the comparative
sample size needed for estimating relative risk based on the stratified model.
But, such a conclusion, leaning entirely on the assumption that we know not
only the class of distributions from which the random effects come but also the
exact value of the population parameters, suggests, in practice, that the hoped
for gain, in this most hopeful of cases, is more likely to be greater than the
80% indicated by our calculations. The only real situation that can be clearly
disadvantageous to the stratified model is one where a non-negligible subset of
the strata are seen to only contain a single observation. For such cases, and
assuming a random effects model to provide an adequate fit, information from
states with a single observation (which would be lost by a stratified analysis) can
be recovered by a random effects analysis.
Recalling the general model, i.e., the non-proportional hazards model for which
there is no restriction on β(t), note that we can re-express this so that the
function β(t) is written as a constant term, the intercept, plus some function of
time multiplied by a constant coefficient. Writing this as
we can describe the term β0 as the intercept and Q(t) as reflecting the nature
of the time dependency. The coefficient θ will simply scale this dependency and
we may often be interested in testing the particular value, θ = 0, since this value
corresponds to a hypothesis of proportional hazards. Fixing the function Q(t)
to be of some special functional form allows us to obtain tests of proportional-
ity against alternatives of a particular nature. Linear or quadratic decline in the
130 Chapter 6. Non-proportional hazards models
log-relative risk, changepoint, and crossing hazard situations are all then easily
accommodated by this simple formulation. Tests of goodness of fit of the pro-
portional hazards assumption can be then be constructed which may be optimal
for certain kinds of departures.
Although not always needed it can sometimes be helpful to divide the time
axis into r non-overlapping intervals, B1 , . . . , Br in an ordered sequence beginning
at the origin. In a data-driven situation these intervals may be chosen so as to
have a comparable number of events in each interval or so as not to have too
few events in any given interval. Defined on these intervals is a vector, also of
dimension r, of some known or estimable functions of time, not involving the
parameters of interest, β. This is denoted Q(t) = {Q1 (t), . . . , Qr (t)} This model
is then written in the form
where θ is a vector of dimension r. Thus, θQ(t) (here the usual inner product) has
the same dimension as β, i.e., one. In order to investigate the time dependency of
particular covariates in the case of multivariate Z we would have β of dimension
greater than one, in which case Q(t) and θ are best expressed in matrix notation
(O’Quigley and Pessione, 1989).
Here, as through most of this text, we concentrate on the univariate case since
the added complexity of the multivariate notation does not bring any added light
to the concepts being discussed. Also, for the majority of the cases of interest,
r = 1 and θ becomes a simple scalar. We will often have in mind some partic-
ular form for the time-dependent regression coefficient Q(t), common examples
being a linear slope (Cox, 1972), an exponential slope corresponding to rapidly
declining effects (Gore et al., 1984) or some function related to the marginal
distribution, F (t) (Breslow et al., 1984). In practice we may be able to estimate
this function of F (t) with the help of consistent estimates of F (t) itself, in par-
ticular the Kaplan-Meier estimate. The non- proportional hazards model with
intercept is of particular use in questions of goodness of fit of the proportional
hazards model pitted against specific alternatives. These specific alternatives can
be quantified by appropriate forms of the function Q(t). We could also test a
joint null hypothesis H0 : β = θ = 0 corresponding to no effect, against an alter-
native H1 , either θ or β nonzero. This leads to a test with the ability to detect
non-proportional hazards, as well as proportional hazards departures to the null
hypothesis of no effect. We could also test a null hypothesis H0 : θ = 0 against
H1 : θ = 0, leaving β itself unspecified. This would then provide a goodness of
fit test of the proportional hazards assumption. We return to these issues later
on when we investigate in greater detail how these models give rise to simple
goodness of fit tests.
6.5. Time-dependent covariates 131
Changepoint models
A simple special case of a non-proportional hazards model with an intercept is
that of a changepoint model. O’Quigley and Pessione (1991), O’Quigley (1994)
and O’Quigley and Natarajan (2004) develop such models whereby we take the
function Q(t) to be defined by, Q(t) = I(t ≤ γ) − I(t > γ) with γ an unknown
changepoint. This function Q(t) depends upon γ but otherwise does not depend
upon the unknown regression coefficients and comes under the above heading
of a non-proportional hazards model with an intercept. For the purposes of a
particular structure for a goodness of fit test we can choose the intercept to be
equal to some fixed value, often zero (O’Quigley and Pessione, 1991). The model
is then
λ(t|Z) = λ0 (t) exp{[β + αQ(t)]Z(t)}. (6.14)
The parameter α is simply providing a scaling (possibly of value zero) to the
time dependency as quantified by the function Q(t). The chosen form of Q(t),
itself fixed and not a parameter, determines the way in which effects change
through time; for instance, whether they decline exponentially to zero, whether
they decline less rapidly or any other way in which effects might potentially
change through time.
Inference for the changepoint model is not straightforward and in the later
sections dealing with inference we pay particular attention to some of the diffi-
culties raised. Note that were γ to be known, then inference would come under
the usual headings with no additional work required. The changepoint model
expressed by Equation 6.14 deals with the regression effect changing through
time and putting the model under the heading of a non-proportional hazards
model. A related, although entirely different model, is one which arises as a sim-
plification of a proportional model with a continuous covariate and the idea is to
replace the continuous covariate with a discrete classification.
The classification problem itself falls into two categories. If we are convinced
of the presence of effects and simply wish to derive the most predictive clas-
sification into, say, two groups, then the methods using explained randomness
or explained variation will achieve this goal. If, on the other hand, we wish to
test a null hypothesis of absence of effect, and, in so doing, wish to consider all
possible classifications based on a family of potential cutpoints of the continuous
covariate, then, as mentioned above, special techniques of inference are required.
We return to these questions in later chapters where they are readily addressed
via the regression effect process.
In all of the above models we can make a simple change by writing the covariate
Z as Z(t), allowing the covariate to assume different values at different time
points. Our model then becomes
132 Chapter 6. Non-proportional hazards models
State 1 State 2
No symptoms Progression
State 3
Dead
Figure 6.4: Compartment model where ability to move between states other than
death state can be characterized by time-dependent indicator covariates Z(t).
Any paths not contradicting arrows are allowed.
If, however, we do not wish to model the effects of this second covariate, either
because it is only of indirect concern or because its effects might be hard to
model, then we could appeal to a stratified model. We write:
where, as for the non-time-dependent case, w(t) takes integer values 1, ..., m
indicating status. The subject can move in and out of the m strata as time
proceeds. Two examples illustrate this. Consider a new treatment to reduce the
incidence of breast cancer. An important time-dependent covariate would be
the number of previous incidents of benign disease. In the context of inference,
the above model simply means that, as far as treatment is concerned, the new
treatment and the standard are only ever contrasted within patients having the
same previous history. These contrasts are then summarized in final estimates
and possibly tests. Any patient works her way through the various states, being
unable to return to a previous state. The states themselves are not modeled. A
second example might be a sociological study on the incidence of job loss and
how it relates to covariates of main interest such as training, computer skills, etc.
Here, a stratification variable would be the type of work or industry in which the
individual finds him or herself. Unlike the previous example a subject can move
between states and return to previously occupied states.
Time-dependent covariates describing states can be used in the same way
for transition models in which there is more than one absorbing “death” state.
Many different kinds of situations can be constructed, these situations being
well described by compartment models with arrows indicating the nature of the
transitions that are possible (Figure 6.5). For compartment models with time-
dependent covariates there is a need for some thought when our interest focuses
on the survival function. The term external covariate is used to describe any
covariate Z(t) such that, at t = 0, for all other t > 0, we know the value of
State 1 State 2
State 3 State 4
Figure 6.5: Compartment model with 2 absorbing “death” states. State 4 cannot
be reached from State 3 and can only be reached from State 1 indirectly via
State 2.
134 Chapter 6. Non-proportional hazards models
Z(t). The paths can be described as deterministic. In the great majority of the
problems that we face this is not the case and a more realistic way of describing
the situation is to consider the covariate path Z(t) to be random. Also open to
us as a modeling possibility, when some covariate Z1 (t) is of secondary interest
assuming a finite number of possible states, is to use the at-risk function Y (s, t).
This restricts our summations to those subjects in state s as described above for
stratified models.
Lemma 6.1. For given β(t) and covariate Z(t) there exists a constant β0
and time-dependent covariate Z ∗ (t) so that λ(t|Z(t)) = λ0 (t) exp{β(t)Z(t)} =
λ0 (t) exp{β0 Z ∗ (t)} .
The important thing to note is that we have the same λ0 (t) either side of
the equation and that, whatever the value of λ(t|Z(t)), for all values of t, these
values are exactly reproduced by either expression, i.e., we have equivalence.
This equivalence is a formal one and does not of itself provide any new angle
on model development. The function β(t) will not generally be known to us. This
6.5. Time-dependent covariates 135
n
δi
exp(βZi (Xi ))
L(β) = n , (6.19)
i=1 j=1 j (Xi ) exp(βZj (Xi ))
Y
As before, taking the logarithm in Equation 6.19 and its derivative with respect
to β, we obtain the estimating equation which, upon setting equal to zero,
can generally be solved without difficulty using the Newton-Raphson method.
Chapter 7 on estimating equations looks at this more closely. The form of U (β)
is as before with only the time dependency marking any distinction.
n
n
j=1 Yj (Xi )Zj (Xi ) exp(βZj (Xi ))
U (β) = δi Zi (Xi ) − n (6.20)
i=1 j=1 Yj (Xi ) exp(βZj (Xi ))
Additive models
Instead of considering a model with multiplicative risks, some authors appeal to
a structure more familiar to the one we know from linear regression, e.g., the
additive intensity model proposed by Aalen (1989, 1980). This is written:
where the vector of regression coefficients, β(t) = (β0 (t), β1 (t), . . . , βp (t)) depends
t
on time. The estimation of the cumulative effects 0 βi (s)ds for i = 1, . . . , p is
done by weighted least squares. McKeague and Sasieni (1994) studied the case
of fixed effects for some covariates and time-dependent effects for the others
creating a model analogous to the partially proportional hazards model. Lin and
Ying (1995) proposed a more complex combined additive and multiplicative risk
model written:
where β(t) is time dependent. Aalen et al. (2008) presents an analysis of the
properties of these kinds of model. Beran (1981) and McKeague and Utikal
(1990) studied the model,
without making explicit the function α, which can be estimated by kernel methods
and local smoothing.
g(T ) = −β T Z + ε, (6.24)
S(t | Z) = S0 t exp(β T Z) , 0≤t≤T, (6.25)
θ1 θ0
λT (t) = λP (t), 0≤t≤T, (6.26)
θ1 + (θ0 − θ1 )SP (t)
test outperformed the log-rank test, something that is not theoretically possible.
So, some kind of adjustment is needed before too much reliance be put on the
calculated p-value. The second reason is rather more serious, not easily overcome,
and stems from the somewhat non-linear specification of the model. Different
codings will give different results. Coding the treatment variable (1,0) rather than
(0,1) will lead to different answers. Since there is often no natural way to code
this is very problematic. It is quite plausible, for example, that the same set of
observations allows us to conclude that the survival experience of women differs
significantly from that of men and that, simultaneously, the survival experience
of men does not significantly differ from that of women. This is clearly not a
coherent summary of the data and has to be fixed before we could have confidence
in any results based on this test. A simple potential solution would be to calculate
both p-values say and then take the minimum. This would solve the coherence
problem but would certainly exacerbate the difficulty in exercising good control
over Type 1 error. And in the situation of several levels, or several covariates, it
would seem to be very difficult to anticipate the operating characteristics in any
broad sense.
Nonetheless, there is something rather attractive in this formulation and it
would be worth the extra effort to fix this coherence problem. The root of the
difficulty would stem from the lack of symmetry in Equation 6.26 and this obser-
vation may open up a path toward a solution. Further discussion is given in
Flandre and O’Quigley (2020).
1. Consider the approach of partitioning the time axis as a way to tackle non-
proportional hazards. For example, we might choose a partition consisting
of 3 intervals, any one of which respects the proportional hazards constraint
with an interval-specific regression coefficient. Find the simplest way of
writing this model down. How might you express a null hypothesis of no
effect against an alternative of a non-null effect. How would this expression
change if, aside from the null, the only possibility is of an effect that
diminishes through time.
2. When biological or other measures are expensive to make, sampling from
the risk set can enable great savings to be made. Explain why this would
be the case and consider the criteria to be used to decide the size of the
risk set sampled. Look up and describe the case-cohort design which has
been successfully used in large epidemiological studies.
3. Show formally that a non-proportional hazards model with a constant
covariate is equivalent to a proportional hazards one with a time-dependent
covariate. Does any such result hold for a non-proportional hazards model
with a non-constant covariate.
140 Chapter 6. Non-proportional hazards models
4. Suppose, for a binary group indicator, that the observations are generated
by a linear model with a constant regression coefficient. Show that such a
model is equivalent to a proportional hazards model with a time-dependent
covariate. Conclude that a linear model is equivalent to a non-proportional
hazards one.
7. Show formally that, if our covariate space does not include continu-
ous covariates, then any true situation can be modeled precisely by a
non-proportional hazards model. Conclude that, for any arbitrary situa-
tion, including that of continuous covariates, we can postulate a non-
proportional hazards model that is arbitrarily close to that generating the
observations.
Chapter 7
The regression effect process, described in Chapter 9, shapes our main approach
to inference. At its heart are differences between observations and their model-
based expectations. The flavor is very much that of linear estimating equations
(Appendix D.1). Before we study this process, we consider here an approach
to inference that makes a more direct appeal to estimating equations. The two
chapters are closely related and complement one another. This chapter leans
less heavily on stochastic processes and links in a natural and direct way to the
large body of theory available for estimating equations. Focusing attention on
the expectation operator, leaning upon different population models and different
working assumptions, makes several important results transparent. For example,
it is readily seen that the so-called partial likelihood estimator is not consistent for
average effect, E{β(T )}, under independent censoring and non-constant β(t).
One example we show, under heavy censoring, indicates the commonly used
partial likelihood estimate to converge to a value greater than 4 times its true
value. Linear estimating equations provide a way to investigate statistical behavior
of estimates for small samples. Several examples are considered.
This chapter and Chapter 9 provide the inferential tools needed to analyze survival
data. Either approach provides several techniques and results that we can exploit.
Taken together we have a broad array of methods, and their properties, that,
when used carefully, can help gain deep understanding of real datasets that arise
in the setting of a survival study.
The earlier chapter on marginal survival is important in its own right and we
lean on the results of that chapter throughout this work. We need to keep in mind
the idea of marginal survival for two reasons: (1) it provides a natural backdrop to
the ideas of conditional survival and (2), together with the conditional distribution
of the covariate given T = t, we are able to consider the joint distribution of
covariate and survival time T. A central concern is conditional survival, where
we investigate the conditional distribution of survival given different potential
covariate configurations, as well as variables such as time elapsed. More generally
we are interested in survival distributions corresponding to transitions from one
state to another, conditional on being in some particular state or of having
mapped out some particular covariate path. The machinery that will enable us
to obtain insight into these conditional distributions is that of proportional and
non-proportional hazards regression.
When we consider any data at hand as having arisen from some experi-
ment, the most common framework for characterizing the joint distribution of
the covariate Z and survival T is one where the distribution of Z is fixed and
known, and the conditional survivorship distribution is the subject of our infer-
ential endeavors. Certainly, this characterization would bring the model to most
closely resemble the experiment as set-up. However, in order to accommodate
censoring, it is more useful to characterize the joint distribution of Z and T via
the conditional distribution of Z given T = t and the marginal distribution of
T . This is one of the reasons why we dealt first with the marginal distribution
of T. Having dealt with that we can now focus our attention on the conditional
distribution of Z given T . We can construct estimating equations based on these
ideas and from these build simple tests or make more general inferences.
The main theorem of proportional hazards regression, introduced by O’Quigley
(2008), generalizes earlier results of Schoenfeld (1980), O’Quigley and Flandre
(1994), and Xu and O’Quigley (2000). The theorem has several immediate corol-
laries and we can use these to write down estimating equations upon which we
can then construct suitable inferential procedures for our models. The regression
effect process of subsequent chapters also has a clear connection to this theorem.
While a particular choice of estimating equation can result in high efficiency when
model assumptions are correct or close to being correct, other equations may be
less efficient but still provide estimates which can be interpreted when model
assumptions are incorrect. For example, when the regression function β(t) might
vary with time, we are able to construct an estimating equation, the solution of
which provides a consistent estimate of E{β(T )}, the average effect. The usual
partial likelihood estimate fails to achieve this, and the resulting errors, depend-
ing on the level of censoring, even when independent, can be considerable. This
can be seen in Table 7.1 where the resulting bias, for high censoring rates, can
be more than 400%. Most currently available software makes no account of this
and, given that a non-constant β(t) will be more the rule than the exception,
7.3. Likelihood solution for parametric models 143
For almost any statistical model, use of the likelihood is usually the chosen
method for dealing with inference on unknown parameters. Bayesian inference,
in which prior information is available, can be viewed as a broadening of the
approach and, aside from the prior, it is again the likelihood function that will be
used to estimate parameters and carry out tests. Maximum likelihood estimates,
broadly speaking, have good properties and, for exponential families, a class to
which our models either belong or are close, we can even claim some optimality.
A useful property of maximum likelihood estimators of some parameter is that
the maximum likelihood estimator of some monotonic function of the parameter
is the same monotonic function of the maximum likelihood estimator of the
parameter itself. Survival functions themselves will often come under this heading
and so, once we have estimated parameters that provide the hazard rate, then we
immediately have estimates of survival. Variance expressions are also obtained
quite easily, either directly or by the approximation techniques of Appendix A.10.
Keeping in mind that our purpose is to make inference on the unknown regression
coefficients, invariant to monotonic increasing transformations on T, we might
also consider lesser used likelihood approaches such as marginal likelihood and
conditional likelihood. It can be seen that these kinds of approaches lead to
the so-called partial likelihood. In practice we will treat the partial likelihood as
though it were any regular likelihood, the justification for this being possible
through several different arguments.
For fixed covariates, in the presence of parametric assumptions concerning
λ0 (t), inference can be carried out on the basis of the following theorem that
simply extends that of Theorem 3.1. We suppose that the survival distribution
is completely specified via some parametric model, the parameter vector being
say θ. A subset of θ is a vector of regression coefficients, β, to the covariates in
the model. The usual working assumption is that of a conditionally independent
censoring mechanism, i.e., the pair (T, C) is independent given Z. This would
mean, for instance, that within any covariate grouping, T and C are independent
but that C itself can depend on the covariate. Such dependence would generally
induce a marginal dependency between C and T .
log Li (θ) = I(δi = 1) log f (xi |zi ; θ) + I(δi = 0) log S(xi |zi ; θ). (7.1)
144 Chapter 7. Model-based estimating equations
The maximum likelihood estimates obtain the values of θ, denoted θ̂, that
maximize log L(θ) over the parameter space. For log L(θ) a differentiable function
value is then the solution to the estimating equation U (θ) = 0 where
of θ, this
U (θ) = i ∂ log Li (θ)/∂θ. Next notice that, at the true value of θ, i.e., the value
which supposedly generates the observations, denoted θ0 , we have Var(U (θ0 )) =
E U 2 (θ0 ) = E I(θ0 ) where
n
n
2 2
I(θ) = Ii (θ) = −∂ log L(θ)/∂θ = − ∂ 2 log Li (θ)/∂θ2 .
i=1 i=1
As for likelihood in general, some care is needed in thinking about the meaning
of these expressions and the fact that the operators E(·) and Var(·) are taken
with respect to the distribution of the pairs (xi , δi ) but with θ0 fixed. The score
equation is U (θ̂) = 0 and the large sample variance is approximated by Var(θ̂) ≈
1/I(θ̂). Newton-Raphson iteration is set up by a simple application of the mean
value theorem so that
where θ1 is some starting value, often zero, to the iterative cycle. The iteration is
brought to a halt once we achieve some desired level of precision. An interesting
result that is not well known and is also, at first glance, surprising is that θ2
is a fully efficient estimator. It does no less well than θ̂ and this is because,
while subsequent iterations may bring up closer and closer to θ̂, they do not, on
average, bring us any closer to θ0 . A one step estimator such as θ2 can save time
when dealing with onerous simulations.
Note that likelihood theory would imply that we work with the expected
information (called Fisher information) E{I(θ)} but in view of Efron and Hink-
ley (1978) and the practical difficulty of specifying the censoring we usually prefer
to work with a quantity allowing us to consistently estimate the expected infor-
mation, in particular the observed information.
Large sample inference can be based on any one of the three tests derived
from the likelihood. For the score test there is no need to carry out parameter
estimation or to maximize some function. Many well-established tests can be
derived in this way. In exponential families, also the so-called curved exponential
families (Efron et al., 1978), such tests reduce to contrasting some observed
value to its expected value under the model. Good confidence intervals (Cox and
Hinkley, 1979) can be constructed from “good” tests. For the exponential family
class of distributions the likelihood ratio forms a uniformly most powerful test
and, as such, qualifies as a “good” test in the sense of Cox and Hinkley. The
other tests are asymptotically equivalent so that confidence intervals based on
the above test procedures will agree as sample size increases. Also, we can use
7.3. Likelihood solution for parametric models 145
such intervals for other quantities of interest such as the survivorship function
since this function depends on these unknown parameters.
Recall from Chapter 3 that we can estimate the survival function as S(t; θ̂).
If Θα provides a 100(1 − α)% confidence region for the vector θ, then we can
obtain a 100(1 − α)% confidence region for S(t; θ) in the following way. For each
t let
Sα+ (t; θ̂) = sup S(t; θ) , Sα− (t; θ̂) = inf S(t; θ) , (7.3)
θ∈Θα θ∈Θα
then Sα+ (t; θ̂) and Sα− (t; θ̂) form the endpoints of the 100(1 − α)% confidence
interval for S(t; θ). Such a quantity may not be so easy to calculate in general,
simulating from Θα or subdividing the space being an effective way to approx-
imate the interval. Some situations nonetheless simplify. The most straight-
forward is where the survival function is a monotonic function of the one-
dimensional parameter θ. As an illustration, the scalar location parameter, θ,
for the exponential model corresponds to the mean. We have that S(t; θ) is
monotonic in θ. For such cases it is only necessary to invert any interval for θ
to obtain an interval with the same coverage properties for S(t; θ). Denoting the
upper limit of the 100(1 − α)% confidence interval for θ as θα + and the lower
−
limit of the 100(1 − α)% confidence interval for θ as θα , we can then write:
Sα+ (t; θ̂) = S(t; θα
− ) and S − (t; θ̂) = S(t; θ + ). Note that these intervals are calcu-
α α
lated under the assumption that t is fixed. For the exponential model, since the
whole distribution is defined by θ, the confidence intervals calculated pointwise
at each t also provide confidence bands for the whole distribution.
or zero. We use the variable wi = I(zi = 1) to indicate which group the subject
is from. From this and Theorem 7.1 we have:
Corollary 7.1. For the 2-sample exponential model, the likelihood satisfies
⎧ ⎫
⎨n n
⎬
log L(λ, β) = k log λ + βk2 − λ xj (1 − wj ) + eβ xj wj ,
⎩ ⎭
j=1 j=1
where wi = 1zi =1 and where there are k1 distinct failures in group 1, k2 in group
2, and k = k1 + k2 .
Differentiating the log-likelihood with respect to both λ and β and equating
both partial derivatives to zero we readily obtain an analytic solution to the pair
of equations given by:
Corollary 7.2. The maximum likelihood estimates β̂ and λ̂ for the two-group
exponential model are written as
k2 k1 k1
β̂ = log n − log n ; λ̂ = n .
x w
j=1 j j j=1 j (1 − wj )
x j=1 j (1 − wj )
x
It follows immediately that λ̂1 = λ̂ and that λ̂2 = λ̂ exp(β̂) = k2 / nj=1 xj wj .
In order to carry out tests and construct confidence intervals we construct the
matrix of second derivatives of the log-likelihood, I(λ, β), obtaining
−∂ 2 log L(λ, β)/∂λ2 −∂ 2 log L(λ, β)/∂λ∂β k/λ2 eβ j xj wj
= β β .
−∂ 2 log L(λ, β)/∂λ∂β −∂ 2 log L(λ, β)/∂β 2 e j xj wj λe j xj wj
The advantage of the two parameter case is that the matrix can be explicitly
inverted. We then have:
Corollary 7.3. Let D = λ−1 eβ j xj wj {k − λeβ j xj wj }. Then, for the two-
group exponential model the inverse of the information matrix is given by
λeβ j xj wj −eβ j xj wj
I −1 (λ, β) = D−1 .
−eβ j xj wj k/λ2
The score test is given by XS2 = U (λ̂, 0)I −1 (λ̂, 0)U (λ̂, 0). Following some
simple calculations and recalling that exp(−β̂) = n j=1 xj wj /k2 , we have:
Corollary 7.4. For the two-group exponential model the score test is given by
At first glance the above expression, involving as it does β̂, might appear
to contradict our contention that the score statistic does not require estimation
7.3. Likelihood solution for parametric models 147
2 k2 k1 k
XL = 2 k2 log + k1 log − k log .
j xj wj j xj (1 − wj ) j xj
The third of the tests based on the likelihood, the Wald test, is also straight-
forward to calculate and we have the corresponding lemma:
Corollary 7.6. For the two-group exponential model, the Wald test is given by
2
XW = k −1 k1 k2 β̂ 2 .
For large samples we anticipate the three different tests to give very similar
results. For smaller samples the Wald test, although the most commonly used,
is generally considered to be the least robust. In particular, a monotonic trans-
formation of the parameter will, typically, lead to a different value of the test
statistic.
For the Freireich data, the maximum likelihood estimates of the hazard rates
in each group are
We might note that the above results are those that we would have obtained
had we used the exponential model separately in each of the groups. In this par-
ticular case then the model structure has not added anything or allowed us to
achieve any greater precision in our analysis. The reason is simple. The expo-
nential model only requires a single parameter. In the above model we have two
groups and, allowing these to be parameterized by two parameters, the rate λ
and the multiplicative factor exp(β), we have a saturated model. The saturated
model is entirely equivalent to using two parameters, λ1 and λ2 , in each of the
148 Chapter 7. Model-based estimating equations
two groups separately. More generally, for exponential models with many groups
or with continuous covariates, or for other models, we will not usually obtain the
same results from separate analyzes as those we obtain via the model structure.
The model structure will, as long as it is not seriously misspecified, usually lead
to inferential gains in terms of precision of parameter estimates.
Since exp(β̂) = 21/182 × 359/9 = 4.60 we have that the estimate of the
log-relative risk parameter β̂ is 1.53. We also have that the score test XS2 =
17.8, the Wald test XW 2 = 14.7, and the likelihood ratio test X 2 = 16.5. The
L
agreement between the test statistics is good and, in all cases, the significance
level is sufficiently strong to enable us to conclude in favor of clear evidence of a
difference between the groups.
Had there been no censoring then k = n, the sample size, and n j=1 tj cor-
responds to a sum of n independent random variables each exponential with
parameter λ. We could therefore treat n/λ̂ as a gamma variate with parameters
(λ, n). In view of the consistency of λ̂, when there is censoring, we can take k/λ̂
as a gamma variate with parameters (λ, k), when k < n. This is not an exact
result, since it hinges on a large sample approximation, but it may provide greater
accuracy than the large sample normal approximation.
Recall from Chapter 3 that we can make use of standard tables by multiplying
each term of the sum by 2λ. The result of this product is a sum of n exponential
variates in which each component of the sum has variance equal to 2. This
corresponds to a gamma (2, n) distribution which is also equivalent to a chi-
square distribution with 2n degrees of freedom. Taking the range of values of
2kλ/λ̂ to be between χα/2 and χ1−α/2 gives a 100(1 − α)% confidence interval
for λ. For the Freireich data we obtained a 95% CI = (0.0115, 0.0439). On the
basis of intervals for λ, we can obtain intervals for the survivorship function which
is, in this particular case, a monotonic function of λ. The upper and lower limits
of the 100(1 − α)% confidence interval are denoted by Sα+ (t; λ̂) and Sα− (t; λ̂),
respectively. We write:
+ − λ̂χα/2 λ̂χ1−α/2
[Sα (t; λ̂), Sα (t; λ̂)] = exp − t , exp − t . (7.4)
2k 2k
where the corresponding expression for Sα− (t; λ̂) is obtained using the percentiles,
z1−α/2 by zα/2 of the standard normal distribution. Agreement between these
two approximations appears to be very close. It would be of interest to have a
more detailed comparison between the approaches.
7.3. Likelihood solution for parametric models 149
For the case of three groups, defined by the pair of binary indicator variables, Z1
and Z2 , the model states that S(t|Z1 , Z2 ) = S1α (t) where, in this more complex
set-up, log α = β1 Z1 + β2 Z2 . Here, in exactly the same way, we obtain ana-
lytic expressions for β1 and β2 as the arithmetic difference between the log-log
transformations of the respective marginal survival curves.
For two independent groups G1 and G2 we can consider two separate estima-
tors, Ŝ1 (t) and Ŝ2 (t). Since we are assuming a proportional hazards model, we will
carry over this restriction to the sample-based estimates whereby Ŝ2 (t) = Ŝ1α (t)
and where, as before, α = exp(β). In view of the above result we have:
All of the simple results that are available to us when data are generated by
an exponential distribution can be used. In particular, if we wish to compare the
means of two distributions, both subject to censoring, then we can transform one
of them to standard exponential via its empirical survival function, then use this
same transformation on the other group. The simple results for contrasting two
censored exponential samples can then be applied even though, at least initially,
the data arose from samples generated by some other mechanism.
150 Chapter 7. Model-based estimating equations
The most immediate departure from proportional hazards would be one where
effects decline through time and, as mentioned before, perhaps even changing
direction (Stablein et al., 1981). The standardized cumulative score, however the
moments are calculated, would, under such departures, increase (or decrease)
steadily until the increase (or decrease) dies away and the process would pro-
ceed on average horizontally or, if effects change direction, make its way back
toward the origin. A test based on the maximum would have the ability to pick
up this kind of behavior. Again, the visual impression given by the cumulative
standardized score process can, of itself, suggest the nature of the departure
from proportional hazards. This kind of test is not focused on parameters in the
model other than the regression effect given by β or possibly β(t). The goal is to
consider the proportionality or lack of such proportionality and, otherwise, how
well the overall model may fit is secondary (Figure 7.1).
Remind ourselves that the data consist of the observations (Zi (t), Yi (t), (t ≤
Xi ), Xi ; i = 1 . . . n). The Zi are the covariates (possibly time-dependent), the
Xi = min(Ti , Ci ), the observed survival which is the smallest of the censoring
time and the actual survival time, and the Yi (t) are time-dependent indicators
taking the value one as long as the i th subject is at risk at time t and zero
otherwise. For the sake of large sample constructions we make Yi (t) to be left
continuous. At some level we will be making an assumption of independence, an
assumption that can be challenged via the data themselves, but that is often left
unchallenged, the physical context providing the main guide. Mostly, we think
of independence as existing across the indices i (i = 1, . . . , n), i.e., the triplets
{Zi (t), Yi (t), Xi ; i = 1, . . . , n}. It is helpful to our notational construction to have:
The reason for this definition is to unify notation. Our practical interest will be on
sums of quantities such as Zi (Xi ) with i ranging from 1 to n. Using the Stieltjes
integral (Appendix A), we will be able to write such sums as integrals with respect
to an empirical process. In view of the Helly-Bray theorem (Appendix A.2) this
makes it easier to gain an intuitive grasp on the population structure behind
the various statistics of interest. Both T and C are assumed to have supports
on some finite interval, the first of which is denoted T . The time-dependent
covariate Z(·) is assumed to be a left-continuous stochastic process and, for
notational simplicity, is taken to be of dimension one whenever possible.
We use the function Pr(A) to return the probability measure associated with
the event A, the reference sets for this being the largest probability space in the
context. In other words, we have not restricted our outcome space by conditioning
on any particular events. It is usually clear which probability space is assumed,
if not we include F to denote this space preceded by a colon, i.e., we write
Pr(A : F). The function P(A) also returns a probability measure associated with
the event A and we reserve this usage for those cases where significant reduction
of the original probability space has taken place. In other words, we view P(A) as
a probability arising after conditioning, and, in general, after conditioning on a
substantial part of the data structure. Again, the context is usually sufficient to
know what is being conditioned on. If not this is made explicit. It is of interest,
although not generally exploited, to observe that, under repeated sampling, A
under P(A) will generally converge in distribution to that of A under Pr(A). The
discrete probabilities πi (β(t), t), defined in Equation 7.10, are so central to the
development that they are given a notation all of their own. The expectation
operator E(·) is typically associated with Pr(·) whereas the expectation opera-
tor E(·|t) is associated with the model-based πi (β(t),
t t). Let F (t) = Pr(T < t),
D(t) = Pr (C < t), and H(t) = F (t){1 − D(t)} − 0 F (u)dD(u).
For each subject i we observe Xi = min(Ti , Ci ), and δi = I(Ti ≤ Ci ) so that
δi takes the value one if the ith subject corresponds to a failure and is zero if the
subject corresponds to a censored observation. A more general situation allows a
subject to be dynamically censored in that he or she can move in and out of the
risk set. To do this we define the “at-risk” indicator Yi (t) where Yi (t) = I(Xi ≥ t).
The events on the i th individual are counted by Ni (t) = I{Ti ≤ t, Ti ≤ Ci },
and N̄ (t) = n 1 Ni (t) counts the number of events before t. Some other sums
7.5. Estimating equations using moments 153
for r = 0, 1, 2, where the expectations are taken with respect to the true distri-
bution of (T, C, Z(·)). Define also
S (2) (β, t) S (1) (β, t)2 s(2) (β, t) s(1) (β, t)2
V (β, t) = (0)
− (0) , v(β, t) = (0) − . (7.9)
S (β, t) S (β, t) 2 s (β, t) s(0) (β, t)2
The Andersen and Gill notation is now classic in this context. Their notation lends
itself more readily to large sample theory based upon Rebolledo’s multivariate
central limit theorem for martingales and stochastic integrals. We will keep this
notation in mind for this chapter although, for subsequent chapters, we use
a lighter notation since our approach to inference does not appeal to special
central limit theorems (Rebolledo’s theorem in particular). One reason for using
the Andersen and Gill notation in this chapter is to help the reader familiar with
that theory to join up the dots and readily see the connections with chapters 9,
10, and 11. The required conditions for the Andersen and Gill theory to apply are
slightly broader than those of our development although this advantage is more of
a theoretical than a practical one. For their results, as well as ours, the censorship
is restricted in such a way that, for large samples, there remains information on
F in the tails. The conditional means and the conditional variances, Eβ(t) (Z|t)
Vβ(t) (Z|t), introduced immediately below, are related to the above via V (β, t) ≡
Vβ (Z|t) and S (1) (β, t)/S (0) (β, t) ≡ Eβ (Z|t). In the counting process framework
of Andersen and Gill (1982), we imagine n as remaining fixed and the asymptotic
results obtaining as a result of asymptotic theory for n-dimensional counting
processes, in which we understand the expectation operator E to be with respect
to infinitely many repetitions of the process. Subsequently we allow n to increase
without bound. For the quantities Eβ(t) (Z k |t) we take the E operator to be these
same quantities when n grows without bound.
We most often view time as providing the set of indices to certain stochastic
processes, so that, for example, we consider Z(t) to be a random variable having
different distributions for different t. Also, the failure time variable T can be
viewed as a non-negative random variable with distribution F (t) and, whenever
the set of indices t to the stochastic process coincides with the support for T ,
then not only can we talk about the random variables Z(t) for which the dis-
tribution corresponds to Pr(Z ≤ z|T = t) but also marginal quantities such as
154 Chapter 7. Model-based estimating equations
the random variable Z(T ) having distribution G(z) = Pr(Z ≤ z). An important
result concerning the conditional distribution of Z(t) given T = t follows. How-
ever, the true population joint distribution of (T, Z) turns out to be of little
interest. Concerning T , we will view its support mostly in terms of providing
indices to a stochastic process. The variable Z depends of course on rather arbi-
trary design features. In order to study dependency, quantified by β(t), we focus
on the conditional distribution of Z given T = t.
The πi (β(t), t) are easilyseen to be bonafide probabilities (for all real values of
β(t)) since πi ≥ 0 and i πi = 1. Note that this continues to hold for values
of β(t) different from those generating the data, and even when the model is
incorrectly specified. As a consequence, replacing β by β̂ results in a probability
distribution that is still valid but different from the true one. Means and vari-
ances with respect to this distribution maintain their interpretation as means and
variances.
Under the proportional hazards assumption, i.e., the constraint β(t) = β, the
product of the π’s over the observed failure times gives the partial likelihood (Cox
1972, 1975). When β = 0, πi (0, t) is the empirical distribution that assigns equal
weight to each sample subject in the risk set. Based on the πi (β(t), t) we have:
Definition 7.3. Conditional moments of Z with respect to πi (β(t), t) are given
by
n
k
Eβ(t) (Z |t) = Zik (t)πi (β(t), t) , k = 1, 2, . . . , . (7.11)
i=1
These two definitions are all that we need in order to set about building the
structures upon which inference is based. This is particularly so when we are able
to assume an independent censoring mechanism, although the weaker assumption
of a conditionally independent censoring mechanism (see Chapter 2) will mostly
cause no conceptual difficulties; simply a slightly more burdensome notation.
Another, somewhat natural, definition will also be appealed to on occasion and
this concerns unconditional expectations.
Definition 7.4. Marginal moments of Z with respect to the bivariate distribution
characterized by πi (β(t), t) and F (t) are given by
Eβ(t) (Z ) = Eβ(t) (Z k |t)dF (t) , k = 1, 2, . . . , .
k
(7.12)
7.5. Estimating equations using moments 155
Note that when censoring does not depend upon z then φ(z, t) will depend upon
neither z nor t and is, in fact, equal to one. Otherwise, under a conditionally
independent censoring assumption, we can consistently estimate φ(z, t) and we
call this φ̂(z, t). This is not explored in this text.
Theorem 7.2. (O’Quigley 2003). Under model (4.2) and assuming β(t)
known, the conditional distribution function of Z(t) given T = t is given by
z ≤z Yi (t) exp{β(t)zi (t)}φ̂(zi , t)
P{Z(t) ≤ z|T = t} = ni . (7.13)
j=1 Yj (t) exp{β(t)zj (t)}φ̂(zj , t)
Corollary 7.7. Under model (4.2) and an independent censorship, assuming β(t)
known, the conditional distribution function of Z(t) given T = t is given by
156 Chapter 7. Model-based estimating equations
n
P(Z(t) ≤ z|T = t) = πj (β(t), t)I(Zj (t) ≤ z). (7.14)
j=1
The observation we would like to make here is that we can fully describe a
random variable indexed by t, i.e., a stochastic process. This idea underlies the
development of the regression effect process described in the following chapters.
All of our inferences can be based on this. In essence, we first fix t and then
we fix our attention on the conditional distribution of Z given that T = t. This
distribution brings into play the models of interest. These models are character-
ized by this distribution making it straightforward to construct tests, to estimate
parameters, and to build confidence regions. Indeed, under the broader censoring
definition of conditional independence, common in the survival context, we can
still make the same basic observation. In this case we condition upon something
more complex than just T = t. The actual random outcome that we condition
upon is of less importance than the simple fact that we are able to describe sets
of conditional distributions all indexed by t, i.e., a stochastic process indexed by
t. Specifically
In practical data analysis the quantity β(t) may be replaced by a value con-
strained by some hypothesis or an estimate. The quantity Vβ(t) (Z|t) can be
viewed as a conditional variance which may vary little with t, in a way analo-
gously to the residual variance in linear regression which, under classic assump-
tions, remains constant with different levels of the independent variable. Since
Vβ(t) (Z|t) may change with t, even if not a lot, it is of interest to consider some
average quantity and so we also introduce
Definition 7.7. σ 2 = E Vβ(t) (Z) = Vβ(t) (Z|t)dF (t) .
Interpretation requires some care. For example, although E Vβ̂ (Z|t) is, in
some sense, a marginal quantity, it is not the marginal variance of Z since we
have neglected the variance of Eβ(t) (Z(t)|t) with respect to the distribution of T.
The easiest case to interpret is the one where we have an independent censoring
mechanism (Equation 7.14). However, we do not need to be very concerned
about any interpretation difficulty, arising for instance in Equation 7.15 where
the censoring time appears in the expression, since, in this or the simpler case,
all that matters to us is that our observations can be considered as arising from
some process, indexed by t and, for this process, we are able, under the model,
to consistently estimate the mean and the variance of the quantities that we
observe. It is also useful to note another natural relation between Vβ (Z|t) and
Eβ (Z|t) since
This relation is readily verified for fixed β. In the case of time-dependent β(t)
then, at each given value of t, it is again clear that the same relation holds. The
result constitutes one of the building blocks in the overall inferential construction
and, under weak conditions, essentially no more than Z being bounded, then it
also follows that
Vβ (Z|t) = ∂ Eβ (Z|t)/∂β = ∂ Eβ (Z|t) /∂β.
Essentially all the information we need, for almost any conceivable statistical
goal, arising from considerations of any of the models considered, is contained
in the joint probabilities πi (β(t), t) of the fundamental definition 7.2. We are
often interested, in the multivariate setting for example, in the evaluation of the
effects of some factor while having controlled for others. This can be immediately
accommodated. Specifically, taking Z to be of some dimension greater than one
(β being of the same dimension), writing Z T = (Z1T , Z2T ) and ZiT = (Z1i T , Z T ),
2i
and then summing over the multivariate probabilities, we have two obvious exten-
sions to Corollaries 7.7 and 7.8.
158 Chapter 7. Model-based estimating equations
The corollary enables component wise inference. We can consider the com-
ponents of the vector Zi individually. Also we could study some functions of the
components, usually say a simple linear combination of the components such as
the prognostic index. Note also that
where in Definition 7.2 for πj (β(t), t) we take β(t)Zj (t) to be an inner product,
which we may prefer to write using boldface or as β(t)T Zj (t) and where Zj (t)
are the observed values of the vector Z(t) for the jth subject. Also, by Z2 (t) ≤ z
we mean that all of the scalar components of Z2 (t) are less than or equal to
the corresponding scalar components of z. As for the corollaries and definitions
following Corollaries 7.7 and 7.8 they have obvious equivalents in the multivariate
setting and so we can readily write down expressions for expectations, variances,
and covariances as well as their corresponding estimates.
where s takes integer values 1, ..., m. In view of the equivalence between strat-
ified models and partially proportional hazards models described in the previ-
ous chapter, the main theorem and its corollaries apply immediately. However,
in light of the special importance of stratified models, as proportional hazards
models with relaxed assumptions, it will be helpful to our development to devote
a few words to this case. Analogous to the above definition for πi (β(t), t), and
using the, possibly time-dependent, stratum indicator s(t) we now define these
probabilities via
7.5. Estimating equations using moments 159
Definition 7.8. For the stratified model, having strata s = 1, . . . , m, the dis-
crete probabilities πi (β(t), t) are now given by
When there is a single stratum then this definition coincides with the earlier
one and, indeed, we use the same πi (β(t), t) for both situations, since it is only
used indirectly and there is no risk of confusion. Under equation (4.3), i.e., the
constraint β(t) = β, the product of the π’s over the observed failure times gives
the so-called stratified partial likelihood (Kalbfleisch and Prentice, 2002). The
series of above definitions for the non-stratified model, in particular Definition
7.2, theorems, and corollaries, all carry over in an obvious way to the stratified
model and we do not propose any additional notation. It is usually clear from
the context although it is worth making some remarks. Firstly, we have no direct
interest in the distribution of Z given t (note that this distribution depends on
the distribution of Z given T > 0, a distribution which corresponds to our design
and is quite arbitrary).
We will exploit the main theorem in order to make inferences on β and, in the
stratified case, we would also condition upon the strata from which transitions
can be made. In practice, we contrast the observations Zi (Xi ), made at time
point Xi at which an event occurs (δi = 1) with those subjects at risk of the same
event. The “at-risk” indicator, Y (s(t), t), makes this very simple to express. We
can use Y (s(t), t) to single out appropriate groups for comparison. This formalizes
a standard technique in epidemiology whereby the groups for comparison may be
matched by not just age but by other variables. Such variables have then been
controlled for and eliminated from the analysis. Their own specific effects can
be quite general and we are not in a position to estimate them. Very complex
situations, such as subjects moving in and out of risk categories, can be easily
modeled by the use of these indicator variables.
and where, mostly, β(t) is not time-varying, being equal to some unknown con-
stant. The most common choices for the function R(r) are exp(r), in which
case we recover the usual model, and 1 + r which leads to the so-called addi-
tive model. Since both λ(t|Z) and λ0 are necessarily positive we would generally
need constraints on the function R(r). In practice this can be a little bothersome
and is, among several other good reasons, a cause for favoring the multiplicative
160 Chapter 7. Model-based estimating equations
risk model exp(r) over the additive risk model 1 + r. If we replace our earlier
definition for πi (β(t), t) by
Yi (t)R{β(t)Zi (t)}
πi (β(t), t) = n , (7.21)
j=1 Yj (t)R{β(t)Zj (t)}
following which all of the above definitions, theorems, and corollaries have imme-
diate analogues and we do not write them out explicitly. Apart from one inter-
esting exception, which we look at more closely in the chapters dealing with
inference, there are no particular considerations we need to concern ourselves
over if we choose R(r) = 1 + r rather than R(r) = exp(r).
What is more, if we allow the regression functions, β(t), to depend arbitrarily
upon time then, given either model, the other model exists with a different
function of β(t). The only real reason for preferring one model over another
would be due to parsimony; for example, we might find in some given situation
that in the case of the additive model the regression function β(t) is in fact
constant unlike the multiplicative model where it may depend on time. But
otherwise both functions may depend, at least to some extent, on time and then
the multiplicative model ought to be preferred since it is the more natural. We
say the more natural because the positivity constraint is automatically satisfied.
All of the calculations proceed as above and no real new concept is involved.
Such models can be considered in the case of continuous covariates, Z, which
may be sufficiently asymmetric, implying very great changes of risk at the high
or low values, to be unlikely to provide a satisfactory fit. Taking logarithms,
or curbing the more extreme values via a defined plateau, or some other such
transformation will produce models of potentially wider applicability. Note that
this is a different approach to work with, say,
and using the main theorem, in conjunction with estimating equations described
here below and basing inference upon the observations ψZ(Xi ) and their expec-
tations under this model. In this latter case we employ ψ in the estimating
equation as a means to obtain greater robustness or to reduce sensitivity to large
7.5. Estimating equations using moments 161
observations. In the former case the model itself is different and would lead to
different estimates of survival probabilities.
Our discussion so far has turned around the hazard function. However, it
is equally straightforward to work with intensity functions and these allow for
increased generality, especially when tackling complex time-dependent effects.
O’Brien (1978) introduced the logit-rank test for survival data when investi-
gating the effect of a continuous covariate on survival time. His purpose was
to construct a test that was rank invariant with respect to both time and the
covariate itself. O’Quigley and Prentice (1991) showed how a broad class of rank
invariant procedures can be developed within the framework of proportional haz-
ards models. The O’Brien logit-rank procedure was a special case of this class.
In these cases we work with intensity rather than hazard functions. Suppose
then that λi (t) indicates an intensity function for the ith subject at time t. A
proportional hazards model for this intensity function can be written as
where Yi (t) indicates whether or not the ith subject is at risk at time t, λ0 (t)
the usual “baseline” hazard function, and Zi (t) is a constructed covariate for
the ith subject at time t. Typically, Zi (t) in the estimating equation is defined
as a function of measurements on the ith subject alone, but it can be defined
more generally as Zi (t) = ψi (t, Ft ) for ψ some function of Ft , the collective
failure, censoring, and covariate information prior to time t on the entire study
group. The examples in O’Quigley and Prentice (1991) included the rank of the
subject’s covariate at Xi and transformations on this such as the normal order
statistics. This represents a departure from most regression situations because
the value used in the estimating equation depends not only on what has been
observed on the particular individual but also upon what has been observed on
other relevant subsets of individuals.
where Q(t) is a function of time that does not depend on the parameters β and
α. Under a null hypothesis that a proportional hazards model is adequate, i.e.,
H0 : α = 0, we recover a proportional hazards model. In the context of sequential
group comparisons of survival data, the above model has been considered by
Tsiatis (1982) and Fleming and Harrington (1984). In keeping with the usual
notation we denote Eβ,α (Z|t) to be the expectation taken with respect to the
probability distribution πi (β, α, t), where
Lemma 7.2. The components of the score vector U (β, α) can be expressed as
n
Uβ (β, α) = δi {Zi (Xi ) − Eβ,α (Z|Xi )}, (7.25)
i=1
n
Uα (β, α) = δi Q(Xi ){Zi (Xi ) − Eβ,α (Z|Xi )}. (7.26)
i=1
A test would be carried out using any one of the large sample tests aris-
ing from considerations of the likelihood. If we let β̂ be the maximum par-
tial likelihood estimate of β under the null hypothesis of proportional hazards,
i.e., H0 : α = 0, then Uβ (β̂, 0) = 0. The score test statistic arising under H0 is
−1
B = Uα (β̂, 0)G−1 Uα (β̂, 0), where G = I22 − I21 I11 I12 and G−1 is the lower
−1
right corner element of I . The elements of the information matrix required to
carry out the calculation are given below in Lemma 7.3. Under H0 , the hypothesis
of proportional hazards, B has asymptotically a χ2 distribution with one degree
of freedom.
Lemma 7.3. Taking k = 1, 2 and = 1, 2, the components of I are
Uββ Uβα I11 I12
I(β, α) = − = where
Uαβ Uαα I21 I22
n
Ik (β, α) = δi Q(Xi )k+−2 {Eβ,α (Z 2 |Xi ) − Eβ,α
2
(Z|Xi )} .
i=1
The literature on the goodness-of-fit problem for the Cox model has consid-
ered many formulations that correspond to particular choices for Q(t). The first
of these was given in the founding paper of Cox (1972). Cox’s suggestion was
equivalent to taking Q(t) = t. Defining Q(t) as a two-dimensional vector, Sta-
blein et al. (1981) considered Q(t) = (t, t2 ) , and Brown (1975), Anderson and
Senthilselvan (1982), O’Quigley and Moreau (1984), Moreau et al. (1985), and
7.5. Estimating equations using moments 163
where the only key requirement is that Q(·) be a predictable process (see Sec-
tion B.4) that converges in probability to a non-negative bounded function uni-
formly in t. Let β̂Q be the zero of (7.27) and β̂ be the partial likelihood esti-
mate. Under the assumption that the proportional hazards model holds and that
(Xi , δi , Zi ) (i = 1, . . . , n) are i.i.d. replicates of (X, δ, Z), n1/2 (β̂Q − β̂) is asymp-
totically normal with zero mean and covariance matrix that can be consistently
164 Chapter 7. Model-based estimating equations
estimated. It then follows that a simple test can be based on the standardized
difference between the two estimates. Lin (1991) showed such a test to be con-
sistent against any model misspecification under which βQ = β, where βQ is the
probability limit of β̂Q . In particular, it can be shown that choosing a mono-
tone weight function for Q(t) such as F̂ (t), where F̂ (·) is the Kaplan-Meier
estimate, is consistent against monotone departures (e.g., decreasing regression
effect) from the proportional hazards assumption.
It has been argued that the careful use of residual techniques can indicate
which kind of model failure may be present. This is not so. Whenever a poor
fit could be due to either cause it is readily seen that a misspecified covariate
form can be represented correctly via a time-dependent effect. In some sense the
two kinds of misspecification are unidentifiable. We can fix the model by working
either with the covariate form or the regression coefficient β(t). Of course, in
certain cases, a discrete binary covariate describing two groups, for example, there
can only be one cause of model failure—the time dependency of the regression
coefficient. This is because the binary coding imposes no restriction of itself since
all possible codings are equivalent.
and, for the case of a model making the stronger assumption of an indepen-
dent censoring mechanism as opposed to a conditionally independent censoring
mechanism given the covariate, we have
n
P(Z(t) ≤ z|T = t) ≈ πj (μ, t)I(Zj (t) ≤ z). (7.29)
j=1
The definition enables us to make sense out of using estimates based on (4.3)
when the data are in fact generated by (4.2). Since we can view T as being
random, whenever β(t) is not constant, we can think of having sampled from
β(T ). The right-hand side of the above equation is then a double expectation
and β ∗ , occurring in the left-hand side of the equation, is the best fitting value
under the constraint that β(t) = β. We can show the existence and uniqueness of
solutions to Equation 7.30 (Xu and O’Quigley, 2000). More importantly, β ∗ can
be shown to have the following three properties: (i) under model (4.3) β ∗ = β;
(ii) under a subclass of the broad class of models known as the Harrington-
Fleming models, we have an exact result in thatβ ∗ = T β(t)dF (t); and (iii) for
very general situations we can write that β ∗ ≈ T β(t)dF (t), an approximation
which is in fact very accurate. Estimates of β ∗ are discussed in Xu and O’Quigley
(2000) and, in the light of the foregoing, we can take these as estimates of μ.
The above integral is simply the difference of two sums, the first the empirical
mean without reference to any model and the second the average of model-based
means. It makes intuitive sense as an estimating equation and the only reason
7.6. Incorrectly specified models 167
for writing the sum in the less immediate form as an integral is that it helps
p
understand the large sample theory when Fn (t) → F (t). Each component in the
above sum includes the size of the increment, 1/n, a quantity that can then
be taken outside of the summation (or integral) as a constant factor. Since the
right-hand side of the equation is identically equal to zero, the incremental size
1/n can be canceled, enabling us to rewrite the equation as
U2 (β) = {Z(t) − Eβ (Z|t)}dN̄ (t) = 0. (7.32)
It is this expression the integral is taken with respect to increments dN̄ (t), rather
than with respect to dFn (t) that is the more classic representation in this context.
The expression equates U2 (β) in terms of the counting processes Ni (t). These
processes, unlike the empirical distribution function, are available in the presence
of censoring. It is the above equation that is used to define the partial likelihood
estimator, since, unless the censoring is completely absent, the quantity U1 (β)
is not defined.
Now, suppose that two observers were to undertake an experiment to estimate
β. A certain percentage of observations remain unobservable to the first observer
as a result of an independent censoring mechanism but are available to the second
observer. The first observer uses Equation 7.32 to estimate β, whereas the second
observer uses Equation 7.31. Will the two estimates agree? By “agree” we mean,
under large sample theory, will they converge to the same quantity. We might
hope that they would; at least if we are to be able to usefully interpret estimates
obtained from Equation 7.32. Unfortunately though (especially since Equation
7.32 is so widely used), the estimates do not typically agree, the greater the degree
of censoring, even when independent, the greater the disagreement. Table 7.1
below indicates just how severe the disagreement might be. However, the form of
U1 (β) remains very much of interest and, before discussing the properties of the
above equations let us consider a third estimating equation which we write as
U3 (β) = {Z(t) − Eβ (Z|t)}dF̂ (t) = W (t){Z(t) − Eβ (Z|t)}dN̄ (t) = 0, (7.33)
upon defining the stochastic process W (t) = Ŝ(t){ n i=1 Yi (t)}
−1 . For practical
calculation note that W (Xi ) = F̂ (Xi +) − F̂ (Xi ) at each observed failure time
Xi , i.e., the jump in the KM curve. When there is no censoring, then clearly
More generally U1 (β) may not be available and solutions to U2 (β) = 0 and
U3 (β) = 0 do not coincide or converge to the same population counterparts
even under independent censoring. They would only ever converge to the same
quantities under the unrealistic assumption that the data are exactly generated
168 Chapter 7. Model-based estimating equations
Even weaker assumptions (not taking the marginal F (t) to be common across
strata) can be made and, at present, this is a topic that remains to be studied.
where ξ lies strictly on the interior of the interval with endpoints β0 and β̂. Now
n
U2 (β̂) = 0 and U2 (ξ) = n i=1 δi Vξ (Z|Xi ) so that Var(β̂) ≈ 1/ i=1 δi Var(Z|Xi ).
This is the Cramer-Rao bound and so the estimate is a good one. Although the
sums are of variables that we can take to be independent they are not iden-
tically distributed. Showing large sample normality requires verification of the
Lindeburgh condition which, although awkward, is not difficult. All the neces-
sary ingredients are then available for inference. However, as our recommended
approach, we adopt a different viewpoint based on the functional central limit
theorem rather than a central limit theorem for independent variables. This is
outlined in some detail in Chapter 9.
Note that the averaging does not produce the marginal variance. For that
we would need to include a further term which measures the variance of the
conditional expectations. Under the conditions on the censoring of Breslow and
170 Chapter 7. Model-based estimating equations
Crowley (1974), essentially requiring that, for each t, as n increases, the infor-
mation increases at the same rate, then nW (t) converges in probability to w(t).
Under these same conditions, recall that the probability limit as n → ∞ of Eβ (Z|t)
under model (4.2) is Eβ (Z|t), that of Eβ (Z 2 |t) is Eβ (Z 2 |t), and that of Vβ (Z|t)
is Vβ (Z|t). The population conditional expectation and variance, whether the
model is correct or not, are denoted by E(Z|t) and V (Z|t), respectively. We
have an important result due to Struthers and Kalbfleisch (1986).
Theorem 7.3. Under model 4.2 the estimator β̂, such that U2 (β̂) = 0, con-
verges in probability to the constant βP L , where βP L is the unique solution
to the equation
∞
w−1 (t) E(Z|t) − Eβ (Z|t) dF (t) = 0, (7.35)
0
Theorem 7.4. Under model 4.2 the estimator β̃, such that U3 (β̃) = 0, con-
verges in probability to the constant β ∗ , where β ∗ is the unique solution to
the equation
∞
E(Z|t) − Eβ (Z|t) dF (t) = 0, (7.36)
0
Theorem 7.5. (Xu 1996). Under the non-proportional hazards model and
an independent censorship the estimator β̃ converges in probability to the
constant β ∗ , where β ∗ is the unique solution to the equation
∞
s(1) (β(t), t) s(1) (β, t)
− dF (t) = 0, (7.38)
0 s(0) (β(t), t) s(0) (β, t)
∞
provided that 0 v(β ∗ , t)dF (t) > 0.
It is clear that equation (7.38) does not involve censoring. Neither then does
the solution to the equation, β ∗ . As a contrast the maximum partial likelihood
estimator β̂P L from the estimating equation U2 = 0 converges to the solution of
the equation
∞
s(1) (β(t), t) s(1) (β, t)
− s(0) (β(t), t)λ0 (t)dt = 0. (7.39)
0 s(0) (β(t), t) s(0) (β, t)
172 Chapter 7. Model-based estimating equations
This result was obtained by Struthers and Kalbfleisch (1986). If the data be
generated by the proportional hazards model, then the solutions of (7.38) and
(7.39) are both equal to the true regression parameter β. In general, however,
these solutions will be different, the solution to (7.39) depending on the unknown
censoring mechanism through the factor s(0) (β(t), t). The simulation results of
Table 7.1 serve to underline this fact in a striking way. The estimate β̃ can be
shown to be asymptotically normal with mean zero and variance that can be
written down. The expression for the variance is nonetheless complicated and is
not reproduced here since it is not used. Instead we base inference on functions
of Brownian motion which can be seen to describe the limiting behavior of the
regression effect process.
is a weighted average of β(t) over time. According to Equation 7.41 more weights
are given to those β(t)’s where the marginal distribution of T is concentrated,
which simply means that, on average, we anticipate there being more individuals
subjected to those particular levels of β(t). The approximation of Equation 7.41
also has an interesting connection with Murphy and Sen (1991), where they show
that if we divide the time domain into disjoint intervals and estimate a constant
β on each interval, in the limit as n → ∞ and the intervals become finer at a
certain rate, the resulting β̂(t) estimates β(t) consistently. In their large sample
7.6. Incorrectly specified models 173
∞
∗
β ≈ β(t)dF (t) = E{β(T )}. (7.42)
0
In the linear setting we study the discrepancies between the observed responses
and their corresponding model-based predictions. In the proportional hazards
setting, rather than being the response variable, survival, it is the covariate given
survival that provides the basis for suitable residuals. On a deeper level, there is
no real difference between these two settings once we accept that our outcome
variable is not survival (since it only determined up to its rank) and is indeed
the covariate, or covariate vector, that is observed, given T = t. This is of course
anticipated in the main theorem of this chapter.
Schoenfeld considered a vector of covariates Z that were fixed over time.
Once we fix time however, the extension of the Schoenfeld residuals to time-
dependent covariates is immediate and, aside from the need for great care in the
calculations, presents no added difficulty. The Schoenfeld residuals are defined
at each death time t by
rj (t : β) = Zj (t) − Eβ (Z | t) , (7.43)
where j indexes the individual failing at time t. These residuals have formed
the basis for many goodness-of-fit tests as well as the basis for graphical pro-
cedures, (Grambsch and Therneau, 1994; Lin, 1991; Lin et al., 1993; O’Quigley
and Pessione, 1991; Wei, 1984).
Given the basic nature of martingales as discrepancies between random vari-
ables conditional upon a history and their corresponding conditional expectation
it is natural to consider such objects as a way to create a large class of residuals.
All that is needed is the history, some function of this—technically referred to
as a filtration, and the conditional expectations. These discrepancies have the
martingale property, essentially no more than the expectations exist, and a large
class of residuals becomes available to us. These are known as martingale residu-
als and, of course, the Schoenfeld residuals are a particular case. In our view, the
Schoenfeld residuals are not just a special case of the martingale residuals but,
in practice, are the only martingale residuals of real interest. We might qualify
that statement by adding after some appropriate standardization. For this rea-
son we do not dwell on any particular features of martingale residuals and we
present little of the work that has been done on these. We recall the bare bones
nonetheless for completeness. For subject j at time t, the martingale residuals
are (Appendix B.3)
t
Mj (t) = Nj (t) − Λ̂j (t), avec Λ̂j (t) = πj (β̂, s)dN̄ (s).
0
independent martingales (Gill, 1980), which means that they should be centered
about zero and uncorrelated. These properties can be observed on a graph of
the residual, Mj (T ), j = 1, . . . , n. Barlow
and Prentice (1988) studied a class
of martingale residuals by focusing on φ(t)Mj (t)dt where φ is a predictable
function. For a predictable function, φ(t), conditioning
on the history at time
t allows us to treat φ(t) as a constant so that φ(t)Mj (t)dt will also be a
martingale. These residuals have formed the basis of tests of fit by Kay (1977),
Lin (1991), Lin et al. (1993) as well as Therneau et al. (1990). As already
mentioned, we mostly focus attention on the Schoenfeld residuals which arise
under the definition, φ = Zj (Schoenfeld, 1982).
The finite sample distribution of the score statistic is considered more closely.
Since the other test statistics are derived from this, and the regression coeffi-
cient itself a monotonic function of the score, it is enough to restrict attention
to the score statistic alone. One direct approach leads to a simple convolution
expression which can be evaluated by integrated integrals. It is also possible to
make improvements to the large sample normal approximation via the use of
saddlepoint approximations or Cornish-Fisher expansions. For these we can use
the results of Corollary 7.9. Corrections to the distribution of the score statistic
can be particularly useful when the distribution of the explanatory variable is
asymmetric. Corrections to the distribution of the score equation have a rather
small impact in the case of the fourth moment but can be of significance in the
case of the third moment. The calculations themselves are uncomplicated and
simplify further in the case of an exponential distribution. Since we can transform
an arbitrary marginal distribution to one of the exponential form, while preserv-
ing the ranks, we can then consider the results for the exponential case to be of
broader generality. The focus of our inferential efforts, regardless of the particular
technique we choose, is mostly the score statistic. For this statistic, based on the
properties of the estimating equation, we can claim large sample normality.
Recall that our underlying probability model is focused on the distribution
of the covariate, or covariate vector, at each time t given that the subject is
still at risk at this time. From this the mean and the variance of the conditional
distribution can be consistently estimated and this is typically the cornerstone
for the basis of any tests or confidence interval construction. Implicitly we are
summarizing these key distributions by their means and variances or, at least, our
best estimates of these means and variances. The fact that it is the distributions
themselves that are of key interest, and not just their first two moments, suggests
that we may be able to improve the accuracy of any inference if we were to take
into account higher order moments. As a consequence of the main theorem it
turns out that this is particularly straightforward, at least for the third moment
and relatively uncomplicated for the fourth moment. In fact, corrections based
7.7. Estimating equations in small samples 177
on the fourth moment seem to have little impact and so only the third moment
might be considered in practice.
We can incorporate information on these higher moments via a Cornish-Fisher
expansion or via the use of a saddlepoint approximation. Potential improvements
over large sample results would need to be assessed on a case-by-case basis,
often via the use of simulation. Some limited simulations are given in O’Quigley
(2008) and suggest that these small sample corrections can lead to more accurate
inference, in particular for situations where there is strong group imbalance.
The first two derivatives of A(θ) are well known and widely available from
any software which fits the proportional hazards model. The third and fourth
derivatives are a little fastidious although, nonetheless straightforward to obtain.
Pulling all of these together we have:
178 Chapter 7. Model-based estimating equations
Lemma 7.5. The first four derivatives of K(θ) are obtained from:
A (θ)/A(θ) = K (θ),
A (θ)/A(θ) = [K (θ)]2 + K (θ),
A (θ)/A(θ) = [K (θ)]3 + 3K (θ)K (θ) + K (θ),
A(4) /A(θ) = [K (θ)]4 + 6[K (θ)]2 K (θ) + 4K (θ)K (θ) + 3[K (θ)]2 + K (4) (θ).
The adjustment can be used either when carrying out a test of a point hypothesis
or when constructing test-based confidence intervals. In the simulations we can
see that the correction, relatively straightforward to implement, leads to improved
control on type I error in a number of situations.
We can use what we know about U to make statements about β̂. This only
works because U (β) is monotonic in β and the same will continue to hold in the
7.7. Estimating equations in small samples 179
being an expression in terms of total probability rather than the usual Bayes’
formula since, instead of a data statistic depending on the model parameter, we
have a direct expression for the parameter estimate. For the small sample exact
distribution of the sum we use the following lemma:
We use the above form dQn (s) in order to accommodate the discrete and the
continuous cases in a single expression. The lemma is proved by recurrence of an
elementary convolution result (see for example Kendall et al. (1987)). Following
Cox (1975) and Andersen et al. (1993), Andersen (1982), and Andersen and
Gill (1982) we will take the contributions to the score statistic U (Xi ) to be
independent with different distributions given by Theorem 7.2. We can then
apply the result by letting Ui = Hi (Xi ) where the i indices now run over the
k, rather than n failure times. The distribution of U1 is given by G0 (X1 ), of
U2 by G0 (X2 ) and we can then construct a sequence of equations based on the
above expression to finally obtain the distribution Qk (s) of the sum. Any prior
information can be incorporated in this expression in the same way as before.
We can estimate E{U 3 (β, ∞)} consistently by replacing λi (s)ds and λi (s1 )
λj (s2 ) ds1 ds2 by R{β̂Zi (s)}dΛ̂0 (s) and R{β̂Zi (s1 )}R{β̂Zj (s2 )}dΛ̂0 (s1 )dΛ̂0 (s2 ),
respectively.
n
∞
E{U 4 (β, ∞)} = E{Yi (s)Hi4 (s)}λi (s)ds
0 i=1
∞ ∞
n
+6 E{Yi (s1 )Hi2 (s1 )}E{Yj (s2 )Hj2 (s2 )}λi (s1 )λj (s2 )ds1 ds2 .
0 0 i=1 j>i
(7.48)
where
π2 eβ exp(−veβ )
ψ(v) = .
π1 e−v + π2 eβ exp(−veβ )
7.7. Estimating equations in small samples 181
Corollary 7.12. The second, third, and fourth moments of U are given by
the following where Us is U standardized to have unit variance:
E{U 2 (0, ∞)} = nπ1 π2 ; E{Us3 (0, ∞)} = n−1/2 (π1 − π2 )(π1 π2 )−1/2
E{Us4 (0, ∞)} =n −1
(π1 + π23 )(π1 π2 )−1 + 3n−1 (n − 1).
3
More generally, consider the case of a continuous variable Z with support I and
density f . Furthermore, suppose that survival time is distributed exponentially
with underlying hazard equal to λ0 R(βz) with R(βz) = 1 for β = 0, and that
there is no censoring. Then, for β = 0 and p ≥ 2, we have:
n
∞ ∞ !
E{Yi (s)Hip (s)}λi (s)ds = {z − E(Z)}p λ0 e−λ0 t dt f (z)dz
0 i=1 I 0
and this integral can be readily evaluated so that the right-hand term becomes:
∞
{z −E(Z)}p λ0 e−λ0 t dt f (z)dz = {z − E(Z)}p f (z)dz = E{Z −E(Z)}p .
I 0 I
Therefore, the required terms can easily be evaluated from the central moments
of Z. Specifically, taking the subscript s to refer to the standardized variable, we
obtain:
Corollary 7.13. The second, third, and fourth moments of U are given, respec-
tively, by:
where the subscript s refers to the standardized variable. Note that these results
also hold for a discrete variable Z.
n
Ĝ(z|t) = P̂ (Z(t) ≤ z|T = t, C > t) = πj (β, t)I(Zj (t) ≤ z). (7.49)
j=1
Note that the definition of πj (β, t) restricts the subjects under consideration to
those in the risk set at time t. The cumulative distribution Ĝ(z|t) is restricted by
both z and t. We will need to invert this function, at each point Xi corresponding
to a failure. Assuming no ties in the observations (we will randomly break them
if there are any) then, at each time point Xi , we order the observations Z in the
risk set. We express the order statistics as Z(1) < Z(2) < . . . < Z(ni ) where there
are ni subjects in the risk set at time Xi . We define the estimator G̃(z|Xi ) at
time t = Xi and for z ∈ (Z(m) , Z(m+1) ) by
z − Z(m)
G̃(z|Xi ) = Ĝ(Z(m) |Xi ) + Ĝ(Z(m+1) |Xi ) − Ĝ(Z(m) |Xi ) ,
Z(m+1) − Z(m)
noting that, at the observed values Z(m) , m = 1, . . . , ni , the two estimators coin-
cide so that G̃(z|Xi ) = Ĝ(z|Xi ) for all values of z taken in the risk set at
time Xi . Otherwise, G̃(z|Xi ) linearly interpolates between adjacent values of
the observed order statistics Z(m) , m = 1, . . . , ni . Also, we are assuming no ties,
in which case, the function G̃(z|Xi ), between the values Z(1) and Z(ni ) , is a
strictly increasing function and can thereby be inverted. We denote the inverse
function by G̃−1 (α) , 0 < α < 1.
Our purpose is achieved by using, instead of G̃−1 (α) which would take us back
to where we began, the inverse of the cumulative normal distribution Φ−1 (α).
We define the transform
∗
Z(m) = Φ−1 G̃(Z(m) |Xi ), (7.50)
noting that the transform is strictly increasing so that the order of the covari-
ate observations in the risk set is respected. We are essentially transforming to
normality via the observed empirical distribution of the covariate in the risk set.
Under the null hypothesis that β = 0 the cumulative distribution G̃(Z(m) |Xi )
is discrete uniform where each atom of probability has mass 1/ni . Thus, the
∗ , m = 1, . . . , n will be close (the degree of closeness increasing with n )
Z(m) i i
to the expectation of the mth smallest order statistic from a normal sample of
size ni (Appendix A). The statistic U (β) is then a linear sum of zero mean
and symmetric variables that will be closer to normal than that for the untrans-
formed sequence. At the same time any information in the covariate is captured
via the ranks of the covariate values among those subjects at risk and so local
power to departures from the null would be model dependent. Under the null
the suggested transformation achieves our purpose, the mean of U (0) is zero
and the distribution of U (0) is symmetric. Under the alternative, however, we
would effectively have changed our model by the transformation and a choice
7.7. Estimating equations in small samples 183
of model which coincides with the mechanism generating the selection from the
risk set would maximize power. The above choice would not necessarily be the
most efficient. An expression for the statistical efficiency of using some particular
covariate transformation model when another one generates the observations is
given in O’Quigley and Prentice (1991).
One way to maintain exact control over type I error using Z(m) ∗ =
Φ−1 G̃(Z(m) |Xi ) is to consider, at each observed failure time, alongside Z(m)
∗ ,
∗
its reflection in the origin −Z(m) , such values, and any more extreme in absolute
value, arising with the same probability under the null hypothesis of no effect.
A nonparametric test considers the distribution of the test statistic under all
possible configurations of the vector of dimension equal to the number of the
∗
observed failures having entries Z(m) ∗ .
or −Z(m)
The number of possibilities grows exponentially so that it is possible, with
even quite small samples, to achieve almost exact control over type I error.
The significance level is simply the number of tests with more extreme values
than those obtained by the configuration that corresponds to the observed data
themselves. This approach would be very attractive apart from the drawback
of the intensity of calculation. With as few as 10 observations per group, in a
two-group case, the number of cases to evaluate is over one million. Finding, say,
the most extreme five percent of these requires comparisons taking us into the
thousands of billions. Approximations are therefore unavoidable.
Assessing accuracy
The most useful tool in assessing which of the several approaches is likely to
deliver the best rewards is that of simulation. It is difficult otherwise because,
even when we can show that taking into account higher moments will reduce the
order of error in an estimate, the exact value of these moments is not typically
known. The further error involved in replacing them by estimates involving error
can often lead us back to an overall order of error no less than we had in the first
place. In some cases we can carry out exact calculation. Even here though caution
is needed since if we need to evaluate integrals numerically, although there is no
statistical error involved, there is a risk of approximation error. Among the three
available tests based on the likelihood; the score test, likelihood ratio, and the
Wald test, the score test is arguably the most satisfactory. Although all three
are asymptotically equivalent, the Wald test’s sensitivity to parameterization has
raised questions as to its value in general situations. For the remaining two, the
score test (log-rank test) has the advantage of not requiring estimation under the
alternative hypothesis and has nice interpretability in terms of simple comparisons
between observed and expected quantities. Indeed it is this test, the log-rank
test in the case of a discrete covariate, that is by far the most used. The higher
moments are also evaluated very easily, again not requiring estimation under
the alternative hypothesis, and therefore it is possible to improve the accuracy
of inference based on the score test at little cost. Only tests of the hypothesis
H0 : β = 0 have been discussed. More generally, we may wish to consider testing
H0 : β = β0 , β0 = 0, such a formulation enabling us to construct confidence
intervals about non-null values of β. The same arguments apply to this case also
and, by extension, will lead to intervals with more accurate coverage properties.
186 Chapter 7. Model-based estimating equations
2. Show that the variance expression V (β, t) using the Andersen and Gill nota-
tion (see Section 7.4) is the same as Vβ (Z|t) using the notation of Section
7.5. Explain why Var(Z|t) is consistently estimated by Vβ̂ (Z|t) but that
Var(Z|t) is not generally equal to v(β, t).
3. For the general model, suppose that β(t) is linear so that β(t) = α0 + βt.
Show that Eβ(t) (Z k |t) does not depend upon α0 .
8. Use some dataset to fit the proportional hazards model. Estimate the param-
eter β on the basis of estimating equations for the observations Zi2 rather
√
than Zi . Derive another estimate based on estimating equations for Z i .
Compare the estimates.
11. Consider a proportional hazards model in which we also know that the
marginal survival is governed by a distribution F (t; θ) where θ is not known.
Suppose that it is relatively straightforward to estimate θ, by maximum
7.8. Classwork and homework 187
12. Use the approach of the preceding question on some dataset by (1) approxi-
mating the marginal distribution by an exponential distribution, (2) approx-
imating the marginal distribution by a log-normal distribution.
13. Using again the approach of the previous two questions show that if the
proportional hazards models are correctly specified then the estimate β̂ based
on F (t; θ) is consistent whether or not the marginal model F (t; θ) is correctly
specified.
14. Supposing that the function β(t) is linear so that β(t) = α0 + βt. Show how
to estimate the function β(t) in this simple case. Note that we can use this
model to base a test of the proportional hazards assumption via a hypothesis
test that H0 : β = 0, α0 = 0 (Cox 1972).
15. Investigate the assertion that it is not anticipated for v(t), the conditional
variance of Z(t), to change much with time. Use the model-based estimates
of v(t) and different datasets to study this question informally.
16. In epidemiological studies of breast cancer it has been observed that the
tumor grade is not well modeled on the basis of a proportional hazards
assumption. A model allowing a monotonic decline in the regression coeffi-
cient β(t) provides a better fit to observed data. On the basis of observa-
tions some epidemiologists have argued that the disease is more aggressive
(higher grade) in younger women. Can you think of other explanations for
this observed phenomenon?
17. Try different weights in the weighted log-rank test and apply these to a
dataset. Suppose we decide to use the weight that leads to the most sig-
nificant result. Would such an approach maintain control over Type I error
under the null hypothesis of no association? Suggest at least two ways in
which we might improve control over the Type I error rate.
18. Use a two-sample dataset such as the Freireich data and carry out a one-
sided test at the 5% level. How does the p-value change if we make the
Edgeworth correction given in Equation 7.45.
19. Repeat the above question but this time using a saddlepoint approximation.
20. Using bootstrap resampling, calculate a 95% confidence interval for the esti-
mated regression coefficient for the above data by the percentile method
on β̂. Compare this to a 95% confidence interval obtained by inverting the
188 Chapter 7. Model-based estimating equations
monotone function U (b) and by determining values of b for which the esti-
mated Pr {U (b) < 0} ≤ 0.025 and Pr {U (b) > 0} ≤ 0.025.
21. Consider the following two priors on β : (1) Pr (β < −1) = 0.1; Pr (β > 2) =
0.1; Pr (−1 < β < 2) = 0.8, (2) Pr (β < 0) = 0; Pr (β > 1) = 0.2; Pr (0 < β <
1) = 0.8. Using these priors repeat the above confidence interval calculations
and comment on the impact of the priors.
23. Either use an existing dataset or generate censored data with a single contin-
uous covariate. Evaluate the empirical distribution of the covariate at each
failure time in the risk set. Use several transformations of this distribution,
e.g., to approximately normal, exponential , or uniform, and take as a test
statistic the maximum across all considered transformations. How would you
ensure correct control of type I error for this test? What are the advantages
and drawbacks of this test?
25. Suppose that the censoring mechanism is not independent of the survival
mechanism, in particular suppose that
Write down the likelihood for a parametric model for which the censoring
mechanism is governed by this equation. Next, suppose that we can take the
above equation to represent the general form for the censoring model but
that, instead of the constant value 2, it depends on an unknown parameter,
i.e., the number 2 is replaced by α. What kind of data would enable us to
estimate the parameter α?
27. Fit a Weibull proportional hazards model to data including at least two
binary regressors; Z1 and Z2 . Calculate a 90% confidence intervals for the
7.9. Outline of proofs 189
probability that the most unfavorable prognosis among the 4 groups has
a survival greater than the marginal median. Calculate a 90% confidence
interval that a subject chosen randomly from either the most unfavorable, or
the second most unfavorable, group has a survival greater than the marginal
median.
29. For the two-sample exponential model, write down the likelihood and confirm
the maximum likelihood estimates given in Section 7.3. Calculate the score
test, the likelihood ratio test, and Wald’s test, and compare these with the
expressions given in Section 7.3. For a dataset with a single binary covariate
calculate and compare the three test statistics.
30. On the basis of data, estimate the unknown regression coefficient, β, as the
expected value of the conditional likelihood (see Appendix D.4). Do this
for both an exponential based likelihood and the partial likelihood. Next,
consider the distribution of log β in this context and take an estimate as
exp E(log β). Do you anticipate these two estimators to agree? Note that the
corresponding maximum likelihood estimators do agree exactly. Comment.
P (C ≥ t|z, T ≥ t) = P (C ≥ t|z).
Now, by the law of total probability, we have P (C ≥ t) = P (C ≥ t|z)g(z)dz,
where the integral is over the domain of definition of Z. Next we replace P (Z ≤
z|T ≥ t, C ≥ t) by the consistent estimate {Zj (t)≤z} Yj (t)/ n 1 Yj (t), which is
simply the empirical distribution in the risk set, which leads to
z ≤z Yi (t) exp{β(t)zi (t)}φ̂(zi , t)
P̂ {Z(t) ≤ z|T = t} = ni .
j=1 Yj (t) exp{β(t)zj (t)}φ̂(zj , t)
We begin by considering the probability that one subject with particular covariates
will have a greater survival time than another subject with different covariates,
i.e., Pr (Ti > Tj |Zi , Zj ). Note that this also provides a Kendall τ -type measure of
predictive strength (Gönen and Heller, 2005) and, although not explored in this
work, provides a potential alternative to the R2 that we recommend. Confidence
intervals are simple to construct and maintain the same coverage properties as
those for β. Using the main results of Chapter 7 we obtain a simple expression
for survival probability given a particular covariate configuration, i.e., S(t|Z ∈ H)
where H is some given covariate subspace. When the subspace is the full covariate
space then this function coincides with S(t) and the estimate coincides with the
Kaplan-Meier estimate. Simple adjustments to cater for the classical difficulty
of Kaplan-Meier estimates not necessarily reaching zero are provided. Several
different situations are highlighted including survival under informative censoring.
The provision of such information may help guide decision making in an applied
context.
While it is usually technically difficult to estimate densities and hazards (some
kind of smoothing typically being required), it is easier to estimate cumula-
tive hazards and distribution (survivorship) functions. These have already been
smoothed, in some sense, via the summing inherently taking place.
Breslow (1972), Breslow and Crowley (1974), using an equivalence between
the proportional hazards model and a piecewise exponential regression model,
with as many parameters as there are failure times, derived a simple expression
for conditional survival given covariate information. An expression for the variance
of the Breslow estimate was derived by O’Quigley (1986). Appealing to Bayes’
rule, Xu and O’Quigley (2000) obtained the simple expression
∞ ∞
S(t|Z ∈ H) = P (Z ∈ H|u)dF (u) P (Z ∈ H|u)dF (u),
t 0
If two individuals are independently sampled from the same distribution then, by
simple symmetry arguments, it is clear that the probability of the first having a
longer survival time than the second is just 0.5. If, instead of sampling from the
same distribution, each individual is sampled from a distribution determined by
the value of their covariate information, then, the stronger the impact of this
covariate information, the further away from 0.5 will this probability be. When
the covariates do not depend on time then this probability is very easily evalu-
ated using:
Theorem 8.1. For subjects i and j, having covariate values Zi and Zj then,
under the proportional hazards model, we can write
exp(βZj )
Pr (Ti > Tj |Zi , Zj ) = .
exp(βZi ) + exp(βZj )
An important observation to make is that the expression does not involve Λ0 (t).
If we define ψ(a, b : β) to be exp(βb)/{exp(βb) + exp(βa)} we then have:
Corollary 8.1. A consistent estimate of Pr (Ti > Tj |Zi , Zj ), under the pro-
portional hazards model is given by ψ(Zi , Zj : β̂) and Var log{ψ/(1 − ψ)} ≈
(Zj − Zi )2 Var(β̂).
The approximation in the corollary arises from an immediate application of
the mean value theorem (Appendix A). In the theorem and corollary it is assumed
that Zi and Zj are scalars and that the model involves only a one-dimensional
covariate. Extension to the multivariate case is again immediate, and instead of
β̂Zi in ψ(Zi , Zj : β̂) being a scalar it can be replaced by the usual inner product
(prognostic index). Suppose that the dimension of β and Z is p and that we use
the notation Zjr to indicate, for subject j, the rth component of Zj . Applying
the delta method (Appendix A.10),
p
p
Var log{ψ/(1 − ψ)} ≈ (Zjr − Zir )(Zjs − Zis )Cov (β̂r , β̂s ).
r=1 s=1
1.0
0.8
Xu-O’Quigley
Kaplan-Meier
0.6
survival
0.4
0.2
0.0
0 20 40 60 80
time
Figure 8.1: Kaplan-Meier survival plot for gastric cancer data and model-based
plot based on selection of the covariate CEA to lie in interval (0,100).
in Theorem 8.1 but could be worked out. Applications to cure studies follow since
we could conceive of situations in which this expression diminishes with t, becom-
ing, at some point, sufficiently close to the value 0.5 to claim that the exposed
group no longer carries a disadvantage when compared to the reference group.
An expression for the large sample variance of Y = log − log S(t|z) was obtained
by O’Quigley (1986). Symmetric intervals for Y can then be transformed into
more plausible ones, at least having better coverage properties according to the
arguments of O’Quigley (1986), by simply applying the exponential function
twice. The Breslow estimate concerns a single point z. It is a natural question
to ask what is the survival probability given that the covariates belong to some
subset H. The set H may denote for example an age group, or a certain range
of continuous measurement, or a combination of those.
In general we assume H to be a subset of the p-dimensional Euclidean space.
A natural approach may be to take the above formula, which is applied to a
point, and average a set of curves over all points belonging to the set H of
interest. For this we would need some distribution for the z across the set H.
Keiding (1995) has a discussion on expected survival curves over a historical, or
background, population, where the main approaches are to take an average of
the individual survival curves obtained from the above equation. See also Sasieni
(2003). Following that, one might use the equation to estimate S(t|z) for all
z in H, then average over an estimated distribution of Z. Xu and O’Quigley
(2000) adopted a different starting point in trying to estimate directly the survival
probabilities given that Z ∈ H. Apart from being direct, this approach is the more
natural in view of the main theorem of Section 7.5. What is more, the method
can also have application to situations in which the regression effect varies with
time. In the following, for notational simplicity, we will assume p = 1. Extensions
to p > 1 are immediate. As for almost all of the quantities we have considered it
turns out to be most useful to work with the conditional distribution of Z given
T = t rather than the other way around. Everything is fully specified by the joint
distribution of (T, Z) and we keep in mind that this can be expressed either as
the conditional distribution of T given Z, together with the marginal distribution
of Z or as the conditional distribution of Z given T, together with the marginal
distribution of T.
196 Chapter 8. Survival given covariate information
This is a very simple and elegant expression and we can see from it how con-
ditioning on the covariates modifies the underlying survival distribution. If H
were to be the whole domain of definition of Z, in which case Z is contained in
H with probability one, then the left-hand side of the equation simply reduces
to the marginal distribution of T . This is nice and, below, we will see that we
have something entirely analogous when dealing with sample-based estimates
whereby, if we are to consider the whole of the covariate space, then we simply
recover the usual empirical estimate. In particular this is just the Kaplan-Meier
estimate when the increments of the right-hand side of the equation are those
of the Kaplan-Meier function. The main theorem of Section 7.5 implies that
P(Z ∈ H|t) can be consistently estimated from
Yj (t) exp{β̂Zj }
P̂ (Z ∈ H|t) = πj (β̂, t) = H . (8.3)
{j:Zj ∈H}
Yj (t) exp{β̂Zj }
This striking, and simple, result is the main ingredient needed to obtain survival
function estimates conditional on particular covariate configurations. The rest
essentially the step increments in the Kaplan-Meier curve are readily available.
Unfortunately, a problem that is always present when dealing with censored data
remains and that is the possibility that the estimated survival function does not
decrease all the way to zero. This will happen when the largest observation is not
a failure. To look at this more closely, let F̂ (·) = 1 − Ŝ(·) be the left-continuous
Kaplan-Meier (KM) estimator of F (·). Let 0 = t0 < t1 < ... < tk be the distinct
failure times, and let W (ti ) = dF̂ (ti ) be the stepsize of F̂ at ti . If the last
observation is a failure, then,
∞
P̂ (Z ∈ H|u)dF̂ (u) t >t P̂ (Z ∈ H|ti )W (ti )
Ŝ(t|Z ∈ H) = t∞ = ki .
0 P̂ (Z ∈ H|u)dF̂ (u) i=1 P̂ (Z ∈ H|ti )W (ti )
(8.4)
When the last observation is not a failure and k1 W (ti ) < 1, an application of
the law of total probability indicates that the quantity B1 where B1 = P̂ (Z ∈
H|T > tk )Ŝ(tk ) should be added to both the numerator and the denominator
8.4. Conditional survival given Z ∈ H 197
in (8.4) This is due to the fact that the estimated survival distribution is not
summing to one. Alternatively, we could simply reduce the support of the time
frame to be less than or equal to the greatest observed failure. In addition,
using the empirical estimate over all the subjects that are censored after the last
observed failure, we have:
Yj (tk +)
P̂ (Z ∈ H|T > tk ) = H , (8.5)
Yj (tk +)
where tk + denotes the moment right after time tk . Therefore we can write:
k
ti >t P̂ (Z ∈ H|ti )W (ti ) + P̂ (Z ∈ H|T > tk ){1 − 1 W (ti )}
Ŝ(t|Z ∈ H) = k k .
1 P̂ (Z ∈ H|ti )W (ti ) + P̂ (Z ∈ H|T > tk ){1 − 1 W (ti )}
∂ Ŝ(t|Z ∈ H)
Ŝ(t|Z ∈ H) = Ŝ(t|Z ∈ H)|β0 + (β̂ − β0 ) β=β̇ , (8.6)
∂β
where β̇ lies on the line segment between β0 and β̂. We then need to bring
together some results.
198 Chapter 8. Survival given covariate information
The first term in (8.7) gives the variation due to the estimation of the conditional
survival, the second term the variation caused by β̂. Details are given at the end
of the chapter. In addition, we have:
Theorem 8.2. Under the proportional hazards model Ŝ(t|Z ∈ H) is asymptoti-
cally normal.
As a consequence one can use the above estimated variance to construct
confidence intervals for S(t|Z ∈ H) at each t.
√
Theorem 8.3. nU (β0 ) is asymptotically equivalent to n−1/2 n
1 ωi (β0 ), where
1 1
s(1) (β, t) s(1) (β, t)
ωi (β) = Zi − dNi (t) − Yi (t)eβZi Zi − λ0 (t)dt
0 s(0) (β, t) 0 s(0) (β, t)
Xu-O’Quigley
Kaplan-Meier
Breslow
0.0
0 20 40 60 80
time
Figure 8.2: Survival probabilities for myeloma data based on prognostic index of
lower 0.33 percentile using Kaplan-Meier, Breslow, and Xu-O’Quigley estimators.
8.4. Conditional survival given Z ∈ H 199
As for the first term, we can use Theorem II.5 of Xu (1996), which derives from
a result of Stute (1995). With φ(t) = 1[t∗ ,1] (t)P (Z ∈ H|t) in the theorem, the
√
first term is equal to n−1/2 n i=1 νi + nRn , where |Rn | = op (n
−1/2 ) and ν’s
1.0
0.8
0.6
survival
0.4
0.2
Xu-O’Quigley
Kaplan-Meier
Breslow
0.0
0 20 40 60 80
time
Figure 8.3: Survival probabilities for myeloma data based on prognostic index
between 0.33 and 0.66 percentiles using Kaplan-Meier, Breslow, and Xu-
O’Quigley estimators.
1.0
0.8
0.6
survival
0.4
0.2
Xu-O’Quigley
Kaplan-Meier
Breslow
0.0
0 20 40 60 80
time
Figure 8.4: Survival probabilities for myeloma data based on prognostic index for
values greater than the 0.33 percentile using 3 different estimators.
was not very far from the median value and provided enough observations for a
good empirical Kaplan-Meier estimate. The empirical estimate and the model-
based estimates show good agreement. For the three prognostic groups, defined
on the basis of a division of the prognostic index from the multivariate model,
the empirical Kaplan-Meier estimates, the Breslow estimates, and the Xu and
O’Quigley estimates are shown in Figures 8.2, 8.3, and 8.4. Again agreement is
strong among the three estimators.
The estimator described above is not limited to proportional hazards models. The
formula itself came from a simple application of Bayes’ rule, and the marginal
distribution of T can always be estimated by the Kaplan-Meier estimator or some
other estimator if we wish to use other assumptions. Taking a general form of
relative risk r(t; z), so that λ(t|z) = λ0 (t)r(t; z). Assume also that r(t; z) can
be estimated by r̂(t; z), for example, that it has a known functional form and a
finite or infinite dimensional parameter that can be consistently estimated. Special
cases of r(t; z) are exp(βz), 1 + βz, and exp{β(t)z}. Since the main theorem
of Section 7.5 extends readily to other relative risk models it is straightforward
to derive analogous results to those above. We are still able to estimate S(t|Z ∈
H), with P̂ (Z ∈ H|t) = {j:Zj ∈H} πj (t). This is an important extension of
the estimator since we may wish to directly work with some form of a non-
proportional hazards model.
as it does not require the estimation of the baseline hazards. Suppose that V
is the stratification variable, and we are interested in the survival given Z ∈ H.
Note P (Z ∈ H|t) = v P (Z ∈ H|t, V = v)P (V = v). We can estimate P (Z ∈
H|t, V = v) by {j:Zj ∈H} πjv (β̂, t), where πjv (β, t) is the conditional probability
defined within strata v, and estimate P (V = v) by the empirical distribution of
V . Similarly to the stratified case, (8.3) can also be used to estimate survival
under random effects models arising from clustered, such as genetic or familial
data. The frailty models under such settings can be written as
where λij is the hazard function of the jth individual in the ith cluster. This
is the same as a stratified model, except that we do not observe the values of
the “stratification variable” ω; but such values are not needed in the calculation
described above for stratified models. So the procedure described above can be
used to estimate S(t|Z ∈ H). In both cases considered here, we need reasonable
stratum or cluster sizes in order to get a good estimate of P (Z ∈ H|t, v) or
P (Z ∈ H|t, ω).
eβz S(t|z)g(z)
f (z|T = t) = . (8.9)
eβz S(t|z)g(z)dz
t >t P̂ (η ∈ β H|ti )W (ti ) + P̂ (η ∈ β H|T > tk )Ŝ(tk )
Ŝ(t|η ∈ β H) = ki ,
i=1 P̂ (η ∈ β H|ti )W (ti ) + P̂ (η ∈ β H|T > tk )Ŝ(tk )
where P̂ (η ∈ β H|t) = {j:ηj ∈β̂ H} πj (β̂, t), and
P̂ (η ∈ β H|T > tk ) = Yj (tk +)/ Yj (tk +).
{j:ηj ∈β̂ H}
As before, since one can consistently estimate β, the above expression still pro-
vides a consistent estimate of S(t|Z ∈ H). Note that when z is a single covariate,
z ∈ H is exactly the same as η ∈ βH (unless β = 0 in which case the covariates
have no predictive capability), so the above is consistent with the one-dimensional
case developed earlier. While we regard the above as one possible approach under
high dimensions when there are not “enough" observations falling into the ranges
204 Chapter 8. Survival given covariate information
Events that occur through time, alongside the main outcome of interest, may
often provide prognostic information on the outcome itself. These can be viewed
as time-dependent covariates and, as before, in light of the main theorem (Section
7.5), it is still straightforward to use such information in the expression of the
survivorship function. Since, in essence, we sum, or integrate, future information,
it can be necessary to postulate paths that the covariate process might take.
Emphasis remains on the conditional distribution of survival given current and
evolving covariate information. This differs slightly from an approach, more even
handed with respect to the covariate process alongside the survival endpoints,
that makes an appeal to joint modeling. Even so, the end goal is mostly the
same, that of characterizing the survival experience given covariate information.
Paths that remain constant are the easiest to interpret and, in certain cases,
the simple fact of having a value tells us that the subject is still at risk for the
event of interest Kalbfleisch and Prentice (2002). The use of the main theorem
(Section 7.5) in this context makes things particularly simple since the relevant
probabilities express themselves in terms of the conditional distribution of the
covariate at given time points. We can make an immediate generalization of
Equation 8.2 if we also wish to condition on the fact that T > s. We have:
∞
P (Z ∈ H|u)dF (u|u > s)
S(t + s|Z ∈ H, T > s) = t+s
∞ ,
s P (Z ∈ H|u)dF (u|u > s)
and, in exactly the same way as before, and assuming that the last observation
is a failure, we replace this expression in practice by its empirical equivalent
∞
P̂ (Z ∈ H|u)dF̂ (u|u > s) t >t+s P̂ (Z ∈ H|ti )W (ti )
Ŝ(t + s|Z ∈ H, T > s) = t+s
∞ = i .
s
P̂ (Z ∈ H|u)dF̂ (u|u > s) ti >s P̂ (Z ∈ H|ti )W (ti )
8.6. Informative censoring 205
When the last observation is not a failure and k1 W (ti ) < 1 we can make a
further adjustment for this in the same way as before.
One reason for favoring the Xu-O’Quigley estimate of survival over the Bres-
low estimate is the immediate extension to time-dependent covariates and to
time-dependent covariate effects. Keeping in mind the discussion of Kalbfleisch
and Prentice (2002), concerning internal and external time-dependent covariates,
whether or not these are determined in advance or can be considered as an indi-
vidual process generated through time, we can, at least formally, leaving aside
interpretation questions, apply the above formulae. The interpretation questions
are solved by sequentially conditioning on time as we progress along the time axis
and, thereby, the further straightforward extension which seeks to quantity the
probability of the event T > t, conditioned by T > s where s < t, is particularly
useful.
If the effect of the occurrence of the intermediary event is to change the hazard
function of death λ(t) from λ1 (t) to λ2 (t), that is: λ(t) = λ1 (t) if t ≤ C and
is equal to λ2 (t) otherwise then λ1 (t) = λ2 (t) when the intermediary response
variable has no influence on survival. When λ2 (t) < λ1 (t) or λ2 (t) > λ1 (t) then
the intermediary, or surrogate, response variable carries relatively favorable or
unfavorable predictions of survival. Thus the quantity π(t) = λ2 (t)/λ1 (t) is a
206 Chapter 8. Survival given covariate information
measure of the effect of the surrogate response on survival. When f1 (t) and
f2 (t) are the density functions, S1 (t) and S2 (t) the survivorship functions corre-
sponding to the hazard functions λ1 (t) and λ2 (t), respectively, then the marginal
survival function is
t c t t
S(t) = exp − λ1 (u)du + λ2 (u)du dG(c) + exp − λ1 (u)du G(t).
0 0 c 0
In the first stage of a two-stage design, all patients are followed to the end-
point of primary concern and for some subset of the patients there will be the
surrogate information collected at an intermediary point during the follow-up.
The purpose of the first stage is to estimate the relationship between the occur-
rence of the surrogate response variable and the remaining survival time. This
information can then be used in the second stage, at which time, for patients
who reach the surrogate endpoint, follow-up is terminated. Such patients could
be considered as censored under a particular dependent censorship model, the
censoring being, in general “informative.” The Kaplan-Meier estimator will not
generally be consistent if the survival time and an informative censoring time are
dependent but treated as though they were independent. Flandre and O’Quigley
(1995) proposed a nonparametric estimator of the survival function for data col-
lected in a two-stage procedure. A nonparametric permutation test for comparing
the survival distributions of two treatments using the two-stage procedure is also
readily derived.
The idea behind a two-stage design in the context of a time-dependent sur-
rogate endpoint is to reduce the overall duration of the study. This potential
reduction occurs at the second stage, where follow-up is terminated on patients
for whom the surrogate variable has been observed. The first stage is used to
quantify the strength of the relationship between occurrence of the surrogate
variable and subsequent survival. It is this information, obtained from the first
stage analysis, that will enable us to make inferences on survival on the basis of
not only observed failures, but also observed occurrences of the surrogate. In the
context of clinical trials, as pointed out by Prentice (1989), the surrogate variable
must attempt to “capture” any relationship between the treatment and the true
endpoint. We may wish to formally test the validity of a surrogate variable before
proceeding to the second stage using a standard likelihood ratio test.
In the first stage N1 patients are enrolled and followed to the endpoint of
primary concern (e.g., death) or to censorship, as in a classical study, and infor-
mation concerning the surrogate variable is recorded. Survival time is then either
the true survival time or the informative censoring time. The information avail-
able for some patients will consist of both time until the surrogate variable and
survival time, while for others (i.e., those who die or are censored without the
occurrence of the surrogate variable) it consists only of survival time. In the sec-
ond stage a new set of patients (N2 ) is enrolled in the study. For those patients,
8.6. Informative censoring 207
the follow-up is completed when the surrogate variable has been observed. Thus,
the information collected consists only of one time, either the time until the sur-
rogate variable is reached or the survival time. In some cases the two stages may
correspond to separate and distinct studies; the second stage being the clinical
trial of immediate interest while the first stage would be an earlier trial carried
out under similar conditions.
When parametric models are assumed, then the likelihood function can be
obtained directly and provides the basis for inference. The main idea follows
that of Lagakos et al. (1978), Lagakos (1976, 1977) who introduced a stochastic
model that utilizes the information on a time-dependent event (auxiliary variable)
that may be related to survival time. By taking λ0 (t) = λ2 , where λ0 (.) is the
hazard function for the occurrence of the surrogate response, λ1 (t) = λ1 , and
λ2 (t) = λ3 , the survival function has the marginal distribution function given
by Lagakos. The Lagakos model itself is a special case of the bivariate model
of Freund (1961), applicable to the lifetimes of certain two-component systems
where a failure of the first or second component alters the failure rate of the
second or first component from β to β or (α to α ). By taking α = λ1 (t),
α = λ2 (t), and β = β = λ0 (t), the Freund model can be viewed as a special
case of the model described above.
Slud and Rubinstein (1983) make simple nonparametric assumptions on the
joint density of (T, C) and consider the function ρ(t) defined by
ordered failure times Xj and Xj+1 . When ρi = 1 it follows that Ŝρ (t) reduces to
the usual Kaplan-Meier estimator. This model is a special case of the nonpara-
metric assumption presented by Slud and Rubinstein. The focus here is not on
the dependence of T and C but on the dependence of T and Cs where Cs is a
dependent censoring indicator, in particular a surrogate endpoint. The function
of interest is
Pr (t < T < t + δ|T > t, Cs < t)
ρs (t) = lim .
δ→ 0 Pr (t < T < t + δ|T > t, Cs ≥ t)
This function is equivalent to the function π(t) and can be estimated from data
from the first stage. Suppose that the conditional hazard, λ(t|z), of death at
t given Z = z has the form h0 (t) exp(βz(t)) where zi (t) takes the value 0 if
ti ≤ ci and value 1 if ti > ci then ρs (t) = ρs = exp(β). Thus, an estimate of ρs
is given by exp(β̂). The estimate of β using data from the first stage quantifies
the increase in the risk of death occurring after the surrogate variable has been
observed. The first stage is viewed as a training set of data to learn about the
relationship between the potential surrogate endpoint and the survival time.
The estimator is constructed from the entire sample N (N = N1 +N2 ). In the
sample of size N , the ordered failure times Xi for which δi = 1 are X1 ≤ . . . ≤ Xd ,
where d is the number of deaths of patients enrolled either in the first stage or in
the second stage. Using the notation that i = 1 if Xi > ci and 0 otherwise, the
random variable V defines either the observed survival time or the time to the
surrogate variable. For patients in the second stage, let us denote the number
of Xi with i = 1 and δi = 0 between Xj and Xj+1 by Wj . In this way, Wj
denotes the number of individuals from the second stage having a surrogate
response between two consecutive failure times Xj and Xj+1 . Let Wj denote
either the number of individuals censored in the first stage or the number of
individuals censored without a surrogate response in the second stage between
Xj and Xj+1 . Clearly, Wj is the number of patients censored with “informative
censoring” between two consecutively ordered failure times Xj and Xj+1 while
Wj is the number of patients censored with “non-informative censoring” between
two consecutively ordered failure times Xj and Xj+1 . Finally nj is the number
of i with Xi ≥ Xj . The product-limit estimator then becomes
⎧ ⎫
⎨
d(t)−1
d(t)
d(t)−1
ni − 1 ⎬
d(t)
ni − 1
Ŝ(t; ρs ) = N −1 n(t) + Wk + Wk .
⎩ ni + ρs − 1 ni ⎭
k=0 i=k+1 k=0 i=k+1
Notice that when ρs = 1 (i.e., the occurrence of the surrogate variables has
no influence of survival) then Ŝ(t; ρs ) is simply the Kaplan-Meier estimator.
Considering the two-sample case, for example, a controlled clinical trial where
group membership is indicated by a single binary variable Z, then the survival
8.7. Classwork and homework 209
where ψ(a, b) is some distance function (metric) between a and b at the point
s and w(s) is some positive weighting function, often taken to be 1 although a
variance-stabilizing definition such as w(t)2 = Ŝ0 (t; ρ̂s0 )(1 − Ŝ0 (t; ρ̂s0 )) can be
useful in certain applications. The choice ψ(a, b) = a − b leads to a test with good
power against alternatives of stochastic ordering, whereas the choice ψ(a, b) =
|a − b| or ψ(a, b) = (a − b)2 , for instance, would provide better power against
crossing hazard alternatives. Large sample theory is simplified by assuming that
the non-informative censoring distributions are identical in both groups and this
may require more critical examination in practical examples. Given data we can
observe some value Y = y0 from which the significance level can be calculated by
randomly permuting the (0, 1) labels corresponding to treatment assignment. For
the ith permutation (i = 1, . . . , np ) the test statistic can be calculated, resulting
in the value say yi . Out of the np permutations suppose that there are n+ values
of yi greater than or equal to y0 and, therefore, np − n+ values of yi less than y0 .
The significance level for a two-sided test is then given by 2 min(n+ , np −n+ )/np .
In practice we sample from the set of all permutations so that np does not
correspond to the total number of possible permutations but, rather, the number
actually used, of which some may even be repeated. This is the same idea that
is used in bootstrap resampling.
1. For subjects i and j with covariate values Zi and Zj , write down the proba-
bility that the ratio of the survival time for subject i to the survival time for
subject j is greater than 2/3.
2. Calculate an estimate of the above probability for the Freireich data in which
Zi = 1 and Zj = 0. Derive an expression for a 95% confidence interval for this
quantity and use it to derive a confidence interval for use with the Freireich
data.
3. Referring to the paper of Kent and O’Quigley (1988), consider the approx-
imation suggested in that paper for the coefficient of randomness. Use this
210 Chapter 8. Survival given covariate information
4. Consider some large dataset in which there exist two prognostic groups.
Divide the time scale into m non-overlapping intervals, a0 = 0 < a1 < ... <
am = ∞. Calculate Pr (Ti > Tj |Zi , Zj , Tj > ak ) for all values of k less than
m and use this information to make inferences about the impact of group
effects through time.
5. Write down the likelihood for the piecewise exponential model in which,
between adjacent failure times, the hazard can be any positive value (Bres-
low, 1972). Find an expression for the cumulative hazard function and use
this to obtain an expression for the survivorship function. Although such
an estimator can be shown to be consistent (Breslow and Crowley, 1974),
explain why the usual large sample likelihood theory would fail to apply.
6. Consider again the two-group case. Suppose we are told that the survival
time of a given subject is less than t0 . We also know that the groups are
initially balanced. Derive an expression for the probability that the subject in
question belongs to group 1. For the Freireich data, given that a subject has
a survival time less than 15 weeks, estimate the probability that the subject
belongs to the group receiving placebo.
9. Use the results of Andersen and Gill (1982) to conclude that β̂ −β0 is asymp-
totically uncorrelated with Ŝ(t|Z ∈ H)|β0 .
10. On the basis of a large dataset construct some simple prognostic indices
using the most important risk factors. Divide into 3 groups the data based
on the prognostic index and calculate the different survival estimates for each
subgroup. Comment on the different features of these estimators as observed
in this example. How would you investigate more closely the relative benefits
of the different estimators?
11. Show that when Ti and Tj have the same covariate values, then Pr (Ti >
Tj ) = 0.5.
12. For the Freireich data calculate the probability that a randomly chosen sub-
ject from the treated group lives longer than a randomly chosen subject from
the control group.
8.8. Outline of proofs 211
Theorem 8.2 We know how to estimate the asymptotic variance of β̂ under the
model. So all that remains for the second term on the right-hand side of (8.7) is
to calculate the partial derivative of Ŝ(t|Z ∈ H) with respect to β. For this we
have:
∂ ( ti >t Di )( ti ≤t Ci ) − ( ti ≤t Di )( ti >t Ci + B1 )
Ŝ(t|Z ∈ H) = ,
∂β ( ki=1 Ci + B1 )2
(8.12)
The first term on the right-hand side of (8.7) can be estimated using Greenwood’s
formula
⎧ ⎫
⎨ q̂ ⎬
ˆ Ŝ(t|Z ∈ H)|β } ≈ Ŝ(t|Z ∈ H)|2
Var{ (1 +
i
) − 1 , (8.13)
0 β0
⎩ ni p̂i ⎭
ti ≤t
where k
Cj + B1
p̂i = P̂ (T > ti |T > ti−1 , Z ∈ H) = i+1 k
,
i Cj + B1
q̂i = 1 − p̂i and ni = Yj (ti ). Then each p̂i is a binomial probability based
on a sample of size ni and Ŝ(t|Z ∈ H) = ti ≤t p̂i . The p̂i ’s may be treated
as conditionally independent given the ni ’s, with β0 fixed. Thus, Greenwood’s
formula applies. All the quantities involved in (8.12) and (8.13) are those routinely
calculated in a Cox model analysis.
212 Chapter 8. Survival given covariate information
P̂ (Z ∈ H|t) = S (H) (β̂, t)/S (0) (β̂, t) , Eβ (Z|t; π H ) = S (H1) (β̂, t)/S (H) (β̂, t).
Using the main theorem of Section 7.5 we have s(H) (β0 , t)/s(0) (β0 , t) = P (Z ∈
H|t). Under the usual regularity and continuity conditions (Xu, 1996) it can
be shown that {∂ Ŝ(t∗ |Z ∈ H)/∂β} |β=β̇ is asymptotically constant. Now β̂ −
β0 = I −1 (β̌)U (β0 ) where β̌ is on the line segment between β̂ and β0 , U (β) =
∂ log L(β)/∂β and I(β) = −∂U (β)/∂β. Combining these we have:
√ √ √ ∂ Ŝ(t∗ |Z ∈ H)
nŜ(t∗ |Z ∈ H) = nŜ(t∗ |Z ∈ H)|β0 + I −1 (β̌) nU (β0 ) β=β̇ .
∂β
Andersen and Gill (1982) show that I(β̌) converges in probability to a well-defined
population parameter. In the following theorem, Lin and Wei (1989) showed that
U (β0 ) is asymptotically equivalent to 1/n times a sum of i.i.d. random variables:
√
Theorem 8.4. (Lin and Wei, 1989) nU (β0 ) is asymptotically equivalent to
n−1/2 × n1 ωi (β0 ), where Ni (t) = I{Ti ≤ t, Ti ≤ Ci } and
1 1
s(1) (β, t) s(1) (β, t)
ωi (β) = Zi − (0) dNi (t)− Yi (t)eβZi Zi − λ0 (t)dt.
0 s (β, t) 0 s(0) (β, t)
It only then remains to show the asymptotic normality of Ŝ(t∗ |Z ∈ H). We see
that the numerator of Ŝ(t∗ |Z ∈ H) |β0 is asymptotically equivalent to 1/n times
a sum of n i.i.d. random variables like the above, since the denominator of it we
know is consistent for P(Z ∈ H). We drop the subscript of β0 in Ŝ(t∗ |Z ∈ H)|β0 .
∞
The numerator of Ŝ(t∗ |Z ∈ H) is t∗ P̂ (Z ∈ H|t)dF̂ (t). Note that
∞
√
n{ P̂ (Z ∈ H|t)dF̂ (t) − P (Z ∈ H, T > t∗ )}
t∗
∞
√
= n P (Z ∈ H|t)d{F̂ (t) − F (t)}
t∗
8.8. Outline of proofs 213
∞
√
+ n {P̂ (Z ∈ H|t) − P (Z ∈ H|t)}d{F̂ (t) − F (t)}
∗
t ∞
√
+ n {P̂ (Z ∈ H|t) − P (Z ∈ H|t)}dF (t).
t∗
√
Now n{F̂ (t)−F (t)} converges in distribution to a zero-mean Gaussian process.
Therefore the second term on the right-hand side of the preceding equation is
op (1). The last term is A1 + op (1) (see also Lemma II.4 of Xu (1996), where
1
√ S (H) (β0 , t) s(H) (β0 , t)S (0) (β0 , t)
A1 = n − dF (t)
t∗ s(0) (β0 , t) s(0) (β0 , t)2
n 1
−1/2 Yi (t)eβ0 Zi s(H) (β0 , t)
=n Qi − dF (t).
∗ s(0) (β0 , t) s(0) (β0 , t)
i=1 t
As for the first term on the right-hand side of the equation preceding the
one immediately above, we use Theorem II.5 of Xu (1996). With φ(t) =
1[t∗ ,1] (t)P (Z ∈ H|t) in the theorem, the first term in this equation is equal
to
n
√
n−1/2 νi + nRn ,
i=1
where |Rn | = op (n−1/2 )and ν’s are i.i.d. with mean zero, each being a function
of Xi and δi . Thus the proof is complete.
Chapter 9
In this chapter we describe the regression effect process. This can be established in
different ways and provides all of the essential information that we need in order to
gain an impression of departures from some null structure, the most common null
structure corresponding to an absence of regression effect. Departures in specific
directions enable us to make inferences on model assumptions and can suggest,
of themselves, richer more plausible models. The regression effect process, in its
basic form, is much like a scatterplot for linear regression telling us, before any
formal statistical analysis, whether the dependent variable really does seem to
depend on the explanatory variable as well as the nature, linear or more complex,
of that relationship. Our setting is semi-parametric and the information on the
time variable is summarized by its rank within the time observations. We make
use of a particular time transformation and see that a great body of known theory
becomes available to us immediately. An important objective is to glean what
we can from a graphical presentation of the regression effect process. The two
chapters following this one—building test statistics for particular and general
situations, and robust, effective model-building—lean heavily on the results of
this chapter.
Figure 9.1: The regression effect process as a function of time for simulated data
under the proportional hazards model for different values of β.
under certain conditions. O’Quigley (2003) and O’Quigley (2008) indicate how
we can move back to a regression effect process on the original scale if needed.
While conceptually the multivariate situation is not really a greater challenge than
the univariate case, the number of possibilities increases greatly. For instance,
two binary covariates give rise to several processes; two marginal processes, two
conditional processes, and a process based on the combined effect as expressed
through the prognostic index. In addition we can construct processes for stratified
models. All of these processes have known limiting distributions when all model
assumptions are met. However, under departures from working assumptions, we
are faced with the problem of deciding which processes provide the most insight
into the collection of regression effects, taken as a whole or looked at individually
(Figure 9.1).
In the presentation of this chapter we follow the outline given by O’Quigley
(2003) and O’Quigley (2008). The key results that enable us to appeal to the form
of Brownian motion and the Brownian bridge are proven in O’Quigley (2003),
O’Quigley (2008), Chauvel and O’Quigley (2014, 2017), and Chauvel (2014).
The processes of interest to us are based on cumulative quantities having the
flavor of weighted residuals, i.e., the discrepancy between observations and some
mean effect determined by the model. The model in turn can be seen as a
mechanism that provides a prediction and this prediction can then be indexed
by one or more unknown parameters. Different large sample results depend on
what values we choose to use for these parameters. They may be values fixed
under an assumption given by a null hypothesis or they may be some or all of the
values provided by estimates from our data. In this latter case the process will
tell us something about goodness of fit (O’Quigley, 2003). In the former case the
process will tell us about different regression functions that can be considered to
be compatible with the observations.
9.3. Elements of the regression effect process 217
In the most basic case of a single binary covariate, the process developed below
in Section 9.4 differs from that of Wei (1984) in three very simple ways: the
sequential standardization of the cumulative process, the use of a transformed
time scale, and the direct interpolation that leads to the needed continuity. These
three features allow for a great deal of development, both on a theoretical and
practical level.
The process corresponds to a sum which can be broken down into a series of
sequential increments, each increment being the discrepancy between the obser-
vation and the expected value of this observation, conditional upon time s and the
history up to that point. It is assumed that the model generates the observations.
Following the last usable observation (by usable we mean that the conditional
variance of the covariable in the risk set is strictly greater than zero), the result-
ing quantity is the same as the score obtained by the first derivative of the log
likelihood before time transformation. For this reason the process is also referred
to as the score process. In view of the immediate generalization of the score
process and the use we will make of this process we prefer to refer to it as the
regression effect process.
The point of view via empirical processes has been adopted by other authors.
Arjas (1988) developed a process that is similar to that considered by Wei.
Barlow and Prentice (1988) developed a class of residuals by conditioning in
such a way—not unlike our own approach—that the residuals can be treated as
martingales. The term martingale residuals is used. In essence any weight that
we can calculate at time t, using only information available immediately prior
to that time, can be used without compromising the martingale property. Such
quantities can be calculated for each individual and subsequently summed. The
result is a weighted process that contains the usual score process as a particular
case (Lin et al., 1993; Therneau and Grambsch, 2000). The constructions of
all of these processes take place within the context of the proportional hazards
model. Extending the ideas to non-proportional hazards does not appear to be
very easy and we do not investigate it outside of our own construction based on
sequential standardization and a transformation of the time scale. Wei (1984)
was interested in model fit and showed how a global, rather than a sequential
standardization, could still result in a large sample result based on a Brownian
bridge.
218 Chapter 9. Regression effect process
Figure 9.2: Kaplan-Meier curves for the two age groups from the Curie Institute
breast cancer study. The regression effect process U ∗ (t) approximates that of
a linear drift while the transformation U ∗ (t) − tU ∗ (1) of the rightmost figure
appears well approximated by a Brownian bridge. This indicates a satisfactory fit.
hazards. All of this can be quantified and framed within more formal inferential
structures. The key point here is that before any formal inference takes place, the
eyeball factor—a quick glance at the Kaplan-Meier curves, the regression effect
process, and its transform—tells us a lot.
The empirical process we construct in the following sections allows us to
perform hypothesis tests on the value of the parameter and study the adequacy
of the non-proportional hazards model without having to estimate the regres-
sion coefficient. In addition, increments of the process are standardized at each
time point, rather than adopting an overall standardization for all increments as
in Wei (1984). This makes it possible to take into account multiple and cor-
related covariates because, unlike the non-standardized score process (9.1), the
correlation between covariates is not assumed to be constant over time. We plot
the process according to ranked times of failure, in the manner of Arjas (1988),
to obtain a simple and explicit asymptotic distribution. Before constructing the
process, we need to change the time scale. This transformation is crucial for
obtaining asymptotic results for the process without having to use the theorem
of Rebolledo (1980), nor the inequality of Lenglart (1977), which are classical
tools in survival analysis. In the following section, we present the time transfor-
mation used. In Section 9.4, we define the univariate regression effect process
and study its asymptotic behavior. In Section 9.5, we look at the multivariate
case. Section 9.8 gathers together the proofs of the theorems and lemmas that
are presented.
actual times of death and censored values are not used for inference; only their
order of occurrence counts. Among all possible monotonic increasing transforma-
tions we choose one of particular value. The order of observed times is preserved
and, by applying the inverse transformation, we can go back to the initial time
scale. Depending on how we carry out inference, there can be a minor, although
somewhat sticky, technical difficulty, in that the number of distinct, and usable,
failure times, depends on the data and is not known in advance. Typically we
treat this number as though it were known and, indeed, our advice is to do just
that in practice. This amounts to conditioning on the number of distinct and
usable failure times. We use the word “usable” to indicate that information on
regression effects is contained in the observations at these time points. In the
case of 2 groups for example, once one group has been extinguished, then, all
failure times greater than the greatest for this group carry no information on
regression effects.
Definition 9.1. With this in mind we define the effective sample size as
n
kn = δi × Δ(Xi ) where Δ(t) = 1Vβ(t) (Z|t)>0
i=1
If the variance is zero at a time of death, this means that the set of individuals
at risk is composed of a homogeneous population sharing the same value of the
covariate. Cleary the outcome provides no information on differential rates as
quantified by the regression parameter β. This translates mathematically as zero
contribution of these times of death to inference, since the value of the covariate
of the individual who dies is equal to its expectation.
We want to avoid such variances from a technical point of view, because
later we will want to normalize by the square root of the variance. For example,
in the case of a comparison of two treatment groups, a null conditional variance
corresponds to the situation in which one group is now empty but individuals from
the other group are still present in the study. Intuitively, we understand that these
times of death do not contribute to estimation of the regression coefficient β since
no information is available to compare the two groups. Note that the nullity of
a conditional variance at a given time implies nullity at later time points. Thus,
according to the definition of kn , the conditional variances Vβ(t) (Z|t) calculated
at the first kn times of death t are strictly greater than zero. For a continuous
covariate when the last time point observed is a death, we have the equality
(algebraic if conditioning, almost sure otherwise) that kn = n i=1 δi −1. In effect,
the conditional variance Vβ(t) (Z|t) is zero at the last observed time point t, and
almost surely is the only time at which the set of at-risk individuals shares the
same value of the covariate. If, for a continuous covariate, the last observed time
point is censored, then we have that kn = n i=1 δi .
9.3. Elements of the regression effect process 221
φ−1
n (t) = inf {Xi , φn (Xi ) ≥ t, i = 1, . . . , n} , 0 ≤ t ≤ 1.
Recall that the counting process {N̄ (t)}t∈[0,T ] has a jump of size 1 at each
time of death. Thus, on the new time scale representing the image of the
observed times {X1 , . . . , Xn } under φn , the values in the set {1/kn , 2/kn , . . . , 1}
correspond to times of death, and the ith time of death ti is such that
ti = i/kn , i = 1, . . . , kn where t0 = 0. The set {1/kn , 2/kn , . . . , 1} is included
in but is not necessarily equal to the transformed set of times of death.
Each time of death ti is a (Ft∗ )t∈[0,1] -stopping time where, for t ∈ [0, 1], the
σ-algebra Ft∗ is defined by
Ft∗ = σ Ni∗ (u), Yi∗ (u+ ), Zi ; i = 1, . . . , n ; u = 0, 1, . . . , tkn ,
where · is the floor function and Yi∗ (t+ ) = lim Yi∗ (s). Notice that if 0 ≤ s <
s→t+
t ≤ 1, Fs∗ ⊂ Ft∗ .
Remark. Denote a ∧ b = min(a, b), for a, b ∈ R. If the covariates are time-
dependent, for t ∈ [0, 1], the σ-algebra Ft∗ is defined by
u +
Ft∗ =σ Ni∗ (u), Yi∗ (u+ ), Zi φ−1
n ∧ Xi ; i = 1, . . . , n ; u = 0, 1, . . . , tkn
kn
since, as we recall, the covariate Zi (·) is not observed after time −1Xi , i =
1, . . . , n. To simplify notation in the following, we will write Zi φn (t) in
+
the place of Zi φ−1
n (t) ∧ Xi , , t ∈ [0, 1]. We define the counting process
associated with transformed times, with jumps of size 1 at times of death in
the new scale, by
n
N̄ ∗ (t) = 1φ (X ) ≤ t, δ = 1 , 0 ≤ t ≤ 1.
n i i
i=1
s s
1 P
sup ∗
An (t)dN̄ (t) − a(t)dt −→ 0.
s∈[0,1] kn 0 0 n→∞
The proofs are given at the end of the chapter. To ease notation, let us define
a process Z which is 0 everywhere except at times of death. At each such time,
the process takes the value of the covariate of the individual who fails at that
time.
Definition 9.3. We denote Z = {Z(t), t ∈ [0, 1]} a process such that
n
Z(t) = Zi (Xi ) 1φ (X ) = t, δ = 1 , t ∈ [0, 1]. (9.2)
n i i
i=1
The family of probabilities {πi (β(t), t), i = 1, . . . , n}, with t ∈ [0, T ] can be
extended to a non-proportional hazards model on the transformed scale as fol-
lows.
Definition 9.4. For i = 1, . . . , n and t ∈ [0, 1], the probability that individual
i dies at time t under the non-proportional hazards model with parameter
β(t), conditional on the set of individuals at risk at time of death t, is defined
by
Yi∗ (t) exp β(t)T Zi φ−1 n (t) 1kn t∈N
πi (β(t), t) = n ∗
−1
.
T
j=1 Yj (t) exp β(t) Zj φn (t)
Remark. The only values at which these quantities will be calculated are the
times of death t1 , . . . , tkn . We nevertheless define Z(t) and {πi (β(t), t), i =
1, . . . , n} for all t ∈ [0, 1] in order to be able to write their respective sums
as integrals with respect to the counting process N̄ ∗ . The values 0 taken
by them at times other than that of death are chosen arbitrarily as long
as they remain bounded by adjacent failures. Note also that at t0 = 0, we
have Z(0) = 0. Expectations and variances with respect to the family of
probabilities {πi (β(t), t), i = 1, . . . , n} can now be defined.
Definition 9.5. Let t ∈ [0, 1]. The expectation and variance of Z with respect
to the family of probabilities {πi (β(t), t), i = 1, . . . , n} are respectively a vector
in Rp and a matrix in Mp×p (R) such that
n
Eβ(t) (Z | t) = Zi φ−1
n (t) πi (β(t), t), (9.3)
i=1
and
n
⊗2
Vβ(t) (Z | t) = Zi φ−1
n (t) πi (β(t), t) − Eβ(t) (Z | t)⊗2 . (9.4)
i=1
224 Chapter 9. Regression effect process
∂
It can be shown that the Jacobian matrix of Eβ(t) (Z | t) is E (Z | t) =
∂β β(t)
Vβ(t) (Z | t), for t ∈ [0, 1].
Proposition 9.2. Let t ∈ [0, 1]. Under the non-proportional hazards model
with parameter β(t), the expectation Eβ(t) (Z(t) | Ft∗− ) and variance
Vβ(t) (Z(t) | Ft∗− ) of the random variable Z(t) given the σ-algebra Ft∗− are
where, by using two parameter functions, α(t) and β(t), we allow ourselves
increased flexibility. The α(t) concerns the variance and allowing this to not be
tied to β(t) may be of help in some cases where we would like estimates of
the variance to be valid both under the null and the alternative. This would be
similar to what takes place in an analysis of variance where the residual variance
is estimated in such a way as to remain consistent both under the null and the
alternative. Our study on this, in this particular context, is very limited, and this
could be something of an open problem. For the remainder of this text we will
suppose that α(t) = β(t) and we will consequently write Un∗ (β(tj ), tj ) as only
having two arguments.
noting that all of the ingredients in the above expression are obtained rou-
tinely from all of the currently available software packages.
226 Chapter 9. Regression effect process
We can now define the process Un∗ on [0, 1] by linearly interpolating the
kn + 1 random variables {Un∗ (β(tj ), tj ), j = 0, 1, . . . , kn }.
Definition 9.7. The standardized score process evaluated at β(t) is defined
by {Un∗ (β(t), t), t ∈ [0, 1]}, where for j = 0, . . . , kn and t ∈ [tj , tj+1 [,
Un∗ (β(t), t) = Un∗ (β(tj ), tj ) + (tkn − j) {Un∗ (β(tj+1 ), tj+1 ) − Un∗ (β(tj ), tj )} .
By definition, the process Un∗ (β(·), ·) is continuous on the interval [0, 1]. The pro-
cess depends on two parameters: the time t and regression coefficient β(t). For
the sake of clarity, we recall that the temporal function β : t → β(t) is denoted
β(t), which will make it easier to distinguish the proportional hazards model
with parameter β from the non-proportional hazards model with parameter β(t).
Under the non-proportional hazards model with β(t), the increments of the pro-
cess Un∗ are centered with variance 1. Moreover, the increments are uncorrelated,
as shown in the following proposition.
Proposition 9.3. (Cox 1975). Let Dn∗ (β(tj ), tj ) = Un∗ (β(tj ), tj ) − Un∗ (β(tj−1 ),
tj−1 ). Under the non-proportional hazards model with parameter β(t), the ran-
dom variables Dn∗ (β(tj ), tj ) for j = 1, 2, . . . , kn are uncorrelated.
The uncorrelated property of the increments is also used in the calculation of
log-rank statistic. All of these properties together make it possible to obtain our
essential convergence results for the process Un∗ (β(t), t).
and
2
∂2 S (2) (β(t), t) S (1) (β(t), t)
Vβ(t) (Z | t) = 2 log S (0) (b, t) = (0) − .
∂b b=β(t) S (β(t), t) S (0) (β(t), t)
A1
(Asymptotic stability). There exists
some δ1 > 0, a neighborhood B =
γ, supt∈[0,1] |γ(t) − β(t)| < δ1 of β of radius δ1 containing the zero func-
tion, and functions s(r) defined on B × [0, 1] for r = 0, 1, 2, such that
√
P
n sup S (r) (γ(t), t) − s(r) (γ(t), t) −→ 0. (9.6)
t∈[0,1],γ∈B n→∞
∂
A3 (Homoscedasticity). For any t ∈ [0, 1] and γ ∈ B, we have v(γ(t), t) = 0.
∂t
A4 (Uniformly bounded covariates). There exists L ∈ R∗+ such that
The first two conditions are standard in survival analysis. They were introduced by
Andersen and Gill (1982) in order to use theory from counting processes and mar-
tingales such as the inequality of Lenglart (1977) and the theorem of Rebolledo
(1980). The variance Vβ(t) (Z | t) is, by definition, an estimator of the variance
of Z given T = t under the non-proportional hazards model with parameter β(t).
Thus, the homoscedasticity condition A3 means that the asymptotic variance is
independent of time for parameters close to the true regression coefficient β(t).
This condition is often implicitly encountered in the use of the proportional
hazards model. This is the case, for example, in the estimation of the variance
of the parameter β, or in the expression for the log-rank statistic, where the
contribution to the overall variance is the same at each time of death. Indeed,
the overall variance is an unweighted sum of the conditional variances. The
228 Chapter 9. Regression effect process
temporal stability of the variance has been pointed out by several authors, notably
Grambsch and Therneau (1994), Xu (1996), and Xu and O’Quigley (2000). Next,
we need two lemmas. Proofs are given at the end of the chapter.
Lemma 9.2. Under conditions A1 and A3, for all t ∈ [0, 1] and γ ∈ B,
v(γ(t), t) > 0.
Lemma 9.3. Under hypotheses, A1, A2, and A3, and if kn = kn (β0 ), for all
γ1 ∈ B, there exist constants C(γ1 ) et C(β0 ) ∈ R∗+ such that
√ P
n sup Vγ (t) (Z | t) − C(γ1 ) −→ 0, (9.8)
1 n→∞
t∈{t1 ,t2 ,...,tkn }
√ Vγ1 (t) (Z | t) C(γ1 ) P
n sup − −→ 0, (9.9)
V (Z | t) C(β0 ) n→∞
t∈{t1 ,t2 ,...,tkn } β 0
V C(γ1 )
√ γ1 (t) (Z | t) P
n sup − −→ 0. (9.10)
t∈{t1 ,t2 ,...,tkn } Vβ0 (Z | t) C(β0 )
1/2 1/2 n→∞
The proof, given at the end of the chapter, is built on the use of the functional
central limit theorem for martingale differences introduced by Helland (1982).
This theorem can be seen as an extension of the functional central limit theo-
rem of Donsker (1951) to standardized though not necessarily independent nor
identically distributed variables. Figure 9.3(a) shows the result of simulating a
standardized score process Un∗ (0, t) as a function of time t, when the true param-
eter is constant. The immediate visual impression would lead us to see Brownian
motion—standard or with linear drift—as a good approximation. We will study
how to calculate this approximation on a given sample in Chapters 10 and 11.
where β̃ is in the ball with center β and radius supt∈[0,1] |β(t) − β0 |. Then,
there exist two constants C1 (β, β0 ) and C2 ∈ R∗+ such that C1 (β, β)=1 and
L
Un∗ (β0 , ·) − kn An −→ C1 (β, β0 )W(t). (9.12)
n→∞
9.4. Univariate regression effect process 229
Figure 9.3: The process Un∗ (0, ·) as a function of time for simulated data under
the proportional hazards model. For left-hand figure, β = 1.0. For right-hand
figure, the absence of effects manifests itself as approximate standard Brownian
motion.
Furthermore,
t
P
sup An (t) − C2 {β(s) − β0 } ds −→ 0. (9.13)
0≤t≤1 0 n→∞
Figure 9.4: The process Un∗ (0, ·) as a function of time for simulated data under the
non-proportional hazards model with different values of β(t). Temporal effects
are clearly reflected in the process on the transformed time scale.
some point τ , piecewise constant over time, increase or decrease continuously,
and so on. In all these situations, the cumulative effect β(s)ds will be seen in
the drift of the process (Theorem 9.2) and will, of itself, describe to us the nature
of the true effects. For example, Figure 9.4(a) shows a simulated process with a
piecewise-constant effect β(t) = 1t≤0.5 . Before t = 0.5, we see a positive linear
trend corresponding to β = 1, and for t > 0.5, the coefficient becomes zero and
the drift of the process Un∗ (0, ·) is parallel to the time axis. Figure 9.4(b) shows
the standardized score process for simulated data under failure time model with
regression function, β(t) = 1t≤1/3 + 0.5 1t≥2/3 . The drift of the process can be
separated into three linear sections reflecting the effect’s strength, each of which
can be approximated by a straight line. The slope of the first line appears to be
twice that of the last, while the slope of the second line is zero.
In summary, whether the effect corresponds to proportional or non-proportional
hazards, all information about the regression coefficient β(t) can be found in the
process Un∗ (0, ·). We can then use this process to test the value of the regres-
sion coefficient and evaluate the adequacy of the proportional hazards model. A
simple glance at the regression effect process is often all that is needed to let
us know that there exists an effect and the nature of that effect. More precise
inference can then follow. Before delving into these aspects in Chapters 10 and
11, we consider more closely the standardized score process for the multivariate
case. As mentioned earlier the univariate process, even in the multivariate setting,
will enable us to answer most of the relevant questions, such as the behavior of
many variables combined in a prognostic index, or, for example, the impact of
some variable after having controlled for the effects of other variables, whether
via the model or via stratification.
9.5. Regression effect processes for several covariates 231
Here we extend the regression effect process of the previous section to the case
of several covariates grouped in a vector Z(t) of dimension p > 1. Suppose that
we have the non-proportional hazards model with regression function β(t) and a
vector of covariates Z(t) which are functions from [0, 1] to Rp . Following Chauvel
(2014), let t ∈ [0, 1] and
n
S (0) (β(t), t) = n−1 Yi∗ (t) exp β(t)T Zi (φ−1
n (t)) ∈ R,
i=1
n
S (1) (β(t), t) = n−1 Yi∗ (t)Zi (φ−1 T −1 p
n (t)) exp β(t) Zi (φn (t)) ∈ R ,
i=1
n
S (2) (β(t), t) = n−1 Yi∗ (t)Zi (φ−1
n (t))
⊗2
exp β(t)T Zi (φ−1
n (t)) ∈ Mp×p (R).
i=1
For t ∈ {t1 , . . . , tkn }, the conditional expectation Eβ(t) (Z | t) and the conditional
variance-covariance matrix Vβ(t) (Z | t) defined in (9.3) and (9.4) can be written
as functions of S (r) (β(t), t) (r = 0, 1, 2):
By the definition of k
n = kn (β) in the multivariate case, and since kn ≤ kn
almost surely (Equation (9.23)), the conditional variance-covariance matrices
Vβ(t) (Z | t) calculated for the first kn times of death t are positive definite, and
∀ t ∈ {t1 , . . . , tkn }, Vβ(t) (Z | t)−1 ≤ CV−1 a.s.
where
tj tj tj
a(s)dN̄ ∗ (s) = a1 (s)dN̄ ∗ (s), . . . , ap (s)dN̄ ∗ (s) ,
0 0 0
Remark. As in the univariate case, the random variable Un∗ (β(tj ), tj ) can be
written as a sum:
j
1
Un∗ (β(tj ), tj ) = √ Vβ(ti ) (Z | ti )−1/2 Z(ti ) − Eβ(ti ) (Z | ti ) .
kn i=1
Definition 9.9. The multivariate standardized score process {Un∗ (β(t), t),
t ∈ [0, 1]} evaluated at β(t) is such that, for j = 0, 1, . . . , kn and t ∈ [tj , tj+1 [,
Un∗ (β(t), t) = Un∗ (β(tj ), tj ) + (tkn − j) {Un∗ (β(tj+1 ), tj+1 ) − Un∗ (β(tj ), tj )} .
These conditions are extensions of hypotheses A1, A2, A3, and A4 from Section
9.4 to the multiple covariates case. We make the additional assumption that the
conditional variance-covariance matrices do not depend on the parameter when it
is close to the true regression parameter. This is a technical assumption required
due to the use of the multidimensional Taylor-Lagrange inequality in the proofs.
In the univariate case, we do not need this hypothesis thanks to the availability
of a Taylor-Lagrange equality.
Proposition 9.4. (Chauvel 2014). Under hypotheses B1, B3, and B4, we have
the following:
√ P
n sup Vγ(t) (Z | t) − Σ −→ 0. (9.16)
t∈{t1 ,t2 ,...,tkn },γ∈B n→∞
234 Chapter 9. Regression effect process
The proposition is proved at the end of the chapter. The following theorem gives
the limit behavior of the univariate regression effect process. Let a = (a1 , . . . , ap )
be a function from Rp to [0, 1]. Denote
t t t
a(s)ds = a1 (s)ds, . . . , ap (s)ds , 0 ≤ t ≤ 1.
0 0 0
Suppose γ ∈ B . Let Un∗ (γ, ·) be the multivariate regression effect process cal-
culated at γ. To construct this process, we consider kn = kn (γ), which can be
estimated by k
n (γ). Recall that kn (γ) ≤ kn (γ) almost surely (Equation 9.23)
and by the definition of kn (γ) in the multivariate case, we have:
∀ t ∈ {t1 , . . . , tkn }, Vγ(t) (Z | t)−1 ≤ CV−1 a.s. (9.17)
tkn
1
Bn (t) = Vβ0 (Z | ti )−1/2 Eβ(ti ) (Z | ti ) − Eβ0 (Z | ti ) .
kn
i=1
Furthermore,
t
P
sup Bn (t) − Σ1/2
{β(s) − β0 } ds −→ 0. (9.19)
n→∞
t∈[0,1] 0
√ t
component of the drift kn Σ1/2 0 β(s)ds represents a linear combination of
t t
the cumulative effects 0 β1 (s)ds, . . . , 0 βp (s)ds, except when the variance-
covariance matrix Σ is diagonal. Under the theorem’s hypotheses, a direct
consequence of Equation 9.16 is
k
1
n
P
Σ̂ := Vβ0 (Z | ti ) −→ Σ, (9.20)
kn n→∞
i=1
and t
−1/2 P
sup Σ Bn (t) − {β(s) − β0 } ds −→ 0.
n→∞
t∈[0,1] 0
Thus, when there are several covariates, the drifts of the process
−1/2 U ∗ (β0 , ·) help us to make inference on the regression coefficients. Each
Σ n
drift represents a cumulative coefficient and inference is done separately for each
drift, as in the univariate case.
attention to some finicky details. We consider this more closely here. Let us
suppose that there exists α0 ∈]0, 1] such that
k
n (β) a.s.
−→ α0 .
n n→∞
k
n (β)
an ≥ 1, an ≥ α0 a.s. (9.21)
n
This would imply in particular that k n (β) tends to infinity when n tends to
infinity, i.e., the number of deaths increases without bound as the sample size
increases. If Z is a discrete variable, this means that for a sample of infinite size,
n
no group will run out of individuals. If Z is continuous, k n (β) = i=1 δi − 1 or
n −1
kn (β) = i=1 δi almost surely. If so, n kn (β) is an estimator that converges
to P (T ≤ C) and α0 = P (T ≤ C). The law of the iterated logarithm then allows
us to construct a sequence (an ) satisfying Equation 9.21.
Thus,
√
n n−1 k n (β) − α0
∀ ε > 0, ∃ Nε ∈ N∗ , ∀ n ≥ Nε , ≥ −1 − ε a.s.
2α0 (1 − α0 ) log log(n)
k
n (β) 3
≥ α0 − √ α0 (1 − α0 ) log log(n),
n 2n
9.7. Classwork and homework 237
and by setting
−1
3
an = 1 − √ (1 − α0 ) log log(n) ,
2α0 n
we find n−1 k
n (β)an ≥ α0 . From Equation 9.22, an ≥ 1 and we therefore have
limn→∞ an = 1. We have thus constructed a sequence (an ) which satisfies
Equation 9.21.
The random variable k n (β) is known only at the end of data collection. It
cannot be used during the experiment to construct a predictable process (in the
sense of an increasing family of σ-algebras as described below). To overcome
this we define the deterministic quantity kn (β) = na−1n α0 , in order to obtain
kn (β)
theoretical results. We have the almost sure convergence −→ 1, and
kn (β) n→∞
Equation 9.21 implies that
,
∀n ≥ N k
n (β) ≥ kn (β) a.s. (9.23)
√ √
7. Let XiC take the value Xi / 2.5 for i=1,...,10; √
the value (Xi − 0.25)/ 2.5,
for i=11,..., 20; and the value (Xi − 0.50)/ 2.5 otherwise. Plot these
series of values against i. What conclusions can be drawn? Calculate Si0 =
Si − i × S30 /30. What can be said about this process?
We provide here some detail to the proofs needed to support the main findings
of this chapter. Broad outlines for these proofs have been given in Chauvel and
O’Quigley (2014, 2017) and further details can be found in Chauvel (2014).
Proposition 9.1 The counting process N̄ on the initial time scale takes values
in N. Thus, for all u ∈ [0, T ],
The inverse process is such that N̄ −1 (j) is the jth ordered time of death, j =
1, . . . , kn . Note that N̄ (N̄ −1 (j)) = j. Thus, by the definition of N̄ ∗ , we have for
0 ≤ t ≤ 1,
n
n
N̄ ∗ (t) = 1 = 1
{N̄ (Xi ) ≤ kn t, δi = 1} {Xi ≤ N̄ −1 (kn t), δi = 1}
i=1 i=1
Suppose also that s ∈ [0, 1]. For the first term we have:
9.8. Outline of proofs 239
kn s
1 s ∗
1
{An (t) − a(t)} d N̄ (t) = (A (t
n i ) − a(t ))
i
kn 0 kn
i=1
kn s
≤ sup |An (ti ) − a(ti )| ≤ sup |An (ti ) − a(ti )| .
kn i=1,..., kn s i=1,...,kn
Therefore,
1 s
sup {An (t) − a(t)} dN̄ (t) ≤ sup |An (ti ) − a(ti )| ,
∗
k
0≤s≤1 n 0 i=1,...,kn
P
and supi=1,...,kn |An (ti ) − a(ti )| −→ 0. As a result,
n→∞
1 s ∗
ε
lim P sup {An (t) − a(t)} dN̄ (t) > = 0.
n→∞ 0≤s≤1 kn 0 2
For the second term on the right-hand side of Equation 9.24 we have:
s s kn s
s
1 1
sup ∗
a(t)dN̄ (t) −
a(t)dt = sup a (ti ) − a(t)dt .
0≤s≤1 kn 0 0 0≤s≤1 kn i=1 0
Note that an is piecewise constant on the interval [0, 1] such that an (t) = a (ti )
for t ∈ [ti , ti+1 [, i = 0, . . . , kn . We then have:
kn s s
1 kn s kn s
a (ti ) = an (t)dt − s − a .
kn 0 kn kn
i=1
As a result we have:
s s
1
sup ∗
a(t)dN̄ (t) − a(t)dt
0≤s≤1 kn 0 0
s s
kn s kn s
≤ sup an (t)dt − a(t)dt + sup s − a
0≤s≤1 0 0 0≤s≤1 kn kn
s s
1 kn s
≤ sup an (t)dt − a(t)dt + sup a ,
0≤s≤1 0 0 kn 0≤s≤1 kn
Note that limn→∞ sups∈[0,1] |an (s) − a(s)| = 0 by the uniform continuity of the
function a. This implies uniform convergence
s s
lim sup an (t)dt − a(t)dt = 0.
n→∞ s∈[0,1] 0 0
240 Chapter 9. Regression effect process
kn s
Furthermore, a is bounded so that sup0≤s≤1 a < ∞, which shows
kn
the result since
s s
1
lim sup a(t)dN̄ ∗ (t) − a(t)dt = 0, (9.25)
n→∞ 0≤s≤1 kn 0 0
Proposition 9.2 Let t ∈ [0, 1]. The expectation of Z(t) (defined in (9.2)) condi-
tional on the σ-algebra Ft∗− is
n
Eβ(t) (Z(t) | Ft∗− ) = Eβ(t) Zi (Xi ) 1φ (X ) = t, δ = 1 | Ft∗−
n i i
i=1
n
= Eβ(t) Zi φ−1
n (t) 1φ (X ) = t, δ = 1 | Ft∗−
n i i
i=1
n
= Zi φ−1 ∗
n (t) Eβ(t) 1φn (Xi ) = t, δi = 1 | Ft−
i=1
n
= Zi φ−1 ∗
n (t) Pβ(t) φn (Xi ) = t, δi = 1 | Ft− .
i=1
−1 −1 −
Indeed, Z and φ−1 n are left continuous, so Zi φn (t) = Zi φn (t ) is Ft−
∗
The same reasoning for the second-order moment gives the variance result.
E Vβ(tj ) (Z | tj )−1/2 Z(tj ) − Eβ(tj ) (Z | tj ) Ft∗−
j
= Vβ(tj ) (Z | tj )−1/2 E Z(tj ) − Eβ(tj ) (Z | tj ) Ft∗− = 0.
j
9.8. Outline of proofs 241
Thus,
E Vβ(ti ) (Z | ti )−1/2 Z(ti ) − Eβ(ti ) (Z | ti ) VjZ Z(tj ) − Eβ(tj ) (Z | tj )
=E E Vβ(ti ) (Z | ti )−1/2 Z(ti )−Eβ(ti ) (Z | ti ) VjZ Z(tj )−Eβ(tj ) (Z | tj ) F ∗−
tj
∗
= E Vβ(ti ) (Z | ti )−1/2 Z(ti )−Eβ(ti ) (Z | ti ) E VjZ Z(tj )−Eβ(tj ) (Z | tj ) F − = 0.
tj
Lemma 9.2 Appealing to the law of large numbers, condition A1 implies that
for r = 0, 1, 2, we have:
where
exp(γ(0)z)
Φγ(0) (z) = , z ∈ R.
E (exp (γ(0)Z(0)))
Furthermore, denoting PZ(0) to be the distribution function of the random
variable Z(0) and X having distribution function PX such that dPX (x) =
Φγ(0) (x)dPZ(0) (x), x ∈ R, we have:
Lemma 9.3 Assume that conditions A1 and A2 hold and that γ1 ∈ B. Recall the
√
definition of v(γ1 (·), ·). We write supn
k to denote n supt∈{t1 ,t2 ,...,tk } so that
n
we have:
242 Chapter 9. Regression effect process
supn
k Vγ1 (t) (Z | t) − v(γ1 (t), t)
(2) 2 2
nS (γ1 (t), t) S (1) (γ1 (t), t) s(2) (γ1 (t), t) s(1) (γ1 (t), t)
= supk (0) − − −
S (γ1 (t), t) S (0) (γ1 (t), t) s(0) (γ1 (t), t) s(0) (γ1 (t), t)
(2) 2 2
√ S (γ1 (t), t) S (1) (γ1 (t), t) s(2) (γ1 (t), t) s(1) (γ1 (t), t)
≤ n sup (0) − − −
t∈[0,1] S
(γ1 (t), t) (0)
S (γ1 (t), t) (0)
s (γ1 (t), t) s(0) (γ1 (t), t)
≤ Vn + Wn ,
where
√ S (2) (γ (t), t) s(2) (γ (t), t)
1 1
Vn = n sup (0) − (0) ,
t∈[0,1] S (γ1 (t), t) s (γ1 (t), t)
and 2 2
√ S (1) (γ1 (t), t) s (γ1 (t), t)
(1)
Wn = n sup − .
t∈[0,1] S
(0) (γ1 (t), t) s(0) (γ1 (t), t)
We can study the large sample behavior of these two terms separately. Denote
(0)
m0 , M1 , and M2 to be strictly positive constants such that s (γ(t), t) ≥ m0 et
(i)
s (γ(t), t) ≤ Mi , (i = 1, 2) for all t ∈ [0, 1] and γ ∈ B. Their existence follows
from condition A2. We have:
√ 1
Vn ≤ n sup (0) S (2) (γ1 (t), t) − s(2) (γ1 (t), t)
t∈[0,1] S (γ1 (t), t)
√ (2) 1 1
+ n sup s (γ1 (t), t) − (0)
t∈[0,1] S (γ1 (t), t) s (γ1 (t), t)
(0)
√
1
≤ n sup (0) sup S (2) (γ1 (t), t) − s(2) (γ1 (t), t)
t∈[0,1] S (γ 1 (t), t) t∈[0,1]
√ 1 1
+ M2 n sup (0) − (0) . (9.26)
S (γ (t), t) s (γ (t), t)
t∈[0,1] 1 1
Furthermore,
√ 1 1
n sup (0) − (0)
t∈[0,1],γ∈B S (γ(t), t) s (γ(t), t)
(0) (0) (γ(t), t)
√ s (γ(t), t) − S
= n sup (0) (γ(t), t) s(0) (γ(t), t)
t∈[0,1],γ∈B S
√
1
≤ m−10 n sup s(0) (γ(t), t) − S (0) (γ(t), t) sup .
t∈[0,1],γ∈B t∈[0,1],γ∈B S (0) (γ(t), t)
9.8. Outline of proofs 243
√
From Equation 9.6, n supt∈[0,1],γ∈B s(0) (γ(t), t) − S (0) (γ(t), t) = oP (1),
and since s(0) is bounded below uniformly by m0 , we have
supt∈[0,1],γ∈B S (0) (γ(t), t)−1 = OP (1), which indicates that this term is bounded
after a certain rank with high probability. As a result,
√ 1 1 P
n
sup (0) − (0) −→ 0. (9.27)
t∈[0,1],γ∈B S (γ(t), t) s (γ(t), t) n→∞
−1
and since supt∈[0,1],γ∈B S (0) (γ(t), t) = OP (1), we have convergence for
√
1 (2) (2) P
n sup (0) (γ (t), t)
sup S (γ1 (t), t) − s (γ1 (t), t) −→ 0.
t∈[0,1] S 1 t∈[0,1] n→∞
√
1 (1) 2 (1) 2
≤ n sup (0) sup S (γ 1 (t), t) − s (γ 1 (t), t)
t∈[0,1] S (γ1 (t), t)2 t∈[0,1]
√ 1 1
+ M12 n sup (0) − (0) . (9.28)
S (γ (t), t)
t∈[0,1] 1
2 s (γ (t), t)
1
2
We have:
√ 1 1
n sup −
t∈[0,1],γ∈B S
(0) (γ(t), t)2 (0)
s (γ(t), t) 2
√ 1 1 1 1
= n sup − +
t∈[0,1],γ∈B
(0) (0)
S (γ(t), t) s (γ(t), t) S (γ(t), t) s (γ(t), t)
(0) (0)
√ 1 1 1
≤ n −1
sup
t∈[0,1],γ∈B S
(0) (γ(t), t) − s(0) (γ(t), t) t∈[0,1],γ∈B
sup
S (0) (γ(t), t)
+ m0 .
−1
Equation 9.27 and the fact that supt∈[0,1],γ∈B S (0) (γ(t), t) = OP (1) imply
that
244 Chapter 9. Regression effect process
√
1 1 P
n sup 2 − 2 −→ 0.
t∈[0,1],γ∈B S (0) (γ(t), t) s (γ(t), t)
(0) n→∞
−2
Since supt∈[0,1],γ∈B S (0) (γ(t), t) = OP (1), we have the convergence of
√
1 (1) 2 (1) 2 P
n sup (0) (γ (t), t)2
sup S (γ1 (t), t) − s (γ1 (t), t) −→ 0.
t∈[0,1] S 1 t∈[0,1] n→∞
Under condition A3, there exists a positive constant C(γ1 ) such that v(γ1 (t), t) =
C(γ1 ) for all t ∈ [0, 1]. Lemma 9.2 indicates the strict positivity of C(γ1 ), and
Equation 9.8 is thus demonstrated.
Equations 9.9 and 9.10. Let γ1 ∈ B. Following Equation 9.8, under conditions
A1, A2, and A3, there exist constants C(γ1 ), C(β0 ) > 0 such that
√ P
n sup Vγ (Z | t) − C(γ1 ) −→ 0, (9.29)
1 (t) n→∞
t∈{t1 ,t2 ,...,tkn }
√ P
n sup Vβ (Z | t) − C(β0 ) −→ 0. (9.30)
0 n→∞
t∈{t1 ,t2 ,...,tkn }
We then have:
V
n γ1 (t)
(Z | t) C(γ1 ) 1
supk − n
≤ supk V (Z | t) − C(γ1 )
Vβ0 (Z | t) C(β0 ) Vβ0 (Z | t) γ1 (t)
n
1 1
+ C(γ1 ) supk −
Vβ0 (Z | t) C(β0 )
n
1 1
−1
≤ CV supk Vγ1 (t) (Z | t)−C(γ1 ) + C(γ1 ) supk n
− .
V (Z | t) C(β0 ) β0
9.8. Outline of proofs 245
Moreover,
1
1 n Vβ0 (Z | t) − C(β0 )
sup V − = supk
t∈{t1 ,t2 ,...,tkn } β0 (Z | t) C(β0 ) C(β0 )Vβ0 (Z | t)
≤ C(β0 )−1 CV−1 supn
k Vβ0 (Z | t) − C(β0 ) .
This convergence together with that of Equation 9.29 lead to the conclusion that
√ Vγ1 (t) (Z | t) C(γ1 ) P
n sup − −→ 0.
C(β0 ) n→∞
t∈{t1 ,t2 ,...,tkn } Vβ0 (Z | t)
Theorem 9.1 This is a special case of the more general proof given for Theorem
9.2.
Theorem 9.2 Let t ∈ [0, 1]. Under hypotheses A1, A2, A3, and A4, we suppose
that β(t) is the true value for the model and that the standarized score, Un∗ , is
evaluated at the value β0 ∈ B. Recall the equalities of Proposition 9.2:
Eβ(t) (Z | t) = Eβ(t) Z(t) Ft∗− , Vβ0 (Z | t) = Vβ0 Z(t) Ft∗− .
This process is piecewise constant with a jump at each time of event. The out-
line of the proof is the following. Firstly, we will decompose Un∗∗ (β0 , ·) into 2
processes belonging to (D[0, 1], R). We will show that the first process converges
in distribution to a Brownian motion by appealing to a theorem for the con-
vergence of differences of martingales. The convergence of the second process
can be obtained by making use of Lemma 9.1. The large sample behavior of
the process Un∗ (β0 , ·) belonging to (C[0, 1], R), which is the linear interpolation
of the variables {Un∗∗ (β0 , ti ), i = 1, . . . , kn }, will be obtained by showing that
the difference between Un∗ and Un∗∗ tends in probability to 0 when n → ∞. A
Taylor-Lagrange expansion of the conditional expectation Eβ(t) {Z(t) | Ft∗− } at
the point β(t) = β0 gives:
246 Chapter 9. Regression effect process
∗ ∗ ∂ ∗
Eβ0
Z(t) Ft− = Eβ(t) Z(t) Ft− + (β0 − β(t)) Eβ Z(t) Ft−
,
∂β β=β̃(t)
(9.32)
since β0 ∈ B with radius δ1 (condition A1). Under conditions A1, A2, and A3,
Equation 9.10 of Lemma 9.3 indicates the existence of a constant C2 > 0 such
that
Vβ̃(s) Z(s) Fs∗− P
sup − C2 −→ 0.
1/2
s∈{t1 ,t2 ,...,tkn } Vβ Z(s) F ∗− n→∞
0 s
Equation (9.33) now indicates that the convergence of Equation 9.13 is shown.
We see that Xn converges in distribution to Brownian motion by making use of
the functional central limit theorem for the differences of martingales developed
by Helland (1982) and described in Theorem C.4. Let j = 1, . . . , kn . We write
ξj,kn as the jth increment of Xn :
9.8. Outline of proofs 247
Z(tj ) − Eβ(tj ) Z(tj ) F ∗−
tj
ξj,kn = 1/2
.
∗
kn Vβ0 Z(tj ) F −
tj
The increment ξj,kn is Ft∗j -measurable and Eβ(tj ) ξj,kn F ∗− = 0. The pro-
tj
cess N̄ ∗ contains a jump at each time of an event tj = j/kn (Proposition 9.1),
we then have:
tkn
tkn /kn Z(s) − E
β(s) Z(s) | Fs∗−
∗
Xn (t) = 1/2 d N̄ (s) = ξj,kn ,
0 kn Vβ0 Z(s) Fs∗− j=1
where the function t −→ tkn is positive, increasing with integer values and
right continuous. To check the assumption of Theorem C.4 we need to consider
the expectation, the variance, and the Lyapunov condition: We begin with the
expectation.
∗
Eβ(tj−1 ) ξj,kn Ft∗j−1 = Eβ(tj−1 ) Eβ(tj ) ξj,kn Ft∗− Ft
j−1 = 0.
j
(9.34)
b. (Variance). Under the hypotheses A1, A2, and A3, following Lemma 9.3,
there exists a constant C1 (β, β0 ) > 0 such that
Vβ(tj ) Z(tj ) F ∗−
tj P
sup − C1 (β, β0 )
2
−→ 0.
j=1,...,kn
n→∞
β0
V Z(t ) F ∗
j − tj
n
since s ∈ {t1 , t2 , . . . , tkn } donc i=1 1φn (Xi ) = s, δi
= 1 = 1. As a result,
Vβ(s) Z(s) Fs∗− ≤ Eβ(s) Z(s)2 Fs∗− ≤ L2 .
Vβ(t ) (Z(tj ) | F ∗− )
j tj
and, to keep things uncluttered, we first write W0F (tj ) = ,
Vβ0 (Z(tj ) | F ∗− )
tj
then,
tkn 2 ∗
E ξ F − C (β, β )2
t
β(tj−1 ) j,kn tj−1 1 0
j=1
tkn
= Eβ(tj−1 ) Eβ(tj ) ξj,k 2 F ∗− Ft∗j−1 − C1 (β, β0 )2 t
n tj
j=1
tkn
1
= Eβ(tj−1 ) W0F (tj ) Ft∗j−1 − C1 (β, β0 )t
kn j=1
tkn
1 F 2
2 tkn
≤ ∗
Eβ(tj−1 ) W0 (tj ) − C1 (β, β0 ) Ftj−1 + C1 (β, β0 ) − t .
kn kn
j=1
9.8. Outline of proofs 249
tkn 2 ∗
Equation 9.36 then leads to the conclusion that j=1 Eβ ξj,kn
Ftj−1
converges in mean and, as a result, in probability to C1 (β, β0 )2 t when
n → ∞.
tk 3
c. We have the convergence in mean of j=1n Eβ(tj−1 ) ξj,k n
| Ft∗j−1 to 0
when n → ∞. Furthermore,
⎛ ⎞
tkn tkn
3 3 ∗
E ⎝ Eβ(tj−1 ) ξj,kn | Ft∗j−1 ⎠ ≤ E Eβ(tj−1 ) ξj,k n
| Ftj−1
j=1 j=1
tkn
= E ξj,k
3
n
, (9.37)
j=1
Also, on the basis of Equation 9.9, t ∈ {t1 , t2 , . . . , tkn } ,Vβ0 Z(t)|Ft∗− is
strictly greater than CV . As a result, Equation 9.37 becomes:
⎛ ⎞
tkn
⎝
E Eβ(tj−1 ) ξj,kn | Ftj−1 ⎠
3 ∗
j=1
kn
3
1 8L3
≤ E Z(tj ) − Eβ(t ) Z(tj )|F − ≤∗
,
(kn CV )3/2 j tj 1/2 3/2
j=1 kn CV
(9.38)
tk 3
and j=1n Eβ(tj−1 ) ξj,k n
| Ft∗j−1 converges in mean to 0 when n → ∞
and, in consequence, in probability. The Lyapunov condition is verified.
where the constant C1 (β, β0 )2 is the ratio of two asymptotic conditional variances
v evaluated at β and β0 . When β = β0 , we have C1 (β, β0 ) = 1. We have the
relation
Un∗ (β0 , ·) − kn An = Un∗ (β0 , ·) − Un∗∗ (β0 , ·) + Un∗∗ (β0 , ·) − kn An
= Un∗ (β0 , ·) − Un∗∗ (β0 , ·) + Xn . (9.39)
P
Therefore, Un∗ (β0 , ·)−Un∗∗ (β0 , ·) −→ 0. Slutsky’s theorem applied to the decom-
n→∞
position of Equation 9.39 leads us to conclude that
9.8. Outline of proofs 251
L
Un∗ (β0 , ·) − kn 1/2 An −→ C1 (β, β0 )W,
n→∞
√
1/2 1/2
≤p n sup Vγ(t) (Z | t) − Σ
t∈{t1 ,t2 ,...,tkn },γ∈B
1/2 1/2
× sup Vγ(t) (Z | t) + Σ .
t∈{t1 ,t2 ,...,tkn },γ∈B
Now, for all t ∈ {t1 , . . . , tkn } and γ ∈ B , according to the condition B4,
Vγ(t) (Z | t) ≤ Eγ Z ⊗2 | t + ⊗2
Eγ (Z | t) ≤ 2L 2 . (9.40)
Theorem 9.3 The proof follows the same outline as in the univariate setting.
Recall that the space of cadlag functions of (D[0, 1], Rp ) is equipped with the
product topology of Skorokhod, described in Definition A.5. Furthermore, the
norm of the vector in Rp or a matrix of Mp×p (R) is the sup norm. The pro-
cess being evaluated at β0 ∈ B , we have kn = kn (β0 ) and the upper bound of
Equation 9.17 becomes:
∀ t ∈ {t1 , . . . , tkn }, Vβ0 (t) (Z | t)−1 ≤ CV−1 a.s. (9.42)
tkn
1 −1/2
Un∗∗ (β0 , t) = √ Vβ0 Z(ti ) Ft∗i − Z(ti ) − Eβ0 Z(ti ) Ft∗i − , 0 ≤ t ≤ 1.
kn
i=1
tkn
1
Xn (t) = √ Vβ0 (Z | ti )−1/2 Z(ti ) − Eβ(ti ) (Z | ti )
kn i=1
tkn
1
Bn (t) = Vβ0 (Z | ti )−1/2 Eβ(ti ) (Z | ti ) − Eβ0 (Z | ti ) .
kn
i=1
We have:
∗
Un (β0 , ·) − kn Bn − Wp
≤ Un∗ (β0 , ·) − Un∗∗ (β0 , ·) + Un∗∗ (β0 , ·) − kn Bn − Wp
= Un∗ (β0 , ·) − Un∗∗ (β0 , ·) + Xn − Wp . (9.43)
Let us consider the limiting behavior of the two terms on the right of the inequal-
ity. Considering just the first term and letting ε > 0, we have:
9.8. Outline of proofs 253
P sup Un∗ (β0 , t) − Un∗∗ (β0 , t) ≥ ε
t∈[0,1]
∗∗ i i − 1
=P sup ∗∗
Un β0 , kn − Un β0 , kn ≥ ε
i=1,...,kn
1
=P √ sup Vβ0 (Z | ti )−1/2 Z(ti ) − Eβ0 (Z | ti ) ≥ ε
kn i=1,...,kn
p −1/2
≤P √ sup Vβ0 (Z | ti ) Z(ti ) − Eβ0 (Z | ti ) ≥ ε . (9.44)
kn i=1,...,kn
where the last inequality is obtained from Equations 9.41 and 9.42. Also we have:
Z(ti ) − Eβ0 Z(ti ) Ft∗i − ≤ Z(ti ) + Eβ0 Z(ti ) Ft∗i − ≤ 2L .
and lim P supt∈[0,1] Un∗ (β0 , t) − Un∗∗ (β0 , t) ≥ ε = 0. This implies the con-
n→∞
vergence in probability of Un∗ (β0 , ·) − Un∗∗ (β0 , ·) to 0 when n → ∞ in the space
(D[0, 1], Rp ) equipped with the Skorohod product topology as a consequence of
the results of Section A.5. We now consider the second term. The convergence in
distribution of Xn to Wp is obtained by an application of the multivariate func-
tional central limit theorem C.5 described in Helland (1982). Before applying this
theorem we need to check some working hypotheses. Let i ∈ {1, 2, . . . , kn } and
t ∈ [0, 1]. Note that
1 −1/2
1
ξi,kn = (ξi,k n
p
, . . . , ξi,k n
) = √ Vβ0 Z(ti ) Ft∗i − Z(ti ) − Eβ(ti ) Z(ti ) Ft∗i −
kn
254 Chapter 9. Regression effect process
tk
The ith increment in Rp of the process Xn . We have Xn (t) = i=1n ξi,kn .
Note that ξi,kn is Ft∗i -measurable and that Eβ(ti ) ξi,kn F ∗− = 0. Let l, m ∈
ti
{1, . . . , p}, l = m. We have el the lth vector of the canonical base of Rp : all of
the elements are zero with the exception of the lth which is 1. Also, eTl em = 0,
eTl el = 1 and el = maxi=1,...,p (el )i = 1. For i = 1, . . . , kn , we have:
l
ξi,k n
= eTl ξi,kn = ξi,k
T
e ∈ R.
n l
(9.47)
∗ ∗ ∗
Eβ(ti−1 ) l
ξi,k Fti−1 = Eβ(ti−1 ) Eβ(ti ) l
ξi,k F − Fti−1 = 0.
n n ti
l
Eβ(ti ) ξi,k ξm
n i,kn
Ft∗− = eT T ∗
l Eβ(ti ) ξi,kn ξi,kn Ft− em
i i
−1/2 −1/2
1 T
= el Vβ0 Z(ti ) Ft∗− Vβ(ti ) Z(ti ) Ft∗− Vβ0 Z(ti ) Ft∗− em
kn i i i
1 T −
= e V ti , β0 , β(ti ) em ,
kn l
where
V t−
Z(ti ) F ∗−
−1/2
Vβ(ti ) Z(ti ) F ∗−
Z(ti ) F ∗−
−1/2
i , β0 , β(ti ) = Vβ0 ti ti
Vβ0
ti
.
1
tkn
≤ E eTl V t−
i , β0 , β(ti ) em
kn
i=1
tkn
1
= E eTl V t−
i , β0 , β(ti ) − Ip em
kn
i=1
9.8. Outline of proofs 255
tkn
T
≤ sup E el V t−
i , β0 , β(ti ) − Ip em
kn i=1,..., tkn
tkn
T
≤ E sup el V t−
i , β0 , β(t i ) − Ip em
kn i=1,..., tkn
2 tkn
−
≤p eTl em E sup V t , β0 , β(ti ) − Ip
i
kn i=1,..., tkn
2 tkn
−
=p E sup V t , β0 , β(ti ) − Ip
i
kn i=1,..., tkn
2 tkn
−
≤p E sup V t , β0 , β(ti ) − Ip . (9.48)
i
kn i=1,...,kn
We have:
−
sup V ti , β0 , β(ti ) − Ip
i=1,...,kn
−1/2
= sup Vβ0 Z(ti ) F
∗
Vβ(ti ) Z(ti ) F
∗
i=1,...,kn
t
−
i
t
−
i
∗ ∗ −1/2
−Vβ0 Z(ti ) F − Vβ0 Z(ti ) F −
t
i
t
i
−1/2 2
V ∗
≤p
2
sup Vβ0 Z(ti ) F ∗ sup β(ti ) Z(ti ) F −
i=1,...,kn
t
−
i
i=1,...,kn
t
i
−Vβ0
∗
Z(ti ) F − .
t
i
2
−2
≤ 2p10 CV L sup Vβ(ti ) Z(ti ) Ft∗− − Σ +
i=1,...,kn i
Σ − V ∗
sup β0 Z(ti ) F − .
t
i=1,...,kn i
− P
sup V t , β0 , β(ti ) − Ip −→ 0.
i n→∞
i=1,...,kn
Equations 9.40 and 9.45 indicate that supi=1,...,kn V t−i , β0 , β(ti ) − Ip
is bounded by a finite constant. The convergence is then again a convergence
in mean.
L1
sup V t− i , β0 , β(ti ) − Ip
−→ 0. (9.49)
i=1,...,kn n→∞
c.’ (Variance.) Similar arguments to a.’ and b.’ can be used, implying the equality,
2 ∗
tkn
l
Eβ(ti−1 ) ξi,kn Fti−1 − t
i=1
tkn
1 − kn t
= Eβ(ti−1 ) eT
l V ti , β0 , β(ti ) − Ip el Ft∗i−1 + −t .
kn kn
i=1
This leads to
⎛ ⎞
tkn
2
E ⎝ Eβ(ti−1 ) l
ξi,k n
Ft∗
i−1 − t ⎠
i=1
tkn
1 kn t
≤ E eTl V t−
i , β0 , β(t i ) − Ip el + − t
kn kn
i=1
tkn
p2 kn t
≤ E V t− i , β0 , β(t i ) − I p
+
− t
kn kn
i=1
kn t kn t
≤ p2 sup E V t− i , β 0 , β(t i ) − Ip
+
− t
kn i=1,...,kn kn
kn t kn t
≤ p2 E sup V t− i , β 0 , β(t i ) − Ip
+
− t .
kn i=1,...,kn kn
l l
d.’ (Lyapunov condition.) We have ξi,k n
≤ max l=1,...,p ξi,kn = ξi,kn .
Therefore,
kn t
l 3 ∗
kn t
l 3
kn
3
E Eβ(ti−1 ) ξi,k Ft ≤ E ξi,k ≤ E ξi,kn .
n i−1 n
i=1 i=1 i=1
1
tkn
≤ Vβ0 (Z | ti )−1/2 Eβ(ti ) (Z | ti ) − Eβ0 (Z | ti ) − Vβ0 (Z | ti ) {β(ti ) − β0 }
kn
i=1
1
tkn
+ Vβ0 (Z | ti )1/2 − Σ1/2 {β(ti ) − β0 }
kn
i=1
258 Chapter 9. Regression effect process
tkn t
1 tkn
+ Σ1/2 β(ti ) − β(s)ds + β0 −t
kn
i=1 0
kn
tkn
tkn
≤
pMn Vβ0 (Z | ti )−1/2 β(ti ) − β0 2 + p Vβ0 (Z | ti )1/2 − Σ1/2 β(ti ) − β0
2kn kn
i=1 i=1
tk t
1/2
1
n
tkn − t
+p Σ k β(ti ) − β(s)ds + Σ1/2 β0
n i=1 0 k n
tkn pMn
≤ sup β(ti ) − β0 2 sup Vβ0 (Z | ti )−1/2
kn 2 i=1,...,tkn i=1,...,tkn
tkn
+p sup β(ti ) − β0 sup Vβ0 (Z | ti )1/2 − Σ1/2
kn i=1,...,tkn i=1,...,tkn
tk t
tkn − t
n
+ p Σ1/2 sup
1
βl (s)ds + p Σ1/2 β0 .
βl (ti ) −
l=1,...,p kn 0 kn
i=1
Therefore,
t
Bn (t) − Σ1/2 {β(s) − β0 } ds
0
tkn p5 2 −1 tk
≤ Mn √ δ2 CV L + p n δ2 sup Vβ0 (Z | ti )1/2 − Σ1/2
kn 2 kn i=1,...,kn
tkn t
1/2 1 tkn
+ p Σ sup βl (ti ) −
βl (s)ds + 2δ2 p − t Σ1/2 .
k
l=1,...,p n i=1 0 kn
p5
≤ Mn √ δ22 CV−1 L + pδ2 sup Vβ0 (Z | ti )1/2 − Σ1/2
2 i=1,...,kn
tkn t
1/2 1 2
+ p Σ sup sup βl (ti ) − βl (s)ds + δ2 p Σ1/2 .
t∈[0,1] l=1,...,p kn i=1 0 kn
(9.51)
9.8. Outline of proofs 259
It follows that
tk t tk t
1 1
n p n
sup sup βl (ti ) − βl (s)ds ≤ sup βl (ti ) − βl (s)ds ,
t∈[0,1] l=1,...,p kn 0 t∈[0,1] kn 0
i=1 l=1 i=1
The basic ideas of linear regression run through this chapter. It is structured
as follows: In Section 10.3, we present a literature review of graphical methods
and goodness of fit tests for proportional hazards models. In Section 10.6, we
consider the R2 coefficient for non-proportional hazards models, which is the
measure of predictive ability used from here on. Although other suggestions for
R2 have been discussed extensively in the literature O’Quigley (2008), we limit
our attention to this particular one in view of its many desirable properties, most
importantly a key property, not known to exist for rival measures, that the popula-
tion equivalent for R2 , Ω2 , is maximized for the model closest to that generating
the observations. Section 10.4 presents a graphical method and a goodness of fit
test for proportional hazards models, based on the regression effect process. We
also present a method for constructing non-proportional hazards models from
this process and the R2 coefficient. Next, simulations are presented in Section
10.9 comparing the performance of the goodness of fit test we have developed
with some standard tests. The method for constructing non-proportional hazards
models will then be illustrated on simulated data. Applications on real data are
presented in Section 10.10. We terminate the chapter with a discussion.
suggesting visually the type of alternative hypothesis, i.e., more realistic models
to consider. Even here, results can be confusing or difficult to summarize if we
are not used to interpreting such plots. In our view the regression effect process,
processes in the multivariate case, can ease interpretation and play a fundamental
role in model construction. Before looking more closely at the regression effect
process and especially the time-transformed regression effect process, we present
a brief overview of some of the less recent proposals in the literature.
The first graphical method proposed for testing the proportional hazards hypoth-
esis for categorical variables was proposed by Kay (1977). The method is based
on the following formulation of the proportional hazards model:
t
T
log {− log(S(t|Z))} = β Z + log λ0 (s)ds .
0
where j is the individual who dies at time t. Grambsch and Therneau (1994)
have suggested plotting the Schoenfeld residuals standardized over time in order
to examine the validity of the proportional hazards model, and if it is rejected,
obtain some idea of the time-dependent effect. Essentially, they have shown that
for each time of death t:
−1
E Vβ̂ (Z|t) rβ̂ (t) + βˆj ≈ βj (t), j = 1, . . . , p,
j
where β̂ = βˆ1 , . . . , βˆp is the partial maximum likelihood estimator of the pro-
portional hazards model and β(t) = (β1 (t), . . . , βp (t)) the true time-dependent
model of the non-proportional hazards model. This approach is used often, and
is implemented in the survival package in R. Furthermore, the authors have
proposed a confidence band for the regression effect, but this is not reliable if the
proportional hazards model is not satisfied. More recently, Sasieni and Winnett
(2003) have suggested using the smoothed residuals of martingale differences.
These can be plotted as a function of time for a fixed value of the covariate, allow-
ing a glimpse at temporal dependency. These can also be plotted as a function
of a covariate at a fixed time, to study how to introduce it into the model. In the
former, several plots need to be output for the various values of the covariates,
10.3. Classical graphical methods 265
and in the latter, for various times. In practice, it may be difficult to interpret
all of these. Also, these methods—based on non-cumulative residuals—require
smoothing functions to join up the residuals. As underlined by Lin et al. (1993),
results are often sensitive to the choice of smoothing technique employed. Dif-
ferent smoothings can lead to very different conclusions. Examining the quality
of proportional hazards models with such methods can be tricky.
To get around these kinds of problems, several authors have proposed using
cumulative martingale residuals. Arjas (1988) suggests plotting the expected
number of deaths under the proportional hazards model as a function of the ranks
of the times of death. The model fits the data well if the resulting curve is close to
the diagonal. A score process, inspired by the goodness of fit test of Wei (1984),
was introduced by Therneau et al. (1990) and is denoted {U (β, t), 0 ≤ t ≤ T },
where
n t
U (β, t) = Zi (s) − Eβ (Z|s) dNi (s), 0≤t≤T. (10.2)
i=1 0
Many methods depend on the covariance between the covariates, such as the
ones proposed by Grambsch and Therneau (1994) and Lin et al. (1993). To deal
with this problem, Scheike and Martinussen (2004), looking at a non-proportional
hazards model, developed estimation techniques and goodness of fit tests for the
proportional hazards model for each covariate, while allowing for the possibility
that other covariates have time-dependent effects. Their simulations show that the
method works well compared to the often-used ones of Grambsch and Therneau
(1994) and Lin et al. (1993) when the proportional hazards hypothesis is invalid
and/or there are correlated covariates. Their test statistics depend on the estima-
tion of the regression parameter, which requires a hard-to-understand algorithm
based on kernel smoothing. Also, the form of the regression parameter estimator
is not a smooth and explicit function of time. The goodness of fit tests are of the
Kolmogorov-Smirnov and Cramér-von Mises type and the p-values are calculated
by simulation since the limiting distributions of the statistics are not known.
For more details on these less recent goodness of fit tests for proportional
hazards models, the reader may wish to consult the works by Therneau and
Grambsch (2000), Klein and Moeschberger (2003) and Martinussen and Scheike
(2006). Our preference as outlined below is to make full use of known and estab-
lished properties of the regression effect process which, depending on the cir-
cumstances, we will approximate by a Brownian motion or a Brownian bridge.
We make use of Theorem 9.2 indicating how a correctly specified model corre-
sponds to a Brownian motion with linear drift. This theorem, together with an
elementary conditioning argument, leads to
As a corollary, the result is slightly misplaced in the text, and if we were guided
only by logical flow, then it would follow immediately from Theorem 9.2. We
present it here because this is where we need it, as a key support to assessing
goodness of fit. It tells us that the above elementary transformation will look
like a Brownian bridge in practice, when our chosen model can be considered
well specified and one that could plausibly have generated the observations.
We will see this in later real examples, but as an appetizer, the reader might
consider Figure 10.1 where—leaving aside any formal testing procedure—the
visual impression is enough to tell us that, in this case, a non-proportional hazards
model with time-dependent regression effects provides a much better fit to the
observations than does the simple proportional hazards model. In a multivariate
setting we can obtain several such graphs; examples being the process for the
prognostic index, the process for particular covariates where others are kept fixed
and processes for stratified models.
10.4. Confidence bands for regression effect process 267
Figure 10.1: Dotted lines show upper 5% limit of the supremum of a Brownian
bridge process. Left-hand process for PH model encroaches on this limit. Right-
hand process for the NPH model with time-dependent effects shows a much
improved fit. The process appears to be not unlike a Brownian bridge
The regression effect process provides us with insight into both the strength
of effect as well as the fit of the assumed model. No less importantly, when
the mechanism generating the observations differs from the assumed model,
the regression effect process will indicate to us in which direction to look for a
better fitting model. Here, we consider confidence bands for the process. These
offer some help in seeing whether the model’s assumed functional form for the
parameter can be deemed sufficiently plausible.
Let us consider the non-proportional hazards model with regression coefficient
β(t) = (β1 (t), . . . , βp (t)), and suppose that conditions B1, B2, B3 and B4 of
Chapter 9 hold. We suppose furthermore than the sequence (Mn )n in hypothesis
√
B3 is such that nMn → 0 as n → ∞. For i = 1, . . . , p, consider the following
null and alternative hypotheses:
Under the hypothesis H0,i , the covariate Z (i) is consistent with the proportional
hazards model, while under H1,i , its coefficient changes over time. The propo-
sition below allows us to define the limit behavior of the ith component of the
regression effect process under H0,i . This then allows us to construct confidence
bands for the constant nature of a parameter around the corresponding compo-
268 Chapter 10. Model construction guided . . .
nent of the process, as well as a goodness of fit test for the proportional hazards
model. Recall that the limit distribution of the supremum of the absolute value
of a Brownian bridge is the Kolmogorov distribution. For α = 5%, the upper
quantile of order α of this distribution is a(α) = 1.358.
If the ith component of the process Σ̂−1/2 Un∗ (β0 , ·) exits the confidence band
IC(α)i or the p-value is less than α, the hypothesis H0,i that the effect βi (t)
is constant is rejected at level α. However, by simultaneously testing several
hypotheses H0,i for different covariates, we see inflation in the overall level of
the test. This means that a process can exit the confidence band even if the
corresponding effect is constant over time, with an overall level greater than α
(Figure 10.2).
This could be a problem if our definitive conclusion on the goodness of fit
of the proportional hazards model is based on these confidence bands or the
goodness of fit test. This is not the case however since such formal testing does
not have any central role in model building. Confidence bands are just one tool
in model construction. The non-detection of a constant effect may be corrected
10.5. Structured tests for time dependency 269
In his article introducing the proportional hazards model, Cox (1972) suggested
replacing the constant regression coefficient by a non-constant one, then testing
this postulated time dependency. Formally, this amounts to considering the class
of tests based on the non-proportional hazards model with parameter
β(t) = β0 + β1 g(t),
Explained variance
Let X be a random variable whose second-order moment exists. The Bienaymé-
Chebyshev inequality:
says that the spread of X around its expectation can be bounded in terms of its
variance. The smaller the variance, the greater the probability that X is closer, on
average, to its expectation. Using a statistical model, the idea of evaluating the
proximity of a response variable to its predicted value, in terms of the variance,
would appear natural. The smaller the variance, the better the quality of the
10.6. Predictive ability of a regression model 271
which then allows us to define the explained variance parameter. Suppose that
we want to model a real-valued response variable Y with the help of a vector of
explanatory variables X. If the chosen model and the covariate information, X,
allow a good characterization of Y , then the expected values of Y given X will
show relatively large dispersion when compared to the simple imprecision (noise)
associated with any given fixed value of X. Whenever Var(E(Y |X)) is large
compared to the average of Var(Y |X), then prediction becomes easy. Conversely,
whenever the opposite holds, then the residual noise will mask to a greater or
lesser extent any underlying signal. It makes sense to describe Var(E(Y |X))
as the signal and E(Var(Y |X)) as the noise, so that we can write: Var(Y ) =
signal + noise. We can formally introduce a parameter that can be viewed as the
explained variation in Y given X via,
Figure 10.3: Conditional survival curves S(·|Z) as a function of time under the
proportional hazards model with λ0 (t) = 1, β = 1.5 or 4, and binary covariates.
Also, O’Quigley and Flandre (1994) have shown that a non-linear trans-
form which changes the time axis but keeps the order of deaths the same alters
the explained variance of T given Z. For proportional hazards models however,
changing the time scale in this way does not change the estimation of the regres-
sion coefficient β. The predictive ability of the model should not change. The
explained variance of T given Z would therefore seem to be a poor indicator of
10.6. Predictive ability of a regression model 273
the predictive power of variables in proportional hazards models, and even less so
in non-proportional ones. Let us now look at the explained variance of Z given T .
Var(E(η|T )) E(Var(η|T ))
Ω2 (β(t)) = = 1− , η(t) = β(t)T Z(t). (10.6)
Var(η) Var(η)
This way of writing the explained variance will be used later to construct an
estimator. The difficulties we had with the explained variance based on the dis-
tribution of T given Z in the previous section disappear when considering the
explained variance of Z conditional on T , thanks to the following result.
Generalizing the third statement above requires a little thought. Since β(t) is no
longer constant it may not be immediately clear what we mean when we talk of
increasing β. The following is however true. For any model based on β(t)Z, there
exists an equivalent model based on β ∗ Z ∗ (t) in which the time dependency has
now been absorbed by Z ∗ (t). This is a purely formal manipulation but allows us to
place any non-proportional hazards model under a proportional hazards heading.
As a result we see that, given Z ∗ (t), it is true that Ω2 is an increasing function
of |β ∗ |. What we have is that part 3 of Proposition 10.2 holds in a particular
direction of the functional coefficient space, the direction that remains within
the functional form of the model. The explained variance coefficient remains
unchanged when applying strictly increasing transformations in time. Only the
ranks matter. In consequence, we can continue to work with the standardized
10.7. The R2 estimate of Ω2 275
where Z(t) is the value of the covariate of the individual who has died at t and
Eβ(t) (Z|t) its expectation under the model. Let us denote F̃ the estimator of the
empirical cumulative distribution function of T on the transformed scale:
n
1 ∗ 1
F̃ (t) = N̄ (t) = 1φ (X ) ≤ t, δ = 1 , 0 ≤ t ≤ 1.
kn kn n i i
i=1
276 Chapter 10. Model construction guided . . .
With one covariate in the model, the R2 coefficient is the ratio of the sum of the
squared Schoenfeld residuals for the model with parameter β̂(t), over the sum of
the squared residuals for the model whose parameter is zero. When instead the
model has several covariates, the coefficient is still the ratio of sums of squared
residuals, but the residuals are now evaluated using the prognostic indicator
β̂(t)T Z(t) rather than the covariate Z(t) directly. A quantity very closely related
to R2 (α(t)) is RS2 (α(t)) defined as:
2
RS (α(t)) = 1 − Q̂−1 (F̂ , 0, α(t)) Q̃(F̂ , α(t), α(t)) , (10.10)
the only change from Equation 10.9 being that F̃ is replaced by F̂ . In practical
work this is only very rarely of interest and we can safely ignore RS 2 (α(t)) and
2 2
only keep in mind R (α(t)). The purpose of RS (α(t)) is for theoretical study. In
the absence of censoring RS 2 (α(t)) and R2 (α(t)) coincide, but when censoring is
present, we will need work with RS 2 (α(t)) in order to demonstrate large sample
consistency. The explanation for this is that the estimator F̃ does not converge
to the cumulative distribution function of T. As a consequence the R2 coefficient
will not generally converge to the explained variance Ω2 . Noting that dF̂ = −dŜ
then −1
dŜ(ti ) = Ŝ φ−1
n (ti ) − Ŝ φn (ti )
−
and Xu, 2001; Xu, 1996). However, in real-world settings, it has been observed
that R2 depends very weakly on the censoring, even for rates as high as 90% and
that, in the great majority of applications, the standard deviation of R2 will be an
order of magnitude greater than any bias. Simulation work by Choodari-Oskooei
et al. (2012) has confirmed the independence of the unweighted R2 coefficient
with respect to censoring, and a higher variance for RŜ 2 in return for its lack of
bias. For these reasons, in our work we mostly work with the unweighted R2
coefficient in real applications.
The coefficient of explained variance Ω2 (β(t)) can be estimated with R2 (β̂(t)),
where β̂(t) is a consistent estimator of the true regression parameter β(t). The
following theorem justifies the use of the R2 coefficient when evaluating the
goodness of fit of the non-proportional hazards model, and gives its asymptotic
behavior under poorly-specified models.
and
arg max lim R2 (b(t)) = β a.s.
b∈B n→+∞
If p > 1 and conditions B1, B2, B3 and B4 of Section 9.5 hold, for all α ∈ B
with α = 0, we have:
1 1 2
2 0 α(t)T Σα(t)dt + 0 α(t)T {e(β(t), t) − e(α(t), t)} dt
lim R (α(t)) = 1 − 1 1 2
.
n→∞ α(t)T Σα(t)dt + (α(t)T {e(β(t), t) − e(0, t)}) dt
0 0
Theorem 10.1 says that in the univariate case, if β(t) is the true regression
coefficient and if the sample size is large enough, the maximum of the R2 function
is reached for the true parameter β(t). Generalizing this to the multivariate case
is not a simple task. Consider the two covariate case Z (1) and Z (2) with effects
β1 (t) and β2 (t). If we suppose that β2 (t) is known, this becomes equivalent to
estimating β1 (t) in the univariate model. Theorem 10.1 then applies and β1 (t)
reaches the maximum for R2 in the limit. In practice, β2 (t) is not known but
we can get close to it using a convergent estimator. By conditioning on this
278 Chapter 10. Model construction guided . . .
estimator, i.e., treating β2 (t) as a known constant, the result again holds for
β1 (t). We can therefore consider than the result of Theorem 10.1 applies for
each variable taken separately, conditional on convergent estimators for all of the
other regression coefficients.
The R2 coefficient and the regression effect process are constructed using the
same residuals. The regression effect process allows us to check the goodness of
fit of the non-proportional hazards model, while the R2 coefficient is a measure of
the model’s predictive ability. Though these aspects are different, their construc-
tion using the same quantities is natural. This means that in essence all of the
relevant information regarding the predictive power of a model as well as model
adequacy is contained in the regression effect process. It is natural to expect that
introducing additional covariates into any model will result in some increase, if
only an apparent one, in predictive power. Here, we see that for any given set of
covariables, improvements in fit will result in improvements in predictive power.
Formal theorems enable us to safely rely on this result as a means to guide model
construction.
Using the results of Theorem 9.3 and its corollary, the regression effect process Un∗
can be used to determine the form of the multivariate regression coefficient β(t).
The process’s drift will mirror the form of β(t). No other techniques like local
smoothing, projection onto basis functions, or kernel estimation are necessary
(Cai and Sun, 2003; Hastie and Tibshirani, 1990; Scheike and Martinussen, 2004).
For example, as illustrated in Figure 9.4a, an effect which is constant until time
τ , followed by a zero-valued effect is easy to spot, even for small sample sizes.
This simple notion generalizes immediately. Suppose that the p components of
a time-dependent regression parameter
drift is a linear function, then the function hj is constant. If the drift is concave,
hj is decreasing, and if convex, increasing. Also, the confidence bands defined in
Section 10.4 can help to evaluate the plausibility of a constant effect over time
for βj (t), resulting in a constant function hj for j = 1, . . . , p.
In the presence of non-proportional hazards, we need a more complex strategy,
one that takes on board both the predictive strength of the covariates as well as
the accuracy of the modeling of the time dependency. We propose using the R2
coefficient which—on top of indicating the predictive abilities of the model—also
converges to a function whose maximum is attained when the true function h is
chosen (Theorem 10.1). When several models provide plausible suggestions for h,
the estimators β̂(t) and corresponding R2 coefficients should be evaluated. The
time-dependent function h retained is the one that maximizes the R2 . In this
way, we obtain a non-proportional hazards model with a good fit and maximal
predictive ability. The measure of predictive ability is maximized over the set
B of m preselected time-dependent regression effects. More formally, denote
B = {β1 (t), . . . , βm (t)} the set of m functions from [0, 1] to Rp . The regression
coefficient β ∗ (t) selected is the one for which
Note that the set B may be arbitrarily large and not even limited to countable
sets, albeit such a level of generalization would not bring any practical benefit.
The following theorem shows the equivalence between this maximization problem
when n → ∞ and one in which we minimize an L2 norm.
is a solution to
β ∗ (t) = arg min β(t) − α(t)2 ,
α(t)∈B
1
where for α(t) ∈ B, β(t) − α(t)22 = 0 (β(t) − α(t))2 dt.
In other words, for a large enough sample size, selecting the regression coef-
ficient which maximizes the R2 is the same as choosing the coefficient closest to
the true regression function in terms of L2 distance. It is possible that the true
regression coefficient, if it exists, is not in the set B. We can, however, make it
arbitrarily close. In any event, models are chosen to represent data either because
they fit well, or for their predictive performance. Many models may correspond to
280 Chapter 10. Model construction guided . . .
one or both of these aspects, not just the “true model”. Here, priority is given to
goodness of fit, with the selection of candidate regression coefficients, and only
then is a model’s predictive performance considered. We have chosen to work
with the R2 coefficient, but any measure of predictive performance that satisfies
Theorem 10.1 could be considered in principle.
When the trend in the process’s drift is a concave function, the effect
decreases with time, and if it is convex, the effect increases. In order the get the
largest possible R2 , it would be possible to construct a time-dependent effect
which is closer and closer to the observed drift of the process, like for example
a piecewise constant effect with many change-points. In general, this will lead
to overfitting, and interpretation of the coefficient will not be straightforward.
A compromise needs to be made between high predictive ability and simplic-
ity of the coefficient, notably for interpretation purposes. This is comparable to
the linear model situation in which the estimated explained variance—the R2
coefficient—is positively biased and increasingly so as the model’s dimension
increases. A balance needs to be found between the goal of improving predictive
strength and the danger of poor prediction due to overfitting. This boils down
to respecting the elementary principles of parsimonious model building.
Chauvel and O’Quigley (2017) present two distinct simulated situations. The
first looks at the goodness of fit of proportional hazards models, and the second
illustrates methods for constructing non-proportional hazards models with the
help of the regression effect process and the R2 coefficient. We generate samples
under the non-proportional hazards model as described in the appendix. Let us
consider the behavior of the goodness of fit test for the proportional hazards
model presented in Section 10.4. This will be compared to standard tests which
are also based on graphical methods, allowing visualization of the form of the
time-dependent effect in the case of poor model fitting.
The first comparison test—proposed by Lin et al. (1993)—is based on the
standadized score process defined in equation (10.2). This is different with the
regression effect process defined and studied here because the increments of the
process of Lin et al. (1993) are non-standardized Schoenfeld residuals summed
with respect to times of death in the initial time scale, and an overall standard-
ization is applied to all increments. The test statistic is the supremum over time
of the absolute value of the globally standardized process. If the covariates are
non-correlated, the statistic’s limit distribution is the Kolmogorov one (Therneau
et al., 1990). If they are correlated, the limit distribution is not known but can be
evaluated numerically using Monte Carlo simulation. Thus, the confidence enve-
lope of the process and the goodness of fit test are obtained by simulating N
processes according to the process’s Gaussian limit distribution. We have chosen
N = 103 here. The R code for evaluating the process, the simulated envelope,
10.9. Some simulated situations 281
and the goodness of fit test can be found in the timereg package (Martinussen
and Scheike, 2006). Below we refer to this test with the abbreviation LWY.
A second goodness of fit test for the proportional hazards model is due to
Grambsch and Therneau (1994). They consider an effect of the form β(t) =
β0 + β1 log(t) and propose a score test for the null hypothesis H0 : β1 = 0 versus
H1 : β1 = 0. This test can be found in the survival package in R. We label this
test GT in the following.
The final goodness of fit test compared here was proposed by Scheike and
Martinussen (2004). Under the non-proportional hazards model, they propose a
Cramér-von Mises-type test based on the cumulative value of the estimator of
β(t). This estimate is made using an algorithm involving kernel smoothing. Code
for this test can be found in the timereg package in R. This test is labeled SM
below. These authors also proposed a Kolmogorov-Smirnov test, but based on
their simulations it appears less powerful, so we have not retained it here. The
test based on the regression effect process introduced in Section 10.4 is denoted
Un∗ in the tables taken from Chauvel (2014), Chauvel and O’Quigley (2017).
where β2 (t) takes the values 0, 0.5, 1t≤0.5 , 1.5 1t≤0.5 or 0.51t≤0.5 − 0.51t≥0.5
on the transformed time scale. The covariate Z (1) respects the proportional
hazards hypothesis. In the first two cases, Z (2) does too. In the next two, the
hypothesis is not satisfied because the effect is constant at the start and becomes
zero when half of the deaths have been observed. In the final case, the hypothesis
is not satisfied and the type of effect changes half-way through the study (the
variable’s effect on survival is positive at the start and negative at the end). This
situation corresponds to risks which cross over each other. For each trial setting,
3000 samples were simulated in order to evaluate the level and empirical power
of the tests.
LWY GT SM Un∗
n ρ(Z (1) , Z (2) ) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t)
50 0.0 6.3 5.3 5.5 4.4 7.4 7.0 3.3 2.9
50 0.3 5.4 4.8 5.2 4.6 8.1 7.0 2.8 3.1
50 0.5 5.0 4.7 5.2 4.5 6.8 6.7 3.0 2.6
50 0.7 5.7 6.5 5.0 5.1 7.2 8.8 3.1 3.2
100 0.0 5.4 5.7 4.8 5.0 5.7 6.5 3.4 3.7
100 0.3 5.5 5.2 4.9 4.7 5.9 5.9 3.4 3.6
100 0.5 5.7 5.9 4.5 4.8 5.4 6.6 3.4 3.5
100 0.7 5.4 5.4 5.7 6.1 6.5 6.7 3.9 3.7
200 0.0 4.6 5.6 4.3 5.1 5.0 5.7 3.9 3.9
200 0.3 5.5 5.7 4.9 4.8 5.2 5.4 4.5 4.2
200 0.5 6.2 5.7 4.4 4.9 4.9 5.1 3.8 3.9
200 0.7 5.5 5.2 5.7 5.0 5.3 5.0 4.6 3.8
400 0.0 5.4 5.7 4.7 4.7 4.6 4.7 4.5 4.6
400 0.3 5.6 4.5 5.1 4.4 5.0 5.2 4.4 3.6
400 0.5 4.8 5.6 4.8 5.2 5.3 5.2 4.4 5.2
400 0.7 6.0 5.2 5.1 5.3 5.7 5.9 5.1 5.0
Table 10.1: Empirical level of tests (in %) for β2 (t) = 0. Taken from Chauvel
(2014).
LWY GT SM Un∗
n ρ(Z (1) , Z (2) ) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t)
50 0.0 5.3 5.9 4.8 4.9 7.3 6.7 3.0 2.8
50 0.3 6.1 5.4 5.0 4.5 7.8 7.0 3.2 3.0
50 0.5 5.1 5.1 5.0 4.5 7.0 7.2 2.9 2.6
50 0.7 5.4 5.5 5.1 5.0 8.1 7.7 3.1 3.0
100 0.0 5.9 5.3 4.3 4.3 5.6 5.3 3.6 3.3
100 0.3 5.0 5.7 4.9 4.3 5.3 5.0 3.5 3.8
100 0.5 5.4 5.5 4.8 4.9 6.1 6.7 3.5 3.4
100 0.7 5.4 5.8 5.6 5.4 5.4 6.0 3.6 3.6
200 0.0 5.2 5.5 3.9 4.6 5.3 5.7 3.5 3.4
200 0.3 6.0 5.6 4.8 4.1 5.7 4.9 4.1 3.6
200 0.5 5.7 5.8 4.9 4.5 5.3 5.0 4.0 4.1
200 0.7 4.6 4.9 5.1 4.4 5.2 4.7 3.4 4.2
400 0.0 4.7 5.6 4.7 5.6 4.6 5.4 3.9 4.6
400 0.3 4.2 5.0 4.5 4.4 5.2 4.6 4.2 4.5
400 0.5 6.6 5.3 5.8 4.9 5.1 5.4 4.6 5.0
400 0.7 5.7 6.3 5.2 4.9 5.1 5.3 4.9 5.1
Table 10.2: Empirical level of tests (in %) for β2 (t) = 0.5. Taken from Chauvel
(2014)
10.9. Some simulated situations 283
LWY GT SM Un∗
n ρ(Z (1) , Z (2) ) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t)
50 0.0 5.3 14.9 4.4 13.6 7.2 7.8 2.6 11.3
50 0.3 6.3 14.7 5.0 12.8 7.4 7.6 3.3 9.6
50 0.5 6.1 15.1 4.6 11.9 6.8 7.6 2.7 8.3
50 0.7 9.3 16.8 5.4 11.0 6.3 7.4 2.8 7.9
100 0.0 4.6 28.3 4.8 21.2 5.2 9.0 2.4 22.4
100 0.3 6.4 27.9 5.2 20.7 5.8 8.3 4.0 21.7
100 0.5 9.4 30.6 5.0 20.3 5.5 9.1 4.1 19.8
100 0.7 14.6 30.0 5.5 15.1 5.8 7.2 3.4 13.5
200 0.0 5.5 55.5 4.6 37.9 5.4 14.7 3.7 50.4
200 0.3 8.2 57.5 4.7 38.4 4.9 13.2 4.1 50.7
200 0.5 13.9 59.2 5.0 34.0 4.4 12.8 4.6 42.7
200 0.7 29.3 59.9 4.9 23.8 5.3 9.2 3.8 28.3
400 0.0 5.3 89.6 4.7 60.8 4.4 23.4 4.3 87.6
400 0.3 11.5 90.4 5.1 62.5 5.4 25.9 4.2 86.0
400 0.5 27.7 91.2 5.6 56.0 5.1 21.5 4.6 79.6
400 0.7 56.3 92.5 5.5 46.2 4.9 16.8 4.7 63.3
Table 10.3: Empirical level (β1 (t) column) and power (β2 (t) column) of several
tests (in %) for β2 (t) = 1t≤0.5 . Taken from Chauvel (2014).
LWY GT SM Un∗
n ρ(Z (1) , Z (2) ) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t)
50 0.0 7.0 51.6 5.6 57.4 8.9 53.4 2.9 56.3
50 0.3 5.8 44.9 5.4 53.1 7.4 47.5 3.1 49.5
50 0.5 10.0 41.9 5.9 49.2 8.0 43.1 3.4 41.9
50 0.7 14.6 37.3 5.5 37.2 7.9 31.4 2.3 27.4
100 0.0 7.5 82.2 4.9 85.4 6.0 81.9 3.4 88.0
100 0.3 7.0 80.9 5.5 83.9 6.2 80.0 4.3 86.6
100 0.5 13.9 76.5 5.6 78.6 6.4 72.4 3.8 77.9
100 0.7 26.1 70.4 5.7 63.6 5.4 56.5 2.8 58.8
200 0.0 8.7 99.0 5.3 98.9 5.1 98.6 4.6 99.7
200 0.3 8.4 98.5 6.0 98.9 5.4 97.9 4.4 99.5
200 0.5 22.3 97.5 5.9 97.4 5.1 95.8 4.5 98.6
200 0.7 53.4 96.6 5.7 92.4 5.1 88.3 3.6 92.9
400 0.0 13.7 100.0 5.0 100.0 4.9 100.0 4.6 100.0
400 0.3 11.4 100.0 5.4 100.0 4.7 100.0 4.3 100.0
400 0.5 43.5 100.0 6.1 100.0 4.6 99.9 5.0 100.0
400 0.7 84.5 100.0 5.7 99.6 4.9 99.4 5.0 100.0
Table 10.4: Empirical level (β1 (t) column) and power (β2 (t) column) of several
tests (in %) for β2 (t) = 1.51t≤0.5 . Taken from Chauvel (2014).
LWY GT SM Un∗
n ρ(Z (1) , Z (2) ) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t) β1 (t) β2 (t)
50 0.0 5.4 14.4 4.6 12.6 6.8 7.8 2.8 9.8
50 0.3 5.8 14.9 4.9 13.0 7.0 8.1 2.7 9.8
50 0.5 8.0 15.2 5.4 12.6 7.1 8.4 3.2 9.4
50 0.7 8.4 15.2 5.3 9.8 6.6 6.7 3.0 6.7
100 0.0 6.2 26.8 5.4 21.6 5.9 9.1 3.4 21.3
100 0.3 6.8 29.0 5.0 21.3 5.5 8.7 3.6 21.6
100 0.5 9.8 29.6 5.2 20.1 5.8 8.2 4.3 19.9
100 0.7 15.2 31.1 5.3 16.0 5.4 7.5 3.1 14.6
200 0.0 6.0 52.8 5.2 35.8 5.5 12.4 4.2 47.7
200 0.3 8.3 56.7 5.3 37.1 5.1 13.0 4.3 48.6
200 0.5 14.9 59.7 4.7 34.0 5.1 11.6 3.6 43.0
200 0.7 29.9 61.8 6.2 26.0 5.2 10.2 4.3 29.7
400 0.0 5.4 87.3 4.5 62.4 5.6 25.0 4.3 86.2
400 0.3 11.1 90.0 5.3 61.6 5.3 22.4 4.3 85.2
400 0.5 27.0 91.7 4.5 55.4 4.1 18.7 4.1 79.6
400 0.7 56.6 93.2 4.9 46.6 4.6 15.4 4.0 64.5
Table 10.5: Empirical level (β1 (t) column) and power (β2 (t) column) of several
tests (in %) for β2 (t) = 0.51t≤0.5 − 0.51t≥0.5 . Taken from Chauvel (2014).
10.9. Some simulated situations 285
Figure 10.4: The regression effect process Un∗ (0, ·) (solid line) and its confidence
band (dotted lines) for the simulated dataset with β(t) = 1 (left-hand) and trans-
formed process (right-hand).
The rate of censoring is 18%. The regression effect process Un∗ (0, ·) and its
confidence band under the proportional hazards model are plotted as a function
of time in Figure 10.4. We observe drift, which implies a non-zero effect. The
drift would appear to be linear and the process does not exit the 95% confidence
bands; thus, the proportionality hypothesis appears reasonable. As the drift is
increasing, the coefficient will be positive. The estimator based on the partial
maximum likelihood of the proportional hazards models gives 1.08 and the R2
1 Available at http://cran.r-project.org/web/packages/PHeval/index.html
286 Chapter 10. Model construction guided . . .
coefficient is 0.21. Note that the transformation to the Brownian bridge is valid
when applied to Brownian motion with drift, as long as the drift is linear. The
slope term disappears in the transformation.
with t on the initial time scale. The coefficient is equal to 1 for the first part
of the study, then zero afterwards. The simulated sample has 26% censoring.
Figure 10.5(a) shows the regression effect process Un∗ (0, ·) as a function of the
transformed time between 0 and 1. Drift can be seen, indicating the presence of
a non-zero effect. The drift does not seem to be linear and the process exits the
95% confidence band. There is evidence of an effect that changes over time. The
drift appears linear until time 0.6 on the transformed time scale, and the effect
appears to be weaker afterwards. This suggests that a single regression coefficient
alone may not be adequate to provide a good summary of the observations. A
more involved model would postulate a constant effect until time 0.6 on the
transformed time scale, then zero after, denoted β0.6a (t) = β0 1t≤0.6 . Here, the
observed drift after time 0.6 corresponds to noise. The effect after 0.6 could also
be non-zero, so we consider the effect
Figure 10.5: The regression effect process Un∗ (0, ·) for the dataset simulated
according to a change-point model.
10.9. Some simulated situations 287
with unknown β0 and C0.6 . The value C0.6 is the one multiplied by the regression
coefficient in the second part of the study. In Figure 10.5(b), two straight lines
have been fitted to the process using linear regression, before and after t = 0.6.
The ratio of the second one’s slope to the first’s gives C0.6 = −0.17.
We also consider other change-point times at t = 0.4, t = 0.5, and t = 0.7,
and in particular the coefficients β0.4a (t) = β0 1t≤0.4 , β0.5a (t) = β0 1t≤0.5 and
β0.7a (t) = β0 1t≤0.7 , which are zero in the second part of the study. The piecewise
constant coefficients are not zero in the second part of the study; β0.4b , β0.5b
and β0.7b were considered and plotted over the process in Figure 10.6. Fitting
was performed using linear regression.
The shape of the regression effect process indicates that the effect β(t)
decreases with time. Several models with continuously decreasing effects β(t) =
β0 h(t) were selected together with the already mentioned coefficients. For each
model, the maximum likelihood estimator β̂0 and the R2 coefficient was calcu-
lated. The coefficients β0.6a (t) and β0.6b (t) give the largest value of R2 here,
0.25, which corresponds to an increase of 80% in predictive ability with respect
to the proportional hazards model, which had R2 = 0.14. Of the two models
with excellent predictive power, we retain the one with regression parameter
β0.6a (t) = β0 1t≤0.6 . Indeed, as the two models fit the data well and have good
predictive ability, we would choose to retain the one with the simplest coefficient.
The time t = 0.6 on the transformed scale corresponds to the time 0.39 on the
initial scale. Recall that on the initial scale, the change-point time used to sim-
ulate the data was 0.4. The regression coefficient selected is therefore close to
the one used to simulate the data (Table 10.6).
Figure 10.6: Regression effect process Un∗ (0, ·) (solid line) and piecewise con-
stant coefficient fits (dotted lines) with a change-point at time t0 , for a dataset
simulated according to a change-point model.
288 Chapter 10. Model construction guided . . .
Table 10.6: R2 coefficients and partial maximum likelihood estimators β̂0 for
the dataset simulated under a change-point model.
Figure 10.7: The regression effect process Un∗ (0, ·) for a dataset simulated with
a continuously decreasing effect.
10.9. Some simulated situations 289
regression, two lines have been fitted to the process, before and after the change-
point at t = 0.6 (Figure 10.7(b)). The ratio of the second slope to the first gives
the value C06 = −0.2. We also considered the coefficient β06b (t) = β0 1t≤0.6 ,
which equals zero after t = 0.6. Using the same methodology, we looked at
piecewise constant coefficients with change-points at t = 0.4, 0.5 or 0.7 (see
the definitions of the coefficients in the previous example). Several models with
continuously decreasing effects over time were investigated.
Table 10.7: R2 coefficients and maximum partial likelihood estimators β̂0 for
the dataset simulated under some different NPH models. Taken from Chauvel
(2014).
with β1 (t) = 1t≤0.7 and β2 = −1. For this example, we chose to simulate, with-
out censoring, n = 200 individuals. Each component of the bivariate process
Σ̂−1/2 Un∗ (0, ·) and its corresponding 95% confidence bands are plotted as a func-
tion of the transformed time between 0 and 1 in Figures 10.8(a) and 10.8(b). The
proportional hazards hypothesis for covariate Z (1) is rejected at the 5% level since
the process exits the confidence bands. The shape of the process’s drift suggests a
piecewise constant effect with a change-point at t0 = 0.6 on the transformed time
scale. As in the univariate case, two fits were implemented using linear regression,
one before t0 = 0.6 and one after. The ratio of the slopes is −0.12, so we con-
sider the regression effect β1 (t) = β1 B0.6 (t), with B0.6 (t) = 1t≤0.6 −0.12 1t≥0.6 .
Other piecewise constant regression coefficients β(t) = β1 Bt0 (t) were also inves-
290 Chapter 10. Model construction guided . . .
Figure 10.8: The regression effect process Σ̂−1/2 Un∗ (0, ·) (solid lines), confidence
bands (dashed lines) and piecewise constant coefficient fits (dotted lines) for
simulated multivariate data.
β1 (t) β1 β1 B0.45 (t) β1 B0.5 (t) β1 B0.55 (t) β1 B0.6 (t) β1 B0.7 (t)
Table 10.8: Partial maximum likelihood estimates β̂(t) and R2 coefficients for
simulated multivariate data. Best fitting model has R2 = 0.39.
Estimation results are shown in Table 10.8. The proportional hazards model
gives an R2 of 0.24. The largest R2 is obtained with β1 (t) = β1 B0.6 (t), increasing
the predictive ability by 60% with respect to the proportional hazards model. In
conclusion, we retain the model with β1 (t) = β1 B0.6 (t) and β2 (t) = β2 . Note
that the time t0 = 0.6 on the transformed scale corresponds to the time 0.68 on
the initial one, which is close to the 0.7 used to simulate the data.
10.10. Illustrations from clinical studies 291
Figure 10.9: Kaplan-Meier curves and model based curves for Freireich data.
Regression effect process indicates strong effects and good model fit.
292 Chapter 10. Model construction guided . . .
Figure 10.10: Kaplan-Meier curves for variable tumor size for Curie Institute
breast cancer study. Regression effect process indicates clear effects that diminish
gradually with time.
a function of the transformed time [0, 1] in Figure 10.10. The process exits the
95% confidence band and the effect appears not to be constant, with a gradually
decreasing slope over time. Hence, one model that fits the data better than
the proportional hazards one has a piecewise constant effect for the tumor size
variable, with a change-point at t = 0.2 on the transformed time scale, and
constant effects for the other two prognostic factors. As in the simulations, two
straight lines have been fitted to the process, before and after t = 0.2, which
leads us to consider a regression effect for the tumor size variable of βsize (t) =
β0 (1t≤0.2 + 0.24 1t≥0.2 ). The predictive ability of this model with respect to the
proportional hazards one corresponds to an increase of more than 30% in the
R2 , moving from 0.29 to 0.39 (Table 10.9).
Figure 10.11 shows the process for the presence of progesterone receptor effect
as a function of the transformed time scale. The drift is not entirely linear but
the process stays within the confidence bands corresponding to a constant effect
over time. We considered several potential regression effects: a change-point
model with a jump at t = 0.5 on the transformed time scale, i.e., βrec0 (t) =
β0 (1t≤0.5 + 0.39 1t>0.5 ), and several continuous effects: βrec1 (t) = β0 (1 − t),
βrec2 (t) = β0 (1−t)2 , βrec3 (t) = β0 (1−t2 ) and βrec4 (t) = β0 log(t). Figure 10.12
shows the process for the cancer grade effect. It touches the lower bound in the
constant effect confidence band, but does not breach it. The drift does not seem
to be linear, and gives the impression of a negative effect that decreases over
time. The simplest effect possible, i.e., constant over time, would nevertheless
appear to be a good candidate; however, in the model construction context, we
10.10. Illustrations from clinical studies 293
Progesterone
Tumor size Stage R2
receptor
0.84 1.03 -0.68 0.29
1.77(1t≤0.2 + 0.241t>0.2 ) 1.03 -0.66 0.39
0.85 −1.02 log(t) -0.67 0.39
1.74(1t≤0.2 + 0.241t>0.2 ) −1.02 log(t) -0.66 0.51
1.72(1t≤0.2 + 0.241t>0.2 ) −1.02 log(t) −0.82(1t≤0.4 + 0.691t>0.4 ) 0.52
Table 10.9: Partial likelihood estimates and R2 coefficients for the clinical trial
data.
Some results are shown in Table 10.9. The proportional hazards model gives
an R2 of 0.29. As we have mentioned previously, a model allowing the tumor
size effect to change in a simple way over time gives a great improvement in
predictive ability (in the order of 30%). The largest R2 is obtained with piecewise
constant effects for tumor size and grade, and an effect proportional to log(t)
for the progesterone receptor. The predictive ability of this model is 80% better
than the proportional hazards one for the three prognostic factors. If we instead
introduce a time-dependent effect for the cancer grade while also allowing the
tumor size and progesterone effects to change over time, we get an increase in
Figure 10.11: Kaplan-Meier curves for variable progesterone status for breast can-
cer study. Regression effect process is suggestive of some weak time dependency.
294 Chapter 10. Model construction guided . . .
Figure 10.12: Kaplan-Meier curves for variable tumor grade for breast cancer
study. Process indicates the presence of effects that appear to diminish with
time.
the R2 from 0.51 to 0.52. Such a small improvement cannot justify the added
complexity of this model; therefore, we retain the one with constant cancer stage
effect and non-constant effects for tumor size and progesterone.
Figure 10.13: Fitting a simple change-point model to the regression effect process
for the variable receptor in the breast cancer study in order to obtain a better fit.
to move away from models showing poor adequacy in the direction of models
with improved adequacy. The sense of adequacy is both in terms of providing
a model based description that can plausibly be considered to have generated
the observations, and in terms of the strength of prediction. These two senses
feed off one another so that model construction is to some degree of an iterative
nature. In Figure 10.15 we can see a further small improvement in the model fit
5. Show how we can approximate the drift parameter function for an observed
Brownian motion with non-linear drift by using multiple change-points. How
would you make use of R2 to obtain the best possible fit while minimizing
the number of change-points?
6. Given two models: one has a higher R2 than the other but shows a signif-
icantly poorer fit. Which of the two models would you recommend? Give
reasons.
7. On the basis of a real data set make use of stepwise model building techniques
to obtain the model that appears to be the most satisfactory. At each step
argue the basis underlying any specific decision. On what basis would we
conclude that, given the information we have, we are unlikely to find a better
performing model.
φZ (−2β) 1
Ω2T |Z = 1 − = 1− . (10.14)
2φZ (−2β) − φZ (−β)2 2 − φZ (−β)2 φ−1
Z (−2β)
where γ is in the ball with center α and radius supt∈[0,1] |α(t) − β(t)|. Under A3,
there exists a constant C(γ) such that v (γ(t), t) = C(γ). Using this expansion
in equation (10.11), we get:
1
2 C(β) + C(γ)2 0 (α(t) − β(t))2 dt
lim R (α(t)) = 1 − 1 .
n→∞ C(β) + 0 (e(0, t) − e(β(t), t))2 dt
We revisit the standard log-rank test and several modifications of it that come
under the heading of weighted log-rank tests. Taken together these provide us
with an extensive array of tools for the hypothesis testing problem. Importantly,
all of these tests can be readily derived from within the proportional and non-
proportional hazards framework. Given our focus on the regression effect pro-
cess, it is equally important to note that these tests can be based on established
properties of this process under various assumptions. These properties allow us
to cover a very broad range of situations. With many different tests, including
goodness-of-fit tests, coming under the same heading, it makes it particularly
straightforward to carry out comparative studies on the relative merits of dif-
ferent choices. Furthermore, we underline the intuitive value of the regression
effect process since it provides us with a clear visual impression of the possible
presence of effects as well as the nature of any such effects. In conjunction with
formal testing, the investigator has a powerful tool to study dependencies and
co-dependencies in survival data.
Significance tests have an important role to play in formal decision making but
also as a way to provide the shortest possible confidence intervals in a model
fitting and estimation setting. There are several possible statistics for testing a
null hypothesis of no difference in survival experience between groups against
some alternative hypothesis. The alternative may be of a general nature or may
be more specific. Specific directions moving away from the null will be mirrored,
in the regression effect process. If the form of this process can be anticipated
ahead of time this will allow us to derive tailor made tests with high power
for certain specific alternatives. Mostly, the null hypothesis corresponds to an
absence of regression effects for two or more distinct groups, i.e., β(t) = 0 for
all t, and, in this case, the test statistics often assume a particularly simple
form. A classical example arises in the context of a clinical trial when we wish
to compare the survival experience of patients from different treatment groups,
the goal being to assess whether or not a new treatment improves survival. For
the sake of simplicity, and clarity of presentation, we mostly consider the case
with two groups defined by a single binary variable. Extensions to several groups
raise both operational and conceptual issues. We may wish to test the impact
of all variables taken together—an example is provided of a clinical trial with
3 treatment arms defined via 2 binary covariates—or we may wish to test the
impact of one variable having controlled, in some way, for the effects of one or
more of the other covariables. Often we will restrict the alternative hypothesis
to one that belongs to the proportional hazards family.
Consider a straightforward clinical trial. Patients in the first group receive the
new treatment, indexed by T , and patients in the second group receive a placebo,
indexed by P . A comparison of the two survival functions is carried out by testing
a null hypothesis H0 against an alternative H1 , expressed more formally as:
H0 : ∀t, β(t) = 0, versus H1 : ∃ T such that T dt > 0 and T β 2 (t)dt > 0.
The above alternative is very general and while it will have power against any
kind of departure from the null, we can increase this power, in some cases greatly
increase this power, by considering more specific alternative hypotheses. This is a
constant theme throughout this chapter. We focus more on two-sided tests, but
one-sided tests can be structured once we figure out how to order different pos-
sible functions for β(t). This is simple when β is constant and does not depend
on t, otherwise the question itself is not always easily framed. A comparison of
survival functions can also be useful in epidemiological studies, to compare for
example the survival of groups of subjects exposed to different environmental
11.3. Some commonly employed tests 303
factors. In these kinds of studies we would usually expect that, under the alter-
native hypothesis, at any time (age), t, the difference between groups would be
in the same direction.
and
1 kn
I(0) = V0 (Z | s)dN̄ ∗ (s) = V0 (Z | ti ).
0 i=1
By definition, the test with statistic U (0)2 /I(0) is a score test for testing
H0 : β = 0 in the proportional hazards model with parameter β.
Definition 11.1. The log-rank test rejects H0 with Type 1 risk α if |Ln | ≥
z α/2 , where z α/2 is the upper quantile of order α/2 of the standardized
normal distribution, and
n k
U (0) Z(tj ) − E0 (Z | tj )
Ln = = (11.1)
I(0) j=1 kn V (Z | t )
i=1 0 i
When β ∈ Rp , the statistic L2n is defined by L2n = U (0)T I(0)−1 U (0) and con-
verges to a χ2 distribution with p − 1 degrees of freedom under the null hypothe-
sis. Written in this way, the log-rank test can be extended to continuous variables
(e.g., age) to test for their effect on survival. The test statistic is easy to imple-
ment, and the p-value of the test can be accurately approximated since the
asymptotic distribution is known. The great popularity of this test is in particular
due to it being the most powerful (Peto and Peto, 1972) under local alternative
hypotheses of the proportional hazards type, which are written
H1 : ∃ β0 = 0, ∀ t, β(t) = β0 = 0.
304 Chapter 11. Hypothesis tests based on regression effect process
This test is powerful for detecting the presence of an effect β that is constant
in time between the groups, under the proportional hazards model. On the other
hand, in the presence of non-proportional hazards, i.e., an effect β(t) which varies
over time, the log-rank test is no longer optimal in the sense that there exist
alternative consistent tests with greater power. This is the case, for instance,
in the presence of hazards that intersect, which happens when the coefficient
changes sign during the study period (Leurgans, 1983, 1984).
Definition 11.2. The weighted log-rank test has the statistic LWn given by
kn
Z(tj ) − E0 (Z | tj )
LWn = Wn (tj ) , (11.2)
kn 2
j=1 i=1 Wn (ti ) V0 (Z | ti )
Many other weights have been proposed (Lagakos et al., 1990; Mantel and Sta-
blein, 1988; Wu and Gilbert, 2002). Note that the test with the weights proposed
in Gehan (1965) corresponds to a modification of the Wilcoxon (1945) test that
takes censored data into account. This test is equivalent to the two sample
Mann-Whitney test and to the Kruskal-Wallis test when there are more than two
11.3. Some commonly employed tests 305
samples (Breslow, 1970). Aalen (1978) and Gill (1980) have shown that weighted
log-rank statistics converge to centered normal distributions. Jones and Crowley
(1989, 1990) have defined a broader class of tests that includes weighted and
unweighted log-rank tests. This class also includes tests for comparing more than
two survival functions, as well as specific constructions such as a trend across
groups.
All of these weighted log-rank tests were developed on the initial time scale.
However, since they are rank tests, they remain unchanged on the transformed
time scale [0, 1] using the mapping φn from Equation (9.2). When the weights
are larger for earlier times, we will put more emphasis there so that such a test
ought to be more sensitive to the detection of early effects. This can be expressed
as the coefficients β(t) decreasing with time (Fig. 11.1).
Many other choices are possible. The weights proposed by Tarone and Ware
(1977) and Fleming and Harrington (1991) involve classes of weights for which
a good choice of function g or (p, q) can help detect late effects, i.e., coefficients
β(t) which increase with time. In immunotherapy trials there is often an early
period in which no effects can be seen, followed by a period in which effects
become manifest. A judicious choice of the weights described by Tarone and Ware
(1977) and Fleming and Harrington (1991) enables us to increase the power of
the basic log-rank test. The optimal choice of g or (p, q) depends on the data,
and may not be easy to choose. Peckova and Fleming (2003) have proposed a
weighted log-rank test whose weights are chosen adaptively as a function of the
data. Such tests will generally display good behavior, although their properties,
both under the null and the alternative, may be difficult to anticipate.
While these tests can exhibit good power under non-proportional hazards-
type alternative hypotheses, they would be expected to lose power with respect to
unweighted log-rank tests under proportional hazards-type alternative hypothe-
306 Chapter 11. Hypothesis tests based on regression effect process
The asymptotic behavior of the regression effect process Un∗ , given by Theorems
9.1 and 9.2, allows us to construct hypothesis tests on the value of the regression
parameter β(t). The null hypothesis H0 and alternative hypothesis H1 that we
wish to test are
11.4. Tests based on the regression effect process 309
H0 : ∀ t, β(t) = β0 , versus H1 : {β(t) − β0 }2 dt > 0,
T
where β0 is a fixed constant and T dt > 0. The reason for the slightly involved
expression for H1 is to overcome situations in which the test could be inconsis-
tent. It enables us to cater for cases where, for example, β(t) differs from β0 only
on a finite set of points or, to be more precise, on a set of time points of mea-
sure zero. Under H0 , the drift of the regression effect process Un∗ (β0 , ·) at β0 is
zero, and the process converges to standard Brownian motion. We can therefore
base tests on the properties of Brownian motion. When β0 = 0, this amounts to
testing for the absence of an effect between the various values of the covariate
Z. In particular, if the covariate is categorical and represents different treatment
groups, this means testing the presence of a treatment effect on survival. The
β0 = 0 case thus corresponds to the null hypothesis of the log-rank test.
Several possibilities are mentioned in the book of O’Quigley (2008), including
tests of the distance from the origin at time t, the maximal distance covered in
a given time interval, the integral of the Brownian motion, the supremum of a
Brownian bridge, reflected Brownian motion, and the arcsine law. We will take a
closer look at the first two of these, followed by a new proposition. In this section
we look at the univariate case with one covariate and one effect to test (p = 1),
before extending the results to the case of several coefficients.
The following, almost trivial, result turns out to be of great help in our theo-
retical investigations. It is that every non-proportional hazard model, λ(t|Z(t)) =
λ0 (t) exp{β(t)Z(t)}, is equivalent to a proportional hazards model.
Lemma 11.1. For given β(t) and covariate Z(t) there exists a constant β0
and time-dependent covariate Z ∗ (t) so that λ(t|Z(t)) = λ0 (t) exp{β(t)Z(t)} =
λ0 (t) exp{β0 Z ∗ (t)} .
There is no loss of generality if we take β0 = 1. The result is immedi-
ate upon introducing β0 = 0 and defining a time-dependent covariate Z ∗ (t) ≡
Z(t)β(t)β0−1 . The important thing to note is that we have the same λ0 (t) either
side of the equation and that, whatever the value of λ(t|Z(t)), for all values of
t, these values are exactly reproduced by either expression, i.e., we have equiv-
alence. Simple though the lemma is, it has strong and important ramifications.
It allows us to identify tests that are unbiased, consistent, and indeed, most
powerful in given situations. A non-proportional hazards effect can thus be made
equivalent to a proportional hazards one simply by multiplying the covariate Z
by β(t). The value of this may appear somewhat limited in that we do not know
the form or magnitude of β(t). However in terms of theoretical study, this simple
observation is very useful. It will allow us to identify the uniformly most powerful
test, generally unavailable to us, and to gauge how close in performance comes
any of those tests that are available in practice.
Our structure hinges on a useful result of Cox (1975). Under the non-
proportional hazards model with β(t), the increments of the process Un∗ are
310 Chapter 11. Hypothesis tests based on regression effect process
centered with variance 1. From Proposition 9.3 we also have that these incre-
ments are uncorrelated, a key property in the derivation of the log-rank statistic.
Based on the regression effect process, two clear candidate statistics stand out
for testing the null hypothesis of no effect. The first is the “distance-traveled” at
time t test and the second is the area under the curve test. Under proportional
hazards the correlation between these two test statistics approaches one as we
move away from the null. Even under the null this correlation is high and we
consider this below. Under non-proportional hazards the two tests behave very
differently. A combination of the two turns out to be particularly valuable for
testing the null against various non-proportional hazards alternatives, including
declining effects, delayed effects, and effects that change direction during the
study.
lim E Un∗ (0, t) − E Un∗ (0, s) = R(kn , β)(t − s), lim Var {Un∗ (0, t) − Un∗ (0, s)} = t − s
n→∞ n→∞
Corollary 11.5. Under H1 : β > 0 , suppose that P(s) is the p-value for
√
Un∗ (0, s)/ s, then, assuming that kn /n → C as n → ∞, then, for t > s , E P(t) <
E P(s).
By applying the Neyman-Pearson lemma to the normal distribution, the above
two corollaries tell us that the distance traveled test is the uniformly most pow-
erful test of the null hypothesis, H0 : β = 0 against the alternative, H1 : β > 0.
The log-rank test, Ln shares this property and this is formalized in Proposition
11.2. Lemma 11.1 allows us to make a stronger conclusion. Assume the non-
proportional hazards model and consider the null hypothesis, H0 : β(t) = 0 for
all t. Then:
Lemma 11.3. The uniformly most powerful test of H0 : β(t) = 0 for all t, is
the distance traveled test applied to the proportional hazards model λ(t|Z(t)) =
λ0 (t) exp{β0 Z ∗ (t)} in which Z ∗ (t) = β0−1 β(t)Z(t) and where we take 0/0 to
be equal to 1.
Since, under the alternative, we may have no information on the size or shape
of β(t), this result is not of any immediate practical value. It can though act as
a bound, not unlike the Cramer-Rao bound for unbiased estimators and let us
judge how well any test is performing when compared with the optimal test in
that situation.
The following proposition states that the distance test statistic is consistent;
that is, the larger the sample, the better we will be able to detect the alternative
hypothesis of a proportional hazards nature. The result continues to hold for
tests of the null versus non-proportional hazards alternatives but restrictions
are needed, the most common one being that any non-null effect is monotonic
through time.
is consistent for E(var(Z|t)) under the proportional hazards model with β(t) = 0.
Rather than integrate with respect to dF̂ it is more common, in the counting
process context, to integrate with respect to dN̄ , the two coinciding in the
absence of censoring. If, we replace V0 (Z|s) by V̄0 (Z) then the distance traveled
test coincides exactly with the log-rank test. The following proposition gives an
asymptotic equivalence under the null between the log-rank and the distance
traveled tests:
Proposition 11.2. Let Ln denote the log-rank statistic and let Un∗ (0, 1) be
the distance statistic Un∗ (β0 , 1) evaluated at β0 = 0. We have that
P
|Un∗ (0, 1) − Ln | −→ 0.
n→∞
Lemma 11.4. The process U0∗ (β̂, β0 , t) converges in distribution to the Brownian
bridge, in particular E U0∗ (β̂, β0 , t) = 0 and Cov {U0∗ (β̂, β0 , s), U0∗ (β̂, β0 , t)} =
s(1 − t).
The Brownian bridge is also called tied-down Brownian motion for the obvious
reason that at t = 0 and t = 1 the process takes the value 0. Carrying out a test
at t = 1 will not then be particularly useful and it is more useful to consider, as
a test statistic, the greatest distance of the bridged process from the time axis.
We can then appeal to:
Lemma 11.5.
Pr sup |U0∗ (β̂, β0 , u)| ≥ a ≈ 2 exp(−2a2 ), (11.4)
u
ing Wr (t) accordingly we have the same result. The process Wr (t) coincides
exactly with W(t) until such a time as a barrier is reached. We can imagine this
barrier as a mirror and beyond the barrier the process Wr (t) is a simple reflec-
tion of W(t). So, consider the process U r (β̂, β0 , t) defined to be U ∗ (β̂, β0 , t) if
|U ∗ (β̂, β0 , t)| < r and to be equal to 2r − U ∗ (β̂, β0 , t) if |U ∗ (β̂, β0 , t)| ≥ r.
Lemma 11.6. The process U r (β̂, β0 , t) converges in distribution to Brownian
motion, in particular, for large samples, E U r (β̂, β0 , t) = 0 and Cov {U r (β̂, β0 , s),
U r (β̂, β0 , t)} = s.
Under proportional hazards there is no obvious role to be played by U r .
However, imagine a non-proportional hazards alternative where the direction of
the effect reverses at some point, the so-called crossing hazards problem. The
statistic U ∗ (β̂, 0, t) would increase up to some point and then decrease back to a
value close to zero. If we knew this point, or had some reasons for guessing it in
advance, then we could work with U r (β̂, β0 , t) instead of U ∗ (β̂, β0 , t). A judicious
choice of the point of reflection would result in a test statistic that continues to
increase under such an alternative so that a distance from the origin test might
have reasonable power. In practice we may not have any ideas on a potential
point of reflection. We could then consider trying a whole class of points of
reflection and choosing that point which results in the greatest test statistic. We
require different inferential procedures for this.
A bound for a supremum-type test can be derived by applying the results of
Davies (1977, 1987). Under the alternative hypothesis we could imagine incre-
ments of the same sign being added together until the value r is reached, at
which point the sign of the increments changes. Under the alternative hypoth-
esis the absolute value of the increments is strictly greater than zero. Under
the null, r is not defined and, following the usual standardization, this set-
up fits in with that of Davies. We can define γr to be the time point satis-
fying U ∗ (β̂, β0 , γr ) = r. A two-sided test can then be based on the statistic
M = supr {|U r (β̂, β0 , 1)| : 0 ≤ γr ≤ 1}, so that:
1
exp(−c2 /2) 1
Pr {sup |U r (β̂, β0 , 1)| > c : 0 ≤ γr ≤ 1} ≤ Φ(−c) + {−ρ11 (γ)} 2 dγ
2π 0
several pairs of U r (β̂, β0 , 1) and U s (β̂, β0 , 1) can be obtained. Using these pairs,
an empirical, i.e., product moment, correlation coefficient can be calculated.
Under the usual conditions (Efron, 1981a,b,c, 1987; Efron and Stein, 1981), the
empirical estimate provides a consistent estimate of the true value. This sam-
pling strategy is further investigated in related work by O’Quigley and Natarajan
(2004). A simpler approximation is available (O’Quigley, 1994) and this has the
advantage that the autocorrelation is not needed. This may be written down as
√
Pr {sup |U r (β̂, β0 , 1)| > M } ≈ Φ(−M ) + 2−3/2 Vρ exp(−M 2 /2)/ π,(11.5)
where Vρ = i |U r (β̂, β0 , 1) − U s (β̂, β0 , 1)| , the γi , ranging over (L, U), are the
·) and M is the observed maximum of T (0, β;
turning points of T (0, β; ·). Turning
points only occur at the kn distinct failure times and, to keep the notation
consistent with that of the next section, it suffices to take γi , i = 2, . . . , kn , as
being located half way between adjacent failures. To this set we can add γ1 = 0
and γkn +1 to be any value greater than the largest failure time, both resulting
in the usual constant estimator.
Definition 11.4. For any γ ∈ B, the process representing the area under the
standardized score function at time t, denoted {Jn (γ, t), 0 ≤ t ≤ 1}, is defined
by t
Jn (γ, t) = Un∗ (γ, u)du, 0 ≤ t ≤ 1.
0
Theorem 9.1 and properties of Brownian motion (Bhattacharya and Waymire,
1990) allow us to establish the following result.
t
Recall that the random variable 0 W(s)ds has a centered normal distribution
with variance t3 /3. The proposition above allows us to define a test using the
area under the score process at time t.
Proposition 11.4. Let t ∈ [0, 1]. The statistic for the area under the standardized
score process at time t is Jn (β0 , t). The test for the area under the curve at
time t rejects H0 at asymptotic level α if
1/2
3t−3|Jn (β0 , t) | ≥ z α/2 .
1/2
The asymptotic p-value of the test is 2 1 − Φ 3t−3 |Jn (β0 , t)| .
Remark. As for the distance from origin traveled, we are at liberty to choose
values for t other than t = 1. This opens up the possibility of a very wide
range of potential tests. For now, we only consider t = 1, and to keep the
text simple, we will simply call the test the “area under the curve test”. The
following proposition indicates that the test statistic for the area under the
curve is consistent.
Figure 11.2: Under proportional hazards the upper two figures indicate very close
agreement between log-rank and AUC. Lower curves indicate substantial gains
of AUC over log-rank when effects diminish with time.
Lemma 11.7. Let t ∈ [0, 1]. Under the proportional hazards model with
parameter β0 , the covariance function of Jn (β0 , t) and Un∗ (β0 , t) is such
that
P
Cov {Jn (β0 , t), Un∗ (β0 , t)} −→ t2 /2.
n→∞
318 Chapter 11. Hypothesis tests based on regression effect process
The result opens up the possibility for whole classes of tests based on different
ways of combining the component contributions. We can build tests based on
different combinations of the two that can have good power under both types
of alternative hypothesis (proportional and non-proportional hazards). Below, we
define a class of adaptive tests that we describe as restrictive adaptive tests. They
combine the distance statistic Un∗ (β0 , t) with the area under the curve statistic
Jn (β0 , t). The user is allowed to bias the choice, hence the term restrictive.
The resulting statistic takes the value of one or the other of the tests. The
class is adaptive in the sense that users do not directly choose which statistic
to apply; instead, it is chosen according to the data, with a parameter set by
the user. We also present an unrestricted adaptive test that does not depend on
any constraining parameters. We can control the degree to which we want the
data to, in some sense, have their say. At one extreme, no amount of data will
convince us to move away from the distance traveled (log-rank) test, whereas,
on the other hand, a more even handed approach might specify no preference for
either test.
Figure 11.3: Two examples of the regression effect process Un∗ (0, ·) under the
null, β = 0. When reading such a figure, note the different scales.
and in the vicinity of the null and so is not studied deeply in its own right. Since
the absolute difference between the area under the curve test and the integrated
log-rank test converges in probability to zero, under the usual conditions on n,
kn , and the censoring, we may sometimes refer to them as essentially the same
thing. The term “integrated log-rank test” may be used informally in either case.
We can glean a lot from the visual impression given by Figure 11.2. The
upper two processes correspond to a case arising under proportional hazards.
The most efficient estimate of the slope of a Brownian motion with linear drift
is given by connecting the origin (0,0) to the point of arrival of the process at
t = 1. The triangular area under this slope corresponds to precisely 0.5 multiplied
by the distance from origin (log-rank test). The variance of this would be 0.25
times the variance of the log-rank statistic and so we see, immediately, that
the area of the triangle can itself be taken to be the log-rank test. Now, under
proportional hazards, we have linear drift so that the integrated log-rank test—
the area under the process—will be almost identical in value to the log-rank test.
The triangular area and the area under the process almost coincide. This is well
confirmed in simulations and, even when the drift is equal to zero, i.e., under
the null hypothesis, the correlation between the log-rank test and the integrated
log-rank test is very high. Let us now consider the lower two graphs of Figure
11.2. We have no difficulty in seeing that the triangular area in this case will be
quite different from the area under the process. Not only that, but in cases like
these, where this area corresponds to the test statistic, it is clear that we will lose
a lot of power by working with the triangular area (log-rank test) rather than
the area under the process (integrated log-rank test). The differences in these
two areas could be large. The lower graphs indicate an effect that diminishes
with time. A yet stronger illustration is presented in Figure 11.4(b) where we can
see that a very great loss of power would be consequent upon the choice of a
statistic based on the triangular area (log-rank) rather than all of the area under
the curve (Fig. 11.3).
Note also that it is easy to encounter such a diminishing effect, even under
proportional hazards. Suppose, for example, that some percentage of the treated
group fails to stick with the treatment. This may be due to undesirable secondary
effects or a lack of conviction concerning the potential of the treatment. Either
way, we will see a dilution of effect through time resulting in a heterogeneous
rather than a homogeneous sample. The impact of this can be a regression process
where the effect appears to diminish with time. When so, it is very likely that the
integrated log-rank would be a more powerful choice of test than the log-rank
test. As a result, in very many practical situations, it can be more logical to
take the integrated log-rank test, rather than the log-rank test, as our working
standard. There are several other possibilities, many of which remain open to
exploration. Combining the log-rank test with the integrated log-rank test can
also provide a valuable tool, in particular in situations where we anticipate a
proportion of non-responders. We look at this in the following section.
320 Chapter 11. Hypothesis tests based on regression effect process
Definition 11.5. The area over the curve (AOC) test statistic Tn (β0 , 1) is
given by;
t
Tn (β(t), t) = Un∗ (β(t), t) − Un∗ (β(u), u) du (11.7)
0
Note that the time scale is on the interval (0,1) so that Un (0, 1) corresponds to
the rectangular area formed by the coordinates (0,1) and (0, Un (0, 1)). The area
1
below the regression effect process is given by 0 Un∗ (0, u) du so that subtracting
this from Un (0, 1) results in the shaded area. This area is our test statistic. Before
considering more precisely the statistical properties of the test statistic we can
build our intuition by making some observations. The expected value of the test
statistic under H0 : β(t) = 0 , t ∈ (0, 1), will be zero. This is because the expected
value of the regression effect process and the area under its path are both zero
under H0 . These two quantities are correlated but that does not impact the
Figure 11.4: The regression effect process Un∗ (0, ·) under PH and strong, non-
proportional hazards (NPH) effect. It is clear that, under PH, log-rank test (tri-
angular area) will be close to AUC. Under NPH, the area under the curve will
provide a more powerful test statistic.
11.4. Tests based on the regression effect process 321
expectation. It will impact the variance and we consider this in the next section.
Under a proportional hazards departure from H0 we anticipate a linear drift in
the regression effect process. The test statistic will then be well approximated by
one half of the area given by Un (0, 1) which is, in turn, equivalent to the log-rank
statistic. The test will then be very close to the log-rank test and a small amount
of power is lost. For a delayed effect, as long as it is not too weak and allows
Un (0, 1) to drift ultimately away from the origin, we will obtain a large value of
the test statistic, all the more so as Un (0, 1) finally climbs away from the region
close to the axis.
Lemma 11.8. Under the model, with parameter β0 (t), Tn (β0 (t), t) is a
Gaussian process for which we have; E Tn (β0 (t), t) = 0 and Var Tn (β0 (t), t) =
t(1 − t) + t3 /3. In particular, we have that Var Tn (β0 (t), 1) = 1/3.
Tests based upon Tn (β0 (t), t) would control for Type I error and would be antic-
ipated to have good power, albeit suboptimal, in proportional hazards situations
and to have high power for delayed effect alternatives.
Figure 11.5 is taken from Flandre and O’Quigley (2019) and shows how the
test statistic might look under an alternative of a delayed effect. This study is
fully described in Hodi et al. (2018). For a period not far short of the median
survival times there appears to be no effect. Beyond this the effect appears to be
a significant one that would tend to rule out the plausibility of the null hypothesis
of absence of effect.
If the absolute value of Δn (β0 , t) is fairly small we can expect the hazards to be
proportional, and the distance test or, equally so, the log-rank test (Proposition
11.2) to be powerful. If not, the area under the curve test is likely to be more
powerful. In this way, we define a class of restricted adaptive tests that depend
on the threshold γ ≥ 0.
Definition 11.6. The class of statistics for restricted adaptive tests at time
t is denoted Mnγ (β0 , t), γ ≥ 0 , where for γ ≥ 0,
√
Mnγ (β0 , t) = 3|Jn (β0 , t)| − |Un∗ (β0 , t)| 1|Δ (β , t)| ≥ γ + |Un∗ (β0 , t)|.
n 0
324 Chapter 11. Hypothesis tests based on regression effect process
We have choice in which value of t to work with, and if effects are known to
drop to zero beyond some point τ then it makes sense to choose t = τ. In most
situations, and wishing to make full use of all of the observations, we will typically
choose t = 1.
where for i = 1, . . . , N ,
√
f (Xi , Yi ) = 3|Xi | − |Yi | 1|2X − Y | ≥ γ + |Yi |.
i i
When γ = 0, Mnγ (β0 , t) is the absolute value of the area under the curve.
When γ → ∞, Mnγ (β0 , t) tends to the absolute value of the distance from the
origin. Between the two, Mnγ (β0 , t) is equal to the absolute value of one or the
other. The optimal choice of γ depends on the unknown form of the alternative
hypothesis H1,P H or H1,N P H . One strategy is to work with a large value of
γ, which essentially amounts to applying the log-rank test and keeping some
power in reserve for unanticipated alternative hypotheses in which the effect
changes strongly over time (for example, a change of sign). However, this test
will be less powerful than the log-rank in detecting the presence of a non-zero
constant effect. If we have no wish to give priority to either of the alternative
hypotheses, rather than estimating the optimal parameter γ and losing power to
detect the alternative hypothesis, we could simply just take the test that provides
the smallest p-value. Of course this then requires adjustment in order to maintain
control on Type 1 error.
Proposition 11.7. (Chauvel and O’Quigley 2014). Denote φ{·; 0, Σ(t)} the
density of the centered normal distribution in R2 with variance-covariance matrix
11.4. Tests based on the regression effect process 325
√
t 3t2 /2
Σ(t) = √ 2 .
3t /2 t3
and recall that we have already studied the multivariate standardized score pro-
cess in Section 9.5. Here, we suppose that B1, B1, B3 and B4 hold, and that
β, β0 ∈ B . The standardized score process is a random function from [0, 1] to
Rp , written
Un∗ (β0 , t) = Un∗ (β0 , t)1 , Un∗ (β0 , t)2 , . . . , Un∗ (β0 , t)p , 0 ≤ t ≤ 1.
326 Chapter 11. Hypothesis tests based on regression effect process
By integrating each component of this process with respect to time, we can also
define the area under the curve process of the multivariate standardized score
process.
Definition 11.8. The area under the curve process of the multivariate standard-
ized score process is a random function from [0, 1] to Rp such that
t t t
Jn (β0 , t) = Un∗ (β0 , s)1 ds, Un∗ (β0 , s)2 ds, . . . , Un∗ (β0 , s)p ds , 0 ≤ t ≤ 1.
0 0 0
Proposition 11.8. (Chauvel and O’Quigley 2014). Let t ∈ [0, 1]. Under the
null hypothesis H0 : β = β0 , that is, under the PH model with parameter β0 , the
vector T
√ √
Un∗ (β0 , t)1 , Un∗ (β0 , t)2 , 3Jn (β0 , t)1 , 3Jn (β0 , t)2
We can extend this multivariate result to the distance from the origin, area
under the curve, and restricted, or unrestricted, adaptive tests, in order to test
the null hypothesis H0 : β(t) = β0 against H1 : β(t) = β0 . The first corollary
extends the test to the distance from the origin.
Corollary 11.6. The distance from the origin statistic is given by:
The next corollary deals with the multivariate area under the curve.
11.5. Choosing the best test statistic 327
Corollary 11.7. The area under the curve statistic is given by:
Lastly, we also extend the unrestricted adaptive test to the multivariate case.
Corollary 11.8. Denote Mn (β0 , t) the statistic for the multivariate unrestricted
adaptive test, where
and φ(· ; 0, Σ̃1 ) is the density of the normal distribution N4 (0, Σ̃1 ). The p-value
of the test is:
1− 1p2 +q2 ≤Mn (β0 ,t),r2 +s2 ≤Mn (β0 ,t) φ(p, q, r, s; 0, Σ̃1 )dpdqdrds.
R4
In a planned experiment we will work out our sample size and other design
features in line with statistical requirements such as control of Type 1 error and
power. These, in turn, lean upon some postulated possibilities, both under the null
and under the alternative. As we have seen, if the alternative differs significantly
from that postulated we can lose a lot of power. We may be working with a
promising treatment but our inability to see just how the patient population
will respond to the treatment can easily result in a failed study. We would like
328 Chapter 11. Hypothesis tests based on regression effect process
to minimize the risk of such failures. This is a goal that the results of this
chapter show to be achievable if we have information on time dependencies under
the alternative hypothesis. Unfortunately our knowledge here may be imprecise.
Nonetheless, even if we fall a little short of some optimal goal, we can still reduce
the risk of trial failure by choosing our test to accommodate effectively plausible
outcomes. We look at some broad considerations in this section.
Figure 11.6: Test based on PH assumption near optimal for left hand figure, and
near optimal after recoding via a cutpoint model for right hand figure.
Figure 11.7: Regression effect process transformed to a bridge process for the
Freireich data via Wn∗ (t) = Un∗ (β̂, t) − tUn∗ (β̂, 1). The process lies well within the
limits of the supremum of the limiting process, the Brownian bridge.
a more powerful test than either of these two. We consider this below. Before
looking at some particular cases note that the concepts convex and concave,
in this context, while inspired by geometrical concepts can differ slightly. Small
departures from the assumption will imply small departures from test behavior
so, if a curve is mostly concave, then, by and large, it will still be advantageous
to make use of the area under the curve test.
where Wi (t) = I(t > τi ). Each patient has their own specific time point τi at
which the effect begins to take a hold. At time t, the percentage of “trigger
points” τi , less than t, will be an increasing function of t, say G(t) and, for the
effect β(t), governing the whole study group:
Figure 11.8: When the points linearly connecting the arrival point of the process
to the origin lie wholly below the regression effect process we call this concave
effects.
Figure 11.9: An idealized model for immunotherapy effects, absent initially and
then manifesting themselves at and beyond some approximate time point.
the relative balance between groups stabilizes through time. In the long run the
groups will look more similar than they did initially.
Presence of non-responders
Consider a clinical trial for which under the usual arguments, such as those laid
out above, we could take a proportional hazards assumption to be a reasonable
one. In this case the distance from the origin (log-rank) test would be close to
optimal. This assumes however that we are dealing with homogeneous groups; a
treated group and a, for the sake of argument, placebo group. Suppose though
that among those treated there exists a subset of patients who will not be sus-
ceptible to the treatment’s action. For these patients they behave similarly to
patients from the placebo group. The treatment group is not then homogeneous
and is a mixture of both types of patients. Under the alternative hypothesis, the
prognostic impact of belonging to the treatment group will grow in strength.
The reason for this is that the composition of the treatment group will change
through time, losing more of the non responding patients within this group with
respect to the responders.
The relative balance within the treatment group of responders to non-
responders increases with time. The regression effect process will also appear
to increase with time. The regression effect process will look similar to that we
obtain with delayed treatment effect, an example being immunotherapy trials.
Knowing, even if only roughly, how many non-responders we may have at the
outset, would allow us to construct a close to optimal test. Without such knowl-
edge we can still improve on the log-rank test by noting that we will be observing
effects of a convex nature. The area over the curve test would be a good choice
in such cases, in particular when the percentage of non-responders is likely to be
quite large.
11.5. Choosing the best test statistic 333
Figure 11.10: Curie Institute breast cancer study. Prognostic factors grade and
tumor size both show evidence of waning effects. Grade lies just within the 95%
PH limits. Size clearly violates the PH assumption.
We now know that, under proportional hazards alternatives, the distance from
the origin test—equivalently the log-rank test—is not only unbiased and con-
sistent but is the uniformly most powerful unbiased test (UMPU). What hap-
pens though when these optimality conditions are not met, specifically when
the non-proportionality shown by the regression effect takes a particular form.
We might first consider different codings for Z(t) in the model: λ(t|Z(t)) =
λ0 (t) exp{β(t)ψ[Z(t)]}, and in view of Lemma 11.3, if ψ[Z(t)] = C0 β(t)Z(t)
for any strictly positive C0 , then this results in the most powerful test of
H0 : β(t) ≡ 0. We can make use of results from Lagakos et al. (1984) in order
to obtain a direct assessment of efficiency that translates as the number of extra
failures that we would need to observe in order to nullify the handicap resulting
from not using the most powerful local test. Proposition 11.9 provides the results
we need and its validity only requires the application of Lemma 11.3.
Exploiting the formal equivalence between time-dependent effects and time
dependency of the covariable, we can make use of a result of O’Quigley and
Prentice (1991) which extends a basic result of Lagakos et al. (1984). This
allows us to assess the impact of using the coding ψ[Z(t)] in the model in place
of the optimal coding ψ ∗ [Z(t)] = C0 β(t)Z(t). See also Schoenfeld (1981) and
Lagakos (1988).
The integrals are over the interval (0,1) and the expectations are taken with
respect to the distribution of Ft again on the interval (0,1). We can use this
expression to evaluate how much we lose by using the log-rank statistic when,
say, regression effects do not kick in for the first 25% of the study (on the φn (t)
timescale). It can also inform us on how well we do when using, say, the area
over the curve test as opposed to the optimal (unknown in practice) test. In the
absence of censoring this expression simplifies to that of the correlation between
the different processes. In some cases, an example being the log-rank process,
and the AUC process, the known properties of the limiting processes are readily
available and provide an accurate guide.
11.7. Supremum tests over cutpoints 335
The above tests will be optimal, i.e., consistent and uniformly most powerful unbi-
ased tests, in given circumstances. Moving away from particular circumstances,
we will continue to obtain tests with near optimal properties. This requires us
to carefully consider the type of behavior that is likely to be exhibited. When
the alternative mirrors the form of β(t), we have an optimal test. Typically we
look for something broader since β(t) is not known. The three classes consid-
ered above—linear, concave, and convex—are precise enough to obtain powerful
tests, albeit falling short of the optimal test for specific situations. We might
then conclude that we have an array of tools from which we can find tests with
very good performance in a wide array of cases. This is true. Yet, there are still
other ways of tackling the problem.
For example, we may not wish to favor any class of alternatives; linear, con-
cave or convex. The alternative is then very broad. An appropriate class of tests
for this situation, a class that we would anticipate to have good power proper-
ties, might be based on cutpoints. The several examples of Chapter 10 suggest
that change-point models can closely approximate the regression effect process.
In most cases a single cutpoint will be good enough and provide a significantly
improved fit to that attained by a model with a constant regression effect. Extend-
ing to more than a single cutpoint is, in principle, quite straightforward, raising no
additional methodological considerations, although more cumbersome computa-
tionally. Under the null, the cutpoint would not be defined. Under the alternative,
we would have as many cutpoints as we choose to model and for s − 1 cutpoints
we would obtain s regression effect processes, each one, under the null and con-
ditional upon its starting point, being uncorrelated with the others. A single
cutpoint for example leads to two regression effect processes; one from zero up
to the cutpoint, the other from the cutpoint to the point 1 on the transformed
time scale (Fig. 11.11).
Tests within such a structure fall under the heading studied by Davies (1977,
1987) whereby the nuisance parameter (cutpoint) is defined only under the alter-
native and not under the null. In the case of a single cutpoint, we write:
Tn2 (β(t), θ) = θ−1 {Un∗ (β(t), θ)}2 + (1 − θ)−1 {Un+ (β(t), θ)}2 (11.9)
where Un+ (β(t), t) = Un∗ (β(t), 1) −Un∗ (β(t), θ) . The ingredients we need are pro-
vided by the variance-covariance matrix of the unsquared quantities. Writing
these as:
⎛ √ ⎞
Un∗ (β(t), θ)√
/ θ
⎜ Un+ (β(t), 1) / 1 − θ ⎟ I A(θ)
Var ⎜ √
⎝ DU ∗ (β(t), θ) / θ ⎠
⎟=
AT (θ) B(θ)
+
n √
DUn (β(t), 1) / 1 − θ
336 Chapter 11. Hypothesis tests based on regression effect process
where DU (θ) = ∂U/∂θ, we have most of what we need. Finally, suppose that
η(θ) are zero-mean normal random variables with variances given by λ1 and λ2
where λi , i = 1, 2, are the eigenvalues of B(θ) − AT (θ)A(θ). Then:
1√
1
P{sup Tn2 (β(t), θ) > u ; 0 < θ < 1 } ≤ P(χ22 > u) + ue−u/2 E||η(θ)||dθ
π 0
Clearly, the calculations are not that straightforward although, with modern soft-
ware such as R, they are readily programmed. In addition, Davies (1987) makes a
number of suggestions for simpler approximations. Some study of these sugges-
tions, in the context of survival data with changepoints, has been carried out by
O’Quigley and Pessione (1991), O’Quigley (1994), and O’Quigley and Natarajan
(2004) and indicate that the accuracy of the approximations is such that they
can be reliably used.
A full study of tests based on taking the supremum over an interval of cut-
points has yet to be carried out. In terms of power and test optimality any
advantages will be necessarily small since we already know how to obtain near
optimal tests. These optimal, and near optimal, tests apply in a range of situa-
tions and include cutpoints. So, despite the extra work, there may not be much
to be gained. One useful advantage is that any such supremum test immediately
furnishes both interval and point estimates of the changepoint itself. This can
be valuable in moving forward from the rejection of a null hypothesis to the
construction of a more plausible hypothesis.
Figure 11.11: Two possible cutpoints for a single cutpoint model. Note the impact
of small changes in cutpoint on the slopes of the regression effect process.
11.8. Some simulated comparisons 337
A large simulation study was carried out by Chauvel (2014), some of which was
reproduced in Chauvel and O’Quigley (2014) and Chauvel and O’Quigley (2017).
The behavior of the tests described above for finite sample sizes was compared
with a number of other established tests. The advantages are readily highlighted.
λ 0.5 1 2
P (C ≤ T ) 0.3 0.5 0.7
Mnγ (0, 1)
β(t) Log-rank Jn (0, 1) γ = 0.5 γ = 1 γ = 1.5 γ=2
0 4.9 5.2 5.0 5.1 4.8 4.6
0.5 50.9 40.9 42.7 46.3 49.5 50.3
0.8 88.1 76.8 79.2 83.5 86.6 87.5
1t≤0.3 21.8 38.5 35.4 32.0 25.3 21.2
1t≤0.5 54.1 72.6 70.1 66.7 58.3 52.8
1t≤0.7 83.4 89.0 88.3 85.5 82.3 80.9
−1.5 1t≤0.5
16.8 85.3 82.7 83.4 84.4 80.0
+1.5 1t≥0.5
1.5 1t≤0.5
15.5 85.1 82.3 83.2 84.4 79.5
−1.5 1t≥0.5
1.5 (1 − t) 84.7 91.6 91.0 88.8 85.2 83.0
2 (1 − t)2 73.8 90.6 89.2 86.0 78.7 72.8
Table 11.2: Empirical level of significance and power of each test (in %) based
on 3000 simulated datasets for each β(t). Each dataset has 100 elements (50
subjects per group). The rate of censoring is fixed at 30%.
vious cases, these tests have acceptable power under all alternative hypotheses
considered. On this basis we find γ = 1.5 provides a good all-round compromise
and tends to keep to a minimum any loss in power with respect to the log-rank
test under proportional hazards-type alternative hypotheses.
11.9 Illustrations
We consider several publicly available data sets. Also included are data collected
by the Curie Institute for studying breast cancer and a number of data sets from
the University of Leeds, U.K. focused on the impact of markers on survival in
cancer. Note that the analysis of a multivariate problem often comes down to
considering several univariate regression effect processes. This is because the
relevant questions typically concern the impact of some variable, often treat-
11.9. Illustrations 341
ment, having taken into account several potentially confounding variables. Once
these confounding variables have been included in the model then the process
of interest to us is a univariate one where the expectation and variance have
been suitably adjusted via model estimation. This is also the case when using
stratification or, indeed, when considering the prognostic index which would be
a linear combination of several variables.
Despite the noise, the drifts of the processes are clearly visible, corresponding
to the alternative hypothesis of the presence of an effect—which appears to be
nonlinear. Its value decreases with time in such a way that the multivariate log-
rank test is not significant, with a p-value of 0.10. The multivariate distance
test has a p-value of 0.09. In contrast, the multivariate area under the curve
and multivariate restricted adaptive tests are highly significant, with respective
p-values of 0.005 and 0.008. These tests are less affected by the decrease in
effect, and for this reason they detect it. They thus correspond more closely to
our intuition when looking at the Kaplan-Meier curves which do seem to suggest
the presence of true differential effects.
In order to compare the survival of several groups, the log-rank test is commonly
used. This test is the most powerful for detecting a fixed effect over time. On
the other hand, several authors have shown its weakness in the presence of
an effect which varies. The purpose of Chauvel and O’Quigley (2014) was to
develop a test with good power under both types of alternative hypothesis. The
asymptotic properties of the regression effect process studied in Chapter 9 lead to
the construction of several new tests. The first of these is the distance from the
origin test, which is asymptotically equivalent to the log-rank. There are many
other possibilities including the area under the curve test. This test, slightly less
powerful than the log-rank test in the presence of a constant effect, allows for
significant power gains when there is an effect that changes with time.
Figure 11.12: Kaplan-Meier estimators and standardized score process for the
Curie Institute data.
11.10. Some further thoughts 343
The class of adaptive tests considered will assume either the value of the
distance traveled or the value of the area under the curve. The parameter allows
us to calibrate the proximity of the tests to the log-rank. However, fixing this
parameter may be tricky, so if there is no compelling reason to privilege the
log-rank test, Chauvel and O’Quigley (2014) proposed the use of an unrestricted
adaptive test. This corresponds to a statistic that is the maximum absolute value
of the distance from the origin and the area under the curve.
Simulations show that these adaptive tests perform well compared to other
tests in the literature. In addition to the tests’ control of the level and their good
power properties, they also have the advantage of being easily interpretable and
simple to plot. In applications, at the same time as we look at the p-values,
parameter estimates, and confidence intervals, it would certainly be helpful to
consider plots of the processes in order to get a visual impression of how close or
far we appear to be from the conditions under which we know we would achieve
something close to an optimal test. This tells us something not only about the
potential lack of plausibility of the null hypothesis but also something about the
nature of a more plausible alternative hypothesis.
By calculating the standardized score process Un∗ at β0 , the results can imme-
diately be extended to the null hypothesis H0 : β(t) = β0 , ∀t, against the alter-
native H1 : ∃ t0 , β(t0 ) = β0 . If β0 is non-zero, we are no longer testing for the
presence of a difference in survival between groups, but rather whether the relative
risk between groups is equal to β0 or not. Although we have mostly considered
testing a null against an alternative of a constant value for the regression parame-
ter, it is however possible to consider testing the null hypothesis H0 : β(t) = β0 (t)
against H1 : β(t) = β0 (t), where β0 (t) changes with time.
Figure 11.13: Kaplan-Meier estimators and standardized score processes for clin-
ical trial data with 3 treatment groups.
344 Chapter 11. Hypothesis tests based on regression effect process
Other combinations of the distance from the origin and area under the curve
tests could also be studied, rather than the linear combination looked at here:
5. On the basis of different data sets, calculate the distance from the origin
test, the log-rank test, and the arcsine test. On the basis of these results
what does your intuition tell you?
8. One of the conditions for consistency relates to the rate of censoring; specif-
ically we need to have kn /n → C as n → ∞ where 0 < C < 1. Suppose,
for some true situation, that kn /n → 0 as n → ∞. Indicate why all of the
tests, including the log-rank test, would be inconsistent in such a situation.
11. Repeat the previous question, changing the word concave to convex. What
would the impact of this be? Can you generalize the arguments to embrace
the more general setting in which one of three situations may prevail:
proportional hazards, concave non-proportional hazards, and convex non-
proportional hazards.
13. Consider a two-group clinical trial in which, beyond the median of the com-
bined groups there is no further effect. Use the expression for efficiency to
calculate how many more patients would be needed to maintain the power
of the most powerful test when the design is based on the log-rank test.
Hint—use the convergence in distribution result between the log-rank and
the distance traveled tests.
14. Guidance on the choice of tests has so far made little reference to censor-
ing beyond the fact that it is assumed either independent or conditionally
independent of the failure mechanism. The impact of a conditionally inde-
pendent censoring mechanism is potentially greater on the distance from
origin test than it is on the log-rank test. Explain why this would be so.
Propositions 11.1 and 11.5 First, recall the result of Theorem 9.2:
L
Un∗ (β0 , t) − kn An (t) −→ C1 (β, β0 )W(t), 0 ≤ t ≤ 1,
n→∞
where t
P
An (t) −→ C2 {β(s) − β0 } ds, 0 ≤ t ≤ 1.
n→∞ 0
Let t ∈ [0, 1]. The functions ∗
√ Un(β 0 , ·) and An are almost surely bounded so that
t
we can write Jn (β0 , t) − kn 0 An (s)ds as:
t t t
L
Un∗ (β0 , s)ds − kn An (s)ds −→ C1 (β, β0 ) W(s)ds.
0 0 n→∞ 0
and t s
1 P
√ Jn (β0 , t) −→ C2 (β(u) − β0 )duds.
kn n→∞ 0 0
√ s
Recall that limn→∞ kn = ∞, and for any s ∈ [0, 1], 0 {β(u) − β0 }du = 0,
which together imply:
1 ∗ α/2 −3 1/2 α/2
lim P √ |Un (β0 , t) | ≥ z = 1, lim P 3t |Jn (β0 , t) | ≥ z = 1.
n→∞ t n→∞
11.12. Outline of proofs 347
Table 11.3: Empirical level of significance and power of each test (in %) based on
3000 simulated datasets for each β(t) and each rate of censoring. Each dataset
has 100 elements (50 subjects per group). The tests have nominal levels of 5%.
348 Chapter 11. Hypothesis tests based on regression effect process
Table 11.4: Empirical level of significance and power of each test (in %) based on
3000 simulated datasets for each β(t) and each rate of censoring. Each dataset
has 60 elements (30 subjects per group). The tests have nominal levels of 5%.
11.12. Outline of proofs 349
Table 11.5: Empirical level of significance and power of each test (in %) based on
3000 simulated datasets for each β(t) and each rate of censoring. Each dataset
has 200 elements (100 subjects per group). The tests have nominal levels of 5%.
350 Chapter 11. Hypothesis tests based on regression effect process
Table 11.6: Empirical levels of significance and test powers (in %) based on 3000
datasets for each β(t) and covariate distribution. The rate of censoring is set at
30%. Each dataset is of size 100.
Appendix A
Probability
We recall some of the fundamental tools used to establish the inferential basis
for our models. The main ideas of stochastic processes, in particular Brownian
motion and functions of Brownian motion, are explained in terms that are not
overly technical. The background to this, i.e., distribution theory and large sam-
ple results, is recalled. Rank invariance is an important concept, i.e., the ability
to transform some variable, usually time, via monotonic increasing transforma-
tions without having an impact on inference. These ideas hinge on the theory
of order statistics and the basic notions of this theory are recalled. An outline
of the theory of counting processes and martingales is presented, once again
without leaning too heavily upon technical measure-theoretic constructions. The
important concepts of explained variation and explained randomness are outlined
in elementary terms, i.e., only with reference to random variables, and at least
initially, making no explicit appeal to any particular model. This is important
since the concepts are hardly any less fundamental than a concept such as vari-
ance itself. They ought therefore stand alone, and not require derivation as a
particular feature of some model. In practice, of course, we may need to estimate
conditional distributions and making an appeal to a model at this point is quite
natural.
The reader is assumed to have some elementary knowledge of set theory and
calculus. We do not recall here any of the basic notions concerning limits, con-
tinuity, differentiability, convergence of infinite series, Taylor series, and so on
and the rusty reader may want to refer to any of the many standard calculus
texts when necessary. One central result which is frequently called upon is the
mean value theorem. This can be deduced as an immediate consequence of the
following result known as Rolle’s theorem:
Theorem A.1. If f (x) is continuously differentiable at all interior points of the
interval [a, b] and f (a) = f (b), then there exists a real number ξ ∈ (a, b) such
that f (ξ) = 0.
A simple sketch would back up our intuition that the theorem would be
correct. Simple though the result appears to be, it has many powerful implications
including:
Theorem A.2. If f (x) is continuously differentiable on the interval [a, b], then
there exists a real number ξ ∈ (a, b) such that
When f (x) is monotone then ξ is unique. This elementary theorem can form
the basis for approximation theory and series expansions such as the Edgeworth
and Cornish-Fisher (see Section A.10). For example, a further immediate corollary
to the above theorem obtains by expanding in turn f (ξ) about f (a) whereby:
Corollary A.1. If f (x) is at least twice differentiable on the interval [a, b] then
there exists a real number ξ ∈ (a, b) such that
(b − a)2
f (b) = f (a) + (b − a)f (a) + f (ξ).
2
The ξ of the theorems and corollary would not typically be the same and we
can clearly continue the process, resulting in an expansion of m + 1 terms, the
last term being the m th derivative of f (x), evaluated at some point ξ ∈ (a, b) and
multiplied by (b − a)m /m!. An understanding of Riemann integrals as limits of
sums, definite and indefinite integrals, is mostly all that is required to follow the
text. It is enough to know that we can often interchange the limiting processes
of integration and differentiation. The precise conditions for this to be valid are
not emphasized. Indeed, we almost entirely avoid the tools of real analysis. The
Lebesgue theory of measure and integration is on occasion referred to, but a lack
of knowledge of this will not hinder the reader. Likewise we will not dig deeply
into the measure-theoretic aspects of the Riemann-Stieltjes integral apart from
the following extremely useful construction:
Definition A.1. The Riemann integral of the function f (x) with respect to x, on
the interval [a, b], is the limit of a sum Δi f (xi−1 ), where Δi = xi − xi−1 > 0,
for an increasing partition of [a, b] in which max Δi goes to zero.
b
The limit is written a f (x)dx and can be seen to be the area under the
curve f (x) between a and b. If b = ∞ then we understand the integral to exist
if the limit exists for any b > 0, the result itself converging to a limit as b → ∞.
A.2. Integration and measure 353
This is the Helly-Bray theorem. The theorem will also hold (see the Exercises)
when h(x) is unbounded provided that some broad conditions are met. A deep
study of Fn (x) as an estimator of F (x) is then all that is needed to obtain insight
into the sample behavior of the empirical mean, the empirical variance and many
other quantities. Of particular importance for the applications of interest to us
here, and developed, albeit very briefly, in Appendix B.3, is the fact that, letting
M (x) = Fn (x) − F (x), then
E h(x)dM (x) = h(x)dF (x) − h(x)dF (x) = 0, (A.1)
The possible outcomes of any experiment are called events where any event
represents some subset of the sample space. The sample space is the collection
of all events, in particular the set of elementary events. A random variable X is
a function from the set of outcomes to the real line. A probability measure is a
function of some subset of the real line to the interval [0,1]. Kolmogorov (2018)
provides axioms which enable us to identify any measure as being a probability
measure. These axioms appear very reasonable and almost self-evident, apart
from the last, which concerns assigning probability measure to infinite collections
of events. There are, in a well defined sense, many more members in the set of
all subsets of any infinite set than in the original set itself, an example being
the set of all subsets of the positive integers which has as many members as the
real line. This fact would have hampered the development of probability without
the inclusion of Kolmogorov’s third axiom which, broadly says that the random
variable is measurable, or, in other words, that the sample space upon which the
probability function is defined is restricted in such a way that the probability we
associate with the sum of an infinite collection of mutually exclusive events is
the same as the sum of the probabilities associated with each composing event.
We call such a space a measurable space or a Borel space, the core idea being
that the property of additivity for infinite sums of probabilities, as axiomatized by
Kolmogorov, holds. The allowable operations on this space are referred to as a
sigma-algebra. Subsets of a sigma-algebra—the most common case being under
some kind of conditioning—are referred to as sub sigma-algebras and inherit the
axiomatic properties defined by Kolmogorov.
A great deal of modern probability theory is based on measure-theoretic ques-
tions, questions that essentially arise from the applicability or otherwise of Kol-
mogorov’s third axiom in any given context. This is an area that is highly techni-
cal and relatively inaccessible to non-mathematicians, or even to mathematicians
lacking a firm grounding in real analysis. The influence of measure theory has
been strongly felt in the area of survival analysis over the last 20 or so years
and much modern work is now of a very technical nature. Even so, none of the
main statistical ideas, or any of the needed demonstrations in this text, require
such knowledge. We can therefore largely avoid measure-theoretic arguments,
although some of the key ideas that underpin important concepts in stochas-
tic processes are touched upon whenever necessary. The reader is expected to
understand the meaning of the term random variable on some level.
Observations or outcomes as random variables and, via models, the proba-
bilities we will associate with them are all part of a theoretical, and therefore
artificial, construction. The hope is that these probabilities will throw light on
real applied problems and it is useful to keep in mind that, in given contexts,
there may be more than one way to set things up. Conditional expectation is a
recurring central topic but can arise in ways that we did not originally anticipate.
A.4. Convergence for random variables 355
We may naturally think of the conditional expected survival time given that a
subject begins the study under, say, some treatment. It may be less natural to
think of the conditional expectation of the random variable we use as a treatment
indicator given some value of time after the beginning of treatment. Yet, this
latter conditional expectation, as we shall see, turns out to be the more relevant
for many situations.
Simple geometrical constructions (intervals, balls) are all that are necessary to
formalize the concept of convergence of a sequence in real and complex analy-
sis. For random variables there are a number of different kinds of convergence,
depending upon which aspect of the random variable we are looking at. Consider
any real value Z and the sequence Un = Z/n. We can easily show that Un → 0
as n → ∞. Now let Un be defined as before except for values of n that are
prime. Whenever n is a prime number then Un = 1. Even though, as n becomes
large, Un is almost always arbitrarily close to zero, a simple definition of conver-
gence would not be adequate and we need to consider more carefully the sizes
of the relevant sets in order to accurately describe this. Now, suppose that Z
is a uniform random variable on the interval (0,1). We can readily calculate the
probability that the distance between Un and 0 is greater than any arbitrarily
small positive number and this number goes to zero with n. We have conver-
gence in probability. Nonetheless there is something slightly erratic about such
convergence, large deviations occurring each time that n is prime. When possi-
ble, we usually prefer a stronger type of convergence. If, for all integer values m
greater than n and as n becomes large, we can assert that the probability of the
distance between Um and 0 being greater than some arbitrarily small positive
number goes to zero, then such a mode of convergence is called strong conver-
gence. This stronger convergence is also called convergence with probability one
or almost sure convergence. Consider also (n + 3)Un . This random variable will
converge almost surely to the random variable Z. But, also, we can say that the
distribution of loge (n + 3)Un , at all points of continuity z, becomes arbitrarily
close to that of a standard exponential distribution. This is called convergence
in distribution. The three modes of convergence are related by:
Theorem A.4. Convergence with probability one implies convergence in proba-
bility. Convergence in probability implies convergence in distribution.
Note also that, for a sequence that converges in probability, there exists a
subsequence that converges with probability one. This latter result requires the
tools of measure theory and is not of wide practical applicability since we may
not have any obvious way of identifying such a subsequence. Added conditions
can enable the direction of the “implies” arrow to be inverted. For example
convergence in distribution implies convergence in probability when the limiting
356 Appendix A. Probability
Here, we describe some further tools that are helpful in determining large sample
behavior. Such behavior, in particular almost sure convergence and the law of
the iterated logarithm allow us to anticipate what we can expect in moderate to
large sample sizes. Small sample behavior is considered separately.
to transformed time or time on the original scale. Moving between the scales is
described precisely.
The interval [0,1] arises in a very natural way when dealing with distribu-
tion functions and it also appears natural to work with uniform distance and to
consider uniform convergence as the basic concept behind our metric definition.
With this in mind we have:
Definition A.3. The uniform distance d between two elements of (C[0, 1], R) is
defined by:
d(f, g) = sup |f (t) − g(t)| , f, g ∈ (C[0, 1], R). (A.2)
0≤t≤1
In order to get around the problem of functions with jumps at the disconti-
nuities, Skorokhod developed a more suitable metric providing the basis for the
Skorokhod topology. Specifically, we have the definition:
Definition A.4. Skorokhod distance δ is defined by:
δ(f, g) = inf {d(f, g(λ)) ∨ d(λ, I)} , f, g ∈ (D[0, 1], R), (A.3)
λ∈Λ
where I is the identity transform on [0, 1], Λ indicates the class of structure
preserving homomorphisms of [0, 1] into itself such that λ(0) = 0 and λ(1) = 1
for all f λ ∈ Λ and a ∨ b = max(a, b).
The idea is to not allow any jumps to dominate and is more fully explored
and explained in Billingsley (1999). Note also that
δ(f, g) ≤ d(f, g), f, g ∈ (D[0, 1], R). (A.4)
In other words, in the space (D[0, 1], R), convergence with respect to a topol-
ogy of uniform convergence implies convergence with respect to the topology of
Skorokhod. At the same time, if the limit is a continuous function then the two
forms of convergence are equivalent.
Proposition A.2. (Billingsley, 1999). For all f ∈ (C[0, 1], R) ; (gn )n∈N ∈
(D[0, 1], R),
Some simple properties and the fact that cadlag functions arise naturally
when considering empirical distribution functions, enable us to find results easily.
358 Appendix A. Probability
The uniform limit of a sequence of cadlag functions is cadlag and the proposition
shows that it is equivalent to show convergence of a sequence of cadlag functions
to a continuous function with a uniform convergence topology or the Skorokhod
topology.
We anticipate that most readers will have some familiarity with the basic ideas
of a distribution function F (t) = Pr (T < t), a density function f (t) = dF (t)/dt,
expectation and conditional expectation, the moments of a random variable, and
A.6. Distributions and densities 359
other basic tools. Nonetheless we will go over these elementary notions in the
context of survival in the next chapter. We write
E ψ(T ) = ψ(t)f (t)dt = ψ(t)dF (t)
for the expected value of the function ψ(T ). Such an expression leaves much
unsaid, that ψ(t) is a function of t and therefore ψ(T ) itself random, that the
integrals exist, the domain of definition of the function being left implicit, and
that the density f (t) is an antiderivative of the cumulative distribution F (t) (in
fact, a slightly weaker mathematical construct, absolute continuity, is enough but
we do not feel the stronger assumption has any significant cost attached to it).
There is a wealth of solid references for the rusty reader on these topics, among
which Billingsley (1999), Rao et al. (1973) and Serfling (2009) are particularly
outstanding. It is very common to wish to consider some transformation of a
random variable, the simplest situation being that of a change in origin or scale.
The distribution of sums of random variables arises by extension to the bivariate
and multivariate cases.
Theorem A.5. Suppose that the distribution of X is F (x) and that F (x) =
f (x). Suppose that y = φ(x) is a monotonic function of x and that φ−1 (y) = x.
Then, if the distribution of Y is G(y) and G (y) = g(y),
dφ(x) −1
G(y) = F {φ−1 (y)} ; g(y) = f {φ−1 (y)} (A.5)
dx x=φ−1 (y) .
Theorem A.6. Let X and Y have joint density f (x, y). Then the density g(w)
of W = X + Y is given by
∞ ∞
g(w) = f (x, w − x)dx = f (w − y, y)dy. (A.6)
−∞ −∞
Normal distribution
A random variable X is taken to be a a normal variate with parameters μ and
σ when we write X ∼ Φ(μ, σ 2 ). The parameters μ and σ 2 are the mean and
variance, respectively, so that σ −1 (X − μ) ∼ Φ(0, 1). The distribution Φ(0, 1) is
called the standard normal. The density of the standard normal variate, that is,
having mean zero and variance one, is typically denoted φ(x) and the cumulative
distribution Φ(x). The density f (x), for x ∈ (−∞, ∞) is given by
2
1 1 x−μ
f (x) = φ(x) = √ exp − .
2πσ 2 σ
us in survival analysis we can use this in one of two ways: firstly, to transform
the response variable time in order to eliminate the impact of its distribution,
and secondly, in the context of regression problems, to transform the distribution
of regressors as a way to obtain greater robustness by reducing the impact of
outliers.
(2)
where Φθ indicates the standardized bivariate normal distribution with correla-
tion coefficient θ, Φ(·) the univariate normal distributions and ux , uy , indepen-
dent uniform variates. The probability integral transform, described just above,
tells us that, for continuous F (x) and G(y), we have a one-to-one correspon-
dence with the uniform distribution. We can break down the steps by starting
with X and Y , transforming these to the uniform interval via ux = F (x) and
uy = G(y), subsequently transforming once more to normal marginals via Φ−1 (·),
and finally, via the bivariate normal distribution with parameter θ, creating the
bivariate model with normal (0,1) marginals and association parameter θ. We
can then use F −1 and G−1 to return to our original scale. This is referred to as
the normal copula and the creation of a bivariate model with a particular asso-
ciation parameter, θ, is, in this case, quite transparent. Also, the dependency
parameter, θ, has a concrete interpretation as a measure of explained variation
on the transformed scales. Clearly, this set-up is readily generalized by replacing
the first and second occurrences of Φ in Equation A.7 by any other distribution
(2)
functions, not necessarily the same, and by replacing Φθ by a different joint
distribution with a different dependency structure.
There are very many more, in fact an infinite number of alternative ways to
create bivariate models with some association structure from given marginals.
Many of these have been used in the survival context, in particular when dealing
with the problem of competing risks, or when considering surrogate endpoints.
These are mentioned in the main text. Building and exploiting the structure of
364 Appendix A. Probability
copula models is a field in its own right discussed in many texts and articles. A
thorough and clear discussion is given in Trivedi and Zimmer (2007).
A.8 Expectation
where the integrals, viewed as limiting processes, are all assumed to converge.
The normal distribution function for a random variable X is completely specified
by E(X) and E(X 2 ). In more general situations we can assume a unique corre-
spondence between the moments of X, E(X r ) , r = 1, 2, . . . , and the distribution
functions as long as these moments all exist. While it is true that the distribution
function determines the moments the converse is not always true. However, it is
almost always true (Kendall et al., 1987) and, for all the distributions of interest
to us here, the assumption can be made without risk. It can then be helpful to
view each moment, beginning with E(X), as providing information about F (x).
This information typically diminishes quickly with increasing r. We can use this
idea to improve inference for small samples when large sample approximations
may not be sufficiently accurate. Moments can be obtained from the moment
generating function, M (t) = E{exp(tX)} since we have:
Lemma A.2. If exp(tx)f (x)dx < ∞ then
r
r ∂ M (t)
E(X ) = , for all r.
∂tr t=0
It is usually sufficient to take convexity to mean that w (x) and w (x) are
greater than or equal to zero at all interior points of I since this is a consequence
of the definition. We have (Jensen’s inequality):
For the variance function we see that w(x) = x2 is a convex function and so
the variance is always positive. The further away from the mean, on average, the
observations are to be found, then the greater the variance. We return to this
in Chapter 10. Although very useful, the moment-generating function, M (t) =
E{exp(tX)} has a theoretical weakness in that the integrals may not always
converge. It is for this, mainly theoretical, reason that it is common to study
instead the characteristic function, which has an almost identical definition, the
only difference being the introduction of complex numbers into the setting. The
characteristic function, denoted by φ(t), always exists and is defined as:
∞
φ(t) = M (it) = exp(itx)dF (x) , i2 = −1.
−∞
Note that the contour integral in the complex plane is restricted to the whole real
axis. Analogous to the above lemma concerning the moment-generating function
we have: r
r r ∂ φ(t)
E(X ) = (−i) , for all r.
∂tr t=0
This is important in that it allows us to anticipate the cumulative generating
function which turns out to be of particular importance in obtaining improved
approximations to those provided by assuming normality. We return to this below
in Section A.10. If we expand the exponential function then we can write:
∞ ∞
r
φ(t) = exp(itx)dF (x) = exp κr (it) /r!
−∞ r=1
and, identifying κr as the coefficient of (it)r /r! in the expansion of log φ(t). The
function ψ(t) = log φ(t) is called the cumulative generating function. When this
function can be found then the density f (x) can be defined in terms of it. We
have the important relation
∞ ∞
1
f (x) = e−itx φ(t)dt , φ(t) = eitx f (x)dx .
2π −∞ −∞
This important result has two immediate and well-known corollaries dealing
with the maximum and minimum of a sample of size n.
Corollary A.4.
in which p(x) = P (x). This rather involved expression leads to many useful
results including the following corollaries:
Corollary A.6. The joint distribution of X(r) and X(s) is
j
n
n!
Frs (x, y) = P i (x)[P (y) − P (x)]j−i [1 − P (y)]n−j .
i!(j − i)!(n − j)!
j=s i=r
same kind of calculations from scratch or, making use once more of the probability
integral transform (see Section A.3), use the above result for the uniform and
transform into arbitrary F . Even this is not that straightforward since, for some
fixed interval (w1 , w2 ), corresponding to w = w2 − w1 from the uniform, the
corresponding F −1 (w2 ) − F −1 (w1 ) depends not only on w2 − w1 but on w1
itself. Again we can appeal to the law of total probability, integrating over all
values of w1 from 0 to 1 − w. In practice, it may be good enough to divide the
interval (0, 1 − w) into a number of equally spaced points, ten would suffice, and
simply take the average. Interval estimates for any given quantile, defined by
P (ξα ) = α, follow from the basic result and we have:
Corollary A.9. In the continuous case, for r < s, the pair (X(r) , X(s) ) covers ξα
with probability given by Iπ (r, n − r + 1) − Iπ (r, n − s + 1).
Yi = Z(i) − Z(i−1) , i = 1, . . . , n
r
r
Z(r) = {Z(i) − Z(i−1) } = Yi ,
i=1 i=1
Lemma A.5. For a sample of size n from the standard exponential distribution
and letting αi = 1/(n − i + 1), we have:
r
r
r
r
E[Z(r) ] = E(Yi ) = αi , Var [Z(r) ] = Var (Yi ) = αi2 .
i=1 i=1 i=1 i=1
The general flavor of the above result applies more generally than just to the
exponential and, applying the probability integral transform (Section A.6), we
have:
Lemma A.6. For an i.i.d. sample of size n from an arbitrary distribution, G(x),
the rth largest order statistic, X(r) can be written:
One immediate conclusion that we can make from the above expression is
that the order statistics from an arbitrary distribution form a Markov chain.
The conditional distribution of X(r+1) given X(1) , X(2) , . . . , X(r) depends only
on the observed value of X(r) and the distribution of Yr+1 . This conditional
distribution is clearly the same as that for X(r+1) given X(r) alone, hence the
Markov property. If needed we can obtain the joint density, frs , of X(r) and
X(s) , (1 ≤ r < s ≤ n) by a simple application of Theorem A.10. We then write:
From this we can immediately deduce the conditional distribution of X(s) given
that X(r) = x as:
A simple visual inspection of this formula confirms again the Markov property.
Given that X(r) = x we can view the distribution of the remaining (n − r) order
statistics as an ordered sample of size (n − r) from the conditional distribution
P (u|u > x).
370 Appendix A. Probability
and
pr q r
Var {X(r) } = [Q (pr )]2 +
2(n + 2)
pr q r
2(qr − pr )Q (pr )Q (pr ) + pr qr Q (pr )Q (pr ) + [Q (pr )]2 .
(n + 2)2
so that, if μ and σ 2
are the mean and variance in the parent population, then
n n 2 2 2 2
μ
r=1 r = nμ and r=1 E{X(r) } = nE(X ) = n(μ + σ ).
have a marginal normal distribution in at least one of the variables under study is
to replace the observations by the expectations of the order statistics. These are
sometimes called normal scores, typically denoted by ξrn = E(X(r) ) for a random
sample of size n from a standard normal parent with distribution function Φ(x)
and density φ(x). For a random sample of size n from a normal distribution with
mean μ and variance σ 2 we can reduce everything to the standard case since
E(X(r) ) = μ + ξrn σ. Note that, if n is odd, then, by symmetry, it is immedi-
ately clear that E(X(r) ) = 0 for all r that are odd. We can see that E(X(r) ) =
−E(X(n−r+1) ). For n as small as, say, 5 we can use integration by parts to
evaluate ξr5 for different values of r. For example, ξ55 = 5 4Φ (x)φ2 (x)dx
3
which then simplifies to: ξ55 = 5π −1/2 /4 + 15π −3/2 sin−1 (1/3)/2 = 1.16296.
Also, ξ45 = 5π −1/2 /2 − 15π −3/2 sin−1 (1/3) = 0.49502 and ξ35 = 0. Finally,
ξ15 = −1.16296 and ξ25 = −0.49502. For larger sample sizes in which the inte-
gration becomes too fastidious we can appeal to the above approximations using
the fact that
1 Q 1 + 2Q2 Q(7 + 6Q2 )
Q (pr ) = , Q (pr )= 2 , Q (pr )= 3 , Q (pr ) = .
φ(Q) φ (Q) φ (Q) φ4 (Q)
A.10 Approximations
(θn − θ)2
φ(θn ) = ψ(θ) + (θn − θ)φ (θ) + ψ (ξ) (A.16)
2
for ξ ∈ (θ ± θn ) Rearranging this expression, ignoring the third term on the right-
hand side, and taking expectations we obtain
Var {ψ(θn )} ≈ E{ψ(θn ) − ψ(θ)}2 ≈ {ψ (θ)}2 Var (θn ) ≈ {ψ (θn )}2 Var (θn )
generally, the smaller the absolute value of this second derivative, the better
we might anticipate the approximation to be. For θn close to θ the squared
term will be small in absolute value when compared with the linear term, an
additional motivation to neglecting the third term. For the mean, the second
term of Equation (A.16) is zero when θn is unbiased, otherwise close to zero
and, this time, ignoring this second term, we obtain
1
E{ψ(θn )} ≈ ψ(θn ) + Var (θn )ψ (θn ) (A.17)
2
as an improvement over the rougher approximation based on the first term alone
of the above expression. Extensions of these expressions to the case of a consistent
estimator ψ(θn ) = ψ(θ1n , . . . , θpn ) of ψ(θ) proceeds in the very same way, only
this time based on a multivariate version of Taylor’s theorem. These are:
p
p
∂ψ(θ) ∂ψ(θ)
Var {ψ(θn )} ≈ Cov (θjn , θmn ) ,
∂θj ∂θm
j=1 m≥j
1 ∂ 2 ψ(θn )
E{ψ(θn )} ≈ ψ(θ1n , . . . , θpn ) + Cov (θjn , θmn ).
2 m
∂θj ∂θm
j
When p = 1 then the previous expressions are recovered as special cases. Again,
the result is an exact one in the case where ψ(·) is a linear combination of the
components θj and this helps guide us in situations where the purpose is that of
confidence interval construction. If, for example, our interest is on ψ and some
strictly monotonic transformation of this, say ψ ∗ , is either linear or close to linear
in the θj , then it may well pay, in terms of accuracy of interval coverage, to use
the delta-method on ψ ∗ , obtaining the end points of the confidence interval for
ψ ∗ and subsequently inverting these, knowing the relationship between ψ and
ψ ∗ , in order to obtain the interval of interest for ψ. Since ψ and ψ ∗ are related by
one-to-one transformations then the coverage properties of an interval for ψ ∗ will
be identical to those of its image for ψ. Examples in this book include confidence
intervals for the conditional survivorship function, given covariate information,
based on a proportional hazards model as well as confidence intervals for indices
of predictability and multiple coefficients of explained variation.
Cornish-Fisher approximations
In the construction of confidence intervals, the δ-method makes a normal-
ity approximation to the unknown distribution and then replaces the first two
moments by local linearization. A different approach, while still working with a
normal density φ(x) = (2π)−1/2 exp(−x2 /2), in a way somewhat analogous to
the construction of a Taylor series, is to express the density of interest, f (x), in
terms of a linear combination of φ(x) and derivatives of φ(x). Normal distribu-
tions with non-zero means and variances not equal to one are obtained by the
A.10. Approximations 373
usual simple linear transformation and, in practical work, the simplest approach
is to standardize the random variable X so that the mean and variance corre-
sponding to the density f (x) are zero and one, respectively.
The derivatives of φ(x) are well-known, arising in many fields of mathe-
matical physics and numerical approximations. Since φ(x) is simply a constant
multiplying an exponential term it follows immediately that all derivatives of φ(x)
are of the form of a polynomial that multiplies φ(x) itself. These polynomials
(apart from an alternating sign coefficient (−1)i ) are the Hermite polynomials,
Hi (x) , i = 0, 1, . . . , and we have:
H0 = 1 , H1 = x , H2 = x2 − 1 , H3 = x3 − 3x , H4 = x4 − 6x2 + 3 ,
with H5 and higher terms being calculated by simple differentiation. The polyno-
mials are of importance in their own right, belonging to the class of orthogonal
polynomials and useful in numerical integration. Indeed, we have that:
∞ ∞
2
Hi (x)φ(x)dx = i! , i = 0, . . . : Hi (x)Hj (x)φ(x)dx = 0 , i = j.
−∞ −∞
and, in order to achieve this we multiply both sides of equation (A.18) by Hj (x),
subsequently integrating to obtain the coefficients
1 ∞
cj = f (x)Hj (x)dx. (A.19)
j! −∞
Note that the polynomial Hj (x) is of order j so that the right-hand side of
equation (A.19) is a linear combination of the moments, (up to the jth), of
the random variable X having associated density f (x). These can be calculated
step-by-step. For many standard densities several of the lower-order moments
have been worked out and are available. Thus, it is relatively straightforward to
approximate some given density f (x) in terms of a linear combination of φ(x).
The expansion of Equation (A.18) can be used in theoretical investigations
as a means to study the impact of ignoring higher-order terms when we make
a normal approximation to the density of X. We will use the expansion in an
attempt to obtain more accurate inference for proportional hazards models fitted
using small samples. Here the large sample normal assumption may not be suf-
ficiently accurate and the approximating equation is used to motivate potential
improvements obtained by taking into account moments of higher order than
374 Appendix A. Probability
just the first and second. When dealing with actual data, the performance of any
such adjustments need to be evaluated on a case-by-case basis. This is because
theoretical moments will have to be replaced by observed moments and the sta-
tistical error involved in that can be of the same order, or greater, than the error
involved in the initial normal approximation. If we know or are able to calculate
the moments of the distribution, then the ci are immediately obtained. When
the mean is zero we can write down the first four terms as:
This series is known as the Gram-Charlier series, and stopping the development
at the fourth term corresponds to making corrections for skewness and kurtosis.
In the development of the properties of estimators in the proportional hazards
model we see that making corrections for skewness can help make inference more
accurate, whereas, at least in that particular application, corrections for kurtosis
appear to have little impact (Chapter 7).
Saddlepoint approximations
A different, although quite closely related, approach to the above uses saddlepoint
approximations. Theoretical and practical work on these approximations indicate
them to be surprisingly accurate for the tails of a distribution. We work with the
inversion formula for the cumulant generating function, a function that is defined
in the complex plane, and in this two-dimensional plane, around the point of
interest (which is typically a mean or a parameter estimate) the function looks
like a minimum in one direction and a maximum in an orthogonal direction: hence
the name “saddlepoint.” Referring back to Section A.8 recall that we identified
κr as the coefficient of (it)r /r! in the expansion of the cumulant generating
function K(t) = log φ(t) where φ(t) is the characteristic function. We can exploit
the relationship between φ(t) and f (x); that is:
∞ ∞
1
f (x) = e−itx φ(t)dt , φ(t) = eitx f (x)dx .
2π −∞ −∞
Stochastic processes
Gaussian processes
If for every partition of (0,1), 0 = t0 < t1 < t2 < · · · < tn = 1, the set of random
variables X(t1 ), . . . , X(tn ) has a multivariate normal distribution, then the pro-
cess X(t) is called a Gaussian process. Brownian motion, described below, can be
thought of as simply a standardized Gaussian process. A Gaussian process being
uniquely determined by the multivariate means and covariances it follows that
such a process will have the property of stationarity if for any pair (s, t : t > s),
Cov {X(s), X(t)} depends only on (t − s). In practical studies we will often deal
with sums indexed by t and the usual central limit theorem will often underlie
the construction of Gaussian processes.
Consider a stochastic process X(t) on (0, 1) with the following three properties:
1. X(0) = 0, i.e., at time t = 0 the starting value of X is fixed at 0.
2. X(t) , t ∈ (0, 1) has independent stationary increments.
3. At each t ∈ (0, 1) the distribution of X(t) is N (0, t).
This simple set of conditions completely describes a uniquely determined stochas-
tic process called Brownian motion. It is also called the Wiener process or Wiener
measure. It has many important properties and is of fundamental interest as a
limiting process for a large class of sums of random variables on the interval (0,1).
An important property is described in Theorem B.1 below. Firstly we make an
attempt to describe just what a single realization of such a process might look
like. Later we will recognize the same process as being the limit of a sum of
independent random contributions. The process is continuous and so, approxi-
mating it by any drawing, there cannot be any gaps. At the same time, in a sense
that can be made more mathematically precise, the process is infinitely jumpy.
Nowhere does a derivative exist. Figure B.1 illustrates this via simulated approx-
imations. The right-hand figure could plausibly be obtained from the left-hand
one by homing in on any small interval, e.g., (0.20, 0.21), subtracting off the
value observed at t = 0.20, and rescaling by a multiple of ten to restore the inter-
val of length 0.01 to the interval (0,1). The point we are trying to make is that
the resulting process itself looks like (and indeed is) a realization of Brownian
motion. Theoretically, this could be repeated without limit which allows us to
understand in some way how infinitely jumpy is the process. In practical exam-
ples we can only ever approximate the process by linearly connecting up adjacent
simulated points.
B.2. Brownian motion 379
Figure B.1: Two independently simulated Brownian motions on the interval (0,1).
So, when looking ahead from time point s to time point t + s, the previous
history indicating how we arrived at s is not relevant. The only thing that matters
is the point at which we find ourselves at time point s. This is referred to as the
Markov property. The joint density of X(t1 ), . . . , X(tn ) can be written as:
f (x1 , x2 , . . . , xn ) = ft1 (x1 )ft1 −t2 (x2 − x1 ) · · · ftn −tn−1 (xn − xn−1 )
Corollary B.1. The conditional distribution of X(s) given X(t) (t > s) is normal
with a mean and a variance given by:
This result helps provide insight into another useful process, the Brownian
bridge described below. Other important processes arise as simple transformations
of Brownian motion. The most obvious to consider is where we have a Gaussian
process satisfying conditions (1) and (2) for Brownian motion but where, instead
of the variance increasing linearly, i.e., Var X(t) = t, the variance increases either
too quickly or too slowly so that Var X(t) = φ(t) where φ(·) is some monotonic
increasing function of t. Then we can transform the time axis using φ(·) to
380 Appendix B. Stochastic processes
produce a process satisfying all three conditions for Brownian motion. Consider
also the transformation
V (t) = exp(−αt/2)X{exp(αt)}
Brownian bridge
Let W(t) be Brownian motion. We know that W(0) = 0. We also know that with
probability one the process W(t) will return at some point to the origin. Let’s
choose a point, and in particular the point t = 1 and consider the conditional
process W 0 (t), defined to be Brownian motion conditioned by the fact that
W(1) = 0. For small t this process will look very much like the Brownian motion
from which it is derived. As t goes to one the process is pulled back to the
origin since at t = 1 we have that W 0 (1) = 0 and W(t) is continuous. Also
W 0 (0) = W(0) = 0. Such a process is called tied down Brownian motion or the
Brownian bridge. We will see below that realizations of a Brownian bridge can
be viewed as linearly transformed realizations of Brownian motion itself, and vice
versa.
B.2. Brownian motion 381
From the results of above the section we can investigate the properties of
W 0 (t). The process is a Gaussian process so we only need to consider the mean
and covariance function for the process to be completely determined. We have:
Theorem B.2.
so that the only remaining question is the covariance function for the process to
be completely and uniquely determined. The following corollary is all we need.
Corollary B.3. The covariance function for the process defined as W 0 (t) is,
This is the covariance function for the Brownian bridge developed above
and, by uniqueness, the process is therefore itself the Brownian bridge. Such a
covariance function is characteristic of many observed phenomena. The covari-
ance decreases linearly with distance from s. As for Brownian motion, should the
covariance function decrease monotonically rather than linearly, then a suitable
transformation of the time scale enables us to write the covariance in this form.
At t = s we recover the usual binomial expression s(1 − s).
Notice that not only can we go from Brownian motion to a Brownian bridge
via the simple transformation
but the converse is also true, i.e., we can recover Brownian motion, X(t), from
the Brownian bridge, Z(t), via the transformation
t
X(t) = (t + 1)Z . (B.2)
t+1
382 Appendix B. Stochastic processes
To see this, first note that, assuming Z(t) to be a Brownian bridge, then X(t)
is a Gaussian process. It will be completely determined by its covariance process
Cov {X(s), X(t)}. All we then require is the following lemma:
The three processes: Brownian motion, the Brownian bridge, and the Ornstein-
Uhlenbeck are then closely related and are those used in the majority of applica-
tions. Two further related processes are also of use in our particular applications:
integrated Brownian motion and reflected Brownian motion.
Lemma B.3. The covariance function for Z(t) and W(t) is:
For a model in which inference derives from cumulative sums, this would
provide a way of examining how reasonable are the underlying assumptions if
repetitions are available. Repetitions can be obtained by bootstrap resampling if
only a single observed process is available. Having standardized, a plot of the log-
covariance function between the process and the integrated process against log-
time ought to be linear with slope of two and intercept of minus log 2 assuming
that model assumptions hold.
B.2. Brownian motion 383
for a process, we may want to, for example, describe an approximate confidence
interval for the whole process rather than just a confidence interval at a single
point t. In such a case the above result comes into play. The joint distribution is
equally simple and we make use of the following:
Lemma B.6. If W(t) is standard Brownian motion and M (t) the maximum value
attained on the interval (0, t), i.e., M (t) = supu∈(0,t) W(u), then
The conditional distribution Pr {W(t) < a − b |M (t) > a} can then be derived
immediately by using the results of the two lemmas.
X(t) = W(t) + μt
where W (t) is Brownian motion. We can immediately see that E{X(t)} = μt and
Var {X(t)} = t As for Brownian motion Cov {X(s), X(t)} = s , s < t. Alternatively
we can define the process in a way analogous to our definition for Brownian
motion as a process having the following three properties:
1. X(0) = 0.
2. X(t) , t ∈ (0, 1) has independent stationary increments.
3. At each t ∈ (0, 1), X(t) is N (μt, t).
Clearly, if X(t) is Brownian motion with drift parameter μ, then the process
X(t) − μt is standard Brownian motion. Also, for the more common situation in
which the mean may change non-linearly with time, provided the increments are
independent, we can always construct a standard Brownian motion by first sub-
tracting the mean at time t, then transforming the timescale in order to achieve
a linearly increasing variance. Note that for non-linear differentiable functions of
drift, these can always be approximated locally by a linear function so that the
B.3. Counting processes and martingales 385
essential nature of the process, whether the drift is linear or smoothly non-linear,
is the same. We will make use of this idea in those chapters devoted to fit and
model building.
The sum can be seen to be convergent since this is an alternating sign series
in which the kth term goes to zero. Furthermore, the error in ignoring all terms
higher than the nth is less, in absolute value, than the size of the (n + 1)th
term. Given that the variance of W0 (t) depends ont it is also of interest to
study the standardized distribution B0 (t) = W0 (t)/ t(1 − t). This is, in fact,
the Ornstein-Uhlenbeck process. Simple results for the supremum of this are not
possible since the process becomes unbounded at t = 0 and t = 1. Nonetheless,
if we are prepared to reduce the interval from (0, 1) to (ε1 , ε2 ) where ε1 > 0 and
ε2 < 1 then we have an approximation due to Miller and Siegmund (1982):
4φ(α) 1 ε2 (1 − ε1 )
Pr sup |B0 (t)| ≥ α ≈ + φ(α) α − log , (B.7)
t α α ε1 (1 − ε2 )
where φ(x) denotes the standard normal density. This enables us to construct
confidence intervals for a bridged process with limits themselves going to zero
endpoints. To obtain these we use the fact that Pr {W0 (t) > α} =
at the
Pr { t(1 − t)B0 (t) > α}. For most practical purposes though it is good enough
to work with Equation B.6 and approximate the infinite sum by curtailing sum-
mation for values of k greater than 2.
While the basic ideas behind counting processes and martingales are both simple
and natural, if we wish to maintain the fullest generality—in practice allowing the
time interval to stretch without limit, as well as allowing the unboundedness for
covariates—then things become difficult and very technical. Rebolledo’s multi-
variate central limit theorem for martingales requires great care in its application.
386 Appendix B. Stochastic processes
Our preference is to assume for both time and the covariates, at least for the
purposes of obtaining large sample results, support on finite bounded intervals.
This enables us to work with standard well-known central limit theorems. Specif-
ically we put time on the interval (0,1). In this appendix we motivate the use
of counting processes and martingales in the context of survival analysis prob-
lems and we describe their application in real situations. This also helps establish
the links with other works in the field that do make an appeal to Rebolledo’s
martingale central limit theorem. The goal of this appendix is to provide some
understanding of the probability structure upon which the theory is based.
where Δi = xi − xi−1 > 0 and where, as described in Section A.2 the summation
is understood to be over an increasing partition in which Δi > 0 and max Δi
goes to zero. Now, changing the order of taking limits, the above expression
becomes
lim E{[M (xi ) − M (xi−1 )]H(xi−1 )} = 0, (B.9)
max Δi →0
B.3. Counting processes and martingales 387
a result which looks simple enough but that has a lot of force when each of the
infinite number of expectations can be readily evaluated. Let’s view Equation B.9
in a different light, one that highlights the sequential and ordered nature of the
partition. Rather than focus on the collection of M (xi ) and H(xi ), we can focus
our attention on the increments M (xi ) − M (xi−1 ) themselves, the increments
being multiplied by H(xi−1 ), and, rather than work with the overall expectation
implied by the operator E, we will set up a sequence of conditional expectations.
Also, for greater clarity, we will omit the term limmax Δi →0 altogether. We will
put it back when it suits us. This lightens the notation and helps to make certain
ideas more transparent. Later, we will equate the effect of adding back in the term
limmax Δi →0 to that of replacing finite differences by infinitesimal differences.
Consider then
U= {M (xi ) − M (xi−1 )}H(xi−1 ), (B.10)
and unlike the preceding two equations, we are able to greatly relax the require-
ment that H(x) be a known function or that M (x) be restricted to being the
difference between the empirical distribution function and the distribution func-
tion. By sequential conditioning upon F(xi ) where F(xi ) are increasing sequence
of sets denoting observations on M (x) and H(x), for all values of x less than
or equal to xi , we can derive results of wide applicability. In particular, we can
now take M (x) and H(x) to be stochastic processes. Some restrictions are still
needed for M (x), in particular that the incremental means and variances exist.
We will suppose that
in words, when given F(xi−1 ), the quantity M (xi−1 ) is fixed and known and the
expected size of the increment is zero. This is not a strong requirement and only
supposes the existence of the mean. If the expected size of the increment is other
than zero, then we can subtract this difference to recover the desired property.
Furthermore, given F(x), the quantity H(x) is fixed. The trick is then to exploit
the device of double expectation whereby for events, A and B, it is always true
that E(A) = EE(A|B). In the context of this expression, B = F(xi−1 ), leading
to
E(U ) = H(xi−1 )E{M (xi ) − M (xi−1 )|F(xi−1 )} = 0, (B.12)
and under the assumption that the increments are uncorrelated we have the
variance is the sum of the variance of each component to the sum. Thus
Var(U ) = E{H 2 (xi−1 )[M (xi ) − M (xi−1 )]2 |F(xi−1 )}. (B.13)
For instance, in Equation B.13, the inner expectation is taken with respect to rep-
etitions over all possible outcomes in which the set F(xi−1 ) remains unchanged,
whereas the outer expectation is taken with respect to all possible repetitions. In
Equation B.12 the outer expectation, taken with respect to the distribution of all
potential realizations of all the sets F(xi−1 ), is not written and is necessarily zero
since all of the inner expectations are zero. The analogous device to double expec-
tation for the variance is not so simple since Var(Y ) = E Var(Y |Z)+Var E(Y |Z).
Applying this we have
since Var E{M (xi ) − M (xi−1 )|F(xi−1 )} is equal to zero, this being the case
because each term is itself equal to the constant zero. The first term also requires
a little thought, the outer expectation indicated by E being taken with respect
to the distribution of F(xi−1 ), i.e., all the conditional distributions M (x) and
H(x) where x ≤ xi−1 . The next key point arises through the sequential nesting.
These outer expectations, taken with respect to the distribution of F(xi−1 )
are the same as those taken with respect to the distribution of any F(x) for
which x ≥ xi−1 . This is an immediate consequence of the fact that the lower-
dimensional distribution results from integrating out all the additional terms in
the higher-dimensional distribution. Thus, if xmax is the greatest value of x
for which observations are made then we can consider that all of these outer
expectations are taken with respect to F(xmax ). Each time that we condition
upon F(xi−1 ) we will treat H(xi−1 ) as a fixed constant and so it can be simply
squared and moved outside the inner expectation. It is still governed by the outer
expectation which we will take to be with respect to the distribution of F(xmax ).
Equation B.13 then follows.
Making a normal approximation for U , and from the theory of estimating
equations, given any set of observations, that U depends monotonically on some
parameter β, then it is very straightforward to set up hypothesis tests for β = β0 .
Many situations, including that of proportional hazards regression, lead to esti-
mating equations of the form of U. The above set-up, which is further developed
below in a continuous form, i.e., after having “added in” the term limmax Δi →0 ,
applies very broadly. We need the concept of a process, usually indexed by time
t, the conditional means and variances of the increments, given the accumulated
information up until time t.
We have restricted our attention here to the Riemann-Stieltjes definition of
the integral. The broader Lebesgue definition allows the inclusion of subsets
of t tolerating serious violations of our conditions such as conditional means
and variances not existing. The conditioning sets can be also very much more
involved. Only in a very small number of applications has this extra generality
been exploited. Given that it makes the main ideas much more difficult to all but
those familiar with measure theory, it seems preferable to avoid it altogether. As
B.3. Counting processes and martingales 389
Counting processes
The above discussion started off with some consideration of the empirical cumula-
tive distribution function Fn (t) which is discussed in much more detail in Section
C.5. Let’s consider the function N (t) = {nFn (t) : 0 ≤ t ≤ 1}. We can view this as
a stochastic process, indexed by time t so that, given any t we can consider N (t)
to be a random variable taking values from 0 to n. We include here a restriction
that we generally make which is that time has some upper limit, without loss
of generality, we call this 1. This restriction can easily be avoided but it implies
no practical constraint and is often convenient in practical applications. We can
broaden the definition of N (t) beyond that of nFn (t) and we have:
Definition B.1. A counting process N = {N (t) : 0 ≤ t ≤ 1} is a stochastic pro-
cess that can be thought of as counting the occurrences (as time t proceeds) of
certain type of events. We suppose these events occur singly.
Very often N (t) can be expressed as the sum of n individual counting pro-
cesses, Ni (t), each one counting no more than a single event. In this case Ni (t)
is a simple step function, taking the value zero at t = 0 and jumping to the
value one at the time of an event. The realizations of N (t) are integer-valued
step functions with jumps of size +1 only. These functions are right continuous
and N (t) is the (random) number of events in the time interval [0, t]. We asso-
ciate with the stochastic process N (t) an intensity function α(t). The intensity
function serves the purpose of standardizing the increments to have zero mean.
In order to better grasp what is happening here, the reader might look back to
Equation B.11 and the two sentences following that equation. The mean is not
determined in advance but depends upon Ft− where, in a continuous framework,
Ft− is to Ft what F(xi−1 ) is to F(xi ). In technical terms:
Definition B.2. A filtration, Ft , is an increasing right continuous family of sub
sigma-algebras (see A3 for the meaning of sigma-algebra).
This definition may not be very transparent to those unfamiliar with the
requirement of sigma additivity for probability spaces and there is no real need
to expand on it here. The requirement is a theoretical one which imposes a
mathematical restriction on the size, in an infinite sense, of the set of subsets
of Ft . The restriction guarantees that the probability we can associate with any
infinite sum of disjoint sets is simply the sum of the probabilities associated
with those sets composing the sum. For our purposes, the only key idea of
importance is that Ft− is a set containing all the accumulated information (hence
“increasing”) on all processes contained in the past up until but not including
390 Appendix B. Stochastic processes
the equality being understood in an infinitesimal sense, i.e., the functional part
of the left-hand side, α(t), is the limit of the right-hand side divided by dt > 0
as dt goes to zero. In the chapter on survival analysis we will see that the hazard
function, λ(t), expressible as the ratio of the density, f (t), to the survivorship
function, S(t), i.e., f (t)/S(t), can be expressed in fundamental terms by first
letting Y (t) = I(T ≥ t). Under this interpretation, we can also write
It is instructive to compare the above definitions of α(t) and λ(t). The first def-
inition is the more general since, choosing the sets Ft to be defined from the
at-risk function Y (t) when it takes the value one, enables the first definition to
reduce to a definition equivalent to the second. The difference is an important
one in that if we do not provide a value for I(T ≥ t) then this is a (0, 1) ran-
dom variable, and in consequence, α(t) is a (0, λ(t)) random variable. For this
particular case we can express this idea succinctly via the formula
Replacing Y (t) by a more general “at risk” indicator variable will allow for great
flexibility, including the ability to obtain a simple expression for the intensity
in the presence of censoring as well as the ability to take on-board multistate
problems where the transitions are not simply from alive to dead but from, say,
state j to state k summarized via αjk (t)dt = Yjk (t)λjk (t)dt in which Yjk (t) is
left continuous and therefore equal to the limit Yjk (t − ) as > 0 goes to zero
through positive values, an indicator variable taking the value one if the subject is
in state j and available to make a transition to state k at time t − as → 0. The
hazards λjk (t) are known in advance, i.e., at t = 0 for all t, whereas the αjk (t)
are randomly viewed from time point s where s < t, with the subtle condition of
left continuity which leads to the notion of “predictability” described below. The
idea of sequential standardization, the repeated subtraction of the mean, that
leans on the evaluation of intensities, can only work when the mean exists. This
requires a further technical property, that of being “adapted.” We say
Definition B.3. A stochastic process X(t) is said to be adapted to the filtration
Ft if X(t) is a random variable with respect to Ft .
Once again the definition is not particularly transparent to nonprobabilists
and the reader need not be over-concerned since it will not be referred to here
apart from in connection with the important concept of a predictable process.
The basic idea is that the relevant quantities upon which we aim to use the
B.3. Counting processes and martingales 391
i.e., the same as P (N1 (t) or N2 (t) jump in [ t, t + dt)|Ft− ) and, if as is reason-
able in the great majority of applications, where, we assume to be negligible the
probability of seeing events occurring simultaneously compared to seeing them
occur singly, then
independent of the failure mechanism, i.e., that Pr (Ti > t|Ci > c) = Pr (Ti > t),
then a simple result is available.
Theorem B.3. Let the counting process, Ni (t), depend on two independent and
positive random variables, Ti and Ci such that Ni (t) = I{Ti ≤ t, Ti ≤ Ci }. Let
Xi = min(Ti , Ci ), Yi (t) = I(Xi ≥ t); then Ni (t) has intensity process
The counting process, Ni (t), is one of great interest to us since the response
variable in most studies will be of such a form, i.e., an observation when the event
of interest occurs but an observation that is only possible when the censoring
variable is greater than the failure variable. Also, when we study a heterogeneous
group, our principal focus in this book, the theorem still holds in a modified form.
Thus, if we can assume that Pr (Ti > t|Ci > c, Z = z) = Pr (Ti > t|Z = z), we
then have:
Theorem B.4. Let the counting processes, Ni (t), depend on two independent
and positive random variables, Ti and Ci , as well as Z such that
Then the intensity process for Ni (t) can be written as αi (t, z)dt = Yi (t)λi (t, z)dt.
The assumption needed for Theorem B.4, known as the conditional indepen-
dence assumption, is weaker than that needed for B.3 in that the latter theorem
contains the former as a special case. Note that the stochastic processes Yi (t)
and αi (t) are left continuous and adapted to Ft . They are therefore predictable
stochastic processes, which means that, given Ft− , we treat Yi (t), αi (t) and,
assuming that Z(t) is predictable, αi (t, z) as fixed constants.
The reader might look over Section B.3 for the probability background behind
martingales. A martingale M = {M (t) : t ≥ 0} is a stochastic process whose
increment over an interval (u, v], given the past up to and including time u, has
expectation zero, i.e., E{M (v) − M (u)|Fu } = 0 for all 0 ≤ u < v < 1. Equation
B.11 provides the essential idea for the discrete time case. We can rewrite the
above defining property of martingales by taking the time instants u and v to be
B.4. Inference for martingales and stochastic integrals 393
just before and just after the time instant t. Letting both v and u tend to t and
u play the role of t−, we can write;
Note that this is no more than a formal way of stating that, whatever the history
Ft may be, given this history, expectations exist. If these expectations are not
themselves equal to zero then we only need to subtract the nonzero means to
achieve this end. A counting process Ni (t) is not of itself a martingale but note,
for 0 ≤ u < v ≤ 1, that E{Ni (v)|Fu } > E{Ni (u)|Fu } and, as above, by taking
the time instants u and v to be just before and just after the time instant t,
letting v and u tend to t and u play the role of t−, we have
Doob-Meyer decomposition
For the submartingale Ni (t), having associated intensity process α(t), we have
from Equation B.15 that E{dN (t)|Ft− } = α(t)dt. If we write dM (t) = dN (t) −
α(t)dt then E{dM (t)|Ft− } = 0. Thus M (t) is a martingale. For the counting
processes
t of interest to us we will always be able to integrate α(t) and we define
A(t) = 0 α(t). We can write
value one for all times greater than or equal to that at which the event of interest
occurs. But, generally, Ni (t) may assume many, or all, integer values. Note that
any sum of counting processes can be immediately seen to be a counting process
in its own right. An illustrative example could be the number of goals scored
during a soccer season by some team. Here, the indexing variable t counts the
minutes from the beginning of the season. The expectation of Ni (t) (which must
exist given the physical constraints of the example) may vary in a complex way
with t, certainly non-decreasing and with long plateau when it is not possible for
a goal to be scored, for instance when no game is being played. At time t = 0,
it might make sense to look forward to any future time t and to consider the
expectation of Ni (t).
As the season unfolds, at each t, depending on how the team performs, we
may exceed, possibly greatly, or fall short of, the initial expectation of Ni (t).
As the team’s performance is progressively revealed to us, the original expecta-
tions are of diminishing interest and it is clearly more useful to consider those
conditional expectations in which we take account of the accumulated history
at time point t. Working this out as we go along, we determine Ai (t) so that
Ni (t) − Ai (t), given all that has happened up to time t, has zero expectation.
When αi (s) is the intensity function for Ni (s), then
t
Ai (t) = αi (s)ds
0
and this important result is presented in Theorem B.5 given immediately below.
Conditional upon Ft− , we can view the random variable dN (t) as a Bernoulli
(0,1) having mean α(t)dt and variance given by α(t)dt{1 − α(t)dt}. In contrast,
the random variable dM (t), conditional on Ft− , has mean zero and the same
variance. This follows since, given Ft− , α(t) is fixed and known. As usual, all
the equations are in an infinitesimal sense, the equal sign indicating a limiting
value as dt → 0. In this sense α2 (t)(dt)2 is negligible when compared to α(t)dt
since the ratio of the first to the second goes to zero as t goes to zero. Thus,
the incremental variances are simply the same as the means, i.e., α(t)dt. This,
of course, ties in exactly with the theory for Poisson counting processes.
Definition B.5. The predictable variation process of a martingale M (t), denoted
by M = {M (t) : t ≥ 0} is such that
The use of pointed brackets has become standard notation here and, indeed,
the process is often referred to as the pointed brackets process, Note that M
is clearly a stochastic process and that the process is predictable and non-
decreasing. It can be thought of as the sum of conditional variances of the
increments of M over small time intervals partitioning [0, t], each conditional
variance being taken given what has happened up to the beginning of the corre-
sponding interval. We then have the following important result:
t
Theorem B.5. Let Mi (t) = Ni (t) − Ai (t) where Ai (t) = 0 αi (s)ds. Then
Corollary B.4. Define, for all t and i = j, the predictable covariation process,
Mi , Mj , of two martingales, Mi and Mj , analogously to the above. Then
The corollary follows readily if, for i = j, the counting processes Ni (t) and
Nj (t) can never jump simultaneously. In this case the product dNi (t)dNj (t)
is always equal to zero. Thus, the conditional covariance between dNi (t) and
dNj (t) is −αi (t)dt · αj (t)dt.
Stochastic integrals
The concept of a stochastic integral is very simple; essentially we take a
Riemmann-Stieltjes integral, from zero to time point t, of a function which,
at the outset when t = 0 and looking forward, t would be random. t Examples
of most immediate interest to us are: N (t) = 0 dN (s), A(t) = 0 dA(s) and
t t
M (t) = 0 dM (s). Of particular value are integrals of the form 0 H(s)dM (s)
where M (s) is a martingale and H(s) a predictable function. By predictable we
mean that if we know all the values of H(s) for s less than t then we also know
396 Appendix B. Stochastic processes
H(t), and this value is the same as the limit of H(s) as s → t for values of s
less than t.
The martingale transform theorem provides a tool for carrying out inference
in the survival context. Many statistics arising in practice will be of a form U
described in Appendix D on estimating equations. For these the following result
will find immediate application:
Theorem B.6. Let M be a martingale and H a predictable stochastic process.
Then M ∗ is also a martingale where it is defined by:
t
∗
M (t) = H(s)dM (s). (B.24)
0
Corollary B.5. The predictable variation process of the stochastic process M ∗ (t)
can be written
t t
M ∗ (t) = H(s)2 dM (s) = H 2 (s)dA(s). (B.25)
0 0
Then, the martingale M (t) converges to a Gaussian process with mean zero and
variance A(t).
B.4. Inference for martingales and stochastic integrals 397
Added conditions can make it easier to obtain the first one of these conditions,
and as a result, there are a number of slightly different versions of these two
criteria. The multivariate form has the same structure. In practical situations, we
take M (∞) to be N (0, σ 2 ) where we estimate σ 2 by M (∞).
counts observable events. If the censoring does not modify the compensator,
A(t), of N (t), then N (t) and N ∗ (t) have the same compensator. The differ-
ence, M ∗ (t) = N ∗ (t) − A(t) would typically differ from the martingale M (t) but
would nonetheless still be a martingale in its own right. In addition, it is easily
anticipated how we might go about tackling the much more complex situation in
which the censoring would not be independent of the failure mechanism. Here,
the compensators for N ∗ (t) and N (t) do not coincide. For this more complex
case, we would need some model, A∗ (t), for the compensator of N ∗ (t) in order
that M ∗ (t) = N ∗ (t) − A∗ (t) would be a martingale.
The most common and the simplest form of the at-risk indicator Y (t) is one
where it assumes the value one at t = 0, retaining this value until censored or
failed, beyond which time point it assumes the value zero. When dealing with
n individuals, and n counting processes, we can write N̄ (t) = n i=1 Ni (t) and
use the at-risk indicator to denote the risk set. If Yi (t) refers to individual i,
then Ȳ (t) = n i=1 Yi (t) is the risk set at time t. The compensator for Ni (t) is
αi (t) = Yi (t)λi (t), where λi (t) is the hazard for subject i, written simply as λ(t)
in the case of i.i.d. replications. Then, the compensator, Ā(t), for N̄ (t) is:
t t
Ā(t) = { ni=1 Yi (s)}λ(s)ds = Ȳ (s)λ(s)ds.
0 0
The intensity process for N̄ (t) is then given by Ȳ (t)λ(t). The multiplicative
intensity model Aalen (1978) has as its cornerstone the product of the fully
398 Appendix B. Stochastic processes
observable quantity Ȳ (t) and the hazard rate, λ(t) which, typically, will involve
unknown model parameters. In testing specific hypotheses we might fix some of
these parameters at particular population values, most often the value zero.
Nonparametric statistics
The multiplicative intensity model just described and first recognized by Aalen
(1978) allows a simple expression, and simple inference, for a large number of
nonparametric statistics that have been used in survival analysis over the past half
century. Martingales are immediate candidates for forming an estimating equation
with which inference can be made on unknown parameters in the model. For
our specific applications, these estimating equations will almost always present
themselves in the form of a martingale. For example, the martingale structure
can be used to underpin several nonparametric tests. Using a martingale as an
estimating equation we can take N̄ (t) as an estimator of its compensator Ā(t).
Dividing by the risk set (assumed to be always greater than zero), we have:
t t
dN̄ (t)
− λ(s)ds,
0 Ȳ (t) 0
where a subscript 1 denotes subjects from group 1 and a 2 from group 2. The
choice of the weighting function W (s) can be made by the user and might be
chosen when some particular alternative is in mind. The properties of differ-
ent weights were investigated by Prentice (1978) and Harrington and Fleming
(1982). We can readily claim that K(∞) converges in probability to N (0, σ 2 ).
We estimate σ 2 by K(∞) where
t 2 t 2
W (s) W (s)
K(t) = dN̄1 (s) + dN̄2 (s). (B.28)
0 Y¯1 (s) 0 Y¯2 (s)
The choice W (s) = Y¯1 (s)Y¯2 (s)/[Y¯1 (s) + Y¯1 (s)] leads to the log-rank test statis-
tic and would maintain good power under a proportional hazards alternative of
a constant group difference as opposed to the null hypothesis of no group differ-
ences. The choice W (s) = Y¯1 (s)Y¯2 (s) corresponds to the weighting suggested
B.4. Inference for martingales and stochastic integrals 399
by Gehan (1965) in his generalization of the Wilcoxon statistic. This test may
offer improved power over the log-rank test in situations where the group differ-
ence declines with time. These weights and therefore the properties of the test
are impacted by the censoring, and in order to obtain a test free of the impact
of censoring, Prentice (1978) suggested the weights, W (s) = Ŝ1 (s)Ŝ2 (s). These
weights would also offer the potential for improved power when the regression
effect declines with time. These weights are also considered in the chapter on
test statistics.
Appendix C
Limit theorems
We outline the main theorems providing inference for sums of random variables.
The theorem of de Moivre-Laplace is a well-known special case of the central
limit theorem and helps provide the setting. Our main interest is in sums which
can be considered to be composed of independent increments. The empirical
distribution function Fn (t) is readily seen to be a consistent estimator for F (t)
at all continuity points of F (t). However, we can also view Fn (t) as a constant
number multiplying a sum of independent Bernoulli variates and this enables us
to construct inference for F (t) on the basis of Fn (t). Such inference can then
be extended to the more general context of estimating equations. Inference for
counting processes and stochastic integrals is described since this is commonly
used in this area and, additionally, shares a number of features with an approach
based on empirical processes.
The importance of estimating equations is stressed, in particular equations
based on the method of moments and equations derived from the likelihood.
Resampling techniques can also be of great value for problems in inference. All
of the statistics that arise in practical situations of interest can be seen quite
easily to fall under the headings described here. These statistics will have known
large sample distributions. We can then appeal immediately to known results from
Brownian motion and other functions of Brownian motion. Using this approach
to inference is reassuring since (1) the building blocks are elementary ones, well-
known to those who have followed introductory courses on inference (this is
not the case, for instance, for the martingale central limit theorem) and (2) we
obtain, as special cases, statistics that are currently widely used, the most notable
examples being the partial likelihood score test and weighted log-rank statistics.
However, we will obtain many more statistics, all of which can be seen to sit in a
single solid framework and some of which, given a particular situation of interest,
will suggest themselves as being potentially more suitable than others.
The majority of statistics of interest that arise in practical applications are directly
or indirectly (e.g., after taking the logarithm to some base) expressible as sums of
random variables. It is therefore of immense practical value that the distribution
theory for such sums can, in a wide variety of cases, be approximated by normal
distributions. Moreover, we can obtain some idea as to how well the approxima-
tion may be expected to behave. It is also possible to refine the approximation.
In this section we review the main limit theorems applicable to sums of random
variables.
Theorem of De Moivre-Laplace
Let Nn = n i=1 Xi be the number of successes in n independent Bernoulli trials
Xi , each trial having probability of success equal to p. Then
{Nn − np}/ np(1 − p) → N (0, 1)
Less formally we state that x̄ converges to a normal distribution with mean μ and
variance σ 2 /n. This result is extremely useful and also quite general. For example,
applying the mean value theorem, then for g(x̄), where g(x) is a differentiable
function of x, we can see, using the same kind of informal statement, that g(x̄)
converges to a normal distribution with mean g(μ) and variance {g (μ)}2 σ 2 /n.
Bn−2 σn2 → 0 , Bn → ∞ , as n → ∞.
It can be fairly easily shown that the condition below implies the Lindeberg
condition and provides a more ready way of evaluating whether or not asymptotic
normality obtains.
Condition C.2. The Lindeberg condition holds if, for k > 2,
Bn−k κk → 0 as n → ∞.
404 Appendix C. Limit theorems
As we might guess, the conditions in this case are much more involved and we
need to use array notation in order to express the cross dependencies that are
generated. If we take an extreme case we see immediately why the dependen-
cies have to be carefully considered for, suppose Xi = αi−1 Xi−1 where the αi
are nonzero deterministic coefficients such that n 1 αi → 1, then clearly Xn con-
verges in distribution to X1 which can be any chosen distribution. In rough terms,
there needs to be enough independence between the variables for the result to
hold. Describing what is meant by “enough” is important in certain contexts,
time series analysis being an example, but, since it is not needed in this work, we
do not spend any time on it here. A special case of nonidentical distributions, of
value in survival analysis, is the following.
max |a |
i → 0 .
n 2
j=1 aj
Many statistics arising in nonparametric theory come under this heading, e.g.,
linear sums of ranks. The condition is a particularly straightforward one to verify
and leads us to conclude large sample normality for the great majority of the com-
monly used rank statistics in nonparametric theory. A related condition, which is
sometimes of more immediate applicability, can be derived as a consequence of
the above large sample result together with an application of Slutsky’s theorem.
Suppose, as before, that Xi , i = 1, 2, . . .are independent random
variables having
the same distribution F (.), that σ 2 = u2 dF (u) < ∞, μ = udF (u) and that
, n, are constants. Again, letting Sn = n−1/2 n
ai , i = 1, . . . i=1 ai (Xi − μ) and
2 2 n 2 /n, then S → N (0, σ 2 α2 ) where:
σS (n) = σ a
i=1 i n
C.3. Functional central limit theorem 405
The condition is useful in that it will allow us to both conclude normality for
the linear combination Sn , and at the same time, provide us with a variance for
the linear combination. Weighted log-rank statistics and score statistics under
non-proportional hazards models can be put under this heading. The weights in
that case are not fixed in advance but, since the weights are typically bounded
and converge to given quantities, it is relatively straightforward to put in the
extra steps to obtain large sample normality in those cases too.
from which, letting t = k/n, we readily obtain the mean and the variance of Uk∗
as
√ √
E(Uk∗ ) = (σ n)−1 E(Xi ) = 0 ; Var (Uk∗ ) = (σ n)−2 kσ 2 = t. (C.1)
i≤k
Although this and the following section are particularly simple, the reader should
make sure that he or she has a very solid understanding as to what is taking
place. It underscores all the main ideas behind the methods of inference that are
used. An example of such a process in which σ 2 = 1 and n = 30 is shown in
Figure C.1 As for Var (Uk∗ ) we see in the same way that;
The important thing to note is that the increments are independent, implying
convergence to a Gaussian process. All we then need is the covariance process.
Figure C.1 and Figure C.2 represent approximations to Brownian motion in view
of discreteness and the linear interpolation. The figures indicate two realizations
from the above summed processes, and the reader is encouraged to carry out
his or her own such simulations, an easy exercise, and yet invaluable in terms
of building good intuition. An inspection of any small part of the curve (take,
for example, the curve between 0.30 and 0.31 where the curve is based on less
than 100 points), might easily be a continuous straight line, nothing at all like
the limiting process, Brownian motion. But imagine, as very often is the case
in applications, that we are only concerned about some simple aspect of the
process, for instance, the greatest absolute distance traveled from the origin for
the transformed process, tied down at t = 1. With as few as 30 observations our
intuition would correctly lead us to believe that the distribution of this maximum
will be accurately approximated by the same distribution, evaluated under the
assumption of a Brownian bridge. Of course, such a statement can be made more
technically precise via use, for example, of the law of the iterated logarithm or
the use of Berry-Esseen bounds.
C.3. Functional central limit theorem 407
will look like the process defined above. In particular, straightforward manipula-
tion as above shows that E(Uk∗ ) = 0 and Cov (Uk∗ , Um ∗ ) = t where k < m and
t = k/n. We allow k to increase at the same rate, i.e., k = nt where 0 < t < 1.
As n → ∞ the number of possible values of t, t ∈ (0, 1) also increases without
limit to the set of all rationals on this interval. We can also suppose that as
k, n → ∞ ; k < n, such that k/n = t then σt2 converges almost everywhere to
some function σ 2 (t). We then allow n to increase without bound.
The functional central limit theorem states that the above process goes to
a limit. The limiting process is defined on the real interval. Choosing any set of
points {t1 , . . . , tk } , (0 < ti < 1 , i = 1, . . . , k) then the process Ut∗1 , Ut∗2 , . . . , Ut∗k
converges in distribution to the multivariate normal. As indicated above the
covariance only depends on the distance between points so that Cov {Us∗ , Ut∗ } =
s ; s < t. The basic idea is that the increments, making up the sum U ∗ (t), get
smaller and smaller as n increases. The increments have expectation equal to
zero unless there is drift. Also, the way in which the increments become smaller
√
with n is precisely of order n. The variance therefore increases linearly with
time out in the process. In practical applications, it is only necessary that the
increments be independent and that these increments have a finite variance. It
408 Appendix C. Limit theorems
Figure C.2: Two independent simulations of sums of 500 points on interval (0,1).
Perhaps the most well-known application of the above is to the empirical distribu-
tion function. A great deal can be said about the empirical distribution function
by appealing to the large sample results that can be anticipated from Donsker’s
theorem. Brownian motion can be seen to be the limit process obtained by apply-
ing the functional central limit theorem. We make wide use of these results in
this book. Donsker’s theorem focuses on the case of linear interpolation of a
stochastic process having independent increments with the same distribution
and, specifically, having the same variance.
Theorem C.3. (Donsker, 1951). Let (ξn )n∈N be a sequence of i.i.d. random
variables on the probability space (Ω, F, P ) such that E(ξn ) = 0 et V (ξn ) = σ 2
for all n ∈ N, then
L
Xn −→ W(t),
n→∞
nt
1 1
Xn (t) = √ ξi + (nt − nt) ξ nt +1 .
σ n σ (n)
i=1
We can relax the assumption of the identical distribution and limit our atten-
tion to zero mean random variables not having the same distribution, and specif-
C.4. Brownian motion as limit process 409
ically having different variances, we can still apply a functional central limit the-
orem described by Helland (1982). The main idea here is to view the sequences
of random variables as martingale differences with respect to a particular family
of σ-algebras. Unlike Donsker we do not use linear interpolation. In practice of
course linear interpolation does not add on any essential restriction and, impor-
tant properties such as continuity, follow very easily.
Theorem C.4. (Helland, 1982). Let ξj,n be a random variable defined on the
probability space (Ω, F, P ), for j = 1, . . . , n. Let Fj,n be a σ-algebra such that
ξj,n is Fj,n -measurable and Fj−1,n ⊂ Fj,n ⊂ F for all j = 2, . . . , n. Let rn be
a function defined on R+ such that rn (t) is a stopping time with respect to
Fj,n , j = 1, . . . , n. We suppose that the paths rn are right continuous and increas-
ing with rn (0) = 0. Note that,
rn (t)
Xn (t) = ξj,n . (C.2)
j=1
t
Let f be a positive measurable function such that 0 f 2 (s)ds < ∞, ∀t > 0. When
the following conditions are verified:
a. ξj,n is a difference of martingales, i.e.,
E(ξj,n | Fj−1,n ) = 0, j = 1, . . . , n,
rn (t) 2 |F P t 2
b. j=1 E(ξj,n j−1,n ) −→ f (s)ds,
n→∞ 0
Then,
L
Xn −→ f W,
n→∞
t
where f W(t) = 0 f (s)dW(s) for t ∈ R+ .
When f is constant and equal to one, the process Xn converges in distribution
toward standard Brownian motion. Condition (c) follows from the condition of
Lyapunov, denoted (c’) such that
rn (t)
P
2+δ
∃ δ > 0, E(ξj,n |Fj−1,n ) −→ 0. (C.3)
n→∞
j=1
These somewhat elementary results provide the framework in which we can read-
ily anticipate the large sample behavior of the processes of interest to us. Such
behavior we outline specifically at the relevant place via key theorems.
410 Appendix C. Limit theorems
(l)
Theorem C.5. (Helland, 1982). Let n, m ∈ N∗ . Let ξj,n be a random variable
space (Ω, F, P ), for j = 1, .. . , n and l = 1, . . . , m. We
defined on the probability
(l)
assume that the sets ξj,n , j = 1, . . . , n, n = 1, 2, . . . are tables of martingale
differences with respect to the increasing sequence of σ-algebras (Fj,n )j=1,2,...,n ,
for l = 1, . . . , m. Let rn be a function defined on R+ such that rn (t) is a stopping
time with respect to Fj,n , j = 1, . . . , n. We suppose also that the paths rn are
right continuous and increasing with rn (0) = 0. We have:
rn (t)
(1) (m) (l) (l)
Xn (t) = Xn (t), . . . , Xn (t) , Xn (t) = ξj,n , l = 1, . . . , m. (C.4)
j=1
t
Let f1 , . . . , fm , m be positive measurable functions such that 0 fl2 (s)ds <
∞, ∀t > 0, for l = 1, . . . , m. When the following conditions hold the, for all
i, l = 1, . . . , m :
rn (t)
(l) (i) P
a. j=1 E ξj,n ξj,n Fj−1,n −→ 0, for t > 0 and l = i.
n→∞
rn (t) (l) 2 P t 2
b. j=1 E ξj,n Fj−1,n −→ f (s)ds,
n→∞ 0 l
rn (t) (l) 2 (l) P
c. j=1 E ξj,n I(|ξj,n | > ε) Fj−1,n −→ 0, ∀ε > 0,
n→∞
L
Then, Xn −→ f1 dW1 , . . . , fm dWm with respect to the product
n→∞
t t
topology of Skorokhod where 0 fl dWl = 0 fl (s)dWl (s) for t ∈ R+ , l = 1, . . . , m
and W1 , . . . , Wm are m independent Brownian motions.
The above results can be directly applied to the sample empirical distribution
function Fn (t), defined for a sample of size n (uncensored) to be the number of
observations less than or equal to t divided by n, i.e., Fn (t) = n−1 n
i=1 I(Ti ≤ t).
For each t, and we may assume F (t) to be a continuous function of t, we would
hope that Fn (t) converges to F (t) in probability. This is easy to see but, in fact,
we have stronger results, starting with the Glivenko-Cantelli theorem whereby:
converges to zero with probability one and where Sn (t) = 1 − Fn (t). This is
analogous to the law of large numbers, and although important, is not all that
informative. A central limit theorem can tell us much more and this obtains
C.5. Empirical distribution function 411
by noticing how Fn (t) will simulate a process relating to the Brownian bridge.
To see this it suffices to note that nFn (t), for each value of t, is a sum of
independent Bernoulli variables. Therefore, for each t as n → ∞, we have that
√
n{Fn (t) − F (t)} converges to normal with mean zero and variance F (t){1 −
F (t)}. We have marginal normality. However, we can claim conditional normality
in the same way, since, for each t and s (s < t), nFn (t) given nFn (s), is also a
sum of independent Bernoulli variables. Take k1 and k2 to be integers (1 < k1 <
k2 < n) such that k1 /n is the nearest rational smaller than or equal to s (i.e.,
k1 = max j ; j ∈ {1, . . . , n, }, j ≤ ns) and k2 /n is the nearest rational smaller than
or equal to t. Thus, k1 /n converges with probability one to s, k2 /n to t, and
k2 − k1 increases without bound at the same rate as n. We then have:
√
Theorem C.6. n{Fn (t) − F (t)} is a Gaussian process with mean zero and
covariance given by:
√ √
Cov [ n{Fn (s)}, n{Fn (t)}] = F (s){1 − F (t)}. (C.5)
√
It follows immediately that, for T uniform, the process n{Fn (t) − t} (0 ≤
t ≤ 1), converges in distribution to the Brownian bridge. But note that, whatever
the distribution of T , as long as it has a continuous distribution function, mono-
√
tonic increasing transformations on T leave the distribution of n{Fn (t)−F (t)}
unaltered. This means that we can use the Brownian bridge for inference quite
generally. In particular, consider results of the Brownian bridge, such as the distri-
bution of the supremum over the interval (0,1), that do not involve any particular
value of t (and thereby F (t)). These results can be applied without modification
√
to the process n{Fn (t) − F (t)} whether or not F (t) is uniform. Among other
useful results concerning the empirical distribution function we have:
For the most common case in which F (t) is continuous, we know that
sup F (t)(1 − F (t)) is equal to 0.5. As an illustration, for 50 subjects, we find that
Dn is around 0.12. For 50 i.i.d. observations coming from some known or some
hypothesized distribution, if the hypothesis is correct then we expect to see the
greatest discrepancy between the empirical and the hypothesized distribution to
be close to 0.12. Values far removed from that might then be indicative of either
a rare event or that the assumed distribution is not correct. Other quantities,
412 Appendix C. Limit theorems
indicating how close to F (t) we can anticipate Fn (t) to be, are of interest, one
in particular being;
∞
Cn = n {Fn (t) − F (t)}2 f (t)dt.
0
As for Dn the asymptotic distribution of Cn does not depend upon F (t). For
this case the law of the iterated logarithm is expressed as follows:
Theorem C.8. With probability one,
Cn 1
lim = . (C.7)
n→∞ (2 log log n)
1
2 π2
The results for both Dn and Cn are large sample ones but can nonetheless
provide guidance when dealing with actual finite samples. Under assumed models
it is usually possible to calculate the theoretical distribution of some quantity
which can also be observed. We are then able to contrast the two and test the
plausibility of given hypotheses.
Appendix D
Inferential tools
Most researchers, together with a large section of the general public, even
if uncertain as to what the study of statistics entails, will be familiar with
the concept, if not the expression itself, of the type T̄ = n−1 n i=1 Ti . The
statistician
may formulate this in somewhat more abstract terms, stating that;
T̄ = n−1 n i=1 T i is a solution to the linear estimating equation for the param-
eter μ, the population mean of the random variable T , in terms of the n i.i.d.
replicates of T. The estimating equation is simply μ − n−1 n i=1 Ti = 0. This
basic idea is very useful in view of the potential for immediate generalization.
The most useful approach to analyzing data is to postulate plausible mod-
els that may approximate some unknown, most likely very complex, mechanism
generating the observations. These models involve unknown parameters and we
use the observations, in conjunction with an estimating equation, to replace the
unknown parameters by estimates. Deriving “good” estimating equations is a
sizeable topic whose surface we only need to scratch here. We appeal to some
general principles, the most common of which are very briefly recalled below,
and note that, unfortunately, the nice simple form for the estimating equation
for μ just above is more an exception than the rule. Estimating equations are
mostly nonlinear and need to be solved by numerical algorithms. Nonetheless, an
understanding of the linear case is more helpful than it may at first appear since
solutions to the nonlinear case are achieved by local linearization (called also
Newton-Raphson approximation) in the neighborhood of the solution. A funda-
mental result in the theory of estimation is described in the following theorem.
Firstly, we define two important functions, L(θ) and I(θ) of the parameter θ by
We refer to L(θ) as the observed likelihood, or simply just the likelihood (note
that, for n = 1, the expected log-likelihood is the negative of the entropy, also
called the information). When the observations Ti , i = 1, . . . , n, are indepen-
n
dent and identically
n distributed then we can write L(θ) = i=1 f (ti ; θ) and
log L(θ) = i=1 log f (ti ; θ). We refer to I(θ) as the information in the sam-
ple. Unfortunately the negative of the entropy is also called the information (the
two are of course related, both quantifying precision is some sense). The risks
of confusion are small given that the contexts are usually distinct. The function
I(θ) is random because it depends on the data and reaches a maximum in the
neighborhood of θ0 since this is where the slope of the log-likelihood is changing
the most quickly.
This inequality is called the Cramer-Rao inequality (Cox and Hinkley, 1979).
When T is an unbiased estimate of θ then ∂ E(T )/∂θ = 1 and Var(T ) ≥
1/E{I(θ)}. The quantity 1/E{I(θ)} is called the Cramer-Rao bound. Taking
the variance as a measure of preciseness then, given unbiasedness, we prefer the
estimator T that has the smallest variance. The Cramer-Rao bound provides the
best that we can do in this sense, and below, we see that the maximum likeli-
hood estimator achieves this for large samples, i.e., the variance of the maximum
likelihood estimator becomes progressively closer to the bound as sample size
increases without limit.
Basic equations
For a scalar parameter θ0 we will take some function U (θ) that depends on the
observations as well as θ. We then use U (θ) to obtain an estimate θ̂ of θ0 via
an estimating equation of the form
U (θ̂) = 0. (D.2)
This is too general to be of use and so we limit the class of possible choices of
U (·). We require the first two moments of U to exist in which case, without loss
of generality we can say
Two widely used methods for constructing U (θ) are described below. It is quite
common that U be expressible as a sum of independent and identically distributed
contributions, Ui , each having a finite second moment. An immediate applica-
tion of the central limit theorem then provides the large sample normality for
U (θ0 ). For independent but nonidentically distributed Ui , it is still usually not
D.1. Theory of estimating equations 415
difficult to verify the Lindeberg condition and apply the central limit theorem
for independent sums. Finally, in order for inference for U to carry over to θ̂,
some further weak restrictions on U will be all we need. These require that U
be monotone and continuous in θ and differentiable in some neighborhood of θ0 .
This is less restrictive than it sounds. In practice it means that we can simply
apply the mean value theorem (A.2) whereby:
Corollary D.1. For any > 0, when θ̂ lies in an interval (θ0 − , θ0 + ) within
which U (θ) is continuously differentiable, then there exists a real number ξ ∈
(θ0 − , θ0 + ) such that
This expression is useful for the following reasons. A likelihood for θ will, with
increasing sample size, look more and more normal. As a consequence, I(ξ) will
look more and more like a constant, depending only on sample size, and not ξ
itself. This is useful since ξ is unknown. We approximate I(ξ) by I(θ̂). We can
then express θ̂ in terms of approximate constants and U (θ̂) whose distribution
we can approximate by a normal distribution.
Finding equations
The guiding principle is always the same, that of replacing unknown parameters
by values that minimize the distance between empirical (observed) quantities
and their theoretical (model-based) equivalents. The large range of potential
choices stem from two central observations: (1) there can be many different
definitions of distance (indeed, the concept of distance is typically made wider
than the usual mathematical one which stipulates that the distance between a
and b must be the same as that between b and a) and (2) there may be a number
of competing empirical and theoretical quantities to consider. To make this more
concrete, consider a particular situation in which the mean is modeled by some
parameter θ such that Eθ (T ) is monotone in θ. Let’s say that the true mean
E(T ) corresponds to the value θ = θ0 . Then the mean squared error, variance
about a hypothesized Eθ (T ), let’s say σ 2 (θ), can be written as
Minimizing Dn (θ) with respect to θ will often provide a good, although not
necessarily very tractable, estimating equation. The same will apply to Cn and
related expressions such as the Anderson-Darling statistic. We will see later that
the so-called partial likelihood estimate for the proportional hazards model can
be viewed as an estimate arising from an empirical process. It can also be seen
as a method of moments estimate and closely relates to the maximum likelihood
estimate. Indeed, these latter two methods of obtaining estimating equations are
those most commonly used, and in particular, the ones given the closest attention
in this work. It is quite common for different techniques, and even contending
approaches from within the same technique, to lead to different estimators. It is
not always easy to argue in favor of one over the others.
Other principles can sometimes provide guidance in practice, the principle
of efficiency holding a strong place in this regard. The idea of efficiency is to
minimize the sample size required to achieve any given precision, or equivalently,
to find estimators having the smallest variance. However, since we are almost
always in situations where our models are only approximately correct, and on
occasion, even quite far off, it is more useful to focus attention on other qualities
of an estimator. How can it be interpreted when the data are generated by a
mechanism much wider than that assumed by the model? How useful is it to
us in our endeavor to build predictive models, even when the model is, at least
to some extent, incorrectly specified. This is the reality of modeling data and
efficiency, as an issue for us to be concerned with, does not take us very far. On
the other hand, estimators that have demonstrably poor efficiency, when model
assumptions are correct, are unlikely to redeem themselves in a broader context
and so it would be a mistake to dismiss efficiency considerations altogether even
though they are rather limited.
Method of moments
This very simple method derives immediately as an application of the Helly-Bray
theorem (Theorem A.3). The idea is to equate population moments
to empirical
ones obtained from the observed sample. Given that μ = xdF (x), the above
example is a special case since, we can write μ̄ = x̄ = xdFn (x). Properties of the
estimate can be deduced from the well-known properties of Fn (x) as an estimate
of F (x) (see Section C.5). For the broad exponential class of distributions, the
method of moments estimator, based on the first moment, coincides with the
maximum likelihood estimator recalled below. In the survival context we will
see that the so-called partial likelihood estimator can be viewed as a method of
moments estimator. The main difficulty with method of moments estimators is
that they are not uniquely defined for any given problem. For example, suppose
we wish to aim to estimate the rate parameter λ from a series of observations,
assumed to have been generated by a Poisson distribution. We can use either the
empirical mean or the empirical variance as an estimate for λ. Typically they will
D.1. Theory of estimating equations 417
not be the same. Indeed we can construct an infinite class of potential estimators
as linear combinations of the two.
normality for U (θ̂n ). By expressing θ̂n as a smooth (not necessarily explicit) func-
tion of U we can also then claim large sample normality for θ̂n . The fact that
U (θ̂n ) (having subtracted off the mean and divided by its standard deviation)
will converge in distribution to a standard normal and that a smooth function
of this, notably θn , will do the same, does not mean that their behavior can be
considered to be equivalent. The result is a large sample one, i.e., as n tends to
infinity, a concept that is not so easy to grasp, and for finite samples, behavior
will differ. Since U is a linear sum we may expect the large sample approxima-
tion to be more accurate, more quickly, than for θn itself. Inference is often more
accurate if we work directly with U (θ̂n ), exploiting the monotonicity of U (·), and
inverting intervals for U (θ̂n ) into intervals for θ̂n . In either case we need some
expression for the variance and this can be obtained from the second important
theorem:
n
achieved via the constraint, i=1 Wi = 1. We might wish to make the con-
dition, Cov (Wi , Ui ) = 0, a large sample rather than a finite sample result and
√
that would also work providing the rate of convergence is higher than n. Sup-
pose that X i , i = 1, ..., n, are standard exponential variates and we define Wi by:
Wi = Xi /{ n j=1 Xj }.
Rubin (1981) and, more recently, Xu and Meeker (2020) studied inference
based on such weighted estimating equations. Inference is fully efficient and
particularly easy to carry out. We can view inference as being conditional upon the
data, i.e., the observations T1 ... Tn . Rather than consider theoretical replications
of U1 ... Un , we treat these as being fixed. In particular, we can write down the
empirical distribution for the ordered U(1) ... U(n) . If we then make an appeal to
the probability integral transform (Appendix A.6), we see that the Wi , the gaps
between the cumulative observations, i.e., the empirical distribution function, are
exactly distributed as described above. This is a consequence of the fact that the
gaps between uniform order statistics are distributed as: Wi = Xi /{ n j=1 Xj }
(Appendix A.9). Large sample inference will agree to a very high degree with
inference based on normal approximations, and for finite samples, we can often
obtain more precise and more robust results.
Three further advantages of such an approach to inference for linear esti-
mating equations are worthy of mention. The first is that we can either do any
calculations analytically, given the particularly simple form of the exponential
distribution, or via simulated replications which, again, are very simple to set up.
For small samples where the bootstrap may fail as a result of unboundedness of
the estimator itself, use of the continuously weighted estimating equation will
avoid such difficulties. The bootstrap for survival data is described below and it
is not difficult to see how the non-zero chance of obtaining samples with infinite
parameter estimates can cause problems. The wild bootstrap is an alternative
way around this problem but we do not investigate that here.
Finally, when a parameter is defined in terms of large sample results for an
estimating equation, we are able to interpret what is being estimated when our
model assumptions are not correct. But, if our estimating equations are linear, we
can do more and we can use these weighted linear estimating equations to make
valid inferences even when the data are generated by a model that is different
from that assumed. This is of great value in cases where we assume some narrow
model, e.g., proportional hazards, but the mechanism generating the observations
is a broader one.
which is the Cramer-Rao inequality (Cox and Hinkley, 1979). When T is an unbi-
ased estimate of θ then ∂ E(T )/∂θ = 1 and Var(T ) ≥ 1/E{I(θ)}. The quantity
1/E{I(θ)} is called the Cramer-Rao bound and it provides the best that we can
do. This bound divided by the variance of any alternative unbiased estimator
measures the relative efficiency of the estimator. This would depend on sample
size n and, when this ratio achieves a limit as n increases without bound then
we call this the asymptotic relative efficiency. Since the ratio of the variance of
the maximum likelihood estimator and this bound tends to one with increasing
n we can see that the maximum likelihood estimator is fully efficient.
e(T1 , T2 ) = n1 /n2 .
In this expression, if we know which is the more powerful test, then this would
correspond to n1 so that this ratio lets us know how much extra work is needed,
in terms of sample size, if we prefer to use test T2 . If, for α and β fixed, for
422 Appendix D. Inferential tools
Bootstrap resampling
The purpose of bootstrap resampling is twofold: (1) to obtain more accurate
inference, in particular more accurate confidence intervals, than is available via
the usual normal approximation, and (2) to facilitate inference for parameter
estimators in complex situations. A broad discussion including several challenging
applications is provided by Politis (1998). Here we will describe the basic ideas in
so far as they are used for most problems arising in survival analysis. Consider the
empirical distribution function Fn (t) as an estimate for the unknown distribution
function F (t). The observations are T1 , T2 , ... , Tn . A parameter of interest,
such as the mean, the median, some percentile, let’s say θ, depends only on F .
This dependence can be made more explicit by writing θ = θ(F ). The core idea
of the bootstrap can be summarized via the simple expression θ̃ = θ(Fn ) as an
estimator for θ(F ).
Taking infinitely many i.i.d. samples, each of size n, from F would provide
us with the exact sampling properties of any estimator θ̃ = θ(Fn ). If, instead
of taking infinitely many samples, we were to take a very large number, say
B, of samples, each sample again of size n from F , then this would provide
us with accurate approximations to the sampling properties of θ̃, the errors of
the approximations diminishing to zero as B becomes infinitely large. Since F
is not known we are unable to carry out such a prescription. However, we do
have available our best possible estimator of F (t), the empirical distribution
function Fn (t). The bootstrap idea is to sample from Fn (t), which is known and
available, instead of from F (t) which, apart from theoretical investigations, is
typically unknown and unavailable.
sample having size n. We repeat this whole process B times where B is a large
number, typically in the thousands. Each sample is viewed as an i.i.d. sample
from Fn (t). The i th resample of size n can be written T1i ∗ , T ∗ , ..., T ∗ and has
2i ni
∗i
empirical distribution Fn (t). For any parameter of interest θ, the mean, median
coefficient of variation for example, it is helpful to remind ourselves of the several
quantities of interest, θ(F ), θ(Fn ), θ(Fn∗i ) and FB (θ), the significance of each of
these quantities needing a little explanation. First, θ(F ) is simply the population
quantity of interest. Second, θ(Fn ) is this same quantity defined with respect to
the empirical distribution of the data T1 , ... , Tn . Third, θ(Fn∗i ) is again the same
quantity defined with respect to the i th empirical distribution of the resamples
∗ , T ∗ , ..., T ∗ . Finally, F (θ) is the bootstrap distribution of θ(F ∗i ), i.e., the
T1i 2i ni B n
empirical distribution of θ(Fn∗i ) (i = 1, . . . , B).
To keep track of our asymptotic thinking we might note that, as B → ∞ ,
udFB (u) converges in probability to θ(Fn ) and, as n → ∞ , θ(Fn ) converges in
probability to θ(F ). Thus, there is an important conceptual distinction between
FB and the other distribution functions. These latter concern the distribution of
the original observations or resamples of these observations. FB itself deals with
the distribution of θ(Fn∗i ) (i = 1, . . . , n) and therefore, when our focus of interest
changes from one parameter to another, from say θ1 to θ2 the function FB will
be generally quite different. This is not the case for F , Fn , and Fn∗i which are
not affected by the particular parameter we are considering. Empirical quantities
with respect to the bootstrap distribution, FB are evaluated in a way entirely
analogous to those evaluated with respect to Fn . For example,
2
2
Var {θ(Fn )} = σB = u2 dFB (u) − udFB (u) , (D.6)
where it is understood that the variance operator, Var() is with respect to the
distribution FB (t). Of greater interest in practice is the fact that Var {θ(Fn )},
where Var() is with respect to the distribution FB (t)}, can be used as an esti-
mator of Var {θ(Fn )}, where Var(·) is with respect to the distribution F (t).
which obtains from a rearrangement of the expression Pr [σzα/2 < θ(Fn ) − θ <
σz1−α/2 ] = 1 − α. Since σ 2 is not generally available we would usually work
2 . Instead of using the normal approximation it is possible to define Q
with σB α
424 Appendix D. Inferential tools
These intervals are called percentile bootstrap confidence intervals. Whenever the
distribution FB is symmetric then the root intervals and the percentile intervals
coincide since Qα/2 + Q1−α/2 = 0. In particular, they coincide if we make a
normal approximation.
and we then consider the standardized distribution of the quantity, θ(Fn∗i )/σ∗i .
The essence of the studentized approach, having thus standardized, is to use the
bootstrap sampling force to focus on the higher-order questions, those concern-
ing bias and skewness in particular. Having, in some sense, spared our bootstrap
resources from being dilapidated to an extent via estimation of the mean and
variance, we can make real gains in accuracy when applying the resulting distri-
butional results to the statistic of interest. For most day-to-day situations this is
probably the best approach to take. It is computationally intensive (no longer a
serious objection) but very simple conceptually. An alternative approach, with the
same ultimate end in mind, is not to standardize the statistic but to make adjust-
ments to the derived bootstrap distribution, the adjustments taking into account
any bias and skewness. These lead to the so-called bias corrected, accelerated
intervals (often written as BCa intervals).
Empirical likelihood in which certain parameters are assumed fixed at some value
will also come under this heading. We next describe an interesting approach to
inference in which those aspects of the observations that impact the precision
and not the location of the parameter of interest could be assumed fixed at the
values observed in the data.
Conditional likelihood
A conditional likelihood, that we may like to view as a conditional density, sim-
ply fixes some parameters at certain values. An obvious example is that of a
parameter that is a function of some variance, a quantity which indicates the
precision in an estimate of other parameters but tells us nothing about where
such parameters may be located. It has been often argued that it is appropriate
to condition on such quantities. The variance parameter is then taken as fixed
and known and the remaining likelihood is a conditional likelihood.
Cox (1958) gives an example of two instruments measuring the same quantity
but having very different precision. His argument, which is entirely persuasive,
says that we should use the conditional likelihood and not the full likelihood
that involves the probabilities of having chosen one or the other instrument.
The fact that we may have chosen the less precise instrument at a given rate
of the time is neither here nor there. All that matters are the observations and
the particular instruments from which they arise. Another compelling example
arises in binomial sampling. It is not at all uncommon that we would not know
in advance the exact number n of repetitions of the experiment. However, we
rarely would think of working with a full likelihood in which the distribution of n
is explicitly expressed. Typically, we would simply use the actual value of n that
we observed. This amounts to working with a conditional likelihood.
Fisher (1934) derived an exact expression for the distribution of the maximum
likelihood estimate for a location or scale parameter conditional on observed
spread in the data. Fisher’s expression is particularly simple and corresponds to
the likelihood function itself standardized so that the integral with respect to
the parameter over the whole of the parameter space is equal to one. It is quite
an extraordinary result in its generality and the fact that it is exact. Mostly we
are happy to use large sample approximations based on central limit theorems
for the score statistic and Taylor series approximations applied to a development
around the true value of an unknown parameter. Here we have an exact result
regardless of how small our sample is as long as we are prepared to accept the
argument that it is appropriate to condition on the observed spread, the so-called
“configuration” of the sample. For a model in which the parameter of interest is a
location parameter, the configuration of the sample is simply the set of distances
between the observations and the empirical mean. For a location-scale family
this set consists of these same quantities standardized by the square root of the
variance.
D.4. Conditional, marginal, and partial likelihood 427
Fisher’s results were extended to the very broad class of models coming under
the heading of curved exponential family models (Efron et al., 1978). This exten-
sion was carried out for more general situations by conditioning on the observed
information in the sample (a quantity analogous to the information contained in
the set of standardized residuals). Although no longer an exact result the result
turns out to be very accurate and has been extensively studied by Barndorff-
Nielsen and Hall (1988). For the proportional hazards model we can use these
results to obtain g(β) as the conditional distribution of the maximum likelihood
estimator via the expression
g(u) = π(u, Xi ) π(u, Xi ) du.
i u i
In Figure D.1 we illustrate the function g(u), i.e., the estimated conditional dis-
tribution of the maximum likelihood estimator for the proportional hazards model
given the two sample data from the Freireich study. It is interesting to compare
the findings with those available via more standard procedures. Rather than base
inference on the mode of this distribution (i.e., the maximum partial likelihood
estimate) and the second derivative of the log-likelihood we can consider
2
β̃ = ug(u)du , v(β) = u2 g(u)du − ug(u)du ,
u u u
and base tests, point, and interval estimation on these. Estimators of this kind
have been studied extensively by Pitman (1948), and generally, have smaller mean
squared error than estimators based on the mode. Note also, in a way analogous
to the calculation of bootstrap percentile intervals, that we can simply consider
the curve g(u) and tick off areas of size α/2 on the upper and lower halves of
the function to obtain a 100(1 − α)% confidence interval for β. Again, analogous
to bootstrap percentile intervals, these intervals can pick up asymmetries and
provide more accurate confidence intervals for small samples than those based
on the symmetric large sample normal approximation. For the Freireich data,
agreement between the approaches is good and that is to be expected since, in
this case, the normal curve is clearly able to give a good approximation to the
likelihood function.
For higher dimensions the approach can be extended almost immediately, at
least conceptually. A potential difficulty arises in the following way: suppose we
are interested in β1 and we have a second parameter, β2 , in the model. Logically
we would just integrate the two-dimensional density with respect to β2 , leaving
us with a single marginal density for β1 . This is straightforward since we have
all that we need to do this, although of course there will usually be no analytical
solution and it will be necessary to appeal to numerical techniques. The difficulty
occurs since we could often parameterize a model differently, say incorporating
β22 instead of β2 alone. The marginal distribution for β1 , after having integrated
428 Appendix D. Inferential tools
Figure D.1: The standardized (i.e., area integrates to one) partial likelihood func-
tion for the proportional hazards model based on the Freireich data.
out β2 , will not generally be the same as before. Nonetheless, for the proportional
hazards model at least, the parameterization is quite natural and it would suffice
to work with the models as they are usually expressed.
Partial likelihood
Again, starting from a full likelihood, or full density, we can focus interest on
some subset, “integrating out” those parameters of indirect interest. Integrating
the full density (likelihood) with respect to those parameters of secondary interest
produces a marginal likelihood. Cox (1975) develops a particular expression for
the full likelihood in which an important component term is called the partial
likelihood. Careful choices lead us back to conditional likelihood and marginal
likelihood and Cox then describes these as special cases. For a given problem, Cox
provides some guidelines for finding partial likelihoods: (1) the omitted factors
should have distributions depending in an essential way on nuisance parameters
and should contain no information about the parameters of interest and (2)
incidental parameters, in particular the nuisance parameters, should not occur in
the partial likelihood.
Lemma D.1. For the proportional hazards model with constant effects:
n
δi
exp(βZi )
L(β) = n (D.10)
i=1 j=1 j (Xi ) exp(βZj )
Y
able to verify the extent to which the guidelines apply. Unfortunately, at least
when compared to usual likelihood, marginal and conditional likelihood, or profile
likelihood (maximizing out rather than integrating out nuisance parameters),
partial likelihood is a very difficult concept. For general situations, it is not at
all clear how to proceed. Furthermore, however obtained, such partial likelihoods
are unlikely to be unique. Unlike the more commonly used likelihoods, great
mathematical skill is required and even well steeled statistical modelers, able to
obtain likelihood estimates in complex applied settings, will not generally be able
to do the same should they wish to proceed on the basis of partial likelihood.
For counting processes Andersen et al. (1993), it requires some six pages (pages
103–109), in order to formulate an appropriate partial likelihood.
However, none of this impedes our development since the usual reason for
seeking an expression for the likelihood is to be able to take its logarithm, differ-
entiate it with respect to the unknown parameters, and equating this to zero, to
enable the construction of an estimating equation. Here we already have, via the
main theorem (Section 7.5), or stochastic integral considerations, appropriate
estimating equations. Thus, and in light of the above mentioned difficulties, we
do not put emphasis on partial likelihood as a concept or as a general statistical
technique. Nonetheless, for the main models we consider here, the partial like-
lihood, when calculated, can be seen to coincide with other kinds of likelihood,
derived in different ways, and that the estimating equation, arising from use of
the partial likelihood, can be seen to be a reasonable estimating equation in its
own right, however obtained. The above expression was first presented in Cox’s
original paper where it was described as a conditional likelihood.
For subjects having either failed or being lost to follow-up before time t we can
still carry out the sum over all n subjects in our evaluation at time t. This is
because of the indicator variable Yj (t) that takes the value zero for all such
subjects, so that their transition probabilities become zero. The same idea can
be expressed via the concept of risk sets, i.e., those subjects alive and available
to make the transition under study. However, whenever possible, it is preferable
to make use of the indicator variables Yj (t), thereby keeping the sums over n.
Multiplying all the above terms over the observed failure times produces L(β).
In his 1972 paper Cox described this as a conditional likelihood and suggested it
be treated as a regular likelihood for the purposes of inference. In their contribu-
tion to the discussion of Cox’s paper, Kalbfleisch and Prentice point out that the
above likelihood does not have the usual probabilistic interpretation. If we take
times to be fixed then exp(βZj )/ Y (Xi ) exp(βZ ) is the probability of the
subject indexed by j failing at Xi and that all other subjects, regardless of order,
occur after Xi . Cox’s deeper study (Cox, 1975) into L(β) led to the recognition
of L(β) as a partial likelihood and not a conditional likelihood in the usual sense.
The flavor of L(β) is nonetheless that of a conditional quantity, even if
the conditioning is done sequentially and not all at once. Cox’s discovery of
L(β), leading to a host of subsequent applications (time-dependent effects, time-
dependent covariates, random effects), represents one of the most important
statistical advances of the twentieth century. Although it took years of subse-
quent research in order to identify the quantity introduced by Cox as the relevant
quantity with which to carry out inference, and although it was argued that Cox’s
likelihood was not a conditional likelihood in the usual sense (where all of the
conditioning is done at once), his likelihood was all the same the right quantity
to work with.
From this we can break the likelihood into two components. We then have:
This argument is also sometimes given to motivate the idea of partial likeli-
hood, stating that the full likelihood can be decomposed into a product of two
terms, one of which contains the nuisance parameters λ0 (t) inextricably mixed
in with the parameter of interest β and a term that only depends upon β. This
second term is then called the partial likelihood. Once again, however, any such
decomposition is unlikely to be unique and it is not clear how to express in precise
mathematical terms just what we mean by “inextricably mixed in” since this is
more of an intuitive notion suggesting that we do not know how to separate the
parameters. Not knowing how to separate the parameters do not mean that no
procedure exists that might be able to separate them. And, if we were to sharpen
the definition by stating, for example, that within some large class there exists no
transformation or re-parameterization that would separate out the parameters,
then we would be left with the difficulty of verifying this in practice, a task that
would not be feasible.
Appendix E
We look at two methods described by Chauvel (2014) that will allow us to sim-
ulate samples {Ti , Zi , i = 1, . . . , n} under the non-proportional hazards model.
The first allows us to simulate data when we have a piecewise constant β(t),
corresponding to a change-point model, using the distribution of T given Z. The
second is based on the distribution of Z given T , and allows us to generate data
for any β(t) which is non-constant over time. Censoring is independent of this;
we thus simulate the times {C1 , . . . , Cn } independently of the data.
where the values β1 , . . . , βk are constants. To begin with, we suppose that the
k − 1 break-points t2 , t3 , . . . , tk are known and we show how to simulate data
under the model with parameter β(t). After that, we show how to define the
change-points corresponding to the places where the value of β(t) changes.
" k
#
λ(t | Z) = λ0 (t) exp {β(t)Z} = exp Z βi 1ti ≤ t < ti+1 , t ∈ [0, T ].
i=1
(E.1)
Calculating S(. | Z)
Let t ∈ [0, T ] and calculate the value of the conditional survival S(t | Z). If t is
between t1 = 0 and t2 , the survival is
t t
S (t | Z) = exp − λ (u | Z) du = exp − exp (β (u) Z) du
0 0
t
= exp − exp (β1 Z) du = exp (−t exp (β1 Z)) .
t1 =0
Then, if t is between tj and tj+1 , j ∈ {2, ..., k}, the survival of T given Z
calculated at t is
t t
S (t | Z) = exp − λ (u | Z) du = exp − exp (β (u) Z) du
0 0
" #
ti+1
j−1 t
= exp − exp (βi Z) du − exp (βj Z) du
i=1 ti tj
" j−1
#
= exp − (ti+1 − ti ) exp (βi Z) − (t − tj ) exp (βj Z) .
i=1
For simulating a random variable whose survival function, given the covariates Z,
is S(. | Z), we use the fact that, if U ∼ U[0, 1], then F −1 (U ) has the cumulative
distribution function F . All that is needed then is to inverse the cumulative
distribution function F (. | Z) = 1 − S(. | Z).
The inverse of F (. | Z)
Let γ ∈ [0, 1] be such that
We are looking for the t such that γ = F (t | Z). If 0 ≤ γ < F (t2 | Z):
" j−1
#
γ = F (t | Z) = 1 − exp − (ti+1 − ti ) exp (βi Z) − (t − tj ) exp (βj Z) ,
i=1
In conclusion:
" j
#
t = exp (−βj Z) − log (1 − γ) − ti {exp (βi−1 Z) − exp (βi Z)} .
i=2
" " ##
k
j
Starting with a variable Z, a coefficients vector (β1 , ..., βk ), and a vector with
the break-points (t2 , ..., tk ), we simulate U from the uniform distribution U[0, 1];
then, F −1 (U | Z) is a random variable with cumulative distribution function
F (. | Z). This variable is the survival time T and is coherent with the non-
proportional hazards model.
The break-point times need to be selected in such a way that we do not run
out of individuals. In effect, the times need to be calibrated so that individuals
are present and of sufficient number in each interval [tj , tj+1 [. The choice of
times is made as a function of the coefficients β1 , . . . , βk that have been selected
for the study.
and β1 = 2, β2 = 1 and β3 = 0. Before being able to simulate the data using the
method presented in Section E, we need to choose the values t2 and t3 .
Step 1: We simulate survival data for 2000 individuals according to model (E.1)
with β(t) = β1 = 2 for all t. We then calculate the corresponding Kaplan-Meier
estimator. Next, we select a time such that 1/3 of individuals have died before
that time. This value corresponds to t2 . In our example, we get t2 = 0.1.
In order to generate data under the more general non-proportional hazards mod-
els, we use the conditional distribution of Z given T . We work with a single
covariate Z that does not change with time; extensions to multiple covariates
are straightforward with the help of the prognostic index.
We simulate the vectors for times of death (T1 , . . . , Tn ), censoring (C1 , . . . , Cn ),
and covariate vectors (Z1 , . . . , Zn ) independently, via their marginal distributions.
With the help of the probabilities {πi (β(t), t) , i = 1, . . . , n} , we connect up the
covariates and the ranked times of death in the following way:
3. For i from 1 to n:
– We calculate πj β(T(i) ), T(i) for each individual j ∈ R, setting
Yl (t) = 1 if l ∈ R and Yl (t) = 0 otherwise.
– We randomly select an individual
j ∗ in R, where individual j in R is
chosen with probability πj β(T(i) ), T(i) .
– We link Zj ∗ with T(i) .
– We remove j ∗ from R.
The appendices are referred to throughout the text and provide the background
to many important results. They are of interest in their own right and, for this
reason, we include here proofs of some key results as well as classwork and
exercises. These are for the benefit of instructors and students who wish to dig a
little more deeply into this background. A course in survival analysis might well
include certain aspects covered in these appendices and this would depend on
the type of course being given as well as the flavor that the instructor wishes for
it to assume.
3. Let g(x) take the value 0 for −∞ < x ≤ 0 : 1/2 for 0 < x ≤ 1 ; 1 for 1 < x ≤ 2 :
and 0 otherwise. Let f (x) = x2 + 2. Evaluate the Riemann-Stieltjes integral of
f (x) with respect to g(x) over the real line.
4. Note that n i=1 i = n(n + 1)/2. Describe a function such that a Riemann-
Stieltjes integral of it is equal to n(n + 1)/2. Viewing integration as an area
under a curve, conclude that this integral converges to n2 as n becomes large.
5. Suppose that in the Helly-Bray theorem for h(x)dFn (x), the function h(x)
is unbounded. Break the integral into components over the real line. For regions
where h(x) is bounded the theorem holds. For the other regions obtain conditions
that would lead to the result holding generally.
8. The order statistics for a random sample of size n from a discrete distribution
are defined as in the continuous case except that now we have X(1) ≤ X(2) ≤
· · · ≤ X(n) . Suppose a random sample of size 5 is taken with replacement from
the discrete distribution f (x) = 1/6 for x = 1, 2, . . . , 6. Find the probability mass
function of X(1) , the smallest order statistic.
9. Ten points are chosen randomly and independently on the interval (0,1). Find
(a) the probability that the point nearest 1 exceeds 0.8, (b) the number c such
that the probability is 0.4 that the point nearest zero will exceed c.
10. Find the expected value of the largest order statistic in a random sample of
size 3 from (a) the exponential distribution f (x) = exp(−x) for x > 0, (b) the
standard normal distribution.
11. Find the probability that the range of a random sample of size n from the
population f (x) = 2e−2x for x ≥ 0 does not exceed the value 4.
12. Approximate the mean and variance of (a) the median of a sample of size
13 from a normal distribution with mean 2 and variance 9, (b) the fifth-order
statistic of a random sample of size 15 from the standard exponential distribution.
13. Simulate 100 observations from a uniform distribution. Do the same for
an exponential, Weibull, and log-logistic distribution with different parameters.
Next, generate normal and log-normal variates by summing a small number of
uniform variates. Obtain histograms. Do the same for 5000 observations.
14. Obtain the histogram of 100 Weibull observations. Obtain the histogram of
the logarithms of these observations. Compare this with the histogram obtained
by the empirical transformation to normality.
19. Recall that the information contained in the density g(t) with respect to
h(t) is defined by V (g, h) = E log g(T ) = log g(t)h(t)dt and the entropy is just
−V (g, h). Suppose that the entropy depends on a parameter θ and is written
−Vθ (f, f ). Consider −Vα (f, f ) as a function of α. Show that this function is
maximized when α = θ.
20. For random variables, T and Z with finite second moments, use the device
of double expectation to show that: Var (T ) = Var E(T |Z) + E Var (T |Z). Thus,
the total variance, Var (T ), breaks down into two parts. Why is this breakdown
interpreted as one component corresponding to “signal” and one component
corresponding to “noise?”
22. Consider a stochastic process X(t) on the interval (2, 7) with the following
properties: (a) X(0) = 2, (b) X(t) , t ∈ (2, 7) has increments such that (c), for
each t ∈ (2, 7) the distribution of X(t) is Weibull with mean 2+λtγ . Can these
increments be independent and stationary? Can the process be described using
the known results of Brownian motion?
23. For Brownian motion, explain why the conditional distribution of X(s) given
X(t) (t > s) is normal with E{X(s)|X(t) = w} = ws/t and Var {X(s)|X(t) =
w} = s(t − s)/t. Deduce the mean and the covariance process for the Brownian
bridge.
26. Find the value of t ∈ (0, 1) for which the variance of a Brownian bridge is
maximized.
27. Suppose that under H0 , X(t) is Brownian motion. Under H1 , X(t) is Brow-
nian motion with drift, having drift parameter 2 as long as X(t) < 1 and drift
parameter minus 2 otherwise. Describe likely paths for reflected Brownian motion
under both H0 and H1 . As a class exercise simulate ten paths under both
hypotheses. Comment on the resulting figures.
5. Repeat the above class exercise, replacing the uniform distribution by (1) the
log-logistic distribution with different means and variances, (2) the exponential
distribution, and (3) the normal distribution. Again replicate each graph ten
times. Comment on your findings.
the 1000 values of D20 , and then plot a histogram of these values. Add to the
histogram the distribution of D20 using the Brownian bridge approximation.
8. For a sample size of 220, provide an approximate calculation of how large you
would anticipate the greatest discrepancy between Fn (t) and F (t) to be.
11. In the next chapter we discuss the idea of right censoring where, for certain
of the observations Ti , the exact value is not known. All that we can say for sure
is that it is greater than some censoring time. How might the discussion of the
previous exercise on the two types of estimating equations arising from reversing
the conditioning variable have a bearing on this.
Outline of proofs
Proof of Theorem A.2, Theorem A.7 and Corollary A.9. The importance of the
first of these two theorems is difficult to overstate. All the useful large sample
results, for instance, hinge ultimately on the theorem. An elegant proof of the
theorem, together with well thought out illustrations and some examples, is given
444 Further exercises and proofs
in Shenk (1979). For Theorem A.7 let T have a continuous and invertible distri-
bution function F (t) and let U = F (T ). The inverse function is denoted F −1 so
that F −1 {F (t)} = t. Then
Thus U has the distribution of a standard uniform. The proof of the corollary
leans upon some straightforward manipulation of elementary events. An outline
is provided in David (1994).
Proof of Theorem A.9 and Corollary A.4. The theorem states that, letting F (x) =
P (X ≤ x) and Fr (x) = P (X(r) ≤ x) then;
n
n
Fr (x) = F i (x)[1 − F (x)]n−i .
i
i=r
Recall that the Xi from the parent population with distribution F (x) are i.i.d.
The event that X(r) ≤ x is the event that at least r of the Xi are less than or
equal to x. This is then the sum of the binomial probabilities summed over all
values of i greater than or equal to r. The first part of the corollary is clear upon
inspection. For the second part note that
n
n n
F i (x)[1 − F (x)]n−i = 1 , F 0 (x)[1 − F (x)]n = [1 − F (x)]n .
i 0
i=0
where the final inequality is obtained by applying Equation (A.4). The conclusion
follows.
Proof of Theorem B.1 and Corollaries B.1 and B.2. We have that:
fs|t (x|w) = fs (x)ft−s (w − x)/ft (w) = const exp {−x2 /2s − (w − x)2 /2(t − s)}
= const exp {−t(x − ws/t)2 /2s(t − s)}.
For Corollary B.2 note that the process V (t) is a Gaussian process in which:
E{V (t)} = 0 , Cov {V (t), V (s)} = E{exp (−αt/2) exp (−αs/2)X(eαt )X(eα s)}
This can be written: exp (−α(t + s)/2) × E{X(eαt )X(eα s)} which in turn is
then equal to exp(−α(t + s)/2) exp αt and this is just exp (−α(t − s)/2).
Proof of Theorem B.2 and Corollary B.3. Note that:
Corollary B.3 can be deduced from the this theorem via the following simple
steps: For s < t,
and
E{W (s)W (t) − sW (1)W (t) + stW (1)W (1) − tW (1)W (s)}
= s − st + st − st = s(1 − t).
Next, from the definition of the Brownian bridge, Z(t) = W (t) − tW (1) so that,
expanding the above expression we obtain:
Cov {W (t/(t + 1)), W (s/(s + 1))} − t/(t + 1) × Cov {W (1), W (s/(s + 1))} −
s/(s + 1)Cov {W (1), W (t/(t + 1))} + st/((s + 1)(t + 1)) × Cov {W (1), W (1)}
= s(t + 1) − st − st + st = s.
operators we have:
s t s t
E X(y)X(u)dydu = EX(y)X(u)dydu
0 0 0 0
s t s u t
= min(y, u)dydu = ydy + udydu = s2 (t/2 − s/6).
0 0 0 0 u
Cov (Sk∗ , Sm
∗
) = E(Sk∗ Sm
∗
) − E(Sk∗ )E(Sm
∗
)
√ −2
= (σ n) E{Sk Sm }
√
= (σ n)−2 E{Sk (Sk + Xi )}.
k<i≤m
Proof of Theorem C.6 and B.6. For the first of these two theorems, note that the
√
marginal and conditional normality of any sequence of n{Fn (ti ) − F (ti )}, (0 <
√
t1 < . . . < tm ), for some m, indicates the multivariate normality of n{Fn (ti ) −
√
F (ti )}, (0 < t1 < . . . < tm ). Thus n{Fn (t) − F (t)} is a Gaussian process. It
has mean zero and variance F (t){1 − F (t)}. To obtain the covariance, note first
that the indicator variables I(Ti ≤ t) and I(Tj ≤ s) are independent for all t and
s and i = j. It only remains to evaluate, for s < t
Further exercises and proofs 447
Cov {I(Ti ≤ t), I(Ti ≤ s)} = E{I(Ti ≤ t)I(Ti ≤ s)} − E{I(Ti ≤ t)}E{I(Ti ≤ s)}
= F (s) − F (t)F (s) = F (s){1 − F (t)}.
Furthermore,
Var{dM (t)|Ft− } = Var{H(t)dM (t)|Ft− } = H(t)2 Var{dM (t)|Ft− } = H(t)2 dM (t).
Similarly,
$ % t
HdM, KdM (t) = H(s)K(s)dM, M (s).
0
O.O. Aalen, A linear regression model for the analysis of life times. Stat. Med.
8, 907–925 (1989)
Acute Leukemia Group B, E.J. Freireich, E. Gehan, E. Frei, L.R. Schroeder, I.J.
Wolman, R. Anbari, E.O. Burgert, S.D. Mills, D. Pinkel, O.S. Selawry, J.H.
Moon, B.R. Gendel, C.L. Spurr, R. Storrs, F. Haurani, B. Hoogstraten, S. Lee,
The effect of 6-mercaptopurine on the duration of steroid-induced remissions
in acute leukemia: a model for evaluation of other potentially useful therapy.
Blood 21(6), 699–716 (1963)
P.K. Andersen, R.D. Gill, Cox’s regression model for counting processes: a large
sample study. Ann. Stat. 10, 1100–1120 (1982)
P.C. Austin, J.P. Fine, Accounting for competing risks in randomized controlled
trials: a review and recommendations for improvement. Stat. Med. 36(8),
1203–1209 (2017)
W.E. Barlow, R.L. Prentice, Residuals for relative risk regression. Biometrika
75(1), 65–74 (1988)
S. Bennett, Analysis of survival data by the proportional odds model. Stat. Med.
2, 273–277 (1983a)
S. Bennett, Log-logistic regression models for survival data. Appl. Stat. 32, 165–
171 (1983b)
Bibliography 451
D.A. Binder, Fitting cox’s proportional hazards models from survey data.
Biometrika 79(1), 139–147 (1992)
O. Borgan, K. Liestol, A note on confidence intervals and bands for the survival
function based on transformations. Scand. J. Stat. 35–41 (1990)
N. Breslow, J. Crowley, A large sample study of the life table and product limit
estimates under random censorship. Ann. Stat. 437–453 (1974)
N. Breslow, L. Elder, L. Berger, A two sample censored-data rank test for accel-
eration. Biometrics 40, 1042–1069 (1984)
M.S. Brose, T.R. Rebbeck, K.A. Calzone, J.E. Stopfer, K.L. Nathanson, B.L.
Weber, Cancer risk estimates for brca1 mutation carriers identified in a risk
evaluation program. J. Natl. Cancer Inst. 94(18), 1365–1372 (2002)
H.-S.D. Cain, C. Kevin, R.J. Little, B. Nan, M. Yosef, J.R. Taffe, M.R. Elliott,
Bias due to left truncation and left censoring in longitudinal studies of devel-
opmental and disease processes. Am. J. Epidemiol. 173(9), 1078–1084 (2011)
452 Bibliography
B.P. Carlin, J.S. Hodges, Hierarchical proportional hazards regression models for
highly stratified data. Biometrics 55(4), 1162–1170 (1999)
Q. Chen, R.C. May, J.G. Ibrahim, H. Chu, S.R. Cole, Joint modeling of longitu-
dinal and survival data with missing and left-censored time-varying covariates.
Stat. Med. 33(26), 4560–4576 (2014)
W.G. Cochran, Some methods for strengthening the common χ 2 tests. Biomet-
rics 10(4), 417–451 (1954)
J. Cologne, W.-L. Hsu, R.D. Abbott, W. Ohishi, E.J. Grant, S. Fujiwara, H.M.
Cullings, Proportional hazards regression in epidemiologic follow-up studies:
an intuitive consideration of primary time scale. Epidemiology 565–573 (2012)
D.R. Cox, Regression models and life–tables (with discussion). J. R. Stat. Soc.
Ser. B 34(2), 187–220 (1972)
Bibliography 453
D.R. Cox, D.V. Hinkley, Theoretical Statistics (Chapman and Hall/CRC, 1979)
D.R. Cox, E.J. Snell, A general definition of residuals. J. R. Stat. Soc. Ser. B
30(2), 248–275 (1968)
J.J. Crowley, B.E. Storer, Comment on ’A reanalysis of the Stanford Heart Trans-
plant Data’, by M. Aitkin, N. Laird and B. Francis. J. Am. Stat. Assoc. 78,
277–281 (1983)
M.J. Crowther, K.R. Abrams, P.C. Lambert, Joint modeling of longitudinal and
survival data. Stata J. 13(1), 165–184 (2013)
R.B. Davies, Hypothesis testing when a nuisance parameter is present only under
the alternative. Biometrika 64(2), 247–254 (1977)
R.B. Davies, Hypothesis testing when a nuisance parameter is present only under
the alternative. Biometrika 74(1), 33–43 (1987)
N.R. Draper, The Box-Wetz criterion versus R2. J. R. Stat. Soc. Ser. A (General)
147(1), 100–103 (1984)
N.R. Draper, Corrections: the Box-Wetz criterion versus R2. J. R. Stat. Soc. Ser.
A (General) 148(4), 357–357 (1985)
B. Efron, Censored data and the bootstrap. J. Am. Stat. Assoc. 76(374), 312–
319 (1981a)
B. Efron, D.V. Hinkley, Assessing the accuracy of the maximum likelihood esti-
mator: observed versus expected Fisher information. Biometrika 65, 457–483
(1978)
B. Efron, C. Stein, The jackknife estimate of variance. Ann. Stat. 586–596 (1981)
B. Efron et al., The geometry of exponential families. Ann. Stat. 6(2), 362–376
(1978)
S.S. Ellenberg, J.M. Hamilton, Surrogate endpoints in clinical trials: cancer. Stat.
Med. 8(4), 405–413 (1989)
K.H. Eng, M.R. Kosorok, A sample size formula for the supremum log-rank
statistic. Biometrics 61, 86–91 (2005)
J.P Fine, R.J. Gray, A proportional hazards model for the subdistribution of a
competing risk. J. Am. Stat. Assoc. 94(446), 496–509 (1999)
Bibliography 455
E. Fix, J. Neyman, A simple stochastic model of recovery, relapse, death and loss
of patients. Hum. Biol. 23(3), 205–241 (1951)
T.R. Fleming, D.P. Harrington, Counting Processes and Survival Analysis (Wiley,
New York, 1991)
T.R. Fleming, D.P. Harrington, Counting Processes and Survival Analysis, 2nd
edn. (Wiley, New York, 2005)
T.R. Fleming, D.P. Harrington, Evaluation of censored survival data test proce-
dures based on single and multiple statistics, in Topics in Applied Statistics
(Marcel Dekker, New York, 1984), pp. 97–123
D.D. Hanagal, Modeling Survival Data Using Frailty Models (Chapman and Hall,
CRC, 2011)
S. Haneuse, K.H. Lee, Semi-competing risks data analysis: accounting for death
as a competing risk when the outcome of interest is nonterminal. Circ. Car-
diovasc. Qual. Outcomes 9(3), 322–331 (2016)
D.P. Harrington, T.R. Fleming, A class of rank test procedures for censored
survival data. Biometrika 69(3), 553–566 (1982)
N.L. Hjort, On inference in parametric survival data models. Int. Stat. Rev./Revue
Internationale de Statistique 355–387 (1992)
F. Hsieh, Y.-K. Tseng, J.-L. Wang, Joint modeling of survival and longitudinal
data: likelihood approach revisited. Biometrics 62(4), 1037–1043 (2006)
458 Bibliography
X. Huang, L. Liu, A joint frailty model for survival and gap times between recur-
rent events. Biometrics 63(2), 389–397 (2007)
J. Hyde, Testing survival under right censoring and left truncation. Biometrika
64(2), 225–230 (1977)
H. Jiang, J.P. Fine, R. Chappell, Semiparametric analysis of survival data with left
truncation and dependent right censoring. Biometrics 61(2), 567–575 (2005)
M.P. Jones, J. Crowley, A general class of nonparametric tests for survival anal-
ysis. Biometrics 45, 157–170 (1989)
J.D. Kalbfleisch, R.L. Prentice, The Statistical Analysis of Failure Time Data
(Wiley, 2002)
R. Kay, Proportional hazard regression models and the analysis of censored sur-
vival data. J. R. Stat. Soc. Ser. C (Appl. Stat.) 26(3), 227–237 (1977)
N. Keiding, Historical controls and modern survival analysis. Lifetime Data Anal.
1(1), 19–25 (1995)
N. Keiding, R.D. Gill, Random truncation models and Markov processes. Ann.
Stat. 582–602 (1990)
Bibliography 459
M.G. Kendall, A. Stuart, J.K. Ord, S.F. Arnold, Kendall’s advanced theory of
statistics (1987)
J.T. Kent, J. O’Quigley, Measures of dependence for censored survival data.
Biometrika 75(3), 525–534 (1988)
H.T. Kim, Cumulative incidence in competing risks data and competing risks
regression analysis. Clin. Cancer Res. 13(2), 559–565 (2007)
K. Kim, A.A. Tsiatis, Study duration for clinical trials with survival response and
early stopping rule. Biometrics 81–92 (1990)
J.P. Klein, M.L. Moeschberger, Survival Analysis Techniques for Censored and
Truncated Data (Springer, 2003)
J.P. Klein, S.-C. Lee, M. Moeschberger, A partially parametric estimator of sur-
vival in the presence of randomly censored data. Biometrics 795–811 (1990)
A.N. Kolmogorov, Foundations of the Theory of Probability: Second English
Edition (2018)
M.R. Kosorok, C.Y. Lin, The versatility of function-indexed weighted log-rank
statistics. J. Am. Stat. Assoc. 94, 320–332 (1999)
J.A. Koziol, S.B. Green, A Cramer-von Mises statistic for randomly censored
data. Biometrika 63(3), 465–474 (1976)
J.A. Koziol, J.-Y. Zhang, C.A. Casiano, X.-X. Peng, F.-D. Shi, A.C. Feng, E.K.
Chan, E.M. Tan, Recursive partitioning as an approach to selection of immune
markers for tumor diagnosis. Clin. Cancer Res. 9(14), 5120–5126 (2003)
T.O. Kvaalseth, Cautionary note about R2. Am. Stat. 39(4), 279–285 (1985)
S. Lagakos, D. Schoenfeld, Properties of proportional-hazards score tests under
misspecified regression models. Biometrics 1037–1048 (1984)
S. Lagakos, The graphical evaluation of explanatory variables in proportional
hazards regression models. Biometrika 68, 93–98 (1981)
S. Lagakos, The loss in efficiency from misspecifying covariates in proportional
hazards regression models. Biometrika 75(1), 156–160 (1988)
S.W. Lagakos, A stochastic model for censored-survival data in the presence of
an auxiliary variable. Biometrics 551–559 (1976)
S.W. Lagakos, Using auxiliary variables for improved estimates of survival time.
Biometrics 399–404 (1977)
S.W. Lagakos, L.L. Kim, J.M. Robins, Adjusting for early treatment termination
in comparative clinical trials. Stat. Med. 9, 1417–1424 (1990)
460 Bibliography
S.W. Lagakos, C.J. Sommer, M. Zelen, Semi-Markov models for partially cen-
sored data. Biometrika 65(2), 311–317 (1978)
M.G. Larson, G.E. Dinse, A mixture model for the regression analysis of compet-
ing risks data. J. R. Stat. Soc. Ser. C (Appl. Stat.) 34(3), 201–211 (1985)
B. Lau, S.R. Cole, S.J. Gange, Competing risk regression models for epidemiologic
data. Am. J. Epidemiol. 170(2), 244–256 (2009)
M. LeBlanc, J. Crowley, Relative risk trees for censored survival data. Biometrics
411–425 (1992)
J. Lee, Some versatile tests based on the simultaneous use of weighted log-rank
statistics. Biometrics 52, 721–725 (1996)
S.-H. Lee, Maximum of the weighted Kaplan-Meier tests for the two-sample
censored data. J. Stat. Comput. Simul. 81, 1017–1026 (2011)
E.L. Lehmann et al., The power of rank tests. Ann. Math. Stat. 24(1), 23–43
(1953)
S. Leurgans, Three classes of censored data rank tests: strengths and weaknesses
under censoring. Biometrika 70, 651–658 (1983)
Y. Li, L. Ryan, Modeling spatial survival data using semiparametric frailty models.
Biometrics 58(2), 287–297 (2002)
D. Lin, Goodness-of-fit analysis for the Cox regression model based on a class of
parameter estimators. J. Am. Stat. Assoc. 86, 153–180 (1991)
D. Lin, Cox regression analysis of multivariate failure time data: the marginal
approach. Stat. Med. 13(21), 2233–2247 (1994)
D. Lin, J. Robins, L. Wei, Comparing two failure time distributions in the presence
of dependent censoring. Biometrika 83, 381–393 (1996)
D. Lin, L. Wei, Z. Ying, Checking the Cox model with cumulative sums of mar-
tingale based residuals. Biometrika 80, 557–572 (1993)
Bibliography 461
D.Y. Lin, Z. Ying, Semiparametric analysis of the additive risk model. Biometrika
81(1), 61–71 (1994)
D.Y. Lin, L.-J. Wei, The robust inference for the cox proportional hazards model.
J. Am. Stat. Assoc. 84(408), 1074–1078 (1989)
C.L. Link, Confidence intervals for the survival function using Cox’s proportional-
hazard model with covariates. Biometrics 601–609 (1984)
L. Liu, R.A. Wolfe, X. Huang, Shared frailty models for recurrent events and a
terminal event. Biometrics 60(3), 747–756 (2004)
W.-Y. Loh, Classification and regression trees. Wiley Interdisc. Rev. Data Min.
Knowl. Disc. 1(1), 14–23 (2011)
W.-Y. Loh, Fifty years of classification and regression trees. Int. Stat. Rev. 82(3),
329–348 (2014)
N. Mantel, Chi-square tests with one degree of freedom; extensions of the Mantel-
Haenszel procedure. J. Am. Stat. Assoc. 58(303), 690–700 (1963)
N. Mantel, Evaluation of survival data and two new rank order statistics arising
in its consideration. Cancer Chemother. Rep. 50, 163–70 (1966)
N. Mantel, D.M. Stablein, The crossing hazard function problem. The Statistician
37, 59–64 (1988)
I.W. McKeague, P.D. Sasieni, A partly parametric additive risk model. Biometrika
81, 501–514 (1994)
I.W. McKeague, K.J. Utikal, Inference for a nonlinear counting process regression
model. Ann. Stat. 18, 1172–1187 (1990)
S.H. Moolgavkar, E.T. Chang, H.N. Watson, E.C. Lau, An assessment of the
cox proportional hazards regression model for epidemiologic studies. Risk Anal.
38(4), 777–794 (2018)
S. Murray, A.A. Tsiatis, Sequential methods for comparing years of life saved in
the two sample censored data problem. Biometrics 55 (1999)
M.H. Myers, B.F. Hankey, N. Mantel, A logistic-exponential model for use with
response-time data involving regressor variables. Biometrics 257–269 (1973)
W. Nelson, Hazard plotting for incomplete failure data. J. Qual. Technol. 1(1),
27–52 (1969)
Bibliography 463
M.A. Newton, A.E. Raftery, Approximate Bayesian inference with the weighted
likelihood bootstrap. J. R. Stat. Soc. Ser. B (Methodological) 56(1), 3–26
(1994)
P.C. O’Brien, A nonparametric test for association with censored data. Biometrics
243–250 (1978)
P.C. O’Brien, T.R. Fleming, A paired Prentice-Wilcoxon test for censored paired
data. Biometrics 169–180 (1987)
J. O’Quigley, Faulty brca1, brca2 genes: how poor is the prognosis? Anna. Epi-
demiol. 27(10), 672–676 (2017)
M. Peckova, T.R. Fleming, Adaptive test for testing the difference in survival
distributions. Lifetime Data Anal. 9, 223–238 (2003)
M.J. Pencina, R.B. D’Agostino, R.S. Vasan, Evaluating the added predictive
ability of a new marker: from area under the ROC curve to reclassification and
beyond. Stat. Med. 27, 157–172 (2008)
M.S. Pepe, J. Fan, Z. Feng, T. Gerds, J. Hilden, The net reclassification index
(nri): a misleading measure of prediction improvement even with independent
test data sets. Stat. Biosci. 7(2), 282–295 (2015)
R.L. Prentice, Linear rank tests with right censored data. Biometrika 65, 167–179
(1978)
R.L. Prentice, J.D. Kalbfleisch, Hazard rate models with covariates. Biometrics
25–39 (1979)
C.R. Rao, C.R. Rao, M. Statistiker, C.R. Rao, C.R. Rao, Linear Statistical Infer-
ence and Its Applications, tome 2 (Wiley, New York, 1973)
T.H. Scheike, M.-J. Zhang, Flexible competing risks regression modeling and
goodness-of-fit. Lifetime Data Anal. 14(4), 464 (2008)
A.A. Seyerle, C.L. Avery, Genetic epidemiology: the potential benefits and chal-
lenges of using genetic information to improve human health. N. C. Med. J.
74(6), 505–508 (2013)
E.V. Slud, L.V. Rubinstein, Dependent competing risks and summary survival
curves. Biometrika 70(3), 643–649 (1983)
W. Stute, Strong consistency under the koziol-green model. Stat. Prob. Lett.
14(4), 313–320 (1992)
W. Stute, The central limit theorem under random censorship. Ann. Stat. 23,
422–439 (1995)
R.E. Tarone, On the distribution of the maximum of the log-rank statistic and
the modified Wilcoxon statistic. Biometrics 37, 79–85 (1981)
R.E. Tarone, J.H. Ware, On distribution-free tests for equality for survival distri-
butions. Biometrika 64, 156–160 (1977)
T.M. Therneau, P.M. Grambsch, Modeling Survival Data: Extending the Cox
Model (Springer, New York, 2000)
W.-Y. Tsai, N.P. Jewell, M.-C. Wang, A note on the product-limit estimator
under right censoring and left truncation. Biometrika 74(4), 883–886 (1987)
A.A. Tsiatis, Group sequential methods for survival analysis with staggered entry.
Lect. Notes Monogr. Ser. 2, 257–268 (1982)
D.J. Venzon, S.H. Moolgavkar, Origin-invariant relative risk functions for case-
control and survival studies. Biometrika 75(2), 325–333 (1988)
L. Wei, The accelerated failure time model: a useful alternative to the Cox regres-
sion model in survival analysis. Stat. Med. 11, 1871–1879 (1992)
J.B. Willett, J.D. Singer, Another cautionary note about R 2: its use in weighted
least-squares regression analysis. Am. Stat. 42(3), 236–238 (1988)
L. Wu, P.B. Gilbert, Flexible weighted log-rank tests optimal for detecting early
and/or late survival differences. Biometrics 58, 997–1004 (2002)
R. Xu, Inference for the proportional hazards model. Thèse de doctorat, University
of California, San Diego (1996)
X. Xue, M.Y. Kim, M.M. Gaudet, Y. Park, M. Heo, A.R. Hollenbeck, H.D.
Strickler, M.J. Gunter, A comparison of the polytomous logistic regression and
joint cox proportional hazards models for evaluating multiple disease subtypes
in prospective cohort studies. Cancer Epidemiol. Prev. Biomark. 22(2), 275–
285 (2013)
S. Yang, R. Prentice, Semiparametric analysis of short term and long term relative
risks with two sample survival data. Biometrika 92, 1–17 (2005)
S. Yang, R. Prentice, Improved logrank-type tests for survival data using adaptive
weights. Biometrics 66, 30–38 (2010)
M.J. Zhang, X. Zhang, J. Fine, A proportional hazards regression model for the
subdistribution with right-censored and left-truncated competing risks data.
Stat. Med. 30(16), 1933–1951 (2011)
D.M. Zucker, E. Lakatos, Weighted log rank type statistics for comparing survival
curves when there is a time lag in the effectiveness of treatment. Biometrika
77, 853–864 (1990)
Index
Historical background, 86 G
Cramer-Rao inequality, 414 Glivenko-Cantelli theorem, 57
Cumulative generating function, 365 Goodness of fit, 278
Cumulative hazard transform, 362 Goodness-of-fit tests, 69
Graphical methods, 263
D Greenwood’s formula, 211
DeMoivre-Laplace approximation, 366
Densities, 358 H
Density of a sum, 359 Hazard and related functions, 21
Distributions, 358 Hazard function, 390
Difference of random variables, 359 Helly-Bray theorem, 152, 353, 416
Sums of random variables, 359 High dimensional sparse data, 203
Hypothesis tests, 301
E Area above curve, 320
Empirical distribution function, 353 Area under curve, 315
Empirical estimates Combined tests, 306
Model verification, 69 Concave alternatives, 328
Epidemiology, 97 Convex alternatives, 329
Estimating equations, 141, 413 Delayed effect, 330
Censoring, 170 Diminishing effect, 331
Distribution of solution, 178 Distance traveled, 310
Large sample properties, 169 Integrated log-rank, 318
Marginal survival, 51 Kolmogorov type tests, 312
Maximum likelihood, 50, 417 Log-rank test, 303
Method of moments, 416 Maximum statistics, 307
Minimum chi-square, 415 Non-responders, 332
Misspecified models, 164 Reflected Brownian motion, 313
Moments, 156 Restrictive adaptive, 323
Newton-Raphson iteration, 51 Supremum over cutpoints, 335
Non-proportional hazards, 169 Weighted Kaplan-Meier, 308
Proportional hazards, 153 Weighted log-rank, 304
Regularity conditions, 418
Relative risk models, 159 I
Residuals, 174 Individual risks, 116
Semi-parametric, 151 Integrals
Small samples, 176 Lebesgue, 388
Stratified models, 158 Riemann-Stieltjes, 388
Expectation, 364 Intensity functions, 23
Compartment models, 23
F
Finance and insurance, 5 J
Frailty model, 9 Jensen’s inequality, 364
Freireich data, 53, 147 Joint survival-covariate model, 12
Index 473
K N
Kaplan-Meier estimate, 23, 56, 61, 62 Nelson-Aalen estimate, 68
Continuous version, 62 Non-proportional hazards, 77, 119, 161
Greenwood’s formula, 63 Normal distribution, 358
Mean and Median, 67 Mill’s ratio, 361
Precision, 63
Redistribution to the right, 66 O
Transformations, 63 Observed information, 52
Variance, 63, 67 Order statistics, 366
Distribution of difference, 367
Expected values, 370
L Joint distribution, 367
Large sample theory, 226 Markov property, 368
Law of iterated logarithm, 411 Maximum of sample, 366
Learning and classification, 9 Minimum of sample, 366
Likelihood Normal parent, 370
Conditional, 426
Cox’s conditional likelihood, 429 P
Exponential model, 145 Parametric goodness-of-fit tests, 150
Marginal, 430 Permutation test, 209
Nonparametric exponential, 149 Predictive indices, 12
Parametric models, 143 Interpretation, 14
Partial, 10, 428 Probability integral transform, 181, 361
Linear models Probability that Ti > Tj , 193
Additive, 136 Prognostic biomarkers, 199
Transformation models, 137 Proportional hazards, 78
Log-minus-log transformation, 33 Applications in epidemiology, 98
Logistic regression, 102 Average effect, 172
Changepoint models, 131
Cox model, 79
M
Explained variation, 270
Mantel-Haenszel estimate, 101
Models with intercept, 129
Marginal survival, 49
Partial, 120
Kaplan-Meier estimator, 49
Predictive ability, 270
Martingales, 385, 392
Random effects, frailties, 124
Compensator, 393 Stratified models, 121
Doob-Meyer decomposition, 393 Time-dependent effects, 131
Predictable variation process, 394
Stochastic integrals, 386, 392, 395 R
Mean residual lifetime, 22 Random variables, 354
Mean value theorem, 352 Registry data, 107
Model building, 294 Regression effect process, 215
Multistate models, 25 Concave effects, 288
Multivariate normal distribution, 360 Confidence bands, 267
474 Index