Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Beta Survival Models: David Hubbard, Benoît Rostykus, Yves Raimond, Tony Jebara

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Beta Survival Models

David Hubbard, Benoît Rostykus, Yves Raimond, Tony Jebara


{dhubbard,brostykus,tjebara,yraimond}@netflix.com
Netflix
Los Gatos, CA
ABSTRACT Density Retention Ranking
1
This article analyzes the problem of estimating the time until an 4
event occurs, also known as survival modeling. We observe through 1
3
substantial experiments on large real-world datasets and use-cases
2
arXiv:1905.03818v1 [cs.LG] 9 May 2019

that populations are largely heterogeneous. Sub-populations have 2


different mean and variance in their survival rates requiring flexible 1 3
models that capture heterogeneity. We leverage a classical exten-
0 0
sion of the logistic function into the survival setting to characterize
0 1 0 8 0 8
unobserved heterogeneity using the beta distribution. This yields θ time time
insights into the geometry of the problem as well as efficient estima-
tion methods for linear, tree and neural network models that adjust
the beta distribution based on observed covariates. We also show Figure 1: Heterogeneity gives rise to different survival dis-
that the additional information captured by the beta distribution tributions and rankings.
leads to interesting ranking implications as we determine who is
most-at-risk. We show theoretically that the ranking is variable as
which might be more desirable. Survival models avoid this trade-off
we forecast forward in time and prove that pairwise comparisons
by relying on right-censoring. This maps it to missing data problem
of survival remain transitive. Empirical results using large-scale
where not all events are observed due to the horizon over which
datasets across two use-cases (online conversions and retention
the data is collected.
modeling), demonstrate the competitiveness of the method. The
There is evidence (which will be further demonstrated in this
simplicity of the method and its ability to capture skew in the data
article) of the importance of heterogeneity in a variety of real-
makes it a viable alternative to standard techniques particularly
world time to event datasets. Heterogeneity indicates that items in
when we are interested in the time to event and when the underly-
a data-set have different survival means and variances. For instance,
ing probabilities are heterogeneous.
heterogeneity in a retention modeling context would be that as time
increases the customers with the highest probability to retain are
CCS CONCEPTS
the ones which still remain in the dataset. Without considering this
• Mathematics of computing → Survival analysis; • Comput- effect, it might appear that the baseline retention probability has
ing methodologies → Machine learning; Ranking; Classifica- increased over time when in fact the first order effect is that there is
tion and regression trees; Model development and analysis. a mover/stayer bias. Thus methods which don’t consider multiple
decision points can fail to adequately account for this effect and
KEYWORDS thus fall victim to the so called ruse of heterogeneity [22].
beta distribution, survival regression, ranking, nonlinear, boosting, Consider the following example inspired by Porath [2] where
heterogeneous we have 2 groups of customers, one in which the customers have
a retention probability of 0.5 and in the other the customers are
1 INTRODUCTION uniformly split between retention probabilities of either 1.0 or 0.0.
Survival modeling, customer lifetime value [12] and product rank- In this case after having observed only one decision point we would
ing [4, 20] are of practical interest when we want to estimate time observe the retention probabilities of the two groups to be identical.
until a specific event occurs or rank items to estimate which will However, if we consider multiple decision points it becomes clear
encounter the event first. Traditionally leveraged in medical appli- that the latter population has a much higher long term retention
cations, today survival regression is extensively used in large-scale rate because some customers therein retain to infinity. In order
business settings such as predicting time to conversion in online to capture this unobserved heterogeneity we need a distribution
advertising and predicting retention (or churn) in subscription ser- that is flexible enough to capture these dynamics and ideally as
vices. Standard survival regression involves a maximum likelihood simple as possible. To that end we posit a beta prior on our instan-
estimation problem over a specified continuous distribution of the taneous event probabilities. The beta has only 2 parameters, yet is
time until event (exponential for the Accelerated Failure Time model flexible enough to capture right/left skewed, U-shaped, or normal
[14]) or of the hazard function (in the case of Cox Proportional Haz- distributions.
ards [7]). In practice, time to event problems are often converted to Consider the example in Figure 1. A data-set contains three
classification problems by choosing a fixed time horizon which is heterogeneous items (green dots, orange plus and blue cross). These
appropriate for the application at hand. One then has to balance items are each characterized by beta distributions (left panel). At
training the model on recent data against a longer labeling horizon each time period, each item samples a Bernoulli distributed coin
from its beta distribution and flips it to determine if the item will that influence the time-to-event, this empirical evidence is consis-
retain. In the middle panel, we see the retention of the items over tent.
time and in the right-mode panel we see the ranking of the items
over time. Even though the items are sampling from fixed beta 2 THE BETA-LOGISTIC FOR SURVIVAL
distributions, the ranking of which item is most at risk over time REGRESSION
changes. Thus, a stationary set of beta distributions lead to non-
stationary survival and rankings. Such nuance cannot be captured 2.1 Model derivation
by summarizing each item with only a point-estimate of survival Denote by (x i , ti , c i ) ∈ Rd × N × {0, 1} a dataset where x i are covari-
(as opposed to a 2-parameter beta distribution). ates, ti is the discrete time to event for an observed (i.e. uncensored)
Due to the discrete and repeat nature of the decision problems datapoint (c i = 0) and ti is the right-censoring time for a datapoint
over time, we leverage a geometric hypothesis to recover survival for which the event of interest hasn’t happened yet (c i = 1). A
distributions. We estimate the parameters of this model via an em- traditional survival model would posit a parametric distribution
pirical Bayes method which can be efficiently implemented through p(T |x) and then try to maximize the following empirical likelihood
the use of analytical solutions to the underlying integrals. This over a class of functions f :
model termed the beta-logistic was first introduced by Heckman Ö Ö
L= P (T = ti | f (x i )) P (T > ti | f (x i )) .
and Willis [13], and was also studied by Fader and Hardie [8]. We
∀i,c i =0 ∀i,c i =1
find that in practice this model fits the discrete decision data quite
well, and that it allows for accurate projections of future decision Unfortunately, unless constrained to a few popular distributions
points. such as the exponential one, the maximization of such a quantity is
We extend the beta-logistic model to the case of large-scale trees usually intractable for most classes of functions f (x).
or neural-network models that adjust the beta distribution given Let us instead assume that at each discrete decision point, a
input covariates. These leverage the use of recurrence relationships customer decides to retain with some (point-estimate) probability
to efficiently compute the gradient. Through the beta prior under- 1 − θ where θ is some function of the covariates x. Then we further
pinning the model, we show empirically that the beta-logistic is assume that the instantaneous event probability at decision point t
able to model a wide range of heterogeneous behaviors that other is characterized by a shifted geometric distribution as follows:
survival or binary classification models fail to capture, especially P(T = t |θ ) = θ (1 − θ )t −1 , where θ ∈ [0, 1] .
in the presence of skew. As we will see, the beta-logistic model
outperforms a typical binary logistic regression in real-world ex- This then gives the following survival equation:
amples, and provides tighter estimated posteriors compared to a t
Õ
typical Laplace approximation. P(T > t |θ ) = 1 − P(T = i |θ ). (1)
We also present theoretical results on ranking with beta dis- i=1
tributions. We show that pairwise comparisons between beta dis- This geometric assumption follows from the discrete nature of the
tributions can be summarized by the median of the two distribu- decisions customers need to make in a subscription service, when
tions. This makes ranking with beta distributions a provably transi- continuing to next episodes of a show, etc. It admits a a simple and
tive problem (pairwise distribution comparisons are generally non- straightforward survival estimate that we can also use to project
transitive [21]). Therefore, using the medians of beta distributions beyond our observed time horizon. Now in order to capture the
allows us to rank the entire population to see which subscribers or heterogeneity in the data, we can instead assume that θ follows a
which items are most-at-risk. The results are then extended to the conditional beta prior (B) as opposed to being a point-estimate as
case where we rank items across multiple time horizons by approx- follows:
imating the evolution of the survival distribution over time as a θ α (x )−1 (1 − θ )β (x )−1
f (θ |α(x), β(x)) =
product of beta distributions as in [10]. Thus we obtain a consistent B(α(x), β(x))
ranking of items which evolves over time and is summarized by where α(x) and β(x) are some arbitrary positive functions of co-
medians (rather than means) to improve the accuracy of ranking variates (e.g. measurements that characterize a specific customer
who is most-at-risk. or a specific item within the population).
This paper is organized as follows. We first show the beta-logistic Consider the Empirical Bayes method [11] (also called Type-II
derivation as well as reference the recursion formulas which make Maximum Likelihood Estimation [3]) as an estimator for α(x) and
the computation efficient. We also make brief remarks about convex- β(x) given the data:
ity and observe that in practice we rarely encounter convergence
max L(α, β)
issues. We then present several simulated examples to help motivate α, β
the discussion. This is followed by an empirical performance evalu- where
ation of the various models across three large real-world datasets: a Ö Ö
sparse online conversion dataset and two proprietary datasets from L(α, β) = P (T = ti |α(x i ), β(x i )) P (T > ti |α(x i ), β(x i )) .
a popular video streaming service involving subscription and view- ∀i,c i =0 ∀i,c i =1
ing behaviors. In all the examples, the beta-logistic outperforms (2)
other baseline methods, and seems to perform better in practice Using the marginal likelihood function we obtain:
∫ 1
regardless of how many attributes are considered. Even though
there will always be unobserved variations between individuals P(T |α(x), β(x)) = f (θ |α(x), β(x))P(T |θ )dθ .
0
2
As we will see in the next section, a key property of the beta-logistic would bring average computation per row to O(maxi ti ) for the
model is that it makes the maximization of Equation 2 tractable. mini-batch) so that computation can be expressed through vec-
Since α and β have to be positive to define valid beta-distributions, torized operations, or via frameworks such as Vectorflow [19] or
we use an exponential reparameterization and aim to estimate func- Tensorflow [1] ragged tensors that allow for variable-length inputs
tions a(x) and b(x) such that: to bring the computational cost back to O(avgi ti ).
α(x) = e a(x ) and β(x) = e b(x ) .
2.3 Convexity
Throughout the paper, we will also assume that a and b are twice-
For brevity, define α i = α(x i ) and βi = β(x i ). In the special case
differentiable.
where a(x) = γa · x and b(x) = γb · x are linear functions, their
The name beta-logistic for such a model has been coined by
second derivative is null and the Hessian of the log-likelihood
[13] and studied when the predictors a(x) = γa · x and b(x) = γb · x
Equation 2 is diagonal:
are linear functions. In this case, at T = 1 observe that if we want
to estimate the mean this reduces to an over-parameterized logistic ti
" #
∂2 ℓ Õ βi Õ βi + u − 1
regression: = γa, j α i
2
+
∂γa,
2
(α i + βi )2 u=2 (α i + βi + u − 1)2
α(x) 1 j i
P(T = 1|α(x), β(x)) = = .
α(x) + β(x) 1 + e (γb −γa )⊤ x
and
2.2 Algorithm ti
" #
∂2 ℓ Õ αi Õ
= γb, j βi
2
+ ki
We will now consider the general case where a(x) and b(x) are ∂γb,
2
(α i + βi )2 u=2
(5)
j i
nonlinear functions and could be defined by the last layer of a neural
network. Alternatively, they could be generated by a vectored- where
output Gradient Boosted Regression Tree (GBRT) model. Using
β i2 −(u−2)(α i +u−1)
 (α i + 1) (β +u−2) if c i = 0
properties of the beta function (see A.1), one can show that: 

 2
(α +β +u−1)2
ki =

α i i
β i2 −(u−1)(α i +u−1)
i
P(T = 1|α, β) =
α +β α i (βi +u−1)2 (α i +βi +u−1)2 otherwise.



and
β We see that the log-likelihood of the shifted-beta geometric
P(T > 1|α, β) = . model is always convex in α when a is linear. Further we can see
α +β
Further, the following recursion formulas hold for t > 1: that when all points are observed (no censoring), and the maximum
horizon is T = 2 then Equation 5 is also convex in b.
β +t −2
 
P(T = t |α, β) = P(T = t − 1|α, β) (3) Subsequent terms are not convex, however, but despite that
α +β +t −1 in practice we do not encounter significant convexity issues (e.g.
and local minima and saddle points). It seems likely that in practice
β +t −1
  the convex terms of the likelihood dominate the non-convex terms.
P(T > t |α, β) = P(T > t − 1|α, β). (4)
α +β +t −1 Note once again that there is generally no global convexity of the
objective function.
If we denote ℓ = − log L as the function we wish to minimize,
Equation 3 and Equation 4 allow us to derive (see Appendix A.2)
∂ℓ ∂2 ℓ
3 RANKING WITH THE BETA-LOGISTIC
recurrence relationships for individual terms of
∂·
and 2 . This
∂· Given n beta distributions another relevant question for the busi-
makes it possible for example to implement a custom loss gradient ness is how do we rank them from best to worst? This is crucial
and Hessian callbacks in popular GBRT libraries such as XGBoost if, for instance, products need to be ranked to prioritize mainte-
[6] and lightGBM [15]. In this case, the GBRT models have "vector- nance as in [20]. For example, we may be interested in ranking
output" and predict for every row both a = log(α) and b = log(β) beta distributions for news articles where we are estimating the
jointly from a set of covariates, similarly to how the multinomial probability that an article will be clicked on. Or for a video subscrip-
logit loss is implemented in these libraries. More precisely, choosing tion service we could have n beta distributions over n titles each
a squared loss for the split criterion in each node as is customary, of which represents the probability that this title will be watched
the model will equally weight how well the boosting residuals to the final episode (title survival). How do we rank the n titles
(gradients) with respect to a and b are regressed on. (articles) from most watchable (readable) to least watchable? Let
Note that because of the inherent discrete nature of the beta- us abstractly view both problems as interchangeable where we are
logistic model, the computational complexity of evaluating its gra- ranking items.
dient on a given datapoint is proportional to the average value of We have two items (u and v), each with their own α, β parameters
ti . Therefore, a reasonable time step discretization value needs to that define a beta distribution. Each beta distribution samples a
be chosen to properly capture the survival dynamics while allow- coin with a probability 1 − θ of heads (retaining) and θ for tails
ing fast inference. One can similarly implement this loss in deep (churning out). In the first time step, item v retains less than u if
learning frameworks. One would typically explicitly pad the la- it has a higher coin flip probability, e.g. p(θv > θu ) > 0.5 or the
bel vectors ti with zeros up to the max censoring horizon (which probability is larger than 50%. In the case of integer parameters,
3
other times it is almost beta distributed. In general, we can easily
Pairwise probability ranking relative to median compute the mean and variance of this power-beta distribution
1
to find a reasonable beta approximation to it with its own α, β
0.8
parameters. This is done by leveraging the derivation of [10] as
follows. Assume we have a beta distribution p(θ |α, β) and define
the new random variable z = (1 − θ ) ∗ θ t −1 . We can then derive the
0.6
first moment as
 t −1
β α
 
0.4 S = Ep(z) [z] =
median(3v) - median(3u )

α +β α +β
0.2 and the second moment as
 t −1
α(α + 1) β(β + 1)
 
0 T = Ep(z) [z 2 ] = .
(α + β)(α + β + 1) (α + β)(α + β + 1)
Then, we approximate the distribution for the future event proba-
-0.2
bilities p(z) by a beta distribution where
(S − T )S
-0.4 α̂ =
T − S2
(7)
(S − T )(1 − S)
-0.6 βˆ = .
(T − S 2 )
-0.8 Therefore, in order to compare who is more likely to survive in
future horizons, we can combine Equation 7 and Equation 1 to find
-1 the median of the approximated future survival distribution.
0 0.2 0.4 0.6 0.8 1
p(3v > 3u ) Theorem 3.1. For random variables θu ∼ Beta(αu , βu ) and θv ∼
Beta(αv , βv ) for αu , αv , βu , βv ∈ N, p(θv > θu ) > 0.5 if and only if
I −1 (0.5, αv , βv ) > I −1 (0.5, αu , βu ) (the median of θv is larger than
Figure 2: The medians of beta distributions are consistent the median of θu ).
with the pairwise probabilistic ranking of beta distributions.
Proof. We first prove that the median gives the correct winner
under simplifying assumptions when both beta distributions have
this is given by the following integral: the same α or the same β.
∫ 1 ∫ 1 First consider when the distributions have the same αu = αv = α
θuαu −1 (1 − θu )βu −1 ·
1
p(θv > θu ) = and different βu and βv . In that case, Equation 6 simplifies to
θu =0 θv =θu B(α u , αv )
αÕ
v −1
B(αu + i, βu + βv )
θ αv −1 (1 − θv )βv −1dθu dθv
1
p(θv > θu ) = .
B(αv , αv ) v
(8)
i=0
(β v + i)B(1 + i, βv )B(α, βu )
which simplifies [18] into The formulas for p(θv > θu ) and p(θu > θv ) only differ in their
αÕ
v −1 denominators. Then, if βv > βu it is easy to show that
B(αu + i, βu + βv )
p(θv > θu ) = . (6)
i=0
(β v + i)B(1 + i, βv )B(αu , βu ) (βv + i)B(1 + i, βv )B(α, βu ) > (βu + i)B(1 + i, βu )B(α, βv ). (9)
If the quantity is larger than 0.5, then item u retains less than item Therefore, p(θv > θu ) > p(θu > θv ) if and only if βv > βu .
v in the first time step. It turns out, under thousands of simulations, Similarly, if βv > βu , the medians satisfy I −1 (0.5, αv , βv ) <
the less likely survivor is also the one with the larger median θ . I (0.5, αu , βu ). This is true since, if all else is equal, increasing the
−1
Figure 2 shows a scatter plot as we randomly compare pairs of beta β parameter reduces the median of a beta distribution. Therefore,
distributions. It is easy to see that the difference in their medians for αu = αv , the median ordering is always consistent with the
agrees with the probability p(θv > θu ). So, instead of using a probability test.
complicated formula to test p(θv > θu ), we just need to compare An analogous derivation holds when the two distributions have
the medians via the inverse incomplete beta function (betaincinv) the same βu = βv and different αu and αv . This is obtained by
denoted by I −1 () and see if I −1 (0.5, αu , βu ) > I −1 (0.5, αv , βv ). A using the property p(θ |α, β) = p(1 −θ |β, α). Therefore, for βu = βv ,
proof of this is below. the median ordering is always consistent with the probability test.
For later time steps, we will leverage a geometric assumption but Next we generalize these two statements to show that the median
applied to distributions rather than point estimates. The item which ordering always agrees with the probability test. Consider the situ-
retains longer is the one with the lower product of repeated coin ation where median(αu , βu ) > median(αu , βv ) > median(αv , βv ).
flip probabilities, i.e. p(θvt > θut ). In that case, the beta distributions Due to the scalar nature of the median of the beta distribution,
get modified by taking them to powers. Sometimes, the product we must have transitivity. We must also have median(αu , βu ) <
of beta-distributed random variables is beta-distributed [10] but median(αv , βu ) < median(αv , βv ). Since each pair of inequalities
4
on medians requires that the corresponding statement on the prob- α = 4.75, β = 14.25 α = 0.50, β = 1.50 α = 0.08, β = 0.25
1.0
ability tests also holds, the overall statement median(αu , βu ) <
median(αv , βv ) must also imply that p(θv > θu ) > p(θu > θv ). □ 0.8

Therefore, thanks to Theorem 3.1, we can safely rank order beta 0.6

S(t)

S(t)

S(t)
distributions simply by considering their medians. These rankings
0.4 θ̄
are not only pairwise consistent but globally consistent. Recall that
̄
S(t)
pairwise ranking of distributions does not always yield globally 0.2
̂
S(t)
consistent rankings as popularly highlighted through the study
0.0
of nontransitive dice [21]. Thus, given any beta distribution at a 0 5 10 0 5 10 0 5 10
particular horizon, it is straightforward to determine which items t t t
are most at risk through a simple sorting procedure on the medians.
Given this approach to ranking beta distributions, we can show Figure 3: Survival distributions as a function of time as well
the performance of our model-based ranking of users or items by as an estimate of S(t)
ˆ from the beta-logistic. Using a point-
holding out data and evaluating the time to event in terms of the estimate of the mean θ¯ (as in the logistic model) fails to re-
AUC (area under the curve) of the receiver operating characteristic cover the heterogeneity.
(ROC) curve.

4 EMPIRICAL RESULTS to the logistic model. However it seems to have a lower variance
4.1 Synthetic simulations which perhaps indicates that its posterior estimates are more con-
servative, a property which will be confirmed in the next set of
We present simulation results for the beta-logistic, and compare experiments.
them to the logistic model. We also show that the beta-logistic
successfully recovers the posterior for skewed distributions. In
our first simulation we have 3 beta distributions which have very Ex 0 Ex 1 Ex 2
different shapes (see table 1 below), but with the same mean (this
0.8
example is inspired by Fader and Hardie [9]). Here, each simulated
customer draws a coin from one of these distributions, and then 0.7
AUC

AUC

AUC
flips that coin repeatedly until they have an event or we reach a 0.6
censoring horizon (in this particular case we considered 4 decision
points). 0.5

0.4
shape α β µ 5 10 5 10 5 10
normal 4.75 14.25 0.25 time time time
right skewed 0.5 1.50 0.25
u shaped 0.083¯ 0.25 0.25
Figure 4: The level of heterogeneity increases from the left
Table 1: Heterogeneous beta distributions with identical panel to the right panel as we add a linear term in α. Clearly,
means. the mean of the beta-logistic 1 step (magenta plus), and lo-
gistic (cyan dot) are nearly identical, but the beta-logistic (or-
ange cross) considers more survival information and outper-
It is trivial to show that the logistic model will do no better forms both even when there is considerable homogeneity.
than random in this case, because it is not observing the dynamics
of the survival distribution which reveal the differing levels of
heterogeneity underlying the 3 populations. If we allow the beta- 4.2 Online conversions dataset
logistic model to have a dummy variable for each of these cases then
4.2.1 Survival modeling. We now evaluate the performance of the
it can recover the posterior of each (see Figure 3). This illustrates
beta logistic model on a large-scale sparse dataset. We use the
an important property of the beta-logistic: it recovers posterior
Criteo online conversions dataset published alongside [5] and pub-
estimates even when the data is very heterogeneous and allows us
licly available for download1 . We consider the problem of modeling
to fit survival distributions well.
the distribution of the time between a click event and a conver-
To create a slightly more realistic simulation, we can include
sion event. We will consider a censoring window of 12 hours (61%
another term which increases the homogeneity linearly in α and
of conversions happen within that window). As noted in [5], the
we add this as another covariate in the models. We also inject ex-
exponential distribution fits reasonably well the data so we will
ponential noise into the α and β used for our random draws. Now,
compare the beta-logistic model against the exponential distribu-
the logistic model does do better than random when there is homo-
tion (1 parameter) and the Weibull distribution (2 parameters). Since
geneity present (see Figure 4), however it still leaves signal behind
the temporal integration of the beta-logistic model is intrinsically
by not considering the survival distribution. We additionally show
results for a one time step beta logistic which performs similarly 1 http://labs.criteo.com/2013/12/conversion-logs-dataset/

5
discrete, we consider a time-discretization of 5 minute steps. We model, the prediction variance on datapoint x is given by:
also add as baselines 2 logistic models: one trained at a horizon of α(x)β(x)
5 minutes (the shortest interval), and one trained at a horizon of 12 Var(x) = .
(α(x) + β(x))2 (α(x) + β(x) + 1)
hours (the largest window). All conditional models are implemented
as sparse linear models in Vectorflow [19] and trained through sto- For a logistic model parameterized by θ ∈ Rd , a standard way
chastic gradient descent. All survival models use an exponential to estimate the confidence of a prediction is through the Laplace
reparameterization of their parameters (since the beta, exponential, approximation of its posterior [17]. In the high-dimensional set-
and Weibull distributions all require positivity in their parameters). ting, estimating the Hessian or its inverse become impractical tasks
Censored events are down-sampled by a factor of 10x. We use 1M (storage cost is O(d 2 ) and matrix inversion requires O(d 3 ) compu-
rows for training and 1M (held-out in time) rows for evaluation. tation). In this scenario, it is customary to assume independence
The total (covariate) dimensionality of the problem is 102K after of the posterior coordinates and restrict the estimation to the di-
agonal of the Hessian h = 2 ∈ Rd , which reduces both storage
one-hot-encoding. Note that covariates are sparse and the overall 1
sparsity of the problem is over 99.98%. Results are presented in σ
and computation costs to O(d). Hence under this assumption, for a
Figure 5.
given datapoint x the distribution of possible values for the random
variable Y = θ T x is also Gaussian with parameters:
!
Õ Õ
N θi xi , σi2x i2 .
i i
If the full Hessian inverse H −1 is estimated, then Y is Gaussian with
parameters:
N θ · x, x T · H −1x) .
 

When Y is Gaussian, the logistic model prediction


1
P(T = 1|x, θ ) =
1 + exp(−Y )
has a distribution for which the variance v can be conveniently ap-
proximated. See [16] for various suggested approximations schemes.
We chose to apply the following approximation
√ !
π µ/ 8 − 1  p  −2
v=Φ p − 1 + exp(−µ/ 1 + πσ 2 /8) .
π − 1 + π 2 σ 2 /8
Armed with this estimate for the logistic regression posterior
variance, we run the following experiment: we random-project
(using Gaussian vectors) the original high-dimensional data into a
50-dimensional space, in which we train beta-logistic classifiers and
logistic classifiers at various horizons, using 50k training samples
every time. We then compare the average posterior size variance on
a held-out dataset containing 50k samples. Holdout AUCs were com-
Figure 5: AUC as a function of censoring horizon for the var- parable in this case for both models at all horizons. Two posterior
ious models considered. approximations are reported for the logistic model: one using the
full Hessian inverse and the other one using only the diagonalized
Hessian. Results are reported in Figure 6.
The beta-logistic survival model outperforms other baselines Note that the beta-logistic model produces much smaller uncer-
at all horizons considered. Even though it is a 2-parameter distri- tainty estimates (between 20% and 45% smaller) than the logistic
bution, the Weibull model is interestingly performing worse than model with Laplace approximation. Furthermore, the growth rate
the exponential survival model and the binary logistic classifier. as a function of the horizon of the binary classifier is also smaller
We hypothesize that this is due to the poor conditioning of its for the beta-logistic approach. Also note that the Laplace posterior
loss function as well as the numerical instabilities during gradient with diagonal Hessian approximation underestimates the posterior
and expectation computation (the latter requires function calls to obtained using the full Hessian. Gaussian posteriors are obviously
the gamma function which is numerically difficult to estimate for unable to appropriately model data skew.
moderately small values and for large values). This empirical result is arguably clear evidence of the superi-
ority of the posteriors generated by the beta-logistic model over
4.2.2 Posterior size comparison. We next consider the problem as a standard Laplace approximation estimate layered onto a logis-
a binary classification task (did a conversion happen within the tic regression model posterior. The beta-logistic posterior is also
specified time window?). It is interesting to compare the confidence much cheaper to recover in terms of computational and storage
interval sizes of various models. For the conditional beta-logistic costs. This also suggests that the beta-logistic model could be a
6
In this example the data set had more than 10M rows and 500
columns. We trained 20 models on bootstraps of the data with a 10x
downsample on censored customers. We used 4 discrete decision
points to fit the model. Evaluation was done on a subset of 3M
rows which was held out from the models, and held out in time
as well over an additional 5 decision points (9 total months of
data). All models are implemented as GBRTs and estimated using
lightGBM. In Figure 8, we show the evaluation of the models across
two cohorts: one with relatively little data and covariates to describe
the customers (which should clearly benefit from modeling the
unobserved heterogeneity) and one with much richer data and
covariates. Surprisingly even on the rich data set where one might
argue there should be considerable homogeneity within any given
subset of customers, we still find accuracy improvements by using
the beta-logistic over the standard logistic model. This example
illustrates how regardless of how many covariates are considered,
there is still considerable heterogeneity.

Figure 6: The posterior variance of beta-logistic binary clas-


cohort 1 cohort 2
sifiers as well as logistic regressions trained on binary labels
datasets with increasing censoring windows. beta-logistic
logistic
auc

viable alternative to standard techniques of explore-exploit models


in binary classification settings.

4.3 Video streaming subscription dataset


4.3.1 Retention modeling. We now study the problem of model-
time time
ing customer retention for a subscription business. We leverage
a proprietary dataset from a popular video streaming service. In
a subscription setting when a customer chooses not to renew the Figure 8: Held-out AUC for two different cohorts of cus-
service, the outcome is explicitly observed and logged. From a prac- tomers.
tical perspective, it is obviously preferable and meaningful when
a customer’s tenure with the service is n months rather than 1
month. It is also clearly valuable to be able to estimate and project 4.3.2 Retention within shows. Another problem of importance to
that tenure accurately across different cohorts of customers. In this a video subscription business is ranking shows that customers are
particular case the cohorts are highly heterogeneous as shown in most likely to fully enjoy (i.e. customers watch the shows to com-
Figure 7. pletion across all the episodes). Here we model the distribution of
survival of watched episodes of a show conditional on the customer
having started the show. In Figure 9 we compare the performance
5 of the beta-logistic to logistic models at an early (1 episode) hori-
cohort 1
cohort 2
zon and a late horizon (8 episodes). The dataset contains 2k shows
4
cohort 3 and spans 3M rows and 500 columns. We used a 50/50 train/test
split. Model training used bootstrap methods to provide uncertainty
3
estimates. All models are implemented as GBRTs which were esti-
f(θ)

mated in lightGBM. In early horizons, the beta-logistic model once


2
again provides significant AUC improvements over logistic models
1
trained at either the 1 episode horizon and 8 episode horizon.

0 5 CONCLUSION
0.0 0.2 0.4 0.6 0.8 1.0
θ
We noted that heterogeneity in the beta-logistic model can better
capture the temporal effects of survival and ranking at multiple
horizons. We extended the beta-logistic and its maximum likelihood
Figure 7: Estimated churn probabilities for 3 different co- estimation to linear, tree and neural models as well as characterized
horts. The large variations in the shape of the fitted distri- the convexity and properties of the learning problem. The resulting
butions motivate the use of a beta prior on the conditional survival models give survival estimates and provably consistent
churn probability. rankings of who is most-at-risk at multiple time horizons. Empirical
7
[11] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and
Donald B Rubin. 2013. Bayesian data analysis. Chapman and Hall/CRC.
[12] Sunil Gupta, Dominique Hanssens, Bruce Hardie, William Kahn, V. Kumar,
Nathaniel Lin, Nalini Ravishanker, and S. Sriram. 2006. Modeling customer
lifetime value. Journal of Service Research 9, 2 (2006).
logistic 8ep [13] James J Heckman and Robert J Willis. 1977. A Beta-logistic Model for the Analysis
of Sequential Labor Force Participation by Married Women. Journal of Political
AUC

logistic 1ep
Economy 85, 1 (1977), 27–58.
beta-logistic [14] John D Kalbfleisch and Ross L Prentice. 2011. The statistical analysis of failure
time data. Vol. 360. John Wiley & Sons.
[15] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,
Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting
decision tree. In Advances in Neural Information Processing Systems. 3146–3154.
[16] Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. 2012.
1 2 3 4 5 6 7 8 9 An unbiased offline evaluation of contextual bandit algorithms with generalized
completed episodes linear models. In Proceedings of the Workshop on On-line Trading of Exploration
and Exploitation 2. 19–36.
[17] David JC MacKay. 2003. Information theory, inference and learning algorithms.
Cambridge university press.
Figure 9: The beta-logistic improves ranking accuracy (in [18] Evan Miller. 2015. Bayesian AB Testing. http://www.evanmiller.org/
terms of AUC) for early horizons. bayesian-ab-testing.html#cite1
[19] Benoît Rostykus and Yves Raimond. 2018. vectorflow: a minimalist neural-
network library. SysML (2018).
[20] Cynthia Rudin, David Waltz, Roger N. Anderson, Albert Boulanger, Ansaf Salleb-
Aouissi, Maggie Chow, Haimonti Dutta, Philip Gross, Bert Huang, Steve Ierome,
results demonstrate that the beta-logistic is an effective model in Delfina Isaac, Arthur Kressner, Rebecca J. Passonneau, Axinia Radeva, and Leon
discrete time to event problems, and improves over common base- Wu. 2012. Machine Learning for the New York City Power Grid. IEEE Transactions
on Pattern Analysis and Machine Intelligence 34, 2 (February 2012), 328–345.
lines. It seems that in practice regardless of how many attributes are [21] Richard P. Savage. 1994. The Paradox of Nontransitive Dice. The American
considered there are still unobserved variations between individu- Mathematical Monthly 101, 5 (1994).
als that influence the time to event. Further we demonstrated that [22] James W. Vaupel and Anatoliy Yashin. 1985. Heterogeneity’s Ruses: Some Sur-
prising Effects of Selection on Population Dynamics. The American statistician
we can recover posteriors effectively even when the data is very 39 (09 1985), 176–85. https://doi.org/10.1080/00031305.1985.10479424
heterogeneous, and due to the speed and ease of implementation we
argue that the beta-logistic is a baseline that should be considered
in time to event problems in practice. A BETA LOGISTIC FORMULAS
In future work, we plan to study the potential use of the beta- A.1 Recurrence derivation
logistic in explore-exploit scenarios and as a viable option in rein- This derivation is taken from Fader and Hardie [8] where they use
forcement learning to model long-term consequences from near- it as a cohort model (also called the shifted beta geometric model)
term decisions and observations. that is not conditional on a covariate vector x.
We do not observe θ , but its expectation given the beta prior
ACKNOWLEDGMENTS (also called marginal likelihood) is given by:
The authors would like to thank Nikos Vlassis for his helpful com-
θ α −1 (1 − θ )β −1
∫ 1
ments during the development of this work, and Harald Steck for P(T = t |α, β) = θ (1 − θ )t −1 dθ
his thoughtful review and guidance. 0 B(α, β)
B(α + 1, β + t − 1)
=
REFERENCES B(α, β)
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. We can write the above as:
[n. d.]. Tensorflow: a system for large-scale machine learning.
Γ(α + β) ∗ Γ(α + 1) ∗ Γ(β + t − 1)
[2] Yoram Ben-Porath. 1973. Labor-force participation rates and the supply of labor.
P(T = t |α, β) = .
Journal of Political economy 81, 3 (1973), 697–704. Γ(α) ∗ Γ(β) ∗ Γ(α + β + t)
[3] James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer
Science & Business Media.
[4] Allison Chang, Cynthia Rudin, Michael Cavaretta, Robert Thomas, and Gloria
Using the property Γ(z + 1) = zΓ(z) leads to equations (3) and (4),
Chou. 2012. How to Reverse-Engineer Quality Rankings. Machine Learning 88 and at t = 1 we have
(September 2012), 369–398. Issue 3.
Γ(α + β) ∗ Γ(α + 1) ∗ Γ(β)
P(T = 1|α, β) =
[5] Olivier Chapelle. 2014. Modeling delayed feedback in display advertising. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge Γ(α) ∗ Γ(β) ∗ Γ(α + β + 1)
discovery and data mining. ACM, 1097–1105.
α
[6] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. P(T = 1|α, β) =
In Proceedings of the 22nd acm sigkdd international conference on knowledge α +β
discovery and data mining. ACM, 785–794.
[7] David R Cox. 1972. Regression models and life-tables. Journal of the Royal
Statistical Society: Series B (Methodological) 34, 2 (1972), 187–202. A.2 Gradients
[8] Peter S Fader and Bruce GS Hardie. 2007. How to project customer retention.
Journal of Interactive Marketing 21, 1 (2007), 76–90.
Note that for machine learning libraries that do not offer symbolic
[9] Peter S Fader, Bruce GS Hardie, Yuzhou Liu, Joseph Davin, and Thomas Steen- computation and auto-differentiation, taking the −loд of equations
burgh. 2018. "How to Project Customer Retention" Revisited: The Role of Duration (3) and (4) and differentiating leads to the following recurrence for-
Dependence. Journal of Interactive Marketing 43 (2018), 1–16.
[10] Da-Yin Fan. 1991. The distribution of the product of independent beta variables. mulas for the gradient of the loss function on a given data point with
Communications in Statistics-Theory and Methods 20, 12 (1991), 4043–4052. respect to the output parameters ai and bi of the model considered:
8
!
∂ 2 log(P(T > t)) ∂ 2 log(P(T > t − 1)) ∂ 2a α

∂ log(P(T = 1)) ∂a β
  = −
= ∂ai2 ∂ai2 ∂ai2 α + β + t − 1
∂ai ∂ai α + β
∂a 2 β +t −1
   
∂ log(P(T = 1)) ∂b β α
 

=− ∂ai (α + β + t − 1)2
∂bi ∂bi α + β
∂ 2 log(P(T > t)) ∂ 2 log(P(T > t − 1))
These derivatives expand as follows: =
∂bi2 ∂b 2
∂ log(P(T = t)) ∂ log(P(T = t − 1)) ∂a α ! i
 
= − ∂ 2b αβ

∂ai ∂ai ∂ai α + β + t − 1 +
∂ log(P(T = t)) ∂ log(P(T = t − 1)) ∂bi2 (β + t − 1)(α + β + t − 1)
=
∂bi ∂bi ∂b 2 β 2 − (t − 1)(α + t − 1)
   
+ β α
∂b (α + 1)β ∂bi
 
+ (β + t − 1)2 (α + β + t − 1)2
∂bi (β + t − 2)(α + β + t − 1)
B ALTERNATIVE DERIVATION
We can get a similar recursion for the survival function:
Another intuitive derivation of the single-step beta-logistic is ob-
∂ log(P(T > 1)) ∂a α
 
=− tained by starting from the likelihood for a logistic model and
∂ai ∂ai α + β modeling the probabilities with a beta distribution:
∂ log(P(T > 1)) ∂b α
 
P(yi = 1|α, β)yi (1 − P(yi = 1|α, β))yi −1
Ö
= L=
∂bi ∂bi α + β i
Ö Ö
∂ log(P(T > t)) ∂ log(P(T > t − 1)) ∂a α = P(yi = 1|α, β) (1 − P(yi = 1|α, β))
 
= − ∀yi =1 ∀yi =0
∂ai ∂ai ∂ai α + β + t − 1 Ö Ö
∂ log(P(T > t)) ∂ log(P(T > t − 1)) = P(T = 1|α, β) P(t >= 1|α, β).
=
∂bi ∂bi uncensor ed censor ed
∂b αβ
 
This is exactly the survival likelihood for a 1 step beta logistic
+
∂bi (β + t − 1)(α + β + t − 1) model.

A.3 Diagonal of the Hessian


We obtain the second derivatives for the Hessian as follows:
∂ 2 log(P(T = 1)) ∂ 2a β ∂a 2 αβ
     
= −
∂ai2 ∂ai2 α + β ∂ai (α + β)2
∂ 2 log(P(T = 1)) ∂ 2b β ∂b 2
αβ
     
= − −
∂bi2 ∂bi2 α + β ∂bi (α + β)2
!
∂ 2 log(P(T = t)) ∂ 2 log(P(T = t − 1)) ∂ 2a α

= −
∂ai2 ∂ai2 ∂ai2 α + β + t − 1
∂a β +t −1
 2  
− α
∂ai (α + β + t − 1)2
∂ 2 log(P(T = t)) ∂ 2 log(P(T = t − 1))
=
∂bi2 ∂b 2
! i
∂ 2b (α + 1)β

+
∂bi2 (β + t − 2)(α + β + t − 1)
∂b 2 β 2 − (t − 2)(α + t − 1)
   
+ β (α + 1)
∂bi (β + t − 2)2 (α + β + t − 1)2
The survival counterparts to the above terms are also readily
computed as follows:
∂ 2 log(P(T > 1)) ∂ 2a α ∂a 2 αβ
     
= − −
∂ai2 ∂ai2 α + β ∂ai (α + β)2
∂ 2 log(P(T > 1)) ∂ 2b α ∂b 2 αβ
     
= − .
∂bi2 ∂bi2 α + β ∂bi (α + β)2
9
3 REPRODUCIBILITY
We include simple python implementations of the gradient callbacks that can be passed to XgBoost or lightGBM. Note that efficient
implementations of these callbacks in C++ are possible and yield orders of magnitude speedups.
def grad_BL ( a l p h a , b e t a , t , i s _ c e n s o r e d ) :
"""
This f u n c t i o n computes the g r a d i e n t of the beta l o g i s t i c o b j e c t i v e .
S i n c e i t i s v e c t o r i z e d i n p r a c t i c e f o r p e r f o r m a n c e r e a s o n s , h e r e we w r i t e
t h e non − v e c t o r i z e d v e r s i o n f o r r e a d a b i l i t y :
"""
N = len ( a l p h a )
g = np . z e r o s ( ( N , 2 ) )
f o r j in range ( 0 , N ) :
i f ( not i s _ c e n s o r e d [ j ] ) :
# failed
g [ j , 0 ] = beta [ j ] / ( alpha [ j ] + beta [ j ] )
g [ j , 1 ] = −g [ j , 0 ]
f o r i in range ( 2 , i n t ( t [ j ] + 1 ) ) :
g [ j , 0 ] += −( a l p h a [ j ] / ( a l p h a [ j ] + b e t a [ j ] + i − 1 ) )
g [ j , 1 ] += b e t a [ j ] / ( b e t a [ j ] + i − 2 ) − b e t a [ j ] / ( a l p h a [ j ] + b e t a [ j ] + i − 1 )
else :
# survived
g [ j , : ] = −alpha [ j ] / ( beta [ j ] + alpha [ j ] )
g [ j , 1 ] = −g [ j , 0 ]
f o r i in range ( 2 , i n t ( t [ j ] + 1 ) ) :
g [ j , 0 ] += −( a l p h a [ j ] / ( a l p h a [ j ] + b e t a [ j ] + i − 1 ) )
g [ j , 1 ] += b e t a [ j ] / ( b e t a [ j ] + i − 1 ) − b e t a [ j ] / ( a l p h a [ j ] + b e t a [ j ] + i − 1 )
return g

def h e s s _ B L ( a l p h a , b e t a , t , i s _ c e n s o r e d ) :
"""
This f u n c t i o n computes the diagonal of the Hessian of the beta l o g i s t i c o b j e c t i v e .
"""
N = len ( a l p h a )
h = np . z e r o s ( ( N , 2 ) )
f o r j in range ( 0 , N ) :
h [ j : ] = −alpha [ j ] ∗ beta [ j ] / ( ( alpha [ j ] + beta [ j ] ) ∗ ∗ 2 )
i f ( not i s _ c e n s o r e d [ j ] ) :
# failed
f o r i in range ( 2 , i n t ( t [ j ] + 1 ) ) :
h [ j , 0 ] += − a l p h a [ j ] ∗ ( ( b e t a [ j ] + i − 1 ) / ( a l p h a [ j ] + b e t a [ j ] + i − 1 ) ∗ ∗ 2 )
d = ( beta [ j ] + i − 2 ) ∗ ∗ 2 ) ∗ ( alpha [ j ] + beta [ j ] + i − 1 ) ∗ ∗ 2 )
h [ j , 1 ] += b e t a [ j ] ∗ ( ( a l p h a [ j ] + 1 ) ∗ ( b e t a [ j ] ∗ ∗ 2 − ( i − 2 ) ∗ ( a l p h a [ j ]+ i − 1 ) / d
else :
# survived
f o r i in range ( 2 , i n t ( t [ j ] + 1 ) ) :
h [ j , 0 ] += − a l p h a [ j ] ∗ ( ( b e t a [ j ] + i − 1 ) / ( a l p h a [ j ] + b e t a [ j ] + i − 1 ) ∗ ∗ 2 )
d = ( beta [ j ] + i − 2 ) ∗ ∗ 2 ) ∗ ( alpha [ j ] + beta [ j ] + i − 1 ) ∗ ∗ 2 )
h [ j , 1 ] += b e t a [ j ] ∗ ( ( a l p h a [ j ] ) ∗ ( b e t a [ j ] ∗ ∗ 2 − ( i − 1 ) ∗ ( a l p h a [ j ]+ i − 1 ) / d

h . shape = (N∗ 2 )
return h

10
def l i k e l i h o o d _ B L ( a l p h a , b e t a , t , i s _ c e n s o r e d ) :
"""
This f u n c t i o n computes beta l o g i s t i c o b j e c t i v e ( l i k e l i h o o d : higher = b e t t e r )
S i n c e i t i s h e a v i l y v e c t o r i z e d i n p r a c t i c e f o r p e r f o r m a n c e r e a s o n s , we w r i t e
h e r e t h e non − v e c t o r i z e d v e r s i o n f o r r e a d a b i l i t y :
"""
p = alpha / ( alpha + beta )
s = 1 − p
f o r j in range ( 0 , len ( a l p h a ) ) :
f o r i in range ( 2 , i n t ( t [ j ] + 1 ) ) :
p[ j ] = p[ j ] ∗ ( beta [ j ] + i − 2 ) / ( alpha [ j ] + beta [ j ] + i − 1)
s [ j ] = s [ j ] − p[ j ]
return p ∗ ( 1 . 0 − i s _ c e n s o r e d ) + s ∗ i s _ c e n s o r e d

11

You might also like