qkad141
qkad141
qkad141
Abstract
The individualized treatment rule (ITR), which recommends an optimal treatment based on individual
characteristics, has drawn considerable interest from many areas such as precision medicine, personalized
education, and personalized marketing. Existing ITR estimation methods mainly adopt 1 of 2 or more
treatments. However, a combination of multiple treatments could be more powerful in various areas. In this
paper, we propose a novel double encoder model (DEM) to estimate the ITR for combination treatments. The
proposed double encoder model is a nonparametric model which not only flexibly incorporates complex
treatment effects and interaction effects among treatments but also improves estimation efficiency via the
parameter-sharing feature. In addition, we tailor the estimated ITR to budget constraints through a multi-
choice knapsack formulation, which enhances our proposed method under restricted-resource scenarios. In
theory, we provide the value reduction bound with or without budget constraints, and an improved
convergence rate with respect to the number of treatments under the DEM. Our simulation studies show
that the proposed method outperforms the existing ITR estimation in various settings. We also demonstrate
the superior performance of the proposed method in patient-derived xenograft data that recommends optimal
combination treatments to shrink the tumour size of the colorectal cancer.
Keywords: causal inference, combination therapy, decision-making, multi-choice knapsack, neural network, precision
medicine
1 Introduction
Individualized decision-making has played a prominent role in many fields such as precision medi
cine, personalized education, and personalized marketing due to the rapid development of person
alized data collection. For example, in precision medicine, individualized treatments based on
individuals’ demographic information and their overall comorbidity improve healthcare quality
(Schmieder et al., 2015). However, most existing individualized decision-making approaches se
lect one out of multiple treatments, whereas recent advances in medical research have suggested
that applying multiple treatments simultaneously, referred to as combination treatments, could
enhance overall healthcare. Specifically, combination treatments are able to reduce treatment fail
ure or fatality rates, and overcome treatment resistance for many chronic diseases (e.g. Bozic et al.,
2013; Forrest & Tamura, 2010; Kalra et al., 2010; Korkut et al., 2015; Maruthur et al., 2016;
Mokhtari et al., 2017; Möttönen et al., 1999; Tamma et al., 2012). Therefore, it is critical to de
velop a novel statistical method to recommend individualized combination treatments.
There are various existing methods for estimating the optimal individualized treatment rule
(ITR). The first approach is the model-based approach, which estimates an outcome regression
model given pre-treatment covariates and the treatment. The optimal ITR is derived by maximiz
ing the outcome over possible treatments conditioned on the pre-treatment covariates. Existing
works such as Q-learning (Moodie et al., 2012; Qian & Murphy, 2011), A-learning (Lu et al.,
Received: March 15, 2022. Revised: September 26, 2023. Accepted: December 10, 2023
© The Royal Statistical Society 2024. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 715
2013; Shi et al., 2018), and D-learning (Meng & Qiao, 2020; Qi et al., 2020; Qi & Liu, 2018) all
belong to this approach. The other approach is known as the direct-search approach, which dir
ectly maximizes the expected outcome over a class of decision functions to obtain an optimal ITR.
The seminal works of the direct-search approach include outcome weighted learning (OWL)
(Huang et al., 2019; Zhao et al., 2012), residual weighted learning (Zhou et al., 2017), and aug
mented OWL (Liu et al., 2018; Zhao et al., 2019; Zhou & Kosorok, 2017). However, the afore
mentioned methods in these two categories are designed for selecting one optimal treatment
applicable to other outcome regression models for deriving budget-constrained decisions for
multi-arm or combination treatment scenarios.
In regard to the theoretical properties of the estimated ITR, we provide the value reduction
bound for the ITR for combination treatments with or without budget constraints. Thereafter,
we provide a non-asymptotic value reduction bound for the DEM, which guarantees that the value
function of the estimated ITR converges to the optimal value function with a high probability and
the proposed method achieves a faster convergence rate with respect to the number of combination
where d(·) : X → A is an ITR. The value function is defined as the expectation of the potential out
comes over the population distribution of (X, A, Y) under A = d(X), which is estimable when the
following causal assumptions (Rubin, 1974) hold:
Assumption 1 (a) Stable unit treatment value assumption: Y = Y(A); (b) no unmeasured
confounders: A ⊥ ⊥ Y(a)|X, for any a ∈ A; (c) positivity: P(A = a|X) ≥ pA ,
∀a ∈ A, ∀X ∈ X , for some pA > 0.
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 717
Assumption (a) is also referred to as ‘consistency’ in causal inference, which assumes that the
potential outcomes of each subject do not vary with treatments assigned to other subjects. The
treatments are well defined in that the same treatment leads to the same potential outcome.
Assumption (b) states that all confounders are observed in pre-treatment covariates, so that the
treatment and potential outcomes are conditionally independent given the pre-treatment covari
ates. Assumption (c) claims that for any pre-treatment covariates X, each treatment can be as
signed with a positive probability.
where I(·) is the indicator function. To maximize the value function, we can first estimate the con
ditional expectation E(Y|X = x, A = a), namely, the Q-function in the literature (Clifton & Laber,
2020). Then the optimal ITR can be obtained by
From the perspective of the multi-arm treatments, the Q-function (Kosorok & Laber, 2019;
Qi et al., 2020; Qian & Murphy, 2011) can be formulated as
|A|
E(Y|X, Ã) = m(X) + δl (X)I(Ã = l) (4)
l=1
where m(X) is the treatment-free effect representing a null effect without any treatment and func
tions δl (X)’s are treatment effects for the lth treatment. There are two major challenges when (4) is
applied to the combination treatments problem: first, if δl (·)’s are imposed to be some parametric
model, for example, linear model (Kosorok & Laber, 2019; Qian & Murphy, 2011), it could have
severe mis-specification issue, especially considering the complex nature of interaction effects of
combination treatments. Second, as the number of treatments K increases, the number of
treatment-specific functions δl (·)’s could grow exponentially. Therefore, the estimation efficiency
of the ITR based on Q-function (4) could be severely compromised for either parametric or non
parametric models, especially in clinical trials or observational studies with limited sample sizes.
In addition, considering the combination of multiple treatments expands the treatment space A
and provides much more feasible treatment options. Therefore, each individual could have more
choices rather than a yes-or-no as in the binary treatment scenario. Therefore, it is possible to con
sider accommodating realistic budget constraints while maintaining an effective outcome. In this
paper, we further consider a population-level budget constraint as follows. Suppose costs over the
K treatments are c = (c1 , c2 , . . . , cK ), where ck denotes the cost for the kth treatment. Then the
budget constraint for a population with a sample size n is
n
1
Cn (d):= cT d(Xi ) ≤ B (5)
n i=1
where B is the average budget for each subject. This budget constraint is suitable for many policy-
making problems such as welfare programmes (Bhattacharya & Dupas, 2012) and vaccination
distribution problem (Matrajt et al., 2021).
3 Methodology
In Section 3.1, we introduce the proposed DEM for estimating the optimal ITR for combination
treatments. Section 3.2 considers the optimal assignment of combination treatments under budget
constraints. The estimation procedure and implementation details are provided in Section 3.3.
718 Xu et al.
where m(·) : X → R is the treatment-free effects as in (4), and α(·) : X → Rr is an encoder that rep
r
δl (X) = β(i) (Ãl )α(i) (X)
i=1
Note that the model for multi-arm treatments (4) is a special case of the DEM (6) where α(X) =
(δ1 (X), . . . , δ|A| (X)) and β(Ã) = (I(Ã = Ã1 ), . . . , I(Ã = Ã|A| )) if r = |A|. Another special case of
(6) is the angle-based modelling (Qi et al., 2020; Xue et al., 2021; Zhang et al., 2020), which
has been applied to the estimation of the ITR for multi-arm treatments. In the angle-based frame
work, each treatment is encoded with a fixed vertex in the simplex, and each subject is projected in
the latent space of the same dimension as the treatments so that the optimal treatment is deter
mined by the angle between treatment vertices and the subject latent factors. However, the dimen
sion of the simplex and latent space is r = |A| − 1, which leads the angle-based modelling that
suffers from the same inefficiency issue as (4).
Since different combination treatments could contain the same individual treatments, it is over-
parameterized to model treatment effects for each combination treatment independently. For
instance, the treatment effect of the combination of drug A and drug B is correlated with the in
dividual treatment effects of drug A and drug B, respectively. Therefore, we seek to find a low-
dimensional function space to incorporate the correlation of the combination treatments without
over-parametrization. In the DEM (6), the dimension of the encoders output r controls the com
plexity of the function space spanned by α(1) (·), . . . , α(r) (·). Empirically, the dimension r is a tuning
parameter, which can be determined via the hyper-parameter tuning procedure. In other words,
the complexity of the DEM is determined by the data itself, rather than pre-specified. In addition,
the reduced dimension also leads to a parsimonious model with fewer parameters, which permits
an efficient estimation of treatment effects. Furthermore, we do not impose any parametric as
sumptions on α(·), which allows us to employ flexible nonlinear or nonparametric models with
r-dimensional output to avoid the potential mis-specification of treatment effects.
Given the double encoder framework in (6), the treatment effects of the combination treatments
share the same function bases α(1) (·),…, α(r) (·). Therefore, the treatment encoder β(·) is necessary to
represent all treatments in A so that α(X)T β(A) can represent treatment effects for all treatments.
Through this modelling strategy, we convert the complexity of |A| treatment-specific functions
δl (·)’s in (4) to the representation complexity of β(·) in that β(·) represents |A| treatments in
r-dimensional latent space. As a result, we can reduce the complexity of the combination treatment
problem and achieve an efficient estimation if an efficient representation (i.e. r ≪ |A|) of |A| treat
ments can be found.
In summary, the DEM (6) is a promising framework to tackle the two challenges in (4) if cova
riates and treatment encoders can provide flexible and powerful representations of covariates and
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 719
treatments, respectively, which will be elaborated in the following sections. Before we dive into the
details of the covariates and treatment encoders, we first show the universal approximation prop
erty of the DEM, which guarantees its flexibility in approximating complex treatment effects.
Theorem 1 For any treatment effects δl (X) ∈ H2 = {f : ∫x∈X |f (2) (x)|2 dx < ∞}, and for any
ϵ > 0, there exists α(·) : X → Rr and β(·) : X → Rr , where K ≤ r ≤ |A| such
that
The above theorem guarantees that the DEM (6) can represent the function space considered in
(4) sufficiently well given a sufficiently large r.
where β0 (A) and β1 (A) are additive and interactive treatment encoders, respectively. In particular,
β0 (A) is a linear function with respect to A, where W = (W1 , W2 , . . . , WK ) ∈ Rr×K , and Wk is
the latent representation of the kth treatment. As a result, α(X)T β0 (A) = k : {Ak =1} WTk α(X) are
the additive treatment effects of the combination treatment A. The constraints for β1 (·) ensure the
identifiability of β0 (·) and β1 (·) such that any representation β(A) can be uniquely decoupled into
β0 (A) and β1 (A).
The interaction effects are challenging to estimate in combination treatments. A naive solution is
to assume that interaction effects are ignorable, which leads the additive treatment encoder β0 (A)
to be saturated in estimating the treatment effects of combination treatments. However, inter
action effects are widely perceived in many fields such as medicine (Li et al., 2018; Stader et al.,
2020), psychology (Caspi et al., 2010), and public health (Braveman et al., 2011). Statistically, ig
noring the interaction effects could lead to inconsistent estimation of the treatment effects (Yu &
Ding, 2023; Zhao & Ding, 2023) and the ITR (Liang et al., 2018). Hence, it is critical to incorp
orate the interaction effects in estimating the ITR for combination treatments.
A straightforward approach to model the interactive treatment encoder β1 (A) is similar to the
additive treatment encoder β0 (A), which we name as the treatment dictionary. Specifically, a ma
trix V = (V1 , V2 , . . . , V|A| ) ∈ Rr×|A| is a dictionary that stores the latent representations of each
combination treatment so that β1 (A) is defined as follows:
where eà is the one-hot encoding of the categorical representation of A. Since the number of pos
sible combination treatments |A| could grow exponentially as K increases, the parameters of V
could also explode. Even worse, each column Vl can be updated only if the associated treatment
Ãl is observed. Given a limited sample size, each treatment could be only observed a few times in
the combination treatment scenarios, which leads the estimation efficiency of V to be severely com
promised. The same puzzle is also observed in other methods. For the Q-function in (4), the
720 Xu et al.
parameters in δl (X) can be updated only if Ãl is observed; in the treatment-agnostic representation
network (TARNet) (Shalit et al., 2017) and the Dragonnet (Shi et al., 2019), each treatment is as
sociated with an independent set of regression layers to estimate the treatment-specific treatment
effects, which results in inefficiency estimation for combination treatment problems.
In order to overcome the above issue, we propose to utilize the feed-forward neural network
(Goodfellow et al., 2016) to learn efficient latent representations in the r-dimensional space.
Specifically, the interactive treatment encoder is defined as
where U l (x) = Ul x + bl is the linear operator with the weight matrix Ul ∈ Rrl ×rl−1 and the biases
bl . The activation function is chosen as ReLU function σ(x) = max(x, 0) in this paper. An illus
tration of the neural network interactive treatment encoder is shown in Figure 1. Note that all
parameters in (9) are shared among all possible treatments, so all of the weight matrices and
biases in (9) are updated regardless of the input treatment, which could improve the estimation
efficiency, even though (9) may include more parameters than the treatment dictionary (8). As a
result, the DEM with (9) not only guarantees a faster convergence rate (with respect to K) of the
value function but also improves the empirical performance especially when K is large or sample
size n is small, which will be shown in numerical studies and real data analysis. A direct compari
son of the neural network interactive treatment encoder (9), the treatment encoder (8), the addi
tive model (4), TARNet (Shalit et al., 2017), and Dragonnet (Shi et al., 2019) is also shown in
Figure 1.
Although the interactive treatment encoder (9) allows an efficient estimation, it is not guaran
teed to represent up to |A| interaction effects. In the treatment dictionary (8), columns Vl ’s are
free parameters to represent |A| treatments without any constraints. However, an ‘under-
parameterized’ neural network is not capable of representing |A| treatments in r-dimensional
space. For example, if there are three treatments to be combined (K = 3), the treatment effects
are sufficiently captured by one-dimensional α(X) with different coefficients (r = 1). We use the fol
lowing one-hidden layer neural network to represent the treatment in R:
where u2 , b2 , b1 ∈ R are scalars and U1 ∈ R1×3 . In other words, the hidden layer only includes
one node. In the following, we show that this neural network can only represent restricted
interaction effects:
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 721
Proposition 1 The one-hidden layer neural network (10) can only represent the following
interaction effects: (a) β1 (A) ≥ 0 or β1 (A) ≤ 0 for all A ∈ A and (b) β1 (A)
takes the same values for all combinations of two treatments.
The proof of Proposition 1 is provided in the online supplementary material. Based on the above
observation, it is critical to guarantee the representation power of β1 (A) to incorporate flexible
interaction effects. In the following, we establish a theoretical guarantee of the representation
power of β1 (·) under a mild assumption on the widths of neural networks:
The above result is adapted from the recent work on the memorization capacity of neural
networks (Bubeck et al., 2020; Yun et al., 2019). Theorem 2 shows that if there are
Ω(2K/2 r1/2 ) hidden nodes in neural networks, then it is sufficient to represent all possible inter
action effects in Rr . However, obtaining the parameter set {Ul , bl , l = 1, 2, 3} in Theorem 2 via
the optimization algorithm is not guaranteed due to the non-convex loss surface of the neural
networks. In practice, the neural network widths in Theorem 2 can be a guide, and choosing a
wider network is recommended to achieve better empirical performance.
In summary, we propose to formulate the treatment encoder as two decoupled parts:
the additive treatment encoder and the interactive encoder. We provide two options for the inter
active treatment encoder: the treatment dictionary and the neural network, where the neural net
work can improve the asymptotic convergence rate and empirical performance with guaranteed
representation power. In the numerical studies, we use the neural network interactive treatment
encoder for our proposed method, and a comprehensive comparison between the treatment dic
tionary and the neural network is provided in the online supplementary material.
T l (x) = Tl x + cl is the linear operator with the weight matrix Tl ∈ Rrl ×rl−1 and the biases cl . The
activation function is chosen as ReLU function σ(x) = max(x, 0) in this paper. Note that the depth
and the width of the covariates encoder α(·) are not necessarily identical to those of the interactive
treatment encoder β1 (·), and these are all tuning parameters to be determined through hyper-
parameter tuning.
Even though neural networks achieve superior performance in many fields, their performance in
small sample size problems, such as clinical trials or observational studies in medical research, is
still deficient. In addition, neural networks lack interpretability due to the nature of their recursive
composition; therefore, the adoption of neural networks in medical research is still under review.
Here, we propose the polynomial and B-Spline covariates encoders to incorporate nonlinear treat
ment effects for better interpretation. For the polynomial covariates encoder, we first expand each
covariate xi into a specific order of polynomials (xi , x2i , . . . , xdi ) where d is a tuning parameter.
Then we take the linear combinations of all polynomials as the output of the covariate encoders.
Figure 2 provides an example of the polynomial covariate encoder with d = 3. Similarly, as for the
B-spline covariates encoder, we first expand each covariate into B-spline bases, where the number
722 Xu et al.
of knots and the spline degree are tuning parameters. Likewise, linear combinations of these B-spline
bases are adopted as the output of the encoder. Although both polynomial and B-spline covariate
encoders can accommodate interaction terms among the polynomial bases or B-spline bases for a
better approximation for multivariate functions, exponentially increasing parameters need to be es
timated as the dimension of covariates or the degree of bases increases. In the interest of computation
feasibility, we do not consider interaction terms in the following discussion.
where V n and Cn are defined on a pre-specified population with a sample size n. Here, covariates xi
(i = 1, 2, . . . , n) are treated as fixed covariates. Based on model formulation (6), maximizing the
objective function of (12) is equivalent to
n
1
argmax V n (d) = argmax [m(xi ) + α(xi )T β(d(xi ))]
d d n i=1
n
1
= argmax α(xi )T β(d(xi ))
d n i=1
n |A|
1
= argmax δij dij
{di j} n i=1 j=1
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 723
where δij = α(xi )T β(aj ) denotes the treatment effects of the jth combination treatment on the ith
subject, and dij = I{d(xi ) = aj } ∈ {0, 1} indicates whether the ith subject receives the jth combin
ation treatment. Since
one subject can only receive one combination treatment, we imposethe n
constraint to be j dij = 1. Similarly, budget constraints can be formulated as (1/n) i=1
|A|
c d
j=1 ãj ij ≤ B, where c ãj is the cost of treatment ãj calculated from the cost vector c. The con
strained ITR can be solved as follows:
The above optimization problem is equivalent to a multi-choice knapsack problem (Kellerer et al.,
2004). For a binary treatment setting, the solution of (13) is the quantile of η(X), which is a special
case of our formulation.
To understand the connection between constrained ITR and the multi-choice knapsack prob
lem, we notice that the priority of the treatment is associated with the definition of dominance
in the multi-choice knapsack problem: for any i ∈ {1, 2, . . . , n}, if δik > δil and cak < cal , the treat
ment l is dominated by the treatment k. In other words, the treatment k achieves a better outcome
than the treatment l with a lower cost. Thus, the dominance property could be an alternative to the
contrast functions in combination treatment settings.
Here, δij indicates the treatment effects, and the parametric assumptions are not required, so this
framework is also applicable for other methods providing estimations of treatment effects such as
the L1 -penalized least-square (L1 -PLS) (Qian & Murphy, 2011) and the outcome weighted learn
ing with multinomial deviance (OWL-MD; Huang et al., 2019). However, the objective function
in (12) depends on the estimation of δij , and we show that the value reduction under budget con
straints is bounded by the estimation error of δij ’s in Theorem 4. Consequently, estimation bias in
δij could lead to biased results in (13). Since the proposed model (6) provides an efficient and ac
curate estimation of treatment effects, it also results in a more favourable property for solving the
budget-constrained ITR for combination treatments.
where P̂(Ai |Xi ) is a working model of the propensity score specifying the probability of treatment
assignment given pre-treatment covariates, and m̂(Xi ) is a working model of treatment-free effects.
The inverse probability weights given by the propensity scores balance the samples assigned to dif
ferent combination treatments, assumed to be equal under the randomized clinical trial setting. By
removing the treatment-free effects m(x) from the responses before we estimate the treatment ef
fects, the numerical stability can be improved and the estimator variance is reduced. This is also
observed in Zhou et al. (2017) and Fu et al. (2016). Furthermore, the estimator in (14) is doubly
robust in that if either P̂( · | · ) or m̂(·) is correctly specified, α̂(·)T β̂(·) is a consistent estimator of the
treatment effects, and a detailed proof is provided in the online supplementary material. This result
extends the results in Meng and Qiao (2020) from binary and multiple treatments to combination
724 Xu et al.
treatments. Empirically, we minimize the sample average of the loss function (14) with additional
penalties: for the additive treatment encoder, L2 penalty is imposed to avoid overfitting; for the
interactive treatment encoder, L1 penalty is added since the interaction effects are usually sparse
(Wu & Hamada, 2011).
In this work, the working model of the propensity score is obtained via the penalized multi
nomial logistic regression (Friedman et al., 2010) as a working model. Specifically, the multi
nomial logistic model is parameterized by γ1 , γ2 , . . . , γ2K ∈ Rp :
K K p
K 1/2
n 2 2 2
1
T
T
2
max γk xi I(Ã = k) − log exp(γk′ xi ) −λ γkj
γ1 ,...γ2 K n
i=1 k=1 k′ =1 j=1 k=1
where the group Lasso (Meier et al., 2008) is used to penalize parameters across all treatment
groups. A potential issue of propensity score estimation is that the estimated probability could
be negligible when there are many possible treatments, which leads to unstable estimators for
treatment effects. To alleviate the limitation on inverse probability weighting, we stabilize the pro
pensity scores (Xu et al., 2010) by multiplying the frequency of the corresponding treatment to the
weights.
For the estimation of the treatment-free effects m(·), we adopt a two-layer neural network:
where w2m ∈ Rh and W1m ∈ Rh×p are weight matrices, and σ(x) is the ReLu function. The width h
controls the complexity of this model. The weight matrices are estimated through minimizing:
n
1
min [yi − (w2m )T σ(W1m x)]2
w2m ,W1m n i=1
Given working models m̂(x) and P̂(a|x) for treatment-free effects and propensity scores, we pro
pose to optimize the double encoder alternatively. The detailed algorithm is listed in Algorithm 1.
Input: Training dataset (xi , ai , yi )ni=1 , working models m̂(x), P̂(a|x), hyper-parameters including network
structure-related hyper-parameters (e.g. network depth Lα , Lβ , network width rα , rβ , and encoder output
dimension r) and optimization-related hyper-parameters (e.g. additive treatment encoder penalty coefficients λa ,
interactive treatment encoder penalty coefficients λi , mini-batch size B, learning rate η, and training epochs E).
Initialization: Initialize parameters in α̂(0) (x), β̂(0) (0)
0 (a), and β̂1 (a).
Training:
for e in 1: E do
for Mini-batch sampled from (xi , ai , yi )ni=1 do
α̂(e) (x) = argmin B1 P̂(a1|x ) {yi − m̂(xi ) − α(xi )T (β̂(e−1)
0 (ai ) + β̂(e−1)
1 (ai ))}2
(e) i i
(e−1)
β̂1 (a) = argmin B1 P̂(a1|x ) {yi − m̂(xi ) − α̂(e) (xi ) (β0 (ai ) + β̂1 (ai ))}2 + λa ‖β0 ‖2
T
1i i
β̂(e)
2 (a) = argmin B
1
{y − m̂(xi ) − α̂(e) (xi )T (β̂(e)
P̂(a |x ) i
2
0 (ai ) + β1 (ai ))} + λi ‖β1 ‖1
i i
end for
end for
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 725
Specifically, we employ the Adam optimizer (Kingma & Ba, 2014) for optimizing each encoder:
covariate encoder α(x), additive treatment encoder β1 (a), and nonparametric treatment encoder
β2 (a). To stabilize the optimization during the iterations, we utilize the exponential scheduler
(Patterson & Gibson, 2017) which decays the learning rate by a constant per epoch. In all of
our numerical studies, we use 0.95 as a decaying constant for the exponential scheduler. To ensure
the identifiability
of treatment effects, we also require a constraint on the treatment encoder β(·)
such that a∈A β(a) = 0. To satisfy this constraint, we add an additional normalization layer be
4 Theoretical guarantees
In this section, we establish the theoretical properties of the ITR estimation for combination treat
ments and the proposed method. First, we establish the value reduction bound for the combination
treatments, either with or without budget constraints. Second, we provide a non-asymptotic ex
cess risk bound for the DEM, which achieves a faster convergence rate compared with existing
methods for multi-arm treatment problems.
Assumption 2 For any ϵ > 0, there exist some constant C > 0 and γ > 0 such that
P( max
′
|δ∗ (X, A) − δ∗ (X, A′ )| ≤ ϵ) ≤ Cϵγ (16)
A,A ∈A
Assumption 2 is a margin condition characterizing the behaviour of the boundary between dif
ferent combination treatments. A larger value of γ indicates that the treatment effects are differen
tiable with a higher probability, suggesting it is easier to find the optimal ITR. Similar assumptions
are also required in the literature (Qi et al., 2020; Qian & Murphy, 2011; Zhao et al., 2012) to
achieve a faster convergence rate of the value reduction bound.
The following theorem shows that the value reduction is bounded by the estimation error of the
treatment effects, and the convergence rate can be improved if Assumption 2 holds:
Theorem 3 Suppose the treatment effects δ∗ ( · , · ) ∈ H2 . For any estimator δ̂( · , · ), and the
corresponding decision rule d̂ such that d̂(X) ∈ argmaxA∈A δ̂(X, A), we have
1/2
V(d∗ ) − V(d̂) ≤ 2 max E[δ∗ (X, A) − δ̂(X, A)]2 (17)
A∈A
Theorem 3 builds a connection between the value reduction and the estimation error of the
treatment effects δ̂( · , · ), which shows that an accurate estimation of treatment effects would
lead the estimated value function V(d̂) to approach the optimal value function V(d∗ ). Based on
Theorem 3, we can further connect the value reduction bound to the excess risk of the estimator
of the proposed model:
Next, we consider the value reduction bound under budget constraints. Since the multi-choice
Theorem 4 For the approximated value function obtained from Algorithm 2, for any
B > 0, we have
n
1
|Z∗ (B) − Ẑ(B)| ≤ | max δ∗ (xi , Ãj ) − δ̂(xi , Ãj )|
n i=1 Ãj ∈A
In other words, the approximated value function under budget constraints can converge if
δ̂( · , · ) is a consistent estimator of treatment effects. Note that the proposed estimator is a doubly
robust estimator in that either propensity score or treatment-free effects is correctly specified; our
proposed estimator is a consistent estimator, which consequently leads the value function and ap
proximated value function under budget constraints converge.
Lemma 1 For any distribution (X, A, Y) with E[Y 2 ] ≤ c1 , given a function Q̂ from Q, then
with probability 1 − 2ϵ,
��������������
∗ 2c21 log (1/ϵ)
L(Q̂) − L(Q ) ≤ 8CRn (Q) + (19)
n
where C is the Lipschitz constant of L(Q), and Rn (Q) is the Rademacher com
plexity of Q.
Lemma 1 provides an upper bound of the excess risk in Corollary 1 using the Rademacher com
plexity of Q. However, the Rademacher complexity of a general neural network is still an open
problem in the literature and existing bounds are mainly established based on the different types
of norm constraints of weight matrices (Bartlett et al., 2017; Golowich et al., 2018; Neyshabur
et al., 2017, 2015). In this work, we focus on the following sub-class of Q with L2 and spectral
norm constraints:
QBm ,Bα ,Bβ = Q ∈ Q : ‖w2m ‖2 ≤ Bm , ‖W1m ‖2,∞ ≤ Bm , ‖Tl ‖2 ≤ Bα , ‖Ul ‖2 ≤ Bβ
728 Xu et al.
where ‖ · ‖2 denotes the L2 -norm for vectors and the spectral norm for matrices. For any matrix
X = (X1 , . . . , Xp ) and Xi is the ith column of matrix X, we use ‖X‖2,∞ = maxi ‖Xi ‖2 to denote the
L2,∞ norm of X. We then establish the upper bound of the Rademacher complexity of QBm ,Bα ,Bβ as
follows:
Lemma 2 Suppose E[‖X‖22 ] ≤ c22 . The Rademacher complexity of QBm ,Bα ,Bβ is upper
bounded by
Lemma
����� 2 provides an upper bound of the Rademacher complexity of F Bm ,Bα ,Bβ with the rate
O( 1/n). The first term of (20) is the upper bound for the function class of m(x) in (15), which
depends on the width of hidden layers h. If h is large, the function m(x) is able to approximate
a larger function space, but with a less tight upper bound on the generalization error. The second
term of (20) is associated with the functional class of the inner product of the double encoders with
a convergence rate of O(K1/2 n−1/2 ). The rate increases with the number of treatments K rather
than |A| due to the parameter-sharing feature of the interactive treatment encoder, and the linearly
growing dimension of input of function β(·) in the proposed method. Specifically, the input of β(·)
is the combination treatment A itself, and parameters in the treatment encoder are shared by all the
combination treatments. Thus, the model complexity is proportional to K and the product of the
spectral norm of weight matrices. Based on Lemmas 1 and 2, we derive the value reduction bound
for the proposed method as follows:
Theorem 5 For any distribution (X, A, Y) with E[Y 2 ] ≤ c1 and E[‖X‖22 ] ≤ c2 . Considering
the neural networks in the subspace QBm ,Bα ,Bβ , with probability at least 1 − 2ϵ,
we have the following value reduction bound:
⎧ �� �� ��������������⎫1/2
⎨ h K 2c21 log (1/ϵ)⎬
∗ 2 Lα Lβ
V(d ) − V(d̂) ≤ 2 16CBm c2 + 8CBα Bβ c2 +
⎩ n n n ⎭
Theorem 5 establishes the value reduction bound in that the estimated decision rule can ap
proach the optimal value function as the sample size increases. Compared with the existing value
reduction bound for multi-arm treatments, the proposed method improves the convergence rate
from O(|A|1/4 ) to O(( log2 |A|)1/4 ). Furthermore, the order of the value reduction bound can ap
proach nearly n−1/2 as γ goes to infinity, which is consistent with the convergence rates established
in Qian and Murphy (2011) and Qi et al. (2020).
5 Simulation studies
In this section, we evaluate the performance of the proposed method in estimating the ITR for
combination treatments. Our numerical studies show that the proposed method achieves
superior performance to competing methods in both unconstrained and budget-constrained
scenarios.
and uniformly sampled from (−1, 1). Four simulation settings are designed to evaluate the per
formance under varying settings. In simulation settings 1 and 2, we consider combinations of three
treatments, which induces eight possible combinations, with six of them considered as our as
signed treatments. Similarly, in simulation settings 3 and 4, we consider combinations of five treat
ments, and we assume that 20 of all combinations are assigned to subjects. The treatments are
assigned either uniformly or following the propensity score model:
Table 1. Simulation settings 1 and 2: treatment effect and interaction effect functions specification
(0, 0, 0) 0 –
(0, 0, 1) 2X1 + exp(X3 + X4 ) –
(0, 1, 0) 2X2 log (X5 ) + X7 –
Note. Column ‘Treatment effects’ specifies the treatment effect functions of individual treatments adopted in simulation
settings 1 and 2. Column ‘Interaction effects’ specifies the interaction effects among individual treatments in setting 2.
Table 2. Simulation settings 3 and 4: treatment effect and interaction effect functions specification
(0, 0, 0, 0, 0) 0 –
(0, 0, 0, 0, 1) (X1 − 0.25)3 –
(0, 0, 0, 1, 0) 2 log (X3 ) + 4 log (X8 )cos(2πX10 ) –
(0, 0, 1, 0, 0) X2 sin(X4 ) − 1
(0, 0, 1, 0, 1) - exp(2X2 )
(0, 1, 0, 0, 0) (X1 + X5 − X28 )3 –
(0, 1, 0, 0, 1) – exp(2X4 + X9 )
(0, 1, 0, 1, 1) – −4 log (X6 )
(0, 1, 1, 0, 0) – 0
(0, 1, 1, 1, 0) – 0
(1, 0, 0, 0, 0) exp(X2 − X5 ) -
(1, 0, 0, 0, 1) – 0
(1, 0, 0, 1, 0) – 0
(1, 0, 1, 0, 0) – 0
(1, 0, 1, 0, 1) – −3/2cos(2πX1 + X28 )
(1, 1, 0, 0, 0) – 0
(1, 1, 0, 0, 1) – −4 log (X6 )
(1, 1, 0, 1, 1) – X26 + 1/2sin(2π/X7 )
(1, 1, 1, 0, 0) – 0
(1, 1, 1, 1, 0) – 0
Note. Column ‘Treatment effects’ specifies the treatment effect functions of individual treatments adopted in simulation
settings 3 and 4. Column ‘Interaction effects’ specifies the interaction effects among individual treatments in setting 4.
accuracy are reported in Tables 3 and 4, where the empirical value function (Qian & Murphy,
2011) is calculated via
En [YI{d(X) = A}]
V̂(d) =
En [I{d(X) = A}]
Value
Treatment assignment Setting Sample size Proposed L1 -PLS OWL-MD MOWL-linear OWL-DL TARNet
Table 3. Continued
Value
Treatment assignment Setting Sample size Proposed L1 -PLS OWL-MD MOWL-linear OWL-DL TARNet
Note. Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (uniform), and treatments are assigned based on the propensity score model (21,
PS-based).
Table 4. Unconstrained simulation study: comparisons of accuracies for the proposed method and existing methods including the L1 -penalized least-square (L1 -PLS; Qian & Murphy, 2011),
the outcome weighted learning with multinomial deviance (OWL-MD; Huang et al., 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear; Zhang et al.,
2020), the outcome weighted learning with deep learning (OWL-DL; Liang et al., 2018), and the treatment-agnostic representation network (TARNet; Shalit et al., 2017)
Accuracy
Treatment assignment Setting Sample size Proposed L1 -PLS OWL-MD MOWL-linear OWL-DL TARNet
Accuracy
Treatment assignment Setting Sample size Proposed L1 -PLS OWL-MD MOWL-linear OWL-DL TARNet
Note. Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (uniform), and treatments are assigned based on the propensity score model (21,
PS-based).
733
proposed method can be adaptive to the additive setting with a large λi . Therefore, the proposed
method and the OWL-DL outperform other competing methods in both settings. In contrast, com
plex interaction effects are considered in simulation settings 2 and 4, and the performance of
OWL-DL is inferior since a consistent estimation is not guaranteed for OWL-DL if there are inter
action effects. Although other competing methods are saturated in incorporating interaction effects,
their estimation efficiencies are still undermined since the decision functions in these methods are all
treatment-specific, while the proposed method possesses the unique parameter-sharing feature for
Table 5. Constrained simulation study: comparisons of value functions for the proposed method and existing
methods including the L1 -penalized least-square (L1 -PLS; Qian & Murphy, 2011), the outcome weighted learning with
multinomial deviance (OWL-MD; Huang et al., 2019), the multicategory outcome weighted learning with linear
decisions (MOWL-linear; Zhang et al., 2020), and the treatment-agnostic representation network (TARNet; Shalit
et al., 2017)
Value
Figure 4. Illustration of the patient-derived xenograft (PDX) data collection. Tumour samples from a patient are
implanted into multiple mice, where a PDX line is formed by these mice. Different treatments can be applied
simultaneously. Tumour size, which is the primary interest of the outcome, is measured for each mouse. RNA and
DNA sequencing and other features are collected as pre-treatment covariates.
screening by Rashid et al. (2020) and these pre-treatment covariates are inherently balanced since
all treatment groups include exactly the same PDX lines. Furthermore, the outcome of our interest
is measured by the scaled maximum observed tumour size shrunken from the baseline size, where a
larger value is more desirable.
For the budget-constrained setting, we impose the costs of treatments as follows: $79 for
BKM120, $100 for LJC049, $66 for BYL719, $240 for cetuximab, $124 for encorafenib, $500
for LJM716, and $79 for binimetinib, where the prices for LJC049 and LJM716 are hypothetical,
while other prices are based on https://www.goodrx.com for unit dosage. We consider the hypo
thetical budgets for these 37 PDX lines as $21,000, $15,000, $10,000, and $5,000, where $21,000
is equivalent to the unconstrained scenario because it is sufficient to cover the most expensive com
bination treatment for all PDX lines.
736 Xu et al.
Table 6. Table of single or combination treatments considered to treat CLR for patient-derived xenograft (PDX) lines
Single 1 0 0 0 0 0 0
Single 0 1 0 0 0 0 0
Single 0 0 1 0 0 0 0
Table 7. Mean and standard errors of value function under different budget constraints
Note. The budget $21,000 is equivalent to the non-constrained scenario. L1 -PLS = L1 -penalized least square; MOWL =
multicategory outcome weighted learning; OWL-DL = outcome weighted learning with multinomial deviance; TARNet =
treatment-agnostic representation network.
To implement our proposed method and other competing methods, we randomly split the
dataset into training (25 PDX lines), validation (6 PDX lines), and testing (6 PDX lines) sets.
All methods are trained on the training set, while hyper-parameters are tuned based on the
validation set. The value function of the unconstrained scenario is calculated on testing set
via V̂(d) = En [YI{d(X) = A}]/En [I{d(X) = A}]. For the budget-constrained scenarios, we apply
the selected model to all PDX lines to obtain the estimation of treatment effects, and then
apply the MCKP algorithm to all PDX lines and calculate the value function based on 37
PDX lines. Finally, we repeat the above random splitting 100 times to validate the results
for comparison.
In the following, we report the means and standard deviations of the value functions under dif
ferent budget constraints in Table 7. For the unconstrained scenario ($21, 000 budget), the pro
posed method achieves great improvement in value maximization. As a reference, the value
function under one-size-fits-all rules and the optimal treatment are shown in Table 8, which shows
that our proposed method improves the value function gap between optimal treatment assignment
and one-size-fits-all rules by 23.0–81.1%. In comparison with the competing methods which also
estimate the ITR, our proposed method improves the value function gap between the optimal
treatment assignment and competing ITR methods by 40.4–65.8%.
For the budget-constrained scenarios, the proposed method achieves dominant advantages over
the competing methods in that the value function under the most restrictive budget constraints still
achieves comparable results with other competing methods in the unconstrained scenario.
Compared with one-size-fits-all rules, the proposed method achieves the best value function
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 737
Table 8. Value function under one-size-fits-all rules and optimal treatment assignment
Note. BYL719 + binimetinib (bold value) is the best one-size-fits-all rules; Optimal (italic) is the estimated ITR by the
proposed method.
with about $5,000 budget compared to the best one-size-fits-all rule, which is the combination of
BYL719 and binimetinib with a $5,365 budget. In summary, the proposed method is more capable
of effectively controlling the tumour size than any other competing methods and one-size-fits-all
rules. Our approach could have great potential for improving therapy quality for CLR patients.
7 Discussion
In this paper, we broaden the scope of estimating the ITR from binary and multi-arm treatments to
combination treatments, where treatments within each combination can interact with each other.
We propose the DEM as a nonparametric approach to accommodate intricate treatment effects of
combination treatments. Specifically, our method overcomes the curse of dimensionality issue via
adopting neural network treatment encoders. The parameter-sharing feature of the neural net
work treatment encoder enhances the estimation efficiency such that the proposed method is
able to outperform other parametric approaches given a small sample size. In addition, we also
adapt the estimated ITR to budget-constrained scenarios. This adaptation is achieved through
the multi-choice knapsack framework, which strengthens our proposed method in situations
with limited resources. Theoretically, we offer a value reduction bound with and without budget
constraints and an improved convergence rate concerning the number of treatments under the
DEM.
Several potential research directions could be worth exploring further. First of all, the proposed
method employs the propensity score model to achieve the double robustness property. However,
the inverse probability weighting method could be weakened in observational studies considering
the combination treatments, due to the potential violation of positivity assumptions. This phe
nomenon is also observed in the binary treatment scenario with high-dimensional covariates
(D’Amour et al., 2021). There are existing works to overcome this limitation in the binary and
multi-arm treatment setting, which utilizes overlap weights (Li, 2019; Li et al., 2019) to substitute
the propensity score. However, this strategy cannot solve the same issue in combination treatment
problems. Therefore, exploring alternative approaches for combination treatment problems will
be a worthwhile direction.
Second, compared with the binary treatments, combination treatments enable us to optimize
multiple outcomes of interest simultaneously. The major challenge of multiple outcomes is that
each combination treatment may only favour a few outcomes, and therefore an optimal ITR is ex
pected to achieve a trade-off among multiple outcomes. Some recent works have studied trade-offs
between the outcome of interest and risk factors (Huang & Xu, 2020; Wang et al., 2018).
However, trade-offs among multiple outcomes could be more challenging.
Furthermore, interpretability is another desirable property of the ITR, especially in medical set
tings. The proposed method incorporates neural networks to enjoy benefits in estimation and the
oretical properties, while interpretation is not obvious. Some existing works (Laber & Zhao, 2015;
Zhang et al., 2015) propose tree-type methods for better interpretability under binary or multi-arm
settings, but may be not applicable for combination treatments. In the literature of explainable
738 Xu et al.
machine learning, there are available post-hoc and model-agnostic approaches (Lundberg & Lee,
2017; Shrikumar et al., 2017) which can be learned from. However, more sophisticated adaptation
might be needed for the combination treatment problem.
Acknowledgments
The authors are grateful to reviewers, the associate editor, and the editor for their insightful and
constructive comments that have helped to strengthen the content and clarity of the paper.
Funding
This work is supported by National Science Foundation Grants DMS 2210640 and DMS
1952406.
Data availability
The data that support the findings of the study are available at https://www.tandfonline.com/doi/
full/10.1080/01621459.2020.1828091.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series B.
References
Bartlett P., Foster D. J., & Telgarsky M. (2017). ‘Spectrally-normalized margin bounds for neural networks’,
arXiv, arXiv:1706.08498, preprint.
Bhattacharya D., & Dupas P. (2012). Inferring welfare maximizing treatment assignment under budget con
straints. Journal of Econometrics, 167(1), 168–196. https://doi.org/10.1016/j.jeconom.2011.11.007
Bozic I., Reiter J. G., Allen B., Antal T., Chatterjee K., Shah P., Moon Y. S., Yaqubie A., Kelly N., & Le D. T.
(2013). Evolutionary dynamics of cancer in response to targeted combination therapy. Elife, 2, e00747.
https://doi.org/10.7554/eLife.00747
Braveman P. A., Egerter S. A., & Mockenhaupt R. E. (2011). Broadening the focus: The need to address the social
determinants of health. American Journal of Preventive Medicine, 40(1), S4–S18. https://doi.org/10.1016/j.
amepre.2010.10.002
Bubeck S., Eldan R., Lee Y. T., & Mikulincer D. (2020). ‘Network size and weights size for memorization with
two-layers neural networks’, arXiv, arXiv:2006.02855, preprint.
Caspi A., Hariri A. R., Holmes A., Uher R., & Moffitt T. E. (2010). Genetic sensitivity to the environment: The
case of the serotonin transporter gene and its implications for studying complex diseases and traits. American
Journal of Psychiatry, 167(5), 509–527. https://doi.org/10.1176/appi.ajp.2010.09101452
Clifton J., & Laber E. (2020). Q-learning: Theory and applications. Annual Review of Statistics and Its
Application, 7(1), 279–301. https://doi.org/10.1146/statistics.2020.7.issue-1
D’Amour A., Ding P., Feller A., Lei L., & Sekhon J. (2021). Overlap in observational studies with high-
dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.
2019.10.014
Dudziń ski K., & Walukiewicz S. (1987). Exact methods for the knapsack problem and its generalizations.
European Journal of Operational Research, 28(1), 3–21. https://doi.org/10.1016/0377-2217(87)90165-2
Forrest G. N., & Tamura K. (2010). Rifampin combination therapy for nonmycobacterial infections. Clinical
Microbiology Reviews, 23(1), 14–34. https://doi.org/10.1128/CMR.00034-09
Friedman J., Hastie T., & Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate
descent. Journal of Statistical Software, 33(1), 1. https://doi.org/10.18637/jss.v033.i01
Fu H., Zhou J., & Faries D. E. (2016). Estimating optimal treatment regimes via subgroup identification in
randomized control trials and observational studies. Statistics in Medicine, 35(19), 3285–3302. https://doi.
org/10.1002/sim.v35.19
Gao H., Korn J. M., Ferretti S., Monahan J. E., Wang Y., Singh M., Zhang C., Schnell C., Yang G., Zhang Y., &
Balbin O. A. (2015). High-throughput screening using patient-derived tumor xenografts to predict clinical tri
al drug response. Nature Medicine, 21(11), 1318–1325. https://doi.org/10.1038/nm.3954
Golowich N., Rakhlin A., & Shamir O. (2018). Size-independent sample complexity of neural networks. In
Conference on learning theory (pp. 297–299). PMLR.
Goodfellow I., Bengio Y., & Courville A. (2016). Deep learning. MIT Press.
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 739
Hastie T., Tibshirani R., Friedman J. H., & Friedman J. H. (2009). The elements of statistical learning: Data min
ing, inference, and prediction (Vol. 2). Springer.
Hidalgo M., Amant F., Biankin A. V., Budinská E., Byrne A. T., Caldas C., Clarke R. B., de Jong S., Jonkers J.,
Mælandsmo G. M., & Roman-Roman S. (2014). Patient-derived xenograft models: An emerging platform for
translational cancer research. Cancer Discovery, 4(9), 998–1013. https://doi.org/10.1158/2159-8290.CD-14-
0001
Holland P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396),
945–960. https://doi.org/10.1080/01621459.1986.10478354
Moodie E. E., Chakraborty B., & Kramer M. S. (2012). Q-learning for estimating optimal dynamic treatment
rules from observational data. Canadian Journal of Statistics, 40(4), 629–645. https://doi.org/10.1002/cjs.
v40.4
Möttönen T., Hannonen P., Leirisalo-Repo M., Nissilä M., Kautiainen H., Korpela M., Laasonen L., Julkunen
H., Luukkainen R., Vuori K., & Paimela L. (1999). Comparison of combination therapy with single-drug
therapy in early rheumatoid arthritis: A randomised trial. The Lancet, 353(9164), 1568–1573. https://doi.
org/10.1016/S0140-6736(98)08513-4
Neyshabur B., Bhojanapalli S., & Srebro N. (2017). ‘A PAC-Bayesian approach to spectrally-normalized margin
Zhang C., Chen J., Fu H., He X., Zhao Y.-Q., & Liu Y. (2020). Multicategory outcome weighted margin-based
learning for estimating individualized treatment rules. Statistica Sinica, 30(4), 1857–1879. https://doi.org/10.
5705/ss.202017.0527
Zhang C., & Liu Y. (2014). Multicategory angle-based large-margin classification. Biometrika, 101(3), 625–640.
https://doi.org/10.1093/biomet/asu017
Zhang Y., Laber E. B., Tsiatis A., & Davidian M. (2015). Using decision lists to construct interpretable and par
simonious treatment regimes. Biometrics, 71(4), 895–904. https://doi.org/10.1111/biom.v71.4
Zhao A., & Ding P. (2023). Covariate adjustment in multiarmed, possibly factorial experiments. Journal of the