qkad141

Journal of the Royal Statistical Society Series B:
Statistical Methodology, 2024, 86, 714–741

https://doi.org/10.1093/jrsssb/qkad141
Advance access publication 10 January 2024
Original Article
Optimal individualized treatment rule for

combination treatments under budget
Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

constraints
Qi Xu1, Haoda Fu2 and Annie Qu1
1
Department of Statistics, University of California Irvine, Irvine, USA
2
Eli Lilly and Company, Indianapolis, USA
Address for correspondence: Annie Qu, Department of Statistics, 2212 Donald Bren Hall, University of California Irvine,
Irvine, CA 92697, USA. Email: aqu2@uci.edu
Abstract
The individualized treatment rule (ITR), which recommends an optimal treatment based on individual
characteristics, has drawn considerable interest from many areas such as precision medicine, personalized
education, and personalized marketing. Existing ITR estimation methods mainly adopt 1 of 2 or more
treatments. However, a combination of multiple treatments could be more powerful in various areas. In this
paper, we propose a novel double encoder model (DEM) to estimate the ITR for combination treatments. The
proposed double encoder model is a nonparametric model which not only flexibly incorporates complex
treatment effects and interaction effects among treatments but also improves estimation efficiency via the
parameter-sharing feature. In addition, we tailor the estimated ITR to budget constraints through a multi-
choice knapsack formulation, which enhances our proposed method under restricted-resource scenarios. In
theory, we provide the value reduction bound with or without budget constraints, and an improved
convergence rate with respect to the number of treatments under the DEM. Our simulation studies show
that the proposed method outperforms the existing ITR estimation in various settings. We also demonstrate
the superior performance of the proposed method in patient-derived xenograft data that recommends optimal
combination treatments to shrink the tumour size of the colorectal cancer.
Keywords: causal inference, combination therapy, decision-making, multi-choice knapsack, neural network, precision
medicine
1 Introduction
Individualized decision-making has played a prominent role in many fields such as precision medi
cine, personalized education, and personalized marketing due to the rapid development of person
alized data collection. For example, in precision medicine, individualized treatments based on
individuals’ demographic information and their overall comorbidity improve healthcare quality
(Schmieder et al., 2015). However, most existing individualized decision-making approaches se
lect one out of multiple treatments, whereas recent advances in medical research have suggested
that applying multiple treatments simultaneously, referred to as combination treatments, could
enhance overall healthcare. Specifically, combination treatments are able to reduce treatment fail
ure or fatality rates, and overcome treatment resistance for many chronic diseases (e.g. Bozic et al.,
2013; Forrest & Tamura, 2010; Kalra et al., 2010; Korkut et al., 2015; Maruthur et al., 2016;
Mokhtari et al., 2017; Möttönen et al., 1999; Tamma et al., 2012). Therefore, it is critical to de
velop a novel statistical method to recommend individualized combination treatments.
There are various existing methods for estimating the optimal individualized treatment rule
(ITR). The first approach is the model-based approach, which estimates an outcome regression
model given pre-treatment covariates and the treatment. The optimal ITR is derived by maximiz
ing the outcome over possible treatments conditioned on the pre-treatment covariates. Existing
works such as Q-learning (Moodie et al., 2012; Qian & Murphy, 2011), A-learning (Lu et al.,
Received: March 15, 2022. Revised: September 26, 2023. Accepted: December 10, 2023
© The Royal Statistical Society 2024. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3 715
2013; Shi et al., 2018), and D-learning (Meng & Qiao, 2020; Qi et al., 2020; Qi & Liu, 2018) all
belong to this approach. The other approach is known as the direct-search approach, which dir
ectly maximizes the expected outcome over a class of decision functions to obtain an optimal ITR.
The seminal works of the direct-search approach include outcome weighted learning (OWL)
(Huang et al., 2019; Zhao et al., 2012), residual weighted learning (Zhou et al., 2017), and aug
mented OWL (Liu et al., 2018; Zhao et al., 2019; Zhou & Kosorok, 2017). However, the afore
mentioned methods in these two categories are designed for selecting one optimal treatment

among two or more treatments. Therefore, they treat each combination treatment as independent
ones, ignoring the correlation between different combinations. Consequently, this type of model
ling strategy suffers from the curse of dimensionality issue due to the combinatorial nature of com
bination treatments, accompanied by computation cost increases and estimation efficiency
sacrifices. In order to accommodate combination treatments, some recent works (Liang et al.,
2018; Xu et al., 2023) belonging to direct-search approach estimate the ITR from the multi-label
classification perspective. In this line of research, Liang et al. (2018) ignore the interaction effects
among different treatments and do not guarantee a consistent estimation. In summary, existing
methods ignore either correlation among combinations or interactions among treatments, yet
both of them are essential to ensure an accurate and efficient estimation of the ITR for combin
ation treatments.
In this paper, we propose a double encoder model (DEM) to estimate the optimal ITR for com
bination treatments. The proposed method incorporates both the interaction effects among differ
ent treatments and correlations among different combinations. Specifically, we introduce an
outcome regression model where the treatment effects are represented by the inner product be
tween the pre-treatment covariates and the treatment encoder. In particular, the treatment encoder
is decoupled as the additive treatment encoder and the interactive treatment encoder, where the
interactive treatment encoder explicitly models the interaction effects of combination treatments.
Meanwhile, the covariates encoder allows either parametric or nonparametric models to learn a
low-dimensional representation of pre-treatment covariates. Finally, we derive the optimal ITR
for combination treatments by maximizing the outcome regression model over the combination
treatments. As we developed our method, a parallel work (Kaddour et al., 2021) proposed the gen
eralized Robinson decomposition, which estimates the conditional average treatment effects for
structured treatments such as graphs, images, and texts. Their proposed generalized Robinson de
composition also utilizes two neural networks to represent the treatment effects given covariates X
and treatments A. In spite of the overlap, our proposed method targets the combination treat
ments, especially considering the interaction effects among different treatments and correlations
between different combinations.
Furthermore, the combination treatments assignment might be restricted by limited resources in
a real world scenario. Existing works (Kitagawa & Tetenov, 2018; Luedtke & van der Laan,
2016) consider the total amount constraint for binary treatments only, where the assignments
are determined by the quantiles of treatment effects. In contrast, allocating combinations of treat
ments with a limited amount is an non deterministic polynomial-time (NP)-hard problem, thus an
analytical solution like quantiles does not exist. To address these problems, we formulate the con
strained ITR as a multi-choice knapsack problem (Kellerer et al., 2004) and solve this optimization
problem through an efficient dynamic programming algorithm.
The main advantages and contributions of this paper are summarized as follows. First of all,
the proposed method addresses the curse of dimensionality issue in combination treatment prob
lems through a double encoder framework. In this framework, the covariates encoder captures
the shared function bases of treatment effects, while the treatment encoder learns the coefficients
for those function bases. This approach shifts complexity of treatment effects to the complexities
of the covariates and treatment encoder, which are managed by our well-designed model struc
ture. Second, the proposed method enhances the estimation efficiency and the rate of value
reduction convergence through the parameter-sharing feature inherent in the neural network-
based treatment encoder. Third, the nonparametric modelling strategy employed by the DEM
accommodates the intricate treatment and interaction effects, effectively mitigating the model
mis-specification problem and leading to consistent estimation of the ITR. Fourth, the proposed
multi-choice knapsack framework enables the tailoring of individualized decisions within budg
et constraints. Apart from its application in the proposed method, this framework is generally
716 Xu et al.
applicable to other outcome regression models for deriving budget-constrained decisions for
multi-arm or combination treatment scenarios.
In regard to the theoretical properties of the estimated ITR, we provide the value reduction
bound for the ITR for combination treatments with or without budget constraints. Thereafter,
we provide a non-asymptotic value reduction bound for the DEM, which guarantees that the value
function of the estimated ITR converges to the optimal value function with a high probability and
the proposed method achieves a faster convergence rate with respect to the number of combination

treatments compared with existing methods for the multi-arm ITR. The improvement in conver
gence rate is attained by the hierarchical structure of the neural network where the parameters are
shared by all combinations and the input dimension is proportional to the number of treatments
instead of the total number of combination treatments.
The proposed method demonstrates superior performance over existing methods in our numerical
studies especially when the number of treatments is large and there exist interaction effects among
different treatments. In the real data application, we apply the proposed method to recommend the
optimal combination treatments to shrink tumour size of colorectal cancer (CRC), which achieves
the maximal tumour size shrinkage and shows its potential in improving individualized healthcare.
The rest of this paper is organized as follows. In Section 2, we introduce the notations and back
ground of the Q-learning framework, and the budget constraints problem. In Section 3, we pro
pose the DEM to estimate the optimal ITR for combination treatments and impose a budget
constraint on the original problem. In Section 4, we establish the theoretical properties of the pro
posed method. In Section 5, we illustrate the empirical performance of the proposed method in
various simulation studies. In Section 6, we apply the proposed method to patient-derived xeno
graft (PDX) data, which aims to recommend optimal combination treatments for CLR. We pro
vide discussion and concluding remarks in Section 7.
2 Notations and background

In this section, we introduce the problem setup and notations for the estimation of ITRs for com
bination treatments. Consider the data (X, A, Y) collected from designed experiments or observa
tional studies. The subject pre-treatment covariates are denoted by X ∈ X ⊂ Rp , which might
include patients’ demographics and lab test results. The combinations of K treatments are denoted
by A = (A1 , A2 , . . . , AK ) ∈ A ⊂ {0, 1}K , where Ak = 1 indicates that the kth treatment is adminis
tered and Ak = 0 otherwise. Note that some combinations are infeasible to be considered in real
applications, for example, many drug–drug interactions could lead to risks for patients outweigh
ing the benefits (Rodrigues, 2019). Therefore, we consider a subset A of all the possible 2K combi
nations in the treatment rule. We may also denote the treatments by Ã ∈ {1, 2, . . . , |A|} as
categorical encodings, and we use these two sets of notations interchangeably without ambiguity.
The outcome of our interest is denoted by Y ∈ R. Without loss of generality, we assume that a lar
ger value of Y is preferable, for example, the shrinkage of tumour size.
In causal inference, the potential outcome framework (Rubin, 1974) is to describe the possible
outcome after a certain treatment is assigned. We use Y(A) to denote the potential outcome
throughout the paper. Due to the ‘fundamental problem of causal inference’ (Holland, 1986),
which indicates that only one potential outcome is observed for each subject, it is infeasible to es
timate the subject-wise optimal treatment. Instead, our goal of estimating the optimal ITR for
combination treatments is to maximize the population-wise expected potential outcome, which
is also known as the value function:
V(d) := E[Y{d(X)}] (1)
where d(·) : X → A is an ITR. The value function is defined as the expectation of the potential out
comes over the population distribution of (X, A, Y) under A = d(X), which is estimable when the
following causal assumptions (Rubin, 1974) hold:
Assumption 1 (a) Stable unit treatment value assumption: Y = Y(A); (b) no unmeasured
confounders: A ⊥ ⊥ Y(a)|X, for any a ∈ A; (c) positivity: P(A = a|X) ≥ pA ,
∀a ∈ A, ∀X ∈ X , for some pA > 0.
Assumption (a) is also referred to as ‘consistency’ in causal inference, which assumes that the
potential outcomes of each subject do not vary with treatments assigned to other subjects. The
treatments are well defined in that the same treatment leads to the same potential outcome.
Assumption (b) states that all confounders are observed in pre-treatment covariates, so that the
treatment and potential outcomes are conditionally independent given the pre-treatment covari
ates. Assumption (c) claims that for any pre-treatment covariates X, each treatment can be as
signed with a positive probability.

Based on these assumptions, the value function defined in (1) can be identified as follows:
􏼨 􏼩
􏽘
V(d) = E{Y|A = d(X)} = E E(Y|X, A)I{d(X) = A} (2)
A∈A
where I(·) is the indicator function. To maximize the value function, we can first estimate the con
ditional expectation E(Y|X = x, A = a), namely, the Q-function in the literature (Clifton & Laber,
2020). Then the optimal ITR can be obtained by
d∗ (x) ∈ argmax E(Y|X = x, A = a). (3)

a∈A
From the perspective of the multi-arm treatments, the Q-function (Kosorok & Laber, 2019;
Qi et al., 2020; Qian & Murphy, 2011) can be formulated as
|A|
􏽘
E(Y|X, Ã) = m(X) + δl (X)I(Ã = l) (4)
l=1
where m(X) is the treatment-free effect representing a null effect without any treatment and func
tions δl (X)’s are treatment effects for the lth treatment. There are two major challenges when (4) is
applied to the combination treatments problem: first, if δl (·)’s are imposed to be some parametric
model, for example, linear model (Kosorok & Laber, 2019; Qian & Murphy, 2011), it could have
severe mis-specification issue, especially considering the complex nature of interaction effects of
combination treatments. Second, as the number of treatments K increases, the number of
treatment-specific functions δl (·)’s could grow exponentially. Therefore, the estimation efficiency
of the ITR based on Q-function (4) could be severely compromised for either parametric or non
parametric models, especially in clinical trials or observational studies with limited sample sizes.
In addition, considering the combination of multiple treatments expands the treatment space A
and provides much more feasible treatment options. Therefore, each individual could have more
choices rather than a yes-or-no as in the binary treatment scenario. Therefore, it is possible to con
sider accommodating realistic budget constraints while maintaining an effective outcome. In this
paper, we further consider a population-level budget constraint as follows. Suppose costs over the
K treatments are c = (c1 , c2 , . . . , cK ), where ck denotes the cost for the kth treatment. Then the
budget constraint for a population with a sample size n is
n
1􏽘
Cn (d):= cT d(Xi ) ≤ B (5)
n i=1
where B is the average budget for each subject. This budget constraint is suitable for many policy-
making problems such as welfare programmes (Bhattacharya & Dupas, 2012) and vaccination
distribution problem (Matrajt et al., 2021).
3 Methodology
In Section 3.1, we introduce the proposed DEM for estimating the optimal ITR for combination
treatments. Section 3.2 considers the optimal assignment of combination treatments under budget
constraints. The estimation procedure and implementation details are provided in Section 3.3.
718 Xu et al.
3.1 DEM for ITRs

Our proposed DEM formulates the conditional expectation E(Y|X, A), or the Q-function, as
follows:
E(Y|X, A) = m(X) + α(X)T β(A) (6)
where m(·) : X → R is the treatment-free effects as in (4), and α(·) : X → Rr is an encoder that rep

resents individuals’ pre-treatment covariates in the r-dimensional latent space, which is called the
covariate encoder. And β(·) : A → Rr is another encoder representing the combination treatment in
the same r-dimensional latent space, named as the treatment encoder. In particular, these two en
coders capture the unobserved intrinsic features of subjects and treatments; for instance, the cova
riates encoder α(·) represents the patients’ underlying health status, while the treatment encoder β(·)
learns physiological mechanisms of the treatment. The inner product α(X)T β(A) represents the con
cordance between subjects and treatments, hence representing the treatment effects on subjects.
From the perspective of function approximation, the covariates encoder α(X) learns the function
bases of treatment effects, and the treatment encoder β(A) learns the coefficients associated with
those function bases. Consequently, the treatment effects are represented as the linear combina
tions of r functions:
r
􏽘
δl (X) = β(i) (Ãl )α(i) (X)
i=1
Note that the model for multi-arm treatments (4) is a special case of the DEM (6) where α(X) =
(δ1 (X), . . . , δ|A| (X)) and β(Ã) = (I(Ã = Ã1 ), . . . , I(Ã = Ã|A| )) if r = |A|. Another special case of
(6) is the angle-based modelling (Qi et al., 2020; Xue et al., 2021; Zhang et al., 2020), which
has been applied to the estimation of the ITR for multi-arm treatments. In the angle-based frame
work, each treatment is encoded with a fixed vertex in the simplex, and each subject is projected in
the latent space of the same dimension as the treatments so that the optimal treatment is deter
mined by the angle between treatment vertices and the subject latent factors. However, the dimen
sion of the simplex and latent space is r = |A| − 1, which leads the angle-based modelling that
suffers from the same inefficiency issue as (4).
Since different combination treatments could contain the same individual treatments, it is over-
parameterized to model treatment effects for each combination treatment independently. For
instance, the treatment effect of the combination of drug A and drug B is correlated with the in
dividual treatment effects of drug A and drug B, respectively. Therefore, we seek to find a low-
dimensional function space to incorporate the correlation of the combination treatments without
over-parametrization. In the DEM (6), the dimension of the encoders output r controls the com
plexity of the function space spanned by α(1) (·), . . . , α(r) (·). Empirically, the dimension r is a tuning
parameter, which can be determined via the hyper-parameter tuning procedure. In other words,
the complexity of the DEM is determined by the data itself, rather than pre-specified. In addition,
the reduced dimension also leads to a parsimonious model with fewer parameters, which permits
an efficient estimation of treatment effects. Furthermore, we do not impose any parametric as
sumptions on α(·), which allows us to employ flexible nonlinear or nonparametric models with
r-dimensional output to avoid the potential mis-specification of treatment effects.
Given the double encoder framework in (6), the treatment effects of the combination treatments
share the same function bases α(1) (·),…, α(r) (·). Therefore, the treatment encoder β(·) is necessary to
represent all treatments in A so that α(X)T β(A) can represent treatment effects for all treatments.
Through this modelling strategy, we convert the complexity of |A| treatment-specific functions
δl (·)’s in (4) to the representation complexity of β(·) in that β(·) represents |A| treatments in
r-dimensional latent space. As a result, we can reduce the complexity of the combination treatment
problem and achieve an efficient estimation if an efficient representation (i.e. r ≪ |A|) of |A| treat
ments can be found.
In summary, the DEM (6) is a promising framework to tackle the two challenges in (4) if cova
riates and treatment encoders can provide flexible and powerful representations of covariates and
treatments, respectively, which will be elaborated in the following sections. Before we dive into the
details of the covariates and treatment encoders, we first show the universal approximation prop
erty of the DEM, which guarantees its flexibility in approximating complex treatment effects.
Theorem 1 For any treatment effects δl (X) ∈ H2 = {f : ∫x∈X |f (2) (x)|2 dx < ∞}, and for any
ϵ > 0, there exists α(·) : X → Rr and β(·) : X → Rr , where K ≤ r ≤ |A| such
that

‖δl (X) − α(X)T β(Ãl )‖H2 ≤ ϵ for any Ãl ∈ A
The above theorem guarantees that the DEM (6) can represent the function space considered in
(4) sufficiently well given a sufficiently large r.
3.1.1 Treatment encoder

In this section, we introduce the detailed modelling strategy for treatment encoder β(·). The treat
ment effects of combination treatments can be decoupled into two components: additive treatment
effects, which is the sum of treatment effects from single treatments in combination, and inter
action effects, which are the additional effects induced by the combinations of multiple treatments.
Therefore, we formulate the treatment encoder as follows:
β(A) = β0 (A) + β1 (A) = WA + β1 (A)

􏼨 􏼩
K
􏽘 (7)
s.t. β1 (A) = 0 for any A ∈ A : Ak ≤ 1
k=1
where β0 (A) and β1 (A) are additive and interactive treatment encoders, respectively. In particular,
β0 (A) is a linear function with respect to A, where W = (W1 , W2 , . . . , WK ) ∈ Rr×K , and Wk is
􏽐
the latent representation of the kth treatment. As a result, α(X)T β0 (A) = k : {Ak =1} WTk α(X) are
the additive treatment effects of the combination treatment A. The constraints for β1 (·) ensure the
identifiability of β0 (·) and β1 (·) such that any representation β(A) can be uniquely decoupled into
β0 (A) and β1 (A).
The interaction effects are challenging to estimate in combination treatments. A naive solution is
to assume that interaction effects are ignorable, which leads the additive treatment encoder β0 (A)
to be saturated in estimating the treatment effects of combination treatments. However, inter
action effects are widely perceived in many fields such as medicine (Li et al., 2018; Stader et al.,
2020), psychology (Caspi et al., 2010), and public health (Braveman et al., 2011). Statistically, ig
noring the interaction effects could lead to inconsistent estimation of the treatment effects (Yu &
Ding, 2023; Zhao & Ding, 2023) and the ITR (Liang et al., 2018). Hence, it is critical to incorp
orate the interaction effects in estimating the ITR for combination treatments.
A straightforward approach to model the interactive treatment encoder β1 (A) is similar to the
additive treatment encoder β0 (A), which we name as the treatment dictionary. Specifically, a ma
trix V = (V1 , V2 , . . . , V|A| ) ∈ Rr×|A| is a dictionary that stores the latent representations of each
combination treatment so that β1 (A) is defined as follows:
β0 (A) = VeÃ (8)
where eÃ is the one-hot encoding of the categorical representation of A. Since the number of pos
sible combination treatments |A| could grow exponentially as K increases, the parameters of V
could also explode. Even worse, each column Vl can be updated only if the associated treatment
Ãl is observed. Given a limited sample size, each treatment could be only observed a few times in
the combination treatment scenarios, which leads the estimation efficiency of V to be severely com
promised. The same puzzle is also observed in other methods. For the Q-function in (4), the
720 Xu et al.

Figure 1. The left panel shows the parameter update scheme in the additive model (4), the treatment dictionary (8),
treatment-agnostic representation network (TARNet) (Shalit et al., 2017), and dragonnet (Shi et al., 2019). Only the
treatment-specific parameters corresponding to a0 are updated. The right panel shows the parameter-sharing
feature of the neural network interactive treatment encoder (9). All parameters except for non-activated input
parameters are updated based on the gradient with respect to the observation (x0 , a0 , y0 ).
parameters in δl (X) can be updated only if Ãl is observed; in the treatment-agnostic representation
network (TARNet) (Shalit et al., 2017) and the Dragonnet (Shi et al., 2019), each treatment is as
sociated with an independent set of regression layers to estimate the treatment-specific treatment
effects, which results in inefficiency estimation for combination treatment problems.
In order to overcome the above issue, we propose to utilize the feed-forward neural network
(Goodfellow et al., 2016) to learn efficient latent representations in the r-dimensional space.
Specifically, the interactive treatment encoder is defined as
β1 (A) = U L ◦ σ ◦ · · · ◦ σ ◦ U 1 (A) (9)
where U l (x) = Ul x + bl is the linear operator with the weight matrix Ul ∈ Rrl ×rl−1 and the biases
bl . The activation function is chosen as ReLU function σ(x) = max(x, 0) in this paper. An illus
tration of the neural network interactive treatment encoder is shown in Figure 1. Note that all
parameters in (9) are shared among all possible treatments, so all of the weight matrices and
biases in (9) are updated regardless of the input treatment, which could improve the estimation
efficiency, even though (9) may include more parameters than the treatment dictionary (8). As a
result, the DEM with (9) not only guarantees a faster convergence rate (with respect to K) of the
value function but also improves the empirical performance especially when K is large or sample
size n is small, which will be shown in numerical studies and real data analysis. A direct compari
son of the neural network interactive treatment encoder (9), the treatment encoder (8), the addi
tive model (4), TARNet (Shalit et al., 2017), and Dragonnet (Shi et al., 2019) is also shown in
Figure 1.
Although the interactive treatment encoder (9) allows an efficient estimation, it is not guaran
teed to represent up to |A| interaction effects. In the treatment dictionary (8), columns Vl ’s are
free parameters to represent |A| treatments without any constraints. However, an ‘under-
parameterized’ neural network is not capable of representing |A| treatments in r-dimensional
space. For example, if there are three treatments to be combined (K = 3), the treatment effects
are sufficiently captured by one-dimensional α(X) with different coefficients (r = 1). We use the fol
lowing one-hidden layer neural network to represent the treatment in R:
β1 (A) = u2 σ(U1 A + b1 ) + b2 (10)
where u2 , b2 , b1 ∈ R are scalars and U1 ∈ R1×3 . In other words, the hidden layer only includes
one node. In the following, we show that this neural network can only represent restricted
interaction effects:
Proposition 1 The one-hidden layer neural network (10) can only represent the following
interaction effects: (a) β1 (A) ≥ 0 or β1 (A) ≤ 0 for all A ∈ A and (b) β1 (A)
takes the same values for all combinations of two treatments.
The proof of Proposition 1 is provided in the online supplementary material. Based on the above
observation, it is critical to guarantee the representation power of β1 (A) to incorporate flexible
interaction effects. In the following, we establish a theoretical guarantee of the representation
power of β1 (·) under a mild assumption on the widths of neural networks:

Theorem 2 For any treatment A ∈ A ⊂ {0, 1}K , if β1 (·) is a three-layer fully connected
neural network defined in (9) satisfying 4[r1 /4][r2 /4r] ≥ |A|, then there exist
parameters {Ul , bl , l = 1, 2, 3}, such that β(A) satisfies the identifiability con
straints and can take any values in Rr .
The above result is adapted from the recent work on the memorization capacity of neural
networks (Bubeck et al., 2020; Yun et al., 2019). Theorem 2 shows that if there are
Ω(2K/2 r1/2 ) hidden nodes in neural networks, then it is sufficient to represent all possible inter
action effects in Rr . However, obtaining the parameter set {Ul , bl , l = 1, 2, 3} in Theorem 2 via
the optimization algorithm is not guaranteed due to the non-convex loss surface of the neural
networks. In practice, the neural network widths in Theorem 2 can be a guide, and choosing a
wider network is recommended to achieve better empirical performance.
In summary, we propose to formulate the treatment encoder as two decoupled parts:
the additive treatment encoder and the interactive encoder. We provide two options for the inter
active treatment encoder: the treatment dictionary and the neural network, where the neural net
work can improve the asymptotic convergence rate and empirical performance with guaranteed
representation power. In the numerical studies, we use the neural network interactive treatment
encoder for our proposed method, and a comprehensive comparison between the treatment dic
tionary and the neural network is provided in the online supplementary material.
3.1.2 Covariates encoder

As we introduced in (6), the covariates encoder α(·) : X → Rr constitutes the function bases of the
treatment effects for all combination treatments. In other words, the treatment effects represented
in (6) lie in the space spanned by α(1) (X), . . . , α(r) (X). Therefore, it is critical to consider a suffi
ciently large and flexible function space to accommodate the highly complex treatment effects
and avoid possible model mis-specification. In particular, we adopt three nonlinear or non
parametric models for covariates encoders: polynomial, B-Spline (Hastie et al., 2009), and neural
network (Goodfellow et al., 2016).
First of all, we introduce the α(X) as a feed-forward neural network defined as follows:
α(X) = T L ◦ σ ◦ · · · ◦ σ ◦ T 1 (X) (11)
T l (x) = Tl x + cl is the linear operator with the weight matrix Tl ∈ Rrl ×rl−1 and the biases cl . The
activation function is chosen as ReLU function σ(x) = max(x, 0) in this paper. Note that the depth
and the width of the covariates encoder α(·) are not necessarily identical to those of the interactive
treatment encoder β1 (·), and these are all tuning parameters to be determined through hyper-
parameter tuning.
Even though neural networks achieve superior performance in many fields, their performance in
small sample size problems, such as clinical trials or observational studies in medical research, is
still deficient. In addition, neural networks lack interpretability due to the nature of their recursive
composition; therefore, the adoption of neural networks in medical research is still under review.
Here, we propose the polynomial and B-Spline covariates encoders to incorporate nonlinear treat
ment effects for better interpretation. For the polynomial covariates encoder, we first expand each
covariate xi into a specific order of polynomials (xi , x2i , . . . , xdi ) where d is a tuning parameter.
Then we take the linear combinations of all polynomials as the output of the covariate encoders.
Figure 2 provides an example of the polynomial covariate encoder with d = 3. Similarly, as for the
B-spline covariates encoder, we first expand each covariate into B-spline bases, where the number
722 Xu et al.

Figure 2. Model structure of the polynomial covariate encoder. Dashed lines indicate the fixed polynomial
expansion procedures, and solid lines are trainable parameters for the linear combination of polynomials.
of knots and the spline degree are tuning parameters. Likewise, linear combinations of these B-spline
bases are adopted as the output of the encoder. Although both polynomial and B-spline covariate
encoders can accommodate interaction terms among the polynomial bases or B-spline bases for a
better approximation for multivariate functions, exponentially increasing parameters need to be es
timated as the dimension of covariates or the degree of bases increases. In the interest of computation
feasibility, we do not consider interaction terms in the following discussion.
3.2 Budget-constrained ITR

In this section, we consider to optimize the assignment of combination treatments under the budg
et constraints, where the total cost constraints are imposed on a population with a sample size n.
We first introduce budget-constrained ITR for binary treatments. Suppose we have a treatment
(A = 1) and a control (A = −1), and there are only b% subjects which can be treated with the treat
ment A = 1. Luedtke and van der Laan (2016) define a contrast function η(x) = δ(x, 1) − δ(x, − 1),
and the corresponding ITR under the budget constraint is d(x, b) = I(η(x) ≥ qb ), where qb is the
b% quantile of the distribution of η(x) for a given population with a finite sample size.
Estimating the optimal ITR for combination treatments under budget constraints is challenging.
First of all, the contrast function is no longer a valid tool to measure the treatment importance to
each subject. Given the exponentially increasing choices of combination treatments, the number of
contrast functions increases exponentially and the pairwise comparisons do not suffice to deter
mine the optimal assignment. Second, costs over different treatments may differ significantly,
which makes the quantile no longer an effective criterion for allocating budgets.
In the following, we consider the constrained ITR problem for a finite population:
max V n (d) s.t. Cn (d) ≤ B (12)
d
where V n and Cn are defined on a pre-specified population with a sample size n. Here, covariates xi
(i = 1, 2, . . . , n) are treated as fixed covariates. Based on model formulation (6), maximizing the
objective function of (12) is equivalent to
n
1􏽘
argmax V n (d) = argmax [m(xi ) + α(xi )T β(d(xi ))]
d d n i=1
n
1􏽘
= argmax α(xi )T β(d(xi ))
d n i=1
n 􏽘 |A|
1􏽘
= argmax δij dij
{di j} n i=1 j=1
where δij = α(xi )T β(aj ) denotes the treatment effects of the jth combination treatment on the ith
subject, and dij = I{d(xi ) = aj } ∈ {0, 1} indicates whether the ith subject receives the jth combin
ation treatment. Since
􏽐 one subject can only receive one combination treatment, we impose􏽐the n
constraint to be j dij = 1. Similarly, budget constraints can be formulated as (1/n) i=1
􏽐|A|
c d
j=1 ãj ij ≤ B, where c ãj is the cost of treatment ãj calculated from the cost vector c. The con
strained ITR can be solved as follows:

n 􏽘|A|
1􏽘
max δij dij
{dij } n i=1 j=1
(13)
n 􏽘|A|
1􏽘 􏽘
s.t. cã dij ≤ B, dij = 1, dij ∈ {0, 1} for any i, j
n i=1 j=1 j j
The above optimization problem is equivalent to a multi-choice knapsack problem (Kellerer et al.,
2004). For a binary treatment setting, the solution of (13) is the quantile of η(X), which is a special
case of our formulation.
To understand the connection between constrained ITR and the multi-choice knapsack prob
lem, we notice that the priority of the treatment is associated with the definition of dominance
in the multi-choice knapsack problem: for any i ∈ {1, 2, . . . , n}, if δik > δil and cak < cal , the treat
ment l is dominated by the treatment k. In other words, the treatment k achieves a better outcome
than the treatment l with a lower cost. Thus, the dominance property could be an alternative to the
contrast functions in combination treatment settings.
Here, δij indicates the treatment effects, and the parametric assumptions are not required, so this
framework is also applicable for other methods providing estimations of treatment effects such as
the L1 -penalized least-square (L1 -PLS) (Qian & Murphy, 2011) and the outcome weighted learn
ing with multinomial deviance (OWL-MD; Huang et al., 2019). However, the objective function
in (12) depends on the estimation of δij , and we show that the value reduction under budget con
straints is bounded by the estimation error of δij ’s in Theorem 4. Consequently, estimation bias in
δij could lead to biased results in (13). Since the proposed model (6) provides an efficient and ac
curate estimation of treatment effects, it also results in a more favourable property for solving the
budget-constrained ITR for combination treatments.
3.3 Estimation and implementation

In this section, we introduce the estimation and hyper-parameter tuning procedures of the pro
posed method for unconstrained and constrained ITR for combination treatment.
3.3.1 Estimation of the DEM

First of all, we propose the following doubly robust estimator for the treatment effects:
􏼚 􏼛
1
α̂(·), β̂(·) = argmin E (Yi − m̂(Xi ) − α(Xi )T β(Ai ))2 (14)
α(·),β(·) P̂(Ai |Xi )
where P̂(Ai |Xi ) is a working model of the propensity score specifying the probability of treatment
assignment given pre-treatment covariates, and m̂(Xi ) is a working model of treatment-free effects.
The inverse probability weights given by the propensity scores balance the samples assigned to dif
ferent combination treatments, assumed to be equal under the randomized clinical trial setting. By
removing the treatment-free effects m(x) from the responses before we estimate the treatment ef
fects, the numerical stability can be improved and the estimator variance is reduced. This is also
observed in Zhou et al. (2017) and Fu et al. (2016). Furthermore, the estimator in (14) is doubly
robust in that if either P̂( · | · ) or m̂(·) is correctly specified, α̂(·)T β̂(·) is a consistent estimator of the
treatment effects, and a detailed proof is provided in the online supplementary material. This result
extends the results in Meng and Qiao (2020) from binary and multiple treatments to combination
724 Xu et al.
treatments. Empirically, we minimize the sample average of the loss function (14) with additional
penalties: for the additive treatment encoder, L2 penalty is imposed to avoid overfitting; for the
interactive treatment encoder, L1 penalty is added since the interaction effects are usually sparse
(Wu & Hamada, 2011).
In this work, the working model of the propensity score is obtained via the penalized multi
nomial logistic regression (Friedman et al., 2010) as a working model. Specifically, the multi
nomial logistic model is parameterized by γ1 , γ2 , . . . , γ2K ∈ Rp :

exp(γTk x)
P(Ã = k|x) = 􏽐2K
T
k′ =1 exp(γk′ x)
The parameters γk ’s can be estimated by maximizing the likelihood:
􏼢 K 􏼨 K 􏼩􏼣 p
􏼠 K 􏼡1/2
n 2 2 2
1􏽘 􏽘
T
􏽘
T
􏽘 􏽘
2
max γk xi I(Ã = k) − log exp(γk′ xi ) −λ γkj
γ1 ,...γ2 K n
i=1 k=1 k′ =1 j=1 k=1
where the group Lasso (Meier et al., 2008) is used to penalize parameters across all treatment
groups. A potential issue of propensity score estimation is that the estimated probability could
be negligible when there are many possible treatments, which leads to unstable estimators for
treatment effects. To alleviate the limitation on inverse probability weighting, we stabilize the pro
pensity scores (Xu et al., 2010) by multiplying the frequency of the corresponding treatment to the
weights.
For the estimation of the treatment-free effects m(·), we adopt a two-layer neural network:
m(x) = (w2m )T σ(W1m x) (15)
where w2m ∈ Rh and W1m ∈ Rh×p are weight matrices, and σ(x) is the ReLu function. The width h
controls the complexity of this model. The weight matrices are estimated through minimizing:
n
1􏽘
min [yi − (w2m )T σ(W1m x)]2
w2m ,W1m n i=1
Given working models m̂(x) and P̂(a|x) for treatment-free effects and propensity scores, we pro
pose to optimize the double encoder alternatively. The detailed algorithm is listed in Algorithm 1.
Algorithm 1 DEM training algorithm
Input: Training dataset (xi , ai , yi )ni=1 , working models m̂(x), P̂(a|x), hyper-parameters including network
structure-related hyper-parameters (e.g. network depth Lα , Lβ , network width rα , rβ , and encoder output
dimension r) and optimization-related hyper-parameters (e.g. additive treatment encoder penalty coefficients λa ,
interactive treatment encoder penalty coefficients λi , mini-batch size B, learning rate η, and training epochs E).
Initialization: Initialize parameters in α̂(0) (x), β̂(0) (0)
0 (a), and β̂1 (a).
Training:
for e in 1: E do
for Mini-batch sampled from (xi , ai , yi )ni=1 do
􏽐
α̂(e) (x) = argmin B1 P̂(a1|x ) {yi − m̂(xi ) − α(xi )T (β̂(e−1)
0 (ai ) + β̂(e−1)
1 (ai ))}2
(e) 􏽐 i i
(e−1)
β̂1 (a) = argmin B1 P̂(a1|x ) {yi − m̂(xi ) − α̂(e) (xi ) (β0 (ai ) + β̂1 (ai ))}2 + λa ‖β0 ‖2
T
􏽐 1i i
β̂(e)
2 (a) = argmin B
1
{y − m̂(xi ) − α̂(e) (xi )T (β̂(e)
P̂(a |x ) i
2
0 (ai ) + β1 (ai ))} + λi ‖β1 ‖1
i i
end for
end for
Specifically, we employ the Adam optimizer (Kingma & Ba, 2014) for optimizing each encoder:
covariate encoder α(x), additive treatment encoder β1 (a), and nonparametric treatment encoder
β2 (a). To stabilize the optimization during the iterations, we utilize the exponential scheduler
(Patterson & Gibson, 2017) which decays the learning rate by a constant per epoch. In all of
our numerical studies, we use 0.95 as a decaying constant for the exponential scheduler. To ensure
the identifiability
􏽐 of treatment effects, we also require a constraint on the treatment encoder β(·)
such that a∈A β(a) = 0. To satisfy this constraint, we add an additional normalization layer be

fore the output of β(·). The normalization layer subtracts the weighted mean vector where the
weight is given by the reciprocal of the combination treatment occurrence in the batch. Since
this operation only centres the outputs, the theoretical guarantee for β(·) in Section 4 still holds.
Once our algorithm converges, we obtain the estimation for α(·) and β(·), and also the estimated
ITR d̂(·) by (3).
In addition, successful neural network training usually requires careful hyper-parameter
tuning. The proposed DEM includes multiple hyper-parameters: network structure-related
hyper-parameters (e.g. network depth Lα and Lβ , network width rα and rβ , and encoder output
dimension r) and optimization-related hyper-parameters (e.g. additive treatment encoder penalty
coefficients λa , interactive treatment encoder penalty coefficients λi , mini-batch size B, learning
rate η, and training epochs E). These hyper-parameters induce an extremely large search space,
which makes the grid search method (Yu & Zhu, 2020) practically infeasible. Instead, we ran
domly sample 50 hyper-parameter settings in each experiment over the pre-specified search space
(detailed specification of hyper-parameter space is provided in the online supplementary
material), and the best hyper-parameter setting is selected if it attains the largest value function
on an independent validation set. Furthermore, due to the non-convexity of the loss function,
the convergence of the algorithm also relies heavily on the parameter initialization. In the
online supplementary material, we provide detailed analyses for numerical results under different
parameter initializations.
3.3.2 Budget-constrained ITR estimation

In the following, we introduce our procedure for the budget-constrained ITR for combination
treatment (12) estimation. We use the plug-in estimates α̂(xi )T β̂(aj ) for δij and calculate the cost
for each combination treatment from the cost vector c by caj = aTj c. Then we apply the dynamic
programming algorithm to solve (12) with plug-in δij ’s. Although the multi-choice knapsack prob
lem is a NP-hard problem, we can still solve it within pseudo-polynomial 􏽐l 􏽐|A|time (Kellerer et al.,
2004). Specifically, we denote Ẑl (b) as 􏽐 􏽐the optimal value (1/n) i=1 j=1 δ̂ij dij for the first l
subjects with budget constraints (1/n) li=1 |A| j=1 cãj dij ≤ b. Let Ẑl (b) = −∞ if no solution exists
and Ẑ0 (b) = 0. We define the budget space as B = {b : 0 ≤ b ≤ B} including all possible average
costs for n subjects, where 0 is the minimal cost if no treatment is applied to subjects, and the
maximal cost is our specified budget B. Once the iterative algorithm ends, the optimal objective
function is obtained as Ẑn (B) and the optimal treatment assignment is the output {dij : i = 1,
. . . , n, j = 1, . . . , |A|}. The detailed algorithm is illustrated in Algorithm 2.
Algorithm 2 Pseudo code of dynamic programming algorithm
1: Input: Treatment effects {δ̂ij : i = 1, 2, . . . n, j = 1, . . . , |A|}, cost {cÃj : j = 1, . . . , |A|}, budget B.

2: Initialize: Ẑ0 (b) ← 0 for b ∈ B = {b : 0 ≤ b ≤ B}
3: while l < n do
4: l ←l+1
5: for b ∈ B do
6: Ẑl (b) ← maxj : b>cÃ Ẑl−1 (b − cÃj ) + δ̂lj /n
j
7: dlj ← 1 if j = argmaxj : b>cÃ Ẑl−1 (b − cÃj ) + δ̂lj /n; otherwise, dlj ← 0

j
8: end for
9: end while
10: Output: {dij : i = 1, . . . , n, j = 1, . . . , |A|}
726 Xu et al.
4 Theoretical guarantees
In this section, we establish the theoretical properties of the ITR estimation for combination treat
ments and the proposed method. First, we establish the value reduction bound for the combination
treatments, either with or without budget constraints. Second, we provide a non-asymptotic ex
cess risk bound for the DEM, which achieves a faster convergence rate compared with existing
methods for multi-arm treatment problems.

4.1 Value reduction bound
The value reduction is the difference between the value functions of the optimal ITR and of the esti
mated ITR. The value function under a desirable ITR is expected to converge to the value function
under the optimal ITR when the sample size goes to infinity. Prior to presenting the main results, we
introduce some necessary notations. The conditional expectation of the outcome Y given the subject
variable X and the treatment A is denoted by Q(X, A) = E[Y|X, A], and the treatment-free effects
can be rewritten as m(X) = E[Q(X, A)|X], and the treatment effects can be denoted as
δ(X, A) = Q(X, A) − m(X). In particular, the true and the estimated treatment effects are denoted
as δ∗ ( · , · ) and δ̂( · , · ), respectively. In addition, we introduce an assumption on the treatment effects:
Assumption 2 For any ϵ > 0, there exist some constant C > 0 and γ > 0 such that
P( max
′
|δ∗ (X, A) − δ∗ (X, A′ )| ≤ ϵ) ≤ Cϵγ (16)
A,A ∈A
Assumption 2 is a margin condition characterizing the behaviour of the boundary between dif
ferent combination treatments. A larger value of γ indicates that the treatment effects are differen
tiable with a higher probability, suggesting it is easier to find the optimal ITR. Similar assumptions
are also required in the literature (Qi et al., 2020; Qian & Murphy, 2011; Zhao et al., 2012) to
achieve a faster convergence rate of the value reduction bound.
The following theorem shows that the value reduction is bounded by the estimation error of the
treatment effects, and the convergence rate can be improved if Assumption 2 holds:
Theorem 3 Suppose the treatment effects δ∗ ( · , · ) ∈ H2 . For any estimator δ̂( · , · ), and the
corresponding decision rule d̂ such that d̂(X) ∈ argmaxA∈A δ̂(X, A), we have
􏽮 􏽯1/2
V(d∗ ) − V(d̂) ≤ 2 max E[δ∗ (X, A) − δ̂(X, A)]2 (17)
A∈A
If Assumption 2 holds, the convergence rate is improved by

􏽮 􏽯(1+γ)/(2+γ)
V(d∗ ) − V(d̂) ≤ C(γ) max E[δ∗ (X, A) − δ̂(X, A)]2 (18)
A∈A
where C(γ) is a constant that depends on C and γ.
Theorem 3 builds a connection between the value reduction and the estimation error of the
treatment effects δ̂( · , · ), which shows that an accurate estimation of treatment effects would
lead the estimated value function V(d̂) to approach the optimal value function V(d∗ ). Based on
Theorem 3, we can further connect the value reduction bound to the excess risk of the estimator
of the proposed model:
Corollary 1 Suppose we define the expected risk of function Q( · , · ) as

L(Q) = E[Y − Q(X, A)]2 . Then for any estimator of the function Q( · , · ),
which is denoted by Q̂( · , · ), we have the following value reduction bound:
􏽮 􏽯1/2
V(d∗ ) − V(d̂) ≤ 2 L(Q̂) − L(Q∗ )
Furthermore, if Assumption 2 holds, the above inequality can be tighter

with γ > 0:
􏽮 􏽯(1+γ)/(2+γ)
V(d∗ ) − V(d̂) ≤ C(γ) L(Q̂) − L(Q∗ )
Next, we consider the value reduction bound under budget constraints. Since the multi-choice

knapsack problem we formulated for budget-constrained ITR is NP-hard (Kellerer et al., 2004),
we adopt a pseudo-polynomial dynamic programming algorithm (Dudziń ski & Walukiewicz,
1987) to obtain an approximated solution. In the following, we analyse the theoretical property
of the approximated value function that is derived from dynamic programming Algorithm 2.
Specifically, we define the approximated value function as the sum of the treatment effects of
the first l subjects divided by the sample size n, which is Ẑl (b) in Algorithm 2. In addition, we de
note the approximated value function as Z∗l (b) if the true treatment effects δ∗ij ’s are plugged in.
Then we have the following result indicating that the approximated value function converges if
the estimation error of δ̂( · , · ) converges.
Theorem 4 For the approximated value function obtained from Algorithm 2, for any
B > 0, we have
n
1􏽘
|Z∗ (B) − Ẑ(B)| ≤ | max δ∗ (xi , Ãj ) − δ̂(xi , Ãj )|
n i=1 Ãj ∈A
In other words, the approximated value function under budget constraints can converge if
δ̂( · , · ) is a consistent estimator of treatment effects. Note that the proposed estimator is a doubly
robust estimator in that either propensity score or treatment-free effects is correctly specified; our
proposed estimator is a consistent estimator, which consequently leads the value function and ap
proximated value function under budget constraints converge.
4.2 Excess risk bound

In this subsection, we provide a non-asymptotic value reduction bound for the proposed DEM and
show the improved convergence rate under the DEM. In Corollary 1, we have shown that the value
reduction can be bounded by the excess risk between the true and estimated Q-functions. The excess
risk serves as an intermediate tool to establish the non-asymptotic property of the proposed estimator
which depends on the complexity of the function class. In the proposed method, we focus on the func
tion class Q = {Q : X × A → R|Q(x, a) = m(x) + α(x)T β(a)}, where m(·), α(·), and β(·) are defined in
(15), (11), and (7). We establish the following excess risk upper bound for the estimator in Q:
Lemma 1 For any distribution (X, A, Y) with E[Y 2 ] ≤ c1 , given a function Q̂ from Q, then
with probability 1 − 2ϵ,
􏽳��
∗ 2c21 log (1/ϵ)
L(Q̂) − L(Q ) ≤ 8CRn (Q) + (19)
n
where C is the Lipschitz constant of L(Q), and Rn (Q) is the Rademacher com
plexity of Q.
Lemma 1 provides an upper bound of the excess risk in Corollary 1 using the Rademacher com
plexity of Q. However, the Rademacher complexity of a general neural network is still an open
problem in the literature and existing bounds are mainly established based on the different types
of norm constraints of weight matrices (Bartlett et al., 2017; Golowich et al., 2018; Neyshabur
et al., 2017, 2015). In this work, we focus on the following sub-class of Q with L2 and spectral
norm constraints:
􏼈 􏼉
QBm ,Bα ,Bβ = Q ∈ Q : ‖w2m ‖2 ≤ Bm , ‖W1m ‖2,∞ ≤ Bm , ‖Tl ‖2 ≤ Bα , ‖Ul ‖2 ≤ Bβ
728 Xu et al.
where ‖ · ‖2 denotes the L2 -norm for vectors and the spectral norm for matrices. For any matrix
X = (X1 , . . . , Xp ) and Xi is the ith column of matrix X, we use ‖X‖2,∞ = maxi ‖Xi ‖2 to denote the
L2,∞ norm of X. We then establish the upper bound of the Rademacher complexity of QBm ,Bα ,Bβ as
follows:
Lemma 2 Suppose E[‖X‖22 ] ≤ c22 . The Rademacher complexity of QBm ,Bα ,Bβ is upper
bounded by

􏽲�� 􏽲��
2 h Lα Lβ K
Rn (QBm ,Bα ,Bβ ) ≤ 2Bm c2 + Bα Bβ c2 (20)
n n
Lemma
􏽰�� 2 provides an upper bound of the Rademacher complexity of F Bm ,Bα ,Bβ with the rate
O( 1/n). The first term of (20) is the upper bound for the function class of m(x) in (15), which
depends on the width of hidden layers h. If h is large, the function m(x) is able to approximate
a larger function space, but with a less tight upper bound on the generalization error. The second
term of (20) is associated with the functional class of the inner product of the double encoders with
a convergence rate of O(K1/2 n−1/2 ). The rate increases with the number of treatments K rather
than |A| due to the parameter-sharing feature of the interactive treatment encoder, and the linearly
growing dimension of input of function β(·) in the proposed method. Specifically, the input of β(·)
is the combination treatment A itself, and parameters in the treatment encoder are shared by all the
combination treatments. Thus, the model complexity is proportional to K and the product of the
spectral norm of weight matrices. Based on Lemmas 1 and 2, we derive the value reduction bound
for the proposed method as follows:
Theorem 5 For any distribution (X, A, Y) with E[Y 2 ] ≤ c1 and E[‖X‖22 ] ≤ c2 . Considering
the neural networks in the subspace QBm ,Bα ,Bβ , with probability at least 1 − 2ϵ,
we have the following value reduction bound:
⎧ 􏽲�� 􏽲�� 􏽳��⎫1/2
⎨ h K 2c21 log (1/ϵ)⎬
∗ 2 Lα Lβ
V(d ) − V(d̂) ≤ 2 16CBm c2 + 8CBα Bβ c2 +
⎩ n n n ⎭
If Assumption 2 holds, we have a tighter bound with a positive γ:

⎧ 􏽲�� 􏽲�� 􏽳��⎫(1+γ)/(2+γ)
⎨ h K 2c21 log (1/ϵ)⎬
∗ 2 Lα Lβ
V(d ) − V(d̂) ≤ C(γ) 16CBm c2 + 8CBα Bβ c2 +
⎩ n n n ⎭
Theorem 5 establishes the value reduction bound in that the estimated decision rule can ap
proach the optimal value function as the sample size increases. Compared with the existing value
reduction bound for multi-arm treatments, the proposed method improves the convergence rate
from O(|A|1/4 ) to O(( log2 |A|)1/4 ). Furthermore, the order of the value reduction bound can ap
proach nearly n−1/2 as γ goes to infinity, which is consistent with the convergence rates established
in Qian and Murphy (2011) and Qi et al. (2020).
5 Simulation studies
In this section, we evaluate the performance of the proposed method in estimating the ITR for
combination treatments. Our numerical studies show that the proposed method achieves
superior performance to competing methods in both unconstrained and budget-constrained
scenarios.
5.1 Unconstrained ITR simulation

We first investigate the empirical performance of the proposed method without budget con
straints. We assume the pre-treatment covariates X = (X1 , . . . , X10 ) ∈ R10 are independently

Figure 3. Treatment assignment distribution in simulation settings. The left panel is for simulation settings 1 and 2,
and the right panel is for simulation settings 3 and 4.
and uniformly sampled from (−1, 1). Four simulation settings are designed to evaluate the per
formance under varying settings. In simulation settings 1 and 2, we consider combinations of three
treatments, which induces eight possible combinations, with six of them considered as our as
signed treatments. Similarly, in simulation settings 3 and 4, we consider combinations of five treat
ments, and we assume that 20 of all combinations are assigned to subjects. The treatments are
assigned either uniformly or following the propensity score model:
exp{0.2i ∗ (XT β)}

P(Ãi |X) = 􏽐 T
(21)
j exp{0.2j ∗ (X β)}
and the marginal treatment assignment distribution is shown in Figure 3.

In simulation setting 1, we assume that the treatment effects of the combination treatment are
additive from individual treatment, and we specify the individual treatment effect functions in
the column ‘Treatment effects’ provided in Table 1. Based on simulation setting 1, we consider
some interaction effects among treatments in simulation setting 2, which are specified in the col
umn ‘Interaction effects’ in Table 1. Therefore, the treatment effects of the combination treat
ments are the summation of individual treatment effects and interaction effects. Similarly,
Table 2 specifies the treatment effects and interaction effects for simulation settings 3 and 4
in the same manner. In particular, the treatment effects of the combination treatments are addi
tive from individual treatment effects in the simulation setting 3, while interaction effects are
added in the simulation setting 4. In summary, we evaluate the empirical performance of the
proposed method and competing methods under the additive treatment effects scenarios in
simulation settings 1 and 3, and under the interactive treatment effects scenarios in simulation
settings 2 and 4.
For each simulation setting, the sample sizes for the training data vary from 500, 1000 to
2000, and each setting is repeated 200 times. Then we compare the proposed method with
the following methods: the L1 -PLS (Qian & Murphy, 2011), the OWL-MD (Huang et al., 2019),
the multicategory outcome weighted learning with linear decisions (MOWL-linear; Zhang &
Liu, 2014), the outcome weighted learning with deep learning (OWL-DL; Liang et al., 2018),
and the TARNet (Shalit et al., 2017). The empirical evaluation of the value function and
730 Xu et al.
Table 1. Simulation settings 1 and 2: treatment effect and interaction effect functions specification
Treatment A Treatment effects Interaction effects
(0, 0, 0) 0 –
(0, 0, 1) 2X1 + exp(X3 + X4 ) –
(0, 1, 0) 2X2 log (X5 ) + X7 –

(0, 1, 1) – sin(5X21 ) − 3(X2 − 0.5)2
(1, 0, 0) sin(X3 ) + 2 log (X4 ) + 2 log (X7 ) –
(1, 1, 1) – 2sin((X2 − X4 )2 )
Note. Column ‘Treatment effects’ specifies the treatment effect functions of individual treatments adopted in simulation
settings 1 and 2. Column ‘Interaction effects’ specifies the interaction effects among individual treatments in setting 2.
Table 2. Simulation settings 3 and 4: treatment effect and interaction effect functions specification
Treatment A Treatment effects Interaction effects
(0, 0, 0, 0, 0) 0 –
(0, 0, 0, 0, 1) (X1 − 0.25)3 –
(0, 0, 0, 1, 0) 2 log (X3 ) + 4 log (X8 )cos(2πX10 ) –
(0, 0, 1, 0, 0) X2 sin(X4 ) − 1
(0, 0, 1, 0, 1) - exp(2X2 )
(0, 1, 0, 0, 0) (X1 + X5 − X28 )3 –
(0, 1, 0, 0, 1) – exp(2X4 + X9 )
(0, 1, 0, 1, 1) – −4 log (X6 )
(0, 1, 1, 0, 0) – 0
(0, 1, 1, 1, 0) – 0
(1, 0, 0, 0, 0) exp(X2 − X5 ) -
(1, 0, 0, 0, 1) – 0
(1, 0, 0, 1, 0) – 0
(1, 0, 1, 0, 0) – 0
(1, 0, 1, 0, 1) – −3/2cos(2πX1 + X28 )
(1, 1, 0, 0, 0) – 0
(1, 1, 0, 0, 1) – −4 log (X6 )
(1, 1, 0, 1, 1) – X26 + 1/2sin(2π/X7 )
(1, 1, 1, 0, 0) – 0
(1, 1, 1, 1, 0) – 0
Note. Column ‘Treatment effects’ specifies the treatment effect functions of individual treatments adopted in simulation
settings 3 and 4. Column ‘Interaction effects’ specifies the interaction effects among individual treatments in setting 4.
accuracy are reported in Tables 3 and 4, where the empirical value function (Qian & Murphy,
2011) is calculated via
En [YI{d(X) = A}]
V̂(d) =
En [I{d(X) = A}]
where En denotes the empirical average.

Since simulation settings 1 and 3 do not include interaction effects among different treatments, all
competing methods except for the OWL-DL (Liang et al., 2018) are over-parameterized, while the
Table 3. Unconstrained simulation study: comparisons of value functions for the proposed method and existing methods including the L1 -penalized least-square (L1 -PLS; Qian & Murphy,
2011), the outcome weighted learning with multinomial deviance (OWL-MD; Huang et al., 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear; Zhang
et al., 2020), the outcome weighted learning with deep learning (OWL-DL; Liang et al., 2018), and the treatment-agnostic representation network (TARNet; Shalit et al., 2017)
Value
Treatment assignment Setting Sample size Proposed L1 -PLS OWL-MD MOWL-linear OWL-DL TARNet
Uniform 1 500 5.477(0.218) 3.643(0.158) 5.338(0.259) 5.206(0.185) 5.408(0.265) 5.320(0.278)

1,000 5.622(0.206) 3.800(0.154) 5.510(0.251) 5.292(0.182) 5.512(0.259) 5.398(0.273)
2,000 5.658(0.208) 3.989(0.149) 5.577(0.246) 5.384(0.179) 5.595(0.251) 5.411(0.272)
2 500 5.268(0.325) 3.870(0.291) 5.132(0.306) 5.078(0.291) 5.015(0.400) 5.008(0.405)
1,000 5.418(0.312) 4.028(0.285) 5.302(0.299) 5.168(0.279) 5.105(0.398) 5.024(0.402)
2,000 5.498(0.311) 4.191(0.282) 5.344(0.289) 5.211(0.272) 5.215(0.382) 5.118(0.386)
3 500 5.600(0.288) 3.562(0.235) 4.479(0.285) 5.216(0.240) 5.332(0.312) 5.354(0.305)
1,000 5.667(0.268) 3.702(0.232) 4.987(0.279) 5.283(0.241) 5.403(0.310) 5.366(0.300)
2,000 5.719(0.262) 3.855(0.230) 5.274(0.265) 5.459(0.232) 5.598(0.299) 5.423(0.289)
4 500 6.117(0.328) 4.490(0.264) 5.900(0.278) 5.948(0.277) 5.995(0.335) 5.895(0.328)
1,000 6.374(0.319) 4.850(0.262) 6.200(0.272) 6.120(0.269) 6.012(0.321) 5.998(0.315)
2,000 6.732(0.310) 5.252(0.254) 6.506(0.259) 6.494(0.262) 6.254(0.311) 6.057(0.310)
PS-based 1 500 5.415(0.238) 4.048(0.198) 5.061(0.215) 4.897(0.233) 5.218(0.273) 5.013(0.305)
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3
1,000 5.589(0.223) 4.087(0.198) 5.099(0.213) 4.959(0.231) 5.223(0.272) 5.018(0.298)

2,000 5.662(0.219) 4.224(0.183) 5.178(0.201) 4.898(0.230) 5.238(0.279) 5.017(0.299)
2 500 5.005(0.324) 3.980(0.236) 4.815(0.254) 4.629(0.279) 4.635(0.336) 4.886(0.352)
1,000 5.622(0.322) 4.042(0.235) 4.906(0.249) 4.700(0.276) 4.913(0.334) 5.021(0.341)
2,000 5.658(0.320) 4.104(0.229) 5.005(0.245) 4.672(0.274) 4.998(0.326) 5.054(0.331)
3 500 5.665(0.330) 3.384(0.258) 3.540(0.269) 5.401(0.302) 5.505(0.352) 5.476(0.338)
1,000 5.792(0.321) 4.560(0.248) 5.009(0.268) 5.519(0.303) 5.784(0.348) 5.676(0.308)
2,000 5.796(0.318) 5.273(0.246) 5.307(0.259) 5.582(0.300) 5.788(0.338) 5.774(0.299)
4 500 5.630(0.356) 4.462(0.285) 3.816(0.305) 5.090(0.321) 5.028(0.405) 5.108(0.387)
(continued)
731

732
Table 3. Continued
Value
1,000 6.001(0.348) 5.432(0.355) 5.822(0.302) 5.134(0.319) 5.384(0.400) 5.338(0.379)

2,000 6.289(0.345) 5.808(0.344) 6.141(0.299) 5.294(0.318) 5.684(0.389) 5.589(0.378)
Note. Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (uniform), and treatments are assigned based on the propensity score model (21,
PS-based).
Table 4. Unconstrained simulation study: comparisons of accuracies for the proposed method and existing methods including the L1 -penalized least-square (L1 -PLS; Qian & Murphy, 2011),
the outcome weighted learning with multinomial deviance (OWL-MD; Huang et al., 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear; Zhang et al.,
2020), the outcome weighted learning with deep learning (OWL-DL; Liang et al., 2018), and the treatment-agnostic representation network (TARNet; Shalit et al., 2017)
Accuracy
Uniform 1 500 0.622(0.052) 0.338(0.039) 0.572(0.040) 0.491(0.041) 0.598(0.046) 0.553(0.050)

1,000 0.707(0.050) 0.358(0.038) 0.642(0.040) 0.507(0.040) 0.682(0.043) 0.640(0.050)
2,000 0.738(0.050) 0.382(0.038) 0.694(0.039) 0.552(0.042) 0.710(0.042) 0.686(0.048)
2 500 0.544(0.058) 0.350(0.037) 0.515(0.041) 0.474(0.035) 0.488(0.045) 0.432(0.052)
1,000 0.610(0.054) 0.367(0.034) 0.573(0.038) 0.505(0.033) 0.532(0.040) 0.462(0.051)
2,000 0.630(0.051) 0.389(0.033) 0.605(0.036) 0.516(0.034) 0.568(0.041) 0.506(0.049)
3 500 0.420(0.041) 0.102(0.025) 0.148(0.031) 0.251(0.034) 0.271(0.047) 0.205(0.057)
1,000 0.445(0.039) 0.099(0.024) 0.187(0.027) 0.285(0.033) 0.300(0.046) 0.268(0.053)
2,000 0.464(0.039) 0.116(0.021) 0.254(0.027) 0.331(0.031) 0.353(0.048) 0.311(0.045)
4 500 0.324(0.045) 0.146(0.031) 0.205(0.032) 0.215(0.032) 0.154(0.047) 0.162(0.041)
1,000 0.335(0.044) 0.191(0.031) 0.279(0.030) 0.229(0.032) 0.189(0.045) 0.193(0.043)
2,000 0.372(0.041) 0.222(0.029) 0.323(0.029) 0.241(0.029) 0.228(0.044) 0.225(0.042)
(continued)
Xu et al.

Table 4. Continued
Accuracy
PS-based 1 500 0.571(0.048) 0.317(0.031) 0.477(0.035) 0.430(0.037) 0.498(0.050) 0.432(0.053)

1,000 0.648(0.044) 0.327(0.027) 0.502(0.033) 0.451(0.032) 0.525(0.047) 0.462(0.051)
2,000 0.699(0.044) 0.332(0.028) 0.522(0.031) 0.448(0.031) 0.552(0.048) 0.481(0.050)
2 500 0.492(0.052) 0.284(0.037) 0.420(0.037) 0.397(0.036) 0.370(0.047) 0.375(0.048)
1,000 0.566(0.051) 0.290(0.037) 0.443(0.034) 0.414(0.037) 0.397(0.045) 0.388(0.047)
2,000 0.618(0.048) 0.291(0.037) 0.464(0.031) 0.406(0.032) 0.411(0.041) 0.425(0.049)
3 500 0.378(0.041) 0.053(0.030) 0.071(0.033) 0.223(0.036) 0.300(0.051) 0.248(0.050)
1,000 0.439(0.041) 0.091(0.031) 0.181(0.032) 0.279(0.034) 0.376(0.041) 0.344(0.048)
2,000 0.444(0.040) 0.137(0.029) 0.242(0.030) 0.327(0.031) 0.416(0.038) 0.378(0.047)
J R Stat Soc Series B: Statistical Methodology, 2024, Vol. 86, No. 3
4 500 0.223(0.056) 0.101(0.038) 0.083(0.039) 0.090(0.043) 0.102(0.052) 0.084(0.061)

1,000 0.267(0.053) 0.136(0.036) 0.195(0.041) 0.089(0.041) 0.121(0.051) 0.098(0.056)
2,000 0.279(0.052) 0.205(0.037) 0.245(0.041) 0.101(0.039) 0.168(0.048) 0.127(0.055)
Note. Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (uniform), and treatments are assigned based on the propensity score model (21,
PS-based).
733

734 Xu et al.
proposed method can be adaptive to the additive setting with a large λi . Therefore, the proposed
method and the OWL-DL outperform other competing methods in both settings. In contrast, com
plex interaction effects are considered in simulation settings 2 and 4, and the performance of
OWL-DL is inferior since a consistent estimation is not guaranteed for OWL-DL if there are inter
action effects. Although other competing methods are saturated in incorporating interaction effects,
their estimation efficiencies are still undermined since the decision functions in these methods are all
treatment-specific, while the proposed method possesses the unique parameter-sharing feature for

different treatments. Therefore, the advantage of our method is more significant for small sample
sizes or large K scenarios. Specifically, the proposed method improves the accuracy by 10.9–
17.8% in simulation setting 4 when the sample size is 500. In addition, we also compare the empir
ical performance of the DEM with different choices of covariates and treatment encoders, and the
detailed simulation results are presented in the online supplementary material.
5.2 Budget-constrained ITR simulation

In this subsection, we investigate the budget-constrained setting with the same data generation
mechanism as in simulation setting 4 in Section 5.1. For a fair comparison, we compare the pro
posed method with competing methods which provide a score to measure the utility or the effect of
each treatment. We then apply the proposed multi-choice knapsack problem (MCKP) method to
all of these methods because the proposed framework (13) does not require specification for the
treatment effects.
For the budget constraint, we let the second treatment be the most critical, or urgently needed,
by the population. Thus, we constrain the amount of the second treatment so that only partial pa
tients can be treated by the second treatment. The quantiles of the constrained populations are
20%, 50%, 80%, and 100%, where the constraints in the last case are trivial constraints as it is
equivalent to an unconstrained setting.
The simulation results are provided in Table 5, which clearly indicate that the proposed method
outperforms other methods in the constrained cases. Moreover, the proposed method achieves
smaller reductions of value functions when budget constraints are imposed. Specifically, the value
functions of the competing methods are reduced by about 0.9 when the budget is decreased from
100 to 20%. More precisely, when the sample size is 2000, compared with the best performances
of competing methods, the proposed method improves the value function by 8.08% when the
budget is 20%, and it achieves 3.47% improvement in value function when the budget is 100%.
The significant improvement of the value function under limited budget scenarios shows that
the proposed method provides a more accurate estimation of treatment effects, and thus leads
to better ITR estimation under restrictive constraints.
6 Application to PDX study

In this section, we apply our method to PDX data to inform optimal personalized treatment for
cancer patients. Due to ethical issues and other limitations of randomized clinical trials in cancer
studies, recent works have utilized PDXs to perform large-scale screening in mice to evaluate can
cer therapies (Gao et al., 2015). Specifically, samples of primary solid tumours are collected from
patients through surgery or biopsy (Hidalgo et al., 2014), and each tumour is implanted into mul
tiple mice to create a PDX line, where multiple treatments can be applied simultaneously.
Meanwhile, high throughput genomic assays such as RNA and DNA sequencing of solid tumours
are collected as the pre-treatment covariates. Therefore, the mice within one PDX line share the
same pre-treatment covariates. The primary interest of the outcome in these studies is the tumour
size or growth rate. An illustration of the PDX data collection procedure is provided in Figure 4.
The original data are from the Novartis PDX study (Gao et al., 2015), and we follow the screen
ing and pre-processing steps in Rashid et al. (2020) to obtain the genomic data and tumour meas
urements. In particular, our focus is CRC, where 37 PDX lines are investigated and the 13
treatments listed in Table 6 are administered simultaneously to all 37 PDX lines. These lead to
481 observations. Compared with other randomized clinical trials or observational studies, the
outcomes of each subject (PDX line) in PDX data given all treatments are fully observed.
Therefore, the positivity assumption is intrinsically satisfied. In addition, the pre-treatment cova
riates are 94 genetic features associated with 50 genes selected by unsupervised and supervised
Table 5. Constrained simulation study: comparisons of value functions for the proposed method and existing
methods including the L1 -penalized least-square (L1 -PLS; Qian & Murphy, 2011), the outcome weighted learning with
multinomial deviance (OWL-MD; Huang et al., 2019), the multicategory outcome weighted learning with linear
decisions (MOWL-linear; Zhang et al., 2020), and the treatment-agnostic representation network (TARNet; Shalit
et al., 2017)
Value

Sample size Constraints Proposed L1 -PLS OWL-MD MOWL-linear TARNet
500 20% 5.237(0.322) 3.445(0.268) 4.918(0.283) 4.909(0.279) 4.874(0.335)

50% 5.527(0.320) 3.756(0.277) 5.218(0.271) 5.215(0.273) 4.998(0.335)
80% 5.832(0.319) 4.108(0.267) 5.510(0.280) 5.499(0.276) 5.425(0.329)
100% 6.117(0.328) 4.490(0.264) 5.900(0.278) 5.948(0.277) 5.895(0.328)
1,000 20% 5.498(0.321) 3.685(0.261) 5.175(0.270) 5.170(0.271) 5.047(0.320)
50% 5.841(0.315) 4.014(0.265) 5.487(0.269) 5.318(0.275) 5.274(0.318)
80% 6.102(0.320) 4.417(0.261) 5.612(0.273) 5.598(0.267) 5.418(0.315)
100% 6.374(0.319) 4.850(0.262) 6.200(0.272) 6.120(0.269) 5.998(0.315)
2,000 20% 5.789(0.309) 4.108(0.249) 5.356(0.209) 5.317(0.266) 5.015(0.308)
50% 6.015(0.315) 4.437(0.251) 5.897(0.207) 5.778(0.261) 5.298(0.298)
80% 6.324(0.311) 4.847(0.255) 6.215(0.208) 6.117(0.264) 5.598(0.315)
100% 6.732(0.310) 5.252(0.254) 6.506(0.201) 6.494(0.262) 6.057(0.310)
Note. All treatments are uniformly assigned to subjects.
Figure 4. Illustration of the patient-derived xenograft (PDX) data collection. Tumour samples from a patient are
implanted into multiple mice, where a PDX line is formed by these mice. Different treatments can be applied
simultaneously. Tumour size, which is the primary interest of the outcome, is measured for each mouse. RNA and
DNA sequencing and other features are collected as pre-treatment covariates.
screening by Rashid et al. (2020) and these pre-treatment covariates are inherently balanced since
all treatment groups include exactly the same PDX lines. Furthermore, the outcome of our interest
is measured by the scaled maximum observed tumour size shrunken from the baseline size, where a
larger value is more desirable.
For the budget-constrained setting, we impose the costs of treatments as follows: $79 for
BKM120, $100 for LJC049, $66 for BYL719, $240 for cetuximab, $124 for encorafenib, $500
for LJM716, and $79 for binimetinib, where the prices for LJC049 and LJM716 are hypothetical,
while other prices are based on https://www.goodrx.com for unit dosage. We consider the hypo
thetical budgets for these 37 PDX lines as $21,000, $15,000, $10,000, and $5,000, where $21,000
is equivalent to the unconstrained scenario because it is sufficient to cover the most expensive com
bination treatment for all PDX lines.
736 Xu et al.
Table 6. Table of single or combination treatments considered to treat CLR for patient-derived xenograft (PDX) lines
Type BKM120 LJC049 BYL719 cetuximab encorafenib LJM716 binimetinib
Single 1 0 0 0 0 0 0
Single 0 1 0 0 0 0 0
Single 0 0 1 0 0 0 0

Single 0 0 0 1 0 0 0
Single 0 0 0 0 1 0 0
Single 0 0 0 0 0 0 1
Comb 1 1 0 0 0 0 0
Comb 0 0 1 1 0 0 0
Comb 0 0 1 1 1 0 0
Comb 0 0 1 0 1 0 0
Comb 0 0 1 0 0 1 0
Comb 0 0 1 0 0 0 1
Comb 0 0 0 1 1 0 0
Table 7. Mean and standard errors of value function under different budget constraints
Budgets Proposed method L1 -PLS MOWL-linear MOWL-kernel OWL-DL TARNet
$ 21,000 0.142(0.199) −0.157(0.227) 0.054(0.205) −0.313(0.251) −0.298(0.268) −0.254(0.209)

$ 15,000 0.103(0.196) −0.188(0.231) −0.007(0.207) −0.312(0.255) −0.318(0.264) −0.283(0.210)
$ 10,000 0.067(0.200) −0.209(0.222) −0.077(0.198) −0.330(0.250) −0.320(0.267) −0.299(0.208)
$ 5,000 0.050(0.202) −0.282(0.216) −0.103(0.199) −0.367(0.249) −0.367(0.259) −0.348(0.207)
Note. The budget $21,000 is equivalent to the non-constrained scenario. L1 -PLS = L1 -penalized least square; MOWL =
multicategory outcome weighted learning; OWL-DL = outcome weighted learning with multinomial deviance; TARNet =
treatment-agnostic representation network.
To implement our proposed method and other competing methods, we randomly split the
dataset into training (25 PDX lines), validation (6 PDX lines), and testing (6 PDX lines) sets.
All methods are trained on the training set, while hyper-parameters are tuned based on the
validation set. The value function of the unconstrained scenario is calculated on testing set
via V̂(d) = En [YI{d(X) = A}]/En [I{d(X) = A}]. For the budget-constrained scenarios, we apply
the selected model to all PDX lines to obtain the estimation of treatment effects, and then
apply the MCKP algorithm to all PDX lines and calculate the value function based on 37
PDX lines. Finally, we repeat the above random splitting 100 times to validate the results
for comparison.
In the following, we report the means and standard deviations of the value functions under dif
ferent budget constraints in Table 7. For the unconstrained scenario ($21, 000 budget), the pro
posed method achieves great improvement in value maximization. As a reference, the value
function under one-size-fits-all rules and the optimal treatment are shown in Table 8, which shows
that our proposed method improves the value function gap between optimal treatment assignment
and one-size-fits-all rules by 23.0–81.1%. In comparison with the competing methods which also
estimate the ITR, our proposed method improves the value function gap between the optimal
treatment assignment and competing ITR methods by 40.4–65.8%.
For the budget-constrained scenarios, the proposed method achieves dominant advantages over
the competing methods in that the value function under the most restrictive budget constraints still
achieves comparable results with other competing methods in the unconstrained scenario.
Compared with one-size-fits-all rules, the proposed method achieves the best value function
Table 8. Value function under one-size-fits-all rules and optimal treatment assignment
Treatment Value Budget Treatment Value Budget
BKM120 + LJC049 −0.684 $6,623 BKM120 −0.350 $2,923

BYL719 + cetuximab −0.232 $11,322 LJC049 −1.168 $3,700
BYL719 + cetuximab + encorafenib −0.388 $15,910 BYL719 −0.371 $2,442

BYL719 + encorafenib −0.562 $7,030 cetuximab −0.633 $8,880
BYL719 + LJM716 −0.103 $20,942 encorafenib −1.083 $4,588
BYL719 + binimetinib 0.047 $5,365 binimetinib −0.425 $2,923
cetuximab + encorafenib −0.749 $13,468 Optimal 0.447 $8,171
Note. BYL719 + binimetinib (bold value) is the best one-size-fits-all rules; Optimal (italic) is the estimated ITR by the
proposed method.
with about $5,000 budget compared to the best one-size-fits-all rule, which is the combination of
BYL719 and binimetinib with a $5,365 budget. In summary, the proposed method is more capable
of effectively controlling the tumour size than any other competing methods and one-size-fits-all
rules. Our approach could have great potential for improving therapy quality for CLR patients.
7 Discussion
In this paper, we broaden the scope of estimating the ITR from binary and multi-arm treatments to
combination treatments, where treatments within each combination can interact with each other.
We propose the DEM as a nonparametric approach to accommodate intricate treatment effects of
combination treatments. Specifically, our method overcomes the curse of dimensionality issue via
adopting neural network treatment encoders. The parameter-sharing feature of the neural net
work treatment encoder enhances the estimation efficiency such that the proposed method is
able to outperform other parametric approaches given a small sample size. In addition, we also
adapt the estimated ITR to budget-constrained scenarios. This adaptation is achieved through
the multi-choice knapsack framework, which strengthens our proposed method in situations
with limited resources. Theoretically, we offer a value reduction bound with and without budget
constraints and an improved convergence rate concerning the number of treatments under the
DEM.
Several potential research directions could be worth exploring further. First of all, the proposed
method employs the propensity score model to achieve the double robustness property. However,
the inverse probability weighting method could be weakened in observational studies considering
the combination treatments, due to the potential violation of positivity assumptions. This phe
nomenon is also observed in the binary treatment scenario with high-dimensional covariates
(D’Amour et al., 2021). There are existing works to overcome this limitation in the binary and
multi-arm treatment setting, which utilizes overlap weights (Li, 2019; Li et al., 2019) to substitute
the propensity score. However, this strategy cannot solve the same issue in combination treatment
problems. Therefore, exploring alternative approaches for combination treatment problems will
be a worthwhile direction.
Second, compared with the binary treatments, combination treatments enable us to optimize
multiple outcomes of interest simultaneously. The major challenge of multiple outcomes is that
each combination treatment may only favour a few outcomes, and therefore an optimal ITR is ex
pected to achieve a trade-off among multiple outcomes. Some recent works have studied trade-offs
between the outcome of interest and risk factors (Huang & Xu, 2020; Wang et al., 2018).
However, trade-offs among multiple outcomes could be more challenging.
Furthermore, interpretability is another desirable property of the ITR, especially in medical set
tings. The proposed method incorporates neural networks to enjoy benefits in estimation and the
oretical properties, while interpretation is not obvious. Some existing works (Laber & Zhao, 2015;
Zhang et al., 2015) propose tree-type methods for better interpretability under binary or multi-arm
settings, but may be not applicable for combination treatments. In the literature of explainable
738 Xu et al.
machine learning, there are available post-hoc and model-agnostic approaches (Lundberg & Lee,
2017; Shrikumar et al., 2017) which can be learned from. However, more sophisticated adaptation
might be needed for the combination treatment problem.
Acknowledgments
The authors are grateful to reviewers, the associate editor, and the editor for their insightful and
constructive comments that have helped to strengthen the content and clarity of the paper.

Conflict of interest: None declared.
Funding
This work is supported by National Science Foundation Grants DMS 2210640 and DMS
1952406.
Data availability
The data that support the findings of the study are available at https://www.tandfonline.com/doi/
full/10.1080/01621459.2020.1828091.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series B.
References
Bartlett P., Foster D. J., & Telgarsky M. (2017). ‘Spectrally-normalized margin bounds for neural networks’,
arXiv, arXiv:1706.08498, preprint.
Bhattacharya D., & Dupas P. (2012). Inferring welfare maximizing treatment assignment under budget con
straints. Journal of Econometrics, 167(1), 168–196. https://doi.org/10.1016/j.jeconom.2011.11.007
Bozic I., Reiter J. G., Allen B., Antal T., Chatterjee K., Shah P., Moon Y. S., Yaqubie A., Kelly N., & Le D. T.
(2013). Evolutionary dynamics of cancer in response to targeted combination therapy. Elife, 2, e00747.
https://doi.org/10.7554/eLife.00747
Braveman P. A., Egerter S. A., & Mockenhaupt R. E. (2011). Broadening the focus: The need to address the social
determinants of health. American Journal of Preventive Medicine, 40(1), S4–S18. https://doi.org/10.1016/j.
amepre.2010.10.002
Bubeck S., Eldan R., Lee Y. T., & Mikulincer D. (2020). ‘Network size and weights size for memorization with
two-layers neural networks’, arXiv, arXiv:2006.02855, preprint.
Caspi A., Hariri A. R., Holmes A., Uher R., & Moffitt T. E. (2010). Genetic sensitivity to the environment: The
case of the serotonin transporter gene and its implications for studying complex diseases and traits. American
Journal of Psychiatry, 167(5), 509–527. https://doi.org/10.1176/appi.ajp.2010.09101452
Clifton J., & Laber E. (2020). Q-learning: Theory and applications. Annual Review of Statistics and Its
Application, 7(1), 279–301. https://doi.org/10.1146/statistics.2020.7.issue-1
D’Amour A., Ding P., Feller A., Lei L., & Sekhon J. (2021). Overlap in observational studies with high-
dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.
2019.10.014
Dudziń ski K., & Walukiewicz S. (1987). Exact methods for the knapsack problem and its generalizations.
European Journal of Operational Research, 28(1), 3–21. https://doi.org/10.1016/0377-2217(87)90165-2
Forrest G. N., & Tamura K. (2010). Rifampin combination therapy for nonmycobacterial infections. Clinical
Microbiology Reviews, 23(1), 14–34. https://doi.org/10.1128/CMR.00034-09
Friedman J., Hastie T., & Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate
descent. Journal of Statistical Software, 33(1), 1. https://doi.org/10.18637/jss.v033.i01
Fu H., Zhou J., & Faries D. E. (2016). Estimating optimal treatment regimes via subgroup identification in
randomized control trials and observational studies. Statistics in Medicine, 35(19), 3285–3302. https://doi.
org/10.1002/sim.v35.19
Gao H., Korn J. M., Ferretti S., Monahan J. E., Wang Y., Singh M., Zhang C., Schnell C., Yang G., Zhang Y., &
Balbin O. A. (2015). High-throughput screening using patient-derived tumor xenografts to predict clinical tri
al drug response. Nature Medicine, 21(11), 1318–1325. https://doi.org/10.1038/nm.3954
Golowich N., Rakhlin A., & Shamir O. (2018). Size-independent sample complexity of neural networks. In
Conference on learning theory (pp. 297–299). PMLR.
Goodfellow I., Bengio Y., & Courville A. (2016). Deep learning. MIT Press.
Hastie T., Tibshirani R., Friedman J. H., & Friedman J. H. (2009). The elements of statistical learning: Data min
ing, inference, and prediction (Vol. 2). Springer.
Hidalgo M., Amant F., Biankin A. V., Budinská E., Byrne A. T., Caldas C., Clarke R. B., de Jong S., Jonkers J.,
Mælandsmo G. M., & Roman-Roman S. (2014). Patient-derived xenograft models: An emerging platform for
translational cancer research. Cancer Discovery, 4(9), 998–1013. https://doi.org/10.1158/2159-8290.CD-14-
0001
Holland P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396),
945–960. https://doi.org/10.1080/01621459.1986.10478354

Huang X., Goldberg Y., & Xu J. (2019). Multicategory individualized treatment regime using outcome weighted
learning. Biometrics, 75(4), 1216–1227. https://doi.org/10.1111/biom.v75.4
Huang X., & Xu J. (2020). Estimating individualized treatment rules with risk constraint. Biometrics, 76(4),
1310–1318. https://doi.org/10.1111/biom.v76.4
Kaddour J., Zhu Y., Liu Q., Kusner M. J., & Silva R. (2021). Causal effect inference for structured treatments.
Advances in Neural Information Processing Systems, 34, 24841–24854.
Kalra S., Kalra B., & Agrawal N. (2010). Combination therapy in hypertension: An update. Diabetology &
Metabolic Syndrome, 2(1), 1–11. http://dx.doi.org/10.1186/1758-5996-2-44
Kellerer H., Pferschy U., & Pisinger D. (2004). Multidimensional Knapsack Problems. In H. Kellerer, U. Pferschy,
D. Pisinger (Eds.), Knapsack Problems (pp. 235–283). Springer. https://doi.org/10.1007/978-3-540-24777-7_9
Kingma D. P., & Ba J. (2014). ‘Adam: A method for stochastic optimization’, arXiv, arXiv:1412.6980, preprint.
Kitagawa T., & Tetenov A. (2018). Who should be treated? empirical welfare maximization methods for treat
ment choice. Econometrica, 86(2), 591–616. https://doi.org/10.3982/ECTA13288
Korkut A., Wang W., Demir E., Aksoy B. A., Jing X., Molinelli E. J., Babur Ö., Bemis D. L., Onur Sumer S., Solit
D. B., Pratilas C. A., & Sander C. (2015). Perturbation biology nominates upstream–downstream drug com
binations in RAF inhibitor resistant melanoma cells. Elife, 4, e04640. https://doi.org/10.7554/eLife.04640
Kosorok M. R., & Laber E. B. (2019). Precision medicine. Annual Review of Statistics and Its Application, 6(1),
263–286. https://doi.org/10.1146/statistics.2019.6.issue-1
Laber E. B., & Zhao Y.-Q. (2015). Tree-based methods for individualized treatment regimes. Biometrika, 102(3),
501–514. https://doi.org/10.1093/biomet/asv028
Li F. (2019). Propensity score weighting for causal inference with multiple treatments. The Annals of Applied
Statistics, 13(4), 2389–2415. http://dx.doi.org/10.1214/19-AOAS1282
Li F., Thomas L. E., & Li F. (2019). Addressing extreme propensity scores via the overlap weights. American
Journal of Epidemiology, 188(1), 250–257. https://doi.org/10.1093/aje/kwy201
Li M., Zhu L., Chen L., Li N., & Qi F. (2018). Assessment of drug–drug interactions between voriconazole and
glucocorticoids. Journal of Chemotherapy, 30(5), 296–303. https://doi.org/10.1080/1120009X.2018.
1506693
Liang M., Ye T., & Fu H. (2018). Estimating individualized optimal combination therapies through outcome
weighted deep learning algorithms. Statistics in Medicine, 37(27), 3869–3886. https://doi.org/10.1002/sim.
v37.27
Liu Y., Wang Y., Kosorok M. R., Zhao Y., & Zeng D. (2018). Augmented outcome-weighted learning for esti
mating optimal dynamic treatment regimens. Statistics in Medicine, 37(26), 3776–3788. https://doi.org/10.
1002/sim.v37.26
Lu W., Zhang H. H., & Zeng D. (2013). Variable selection for optimal treatment decision. Statistical Methods in
Medical Research, 22(5), 493–504. https://doi.org/10.1177/0962280211428383
Luedtke A. R., & van der Laan M. J. (2016). Optimal individualized treatments in resource-limited settings. The
International Journal of Biostatistics, 12(1), 283–303. https://doi.org/10.1515/ijb-2015-0007
Lundberg S. M., & Lee S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural
Information Processing Systems, 30, 4768–4777.
Maruthur N. M., Tseng E., Hutfless S., Wilson L. M., Suarez-Cuervo C., Berger Z., Chu Y., Iyoha E., Segal J. B.,
& Bolen S. (2016). Diabetes medications as monotherapy or metformin-based combination therapy for type 2
diabetes: A systematic review and meta-analysis. Annals of Internal Medicine, 164(11), 740–751. https://doi.
org/10.7326/M15-2650
Matrajt L., Eaton J., Leung T., & Brown E. R. (2021). Vaccine optimization for COVID-19: Who to vaccinate
first? Science Advances, 7(6), eabf1374. https://doi.org/10.1126/sciadv.abf1374
Meier L., Van De Geer S., & Bühlmann P. (2008). The group lasso for logistic regression. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 70(1), 53–71. https://doi.org/10.1111/j.1467-9868.
2007.00627.x
Meng H., & Qiao X. (2020). ‘Doubly robust direct learning for estimating conditional average treatment effect’,
Mokhtari R. B., Homayouni T. S., Baluch N., Morgatskaya E., Kumar S., Das B., & Yeger H. (2017).
Combination therapy in combating cancer. Oncotarget, 8(23), 38022–38043. https://doi.org/10.18632/
oncotarget.v8i23
740 Xu et al.
Moodie E. E., Chakraborty B., & Kramer M. S. (2012). Q-learning for estimating optimal dynamic treatment
rules from observational data. Canadian Journal of Statistics, 40(4), 629–645. https://doi.org/10.1002/cjs.
v40.4
Möttönen T., Hannonen P., Leirisalo-Repo M., Nissilä M., Kautiainen H., Korpela M., Laasonen L., Julkunen
H., Luukkainen R., Vuori K., & Paimela L. (1999). Comparison of combination therapy with single-drug
therapy in early rheumatoid arthritis: A randomised trial. The Lancet, 353(9164), 1568–1573. https://doi.
org/10.1016/S0140-6736(98)08513-4
Neyshabur B., Bhojanapalli S., & Srebro N. (2017). ‘A PAC-Bayesian approach to spectrally-normalized margin

bounds for neural networks’, arXiv, arXiv:1707.09564, preprint.
Neyshabur B., Tomioka R., & Srebro N. (2015). Norm-based capacity control in neural networks. In Conference
on learning theory (pp. 1376–1401). PMLR.
Patterson J., & Gibson A. (2017). Deep learning: A practitioner’s approach. O’Reilly Media, Inc.
Qi Z., Liu D., Fu H., & Liu Y. (2020). Multi-armed angle-based direct learning for estimating optimal individ
ualized treatment rules with various outcomes. Journal of the American Statistical Association, 115(530),
678–691. https://doi.org/10.1080/01621459.2018.1529597
Qi Z., & Liu Y. (2018). D-learning to estimate optimal individual treatment rules. Electronic Journal of Statistics,
12(2), 3601–3638. https://doi.org/10.1214/18-EJS1480
Qian M., & Murphy S. A. (2011). Performance guarantees for individualized treatment rules. The Annals of
Statistics, 39(2), 1180. https://doi.org/10.1214/10-AOS864
Rashid N. U., Luckett D. J., Chen J., Lawson M. T., Wang L., Zhang Y., Laber E. B., Liu Y., Yeh J. J., Zeng D., &
Kosorok M. R. (2020). High-dimensional precision medicine from patient-derived xenografts. Journal of the
American Statistical Association, 116(535), 1140–1154. https://doi.org/10.1080/01621459.2020.1828091
Rodrigues A. D. (2019). Drug-drug interactions. CRC Press.
Rubin D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of
Educational Psychology, 66(5), 688–701. https://doi.org/10.1037/h0037350
Schmieder R. E., Gitt A. K., Koch C., Bramlage P., Ouarrak T., & Tschöpe D. (2015). Achievement of individ
ualized treatment targets in patients with comorbid type-2 diabetes and hypertension: 6 months results of the
dialogue registry. BMC Endocrine Disorders, 15(1), 1–12. https://doi.org/10.1186/s12902-015-0020-7
Shalit U., Johansson F. D., & Sontag D. (2017). Estimating individual treatment effect: Generalization bounds
and algorithms. In International conference on machine learning (pp. 3076–3085). PMLR.
Shi C., Blei D., & Veitch V. (2019). Adapting neural networks for the estimation of treatment effects. Advances in
Neural Information Processing Systems, 32, 2507–2517.
Shi C., Fan A., Song R., & Lu W. (2018). High-dimensional a-learning for optimal dynamic treatment regimes.
The Annals of Statistics, 46(3), 925. https://doi.org/10.1214/17-AOS1570
Shrikumar A., Greenside P., & Kundaje A. (2017). Learning important features through propagating activation
differences. In International conference on machine learning (pp. 3145–3153). PMLR.
Stader F., Khoo S., Stoeckle M., Back D., Hirsch H. H., Battegay M., & Marzolini C. (2020). Stopping lopinavir/
ritonavir in COVID-19 patients: Duration of the drug interacting effect. Journal of Antimicrobial
Chemotherapy, 75(10), 3084–3086. https://doi.org/10.1093/jac/dkaa253
Tamma P. D., Cosgrove S. E., & Maragakis L. L. (2012). Combination therapy for treatment of infections with
gram-negative bacteria. Clinical Microbiology Reviews, 25(3), 450–470. https://doi.org/10.1128/CMR.
05041-11
Wang Y., Fu H., & Zeng D. (2018). Learning optimal personalized treatment rules in consideration of benefit and
risk: With an application to treating type 2 diabetes patients with insulin therapies. Journal of the American
Statistical Association, 113(521), 1–13. https://doi.org/10.1080/01621459.2017.1303386
Wu C. J., & Hamada M. S. (2011). Experiments: Planning, analysis, and optimization. John Wiley & Sons.
Xu Q., Cao X., Chen G., Zeng H., Fu H., & Qu A. (2023). ‘Multi-label residual weighted learning for individ
ualized combination treatment rule’, arXiv, arXiv:2310.00864, preprint.
Xu S., Ross C., Raebel M. A., Shetterly S., Blanchette C., & Smith D. (2010). Use of stabilized inverse propensity
scores as weights to directly estimate relative risk and its confidence intervals. Value in Health, 13(2),
273–277. https://doi.org/10.1111/j.1524-4733.2009.00671.x
Xue F., Zhang Y., Zhou W., Fu H., & Qu A. (2021). Multicategory angle-based learning for estimating optimal
dynamic treatment regimes with censored data. Journal of the American Statistical Association, 117(539),
1438–1451. http://dx.doi.org/10.1080/01621459.2020.1862671
Yu R., & Ding P. (2023). ‘Balancing weights for causal inference in observational factorial studies’, arXiv,
arXiv:2310.04660, preprint.
Yu T., & Zhu H. (2020). ‘Hyper-parameter optimization: A review of algorithms and applications’, arXiv,
arXiv:2003.05689, preprint.
Yun C., Sra S., & Jadbabaie A. (2019). Small ReLu networks are powerful memorizers: A tight analysis of mem
orization capacity. Advances in Neural Information Processing Systems, 32, 15558–15569.
Zhang C., Chen J., Fu H., He X., Zhao Y.-Q., & Liu Y. (2020). Multicategory outcome weighted margin-based
learning for estimating individualized treatment rules. Statistica Sinica, 30(4), 1857–1879. https://doi.org/10.
5705/ss.202017.0527
Zhang C., & Liu Y. (2014). Multicategory angle-based large-margin classification. Biometrika, 101(3), 625–640.
https://doi.org/10.1093/biomet/asu017
Zhang Y., Laber E. B., Tsiatis A., & Davidian M. (2015). Using decision lists to construct interpretable and par
simonious treatment regimes. Biometrics, 71(4), 895–904. https://doi.org/10.1111/biom.v71.4
Zhao A., & Ding P. (2023). Covariate adjustment in multiarmed, possibly factorial experiments. Journal of the

Royal Statistical Society Series B: Statistical Methodology, 85(1), 1–23. https://doi.org/10.1093/jrsssb/
qkac003
Zhao Y., Zeng D., Rush A. J., & Kosorok M. R. (2012). Estimating individualized treatment rules using outcome
weighted learning. Journal of the American Statistical Association, 107(499), 1106–1118. https://doi.org/10.
1080/01621459.2012.695674
Zhao Y.-Q., Laber E. B., Ning Y., Saha S., & Sands B. E. (2019). Efficient augmentation and relaxation learning
for individualized treatment rules using observational data. The Journal of Machine Learning Research,
20(1), 1821–1843.
Zhou X., & Kosorok M. R. (2017). ‘Augmented outcome-weighted learning for optimal treatment regimes’,
Zhou X., Mayer-Hamblett N., Khan U., & Kosorok M. R. (2017). Residual weighted learning for estimating in
dividualized treatment rules. Journal of the American Statistical Association, 112(517), 169–187. https://doi.
org/10.1080/01621459.2015.1093947

qkad141

Uploaded by

Copyright:

Available Formats

qkad141

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

qkad141

Uploaded by

Copyright:

Available Formats

Journal of the Royal Statistical Society Series B:

Statistical Methodology, 2024, 86, 714–741

Optimal individualized treatment rule for

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

2 Notations and background

V(d) := E[Y{d(X)}] (1)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

d∗ (x) ∈ argmax E(Y|X = x, A = a). (3)

3.1 DEM for ITRs

E(Y|X, A) = m(X) + α(X)T β(A) (6)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

3.1.1 Treatment encoder

β(A) = β0 (A) + β1 (A) = WA + β1 (A)

β0 (A) = VeÃ (8)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

β1 (A) = U L ◦ σ ◦ · · · ◦ σ ◦ U 1 (A) (9)

β1 (A) = u2 σ(U1 A + b1 ) + b2 (10)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

3.1.2 Covariates encoder

α(X) = T L ◦ σ ◦ · · · ◦ σ ◦ T 1 (X) (11)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

3.2 Budget-constrained ITR

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

3.3 Estimation and implementation

3.3.1 Estimation of the DEM

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

The parameters γk ’s can be estimated by maximizing the likelihood:

m(x) = (w2m )T σ(W1m x) (15)

Algorithm 1 DEM training algorithm

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

3.3.2 Budget-constrained ITR estimation

Algorithm 2 Pseudo code of dynamic programming algorithm

1: Input: Treatment effects {δ̂ij : i = 1, 2, . . . n, j = 1, . . . , |A|}, cost {cÃj : j = 1, . . . , |A|}, budget B.

7: dlj ← 1 if j = argmaxj : b>cÃ Ẑl−1 (b − cÃj ) + δ̂lj /n; otherwise, dlj ← 0

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

If Assumption 2 holds, the convergence rate is improved by

where C(γ) is a constant that depends on C and γ.

Corollary 1 Suppose we define the expected risk of function Q( · , · ) as

Furthermore, if Assumption 2 holds, the above inequality can be tighter

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

4.2 Excess risk bound

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

If Assumption 2 holds, we have a tighter bound with a positive γ:

5.1 Unconstrained ITR simulation

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

exp{0.2i ∗ (XT β)}

and the marginal treatment assignment distribution is shown in Figure 3.

Treatment A Treatment effects Interaction effects

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

Treatment A Treatment effects Interaction effects

where En denotes the empirical average.

Uniform 1 500 5.477(0.218) 3.643(0.158) 5.338(0.259) 5.206(0.185) 5.408(0.265) 5.320(0.278)

1,000 5.589(0.223) 4.087(0.198) 5.099(0.213) 4.959(0.231) 5.223(0.272) 5.018(0.298)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024

1,000 6.001(0.348) 5.432(0.355) 5.822(0.302) 5.134(0.319) 5.384(0.400) 5.338(0.379)

Uniform 1 500 0.622(0.052) 0.338(0.039) 0.572(0.040) 0.491(0.041) 0.598(0.046) 0.553(0.050)

Downloaded from https://academic.oup.com/jrsssb/article/86/3/714/7514594 by Renmin University user on 08 October 2024