The Fallacy of Minimizing Local Regret in the Sequential Task Setting

Ziping Xu Harvard University Kelly W. Zhang Columbia University Susan A. Murphy Harvard University

Abstract

In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear ( $\sqrt{T}$ ) regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of $\epsilon$ -greedy exploration in robotic learning.

1 Introduction

Regret minimization has emerged as a central topic of online Reinforcement Learning (RL) research (Auer et al.,, 2008; Chu et al.,, 2011; Agrawal and Goyal,, 2013; Lattimore and Szepesvári,, 2020). In a static environment, prioritizing regret minimization is theoretically sound. A sublinear regret bound directly leads to a sound lower bound on cumulative rewards and it implies the convergence to optimal policies for environments with lower bounded suboptimality gap. Any algorithm with a sublinear regret bound is designed to cease exploration upon acquiring sufficient information about the underlying environment. Nevertheless, this cessation of exploration, while theoretically ideal, poses challenges in real-world RL implementations.

Real-world RL tasks significantly differ from the static setup in two key aspects. First, many real-world RL tasks arrive sequentially with substantial changes between tasks. This observation is evident from many applications including mobile health, where RL algorithms are trained to personalize digital interventions (Liao et al.,, 2020; Bidargaddi et al.,, 2020; Trella et al.,, 2022), and online education, where RL learns an automated pedagogical strategy that optimizes students’ performance (Aleven et al.,, 2023; Ruan et al.,, 2023). These applications, including others like inventory management (Madeka et al.,, 2022) and online charitable giving experiments (Athey et al.,, 2022), often undergo significant changes between tasks. Previous literature often studies changes in outcome distributions (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017), while our study encompasses changes in reward design in healthcare due to emerging side effects, or changes in permissible policy spaces driven by technological advancements and concerns around fairness or privacy. These changes are often results of high-level human decision-making, and thus, are unpredictable. Second, in many of these domains, particularly those involving human interactions, it is necessary to deploy static or nonadaptive policies within certain tasks. Though adaptive experimental design has emerged as a powerful tool in collecting data efficiently for subsequent tasks, many funding agencies are more familiar or restricted to the static experimental design (Zanette et al.,, 2021). Moreover, ethical or safety considerations often mandate the use of nonadaptive approaches (Madeka et al.,, 2022).

It is prevalent in the field to apply naive regret minimization algorithms within each task, (i.e., local regret minimization) often overlooking the sequential nature of real-world problems. A main message of this paper is the necessity of additional exploration due to unanticipated changes between tasks, such as rapid technological development or human-in-the-loop involvement, compared to what is normally sufficient in the stationary setting. Complete exploitation within a task could lead to worse performance in later tasks.

To understand the limitations of local regret minimization, note that sublinear regret often implies the convergence to an optimal policy, leading to complete exploitation. This focus on local regret minimization can result in an online dataset that is inadequate for subsequent tasks, especially in the presence of changes between tasks. We elucidate this concept through Example 1, which involves a sequence of two contextual bandit tasks. The two tasks share the same reward distribution, while the policy spaces are different. In the first task, the algorithm is allowed to make decisions based on the current context leading to the optimal policy of choosing action $a_{1}$ in context $x_{1}$ and $a_{2}$ in context $x_{2}$ . In contrast, the second task has to adopt context-free policies leading to the optimal policy that takes action $a_{1}$ for all contexts. We further ask the algorithm to not adaptively learn in the second task. Therefore, the data collected from the first task is used to estimate an optimal policy to deploy for the second task. The online dataset collected by the first task’s optimal policy lacks visitations in $(x_{1},a_{2})$ and $(x_{2},a_{1})$ , hindering a good policy learning for the second task. That is, minimizing the regret in the first task leads to worse regret in the second task. Our simulation experiment validates this hypothesis. Figure 2 compares the average regret in the first task and the simple regret in the second task by running UCB (Upper Confidence Bound) (Auer,, 2002) algorithm (in blue) and UCB mixed with probability 0.1 random exploration (in green) on the tasks described in Example 1. UCB shows a diminishing average regret in the first task with a constant simple regret for the second task. Mixed with 0.1 random exploration, UCB receives a constant average regret, while the simple regret goes to zero, revealing the expected trade-off between regrets in two tasks.

Example 1.

Consider two contextual bandit tasks with two arms and a context space of size two. The two tasks share the same reward distribution with the mean reward for each context-arm pair reported by Table 2. In the first task, the algorithm is allowed use any policy and can update its policy adaptively. In the second task, the algorithm can not update its policy and is forced to ignore the context, meaning that the second task is a non-contextual bandit. The two tasks share the same reward distribution.

Refer to caption — Figure 1: Mean reward for Example 1

Mean reward	$x_{1}$	$x_{2}$	Average
$a_{1}$	1	$\epsilon$	$(1+\epsilon)/2$
$a_{2}$	0	1	$1/2$

We rigorously characterize this trade-off between local regret minimization and global regret minimization, when there are changes in reward design and/or the policy space between tasks. We formulate a sequential contextual bandit framework with a discrete context space ${\mathcal{X}}$ and action space ${\mathcal{A}}$ . Let $\operatorname{Reg}_{i}$ and $T_{i}$ denote the cumulative regret and number of time-steps in task $i$ , respectively. We initiate out analysis with a two-task scenario. Theorem 1 establishes a general condition on the changes under which the minimax rate of $\sqrt{\mathbb{E}[\operatorname{Reg}_{1}]}(\mathbb{E}[\operatorname{Reg}_{2}]/T% _{2})=\Omega(1)$ when the algorithm is not allowed to be adaptive within the second task. We show that a simple strategy that mixes any no regret online algorithm with random exploration attains the minimax rate that is optimal in $T_{1}$ and $T_{2}$ . Our results show that for minimizing the combined regrets $\mathbb{E}[\operatorname{Reg}_{1}+\operatorname{Reg}_{2}]$ , an algorithm should aim for $\mathbb{E}[\operatorname{Reg}_{1}]=\Theta(T_{1}^{2/3})$ in case of $T_{1}=T_{2}$ , demonstrating an excess exploration compared with the optimal local regret rate $T_{1}^{1/2}$ . Subsequent case studies explore the satisfaction of these conditions in the presence of nonstationarity in the policy space and reward function.

We continue with three extensions from the two-task scenario. We first extend the results to the multiple tasks case, showing that the minimax rate of $\sqrt{\sum_{j=1}^{i-1}\mathbb{E}[\operatorname{Reg}_{j}]}(\mathbb{E}[% \operatorname{Reg}_{i}]/T_{i})=\Omega(1)$ simultaneously for all task $i$ , within which adaptivity is not allowed. The maximal sequence length for which this lower bound is valid across all adjacent tasks is determined by the product $|{\mathcal{X}}|\times|{\mathcal{A}}|$ . We further extend our discussion to nonlinear case, where we show that such sequence of tasks can be exponentially long, suggesting a strong motivation to apply excess exploration in real-world RL applications. The third extension considers a more prevalent notion of changes between tasks–nonstationarity in reward distributions. We show that under a new notion of robust simple regret, the aforementioned tension between local regret minimization and global regret minimization still holds when there are changes in reward distribution.

1.1 Related Work

In a contextual bandit setting, the simple regret is simply the expected cumulative regret divided by the length of horizon, when the policy is fixed. Our two-task case study extends the existing literature on optimizing both cumulative regret and simple regret by introducing non-stationarity. We denote the cumulative regret by $\operatorname{CR}$ and simple regret by $\operatorname{SR}$ . In a multi-armed bandit setting, Bubeck et al., (2011) show a trade-off of $\operatorname{SR}\times\exp(D\cdot\operatorname{CR})\geq\Delta/2$ , where $\Delta$ is the minimal gap and $D$ is a constant. This bound is substantially weaker than our bound as $\Delta\leq 1$ . Krishnamurthy et al., (2023) study this trade-off under standard contextual bandit, where they show that any learning algorithm that achieves a worst-case $\mathcal{O}(\sqrt{\phi/T})$ simple regret bound has a lower bounded minimax rate of $\sqrt{A^{2}T/\phi}$ for cumulative regret. Beyond the standard stationary setting, Simchi-Levi and Wang, (2023) study the trade-off between cumulative regret and the statistical power of inferring the treatment effect in a stationary multi-armed bandit problem. Similar results are shown in Gao et al., (2022); Dai et al., (2023). They show a similar minimax lower bound, $\sqrt{\operatorname{CR}}\times\text{inference error}=\Omega(1)$ . Qin and Russo, (2023) study the multi-armed bandit problem with a different cost of exploration for different arms. This is equivalent to having a different reward function from the reward for calculating simple regret. From an empirical point of view, Athey et al., (2022) propose TreeBagging algorithm that controls the level of exploration in online charitable giving. They observe that the uniform randomization algorithm learned a policy with the lowest simple regret while receiving the highest cumulative regret over a sequence of 10 implementations.

Our framework can be also seen as a generalization of the previous nonstationary bandit setting. Previous works in non-stationary bandit consider only changes in reward distribution (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017; Wu et al.,, 2018; Kim and Tewari,, 2020; Hong et al.,, 2023). We introduce a broader definition of non-stationarity including the changes in reward function and policy space. When the reward distribution does change, we propose a new metric named robust simple regret to accommodate the fact that no algorithm can minimize simple regret to 0 even with infinite data, which is new in the literature (Section 5).

A main message of this paper is that the RL algorithm should never stop exploring or learning when there is non-stationarity. Conceptually, this is also the main characteristic of a continual learning agent (Abel et al.,, 2023). The literature of continual learning has primarily focused on learning agent that never forgets (Lange et al.,, 2019; Peng et al.,, 2023; Wang et al.,, 2023). However, in a RL context, the algorithm should strategically forget the previous learned knowledge when the outcome distribution drastically changes and maintain the right level of continual exploration to obtain new information (Khetarpal et al.,, 2020).

2 Problem Setups

Notations.

For a set ${\mathcal{X}}$ , we denote by $\Delta({\mathcal{X}})$ the set of all distributions over ${\mathcal{X}}$ . For $N\in{\mathbb{Z}}$ , we let $[N]=\{1,2,\dots,N\}$ . We use ${\mathcal{O}}(\cdot)$ , $\Theta(\cdot)$ , $\Omega$ to denote the big- $O$ , big-Theta and big-Omega notations. For a vector $\nu\in{\mathbb{R}}^{d}$ , we let $[\nu]_{i}$ be the $i$ -th element of the vector. We denote by $D_{\operatorname{KL}}(P\mid Q)$ , the KL divergence between two probability measures $P$ and $Q$ with $P\ll Q$ .

Sequential multitask contextual bandit framework.

We consider learning on a sequence of $N$ contextual bandit tasks. Different from the standard online contextual bandit setting, each task is a contextual bandit with rich observations and potentially restricted policy space. Specifically, we define each task $i\in[N]$ as a tuple of $({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}},P,\Pi^{(i)},f^{(i)})$ , where the shared elements are the context space ${\mathcal{X}}$ , the action space ${\mathcal{A}}$ , the potentially high-dimensional outcome space ${\mathcal{Y}}$ , and the outcome distribution $P:{\mathcal{X}}\times{\mathcal{A}}\mapsto\Delta({\mathcal{Y}})$ . The task-specific elements include the restricted policy space $\Pi^{(i)}\subset{\mathcal{X}}\mapsto\Delta({\mathcal{A}})$ that is a collection of mappings from the current context to a distribution over the action space. We also allow different tasks to have different reward functions $f^{(i)}:{\mathcal{Y}}\mapsto[0,1]$ . For simplicity, we assume that there is a fixed context distribution $P_{X}$ across all tasks. Since $({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})$ are assumed to be shared across different tasks, we denote a task setup by $S^{(i)}=(\Pi^{(i)},f^{(i)},P)$ , when $({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})$ are clear from the context. Note that the restricted policy spaces and the reward functions are assumed to be known by the agent before interacting with the environment, while the outcome distribution $P$ is unknown and has to be learned.

The agent interacts with each task $i$ for $T_{i}$ steps. At the step $t\in[T_{i}]$ , the agent observes context $X_{i,t}\in{\mathcal{X}}$ , and decides a policy $\pi_{i,t}\in\Pi^{(i)}$ . Then the agent samples an action $A_{i,t}\sim\pi_{i,t}(X_{i,t})$ and receives a feedback vector $Y_{i,t}\sim P(\cdot\mid X_{i,t},A_{i,t})$ . The goal of the agent for task $i$ is to maximize the cumulative rewards $\sum_{t=1}^{T_{i}}f^{(i)}(Y_{i,t})$ . The optimal policy of a task $S=(\Pi,f,P)$ is given by

\pi^{\star}_{S}\in\operatorname*{arg\,max}_{\pi\in\Pi}\mathbb{E}_{X\sim P_{X}}% \mathbb{E}_{A\sim\pi(X)}\mathbb{E}_{Y\sim P(\cdot\mid X,A)}f(Y).

Note that we consider the same outcome distribution at this moment to focus on the study of changes in policy spaces and reward functions. In Section 5, we extend our discussion to a sequence of tasks where the underlying outcome distribution could shift between tasks.

Motivations for changes between tasks.

The setups for different tasks may change in various ways. We provide the following motivation examples on the change of $(\Pi^{(i)},f^{(i)})$ across tasks.

1.

Different tasks may set up different reward functions $f^{(i)}$ . For instance, in study one, we are only interested in maximizing cumulative rewards $\sum_{t=1}^{T_{1}}R_{1,t}$ , where $R_{1,t}=f^{(1)}(Y_{1,t})\equiv[Y_{1,t}]_{1}$ . At the end of the first task, the domain expert may decide that $[Y_{1,t}]_{2}$ is a significant side effect that should be controlled. Therefore, they propose $f^{(2)}(Y)=[Y]_{1}-\alpha[Y]_{2}$ for the second task.
2.

Different tasks may set up different target policy classes $\Pi^{(i)}$ . In practice, the context space ${\mathcal{X}}$ can be a high-dimensional vector, due to the complexity of real-world observations. In the the first task, we may not have the computational resources to maximize over the space of all policies that takes context into account. Hence, the domain expert decides to optimize only in the space of context-independent policies $\Pi^{(1)}=\{\pi\in\Pi:\pi(a\mid x_{1})=\pi(a\mid x_{2}),\text{ for all }(x_{1}% ,x_{2})\in{\mathcal{X}}\}$ . In the second task, with evidence accumulating, the expert decides that certain components of ${\mathcal{X}}$ are relevant to the task, which should be included in the new policies. In a reversed case, the domain expert may decide that certain feature is irrelevant or raises fairness concerns and should be removed from the feature space.

2.1 Performance metric

The cumulative regret of task $i$ is given by Definition 1. The goal of an agent that learns on the sequence of $N$ tasks is to minimize the sum of cumulative regret $\sum_{i=1}^{N}\operatorname{Reg}_{i}$ over all the tasks. We refer the cumulative regret of each task $i$ as the local regret and the cumulative regret over all the tasks as the global regret. Throughout the paper, we discuss the tension between the local regret and the global regret under the aforementioned changes in task setups.

Definition 1 (Cumulative regret).

Denote the mean reward of task $i$ given context $x$ and action $a$ by $R^{(i)}(x,a)\coloneqq\mathbb{E}_{Y\sim P(\cdot\mid x,a)}f^{(i)}(Y)$ . The cumulative regret within task $i$ is defined by

\operatorname{Reg}_{i}\coloneqq\sum_{t=1}^{T_{i}}\left[\max_{\pi\in\Pi^{(i)}}% \mathbb{E}_{X_{i,t}\sim P_{X}}\mathbb{E}_{A\sim\pi(X_{i,t})}[R^{(i)}(X_{i,t},A% )-R^{(i)}(X_{i,t},A_{i,t})]\right].

A significant body of the paper discusses the tasks, where the algorithm is not allowed to adaptively learn, thereby requiring a commitment to a fixed policy for the entire of each task. In such scenarios, minimizing the cumulative regret is equivalent to minimizing simple regret defined in Definition 2.

Definition 2 (Simple regret).

Define the simple regret of policy $\pi$ given a task $S=(\Pi,f,P)$ with mean reward function $R\coloneqq\mathbb{E}_{Y\sim P(\cdot\mid x,a)}f(Y)$ by

\operatorname{SR}_{S}(\pi\mid x)\coloneqq\sum_{a}\pi^{\star}_{S}(a\mid x)R(x,a% )-\sum_{a}\pi(a\mid x)R(x,a),\text{ and }\operatorname{SR}_{S}(\pi)\coloneqq% \sum_{x}P_{X}(x)\operatorname{SR}_{S}(\pi\mid x).

Note that $\operatorname{SR}_{S}(\pi\mid x)$ can potentially be negative depending on how the policy space is defined. However, $\operatorname{SR}_{S}(\pi)\geq 0$ for any policy $\pi\in\Pi$ .

Definition 3 (Occupancy measure).

We define $\mu_{\pi}(x,a)\coloneqq\mathbb{E}_{X\sim P_{X},A\sim\pi}\mathbbm{1}_{\{X=x,A=a\}}$ as the occupancy measure of a running policy $\pi$ . Note that since we assume the same context distribution $P_{X}$ , occupancy measure of a policy $\pi$ is task independent.

2.2 Learning algorithms

This section is motivated by the practical considerations that some tasks may require nonadaptive algorithms. We introduce a formal definition to clarify what constitutes an algorithm and its nonadaptive nature in this context.

Let $\tau_{i,t}\in({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})^{\star}$ denote the sequence of observations up to the $t$ -th step in task $i$ . A learning algorithm $L$ is a mapping from the previous observations to a policy in $\Pi^{(i)}$ at any given step $t$ in task $i$ . The agent running algorithm $L$ randomly samples an action $A_{i,t}\sim[L(\tau_{i,t})](X_{i,t})$ . Furthermore, an algorithm is said to be nonadaptive in task $i$ , if it is nonadaptive to any new data collected during task $i$ as described in Definition 4.

Definition 4 (Non-adaptive algorithm).

We call an algorithm non-adaptive in task $i$ if for all steps $t_{1}$ , $t_{2}$ within a task, where $t_{1}<t_{2}\in[T_{i}]$ , and for all sequences of observations $\tau_{i,t_{1}}$ and $\tau_{i,t_{2}}$ , such that $\tau_{i,t_{1}}$ is a prefix of $\tau_{i,t_{2}}$ , the algorithm satisfies $L(\tau_{i,t_{1}})=L(\tau_{i,t_{2}}).$

Denote by ${\mathcal{L}}_{i}$ the set of all learning algorithms that are nonadaptive in task $i$ . For a set of task indices ${\mathcal{I}}$ , we denote by ${\mathcal{L}}_{{\mathcal{I}}}\coloneqq\cap_{i\in{\mathcal{I}}}{\mathcal{L}}_{i}$ the set of algorithms that are simultaneously nonadaptive on all tasks in ${\mathcal{I}}$ .

3 Results on Two Tasks

We commence with a two-task case, where the first task involves running an online learning algorithm focused on minimizing the regret within the task. In the second task, we employ a fixed policy that is offline learned based on the dataset collected from the the first task. This scenario is closely related to the contextual bandit setting, where the algorithm aims at minimizing cumulative regret and simple regret simultaneously, with the difference in allowing changes between tasks.

In this section, there is a trade-off between the regrets in two tasks under various changes in task setups. The trade-off is substantially stronger than the cases without changes. In fact, this stronger trade-off has been shown in some special cases. For instance, in a multi-armed bandit setting, Simchi-Levi and Wang, (2023) simultaneously minimizes the cumulative regret and the average treatment effect (ATE) estimation error of the worst arm. This setting is inherently analogous to our setting since ATE’s of all the arms are essential for achieving a low simple regret with arbitrary changes in the policy space. They show that the product of the cumulative regret and the square of the worst-case estimation error is lower bounded by a constant in a minimax sense. We prove a similar lower bound result on a more general case with a wider range of changes in $\Pi$ , $f$ . Our results provide a more comprehensive view of this tension between cumulative regret and simple regret.

We denote an instance by ${\bm{S}}=(S^{(1)},S^{(2)})$ , a sequence of two tasks. For an instance ${\bm{S}}$ , we denote by ${\mathcal{S}}({\bm{S}})$ the set of all instances that share the same policy spaces and reward functions $(\Pi^{(1)},\Pi^{(2)},f^{(1)},f^{(2)})$ , while having different $P$ . For some instance set ${\mathcal{S}}$ , we study the following minimax multi-objective optimization problem:

\inf_{L\in{\mathcal{L}}_{2}}\sup_{{\bm{S}}\in{\mathcal{S}}}\left(\mathbb{E}[% \operatorname{Reg}_{1}],\mathbb{E}[\operatorname{Reg}_{2}]\right).

(1)

In order to show a strong lower bound for the above multi-objective problem, we need the instance set ${\mathcal{S}}$ to be adequately rich. Theorem 1 provides general conditions for ${\mathcal{S}}$ that characterizes a strong trade-off between the cumulative regrets in two tasks.

Theorem 1.

Assume the instance set ${\mathcal{S}}$ is sufficiently large to ensure the existence of an instance ${\bm{S}}=(S^{(1)},S^{(2)})$ such that for all $\epsilon\in[0,1/4]$ , we can find some $\bar{{\bm{S}}}=(\bar{S}^{(1)},\bar{S}^{(2)})\in{\mathcal{S}}({\bm{S}})\cap{% \mathcal{S}}$ satisfying the following conditions:

1.

There exists unique optimal policy $\pi^{\star}_{S}$ for each $S\in\{S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}\}$ ;
2.

$\mu_{\pi_{S^{(1)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(1)}}^{% \star}}(x_{\star},a_{\star})=0$ and $\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(1)}$ ;
3.

$\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (2)}}}(x_{\star},a_{\star})=c_{2}>0$ ;
4.

$\operatorname{SR}_{S^{(2)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(2)}$ ;
5.

$\operatorname{SR}_{\bar{S}^{(2)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(2)}$ ;
6.

$P$ and $\bar{P}$ only differ in $(x_{\star},a_{\star})$ and $D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}$ ,

where $c_{1},c_{2}>0$ are some constant and $(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}$ is some context-action pair. Then we have the following lower bound:

\inf_{L\in{\mathcal{L}}_{2}}\sup_{{\bm{S}}\in{\mathcal{S}}}\sqrt{\mathbb{E}% \left[\operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}% \right]/T_{2}=\Omega\left(c_{1}c_{2}\right).

Discussion of Theorem 1.

For sufficiently large $T_{1}$ , a no-regret online algorithm tends to converge to the optimal policy (assuming there exists a unique one and there is a lower bounded suboptimality gap). This dataset collected from online learning of the first task may not be sufficient for the goal of offline policy optimization for the second task, when $\max_{x,a}\mu_{\pi^{\star}_{S^{(2)}}}(x,a)/\mu_{\pi^{\star}_{S^{(1)}}}(x,a)=\infty$ . This connects closely to the offline learning literature, where it has been shown that offline learning is fundamentally hard if the single-policy concentrability is unbounded (Chen and Jiang,, 2019). Single-policy concentrability is the ratio between the occupancy measures of the behavior policy that collects the offline dataset and the optimal policy. Condition 2 guarantees that it incurs regret whenever $(x_{\star},a_{\star})$ is visited in the first task and Condition 3, 4 and 5 guarantee that it is necessary to visit $(x_{\star},a_{\star})$ to distinguish between $S^{(2)}$ and $\bar{S}^{(2)}$ , which creates the tension between the regrets of two tasks.

3.1 Case studies on Theorem 1 application

In this section, we explore three case studies with potential practical interests to demonstrate how the conditions specified in Theorem 1 can be satisfied by including potential changes in $\Pi$ and $f$ . To ease the demonstration, we consider uniform distribution $P_{X}$ over context space ${\mathcal{X}}$ . The first two cases are two-task contextual bandit problems with ${\mathcal{A}}=\{a_{1},a_{2}\}$ and ${\mathcal{X}}=\{x_{1},x_{2}\}$ . The third case is an MAB problem with ${\mathcal{A}}=\{a_{1},a_{2}\}$ .

Case I: adding a new feature.

In real-world implementations, some features that were excluded from the input may be added back in the later tasks. To conceptualize this, we consider $\Pi^{(1)}$ as the set of policies where decision-making is independent of current features, formalized as $\Pi^{(1)}=\{\pi:\pi(\cdot\mid x_{1})=\pi(\cdot\mid x_{2}),\text{for all }x_{1}% ,x_{2}\in{\mathcal{X}}\}$ . Conversely, let $\Pi^{(2)}$ encompass all possible policies. The mean reward induced by outcome distribution $P$ and $\bar{P}$ is given by Table 4. For all positive $\epsilon$ , $\pi^{\star}_{S^{(1)}}(a_{1}\mid\cdot)\equiv 1$ . Any policy with a non-zero occupancy measure on $(x_{2},a_{2})$ will have the same occupancy measure on $(x_{1},a_{2})$ , leading to simple regret of $1/2-\epsilon$ . However, the optimal policies of $S^{(2)}$ and $\bar{S}^{(2)}$ disagree on $x_{2}$ . To distinguish $\bar{S}$ from $S$ , the algorithm is forced to visit $(x_{2},a_{2})$ in the first task. The conditions in Theorem 1 is satisfied with $c_{1}=1/2-\epsilon$ and $c_{2}=1/2$ .

Table 1: Reward tables for different case studies

Table 2: Case I

Table 3: Case II

Table 4: Case III

$R$	$x_{1}$	$x_{2}$	$\bar{R}$	$x_{1}$	$x_{2}$
$a_{1}$	1- $\epsilon$	1- $\epsilon$	$a_{1}$	1- $\epsilon$	1- $\epsilon$
$a_{2}$	0	1	$a_{2}$	0	1-2 $\epsilon$

$R$	$x_{1}$	$x_{2}$	$\bar{R}$	$x_{1}$	$x_{2}$
$a_{1}$	1	0	$a_{1}$	1	$\epsilon$
$a_{2}$	$\epsilon/2$	1	$a_{2}$	$\epsilon/2$	1

	$R^{(1)}$	$R^{(2)}$	$\bar{R}^{(1)}$	$\bar{R}^{(2)}$
$a_{1}$	1	0.5	1	0.5
$a_{2}$	0.8	0.5- $\epsilon$ /2	0.8	0.5+ $\epsilon$ /2

Table 3: Case II

Table 4: Case III

Case II: removing an old feature.

Some features may have to be removed over a sequence of implementations due to potential ethic issues. For this consideration, we let $\Pi^{(1)}$ be the set of all policies and $\Pi^{(2)}=\{\pi:\pi(\cdot\mid x_{1})=\pi(\cdot\mid x_{2}),\text{for all }x_{1}% ,x_{2}\in{\mathcal{X}}\}$ . Table 4 demonstrates a pair of instances, where the optimal policies for the first task always select action $a_{1}$ under context $x_{1}$ and action $a_{2}$ under context $x_{2}$ . Any non-zero occupancy measure on $(x_{2},a_{1})$ induces an instant regret of at least $1-\epsilon$ for both $S^{(1)}$ and $\bar{S}^{(1)}$ . Nevertheless, the optimal actions for $S^{(2)}$ and $\bar{S}^{(2)}$ are $a_{2}$ and $a_{1}$ , respectively, and it requires the first task to have a coverage on $(x_{2},a_{1})$ to distinguish between $S$ and $\bar{S}$ . The conditions in Theorem 1 is satisfied with $c_{1}=1-\epsilon$ and $c_{2}=1/2$ .

Case III: change of reward function.

The reward functions may change over a sequence of implementations. For instance, let outcome $Y_{i,t}=(R_{i,t},W_{i,t})$ , where $R_{i,t}$ is the primary outcome we intend to maximize and $W_{i,t}$ is a potential side effects. In the first implementation, we aim at maximizing the primary outcome $R_{i,t}$ with reward function $f^{(1)}(Y_{1,t})=R_{1,t}$ . However, the domain expert may realize that $W_{i,t}$ is a strong side effect, which should be controlled in the second task, and thus, they set the reward function as $f^{(2)}(Y_{2,t})=R_{2,t}-\alpha W_{2,t}$ . With this motivation, we construct a pair of multi-armed bandit instances with no context. In Table 4, we demonstrate the mean reward of different arms under different combinations of $f^{(1)}$ , $f^{(2)}$ , $P$ and $\bar{P}$ . In the first task, a regret of $0.2$ is incurred whenever $a_{2}$ is pulled, while a sufficient pulling of $a_{2}$ is necessary to distinguish $P$ from $\bar{P}$ . It can be verified that the conditions in Theorem 1 holds with $c_{1}=0.2$ and $c_{2}=1$ .

A detailed verification of how the three cases satisfy the conditions in Theorem 1 is deferred to Appendix C.

3.2 Optimal level of exploration

As implied by Theorem 1, any algorithm that achieves an optimal rate in $\operatorname{Reg}_{1}$ is suboptimal in $\operatorname{Reg}_{2}$ . To trade-off between the two goals, the algorithm needs to employ additional exploration in the first task. In this section, we characterize the optimal level of additional exploration in different regimes of $T_{1}$ and $T_{2}$ . Since we primarily focus on the role of horizons, we omit the dependence on $|{\mathcal{X}}|$ and $|{\mathcal{A}}|$ throughout the discussions on this section.

Recall that our primary goal is to minimize global regret, that is the sum of cumulative regrets of two tasks. Proposition 1 suggests a minimax lower bound for the sum of cumulative regrets of two tasks that is the maximum of three terms– $T_{2}/\sqrt{T_{1}},T_{2}^{2/3}$ and $\sqrt{T_{1}}$ . The $T_{2}/\sqrt{T_{1}}$ term corresponds to the case, where $\mathbb{E}[\operatorname{Reg}_{2}]$ dominates $\mathbb{E}[\operatorname{Reg}_{1}]$ , and the minimax rate of simple regret in the second for any dataset collected during the first task is $T_{2}/\sqrt{T_{1}}$ . The second term corresponds to the rate characterized by Theorem 1. The last term of $\sqrt{T_{1}}$ is the minimax rate of the first task regret minimization. This corresponds to the case when $\mathbb{E}[\operatorname{Reg}_{1}]$ dominates $\mathbb{E}[\operatorname{Reg}_{2}]$ .

Proposition 1.

Following the same conditions on the instance set ${\mathcal{S}}$ as in Theorem 1, the following minimax lower bound holds

\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}[\operatorname{% Reg}_{1}+\operatorname{Reg}_{2}]=\Omega\left(\max\left\{\frac{T_{2}}{\sqrt{T_{% 1}}},T_{2}^{2/3},\sqrt{T_{1}}\right\}\right)

(2)

We show that a simple algorithm that mixes a minimax-optimal online learning algorithm with a purely random exploration has upper bounded global regret that matches the lower bound in Theorem 1 up to a factor of $|{\mathcal{X}}||{\mathcal{A}}|$ . This also allows us to achieve any point on the Pareto frontier up to a factor of $|{\mathcal{X}}||{\mathcal{A}}|$ . The parameter $\alpha$ controls the level of additional exploration in the first task.

Theorem 2.

Let $L_{0}$ be an online learning algorithm with a regret bound of $\mathcal{O}(\sqrt{|{\mathcal{X}}||{\mathcal{A}}|T_{1}})$ on the first task. Let the algorithm for the first task be $L_{\alpha}(\tau)=(1-\alpha)L_{0}(\tau)+\alpha\pi_{0}$ , where $\tau$ is any past observations and $\pi_{0}$ is the uniform random policy. For any choice of $\alpha\in[{|{\mathcal{X}}||{\mathcal{A}}|}/{\sqrt{T_{1}}},1]$ , there exist offline-learning algorithm for the second task such that

\mathbb{E}[\operatorname{Reg}_{1}]={\mathcal{O}}\left(\alpha T_{1}\right),% \text{ and }\quad\mathbb{E}[\operatorname{Reg}_{2}/T_{2}]={\mathcal{O}}\left(% \sqrt{\frac{(|{\mathcal{X}}||{\mathcal{A}}|)^{2}}{\alpha T_{1}}}\right).

(3)

By tuning the exploration rate $\alpha$ , we are able to match the minimax lower bound provided in (2). In short, there are three regimes of $(T_{1},T_{2})$ , for which we should choose different levels of exploration rate $\alpha$ to balance $\mathbb{E}[\operatorname{Reg}_{1}]$ and $\mathbb{E}[\operatorname{Reg}_{2}]$ . The regime one is when $T_{1}\leq T_{2}^{2/3}$ , where the first task is too short compared to the second task, and the algorithm should employ pure exploration in the first task ( $\alpha=1$ ). This regime leads to a global regret of ${\mathcal{O}}({T_{2}}/{\sqrt{T_{1}}})$ . In an intermediate regime with $T_{2}^{2/3}<T_{1}\leq T_{2}^{4/3}$ , the algorithm should employ additional exploration compared to these that achieve a minimax optimal rate in a single task. Theorem 2 suggests an additional exploration rate of $\alpha=T_{2}^{2/3}/T_{1}$ and a global regret bound of ${\mathcal{O}}(T_{2}^{2/3})$ . Note that under a special case of $T_{1}=T_{2}$ , the rate of $\alpha=T_{1}^{-1/3}$ indicates a regret bound of ${\mathcal{O}}(T_{1}^{2/3})$ in the first task. The third regime is $T_{1}>T_{2}^{4/3}$ , where one should employ $\alpha=0$ , meaning that no excess exploration is needed and the agent in the first task can minimize the local regret as much as possible. In this regime, the local regret in the first task could achieve the minimax optimal rate of $\sqrt{T_{1}}$ .

It is often in real-world applications that $T_{2}$ is pre-determined and the researcher could decide how many samples to collect in the first task to ensure a good learning in the second one. For instance, in an inventory management context (Madeka et al.,, 2022), it is determined by the engineering team that how long a learned policy should be deployed for the second task. In such cases, our theory indicates that one should choose $T_{1}>T_{2}^{4/3}$ , so a greedy local regret minimization for the first task is justified.

4 Results on Multiple Tasks

In our two-task study, we highlighted the inherent dilemma between local regrets in the first and second tasks. Now we extend our analysis to a sequence of multiple tasks. A significant property about the two-tasks scenario is the inability of the algorithm to adaptively learn in the second task. This restriction forces the algorithm to ”overly” explore in the first task to propose a good policy for the second task, thereby introducing a tension between the regrets in the first and the second task. In fact, such tension exists between any task $i$ and its preceding tasks, whenever the algorithm is not allowed to adaptively learn within the task $i$ . We will also discuss the maximum number of rounds this trade-off could hold simultaneously.

In this section, we denote an instance by a sequence of $N$ task setups and their shared outcome distribution, ${\bm{S}}=(S^{(1)},\dots,S^{(N)})$ . Let ${\mathcal{I}}$ be a set of indices, such that any task $i\in{\mathcal{I}}$ has to be non-adaptive. Theorem 3 generalizes Theorem 1 by lower bounding the minimax rate of the product between the sum of the cumulative regret over $j=1,\dots i-1$ tasks and that over task $i$ , simultaneously for all the indices in a set ${\mathcal{I}}$ .

Theorem 3.

Recall that ${\mathcal{S}}({\bm{S}})$ denote the set of instances that share the same policy space and reward function with a given instance ${\bm{S}}$ . Let ${\mathcal{I}}$ be an index set. Assume the instance set ${\mathcal{S}}$ is sufficiently large to ensure the existence of an instance ${\bm{S}}\in{\mathcal{S}}$ for which, we can find some $\bar{{\bm{S}}}\in{\mathcal{S}}({\bm{S}})$ such that for all $i\in{\mathcal{I}}$ and $\epsilon\in[0,1/4]$ , it satisfies the following conditions:

1.

There exists unique optimal policy $\pi^{\star}_{S}$ for each $S\in\{S^{(1)},\dots,{S}^{(i)},\bar{S}^{(1)},\dots,\bar{S}^{(i)}\}$ ;
2.

$\mu_{\pi_{S^{(j)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(j)}}^{% \star}}(x_{\star},a_{\star})=0$ and $\min\{\operatorname{SR}_{S^{(j)}}(\pi),\operatorname{SR}_{\bar{S}^{(j)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(j)}$ and $j\in[i-1]$ ;
3.

$\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (i)}}}(x_{\star},a_{\star})>c_{2}$ ;
4.

$\operatorname{SR}_{S^{(i)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(i)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(i)}$ ;
5.

$\operatorname{SR}_{\bar{S}^{(i)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(i)}}}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(i)}$ ;
6.

$P$ and $\bar{P}$ only differ in $(x_{\star},a_{\star})$ and $D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}$ ,

where $c_{1},c_{2}>0$ are some constant and $(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}$ is some context-action pair. Then we have the following lower bound :

\inf_{L\in{\mathcal{L}}_{{\mathcal{I}}}}\min_{i\in{\mathcal{I}}}\sup_{S\in{% \mathcal{S}}}\sqrt{\mathbb{E}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}% \right]}\mathbb{E}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega\left(c_{1}c% _{2}\right).

Theorem 3 provide a strong lower bound that holds simultaneously for the simple regret of all tasks in a set. The construction of the hard instance requires the the optimal of the next task $i$ . Proposition 2 states that the longest sequence of tasks one can find to ensure that the conditions in Theorem 3 hold is $\Theta(|{\mathcal{X}}||{\mathcal{A}}|)$ .

Proposition 2.

Let ${\mathcal{I}}=[N]$ . There exists an instance ${\bm{S}}$ of length $N=|{\mathcal{X}}|(|{\mathcal{A}}|-2)$ that satisfies the conditions in Theorem 3. Any instance that satisfies the conditions in Theorem 3 must have length $N={\mathcal{O}}(|{\mathcal{X}}||{\mathcal{A}}|)$ .

4.1 Discussion on Nonlinear Case

We have shown in a tabular case that we can find at most $\Theta(|{\mathcal{X}}||{\mathcal{A}}|)$ tasks such that there is a trade-off between the simple regret of any task and the cumulative regrets of all its preceding tasks (Proposition 2). It appears that the number of rounds this tension could hold connects closely to the complexity of the underlying outcome distributions. In this section, we extend the tabular bandit to a nonlinear setting, where we show that it is possible to find an exponentially long sequence of tasks with the trade-off described above holding for each of tasks.

Nonlinear contextual bandit.

For simplicity, we consider the outcome ${\mathcal{Y}}={\mathbb{R}}$ and a fixed reward function $f^{(i)}=f\equiv x\mapsto x$ for all tasks $i$ , i.e. no rich observations, so we focus on the changes in the policy spaces. Following the setups in Section 2, we now consider potentially large or continuous context and action space ${\mathcal{X}}$ and ${\mathcal{A}}$ . Recall that the mean reward is given by $R(x,a)\coloneqq\mathbb{E}_{Y\sim P(x,a)}[f(Y)]$ . For contextual bandit with nonlinear reward models, we assume that mean reward $R\in{\mathcal{F}}$ for some known function class ${\mathcal{F}}:{\mathcal{X}}\times{\mathcal{A}}\mapsto{\mathbb{R}}$ .

Complexity for nonlinear bandit.

Running UCB on nonlinear bandit is generally hard. Russo and Van Roy, (2013) proposed to explore by choosing $A_{t}\in\operatorname*{arg\,max}_{a\in{\mathcal{A}}}\sup_{f\in{\mathcal{F}}_{t% }}f(X_{t},a),$ where $\sup_{f\in{\mathcal{F}}_{t}}f(a)$ is an optimistic estimate of $f_{\theta}(a)$ . A choice of ${\mathcal{F}}_{t}$ given by Russo and Van Roy, (2013) is

{\mathcal{F}}_{t}=\left\{f\in{\mathcal{F}}:\|f-\hat{f}_{t}^{LS}\|_{2,E_{t}}% \leq\sqrt{\beta_{t}^{\star}}\right\},

(4)

where $\beta_{t}^{\star}$ are constants, $\|g\|_{2,E_{t}}=\sum_{t=1}^{T}g^{2}(X_{t},A_{t})$ is the empirical 2-norm, and $\hat{f}_{t}^{LS}\in\inf_{f\in{\mathcal{F}}}(f(X_{t},A_{t})-Y_{t})^{2}$ is the empirical risk minimizer. The regret of running UCB with appropriately chosen $\beta_{t}^{\star}$ has regret of $\sqrt{\operatorname{dim}_{E}({\mathcal{F}},T^{-2})T}$ , where $\operatorname{dim}_{E}({\mathcal{F}},T^{-2})$ is the eluder dimension of the function class ${\mathcal{F}}$ .

Definition 5 (Distributional eluder dimension).

Let ${\mathcal{F}}:{\mathcal{X}}\mapsto{\mathbb{R}}$ . A probability measure $\nu$ over ${\mathcal{X}}$ is said to be $\epsilon$ -independent of a sequence of probability measures $\{\mu_{1},\dots,\mu_{n}\}$ w.r.t ${\mathcal{F}}$ if any pair of functions $f,\bar{f}\in{\mathcal{F}}$ satisfying $\sqrt{\sum_{i=1}^{n}(\mathbb{E}_{\mu}[f(x)-\bar{f}(x)])^{2}}\leq\epsilon$ also satisfies $|\mathbb{E}_{\nu}[f(x)-\bar{f}(x)]|\leq\epsilon$ . Furthermore, $x$ is $\epsilon$ -independent of $\{\mu_{1},\dots,\mu_{n}\}$ if it is not $\epsilon$ -dependent of the sequence.

The $\epsilon$ -eluder dimension $\operatorname{dim}_{E}({\mathcal{F}},\epsilon)$ is the length of the longest sequence of distributions over ${\mathcal{X}}$ such that for some $\epsilon^{\prime}\geq\epsilon$ , every distribution is $\epsilon^{\prime}$ -independent of its predecessors.

Recall that the construction of our hard instances in Theorem 1 requires that the new task has the optimal policy whose occupancy measure has no overlap from the occupancy measure of optimal policies in the previous tasks. A generalization of this to the nonlinear case is that a predicted function that minimizes the loss over the dataset collected in the previous tasks may still occur large loss on a new task. Let the optimal policies of tasks $1,\dots n$ be $\pi_{1}^{\star},\dots,\pi_{n}^{\star}$ . Intuitively, as long as $n$ is smaller than $\operatorname{dim}_{E}({\mathcal{F}},\epsilon)$ , we can find a new task with optimal policy $\pi_{n+1}^{\star}$ for which the occupancy measure $\mu_{\pi_{n+1}^{\star}}$ is $\epsilon$ -independent of $(\mu_{\pi_{1}^{\star}},\dots,\mu_{\pi_{n}^{\star}})$ . By the definition of eluder dimension, this implies that the function chosen for task $n+1$ based on the dataset collected by $(\mu_{\pi_{1}^{\star}},\dots,\mu_{\pi_{n}^{\star}})$ may still occur a large error. Note that by running a no-regret online algorithm, the dataset collected during a task will asymptotically distributed as the occupancy measure induced by its optimal policy.

Eluder dimension has been shown to be exponentially large for simple models like one-layer neural network with ReLU activation function (Dong et al.,, 2021). It is not trivial to show a lower bound directly depending on the eluder dimension. Instead, we provide a concrete example, where UCB described in (4) fails.

Theorem 4.

Consider the hypothesis set ${\mathcal{F}}$ to be one-hidden layer neural networks with width $d$ . There exists ground-truth reward function and a sequence of tasks of length $\Omega(\exp(d))$ with different $\Pi^{(i)}$ , such that the local regret for each task is lower bounded by a constant, even if each $T_{i}\rightarrow\infty$ .

Theorem 4 indicates that even without a change in outcome distributions, there still exists an exponentially long sequence of tasks, for which the tension between local regret minimization and global regret minimization still holds. An UCB algorithm that greedily minimizes local regret fails to provide good guarantees for later tasks.

5 Study on Changes in $P$

In real-world implementations, the outcome distribution $P$ often undergoes unpredictable shift. Prior research on non-stationary bandits has typically focused on single-task scenarios with potential reward distribution shifts at any step. To manage these shifts, the literature often limits the total variation in distribution shifts, making it possible to establish sublinear regret bounds. In a sequential task setting, when the algorithm is not allowed to adaptively learn in the second task, the simple regret is always lower bounded by a constant. This is attributed to the uncertainty of the second task’s optimal policy, even with a full knowledge of the first task. To address this challenge, we introduce the concept of robust simple regret. We show that the robust simple regret and cumulative regret in the two-task case, are shown to have a similar minimax lower bound as shown in Theorem 1.

For simplicity, we consider no change in the policy space and the reward function. More specifically, we let $\Pi^{(1)}=\Pi^{(2)}=\Pi$ , the set of all policies, and $f^{(1)}=f^{(2)}=f$ , the identical mapping in ${\mathbb{R}}$ . We denote by $P^{(1)}$ and $P^{(2)}$ the outcome distribution of the first and the second task. We denote a problem instance by ${\bm{P}}=(P^{(1)},P^{(2)})$ .

The adversary is allowed to choose $P^{(2)}$ from a $L_{1}$ ball around $P^{(1)}$ . This leads to the instance set ${\mathcal{P}}(\Delta)$ parametrized by constant $\Delta$ such that each ${\bm{P}}=(P^{(1)},P^{(2)})$ satisfies

P^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)\coloneqq\left\{P:\sum_{a}|P(x,a)-P^{(1% )}(x,a)|\leq\Delta\text{ for all }x\in{\mathcal{X}}\right\},

(5)

where we abuse the notation for $P$ and let $P(x,a)$ denote the mean reward for $(x,a)$ .

Robust simple regret.

When $P^{(2)}$ is allowed to change from all $P^{(1)}$ in a potentially adversarial way, it is not reasonable to compare with the true optimal policy with respect to the underlying true $P^{(2)}$ . Instead, we consider a robust regret definition. We first define the worst-case simple regret of a policy $\pi$ on a context $x$ :

\operatorname{SR}(\pi\mid P^{(1)},\Delta)\coloneqq\sup_{P^{(2)}\in{\mathcal{P}% }(P^{(1)},\Delta)}\left(\max_{a}P^{(2)}(x,a)-\sum_{a}P^{(2)}(x,a)\pi(a\mid x)% \right).

(6)

We denote by $\tilde{\pi}_{P^{(1)},\Delta}\coloneqq\inf_{\pi^{\prime}\in\Pi^{(2)}}% \operatorname{SR}(\pi^{\prime}\mid P^{(1)},\Delta)$ the optimal robust policy given $(P^{(1)},\Delta)$ . When it is clear from the context, we drop the subscription for $P^{(1)}$ and $\Delta$ .

We further define robust simple regret, which is the gap between the worst-case simple regret of a given policy and the policy that achieves the lowest worst-case simple regret:

\widetilde{\operatorname{SR}}(\pi\mid P^{(1)},\Delta)\coloneqq\operatorname{SR% }(\pi\mid P^{(1)},\Delta)-\inf_{\pi^{\prime}\in\Pi^{(2)}}\operatorname{SR}(\pi% ^{\prime}\mid P^{(1)},\Delta).

(7)

Note that the worst-case regret form over some ambiguity set has been studied in the Robust Markov Decision Process literature (Xu and Mannor,, 2010; Eysenbach and Levine,, 2021; Dong et al.,, 2022). However, the definition of robust simple regret and the tension between cumulative regret and robust simple regret has not yet been explored.

To understand how the tension between cumulative and simple regret still plays a role, we investigate a simple two-armed, context-free bandit case in Proposition 3. The optimal arm in the the first task is $a_{1}$ , while the optimal robust policy depends on the gap between the mean reward of both arms. Thus, to reduce the robust simple regret in the second task, the algorithm is forced to have an accurate estimate on the suboptimal arm in the first task.

Proposition 3.

Consider the following two-armed, context-free bandit, with $\text{Gap}\coloneqq P^{(1)}(a_{1})-P^{(1)}(a_{2})>0$ . Then the worst-case simple regret is given by

\operatorname{SR}(\pi\mid P^{(1)},\Delta)=\max\{(\Delta-\text{Gap})\pi(a_{1}),% (\Delta+\text{Gap})\pi(a_{2})\}.

(8)

Assume that $\Delta>\text{Gap}$ . The optimal robust policy $\tilde{\pi}$ w.r.t. the worst-case simple regret has the explicit form of

\tilde{\pi}_{P^{(1)},\Delta}(a_{1})=\frac{\Delta+\text{Gap}}{2\Delta},\text{ % and }\tilde{\pi}_{P^{(1)},\Delta}(a_{2})=\frac{\Delta-\text{Gap}}{2\Delta}.

(9)

Motivated by the instance introduced in Proposition 3, we show the following Theorem that lower bounds minimax rate of the product between cumulative regret in the first task and the robust simple regret in the second task. Note that robust simple regret does not depend on the actual $P^{(2)}$ of choice, the supremum is only taken over $P^{(1)}$ .

Theorem 5.

Assume ${\mathcal{S}}$ is such that $\Pi^{(1)}=\Pi^{(2)}=\Pi$ , the set of all policies, and $\Pi^{(1)}=\Pi^{(2)}=f$ , the identical mapping in ${\mathbb{R}}$ . Assume each $P^{(i)}(\cdot\mid x,a)$ is from a binomial distribution with mean $P^{(i)}(x,a)$ for all $i=1,2$ and $x,a\in{\mathcal{X}}\times{\mathcal{A}}$ , and $P^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)$ . Then there exists some $\Delta$ such that

\inf_{L\in{\mathcal{L}}_{2}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_% {1}]}\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=% \Omega(1),

(10)

where $\pi_{2}$ is the random policy chosen by learning algorithm $L$ .

Theorem 5 implies that one should still employ additional exploration for small $T_{1}$ , when there is only changes in the outcome distributions and a similar trade-off between local regret and global regret still holds.

6 Discussion

In this paper, we study the minimax rate of the sum of local regrets across a sequence of contextual bandit tasks. By showing a lower bound on this rate, we demonstrate a strong trade-off between local regrets in different tasks, when there is changes between tasks. These changes include changes in policy space, reward function and outcome distribution, which is of significant novelty. A main message is that one should employ additional exploration compared to what is sufficient for single task cumulative regret minimization in presence of such changes. Our work opens many interesting future directions in the area of multitask bandit.

Multiple changes in $P$ .

In this paper, we only studied the in outcome distribution change in a two-task case, where we propose a new notion of robust simple regret and show that there is a dilemma between cumulative regret minimization in the first task and robust simple regret minimization in the second one. It is, at the current form, not clear how to extend the result to a multiple-task case. Intuitively, information from older tasks should be discounted when proposing a policy for a new task. Future work could consider modeling the changes in $P$ by an auto-regression model, which allows us to characterize how a new $P^{(1)}$ depends on the previous ones.

Instance-dependence results.

We study minimax rate throughout the paper, which focuses often on the worst case. In reality, some instances are significantly harder to learn than the others. An interesting direction is to propose a theoretical measure of the significance of the trade-off studied in this paper and derive an instance-dependent result.

References

Abel et al., (2023) Abel, D., Barreto, A., Roy, B. V., Precup, D., Hasselt, H. V., and Singh, S. (2023). A definition of continual reinforcement learning. ArXiv, abs/2307.11046.
Agrawal and Goyal, (2013) Agrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
Aleven et al., (2023) Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S. A., Demszky, D., Fancsali, S. E., Gupta, S., Koedinger, K., Piech, C., Ritter, S., Thomas, D. R., Woodhead, S., and Xing, W. (2023). Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education.
Athey et al., (2022) Athey, S., Byambadalai, U., Hadad, V., Krishnamurthy, S. K., Leung, W., and Williams, J. J. (2022). Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning. ArXiv, abs/2211.12004.
Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
Auer et al., (2008) Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
Bidargaddi et al., (2020) Bidargaddi, N., Schrader, G., Klasnja, P., Licinio, J., and Murphy, S. (2020). Designing m-health interventions for precision mental health support. Translational psychiatry, 10(1):222.
Bubeck et al., (2011) Bubeck, S., Munos, R., and Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852.
Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
Chu et al., (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
Dai et al., (2023) Dai, J., Gradu, P., and Harshaw, C. (2023). Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187.
Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
Dong et al., (2021) Dong, K., Yang, J., and Ma, T. (2021). Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in neural information processing systems, 34:26168–26182.
Eysenbach and Levine, (2021) Eysenbach, B. and Levine, S. (2021). Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257.
Gao et al., (2022) Gao, D., Liu, Y., and Zeng, D. (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. The Journal of Machine Learning Research, 23(1):11362–11403.
Garivier and Moulines, (2008) Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
Hong et al., (2023) Hong, K., Li, Y., and Tewari, A. (2023). An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In International Conference on Artificial Intelligence and Statistics, pages 3048–3085. PMLR.
Khetarpal et al., (2020) Khetarpal, K., Riemer, M., Rish, I., and Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res., 75:1401–1476.
Kim and Tewari, (2020) Kim, B. and Tewari, A. (2020). Randomized exploration for non-stationary stochastic linear bandits. In Conference on Uncertainty in Artificial Intelligence, pages 71–80. PMLR.
Krishnamurthy et al., (2023) Krishnamurthy, S. K., Zhan, R., Athey, S., and Brunskill, E. (2023). Proportional response: Contextual bandits for simple and cumulative regret minimization. arXiv preprint arXiv:2307.02108.
Lange et al., (2019) Lange, M. D., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G. G., and Tuytelaars, T. (2019). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385.
Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
Liao et al., (2020) Liao, P., Greenewald, K., Klasnja, P., and Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22.
Madeka et al., (2022) Madeka, D., Torkkola, K., Eisenach, C., Luo, A., Foster, D. P., and Kakade, S. M. (2022). Deep inventory management. arXiv preprint arXiv:2210.03137.
Peng et al., (2023) Peng, L., Giampouras, P. V., and Vidal, R. (2023). The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning.
Qin and Russo, (2023) Qin, C. and Russo, D. (2023). Generalized objectives in adaptive experiments: The frontier between regret and speed. NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World.
Raj and Kalyani, (2017) Raj, V. and Kalyani, S. (2017). Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
Ruan et al., (2023) Ruan, S. S., Nie, A., Steenbergen, W., He, J., Zhang, J., Guo, M., Liu, Y., Nguyen, K. D., Wang, C. Y., Ying, R., Landay, J. A., and Brunskill, E. (2023). Reinforcement learning tutor better supported lower performers in a math task. ArXiv, abs/2304.04933.
Russo and Van Roy, (2013) Russo, D. and Van Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
Simchi-Levi and Wang, (2023) Simchi-Levi, D. and Wang, C. (2023). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pages 3086–3097. PMLR.
Trella et al., (2022) Trella, A. L., Zhang, K. W., Nahum-Shani, I., Shetty, V., Doshi-Velez, F., and Murphy, S. A. (2022). Designing reinforcement learning algorithms for digital interventions: pre-implementation guidelines. Algorithms, 15(8):255.
Wang et al., (2023) Wang, L., Zhang, X., Su, H., and Zhu, J. (2023). A comprehensive survey of continual learning: Theory, method and application. ArXiv, abs/2302.00487.
Wu et al., (2018) Wu, Q., Iyer, N., and Wang, H. (2018). Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 495–504.
Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
Yin and Wang, (2021) Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
Zanette et al., (2021) Zanette, A., Dong, K., Lee, J. N., and Brunskill, E. (2021). Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731.

Appendix A Proof of Theorem 1

Theorem 1 Assume the instance set ${\mathcal{S}}$ is sufficiently large to ensure the existence of an instance ${\bm{S}}=(S^{(1)},S^{(2)})$ such that for all $\epsilon\in[0,1/4]$ , we can find some $\bar{{\bm{S}}}=(\bar{S}^{(1)},\bar{S}^{(2)})\in{\mathcal{S}}({\bm{S}})\cap{% \mathcal{S}}$ satisfying the following conditions:

1.

There exists unique optimal policy $\pi^{\star}_{S}$ for each $S\in\{S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}\}$ ;
2.

$\mu_{\pi_{S^{(1)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(1)}}^{% \star}}(x_{\star},a_{\star})=0$ and $\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(1)}$ ;
3.

$\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (2)}}}(x_{\star},a_{\star})=c_{2}>0$ ;
4.

$\operatorname{SR}_{S^{(2)}}(\pi)/\epsilon\geq\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(2)}$ ;
5.

$\operatorname{SR}_{\bar{S}^{(2)}}(\pi)/\epsilon\geq\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(2)}$ ;
6.

$P$ and $\bar{P}$ only differ in $(x_{\star},a_{\star})$ and $D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}$ ,

where $c_{1},c_{2}>0$ are some constant and $(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}$ is some context-action pair. Then we have the following lower bound:

\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\sqrt{\mathbb{E}\left[% \operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}\right]/T_% {2}=\Omega\left(\sqrt{{c_{1}^{2}c_{2}^{2}}}\right).

Proof.

Fix a learning algorithm $L$ . Throughout the proof, we let $\mathbb{E}_{{\bm{S}}^{\prime}}$ be the expectation of random variable of interest given the underlying instance ${\bm{S}}^{\prime}\in{\mathcal{S}}$ by running algorithm $L$ .

Let $T(x,a)=\sum_{t=1}^{T_{1}}\mathbbm{1}_{\{(X_{1,t},A_{1,t})=(x_{\star},a_{\star}% )\}}$ . From Condition 2 and the definition of the cumulative regret, we have

$\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{Reg}_{1}]$	$\displaystyle=\sum_{t=1}^{T_{1}}\mathbb{E}_{{\bm{S}}}\left[\operatorname{SR}_{% S^{(1)}}(\pi_{1,t})\right]$	(11)
	$\displaystyle\geq\sum_{t=1}^{T_{1}}\mathbb{E}_{{\bm{S}}}\left[c_{1}\mu_{\pi_{1% ,t}}(x_{\star},a_{\star})\right]$	(12)
	$\displaystyle=c_{1}\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})].$	(13)

The same argument gives $\mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{Reg}_{1}]\geq c_{1}\mathbb{E}_{\bar{% {\bm{S}}}}[T(x_{\star},a_{\star})]$ .

Denote by $\pi_{2}$ the fixed policy proposed by the algorithm $L$ for task two. Note that

\mathbb{E}_{{\bm{S}}}[\operatorname{Reg}_{2}/T_{2}]=\mathbb{E}_{{\bm{S}}}[% \operatorname{SR}_{S^{(2)}}(\pi_{2})]\text{ and }\mathbb{E}_{\bar{{\bm{S}}}}[% \operatorname{Reg}_{2}/T_{2}]=\mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{% \bar{S}^{(2)}}(\pi_{2})].

(14)

We further lower bound the sum of squared simple regret of $\pi_{2}$ on $S^{(2)}$ and $\bar{S}^{(2)}$ :

	$\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+% \mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]$	(15)
$\displaystyle\geq$	$\displaystyle\epsilon\mathbb{E}_{{\bm{S}}}[(\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi_{2}}(x_{\star},a_{\star}))]+\epsilon\mathbb{E}_{% \bar{{\bm{S}}}}[\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})-\mu_{% \pi_{2}}(x_{\star},a_{\star})]$	(16)
$\displaystyle\geq$	$\displaystyle\frac{c_{2}\epsilon}{2}\left[{\mathbb{P}}_{{\bm{S}}}\left(\mu_{% \pi_{2}}(x_{\star},a_{\star})\leq\frac{\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a% _{\star})+\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})}{2}\right)+\right.$	(17)
	$\displaystyle\quad\quad\quad\quad\quad\left.{\mathbb{P}}_{\bar{{\bm{S}}}}\left% (\mu_{\pi_{2}}(x_{\star},a_{\star})>\frac{\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star% },a_{\star})+\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})}{2}\right)% \right],$	(18)

where the first inequality is from Condition 4 and 5, the second inequality is from Condition 3.

Lemma 1 (Bretagnolle–Huber inequality).

For any two probability distributions $P,Q$ on the same measurable space $({\mathcal{X}},{\mathcal{F}})$ , and any event $A\in{\mathcal{F}}$ , we have

P(A)+Q(\bar{A})\geq\frac{1}{2}\exp(-D_{KL}(P\|Q)).

It follows from Lemma 1 and Condition 6 that

	$\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+% \mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]$	(19)
$\displaystyle\geq$	$\displaystyle\frac{c_{2}\epsilon}{4}\exp(-D_{\operatorname{KL}}({\mathbb{P}}_{% {\bm{S}}}\mid{\mathbb{P}}_{\bar{\bm{S}}}))$	(20)
$\displaystyle=$	$\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\sum_{t=1}^{T_{1}}\mathbb{E}_{{% \bm{S}}}[D_{\operatorname{KL}}(P(\cdot\mid X_{1,t},A_{1,t}),\bar{P}(\cdot\mid X% _{1,t},A_{1,t}))]\right)$	(21)
$\displaystyle=$	$\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\mathbb{E}_{{\bm{S}}}[T(x_{% \star},a_{\star})]D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star}),\bar{% P}(\cdot\mid x_{\star},a_{\star}))\right)$	(22)
$\displaystyle\geq$	$\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\mathbb{E}_{{\bm{S}}}[T(x_{% \star},a_{\star}))]\epsilon^{2}\right).$	(23)

The same argument gives that

\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+\mathbb{E}_{\bar{{% \bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]\geq\frac{c_{2}\epsilon}{% 4}\exp\left(-\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star}))]\epsilon^{2}% \right).

(24)

Lemma 2.

For all choices of algorithm $L$ , with $\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}$ , we have $\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]\leq\gamma\mathbb{E}_{\bar{{\bm{S% }}}}[T(x_{\star},a_{\star})]$ for some universal constant $\gamma>0$ .

By choosing $\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}$ (assuming that ${1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}\leq 1/4$ ) and applying Lemma 2, it holds that

\displaystyle\mathbb{E}_{{\bm{S}}}\operatorname{SR}_{S^{(2)}}(\pi_{2})+\mathbb% {E}_{\bar{{\bm{S}}}}\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})\geq\sqrt{\frac{% c_{1}c_{2}^{2}}{64\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}}\geq\sqrt{% \frac{c_{1}c_{2}^{2}}{64\gamma\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star% })]}}

(25)

Combined with (13) and (14), we have

	$\displaystyle\inf_{L}\sup_{{\bm{S}}^{\prime}\in{\mathcal{S}}}\sqrt{\mathbb{E}% \left[\operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}% \right]/T_{2}$	(26)
$\displaystyle\geq$	$\displaystyle\inf_{L}(\sqrt{\mathbb{E}_{{\bm{S}}}\left[\operatorname{Reg}_{1}% \right]}\mathbb{E}_{{\bm{S}}}\left[\operatorname{Reg}_{2}\right]/T_{2}+\sqrt{% \mathbb{E}_{\bar{{\bm{S}}}}\left[\operatorname{Reg}_{1}\right]}\mathbb{E}_{% \bar{{\bm{S}}}}\left[\operatorname{Reg}_{2}\right]/T_{2})$	(27)
$\displaystyle=$	$\displaystyle\Omega\left(\sqrt{{c_{1}^{2}c_{2}^{2}}}\right).$	(28)

∎

A.1 Proof of Lemma 2

Lemma 2 For all choices of algorithm $L$ , with $\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}$ , we have $\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]\leq\gamma\mathbb{E}_{\bar{{\bm{S% }}}}[T(x_{\star},a_{\star})]$ for some universal constant $\gamma>0$ .

Proof.

For any $\beta>0$ , by applying Lemma 1 and the same argument that obtains (23), we have

{\mathbb{P}}_{{\bm{S}}}(T(x_{\star},a_{\star})\leq\beta/\epsilon^{2})+{\mathbb% {P}}_{\bar{{\bm{S}}}}(T(x_{\star},a_{\star})>\beta/\epsilon^{2})\geq\frac{1}{2% e}.

(29)

To proceed, apply Markov inequality

	$\displaystyle\frac{\alpha/\epsilon^{2}-\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{% \star})]}{\alpha/\epsilon^{2}-\beta/\epsilon^{2}}+\frac{\mathbb{E}_{\bar{{\bm{% S}}}}[T(x_{\star},a_{\star})]}{\beta/\epsilon^{2}}$	(30)
$\displaystyle\geq$	$\displaystyle{\mathbb{P}}_{{\bm{S}}}(\alpha/\epsilon^{2}-T(x_{\star},a_{\star}% )\geq\alpha/\epsilon^{2}-\beta/\epsilon^{2})+{\mathbb{P}}_{\bar{{\bm{S}}}}(T(x% _{\star},a_{\star})>\beta/\epsilon^{2})$	(31)
$\displaystyle=$	$\displaystyle{\mathbb{P}}_{{\bm{S}}}(T(x_{\star},a_{\star})\leq\beta/\epsilon^% {2})+{\mathbb{P}}_{\bar{{\bm{S}}}}(T(x_{\star},a_{\star})>\beta/\epsilon^{2})$	(32)
$\displaystyle\geq$	$\displaystyle 1/(2e).$	(33)

By choosing $\alpha=2$ and $\beta=11/6$ , we have

\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star})]\geq\left(\frac{1}{2e}-% \frac{\alpha-1}{\alpha-\beta}\right)\frac{\beta}{\epsilon^{2}}\geq 0.01/% \epsilon^{2}.

(34)

∎

Appendix B Proof of Proposition 1

Proposition 1 Following the same condition on the instance set ${\mathcal{S}}$ , the following minimax lower bound holds

\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}[\operatorname{% Reg}_{1}+\operatorname{Reg}_{2}]=\Omega\left(\max\left\{\frac{T_{2}}{\sqrt{T_{% 1}}},T_{2}^{2/3},\sqrt{T_{1}}\right\}\right)

(35)

Proof.

It is well known that the minimax rate of the cumulative regret of a single task of horizon $T_{1}$ is $\Omega(\sqrt{T_{1}})$ (Lattimore and Szepesvári,, 2020). By Theorem 1,

	$\displaystyle\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}% \left[\operatorname{Reg}_{1}+\operatorname{Reg}_{2}\right]$	$\displaystyle=\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\frac{% \mathbb{E}\left[\operatorname{Reg}_{1}\right](\mathbb{E}\left[\operatorname{% Reg}_{2}\right])^{2}/T_{2}^{2}}{(\mathbb{E}\left[\operatorname{Reg}_{2}\right]% )^{2}/T_{2}^{2}}+\mathbb{E}[\operatorname{Reg}_{2}]$
		$\displaystyle=\Omega\left(\frac{1}{(\mathbb{E}\left[\operatorname{Reg}_{2}% \right])^{2}/T_{2}^{2}}+\mathbb{E}[\operatorname{Reg}_{2}]\right)\text{ for % some $L_{2}$ and $S$}$
		$\displaystyle=\Omega\left(T_{2}^{2/3}\right).$

A minimax lower bound of $\Omega(T_{2}/\sqrt{T_{1}})$ for offline policy optimization is shown in Theorem 4.3 (Yin and Wang,, 2021). Combined the three lower bounds together, we conclude (35). ∎

Appendix C Verifying Conditions in Section 3.1

We verify each of the conditions in Theorem 1 for three case studies introduced in Section 3.1. Note that the mean reward are given by Table 4.

C.1 Case I

Let $(x_{\star},a_{\star})=(x_{2},a_{2})$ . It follows from the construction that there are unique optimal policies for $S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}$ . The unique optimal policies for each task is given by Table 5.

1.

Condition 2 holds with $c_{1}=1/4$ : $\pi^{\star}_{S^{(1)}}(a_{\star}\mid\cdot)=\pi^{\star}_{\bar{S}^{(1)}}(a_{\star% }\mid\cdot)=0$ and $\operatorname{SR}_{S^{(1)}}(\pi)=(1-2\epsilon)\mu_{\pi}(x_{\star},a_{\star})$ , $\operatorname{SR}_{\bar{S}^{(1)}}(\pi)=(1-3\epsilon)\mu_{\pi}(x_{\star},a_{% \star})$ Thus, $\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq(1-3\epsilon)\mu_{\pi}(x_{\star},a_{\star})\geq 1/4\mu_{\pi}(x_{\star},a% _{\star})$ .
2.

Condition 3 holds with $c_{2}=1/2$ because $\mu_{\pi_{S^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2$ , while $\mu_{\pi_{\bar{S}^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2$ .
3.

Condition 4 and 5 holds since $\operatorname{SR}_{S^{(2)}}(\pi)/\epsilon=\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star% },a_{\star})-\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(2)}$ and $\operatorname{SR}_{\bar{S}^{(2)}}(\pi)/\epsilon=\mu_{\pi}(x_{\star},a_{\star})% -\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(2)}$ ;
4.

Condition 6 can be satisfied by choosing $P$ and $\bar{P}$ as the normal distribution with mean $R$ and $\bar{R}$ and variance 1/2.

Table 5: Sampling probability for

(a_{1},a_{2})

under different context of optimal policies

Context	$\pi_{S^{(1)}}^{\star}$	$\pi_{\bar{S}^{(1)}}^{\star}$	$\pi_{S^{(2)}}^{\star}$	$\pi_{\bar{S}^{(2)}}^{\star}$
$x_{1}$	(1, 0)	(1, 0)	(1, 0)	(1, 0)
$x_{2}$	(1, 0)	(1, 0)	(0, 1)	(1, 0)

C.2 Case II

Let $(x_{\star},a_{\star})=(x_{2},a_{1})$ . The unique optimal policies for each task is given by Table 6.

1.

Condition 2 holds with $c_{1}=1/4$ : both $\pi^{\star}_{S^{(1)}}(a_{\star}\mid x_{\star})=\pi^{\star}_{\bar{S}^{(1)}}(a_{% \star}\mid x_{\star})=0$ . Furthermore, $\operatorname{SR}_{S^{(1)}}(\pi)\geq(1-\epsilon)\mu_{\pi}(x_{\star},a_{\star})$ and $\operatorname{SR}_{\bar{S}^{(1)}}(\pi)\geq(1-\epsilon)\mu_{\pi}(x_{\star},a_{% \star})$ . By choosing $\epsilon<1/4$ , condition 2 is satisfied with $c_{1}=1/4$ .
2.

Condition 3 holds with $c_{2}=1/2$ because $\mu_{\pi_{S^{(2)}}}^{\star}(x_{\star},a_{\star})=0$ , while $\mu_{\pi_{\bar{S}^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2$ .
3.

Condition 4 and 5 holds because $\operatorname{SR}_{S^{(2)}}(\pi)=\epsilon\mu_{\pi}(x_{\star},a_{\star})/2$ and $\operatorname{SR}_{\bar{S}^{(2)}}(\pi)=\epsilon(1/2-\mu_{\pi}(x_{\star},a_{% \star}))$ .
4.

Condition 6 can be satisfied by choosing $P$ and $\bar{P}$ as the normal distribution with mean $R$ and $\bar{R}$ and variance 1/2.

Table 6: Sampling probability for

(a_{1},a_{2})

under different context of optimal policies

Context	$\pi_{S^{(1)}}^{\star}$	$\pi_{\bar{S}^{(1)}}^{\star}$	$\pi_{S^{(2)}}^{\star}$	$\pi_{\bar{S}^{(2)}}^{\star}$
$x_{1}$	(1, 0)	(1, 0)	(0, 1)	(1, 0)
$x_{2}$	(0, 1)	(0, 1)	(0, 1)	(1, 0)

C.3 Case III

Note that any MAB can be seen as a contextual bandit with a dummy context $x_{0}$ . Let $(x_{\star},a_{\star})=(x_{0},a_{2})$ . The optimal policies for $S^{(1)}$ and $\bar{S}^{(2)}$ are $\pi_{S^{(1)}}^{\star}(a_{1})=\pi_{\bar{S}^{(1)}}^{\star}(a_{1})=1$ . The optimal policy for $S^{(2)}$ is $\pi_{S^{(2)}})^{\star}(a_{1})=1$ , and for $S^{(2)}$ is $\pi_{\bar{S}^{(2)}})^{\star}(a_{1})=0$ .

1.

Condition 2 holds with $c_{1}=0.2$ .
2.

Condition 3 holds with $c_{2}=1$ .
3.

Condition 4 and 5 holds because $\operatorname{SR}_{S^{(2)}}(\pi)=\epsilon\mu_{\pi}(a_{\star})/2$ and $\operatorname{SR}_{\bar{S}^{(2)}}(\pi)=\epsilon(1-\mu_{\pi}(x_{\star},a_{\star% }))/2$ .
4.

Condition 6 holds by choosing $P$ and $\bar{P}$ as the normal distribution with mean $R$ and $\bar{R}$ and variance 1/2.

Appendix D Proof of Theorem 5

Theorem 5 Assume ${\mathcal{S}}$ is such that $\Pi^{(1)}=\Pi^{(2)}=\Pi$ , the set of all policies, and $\Pi^{(1)}=\Pi^{(2)}=f$ , the identical mapping in ${\mathbb{R}}$ . Assume each $P^{(i)}(\cdot\mid x,a)$ is from a binomial distribution with mean $P^{(i)}(x,a)$ for all $i=1,2$ and $x,a\in{\mathcal{X}}\times{\mathcal{A}}$ , and $P^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)$ . Then there exists some $\Delta$ such that

\inf_{L\in{\mathcal{L}}_{2}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_% {1}]}\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=% \Omega(1),

(36)

where $\pi_{2}$ is the random policy chosen by learning algorithm $L$ .

Proof.

We construct two hard instances inspired by Proposition 3. Recall that Proposition 3 states that any two-armed, context-free bandit, with $\text{Gap}\coloneqq P^{(1)}(a_{1})-P^{(1)}(a_{2})>0$ has the following explicit form of optimal robust policy:

\tilde{\pi}(a_{1})=\frac{\Delta+\text{Gap}}{2\Delta},\text{ and }\tilde{\pi}(a% _{2})=\frac{\Delta-\text{Gap}}{2\Delta}.

(37)

Let the arm space be ${\mathcal{A}}=\{a_{1},a_{2}\}$ . We construct two instances $S$ and $\bar{S}$ with $P^{(1)}$ and $\bar{P}^{(1)}$ , such that $R^{(1)}(a_{1})=\bar{R}^{(1)}(a_{1})=1$ and $R^{(1)}(a_{2})=1/2$ , while $\bar{R}^{(1)}(a_{2})=1/2+\epsilon$ for $\epsilon<1/4$ . Let $P^{(i)}(\cdot\mid a)$ and $\bar{P}^{(i)}(\cdot\mid a)$ be Bernoulli distributions of parameter $R^{(i)}(a)$ and $\bar{R}^{(i)}(a)$ for each $i\in\{1,2\},a\in{\mathcal{A}}$ . Let $\text{Gap}=R^{(1)}(a_{1})-R^{(1)}(a_{2})=1/2$ and $\overline{\text{Gap}}=\bar{R}^{(1)}(a_{1})-\bar{R}^{(1)}(a_{2})=1/2-\epsilon$ .

Follow a similar proof of Theorem 1. We first connect the cumulative regret in the first task $\operatorname{Reg}_{1}$ with the number of visits in the suboptimal arm $a_{2}$ in the first task. It can be shown that $\mathbb{E}_{{\bm{S}}^{\prime}}[\operatorname{Reg}_{1}]\geq 1/4\mathbb{E}_{{\bm% {S}}^{\prime}}[T(a_{2})]$ , where $T(a_{2})\coloneqq\sum_{t=1}^{T_{1}}\mathbbm{1}_{A_{1,t}=a_{2}}$ for each $S^{\prime}\in\{S,\bar{S}\}$ .

We consider $\Delta=3/4>\max\{\text{Gap},\overline{\text{Gap}}\}$ . By Proposition 3, we first lower bound the robust simple regret by

	$\displaystyle\widetilde{\operatorname{SR}}(\pi\mid P^{(1)},\Delta)$	(38)
$\displaystyle=$	$\displaystyle\operatorname{SR}(\pi\mid P^{(1)},\Delta)-\inf_{\pi^{\prime}}% \operatorname{SR}(\pi^{\prime}\mid P^{(1)},\Delta)$	(39)
$\displaystyle=$	$\displaystyle\max\{(\Delta-\text{Gap})\pi(a_{1}),(\Delta-\text{Gap})\pi(a_{2})% \}-\frac{\Delta^{2}-\text{Gap}^{2}}{2\Delta}$	(40)
$\displaystyle=$	$\displaystyle\left(\pi(a_{1})-\frac{\Delta+\text{Gap}}{2\Delta}\right)^{+}(% \Delta-\text{Gap})+\left(\pi(a_{2})-\frac{\Delta-\text{Gap}}{2\Delta}\right)^{% +}(\Delta+\text{Gap})$	(41)
$\displaystyle\geq$	$\displaystyle(\Delta-\text{Gap})\|\pi-\pi^{}\|/2\geq\|\pi-\pi^{}\|/8.$	(42)

Similarly, we also have $\widetilde{\operatorname{SR}}(\pi\mid\bar{P}^{(1)},\Delta)\geq|\pi-\bar{\pi}^{% *}|/8$ . Here we let $\pi^{*}$ , $\bar{\pi}^{*}$ be the optimal robust policy for $P^{(2)}\in{\mathcal{P}}(P^{(1)}\mid\Delta)$ and $P^{(2)}\in{\mathcal{P}}(\bar{P}^{(1)}\mid\Delta)$ , respectively.

Let $\pi_{2}$ be the random policy proposed by the learning algorithm for the second task. We convert the robust learning problem to a testing problem of two instances. Note that $\pi^{*}(a_{1})=5/6$ and $\bar{\pi}^{*}(a_{1})=5/6-2\epsilon/3$ .

The sum of robust simple regrets for two instances can be lower bounded by

	$\displaystyle\mathbb{E}_{P^{(1)}}[\widetilde{\operatorname{SR}}_{2}]+\mathbb{E% }_{\bar{P}^{(1)}}[\widetilde{\operatorname{SR}}_{2}]$	(43)
$\displaystyle\geq$	$\displaystyle\frac{\epsilon}{24}{\mathbb{P}}_{P^{(1)}}\left(\pi_{2}(a_{1})\leq% \frac{5}{6}-\epsilon/3\right)+\frac{\epsilon}{24}{\mathbb{P}}_{\bar{P}^{(1)}}% \left(\pi_{2}(a_{1})>\frac{5}{6}-\epsilon/3\right)$	(44)
$\displaystyle\geq$	$\displaystyle\frac{\epsilon}{24}\exp(-\epsilon^{2}\mathbb{E}_{P^{(1)}}[T(a_{2}% )])$	(45)

Choosing $\epsilon=\sqrt{1/\mathbb{E}_{P^{(1)}}[T(a_{2})]}$ and applying Lemma 2 again, we have

\inf_{L\in{\mathcal{L}}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_{1}]% }\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=\Omega(% 1).

∎

Appendix E Proof of Theorem 3

Theorem 3 Recall that ${\mathcal{S}}({\bm{S}})$ denote the set of instances that share the same policy space and reward function with a given instance ${\bm{S}}$ . Let ${\mathcal{I}}$ be an index set. Assume the instance set ${\mathcal{S}}$ is sufficiently large to ensure the existence of an instance ${\bm{S}}\in{\mathcal{S}}$ for which, we can find some $\bar{{\bm{S}}}\in{\mathcal{S}}({\bm{S}})$ such that for all $i\in{\mathcal{I}}$ and $\epsilon\in[0,1/4]$ , it satisfies the following conditions:

1.

There exists unique optimal policy $\pi^{\star}_{S}$ for each $S\in\{S^{(1)},\dots,{S}^{(i)},\bar{S}^{(1)},\dots,\bar{S}^{(i)}\}$ ;
2.

$\mu_{\pi_{S^{(j)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(j)}}^{% \star}}(x_{\star},a_{\star})=0$ and $\min\{\operatorname{SR}_{S^{(j)}}(\pi),\operatorname{SR}_{\bar{S}^{(j)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})$ for all $\pi\in\Pi^{(j)}$ and $j\in[i-1]$ ;
3.

$\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (i)}}}(x_{\star},a_{\star})>c_{2}$ ;
4.

$\operatorname{SR}_{S^{(i)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(i)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(i)}$ ;
5.

$\operatorname{SR}_{\bar{S}^{(i)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(i)}}}(x_{\star},a_{\star}))/2$ for all $\pi\in\Pi^{(i)}$ ;
6.

$P$ and $\bar{P}$ only differ in $(x_{\star},a_{\star})$ and $D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}$ ,

where $c_{1},c_{2}>0$ are some constant and $(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}$ is some context-action pair. Then we have the following lower bound :

\inf_{L\in{\mathcal{L}}_{{\mathcal{I}}}}\min_{i\in{\mathcal{I}}}\sup_{S\in{% \mathcal{S}}}\sqrt{\mathbb{E}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}% \right]}\mathbb{E}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega\left(c_{1}c% _{2}\right).

Proof.

The proof of Theorem 3 extends that of Theorem 1 by showing the following statement: for each $i\in{\mathcal{I}}$ ,

\mathbb{E}_{{\bm{S}}}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}\right]\geq c% _{1}\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})],

(46)

where $T(x_{\star},a_{\star})\coloneqq\sum_{j=1}^{i-1}\sum_{t=1}^{T_{j}}\mathbbm{1}_{% X_{j,t}=x_{\star},A_{j,t}=a_{\star}}$ . Follow the same steps in the proof of Theorem 1, we can show that for all $i\in{\mathcal{I}}$ , and all learning algorithm $L\in{\mathcal{L}}_{{\mathcal{I}}}$ ,

\max_{{\bm{S}}^{\prime}\in\{{\bm{S}},\bar{{\bm{S}}}\}}\sqrt{\mathbb{E}_{{\bm{S% }}^{\prime}}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}\right]}\mathbb{E}_{{% \bm{S}}^{\prime}}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega(\sqrt{c_{1}^% {2}c_{2}^{2}}),

(47)

from which we immediately obtain Theorem 3. ∎

Appendix F Proof of Proposition 2

Proof.

Conditions in Theorem 3 requires that the optimal policy for task $i$ has occupancy measure $\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})>0$ , while all the previous tasks $j$ has occupancy measure $\mu_{\pi^{\star}_{S^{(j)}}}(x_{\star},a_{\star})=0$ . The longest sequence of $\pi$ whose occupancy measures have non-overlapping support set is no longer than $|{\mathcal{X}}||{\mathcal{A}}|$ . A trivial instance of length $|{\mathcal{X}}|(|{\mathcal{A}}|-2)$ can be constructed although it does not have a strong practical interest. Index the context in ${\mathcal{X}}$ by $x_{1},\dots,x_{|{\mathcal{X}}|}$ , the action in ${\mathcal{A}}$ by $a_{1},\dots,a_{|{\mathcal{A}}|}$ . Split the $|{\mathcal{X}}|(|{\mathcal{A}}|-2)$ tasks into $|{\mathcal{X}}|$ groups of size $|{\mathcal{A}}|-2$ . The reward functions in group $j$ is designed such that the optimal policy at all $x_{l}\in{\mathcal{X}}$ has optimal arm $a_{1}$ for all $l\neq j$ . The tasks in group $j$ share the same reward function, and the mean reward satisfies $R^{(i)}(x_{j},a_{2})>\dots>R^{(i)}(x_{j},a_{|{\mathcal{A}}|})$ . The policy spaces are designed such that $\Pi_{j,i}=\{\pi:\pi(x_{j},a_{k})=0,\text{ for all }k<i\}$ , a.k.a. the $i$ -th task in the group $j$ is not allowed to choose the first $i-1$ actions. It can be verified that this construction satisfies the conditions in Theorem 3. ∎

Appendix G Proof of Theorem 4

Proof.

The construction of the hard instance can be described below. Consider a nonlinear bandit problem with ${\mathcal{A}}=S^{d-1}$ , the $d$ -dimensional sphere. We first define the reward function as

R(\theta_{2})=\alpha_{1}\langle\theta_{1},a\rangle+\alpha_{2}\left(\langle% \theta_{2},a\rangle-\epsilon\right)^{+},

where $\theta_{1}\in S^{d-1},\alpha_{1}>0$ and $\alpha_{2}>0$ are known parameters and $\theta_{2}\in S^{d-1}$ is unknown. Assume that the true parameter $\theta_{2}^{*}$ satisfies $\langle\theta_{1},\theta_{2}^{*}\rangle<0$ , i.e., $\theta_{1}$ and $\theta_{2}^{*}$ are on different sphere. Furthermore, let $\alpha_{2}=2\alpha_{2}/(1-\epsilon)$ .

Let $\{{\mathcal{A}}_{1},\dots,{\mathcal{A}}_{N}\}$ be an $\epsilon$ -pack of the subset $\{a\in{\mathcal{A}}:\langle\theta_{1},a\rangle<0\}$ . Let the allowed policy space for task $i$ be $\Pi^{(i)}=\{\pi:\pi\text{ is supported on }{\mathcal{A}}_{i}\cup\{\theta_{1}\}\}$ . Specifically, order $\{{\mathcal{A}}_{1},\dots,{\mathcal{A}}_{N}\}$ such that $\theta_{2}^{*}\in{\mathcal{A}}_{N}$ .

To verify, $R(\theta_{2})$ is in the family of one-layer neural network with ReLU activation function.

We first observe that the optimal policy for task $i$ is $\pi^{*}_{i}=\delta(\theta_{1})$ for all $i=1,\dots N-1$ . Note that by running UCB, the algorithm will optimistically choose $a\in{\mathcal{A}}_{i}$ as they do not know that whether $\theta_{2}^{*}\in{\mathcal{A}}_{i}$ for all $i=1,\dots,N-1$ , thus leading to a constant regret for all tasks $i<N$ . ∎

The Fallacy of Minimizing Local Regret in the Sequential Task Setting

Abstract

1 Introduction

Example 1.

1.1 Related Work

2 Problem Setups

Notations.

Sequential multitask contextual bandit framework.

Motivations for changes between tasks.

2.1 Performance metric

Definition 1 (Cumulative regret).

Definition 2 (Simple regret).

Definition 3 (Occupancy measure).

2.2 Learning algorithms

Definition 4 (Non-adaptive algorithm).

3 Results on Two Tasks

Theorem 1.

Discussion of Theorem 1.

3.1 Case studies on Theorem 1 application

Case I: adding a new feature.

Case II: removing an old feature.

Case III: change of reward function.

3.2 Optimal level of exploration

Proposition 1.

Theorem 2.

4 Results on Multiple Tasks

Theorem 3.

Proposition 2.

4.1 Discussion on Nonlinear Case

Nonlinear contextual bandit.

Complexity for nonlinear bandit.

Definition 5 (Distributional eluder dimension).

Theorem 4.

5 Study on Changes in P𝑃Pitalic_P

Robust simple regret.

Proposition 3.

Theorem 5.

6 Discussion

Multiple changes in P𝑃Pitalic_P.

Instance-dependence results.

References

Appendix A Proof of Theorem 1

Proof.

Lemma 1 (Bretagnolle–Huber inequality).

Lemma 2.

A.1 Proof of Lemma 2

Proof.

Appendix B Proof of Proposition 1

Proof.

Appendix C Verifying Conditions in Section 3.1

C.1 Case I

C.2 Case II

C.3 Case III

Appendix D Proof of Theorem 5

Proof.

Appendix E Proof of Theorem 3

Proof.

Appendix F Proof of Proposition 2

Proof.

Appendix G Proof of Theorem 4

Proof.

5 Study on Changes in $P$

Multiple changes in $P$ .