Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2403.10946v1 [stat.ML] 16 Mar 2024

The Fallacy of Minimizing Local Regret in the Sequential Task Setting

Ziping Xu Harvard University Kelly W. Zhang Columbia University Susan A. Murphy Harvard University
Abstract

In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear (T𝑇\sqrt{T}square-root start_ARG italic_T end_ARG) regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of ϵitalic-ϵ\epsilonitalic_ϵ-greedy exploration in robotic learning.

1 Introduction

Regret minimization has emerged as a central topic of online Reinforcement Learning (RL) research (Auer et al.,, 2008; Chu et al.,, 2011; Agrawal and Goyal,, 2013; Lattimore and Szepesvári,, 2020). In a static environment, prioritizing regret minimization is theoretically sound. A sublinear regret bound directly leads to a sound lower bound on cumulative rewards and it implies the convergence to optimal policies for environments with lower bounded suboptimality gap. Any algorithm with a sublinear regret bound is designed to cease exploration upon acquiring sufficient information about the underlying environment. Nevertheless, this cessation of exploration, while theoretically ideal, poses challenges in real-world RL implementations.

Real-world RL tasks significantly differ from the static setup in two key aspects. First, many real-world RL tasks arrive sequentially with substantial changes between tasks. This observation is evident from many applications including mobile health, where RL algorithms are trained to personalize digital interventions (Liao et al.,, 2020; Bidargaddi et al.,, 2020; Trella et al.,, 2022), and online education, where RL learns an automated pedagogical strategy that optimizes students’ performance (Aleven et al.,, 2023; Ruan et al.,, 2023). These applications, including others like inventory management (Madeka et al.,, 2022) and online charitable giving experiments (Athey et al.,, 2022), often undergo significant changes between tasks. Previous literature often studies changes in outcome distributions (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017), while our study encompasses changes in reward design in healthcare due to emerging side effects, or changes in permissible policy spaces driven by technological advancements and concerns around fairness or privacy. These changes are often results of high-level human decision-making, and thus, are unpredictable. Second, in many of these domains, particularly those involving human interactions, it is necessary to deploy static or nonadaptive policies within certain tasks. Though adaptive experimental design has emerged as a powerful tool in collecting data efficiently for subsequent tasks, many funding agencies are more familiar or restricted to the static experimental design (Zanette et al.,, 2021). Moreover, ethical or safety considerations often mandate the use of nonadaptive approaches (Madeka et al.,, 2022).

It is prevalent in the field to apply naive regret minimization algorithms within each task, (i.e., local regret minimization) often overlooking the sequential nature of real-world problems. A main message of this paper is the necessity of additional exploration due to unanticipated changes between tasks, such as rapid technological development or human-in-the-loop involvement, compared to what is normally sufficient in the stationary setting. Complete exploitation within a task could lead to worse performance in later tasks.

To understand the limitations of local regret minimization, note that sublinear regret often implies the convergence to an optimal policy, leading to complete exploitation. This focus on local regret minimization can result in an online dataset that is inadequate for subsequent tasks, especially in the presence of changes between tasks. We elucidate this concept through Example 1, which involves a sequence of two contextual bandit tasks. The two tasks share the same reward distribution, while the policy spaces are different. In the first task, the algorithm is allowed to make decisions based on the current context leading to the optimal policy of choosing action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in context x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in context x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In contrast, the second task has to adopt context-free policies leading to the optimal policy that takes action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all contexts. We further ask the algorithm to not adaptively learn in the second task. Therefore, the data collected from the first task is used to estimate an optimal policy to deploy for the second task. The online dataset collected by the first task’s optimal policy lacks visitations in (x1,a2)subscript𝑥1subscript𝑎2(x_{1},a_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and (x2,a1)subscript𝑥2subscript𝑎1(x_{2},a_{1})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), hindering a good policy learning for the second task. That is, minimizing the regret in the first task leads to worse regret in the second task. Our simulation experiment validates this hypothesis. Figure 2 compares the average regret in the first task and the simple regret in the second task by running UCB (Upper Confidence Bound) (Auer,, 2002) algorithm (in blue) and UCB mixed with probability 0.1 random exploration (in green) on the tasks described in Example 1. UCB shows a diminishing average regret in the first task with a constant simple regret for the second task. Mixed with 0.1 random exploration, UCB receives a constant average regret, while the simple regret goes to zero, revealing the expected trade-off between regrets in two tasks.

Example 1.

Consider two contextual bandit tasks with two arms and a context space of size two. The two tasks share the same reward distribution with the mean reward for each context-arm pair reported by Table 2. In the first task, the algorithm is allowed use any policy and can update its policy adaptively. In the second task, the algorithm can not update its policy and is forced to ignore the context, meaning that the second task is a non-contextual bandit. The two tasks share the same reward distribution.

Figure 1: Mean reward for Example 1
Mean reward x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Average
a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 ϵitalic-ϵ\epsilonitalic_ϵ (1+ϵ)/21italic-ϵ2(1+\epsilon)/2( 1 + italic_ϵ ) / 2
a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0 1 1/2121/21 / 2
Refer to captionRefer to caption
Figure 1: Mean reward for Example 1
Figure 2: Average regret in task 1 (left) v.s. simple regret (right) in task 2 over various time steps. The blue lines are UCB and the green lines are UCB + 0.10.10.10.1-uniform exploration.

We rigorously characterize this trade-off between local regret minimization and global regret minimization, when there are changes in reward design and/or the policy space between tasks. We formulate a sequential contextual bandit framework with a discrete context space 𝒳𝒳{\mathcal{X}}caligraphic_X and action space 𝒜𝒜{\mathcal{A}}caligraphic_A. Let RegisubscriptReg𝑖\operatorname{Reg}_{i}roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the cumulative regret and number of time-steps in task i𝑖iitalic_i, respectively. We initiate out analysis with a two-task scenario. Theorem 1 establishes a general condition on the changes under which the minimax rate of 𝔼[Reg1](𝔼[Reg2]/T2)=Ω(1)𝔼delimited-[]subscriptReg1𝔼delimited-[]subscriptReg2subscript𝑇2Ω1\sqrt{\mathbb{E}[\operatorname{Reg}_{1}]}(\mathbb{E}[\operatorname{Reg}_{2}]/T% _{2})=\Omega(1)square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Ω ( 1 ) when the algorithm is not allowed to be adaptive within the second task. We show that a simple strategy that mixes any no regret online algorithm with random exploration attains the minimax rate that is optimal in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Our results show that for minimizing the combined regrets 𝔼[Reg1+Reg2]𝔼delimited-[]subscriptReg1subscriptReg2\mathbb{E}[\operatorname{Reg}_{1}+\operatorname{Reg}_{2}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], an algorithm should aim for 𝔼[Reg1]=Θ(T12/3)𝔼delimited-[]subscriptReg1Θsuperscriptsubscript𝑇123\mathbb{E}[\operatorname{Reg}_{1}]=\Theta(T_{1}^{2/3})blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = roman_Θ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in case of T1=T2subscript𝑇1subscript𝑇2T_{1}=T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, demonstrating an excess exploration compared with the optimal local regret rate T11/2superscriptsubscript𝑇112T_{1}^{1/2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. Subsequent case studies explore the satisfaction of these conditions in the presence of nonstationarity in the policy space and reward function.

We continue with three extensions from the two-task scenario. We first extend the results to the multiple tasks case, showing that the minimax rate of j=1i1𝔼[Regj](𝔼[Regi]/Ti)=Ω(1)superscriptsubscript𝑗1𝑖1𝔼delimited-[]subscriptReg𝑗𝔼delimited-[]subscriptReg𝑖subscript𝑇𝑖Ω1\sqrt{\sum_{j=1}^{i-1}\mathbb{E}[\operatorname{Reg}_{j}]}(\mathbb{E}[% \operatorname{Reg}_{i}]/T_{i})=\Omega(1)square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT blackboard_E [ roman_Reg start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_Ω ( 1 ) simultaneously for all task i𝑖iitalic_i, within which adaptivity is not allowed. The maximal sequence length for which this lower bound is valid across all adjacent tasks is determined by the product |𝒳|×|𝒜|𝒳𝒜|{\mathcal{X}}|\times|{\mathcal{A}}|| caligraphic_X | × | caligraphic_A |. We further extend our discussion to nonlinear case, where we show that such sequence of tasks can be exponentially long, suggesting a strong motivation to apply excess exploration in real-world RL applications. The third extension considers a more prevalent notion of changes between tasks–nonstationarity in reward distributions. We show that under a new notion of robust simple regret, the aforementioned tension between local regret minimization and global regret minimization still holds when there are changes in reward distribution.

1.1 Related Work

In a contextual bandit setting, the simple regret is simply the expected cumulative regret divided by the length of horizon, when the policy is fixed. Our two-task case study extends the existing literature on optimizing both cumulative regret and simple regret by introducing non-stationarity. We denote the cumulative regret by CRCR\operatorname{CR}roman_CR and simple regret by SRSR\operatorname{SR}roman_SR. In a multi-armed bandit setting, Bubeck et al., (2011) show a trade-off of SR×exp(DCR)Δ/2SR𝐷CRΔ2\operatorname{SR}\times\exp(D\cdot\operatorname{CR})\geq\Delta/2roman_SR × roman_exp ( italic_D ⋅ roman_CR ) ≥ roman_Δ / 2, where ΔΔ\Deltaroman_Δ is the minimal gap and D𝐷Ditalic_D is a constant. This bound is substantially weaker than our bound as Δ1Δ1\Delta\leq 1roman_Δ ≤ 1. Krishnamurthy et al., (2023) study this trade-off under standard contextual bandit, where they show that any learning algorithm that achieves a worst-case 𝒪(ϕ/T)𝒪italic-ϕ𝑇\mathcal{O}(\sqrt{\phi/T})caligraphic_O ( square-root start_ARG italic_ϕ / italic_T end_ARG ) simple regret bound has a lower bounded minimax rate of A2T/ϕsuperscript𝐴2𝑇italic-ϕ\sqrt{A^{2}T/\phi}square-root start_ARG italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T / italic_ϕ end_ARG for cumulative regret. Beyond the standard stationary setting, Simchi-Levi and Wang, (2023) study the trade-off between cumulative regret and the statistical power of inferring the treatment effect in a stationary multi-armed bandit problem. Similar results are shown in Gao et al., (2022); Dai et al., (2023). They show a similar minimax lower bound, CR×inference error=Ω(1)CRinference errorΩ1\sqrt{\operatorname{CR}}\times\text{inference error}=\Omega(1)square-root start_ARG roman_CR end_ARG × inference error = roman_Ω ( 1 ). Qin and Russo, (2023) study the multi-armed bandit problem with a different cost of exploration for different arms. This is equivalent to having a different reward function from the reward for calculating simple regret. From an empirical point of view, Athey et al., (2022) propose TreeBagging algorithm that controls the level of exploration in online charitable giving. They observe that the uniform randomization algorithm learned a policy with the lowest simple regret while receiving the highest cumulative regret over a sequence of 10 implementations.

Our framework can be also seen as a generalization of the previous nonstationary bandit setting. Previous works in non-stationary bandit consider only changes in reward distribution (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017; Wu et al.,, 2018; Kim and Tewari,, 2020; Hong et al.,, 2023). We introduce a broader definition of non-stationarity including the changes in reward function and policy space. When the reward distribution does change, we propose a new metric named robust simple regret to accommodate the fact that no algorithm can minimize simple regret to 0 even with infinite data, which is new in the literature (Section 5).

A main message of this paper is that the RL algorithm should never stop exploring or learning when there is non-stationarity. Conceptually, this is also the main characteristic of a continual learning agent (Abel et al.,, 2023). The literature of continual learning has primarily focused on learning agent that never forgets (Lange et al.,, 2019; Peng et al.,, 2023; Wang et al.,, 2023). However, in a RL context, the algorithm should strategically forget the previous learned knowledge when the outcome distribution drastically changes and maintain the right level of continual exploration to obtain new information (Khetarpal et al.,, 2020).

2 Problem Setups

Notations.

For a set 𝒳𝒳{\mathcal{X}}caligraphic_X, we denote by Δ(𝒳)Δ𝒳\Delta({\mathcal{X}})roman_Δ ( caligraphic_X ) the set of all distributions over 𝒳𝒳{\mathcal{X}}caligraphic_X. For N𝑁N\in{\mathbb{Z}}italic_N ∈ blackboard_Z, we let [N]={1,2,,N}delimited-[]𝑁12𝑁[N]=\{1,2,\dots,N\}[ italic_N ] = { 1 , 2 , … , italic_N }. We use 𝒪()𝒪{\mathcal{O}}(\cdot)caligraphic_O ( ⋅ ), Θ()Θ\Theta(\cdot)roman_Θ ( ⋅ ), ΩΩ\Omegaroman_Ω to denote the big-O𝑂Oitalic_O, big-Theta and big-Omega notations. For a vector νd𝜈superscript𝑑\nu\in{\mathbb{R}}^{d}italic_ν ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we let [ν]isubscriptdelimited-[]𝜈𝑖[\nu]_{i}[ italic_ν ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the i𝑖iitalic_i-th element of the vector. We denote by DKL(PQ)subscript𝐷KLconditional𝑃𝑄D_{\operatorname{KL}}(P\mid Q)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ∣ italic_Q ), the KL divergence between two probability measures P𝑃Pitalic_P and Q𝑄Qitalic_Q with PQmuch-less-than𝑃𝑄P\ll Qitalic_P ≪ italic_Q.

Sequential multitask contextual bandit framework.

We consider learning on a sequence of N𝑁Nitalic_N contextual bandit tasks. Different from the standard online contextual bandit setting, each task is a contextual bandit with rich observations and potentially restricted policy space. Specifically, we define each task i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] as a tuple of (𝒳,𝒜,𝒴,P,Π(i),f(i))𝒳𝒜𝒴𝑃superscriptΠ𝑖superscript𝑓𝑖({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}},P,\Pi^{(i)},f^{(i)})( caligraphic_X , caligraphic_A , caligraphic_Y , italic_P , roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), where the shared elements are the context space 𝒳𝒳{\mathcal{X}}caligraphic_X, the action space 𝒜𝒜{\mathcal{A}}caligraphic_A, the potentially high-dimensional outcome space 𝒴𝒴{\mathcal{Y}}caligraphic_Y, and the outcome distribution P:𝒳×𝒜Δ(𝒴):𝑃maps-to𝒳𝒜Δ𝒴P:{\mathcal{X}}\times{\mathcal{A}}\mapsto\Delta({\mathcal{Y}})italic_P : caligraphic_X × caligraphic_A ↦ roman_Δ ( caligraphic_Y ). The task-specific elements include the restricted policy space Π(i)𝒳Δ(𝒜)superscriptΠ𝑖𝒳maps-toΔ𝒜\Pi^{(i)}\subset{\mathcal{X}}\mapsto\Delta({\mathcal{A}})roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⊂ caligraphic_X ↦ roman_Δ ( caligraphic_A ) that is a collection of mappings from the current context to a distribution over the action space. We also allow different tasks to have different reward functions f(i):𝒴[0,1]:superscript𝑓𝑖maps-to𝒴01f^{(i)}:{\mathcal{Y}}\mapsto[0,1]italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT : caligraphic_Y ↦ [ 0 , 1 ]. For simplicity, we assume that there is a fixed context distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT across all tasks. Since (𝒳,𝒜,𝒴)𝒳𝒜𝒴({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})( caligraphic_X , caligraphic_A , caligraphic_Y ) are assumed to be shared across different tasks, we denote a task setup by S(i)=(Π(i),f(i),P)superscript𝑆𝑖superscriptΠ𝑖superscript𝑓𝑖𝑃S^{(i)}=(\Pi^{(i)},f^{(i)},P)italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_P ), when (𝒳,𝒜,𝒴)𝒳𝒜𝒴({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})( caligraphic_X , caligraphic_A , caligraphic_Y ) are clear from the context. Note that the restricted policy spaces and the reward functions are assumed to be known by the agent before interacting with the environment, while the outcome distribution P𝑃Pitalic_P is unknown and has to be learned.

The agent interacts with each task i𝑖iitalic_i for Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT steps. At the step t[Ti]𝑡delimited-[]subscript𝑇𝑖t\in[T_{i}]italic_t ∈ [ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], the agent observes context Xi,t𝒳subscript𝑋𝑖𝑡𝒳X_{i,t}\in{\mathcal{X}}italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ caligraphic_X, and decides a policy πi,tΠ(i)subscript𝜋𝑖𝑡superscriptΠ𝑖\pi_{i,t}\in\Pi^{(i)}italic_π start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then the agent samples an action Ai,tπi,t(Xi,t)similar-tosubscript𝐴𝑖𝑡subscript𝜋𝑖𝑡subscript𝑋𝑖𝑡A_{i,t}\sim\pi_{i,t}(X_{i,t})italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) and receives a feedback vector Yi,tP(Xi,t,Ai,t)Y_{i,t}\sim P(\cdot\mid X_{i,t},A_{i,t})italic_Y start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∼ italic_P ( ⋅ ∣ italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ). The goal of the agent for task i𝑖iitalic_i is to maximize the cumulative rewards t=1Tif(i)(Yi,t)superscriptsubscript𝑡1subscript𝑇𝑖superscript𝑓𝑖subscript𝑌𝑖𝑡\sum_{t=1}^{T_{i}}f^{(i)}(Y_{i,t})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ). The optimal policy of a task S=(Π,f,P)𝑆Π𝑓𝑃S=(\Pi,f,P)italic_S = ( roman_Π , italic_f , italic_P ) is given by

πSargmaxπΠ𝔼XPX𝔼Aπ(X)𝔼YP(X,A)f(Y).\pi^{\star}_{S}\in\operatorname*{arg\,max}_{\pi\in\Pi}\mathbb{E}_{X\sim P_{X}}% \mathbb{E}_{A\sim\pi(X)}\mathbb{E}_{Y\sim P(\cdot\mid X,A)}f(Y).italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_A ∼ italic_π ( italic_X ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_Y ∼ italic_P ( ⋅ ∣ italic_X , italic_A ) end_POSTSUBSCRIPT italic_f ( italic_Y ) .

Note that we consider the same outcome distribution at this moment to focus on the study of changes in policy spaces and reward functions. In Section 5, we extend our discussion to a sequence of tasks where the underlying outcome distribution could shift between tasks.

Motivations for changes between tasks.

The setups for different tasks may change in various ways. We provide the following motivation examples on the change of (Π(i),f(i))superscriptΠ𝑖superscript𝑓𝑖(\Pi^{(i)},f^{(i)})( roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) across tasks.

  1. 1.

    Different tasks may set up different reward functions f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. For instance, in study one, we are only interested in maximizing cumulative rewards t=1T1R1,tsuperscriptsubscript𝑡1subscript𝑇1subscript𝑅1𝑡\sum_{t=1}^{T_{1}}R_{1,t}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT, where R1,t=f(1)(Y1,t)[Y1,t]1subscript𝑅1𝑡superscript𝑓1subscript𝑌1𝑡subscriptdelimited-[]subscript𝑌1𝑡1R_{1,t}=f^{(1)}(Y_{1,t})\equiv[Y_{1,t}]_{1}italic_R start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) ≡ [ italic_Y start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. At the end of the first task, the domain expert may decide that [Y1,t]2subscriptdelimited-[]subscript𝑌1𝑡2[Y_{1,t}]_{2}[ italic_Y start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a significant side effect that should be controlled. Therefore, they propose f(2)(Y)=[Y]1α[Y]2superscript𝑓2𝑌subscriptdelimited-[]𝑌1𝛼subscriptdelimited-[]𝑌2f^{(2)}(Y)=[Y]_{1}-\alpha[Y]_{2}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_Y ) = [ italic_Y ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_α [ italic_Y ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the second task.

  2. 2.

    Different tasks may set up different target policy classes Π(i)superscriptnormal-Π𝑖\Pi^{(i)}roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. In practice, the context space 𝒳𝒳{\mathcal{X}}caligraphic_X can be a high-dimensional vector, due to the complexity of real-world observations. In the the first task, we may not have the computational resources to maximize over the space of all policies that takes context into account. Hence, the domain expert decides to optimize only in the space of context-independent policies Π(1)={πΠ:π(ax1)=π(ax2), for all (x1,x2)𝒳}superscriptΠ1conditional-set𝜋Πformulae-sequence𝜋conditional𝑎subscript𝑥1𝜋conditional𝑎subscript𝑥2 for all subscript𝑥1subscript𝑥2𝒳\Pi^{(1)}=\{\pi\in\Pi:\pi(a\mid x_{1})=\pi(a\mid x_{2}),\text{ for all }(x_{1}% ,x_{2})\in{\mathcal{X}}\}roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = { italic_π ∈ roman_Π : italic_π ( italic_a ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π ( italic_a ∣ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , for all ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_X }. In the second task, with evidence accumulating, the expert decides that certain components of 𝒳𝒳{\mathcal{X}}caligraphic_X are relevant to the task, which should be included in the new policies. In a reversed case, the domain expert may decide that certain feature is irrelevant or raises fairness concerns and should be removed from the feature space.

2.1 Performance metric

The cumulative regret of task i𝑖iitalic_i is given by Definition 1. The goal of an agent that learns on the sequence of N𝑁Nitalic_N tasks is to minimize the sum of cumulative regret i=1NRegisuperscriptsubscript𝑖1𝑁subscriptReg𝑖\sum_{i=1}^{N}\operatorname{Reg}_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over all the tasks. We refer the cumulative regret of each task i𝑖iitalic_i as the local regret and the cumulative regret over all the tasks as the global regret. Throughout the paper, we discuss the tension between the local regret and the global regret under the aforementioned changes in task setups.

Definition 1 (Cumulative regret).

Denote the mean reward of task i𝑖iitalic_i given context x𝑥xitalic_x and action a𝑎aitalic_a by R(i)(x,a)𝔼YP(x,a)f(i)(Y)R^{(i)}(x,a)\coloneqq\mathbb{E}_{Y\sim P(\cdot\mid x,a)}f^{(i)}(Y)italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) ≔ blackboard_E start_POSTSUBSCRIPT italic_Y ∼ italic_P ( ⋅ ∣ italic_x , italic_a ) end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_Y ). The cumulative regret within task i𝑖iitalic_i is defined by

Regit=1Ti[maxπΠ(i)𝔼Xi,tPX𝔼Aπ(Xi,t)[R(i)(Xi,t,A)R(i)(Xi,t,Ai,t)]].subscriptReg𝑖superscriptsubscript𝑡1subscript𝑇𝑖delimited-[]subscript𝜋superscriptΠ𝑖subscript𝔼similar-tosubscript𝑋𝑖𝑡subscript𝑃𝑋subscript𝔼similar-to𝐴𝜋subscript𝑋𝑖𝑡delimited-[]superscript𝑅𝑖subscript𝑋𝑖𝑡𝐴superscript𝑅𝑖subscript𝑋𝑖𝑡subscript𝐴𝑖𝑡\operatorname{Reg}_{i}\coloneqq\sum_{t=1}^{T_{i}}\left[\max_{\pi\in\Pi^{(i)}}% \mathbb{E}_{X_{i,t}\sim P_{X}}\mathbb{E}_{A\sim\pi(X_{i,t})}[R^{(i)}(X_{i,t},A% )-R^{(i)}(X_{i,t},A_{i,t})]\right].roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_A ∼ italic_π ( italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_A ) - italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] ] .

A significant body of the paper discusses the tasks, where the algorithm is not allowed to adaptively learn, thereby requiring a commitment to a fixed policy for the entire of each task. In such scenarios, minimizing the cumulative regret is equivalent to minimizing simple regret defined in Definition 2.

Definition 2 (Simple regret).

Define the simple regret of policy π𝜋\piitalic_π given a task S=(Π,f,P)𝑆normal-Π𝑓𝑃S=(\Pi,f,P)italic_S = ( roman_Π , italic_f , italic_P ) with mean reward function R𝔼YP(x,a)f(Y)R\coloneqq\mathbb{E}_{Y\sim P(\cdot\mid x,a)}f(Y)italic_R ≔ blackboard_E start_POSTSUBSCRIPT italic_Y ∼ italic_P ( ⋅ ∣ italic_x , italic_a ) end_POSTSUBSCRIPT italic_f ( italic_Y ) by

SRS(πx)aπS(ax)R(x,a)aπ(ax)R(x,a), and SRS(π)xPX(x)SRS(πx).formulae-sequencesubscriptSR𝑆conditional𝜋𝑥subscript𝑎subscriptsuperscript𝜋𝑆conditional𝑎𝑥𝑅𝑥𝑎subscript𝑎𝜋conditional𝑎𝑥𝑅𝑥𝑎 and subscriptSR𝑆𝜋subscript𝑥subscript𝑃𝑋𝑥subscriptSR𝑆conditional𝜋𝑥\operatorname{SR}_{S}(\pi\mid x)\coloneqq\sum_{a}\pi^{\star}_{S}(a\mid x)R(x,a% )-\sum_{a}\pi(a\mid x)R(x,a),\text{ and }\operatorname{SR}_{S}(\pi)\coloneqq% \sum_{x}P_{X}(x)\operatorname{SR}_{S}(\pi\mid x).roman_SR start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_π ∣ italic_x ) ≔ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_a ∣ italic_x ) italic_R ( italic_x , italic_a ) - ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a ∣ italic_x ) italic_R ( italic_x , italic_a ) , and roman_SR start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_π ) ≔ ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) roman_SR start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_π ∣ italic_x ) .

Note that SRS(πx)subscriptnormal-SR𝑆conditional𝜋𝑥\operatorname{SR}_{S}(\pi\mid x)roman_SR start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_π ∣ italic_x ) can potentially be negative depending on how the policy space is defined. However, SRS(π)0subscriptnormal-SR𝑆𝜋0\operatorname{SR}_{S}(\pi)\geq 0roman_SR start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_π ) ≥ 0 for any policy πΠ𝜋normal-Π\pi\in\Piitalic_π ∈ roman_Π.

Definition 3 (Occupancy measure).

We define μπ(x,a)𝔼XPX,Aπ𝟙{X=x,A=a}normal-≔subscript𝜇𝜋𝑥𝑎subscript𝔼formulae-sequencesimilar-to𝑋subscript𝑃𝑋similar-to𝐴𝜋subscript1formulae-sequence𝑋𝑥𝐴𝑎\mu_{\pi}(x,a)\coloneqq\mathbb{E}_{X\sim P_{X},A\sim\pi}\mathbbm{1}_{\{X=x,A=a\}}italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x , italic_a ) ≔ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_A ∼ italic_π end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_X = italic_x , italic_A = italic_a } end_POSTSUBSCRIPT as the occupancy measure of a running policy π𝜋\piitalic_π. Note that since we assume the same context distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, occupancy measure of a policy π𝜋\piitalic_π is task independent.

2.2 Learning algorithms

This section is motivated by the practical considerations that some tasks may require nonadaptive algorithms. We introduce a formal definition to clarify what constitutes an algorithm and its nonadaptive nature in this context.

Let τi,t(𝒳,𝒜,𝒴)subscript𝜏𝑖𝑡superscript𝒳𝒜𝒴\tau_{i,t}\in({\mathcal{X}},{\mathcal{A}},{\mathcal{Y}})^{\star}italic_τ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ ( caligraphic_X , caligraphic_A , caligraphic_Y ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT denote the sequence of observations up to the t𝑡titalic_t-th step in task i𝑖iitalic_i. A learning algorithm L𝐿Litalic_L is a mapping from the previous observations to a policy in Π(i)superscriptΠ𝑖\Pi^{(i)}roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT at any given step t𝑡titalic_t in task i𝑖iitalic_i. The agent running algorithm L𝐿Litalic_L randomly samples an action Ai,t[L(τi,t)](Xi,t)similar-tosubscript𝐴𝑖𝑡delimited-[]𝐿subscript𝜏𝑖𝑡subscript𝑋𝑖𝑡A_{i,t}\sim[L(\tau_{i,t})](X_{i,t})italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∼ [ italic_L ( italic_τ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] ( italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ). Furthermore, an algorithm is said to be nonadaptive in task i𝑖iitalic_i, if it is nonadaptive to any new data collected during task i𝑖iitalic_i as described in Definition 4.

Definition 4 (Non-adaptive algorithm).

We call an algorithm non-adaptive in task i𝑖iitalic_i if for all steps t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT within a task, where t1<t2[Ti]subscript𝑡1subscript𝑡2delimited-[]subscript𝑇𝑖t_{1}<t_{2}\in[T_{i}]italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], and for all sequences of observations τi,t1subscript𝜏𝑖subscript𝑡1\tau_{i,t_{1}}italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and τi,t2subscript𝜏𝑖subscript𝑡2\tau_{i,t_{2}}italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, such that τi,t1subscript𝜏𝑖subscript𝑡1\tau_{i,t_{1}}italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a prefix of τi,t2subscript𝜏𝑖subscript𝑡2\tau_{i,t_{2}}italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the algorithm satisfies L(τi,t1)=L(τi,t2).𝐿subscript𝜏𝑖subscript𝑡1𝐿subscript𝜏𝑖subscript𝑡2L(\tau_{i,t_{1}})=L(\tau_{i,t_{2}}).italic_L ( italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_L ( italic_τ start_POSTSUBSCRIPT italic_i , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

Denote by isubscript𝑖{\mathcal{L}}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the set of all learning algorithms that are nonadaptive in task i𝑖iitalic_i. For a set of task indices {\mathcal{I}}caligraphic_I, we denote by iisubscriptsubscript𝑖subscript𝑖{\mathcal{L}}_{{\mathcal{I}}}\coloneqq\cap_{i\in{\mathcal{I}}}{\mathcal{L}}_{i}caligraphic_L start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ≔ ∩ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the set of algorithms that are simultaneously nonadaptive on all tasks in {\mathcal{I}}caligraphic_I.

3 Results on Two Tasks

We commence with a two-task case, where the first task involves running an online learning algorithm focused on minimizing the regret within the task. In the second task, we employ a fixed policy that is offline learned based on the dataset collected from the the first task. This scenario is closely related to the contextual bandit setting, where the algorithm aims at minimizing cumulative regret and simple regret simultaneously, with the difference in allowing changes between tasks.

In this section, there is a trade-off between the regrets in two tasks under various changes in task setups. The trade-off is substantially stronger than the cases without changes. In fact, this stronger trade-off has been shown in some special cases. For instance, in a multi-armed bandit setting, Simchi-Levi and Wang, (2023) simultaneously minimizes the cumulative regret and the average treatment effect (ATE) estimation error of the worst arm. This setting is inherently analogous to our setting since ATE’s of all the arms are essential for achieving a low simple regret with arbitrary changes in the policy space. They show that the product of the cumulative regret and the square of the worst-case estimation error is lower bounded by a constant in a minimax sense. We prove a similar lower bound result on a more general case with a wider range of changes in ΠΠ\Piroman_Π, f𝑓fitalic_f. Our results provide a more comprehensive view of this tension between cumulative regret and simple regret.

We denote an instance by 𝑺=(S(1),S(2))𝑺superscript𝑆1superscript𝑆2{\bm{S}}=(S^{(1)},S^{(2)})bold_italic_S = ( italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), a sequence of two tasks. For an instance 𝑺𝑺{\bm{S}}bold_italic_S, we denote by 𝒮(𝑺)𝒮𝑺{\mathcal{S}}({\bm{S}})caligraphic_S ( bold_italic_S ) the set of all instances that share the same policy spaces and reward functions (Π(1),Π(2),f(1),f(2))superscriptΠ1superscriptΠ2superscript𝑓1superscript𝑓2(\Pi^{(1)},\Pi^{(2)},f^{(1)},f^{(2)})( roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), while having different P𝑃Pitalic_P. For some instance set 𝒮𝒮{\mathcal{S}}caligraphic_S, we study the following minimax multi-objective optimization problem:

infL2sup𝑺𝒮(𝔼[Reg1],𝔼[Reg2]).subscriptinfimum𝐿subscript2subscriptsupremum𝑺𝒮𝔼delimited-[]subscriptReg1𝔼delimited-[]subscriptReg2\inf_{L\in{\mathcal{L}}_{2}}\sup_{{\bm{S}}\in{\mathcal{S}}}\left(\mathbb{E}[% \operatorname{Reg}_{1}],\mathbb{E}[\operatorname{Reg}_{2}]\right).roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT bold_italic_S ∈ caligraphic_S end_POSTSUBSCRIPT ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) . (1)

In order to show a strong lower bound for the above multi-objective problem, we need the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S to be adequately rich. Theorem 1 provides general conditions for 𝒮𝒮{\mathcal{S}}caligraphic_S that characterizes a strong trade-off between the cumulative regrets in two tasks.

Theorem 1.

Assume the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S is sufficiently large to ensure the existence of an instance 𝐒=(S(1),S(2))𝐒superscript𝑆1superscript𝑆2{\bm{S}}=(S^{(1)},S^{(2)})bold_italic_S = ( italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) such that for all ϵ[0,1/4]italic-ϵ014\epsilon\in[0,1/4]italic_ϵ ∈ [ 0 , 1 / 4 ], we can find some 𝐒¯=(S¯(1),S¯(2))𝒮(𝐒)𝒮normal-¯𝐒superscriptnormal-¯𝑆1superscriptnormal-¯𝑆2𝒮𝐒𝒮\bar{{\bm{S}}}=(\bar{S}^{(1)},\bar{S}^{(2)})\in{\mathcal{S}}({\bm{S}})\cap{% \mathcal{S}}over¯ start_ARG bold_italic_S end_ARG = ( over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_S ( bold_italic_S ) ∩ caligraphic_S satisfying the following conditions:

  1. 1.

    There exists unique optimal policy πSsubscriptsuperscript𝜋𝑆\pi^{\star}_{S}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for each S{S(1),S¯(1),S(2),S¯(2)}𝑆superscript𝑆1superscript¯𝑆1superscript𝑆2superscript¯𝑆2S\in\{S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}\}italic_S ∈ { italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT };

  2. 2.

    μπS(1)(x,a)=μπS¯(1)(x,a)=0subscript𝜇superscriptsubscript𝜋superscript𝑆1subscript𝑥subscript𝑎subscript𝜇superscriptsubscript𝜋superscript¯𝑆1subscript𝑥subscript𝑎0\mu_{\pi_{S^{(1)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(1)}}^{% \star}}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0 and min{SRS(1)(π),SRS¯(1)(π)}c1μπ(x,a)subscriptSRsuperscript𝑆1𝜋subscriptSRsuperscript¯𝑆1𝜋subscript𝑐1subscript𝜇𝜋subscript𝑥subscript𝑎\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})roman_min { roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) , roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) } ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(1)𝜋superscriptΠ1\pi\in\Pi^{(1)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT;

  3. 3.

    μπS(2)(x,a)μπS¯(2)(x,a)=c2>0subscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎subscript𝑐20\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (2)}}}(x_{\star},a_{\star})=c_{2}>0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0;

  4. 4.

    SRS(2)(π)ϵ(μπS(2)(x,a)μπ(x,a))/2subscriptSRsuperscript𝑆2𝜋italic-ϵsubscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇𝜋subscript𝑥subscript𝑎2\operatorname{SR}_{S^{(2)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT;

  5. 5.

    SRS¯(2)(π)ϵ(μπ(x,a)μπS¯(2)(x,a))/2subscriptSRsuperscript¯𝑆2𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎2\operatorname{SR}_{\bar{S}^{(2)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT;

  6. 6.

    P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG only differ in (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) and DKL(P(x,a)P¯(x,a))ϵ2D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∣ over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

where c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are some constant and (x,a)𝒳×𝒜subscript𝑥normal-⋆subscript𝑎normal-⋆𝒳𝒜(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∈ caligraphic_X × caligraphic_A is some context-action pair. Then we have the following lower bound:

infL2sup𝑺𝒮𝔼[Reg1]𝔼[Reg2]/T2=Ω(c1c2).subscriptinfimum𝐿subscript2subscriptsupremum𝑺𝒮𝔼delimited-[]subscriptReg1𝔼delimited-[]subscriptReg2subscript𝑇2Ωsubscript𝑐1subscript𝑐2\inf_{L\in{\mathcal{L}}_{2}}\sup_{{\bm{S}}\in{\mathcal{S}}}\sqrt{\mathbb{E}% \left[\operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}% \right]/T_{2}=\Omega\left(c_{1}c_{2}\right).roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT bold_italic_S ∈ caligraphic_S end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Ω ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Discussion of Theorem 1.

For sufficiently large T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, a no-regret online algorithm tends to converge to the optimal policy (assuming there exists a unique one and there is a lower bounded suboptimality gap). This dataset collected from online learning of the first task may not be sufficient for the goal of offline policy optimization for the second task, when maxx,aμπS(2)(x,a)/μπS(1)(x,a)=subscript𝑥𝑎subscript𝜇subscriptsuperscript𝜋superscript𝑆2𝑥𝑎subscript𝜇subscriptsuperscript𝜋superscript𝑆1𝑥𝑎\max_{x,a}\mu_{\pi^{\star}_{S^{(2)}}}(x,a)/\mu_{\pi^{\star}_{S^{(1)}}}(x,a)=\inftyroman_max start_POSTSUBSCRIPT italic_x , italic_a end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_a ) / italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_a ) = ∞. This connects closely to the offline learning literature, where it has been shown that offline learning is fundamentally hard if the single-policy concentrability is unbounded (Chen and Jiang,, 2019). Single-policy concentrability is the ratio between the occupancy measures of the behavior policy that collects the offline dataset and the optimal policy. Condition 2 guarantees that it incurs regret whenever (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) is visited in the first task and Condition 3, 4 and 5 guarantee that it is necessary to visit (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) to distinguish between S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and S¯(2)superscript¯𝑆2\bar{S}^{(2)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, which creates the tension between the regrets of two tasks.

3.1 Case studies on Theorem 1 application

In this section, we explore three case studies with potential practical interests to demonstrate how the conditions specified in Theorem 1 can be satisfied by including potential changes in ΠΠ\Piroman_Π and f𝑓fitalic_f. To ease the demonstration, we consider uniform distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT over context space 𝒳𝒳{\mathcal{X}}caligraphic_X. The first two cases are two-task contextual bandit problems with 𝒜={a1,a2}𝒜subscript𝑎1subscript𝑎2{\mathcal{A}}=\{a_{1},a_{2}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and 𝒳={x1,x2}𝒳subscript𝑥1subscript𝑥2{\mathcal{X}}=\{x_{1},x_{2}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. The third case is an MAB problem with 𝒜={a1,a2}𝒜subscript𝑎1subscript𝑎2{\mathcal{A}}=\{a_{1},a_{2}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.

Case I: adding a new feature.

In real-world implementations, some features that were excluded from the input may be added back in the later tasks. To conceptualize this, we consider Π(1)superscriptΠ1\Pi^{(1)}roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT as the set of policies where decision-making is independent of current features, formalized as Π(1)={π:π(x1)=π(x2),for all x1,x2𝒳}\Pi^{(1)}=\{\pi:\pi(\cdot\mid x_{1})=\pi(\cdot\mid x_{2}),\text{for all }x_{1}% ,x_{2}\in{\mathcal{X}}\}roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = { italic_π : italic_π ( ⋅ ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π ( ⋅ ∣ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , for all italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_X }. Conversely, let Π(2)superscriptΠ2\Pi^{(2)}roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT encompass all possible policies. The mean reward induced by outcome distribution P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG is given by Table 4. For all positive ϵitalic-ϵ\epsilonitalic_ϵ, πS(1)(a1)1subscriptsuperscript𝜋superscript𝑆1conditionalsubscript𝑎11\pi^{\star}_{S^{(1)}}(a_{1}\mid\cdot)\equiv 1italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ ⋅ ) ≡ 1. Any policy with a non-zero occupancy measure on (x2,a2)subscript𝑥2subscript𝑎2(x_{2},a_{2})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) will have the same occupancy measure on (x1,a2)subscript𝑥1subscript𝑎2(x_{1},a_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), leading to simple regret of 1/2ϵ12italic-ϵ1/2-\epsilon1 / 2 - italic_ϵ. However, the optimal policies of S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and S¯(2)superscript¯𝑆2\bar{S}^{(2)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT disagree on x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To distinguish S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG from S𝑆Sitalic_S, the algorithm is forced to visit (x2,a2)subscript𝑥2subscript𝑎2(x_{2},a_{2})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) in the first task. The conditions in Theorem 1 is satisfied with c1=1/2ϵsubscript𝑐112italic-ϵc_{1}=1/2-\epsilonitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 / 2 - italic_ϵ and c2=1/2subscript𝑐212c_{2}=1/2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 / 2.

Table 1: Reward tables for different case studies
Table 2: Case I
Table 3: Case II
Table 4: Case III
R𝑅Ritalic_R x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT R¯¯𝑅\bar{R}over¯ start_ARG italic_R end_ARG x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1-ϵitalic-ϵ\epsilonitalic_ϵ 1-ϵitalic-ϵ\epsilonitalic_ϵ a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1-ϵitalic-ϵ\epsilonitalic_ϵ 1-ϵitalic-ϵ\epsilonitalic_ϵ
a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0 1 a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0 1-2ϵitalic-ϵ\epsilonitalic_ϵ
R𝑅Ritalic_R x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT R¯¯𝑅\bar{R}over¯ start_ARG italic_R end_ARG x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 0 a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 ϵitalic-ϵ\epsilonitalic_ϵ
a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2 1 a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2 1
R(1)superscript𝑅1R^{(1)}italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT R(2)superscript𝑅2R^{(2)}italic_R start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT R¯(1)superscript¯𝑅1\bar{R}^{(1)}over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT R¯(2)superscript¯𝑅2\bar{R}^{(2)}over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 0.5 1 0.5
a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.8 0.5-ϵitalic-ϵ\epsilonitalic_ϵ/2 0.8 0.5+ϵitalic-ϵ\epsilonitalic_ϵ/2
Table 3: Case II
Table 4: Case III

Case II: removing an old feature.

Some features may have to be removed over a sequence of implementations due to potential ethic issues. For this consideration, we let Π(1)superscriptΠ1\Pi^{(1)}roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT be the set of all policies and Π(2)={π:π(x1)=π(x2),for all x1,x2𝒳}\Pi^{(2)}=\{\pi:\pi(\cdot\mid x_{1})=\pi(\cdot\mid x_{2}),\text{for all }x_{1}% ,x_{2}\in{\mathcal{X}}\}roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = { italic_π : italic_π ( ⋅ ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π ( ⋅ ∣ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , for all italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_X }. Table 4 demonstrates a pair of instances, where the optimal policies for the first task always select action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under context x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and action a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT under context x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Any non-zero occupancy measure on (x2,a1)subscript𝑥2subscript𝑎1(x_{2},a_{1})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) induces an instant regret of at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ for both S(1)superscript𝑆1S^{(1)}italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and S¯(1)superscript¯𝑆1\bar{S}^{(1)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. Nevertheless, the optimal actions for S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and S¯(2)superscript¯𝑆2\bar{S}^{(2)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, respectively, and it requires the first task to have a coverage on (x2,a1)subscript𝑥2subscript𝑎1(x_{2},a_{1})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) to distinguish between S𝑆Sitalic_S and S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG. The conditions in Theorem 1 is satisfied with c1=1ϵsubscript𝑐11italic-ϵc_{1}=1-\epsilonitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 - italic_ϵ and c2=1/2subscript𝑐212c_{2}=1/2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 / 2.

Case III: change of reward function.

The reward functions may change over a sequence of implementations. For instance, let outcome Yi,t=(Ri,t,Wi,t)subscript𝑌𝑖𝑡subscript𝑅𝑖𝑡subscript𝑊𝑖𝑡Y_{i,t}=(R_{i,t},W_{i,t})italic_Y start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = ( italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ), where Ri,tsubscript𝑅𝑖𝑡R_{i,t}italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the primary outcome we intend to maximize and Wi,tsubscript𝑊𝑖𝑡W_{i,t}italic_W start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is a potential side effects. In the first implementation, we aim at maximizing the primary outcome Ri,tsubscript𝑅𝑖𝑡R_{i,t}italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT with reward function f(1)(Y1,t)=R1,tsuperscript𝑓1subscript𝑌1𝑡subscript𝑅1𝑡f^{(1)}(Y_{1,t})=R_{1,t}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT. However, the domain expert may realize that Wi,tsubscript𝑊𝑖𝑡W_{i,t}italic_W start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is a strong side effect, which should be controlled in the second task, and thus, they set the reward function as f(2)(Y2,t)=R2,tαW2,tsuperscript𝑓2subscript𝑌2𝑡subscript𝑅2𝑡𝛼subscript𝑊2𝑡f^{(2)}(Y_{2,t})=R_{2,t}-\alpha W_{2,t}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT - italic_α italic_W start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT. With this motivation, we construct a pair of multi-armed bandit instances with no context. In Table 4, we demonstrate the mean reward of different arms under different combinations of f(1)superscript𝑓1f^{(1)}italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, f(2)superscript𝑓2f^{(2)}italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG. In the first task, a regret of 0.20.20.20.2 is incurred whenever a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is pulled, while a sufficient pulling of a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is necessary to distinguish P𝑃Pitalic_P from P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG. It can be verified that the conditions in Theorem 1 holds with c1=0.2subscript𝑐10.2c_{1}=0.2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2 and c2=1subscript𝑐21c_{2}=1italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.

A detailed verification of how the three cases satisfy the conditions in Theorem 1 is deferred to Appendix C.

3.2 Optimal level of exploration

As implied by Theorem 1, any algorithm that achieves an optimal rate in Reg1subscriptReg1\operatorname{Reg}_{1}roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is suboptimal in Reg2subscriptReg2\operatorname{Reg}_{2}roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To trade-off between the two goals, the algorithm needs to employ additional exploration in the first task. In this section, we characterize the optimal level of additional exploration in different regimes of T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since we primarily focus on the role of horizons, we omit the dependence on |𝒳|𝒳|{\mathcal{X}}|| caligraphic_X | and |𝒜|𝒜|{\mathcal{A}}|| caligraphic_A | throughout the discussions on this section.

Recall that our primary goal is to minimize global regret, that is the sum of cumulative regrets of two tasks. Proposition 1 suggests a minimax lower bound for the sum of cumulative regrets of two tasks that is the maximum of three terms–T2/T1,T22/3subscript𝑇2subscript𝑇1superscriptsubscript𝑇223T_{2}/\sqrt{T_{1}},T_{2}^{2/3}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT and T1subscript𝑇1\sqrt{T_{1}}square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. The T2/T1subscript𝑇2subscript𝑇1T_{2}/\sqrt{T_{1}}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG term corresponds to the case, where 𝔼[Reg2]𝔼delimited-[]subscriptReg2\mathbb{E}[\operatorname{Reg}_{2}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] dominates 𝔼[Reg1]𝔼delimited-[]subscriptReg1\mathbb{E}[\operatorname{Reg}_{1}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ], and the minimax rate of simple regret in the second for any dataset collected during the first task is T2/T1subscript𝑇2subscript𝑇1T_{2}/\sqrt{T_{1}}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. The second term corresponds to the rate characterized by Theorem 1. The last term of T1subscript𝑇1\sqrt{T_{1}}square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG is the minimax rate of the first task regret minimization. This corresponds to the case when 𝔼[Reg1]𝔼delimited-[]subscriptReg1\mathbb{E}[\operatorname{Reg}_{1}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] dominates 𝔼[Reg2]𝔼delimited-[]subscriptReg2\mathbb{E}[\operatorname{Reg}_{2}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ].

Proposition 1.

Following the same conditions on the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S as in Theorem 1, the following minimax lower bound holds

infL2supS𝒮𝔼[Reg1+Reg2]=Ω(max{T2T1,T22/3,T1})subscriptinfimumsubscript𝐿2subscriptsupremum𝑆𝒮𝔼delimited-[]subscriptReg1subscriptReg2Ωsubscript𝑇2subscript𝑇1superscriptsubscript𝑇223subscript𝑇1\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}[\operatorname{% Reg}_{1}+\operatorname{Reg}_{2}]=\Omega\left(\max\left\{\frac{T_{2}}{\sqrt{T_{% 1}}},T_{2}^{2/3},\sqrt{T_{1}}\right\}\right)roman_inf start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_Ω ( roman_max { divide start_ARG italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT , square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG } ) (2)

We show that a simple algorithm that mixes a minimax-optimal online learning algorithm with a purely random exploration has upper bounded global regret that matches the lower bound in Theorem 1 up to a factor of |𝒳||𝒜|𝒳𝒜|{\mathcal{X}}||{\mathcal{A}}|| caligraphic_X | | caligraphic_A |. This also allows us to achieve any point on the Pareto frontier up to a factor of |𝒳||𝒜|𝒳𝒜|{\mathcal{X}}||{\mathcal{A}}|| caligraphic_X | | caligraphic_A |. The parameter α𝛼\alphaitalic_α controls the level of additional exploration in the first task.

Theorem 2.

Let L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be an online learning algorithm with a regret bound of 𝒪(|𝒳||𝒜|T1)𝒪𝒳𝒜subscript𝑇1\mathcal{O}(\sqrt{|{\mathcal{X}}||{\mathcal{A}}|T_{1}})caligraphic_O ( square-root start_ARG | caligraphic_X | | caligraphic_A | italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) on the first task. Let the algorithm for the first task be Lα(τ)=(1α)L0(τ)+απ0subscript𝐿𝛼𝜏1𝛼subscript𝐿0𝜏𝛼subscript𝜋0L_{\alpha}(\tau)=(1-\alpha)L_{0}(\tau)+\alpha\pi_{0}italic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_τ ) = ( 1 - italic_α ) italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) + italic_α italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where τ𝜏\tauitalic_τ is any past observations and π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the uniform random policy. For any choice of α[|𝒳||𝒜|/T1,1]𝛼𝒳𝒜subscript𝑇11\alpha\in[{|{\mathcal{X}}||{\mathcal{A}}|}/{\sqrt{T_{1}}},1]italic_α ∈ [ | caligraphic_X | | caligraphic_A | / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , 1 ], there exist offline-learning algorithm for the second task such that

𝔼[Reg1]=𝒪(αT1), and 𝔼[Reg2/T2]=𝒪((|𝒳||𝒜|)2αT1).formulae-sequence𝔼delimited-[]subscriptReg1𝒪𝛼subscript𝑇1 and 𝔼delimited-[]subscriptReg2subscript𝑇2𝒪superscript𝒳𝒜2𝛼subscript𝑇1\mathbb{E}[\operatorname{Reg}_{1}]={\mathcal{O}}\left(\alpha T_{1}\right),% \text{ and }\quad\mathbb{E}[\operatorname{Reg}_{2}/T_{2}]={\mathcal{O}}\left(% \sqrt{\frac{(|{\mathcal{X}}||{\mathcal{A}}|)^{2}}{\alpha T_{1}}}\right).blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = caligraphic_O ( italic_α italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , and blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = caligraphic_O ( square-root start_ARG divide start_ARG ( | caligraphic_X | | caligraphic_A | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG ) . (3)

By tuning the exploration rate α𝛼\alphaitalic_α, we are able to match the minimax lower bound provided in (2). In short, there are three regimes of (T1,T2)subscript𝑇1subscript𝑇2(T_{1},T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), for which we should choose different levels of exploration rate α𝛼\alphaitalic_α to balance 𝔼[Reg1]𝔼delimited-[]subscriptReg1\mathbb{E}[\operatorname{Reg}_{1}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] and 𝔼[Reg2]𝔼delimited-[]subscriptReg2\mathbb{E}[\operatorname{Reg}_{2}]blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]. The regime one is when T1T22/3subscript𝑇1superscriptsubscript𝑇223T_{1}\leq T_{2}^{2/3}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT, where the first task is too short compared to the second task, and the algorithm should employ pure exploration in the first task (α=1𝛼1\alpha=1italic_α = 1). This regime leads to a global regret of 𝒪(T2/T1)𝒪subscript𝑇2subscript𝑇1{\mathcal{O}}({T_{2}}/{\sqrt{T_{1}}})caligraphic_O ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ). In an intermediate regime with T22/3<T1T24/3superscriptsubscript𝑇223subscript𝑇1superscriptsubscript𝑇243T_{2}^{2/3}<T_{1}\leq T_{2}^{4/3}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT < italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT, the algorithm should employ additional exploration compared to these that achieve a minimax optimal rate in a single task. Theorem 2 suggests an additional exploration rate of α=T22/3/T1𝛼superscriptsubscript𝑇223subscript𝑇1\alpha=T_{2}^{2/3}/T_{1}italic_α = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT / italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a global regret bound of 𝒪(T22/3)𝒪superscriptsubscript𝑇223{\mathcal{O}}(T_{2}^{2/3})caligraphic_O ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ). Note that under a special case of T1=T2subscript𝑇1subscript𝑇2T_{1}=T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the rate of α=T11/3𝛼superscriptsubscript𝑇113\alpha=T_{1}^{-1/3}italic_α = italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT indicates a regret bound of 𝒪(T12/3)𝒪superscriptsubscript𝑇123{\mathcal{O}}(T_{1}^{2/3})caligraphic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in the first task. The third regime is T1>T24/3subscript𝑇1superscriptsubscript𝑇243T_{1}>T_{2}^{4/3}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT, where one should employ α=0𝛼0\alpha=0italic_α = 0, meaning that no excess exploration is needed and the agent in the first task can minimize the local regret as much as possible. In this regime, the local regret in the first task could achieve the minimax optimal rate of T1subscript𝑇1\sqrt{T_{1}}square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG.

It is often in real-world applications that T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is pre-determined and the researcher could decide how many samples to collect in the first task to ensure a good learning in the second one. For instance, in an inventory management context (Madeka et al.,, 2022), it is determined by the engineering team that how long a learned policy should be deployed for the second task. In such cases, our theory indicates that one should choose T1>T24/3subscript𝑇1superscriptsubscript𝑇243T_{1}>T_{2}^{4/3}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 / 3 end_POSTSUPERSCRIPT, so a greedy local regret minimization for the first task is justified.

4 Results on Multiple Tasks

In our two-task study, we highlighted the inherent dilemma between local regrets in the first and second tasks. Now we extend our analysis to a sequence of multiple tasks. A significant property about the two-tasks scenario is the inability of the algorithm to adaptively learn in the second task. This restriction forces the algorithm to ”overly” explore in the first task to propose a good policy for the second task, thereby introducing a tension between the regrets in the first and the second task. In fact, such tension exists between any task i𝑖iitalic_i and its preceding tasks, whenever the algorithm is not allowed to adaptively learn within the task i𝑖iitalic_i. We will also discuss the maximum number of rounds this trade-off could hold simultaneously.

In this section, we denote an instance by a sequence of N𝑁Nitalic_N task setups and their shared outcome distribution, 𝑺=(S(1),,S(N))𝑺superscript𝑆1superscript𝑆𝑁{\bm{S}}=(S^{(1)},\dots,S^{(N)})bold_italic_S = ( italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_S start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT ). Let {\mathcal{I}}caligraphic_I be a set of indices, such that any task i𝑖i\in{\mathcal{I}}italic_i ∈ caligraphic_I has to be non-adaptive. Theorem 3 generalizes Theorem 1 by lower bounding the minimax rate of the product between the sum of the cumulative regret over j=1,i1𝑗1𝑖1j=1,\dots i-1italic_j = 1 , … italic_i - 1 tasks and that over task i𝑖iitalic_i, simultaneously for all the indices in a set {\mathcal{I}}caligraphic_I.

Theorem 3.

Recall that 𝒮(𝐒)𝒮𝐒{\mathcal{S}}({\bm{S}})caligraphic_S ( bold_italic_S ) denote the set of instances that share the same policy space and reward function with a given instance 𝐒𝐒{\bm{S}}bold_italic_S. Let {\mathcal{I}}caligraphic_I be an index set. Assume the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S is sufficiently large to ensure the existence of an instance 𝐒𝒮𝐒𝒮{\bm{S}}\in{\mathcal{S}}bold_italic_S ∈ caligraphic_S for which, we can find some 𝐒¯𝒮(𝐒)normal-¯𝐒𝒮𝐒\bar{{\bm{S}}}\in{\mathcal{S}}({\bm{S}})over¯ start_ARG bold_italic_S end_ARG ∈ caligraphic_S ( bold_italic_S ) such that for all i𝑖i\in{\mathcal{I}}italic_i ∈ caligraphic_I and ϵ[0,1/4]italic-ϵ014\epsilon\in[0,1/4]italic_ϵ ∈ [ 0 , 1 / 4 ], it satisfies the following conditions:

  1. 1.

    There exists unique optimal policy πSsubscriptsuperscript𝜋𝑆\pi^{\star}_{S}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for each S{S(1),,S(i),S¯(1),,S¯(i)}𝑆superscript𝑆1superscript𝑆𝑖superscript¯𝑆1superscript¯𝑆𝑖S\in\{S^{(1)},\dots,{S}^{(i)},\bar{S}^{(1)},\dots,\bar{S}^{(i)}\}italic_S ∈ { italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT };

  2. 2.

    μπS(j)(x,a)=μπS¯(j)(x,a)=0subscript𝜇superscriptsubscript𝜋superscript𝑆𝑗subscript𝑥subscript𝑎subscript𝜇superscriptsubscript𝜋superscript¯𝑆𝑗subscript𝑥subscript𝑎0\mu_{\pi_{S^{(j)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(j)}}^{% \star}}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0 and min{SRS(j)(π),SRS¯(j)(π)}c1μπ(x,a)subscriptSRsuperscript𝑆𝑗𝜋subscriptSRsuperscript¯𝑆𝑗𝜋subscript𝑐1subscript𝜇𝜋subscript𝑥subscript𝑎\min\{\operatorname{SR}_{S^{(j)}}(\pi),\operatorname{SR}_{\bar{S}^{(j)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})roman_min { roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) , roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) } ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(j)𝜋superscriptΠ𝑗\pi\in\Pi^{(j)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and j[i1]𝑗delimited-[]𝑖1j\in[i-1]italic_j ∈ [ italic_i - 1 ];

  3. 3.

    μπS(i)(x,a)μπS¯(i)(x,a)>c2subscript𝜇subscriptsuperscript𝜋superscript𝑆𝑖subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆𝑖subscript𝑥subscript𝑎subscript𝑐2\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (i)}}}(x_{\star},a_{\star})>c_{2}italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;

  4. 4.

    SRS(i)(π)ϵ(μπS(i)(x,a)μπ(x,a))/2subscriptSRsuperscript𝑆𝑖𝜋italic-ϵsubscript𝜇subscriptsuperscript𝜋superscript𝑆𝑖subscript𝑥subscript𝑎subscript𝜇𝜋subscript𝑥subscript𝑎2\operatorname{SR}_{S^{(i)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(i)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(i)𝜋superscriptΠ𝑖\pi\in\Pi^{(i)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT;

  5. 5.

    SRS¯(i)(π)ϵ(μπ(x,a)μπS¯(i)(x,a))/2subscriptSRsuperscript¯𝑆𝑖𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆𝑖subscript𝑥subscript𝑎2\operatorname{SR}_{\bar{S}^{(i)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(i)}}}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(i)𝜋superscriptΠ𝑖\pi\in\Pi^{(i)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT;

  6. 6.

    P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG only differ in (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) and DKL(P(x,a)P¯(x,a))ϵ2D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∣ over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

where c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are some constant and (x,a)𝒳×𝒜subscript𝑥normal-⋆subscript𝑎normal-⋆𝒳𝒜(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∈ caligraphic_X × caligraphic_A is some context-action pair. Then we have the following lower bound :

infLminisupS𝒮𝔼[j=1i1Regj]𝔼[Regi]/Ti=Ω(c1c2).subscriptinfimum𝐿subscriptsubscript𝑖subscriptsupremum𝑆𝒮𝔼delimited-[]superscriptsubscript𝑗1𝑖1subscriptReg𝑗𝔼delimited-[]subscriptReg𝑖subscript𝑇𝑖Ωsubscript𝑐1subscript𝑐2\inf_{L\in{\mathcal{L}}_{{\mathcal{I}}}}\min_{i\in{\mathcal{I}}}\sup_{S\in{% \mathcal{S}}}\sqrt{\mathbb{E}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}% \right]}\mathbb{E}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega\left(c_{1}c% _{2}\right).roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_Reg start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ω ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Theorem 3 provide a strong lower bound that holds simultaneously for the simple regret of all tasks in a set. The construction of the hard instance requires the the optimal of the next task i𝑖iitalic_i. Proposition 2 states that the longest sequence of tasks one can find to ensure that the conditions in Theorem 3 hold is Θ(|𝒳||𝒜|)Θ𝒳𝒜\Theta(|{\mathcal{X}}||{\mathcal{A}}|)roman_Θ ( | caligraphic_X | | caligraphic_A | ).

Proposition 2.

Let =[N]delimited-[]𝑁{\mathcal{I}}=[N]caligraphic_I = [ italic_N ]. There exists an instance 𝐒𝐒{\bm{S}}bold_italic_S of length N=|𝒳|(|𝒜|2)𝑁𝒳𝒜2N=|{\mathcal{X}}|(|{\mathcal{A}}|-2)italic_N = | caligraphic_X | ( | caligraphic_A | - 2 ) that satisfies the conditions in Theorem 3. Any instance that satisfies the conditions in Theorem 3 must have length N=𝒪(|𝒳||𝒜|)𝑁𝒪𝒳𝒜N={\mathcal{O}}(|{\mathcal{X}}||{\mathcal{A}}|)italic_N = caligraphic_O ( | caligraphic_X | | caligraphic_A | ).

4.1 Discussion on Nonlinear Case

We have shown in a tabular case that we can find at most Θ(|𝒳||𝒜|)Θ𝒳𝒜\Theta(|{\mathcal{X}}||{\mathcal{A}}|)roman_Θ ( | caligraphic_X | | caligraphic_A | ) tasks such that there is a trade-off between the simple regret of any task and the cumulative regrets of all its preceding tasks (Proposition 2). It appears that the number of rounds this tension could hold connects closely to the complexity of the underlying outcome distributions. In this section, we extend the tabular bandit to a nonlinear setting, where we show that it is possible to find an exponentially long sequence of tasks with the trade-off described above holding for each of tasks.

Nonlinear contextual bandit.

For simplicity, we consider the outcome 𝒴=𝒴{\mathcal{Y}}={\mathbb{R}}caligraphic_Y = blackboard_R and a fixed reward function f(i)=fxxsuperscript𝑓𝑖𝑓𝑥maps-to𝑥f^{(i)}=f\equiv x\mapsto xitalic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_f ≡ italic_x ↦ italic_x for all tasks i𝑖iitalic_i, i.e. no rich observations, so we focus on the changes in the policy spaces. Following the setups in Section 2, we now consider potentially large or continuous context and action space 𝒳𝒳{\mathcal{X}}caligraphic_X and 𝒜𝒜{\mathcal{A}}caligraphic_A. Recall that the mean reward is given by R(x,a)𝔼YP(x,a)[f(Y)]𝑅𝑥𝑎subscript𝔼similar-to𝑌𝑃𝑥𝑎delimited-[]𝑓𝑌R(x,a)\coloneqq\mathbb{E}_{Y\sim P(x,a)}[f(Y)]italic_R ( italic_x , italic_a ) ≔ blackboard_E start_POSTSUBSCRIPT italic_Y ∼ italic_P ( italic_x , italic_a ) end_POSTSUBSCRIPT [ italic_f ( italic_Y ) ]. For contextual bandit with nonlinear reward models, we assume that mean reward R𝑅R\in{\mathcal{F}}italic_R ∈ caligraphic_F for some known function class :𝒳×𝒜:maps-to𝒳𝒜{\mathcal{F}}:{\mathcal{X}}\times{\mathcal{A}}\mapsto{\mathbb{R}}caligraphic_F : caligraphic_X × caligraphic_A ↦ blackboard_R.

Complexity for nonlinear bandit.

Running UCB on nonlinear bandit is generally hard. Russo and Van Roy, (2013) proposed to explore by choosing Atargmaxa𝒜supftf(Xt,a),subscript𝐴𝑡subscriptargmax𝑎𝒜subscriptsupremum𝑓subscript𝑡𝑓subscript𝑋𝑡𝑎A_{t}\in\operatorname*{arg\,max}_{a\in{\mathcal{A}}}\sup_{f\in{\mathcal{F}}_{t% }}f(X_{t},a),italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) , where supftf(a)subscriptsupremum𝑓subscript𝑡𝑓𝑎\sup_{f\in{\mathcal{F}}_{t}}f(a)roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_a ) is an optimistic estimate of fθ(a)subscript𝑓𝜃𝑎f_{\theta}(a)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a ). A choice of tsubscript𝑡{\mathcal{F}}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given by Russo and Van Roy, (2013) is

t={f:ff^tLS2,Etβt},subscript𝑡conditional-set𝑓subscriptnorm𝑓superscriptsubscript^𝑓𝑡𝐿𝑆2subscript𝐸𝑡superscriptsubscript𝛽𝑡{\mathcal{F}}_{t}=\left\{f\in{\mathcal{F}}:\|f-\hat{f}_{t}^{LS}\|_{2,E_{t}}% \leq\sqrt{\beta_{t}^{\star}}\right\},caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_f ∈ caligraphic_F : ∥ italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_S end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG } , (4)

where βtsuperscriptsubscript𝛽𝑡\beta_{t}^{\star}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT are constants, g2,Et=t=1Tg2(Xt,At)subscriptnorm𝑔2subscript𝐸𝑡superscriptsubscript𝑡1𝑇superscript𝑔2subscript𝑋𝑡subscript𝐴𝑡\|g\|_{2,E_{t}}=\sum_{t=1}^{T}g^{2}(X_{t},A_{t})∥ italic_g ∥ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the empirical 2-norm, and f^tLSinff(f(Xt,At)Yt)2superscriptsubscript^𝑓𝑡𝐿𝑆subscriptinfimum𝑓superscript𝑓subscript𝑋𝑡subscript𝐴𝑡subscript𝑌𝑡2\hat{f}_{t}^{LS}\in\inf_{f\in{\mathcal{F}}}(f(X_{t},A_{t})-Y_{t})^{2}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_S end_POSTSUPERSCRIPT ∈ roman_inf start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT ( italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the empirical risk minimizer. The regret of running UCB with appropriately chosen βtsuperscriptsubscript𝛽𝑡\beta_{t}^{\star}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT has regret of dimE(,T2)Tsubscriptdim𝐸superscript𝑇2𝑇\sqrt{\operatorname{dim}_{E}({\mathcal{F}},T^{-2})T}square-root start_ARG roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_T start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) italic_T end_ARG, where dimE(,T2)subscriptdim𝐸superscript𝑇2\operatorname{dim}_{E}({\mathcal{F}},T^{-2})roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_T start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) is the eluder dimension of the function class {\mathcal{F}}caligraphic_F.

Definition 5 (Distributional eluder dimension).

Let :𝒳normal-:maps-to𝒳{\mathcal{F}}:{\mathcal{X}}\mapsto{\mathbb{R}}caligraphic_F : caligraphic_X ↦ blackboard_R. A probability measure ν𝜈\nuitalic_ν over 𝒳𝒳{\mathcal{X}}caligraphic_X is said to be ϵitalic-ϵ\epsilonitalic_ϵ-independent of a sequence of probability measures {μ1,,μn}subscript𝜇1normal-…subscript𝜇𝑛\{\mu_{1},\dots,\mu_{n}\}{ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } w.r.t {\mathcal{F}}caligraphic_F if any pair of functions f,f¯𝑓normal-¯𝑓f,\bar{f}\in{\mathcal{F}}italic_f , over¯ start_ARG italic_f end_ARG ∈ caligraphic_F satisfying i=1n(𝔼μ[f(x)f¯(x)])2ϵsuperscriptsubscript𝑖1𝑛superscriptsubscript𝔼𝜇delimited-[]𝑓𝑥normal-¯𝑓𝑥2italic-ϵ\sqrt{\sum_{i=1}^{n}(\mathbb{E}_{\mu}[f(x)-\bar{f}(x)])^{2}}\leq\epsilonsquare-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_f ( italic_x ) - over¯ start_ARG italic_f end_ARG ( italic_x ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_ϵ also satisfies |𝔼ν[f(x)f¯(x)]|ϵsubscript𝔼𝜈delimited-[]𝑓𝑥normal-¯𝑓𝑥italic-ϵ|\mathbb{E}_{\nu}[f(x)-\bar{f}(x)]|\leq\epsilon| blackboard_E start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_f ( italic_x ) - over¯ start_ARG italic_f end_ARG ( italic_x ) ] | ≤ italic_ϵ. Furthermore, x𝑥xitalic_x is ϵitalic-ϵ\epsilonitalic_ϵ-independent of {μ1,,μn}subscript𝜇1normal-…subscript𝜇𝑛\{\mu_{1},\dots,\mu_{n}\}{ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } if it is not ϵitalic-ϵ\epsilonitalic_ϵ-dependent of the sequence.

The ϵitalic-ϵ\epsilonitalic_ϵ-eluder dimension dimE(,ϵ)subscriptnormal-dim𝐸italic-ϵ\operatorname{dim}_{E}({\mathcal{F}},\epsilon)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ϵ ) is the length of the longest sequence of distributions over 𝒳𝒳{\mathcal{X}}caligraphic_X such that for some ϵϵsuperscriptitalic-ϵnormal-′italic-ϵ\epsilon^{\prime}\geq\epsilonitalic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_ϵ, every distribution is ϵsuperscriptitalic-ϵnormal-′\epsilon^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-independent of its predecessors.

Recall that the construction of our hard instances in Theorem 1 requires that the new task has the optimal policy whose occupancy measure has no overlap from the occupancy measure of optimal policies in the previous tasks. A generalization of this to the nonlinear case is that a predicted function that minimizes the loss over the dataset collected in the previous tasks may still occur large loss on a new task. Let the optimal policies of tasks 1,n1𝑛1,\dots n1 , … italic_n be π1,,πnsuperscriptsubscript𝜋1superscriptsubscript𝜋𝑛\pi_{1}^{\star},\dots,\pi_{n}^{\star}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Intuitively, as long as n𝑛nitalic_n is smaller than dimE(,ϵ)subscriptdim𝐸italic-ϵ\operatorname{dim}_{E}({\mathcal{F}},\epsilon)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ϵ ), we can find a new task with optimal policy πn+1superscriptsubscript𝜋𝑛1\pi_{n+1}^{\star}italic_π start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT for which the occupancy measure μπn+1subscript𝜇superscriptsubscript𝜋𝑛1\mu_{\pi_{n+1}^{\star}}italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is ϵitalic-ϵ\epsilonitalic_ϵ-independent of (μπ1,,μπn)subscript𝜇superscriptsubscript𝜋1subscript𝜇superscriptsubscript𝜋𝑛(\mu_{\pi_{1}^{\star}},\dots,\mu_{\pi_{n}^{\star}})( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). By the definition of eluder dimension, this implies that the function chosen for task n+1𝑛1n+1italic_n + 1 based on the dataset collected by (μπ1,,μπn)subscript𝜇superscriptsubscript𝜋1subscript𝜇superscriptsubscript𝜋𝑛(\mu_{\pi_{1}^{\star}},\dots,\mu_{\pi_{n}^{\star}})( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) may still occur a large error. Note that by running a no-regret online algorithm, the dataset collected during a task will asymptotically distributed as the occupancy measure induced by its optimal policy.

Eluder dimension has been shown to be exponentially large for simple models like one-layer neural network with ReLU activation function (Dong et al.,, 2021). It is not trivial to show a lower bound directly depending on the eluder dimension. Instead, we provide a concrete example, where UCB described in (4) fails.

Theorem 4.

Consider the hypothesis set {\mathcal{F}}caligraphic_F to be one-hidden layer neural networks with width d𝑑ditalic_d. There exists ground-truth reward function and a sequence of tasks of length Ω(exp(d))normal-Ω𝑑\Omega(\exp(d))roman_Ω ( roman_exp ( italic_d ) ) with different Π(i)superscriptnormal-Π𝑖\Pi^{(i)}roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, such that the local regret for each task is lower bounded by a constant, even if each Tinormal-→subscript𝑇𝑖T_{i}\rightarrow\inftyitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → ∞.

Theorem 4 indicates that even without a change in outcome distributions, there still exists an exponentially long sequence of tasks, for which the tension between local regret minimization and global regret minimization still holds. An UCB algorithm that greedily minimizes local regret fails to provide good guarantees for later tasks.

5 Study on Changes in P𝑃Pitalic_P

In real-world implementations, the outcome distribution P𝑃Pitalic_P often undergoes unpredictable shift. Prior research on non-stationary bandits has typically focused on single-task scenarios with potential reward distribution shifts at any step. To manage these shifts, the literature often limits the total variation in distribution shifts, making it possible to establish sublinear regret bounds. In a sequential task setting, when the algorithm is not allowed to adaptively learn in the second task, the simple regret is always lower bounded by a constant. This is attributed to the uncertainty of the second task’s optimal policy, even with a full knowledge of the first task. To address this challenge, we introduce the concept of robust simple regret. We show that the robust simple regret and cumulative regret in the two-task case, are shown to have a similar minimax lower bound as shown in Theorem 1.

For simplicity, we consider no change in the policy space and the reward function. More specifically, we let Π(1)=Π(2)=ΠsuperscriptΠ1superscriptΠ2Π\Pi^{(1)}=\Pi^{(2)}=\Piroman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = roman_Π, the set of all policies, and f(1)=f(2)=fsuperscript𝑓1superscript𝑓2𝑓f^{(1)}=f^{(2)}=fitalic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = italic_f, the identical mapping in {\mathbb{R}}blackboard_R. We denote by P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT the outcome distribution of the first and the second task. We denote a problem instance by 𝑷=(P(1),P(2))𝑷superscript𝑃1superscript𝑃2{\bm{P}}=(P^{(1)},P^{(2)})bold_italic_P = ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ).

The adversary is allowed to choose P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT from a L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ball around P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. This leads to the instance set 𝒫(Δ)𝒫Δ{\mathcal{P}}(\Delta)caligraphic_P ( roman_Δ ) parametrized by constant ΔΔ\Deltaroman_Δ such that each 𝑷=(P(1),P(2))𝑷superscript𝑃1superscript𝑃2{\bm{P}}=(P^{(1)},P^{(2)})bold_italic_P = ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) satisfies

P(2)𝒫(P(1),Δ){P:a|P(x,a)P(1)(x,a)|Δ for all x𝒳},superscript𝑃2𝒫superscript𝑃1Δconditional-set𝑃subscript𝑎𝑃𝑥𝑎superscript𝑃1𝑥𝑎Δ for all 𝑥𝒳P^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)\coloneqq\left\{P:\sum_{a}|P(x,a)-P^{(1% )}(x,a)|\leq\Delta\text{ for all }x\in{\mathcal{X}}\right\},italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ≔ { italic_P : ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_P ( italic_x , italic_a ) - italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) | ≤ roman_Δ for all italic_x ∈ caligraphic_X } , (5)

where we abuse the notation for P𝑃Pitalic_P and let P(x,a)𝑃𝑥𝑎P(x,a)italic_P ( italic_x , italic_a ) denote the mean reward for (x,a)𝑥𝑎(x,a)( italic_x , italic_a ).

Robust simple regret.

When P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is allowed to change from all P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT in a potentially adversarial way, it is not reasonable to compare with the true optimal policy with respect to the underlying true P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. Instead, we consider a robust regret definition. We first define the worst-case simple regret of a policy π𝜋\piitalic_π on a context x𝑥xitalic_x:

SR(πP(1),Δ)supP(2)𝒫(P(1),Δ)(maxaP(2)(x,a)aP(2)(x,a)π(ax)).SRconditional𝜋superscript𝑃1Δsubscriptsupremumsuperscript𝑃2𝒫superscript𝑃1Δsubscript𝑎superscript𝑃2𝑥𝑎subscript𝑎superscript𝑃2𝑥𝑎𝜋conditional𝑎𝑥\operatorname{SR}(\pi\mid P^{(1)},\Delta)\coloneqq\sup_{P^{(2)}\in{\mathcal{P}% }(P^{(1)},\Delta)}\left(\max_{a}P^{(2)}(x,a)-\sum_{a}P^{(2)}(x,a)\pi(a\mid x)% \right).roman_SR ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ≔ roman_sup start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) - ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) italic_π ( italic_a ∣ italic_x ) ) . (6)

We denote by π~P(1),ΔinfπΠ(2)SR(πP(1),Δ)subscript~𝜋superscript𝑃1Δsubscriptinfimumsuperscript𝜋superscriptΠ2SRconditionalsuperscript𝜋superscript𝑃1Δ\tilde{\pi}_{P^{(1)},\Delta}\coloneqq\inf_{\pi^{\prime}\in\Pi^{(2)}}% \operatorname{SR}(\pi^{\prime}\mid P^{(1)},\Delta)over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ end_POSTSUBSCRIPT ≔ roman_inf start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_SR ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) the optimal robust policy given (P(1),Δ)superscript𝑃1Δ(P^{(1)},\Delta)( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ). When it is clear from the context, we drop the subscription for P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and ΔΔ\Deltaroman_Δ.

We further define robust simple regret, which is the gap between the worst-case simple regret of a given policy and the policy that achieves the lowest worst-case simple regret:

SR~(πP(1),Δ)SR(πP(1),Δ)infπΠ(2)SR(πP(1),Δ).~SRconditional𝜋superscript𝑃1ΔSRconditional𝜋superscript𝑃1Δsubscriptinfimumsuperscript𝜋superscriptΠ2SRconditionalsuperscript𝜋superscript𝑃1Δ\widetilde{\operatorname{SR}}(\pi\mid P^{(1)},\Delta)\coloneqq\operatorname{SR% }(\pi\mid P^{(1)},\Delta)-\inf_{\pi^{\prime}\in\Pi^{(2)}}\operatorname{SR}(\pi% ^{\prime}\mid P^{(1)},\Delta).over~ start_ARG roman_SR end_ARG ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ≔ roman_SR ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) - roman_inf start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_SR ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) . (7)

Note that the worst-case regret form over some ambiguity set has been studied in the Robust Markov Decision Process literature (Xu and Mannor,, 2010; Eysenbach and Levine,, 2021; Dong et al.,, 2022). However, the definition of robust simple regret and the tension between cumulative regret and robust simple regret has not yet been explored.

To understand how the tension between cumulative and simple regret still plays a role, we investigate a simple two-armed, context-free bandit case in Proposition 3. The optimal arm in the the first task is a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while the optimal robust policy depends on the gap between the mean reward of both arms. Thus, to reduce the robust simple regret in the second task, the algorithm is forced to have an accurate estimate on the suboptimal arm in the first task.

Proposition 3.

Consider the following two-armed, context-free bandit, with 𝐺𝑎𝑝P(1)(a1)P(1)(a2)>0normal-≔𝐺𝑎𝑝superscript𝑃1subscript𝑎1superscript𝑃1subscript𝑎20\text{Gap}\coloneqq P^{(1)}(a_{1})-P^{(1)}(a_{2})>0Gap ≔ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0. Then the worst-case simple regret is given by

SR(πP(1),Δ)=max{(Δ𝐺𝑎𝑝)π(a1),(Δ+𝐺𝑎𝑝)π(a2)}.SRconditional𝜋superscript𝑃1ΔΔ𝐺𝑎𝑝𝜋subscript𝑎1Δ𝐺𝑎𝑝𝜋subscript𝑎2\operatorname{SR}(\pi\mid P^{(1)},\Delta)=\max\{(\Delta-\text{Gap})\pi(a_{1}),% (\Delta+\text{Gap})\pi(a_{2})\}.roman_SR ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) = roman_max { ( roman_Δ - Gap ) italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( roman_Δ + Gap ) italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } . (8)

Assume that Δ>𝐺𝑎𝑝normal-Δ𝐺𝑎𝑝\Delta>\text{Gap}roman_Δ > Gap. The optimal robust policy π~normal-~𝜋\tilde{\pi}over~ start_ARG italic_π end_ARG w.r.t. the worst-case simple regret has the explicit form of

π~P(1),Δ(a1)=Δ+𝐺𝑎𝑝2Δ, and π~P(1),Δ(a2)=Δ𝐺𝑎𝑝2Δ.formulae-sequencesubscript~𝜋superscript𝑃1Δsubscript𝑎1Δ𝐺𝑎𝑝2Δ and subscript~𝜋superscript𝑃1Δsubscript𝑎2Δ𝐺𝑎𝑝2Δ\tilde{\pi}_{P^{(1)},\Delta}(a_{1})=\frac{\Delta+\text{Gap}}{2\Delta},\text{ % and }\tilde{\pi}_{P^{(1)},\Delta}(a_{2})=\frac{\Delta-\text{Gap}}{2\Delta}.over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = divide start_ARG roman_Δ + Gap end_ARG start_ARG 2 roman_Δ end_ARG , and over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG roman_Δ - Gap end_ARG start_ARG 2 roman_Δ end_ARG . (9)

Motivated by the instance introduced in Proposition 3, we show the following Theorem that lower bounds minimax rate of the product between cumulative regret in the first task and the robust simple regret in the second task. Note that robust simple regret does not depend on the actual P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT of choice, the supremum is only taken over P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT.

Theorem 5.

Assume 𝒮𝒮{\mathcal{S}}caligraphic_S is such that Π(1)=Π(2)=Πsuperscriptnormal-Π1superscriptnormal-Π2normal-Π\Pi^{(1)}=\Pi^{(2)}=\Piroman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = roman_Π, the set of all policies, and Π(1)=Π(2)=fsuperscriptnormal-Π1superscriptnormal-Π2𝑓\Pi^{(1)}=\Pi^{(2)}=froman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = italic_f, the identical mapping in {\mathbb{R}}blackboard_R. Assume each P(i)(x,a)P^{(i)}(\cdot\mid x,a)italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( ⋅ ∣ italic_x , italic_a ) is from a binomial distribution with mean P(i)(x,a)superscript𝑃𝑖𝑥𝑎P^{(i)}(x,a)italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) for all i=1,2𝑖12i=1,2italic_i = 1 , 2 and x,a𝒳×𝒜𝑥𝑎𝒳𝒜x,a\in{\mathcal{X}}\times{\mathcal{A}}italic_x , italic_a ∈ caligraphic_X × caligraphic_A, and P(2)𝒫(P(1),Δ)superscript𝑃2𝒫superscript𝑃1normal-ΔP^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ). Then there exists some Δnormal-Δ\Deltaroman_Δ such that

infL2supP(1)𝔼[Reg1]𝔼[SR~(π2P(1),Δ)]=Ω(1),subscriptinfimum𝐿subscript2subscriptsupremumsuperscript𝑃1𝔼delimited-[]subscriptReg1𝔼delimited-[]~SRconditionalsubscript𝜋2superscript𝑃1ΔΩ1\inf_{L\in{\mathcal{L}}_{2}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_% {1}]}\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=% \Omega(1),roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ over~ start_ARG roman_SR end_ARG ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ] = roman_Ω ( 1 ) , (10)

where π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the random policy chosen by learning algorithm L𝐿Litalic_L.

Theorem 5 implies that one should still employ additional exploration for small T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, when there is only changes in the outcome distributions and a similar trade-off between local regret and global regret still holds.

6 Discussion

In this paper, we study the minimax rate of the sum of local regrets across a sequence of contextual bandit tasks. By showing a lower bound on this rate, we demonstrate a strong trade-off between local regrets in different tasks, when there is changes between tasks. These changes include changes in policy space, reward function and outcome distribution, which is of significant novelty. A main message is that one should employ additional exploration compared to what is sufficient for single task cumulative regret minimization in presence of such changes. Our work opens many interesting future directions in the area of multitask bandit.

Multiple changes in P𝑃Pitalic_P.

In this paper, we only studied the in outcome distribution change in a two-task case, where we propose a new notion of robust simple regret and show that there is a dilemma between cumulative regret minimization in the first task and robust simple regret minimization in the second one. It is, at the current form, not clear how to extend the result to a multiple-task case. Intuitively, information from older tasks should be discounted when proposing a policy for a new task. Future work could consider modeling the changes in P𝑃Pitalic_P by an auto-regression model, which allows us to characterize how a new P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT depends on the previous ones.

Instance-dependence results.

We study minimax rate throughout the paper, which focuses often on the worst case. In reality, some instances are significantly harder to learn than the others. An interesting direction is to propose a theoretical measure of the significance of the trade-off studied in this paper and derive an instance-dependent result.

References

  • Abel et al., (2023) Abel, D., Barreto, A., Roy, B. V., Precup, D., Hasselt, H. V., and Singh, S. (2023). A definition of continual reinforcement learning. ArXiv, abs/2307.11046.
  • Agrawal and Goyal, (2013) Agrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
  • Aleven et al., (2023) Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S. A., Demszky, D., Fancsali, S. E., Gupta, S., Koedinger, K., Piech, C., Ritter, S., Thomas, D. R., Woodhead, S., and Xing, W. (2023). Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education.
  • Athey et al., (2022) Athey, S., Byambadalai, U., Hadad, V., Krishnamurthy, S. K., Leung, W., and Williams, J. J. (2022). Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning. ArXiv, abs/2211.12004.
  • Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
  • Auer et al., (2008) Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  • Bidargaddi et al., (2020) Bidargaddi, N., Schrader, G., Klasnja, P., Licinio, J., and Murphy, S. (2020). Designing m-health interventions for precision mental health support. Translational psychiatry, 10(1):222.
  • Bubeck et al., (2011) Bubeck, S., Munos, R., and Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852.
  • Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
  • Chu et al., (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
  • Dai et al., (2023) Dai, J., Gradu, P., and Harshaw, C. (2023). Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187.
  • Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
  • Dong et al., (2021) Dong, K., Yang, J., and Ma, T. (2021). Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in neural information processing systems, 34:26168–26182.
  • Eysenbach and Levine, (2021) Eysenbach, B. and Levine, S. (2021). Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257.
  • Gao et al., (2022) Gao, D., Liu, Y., and Zeng, D. (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. The Journal of Machine Learning Research, 23(1):11362–11403.
  • Garivier and Moulines, (2008) Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
  • Hong et al., (2023) Hong, K., Li, Y., and Tewari, A. (2023). An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In International Conference on Artificial Intelligence and Statistics, pages 3048–3085. PMLR.
  • Khetarpal et al., (2020) Khetarpal, K., Riemer, M., Rish, I., and Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res., 75:1401–1476.
  • Kim and Tewari, (2020) Kim, B. and Tewari, A. (2020). Randomized exploration for non-stationary stochastic linear bandits. In Conference on Uncertainty in Artificial Intelligence, pages 71–80. PMLR.
  • Krishnamurthy et al., (2023) Krishnamurthy, S. K., Zhan, R., Athey, S., and Brunskill, E. (2023). Proportional response: Contextual bandits for simple and cumulative regret minimization. arXiv preprint arXiv:2307.02108.
  • Lange et al., (2019) Lange, M. D., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G. G., and Tuytelaars, T. (2019). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • Liao et al., (2020) Liao, P., Greenewald, K., Klasnja, P., and Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22.
  • Madeka et al., (2022) Madeka, D., Torkkola, K., Eisenach, C., Luo, A., Foster, D. P., and Kakade, S. M. (2022). Deep inventory management. arXiv preprint arXiv:2210.03137.
  • Peng et al., (2023) Peng, L., Giampouras, P. V., and Vidal, R. (2023). The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning.
  • Qin and Russo, (2023) Qin, C. and Russo, D. (2023). Generalized objectives in adaptive experiments: The frontier between regret and speed. NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World.
  • Raj and Kalyani, (2017) Raj, V. and Kalyani, S. (2017). Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
  • Ruan et al., (2023) Ruan, S. S., Nie, A., Steenbergen, W., He, J., Zhang, J., Guo, M., Liu, Y., Nguyen, K. D., Wang, C. Y., Ying, R., Landay, J. A., and Brunskill, E. (2023). Reinforcement learning tutor better supported lower performers in a math task. ArXiv, abs/2304.04933.
  • Russo and Van Roy, (2013) Russo, D. and Van Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
  • Simchi-Levi and Wang, (2023) Simchi-Levi, D. and Wang, C. (2023). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pages 3086–3097. PMLR.
  • Trella et al., (2022) Trella, A. L., Zhang, K. W., Nahum-Shani, I., Shetty, V., Doshi-Velez, F., and Murphy, S. A. (2022). Designing reinforcement learning algorithms for digital interventions: pre-implementation guidelines. Algorithms, 15(8):255.
  • Wang et al., (2023) Wang, L., Zhang, X., Su, H., and Zhu, J. (2023). A comprehensive survey of continual learning: Theory, method and application. ArXiv, abs/2302.00487.
  • Wu et al., (2018) Wu, Q., Iyer, N., and Wang, H. (2018). Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 495–504.
  • Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
  • Yin and Wang, (2021) Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
  • Zanette et al., (2021) Zanette, A., Dong, K., Lee, J. N., and Brunskill, E. (2021). Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731.

Appendix A Proof of Theorem 1

Theorem 1   Assume the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S is sufficiently large to ensure the existence of an instance 𝐒=(S(1),S(2))𝐒superscript𝑆1superscript𝑆2{\bm{S}}=(S^{(1)},S^{(2)})bold_italic_S = ( italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) such that for all ϵ[0,1/4]italic-ϵ014\epsilon\in[0,1/4]italic_ϵ ∈ [ 0 , 1 / 4 ], we can find some 𝐒¯=(S¯(1),S¯(2))𝒮(𝐒)𝒮normal-¯𝐒superscriptnormal-¯𝑆1superscriptnormal-¯𝑆2𝒮𝐒𝒮\bar{{\bm{S}}}=(\bar{S}^{(1)},\bar{S}^{(2)})\in{\mathcal{S}}({\bm{S}})\cap{% \mathcal{S}}over¯ start_ARG bold_italic_S end_ARG = ( over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_S ( bold_italic_S ) ∩ caligraphic_S satisfying the following conditions:

  1. 1.

    There exists unique optimal policy πSsubscriptsuperscript𝜋𝑆\pi^{\star}_{S}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for each S{S(1),S¯(1),S(2),S¯(2)}𝑆superscript𝑆1superscript¯𝑆1superscript𝑆2superscript¯𝑆2S\in\{S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}\}italic_S ∈ { italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT };

  2. 2.

    μπS(1)(x,a)=μπS¯(1)(x,a)=0subscript𝜇superscriptsubscript𝜋superscript𝑆1subscript𝑥subscript𝑎subscript𝜇superscriptsubscript𝜋superscript¯𝑆1subscript𝑥subscript𝑎0\mu_{\pi_{S^{(1)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(1)}}^{% \star}}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0 and min{SRS(1)(π),SRS¯(1)(π)}c1μπ(x,a)subscriptSRsuperscript𝑆1𝜋subscriptSRsuperscript¯𝑆1𝜋subscript𝑐1subscript𝜇𝜋subscript𝑥subscript𝑎\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})roman_min { roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) , roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) } ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(1)𝜋superscriptΠ1\pi\in\Pi^{(1)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT;

  3. 3.

    μπS(2)(x,a)μπS¯(2)(x,a)=c2>0subscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎subscript𝑐20\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (2)}}}(x_{\star},a_{\star})=c_{2}>0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0;

  4. 4.

    SRS(2)(π)/ϵμπS(2)(x,a)μπ(x,a)subscriptSRsuperscript𝑆2𝜋italic-ϵsubscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{S^{(2)}}(\pi)/\epsilon\geq\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) / italic_ϵ ≥ italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT;

  5. 5.

    SRS¯(2)(π)/ϵμπ(x,a)μπS¯(2)(x,a)subscriptSRsuperscript¯𝑆2𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎\operatorname{SR}_{\bar{S}^{(2)}}(\pi)/\epsilon\geq\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) / italic_ϵ ≥ italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT;

  6. 6.

    P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG only differ in (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) and DKL(P(x,a)P¯(x,a))ϵ2D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∣ over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

where c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are some constant and (x,a)𝒳×𝒜subscript𝑥normal-⋆subscript𝑎normal-⋆𝒳𝒜(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∈ caligraphic_X × caligraphic_A is some context-action pair. Then we have the following lower bound:

infL2supS𝒮𝔼[Reg1]𝔼[Reg2]/T2=Ω(c12c22).subscriptinfimumsubscript𝐿2subscriptsupremum𝑆𝒮𝔼delimited-[]subscriptReg1𝔼delimited-[]subscriptReg2subscript𝑇2Ωsuperscriptsubscript𝑐12superscriptsubscript𝑐22\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\sqrt{\mathbb{E}\left[% \operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}\right]/T_% {2}=\Omega\left(\sqrt{{c_{1}^{2}c_{2}^{2}}}\right).roman_inf start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Ω ( square-root start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .
Proof.

Fix a learning algorithm L𝐿Litalic_L. Throughout the proof, we let 𝔼𝑺subscript𝔼superscript𝑺\mathbb{E}_{{\bm{S}}^{\prime}}blackboard_E start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT be the expectation of random variable of interest given the underlying instance 𝑺𝒮superscript𝑺𝒮{\bm{S}}^{\prime}\in{\mathcal{S}}bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S by running algorithm L𝐿Litalic_L.

Let T(x,a)=t=1T1𝟙{(X1,t,A1,t)=(x,a)}𝑇𝑥𝑎superscriptsubscript𝑡1subscript𝑇1subscript1subscript𝑋1𝑡subscript𝐴1𝑡subscript𝑥subscript𝑎T(x,a)=\sum_{t=1}^{T_{1}}\mathbbm{1}_{\{(X_{1,t},A_{1,t})=(x_{\star},a_{\star}% )\}}italic_T ( italic_x , italic_a ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { ( italic_X start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT. From Condition 2 and the definition of the cumulative regret, we have

𝔼𝑺[Reg1]subscript𝔼𝑺delimited-[]subscriptReg1\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{Reg}_{1}]blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] =t=1T1𝔼𝑺[SRS(1)(π1,t)]absentsuperscriptsubscript𝑡1subscript𝑇1subscript𝔼𝑺delimited-[]subscriptSRsuperscript𝑆1subscript𝜋1𝑡\displaystyle=\sum_{t=1}^{T_{1}}\mathbb{E}_{{\bm{S}}}\left[\operatorname{SR}_{% S^{(1)}}(\pi_{1,t})\right]= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) ] (11)
t=1T1𝔼𝑺[c1μπ1,t(x,a)]absentsuperscriptsubscript𝑡1subscript𝑇1subscript𝔼𝑺delimited-[]subscript𝑐1subscript𝜇subscript𝜋1𝑡subscript𝑥subscript𝑎\displaystyle\geq\sum_{t=1}^{T_{1}}\mathbb{E}_{{\bm{S}}}\left[c_{1}\mu_{\pi_{1% ,t}}(x_{\star},a_{\star})\right]≥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] (12)
=c1𝔼𝑺[T(x,a)].absentsubscript𝑐1subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎\displaystyle=c_{1}\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})].= italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] . (13)

The same argument gives 𝔼𝑺¯[Reg1]c1𝔼𝑺¯[T(x,a)]subscript𝔼¯𝑺delimited-[]subscriptReg1subscript𝑐1subscript𝔼¯𝑺delimited-[]𝑇subscript𝑥subscript𝑎\mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{Reg}_{1}]\geq c_{1}\mathbb{E}_{\bar{% {\bm{S}}}}[T(x_{\star},a_{\star})]blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ].

Denote by π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the fixed policy proposed by the algorithm L𝐿Litalic_L for task two. Note that

𝔼𝑺[Reg2/T2]=𝔼𝑺[SRS(2)(π2)] and 𝔼𝑺¯[Reg2/T2]=𝔼𝑺¯[SRS¯(2)(π2)].subscript𝔼𝑺delimited-[]subscriptReg2subscript𝑇2subscript𝔼𝑺delimited-[]subscriptSRsuperscript𝑆2subscript𝜋2 and subscript𝔼¯𝑺delimited-[]subscriptReg2subscript𝑇2subscript𝔼¯𝑺delimited-[]subscriptSRsuperscript¯𝑆2subscript𝜋2\mathbb{E}_{{\bm{S}}}[\operatorname{Reg}_{2}/T_{2}]=\mathbb{E}_{{\bm{S}}}[% \operatorname{SR}_{S^{(2)}}(\pi_{2})]\text{ and }\mathbb{E}_{\bar{{\bm{S}}}}[% \operatorname{Reg}_{2}/T_{2}]=\mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{% \bar{S}^{(2)}}(\pi_{2})].blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] and blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] . (14)

We further lower bound the sum of squared simple regret of π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and S¯(2)superscript¯𝑆2\bar{S}^{(2)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT:

𝔼𝑺[SRS(2)(π2)]+𝔼𝑺¯[SRS¯(2)(π2)]subscript𝔼𝑺delimited-[]subscriptSRsuperscript𝑆2subscript𝜋2subscript𝔼¯𝑺delimited-[]subscriptSRsuperscript¯𝑆2subscript𝜋2\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+% \mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] (15)
\displaystyle\geq ϵ𝔼𝑺[(μπS(2)(x,a)μπ2(x,a))]+ϵ𝔼𝑺¯[μπS¯(2)(x,a)μπ2(x,a)]italic-ϵsubscript𝔼𝑺delimited-[]subscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇subscript𝜋2subscript𝑥subscript𝑎italic-ϵsubscript𝔼¯𝑺delimited-[]subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎subscript𝜇subscript𝜋2subscript𝑥subscript𝑎\displaystyle\epsilon\mathbb{E}_{{\bm{S}}}[(\mu_{\pi^{\star}_{S^{(2)}}}(x_{% \star},a_{\star})-\mu_{\pi_{2}}(x_{\star},a_{\star}))]+\epsilon\mathbb{E}_{% \bar{{\bm{S}}}}[\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})-\mu_{% \pi_{2}}(x_{\star},a_{\star})]italic_ϵ blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ] + italic_ϵ blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] (16)
\displaystyle\geq c2ϵ2[𝑺(μπ2(x,a)μπS(2)(x,a)+μπS¯(2)(x,a)2)+\displaystyle\frac{c_{2}\epsilon}{2}\left[{\mathbb{P}}_{{\bm{S}}}\left(\mu_{% \pi_{2}}(x_{\star},a_{\star})\leq\frac{\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star},a% _{\star})+\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})}{2}\right)+\right.divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 2 end_ARG [ blackboard_P start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ) + (17)
𝑺¯(μπ2(x,a)>μπS(2)(x,a)+μπS¯(2)(x,a)2)],\displaystyle\quad\quad\quad\quad\quad\left.{\mathbb{P}}_{\bar{{\bm{S}}}}\left% (\mu_{\pi_{2}}(x_{\star},a_{\star})>\frac{\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star% },a_{\star})+\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})}{2}\right)% \right],blackboard_P start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > divide start_ARG italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ) ] , (18)

where the first inequality is from Condition 4 and 5, the second inequality is from Condition 3.

Lemma 1 (Bretagnolle–Huber inequality).

For any two probability distributions P,Q𝑃𝑄P,Qitalic_P , italic_Q on the same measurable space (𝒳,)𝒳({\mathcal{X}},{\mathcal{F}})( caligraphic_X , caligraphic_F ), and any event A𝐴A\in{\mathcal{F}}italic_A ∈ caligraphic_F, we have

P(A)+Q(A¯)12exp(DKL(PQ)).𝑃𝐴𝑄¯𝐴12subscript𝐷𝐾𝐿conditional𝑃𝑄P(A)+Q(\bar{A})\geq\frac{1}{2}\exp(-D_{KL}(P\|Q)).italic_P ( italic_A ) + italic_Q ( over¯ start_ARG italic_A end_ARG ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_exp ( - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) ) .

It follows from Lemma 1 and Condition 6 that

𝔼𝑺[SRS(2)(π2)]+𝔼𝑺¯[SRS¯(2)(π2)]subscript𝔼𝑺delimited-[]subscriptSRsuperscript𝑆2subscript𝜋2subscript𝔼¯𝑺delimited-[]subscriptSRsuperscript¯𝑆2subscript𝜋2\displaystyle\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+% \mathbb{E}_{\bar{{\bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] (19)
\displaystyle\geq c2ϵ4exp(DKL(𝑺𝑺¯))subscript𝑐2italic-ϵ4subscript𝐷KLconditionalsubscript𝑺subscript¯𝑺\displaystyle\frac{c_{2}\epsilon}{4}\exp(-D_{\operatorname{KL}}({\mathbb{P}}_{% {\bm{S}}}\mid{\mathbb{P}}_{\bar{\bm{S}}}))divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 4 end_ARG roman_exp ( - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ∣ blackboard_P start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT ) ) (20)
=\displaystyle== c2ϵ4exp(t=1T1𝔼𝑺[DKL(P(X1,t,A1,t),P¯(X1,t,A1,t))])\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\sum_{t=1}^{T_{1}}\mathbb{E}_{{% \bm{S}}}[D_{\operatorname{KL}}(P(\cdot\mid X_{1,t},A_{1,t}),\bar{P}(\cdot\mid X% _{1,t},A_{1,t}))]\right)divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 4 end_ARG roman_exp ( - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_X start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) , over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_X start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT ) ) ] ) (21)
=\displaystyle== c2ϵ4exp(𝔼𝑺[T(x,a)]DKL(P(x,a),P¯(x,a)))\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\mathbb{E}_{{\bm{S}}}[T(x_{% \star},a_{\star})]D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star}),\bar{% P}(\cdot\mid x_{\star},a_{\star}))\right)divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 4 end_ARG roman_exp ( - blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) , over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ) (22)
\displaystyle\geq c2ϵ4exp(𝔼𝑺[T(x,a))]ϵ2).\displaystyle\frac{c_{2}\epsilon}{4}\exp\left(-\mathbb{E}_{{\bm{S}}}[T(x_{% \star},a_{\star}))]\epsilon^{2}\right).divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 4 end_ARG roman_exp ( - blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ] italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (23)

The same argument gives that

𝔼𝑺[SRS(2)(π2)]+𝔼𝑺¯[SRS¯(2)(π2)]c2ϵ4exp(𝔼𝑺¯[T(x,a))]ϵ2).\mathbb{E}_{{\bm{S}}}[\operatorname{SR}_{S^{(2)}}(\pi_{2})]+\mathbb{E}_{\bar{{% \bm{S}}}}[\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})]\geq\frac{c_{2}\epsilon}{% 4}\exp\left(-\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star}))]\epsilon^{2}% \right).blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≥ divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ϵ end_ARG start_ARG 4 end_ARG roman_exp ( - blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ] italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (24)
Lemma 2.

For all choices of algorithm L𝐿Litalic_L, with ϵ=1/𝔼𝐒[T(x,a)]italic-ϵ1subscript𝔼𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}italic_ϵ = square-root start_ARG 1 / blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG, we have 𝔼𝐒[T(x,a)]γ𝔼𝐒¯[T(x,a)]subscript𝔼𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆𝛾subscript𝔼normal-¯𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]\leq\gamma\mathbb{E}_{\bar{{\bm{S% }}}}[T(x_{\star},a_{\star})]blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] ≤ italic_γ blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] for some universal constant γ>0𝛾0\gamma>0italic_γ > 0.

By choosing ϵ=1/𝔼𝑺[T(x,a)]italic-ϵ1subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}italic_ϵ = square-root start_ARG 1 / blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG (assuming that 1/𝔼𝑺[T(x,a)]1/41subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎14{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}\leq 1/41 / blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] ≤ 1 / 4) and applying Lemma 2, it holds that

𝔼𝑺SRS(2)(π2)+𝔼𝑺¯SRS¯(2)(π2)c1c2264𝔼𝑺[T(x,a)]c1c2264γ𝔼𝑺¯[T(x,a)]subscript𝔼𝑺subscriptSRsuperscript𝑆2subscript𝜋2subscript𝔼¯𝑺subscriptSRsuperscript¯𝑆2subscript𝜋2subscript𝑐1superscriptsubscript𝑐2264subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎subscript𝑐1superscriptsubscript𝑐2264𝛾subscript𝔼¯𝑺delimited-[]𝑇subscript𝑥subscript𝑎\displaystyle\mathbb{E}_{{\bm{S}}}\operatorname{SR}_{S^{(2)}}(\pi_{2})+\mathbb% {E}_{\bar{{\bm{S}}}}\operatorname{SR}_{\bar{S}^{(2)}}(\pi_{2})\geq\sqrt{\frac{% c_{1}c_{2}^{2}}{64\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}}\geq\sqrt{% \frac{c_{1}c_{2}^{2}}{64\gamma\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star% })]}}blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 64 blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG end_ARG ≥ square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 64 italic_γ blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG end_ARG (25)

Combined with (13) and (14), we have

infLsup𝑺𝒮𝔼[Reg1]𝔼[Reg2]/T2subscriptinfimum𝐿subscriptsupremumsuperscript𝑺𝒮𝔼delimited-[]subscriptReg1𝔼delimited-[]subscriptReg2subscript𝑇2\displaystyle\inf_{L}\sup_{{\bm{S}}^{\prime}\in{\mathcal{S}}}\sqrt{\mathbb{E}% \left[\operatorname{Reg}_{1}\right]}\mathbb{E}\left[\operatorname{Reg}_{2}% \right]/T_{2}roman_inf start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (26)
\displaystyle\geq infL(𝔼𝑺[Reg1]𝔼𝑺[Reg2]/T2+𝔼𝑺¯[Reg1]𝔼𝑺¯[Reg2]/T2)subscriptinfimum𝐿subscript𝔼𝑺delimited-[]subscriptReg1subscript𝔼𝑺delimited-[]subscriptReg2subscript𝑇2subscript𝔼¯𝑺delimited-[]subscriptReg1subscript𝔼¯𝑺delimited-[]subscriptReg2subscript𝑇2\displaystyle\inf_{L}(\sqrt{\mathbb{E}_{{\bm{S}}}\left[\operatorname{Reg}_{1}% \right]}\mathbb{E}_{{\bm{S}}}\left[\operatorname{Reg}_{2}\right]/T_{2}+\sqrt{% \mathbb{E}_{\bar{{\bm{S}}}}\left[\operatorname{Reg}_{1}\right]}\mathbb{E}_{% \bar{{\bm{S}}}}\left[\operatorname{Reg}_{2}\right]/T_{2})roman_inf start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( square-root start_ARG blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + square-root start_ARG blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (27)
=\displaystyle== Ω(c12c22).Ωsuperscriptsubscript𝑐12superscriptsubscript𝑐22\displaystyle\Omega\left(\sqrt{{c_{1}^{2}c_{2}^{2}}}\right).roman_Ω ( square-root start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (28)

A.1 Proof of Lemma 2

Lemma 2 For all choices of algorithm L𝐿Litalic_L, with ϵ=1/𝔼𝐒[T(x,a)]italic-ϵ1subscript𝔼𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆\epsilon=\sqrt{1/\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]}italic_ϵ = square-root start_ARG 1 / blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG, we have 𝔼𝐒[T(x,a)]γ𝔼𝐒¯[T(x,a)]subscript𝔼𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆𝛾subscript𝔼normal-¯𝐒delimited-[]𝑇subscript𝑥normal-⋆subscript𝑎normal-⋆\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})]\leq\gamma\mathbb{E}_{\bar{{\bm{S% }}}}[T(x_{\star},a_{\star})]blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] ≤ italic_γ blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] for some universal constant γ>0𝛾0\gamma>0italic_γ > 0.

Proof.

For any β>0𝛽0\beta>0italic_β > 0, by applying Lemma 1 and the same argument that obtains (23), we have

𝑺(T(x,a)β/ϵ2)+𝑺¯(T(x,a)>β/ϵ2)12e.subscript𝑺𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ2subscript¯𝑺𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ212𝑒{\mathbb{P}}_{{\bm{S}}}(T(x_{\star},a_{\star})\leq\beta/\epsilon^{2})+{\mathbb% {P}}_{\bar{{\bm{S}}}}(T(x_{\star},a_{\star})>\beta/\epsilon^{2})\geq\frac{1}{2% e}.blackboard_P start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≤ italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + blackboard_P start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG . (29)

To proceed, apply Markov inequality

α/ϵ2𝔼𝑺[T(x,a)]α/ϵ2β/ϵ2+𝔼𝑺¯[T(x,a)]β/ϵ2𝛼superscriptitalic-ϵ2subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎𝛼superscriptitalic-ϵ2𝛽superscriptitalic-ϵ2subscript𝔼¯𝑺delimited-[]𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ2\displaystyle\frac{\alpha/\epsilon^{2}-\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{% \star})]}{\alpha/\epsilon^{2}-\beta/\epsilon^{2}}+\frac{\mathbb{E}_{\bar{{\bm{% S}}}}[T(x_{\star},a_{\star})]}{\beta/\epsilon^{2}}divide start_ARG italic_α / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG start_ARG italic_α / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] end_ARG start_ARG italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (30)
\displaystyle\geq 𝑺(α/ϵ2T(x,a)α/ϵ2β/ϵ2)+𝑺¯(T(x,a)>β/ϵ2)subscript𝑺𝛼superscriptitalic-ϵ2𝑇subscript𝑥subscript𝑎𝛼superscriptitalic-ϵ2𝛽superscriptitalic-ϵ2subscript¯𝑺𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ2\displaystyle{\mathbb{P}}_{{\bm{S}}}(\alpha/\epsilon^{2}-T(x_{\star},a_{\star}% )\geq\alpha/\epsilon^{2}-\beta/\epsilon^{2})+{\mathbb{P}}_{\bar{{\bm{S}}}}(T(x% _{\star},a_{\star})>\beta/\epsilon^{2})blackboard_P start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ( italic_α / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≥ italic_α / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + blackboard_P start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (31)
=\displaystyle== 𝑺(T(x,a)β/ϵ2)+𝑺¯(T(x,a)>β/ϵ2)subscript𝑺𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ2subscript¯𝑺𝑇subscript𝑥subscript𝑎𝛽superscriptitalic-ϵ2\displaystyle{\mathbb{P}}_{{\bm{S}}}(T(x_{\star},a_{\star})\leq\beta/\epsilon^% {2})+{\mathbb{P}}_{\bar{{\bm{S}}}}(T(x_{\star},a_{\star})>\beta/\epsilon^{2})blackboard_P start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≤ italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + blackboard_P start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > italic_β / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (32)
\displaystyle\geq 1/(2e).12𝑒\displaystyle 1/(2e).1 / ( 2 italic_e ) . (33)

By choosing α=2𝛼2\alpha=2italic_α = 2 and β=11/6𝛽116\beta=11/6italic_β = 11 / 6, we have

𝔼𝑺¯[T(x,a)](12eα1αβ)βϵ20.01/ϵ2.subscript𝔼¯𝑺delimited-[]𝑇subscript𝑥subscript𝑎12𝑒𝛼1𝛼𝛽𝛽superscriptitalic-ϵ20.01superscriptitalic-ϵ2\mathbb{E}_{\bar{{\bm{S}}}}[T(x_{\star},a_{\star})]\geq\left(\frac{1}{2e}-% \frac{\alpha-1}{\alpha-\beta}\right)\frac{\beta}{\epsilon^{2}}\geq 0.01/% \epsilon^{2}.blackboard_E start_POSTSUBSCRIPT over¯ start_ARG bold_italic_S end_ARG end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] ≥ ( divide start_ARG 1 end_ARG start_ARG 2 italic_e end_ARG - divide start_ARG italic_α - 1 end_ARG start_ARG italic_α - italic_β end_ARG ) divide start_ARG italic_β end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≥ 0.01 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (34)

Appendix B Proof of Proposition 1

Proposition 1Following the same condition on the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S, the following minimax lower bound holds

infL2supS𝒮𝔼[Reg1+Reg2]=Ω(max{T2T1,T22/3,T1})subscriptinfimumsubscript𝐿2subscriptsupremum𝑆𝒮𝔼delimited-[]subscriptReg1subscriptReg2Ωsubscript𝑇2subscript𝑇1superscriptsubscript𝑇223subscript𝑇1\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}[\operatorname{% Reg}_{1}+\operatorname{Reg}_{2}]=\Omega\left(\max\left\{\frac{T_{2}}{\sqrt{T_{% 1}}},T_{2}^{2/3},\sqrt{T_{1}}\right\}\right)roman_inf start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_Ω ( roman_max { divide start_ARG italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT , square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG } ) (35)
Proof.

It is well known that the minimax rate of the cumulative regret of a single task of horizon T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is Ω(T1)Ωsubscript𝑇1\Omega(\sqrt{T_{1}})roman_Ω ( square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) (Lattimore and Szepesvári,, 2020). By Theorem 1,

infL2supS𝒮𝔼[Reg1+Reg2]subscriptinfimumsubscript𝐿2subscriptsupremum𝑆𝒮𝔼delimited-[]subscriptReg1subscriptReg2\displaystyle\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\mathbb{E}% \left[\operatorname{Reg}_{1}+\operatorname{Reg}_{2}\right]roman_inf start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] =infL2supS𝒮𝔼[Reg1](𝔼[Reg2])2/T22(𝔼[Reg2])2/T22+𝔼[Reg2]absentsubscriptinfimumsubscript𝐿2subscriptsupremum𝑆𝒮𝔼delimited-[]subscriptReg1superscript𝔼delimited-[]subscriptReg22superscriptsubscript𝑇22superscript𝔼delimited-[]subscriptReg22superscriptsubscript𝑇22𝔼delimited-[]subscriptReg2\displaystyle=\inf_{L_{2}\in{\mathcal{L}}}\sup_{S\in{\mathcal{S}}}\frac{% \mathbb{E}\left[\operatorname{Reg}_{1}\right](\mathbb{E}\left[\operatorname{% Reg}_{2}\right])^{2}/T_{2}^{2}}{(\mathbb{E}\left[\operatorname{Reg}_{2}\right]% )^{2}/T_{2}^{2}}+\mathbb{E}[\operatorname{Reg}_{2}]= roman_inf start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT divide start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
=Ω(1(𝔼[Reg2])2/T22+𝔼[Reg2]) for some L2 and SabsentΩ1superscript𝔼delimited-[]subscriptReg22superscriptsubscript𝑇22𝔼delimited-[]subscriptReg2 for some L2 and S\displaystyle=\Omega\left(\frac{1}{(\mathbb{E}\left[\operatorname{Reg}_{2}% \right])^{2}/T_{2}^{2}}+\mathbb{E}[\operatorname{Reg}_{2}]\right)\text{ for % some $L_{2}$ and $S$}= roman_Ω ( divide start_ARG 1 end_ARG start_ARG ( blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + blackboard_E [ roman_Reg start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ) for some italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and italic_S
=Ω(T22/3).absentΩsuperscriptsubscript𝑇223\displaystyle=\Omega\left(T_{2}^{2/3}\right).= roman_Ω ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) .

A minimax lower bound of Ω(T2/T1)Ωsubscript𝑇2subscript𝑇1\Omega(T_{2}/\sqrt{T_{1}})roman_Ω ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / square-root start_ARG italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) for offline policy optimization is shown in Theorem 4.3 (Yin and Wang,, 2021). Combined the three lower bounds together, we conclude (35). ∎

Appendix C Verifying Conditions in Section 3.1

We verify each of the conditions in Theorem 1 for three case studies introduced in Section 3.1. Note that the mean reward are given by Table 4.

C.1 Case I

Let (x,a)=(x2,a2)subscript𝑥subscript𝑎subscript𝑥2subscript𝑎2(x_{\star},a_{\star})=(x_{2},a_{2})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). It follows from the construction that there are unique optimal policies for S(1),S¯(1),S(2),S¯(2)superscript𝑆1superscript¯𝑆1superscript𝑆2superscript¯𝑆2S^{(1)},\bar{S}^{(1)},S^{(2)},\bar{S}^{(2)}italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. The unique optimal policies for each task is given by Table 5.

  1. 1.

    Condition 2 holds with c1=1/4subscript𝑐114c_{1}=1/4italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 / 4: πS(1)(a)=πS¯(1)(a)=0subscriptsuperscript𝜋superscript𝑆1conditionalsubscript𝑎subscriptsuperscript𝜋superscript¯𝑆1conditionalsubscript𝑎0\pi^{\star}_{S^{(1)}}(a_{\star}\mid\cdot)=\pi^{\star}_{\bar{S}^{(1)}}(a_{\star% }\mid\cdot)=0italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∣ ⋅ ) = italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∣ ⋅ ) = 0 and SRS(1)(π)=(12ϵ)μπ(x,a)subscriptSRsuperscript𝑆1𝜋12italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{S^{(1)}}(\pi)=(1-2\epsilon)\mu_{\pi}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = ( 1 - 2 italic_ϵ ) italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ), SRS¯(1)(π)=(13ϵ)μπ(x,a)subscriptSRsuperscript¯𝑆1𝜋13italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{\bar{S}^{(1)}}(\pi)=(1-3\epsilon)\mu_{\pi}(x_{\star},a_{% \star})roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = ( 1 - 3 italic_ϵ ) italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) Thus, min{SRS(1)(π),SRS¯(1)(π)}(13ϵ)μπ(x,a)1/4μπ(x,a)subscriptSRsuperscript𝑆1𝜋subscriptSRsuperscript¯𝑆1𝜋13italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎14subscript𝜇𝜋subscript𝑥subscript𝑎\min\{\operatorname{SR}_{S^{(1)}}(\pi),\operatorname{SR}_{\bar{S}^{(1)}}(\pi)% \}\geq(1-3\epsilon)\mu_{\pi}(x_{\star},a_{\star})\geq 1/4\mu_{\pi}(x_{\star},a% _{\star})roman_min { roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) , roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) } ≥ ( 1 - 3 italic_ϵ ) italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≥ 1 / 4 italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ).

  2. 2.

    Condition 3 holds with c2=1/2subscript𝑐212c_{2}=1/2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 / 2 because μπS(2)(x,a)=1/2superscriptsubscript𝜇subscript𝜋superscript𝑆2subscript𝑥subscript𝑎12\mu_{\pi_{S^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 1 / 2, while μπS¯(2)(x,a)=1/2superscriptsubscript𝜇subscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎12\mu_{\pi_{\bar{S}^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 1 / 2.

  3. 3.

    Condition 4 and 5 holds since SRS(2)(π)/ϵ=μπS(2)(x,a)μπ(x,a)subscriptSRsuperscript𝑆2𝜋italic-ϵsubscript𝜇subscriptsuperscript𝜋superscript𝑆2subscript𝑥subscript𝑎subscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{S^{(2)}}(\pi)/\epsilon=\mu_{\pi^{\star}_{S^{(2)}}}(x_{\star% },a_{\star})-\mu_{\pi}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) / italic_ϵ = italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT and SRS¯(2)(π)/ϵ=μπ(x,a)μπS¯(2)(x,a)subscriptSRsuperscript¯𝑆2𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎\operatorname{SR}_{\bar{S}^{(2)}}(\pi)/\epsilon=\mu_{\pi}(x_{\star},a_{\star})% -\mu_{\pi^{\star}_{\bar{S}^{(2)}}}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) / italic_ϵ = italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(2)𝜋superscriptΠ2\pi\in\Pi^{(2)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT;

  4. 4.

    Condition 6 can be satisfied by choosing P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG as the normal distribution with mean R𝑅Ritalic_R and R¯¯𝑅\bar{R}over¯ start_ARG italic_R end_ARG and variance 1/2.

Table 5: Sampling probability for (a1,a2)subscript𝑎1subscript𝑎2(a_{1},a_{2})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) under different context of optimal policies
Context πS(1)superscriptsubscript𝜋superscript𝑆1\pi_{S^{(1)}}^{\star}italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS¯(1)superscriptsubscript𝜋superscript¯𝑆1\pi_{\bar{S}^{(1)}}^{\star}italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS(2)superscriptsubscript𝜋superscript𝑆2\pi_{S^{(2)}}^{\star}italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS¯(2)superscriptsubscript𝜋superscript¯𝑆2\pi_{\bar{S}^{(2)}}^{\star}italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT
x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (1, 0) (1, 0) (1, 0) (1, 0)
x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (1, 0) (1, 0) (0, 1) (1, 0)

C.2 Case II

Let (x,a)=(x2,a1)subscript𝑥subscript𝑎subscript𝑥2subscript𝑎1(x_{\star},a_{\star})=(x_{2},a_{1})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The unique optimal policies for each task is given by Table 6.

  1. 1.

    Condition 2 holds with c1=1/4subscript𝑐114c_{1}=1/4italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 / 4: both πS(1)(ax)=πS¯(1)(ax)=0subscriptsuperscript𝜋superscript𝑆1conditionalsubscript𝑎subscript𝑥subscriptsuperscript𝜋superscript¯𝑆1conditionalsubscript𝑎subscript𝑥0\pi^{\star}_{S^{(1)}}(a_{\star}\mid x_{\star})=\pi^{\star}_{\bar{S}^{(1)}}(a_{% \star}\mid x_{\star})=0italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0. Furthermore, SRS(1)(π)(1ϵ)μπ(x,a)subscriptSRsuperscript𝑆1𝜋1italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{S^{(1)}}(\pi)\geq(1-\epsilon)\mu_{\pi}(x_{\star},a_{\star})roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ ( 1 - italic_ϵ ) italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) and SRS¯(1)(π)(1ϵ)μπ(x,a)subscriptSRsuperscript¯𝑆1𝜋1italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{\bar{S}^{(1)}}(\pi)\geq(1-\epsilon)\mu_{\pi}(x_{\star},a_{% \star})roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ ( 1 - italic_ϵ ) italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ). By choosing ϵ<1/4italic-ϵ14\epsilon<1/4italic_ϵ < 1 / 4, condition 2 is satisfied with c1=1/4subscript𝑐114c_{1}=1/4italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 / 4.

  2. 2.

    Condition 3 holds with c2=1/2subscript𝑐212c_{2}=1/2italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 / 2 because μπS(2)(x,a)=0superscriptsubscript𝜇subscript𝜋superscript𝑆2subscript𝑥subscript𝑎0\mu_{\pi_{S^{(2)}}}^{\star}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0, while μπS¯(2)(x,a)=1/2superscriptsubscript𝜇subscript𝜋superscript¯𝑆2subscript𝑥subscript𝑎12\mu_{\pi_{\bar{S}^{(2)}}}^{\star}(x_{\star},a_{\star})=1/2italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 1 / 2.

  3. 3.

    Condition 4 and 5 holds because SRS(2)(π)=ϵμπ(x,a)/2subscriptSRsuperscript𝑆2𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎2\operatorname{SR}_{S^{(2)}}(\pi)=\epsilon\mu_{\pi}(x_{\star},a_{\star})/2roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = italic_ϵ italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) / 2 and SRS¯(2)(π)=ϵ(1/2μπ(x,a))subscriptSRsuperscript¯𝑆2𝜋italic-ϵ12subscript𝜇𝜋subscript𝑥subscript𝑎\operatorname{SR}_{\bar{S}^{(2)}}(\pi)=\epsilon(1/2-\mu_{\pi}(x_{\star},a_{% \star}))roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = italic_ϵ ( 1 / 2 - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ).

  4. 4.

    Condition 6 can be satisfied by choosing P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG as the normal distribution with mean R𝑅Ritalic_R and R¯¯𝑅\bar{R}over¯ start_ARG italic_R end_ARG and variance 1/2.

Table 6: Sampling probability for (a1,a2)subscript𝑎1subscript𝑎2(a_{1},a_{2})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) under different context of optimal policies
Context πS(1)superscriptsubscript𝜋superscript𝑆1\pi_{S^{(1)}}^{\star}italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS¯(1)superscriptsubscript𝜋superscript¯𝑆1\pi_{\bar{S}^{(1)}}^{\star}italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS(2)superscriptsubscript𝜋superscript𝑆2\pi_{S^{(2)}}^{\star}italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT πS¯(2)superscriptsubscript𝜋superscript¯𝑆2\pi_{\bar{S}^{(2)}}^{\star}italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT
x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (1, 0) (1, 0) (0, 1) (1, 0)
x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (0, 1) (0, 1) (0, 1) (1, 0)

C.3 Case III

Note that any MAB can be seen as a contextual bandit with a dummy context x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let (x,a)=(x0,a2)subscript𝑥subscript𝑎subscript𝑥0subscript𝑎2(x_{\star},a_{\star})=(x_{0},a_{2})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The optimal policies for S(1)superscript𝑆1S^{(1)}italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and S¯(2)superscript¯𝑆2\bar{S}^{(2)}over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are πS(1)(a1)=πS¯(1)(a1)=1superscriptsubscript𝜋superscript𝑆1subscript𝑎1superscriptsubscript𝜋superscript¯𝑆1subscript𝑎11\pi_{S^{(1)}}^{\star}(a_{1})=\pi_{\bar{S}^{(1)}}^{\star}(a_{1})=1italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1. The optimal policy for S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is πS(2))(a1)=1\pi_{S^{(2)}})^{\star}(a_{1})=1italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1, and for S(2)superscript𝑆2S^{(2)}italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is πS¯(2))(a1)=0\pi_{\bar{S}^{(2)}})^{\star}(a_{1})=0italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.

  1. 1.

    Condition 2 holds with c1=0.2subscript𝑐10.2c_{1}=0.2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2.

  2. 2.

    Condition 3 holds with c2=1subscript𝑐21c_{2}=1italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.

  3. 3.

    Condition 4 and 5 holds because SRS(2)(π)=ϵμπ(a)/2subscriptSRsuperscript𝑆2𝜋italic-ϵsubscript𝜇𝜋subscript𝑎2\operatorname{SR}_{S^{(2)}}(\pi)=\epsilon\mu_{\pi}(a_{\star})/2roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = italic_ϵ italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) / 2 and SRS¯(2)(π)=ϵ(1μπ(x,a))/2subscriptSRsuperscript¯𝑆2𝜋italic-ϵ1subscript𝜇𝜋subscript𝑥subscript𝑎2\operatorname{SR}_{\bar{S}^{(2)}}(\pi)=\epsilon(1-\mu_{\pi}(x_{\star},a_{\star% }))/2roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) = italic_ϵ ( 1 - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2.

  4. 4.

    Condition 6 holds by choosing P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG as the normal distribution with mean R𝑅Ritalic_R and R¯¯𝑅\bar{R}over¯ start_ARG italic_R end_ARG and variance 1/2.

Appendix D Proof of Theorem 5

Theorem 5 Assume 𝒮𝒮{\mathcal{S}}caligraphic_S is such that Π(1)=Π(2)=Πsuperscriptnormal-Π1superscriptnormal-Π2normal-Π\Pi^{(1)}=\Pi^{(2)}=\Piroman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = roman_Π, the set of all policies, and Π(1)=Π(2)=fsuperscriptnormal-Π1superscriptnormal-Π2𝑓\Pi^{(1)}=\Pi^{(2)}=froman_Π start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = italic_f, the identical mapping in {\mathbb{R}}blackboard_R. Assume each P(i)(x,a)P^{(i)}(\cdot\mid x,a)italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( ⋅ ∣ italic_x , italic_a ) is from a binomial distribution with mean P(i)(x,a)superscript𝑃𝑖𝑥𝑎P^{(i)}(x,a)italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) for all i=1,2𝑖12i=1,2italic_i = 1 , 2 and x,a𝒳×𝒜𝑥𝑎𝒳𝒜x,a\in{\mathcal{X}}\times{\mathcal{A}}italic_x , italic_a ∈ caligraphic_X × caligraphic_A, and P(2)𝒫(P(1),Δ)superscript𝑃2𝒫superscript𝑃1normal-ΔP^{(2)}\in{\mathcal{P}}(P^{(1)},\Delta)italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ). Then there exists some Δnormal-Δ\Deltaroman_Δ such that

infL2supP(1)𝔼[Reg1]𝔼[SR~(π2P(1),Δ)]=Ω(1),subscriptinfimum𝐿subscript2subscriptsupremumsuperscript𝑃1𝔼delimited-[]subscriptReg1𝔼delimited-[]~SRconditionalsubscript𝜋2superscript𝑃1ΔΩ1\inf_{L\in{\mathcal{L}}_{2}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_% {1}]}\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=% \Omega(1),roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ over~ start_ARG roman_SR end_ARG ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ] = roman_Ω ( 1 ) , (36)

where π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the random policy chosen by learning algorithm L𝐿Litalic_L.

Proof.

We construct two hard instances inspired by Proposition 3. Recall that Proposition 3 states that any two-armed, context-free bandit, with GapP(1)(a1)P(1)(a2)>0Gapsuperscript𝑃1subscript𝑎1superscript𝑃1subscript𝑎20\text{Gap}\coloneqq P^{(1)}(a_{1})-P^{(1)}(a_{2})>0Gap ≔ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0 has the following explicit form of optimal robust policy:

π~(a1)=Δ+Gap2Δ, and π~(a2)=ΔGap2Δ.formulae-sequence~𝜋subscript𝑎1ΔGap2Δ and ~𝜋subscript𝑎2ΔGap2Δ\tilde{\pi}(a_{1})=\frac{\Delta+\text{Gap}}{2\Delta},\text{ and }\tilde{\pi}(a% _{2})=\frac{\Delta-\text{Gap}}{2\Delta}.over~ start_ARG italic_π end_ARG ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = divide start_ARG roman_Δ + Gap end_ARG start_ARG 2 roman_Δ end_ARG , and over~ start_ARG italic_π end_ARG ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG roman_Δ - Gap end_ARG start_ARG 2 roman_Δ end_ARG . (37)

Let the arm space be 𝒜={a1,a2}𝒜subscript𝑎1subscript𝑎2{\mathcal{A}}=\{a_{1},a_{2}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. We construct two instances S𝑆Sitalic_S and S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG with P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and P¯(1)superscript¯𝑃1\bar{P}^{(1)}over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, such that R(1)(a1)=R¯(1)(a1)=1superscript𝑅1subscript𝑎1superscript¯𝑅1subscript𝑎11R^{(1)}(a_{1})=\bar{R}^{(1)}(a_{1})=1italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 and R(1)(a2)=1/2superscript𝑅1subscript𝑎212R^{(1)}(a_{2})=1/2italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 / 2, while R¯(1)(a2)=1/2+ϵsuperscript¯𝑅1subscript𝑎212italic-ϵ\bar{R}^{(1)}(a_{2})=1/2+\epsilonover¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 / 2 + italic_ϵ for ϵ<1/4italic-ϵ14\epsilon<1/4italic_ϵ < 1 / 4. Let P(i)(a)P^{(i)}(\cdot\mid a)italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( ⋅ ∣ italic_a ) and P¯(i)(a)\bar{P}^{(i)}(\cdot\mid a)over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( ⋅ ∣ italic_a ) be Bernoulli distributions of parameter R(i)(a)superscript𝑅𝑖𝑎R^{(i)}(a)italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_a ) and R¯(i)(a)superscript¯𝑅𝑖𝑎\bar{R}^{(i)}(a)over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_a ) for each i{1,2},a𝒜formulae-sequence𝑖12𝑎𝒜i\in\{1,2\},a\in{\mathcal{A}}italic_i ∈ { 1 , 2 } , italic_a ∈ caligraphic_A. Let Gap=R(1)(a1)R(1)(a2)=1/2Gapsuperscript𝑅1subscript𝑎1superscript𝑅1subscript𝑎212\text{Gap}=R^{(1)}(a_{1})-R^{(1)}(a_{2})=1/2Gap = italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 / 2 and Gap¯=R¯(1)(a1)R¯(1)(a2)=1/2ϵ¯Gapsuperscript¯𝑅1subscript𝑎1superscript¯𝑅1subscript𝑎212italic-ϵ\overline{\text{Gap}}=\bar{R}^{(1)}(a_{1})-\bar{R}^{(1)}(a_{2})=1/2-\epsilonover¯ start_ARG Gap end_ARG = over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over¯ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 / 2 - italic_ϵ.

Follow a similar proof of Theorem 1. We first connect the cumulative regret in the first task Reg1subscriptReg1\operatorname{Reg}_{1}roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the number of visits in the suboptimal arm a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the first task. It can be shown that 𝔼𝑺[Reg1]1/4𝔼𝑺[T(a2)]subscript𝔼superscript𝑺delimited-[]subscriptReg114subscript𝔼superscript𝑺delimited-[]𝑇subscript𝑎2\mathbb{E}_{{\bm{S}}^{\prime}}[\operatorname{Reg}_{1}]\geq 1/4\mathbb{E}_{{\bm% {S}}^{\prime}}[T(a_{2})]blackboard_E start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≥ 1 / 4 blackboard_E start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_T ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ], where T(a2)t=1T1𝟙A1,t=a2𝑇subscript𝑎2superscriptsubscript𝑡1subscript𝑇1subscript1subscript𝐴1𝑡subscript𝑎2T(a_{2})\coloneqq\sum_{t=1}^{T_{1}}\mathbbm{1}_{A_{1,t}=a_{2}}italic_T ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≔ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for each S{S,S¯}superscript𝑆𝑆¯𝑆S^{\prime}\in\{S,\bar{S}\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { italic_S , over¯ start_ARG italic_S end_ARG }.

We consider Δ=3/4>max{Gap,Gap¯}Δ34Gap¯Gap\Delta=3/4>\max\{\text{Gap},\overline{\text{Gap}}\}roman_Δ = 3 / 4 > roman_max { Gap , over¯ start_ARG Gap end_ARG }. By Proposition 3, we first lower bound the robust simple regret by

SR~(πP(1),Δ)~SRconditional𝜋superscript𝑃1Δ\displaystyle\widetilde{\operatorname{SR}}(\pi\mid P^{(1)},\Delta)over~ start_ARG roman_SR end_ARG ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) (38)
=\displaystyle== SR(πP(1),Δ)infπSR(πP(1),Δ)SRconditional𝜋superscript𝑃1Δsubscriptinfimumsuperscript𝜋SRconditionalsuperscript𝜋superscript𝑃1Δ\displaystyle\operatorname{SR}(\pi\mid P^{(1)},\Delta)-\inf_{\pi^{\prime}}% \operatorname{SR}(\pi^{\prime}\mid P^{(1)},\Delta)roman_SR ( italic_π ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) - roman_inf start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_SR ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) (39)
=\displaystyle== max{(ΔGap)π(a1),(ΔGap)π(a2)}Δ2Gap22ΔΔGap𝜋subscript𝑎1ΔGap𝜋subscript𝑎2superscriptΔ2superscriptGap22Δ\displaystyle\max\{(\Delta-\text{Gap})\pi(a_{1}),(\Delta-\text{Gap})\pi(a_{2})% \}-\frac{\Delta^{2}-\text{Gap}^{2}}{2\Delta}roman_max { ( roman_Δ - Gap ) italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( roman_Δ - Gap ) italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } - divide start_ARG roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - Gap start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_Δ end_ARG (40)
=\displaystyle== (π(a1)Δ+Gap2Δ)+(ΔGap)+(π(a2)ΔGap2Δ)+(Δ+Gap)superscript𝜋subscript𝑎1ΔGap2ΔΔGapsuperscript𝜋subscript𝑎2ΔGap2ΔΔGap\displaystyle\left(\pi(a_{1})-\frac{\Delta+\text{Gap}}{2\Delta}\right)^{+}(% \Delta-\text{Gap})+\left(\pi(a_{2})-\frac{\Delta-\text{Gap}}{2\Delta}\right)^{% +}(\Delta+\text{Gap})( italic_π ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - divide start_ARG roman_Δ + Gap end_ARG start_ARG 2 roman_Δ end_ARG ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( roman_Δ - Gap ) + ( italic_π ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - divide start_ARG roman_Δ - Gap end_ARG start_ARG 2 roman_Δ end_ARG ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( roman_Δ + Gap ) (41)
\displaystyle\geq (ΔGap)|ππ*|/2|ππ*|/8.ΔGap𝜋superscript𝜋2𝜋superscript𝜋8\displaystyle(\Delta-\text{Gap})|\pi-\pi^{*}|/2\geq|\pi-\pi^{*}|/8.( roman_Δ - Gap ) | italic_π - italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | / 2 ≥ | italic_π - italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | / 8 . (42)

Similarly, we also have SR~(πP¯(1),Δ)|ππ¯*|/8~SRconditional𝜋superscript¯𝑃1Δ𝜋superscript¯𝜋8\widetilde{\operatorname{SR}}(\pi\mid\bar{P}^{(1)},\Delta)\geq|\pi-\bar{\pi}^{% *}|/8over~ start_ARG roman_SR end_ARG ( italic_π ∣ over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ≥ | italic_π - over¯ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | / 8. Here we let π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, π¯*superscript¯𝜋\bar{\pi}^{*}over¯ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT be the optimal robust policy for P(2)𝒫(P(1)Δ)superscript𝑃2𝒫conditionalsuperscript𝑃1ΔP^{(2)}\in{\mathcal{P}}(P^{(1)}\mid\Delta)italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∣ roman_Δ ) and P(2)𝒫(P¯(1)Δ)superscript𝑃2𝒫conditionalsuperscript¯𝑃1ΔP^{(2)}\in{\mathcal{P}}(\bar{P}^{(1)}\mid\Delta)italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_P ( over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∣ roman_Δ ), respectively.

Let π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the random policy proposed by the learning algorithm for the second task. We convert the robust learning problem to a testing problem of two instances. Note that π*(a1)=5/6superscript𝜋subscript𝑎156\pi^{*}(a_{1})=5/6italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 5 / 6 and π¯*(a1)=5/62ϵ/3superscript¯𝜋subscript𝑎1562italic-ϵ3\bar{\pi}^{*}(a_{1})=5/6-2\epsilon/3over¯ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 5 / 6 - 2 italic_ϵ / 3.

The sum of robust simple regrets for two instances can be lower bounded by

𝔼P(1)[SR~2]+𝔼P¯(1)[SR~2]subscript𝔼superscript𝑃1delimited-[]subscript~SR2subscript𝔼superscript¯𝑃1delimited-[]subscript~SR2\displaystyle\mathbb{E}_{P^{(1)}}[\widetilde{\operatorname{SR}}_{2}]+\mathbb{E% }_{\bar{P}^{(1)}}[\widetilde{\operatorname{SR}}_{2}]blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG roman_SR end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG roman_SR end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (43)
\displaystyle\geq ϵ24P(1)(π2(a1)56ϵ/3)+ϵ24P¯(1)(π2(a1)>56ϵ/3)italic-ϵ24subscriptsuperscript𝑃1subscript𝜋2subscript𝑎156italic-ϵ3italic-ϵ24subscriptsuperscript¯𝑃1subscript𝜋2subscript𝑎156italic-ϵ3\displaystyle\frac{\epsilon}{24}{\mathbb{P}}_{P^{(1)}}\left(\pi_{2}(a_{1})\leq% \frac{5}{6}-\epsilon/3\right)+\frac{\epsilon}{24}{\mathbb{P}}_{\bar{P}^{(1)}}% \left(\pi_{2}(a_{1})>\frac{5}{6}-\epsilon/3\right)divide start_ARG italic_ϵ end_ARG start_ARG 24 end_ARG blackboard_P start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ divide start_ARG 5 end_ARG start_ARG 6 end_ARG - italic_ϵ / 3 ) + divide start_ARG italic_ϵ end_ARG start_ARG 24 end_ARG blackboard_P start_POSTSUBSCRIPT over¯ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) > divide start_ARG 5 end_ARG start_ARG 6 end_ARG - italic_ϵ / 3 ) (44)
\displaystyle\geq ϵ24exp(ϵ2𝔼P(1)[T(a2)])italic-ϵ24superscriptitalic-ϵ2subscript𝔼superscript𝑃1delimited-[]𝑇subscript𝑎2\displaystyle\frac{\epsilon}{24}\exp(-\epsilon^{2}\mathbb{E}_{P^{(1)}}[T(a_{2}% )])divide start_ARG italic_ϵ end_ARG start_ARG 24 end_ARG roman_exp ( - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_T ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ) (45)

Choosing ϵ=1/𝔼P(1)[T(a2)]italic-ϵ1subscript𝔼superscript𝑃1delimited-[]𝑇subscript𝑎2\epsilon=\sqrt{1/\mathbb{E}_{P^{(1)}}[T(a_{2})]}italic_ϵ = square-root start_ARG 1 / blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_T ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] end_ARG and applying Lemma 2 again, we have

infLsupP(1)𝔼[Reg1]𝔼[SR~(π2P(1),Δ)]=Ω(1).subscriptinfimum𝐿subscriptsupremumsuperscript𝑃1𝔼delimited-[]subscriptReg1𝔼delimited-[]~SRconditionalsubscript𝜋2superscript𝑃1ΔΩ1\inf_{L\in{\mathcal{L}}}\sup_{P^{(1)}}\sqrt{\mathbb{E}[\operatorname{Reg}_{1}]% }\mathbb{E}[\widetilde{\operatorname{SR}}(\pi_{2}\mid P^{(1)},\Delta)]=\Omega(% 1).roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_ARG blackboard_E [ over~ start_ARG roman_SR end_ARG ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ ) ] = roman_Ω ( 1 ) .

Appendix E Proof of Theorem 3

Theorem 3 Recall that 𝒮(𝐒)𝒮𝐒{\mathcal{S}}({\bm{S}})caligraphic_S ( bold_italic_S ) denote the set of instances that share the same policy space and reward function with a given instance 𝐒𝐒{\bm{S}}bold_italic_S. Let {\mathcal{I}}caligraphic_I be an index set. Assume the instance set 𝒮𝒮{\mathcal{S}}caligraphic_S is sufficiently large to ensure the existence of an instance 𝐒𝒮𝐒𝒮{\bm{S}}\in{\mathcal{S}}bold_italic_S ∈ caligraphic_S for which, we can find some 𝐒¯𝒮(𝐒)normal-¯𝐒𝒮𝐒\bar{{\bm{S}}}\in{\mathcal{S}}({\bm{S}})over¯ start_ARG bold_italic_S end_ARG ∈ caligraphic_S ( bold_italic_S ) such that for all i𝑖i\in{\mathcal{I}}italic_i ∈ caligraphic_I and ϵ[0,1/4]italic-ϵ014\epsilon\in[0,1/4]italic_ϵ ∈ [ 0 , 1 / 4 ], it satisfies the following conditions:

  1. 1.

    There exists unique optimal policy πSsubscriptsuperscript𝜋𝑆\pi^{\star}_{S}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT for each S{S(1),,S(i),S¯(1),,S¯(i)}𝑆superscript𝑆1superscript𝑆𝑖superscript¯𝑆1superscript¯𝑆𝑖S\in\{S^{(1)},\dots,{S}^{(i)},\bar{S}^{(1)},\dots,\bar{S}^{(i)}\}italic_S ∈ { italic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT };

  2. 2.

    μπS(j)(x,a)=μπS¯(j)(x,a)=0subscript𝜇superscriptsubscript𝜋superscript𝑆𝑗subscript𝑥subscript𝑎subscript𝜇superscriptsubscript𝜋superscript¯𝑆𝑗subscript𝑥subscript𝑎0\mu_{\pi_{S^{(j)}}^{\star}}(x_{\star},a_{\star})=\mu_{\pi_{\bar{S}^{(j)}}^{% \star}}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0 and min{SRS(j)(π),SRS¯(j)(π)}c1μπ(x,a)subscriptSRsuperscript𝑆𝑗𝜋subscriptSRsuperscript¯𝑆𝑗𝜋subscript𝑐1subscript𝜇𝜋subscript𝑥subscript𝑎\min\{\operatorname{SR}_{S^{(j)}}(\pi),\operatorname{SR}_{\bar{S}^{(j)}}(\pi)% \}\geq c_{1}\mu_{\pi}(x_{\star},a_{\star})roman_min { roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) , roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) } ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) for all πΠ(j)𝜋superscriptΠ𝑗\pi\in\Pi^{(j)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and j[i1]𝑗delimited-[]𝑖1j\in[i-1]italic_j ∈ [ italic_i - 1 ];

  3. 3.

    μπS(i)(x,a)μπS¯(i)(x,a)>c2subscript𝜇subscriptsuperscript𝜋superscript𝑆𝑖subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆𝑖subscript𝑥subscript𝑎subscript𝑐2\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})-\mu_{{\pi}^{\star}_{\bar{S}^{% (i)}}}(x_{\star},a_{\star})>c_{2}italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;

  4. 4.

    SRS(i)(π)ϵ(μπS(i)(x,a)μπ(x,a))/2subscriptSRsuperscript𝑆𝑖𝜋italic-ϵsubscript𝜇subscriptsuperscript𝜋superscript𝑆𝑖subscript𝑥subscript𝑎subscript𝜇𝜋subscript𝑥subscript𝑎2\operatorname{SR}_{S^{(i)}}(\pi)\geq\epsilon(\mu_{\pi^{\star}_{S^{(i)}}}(x_{% \star},a_{\star})-\mu_{\pi}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(i)𝜋superscriptΠ𝑖\pi\in\Pi^{(i)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT;

  5. 5.

    SRS¯(i)(π)ϵ(μπ(x,a)μπS¯(i)(x,a))/2subscriptSRsuperscript¯𝑆𝑖𝜋italic-ϵsubscript𝜇𝜋subscript𝑥subscript𝑎subscript𝜇subscriptsuperscript𝜋superscript¯𝑆𝑖subscript𝑥subscript𝑎2\operatorname{SR}_{\bar{S}^{(i)}}(\pi)\geq\epsilon(\mu_{\pi}(x_{\star},a_{% \star})-\mu_{\pi^{\star}_{\bar{S}^{(i)}}}(x_{\star},a_{\star}))/2roman_SR start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≥ italic_ϵ ( italic_μ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) / 2 for all πΠ(i)𝜋superscriptΠ𝑖\pi\in\Pi^{(i)}italic_π ∈ roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT;

  6. 6.

    P𝑃Pitalic_P and P¯¯𝑃\bar{P}over¯ start_ARG italic_P end_ARG only differ in (x,a)subscript𝑥subscript𝑎(x_{\star},a_{\star})( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) and DKL(P(x,a)P¯(x,a))ϵ2D_{\operatorname{KL}}(P(\cdot\mid x_{\star},a_{\star})\mid\bar{P}(\cdot\mid x_% {\star},a_{\star}))\leq\epsilon^{2}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_P ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∣ over¯ start_ARG italic_P end_ARG ( ⋅ ∣ italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ) ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

where c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are some constant and (x,a)𝒳×𝒜subscript𝑥normal-⋆subscript𝑎normal-⋆𝒳𝒜(x_{\star},a_{\star})\in{\mathcal{X}}\times{\mathcal{A}}( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∈ caligraphic_X × caligraphic_A is some context-action pair. Then we have the following lower bound :

infLminisupS𝒮𝔼[j=1i1Regj]𝔼[Regi]/Ti=Ω(c1c2).subscriptinfimum𝐿subscriptsubscript𝑖subscriptsupremum𝑆𝒮𝔼delimited-[]superscriptsubscript𝑗1𝑖1subscriptReg𝑗𝔼delimited-[]subscriptReg𝑖subscript𝑇𝑖Ωsubscript𝑐1subscript𝑐2\inf_{L\in{\mathcal{L}}_{{\mathcal{I}}}}\min_{i\in{\mathcal{I}}}\sup_{S\in{% \mathcal{S}}}\sqrt{\mathbb{E}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}% \right]}\mathbb{E}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega\left(c_{1}c% _{2}\right).roman_inf start_POSTSUBSCRIPT italic_L ∈ caligraphic_L start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_S ∈ caligraphic_S end_POSTSUBSCRIPT square-root start_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_Reg start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG blackboard_E [ roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ω ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .
Proof.

The proof of Theorem 3 extends that of Theorem 1 by showing the following statement: for each i𝑖i\in{\mathcal{I}}italic_i ∈ caligraphic_I,

𝔼𝑺[j=1i1Regj]c1𝔼𝑺[T(x,a)],subscript𝔼𝑺delimited-[]superscriptsubscript𝑗1𝑖1subscriptReg𝑗subscript𝑐1subscript𝔼𝑺delimited-[]𝑇subscript𝑥subscript𝑎\mathbb{E}_{{\bm{S}}}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}\right]\geq c% _{1}\mathbb{E}_{{\bm{S}}}[T(x_{\star},a_{\star})],blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_Reg start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ≥ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_S end_POSTSUBSCRIPT [ italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ] , (46)

where T(x,a)j=1i1t=1Tj𝟙Xj,t=x,Aj,t=a𝑇subscript𝑥subscript𝑎superscriptsubscript𝑗1𝑖1superscriptsubscript𝑡1subscript𝑇𝑗subscript1formulae-sequencesubscript𝑋𝑗𝑡subscript𝑥subscript𝐴𝑗𝑡subscript𝑎T(x_{\star},a_{\star})\coloneqq\sum_{j=1}^{i-1}\sum_{t=1}^{T_{j}}\mathbbm{1}_{% X_{j,t}=x_{\star},A_{j,t}=a_{\star}}italic_T ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ≔ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Follow the same steps in the proof of Theorem 1, we can show that for all i𝑖i\in{\mathcal{I}}italic_i ∈ caligraphic_I, and all learning algorithm L𝐿subscriptL\in{\mathcal{L}}_{{\mathcal{I}}}italic_L ∈ caligraphic_L start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT,

max𝑺{𝑺,𝑺¯}𝔼𝑺[j=1i1Regj]𝔼𝑺[Regi]/Ti=Ω(c12c22),subscriptsuperscript𝑺𝑺¯𝑺subscript𝔼superscript𝑺delimited-[]superscriptsubscript𝑗1𝑖1subscriptReg𝑗subscript𝔼superscript𝑺delimited-[]subscriptReg𝑖subscript𝑇𝑖Ωsuperscriptsubscript𝑐12superscriptsubscript𝑐22\max_{{\bm{S}}^{\prime}\in\{{\bm{S}},\bar{{\bm{S}}}\}}\sqrt{\mathbb{E}_{{\bm{S% }}^{\prime}}\left[\sum_{j=1}^{i-1}\operatorname{Reg}_{j}\right]}\mathbb{E}_{{% \bm{S}}^{\prime}}\left[\operatorname{Reg}_{i}\right]/T_{i}=\Omega(\sqrt{c_{1}^% {2}c_{2}^{2}}),roman_max start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { bold_italic_S , over¯ start_ARG bold_italic_S end_ARG } end_POSTSUBSCRIPT square-root start_ARG blackboard_E start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_Reg start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG blackboard_E start_POSTSUBSCRIPT bold_italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_Reg start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ω ( square-root start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (47)

from which we immediately obtain Theorem 3. ∎

Appendix F Proof of Proposition 2

Proof.

Conditions in Theorem 3 requires that the optimal policy for task i𝑖iitalic_i has occupancy measure μπS(i)(x,a)>0subscript𝜇subscriptsuperscript𝜋superscript𝑆𝑖subscript𝑥subscript𝑎0\mu_{\pi^{\star}_{S^{(i)}}}(x_{\star},a_{\star})>0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) > 0, while all the previous tasks j𝑗jitalic_j has occupancy measure μπS(j)(x,a)=0subscript𝜇subscriptsuperscript𝜋superscript𝑆𝑗subscript𝑥subscript𝑎0\mu_{\pi^{\star}_{S^{(j)}}}(x_{\star},a_{\star})=0italic_μ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = 0. The longest sequence of π𝜋\piitalic_π whose occupancy measures have non-overlapping support set is no longer than |𝒳||𝒜|𝒳𝒜|{\mathcal{X}}||{\mathcal{A}}|| caligraphic_X | | caligraphic_A |. A trivial instance of length |𝒳|(|𝒜|2)𝒳𝒜2|{\mathcal{X}}|(|{\mathcal{A}}|-2)| caligraphic_X | ( | caligraphic_A | - 2 ) can be constructed although it does not have a strong practical interest. Index the context in 𝒳𝒳{\mathcal{X}}caligraphic_X by x1,,x|𝒳|subscript𝑥1subscript𝑥𝒳x_{1},\dots,x_{|{\mathcal{X}}|}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT | caligraphic_X | end_POSTSUBSCRIPT, the action in 𝒜𝒜{\mathcal{A}}caligraphic_A by a1,,a|𝒜|subscript𝑎1subscript𝑎𝒜a_{1},\dots,a_{|{\mathcal{A}}|}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT | caligraphic_A | end_POSTSUBSCRIPT. Split the |𝒳|(|𝒜|2)𝒳𝒜2|{\mathcal{X}}|(|{\mathcal{A}}|-2)| caligraphic_X | ( | caligraphic_A | - 2 ) tasks into |𝒳|𝒳|{\mathcal{X}}|| caligraphic_X | groups of size |𝒜|2𝒜2|{\mathcal{A}}|-2| caligraphic_A | - 2. The reward functions in group j𝑗jitalic_j is designed such that the optimal policy at all xl𝒳subscript𝑥𝑙𝒳x_{l}\in{\mathcal{X}}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_X has optimal arm a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all lj𝑙𝑗l\neq jitalic_l ≠ italic_j. The tasks in group j𝑗jitalic_j share the same reward function, and the mean reward satisfies R(i)(xj,a2)>>R(i)(xj,a|𝒜|)superscript𝑅𝑖subscript𝑥𝑗subscript𝑎2superscript𝑅𝑖subscript𝑥𝑗subscript𝑎𝒜R^{(i)}(x_{j},a_{2})>\dots>R^{(i)}(x_{j},a_{|{\mathcal{A}}|})italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > ⋯ > italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT | caligraphic_A | end_POSTSUBSCRIPT ). The policy spaces are designed such that Πj,i={π:π(xj,ak)=0, for all k<i}subscriptΠ𝑗𝑖conditional-set𝜋formulae-sequence𝜋subscript𝑥𝑗subscript𝑎𝑘0 for all 𝑘𝑖\Pi_{j,i}=\{\pi:\pi(x_{j},a_{k})=0,\text{ for all }k<i\}roman_Π start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT = { italic_π : italic_π ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0 , for all italic_k < italic_i }, a.k.a. the i𝑖iitalic_i-th task in the group j𝑗jitalic_j is not allowed to choose the first i1𝑖1i-1italic_i - 1 actions. It can be verified that this construction satisfies the conditions in Theorem 3. ∎

Appendix G Proof of Theorem 4

Proof.

The construction of the hard instance can be described below. Consider a nonlinear bandit problem with 𝒜=Sd1𝒜superscript𝑆𝑑1{\mathcal{A}}=S^{d-1}caligraphic_A = italic_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, the d𝑑ditalic_d-dimensional sphere. We first define the reward function as

R(θ2)=α1θ1,a+α2(θ2,aϵ)+,𝑅subscript𝜃2subscript𝛼1subscript𝜃1𝑎subscript𝛼2superscriptsubscript𝜃2𝑎italic-ϵR(\theta_{2})=\alpha_{1}\langle\theta_{1},a\rangle+\alpha_{2}\left(\langle% \theta_{2},a\rangle-\epsilon\right)^{+},italic_R ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a ⟩ + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⟨ italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a ⟩ - italic_ϵ ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ,

where θ1Sd1,α1>0formulae-sequencesubscript𝜃1superscript𝑆𝑑1subscript𝛼10\theta_{1}\in S^{d-1},\alpha_{1}>0italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and α2>0subscript𝛼20\alpha_{2}>0italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are known parameters and θ2Sd1subscript𝜃2superscript𝑆𝑑1\theta_{2}\in S^{d-1}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is unknown. Assume that the true parameter θ2*superscriptsubscript𝜃2\theta_{2}^{*}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT satisfies θ1,θ2*<0subscript𝜃1superscriptsubscript𝜃20\langle\theta_{1},\theta_{2}^{*}\rangle<0⟨ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ⟩ < 0, i.e., θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ2*superscriptsubscript𝜃2\theta_{2}^{*}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are on different sphere. Furthermore, let α2=2α2/(1ϵ)subscript𝛼22subscript𝛼21italic-ϵ\alpha_{2}=2\alpha_{2}/(1-\epsilon)italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ( 1 - italic_ϵ ).

Let {𝒜1,,𝒜N}subscript𝒜1subscript𝒜𝑁\{{\mathcal{A}}_{1},\dots,{\mathcal{A}}_{N}\}{ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } be an ϵitalic-ϵ\epsilonitalic_ϵ-pack of the subset {a𝒜:θ1,a<0}conditional-set𝑎𝒜subscript𝜃1𝑎0\{a\in{\mathcal{A}}:\langle\theta_{1},a\rangle<0\}{ italic_a ∈ caligraphic_A : ⟨ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a ⟩ < 0 }. Let the allowed policy space for task i𝑖iitalic_i be Π(i)={π:π is supported on 𝒜i{θ1}}superscriptΠ𝑖conditional-set𝜋𝜋 is supported on subscript𝒜𝑖subscript𝜃1\Pi^{(i)}=\{\pi:\pi\text{ is supported on }{\mathcal{A}}_{i}\cup\{\theta_{1}\}\}roman_Π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = { italic_π : italic_π is supported on caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } }. Specifically, order {𝒜1,,𝒜N}subscript𝒜1subscript𝒜𝑁\{{\mathcal{A}}_{1},\dots,{\mathcal{A}}_{N}\}{ caligraphic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } such that θ2*𝒜Nsuperscriptsubscript𝜃2subscript𝒜𝑁\theta_{2}^{*}\in{\mathcal{A}}_{N}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT.

To verify, R(θ2)𝑅subscript𝜃2R(\theta_{2})italic_R ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is in the family of one-layer neural network with ReLU activation function.

We first observe that the optimal policy for task i𝑖iitalic_i is πi*=δ(θ1)subscriptsuperscript𝜋𝑖𝛿subscript𝜃1\pi^{*}_{i}=\delta(\theta_{1})italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_δ ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for all i=1,N1𝑖1𝑁1i=1,\dots N-1italic_i = 1 , … italic_N - 1. Note that by running UCB, the algorithm will optimistically choose a𝒜i𝑎subscript𝒜𝑖a\in{\mathcal{A}}_{i}italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as they do not know that whether θ2*𝒜isuperscriptsubscript𝜃2subscript𝒜𝑖\theta_{2}^{*}\in{\mathcal{A}}_{i}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i=1,,N1𝑖1𝑁1i=1,\dots,N-1italic_i = 1 , … , italic_N - 1, thus leading to a constant regret for all tasks i<N𝑖𝑁i<Nitalic_i < italic_N. ∎