The Fallacy of Minimizing Local Regret in the Sequential Task Setting
Abstract
In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear () regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of -greedy exploration in robotic learning.
1 Introduction
Regret minimization has emerged as a central topic of online Reinforcement Learning (RL) research (Auer et al.,, 2008; Chu et al.,, 2011; Agrawal and Goyal,, 2013; Lattimore and Szepesvári,, 2020). In a static environment, prioritizing regret minimization is theoretically sound. A sublinear regret bound directly leads to a sound lower bound on cumulative rewards and it implies the convergence to optimal policies for environments with lower bounded suboptimality gap. Any algorithm with a sublinear regret bound is designed to cease exploration upon acquiring sufficient information about the underlying environment. Nevertheless, this cessation of exploration, while theoretically ideal, poses challenges in real-world RL implementations.
Real-world RL tasks significantly differ from the static setup in two key aspects. First, many real-world RL tasks arrive sequentially with substantial changes between tasks. This observation is evident from many applications including mobile health, where RL algorithms are trained to personalize digital interventions (Liao et al.,, 2020; Bidargaddi et al.,, 2020; Trella et al.,, 2022), and online education, where RL learns an automated pedagogical strategy that optimizes students’ performance (Aleven et al.,, 2023; Ruan et al.,, 2023). These applications, including others like inventory management (Madeka et al.,, 2022) and online charitable giving experiments (Athey et al.,, 2022), often undergo significant changes between tasks. Previous literature often studies changes in outcome distributions (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017), while our study encompasses changes in reward design in healthcare due to emerging side effects, or changes in permissible policy spaces driven by technological advancements and concerns around fairness or privacy. These changes are often results of high-level human decision-making, and thus, are unpredictable. Second, in many of these domains, particularly those involving human interactions, it is necessary to deploy static or nonadaptive policies within certain tasks. Though adaptive experimental design has emerged as a powerful tool in collecting data efficiently for subsequent tasks, many funding agencies are more familiar or restricted to the static experimental design (Zanette et al.,, 2021). Moreover, ethical or safety considerations often mandate the use of nonadaptive approaches (Madeka et al.,, 2022).
It is prevalent in the field to apply naive regret minimization algorithms within each task, (i.e., local regret minimization) often overlooking the sequential nature of real-world problems. A main message of this paper is the necessity of additional exploration due to unanticipated changes between tasks, such as rapid technological development or human-in-the-loop involvement, compared to what is normally sufficient in the stationary setting. Complete exploitation within a task could lead to worse performance in later tasks.
To understand the limitations of local regret minimization, note that sublinear regret often implies the convergence to an optimal policy, leading to complete exploitation. This focus on local regret minimization can result in an online dataset that is inadequate for subsequent tasks, especially in the presence of changes between tasks. We elucidate this concept through Example 1, which involves a sequence of two contextual bandit tasks. The two tasks share the same reward distribution, while the policy spaces are different. In the first task, the algorithm is allowed to make decisions based on the current context leading to the optimal policy of choosing action in context and in context . In contrast, the second task has to adopt context-free policies leading to the optimal policy that takes action for all contexts. We further ask the algorithm to not adaptively learn in the second task. Therefore, the data collected from the first task is used to estimate an optimal policy to deploy for the second task. The online dataset collected by the first task’s optimal policy lacks visitations in and , hindering a good policy learning for the second task. That is, minimizing the regret in the first task leads to worse regret in the second task. Our simulation experiment validates this hypothesis. Figure 2 compares the average regret in the first task and the simple regret in the second task by running UCB (Upper Confidence Bound) (Auer,, 2002) algorithm (in blue) and UCB mixed with probability 0.1 random exploration (in green) on the tasks described in Example 1. UCB shows a diminishing average regret in the first task with a constant simple regret for the second task. Mixed with 0.1 random exploration, UCB receives a constant average regret, while the simple regret goes to zero, revealing the expected trade-off between regrets in two tasks.
Example 1.
Consider two contextual bandit tasks with two arms and a context space of size two. The two tasks share the same reward distribution with the mean reward for each context-arm pair reported by Table 2. In the first task, the algorithm is allowed use any policy and can update its policy adaptively. In the second task, the algorithm can not update its policy and is forced to ignore the context, meaning that the second task is a non-contextual bandit. The two tasks share the same reward distribution.
We rigorously characterize this trade-off between local regret minimization and global regret minimization, when there are changes in reward design and/or the policy space between tasks. We formulate a sequential contextual bandit framework with a discrete context space and action space . Let and denote the cumulative regret and number of time-steps in task , respectively. We initiate out analysis with a two-task scenario. Theorem 1 establishes a general condition on the changes under which the minimax rate of when the algorithm is not allowed to be adaptive within the second task. We show that a simple strategy that mixes any no regret online algorithm with random exploration attains the minimax rate that is optimal in and . Our results show that for minimizing the combined regrets , an algorithm should aim for in case of , demonstrating an excess exploration compared with the optimal local regret rate . Subsequent case studies explore the satisfaction of these conditions in the presence of nonstationarity in the policy space and reward function.
We continue with three extensions from the two-task scenario. We first extend the results to the multiple tasks case, showing that the minimax rate of simultaneously for all task , within which adaptivity is not allowed. The maximal sequence length for which this lower bound is valid across all adjacent tasks is determined by the product . We further extend our discussion to nonlinear case, where we show that such sequence of tasks can be exponentially long, suggesting a strong motivation to apply excess exploration in real-world RL applications. The third extension considers a more prevalent notion of changes between tasks–nonstationarity in reward distributions. We show that under a new notion of robust simple regret, the aforementioned tension between local regret minimization and global regret minimization still holds when there are changes in reward distribution.
1.1 Related Work
In a contextual bandit setting, the simple regret is simply the expected cumulative regret divided by the length of horizon, when the policy is fixed. Our two-task case study extends the existing literature on optimizing both cumulative regret and simple regret by introducing non-stationarity. We denote the cumulative regret by and simple regret by . In a multi-armed bandit setting, Bubeck et al., (2011) show a trade-off of , where is the minimal gap and is a constant. This bound is substantially weaker than our bound as . Krishnamurthy et al., (2023) study this trade-off under standard contextual bandit, where they show that any learning algorithm that achieves a worst-case simple regret bound has a lower bounded minimax rate of for cumulative regret. Beyond the standard stationary setting, Simchi-Levi and Wang, (2023) study the trade-off between cumulative regret and the statistical power of inferring the treatment effect in a stationary multi-armed bandit problem. Similar results are shown in Gao et al., (2022); Dai et al., (2023). They show a similar minimax lower bound, . Qin and Russo, (2023) study the multi-armed bandit problem with a different cost of exploration for different arms. This is equivalent to having a different reward function from the reward for calculating simple regret. From an empirical point of view, Athey et al., (2022) propose TreeBagging algorithm that controls the level of exploration in online charitable giving. They observe that the uniform randomization algorithm learned a policy with the lowest simple regret while receiving the highest cumulative regret over a sequence of 10 implementations.
Our framework can be also seen as a generalization of the previous nonstationary bandit setting. Previous works in non-stationary bandit consider only changes in reward distribution (Garivier and Moulines,, 2008; Raj and Kalyani,, 2017; Wu et al.,, 2018; Kim and Tewari,, 2020; Hong et al.,, 2023). We introduce a broader definition of non-stationarity including the changes in reward function and policy space. When the reward distribution does change, we propose a new metric named robust simple regret to accommodate the fact that no algorithm can minimize simple regret to 0 even with infinite data, which is new in the literature (Section 5).
A main message of this paper is that the RL algorithm should never stop exploring or learning when there is non-stationarity. Conceptually, this is also the main characteristic of a continual learning agent (Abel et al.,, 2023). The literature of continual learning has primarily focused on learning agent that never forgets (Lange et al.,, 2019; Peng et al.,, 2023; Wang et al.,, 2023). However, in a RL context, the algorithm should strategically forget the previous learned knowledge when the outcome distribution drastically changes and maintain the right level of continual exploration to obtain new information (Khetarpal et al.,, 2020).
2 Problem Setups
Notations.
For a set , we denote by the set of all distributions over . For , we let . We use , , to denote the big-, big-Theta and big-Omega notations. For a vector , we let be the -th element of the vector. We denote by , the KL divergence between two probability measures and with .
Sequential multitask contextual bandit framework.
We consider learning on a sequence of contextual bandit tasks. Different from the standard online contextual bandit setting, each task is a contextual bandit with rich observations and potentially restricted policy space. Specifically, we define each task as a tuple of , where the shared elements are the context space , the action space , the potentially high-dimensional outcome space , and the outcome distribution . The task-specific elements include the restricted policy space that is a collection of mappings from the current context to a distribution over the action space. We also allow different tasks to have different reward functions . For simplicity, we assume that there is a fixed context distribution across all tasks. Since are assumed to be shared across different tasks, we denote a task setup by , when are clear from the context. Note that the restricted policy spaces and the reward functions are assumed to be known by the agent before interacting with the environment, while the outcome distribution is unknown and has to be learned.
The agent interacts with each task for steps. At the step , the agent observes context , and decides a policy . Then the agent samples an action and receives a feedback vector . The goal of the agent for task is to maximize the cumulative rewards . The optimal policy of a task is given by
Note that we consider the same outcome distribution at this moment to focus on the study of changes in policy spaces and reward functions. In Section 5, we extend our discussion to a sequence of tasks where the underlying outcome distribution could shift between tasks.
Motivations for changes between tasks.
The setups for different tasks may change in various ways. We provide the following motivation examples on the change of across tasks.
-
1.
Different tasks may set up different reward functions . For instance, in study one, we are only interested in maximizing cumulative rewards , where . At the end of the first task, the domain expert may decide that is a significant side effect that should be controlled. Therefore, they propose for the second task.
-
2.
Different tasks may set up different target policy classes . In practice, the context space can be a high-dimensional vector, due to the complexity of real-world observations. In the the first task, we may not have the computational resources to maximize over the space of all policies that takes context into account. Hence, the domain expert decides to optimize only in the space of context-independent policies . In the second task, with evidence accumulating, the expert decides that certain components of are relevant to the task, which should be included in the new policies. In a reversed case, the domain expert may decide that certain feature is irrelevant or raises fairness concerns and should be removed from the feature space.
2.1 Performance metric
The cumulative regret of task is given by Definition 1. The goal of an agent that learns on the sequence of tasks is to minimize the sum of cumulative regret over all the tasks. We refer the cumulative regret of each task as the local regret and the cumulative regret over all the tasks as the global regret. Throughout the paper, we discuss the tension between the local regret and the global regret under the aforementioned changes in task setups.
Definition 1 (Cumulative regret).
Denote the mean reward of task given context and action by . The cumulative regret within task is defined by
A significant body of the paper discusses the tasks, where the algorithm is not allowed to adaptively learn, thereby requiring a commitment to a fixed policy for the entire of each task. In such scenarios, minimizing the cumulative regret is equivalent to minimizing simple regret defined in Definition 2.
Definition 2 (Simple regret).
Define the simple regret of policy given a task with mean reward function by
Note that can potentially be negative depending on how the policy space is defined. However, for any policy .
Definition 3 (Occupancy measure).
We define as the occupancy measure of a running policy . Note that since we assume the same context distribution , occupancy measure of a policy is task independent.
2.2 Learning algorithms
This section is motivated by the practical considerations that some tasks may require nonadaptive algorithms. We introduce a formal definition to clarify what constitutes an algorithm and its nonadaptive nature in this context.
Let denote the sequence of observations up to the -th step in task . A learning algorithm is a mapping from the previous observations to a policy in at any given step in task . The agent running algorithm randomly samples an action . Furthermore, an algorithm is said to be nonadaptive in task , if it is nonadaptive to any new data collected during task as described in Definition 4.
Definition 4 (Non-adaptive algorithm).
We call an algorithm non-adaptive in task if for all steps , within a task, where , and for all sequences of observations and , such that is a prefix of , the algorithm satisfies
Denote by the set of all learning algorithms that are nonadaptive in task . For a set of task indices , we denote by the set of algorithms that are simultaneously nonadaptive on all tasks in .
3 Results on Two Tasks
We commence with a two-task case, where the first task involves running an online learning algorithm focused on minimizing the regret within the task. In the second task, we employ a fixed policy that is offline learned based on the dataset collected from the the first task. This scenario is closely related to the contextual bandit setting, where the algorithm aims at minimizing cumulative regret and simple regret simultaneously, with the difference in allowing changes between tasks.
In this section, there is a trade-off between the regrets in two tasks under various changes in task setups. The trade-off is substantially stronger than the cases without changes. In fact, this stronger trade-off has been shown in some special cases. For instance, in a multi-armed bandit setting, Simchi-Levi and Wang, (2023) simultaneously minimizes the cumulative regret and the average treatment effect (ATE) estimation error of the worst arm. This setting is inherently analogous to our setting since ATE’s of all the arms are essential for achieving a low simple regret with arbitrary changes in the policy space. They show that the product of the cumulative regret and the square of the worst-case estimation error is lower bounded by a constant in a minimax sense. We prove a similar lower bound result on a more general case with a wider range of changes in , . Our results provide a more comprehensive view of this tension between cumulative regret and simple regret.
We denote an instance by , a sequence of two tasks. For an instance , we denote by the set of all instances that share the same policy spaces and reward functions , while having different . For some instance set , we study the following minimax multi-objective optimization problem:
(1) |
In order to show a strong lower bound for the above multi-objective problem, we need the instance set to be adequately rich. Theorem 1 provides general conditions for that characterizes a strong trade-off between the cumulative regrets in two tasks.
Theorem 1.
Assume the instance set is sufficiently large to ensure the existence of an instance such that for all , we can find some satisfying the following conditions:
-
1.
There exists unique optimal policy for each ;
-
2.
and for all ;
-
3.
;
-
4.
for all ;
-
5.
for all ;
-
6.
and only differ in and ,
where are some constant and is some context-action pair. Then we have the following lower bound:
Discussion of Theorem 1.
For sufficiently large , a no-regret online algorithm tends to converge to the optimal policy (assuming there exists a unique one and there is a lower bounded suboptimality gap). This dataset collected from online learning of the first task may not be sufficient for the goal of offline policy optimization for the second task, when . This connects closely to the offline learning literature, where it has been shown that offline learning is fundamentally hard if the single-policy concentrability is unbounded (Chen and Jiang,, 2019). Single-policy concentrability is the ratio between the occupancy measures of the behavior policy that collects the offline dataset and the optimal policy. Condition 2 guarantees that it incurs regret whenever is visited in the first task and Condition 3, 4 and 5 guarantee that it is necessary to visit to distinguish between and , which creates the tension between the regrets of two tasks.
3.1 Case studies on Theorem 1 application
In this section, we explore three case studies with potential practical interests to demonstrate how the conditions specified in Theorem 1 can be satisfied by including potential changes in and . To ease the demonstration, we consider uniform distribution over context space . The first two cases are two-task contextual bandit problems with and . The third case is an MAB problem with .
Case I: adding a new feature.
In real-world implementations, some features that were excluded from the input may be added back in the later tasks. To conceptualize this, we consider as the set of policies where decision-making is independent of current features, formalized as . Conversely, let encompass all possible policies. The mean reward induced by outcome distribution and is given by Table 4. For all positive , . Any policy with a non-zero occupancy measure on will have the same occupancy measure on , leading to simple regret of . However, the optimal policies of and disagree on . To distinguish from , the algorithm is forced to visit in the first task. The conditions in Theorem 1 is satisfied with and .
1- | 1- | 1- | 1- | ||
---|---|---|---|---|---|
0 | 1 | 0 | 1-2 |
1 | 0 | 1 | |||
---|---|---|---|---|---|
1 | 1 |
1 | 0.5 | 1 | 0.5 | |
0.8 | 0.5-/2 | 0.8 | 0.5+/2 |
Case II: removing an old feature.
Some features may have to be removed over a sequence of implementations due to potential ethic issues. For this consideration, we let be the set of all policies and . Table 4 demonstrates a pair of instances, where the optimal policies for the first task always select action under context and action under context . Any non-zero occupancy measure on induces an instant regret of at least for both and . Nevertheless, the optimal actions for and are and , respectively, and it requires the first task to have a coverage on to distinguish between and . The conditions in Theorem 1 is satisfied with and .
Case III: change of reward function.
The reward functions may change over a sequence of implementations. For instance, let outcome , where is the primary outcome we intend to maximize and is a potential side effects. In the first implementation, we aim at maximizing the primary outcome with reward function . However, the domain expert may realize that is a strong side effect, which should be controlled in the second task, and thus, they set the reward function as . With this motivation, we construct a pair of multi-armed bandit instances with no context. In Table 4, we demonstrate the mean reward of different arms under different combinations of , , and . In the first task, a regret of is incurred whenever is pulled, while a sufficient pulling of is necessary to distinguish from . It can be verified that the conditions in Theorem 1 holds with and .
3.2 Optimal level of exploration
As implied by Theorem 1, any algorithm that achieves an optimal rate in is suboptimal in . To trade-off between the two goals, the algorithm needs to employ additional exploration in the first task. In this section, we characterize the optimal level of additional exploration in different regimes of and . Since we primarily focus on the role of horizons, we omit the dependence on and throughout the discussions on this section.
Recall that our primary goal is to minimize global regret, that is the sum of cumulative regrets of two tasks. Proposition 1 suggests a minimax lower bound for the sum of cumulative regrets of two tasks that is the maximum of three terms– and . The term corresponds to the case, where dominates , and the minimax rate of simple regret in the second for any dataset collected during the first task is . The second term corresponds to the rate characterized by Theorem 1. The last term of is the minimax rate of the first task regret minimization. This corresponds to the case when dominates .
Proposition 1.
Following the same conditions on the instance set as in Theorem 1, the following minimax lower bound holds
(2) |
We show that a simple algorithm that mixes a minimax-optimal online learning algorithm with a purely random exploration has upper bounded global regret that matches the lower bound in Theorem 1 up to a factor of . This also allows us to achieve any point on the Pareto frontier up to a factor of . The parameter controls the level of additional exploration in the first task.
Theorem 2.
Let be an online learning algorithm with a regret bound of on the first task. Let the algorithm for the first task be , where is any past observations and is the uniform random policy. For any choice of , there exist offline-learning algorithm for the second task such that
(3) |
By tuning the exploration rate , we are able to match the minimax lower bound provided in (2). In short, there are three regimes of , for which we should choose different levels of exploration rate to balance and . The regime one is when , where the first task is too short compared to the second task, and the algorithm should employ pure exploration in the first task (). This regime leads to a global regret of . In an intermediate regime with , the algorithm should employ additional exploration compared to these that achieve a minimax optimal rate in a single task. Theorem 2 suggests an additional exploration rate of and a global regret bound of . Note that under a special case of , the rate of indicates a regret bound of in the first task. The third regime is , where one should employ , meaning that no excess exploration is needed and the agent in the first task can minimize the local regret as much as possible. In this regime, the local regret in the first task could achieve the minimax optimal rate of .
It is often in real-world applications that is pre-determined and the researcher could decide how many samples to collect in the first task to ensure a good learning in the second one. For instance, in an inventory management context (Madeka et al.,, 2022), it is determined by the engineering team that how long a learned policy should be deployed for the second task. In such cases, our theory indicates that one should choose , so a greedy local regret minimization for the first task is justified.
4 Results on Multiple Tasks
In our two-task study, we highlighted the inherent dilemma between local regrets in the first and second tasks. Now we extend our analysis to a sequence of multiple tasks. A significant property about the two-tasks scenario is the inability of the algorithm to adaptively learn in the second task. This restriction forces the algorithm to ”overly” explore in the first task to propose a good policy for the second task, thereby introducing a tension between the regrets in the first and the second task. In fact, such tension exists between any task and its preceding tasks, whenever the algorithm is not allowed to adaptively learn within the task . We will also discuss the maximum number of rounds this trade-off could hold simultaneously.
In this section, we denote an instance by a sequence of task setups and their shared outcome distribution, . Let be a set of indices, such that any task has to be non-adaptive. Theorem 3 generalizes Theorem 1 by lower bounding the minimax rate of the product between the sum of the cumulative regret over tasks and that over task , simultaneously for all the indices in a set .
Theorem 3.
Recall that denote the set of instances that share the same policy space and reward function with a given instance . Let be an index set. Assume the instance set is sufficiently large to ensure the existence of an instance for which, we can find some such that for all and , it satisfies the following conditions:
-
1.
There exists unique optimal policy for each ;
-
2.
and for all and ;
-
3.
;
-
4.
for all ;
-
5.
for all ;
-
6.
and only differ in and ,
where are some constant and is some context-action pair. Then we have the following lower bound :
Theorem 3 provide a strong lower bound that holds simultaneously for the simple regret of all tasks in a set. The construction of the hard instance requires the the optimal of the next task . Proposition 2 states that the longest sequence of tasks one can find to ensure that the conditions in Theorem 3 hold is .
Proposition 2.
4.1 Discussion on Nonlinear Case
We have shown in a tabular case that we can find at most tasks such that there is a trade-off between the simple regret of any task and the cumulative regrets of all its preceding tasks (Proposition 2). It appears that the number of rounds this tension could hold connects closely to the complexity of the underlying outcome distributions. In this section, we extend the tabular bandit to a nonlinear setting, where we show that it is possible to find an exponentially long sequence of tasks with the trade-off described above holding for each of tasks.
Nonlinear contextual bandit.
For simplicity, we consider the outcome and a fixed reward function for all tasks , i.e. no rich observations, so we focus on the changes in the policy spaces. Following the setups in Section 2, we now consider potentially large or continuous context and action space and . Recall that the mean reward is given by . For contextual bandit with nonlinear reward models, we assume that mean reward for some known function class .
Complexity for nonlinear bandit.
Running UCB on nonlinear bandit is generally hard. Russo and Van Roy, (2013) proposed to explore by choosing where is an optimistic estimate of . A choice of given by Russo and Van Roy, (2013) is
(4) |
where are constants, is the empirical 2-norm, and is the empirical risk minimizer. The regret of running UCB with appropriately chosen has regret of , where is the eluder dimension of the function class .
Definition 5 (Distributional eluder dimension).
Let . A probability measure over is said to be -independent of a sequence of probability measures w.r.t if any pair of functions satisfying also satisfies . Furthermore, is -independent of if it is not -dependent of the sequence.
The -eluder dimension is the length of the longest sequence of distributions over such that for some , every distribution is -independent of its predecessors.
Recall that the construction of our hard instances in Theorem 1 requires that the new task has the optimal policy whose occupancy measure has no overlap from the occupancy measure of optimal policies in the previous tasks. A generalization of this to the nonlinear case is that a predicted function that minimizes the loss over the dataset collected in the previous tasks may still occur large loss on a new task. Let the optimal policies of tasks be . Intuitively, as long as is smaller than , we can find a new task with optimal policy for which the occupancy measure is -independent of . By the definition of eluder dimension, this implies that the function chosen for task based on the dataset collected by may still occur a large error. Note that by running a no-regret online algorithm, the dataset collected during a task will asymptotically distributed as the occupancy measure induced by its optimal policy.
Eluder dimension has been shown to be exponentially large for simple models like one-layer neural network with ReLU activation function (Dong et al.,, 2021). It is not trivial to show a lower bound directly depending on the eluder dimension. Instead, we provide a concrete example, where UCB described in (4) fails.
Theorem 4.
Consider the hypothesis set to be one-hidden layer neural networks with width . There exists ground-truth reward function and a sequence of tasks of length with different , such that the local regret for each task is lower bounded by a constant, even if each .
Theorem 4 indicates that even without a change in outcome distributions, there still exists an exponentially long sequence of tasks, for which the tension between local regret minimization and global regret minimization still holds. An UCB algorithm that greedily minimizes local regret fails to provide good guarantees for later tasks.
5 Study on Changes in
In real-world implementations, the outcome distribution often undergoes unpredictable shift. Prior research on non-stationary bandits has typically focused on single-task scenarios with potential reward distribution shifts at any step. To manage these shifts, the literature often limits the total variation in distribution shifts, making it possible to establish sublinear regret bounds. In a sequential task setting, when the algorithm is not allowed to adaptively learn in the second task, the simple regret is always lower bounded by a constant. This is attributed to the uncertainty of the second task’s optimal policy, even with a full knowledge of the first task. To address this challenge, we introduce the concept of robust simple regret. We show that the robust simple regret and cumulative regret in the two-task case, are shown to have a similar minimax lower bound as shown in Theorem 1.
For simplicity, we consider no change in the policy space and the reward function. More specifically, we let , the set of all policies, and , the identical mapping in . We denote by and the outcome distribution of the first and the second task. We denote a problem instance by .
The adversary is allowed to choose from a ball around . This leads to the instance set parametrized by constant such that each satisfies
(5) |
where we abuse the notation for and let denote the mean reward for .
Robust simple regret.
When is allowed to change from all in a potentially adversarial way, it is not reasonable to compare with the true optimal policy with respect to the underlying true . Instead, we consider a robust regret definition. We first define the worst-case simple regret of a policy on a context :
(6) |
We denote by the optimal robust policy given . When it is clear from the context, we drop the subscription for and .
We further define robust simple regret, which is the gap between the worst-case simple regret of a given policy and the policy that achieves the lowest worst-case simple regret:
(7) |
Note that the worst-case regret form over some ambiguity set has been studied in the Robust Markov Decision Process literature (Xu and Mannor,, 2010; Eysenbach and Levine,, 2021; Dong et al.,, 2022). However, the definition of robust simple regret and the tension between cumulative regret and robust simple regret has not yet been explored.
To understand how the tension between cumulative and simple regret still plays a role, we investigate a simple two-armed, context-free bandit case in Proposition 3. The optimal arm in the the first task is , while the optimal robust policy depends on the gap between the mean reward of both arms. Thus, to reduce the robust simple regret in the second task, the algorithm is forced to have an accurate estimate on the suboptimal arm in the first task.
Proposition 3.
Consider the following two-armed, context-free bandit, with . Then the worst-case simple regret is given by
(8) |
Assume that . The optimal robust policy w.r.t. the worst-case simple regret has the explicit form of
(9) |
Motivated by the instance introduced in Proposition 3, we show the following Theorem that lower bounds minimax rate of the product between cumulative regret in the first task and the robust simple regret in the second task. Note that robust simple regret does not depend on the actual of choice, the supremum is only taken over .
Theorem 5.
Assume is such that , the set of all policies, and , the identical mapping in . Assume each is from a binomial distribution with mean for all and , and . Then there exists some such that
(10) |
where is the random policy chosen by learning algorithm .
Theorem 5 implies that one should still employ additional exploration for small , when there is only changes in the outcome distributions and a similar trade-off between local regret and global regret still holds.
6 Discussion
In this paper, we study the minimax rate of the sum of local regrets across a sequence of contextual bandit tasks. By showing a lower bound on this rate, we demonstrate a strong trade-off between local regrets in different tasks, when there is changes between tasks. These changes include changes in policy space, reward function and outcome distribution, which is of significant novelty. A main message is that one should employ additional exploration compared to what is sufficient for single task cumulative regret minimization in presence of such changes. Our work opens many interesting future directions in the area of multitask bandit.
Multiple changes in .
In this paper, we only studied the in outcome distribution change in a two-task case, where we propose a new notion of robust simple regret and show that there is a dilemma between cumulative regret minimization in the first task and robust simple regret minimization in the second one. It is, at the current form, not clear how to extend the result to a multiple-task case. Intuitively, information from older tasks should be discounted when proposing a policy for a new task. Future work could consider modeling the changes in by an auto-regression model, which allows us to characterize how a new depends on the previous ones.
Instance-dependence results.
We study minimax rate throughout the paper, which focuses often on the worst case. In reality, some instances are significantly harder to learn than the others. An interesting direction is to propose a theoretical measure of the significance of the trade-off studied in this paper and derive an instance-dependent result.
References
- Abel et al., (2023) Abel, D., Barreto, A., Roy, B. V., Precup, D., Hasselt, H. V., and Singh, S. (2023). A definition of continual reinforcement learning. ArXiv, abs/2307.11046.
- Agrawal and Goyal, (2013) Agrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR.
- Aleven et al., (2023) Aleven, V., Baraniuk, R., Brunskill, E., Crossley, S. A., Demszky, D., Fancsali, S. E., Gupta, S., Koedinger, K., Piech, C., Ritter, S., Thomas, D. R., Woodhead, S., and Xing, W. (2023). Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education.
- Athey et al., (2022) Athey, S., Byambadalai, U., Hadad, V., Krishnamurthy, S. K., Leung, W., and Williams, J. J. (2022). Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning. ArXiv, abs/2211.12004.
- Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
- Auer et al., (2008) Auer, P., Jaksch, T., and Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
- Bidargaddi et al., (2020) Bidargaddi, N., Schrader, G., Klasnja, P., Licinio, J., and Murphy, S. (2020). Designing m-health interventions for precision mental health support. Translational psychiatry, 10(1):222.
- Bubeck et al., (2011) Bubeck, S., Munos, R., and Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852.
- Chen and Jiang, (2019) Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.
- Chu et al., (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
- Dai et al., (2023) Dai, J., Gradu, P., and Harshaw, C. (2023). Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187.
- Dong et al., (2022) Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust mdp. arXiv preprint arXiv:2209.13841.
- Dong et al., (2021) Dong, K., Yang, J., and Ma, T. (2021). Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in neural information processing systems, 34:26168–26182.
- Eysenbach and Levine, (2021) Eysenbach, B. and Levine, S. (2021). Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257.
- Gao et al., (2022) Gao, D., Liu, Y., and Zeng, D. (2022). Non-asymptotic properties of individualized treatment rules from sequentially rule-adaptive trials. The Journal of Machine Learning Research, 23(1):11362–11403.
- Garivier and Moulines, (2008) Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
- Hong et al., (2023) Hong, K., Li, Y., and Tewari, A. (2023). An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In International Conference on Artificial Intelligence and Statistics, pages 3048–3085. PMLR.
- Khetarpal et al., (2020) Khetarpal, K., Riemer, M., Rish, I., and Precup, D. (2020). Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res., 75:1401–1476.
- Kim and Tewari, (2020) Kim, B. and Tewari, A. (2020). Randomized exploration for non-stationary stochastic linear bandits. In Conference on Uncertainty in Artificial Intelligence, pages 71–80. PMLR.
- Krishnamurthy et al., (2023) Krishnamurthy, S. K., Zhan, R., Athey, S., and Brunskill, E. (2023). Proportional response: Contextual bandits for simple and cumulative regret minimization. arXiv preprint arXiv:2307.02108.
- Lange et al., (2019) Lange, M. D., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G. G., and Tuytelaars, T. (2019). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:3366–3385.
- Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
- Liao et al., (2020) Liao, P., Greenewald, K., Klasnja, P., and Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22.
- Madeka et al., (2022) Madeka, D., Torkkola, K., Eisenach, C., Luo, A., Foster, D. P., and Kakade, S. M. (2022). Deep inventory management. arXiv preprint arXiv:2210.03137.
- Peng et al., (2023) Peng, L., Giampouras, P. V., and Vidal, R. (2023). The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning.
- Qin and Russo, (2023) Qin, C. and Russo, D. (2023). Generalized objectives in adaptive experiments: The frontier between regret and speed. NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World.
- Raj and Kalyani, (2017) Raj, V. and Kalyani, S. (2017). Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
- Ruan et al., (2023) Ruan, S. S., Nie, A., Steenbergen, W., He, J., Zhang, J., Guo, M., Liu, Y., Nguyen, K. D., Wang, C. Y., Ying, R., Landay, J. A., and Brunskill, E. (2023). Reinforcement learning tutor better supported lower performers in a math task. ArXiv, abs/2304.04933.
- Russo and Van Roy, (2013) Russo, D. and Van Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
- Simchi-Levi and Wang, (2023) Simchi-Levi, D. and Wang, C. (2023). Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pages 3086–3097. PMLR.
- Trella et al., (2022) Trella, A. L., Zhang, K. W., Nahum-Shani, I., Shetty, V., Doshi-Velez, F., and Murphy, S. A. (2022). Designing reinforcement learning algorithms for digital interventions: pre-implementation guidelines. Algorithms, 15(8):255.
- Wang et al., (2023) Wang, L., Zhang, X., Su, H., and Zhu, J. (2023). A comprehensive survey of continual learning: Theory, method and application. ArXiv, abs/2302.00487.
- Wu et al., (2018) Wu, Q., Iyer, N., and Wang, H. (2018). Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 495–504.
- Xu and Mannor, (2010) Xu, H. and Mannor, S. (2010). Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23.
- Yin and Wang, (2021) Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078.
- Zanette et al., (2021) Zanette, A., Dong, K., Lee, J. N., and Brunskill, E. (2021). Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731.
Appendix A Proof of Theorem 1
Theorem 1 Assume the instance set is sufficiently large to ensure the existence of an instance such that for all , we can find some satisfying the following conditions:
-
1.
There exists unique optimal policy for each ;
-
2.
and for all ;
-
3.
;
-
4.
for all ;
-
5.
for all ;
-
6.
and only differ in and ,
where are some constant and is some context-action pair. Then we have the following lower bound:
Proof.
Fix a learning algorithm . Throughout the proof, we let be the expectation of random variable of interest given the underlying instance by running algorithm .
Let . From Condition 2 and the definition of the cumulative regret, we have
(11) | ||||
(12) | ||||
(13) |
The same argument gives .
Denote by the fixed policy proposed by the algorithm for task two. Note that
(14) |
We further lower bound the sum of squared simple regret of on and :
(15) | ||||
(16) | ||||
(17) | ||||
(18) |
where the first inequality is from Condition 4 and 5, the second inequality is from Condition 3.
Lemma 1 (Bretagnolle–Huber inequality).
For any two probability distributions on the same measurable space , and any event , we have
It follows from Lemma 1 and Condition 6 that
(19) | ||||
(20) | ||||
(21) | ||||
(22) | ||||
(23) |
The same argument gives that
(24) |
Lemma 2.
For all choices of algorithm , with , we have for some universal constant .
By choosing (assuming that ) and applying Lemma 2, it holds that
(25) |
∎
A.1 Proof of Lemma 2
Lemma 2 For all choices of algorithm , with , we have for some universal constant .
Appendix B Proof of Proposition 1
Proposition 1 Following the same condition on the instance set , the following minimax lower bound holds
(35) |
Proof.
It is well known that the minimax rate of the cumulative regret of a single task of horizon is (Lattimore and Szepesvári,, 2020). By Theorem 1,
A minimax lower bound of for offline policy optimization is shown in Theorem 4.3 (Yin and Wang,, 2021). Combined the three lower bounds together, we conclude (35). ∎
Appendix C Verifying Conditions in Section 3.1
We verify each of the conditions in Theorem 1 for three case studies introduced in Section 3.1. Note that the mean reward are given by Table 4.
C.1 Case I
Let . It follows from the construction that there are unique optimal policies for . The unique optimal policies for each task is given by Table 5.
-
1.
Condition 2 holds with : and , Thus, .
-
2.
Condition 3 holds with because , while .
-
3.
Condition 4 and 5 holds since for all and for all ;
-
4.
Condition 6 can be satisfied by choosing and as the normal distribution with mean and and variance 1/2.
Context | ||||
---|---|---|---|---|
(1, 0) | (1, 0) | (1, 0) | (1, 0) | |
(1, 0) | (1, 0) | (0, 1) | (1, 0) |
C.2 Case II
Let . The unique optimal policies for each task is given by Table 6.
-
1.
Condition 2 holds with : both . Furthermore, and . By choosing , condition 2 is satisfied with .
-
2.
Condition 3 holds with because , while .
-
3.
Condition 4 and 5 holds because and .
-
4.
Condition 6 can be satisfied by choosing and as the normal distribution with mean and and variance 1/2.
Context | ||||
---|---|---|---|---|
(1, 0) | (1, 0) | (0, 1) | (1, 0) | |
(0, 1) | (0, 1) | (0, 1) | (1, 0) |
C.3 Case III
Note that any MAB can be seen as a contextual bandit with a dummy context . Let . The optimal policies for and are . The optimal policy for is , and for is .
-
1.
Condition 2 holds with .
-
2.
Condition 3 holds with .
-
3.
Condition 4 and 5 holds because and .
-
4.
Condition 6 holds by choosing and as the normal distribution with mean and and variance 1/2.
Appendix D Proof of Theorem 5
Theorem 5 Assume is such that , the set of all policies, and , the identical mapping in . Assume each is from a binomial distribution with mean for all and , and . Then there exists some such that
(36) |
where is the random policy chosen by learning algorithm .
Proof.
We construct two hard instances inspired by Proposition 3. Recall that Proposition 3 states that any two-armed, context-free bandit, with has the following explicit form of optimal robust policy:
(37) |
Let the arm space be . We construct two instances and with and , such that and , while for . Let and be Bernoulli distributions of parameter and for each . Let and .
Follow a similar proof of Theorem 1. We first connect the cumulative regret in the first task with the number of visits in the suboptimal arm in the first task. It can be shown that , where for each .
We consider . By Proposition 3, we first lower bound the robust simple regret by
(38) | ||||
(39) | ||||
(40) | ||||
(41) | ||||
(42) |
Similarly, we also have . Here we let , be the optimal robust policy for and , respectively.
Let be the random policy proposed by the learning algorithm for the second task. We convert the robust learning problem to a testing problem of two instances. Note that and .
The sum of robust simple regrets for two instances can be lower bounded by
(43) | ||||
(44) | ||||
(45) |
Appendix E Proof of Theorem 3
Theorem 3 Recall that denote the set of instances that share the same policy space and reward function with a given instance . Let be an index set. Assume the instance set is sufficiently large to ensure the existence of an instance for which, we can find some such that for all and , it satisfies the following conditions:
-
1.
There exists unique optimal policy for each ;
-
2.
and for all and ;
-
3.
;
-
4.
for all ;
-
5.
for all ;
-
6.
and only differ in and ,
where are some constant and is some context-action pair. Then we have the following lower bound :
Appendix F Proof of Proposition 2
Proof.
Conditions in Theorem 3 requires that the optimal policy for task has occupancy measure , while all the previous tasks has occupancy measure . The longest sequence of whose occupancy measures have non-overlapping support set is no longer than . A trivial instance of length can be constructed although it does not have a strong practical interest. Index the context in by , the action in by . Split the tasks into groups of size . The reward functions in group is designed such that the optimal policy at all has optimal arm for all . The tasks in group share the same reward function, and the mean reward satisfies . The policy spaces are designed such that , a.k.a. the -th task in the group is not allowed to choose the first actions. It can be verified that this construction satisfies the conditions in Theorem 3. ∎
Appendix G Proof of Theorem 4
Proof.
The construction of the hard instance can be described below. Consider a nonlinear bandit problem with , the -dimensional sphere. We first define the reward function as
where and are known parameters and is unknown. Assume that the true parameter satisfies , i.e., and are on different sphere. Furthermore, let .
Let be an -pack of the subset . Let the allowed policy space for task be . Specifically, order such that .
To verify, is in the family of one-layer neural network with ReLU activation function.
We first observe that the optimal policy for task is for all . Note that by running UCB, the algorithm will optimistically choose as they do not know that whether for all , thus leading to a constant regret for all tasks . ∎