Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stochastic Q-learning for Large Discrete Action Spaces

Fares Fourati    Vaneet Aggarwal    Mohamed-Slim Alouini
Abstract

In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of n𝑛nitalic_n actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ). The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.

Machine Learning, ICML

1 Introduction

Reinforcement learning (RL), a continually evolving field of machine learning, has achieved notable successes, especially when combined with deep learning (Sutton & Barto, 2018; Wang et al., 2022). While there have been several advances in the field, a significant challenge lies in navigating complex environments with large discrete action spaces (Dulac-Arnold et al., 2015, 2021). In such scenarios, standard RL algorithms suffer in terms of computational efficiency (Akkerman et al., 2023). Identifying the optimal actions might entail cycling through all of them, in general, multiple times within different states, which is computationally expensive and may become prohibitive with large discrete action spaces (Tessler et al., 2019).

Such challenges apply to various domains, including combinatorial optimization (Mazyavkina et al., 2021; Fourati et al., 2023, 2024b, 2024a), natural language processing (He et al., 2015, 2016a, 2016b; Tessler et al., 2019), communications and networking (Luong et al., 2019; Fourati & Alouini, 2021), recommendation systems (Dulac-Arnold et al., 2015), transportation (Al-Abbasi et al., 2019; Haliem et al., 2021; Li et al., 2022), and robotics (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020; Seyde et al., 2021, 2022; Gonzalez et al., 2023; Ireland & Montana, 2024). Although tailored solutions leveraging action space structures and dimensions may suffice in specific contexts, their applicability across diverse problems, possibly unstructured, still needs to be expanded. We complement these works by proposing a general method that addresses a broad spectrum of problems, accommodating structured and unstructured single and multi-dimensional large discrete action spaces.

Value-based and actor-based approaches are both prominent approaches in RL. Value-based approaches, which entail the agent implicitly optimizing its policy by maximizing a value function, demonstrate superior generalization capabilities but demand significant computational resources, particularly in complex settings. Conversely, actor-based approaches, which entail the agent directly optimizing its policy, offer computational efficiency but often encounter challenges in generalizing across multiple and unexplored actions (Dulac-Arnold et al., 2015). While both hold unique advantages and challenges, they represent distinct avenues for addressing the complexities of decision-making in large action spaces. However, comparing them falls outside the scope of this work. While some previous methods have focused on the latter (Dulac-Arnold et al., 2015), our work concentrates on the former. Specifically, we aim to exploit the natural generalization inherent in value-based RL approaches while reducing their per-step computational complexity.

Q-learning, as introduced by Watkins & Dayan (1992), for discrete action and state spaces, stands out as one of the most famous examples of value-based RL methods and remains one of the most widely used ones in the field. As an off-policy learning method, it decouples the learning process from the agent’s current policy, allowing it to leverage past experiences from various sources, which becomes advantageous in complex environments. In each step of Q-learning, the agent updates its action value estimates based on the observed reward and the estimated value of the best action in the next state.

Some approaches have been proposed to apply Q-learning to continuous state spaces, leveraging deep neural networks (Mnih et al., 2013; Van Hasselt et al., 2016). Moreover, several improvements have also been suggested to address its inherent estimation bias (Hasselt, 2010; Van Hasselt et al., 2016; Zhang et al., 2017; Lan et al., 2020; Wang et al., 2021). However, despite the different progress and its numerous advantages, a significant challenge still needs to be solved in Q-learning-like methods when confronted with large discrete action spaces. The computational complexity associated with selecting actions and updating Q-functions increases proportionally with the increasing number of actions, which renders the conventional approach impractical as the number of actions substantially increases. Consequently, we confront a crucial question: Is it possible to mitigate the complexity of the different Q-learning methods while maintaining a good performance?

This work proposes a novel, simple, and practical approach for handling general, possibly unstructured, single-dimensional or multi-dimensional, large discrete action spaces. Our approach targets the computational bottleneck in value-based methods caused by the search for a maximum (max\maxroman_max and argmaxargmax\operatorname*{arg\,max}roman_arg roman_max) in every learning iteration, which scales as 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ), i.e., linearly with the number of possible actions n𝑛nitalic_n. Through randomization, we can reduce this linear per-step computational complexity to logarithmic.

We introduce stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max, which, instead of exhaustively searching for the precise maximum across the entire set of actions, rely on at most two random subsets of actions, both of sub-linear sizes, possibly each of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉. The first subset is randomly sampled from the complete set of actions, and the second from the previously exploited actions. These stochastic maximization techniques amortize the computational overhead of standard maximization operations in various Q-learning methods (Watkins & Dayan, 1992; Hasselt, 2010; Mnih et al., 2013; Van Hasselt et al., 2016). Stochastic maximization methods significantly accelerate the agent’s steps, including action selection and value-function updates in value-based RL methods, making them practical for handling challenging, large-scale, real-world problems.

We propose Stochastic Q-learning, Stochastic Double Q-learning, StochDQN, and StochDDQN, which are obtained by changing max\maxroman_max and argmaxargmax\operatorname*{arg\,max}roman_arg roman_max to stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max in the Q-learning (Watkins & Dayan, 1992), the Double Q-learning (Hasselt, 2010), the deep Q-network (DQN) (Mnih et al., 2013) and the double DQN (DDQN) (Van Hasselt et al., 2016), respectively. Furthermore, we observed that our approach works even for the on-policy Sarsa (Rummery & Niranjan, 1994).

We conduct a theoretical analysis of the proposed method, proving the convergence of Stochastic Q-learning, which integrates these techniques for action selection and value updates, and establishing a lower bound on the probability of sampling an optimal action from a random set of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ and analyze the error of stochastic maximization compared to exact maximization. Furthermore, we evaluate the proposed RL algorithms on environments from Gymnasium (Brockman et al., 2016). For the stochastic deep RL algorithms, the evaluations were performed on control tasks within the multi-joint dynamics with contact (MuJoCo) environment (Todorov et al., 2012) with discretized actions (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020). These evaluations demonstrate that the stochastic approaches outperform non-stochastic ones regarding wall time speedup and sometimes rewards. Our key contributions are summarized as follows:

  • We introduce novel stochastic maximization techniques denoted as stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max, offering a compelling alternative to conventional deterministic maximization operations, particularly beneficial for handling large discrete action spaces, ensuring sub-linear complexity concerning the number of actions.

  • We present a suite of value-based algorithms suitable for large discrete actions, including Stochastic Q-learning, Stochastic Sarsa, Stochastic Double Q-learning, StochDQN, and StochDDQN, which integrate stochastic maximization within Q-learning, Sarsa, Double Q-learning, DQN, and DDQN, respectively.

  • We analyze stochastic maximization and demonstrate the convergence of Stochastic Q-learning. Furthermore, we empirically validate our approach to tasks from the Gymnasium and MuJoCO environments, encompassing various dimensional discretized actions.

2 Related Works

While RL has shown promise in diverse domains, practical applications often grapple with real-world complexities. A significant hurdle arises when dealing with large discrete action spaces (Dulac-Arnold et al., 2015, 2021). Previous research has investigated strategies to address this challenge by leveraging the combinatorial or the dimensional structures in the action space (He et al., 2016b; Tavakoli et al., 2018; Tessler et al., 2019; Delarue et al., 2020; Seyde et al., 2021, 2022; Fourati et al., 2023, 2024b, 2024a; Akkerman et al., 2023; Fourati et al., 2024b, a; Ireland & Montana, 2024). For example, He et al. (2016b) leveraged the combinatorial structure of their language problem through sub-action embeddings. Compressed sensing was employed in (Tessler et al., 2019) for text-based games with combinatorial actions. Delarue et al. (2020) formulated the combinatorial action decision of a vehicle routing problem as a mixed-integer program. Moreover, Akkerman et al. (2023) introduced dynamic neighbourhood construction specifically for structured combinatorial large discrete action spaces. Previous works tailored solutions for multi-dimensional spaces such as those in (Seyde et al., 2021, 2022; Ireland & Montana, 2024), among others, while practical in the multi-dimensional spaces, may not be helpful for single-dimensional large action spaces. While relying on the structure of the action space is practical in some settings, not all problems with large action spaces are multi-dimensional or structured. We complement these works by making no assumptions about the structure of the action space.

Some approaches have proposed factorizing the action spaces to reduce their size. For example, these include factorizing into binary subspaces (Lagoudakis & Parr, 2003; Sallans & Hinton, 2004; Pazis & Parr, 2011; Dulac-Arnold et al., 2012), expert demonstration (Tennenholtz & Mannor, 2019), tensor factorization (Mahajan et al., 2021), and symbolic representations (Cui & Khardon, 2016). Additionally, some hierarchical and multi-agent RL approaches employed factorization as well (Zhang et al., 2020; Kim et al., 2021; Peng et al., 2021; Enders et al., 2023). While some of these methods effectively handle large action spaces for certain problems, they necessitate the design of a representation for each discrete action. Even then, for some problems, the resulting space may still be large.

Methods presented in (Van Hasselt & Wiering, 2009; Dulac-Arnold et al., 2015; Wang et al., 2020) combine continuous-action policy gradients with nearest neighbour search to generate continuous actions and identify the nearest discrete actions. These are interesting methods but require continuous-to-discrete mapping and are mainly policy-based rather than value-based approaches. In the works of Kalashnikov et al. (2018) and Quillen et al. (2018), the cross-entropy method (Rubinstein, 1999) was utilized to approximate action maximization. This approach requires multiple iterations (r𝑟ritalic_r) for a single action selection. During each iteration, it samples nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT values, where n<nsuperscript𝑛𝑛n^{\prime}<nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_n, fits a Gaussian distribution to m<n𝑚superscript𝑛m<n^{\prime}italic_m < italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of these samples, and subsequently draws a new batch of nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT samples from this Gaussian distribution. As a result, this approximation remains costly, with a complexity of 𝒪(rn)𝒪𝑟superscript𝑛\mathcal{O}(rn^{\prime})caligraphic_O ( italic_r italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Additionally, in the work of Van de Wiele et al. (2020), a neural network was trained to predict the optimal action in combination with a uniform search. This approach involves the use of an expensive autoregressive proposal distribution to generate nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT actions and samples a large number of actions (m𝑚mitalic_m), thus remaining computationally expensive, with 𝒪(n+m)𝒪superscript𝑛𝑚\mathcal{O}(n^{\prime}+m)caligraphic_O ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_m ). In (Metz et al., 2017), sequential DQN allows the agent to choose sub-actions one by one, which increases the number of steps needed to solve a problem and requires d𝑑ditalic_d steps with a linear complexity of 𝒪(i)𝒪𝑖\mathcal{O}(i)caligraphic_O ( italic_i ) for a discretization granularity i𝑖iitalic_i. Additionally, Tavakoli et al. (2018) employs a branching technique with duelling DQN for combinatorial control problems. Their approach has a complexity of 𝒪(di)𝒪𝑑𝑖\mathcal{O}(di)caligraphic_O ( italic_d italic_i ) for actions with discretization granularity i𝑖iitalic_i and d𝑑ditalic_d dimensions, whereas our method, in a similar setting, achieves 𝒪(dlog(i))𝒪𝑑𝑖\mathcal{O}(d\log(i))caligraphic_O ( italic_d roman_log ( italic_i ) ). Another line of work introduces action elimination techniques, such as the action elimination DQN (Zahavy et al., 2018), which employs an action elimination network guided by an external elimination signal from the environment. However, it requires this domain-specific signal and can be computationally expensive (𝒪(n)𝒪superscript𝑛\mathcal{O}(n^{\prime})caligraphic_O ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where nnsuperscript𝑛𝑛n^{\prime}\leq nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n are the number of remaining actions). In contrast, curriculum learning, as proposed by Farquhar et al. (2020), initially limits an agent’s action space, gradually expanding it during training for efficient exploration. However, its effectiveness relies on having an informative restricted action space, and as the action space size grows, its complexity scales linearly with its size, eventually reaching 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ).

In the context of combinatorial bandits with a single state but large discrete action spaces, previous works have exploited the combinatorial structure of actions, where each action is a subset of main arms. For instance, for submodular reward functions, which imply diminishing returns when adding arms, in (Fourati et al., 2023) and (Fourati et al., 2024b), stochastic greedy algorithms are used to avoid exact search. The former evaluates the marginal gains of adding and removing sub-actions (arms), while the latter assumes monotonic rewards and considers adding the best arm until a cardinality constraint is met. For general reward functions, Fourati et al. (2024a) propose using approximation algorithms to evaluate and add sub-actions. While these methods are practical for bandits, they exploit the combinatorial structure of their problems and consider a single-state scenario, which is different from general RL problems.

While some approaches above are practical for handling specific problems with large discrete action spaces, they often exploit the dimensional or combinatorial structures inherent in their considered problems. In contrast, we complement these approaches by proposing a solution to tackle any general, potentially unstructured, single-dimensional or multi-dimensional, large discrete action space without relying on structure assumptions. Our proposed solution is general, simple, and efficient.

3 Problem Description

In the context of a Markov decision process (MDP), we have specific components: a finite set of actions denoted as 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A, a finite set of states denoted as 𝒮𝒮\operatorname{\mathcal{S}}caligraphic_S, a transition probability distribution 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P}:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\times% \operatorname{\mathcal{S}}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ], a bounded reward function r:𝒮×𝒜:𝑟𝒮𝒜r:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_r : caligraphic_S × caligraphic_A → blackboard_R, and a discount factor γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ]. Furthermore, for time step t𝑡titalic_t, we denote the chosen action as 𝐚tsubscript𝐚𝑡\operatorname{\mathbf{a}}_{t}bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the current state as 𝐬tsubscript𝐬𝑡\operatorname{\mathbf{s}}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the received reward as rtr(𝐬t,𝐚t)subscript𝑟𝑡𝑟subscript𝐬𝑡subscript𝐚𝑡r_{t}\triangleq r(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ italic_r ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Additionally, for time step t𝑡titalic_t, we define a learning rate function αt:𝒮×𝒜[0,1]:subscript𝛼𝑡𝒮𝒜01\alpha_{t}:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}% \rightarrow[0,1]italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ].

The cumulative reward an agent receives during an episode in an MDP with variable length time T𝑇Titalic_T is the return Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It is calculated as the discounted sum of rewards from time step t𝑡titalic_t until the episode terminates: Rti=tTγitrisubscript𝑅𝑡superscriptsubscript𝑖𝑡𝑇superscript𝛾𝑖𝑡subscript𝑟𝑖R_{t}\triangleq\sum_{i=t}^{T}\gamma^{i-t}r_{i}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. RL aims to learn a policy π:𝒮𝒜:𝜋𝒮𝒜\pi:\operatorname{\mathcal{S}}\rightarrow\operatorname{\mathcal{A}}italic_π : caligraphic_S → caligraphic_A mapping states to actions that maximize the expected return across all episodes. The state-action value function, denoted as Qπ(𝐬,𝐚)superscript𝑄𝜋𝐬𝐚Q^{\pi}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_s , bold_a ), represents the expected return when starting from a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, taking action 𝐚𝐚\operatorname{\mathbf{a}}bold_a, and following a policy π𝜋\piitalic_π afterwards. The function Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT can be expressed recursively using the Bellman equation:

Qπ(𝐬,𝐚)=r(𝐬,𝐚)+γ𝐬𝒮𝒫(𝐬|𝐬,𝐚)Qπ(𝐬,π(𝐬)).superscript𝑄𝜋𝐬𝐚𝑟𝐬𝐚𝛾subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚superscript𝑄𝜋superscript𝐬𝜋superscript𝐬Q^{\pi}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=r(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})+\gamma\sum_{\operatorname{\mathbf{s}}^{% \prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(\operatorname{\mathbf{s}}^{% \prime}|\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})Q^{\pi}(% \operatorname{\mathbf{s}}^{\prime},\pi(\operatorname{\mathbf{s}}^{\prime})).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_s , bold_a ) = italic_r ( bold_s , bold_a ) + italic_γ ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_s , bold_a ) italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (1)

Two main categories of policies are commonly employed in RL systems: value-based and actor-based policies (Sutton & Barto, 2018). This study primarily concentrates on the former type, where the value function directly influences the policy’s decisions. An example of a value-based policy in a state 𝐬𝐬\operatorname{\mathbf{s}}bold_s involves an ε𝐬subscript𝜀𝐬\varepsilon_{\operatorname{\mathbf{s}}}italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT-greedy algorithm, selecting the action with the highest Q-function value with probability (1ε𝐬)1subscript𝜀𝐬(1-\varepsilon_{\operatorname{\mathbf{s}}})( 1 - italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ), where ε𝐬0subscript𝜀𝐬0\varepsilon_{\operatorname{\mathbf{s}}}\geq 0italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ≥ 0, function of the state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, requiring the use of argmaxargmax\operatorname*{arg\,max}roman_arg roman_max operation, as follows:

πQ(𝐬)={play randomlywith proba. ϵ𝐬argmax𝐚𝒜Q(𝐬,𝐚)otherwise.subscript𝜋𝑄𝐬casesplay randomlywith proba. subscriptitalic-ϵ𝐬subscriptargmax𝐚𝒜𝑄𝐬𝐚otherwise.\pi_{Q}(\operatorname{\mathbf{s}})=\begin{cases}\text{play randomly}&\text{% with proba. }\epsilon_{\operatorname{\mathbf{s}}}\\ \operatorname*{arg\,max}_{\operatorname{\mathbf{a}}\in\operatorname{\mathcal{A% }}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})&\text{otherwise.}% \end{cases}italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( bold_s ) = { start_ROW start_CELL play randomly end_CELL start_CELL with proba. italic_ϵ start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) end_CELL start_CELL otherwise. end_CELL end_ROW (2)

Furthermore, during the training, to update the Q-function, Q-learning (Watkins & Dayan, 1992), for example, uses the following update rule, which requires a max\maxroman_max operation:

Qt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Qt(𝐬t,𝐚t)subscript𝑄𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝑄𝑡subscript𝐬𝑡subscript𝐚𝑡\displaystyle Q_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf% {a}}_{t}\right)=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)Q_{t}\left(\operatorname{\mathbf{s}% }_{t},\operatorname{\mathbf{a}}_{t}\right)italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+αt(𝐬t,𝐚t)[rt+γmaxb𝒜Qt(𝐬t+1,b)].subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡delimited-[]subscript𝑟𝑡𝛾subscript𝑏𝒜subscript𝑄𝑡subscript𝐬𝑡1𝑏\displaystyle+\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t}\right)\left[r_{t}+\gamma\max_{b\in\mathcal{A}}Q_{t}\left(% \operatorname{\mathbf{s}}_{t+1},b\right)\right].+ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ] . (3)

Therefore, the computational complexity of both the action selections in Eq. (2) and the Q-function updates in Eq. (3) scales linearly with the cardinality n𝑛nitalic_n of the action set 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A, making this approach infeasible as the number of actions increases significantly. The same complexity issues remain for other Q-learning variants, such as Double Q-learning (Hasselt, 2010), DQN (Mnih et al., 2013), and DDQN (Van Hasselt et al., 2016), among several others.

When representing the value function as a parameterized function, such as a neural network, taking only the current state 𝐬𝐬\operatorname{\mathbf{s}}bold_s as input and outputting the values for all actions, as proposed in DQN (Mnih et al., 2013), the network must accommodate a large number of output nodes, which results in increasing memory overhead and necessitates extensive predictions and maximization over these final outputs in the last layer. A notable point about this approach is that it does not exploit contextual information (representation) of actions, if available, which leads to lower generalization capability across actions with similar features and fails to generalize over new actions.

Previous works have considered generalization over actions by taking the features of an action 𝐚𝐚\operatorname{\mathbf{a}}bold_a and the current state 𝐬𝐬\operatorname{\mathbf{s}}bold_s as inputs to the Q-network and predicting its value (Zahavy et al., 2018; Metz et al., 2017; Van de Wiele et al., 2020). However, it leads to further complications when the value function is modeled as a parameterized function with both state 𝐬𝐬\operatorname{\mathbf{s}}bold_s and action 𝐚𝐚\operatorname{\mathbf{a}}bold_a as inputs. Although this approach allows for improved generalization across the action space by leveraging contextual information from each action and generalizing across similar ones, it requires evaluating the function for each action within the action set 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A. This results in a linear increase in the number of function calls as the number of actions grows. This scalability issue becomes particularly problematic when dealing with computationally expensive function approximators, such as deep neural networks (Dulac-Arnold et al., 2015). Addressing these challenges forms the motivation behind this work.

4 Proposed Approach

To alleviate the computational burden associated with maximizing a Q-function at each time step, especially when dealing with large action spaces, we introduce stochastic maximization methods with sub-linear complexity relative to the size of the action set 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A. Then, we integrate these methods into different value-based RL algorithms.

4.1 Stochastic Maximization

We introduce stochastic maximization as an alternative to maximization when dealing with large discrete action spaces. Instead of conducting an exhaustive search for the precise maximum across the entire set of actions 𝒜𝒜\mathcal{A}caligraphic_A, stochastic maximization searches for a maximum within a stochastic subset of actions of sub-linear size relative to the total number of actions. In principle, any size can be used, trading off time complexity and approximation. We mainly focus on 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ) to illustrate the power of the method in recovering Q-learning, even with such a small number of actions, with logarithmic complexity.

We consider two approaches to stochastic maximization: memoryless and memory-based approaches. The memoryless one samples a random subset of actions 𝒜𝒜\mathcal{R}\subseteq\operatorname{\mathcal{A}}caligraphic_R ⊆ caligraphic_A with a sublinear size and seeks the maximum within this subset. On the other hand, the memory-based one expands the randomly sampled set to include a few actions \mathcal{M}caligraphic_M with a sublinear size from the latest exploited actions \mathcal{E}caligraphic_E and uses the combined sets to search for a stochastic maximum. Stochastic maximization, which may miss the exact maximum in both versions, is always upper-bounded by deterministic maximization, which finds the exact maximum. However, by construction, it has sublinear complexity in the number of actions, making it appealing when maximizing over large action spaces becomes impractical.

Formally, given a state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, which may be discrete or continuous, along with a Q-function, a random subset of actions 𝒜𝒜\mathcal{R}\subseteq\operatorname{\mathcal{A}}caligraphic_R ⊆ caligraphic_A, and a memory subset \mathcal{M}\subseteq\mathcal{E}caligraphic_M ⊆ caligraphic_E (empty in the memoryless case), each subset being of sublinear size, such as at most 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ) each, the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max is the maximum value computed from the union set 𝒞=𝒞\mathcal{C}=\mathcal{R}\cup\mathcal{M}caligraphic_C = caligraphic_R ∪ caligraphic_M, defined as:

stochmaxk𝒜Qt(𝐬,k)maxk𝒞Qt(𝐬,k).subscriptstochmax𝑘𝒜subscript𝑄𝑡𝐬𝑘subscript𝑘𝒞subscript𝑄𝑡𝐬𝑘\operatorname*{stoch\,max}_{k\in\operatorname{\mathcal{A}}}Q_{t}(\operatorname% {\mathbf{s}},k)\triangleq\max_{k\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}% ,k).start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ≜ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) . (4)

Besides, the stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max is computed as follows:

stochargmaxk𝒜Qt(𝐬,k)argmaxk𝒞Qt(𝐬,k).subscriptstochargmax𝑘𝒜subscript𝑄𝑡𝐬𝑘subscriptargmax𝑘𝒞subscript𝑄𝑡𝐬𝑘\operatorname*{stoch\,arg\,max}_{k\in\operatorname{\mathcal{A}}}Q_{t}(% \operatorname{\mathbf{s}},k)\triangleq\operatorname*{arg\,max}_{k\in\mathcal{C% }}Q_{t}(\operatorname{\mathbf{s}},k).start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ≜ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) . (5)

In the analysis of stochastic maximization, we explore both memory-based and memoryless maximization. In the analysis and experiments, we consider the random set \mathcal{R}caligraphic_R to consist of log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions. When memory-based, in our experiments, within a given discrete state, we consider the two most recently exploited actions in that state. For continuous states, where it is impossible to retain the latest exploited actions for each state, we consider a randomly sampled subset \mathcal{M}\subseteq\mathcal{E}caligraphic_M ⊆ caligraphic_E, which includes log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions, even though they were played in different states. We demonstrate that this approach was sufficient to achieve good results in the benchmarks considered; see Section 7.3. Our Stochastic Q-learning convergence analysis considers memoryless stochastic maximization with a random set \mathcal{R}caligraphic_R of any size.

Remark 4.1.

By setting 𝒞𝒞\mathcal{C}caligraphic_C equal to 𝒜𝒜\mathcal{A}caligraphic_A, we essentially revert to standard approaches. Consequently, our method is an extension of non-stochastic maximization. However, in pursuit of our objective to make RL practical for large discrete action spaces, for a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, in our analysis and experiments, we keep the union set 𝒞𝒞\mathcal{C}caligraphic_C limited to at most 2log(n)2𝑛2\lceil\log(n)\rceil2 ⌈ roman_log ( italic_n ) ⌉, ensuring sub-linear (logarithmic) complexity.

4.2 Stochastic Q-learning

  Initialize Q(𝐬,𝐚)𝑄𝐬𝐚Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q ( bold_s , bold_a ) for all 𝐬𝒮,𝐚𝒜formulae-sequence𝐬𝒮𝐚𝒜\operatorname{\mathbf{s}}\in\mathcal{S},\operatorname{\mathbf{a}}\in\mathcal{A}bold_s ∈ caligraphic_S , bold_a ∈ caligraphic_A
  for each episode do
     Observe state 𝐬𝐬\operatorname{\mathbf{s}}bold_s.
     for each step of episode do
       Choose 𝐚𝐚\operatorname{\mathbf{a}}bold_a from 𝐬𝐬\operatorname{\mathbf{s}}bold_s with policy πQS(𝐬)superscriptsubscript𝜋𝑄𝑆𝐬\pi_{Q}^{S}(\operatorname{\mathbf{s}})italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s ).
       Take action 𝐚𝐚\operatorname{\mathbf{a}}bold_a, observe r𝑟ritalic_r, 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
       bstochargmaxb𝒜Q(𝐬,b).superscript𝑏subscriptstochargmax𝑏𝒜𝑄superscript𝐬𝑏b^{*}\leftarrow\operatorname*{stoch\,arg\,max}_{b\in\mathcal{A}}Q(% \operatorname{\mathbf{s}}^{\prime},b).italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) .
       Δr+γQ(𝐬,b)Q(𝐬,𝐚)Δ𝑟𝛾𝑄superscript𝐬superscript𝑏𝑄𝐬𝐚\Delta\leftarrow r+\gamma Q(\operatorname{\mathbf{s}}^{\prime},b^{*})-Q(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})roman_Δ ← italic_r + italic_γ italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_Q ( bold_s , bold_a ).
       Q(𝐬,𝐚)Q(𝐬,𝐚)+α(𝐬,𝐚)Δ𝑄𝐬𝐚𝑄𝐬𝐚𝛼𝐬𝐚ΔQ(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\leftarrow Q(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\alpha(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\Deltaitalic_Q ( bold_s , bold_a ) ← italic_Q ( bold_s , bold_a ) + italic_α ( bold_s , bold_a ) roman_Δ.
       𝐬𝐬𝐬superscript𝐬\operatorname{\mathbf{s}}\leftarrow\operatorname{\mathbf{s}}^{\prime}bold_s ← bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
     end for
  end for
Algorithm 1 Stochastic Q-learning

We introduce Stochastic Q-learning, described in Algorithm 1, and Stochastic Double Q-learning, described in Algorithm 2 in Appendix C, that replace the max\maxroman_max and argmaxargmax\operatorname*{arg\,max}roman_arg roman_max operations in Q-learning and Double Q-learning with stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max, respectively. Furthermore, we introduce Stochastic Sarsa, described in Algorithm 3 in Appendix C, which replaces the maximization in the greedy action selection (argmaxargmax\operatorname*{arg\,max}roman_arg roman_max) in Sarsa.

Our proposed solution takes a distinct approach from the conventional method of selecting the action with the highest Q-function value from the complete set of actions 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A. Instead, it uses stochastic maximization, which finds a maximum within a stochastic subset 𝒞𝒜𝒞𝒜\mathcal{C}\subseteq\mathcal{A}caligraphic_C ⊆ caligraphic_A, constructed as explained in Section 4.1. Our stochastic policy πQS(𝐬)superscriptsubscript𝜋𝑄𝑆𝐬\pi_{Q}^{S}(\operatorname{\mathbf{s}})italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s ), uses an ε𝐬subscript𝜀𝐬\varepsilon_{\operatorname{\mathbf{s}}}italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT-greedy algorithm, in a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, with a probability of (1ε𝐬)1subscript𝜀𝐬(1-\varepsilon_{\operatorname{\mathbf{s}}})( 1 - italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ), for ε𝐬>0subscript𝜀𝐬0\varepsilon_{\operatorname{\mathbf{s}}}>0italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT > 0, is defined as follows:

πQS(𝐬){play randomlywith proba. ϵ𝐬stochargmax𝐚𝒜Q(𝐬,𝐚)otherwise.superscriptsubscript𝜋𝑄𝑆𝐬casesplay randomlywith proba. subscriptitalic-ϵ𝐬subscriptstochargmax𝐚𝒜𝑄𝐬𝐚otherwise\pi_{Q}^{S}(\operatorname{\mathbf{s}})\triangleq\begin{cases}\text{play % randomly}&\text{with proba. }\epsilon_{\operatorname{\mathbf{s}}}\\ \operatorname*{stoch\,arg\,max}_{\operatorname{\mathbf{a}}\in\operatorname{% \mathcal{A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})&\text{% otherwise}.\end{cases}italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s ) ≜ { start_ROW start_CELL play randomly end_CELL start_CELL with proba. italic_ϵ start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) end_CELL start_CELL otherwise . end_CELL end_ROW (6)

Furthermore, during the training, to update the Q-function, our proposed Stochastic Q-learning uses the following rule:

Qt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Qt(𝐬t,𝐚t)subscript𝑄𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝑄𝑡subscript𝐬𝑡subscript𝐚𝑡\displaystyle Q_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf% {a}}_{t}\right)=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)Q_{t}\left(\operatorname{\mathbf{s}% }_{t},\operatorname{\mathbf{a}}_{t}\right)italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+αt(𝐬t,𝐚t)[rt+γstochmaxb𝒜Qt(𝐬t+1,b)].subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡delimited-[]subscript𝑟𝑡𝛾subscriptstochmax𝑏𝒜subscript𝑄𝑡subscript𝐬𝑡1𝑏\displaystyle+\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t}\right)\left[r_{t}+\gamma\operatorname*{stoch\,max}_{b\in% \mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}}_{t+1},b\right)\right].+ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ] . (7)

While Stochastic Q-learning, like Q-learning, employs the same values for action selection and action evaluation, Stochastic Double Q-learning, similar to Double Q-learning, learns two separate Q-functions. For each update, one Q-function determines the policy, while the other determines the value of that policy. Both stochastic learning methods remove the maximization bottleneck from exploration and training updates, making these proposed algorithms significantly faster than their deterministic counterparts.

4.3 Stochastic Deep Q-network

We introduce Stochastic DQN (StochDQN), described in Algorithm 4 in Appendix C, and Stochastic DDQN (StochDDQN) as efficient variants of deep Q-networks. These variants substitute the maximization steps in the DQN (Mnih et al., 2013) and DDQN (Van Hasselt et al., 2016) algorithms with the stochastic maximization operations. In these modified approaches, we replace the ε𝐬subscript𝜀𝐬\varepsilon_{\operatorname{\mathbf{s}}}italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT-greedy exploration strategy with the same exploration policy as in Eq. (6).

For StochDQN, we employ a deep neural network as a function approximator to estimate the action-value function, represented as Q(𝐬,𝐚;θ)Q(𝐬,𝐚)𝑄𝐬𝐚𝜃𝑄𝐬𝐚Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}};\theta)\approx Q(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q ( bold_s , bold_a ; italic_θ ) ≈ italic_Q ( bold_s , bold_a ), where θ𝜃\thetaitalic_θ denotes the weights of the Q-network. This network is trained by minimizing a series of loss functions denoted as Li(θi)subscript𝐿𝑖subscript𝜃𝑖L_{i}(\theta_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with these loss functions changing at each iteration i𝑖iitalic_i as follows:

Li(θi)subscript𝐿𝑖subscript𝜃𝑖\displaystyle L_{i}\left(\theta_{i}\right)italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) 𝔼𝐬,𝐚ρ()[(yiQ(𝐬,𝐚;θi))2],absentsubscript𝔼similar-to𝐬𝐚𝜌delimited-[]superscriptsubscript𝑦𝑖𝑄𝐬𝐚subscript𝜃𝑖2\displaystyle\triangleq\mathbb{E}_{\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}\sim\rho(\cdot)}\left[\left(y_{i}-Q\left(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}};\theta_{i}\right)\right)^{2}\right],≜ blackboard_E start_POSTSUBSCRIPT bold_s , bold_a ∼ italic_ρ ( ⋅ ) end_POSTSUBSCRIPT [ ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q ( bold_s , bold_a ; italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

where yi𝔼[r+γstochmaxb𝒜Q(𝐬,b;θi1)|𝐬,𝐚]subscript𝑦𝑖𝔼delimited-[]𝑟conditional𝛾subscriptstochmax𝑏𝒜𝑄superscript𝐬𝑏subscript𝜃𝑖1𝐬𝐚y_{i}\triangleq\mathbb{E}\left[r+\gamma\operatorname*{stoch\,max}_{b\in% \operatorname{\mathcal{A}}}Q(\operatorname{\mathbf{s}}^{\prime},b;\theta_{i-1}% )|\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right]italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ blackboard_E [ italic_r + italic_γ start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ; italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) | bold_s , bold_a ]. In this context, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the target value for an iteration i𝑖iitalic_i, and ρ(.)\rho(.)italic_ρ ( . ) is a probability distribution that covers states and actions. Like the DQN approach, we keep the parameters fixed from the previous iteration, denoted as θi1subscript𝜃𝑖1\theta_{i-1}italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT when optimizing the loss function Li(θi)subscript𝐿𝑖subscript𝜃𝑖L_{i}(\theta_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

These target values depend on the network weights, which differ from the fixed targets typically used in supervised learning. We employ stochastic gradient descent for the training. While StochDQN, like DQN, employs the same values for action selection and evaluation, StochDDQN, like DDQN, trains two separate value functions. It does this by randomly assigning each experience to update one of the two value functions, resulting in two sets of weights, θ𝜃\thetaitalic_θ and θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. For each update, one set of weights determines the policy, while the other set determines the values.

5 Stochastic Maximization Analysis

In the following, we study stochastic maximization with and without memory compared to exact maximization.

5.1 Memoryless Stochastic Maximization

Memoryless stochastic maximization, i.e., 𝒞=𝒞\mathcal{C}=\mathcal{R}\cup\emptysetcaligraphic_C = caligraphic_R ∪ ∅, does not always yield an optimal maximizer. To return an optimal action, this action needs to be randomly sampled from the set of actions. Finding an exact maximizer, without relying on memory \mathcal{M}caligraphic_M, is a random event with a probability p𝑝pitalic_p, representing the likelihood of sampling such an exact maximizer. In the following lemma, we provide a lower bound on the probability of discovering an optimal action within a uniformly randomly sampled subset 𝒞=𝒞\mathcal{C}=\mathcal{R}caligraphic_C = caligraphic_R of log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions, which we prove in Appendix B.1.1.

Lemma 5.1.

For any given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, the probability p𝑝pitalic_p of sampling an optimal action from a uniformly randomly chosen subset 𝒞𝒞\mathcal{C}caligraphic_C of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions is at least log(n)n𝑛𝑛\frac{\lceil\log(n)\rceil}{n}divide start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG start_ARG italic_n end_ARG.

While finding an exact maximizer through sampling may not always occur, the rewards of near-optimal actions can still be similar to those obtained from an optimal action. Therefore, the difference between stochastic maximization and exact maximization might be a more informative metric than just the probability of finding an exact maximizer. Thus, at time step t𝑡titalic_t, given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s and the current estimated Q-function Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we define the estimation error as βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), as follows:

βt(𝐬)max𝐚𝒜Qt(𝐬,𝐚)stochmax𝐚𝒜Qt(𝐬,𝐚).subscript𝛽𝑡𝐬subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚subscriptstochmax𝐚𝒜subscript𝑄𝑡𝐬𝐚\beta_{t}(\operatorname{\mathbf{s}})\triangleq\max_{\operatorname{\mathbf{a}}% \in\mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}% \right)-\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in\mathcal{A}}Q_% {t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right).italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≜ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) - start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) . (9)

Furthermore, we define the similarity ratio ωt(𝐬)subscript𝜔𝑡𝐬\omega_{t}(\operatorname{\mathbf{s}})italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), as follows:

ωt(𝐬)stochmax𝐚𝒜Qt(𝐬,𝐚)/max𝐚𝒜Qt(𝐬,𝐚).subscript𝜔𝑡𝐬subscriptstochmax𝐚𝒜subscript𝑄𝑡𝐬𝐚subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚\omega_{t}(\operatorname{\mathbf{s}})\triangleq\operatorname*{stoch\,max}_{% \operatorname{\mathbf{a}}\in\mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}}\right)/\max_{\operatorname{\mathbf{a}}\in\mathcal{A}% }Q_{t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right).italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≜ start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) / roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) . (10)

It can be seen from the definitions that βt(𝐬)0subscript𝛽𝑡𝐬0\beta_{t}(\operatorname{\mathbf{s}})\geq 0italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≥ 0 and ωt(𝐬)1subscript𝜔𝑡𝐬1\omega_{t}(\operatorname{\mathbf{s}})\leq 1italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≤ 1. While sampling the exact maximizer is not always possible, near-optimal actions may yield near-optimal values, providing good approximations, i.e., βt(𝐬)0subscript𝛽𝑡𝐬0\beta_{t}(\operatorname{\mathbf{s}})\approx 0italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≈ 0 and ωt(𝐬)1subscript𝜔𝑡𝐬1\omega_{t}(\operatorname{\mathbf{s}})\approx 1italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ≈ 1. In general, this difference depends on the value distribution over the actions.

While we do not make any specific assumptions about the value distribution in our work, we note that with some simplifying assumptions on the value distributions over the actions, one can derive more specialized guarantees. For example, assuming that the rewards are uniformly distributed over the actions, we demonstrate in Section B.3 that for a given discrete state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, if the values of the sampled actions independently follow a uniform distribution from the interval [Qt(𝐬,𝐚t)bt(𝐬),Qt(𝐬,𝐚t)]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡[Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{\star})-b_{t}(% \operatorname{\mathbf{s}}),Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}_{t}^{\star})][ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ], where bt(𝐬)subscript𝑏𝑡𝐬b_{t}(\operatorname{\mathbf{s}})italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) represents the range of the Qt(𝐬,.)Q_{t}(\operatorname{\mathbf{s}},.)italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , . ) values over the actions in state 𝐬𝐬\operatorname{\mathbf{s}}bold_s at time step t𝑡titalic_t, then the expected value of βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), even without memory, is: 𝔼[βt(𝐬)𝐬]bt(𝐬)log(n)+1.𝔼delimited-[]conditionalsubscript𝛽𝑡𝐬𝐬subscript𝑏𝑡𝐬𝑛1\mathbb{E}\left[\beta_{t}(\operatorname{\mathbf{s}})\mid\operatorname{\mathbf{% s}}\right]\leq\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil\log(n)\rceil+1}.blackboard_E [ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ∣ bold_s ] ≤ divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG . Furthermore, we empirically demonstrate that for the considered control problems, the difference βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) is not large, and the ratio ωt(𝐬)subscript𝜔𝑡𝐬\omega_{t}(\operatorname{\mathbf{s}})italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) is close to one, as shown in Section 7.4.

5.2 Stochastic Maximization with Memory

While memoryless stochastic maximization could approach the maximum value or find it with the probability p𝑝pitalic_p, lower-bounded in Lemma 5.1, it does not converge to an exact maximization, as it keeps sampling purely at random, as can be seen in Fig. 6 in Appendix E.2.1. However, memory-based stochastic maximization, i.e., 𝒞=𝒞\mathcal{C}=\mathcal{R}\cup\mathcal{M}caligraphic_C = caligraphic_R ∪ caligraphic_M with \mathcal{M}\neq\emptysetcaligraphic_M ≠ ∅, can become an exact maximization when the Q-function becomes stable, as we state in the Corollary 5.3, which we prove in Appendix B.2.1, and as confirmed in Fig. 6.

Definition 5.2.

A Q-function is considered stable for a given time range and state 𝐬𝐬\operatorname{\mathbf{s}}bold_s when its maximizing action in that state remains unchanged for all subsequent steps within that time, even if the Q-function’s values themselves change.

A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed. However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the critical characteristic is that its maximizing action remains the same even when its values are updated. Although the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max has sub-linear complexity compared to the max\maxroman_max, without any assumption of the value distributions, the following corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max matches precisely the output of max\maxroman_max.

Corollary 5.3.

For a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, assuming a time range where the Q-function becomes stable in that state, βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) is expected to converge to zero after nlog(n)𝑛𝑛\frac{n}{\lceil\log(n)\rceil}divide start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG iterations.

Recalling the definition of the similarity ratio ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it follows that ωt(𝐬)=1β(s)/max𝐚𝒜Qt(𝐬,𝐚)subscript𝜔𝑡𝐬1𝛽𝑠subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚\omega_{t}(\operatorname{\mathbf{s}})=1-\beta(s)/\max_{\operatorname{\mathbf{a% }}\in\operatorname{\mathcal{A}}}Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) = 1 - italic_β ( italic_s ) / roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ). Therefore, for a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, where the Q-function becomes stable, given the boundedness of iterates in Q-learning, it is expected that ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges to one. This observation was confirmed, even with continuous states and using neural networks as function approximators, in Section 7.4.

6 Stochastic Q-learning Convergence

In this section, we analyze the convergence of the Stochastic Q-learning, described in Algorithm 1. This algorithm employs the policy πQS(𝐬)superscriptsubscript𝜋𝑄𝑆𝐬\pi_{Q}^{S}(\operatorname{\mathbf{s}})italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s ), as defined in Eq. (6), with ε𝐬>0subscript𝜀𝐬0\varepsilon_{\operatorname{\mathbf{s}}}>0italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT > 0 to guarantee that π[𝐚t=𝐚𝐬t=𝐬]>0subscript𝜋delimited-[]subscript𝐚𝑡conditional𝐚subscript𝐬𝑡𝐬0\mathbb{P}_{\pi}[\operatorname{\mathbf{a}}_{t}=\operatorname{\mathbf{a}}\mid% \operatorname{\mathbf{s}}_{t}=\operatorname{\mathbf{s}}]>0blackboard_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_a ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_s ] > 0 for all state-action pairs (𝐬,𝐚)𝒮×𝒜𝐬𝐚𝒮𝒜(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\in\operatorname{\mathcal% {S}}\times\operatorname{\mathcal{A}}( bold_s , bold_a ) ∈ caligraphic_S × caligraphic_A. The value update rule, on the other hand, uses the update rule specified in Eq. (4.2).

In the convergence analysis, we focus on memoryless maximization. While the stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max operator for action selection can be employed with or without memory, we assume a memoryless stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max operator for value updates, which means that value updates are performed by maximizing over a randomly sampled subset of actions from 𝒜𝒜\mathcal{A}caligraphic_A, sampled independently from both the next state 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the set used for the stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max.

For a stochastic variable subset of actions 𝒞𝒜𝒞𝒜\mathcal{C}\subseteq\mathcal{A}caligraphic_C ⊆ caligraphic_A, following some probability distribution :2𝒜[0,1]:superscript2𝒜01\mathbb{P}:2^{\mathcal{A}}\rightarrow[0,1]blackboard_P : 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT → [ 0 , 1 ], we consider, without loss of generality Q(.,)=0Q(.,\emptyset)=0italic_Q ( . , ∅ ) = 0, and define, according to \mathbb{P}blackboard_P, a target Q-function, denoted as Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as:

Q(𝐬,𝐚)𝔼[r(𝐬,𝐚)+γmaxb𝒞Q(𝐬,b)𝐬,𝐚].superscript𝑄𝐬𝐚𝔼delimited-[]𝑟𝐬𝐚conditional𝛾subscript𝑏𝒞similar-tosuperscript𝑄superscript𝐬𝑏𝐬𝐚Q^{*}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\triangleq\mathbb{E}% \left[r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\gamma\max_{b\in% \mathcal{C}\sim\mathbb{P}}Q^{*}(\operatorname{\mathbf{s}}^{\prime},b)\mid% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right].italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) ≜ blackboard_E [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C ∼ blackboard_P end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ bold_s , bold_a ] . (11)
Remark 6.1.

The Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT defined above depends on the sampling distribution \mathbb{P}blackboard_P. Therefore, it does not represent the optimal value function of the original MDP problem; instead, it is optimal under the condition where only a random subset of actions following the distribution \mathbb{P}blackboard_P is available to the agent at each time step. However, as the sampling cardinality increases, it increasingly better approximates the optimal value function of the original MDP and fully recovers the optimal Q-function of the original problem when the sampling distribution becomes (𝒜)=1𝒜1\mathbb{P}(\mathcal{A})=1blackboard_P ( caligraphic_A ) = 1.

The following theorem states the convergence of the iterates Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of Stochastic Q-learning with memoryless stochastic maximization to the Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, defined in Eq. 11, for any sampling distribution \mathbb{P}blackboard_P, regardless of the cardinality.

Theorem 6.2.

For a finite MDP, as described in Section 3, let 𝒞𝒞\mathcal{C}caligraphic_C be a randomly independently sampled subset of actions from 𝒜𝒜\mathcal{A}caligraphic_A, of any cardinality, following any distribution \mathbb{P}blackboard_P, exclusively sampled for the value updates, for the Stochastic Q-learning, as described in Algorithm 1, given by the following update rule:

Qt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Qt(𝐬t,𝐚t)subscript𝑄𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝑄𝑡subscript𝐬𝑡subscript𝐚𝑡\displaystyle Q_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf% {a}}_{t}\right)=(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},\operatorname% {\mathbf{a}}_{t}\right))Q_{t}\left(\operatorname{\mathbf{s}}_{t},\operatorname% {\mathbf{a}}_{t}\right)italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+αt(𝐬t,𝐚t)[rt+γmaxb𝒞Qt(𝐬t+1,𝐚)],subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡delimited-[]subscript𝑟𝑡𝛾subscript𝑏𝒞similar-tosubscript𝑄𝑡subscript𝐬𝑡1𝐚\displaystyle+\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t}\right)\left[r_{t}+\gamma\max_{b\in\mathcal{C}\sim\mathbb{P}}Q_% {t}\left(\operatorname{\mathbf{s}}_{t+1},\operatorname{\mathbf{a}}\right)% \right],+ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C ∼ blackboard_P end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_a ) ] ,

given any initial estimate Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges with probability 1 to Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, defined in Eq. (11), as long as tαt(𝐬,𝐚)=subscript𝑡subscript𝛼𝑡𝐬𝐚\sum_{t}\alpha_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=\infty∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = ∞ and tαt2(𝐬,𝐚)<subscript𝑡superscriptsubscript𝛼𝑡2𝐬𝐚\sum_{t}\alpha_{t}^{2}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})<\infty∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_s , bold_a ) < ∞ for all (𝐬,𝐚)𝒮×𝒜𝐬𝐚𝒮𝒜(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\in\operatorname{\mathcal% {S}}\times\operatorname{\mathcal{A}}( bold_s , bold_a ) ∈ caligraphic_S × caligraphic_A.

The theorem’s result demonstrates that for any cardinality of actions, Stochastic Q-learning converges to Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as defined in Eq. (11), which recovers the convergence guarantees of Q-learning when the sampling distribution is (𝒜)=1𝒜1\mathbb{P}(\mathcal{A})=1blackboard_P ( caligraphic_A ) = 1.

Remark 6.3.

In principle, any size can be used, balancing time complexity and approximation. Our empirical experiments focused on log(n)𝑛\log(n)roman_log ( italic_n ) to illustrate the method’s ability to recover Q-learning, even with a few actions. Using n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG will approach the value function of Q-learning more closely compared to using log(n)𝑛\log(n)roman_log ( italic_n ), albeit at the cost of higher complexity than log(n)𝑛\log(n)roman_log ( italic_n ).

The theorem shows that even with memoryless stochastic maximization, using randomly sampled 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ) actions, the convergence is still guaranteed. However, relying on memory-based stochastic maximization helps minimize the approximation error in stochastic maximization, as shown in Corollary 5.3, and outperforms Q-learning as shown in the experiments in Section 7.1.

In the following, we provide a sketch of the proof addressing the extra stochasticity due to stochastic maximization. The full proof is provided in Appendix A.

We tackle the additional stochasticity depending on the sampling distribution \mathbb{P}blackboard_P, by defining an operator function ΦΦ\Phiroman_Φ, which for any q:𝒮×𝒜:𝑞𝒮𝒜q:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_q : caligraphic_S × caligraphic_A → blackboard_R, is as follows:

(Φq)(𝐬,𝐚)Φ𝑞𝐬𝐚\displaystyle(\Phi q)(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})( roman_Φ italic_q ) ( bold_s , bold_a ) 𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)absentsubscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚\displaystyle\triangleq\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(% \mathcal{C})\sum_{\operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{% S}}}\mathcal{P}(\operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}% },\operatorname{\mathbf{a}})≜ ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a )
[r(𝐬,𝐚)+γmaxb𝒞q(𝐬,b)].delimited-[]𝑟𝐬𝐚𝛾subscript𝑏𝒞𝑞superscript𝐬𝑏\displaystyle\quad\quad\quad\left[r(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})+\gamma\max_{b\in\mathcal{C}}q(\operatorname{\mathbf{s}}^{\prime},% b)\right].[ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ] . (12)

We then demonstrate that it is a contraction in the sup-norm, as shown in Lemma 6.4, which we prove in Appendix A.2.

Lemma 6.4.

The operator ΦΦ\Phiroman_Φ, defined in Eq. (6), is a contraction in the sup-norm, with a contraction factor γ𝛾\gammaitalic_γ, i.e., Φq1Φq2γq1q2.subscriptnormΦsubscript𝑞1Φsubscript𝑞2𝛾subscriptnormsubscript𝑞1subscript𝑞2\left\|\Phi q_{1}-\Phi q_{2}\right\|_{\infty}\leq\gamma\left\|q_{1}-q_{2}% \right\|_{\infty}.∥ roman_Φ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Φ italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_γ ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

We then use the above lemma to establish the convergence of Stochastic Q-learning. Given any initial estimate Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, using the considered update rule for Stochastic Q-learning, subtracting from both sides Q(𝐬t,𝐚t)superscript𝑄subscript𝐬𝑡subscript𝐚𝑡Q^{*}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and letting Δt(𝐬,a)Qt(𝐬,a)Q(𝐬,a)subscriptΔ𝑡𝐬𝑎subscript𝑄𝑡𝐬𝑎superscript𝑄𝐬𝑎\Delta_{t}(\operatorname{\mathbf{s}},a)\triangleq Q_{t}(\operatorname{\mathbf{% s}},a)-Q^{*}(\operatorname{\mathbf{s}},a)roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_a ) ≜ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , italic_a ), yields

Δt+1(𝐬t,𝐚t)subscriptΔ𝑡1subscript𝐬𝑡subscript𝐚𝑡\displaystyle\Delta_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t}\right)roman_Δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =(1αt(𝐬t,𝐚t))Δt(𝐬t,𝐚t)absent1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscriptΔ𝑡subscript𝐬𝑡subscript𝐚𝑡\displaystyle=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)\Delta_{t}\left(\operatorname{% \mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)= ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+αt(𝐬t,𝐚t)Ft(𝐬t,𝐚t), withsubscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝐹𝑡subscript𝐬𝑡subscript𝐚𝑡 with\displaystyle\quad+\alpha_{t}(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t})F_{t}(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf{a}}_% {t}),\text{ with }+ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , with
Ft(𝐬,𝐚)r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)Q(𝐬,𝐚).subscript𝐹𝑡𝐬𝐚𝑟𝐬𝐚𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏superscript𝑄𝐬𝐚F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\triangleq r(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\gamma\max_{b\in\mathcal{% C}}Q_{t}\left(\operatorname{\mathbf{s}}^{\prime},b\right)-Q^{*}\left(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right).italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ≜ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) . (13)

With tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT representing the past at time t𝑡titalic_t,

𝔼[Ft(𝐬,𝐚)t]𝔼delimited-[]conditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡\displaystyle\mathbb{E}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\mid\mathcal{F}_{t}\right]blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[Ft(𝐬,𝐚)]absentsubscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚delimited-[]subscript𝐹𝑡𝐬𝐚\displaystyle=\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum% _{\operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\right]= ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ]
=(Φ(Qt))(𝐬,𝐚)Q(𝐬,𝐚).absentΦsubscript𝑄𝑡𝐬𝐚superscript𝑄𝐬𝐚\displaystyle=\left(\Phi(Q_{t})\right)(\operatorname{\mathbf{s}},\operatorname% {\mathbf{a}})-Q^{*}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}).= ( roman_Φ ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( bold_s , bold_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) .

Using the fact that Q=ΦQsuperscript𝑄Φsuperscript𝑄Q^{*}=\Phi Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Φ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Lemma 6.4,

𝔼[Ft(𝐬,𝐚)t]γQtQ=γΔt.\left\|\mathbb{E}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }})\mid\mathcal{F}_{t}\right]\right\|_{\infty}\leq\gamma\left\|Q_{t}-Q^{*}% \right\|_{\infty}=\gamma\left\|\Delta_{t}\right\|_{\infty}.∥ blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_γ ∥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_γ ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT . (14)

Given that r𝑟ritalic_r is bounded, its variance is bounded by some constant B𝐵Bitalic_B. Thus, as shown in Appendix A.1, for C=max{B+γ2Q2,γ2}𝐶𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄2superscript𝛾2C=\max\{B+\gamma^{2}\|Q^{*}\|_{\infty}^{2},\gamma^{2}\}italic_C = roman_max { italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, var[Ft(𝐬,𝐚)t]C(1+Δt)2.varconditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡𝐶superscript1subscriptnormsubscriptΔ𝑡2\operatorname{var}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{% a}})\mid\mathcal{F}_{t}\right]\leq C(1+\|\Delta_{t}\|_{\infty})^{2}.roman_var [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≤ italic_C ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . Then, by this inequality, Eq. (14), and Theorem 1 in (Jaakkola et al., 1993), ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges to zero with probability 1, i.e., Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges to Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with probability 1.

7 Experiments

We compare stochastic maximization to exact maximization and evaluate the proposed RL algorithms in Gymnasium (Brockman et al., 2016) and MuJoCo (Todorov et al., 2012) environments. The stochastic tabular Q-learning approaches are tested on CliffWalking-v0, FrozenLake-v1, and a generated MDP environment. Additionally, the stochastic deep Q-network approaches are tested on control tasks and compared against their deterministic counterparts, as well as against DDPG (Lillicrap et al., 2015), A2C (Mnih et al., 2016), and PPO (Schulman et al., 2017), using Stable-Baselines implementations (Hill et al., 2018), which can directly handle continuous action spaces. Further details can be found in Appendix D.

7.1 Stochastic Q-learning Average Return

We test Stochastic Q-learning, Stochastic Double Q-learning, and Stochastic Sarsa in environments with discrete states and actions. Interestingly, as shown in Fig. 1, our stochastic algorithms outperform their deterministic counterparts. Furthermore, we observe that Stochastic Q-learning outperforms all the methods considered regarding the cumulative rewards in the FrozenLake-v1. Moreover, in the CliffWalking-v0 (as shown in Fig. 10), as well as for the generated MDP environment with 256 actions (as shown in Fig. 12), all the stochastic and non-stochastic methods reach the optimal policy in a similar number of steps.

Refer to caption
Figure 1: Comparison of stochastic vs. non-stochastic value-based variants on the FrozenLake-v1, with steps on the x-axis and cumulative rewards on the y-axis.

7.2 Exponential Wall Time Speedup

Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions. Therefore, StochDQN and StochDDQN, which apply these techniques for action selection and updates, have exponentially faster execution times than DQN and DDQN, as confirmed in Fig. 2.

Refer to caption
Figure 2: Comparison of wall time in seconds of stochastic and non-stochastic DQN methods on various action set sizes.

For the time duration of action selection alone, please refer to Appendix E.1. The time analysis results show that the proposed methods are nearly as fast as a random algorithm that selects actions randomly. Specifically, in the experiments with the InvertedPendulum-v4, the stochastic methods took around 0.003 seconds per step for a set of 1000 actions, while the non-stochastic methods took 0.18 seconds, which indicates that the stochastic versions are 60 times faster than their deterministic counterparts. Furthermore, for the HalfCheetah-v4 experiment, we considered 4096 actions, where one (D)DQN step takes 0.6 seconds, needing around 17 hours to run for 100,000 steps, while the Stoch(D)DQN needs around 2 hours to finish the same 100,000 steps. In other words, we can easily run for 10x more steps in the same period (seconds). This makes the stochastic methods more practical, especially with large action spaces.

7.3 Stochastic Deep Q-network Average Return

Fig. 3 shows the performance of various RL algorithms on the InvertedPendulum-v4 task, which has 512 actions. StochDQN achieves the optimal average return in fewer steps than DQN, with a lower per-step time advantage (as shown in Section 7.2). Interestingly, while DDQN struggles, StochDDQN nearly reaches the optimal average return, demonstrating the effectiveness of stochasticity. StochDQN and StochDDQN significantly outperform DDQN, A2C, and PPO by obtaining higher average returns in fewer steps. Similarly, Fig. 9(b) in Section E.3 shows the results for the HalfCheetah-v4 task, which has 4096 actions. Stochastic methods, particularly StochDDQN, achieve results comparable to the non-stochastic methods. Notably, all DQN methods (stochastic and non-stochastic) outperform PPO and A2C, highlighting their efficiency in such scenarios.

Remark 7.1.

While comparing them falls outside the scope of our work, we note that DDQN was proposed to mitigate the inherent overestimation in DQN. However, exchanging overestimation for underestimation bias is not always beneficial, as our results demonstrate and as shown in other studies such as (Lan et al., 2020).

Refer to caption
Figure 3: Comparison of stochastic DQN variants against other RL algorithms on the InvertedPendulum-v4, with 1000 actions, with steps on the x-axis and average returns on the y-axis.

7.4 Stochastic Maximization

This section analyzes stochastic maximization by tracking returned values of ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Eq. (10)) across the steps. As shown in Fig. 4, for StochDQN, for the InvertedPendulum-v4, ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT approaches one rapidly, similarly for the HalfCheetah-v4, as shown in Appendix E.2.2.

Refer to caption
Figure 4: The stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and max\maxroman_max ratio values tracked over the steps on the InvertedPendulum-v4.

Furthermore, we track the returned values of the difference βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Eq. (9)) and show that it is bounded by small values in both environments, as illustrated in Appendix E.2.2.

8 Discussion

In this work, we focus on adapting value-based methods, which excel in generalization compared to actor-based approaches (Dulac-Arnold et al., 2015). However, this advantage comes at the cost of lower computational efficiency due to the maximization operation required for action selection and value function updates. Therefore, our primary motivation is to provide a computationally efficient alternative for situations with general large discrete action spaces.

We focus mainly on Q-learning-like methods among value-based approaches due to their off-policy nature and proven success in various applications. We demonstrate that these methods can be applied to large discrete action spaces while achieving exponentially lower complexity and maintaining good performance. Furthermore, our proposed stochastic maximization method performs well even when applied to the on-policy Sarsa algorithm, extending its potential beyond off-policy methods. Consequently, the suggested stochastic approach offers broader applicability to other value-based approaches, resulting in lower complexity and improved efficiency with large discrete action spaces.

While the primary goal of this work is to reduce the complexity and wall time of Q-learning-like algorithms, our experiments revealed that stochastic methods not only achieve shorter step times (in seconds) but also, in some cases, yield higher rewards and exhibit faster convergence in terms of the number of steps compared to other methods. These improvements can be attributed to several factors. Firstly, introducing more stochasticity into the greedy choice through stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max enhances exploration. Secondly, Stochastic Q-learning specifically helps to reduce the inherent overestimation in Q-learning-like methods (Hasselt, 2010; Lan et al., 2020; Wang et al., 2021). This reduction is achieved using stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max, a lower bound to the max\maxroman_max operation.

Q-learning methods, focused initially on discrete actions, can be adapted to tackle continuous problems with discretization techniques and stochastic maximization. Our control experiments show that Q-network methods with discretization achieve superior performance to algorithms with continuous actions, such as PPO, by obtaining higher rewards in fewer steps, which aligns with observations in previous works that highlight the potential of discretization for solving continuous control problems (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020). Notably, the logarithmic complexity of the proposed stochastic methods concerning the number of considered actions makes them well-suited for scenarios with finer-grained discretization, leading to more practical implementations.

9 Conclusion

We propose adapting Q-learning-like methods to mitigate the computational bottleneck associated with the max\maxroman_max and argmaxargmax\operatorname*{arg\,max}roman_arg roman_max operations in these methods. By reducing the maximization complexity from linear to sublinear using stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max, we pave the way for practical and efficient value-based RL for large discrete action spaces. We prove the convergence of Stochastic Q-learning, analyze stochastic maximization, and empirically show that it performs well with significantly low complexity.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Akkerman et al. (2023) Akkerman, F., Luy, J., van Heeswijk, W., and Schiffer, M. Handling large discrete action spaces via dynamic neighborhood construction. arXiv preprint arXiv:2305.19891, 2023.
  • Al-Abbasi et al. (2019) Al-Abbasi, A. O., Ghosh, A., and Aggarwal, V. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12):4714–4727, 2019.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Cui & Khardon (2016) Cui, H. and Khardon, R. Online symbolic gradient-based optimization for factored action mdps. In IJCAI, pp.  3075–3081, 2016.
  • Delarue et al. (2020) Delarue, A., Anderson, R., and Tjandraatmadja, C. Reinforcement learning with combinatorial actions: An application to vehicle routing. Advances in Neural Information Processing Systems, 33:609–620, 2020.
  • Dulac-Arnold et al. (2012) Dulac-Arnold, G., Denoyer, L., Preux, P., and Gallinari, P. Fast reinforcement learning with large action sets using error-correcting output codes for mdp factorization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  180–194. Springer, 2012.
  • Dulac-Arnold et al. (2015) Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
  • Dulac-Arnold et al. (2021) Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419–2468, 2021.
  • Enders et al. (2023) Enders, T., Harrison, J., Pavone, M., and Schiffer, M. Hybrid multi-agent deep reinforcement learning for autonomous mobility on demand systems. In Learning for Dynamics and Control Conference, pp.  1284–1296. PMLR, 2023.
  • Farquhar et al. (2020) Farquhar, G., Gustafson, L., Lin, Z., Whiteson, S., Usunier, N., and Synnaeve, G. Growing action spaces. In International Conference on Machine Learning, pp.  3040–3051. PMLR, 2020.
  • Fourati & Alouini (2021) Fourati, F. and Alouini, M.-S. Artificial intelligence for satellite communication: A review. Intelligent and Converged Networks, 2(3):213–243, 2021.
  • Fourati et al. (2023) Fourati, F., Aggarwal, V., Quinn, C., and Alouini, M.-S. Randomized greedy learning for non-monotone stochastic submodular maximization under full-bandit feedback. In International Conference on Artificial Intelligence and Statistics, pp.  7455–7471. PMLR, 2023.
  • Fourati et al. (2024a) Fourati, F., Alouini, M.-S., and Aggarwal, V. Federated combinatorial multi-agent multi-armed bandits. arXiv preprint arXiv:2405.05950, 2024a.
  • Fourati et al. (2024b) Fourati, F., Quinn, C. J., Alouini, M.-S., and Aggarwal, V. Combinatorial stochastic-greedy bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  12052–12060, 2024b.
  • Gonzalez et al. (2023) Gonzalez, G., Balakuntala, M., Agarwal, M., Low, T., Knoth, B., Kirkpatrick, A. W., McKee, J., Hager, G., Aggarwal, V., Xue, Y., et al. Asap: A semi-autonomous precise system for telesurgery during communication delays. IEEE Transactions on Medical Robotics and Bionics, 5(1):66–78, 2023.
  • Haliem et al. (2021) Haliem, M., Mani, G., Aggarwal, V., and Bhargava, B. A distributed model-free ride-sharing approach for joint matching, pricing, and dispatching using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 22(12):7931–7942, 2021.
  • Hasselt (2010) Hasselt, H. Double q-learning. Advances in neural information processing systems, 23, 2010.
  • He et al. (2015) He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. Deep reinforcement learning with an unbounded action space. arXiv preprint arXiv:1511.04636, 5, 2015.
  • He et al. (2016a) He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1621–1630, Berlin, Germany, August 2016a. Association for Computational Linguistics. doi: 10.18653/v1/P16-1153. URL https://aclanthology.org/P16-1153.
  • He et al. (2016b) He, J., Ostendorf, M., He, X., Chen, J., Gao, J., Li, L., and Deng, L. Deep reinforcement learning with a combinatorial action space for predicting popular Reddit threads. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1838–1848, Austin, Texas, November 2016b. Association for Computational Linguistics. doi: 10.18653/v1/D16-1189. URL https://aclanthology.org/D16-1189.
  • Hill et al. (2018) Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
  • Ireland & Montana (2024) Ireland, D. and Montana, G. Revalued: Regularised ensemble value-decomposition for factorisable markov decision processes. arXiv preprint arXiv:2401.08850, 2024.
  • Jaakkola et al. (1993) Jaakkola, T., Jordan, M., and Singh, S. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
  • Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  • Kim et al. (2021) Kim, M., Park, J., et al. Learning collaborative policies to solve np-hard routing problems. Advances in Neural Information Processing Systems, 34:10418–10430, 2021.
  • Lagoudakis & Parr (2003) Lagoudakis, M. G. and Parr, R. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp.  424–431, 2003.
  • Lan et al. (2020) Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487, 2020.
  • Li et al. (2022) Li, S., Wei, C., and Wang, Y. Combining decision making and trajectory planning for lane changing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):16110–16136, 2022.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Luong et al. (2019) Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Liang, Y.-C., and Kim, D. I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21(4):3133–3174, 2019.
  • Mahajan et al. (2021) Mahajan, A., Samvelyan, M., Mao, L., Makoviychuk, V., Garg, A., Kossaifi, J., Whiteson, S., Zhu, Y., and Anandkumar, A. Reinforcement learning in factored action spaces using tensor decompositions. arXiv preprint arXiv:2110.14538, 2021.
  • Mazyavkina et al. (2021) Mazyavkina, N., Sviridov, S., Ivanov, S., and Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
  • Metz et al. (2017) Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.  1928–1937. PMLR, 2016.
  • Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.  807–814, 2010.
  • Pazis & Parr (2011) Pazis, J. and Parr, R. Generalized value functions for large action sets. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.  1185–1192, 2011.
  • Peng et al. (2021) Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny, P.-A., Torr, P., Böhmer, W., and Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221, 2021.
  • Quillen et al. (2018) Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and Levine, S. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  6284–6291. IEEE, 2018.
  • Rubinstein (1999) Rubinstein, R. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1:127–190, 1999.
  • Rummery & Niranjan (1994) Rummery, G. A. and Niranjan, M. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
  • Sallans & Hinton (2004) Sallans, B. and Hinton, G. E. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5:1063–1088, 2004.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Seyde et al. (2021) Seyde, T., Gilitschenski, I., Schwarting, W., Stellato, B., Riedmiller, M., Wulfmeier, M., and Rus, D. Is bang-bang control all you need? solving continuous control with bernoulli policies. Advances in Neural Information Processing Systems, 34:27209–27221, 2021.
  • Seyde et al. (2022) Seyde, T., Werner, P., Schwarting, W., Gilitschenski, I., Riedmiller, M., Rus, D., and Wulfmeier, M. Solving continuous control via q-learning. arXiv preprint arXiv:2210.12566, 2022.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
  • Tang & Agrawal (2020) Tang, Y. and Agrawal, S. Discretizing continuous action space for on-policy optimization. In Proceedings of the aaai conference on artificial intelligence, volume 34, pp.  5981–5988, 2020.
  • Tavakoli et al. (2018) Tavakoli, A., Pardo, F., and Kormushev, P. Action branching architectures for deep reinforcement learning. In Proceedings of the AAAI conference on Artificial Intelligence, volume 32, 2018.
  • Tennenholtz & Mannor (2019) Tennenholtz, G. and Mannor, S. The natural language of actions. In International Conference on Machine Learning, pp.  6196–6205. PMLR, 2019.
  • Tessler et al. (2019) Tessler, C., Zahavy, T., Cohen, D., Mankowitz, D. J., and Mannor, S. Action assembly: Sparse imitation learning for text based games with combinatorial action spaces. arXiv preprint arXiv:1905.09700, 2019.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  • Van de Wiele et al. (2020) Van de Wiele, T., Warde-Farley, D., Mnih, A., and Mnih, V. Q-learning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116, 2020.
  • Van Hasselt & Wiering (2009) Van Hasselt, H. and Wiering, M. A. Using continuous action spaces to solve discrete problems. In 2009 International Joint Conference on Neural Networks, pp.  1149–1156. IEEE, 2009.
  • Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  • Wang et al. (2020) Wang, G., Shi, D., Xue, C., Jiang, H., and Wang, Y. Bic-ddpg: Bidirectionally-coordinated nets for deep multi-agent reinforcement learning. In International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp.  337–354. Springer, 2020.
  • Wang et al. (2021) Wang, H., Lin, S., and Zhang, J. Adaptive ensemble q-learning: Minimizing estimation bias via error feedback. Advances in Neural Information Processing Systems, 34:24778–24790, 2021.
  • Wang et al. (2022) Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu, X., Dai, B., and Miao, Q. Deep reinforcement learning: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8:279–292, 1992.
  • Zahavy et al. (2018) Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J., and Mannor, S. Learn what not to learn: Action elimination with deep reinforcement learning. Advances in neural information processing systems, 31, 2018.
  • Zhang et al. (2020) Zhang, T., Guo, S., Tan, T., Hu, X., and Chen, F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 33:21579–21590, 2020.
  • Zhang et al. (2017) Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted double q-learning. In IJCAI, pp.  3455–3461, 2017.

Appendix A Stochastic Q-learning Convergence Proofs

In this section, we prove Theorem 6.2, which states the convergence of Stochastic Q-learning. This algorithm uses a stochastic policy for action selection, employing a stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max with or without memory, possibly dependent on the current state 𝐬𝐬\operatorname{\mathbf{s}}bold_s. For value updates, it utilizes a stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max without memory, independent of the following state 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

A.1 Proof of Theorem 6.2

Proof.

Stochastic Q-learning employs a stochastic policy in a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, which use stochargmaxstochargmax\operatorname*{stoch\,arg\,max}roman_stoch roman_arg roman_max operation, with or without memory \mathcal{M}caligraphic_M, with probability (1ε𝐬)1subscript𝜀𝐬(1-\varepsilon_{\operatorname{\mathbf{s}}})( 1 - italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ), for ε𝐬>0subscript𝜀𝐬0\varepsilon_{\operatorname{\mathbf{s}}}>0italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT > 0, which can be summarized by the following equation:

πQS(𝐬)={play randomlywith probability ϵ𝐬stochargmax𝐚𝒜Q(𝐬,𝐚)otherwise .subscriptsuperscript𝜋𝑆𝑄𝐬casesplay randomlywith probability subscriptitalic-ϵ𝐬subscriptstochargmax𝐚𝒜𝑄𝐬𝐚otherwise \pi^{S}_{Q}(\operatorname{\mathbf{s}})=\begin{cases}\text{play randomly}&\text% {with probability }\epsilon_{\operatorname{\mathbf{s}}}\\ \operatorname*{stoch\,arg\,max}_{\operatorname{\mathbf{a}}\in\operatorname{% \mathcal{A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})&\text{% otherwise }.\end{cases}italic_π start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( bold_s ) = { start_ROW start_CELL play randomly end_CELL start_CELL with probability italic_ϵ start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) end_CELL start_CELL otherwise . end_CELL end_ROW (15)

This policy, with ε𝐬>0subscript𝜀𝐬0\varepsilon_{\operatorname{\mathbf{s}}}>0italic_ε start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT > 0, ensures that π[𝐚t=𝐚𝐬t=𝐬]>0subscript𝜋delimited-[]subscript𝐚𝑡conditional𝐚subscript𝐬𝑡𝐬0\mathbb{P}_{\pi}[\operatorname{\mathbf{a}}_{t}=\operatorname{\mathbf{a}}\mid% \operatorname{\mathbf{s}}_{t}=\operatorname{\mathbf{s}}]>0blackboard_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_a ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_s ] > 0 for all (𝐬,𝐚)𝒮×𝒜𝐬𝐚𝒮𝒜(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\in\operatorname{\mathcal% {S}}\times\operatorname{\mathcal{A}}( bold_s , bold_a ) ∈ caligraphic_S × caligraphic_A.

Furthermore, during the training, to update the Q-function, given any initial estimate Q0subscript𝑄0Q_{0}italic_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we consider a Stochastic Q-learning which uses stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max operation as in the following stochastic update rule:

Qt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Qt(𝐬t,𝐚t)+αt(𝐬t,𝐚t)[rt+γstochmaxb𝒜Qt(𝐬t+1,b)].subscript𝑄𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝑄𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡delimited-[]subscript𝑟𝑡𝛾subscriptstochmax𝑏𝒜subscript𝑄𝑡subscript𝐬𝑡1𝑏\displaystyle Q_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf% {a}}_{t}\right)=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)Q_{t}\left(\operatorname{\mathbf{s}% }_{t},\operatorname{\mathbf{a}}_{t}\right)+\alpha_{t}\left(\operatorname{% \mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)\left[r_{t}+\gamma% \operatorname*{stoch\,max}_{b\in\mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s% }}_{t+1},b\right)\right].italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ] . (16)

For the function updates, we consider a stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max without memory, which involves a max\maxroman_max over a random subset of action 𝒞𝒞\mathcal{C}caligraphic_C sampled from a set probability distribution \mathbb{P}blackboard_P defined over the combinatorial space of actions, i.e., :2𝒜[0,1]:superscript2𝒜01\mathbb{P}:2^{\mathcal{A}}\rightarrow[0,1]blackboard_P : 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT → [ 0 , 1 ], which can be a uniform distribution over the action sets of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉.

Hence, for a random subset of actions 𝒞𝒞\mathcal{C}caligraphic_C, the update rule of Stochastic Q-learning can be written as:

Qt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Qt(𝐬t,𝐚t)+αt(𝐬t,𝐚t)[rt+γmaxb𝒞Qt(𝐬t+1,b)].subscript𝑄𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝑄𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡delimited-[]subscript𝑟𝑡𝛾subscript𝑏𝒞subscript𝑄𝑡subscript𝐬𝑡1𝑏\displaystyle Q_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf% {a}}_{t}\right)=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)Q_{t}\left(\operatorname{\mathbf{s}% }_{t},\operatorname{\mathbf{a}}_{t}\right)+\alpha_{t}\left(\operatorname{% \mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)\left[r_{t}+\gamma\max_{b% \in\mathcal{C}}Q_{t}\left(\operatorname{\mathbf{s}}_{t+1},b\right)\right].italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_b ) ] . (17)

We define an optimal Q-function, denoted as Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as follows:

Q(𝐬,𝐚)superscript𝑄𝐬𝐚\displaystyle Q^{*}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) =𝔼[r(𝐬,𝐚)+γ stochmaxb𝒜Q(𝐬,b)𝐬,𝐚]absent𝔼delimited-[]𝑟𝐬𝐚conditional𝛾 stochsubscript𝑏𝒜superscript𝑄superscript𝐬𝑏𝐬𝐚\displaystyle=\mathbb{E}\left[r(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})+\gamma\text{ stoch}\max_{b\in\mathcal{A}}Q^{*}(\operatorname{% \mathbf{s}}^{\prime},b)\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right]= blackboard_E [ italic_r ( bold_s , bold_a ) + italic_γ stoch roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ bold_s , bold_a ] (18)
=𝔼[r(𝐬,𝐚)+γmaxb𝒞Q(𝐬,b)𝐬,𝐚].absent𝔼delimited-[]𝑟𝐬𝐚conditional𝛾subscript𝑏𝒞superscript𝑄superscript𝐬𝑏𝐬𝐚\displaystyle=\mathbb{E}\left[r(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})+\gamma\max_{b\in\mathcal{C}}Q^{*}(\operatorname{\mathbf{s}}^{% \prime},b)\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right].= blackboard_E [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ bold_s , bold_a ] . (19)

Subtracting from both sides Q(𝐬t,𝐚t)superscript𝑄subscript𝐬𝑡subscript𝐚𝑡Q^{*}\left(\operatorname{\mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and letting

Δt(𝐬,𝐚)=Qt(𝐬,𝐚)Q(𝐬,𝐚),subscriptΔ𝑡𝐬𝐚subscript𝑄𝑡𝐬𝐚superscript𝑄𝐬𝐚\displaystyle\Delta_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=Q% _{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})-Q^{*}(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}}),roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) , (20)

yields

Δt+1(𝐬t,𝐚t)=(1αt(𝐬t,𝐚t))Δt(𝐬t,𝐚t)+αt(𝐬t,𝐚t)Ft(𝐬t,𝐚t),subscriptΔ𝑡1subscript𝐬𝑡subscript𝐚𝑡1subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscriptΔ𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝛼𝑡subscript𝐬𝑡subscript𝐚𝑡subscript𝐹𝑡subscript𝐬𝑡subscript𝐚𝑡\displaystyle\Delta_{t+1}\left(\operatorname{\mathbf{s}}_{t},\operatorname{% \mathbf{a}}_{t}\right)=\left(1-\alpha_{t}\left(\operatorname{\mathbf{s}}_{t},% \operatorname{\mathbf{a}}_{t}\right)\right)\Delta_{t}\left(\operatorname{% \mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t}\right)+\alpha_{t}(\operatorname{% \mathbf{s}}_{t},\operatorname{\mathbf{a}}_{t})F_{t}(\operatorname{\mathbf{s}}_% {t},\operatorname{\mathbf{a}}_{t}),roman_Δ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (21)

with

Ft(𝐬,𝐚)=r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)Q(𝐬,𝐚).subscript𝐹𝑡𝐬𝐚𝑟𝐬𝐚𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏superscript𝑄𝐬𝐚F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=r(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})+\gamma\max_{b\in\mathcal{C}}Q_{t}\left(% \operatorname{\mathbf{s}}^{\prime},b\right)-Q^{*}\left(\operatorname{\mathbf{s% }},\operatorname{\mathbf{a}}\right).italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) . (22)

For the transition probability distribution 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P}:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\times% \operatorname{\mathcal{S}}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ], the set probability distribution :2𝒜[0,1]:superscript2𝒜01\mathbb{P}:2^{\mathcal{A}}\rightarrow[0,1]blackboard_P : 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT → [ 0 , 1 ], the reward function r:𝒮×𝒜:𝑟𝒮𝒜r:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_r : caligraphic_S × caligraphic_A → blackboard_R, and the discount factor, γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ], we define the following contraction operator ΦΦ\Phiroman_Φ, defined for a function q:𝒮×𝒜:𝑞𝒮𝒜q:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_q : caligraphic_S × caligraphic_A → blackboard_R as

(Φq)(𝐬,𝐚)=𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[r(𝐬,𝐚)+γmaxb𝒞q(𝐬,b)].Φ𝑞𝐬𝐚subscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚delimited-[]𝑟𝐬𝐚𝛾subscript𝑏𝒞𝑞superscript𝐬𝑏(\Phi q)(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=\sum_{\mathcal{C% }\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{\operatorname{\mathbf{s}}^{% \prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(\operatorname{\mathbf{s}}^{% \prime}\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\left[r(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\gamma\max_{b\in\mathcal{% C}}q(\operatorname{\mathbf{s}}^{\prime},b)\right].( roman_Φ italic_q ) ( bold_s , bold_a ) = ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ] . (23)

Therefore, with tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT representing the past at time step t𝑡titalic_t,

𝔼[Ft(𝐬,𝐚)t]𝔼delimited-[]conditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡\displaystyle\mathbb{E}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\mid\mathcal{F}_{t}\right]blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)Q(𝐬,𝐚)]absentsubscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚delimited-[]𝑟𝐬𝐚𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏superscript𝑄𝐬𝐚\displaystyle=\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum% _{\operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left[r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+% \gamma\max_{b\in\mathcal{C}}Q_{t}\left(\operatorname{\mathbf{s}}^{\prime},b% \right)-Q^{*}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right)\right]= ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) ]
=(Φ(Qt))(𝐬,𝐚)Q(𝐬,𝐚).absentΦsubscript𝑄𝑡𝐬𝐚superscript𝑄𝐬𝐚\displaystyle=\left(\Phi(Q_{t})\right)(\operatorname{\mathbf{s}},\operatorname% {\mathbf{a}})-Q^{*}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}).= ( roman_Φ ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( bold_s , bold_a ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) .

Using the fact that Q=ΦQsuperscript𝑄Φsuperscript𝑄Q^{*}=\Phi Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Φ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT,

𝔼[Ft(𝐬,𝐚)t]=(ΦQt)(𝐬,𝐚)(ΦQ)(𝐬,𝐚).𝔼delimited-[]conditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡Φsubscript𝑄𝑡𝐬𝐚Φsuperscript𝑄𝐬𝐚\mathbb{E}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\mid% \mathcal{F}_{t}\right]=\left(\Phi Q_{t}\right)(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})-\left(\Phi Q^{*}\right)(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}}).blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = ( roman_Φ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_s , bold_a ) - ( roman_Φ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( bold_s , bold_a ) .

It is now immediate from Lemma 6.4, which we prove in Appendix A.2, that

𝔼[Ft(𝐬,𝐚)t]γQtQ=γΔt.\left\|\mathbb{E}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }})\mid\mathcal{F}_{t}\right]\right\|_{\infty}\leq\gamma\left\|Q_{t}-Q^{*}% \right\|_{\infty}=\gamma\left\|\Delta_{t}\right\|_{\infty}.∥ blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_γ ∥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_γ ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT . (24)

Moreover,

var[Ft(𝐬,𝐚)t]varconditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡\displaystyle\operatorname{var}\left[F_{t}(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})\mid\mathcal{F}_{t}\right]roman_var [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =𝔼[(r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)Q(𝐬,𝐚)(ΦQt)(𝐬,𝐚)+Q(𝐬,𝐚))2t]absent𝔼delimited-[]conditionalsuperscript𝑟𝐬𝐚𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏superscript𝑄𝐬𝐚Φsubscript𝑄𝑡𝐬𝐚superscript𝑄𝐬𝐚2subscript𝑡\displaystyle=\mathbb{E}\left[\left(r(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})+\gamma\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{% \prime},b)-Q^{*}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})-\left(% \Phi Q_{t}\right)(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+Q^{*}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\right)^{2}\mid\mathcal{F}% _{t}\right]= blackboard_E [ ( italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) - ( roman_Φ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_s , bold_a ) + italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_s , bold_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=𝔼[(r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)(ΦQt)(𝐬,𝐚))2t]absent𝔼delimited-[]conditionalsuperscript𝑟𝐬𝐚𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏Φsubscript𝑄𝑡𝐬𝐚2subscript𝑡\displaystyle=\mathbb{E}\left[\left(r(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})+\gamma\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{% \prime},b)-\left(\Phi Q_{t}\right)(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\right)^{2}\mid\mathcal{F}_{t}\right]= blackboard_E [ ( italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - ( roman_Φ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_s , bold_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=var[r(𝐬,𝐚)+γmaxb𝒞Qt(𝐬,b)t]absentvar𝑟𝐬𝐚conditional𝛾subscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡\displaystyle=\operatorname{var}\left[r(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})+\gamma\max_{b\in\mathcal{C}}Q_{t}(\operatorname{% \mathbf{s}}^{\prime},b)\mid\mathcal{F}_{t}\right]= roman_var [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=var[r(𝐬,𝐚)t]+γ2var[maxb𝒞Qt(𝐬,b)t]+2γcov(r(𝐬,𝐚),maxb𝒞Qt(𝐬,b)t)absentvarconditional𝑟𝐬𝐚subscript𝑡superscript𝛾2varconditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡2𝛾cov𝑟𝐬𝐚conditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡\displaystyle=\operatorname{var}\left[r(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})\mid\mathcal{F}_{t}\right]+\gamma^{2}\operatorname{% var}\left[\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{\prime},b)% \mid\mathcal{F}_{t}\right]+2\gamma\operatorname{cov}(r(\operatorname{\mathbf{s% }},\operatorname{\mathbf{a}}),\max_{b\in\mathcal{C}}Q_{t}(\operatorname{% \mathbf{s}}^{\prime},b)\mid\mathcal{F}_{t})= roman_var [ italic_r ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_var [ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + 2 italic_γ roman_cov ( italic_r ( bold_s , bold_a ) , roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=var[r(𝐬,𝐚)t]+γ2var[maxb𝒞Qt(𝐬,b)t].absentvarconditional𝑟𝐬𝐚subscript𝑡superscript𝛾2varconditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡\displaystyle=\operatorname{var}\left[r(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})\mid\mathcal{F}_{t}\right]+\gamma^{2}\operatorname{% var}\left[\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{\prime},b)% \mid\mathcal{F}_{t}\right].= roman_var [ italic_r ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_var [ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] .

The last line follows from the fact that the randomness of maxb𝒞Qt(𝐬,b)tconditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{\prime},b)\mid\mathcal{% F}_{t}roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only depends on the random set 𝒞𝒞\mathcal{C}caligraphic_C and the next state 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Moreover, we consider the reward r(𝐬,𝐚)𝑟𝐬𝐚r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_r ( bold_s , bold_a ) independent of the set 𝒞𝒞\mathcal{C}caligraphic_C and the next state 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, by not using the same set 𝒞𝒞\mathcal{C}caligraphic_C for both the action selection and the value update.

Given that r𝑟ritalic_r is bounded, its variance is bounded by some constant B𝐵Bitalic_B. Therefore,

var[Ft(𝐬,𝐚)t]varconditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡\displaystyle\operatorname{var}\left[F_{t}(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})\mid\mathcal{F}_{t}\right]roman_var [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] B+γ2var[maxb𝒞Qt(𝐬,b)t]absent𝐵superscript𝛾2varconditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡\displaystyle\leq B+\gamma^{2}\operatorname{var}\left[\max_{b\in\mathcal{C}}Q_% {t}(\operatorname{\mathbf{s}}^{\prime},b)\mid\mathcal{F}_{t}\right]≤ italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_var [ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=B+γ2𝔼[(maxb𝒞Qt(𝐬,b))2t]γ2𝔼[maxb𝒞Qt(𝐬,b)t]2absent𝐵superscript𝛾2𝔼delimited-[]conditionalsuperscriptsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏2subscript𝑡superscript𝛾2𝔼superscriptdelimited-[]conditionalsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏subscript𝑡2\displaystyle=B+\gamma^{2}\mathbb{E}\left[(\max_{b\in\mathcal{C}}Q_{t}(% \operatorname{\mathbf{s}}^{\prime},b))^{2}\mid\mathcal{F}_{t}\right]-\gamma^{2% }\mathbb{E}\left[\max_{b\in\mathcal{C}}Q_{t}(\operatorname{\mathbf{s}}^{\prime% },b)\mid\mathcal{F}_{t}\right]^{2}= italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
B+γ2𝔼[(maxb𝒞Qt(𝐬,b))2t]absent𝐵superscript𝛾2𝔼delimited-[]conditionalsuperscriptsubscript𝑏𝒞subscript𝑄𝑡superscript𝐬𝑏2subscript𝑡\displaystyle\leq B+\gamma^{2}\mathbb{E}\left[(\max_{b\in\mathcal{C}}Q_{t}(% \operatorname{\mathbf{s}}^{\prime},b))^{2}\mid\mathcal{F}_{t}\right]≤ italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
B+γ2𝔼[(max𝐬𝒮maxb𝒜Qt(𝐬,b))2t]absent𝐵superscript𝛾2𝔼delimited-[]conditionalsuperscriptsubscriptsuperscript𝐬𝒮subscript𝑏𝒜subscript𝑄𝑡superscript𝐬𝑏2subscript𝑡\displaystyle\leq B+\gamma^{2}\mathbb{E}\left[(\max_{\operatorname{\mathbf{s}}% ^{\prime}\in\operatorname{\mathcal{S}}}\max_{b\in\mathcal{A}}Q_{t}(% \operatorname{\mathbf{s}}^{\prime},b))^{2}\mid\mathcal{F}_{t}\right]≤ italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( roman_max start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
B+γ2(max𝐬𝒮maxb𝒜Qt(𝐬,b))2absent𝐵superscript𝛾2superscriptsubscriptsuperscript𝐬𝒮subscript𝑏𝒜subscript𝑄𝑡superscript𝐬𝑏2\displaystyle\leq B+\gamma^{2}(\max_{\operatorname{\mathbf{s}}^{\prime}\in% \operatorname{\mathcal{S}}}\max_{b\in\mathcal{A}}Q_{t}(\operatorname{\mathbf{s% }}^{\prime},b))^{2}≤ italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_max start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=B+γ2Qt2absent𝐵superscript𝛾2superscriptsubscriptnormsubscript𝑄𝑡2\displaystyle=B+\gamma^{2}\|Q_{t}\|_{\infty}^{2}= italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=B+γ2Δt+Q2absent𝐵superscript𝛾2superscriptsubscriptnormsubscriptΔ𝑡superscript𝑄2\displaystyle=B+\gamma^{2}\|\Delta_{t}+Q^{*}\|_{\infty}^{2}= italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
B+γ2Q2+γ2Δt2absent𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄2superscript𝛾2superscriptsubscriptnormsubscriptΔ𝑡2\displaystyle\leq B+\gamma^{2}\|Q^{*}\|_{\infty}^{2}+\gamma^{2}\|\Delta_{t}\|_% {\infty}^{2}≤ italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(B+γ2Q2)(1+Δt2)+γ2(1+Δt2)absent𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄21superscriptsubscriptnormsubscriptΔ𝑡2superscript𝛾21superscriptsubscriptnormsubscriptΔ𝑡2\displaystyle\leq(B+\gamma^{2}\|Q^{*}\|_{\infty}^{2})(1+\|\Delta_{t}\|_{\infty% }^{2})+\gamma^{2}(1+\|\Delta_{t}\|_{\infty}^{2})≤ ( italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
max{B+γ2Q2,γ2}(1+Δt2)absent𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄2superscript𝛾21superscriptsubscriptnormsubscriptΔ𝑡2\displaystyle\leq\max\{B+\gamma^{2}\|Q^{*}\|_{\infty}^{2},\gamma^{2}\}(1+\|% \Delta_{t}\|_{\infty}^{2})≤ roman_max { italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
max{B+γ2Q2,γ2}(1+Δt)2.absent𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄2superscript𝛾2superscript1subscriptnormsubscriptΔ𝑡2\displaystyle\leq\max\{B+\gamma^{2}\|Q^{*}\|_{\infty}^{2},\gamma^{2}\}(1+\|% \Delta_{t}\|_{\infty})^{2}.≤ roman_max { italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, for constant C=max{B+γ2Q2,γ2}𝐶𝐵superscript𝛾2superscriptsubscriptnormsuperscript𝑄2superscript𝛾2C=\max\{B+\gamma^{2}\|Q^{*}\|_{\infty}^{2},\gamma^{2}\}italic_C = roman_max { italic_B + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT },

var[Ft(𝐬,𝐚)t]C(1+Δt)2.varconditionalsubscript𝐹𝑡𝐬𝐚subscript𝑡𝐶superscript1subscriptnormsubscriptΔ𝑡2\operatorname{var}\left[F_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{% a}})\mid\mathcal{F}_{t}\right]\leq C(1+\|\Delta_{t}\|_{\infty})^{2}.roman_var [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≤ italic_C ( 1 + ∥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (25)

Then, by Eq. (24), Eq. (25), and Theorem 1 in (Jaakkola et al., 1993), ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges to zero with probability 1, i.e., Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converges to Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with probability 1. ∎

A.2 Proof of Lemma 6.4

Proof.

For the transition probability distribution 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P}:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\times% \operatorname{\mathcal{S}}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ], the set probability distribution \mathbb{P}blackboard_P defined over the combinatorial space of actions, i.e., :2𝒜[0,1]:superscript2𝒜01\mathbb{P}:2^{\mathcal{A}}\rightarrow[0,1]blackboard_P : 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT → [ 0 , 1 ], the reward function r:𝒮×𝒜:𝑟𝒮𝒜r:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_r : caligraphic_S × caligraphic_A → blackboard_R, and the discount factor γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ], for a function q:𝒮×𝒜:𝑞𝒮𝒜q:\operatorname{\mathcal{S}}\times\operatorname{\mathcal{A}}\rightarrow% \operatorname{\mathbb{R}}italic_q : caligraphic_S × caligraphic_A → blackboard_R, the operator ΦΦ\Phiroman_Φ is defined as follows:

(Φq)(𝐬,𝐚)=𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[r(𝐬,𝐚)+γmaxb𝒞q(𝐬,b)].Φ𝑞𝐬𝐚subscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚delimited-[]𝑟𝐬𝐚𝛾subscript𝑏𝒞𝑞superscript𝐬𝑏\displaystyle(\Phi q)(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=% \sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left[r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+% \gamma\max_{b\in\mathcal{C}}q(\operatorname{\mathbf{s}}^{\prime},b)\right].( roman_Φ italic_q ) ( bold_s , bold_a ) = ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ] . (26)

Therefore,

Φq1Φq2subscriptnormΦsubscript𝑞1Φsubscript𝑞2\displaystyle\left\|\Phi q_{1}-\Phi q_{2}\right\|_{\infty}∥ roman_Φ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Φ italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT =max𝐬,𝐚|𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[r(𝐬,𝐚)+γmaxb𝒞q1(𝐬,b)r(𝐬,𝐚)+γmaxb𝒞q2(𝐬,b)]|\displaystyle=\max_{\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}}\left|% \sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left[r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+% \gamma\max_{b\in\mathcal{C}}q_{1}(\operatorname{\mathbf{s}}^{\prime},b)-r(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\gamma\max_{b\in\mathcal{% C}}q_{2}(\operatorname{\mathbf{s}}^{\prime},b)\right]\right|= roman_max start_POSTSUBSCRIPT bold_s , bold_a end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - italic_r ( bold_s , bold_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ] |
=max𝐬,𝐚γ|𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)[maxb𝒞q1(𝐬,b)maxb𝒞q2(𝐬,b)]|\displaystyle=\max_{\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}}\gamma% \left|\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left[\max_{b\in\mathcal{C}}q_{1}(\operatorname{\mathbf{s}}^{% \prime},b)-\max_{b\in\mathcal{C}}q_{2}(\operatorname{\mathbf{s}}^{\prime},b)% \right]\right|= roman_max start_POSTSUBSCRIPT bold_s , bold_a end_POSTSUBSCRIPT italic_γ | ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) [ roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ] |
max𝐬,𝐚γ𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)|maxb𝒞q1(𝐬,b)maxb𝒞q2(𝐬,b)|absentsubscript𝐬𝐚𝛾subscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚subscript𝑏𝒞subscript𝑞1superscript𝐬𝑏subscript𝑏𝒞subscript𝑞2superscript𝐬𝑏\displaystyle\leq\max_{\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}}% \gamma\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left|\max_{b\in\mathcal{C}}q_{1}(\operatorname{\mathbf{s}}^{% \prime},b)-\max_{b\in\mathcal{C}}q_{2}(\operatorname{\mathbf{s}}^{\prime},b)\right|≤ roman_max start_POSTSUBSCRIPT bold_s , bold_a end_POSTSUBSCRIPT italic_γ ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) | roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) - roman_max start_POSTSUBSCRIPT italic_b ∈ caligraphic_C end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) |
max𝐬,𝐚γ𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)maxz,b|q1(z,b)q2(z,b)|absentsubscript𝐬𝐚𝛾subscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚subscript𝑧𝑏subscript𝑞1𝑧𝑏subscript𝑞2𝑧𝑏\displaystyle\leq\max_{\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}}% \gamma\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\max_{z,b}\left|q_{1}(z,b)-q_{2}(z,b)\right|≤ roman_max start_POSTSUBSCRIPT bold_s , bold_a end_POSTSUBSCRIPT italic_γ ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) roman_max start_POSTSUBSCRIPT italic_z , italic_b end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z , italic_b ) - italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z , italic_b ) |
max𝐬,𝐚γ𝒞2𝒜(𝒞)𝐬𝒮𝒫(𝐬𝐬,𝐚)q1q2absentsubscript𝐬𝐚𝛾subscript𝒞superscript2𝒜𝒞subscriptsuperscript𝐬𝒮𝒫conditionalsuperscript𝐬𝐬𝐚subscriptnormsubscript𝑞1subscript𝑞2\displaystyle\leq\max_{\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}}% \gamma\sum_{\mathcal{C}\in 2^{\mathcal{A}}}\mathbb{P}(\mathcal{C})\sum_{% \operatorname{\mathbf{s}}^{\prime}\in\operatorname{\mathcal{S}}}\mathcal{P}(% \operatorname{\mathbf{s}}^{\prime}\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\left\|q_{1}-q_{2}\right\|_{\infty}≤ roman_max start_POSTSUBSCRIPT bold_s , bold_a end_POSTSUBSCRIPT italic_γ ∑ start_POSTSUBSCRIPT caligraphic_C ∈ 2 start_POSTSUPERSCRIPT caligraphic_A end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( caligraphic_C ) ∑ start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_P ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_s , bold_a ) ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=γq1q2.absent𝛾subscriptnormsubscript𝑞1subscript𝑞2\displaystyle=\gamma\left\|q_{1}-q_{2}\right\|_{\infty}.= italic_γ ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

Appendix B Stochastic Maximization

We analyze the proposed stochastic maximization method by comparing its error to that of exact maximization. First, we consider the case without memory, where 𝒞=𝒞\mathcal{C}=\mathcal{R}caligraphic_C = caligraphic_R, and then the case with memory, where \mathcal{M}\neq\emptysetcaligraphic_M ≠ ∅. Finally, we provide a specialized bound for the case where the action values follow a uniform distribution.

B.1 Memoryless Stochastic Maximization

In the following lemma, we give a lower bound on the probability of finding an optimal action within a uniformly sampled subset \mathcal{R}caligraphic_R of log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions. We prove that for a given state s𝑠sitalic_s, the probability p𝑝pitalic_p of sampling an optimal action within the uniformly randomly sampled subset \mathcal{R}caligraphic_R of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions is lower bounded with plog(n)n𝑝𝑛𝑛p\geq\frac{\lceil\log(n)\rceil}{n}italic_p ≥ divide start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG start_ARG italic_n end_ARG.

B.1.1 Proof of Lemma 5.1

Proof.

In the presence of multiple maximizers, we focus on one of them, denoted as 𝐚0subscriptsuperscript𝐚0\operatorname{\mathbf{a}}^{*}_{0}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and then the probability p𝑝pitalic_p of sampling at least one maximizer is lower-bounded by the probability p𝐚0subscript𝑝subscriptsuperscript𝐚0p_{\operatorname{\mathbf{a}}^{*}_{0}}italic_p start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of finding 𝐚0subscriptsuperscript𝐚0\operatorname{\mathbf{a}}^{*}_{0}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e.,

pp𝐚0.𝑝subscript𝑝subscriptsuperscript𝐚0p\geq p_{\operatorname{\mathbf{a}}^{*}_{0}}.italic_p ≥ italic_p start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

The probability p𝐚0subscript𝑝subscriptsuperscript𝐚0p_{\operatorname{\mathbf{a}}^{*}_{0}}italic_p start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of finding 𝐚0subscriptsuperscript𝐚0\operatorname{\mathbf{a}}^{*}_{0}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the probability of sampling 𝐚0subscriptsuperscript𝐚0\operatorname{\mathbf{a}}^{*}_{0}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT within the random set \mathcal{R}caligraphic_R of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉, which is the fraction of all possible combinations of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ that include 𝐚0subscriptsuperscript𝐚0\operatorname{\mathbf{a}}^{*}_{0}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

This fraction can be calculated as (n1log(n)1)binomial𝑛1𝑛1{n-1\choose\lceil\log(n)\rceil-1}( binomial start_ARG italic_n - 1 end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ - 1 end_ARG ) divided by all possible combinations of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉, which is (nlog(n))binomial𝑛𝑛{n\choose\lceil\log(n)\rceil}( binomial start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG ).

Therefore, p𝐚0=(n1log(n)1)(nlog(n))subscript𝑝subscriptsuperscript𝐚0binomial𝑛1𝑛1binomial𝑛𝑛p_{\operatorname{\mathbf{a}}^{*}_{0}}=\frac{{n-1\choose\lceil\log(n)\rceil-1}}% {{n\choose\lceil\log(n)\rceil}}italic_p start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG ( binomial start_ARG italic_n - 1 end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ - 1 end_ARG ) end_ARG start_ARG ( binomial start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG ) end_ARG.

Consequently,

plog(n)n.𝑝𝑛𝑛p\geq\frac{\lceil\log(n)\rceil}{n}.italic_p ≥ divide start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG start_ARG italic_n end_ARG . (27)

B.2 Stochastic Maximization with Memory

While stochastic maximization without memory could approach the maximum value or find it with the probability p𝑝pitalic_p, lower-bounded in Lemma 5.1, it never converges to an exact maximization, as it keeps sampling purely at random, as can be seen in Fig. 6. However, stochastic maximization with memory can become an exact maximization when the Q-function becomes stable, which we prove in the following Corollary. Although the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max has sub-linear complexity compared to the max, the following Corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max matches the output of max.

Definition B.1.

A Q-function is considered stable for a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s if its best action in that state remains unchanged for all subsequent steps, even if the Q-function’s values themselves change.

A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed. However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the key characteristic is that its maximizing action remains the same even when its values are updated. Although the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max has sub-linear complexity compared to the max\maxroman_max, without any assumption of the value distributions, the following Corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max matches exactly the output of max\maxroman_max.

B.2.1 Proof of Corollary 5.3

Proof.

We formalize the problem as a geometric distribution where the success event is the event of sampling a subset of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ that includes at least one maximizer. The geometric distribution gives the probability that the first time to sample a subset that includes an optimal action requires k𝑘kitalic_k independent calls, each with success probability p𝑝pitalic_p. From Lemma 5.1, we have plog(n)n𝑝𝑛𝑛p\geq\frac{\lceil\log(n)\rceil}{n}italic_p ≥ divide start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG start_ARG italic_n end_ARG. Therefore, on an average, success requires: 1pnlog(n)1𝑝𝑛𝑛\frac{1}{p}\leq\frac{n}{\lceil\log(n)\rceil}divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ≤ divide start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG calls.

For a given discrete state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, \mathcal{M}caligraphic_M keeps track of the most recent best action found. For 𝒞=𝒞\mathcal{C}=\mathcal{R}\cup\mathcal{M}caligraphic_C = caligraphic_R ∪ caligraphic_M,

stochmax𝐚𝒜Q(𝐬,𝐚)subscriptstochmax𝐚𝒜𝑄𝐬𝐚\displaystyle\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in% \operatorname{\mathcal{A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }})start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) =max𝐚𝒞Q(𝐬,𝐚)max𝐚Q(𝐬,𝐚).absentsubscript𝐚𝒞𝑄𝐬𝐚subscript𝐚𝑄𝐬𝐚\displaystyle=\max_{\operatorname{\mathbf{a}}\in\mathcal{C}}Q(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\geq\max_{\operatorname{\mathbf{a}}\in% \mathcal{M}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}).= roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_C end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) ≥ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_M end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) . (28)

Therefore, for a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, on average, if the Q-function is stable, then within nlog(n)𝑛𝑛\frac{n}{\lceil\log(n)\rceil}divide start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG, \mathcal{M}caligraphic_M will contain the optimal action 𝐚superscript𝐚\operatorname{\mathbf{a}}^{*}bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, on an average, after nlog(n)𝑛𝑛\frac{n}{\lceil\log(n)\rceil}divide start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG time steps,

stochmax𝐚𝒜Q(𝐬,𝐚)subscriptstochmax𝐚𝒜𝑄𝐬𝐚\displaystyle\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in% \operatorname{\mathcal{A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }})start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) max𝐚Q(𝐬,𝐚)=max𝐚𝒜Q(𝐬,𝐚).absentsubscript𝐚𝑄𝐬𝐚subscript𝐚𝒜𝑄𝐬𝐚\displaystyle\geq\max_{\operatorname{\mathbf{a}}\in\mathcal{M}}Q(\operatorname% {\mathbf{s}},\operatorname{\mathbf{a}})=\max_{\operatorname{\mathbf{a}}\in% \operatorname{\mathcal{A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }}).≥ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_M end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) = roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) .

We know that, stochmax𝐚𝒜Q(𝐬,𝐚)max𝐚𝒜Q(𝐬,𝐚).subscriptstochmax𝐚𝒜𝑄𝐬𝐚subscript𝐚𝒜𝑄𝐬𝐚\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in\operatorname{\mathcal% {A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\leq\max_{% \operatorname{\mathbf{a}}\in\operatorname{\mathcal{A}}}Q(\operatorname{\mathbf% {s}},\operatorname{\mathbf{a}}).start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) ≤ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) . Therefore, for a stable Q-function, on an average, after nlog(n)𝑛𝑛\frac{n}{\lceil\log(n)\rceil}divide start_ARG italic_n end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG time steps, stochmax𝐚𝒜Q(𝐬,𝐚)subscriptstochmax𝐚𝒜𝑄𝐬𝐚\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in\operatorname{\mathcal% {A}}}Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ) becomes max𝐚𝒜Q(𝐬,𝐚)subscript𝐚𝒜𝑄𝐬𝐚\max_{\operatorname{\mathbf{a}}\in\operatorname{\mathcal{A}}}Q(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( bold_s , bold_a ). ∎

B.3 Stochastic Maximization with Uniformly Distributed Rewards

While the above corollary outlines an upper-bound on the average number of calls needed to determine the exact optimal action eventually, the following lemma offers insights into the expected maximum value of a randomly sampled subset of actions, comprising log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ elements when their values are uniformly distributed.

Lemma B.2.

For a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s and a uniformly randomly sampled subset \mathcal{R}caligraphic_R of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions, if the values of the sampled actions follow independently a uniform distribution in the interval [Qt(𝐬,𝐚t)bt(𝐬),Qt(𝐬,𝐚t)]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡[Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{\star})-b_{t}(% \operatorname{\mathbf{s}}),Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}_{t}^{\star})][ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ], then the expected value of the maximum Q-function within this random subset is:

𝔼[maxkQt(𝐬,k)𝐬,𝐚t]=Qt(𝐬,𝐚t)bt(𝐬)log(n)+1.𝔼delimited-[]conditionalsubscript𝑘subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡subscript𝑄𝑡𝐬subscriptsuperscript𝐚𝑡subscript𝑏𝑡𝐬𝑛1\displaystyle\mathbb{E}\left[\max_{k\in\mathcal{R}}Q_{t}(\operatorname{\mathbf% {s}},k)\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t}% \right]=Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t})% -\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil\log(n)\rceil+1}.blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG . (29)
Proof.

For a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s we assume a uniformly randomly sampled subset \mathcal{R}caligraphic_R of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions, and the values of the sampled actions are independent and follow a uniform distribution in the interval [Qt(𝐬,𝐚t)bt(𝐬),Qt(𝐬,𝐚t)]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡[Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{\star})-b_{t}(% \operatorname{\mathbf{s}}),Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}_{t}^{\star})][ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ]. Therefore, the cumulative distribution function (CDF) for the value of an action 𝐚𝐚\operatorname{\mathbf{a}}bold_a given the state 𝐬𝐬\operatorname{\mathbf{s}}bold_s and the optimal action 𝐚tsuperscriptsubscript𝐚𝑡\operatorname{\mathbf{a}}_{t}^{*}bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is:

G(y;𝐬,𝐚)={0for y<Qt(𝐬,𝐚t)bt(𝐬)yfor y[Qt(𝐬,𝐚t)bt,Qt(𝐬,𝐚)]1for y>Qt(𝐬,𝐚t).𝐺𝑦𝐬𝐚cases0for y<Qt(𝐬,𝐚t)bt(𝐬)𝑦for y[Qt(𝐬,𝐚t)bt,Qt(𝐬,𝐚)]1for y>Qt(𝐬,𝐚t)G(y;\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=\left\{\begin{array}[% ]{ll}0&\text{for $y<Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_% {t}^{*})-b_{t}(\operatorname{\mathbf{s}})$}\\ y&\text{for $y\in[Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t% }^{*})-b_{t},Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{*})]$}% \\ 1&\text{for $y>Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{% *})$}\,.\end{array}\right.italic_G ( italic_y ; bold_s , bold_a ) = { start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL for italic_y < italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_CELL end_ROW start_ROW start_CELL italic_y end_CELL start_CELL for italic_y ∈ [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_y > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . end_CELL end_ROW end_ARRAY

We define the variable x=(y(Qt(𝐬,𝐚t)bt(𝐬)))/bt(𝐬)𝑥𝑦subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑏𝑡𝐬x=(y-(Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{*})-b_{t}% (\operatorname{\mathbf{s}})))/b_{t}(\operatorname{\mathbf{s}})italic_x = ( italic_y - ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ) ) / italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ).

F(x;𝐬,𝐚)={0for x<0xfor x[0,1]1for x>1.𝐹𝑥𝐬𝐚cases0for x<0𝑥for x[0,1]1for x>1F(x;\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=\left\{\begin{array}[% ]{ll}0&\text{for $x<0$}\\ x&\text{for $x\in[0,1]$}\\ 1&\text{for $x>1$}\,.\end{array}\right.italic_F ( italic_x ; bold_s , bold_a ) = { start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL for italic_x < 0 end_CELL end_ROW start_ROW start_CELL italic_x end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_x > 1 . end_CELL end_ROW end_ARRAY

If we select log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ such actions, the CDF of the maximum of these actions, denoted as Fmaxsubscript𝐹F_{\max}italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the following:

Fmax(x;𝐬,𝐚)subscript𝐹𝑥𝐬𝐚\displaystyle F_{\max}(x;\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a ) =(maxaQt(𝐬,𝐚)x)absentsubscript𝑎subscript𝑄𝑡𝐬𝐚𝑥\displaystyle=\mathbb{P}\left(\max_{a\in\mathcal{R}}Q_{t}(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\leq x\right)= blackboard_P ( roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ≤ italic_x )
=a(Qt(𝐬,𝐚)x)absentsubscriptproduct𝑎subscript𝑄𝑡𝐬𝐚𝑥\displaystyle=\prod_{a\in\mathcal{R}}\mathbb{P}\left(Q_{t}(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\leq x\right)= ∏ start_POSTSUBSCRIPT italic_a ∈ caligraphic_R end_POSTSUBSCRIPT blackboard_P ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ≤ italic_x )
=aF(x;𝐬,𝐚)absentsubscriptproduct𝑎𝐹𝑥𝐬𝐚\displaystyle=\prod_{a\in\mathcal{R}}F(x;\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})= ∏ start_POSTSUBSCRIPT italic_a ∈ caligraphic_R end_POSTSUBSCRIPT italic_F ( italic_x ; bold_s , bold_a )
=F(x;𝐬,𝐚)log(n).absent𝐹superscript𝑥𝐬𝐚𝑛\displaystyle=F(x;\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})^{\lceil% \log(n)\rceil}.\,= italic_F ( italic_x ; bold_s , bold_a ) start_POSTSUPERSCRIPT ⌈ roman_log ( italic_n ) ⌉ end_POSTSUPERSCRIPT .

The second line follows from the independence of the values, and the last line follows from the assumption that all actions follow the same uniform distribution.

The CDF of the maximum is therefore given by:

Fmax(x;𝐬,𝐚)={0for x<0xlog(n)for x[0,1]1for x>1.subscript𝐹𝑥𝐬𝐚cases0for x<0superscript𝑥𝑛for x[0,1]1for x>1F_{\max}(x;\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=\left\{\begin{% array}[]{ll}0&\text{for $x<0$}\\ x^{\lceil\log(n)\rceil}&\text{for $x\in[0,1]$}\\ 1&\text{for $x>1$}\,.\end{array}\right.italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a ) = { start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL for italic_x < 0 end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ⌈ roman_log ( italic_n ) ⌉ end_POSTSUPERSCRIPT end_CELL start_CELL for italic_x ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_x > 1 . end_CELL end_ROW end_ARRAY

Now, we can determine the desired expected value as

𝔼[max𝐚Qt(𝐬,𝐚)(Qt(𝐬,𝐚t)bt(𝐬))bt(𝐬)]𝔼delimited-[]subscript𝐚subscript𝑄𝑡𝐬𝐚subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑏𝑡𝐬\displaystyle\mathbb{E}\left[\max_{\operatorname{\mathbf{a}}\in\mathcal{R}}% \frac{Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})-(Q_{t}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{*})-b_{t}(% \operatorname{\mathbf{s}}))}{b_{t}(\operatorname{\mathbf{s}})}\right]blackboard_E [ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_R end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) - ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ) end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG ] =xdFmax(x;𝐬,𝐚)absentsuperscriptsubscript𝑥dsubscript𝐹𝑥𝐬𝐚\displaystyle=\int_{-\infty}^{\infty}x\,\text{d}F_{\max}(x;\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})= ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_x d italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a )
=01xdFmax(x;𝐬,𝐚)absentsuperscriptsubscript01𝑥dsubscript𝐹𝑥𝐬𝐚\displaystyle=\int_{0}^{1}x\,\text{d}F_{\max}(x;\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x d italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a )
=[xFmax(x;𝐬,𝐚)]0101Fmax(x;𝐬,𝐚)dxabsentsuperscriptsubscriptdelimited-[]𝑥subscript𝐹𝑥𝐬𝐚01superscriptsubscript01subscript𝐹𝑥𝐬𝐚d𝑥\displaystyle=\left[xF_{\max}(x;\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}})\right]_{0}^{1}-\int_{0}^{1}F_{\max}(x;\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})\,\text{d}x= [ italic_x italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a ) ] start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_x ; bold_s , bold_a ) d italic_x
=101xlog(n)dxabsent1superscriptsubscript01superscript𝑥𝑛d𝑥\displaystyle=1-\int_{0}^{1}x^{\lceil\log(n)\rceil}\,\text{d}x= 1 - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ⌈ roman_log ( italic_n ) ⌉ end_POSTSUPERSCRIPT d italic_x
=11log(n)+1.absent11𝑛1\displaystyle=1-\frac{1}{\lceil\log(n)\rceil+1}.= 1 - divide start_ARG 1 end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG .

We employed the identity 01xdμ(x)=011μ(x)dxsuperscriptsubscript01𝑥d𝜇𝑥superscriptsubscript011𝜇𝑥d𝑥\int_{0}^{1}x\,\text{d}\mu(x)=\int_{0}^{1}1-\mu(x)\,\text{d}x∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x d italic_μ ( italic_x ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 1 - italic_μ ( italic_x ) d italic_x, which can be demonstrated through integration by parts. To return to the original scale, we can first multiply by btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then add Qt(𝐬,𝐚t)bt(𝐬)subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{*})-b_{t}(% \operatorname{\mathbf{s}})italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), resulting in:

𝔼[max𝐚Qt(𝐬,𝐚)𝐬,𝐚t]𝔼delimited-[]conditionalsubscript𝐚subscript𝑄𝑡𝐬𝐚𝐬superscriptsubscript𝐚𝑡\displaystyle\mathbb{E}\left[\max_{\operatorname{\mathbf{a}}\in\mathcal{R}}Q_{% t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\mid\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}}_{t}^{*}\right]blackboard_E [ roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) ∣ bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] =Qt(𝐬,𝐚t)bt(𝐬)log(n)+1.absentsubscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬𝑛1\displaystyle=Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{*% })-\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil\log(n)\rceil+1}.= italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG .

As an example of this setting, for Qt(𝐬,𝐚t)=100subscript𝑄𝑡𝐬subscriptsuperscript𝐚𝑡100Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t})=100italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 100, bt=100subscript𝑏𝑡100b_{t}=100italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 100, for a setting with n=1000𝑛1000n=1000italic_n = 1000 actions, log(n)+1=11𝑛111\lceil\log(n)\rceil+1=11⌈ roman_log ( italic_n ) ⌉ + 1 = 11. Hence the 𝔼[maxkQt(𝐬,k)𝐬,𝐚t]91𝔼delimited-[]conditionalsubscript𝑘subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡91\mathbb{E}\left[\max_{k\in\mathcal{R}}Q_{t}(\operatorname{\mathbf{s}},k)\mid% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t}\right]\approx 91blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≈ 91. This shows that even with a randomly sampled set of actions \mathcal{R}caligraphic_R, the stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max can be close to the max. We simulate this setting in the experiments in Fig. 6.

Our proposed stochastic maximization does not solely rely on the randomly sampled subset of actions \mathcal{R}caligraphic_R but also considers actions from previous experiences through \mathcal{M}caligraphic_M. Therefore, the expected stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max should be higher than the above result, providing an upper bound on the expected βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as described in the following corollary of Lemma B.2.

Corollary B.3.

For a given discrete state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, if the values of the sampled actions follow independently a uniform distribution from the interval [Qt(𝐬,𝐚t)bt(𝐬),Qt(𝐬,𝐚t)]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡[Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}_{t}^{\star})-b_{t}(% \operatorname{\mathbf{s}}),Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}_{t}^{\star})][ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ], then the expected value of βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) is:

𝔼[βt(𝐬)𝐬]bt(𝐬)log(n)+1.𝔼delimited-[]conditionalsubscript𝛽𝑡𝐬𝐬subscript𝑏𝑡𝐬𝑛1\displaystyle\mathbb{E}\left[\beta_{t}(\operatorname{\mathbf{s}})\mid% \operatorname{\mathbf{s}}\right]\leq\frac{b_{t}(\operatorname{\mathbf{s}})}{% \lceil\log(n)\rceil+1}.blackboard_E [ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ∣ bold_s ] ≤ divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG . (30)
Proof.

At time step t𝑡titalic_t, given a state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, and the current estimated Q-function Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) is defined as follows:

βt(𝐬)=max𝐚𝒜Qt(𝐬,𝐚)stochmax𝐚𝒜Qt(𝐬,𝐚).subscript𝛽𝑡𝐬subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚subscriptstochmax𝐚𝒜subscript𝑄𝑡𝐬𝐚\beta_{t}(\operatorname{\mathbf{s}})=\max_{\operatorname{\mathbf{a}}\in% \mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}% \right)-\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in\mathcal{A}}Q_% {t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right).italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) = roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) - start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) . (31)

For a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s and a uniformly randomly sampled subset \mathcal{R}caligraphic_R of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions and a subset of some previous played actions \mathcal{M}\subset\mathcal{E}caligraphic_M ⊂ caligraphic_E, using the law of total expectation,

𝔼[βt(𝐬)𝐬]𝔼delimited-[]conditionalsubscript𝛽𝑡𝐬𝐬\displaystyle\mathbb{E}\left[\beta_{t}(\operatorname{\mathbf{s}})\mid% \operatorname{\mathbf{s}}\right]blackboard_E [ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ∣ bold_s ] =𝔼[𝔼[βt(𝐬)𝐬,𝐚t]𝐬]absent𝔼delimited-[]conditional𝔼delimited-[]conditionalsubscript𝛽𝑡𝐬𝐬subscriptsuperscript𝐚𝑡𝐬\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\beta_{t}(\operatorname{\mathbf{% s}})\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t}\right]% \mid\operatorname{\mathbf{s}}\right]= blackboard_E [ blackboard_E [ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∣ bold_s ]
=𝔼[𝔼[maxk𝒜Qt(𝐬,k)stochmaxk𝒜Qt(𝐬,k)𝐬,𝐚t]𝐬]absent𝔼delimited-[]conditional𝔼delimited-[]subscript𝑘𝒜subscript𝑄𝑡𝐬𝑘conditionalsubscriptstochmax𝑘𝒜subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡𝐬\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\max_{k\in\operatorname{\mathcal% {A}}}Q_{t}(\operatorname{\mathbf{s}},k)-\operatorname*{stoch\,max}_{k\in% \operatorname{\mathcal{A}}}Q_{t}(\operatorname{\mathbf{s}},k)\mid\operatorname% {\mathbf{s}},\operatorname{\mathbf{a}}^{\star}_{t}\right]\mid\operatorname{% \mathbf{s}}\right]= blackboard_E [ blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) - start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∣ bold_s ]
=𝔼[𝔼[maxk𝒜Qt(𝐬,k)maxkQt(𝐬,k)𝐬,𝐚t]𝐬]absent𝔼delimited-[]conditional𝔼delimited-[]subscript𝑘𝒜subscript𝑄𝑡𝐬𝑘conditionalsubscript𝑘subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡𝐬\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\max_{k\in\operatorname{\mathcal% {A}}}Q_{t}(\operatorname{\mathbf{s}},k)-\max_{k\in\mathcal{R}\cup\mathcal{M}}Q% _{t}(\operatorname{\mathbf{s}},k)\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}^{\star}_{t}\right]\mid\operatorname{\mathbf{s}}\right]= blackboard_E [ blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) - roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_R ∪ caligraphic_M end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∣ bold_s ]
𝔼[𝔼[maxk𝒜Qt(𝐬,k)maxkQt(𝐬,k)𝐬,𝐚t]𝐬]absent𝔼delimited-[]conditional𝔼delimited-[]subscript𝑘𝒜subscript𝑄𝑡𝐬𝑘conditionalsubscript𝑘subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡𝐬\displaystyle\leq\mathbb{E}\left[\mathbb{E}\left[\max_{k\in\operatorname{% \mathcal{A}}}Q_{t}(\operatorname{\mathbf{s}},k)-\max_{k\in\mathcal{R}}Q_{t}(% \operatorname{\mathbf{s}},k)\mid\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}^{\star}_{t}\right]\mid\operatorname{\mathbf{s}}\right]≤ blackboard_E [ blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) - roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∣ bold_s ]
=𝔼[Qt(𝐬,𝐚t)𝔼[maxkQt(𝐬,k)𝐬,𝐚t]𝐬].absent𝔼delimited-[]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡conditional𝔼delimited-[]conditionalsubscript𝑘subscript𝑄𝑡𝐬𝑘𝐬subscriptsuperscript𝐚𝑡𝐬\displaystyle=\mathbb{E}\left[Q_{t}(\operatorname{\mathbf{s}},\operatorname{% \mathbf{a}}_{t}^{*})-\mathbb{E}\left[\max_{k\in\mathcal{R}}Q_{t}(\operatorname% {\mathbf{s}},k)\mid\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}^{\star}% _{t}\right]\mid\operatorname{\mathbf{s}}\right].= blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - blackboard_E [ roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_R end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , italic_k ) ∣ bold_s , bold_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∣ bold_s ] .

Therefore by Lemma B.2:

𝔼[βt(𝐬)𝐬]𝔼delimited-[]conditionalsubscript𝛽𝑡𝐬𝐬\displaystyle\mathbb{E}\left[\beta_{t}(\operatorname{\mathbf{s}})\mid% \operatorname{\mathbf{s}}\right]blackboard_E [ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) ∣ bold_s ] 𝔼[Qt(𝐬,𝐚t)(Qt(𝐬,𝐚t)bt(𝐬)log(n)+1)𝐬]absent𝔼delimited-[]subscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡conditionalsubscript𝑄𝑡𝐬superscriptsubscript𝐚𝑡subscript𝑏𝑡𝐬𝑛1𝐬\displaystyle\leq\mathbb{E}\left[Q_{t}(\operatorname{\mathbf{s}},\operatorname% {\mathbf{a}}_{t}^{*})-(Q_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a% }}_{t}^{*})-\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil\log(n)\rceil+1})% \mid\operatorname{\mathbf{s}}\right]≤ blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG ) ∣ bold_s ]
=𝔼[bt(𝐬)log(n)+1𝐬]absent𝔼delimited-[]conditionalsubscript𝑏𝑡𝐬𝑛1𝐬\displaystyle=\mathbb{E}\left[\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil% \log(n)\rceil+1}\mid\operatorname{\mathbf{s}}\right]= blackboard_E [ divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG ∣ bold_s ]
=bt(𝐬)log(n)+1.absentsubscript𝑏𝑡𝐬𝑛1\displaystyle=\frac{b_{t}(\operatorname{\mathbf{s}})}{\lceil\log(n)\rceil+1}.= divide start_ARG italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ + 1 end_ARG .

Appendix C Pseudocodes

  Initialize QA(𝐬,𝐚)superscript𝑄𝐴𝐬𝐚Q^{A}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s , bold_a ) and QB(𝐬,𝐚)superscript𝑄𝐵𝐬𝐚Q^{B}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s , bold_a ) for all 𝐬𝒮,𝐚𝒜formulae-sequence𝐬𝒮𝐚𝒜\operatorname{\mathbf{s}}\in\mathcal{S},\operatorname{\mathbf{a}}\in\mathcal{A}bold_s ∈ caligraphic_S , bold_a ∈ caligraphic_A, n=|𝒜|𝑛𝒜n=|\operatorname{\mathcal{A}}|italic_n = | caligraphic_A |
  for each episode do
     Observe state 𝐬𝐬\operatorname{\mathbf{s}}bold_s.
     for each step of episode do
        Choose 𝐚𝐚\operatorname{\mathbf{a}}bold_a from 𝐬𝐬\operatorname{\mathbf{s}}bold_s via QA+QBsuperscript𝑄𝐴superscript𝑄𝐵Q^{A}+Q^{B}italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT with policy π(QA+QB)S(𝐬)subscriptsuperscript𝜋𝑆superscript𝑄𝐴superscript𝑄𝐵𝐬\pi^{S}_{(Q^{A}+Q^{B})}(\operatorname{\mathbf{s}})italic_π start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( bold_s ) in Eq. (6).
        Take action 𝐚𝐚\operatorname{\mathbf{a}}bold_a, observe r𝑟ritalic_r, 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
        Choose either UPDATE(A) or UPDATE(B), for example randomly.
        if UPDATE(A) then
           ΔAr+γQB(𝐬,stochargmaxb𝒜QA(𝐬,b))QA(𝐬,𝐚)superscriptΔ𝐴𝑟𝛾superscript𝑄𝐵superscript𝐬subscriptstochargmax𝑏𝒜superscript𝑄𝐴superscript𝐬𝑏superscript𝑄𝐴𝐬𝐚\Delta^{A}\leftarrow r+\gamma Q^{B}(\operatorname{\mathbf{s}}^{\prime},% \operatorname*{stoch\,arg\,max}_{b\in\mathcal{A}}Q^{A}(\operatorname{\mathbf{s% }}^{\prime},b))-Q^{A}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})roman_Δ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ← italic_r + italic_γ italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) - italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s , bold_a ).
           QA(𝐬,𝐚)QA(𝐬,𝐚)+α(𝐬,𝐚)ΔAsuperscript𝑄𝐴𝐬𝐚superscript𝑄𝐴𝐬𝐚𝛼𝐬𝐚superscriptΔ𝐴Q^{A}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\leftarrow Q^{A}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\alpha(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\Delta^{A}italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s , bold_a ) ← italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s , bold_a ) + italic_α ( bold_s , bold_a ) roman_Δ start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT.
        else if UPDATE(B) then
           ΔBr+γQA(𝐬,stochargmaxb𝒜QB(𝐬,b))QB(𝐬,𝐚)superscriptΔ𝐵𝑟𝛾superscript𝑄𝐴superscript𝐬subscriptstochargmax𝑏𝒜superscript𝑄𝐵superscript𝐬𝑏superscript𝑄𝐵𝐬𝐚\Delta^{B}\leftarrow r+\gamma Q^{A}(\operatorname{\mathbf{s}}^{\prime},% \operatorname*{stoch\,arg\,max}_{b\in\mathcal{A}}Q^{B}(\operatorname{\mathbf{s% }}^{\prime},b))-Q^{B}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})roman_Δ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ← italic_r + italic_γ italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_b ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b ) ) - italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s , bold_a ).
           QB(𝐬,𝐚)QB(𝐬,𝐚)+α(𝐬,𝐚)ΔBsuperscript𝑄𝐵𝐬𝐚superscript𝑄𝐵𝐬𝐚𝛼𝐬𝐚superscriptΔ𝐵Q^{B}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\leftarrow Q^{B}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\alpha(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})\Delta^{B}italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s , bold_a ) ← italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_s , bold_a ) + italic_α ( bold_s , bold_a ) roman_Δ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT.
        end if
        𝐬𝐬𝐬superscript𝐬\operatorname{\mathbf{s}}\leftarrow\operatorname{\mathbf{s}}^{\prime}bold_s ← bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
     end for
  end for
Algorithm 2 Stochastic Double Q-learning
  Initialize Q(𝐬,𝐚)𝑄𝐬𝐚Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_Q ( bold_s , bold_a ) for all 𝐬𝒮,𝐚𝒜formulae-sequence𝐬𝒮𝐚𝒜\operatorname{\mathbf{s}}\in\mathcal{S},\operatorname{\mathbf{a}}\in\mathcal{A}bold_s ∈ caligraphic_S , bold_a ∈ caligraphic_A, n=|𝒜|𝑛𝒜n=|\operatorname{\mathcal{A}}|italic_n = | caligraphic_A |
  for each episode do
     Observe state 𝐬𝐬\operatorname{\mathbf{s}}bold_s.
     Choose 𝐚𝐚\operatorname{\mathbf{a}}bold_a from 𝐬𝐬\operatorname{\mathbf{s}}bold_s with policy πQS(𝐬)superscriptsubscript𝜋𝑄𝑆𝐬\pi_{Q}^{S}(\operatorname{\mathbf{s}})italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s ) in Eq. (6).
     for each step of episode do
        Take action 𝐚𝐚\operatorname{\mathbf{a}}bold_a, observe r𝑟ritalic_r, 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
        Choose 𝐚superscript𝐚\operatorname{\mathbf{a}}^{\prime}bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with policy πQS(𝐬)superscriptsubscript𝜋𝑄𝑆superscript𝐬\pi_{Q}^{S}(\operatorname{\mathbf{s}}^{\prime})italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in Eq. (6).
        Q(𝐬,𝐚)Q(𝐬,𝐚)+α(𝐬,𝐚)[r+γQ(𝐬,𝐚)Q(𝐬,𝐚)]𝑄𝐬𝐚𝑄𝐬𝐚𝛼𝐬𝐚delimited-[]𝑟𝛾𝑄superscript𝐬superscript𝐚𝑄𝐬𝐚Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})\leftarrow Q(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})+\alpha(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}})[r+\gamma Q(\operatorname{\mathbf{s}}^{% \prime},\operatorname{\mathbf{a}}^{\prime})-Q(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}})]italic_Q ( bold_s , bold_a ) ← italic_Q ( bold_s , bold_a ) + italic_α ( bold_s , bold_a ) [ italic_r + italic_γ italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( bold_s , bold_a ) ].
        𝐬𝐬𝐬superscript𝐬\operatorname{\mathbf{s}}\leftarrow\operatorname{\mathbf{s}}^{\prime}bold_s ← bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; 𝐚𝐚𝐚superscript𝐚\operatorname{\mathbf{a}}\leftarrow\operatorname{\mathbf{a}}^{\prime}bold_a ← bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
     end for
  end for
Algorithm 3 Stochastic Sarsa
  Algorithm parameters: learning rate α(0,1]𝛼01\alpha\in(0,1]italic_α ∈ ( 0 , 1 ], replay buffer \mathcal{E}caligraphic_E, update rate τ𝜏\tauitalic_τ.
  Initialize: neural network Q(𝐬,𝐚;θ)𝑄𝐬𝐚𝜃Q(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}};\theta)italic_Q ( bold_s , bold_a ; italic_θ ) with random weights θ𝜃\thetaitalic_θ, target network Q^(𝐬,𝐚;θ)^𝑄𝐬𝐚superscript𝜃\hat{Q}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}};\theta^{-})over^ start_ARG italic_Q end_ARG ( bold_s , bold_a ; italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) with θ=θsuperscript𝜃𝜃\theta^{-}=\thetaitalic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_θ, set of actions 𝒜𝒜\operatorname{\mathcal{A}}caligraphic_A of size n𝑛nitalic_n.
  for each episode do
     Initialize state 𝐬𝐬\operatorname{\mathbf{s}}bold_s.
     while not terminal state is reached do
        Choose 𝐚𝐚\operatorname{\mathbf{a}}bold_a from 𝐬𝐬\operatorname{\mathbf{s}}bold_s using a stochastic policy as defined in Eq. (15) using Q(𝐬,.;θ)Q(\operatorname{\mathbf{s}},.;\theta)italic_Q ( bold_s , . ; italic_θ ).
        Take action a, observe reward r(𝐬,𝐚)𝑟𝐬𝐚r(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_r ( bold_s , bold_a ) and next state 𝐬superscript𝐬\operatorname{\mathbf{s}}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
        Store (𝐬,𝐚,r(𝐬,𝐚),𝐬)𝐬𝐚𝑟𝐬𝐚superscript𝐬(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}},r(\operatorname{\mathbf{s% }},\operatorname{\mathbf{a}}),\operatorname{\mathbf{s}}^{\prime})( bold_s , bold_a , italic_r ( bold_s , bold_a ) , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in replay buffer \mathcal{E}caligraphic_E.
        Compute target values for the mini-batch:
yi={riif 𝐬i is terminalri+γQ^(𝐬i,stochargmax𝐚𝒜Q^(𝐬i,𝐚;θ);θ)otherwise.subscript𝑦𝑖casessubscript𝑟𝑖if subscriptsuperscript𝐬𝑖 is terminalsubscript𝑟𝑖𝛾^𝑄subscriptsuperscript𝐬𝑖subscriptstochargmaxsuperscript𝐚𝒜^𝑄subscriptsuperscript𝐬𝑖superscript𝐚superscript𝜃superscript𝜃otherwisey_{i}=\begin{cases}r_{i}&\text{if }\operatorname{\mathbf{s}}^{\prime}_{i}\text% { is terminal}\\ r_{i}+\gamma\hat{Q}(\operatorname{\mathbf{s}}^{\prime}_{i},\operatorname*{% stoch\,arg\,max}_{\operatorname{\mathbf{a}}^{\prime}\in\mathcal{A}}\hat{Q}(% \operatorname{\mathbf{s}}^{\prime}_{i},\operatorname{\mathbf{a}}^{\prime};% \theta^{-});\theta^{-})&\text{otherwise}.\end{cases}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is terminal end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ over^ start_ARG italic_Q end_ARG ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , start_OPERATOR roman_stoch roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ; italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_CELL start_CELL otherwise . end_CELL end_ROW
        Perform a gradient descent step on the loss:
(θ)=1log(n)i=1log(n)(yiQ(𝐬i,𝐚i;θ))2.𝜃1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖𝑄subscript𝐬𝑖subscript𝐚𝑖𝜃2\mathcal{L}(\theta)=\frac{1}{\lceil\log(n)\rceil}\sum_{i=1}^{\lceil\log(n)% \rceil}(y_{i}-Q(\operatorname{\mathbf{s}}_{i},\operatorname{\mathbf{a}}_{i};% \theta))^{2}.caligraphic_L ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG ⌈ roman_log ( italic_n ) ⌉ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌈ roman_log ( italic_n ) ⌉ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
        Update the target network weights:
θτθ+(1τ)θ.superscript𝜃𝜏𝜃1𝜏superscript𝜃\theta^{-}\leftarrow\tau\cdot\theta+(1-\tau)\cdot\theta^{-}.italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← italic_τ ⋅ italic_θ + ( 1 - italic_τ ) ⋅ italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT .
        Update the Q-network weights using gradient descent:
θθ+αθ(θ).𝜃𝜃𝛼subscript𝜃𝜃\theta\leftarrow\theta+\alpha\nabla_{\theta}\mathcal{L}(\theta).italic_θ ← italic_θ + italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ) .
        𝐬𝐬𝐬superscript𝐬\operatorname{\mathbf{s}}\leftarrow\operatorname{\mathbf{s}}^{\prime}bold_s ← bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
     end while
  end for
Algorithm 4 Stochastic Deep Q-Network (StochDQN)

Appendix D Experimental Details

D.1 Environments

We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments (Brockman et al., 2016). Stochastic Q-learning and Stochastic Double Q-learning are tested on the CliffWalking-v0, the FrozenLake-v1, and a generated MDP environment, while stochastic deep Q-learning approaches are tested on MuJoCo control tasks (Todorov et al., 2012).

D.1.1 Environments with Discrete States and Actions

We generate an MDP environment with 256 actions, with rewards following a normal distribution of mean -50 and standard deviation of 50, with 3 states. Furthermore, while our approach is designed for large discrete action spaces, we tested it in Gymnasium environments (Brockman et al., 2016) with only four discrete actions, such as CliffWalking-v0 and FrozenLake-v1. CliffWalking-v0 involves navigating a grid world from the starting point to the destination without falling off a cliff. FrozenLake-v1 requires moving from the starting point to the goal without stepping into any holes on the frozen surface, which can be challenging due to the slippery nature of the ice.

D.1.2 Environments with Continuous States: Discretizing Control Tasks

We test the stochastic deep Q-learning approaches on MuJoCo (Todorov et al., 2012) for continuous states discretized control tasks. We discretize each action dimension into i𝑖iitalic_i equally spaced values, creating a discrete action space with n=id𝑛superscript𝑖𝑑n=i^{d}italic_n = italic_i start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT d𝑑ditalic_d-dimensional actions. We mainly focused on the inverted pendulum and the half-cheetah. The inverted pendulum involves a cart that can be moved left or right, intending to balance a pole on top using a 1D force, with i=512𝑖512i=512italic_i = 512 resulting in 512 actions. The half-cheetah is a robot with nine body parts aiming to maximize forward speed. It can apply torque to 6 joints, resulting in 6D actions with i=4𝑖4i=4italic_i = 4, which results in 4096 actions.

D.2 Algorithms

D.2.1 Stochastic Maximization

We have two scenarios, one for discrete and the other for continuous states. For discrete states, \mathcal{E}caligraphic_E is a dictionary with the keys as the states in 𝒮𝒮\operatorname{\mathcal{S}}caligraphic_S with corresponding values of the latest played action in every state. In contrast, \mathcal{E}caligraphic_E comprises the actions in the replay buffer for continuous states. Indeed, we do not consider the whole set \mathcal{E}caligraphic_E either. Instead, we only consider a subset \mathcal{M}\subset\mathcal{E}caligraphic_M ⊂ caligraphic_E. For discrete states, for a given state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, \mathcal{M}caligraphic_M includes the latest two exploited actions in state 𝐬𝐬\operatorname{\mathbf{s}}bold_s. For continuous states, where it is impossible to retain the last exploited action for each state, we consider randomly sampled subset \mathcal{M}\subset\mathcal{E}caligraphic_M ⊂ caligraphic_E, which includes log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ actions, even though they were played in different states. In the experiments involving continuous states, we demonstrate that this was sufficient to achieve good results, see Section 7.3.

D.2.2 Tabular Q-learning Methods

We set the training parameters the same for all the Q-learning variants. We follow similar hyper-parameters as in (Hasselt, 2010). We set the discount factor γ𝛾\gammaitalic_γ to 0.95 and apply a dynamical polynomial learning rate α𝛼\alphaitalic_α with αt(𝐬,𝐚)=1/zt(𝐬,𝐚)0.8subscript𝛼𝑡𝐬𝐚1subscript𝑧𝑡superscript𝐬𝐚0.8\alpha_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=1/z_{t}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})^{0.8}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = 1 / italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) start_POSTSUPERSCRIPT 0.8 end_POSTSUPERSCRIPT, where zt(𝐬,𝐚)subscript𝑧𝑡𝐬𝐚z_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) is the number of times the pair (𝐬,𝐚)𝐬𝐚(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})( bold_s , bold_a ) has been visited, initially set to one for all the pairs. For the exploration rate, we use use a decaying ε𝜀\varepsilonitalic_ε, defined as ε(𝐬)=1/(z(𝐬))\varepsilon(\operatorname{\mathbf{s}})=1/\sqrt{(}z(\operatorname{\mathbf{s}}))italic_ε ( bold_s ) = 1 / square-root start_ARG ( end_ARG italic_z ( bold_s ) ) where z(𝐬)𝑧𝐬z(\operatorname{\mathbf{s}})italic_z ( bold_s ) is the number of times state 𝐬𝐬\operatorname{\mathbf{s}}bold_s has been visited, initially set to one for all the states. For Double Q-learning zt(𝐬,𝐚)=ztA(𝐬,𝐚)subscript𝑧𝑡𝐬𝐚subscriptsuperscript𝑧𝐴𝑡𝐬𝐚z_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=z^{A}_{t}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = italic_z start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) if QAsuperscript𝑄𝐴Q^{A}italic_Q start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT is updated and zt(𝐬,𝐚)=ztB(𝐬,𝐚)subscript𝑧𝑡𝐬𝐚subscriptsuperscript𝑧𝐵𝑡𝐬𝐚z_{t}(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}})=z^{B}_{t}(% \operatorname{\mathbf{s}},\operatorname{\mathbf{a}})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) = italic_z start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) if QBsuperscript𝑄𝐵Q^{B}italic_Q start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is updated, where ztAsubscriptsuperscript𝑧𝐴𝑡z^{A}_{t}italic_z start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ztBsubscriptsuperscript𝑧𝐵𝑡z^{B}_{t}italic_z start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT store the number of updates for each action for the corresponding value function. We averaged the results over ten repetitions. For Stochastic Q-learning, we track a dictionary 𝒟𝒟\mathcal{D}caligraphic_D with keys being the states, and values being the latest exploited action. Thus, for a state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, the memory =𝒟(𝐬)𝒟𝐬\mathcal{M}=\mathcal{D}(\operatorname{\mathbf{s}})caligraphic_M = caligraphic_D ( bold_s ), thus \mathcal{M}caligraphic_M is the latest exploited action in the same state 𝐬𝐬\operatorname{\mathbf{s}}bold_s.

D.2.3 Deep Q-network Methods

We set the training parameters the same for all the deep Q-learning variants. We set the discount factor γ𝛾\gammaitalic_γ to 0.99 and the learning rate α𝛼\alphaitalic_α to 0.001. Our neural network takes input of a size equal to the sum of the dimensions of states and actions with a single output neuron. The network consists of two hidden linear layers, each with a size of 64, followed by a ReLU activation function (Nair & Hinton, 2010). We keep the exploration rate ε𝜀\varepsilonitalic_ε the same for all states, initialize it at 1, and apply a decay factor of 0.995, with a minimum threshold of 0.01. For n𝑛nitalic_n total number of actions, during training, to train the network, we use stochastic batches of size log(n)𝑛\lceil\log(n)\rceil⌈ roman_log ( italic_n ) ⌉ uniformly sampled from a buffer of size 2log(n)2𝑛2\lceil\log(n)\rceil2 ⌈ roman_log ( italic_n ) ⌉. We averaged the results over five repetitions. For the stochastic methods, we consider the actions in the batch of actions as the memory set \mathcal{M}caligraphic_M. We choose the batch size in this way to keep the complexity of the Stochastic Q-learning within 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ).

D.3 Compute and Implementation

We implement the different Q-learning methods using Python 3.9, Numpy 1.23.4, and Pytorch 2.0.1. For proximal policy optimization (PPO) (Schulman et al., 2017), asynchronous actor-critic (A2C) (Mnih et al., 2016), and deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015), we use the implementations of Stable-Baselines (Hill et al., 2018). We test the training time using a CPU 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz. with 16.0 GB RAM.

Appendix E Additional Results

E.1 Wall Time Speed

Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions, as confirmed in Fig. 5(a). Therefore, both StochDQN and StochDDQN, which apply these techniques for action selection and updates, have exponentially faster execution times compared to both DQN and DDQN, which can be seen in Fig 5(b) which shows the complete step duration for deep Q-learning methods, which include action selection and network update. The proposed methods are nearly as fast as a random algorithm, which samples and selects actions randomly and has no updates.

Refer to caption
(a) Action Selection Time
Refer to caption
(b) Full Step Duration
Figure 5: Comparison results for the stochastic and deterministic methods. The x-axis represents the number of possible actions, and the y-axis represents the time step duration of the agent in seconds.

E.2 Stochastic Maxmization

E.2.1 Stochastic Maxmization vs Maximization with Uniform Rewards

In the setting described in Section B.3 with 5000 uniformly independently distributed action values in the range of [0, 100], as shown in Fig. 6, stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max without memory, i.e., =\mathcal{M}=\emptysetcaligraphic_M = ∅ reaches around 91 in average return, and keeps fluctuating around, while stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max with \mathcal{M}caligraphic_M quickly achieves the optimal reward.

Refer to caption
Figure 6: stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max with (w/) and without (w/o) memory \mathcal{M}caligraphic_M vs. max\maxroman_max on uniformly distributed action values as described in Section B. The x-axis and y-axis represent the steps and the values, respectively.

E.2.2 Stochastic Maximization Analysis

In this section, we analyze stochastic maximization by tracking returned values across rounds, ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Eq. (10)), and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Eq. (9)), which we provide here. At time step t𝑡titalic_t, given a state 𝐬𝐬\operatorname{\mathbf{s}}bold_s, and the current estimated Q-function Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we define the non-negative underestimation error as βt(𝐬)subscript𝛽𝑡𝐬\beta_{t}(\operatorname{\mathbf{s}})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), as follows:

βt(𝐬)=max𝐚𝒜Qt(𝐬,𝐚)stochmax𝐚𝒜Qt(𝐬,𝐚).subscript𝛽𝑡𝐬subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚subscriptstochmax𝐚𝒜subscript𝑄𝑡𝐬𝐚\beta_{t}(\operatorname{\mathbf{s}})=\max_{\operatorname{\mathbf{a}}\in% \mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}% \right)-\operatorname*{stoch\,max}_{\operatorname{\mathbf{a}}\in\mathcal{A}}Q_% {t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right).italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) = roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) - start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) . (32)

Furthermore, we define the ratio ωt(𝐬)subscript𝜔𝑡𝐬\omega_{t}(\operatorname{\mathbf{s}})italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ), as follows:

ωt(𝐬)=stochmax𝐚𝒜Qt(𝐬,𝐚)max𝐚𝒜Qt(𝐬,𝐚).subscript𝜔𝑡𝐬subscriptstochmax𝐚𝒜subscript𝑄𝑡𝐬𝐚subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚\omega_{t}(\operatorname{\mathbf{s}})=\frac{\operatorname*{stoch\,max}_{% \operatorname{\mathbf{a}}\in\mathcal{A}}Q_{t}\left(\operatorname{\mathbf{s}},% \operatorname{\mathbf{a}}\right)}{\max_{\operatorname{\mathbf{a}}\in\mathcal{A% }}Q_{t}\left(\operatorname{\mathbf{s}},\operatorname{\mathbf{a}}\right)}.italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) = divide start_ARG start_OPERATOR roman_stoch roman_max end_OPERATOR start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) end_ARG start_ARG roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) end_ARG . (33)

It follows that:

ωt(𝐬)=1βt(𝐬)max𝐚𝒜Qt(𝐬,𝐚).subscript𝜔𝑡𝐬1subscript𝛽𝑡𝐬subscript𝐚𝒜subscript𝑄𝑡𝐬𝐚\omega_{t}(\operatorname{\mathbf{s}})=1-\frac{\beta_{t}(\operatorname{\mathbf{% s}})}{\max_{\operatorname{\mathbf{a}}\in\mathcal{A}}Q_{t}\left(\operatorname{% \mathbf{s}},\operatorname{\mathbf{a}}\right)}.italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) = 1 - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s ) end_ARG start_ARG roman_max start_POSTSUBSCRIPT bold_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_s , bold_a ) end_ARG . (34)

For Deep Q-Networks, for the InvertedPendulum-v4, both stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and max return similar values (Fig. 7(a)), ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT approaches one rapidly (Fig. 7(b)) and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT remains below 0.5 (Fig. 7(c)). In the case of HalfCheetah-v4, both stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and max return similar values (Fig. 8(a)), ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT quickly converges to one (Fig. 8(b)), and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is upper bounded below eight (Fig. 8(c)).

While the difference βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT remains bounded, the values of both stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max and max increase over the rounds as the agent explores better options. This leads to the ratio ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converging to one as the error becomes negligible over the rounds, as expected according to Eq. (34).

Refer to caption
(a) stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max vs max
Refer to caption
(b) Ratio ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Refer to caption
(c) Difference βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Figure 7: Comparison results for the stochastic and non-stochastic methods for the Inverted Pendulum with 512 actions.
Refer to caption
(a) stochmaxstochmax\operatorname*{stoch\,max}roman_stoch roman_max vs max
Refer to caption
(b) Ratio ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Refer to caption
(c) Difference βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Figure 8: Comparison results for the stochastic and non-stochastic methods for the Half Cheetah with 4096 actions.

E.3 Stochastic Q-network Reward Analysis

Refer to caption
(a) Inverted Pendulum
Refer to caption
(b) Half Cheetah
Figure 9: Stochastic vs non-stochastic of deep Q-learning variants on Inverted Pendulum and Half Cheetah, with steps on the x-axis and average returns, smoothed over a size 100 window on the y-axis.

As illustrated in Fig. 9(a) and Fig. 9(b) for the inverted pendulum and half cheetah experiments, which involve 512 and 4096 actions, respectively, both StochDQN and StochDDQN attain the optimal average return in a comparable number of rounds to DQN and DDQN. Additionally, StochDQN exhibits the quickest attainment of optimal rewards for the inverted pendulum. Furthermore, while DDQN did not perform well on the inverted pendulum task, its modification, i.e., StochDDQN, reached the optimal rewards.

E.4 Stochastic Q-learning Reward Analysis

We tested Stochastic Q-learning, Stochastic Double Q-learning, and Stochastic Sarsa in environments with both discrete states and actions. Interestingly, as shown in Fig. 11, our stochastic algorithms outperform their deterministic counterparts in terms of cumulative rewards. Furthermore, we notice that Stochastic Q-learning outperforms all the considered methods regarding the cumulative rewards. Moreover, in the CliffWalking-v0 (as shown in Fig. 10), as well as for the generated MDP environment with 256 possible actions (as shown in Fig. 12), all the stochastic and non-stochastic algorithms reach the optimal policy in a similar number of steps.

Refer to caption
(a) Instantaneous Rewards
Refer to caption
(b) Cumulative Rewards
Figure 10: Comparing stochastic and non-stochastic Q-learning approaches on the Cliff Walking, with steps on the x-axis, instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis for plot (b).
Refer to caption
(a) Instantaneous Rewards
Refer to caption
(b) Cumulative Rewards
Figure 11: Comparing stochastic and non-stochastic Q-learning approaches on the Frozen Lake, with steps on the x-axis, instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis for plot (b).
Refer to caption
(a) Instantaneous Rewards
Refer to caption
(b) Cumulative Rewards
Figure 12: Comparing stochastic and non-stochastic Q-learning approaches on the generated MDP environment, with steps on the x-axis, instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis for plot (b).