Stochastic Q-learning for Large Discrete Action Spaces
Abstract
In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as . The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
1 Introduction
Reinforcement learning (RL), a continually evolving field of machine learning, has achieved notable successes, especially when combined with deep learning (Sutton & Barto, 2018; Wang et al., 2022). While there have been several advances in the field, a significant challenge lies in navigating complex environments with large discrete action spaces (Dulac-Arnold et al., 2015, 2021). In such scenarios, standard RL algorithms suffer in terms of computational efficiency (Akkerman et al., 2023). Identifying the optimal actions might entail cycling through all of them, in general, multiple times within different states, which is computationally expensive and may become prohibitive with large discrete action spaces (Tessler et al., 2019).
Such challenges apply to various domains, including combinatorial optimization (Mazyavkina et al., 2021; Fourati et al., 2023, 2024b, 2024a), natural language processing (He et al., 2015, 2016a, 2016b; Tessler et al., 2019), communications and networking (Luong et al., 2019; Fourati & Alouini, 2021), recommendation systems (Dulac-Arnold et al., 2015), transportation (Al-Abbasi et al., 2019; Haliem et al., 2021; Li et al., 2022), and robotics (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020; Seyde et al., 2021, 2022; Gonzalez et al., 2023; Ireland & Montana, 2024). Although tailored solutions leveraging action space structures and dimensions may suffice in specific contexts, their applicability across diverse problems, possibly unstructured, still needs to be expanded. We complement these works by proposing a general method that addresses a broad spectrum of problems, accommodating structured and unstructured single and multi-dimensional large discrete action spaces.
Value-based and actor-based approaches are both prominent approaches in RL. Value-based approaches, which entail the agent implicitly optimizing its policy by maximizing a value function, demonstrate superior generalization capabilities but demand significant computational resources, particularly in complex settings. Conversely, actor-based approaches, which entail the agent directly optimizing its policy, offer computational efficiency but often encounter challenges in generalizing across multiple and unexplored actions (Dulac-Arnold et al., 2015). While both hold unique advantages and challenges, they represent distinct avenues for addressing the complexities of decision-making in large action spaces. However, comparing them falls outside the scope of this work. While some previous methods have focused on the latter (Dulac-Arnold et al., 2015), our work concentrates on the former. Specifically, we aim to exploit the natural generalization inherent in value-based RL approaches while reducing their per-step computational complexity.
Q-learning, as introduced by Watkins & Dayan (1992), for discrete action and state spaces, stands out as one of the most famous examples of value-based RL methods and remains one of the most widely used ones in the field. As an off-policy learning method, it decouples the learning process from the agent’s current policy, allowing it to leverage past experiences from various sources, which becomes advantageous in complex environments. In each step of Q-learning, the agent updates its action value estimates based on the observed reward and the estimated value of the best action in the next state.
Some approaches have been proposed to apply Q-learning to continuous state spaces, leveraging deep neural networks (Mnih et al., 2013; Van Hasselt et al., 2016). Moreover, several improvements have also been suggested to address its inherent estimation bias (Hasselt, 2010; Van Hasselt et al., 2016; Zhang et al., 2017; Lan et al., 2020; Wang et al., 2021). However, despite the different progress and its numerous advantages, a significant challenge still needs to be solved in Q-learning-like methods when confronted with large discrete action spaces. The computational complexity associated with selecting actions and updating Q-functions increases proportionally with the increasing number of actions, which renders the conventional approach impractical as the number of actions substantially increases. Consequently, we confront a crucial question: Is it possible to mitigate the complexity of the different Q-learning methods while maintaining a good performance?
This work proposes a novel, simple, and practical approach for handling general, possibly unstructured, single-dimensional or multi-dimensional, large discrete action spaces. Our approach targets the computational bottleneck in value-based methods caused by the search for a maximum ( and ) in every learning iteration, which scales as , i.e., linearly with the number of possible actions . Through randomization, we can reduce this linear per-step computational complexity to logarithmic.
We introduce and , which, instead of exhaustively searching for the precise maximum across the entire set of actions, rely on at most two random subsets of actions, both of sub-linear sizes, possibly each of size . The first subset is randomly sampled from the complete set of actions, and the second from the previously exploited actions. These stochastic maximization techniques amortize the computational overhead of standard maximization operations in various Q-learning methods (Watkins & Dayan, 1992; Hasselt, 2010; Mnih et al., 2013; Van Hasselt et al., 2016). Stochastic maximization methods significantly accelerate the agent’s steps, including action selection and value-function updates in value-based RL methods, making them practical for handling challenging, large-scale, real-world problems.
We propose Stochastic Q-learning, Stochastic Double Q-learning, StochDQN, and StochDDQN, which are obtained by changing and to and in the Q-learning (Watkins & Dayan, 1992), the Double Q-learning (Hasselt, 2010), the deep Q-network (DQN) (Mnih et al., 2013) and the double DQN (DDQN) (Van Hasselt et al., 2016), respectively. Furthermore, we observed that our approach works even for the on-policy Sarsa (Rummery & Niranjan, 1994).
We conduct a theoretical analysis of the proposed method, proving the convergence of Stochastic Q-learning, which integrates these techniques for action selection and value updates, and establishing a lower bound on the probability of sampling an optimal action from a random set of size and analyze the error of stochastic maximization compared to exact maximization. Furthermore, we evaluate the proposed RL algorithms on environments from Gymnasium (Brockman et al., 2016). For the stochastic deep RL algorithms, the evaluations were performed on control tasks within the multi-joint dynamics with contact (MuJoCo) environment (Todorov et al., 2012) with discretized actions (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020). These evaluations demonstrate that the stochastic approaches outperform non-stochastic ones regarding wall time speedup and sometimes rewards. Our key contributions are summarized as follows:
-
•
We introduce novel stochastic maximization techniques denoted as and , offering a compelling alternative to conventional deterministic maximization operations, particularly beneficial for handling large discrete action spaces, ensuring sub-linear complexity concerning the number of actions.
-
•
We present a suite of value-based algorithms suitable for large discrete actions, including Stochastic Q-learning, Stochastic Sarsa, Stochastic Double Q-learning, StochDQN, and StochDDQN, which integrate stochastic maximization within Q-learning, Sarsa, Double Q-learning, DQN, and DDQN, respectively.
-
•
We analyze stochastic maximization and demonstrate the convergence of Stochastic Q-learning. Furthermore, we empirically validate our approach to tasks from the Gymnasium and MuJoCO environments, encompassing various dimensional discretized actions.
2 Related Works
While RL has shown promise in diverse domains, practical applications often grapple with real-world complexities. A significant hurdle arises when dealing with large discrete action spaces (Dulac-Arnold et al., 2015, 2021). Previous research has investigated strategies to address this challenge by leveraging the combinatorial or the dimensional structures in the action space (He et al., 2016b; Tavakoli et al., 2018; Tessler et al., 2019; Delarue et al., 2020; Seyde et al., 2021, 2022; Fourati et al., 2023, 2024b, 2024a; Akkerman et al., 2023; Fourati et al., 2024b, a; Ireland & Montana, 2024). For example, He et al. (2016b) leveraged the combinatorial structure of their language problem through sub-action embeddings. Compressed sensing was employed in (Tessler et al., 2019) for text-based games with combinatorial actions. Delarue et al. (2020) formulated the combinatorial action decision of a vehicle routing problem as a mixed-integer program. Moreover, Akkerman et al. (2023) introduced dynamic neighbourhood construction specifically for structured combinatorial large discrete action spaces. Previous works tailored solutions for multi-dimensional spaces such as those in (Seyde et al., 2021, 2022; Ireland & Montana, 2024), among others, while practical in the multi-dimensional spaces, may not be helpful for single-dimensional large action spaces. While relying on the structure of the action space is practical in some settings, not all problems with large action spaces are multi-dimensional or structured. We complement these works by making no assumptions about the structure of the action space.
Some approaches have proposed factorizing the action spaces to reduce their size. For example, these include factorizing into binary subspaces (Lagoudakis & Parr, 2003; Sallans & Hinton, 2004; Pazis & Parr, 2011; Dulac-Arnold et al., 2012), expert demonstration (Tennenholtz & Mannor, 2019), tensor factorization (Mahajan et al., 2021), and symbolic representations (Cui & Khardon, 2016). Additionally, some hierarchical and multi-agent RL approaches employed factorization as well (Zhang et al., 2020; Kim et al., 2021; Peng et al., 2021; Enders et al., 2023). While some of these methods effectively handle large action spaces for certain problems, they necessitate the design of a representation for each discrete action. Even then, for some problems, the resulting space may still be large.
Methods presented in (Van Hasselt & Wiering, 2009; Dulac-Arnold et al., 2015; Wang et al., 2020) combine continuous-action policy gradients with nearest neighbour search to generate continuous actions and identify the nearest discrete actions. These are interesting methods but require continuous-to-discrete mapping and are mainly policy-based rather than value-based approaches. In the works of Kalashnikov et al. (2018) and Quillen et al. (2018), the cross-entropy method (Rubinstein, 1999) was utilized to approximate action maximization. This approach requires multiple iterations () for a single action selection. During each iteration, it samples values, where , fits a Gaussian distribution to of these samples, and subsequently draws a new batch of samples from this Gaussian distribution. As a result, this approximation remains costly, with a complexity of . Additionally, in the work of Van de Wiele et al. (2020), a neural network was trained to predict the optimal action in combination with a uniform search. This approach involves the use of an expensive autoregressive proposal distribution to generate actions and samples a large number of actions (), thus remaining computationally expensive, with . In (Metz et al., 2017), sequential DQN allows the agent to choose sub-actions one by one, which increases the number of steps needed to solve a problem and requires steps with a linear complexity of for a discretization granularity . Additionally, Tavakoli et al. (2018) employs a branching technique with duelling DQN for combinatorial control problems. Their approach has a complexity of for actions with discretization granularity and dimensions, whereas our method, in a similar setting, achieves . Another line of work introduces action elimination techniques, such as the action elimination DQN (Zahavy et al., 2018), which employs an action elimination network guided by an external elimination signal from the environment. However, it requires this domain-specific signal and can be computationally expensive ( where are the number of remaining actions). In contrast, curriculum learning, as proposed by Farquhar et al. (2020), initially limits an agent’s action space, gradually expanding it during training for efficient exploration. However, its effectiveness relies on having an informative restricted action space, and as the action space size grows, its complexity scales linearly with its size, eventually reaching .
In the context of combinatorial bandits with a single state but large discrete action spaces, previous works have exploited the combinatorial structure of actions, where each action is a subset of main arms. For instance, for submodular reward functions, which imply diminishing returns when adding arms, in (Fourati et al., 2023) and (Fourati et al., 2024b), stochastic greedy algorithms are used to avoid exact search. The former evaluates the marginal gains of adding and removing sub-actions (arms), while the latter assumes monotonic rewards and considers adding the best arm until a cardinality constraint is met. For general reward functions, Fourati et al. (2024a) propose using approximation algorithms to evaluate and add sub-actions. While these methods are practical for bandits, they exploit the combinatorial structure of their problems and consider a single-state scenario, which is different from general RL problems.
While some approaches above are practical for handling specific problems with large discrete action spaces, they often exploit the dimensional or combinatorial structures inherent in their considered problems. In contrast, we complement these approaches by proposing a solution to tackle any general, potentially unstructured, single-dimensional or multi-dimensional, large discrete action space without relying on structure assumptions. Our proposed solution is general, simple, and efficient.
3 Problem Description
In the context of a Markov decision process (MDP), we have specific components: a finite set of actions denoted as , a finite set of states denoted as , a transition probability distribution , a bounded reward function , and a discount factor . Furthermore, for time step , we denote the chosen action as , the current state as , and the received reward as . Additionally, for time step , we define a learning rate function .
The cumulative reward an agent receives during an episode in an MDP with variable length time is the return . It is calculated as the discounted sum of rewards from time step until the episode terminates: . RL aims to learn a policy mapping states to actions that maximize the expected return across all episodes. The state-action value function, denoted as , represents the expected return when starting from a given state , taking action , and following a policy afterwards. The function can be expressed recursively using the Bellman equation:
(1) |
Two main categories of policies are commonly employed in RL systems: value-based and actor-based policies (Sutton & Barto, 2018). This study primarily concentrates on the former type, where the value function directly influences the policy’s decisions. An example of a value-based policy in a state involves an -greedy algorithm, selecting the action with the highest Q-function value with probability , where , function of the state , requiring the use of operation, as follows:
(2) |
Furthermore, during the training, to update the Q-function, Q-learning (Watkins & Dayan, 1992), for example, uses the following update rule, which requires a operation:
(3) |
Therefore, the computational complexity of both the action selections in Eq. (2) and the Q-function updates in Eq. (3) scales linearly with the cardinality of the action set , making this approach infeasible as the number of actions increases significantly. The same complexity issues remain for other Q-learning variants, such as Double Q-learning (Hasselt, 2010), DQN (Mnih et al., 2013), and DDQN (Van Hasselt et al., 2016), among several others.
When representing the value function as a parameterized function, such as a neural network, taking only the current state as input and outputting the values for all actions, as proposed in DQN (Mnih et al., 2013), the network must accommodate a large number of output nodes, which results in increasing memory overhead and necessitates extensive predictions and maximization over these final outputs in the last layer. A notable point about this approach is that it does not exploit contextual information (representation) of actions, if available, which leads to lower generalization capability across actions with similar features and fails to generalize over new actions.
Previous works have considered generalization over actions by taking the features of an action and the current state as inputs to the Q-network and predicting its value (Zahavy et al., 2018; Metz et al., 2017; Van de Wiele et al., 2020). However, it leads to further complications when the value function is modeled as a parameterized function with both state and action as inputs. Although this approach allows for improved generalization across the action space by leveraging contextual information from each action and generalizing across similar ones, it requires evaluating the function for each action within the action set . This results in a linear increase in the number of function calls as the number of actions grows. This scalability issue becomes particularly problematic when dealing with computationally expensive function approximators, such as deep neural networks (Dulac-Arnold et al., 2015). Addressing these challenges forms the motivation behind this work.
4 Proposed Approach
To alleviate the computational burden associated with maximizing a Q-function at each time step, especially when dealing with large action spaces, we introduce stochastic maximization methods with sub-linear complexity relative to the size of the action set . Then, we integrate these methods into different value-based RL algorithms.
4.1 Stochastic Maximization
We introduce stochastic maximization as an alternative to maximization when dealing with large discrete action spaces. Instead of conducting an exhaustive search for the precise maximum across the entire set of actions , stochastic maximization searches for a maximum within a stochastic subset of actions of sub-linear size relative to the total number of actions. In principle, any size can be used, trading off time complexity and approximation. We mainly focus on to illustrate the power of the method in recovering Q-learning, even with such a small number of actions, with logarithmic complexity.
We consider two approaches to stochastic maximization: memoryless and memory-based approaches. The memoryless one samples a random subset of actions with a sublinear size and seeks the maximum within this subset. On the other hand, the memory-based one expands the randomly sampled set to include a few actions with a sublinear size from the latest exploited actions and uses the combined sets to search for a stochastic maximum. Stochastic maximization, which may miss the exact maximum in both versions, is always upper-bounded by deterministic maximization, which finds the exact maximum. However, by construction, it has sublinear complexity in the number of actions, making it appealing when maximizing over large action spaces becomes impractical.
Formally, given a state , which may be discrete or continuous, along with a Q-function, a random subset of actions , and a memory subset (empty in the memoryless case), each subset being of sublinear size, such as at most each, the is the maximum value computed from the union set , defined as:
(4) |
Besides, the is computed as follows:
(5) |
In the analysis of stochastic maximization, we explore both memory-based and memoryless maximization. In the analysis and experiments, we consider the random set to consist of actions. When memory-based, in our experiments, within a given discrete state, we consider the two most recently exploited actions in that state. For continuous states, where it is impossible to retain the latest exploited actions for each state, we consider a randomly sampled subset , which includes actions, even though they were played in different states. We demonstrate that this approach was sufficient to achieve good results in the benchmarks considered; see Section 7.3. Our Stochastic Q-learning convergence analysis considers memoryless stochastic maximization with a random set of any size.
Remark 4.1.
By setting equal to , we essentially revert to standard approaches. Consequently, our method is an extension of non-stochastic maximization. However, in pursuit of our objective to make RL practical for large discrete action spaces, for a given state , in our analysis and experiments, we keep the union set limited to at most , ensuring sub-linear (logarithmic) complexity.
4.2 Stochastic Q-learning
We introduce Stochastic Q-learning, described in Algorithm 1, and Stochastic Double Q-learning, described in Algorithm 2 in Appendix C, that replace the and operations in Q-learning and Double Q-learning with and , respectively. Furthermore, we introduce Stochastic Sarsa, described in Algorithm 3 in Appendix C, which replaces the maximization in the greedy action selection () in Sarsa.
Our proposed solution takes a distinct approach from the conventional method of selecting the action with the highest Q-function value from the complete set of actions . Instead, it uses stochastic maximization, which finds a maximum within a stochastic subset , constructed as explained in Section 4.1. Our stochastic policy , uses an -greedy algorithm, in a given state , with a probability of , for , is defined as follows:
(6) |
Furthermore, during the training, to update the Q-function, our proposed Stochastic Q-learning uses the following rule:
(7) |
While Stochastic Q-learning, like Q-learning, employs the same values for action selection and action evaluation, Stochastic Double Q-learning, similar to Double Q-learning, learns two separate Q-functions. For each update, one Q-function determines the policy, while the other determines the value of that policy. Both stochastic learning methods remove the maximization bottleneck from exploration and training updates, making these proposed algorithms significantly faster than their deterministic counterparts.
4.3 Stochastic Deep Q-network
We introduce Stochastic DQN (StochDQN), described in Algorithm 4 in Appendix C, and Stochastic DDQN (StochDDQN) as efficient variants of deep Q-networks. These variants substitute the maximization steps in the DQN (Mnih et al., 2013) and DDQN (Van Hasselt et al., 2016) algorithms with the stochastic maximization operations. In these modified approaches, we replace the -greedy exploration strategy with the same exploration policy as in Eq. (6).
For StochDQN, we employ a deep neural network as a function approximator to estimate the action-value function, represented as , where denotes the weights of the Q-network. This network is trained by minimizing a series of loss functions denoted as , with these loss functions changing at each iteration as follows:
(8) |
where . In this context, represents the target value for an iteration , and is a probability distribution that covers states and actions. Like the DQN approach, we keep the parameters fixed from the previous iteration, denoted as when optimizing the loss function .
These target values depend on the network weights, which differ from the fixed targets typically used in supervised learning. We employ stochastic gradient descent for the training. While StochDQN, like DQN, employs the same values for action selection and evaluation, StochDDQN, like DDQN, trains two separate value functions. It does this by randomly assigning each experience to update one of the two value functions, resulting in two sets of weights, and . For each update, one set of weights determines the policy, while the other set determines the values.
5 Stochastic Maximization Analysis
In the following, we study stochastic maximization with and without memory compared to exact maximization.
5.1 Memoryless Stochastic Maximization
Memoryless stochastic maximization, i.e., , does not always yield an optimal maximizer. To return an optimal action, this action needs to be randomly sampled from the set of actions. Finding an exact maximizer, without relying on memory , is a random event with a probability , representing the likelihood of sampling such an exact maximizer. In the following lemma, we provide a lower bound on the probability of discovering an optimal action within a uniformly randomly sampled subset of actions, which we prove in Appendix B.1.1.
Lemma 5.1.
For any given state , the probability of sampling an optimal action from a uniformly randomly chosen subset of size actions is at least .
While finding an exact maximizer through sampling may not always occur, the rewards of near-optimal actions can still be similar to those obtained from an optimal action. Therefore, the difference between stochastic maximization and exact maximization might be a more informative metric than just the probability of finding an exact maximizer. Thus, at time step , given state and the current estimated Q-function , we define the estimation error as , as follows:
(9) |
Furthermore, we define the similarity ratio , as follows:
(10) |
It can be seen from the definitions that and . While sampling the exact maximizer is not always possible, near-optimal actions may yield near-optimal values, providing good approximations, i.e., and . In general, this difference depends on the value distribution over the actions.
While we do not make any specific assumptions about the value distribution in our work, we note that with some simplifying assumptions on the value distributions over the actions, one can derive more specialized guarantees. For example, assuming that the rewards are uniformly distributed over the actions, we demonstrate in Section B.3 that for a given discrete state , if the values of the sampled actions independently follow a uniform distribution from the interval , where represents the range of the values over the actions in state at time step , then the expected value of , even without memory, is: Furthermore, we empirically demonstrate that for the considered control problems, the difference is not large, and the ratio is close to one, as shown in Section 7.4.
5.2 Stochastic Maximization with Memory
While memoryless stochastic maximization could approach the maximum value or find it with the probability , lower-bounded in Lemma 5.1, it does not converge to an exact maximization, as it keeps sampling purely at random, as can be seen in Fig. 6 in Appendix E.2.1. However, memory-based stochastic maximization, i.e., with , can become an exact maximization when the Q-function becomes stable, as we state in the Corollary 5.3, which we prove in Appendix B.2.1, and as confirmed in Fig. 6.
Definition 5.2.
A Q-function is considered stable for a given time range and state when its maximizing action in that state remains unchanged for all subsequent steps within that time, even if the Q-function’s values themselves change.
A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed. However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the critical characteristic is that its maximizing action remains the same even when its values are updated. Although the has sub-linear complexity compared to the , without any assumption of the value distributions, the following corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the matches precisely the output of .
Corollary 5.3.
For a given state , assuming a time range where the Q-function becomes stable in that state, is expected to converge to zero after iterations.
Recalling the definition of the similarity ratio , it follows that . Therefore, for a given state , where the Q-function becomes stable, given the boundedness of iterates in Q-learning, it is expected that converges to one. This observation was confirmed, even with continuous states and using neural networks as function approximators, in Section 7.4.
6 Stochastic Q-learning Convergence
In this section, we analyze the convergence of the Stochastic Q-learning, described in Algorithm 1. This algorithm employs the policy , as defined in Eq. (6), with to guarantee that for all state-action pairs . The value update rule, on the other hand, uses the update rule specified in Eq. (4.2).
In the convergence analysis, we focus on memoryless maximization. While the operator for action selection can be employed with or without memory, we assume a memoryless operator for value updates, which means that value updates are performed by maximizing over a randomly sampled subset of actions from , sampled independently from both the next state and the set used for the .
For a stochastic variable subset of actions , following some probability distribution , we consider, without loss of generality , and define, according to , a target Q-function, denoted as , as:
(11) |
Remark 6.1.
The defined above depends on the sampling distribution . Therefore, it does not represent the optimal value function of the original MDP problem; instead, it is optimal under the condition where only a random subset of actions following the distribution is available to the agent at each time step. However, as the sampling cardinality increases, it increasingly better approximates the optimal value function of the original MDP and fully recovers the optimal Q-function of the original problem when the sampling distribution becomes .
The following theorem states the convergence of the iterates of Stochastic Q-learning with memoryless stochastic maximization to the , defined in Eq. 11, for any sampling distribution , regardless of the cardinality.
Theorem 6.2.
For a finite MDP, as described in Section 3, let be a randomly independently sampled subset of actions from , of any cardinality, following any distribution , exclusively sampled for the value updates, for the Stochastic Q-learning, as described in Algorithm 1, given by the following update rule:
given any initial estimate , converges with probability 1 to , defined in Eq. (11), as long as and for all .
The theorem’s result demonstrates that for any cardinality of actions, Stochastic Q-learning converges to , as defined in Eq. (11), which recovers the convergence guarantees of Q-learning when the sampling distribution is .
Remark 6.3.
In principle, any size can be used, balancing time complexity and approximation. Our empirical experiments focused on to illustrate the method’s ability to recover Q-learning, even with a few actions. Using will approach the value function of Q-learning more closely compared to using , albeit at the cost of higher complexity than .
The theorem shows that even with memoryless stochastic maximization, using randomly sampled actions, the convergence is still guaranteed. However, relying on memory-based stochastic maximization helps minimize the approximation error in stochastic maximization, as shown in Corollary 5.3, and outperforms Q-learning as shown in the experiments in Section 7.1.
In the following, we provide a sketch of the proof addressing the extra stochasticity due to stochastic maximization. The full proof is provided in Appendix A.
We tackle the additional stochasticity depending on the sampling distribution , by defining an operator function , which for any , is as follows:
(12) |
We then demonstrate that it is a contraction in the sup-norm, as shown in Lemma 6.4, which we prove in Appendix A.2.
Lemma 6.4.
The operator , defined in Eq. (6), is a contraction in the sup-norm, with a contraction factor , i.e.,
We then use the above lemma to establish the convergence of Stochastic Q-learning. Given any initial estimate , using the considered update rule for Stochastic Q-learning, subtracting from both sides and letting , yields
(13) |
With representing the past at time ,
Using the fact that and Lemma 6.4,
(14) |
Given that is bounded, its variance is bounded by some constant . Thus, as shown in Appendix A.1, for , Then, by this inequality, Eq. (14), and Theorem 1 in (Jaakkola et al., 1993), converges to zero with probability 1, i.e., converges to with probability 1.
7 Experiments
We compare stochastic maximization to exact maximization and evaluate the proposed RL algorithms in Gymnasium (Brockman et al., 2016) and MuJoCo (Todorov et al., 2012) environments. The stochastic tabular Q-learning approaches are tested on CliffWalking-v0, FrozenLake-v1, and a generated MDP environment. Additionally, the stochastic deep Q-network approaches are tested on control tasks and compared against their deterministic counterparts, as well as against DDPG (Lillicrap et al., 2015), A2C (Mnih et al., 2016), and PPO (Schulman et al., 2017), using Stable-Baselines implementations (Hill et al., 2018), which can directly handle continuous action spaces. Further details can be found in Appendix D.
7.1 Stochastic Q-learning Average Return
We test Stochastic Q-learning, Stochastic Double Q-learning, and Stochastic Sarsa in environments with discrete states and actions. Interestingly, as shown in Fig. 1, our stochastic algorithms outperform their deterministic counterparts. Furthermore, we observe that Stochastic Q-learning outperforms all the methods considered regarding the cumulative rewards in the FrozenLake-v1. Moreover, in the CliffWalking-v0 (as shown in Fig. 10), as well as for the generated MDP environment with 256 actions (as shown in Fig. 12), all the stochastic and non-stochastic methods reach the optimal policy in a similar number of steps.
7.2 Exponential Wall Time Speedup
Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions. Therefore, StochDQN and StochDDQN, which apply these techniques for action selection and updates, have exponentially faster execution times than DQN and DDQN, as confirmed in Fig. 2.
For the time duration of action selection alone, please refer to Appendix E.1. The time analysis results show that the proposed methods are nearly as fast as a random algorithm that selects actions randomly. Specifically, in the experiments with the InvertedPendulum-v4, the stochastic methods took around 0.003 seconds per step for a set of 1000 actions, while the non-stochastic methods took 0.18 seconds, which indicates that the stochastic versions are 60 times faster than their deterministic counterparts. Furthermore, for the HalfCheetah-v4 experiment, we considered 4096 actions, where one (D)DQN step takes 0.6 seconds, needing around 17 hours to run for 100,000 steps, while the Stoch(D)DQN needs around 2 hours to finish the same 100,000 steps. In other words, we can easily run for 10x more steps in the same period (seconds). This makes the stochastic methods more practical, especially with large action spaces.
7.3 Stochastic Deep Q-network Average Return
Fig. 3 shows the performance of various RL algorithms on the InvertedPendulum-v4 task, which has 512 actions. StochDQN achieves the optimal average return in fewer steps than DQN, with a lower per-step time advantage (as shown in Section 7.2). Interestingly, while DDQN struggles, StochDDQN nearly reaches the optimal average return, demonstrating the effectiveness of stochasticity. StochDQN and StochDDQN significantly outperform DDQN, A2C, and PPO by obtaining higher average returns in fewer steps. Similarly, Fig. 9(b) in Section E.3 shows the results for the HalfCheetah-v4 task, which has 4096 actions. Stochastic methods, particularly StochDDQN, achieve results comparable to the non-stochastic methods. Notably, all DQN methods (stochastic and non-stochastic) outperform PPO and A2C, highlighting their efficiency in such scenarios.
Remark 7.1.
While comparing them falls outside the scope of our work, we note that DDQN was proposed to mitigate the inherent overestimation in DQN. However, exchanging overestimation for underestimation bias is not always beneficial, as our results demonstrate and as shown in other studies such as (Lan et al., 2020).
7.4 Stochastic Maximization
8 Discussion
In this work, we focus on adapting value-based methods, which excel in generalization compared to actor-based approaches (Dulac-Arnold et al., 2015). However, this advantage comes at the cost of lower computational efficiency due to the maximization operation required for action selection and value function updates. Therefore, our primary motivation is to provide a computationally efficient alternative for situations with general large discrete action spaces.
We focus mainly on Q-learning-like methods among value-based approaches due to their off-policy nature and proven success in various applications. We demonstrate that these methods can be applied to large discrete action spaces while achieving exponentially lower complexity and maintaining good performance. Furthermore, our proposed stochastic maximization method performs well even when applied to the on-policy Sarsa algorithm, extending its potential beyond off-policy methods. Consequently, the suggested stochastic approach offers broader applicability to other value-based approaches, resulting in lower complexity and improved efficiency with large discrete action spaces.
While the primary goal of this work is to reduce the complexity and wall time of Q-learning-like algorithms, our experiments revealed that stochastic methods not only achieve shorter step times (in seconds) but also, in some cases, yield higher rewards and exhibit faster convergence in terms of the number of steps compared to other methods. These improvements can be attributed to several factors. Firstly, introducing more stochasticity into the greedy choice through enhances exploration. Secondly, Stochastic Q-learning specifically helps to reduce the inherent overestimation in Q-learning-like methods (Hasselt, 2010; Lan et al., 2020; Wang et al., 2021). This reduction is achieved using , a lower bound to the operation.
Q-learning methods, focused initially on discrete actions, can be adapted to tackle continuous problems with discretization techniques and stochastic maximization. Our control experiments show that Q-network methods with discretization achieve superior performance to algorithms with continuous actions, such as PPO, by obtaining higher rewards in fewer steps, which aligns with observations in previous works that highlight the potential of discretization for solving continuous control problems (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020). Notably, the logarithmic complexity of the proposed stochastic methods concerning the number of considered actions makes them well-suited for scenarios with finer-grained discretization, leading to more practical implementations.
9 Conclusion
We propose adapting Q-learning-like methods to mitigate the computational bottleneck associated with the and operations in these methods. By reducing the maximization complexity from linear to sublinear using and , we pave the way for practical and efficient value-based RL for large discrete action spaces. We prove the convergence of Stochastic Q-learning, analyze stochastic maximization, and empirically show that it performs well with significantly low complexity.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
References
- Akkerman et al. (2023) Akkerman, F., Luy, J., van Heeswijk, W., and Schiffer, M. Handling large discrete action spaces via dynamic neighborhood construction. arXiv preprint arXiv:2305.19891, 2023.
- Al-Abbasi et al. (2019) Al-Abbasi, A. O., Ghosh, A., and Aggarwal, V. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12):4714–4727, 2019.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Cui & Khardon (2016) Cui, H. and Khardon, R. Online symbolic gradient-based optimization for factored action mdps. In IJCAI, pp. 3075–3081, 2016.
- Delarue et al. (2020) Delarue, A., Anderson, R., and Tjandraatmadja, C. Reinforcement learning with combinatorial actions: An application to vehicle routing. Advances in Neural Information Processing Systems, 33:609–620, 2020.
- Dulac-Arnold et al. (2012) Dulac-Arnold, G., Denoyer, L., Preux, P., and Gallinari, P. Fast reinforcement learning with large action sets using error-correcting output codes for mdp factorization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 180–194. Springer, 2012.
- Dulac-Arnold et al. (2015) Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
- Dulac-Arnold et al. (2021) Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419–2468, 2021.
- Enders et al. (2023) Enders, T., Harrison, J., Pavone, M., and Schiffer, M. Hybrid multi-agent deep reinforcement learning for autonomous mobility on demand systems. In Learning for Dynamics and Control Conference, pp. 1284–1296. PMLR, 2023.
- Farquhar et al. (2020) Farquhar, G., Gustafson, L., Lin, Z., Whiteson, S., Usunier, N., and Synnaeve, G. Growing action spaces. In International Conference on Machine Learning, pp. 3040–3051. PMLR, 2020.
- Fourati & Alouini (2021) Fourati, F. and Alouini, M.-S. Artificial intelligence for satellite communication: A review. Intelligent and Converged Networks, 2(3):213–243, 2021.
- Fourati et al. (2023) Fourati, F., Aggarwal, V., Quinn, C., and Alouini, M.-S. Randomized greedy learning for non-monotone stochastic submodular maximization under full-bandit feedback. In International Conference on Artificial Intelligence and Statistics, pp. 7455–7471. PMLR, 2023.
- Fourati et al. (2024a) Fourati, F., Alouini, M.-S., and Aggarwal, V. Federated combinatorial multi-agent multi-armed bandits. arXiv preprint arXiv:2405.05950, 2024a.
- Fourati et al. (2024b) Fourati, F., Quinn, C. J., Alouini, M.-S., and Aggarwal, V. Combinatorial stochastic-greedy bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 12052–12060, 2024b.
- Gonzalez et al. (2023) Gonzalez, G., Balakuntala, M., Agarwal, M., Low, T., Knoth, B., Kirkpatrick, A. W., McKee, J., Hager, G., Aggarwal, V., Xue, Y., et al. Asap: A semi-autonomous precise system for telesurgery during communication delays. IEEE Transactions on Medical Robotics and Bionics, 5(1):66–78, 2023.
- Haliem et al. (2021) Haliem, M., Mani, G., Aggarwal, V., and Bhargava, B. A distributed model-free ride-sharing approach for joint matching, pricing, and dispatching using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 22(12):7931–7942, 2021.
- Hasselt (2010) Hasselt, H. Double q-learning. Advances in neural information processing systems, 23, 2010.
- He et al. (2015) He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. Deep reinforcement learning with an unbounded action space. arXiv preprint arXiv:1511.04636, 5, 2015.
- He et al. (2016a) He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Ostendorf, M. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1621–1630, Berlin, Germany, August 2016a. Association for Computational Linguistics. doi: 10.18653/v1/P16-1153. URL https://aclanthology.org/P16-1153.
- He et al. (2016b) He, J., Ostendorf, M., He, X., Chen, J., Gao, J., Li, L., and Deng, L. Deep reinforcement learning with a combinatorial action space for predicting popular Reddit threads. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1838–1848, Austin, Texas, November 2016b. Association for Computational Linguistics. doi: 10.18653/v1/D16-1189. URL https://aclanthology.org/D16-1189.
- Hill et al. (2018) Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
- Ireland & Montana (2024) Ireland, D. and Montana, G. Revalued: Regularised ensemble value-decomposition for factorisable markov decision processes. arXiv preprint arXiv:2401.08850, 2024.
- Jaakkola et al. (1993) Jaakkola, T., Jordan, M., and Singh, S. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993.
- Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
- Kim et al. (2021) Kim, M., Park, J., et al. Learning collaborative policies to solve np-hard routing problems. Advances in Neural Information Processing Systems, 34:10418–10430, 2021.
- Lagoudakis & Parr (2003) Lagoudakis, M. G. and Parr, R. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 424–431, 2003.
- Lan et al. (2020) Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487, 2020.
- Li et al. (2022) Li, S., Wei, C., and Wang, Y. Combining decision making and trajectory planning for lane changing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):16110–16136, 2022.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Luong et al. (2019) Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Liang, Y.-C., and Kim, D. I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21(4):3133–3174, 2019.
- Mahajan et al. (2021) Mahajan, A., Samvelyan, M., Mao, L., Makoviychuk, V., Garg, A., Kossaifi, J., Whiteson, S., Zhu, Y., and Anandkumar, A. Reinforcement learning in factored action spaces using tensor decompositions. arXiv preprint arXiv:2110.14538, 2021.
- Mazyavkina et al. (2021) Mazyavkina, N., Sviridov, S., Ivanov, S., and Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
- Metz et al. (2017) Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035, 2017.
- Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016.
- Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Pazis & Parr (2011) Pazis, J. and Parr, R. Generalized value functions for large action sets. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1185–1192, 2011.
- Peng et al. (2021) Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny, P.-A., Torr, P., Böhmer, W., and Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221, 2021.
- Quillen et al. (2018) Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and Levine, S. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. IEEE, 2018.
- Rubinstein (1999) Rubinstein, R. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1:127–190, 1999.
- Rummery & Niranjan (1994) Rummery, G. A. and Niranjan, M. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
- Sallans & Hinton (2004) Sallans, B. and Hinton, G. E. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5:1063–1088, 2004.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Seyde et al. (2021) Seyde, T., Gilitschenski, I., Schwarting, W., Stellato, B., Riedmiller, M., Wulfmeier, M., and Rus, D. Is bang-bang control all you need? solving continuous control with bernoulli policies. Advances in Neural Information Processing Systems, 34:27209–27221, 2021.
- Seyde et al. (2022) Seyde, T., Werner, P., Schwarting, W., Gilitschenski, I., Riedmiller, M., Rus, D., and Wulfmeier, M. Solving continuous control via q-learning. arXiv preprint arXiv:2210.12566, 2022.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
- Tang & Agrawal (2020) Tang, Y. and Agrawal, S. Discretizing continuous action space for on-policy optimization. In Proceedings of the aaai conference on artificial intelligence, volume 34, pp. 5981–5988, 2020.
- Tavakoli et al. (2018) Tavakoli, A., Pardo, F., and Kormushev, P. Action branching architectures for deep reinforcement learning. In Proceedings of the AAAI conference on Artificial Intelligence, volume 32, 2018.
- Tennenholtz & Mannor (2019) Tennenholtz, G. and Mannor, S. The natural language of actions. In International Conference on Machine Learning, pp. 6196–6205. PMLR, 2019.
- Tessler et al. (2019) Tessler, C., Zahavy, T., Cohen, D., Mankowitz, D. J., and Mannor, S. Action assembly: Sparse imitation learning for text based games with combinatorial action spaces. arXiv preprint arXiv:1905.09700, 2019.
- Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026–5033. IEEE, 2012.
- Van de Wiele et al. (2020) Van de Wiele, T., Warde-Farley, D., Mnih, A., and Mnih, V. Q-learning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116, 2020.
- Van Hasselt & Wiering (2009) Van Hasselt, H. and Wiering, M. A. Using continuous action spaces to solve discrete problems. In 2009 International Joint Conference on Neural Networks, pp. 1149–1156. IEEE, 2009.
- Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Wang et al. (2020) Wang, G., Shi, D., Xue, C., Jiang, H., and Wang, Y. Bic-ddpg: Bidirectionally-coordinated nets for deep multi-agent reinforcement learning. In International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 337–354. Springer, 2020.
- Wang et al. (2021) Wang, H., Lin, S., and Zhang, J. Adaptive ensemble q-learning: Minimizing estimation bias via error feedback. Advances in Neural Information Processing Systems, 34:24778–24790, 2021.
- Wang et al. (2022) Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu, X., Dai, B., and Miao, Q. Deep reinforcement learning: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8:279–292, 1992.
- Zahavy et al. (2018) Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J., and Mannor, S. Learn what not to learn: Action elimination with deep reinforcement learning. Advances in neural information processing systems, 31, 2018.
- Zhang et al. (2020) Zhang, T., Guo, S., Tan, T., Hu, X., and Chen, F. Generating adjacency-constrained subgoals in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 33:21579–21590, 2020.
- Zhang et al. (2017) Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted double q-learning. In IJCAI, pp. 3455–3461, 2017.
Appendix A Stochastic Q-learning Convergence Proofs
In this section, we prove Theorem 6.2, which states the convergence of Stochastic Q-learning. This algorithm uses a stochastic policy for action selection, employing a with or without memory, possibly dependent on the current state . For value updates, it utilizes a without memory, independent of the following state .
A.1 Proof of Theorem 6.2
Proof.
Stochastic Q-learning employs a stochastic policy in a given state , which use operation, with or without memory , with probability , for , which can be summarized by the following equation:
(15) |
This policy, with , ensures that for all .
Furthermore, during the training, to update the Q-function, given any initial estimate , we consider a Stochastic Q-learning which uses operation as in the following stochastic update rule:
(16) |
For the function updates, we consider a without memory, which involves a over a random subset of action sampled from a set probability distribution defined over the combinatorial space of actions, i.e., , which can be a uniform distribution over the action sets of size .
Hence, for a random subset of actions , the update rule of Stochastic Q-learning can be written as:
(17) |
We define an optimal Q-function, denoted as , as follows:
(18) | ||||
(19) |
Subtracting from both sides and letting
(20) |
yields
(21) |
with
(22) |
For the transition probability distribution , the set probability distribution , the reward function , and the discount factor, , we define the following contraction operator , defined for a function as
(23) |
Therefore, with representing the past at time step ,
Using the fact that ,
It is now immediate from Lemma 6.4, which we prove in Appendix A.2, that
(24) |
Moreover,
The last line follows from the fact that the randomness of only depends on the random set and the next state . Moreover, we consider the reward independent of the set and the next state , by not using the same set for both the action selection and the value update.
Given that is bounded, its variance is bounded by some constant . Therefore,
Therefore, for constant ,
(25) |
A.2 Proof of Lemma 6.4
Proof.
For the transition probability distribution , the set probability distribution defined over the combinatorial space of actions, i.e., , the reward function , and the discount factor , for a function , the operator is defined as follows:
(26) |
Therefore,
∎
Appendix B Stochastic Maximization
We analyze the proposed stochastic maximization method by comparing its error to that of exact maximization. First, we consider the case without memory, where , and then the case with memory, where . Finally, we provide a specialized bound for the case where the action values follow a uniform distribution.
B.1 Memoryless Stochastic Maximization
In the following lemma, we give a lower bound on the probability of finding an optimal action within a uniformly sampled subset of actions. We prove that for a given state , the probability of sampling an optimal action within the uniformly randomly sampled subset of size actions is lower bounded with .
B.1.1 Proof of Lemma 5.1
Proof.
In the presence of multiple maximizers, we focus on one of them, denoted as , and then the probability of sampling at least one maximizer is lower-bounded by the probability of finding , i.e.,
The probability of finding is the probability of sampling within the random set of size , which is the fraction of all possible combinations of size that include .
This fraction can be calculated as divided by all possible combinations of size , which is .
Therefore, .
Consequently,
(27) |
∎
B.2 Stochastic Maximization with Memory
While stochastic maximization without memory could approach the maximum value or find it with the probability , lower-bounded in Lemma 5.1, it never converges to an exact maximization, as it keeps sampling purely at random, as can be seen in Fig. 6. However, stochastic maximization with memory can become an exact maximization when the Q-function becomes stable, which we prove in the following Corollary. Although the has sub-linear complexity compared to the max, the following Corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the matches the output of max.
Definition B.1.
A Q-function is considered stable for a given state if its best action in that state remains unchanged for all subsequent steps, even if the Q-function’s values themselves change.
A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed. However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the key characteristic is that its maximizing action remains the same even when its values are updated. Although the has sub-linear complexity compared to the , without any assumption of the value distributions, the following Corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the output of the matches exactly the output of .
B.2.1 Proof of Corollary 5.3
Proof.
We formalize the problem as a geometric distribution where the success event is the event of sampling a subset of size that includes at least one maximizer. The geometric distribution gives the probability that the first time to sample a subset that includes an optimal action requires independent calls, each with success probability . From Lemma 5.1, we have . Therefore, on an average, success requires: calls.
For a given discrete state , keeps track of the most recent best action found. For ,
(28) |
Therefore, for a given state , on average, if the Q-function is stable, then within , will contain the optimal action . Therefore, on an average, after time steps,
We know that, Therefore, for a stable Q-function, on an average, after time steps, becomes . ∎
B.3 Stochastic Maximization with Uniformly Distributed Rewards
While the above corollary outlines an upper-bound on the average number of calls needed to determine the exact optimal action eventually, the following lemma offers insights into the expected maximum value of a randomly sampled subset of actions, comprising elements when their values are uniformly distributed.
Lemma B.2.
For a given state and a uniformly randomly sampled subset of size actions, if the values of the sampled actions follow independently a uniform distribution in the interval , then the expected value of the maximum Q-function within this random subset is:
(29) |
Proof.
For a given state we assume a uniformly randomly sampled subset of size actions, and the values of the sampled actions are independent and follow a uniform distribution in the interval . Therefore, the cumulative distribution function (CDF) for the value of an action given the state and the optimal action is:
We define the variable .
If we select such actions, the CDF of the maximum of these actions, denoted as is the following:
The second line follows from the independence of the values, and the last line follows from the assumption that all actions follow the same uniform distribution.
The CDF of the maximum is therefore given by:
Now, we can determine the desired expected value as
We employed the identity , which can be demonstrated through integration by parts. To return to the original scale, we can first multiply by and then add , resulting in:
As an example of this setting, for , , for a setting with actions, . Hence the . This shows that even with a randomly sampled set of actions , the can be close to the max. We simulate this setting in the experiments in Fig. 6.
Our proposed stochastic maximization does not solely rely on the randomly sampled subset of actions but also considers actions from previous experiences through . Therefore, the expected should be higher than the above result, providing an upper bound on the expected as described in the following corollary of Lemma B.2.
Corollary B.3.
For a given discrete state , if the values of the sampled actions follow independently a uniform distribution from the interval , then the expected value of is:
(30) |
Proof.
At time step , given a state , and the current estimated Q-function , is defined as follows:
(31) |
For a given state and a uniformly randomly sampled subset of size actions and a subset of some previous played actions , using the law of total expectation,
Therefore by Lemma B.2:
∎
Appendix C Pseudocodes
Appendix D Experimental Details
D.1 Environments
We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments (Brockman et al., 2016). Stochastic Q-learning and Stochastic Double Q-learning are tested on the CliffWalking-v0, the FrozenLake-v1, and a generated MDP environment, while stochastic deep Q-learning approaches are tested on MuJoCo control tasks (Todorov et al., 2012).
D.1.1 Environments with Discrete States and Actions
We generate an MDP environment with 256 actions, with rewards following a normal distribution of mean -50 and standard deviation of 50, with 3 states. Furthermore, while our approach is designed for large discrete action spaces, we tested it in Gymnasium environments (Brockman et al., 2016) with only four discrete actions, such as CliffWalking-v0 and FrozenLake-v1. CliffWalking-v0 involves navigating a grid world from the starting point to the destination without falling off a cliff. FrozenLake-v1 requires moving from the starting point to the goal without stepping into any holes on the frozen surface, which can be challenging due to the slippery nature of the ice.
D.1.2 Environments with Continuous States: Discretizing Control Tasks
We test the stochastic deep Q-learning approaches on MuJoCo (Todorov et al., 2012) for continuous states discretized control tasks. We discretize each action dimension into equally spaced values, creating a discrete action space with -dimensional actions. We mainly focused on the inverted pendulum and the half-cheetah. The inverted pendulum involves a cart that can be moved left or right, intending to balance a pole on top using a 1D force, with resulting in 512 actions. The half-cheetah is a robot with nine body parts aiming to maximize forward speed. It can apply torque to 6 joints, resulting in 6D actions with , which results in 4096 actions.
D.2 Algorithms
D.2.1 Stochastic Maximization
We have two scenarios, one for discrete and the other for continuous states. For discrete states, is a dictionary with the keys as the states in with corresponding values of the latest played action in every state. In contrast, comprises the actions in the replay buffer for continuous states. Indeed, we do not consider the whole set either. Instead, we only consider a subset . For discrete states, for a given state , includes the latest two exploited actions in state . For continuous states, where it is impossible to retain the last exploited action for each state, we consider randomly sampled subset , which includes actions, even though they were played in different states. In the experiments involving continuous states, we demonstrate that this was sufficient to achieve good results, see Section 7.3.
D.2.2 Tabular Q-learning Methods
We set the training parameters the same for all the Q-learning variants. We follow similar hyper-parameters as in (Hasselt, 2010). We set the discount factor to 0.95 and apply a dynamical polynomial learning rate with , where is the number of times the pair has been visited, initially set to one for all the pairs. For the exploration rate, we use use a decaying , defined as where is the number of times state has been visited, initially set to one for all the states. For Double Q-learning if is updated and if is updated, where and store the number of updates for each action for the corresponding value function. We averaged the results over ten repetitions. For Stochastic Q-learning, we track a dictionary with keys being the states, and values being the latest exploited action. Thus, for a state , the memory , thus is the latest exploited action in the same state .
D.2.3 Deep Q-network Methods
We set the training parameters the same for all the deep Q-learning variants. We set the discount factor to 0.99 and the learning rate to 0.001. Our neural network takes input of a size equal to the sum of the dimensions of states and actions with a single output neuron. The network consists of two hidden linear layers, each with a size of 64, followed by a ReLU activation function (Nair & Hinton, 2010). We keep the exploration rate the same for all states, initialize it at 1, and apply a decay factor of 0.995, with a minimum threshold of 0.01. For total number of actions, during training, to train the network, we use stochastic batches of size uniformly sampled from a buffer of size . We averaged the results over five repetitions. For the stochastic methods, we consider the actions in the batch of actions as the memory set . We choose the batch size in this way to keep the complexity of the Stochastic Q-learning within .
D.3 Compute and Implementation
We implement the different Q-learning methods using Python 3.9, Numpy 1.23.4, and Pytorch 2.0.1. For proximal policy optimization (PPO) (Schulman et al., 2017), asynchronous actor-critic (A2C) (Mnih et al., 2016), and deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015), we use the implementations of Stable-Baselines (Hill et al., 2018). We test the training time using a CPU 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz. with 16.0 GB RAM.
Appendix E Additional Results
E.1 Wall Time Speed
Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions, as confirmed in Fig. 5(a). Therefore, both StochDQN and StochDDQN, which apply these techniques for action selection and updates, have exponentially faster execution times compared to both DQN and DDQN, which can be seen in Fig 5(b) which shows the complete step duration for deep Q-learning methods, which include action selection and network update. The proposed methods are nearly as fast as a random algorithm, which samples and selects actions randomly and has no updates.
E.2 Stochastic Maxmization
E.2.1 Stochastic Maxmization vs Maximization with Uniform Rewards
In the setting described in Section B.3 with 5000 uniformly independently distributed action values in the range of [0, 100], as shown in Fig. 6, without memory, i.e., reaches around 91 in average return, and keeps fluctuating around, while with quickly achieves the optimal reward.
E.2.2 Stochastic Maximization Analysis
In this section, we analyze stochastic maximization by tracking returned values across rounds, (Eq. (10)), and (Eq. (9)), which we provide here. At time step , given a state , and the current estimated Q-function , we define the non-negative underestimation error as , as follows:
(32) |
Furthermore, we define the ratio , as follows:
(33) |
It follows that:
(34) |
For Deep Q-Networks, for the InvertedPendulum-v4, both and max return similar values (Fig. 7(a)), approaches one rapidly (Fig. 7(b)) and remains below 0.5 (Fig. 7(c)). In the case of HalfCheetah-v4, both and max return similar values (Fig. 8(a)), quickly converges to one (Fig. 8(b)), and is upper bounded below eight (Fig. 8(c)).
While the difference remains bounded, the values of both and max increase over the rounds as the agent explores better options. This leads to the ratio converging to one as the error becomes negligible over the rounds, as expected according to Eq. (34).
E.3 Stochastic Q-network Reward Analysis
As illustrated in Fig. 9(a) and Fig. 9(b) for the inverted pendulum and half cheetah experiments, which involve 512 and 4096 actions, respectively, both StochDQN and StochDDQN attain the optimal average return in a comparable number of rounds to DQN and DDQN. Additionally, StochDQN exhibits the quickest attainment of optimal rewards for the inverted pendulum. Furthermore, while DDQN did not perform well on the inverted pendulum task, its modification, i.e., StochDDQN, reached the optimal rewards.
E.4 Stochastic Q-learning Reward Analysis
We tested Stochastic Q-learning, Stochastic Double Q-learning, and Stochastic Sarsa in environments with both discrete states and actions. Interestingly, as shown in Fig. 11, our stochastic algorithms outperform their deterministic counterparts in terms of cumulative rewards. Furthermore, we notice that Stochastic Q-learning outperforms all the considered methods regarding the cumulative rewards. Moreover, in the CliffWalking-v0 (as shown in Fig. 10), as well as for the generated MDP environment with 256 possible actions (as shown in Fig. 12), all the stochastic and non-stochastic algorithms reach the optimal policy in a similar number of steps.