Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: School of Computer Science, University of Galway, H91 FYH2, Galway, Ireland 11email: {J.Shianifar1,Michael.Schukat,Karl.Mason}@universityofgalway.ie

Optimizing Deep Reinforcement Learning for Adaptive Robotic Arm Control

Jonaid Shianifar 11 0000-0003-0477-0056    Michael Schukat 11    Karl Mason 11
Abstract

In this paper, we explore the optimization of hyperparameters for the Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) algorithms using the Tree-structured Parzen Estimator (TPE) in the context of robotic arm control with seven Degrees of Freedom (DOF). Our results demonstrate a significant enhancement in algorithm performance, TPE improves the success rate of SAC by 10.48 percentage points and PPO by 34.28 percentage points, where models trained for 50K episodes. Furthermore, TPE enables PPO to converge to a reward within 95% of the maximum reward 76% faster than without TPE, which translates to about 40K fewer episodes of training required for optimal performance. Also, this improvement for SAC is 80% faster than without TPE. This study underscores the impact of advanced hyperparameter optimization on the efficiency and success of deep reinforcement learning algorithms in complex robotic tasks.

Keywords:
Deep Reinforcement Learning Robotic Arm Control Hyperparameter Optimization.

1 Introduction

In the rapidly evolving field of robotics, the development of autonomous and highly adaptable robotic arms represents a significant step towards realizing more efficient, precise, and versatile automated systems [1]. These systems find applications across a spectrum of industries, from intricate tasks in surgical robotics to the repetitive, precision-demanding processes in manufacturing [2]. The control of robotic arms, especially those with a high DOF, poses a substantial challenge due to the complexity of their operation and the need for fine-tuned coordination [3, 4]. A promising answer to these challenges is DRL, bringing a framework for robotic arms to learn and optimize their activity through interaction with the environment. [2, 5, 6]. DRL methods have demonstrated their efficiency in solving a wide spectrum of tasks—from games to the control of robotic systems [7, 8, 9, 10].

Among the various algorithms under the DRL paradigm, SAC and PPO have stood out for their effectiveness in balancing exploration and exploitation, which is critical for the efficient learning of control policies in environments with high-dimensional action spaces. SAC, known for its off-policy learning and entropy regularization, encourages efficient exploration and has demonstrated success in continuous control tasks [11]. On the other hand, PPO, an on-policy algorithm, simplifies the implementation and has shown robust performance across a range of DRL problems [12]. However, the optimal application of these algorithms in robotic arm control, particularly for arms with 7-DOF, heavily depends on the precise tuning of their hyperparameters.

The tuning process, which is often manual and intuitive, can be significantly enhanced using optimization algorithms. The TPE emerges as a potent tool in this aspect, offering a structured and efficient means of hyperparameter optimization. By employing a Bayesian model approach the relationship between hyperparameters and the objective function, TPE facilitates the identification of optimal configurations with fewer evaluations [13].

This study is the first to apply TPE optimization to DRL algorithms, SAC and PPO, for robotic arm control, filling a significant gap in research on hyperparameter impact on DRL performance. Also, This paper makes the following contributions:

  1. 1.

    To optimize the hyperparameters of two DRL algorithms, SAC and PPO, utilizing the TPE for enhanced control of a robotic arm with 7-DOF.

  2. 2.

    To demonstrate through a series of experiments the effectiveness of TPE in significantly improving the precision, convergence speed, and overall task success rate of the SAC and PPO algorithms in a reach target task.

  3. 3.

    To highlight the importance of SAC and PPO hyperparameter optimization in achieving performance in robotic arm control.

The paper is organized as follows: Section 2 details the methodologies, including DRL frameworks, hyperparameter optimization, the defined task, and the experimental setup involving the Franka Emika Panda arm in a simulated environment. Section 3 discusses the results, demonstrating the effectiveness of the TPE in enhancing the performance of algorithms for robotic arm control. The final section presents the conclusions and future work.

2 Materials and Methods

2.1 Deep Reinforcement Learning

DRL’s application in robotics has been marked by its ability to tackle complex, high-dimensional control tasks, a domain where traditional programming methods fall short [14, 15, 3]. DRL is a pivotal area within machine learning that centers on teaching agents to make strategic decisions through interactions with their environment, aiming to maximize rewards. Within the DRL framework, an agent systematically engages with its environment. At each timestep t𝑡titalic_t, the agent observes a current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, selects an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on its policy π𝜋\piitalic_π, and receives a corresponding reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, thereby transitioning to the next state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. This iterative process persists until reaching a terminal state, indicating the end of an episode. The agent’s overarching aim is to devise a policy that optimally increases the aggregate expected reward over time, The process of DRL is depicted in Fig. 1. DRL aligns well with the autonomous decision-making required in robotics [6, 16].

Refer to caption
Figure 1: In DRL, the agent selects actions based on its policy, interacts with the environment, and receives rewards to adjust its policy for future actions.

Recent advancements in DRL have further expanded its capabilities, integrating deep neural networks to approximate complex functions and policies [17]. We explore two prominent algorithms in this context, PPO and SAC.

2.1.1 Proximal Policy Optimization (PPO)

is a popular DRL algorithm known for its stability and sample efficiency. It aims to optimize the policy while ensuring that the updates are not too drastic [12]. The objective function for PPO can be expressed as:

LCLIP(θ)=𝔼^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]superscript𝐿𝐶𝐿𝐼𝑃𝜃subscript^𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡L^{CLIP}(\theta)=\hat{\mathbb{E}}_{t}\left[\min(r_{t}(\theta)\hat{A}_{t},\text% {clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})\right]italic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (1)

Here, LCLIP(θ)superscript𝐿𝐶𝐿𝐼𝑃𝜃L^{CLIP}(\theta)italic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) is the clipped objective function, 𝔼^tsubscript^𝔼𝑡\hat{\mathbb{E}}_{t}over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the expected value over a finite batch of samples, rt(θ)=πθ(at|st)πθold(at|st)subscript𝑟𝑡𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋subscript𝜃𝑜𝑙𝑑conditionalsubscript𝑎𝑡subscript𝑠𝑡r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG is the probability ratio of the current policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over the old policy πθoldsubscript𝜋subscript𝜃𝑜𝑙𝑑\pi_{\theta_{old}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT, A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an estimator of the advantage function at time t𝑡titalic_t, and ϵitalic-ϵ\epsilonitalic_ϵ is the clipping parameter [12]. Table 1 lists several key hyperparameters essential for tuning, highlighting the numerous hyperparameters associated with PPO.

Table 1: Hyperparameters for Proximal Policy Optimization (PPO).
Name Description Symbol
learning_rate Learning rate of the optimizer α𝛼\alphaitalic_α
n_steps Number of steps to collect data for each iteration -
batch_size The batch size used for updating the policy network -
gamma The discount factor, which determines the importance of future rewards γ𝛾\gammaitalic_γ
ent_coef The coefficient for the entropy term in the loss function, encouraging exploration α𝛼\alphaitalic_α
vf_coef The coefficient for the value function loss, which determines the balance between policy and value function updates -
max_grad_norm The maximum gradient norm, which can be used to clip gradients during optimization to prevent exploding gradients -
gae_lambda The generalized advantage estimation (GAE) lambda parameter, which controls the trade-off between bias and variance in the advantage estimate λ𝜆\lambdaitalic_λ
clip_range The clipping range for the surrogate loss, which helps in preventing -

2.1.2 Soft Actor-Critic (SAC)

is a DRL algorithm that combines off-policy and actor-critic methods [11]. It is particularly suited for continuous action spaces. The SAC objective includes maximizing the entropy of the policy to encourage exploration. The SAC objective is given by:

J(θ)=𝔼τπθ[t=0γt(r(st,at)+αH(π(|st)))]J(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}% \left(r(s_{t},a_{t})+\alpha H(\pi(\cdot|s_{t}))\right)\right]italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α italic_H ( italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] (2)

Here, J(θ)𝐽𝜃J(\theta)italic_J ( italic_θ ) represents the SAC objective, θ𝜃\thetaitalic_θ are the policy parameters, τ𝜏\tauitalic_τ is a trajectory sampled from the policy, r(st,at)𝑟subscript𝑠𝑡subscript𝑎𝑡r(s_{t},a_{t})italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the immediate reward, α𝛼\alphaitalic_α is the temperature parameter, a positive scalar that multiplies the entropy term in the objective function, and H(π(|st))H(\pi(\cdot|s_{t}))italic_H ( italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) is the entropy of the policy [11]. The SAC algorithm involves multiple hyperparameters, as Table 2 outlines. Similar to PPO, the efficacy and efficiency of SAC in generating optimal policy decisions are considerably dependent on the precise adjustment of its hyperparameters.

Table 2: Hyperparameters for Soft Actor-Critic (SAC).
Name Description Symbol
buffer_size The size of the replay buffer used for storing past experiences -
learning_starts Number of steps before learning begins -
batch_size The batch size used for sampling from the replay buffer during training -
tau The soft update coefficient, which controls the rate at which the target networks are updated τ𝜏\tauitalic_τ
gamma The discount factor, which determines the importance of future rewards γ𝛾\gammaitalic_γ
learning_rate Learning rate for the optimizer α𝛼\alphaitalic_α
ent_coef The coefficient for the entropy regularization term in the policy loss α𝛼\alphaitalic_α
target_update_interval Interval for target network updates -
gradient_steps Number of gradient steps per update -
use_sde Whether to use state-dependent exploration -

2.2 Tree-structured Parzen Estimator (TPE)

TPE, as named by Bergstra et al. [18], is the technique of using Bayesian optimization heuristics for guiding and speeding up the hyperparameter configuration optimization process. Its innovative approach involves the creation of a binary tree-like model that adeptly maps out the probability distributions for various hyperparameters. This model is particularly adept at navigating the complex landscapes often encountered in high-dimensional spaces, where objective functions can be costly to evaluate, thereby streamlining the optimization trajectory towards optimal solutions with efficiency and precision [18]. Mathematically, TPE optimizes by iteratively selecting hyperparameters based on the following principle:

θ=argminθP(Objective better than current bestθ)P(Objective worse than current bestθ)superscript𝜃subscript𝜃𝑃conditionalObjective better than current best𝜃𝑃conditionalObjective worse than current best𝜃\theta^{*}=\arg\min_{\theta}\frac{P(\text{Objective better than current best}% \mid\theta)}{P(\text{Objective worse than current best}\mid\theta)}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG italic_P ( Objective better than current best ∣ italic_θ ) end_ARG start_ARG italic_P ( Objective worse than current best ∣ italic_θ ) end_ARG (3)

where θ𝜃\thetaitalic_θ represents the optimized hyperparameters, and P(|)P(|)italic_P ( | ) denotes the conditional probabilities that the objective function is better or worse than the current best given the chosen hyperparameters. This Bayesian approach enables TPE to efficiently explore the search space by balancing exploration and exploitation, ultimately guiding our evolutionary algorithm towards optimal solutions.

2.3 Task Definition

The reach target task, pivotal in robot manipulation, necessitates the end effector’s precise targeting within Cartesian space constraints across diverse tasks [5]. The targets are randomly generated within this space to assess the arm’s ability to adapt and accurately reach different points, simulating potential real-world applications [19]. Our objective is to optimize DRL for training a 7-DOF robotic arm to reach target tasks, where a policy learns the mapping from current states to subsequent actions for goal achievement. Notably, existing literature predominantly addresses position-only reaching. This study delineates the reach task by explicitly defining state, action, goal, and reward [19].

  • State (Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): Represents the environment’s current status, comprises a vector that includes joint angles (θ𝜃\thetaitalic_θ), the end-effector position(EEpos𝐸subscript𝐸𝑝𝑜𝑠EE_{pos}italic_E italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT), and the goal position (Goalpos𝐺𝑜𝑎subscript𝑙𝑝𝑜𝑠Goal_{pos}italic_G italic_o italic_a italic_l start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT).

  • Action (Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): The action corresponds to the individual motion of each joint, an action provided by the agent to update the environment state.

  • Reward (Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): As a reward function, we use a dense function, shown on Equation .4, where the closer the agent is to completing the task, the higher the reward.

    Rt=(xEExGoal)2+(yEEyGoal)2+(zEEzGoal)2subscript𝑅𝑡superscriptsubscript𝑥𝐸𝐸subscript𝑥𝐺𝑜𝑎𝑙2superscriptsubscript𝑦𝐸𝐸subscript𝑦𝐺𝑜𝑎𝑙2superscriptsubscript𝑧𝐸𝐸subscript𝑧𝐺𝑜𝑎𝑙2R_{t}=-\sqrt{(x_{EE}-x_{Goal})^{2}+(y_{EE}-y_{Goal})^{2}+(z_{EE}-z_{Goal})^{2}}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_E italic_E end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_G italic_o italic_a italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_E italic_E end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_G italic_o italic_a italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_z start_POSTSUBSCRIPT italic_E italic_E end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_G italic_o italic_a italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (4)
  • Done: A binary flag indicating the end of an episode, task is completed when the distance between the end-effector and the goal position is less than 5 cm.

    • Terminated: Episode ends due to goal achievement.

    • Truncated: Episode ends due to reaching a step limit, not because the task was completed, reaching the step limit in the training phase is 50, and for the testing phase is 5 steps.

2.4 Training and Evaluation

The methodology used in this work is depicted in Fig. 2. The process encompasses two primary phases: a warm-up phase for initial hyperparameter exploration and a subsequent phase for focused training and hyperparameter refinement.

Refer to caption
Figure 2: The iterative process of hyperparameter optimization and the DRL model evaluation.

2.4.1 A. Training Process Overview

The training commences with an initial phase dedicated to exploring a range of hyperparameters through random sampling, aiming to identify a foundational set that yields a favorable reward outcome. This stage, referred to as the warm-up, sets the groundwork for more targeted optimization.

Following the initial exploration, the training enters a critical phase where the algorithm focuses on refining the model’s accuracy and responsiveness. Actions are evaluated based on their success in navigating towards a set goal under the constraints of the environment, with each episode offering a fresh challenge. The objective is to systematically improve the model’s performance through iterative learning, with a specific cap on the number of steps (50 steps per episode) to encourage efficiency and deter protracted decision-making paths.

Finally, the models trained with hyperparameters selected with TPE and also default hyperparameters, for 100,000 epochs, where the robot must find a random target in a maximum of 50 steps.

2.4.2 B. Evaluation Methodology

Post-training, the model is tested against 100,000 randomly generated target positions. The evaluation emphasizes not only the success rate in reaching these targets, but also the efficiency as measured by the number of steps required, imposing a stricter limit of 5 steps per episode during this phase. This sharpens the focus on models ability to quickly adapt and accurately respond to new scenarios. Also, models saved in 20K, and 50K training episodes are tested to show learning convergence speed.

2.5 Experimental Setup

This study utilized the Franka Emika Panda arm with 7-DOF in a panda_gym [19] simulation, developed with PyBullet [20] and Gymnasium [21], to conduct risk-free tests (Fig. 3). Our experiments ran on a system with a 12th Gen Intel Core i7-12700 CPU and 64GB of RAM, using Python 3.8.10 for programming. We applied PPO and SAC via the Stable Baseline3 [22] library and optimized hyperparameters with the Optuna [23] framework.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) Franka Emika Panda arm robot with declared joints position, (b) panda_gym simulation environment

3 Results

Our evaluation focused on the performance of two DRL algorithms, SAC and PPO, in conjunction with TPE for hyperparameter optimization, aiming to enhance the control of a robotic arm with 7-DOF. We observed significant improvements in learning efficiency, the utility of TPE in hyperparameter selection, and insights from various data visualizations and analyses. A key metric for evaluation was the success rate in reaching targets across the testing phases, which highlighted the enhanced learning efficiency and the strategic optimization of hyperparameters facilitated by TPE.

3.1 Hyperparameter Optimization

The optimization procedure benefited significantly from the strategic selection of hyperparameters via TPE, as shown in Fig. 4, parallel coordinate plots (PCP) for PPO and SAC elucidated the systematic exploration within the hyperparameter space, charting the trajectories toward optimized performance. This comprehensive view demonstrated how various hyperparameter configurations contributed to the overall efficacy of the learning models.

Refer to caption
Refer to caption
Figure 4: Parallel coordinate plot(PCP) for PPO (left) and SAC (right).

The Hyperparameters Importance Plot, shown in Fig. 5, underscores the relative significance of each hyperparameter in influencing the success of the learning process. This plot is instrumental in identifying the hyperparameters that have the most substantial impact on the model’s ability to learn efficiently.

Refer to caption
Refer to caption
Figure 5: Hyperparameters importance plot, for PPO (left) and SAC (right).

A thorough examination of the SAC and PPO hyperparameters was conducted, detailing the range explored, and comparing the default values with the optimal values as determined by TPE. This comparison highlights TPE’s nuanced approach to navigating the hyperparameter space and pinpointing the most effective configurations. The hyperparameters for PPO and SAC are listed in Table 3 and Table 4, respectively. We selected default values based on recommendations from the Stable Baselines3 library [22].

Table 3: Hyperparameters for Proximal Policy Optimization (PPO).
Name Default Value Search space Best
learning_rate 0.0003 1e-5 - 1e-1  0.0153
n_steps 2048 64 - 4096 559
batch_size 64 16 - 256 193
gamma 0.99 0.95 - 0.999  0.9657
ent_coef 0.0 0.0 - 0.1  0.0548
vf_coef 0.5 0.2 - 1.0  0.3999
max_grad_norm 0.5 0.1 - 10.0  9.4229
gae_lambda 0.95 0.8 - 0.99  0.8543
clip_range 0.2 0.1 - 0.4  0.2865
Table 4: Hyperparameters for Soft Actor-Critic (SAC).
Name Default Value Search space Best
buffer_size 1000000 1000 - 1000000 79709
learning_starts 1000 100 - 10000 7126
batch_size 256 16 - 256 104
tau 0.005 0.001 - 0.1  0.034480
gamma 0.99 0.9 - 0.999  0.920970
learning_rate 0.0003 1e-5 - 1e-1  0.000728
ent_coef 0.2 0.0 - 0.2  0.008345
target_update_interval 1 1 - 100 40
gradient_steps 1 1 - 20 10
use_sde False True - False True

3.2 Enhanced Learning Efficiency

The application of TPE markedly accelerated the learning process for both the SAC and PPO algorithms, as evidenced by the training curves (Fig. 6). The training curves for each algorithm illustrated a faster ascent toward higher mean rewards per episode, indicative of an expedited convergence toward optimal policy decisions post-TPE optimization. With TPE, PPO converges to a reward within 95% of the maximum payout 76.32% faster than it could without it. This means that roughly 40K fewer training episodes are needed for PPO to function at its best. Furthermore, compared to no TPE, this improvement for SAC happens 80.39% faster.

Refer to caption
Refer to caption
Figure 6: Training Curves for PPO (left) and SAC (right) Before and After TPE Optimization showcases this improved learning trajectory, highlighting the efficiency gains in model training facilitated by TPE.

3.3 Comparative Evaluation

Upon evaluating the DRL models across a test set of 100,000 targets, Table 5 showcases a distinct advantage in utilizing TPE-optimized hyperparameters. The results, detailed in Table 5, emphasize the superior performance of the optimized models, marking a significant improvement in success rates, and learning convergence speed for task achievements.

Table 5: Success Rates of SAC and PPO Models Before and After Optimization
Model Success Rate (%)
20K Episodes 50K Episodes 100K Episodes
SAC Before Optimization 3.22 74.64 86.59
SAC After TPE Optimization 79.69 85.12 89.75
PPO Before Optimization 8.72 49.65 87.29
PPO After TPE Optimization 71.69 83.93 89.41

4 Conclusion

This study demonstrates the benefits of using the TPE for hyperparameter optimization in DRL algorithms, SAC and PPO, for controlling robotic arms with 7-DOF. Our results indicate that TPE significantly improves both the learning efficiency and the performance of these algorithms, highlighted by the enhanced success rates and the accelerated convergence towards optimal rewards. Specifically, the application of TPE enabled PPO to achieve convergence to a reward within 95% of the maximum reward 76.32% faster than without TPE, necessitating about 40,630 fewer episodes of training for optimal performance, with similar improvements observed for SAC. These findings underscore the importance of precise hyperparameter tuning in the development of robust and efficient DRL models, particularly in the context of complex robotic tasks.

A notable challenge tackled in this research was the complexity of hyperparameter tuning, where TPE proved to be a valuable tool for optimizing the DRL algorithms’ performance in robotic tasks.

Future research could broaden the scope by applying TPE to various DRL algorithms and robotic tasks such as pick-and-place, thereby unveiling more about its utility and limitations. Exploring other hyperparameter optimization methods could also contribute to refining the efficiency of DRL models further. Additionally, the practical application of these optimized models in real-world robotics scenarios will be crucial for making tangible advancements in the field. Our study paves the way for such explorations, indicating a promising direction for enhancing the performance and efficiency of DRL algorithms in complex, real-world tasks.

4.0.1 Acknowledgements

This work is supported by the University of Galway College of Science and Engineering Postgraduate Scholarship.

References

  • [1] Z. Arif and Y. Fu, “Mix frame visual servo control framework for autonomous assistive robotic arms,” Sensors, vol. 22, no. 2, p. 642, 2022.
  • [2] R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and B. Dresp-Langley, “Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,” Robotics, vol. 10, no. 1, p. 22, 2021.
  • [3] A. Imran, D. Hanson, G. Morales, and V. Krisciunas, “Open arms: Open-source arms, hands & control,” in 2022 22nd International Conference on Control, Automation and Systems (ICCAS), pp. 1426–1431, IEEE, 2022.
  • [4] A. Horgan and K. Mason, “Evolving neural networks for robotic arm control,” in International Conference on the Applications of Evolutionary Computation (Part of EvoStar), pp. 591–607, Springer, 2023.
  • [5] D. Han, B. Mulyana, V. Stankovic, and S. Cheng, “A survey on deep reinforcement learning algorithms for robotic manipulation,” Sensors, vol. 23, no. 7, p. 3762, 2023.
  • [6] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  • [7] Q. Zhang, Y. Kang, Y.-B. Zhao, P. Li, and S. You, “Traded control of human–machine systems for sequential decision-making based on reinforcement learning,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 4, pp. 553–566, 2021.
  • [8] X. He, H. Yang, Z. Hu, and C. Lv, “Robust lane change decision making for autonomous vehicles: An observation adversarial reinforcement learning approach,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 184–193, 2022.
  • [9] R. Dong, J. Du, Y. Liu, A. A. Heidari, and H. Chen, “An enhanced deep deterministic policy gradient algorithm for intelligent control of robotic arms,” Frontiers in Neuroinformatics, 2023.
  • [10] Z. Pan, J. Zhou, Q. Fan, Z. Feng, X. Gao, and M. Su, “Robotic control mechanism based on deep reinforcement learning,” in 2023 2nd International Symposium on Control Engineering and Robotics (ISCER), pp. 70–74, IEEE, 2023.
  • [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning, pp. 1861–1870, PMLR, 2018.
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [13] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” Advances in neural information processing systems, vol. 24, 2011.
  • [14] T. Lindner, A. Milecki, and D. Wyrwał, “Positioning of the robotic arm using different reinforcement learning algorithms,” International Journal of Control, Automation and Systems, vol. 19, no. 4, pp. 1661–1676, 2021.
  • [15] A. Lobbezoo and H.-J. Kwon, “Simulated and real robotic reach, grasp, and pick-and-place using combined reinforcement learning and traditional controls,” Robotics, vol. 12, no. 1, p. 12, 2023.
  • [16] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [18] J. Bergstra, D. Yamins, D. D. Cox, et al., “Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms.,” SciPy, vol. 13, p. 20, 2013.
  • [19] Q. Gallouédec, N. Cazin, E. Dellandréa, and L. Chen, “panda-gym: Open-source goal-conditioned environments for robotic learning,” arXiv preprint arXiv:2106.13687, 2021.
  • [20] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2021.
  • [21] M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris, A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G. Younis, “Gymnasium,” Mar. 2023.
  • [22] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
  • [23] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2623–2631, 2019.