¹¹institutetext: School of Computer Science, University of Galway, H91 FYH2, Galway, Ireland ¹¹email: {J.Shianifar1,Michael.Schukat,Karl.Mason}@universityofgalway.ie

Optimizing Deep Reinforcement Learning for Adaptive Robotic Arm Control

Jonaid Shianifar 11 0000-0003-0477-0056 Michael Schukat 11 Karl Mason 11

Abstract

In this paper, we explore the optimization of hyperparameters for the Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO) algorithms using the Tree-structured Parzen Estimator (TPE) in the context of robotic arm control with seven Degrees of Freedom (DOF). Our results demonstrate a significant enhancement in algorithm performance, TPE improves the success rate of SAC by 10.48 percentage points and PPO by 34.28 percentage points, where models trained for 50K episodes. Furthermore, TPE enables PPO to converge to a reward within 95% of the maximum reward 76% faster than without TPE, which translates to about 40K fewer episodes of training required for optimal performance. Also, this improvement for SAC is 80% faster than without TPE. This study underscores the impact of advanced hyperparameter optimization on the efficiency and success of deep reinforcement learning algorithms in complex robotic tasks.

Keywords:

Deep Reinforcement Learning Robotic Arm Control Hyperparameter Optimization.

1 Introduction

In the rapidly evolving field of robotics, the development of autonomous and highly adaptable robotic arms represents a significant step towards realizing more efficient, precise, and versatile automated systems [1]. These systems find applications across a spectrum of industries, from intricate tasks in surgical robotics to the repetitive, precision-demanding processes in manufacturing [2]. The control of robotic arms, especially those with a high DOF, poses a substantial challenge due to the complexity of their operation and the need for fine-tuned coordination [3, 4]. A promising answer to these challenges is DRL, bringing a framework for robotic arms to learn and optimize their activity through interaction with the environment. [2, 5, 6]. DRL methods have demonstrated their efficiency in solving a wide spectrum of tasks—from games to the control of robotic systems [7, 8, 9, 10].

Among the various algorithms under the DRL paradigm, SAC and PPO have stood out for their effectiveness in balancing exploration and exploitation, which is critical for the efficient learning of control policies in environments with high-dimensional action spaces. SAC, known for its off-policy learning and entropy regularization, encourages efficient exploration and has demonstrated success in continuous control tasks [11]. On the other hand, PPO, an on-policy algorithm, simplifies the implementation and has shown robust performance across a range of DRL problems [12]. However, the optimal application of these algorithms in robotic arm control, particularly for arms with 7-DOF, heavily depends on the precise tuning of their hyperparameters.

The tuning process, which is often manual and intuitive, can be significantly enhanced using optimization algorithms. The TPE emerges as a potent tool in this aspect, offering a structured and efficient means of hyperparameter optimization. By employing a Bayesian model approach the relationship between hyperparameters and the objective function, TPE facilitates the identification of optimal configurations with fewer evaluations [13].

This study is the first to apply TPE optimization to DRL algorithms, SAC and PPO, for robotic arm control, filling a significant gap in research on hyperparameter impact on DRL performance. Also, This paper makes the following contributions:

1.

To optimize the hyperparameters of two DRL algorithms, SAC and PPO, utilizing the TPE for enhanced control of a robotic arm with 7-DOF.
2.

To demonstrate through a series of experiments the effectiveness of TPE in significantly improving the precision, convergence speed, and overall task success rate of the SAC and PPO algorithms in a reach target task.
3.

To highlight the importance of SAC and PPO hyperparameter optimization in achieving performance in robotic arm control.

The paper is organized as follows: Section 2 details the methodologies, including DRL frameworks, hyperparameter optimization, the defined task, and the experimental setup involving the Franka Emika Panda arm in a simulated environment. Section 3 discusses the results, demonstrating the effectiveness of the TPE in enhancing the performance of algorithms for robotic arm control. The final section presents the conclusions and future work.

2 Materials and Methods

2.1 Deep Reinforcement Learning

DRL’s application in robotics has been marked by its ability to tackle complex, high-dimensional control tasks, a domain where traditional programming methods fall short [14, 15, 3]. DRL is a pivotal area within machine learning that centers on teaching agents to make strategic decisions through interactions with their environment, aiming to maximize rewards. Within the DRL framework, an agent systematically engages with its environment. At each timestep $t$ , the agent observes a current state $s_{t}$ , selects an action $a_{t}$ based on its policy $\pi$ , and receives a corresponding reward $r_{t}$ , thereby transitioning to the next state $s_{t+1}$ . This iterative process persists until reaching a terminal state, indicating the end of an episode. The agent’s overarching aim is to devise a policy that optimally increases the aggregate expected reward over time, The process of DRL is depicted in Fig. 1. DRL aligns well with the autonomous decision-making required in robotics [6, 16].

Refer to caption — Figure 1: In DRL, the agent selects actions based on its policy, interacts with the environment, and receives rewards to adjust its policy for future actions.

Recent advancements in DRL have further expanded its capabilities, integrating deep neural networks to approximate complex functions and policies [17]. We explore two prominent algorithms in this context, PPO and SAC.

2.1.1 Proximal Policy Optimization (PPO)

is a popular DRL algorithm known for its stability and sample efficiency. It aims to optimize the policy while ensuring that the updates are not too drastic [12]. The objective function for PPO can be expressed as:

L^{CLIP}(\theta)=\hat{\mathbb{E}}_{t}\left[\min(r_{t}(\theta)\hat{A}_{t},\text% {clip}(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})\right]

(1)

Here, $L^{CLIP}(\theta)$ is the clipped objective function, $\hat{\mathbb{E}}_{t}$ denotes the expected value over a finite batch of samples, $r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}$ is the probability ratio of the current policy $\pi_{\theta}$ over the old policy $\pi_{\theta_{old}}$ , $\hat{A}_{t}$ is an estimator of the advantage function at time $t$ , and $\epsilon$ is the clipping parameter [12]. Table 1 lists several key hyperparameters essential for tuning, highlighting the numerous hyperparameters associated with PPO.

Table 1: Hyperparameters for Proximal Policy Optimization (PPO).

Name	Description	Symbol
learning_rate	Learning rate of the optimizer	$\alpha$
n_steps	Number of steps to collect data for each iteration	-
batch_size	The batch size used for updating the policy network	-
gamma	The discount factor, which determines the importance of future rewards	$\gamma$
ent_coef	The coefficient for the entropy term in the loss function, encouraging exploration	$\alpha$
vf_coef	The coefficient for the value function loss, which determines the balance between policy and value function updates	-
max_grad_norm	The maximum gradient norm, which can be used to clip gradients during optimization to prevent exploding gradients	-
gae_lambda	The generalized advantage estimation (GAE) lambda parameter, which controls the trade-off between bias and variance in the advantage estimate	$\lambda$
clip_range	The clipping range for the surrogate loss, which helps in preventing	-

2.1.2 Soft Actor-Critic (SAC)

is a DRL algorithm that combines off-policy and actor-critic methods [11]. It is particularly suited for continuous action spaces. The SAC objective includes maximizing the entropy of the policy to encourage exploration. The SAC objective is given by:

J(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}% \left(r(s_{t},a_{t})+\alpha H(\pi(\cdot|s_{t}))\right)\right]

(2)

Here, $J(\theta)$ represents the SAC objective, $\theta$ are the policy parameters, $\tau$ is a trajectory sampled from the policy, $r(s_{t},a_{t})$ is the immediate reward, $\alpha$ is the temperature parameter, a positive scalar that multiplies the entropy term in the objective function, and $H(\pi(\cdot|s_{t}))$ is the entropy of the policy [11]. The SAC algorithm involves multiple hyperparameters, as Table 2 outlines. Similar to PPO, the efficacy and efficiency of SAC in generating optimal policy decisions are considerably dependent on the precise adjustment of its hyperparameters.

Table 2: Hyperparameters for Soft Actor-Critic (SAC).

Name	Description	Symbol
buffer_size	The size of the replay buffer used for storing past experiences	-
learning_starts	Number of steps before learning begins	-
batch_size	The batch size used for sampling from the replay buffer during training	-
tau	The soft update coefficient, which controls the rate at which the target networks are updated	$\tau$
gamma	The discount factor, which determines the importance of future rewards	$\gamma$
learning_rate	Learning rate for the optimizer	$\alpha$
ent_coef	The coefficient for the entropy regularization term in the policy loss	$\alpha$
target_update_interval	Interval for target network updates	-
gradient_steps	Number of gradient steps per update	-
use_sde	Whether to use state-dependent exploration	-

2.2 Tree-structured Parzen Estimator (TPE)

TPE, as named by Bergstra et al. [18], is the technique of using Bayesian optimization heuristics for guiding and speeding up the hyperparameter configuration optimization process. Its innovative approach involves the creation of a binary tree-like model that adeptly maps out the probability distributions for various hyperparameters. This model is particularly adept at navigating the complex landscapes often encountered in high-dimensional spaces, where objective functions can be costly to evaluate, thereby streamlining the optimization trajectory towards optimal solutions with efficiency and precision [18]. Mathematically, TPE optimizes by iteratively selecting hyperparameters based on the following principle:

\theta^{*}=\arg\min_{\theta}\frac{P(\text{Objective better than current best}% \mid\theta)}{P(\text{Objective worse than current best}\mid\theta)}

(3)

where $\theta$ represents the optimized hyperparameters, and $P(|)$ denotes the conditional probabilities that the objective function is better or worse than the current best given the chosen hyperparameters. This Bayesian approach enables TPE to efficiently explore the search space by balancing exploration and exploitation, ultimately guiding our evolutionary algorithm towards optimal solutions.

2.3 Task Definition

The reach target task, pivotal in robot manipulation, necessitates the end effector’s precise targeting within Cartesian space constraints across diverse tasks [5]. The targets are randomly generated within this space to assess the arm’s ability to adapt and accurately reach different points, simulating potential real-world applications [19]. Our objective is to optimize DRL for training a 7-DOF robotic arm to reach target tasks, where a policy learns the mapping from current states to subsequent actions for goal achievement. Notably, existing literature predominantly addresses position-only reaching. This study delineates the reach task by explicitly defining state, action, goal, and reward [19].

•

State ( $S_{t}$ ): Represents the environment’s current status, comprises a vector that includes joint angles ( $\theta$ ), the end-effector position( $EE_{pos}$ ), and the goal position ( $Goal_{pos}$ ).
•

Action ( $A_{t}$ ): The action corresponds to the individual motion of each joint, an action provided by the agent to update the environment state.

•

Reward ( $R_{t}$ ): As a reward function, we use a dense function, shown on Equation .4, where the closer the agent is to completing the task, the higher the reward.

R_{t}=-\sqrt{(x_{EE}-x_{Goal})^{2}+(y_{EE}-y_{Goal})^{2}+(z_{EE}-z_{Goal})^{2}}

(4)

•
Done: A binary flag indicating the end of an episode, task is completed when the distance between the end-effector and the goal position is less than 5 cm.
- –
  
  Terminated: Episode ends due to goal achievement.
- –
  
  Truncated: Episode ends due to reaching a step limit, not because the task was completed, reaching the step limit in the training phase is 50, and for the testing phase is 5 steps.

2.4 Training and Evaluation

The methodology used in this work is depicted in Fig. 2. The process encompasses two primary phases: a warm-up phase for initial hyperparameter exploration and a subsequent phase for focused training and hyperparameter refinement.

2.4.1 A. Training Process Overview

The training commences with an initial phase dedicated to exploring a range of hyperparameters through random sampling, aiming to identify a foundational set that yields a favorable reward outcome. This stage, referred to as the warm-up, sets the groundwork for more targeted optimization.

Following the initial exploration, the training enters a critical phase where the algorithm focuses on refining the model’s accuracy and responsiveness. Actions are evaluated based on their success in navigating towards a set goal under the constraints of the environment, with each episode offering a fresh challenge. The objective is to systematically improve the model’s performance through iterative learning, with a specific cap on the number of steps (50 steps per episode) to encourage efficiency and deter protracted decision-making paths.

Finally, the models trained with hyperparameters selected with TPE and also default hyperparameters, for 100,000 epochs, where the robot must find a random target in a maximum of 50 steps.

2.4.2 B. Evaluation Methodology

Post-training, the model is tested against 100,000 randomly generated target positions. The evaluation emphasizes not only the success rate in reaching these targets, but also the efficiency as measured by the number of steps required, imposing a stricter limit of 5 steps per episode during this phase. This sharpens the focus on models ability to quickly adapt and accurately respond to new scenarios. Also, models saved in 20K, and 50K training episodes are tested to show learning convergence speed.

2.5 Experimental Setup

This study utilized the Franka Emika Panda arm with 7-DOF in a panda_gym [19] simulation, developed with PyBullet [20] and Gymnasium [21], to conduct risk-free tests (Fig. 3). Our experiments ran on a system with a 12th Gen Intel Core i7-12700 CPU and 64GB of RAM, using Python 3.8.10 for programming. We applied PPO and SAC via the Stable Baseline3 [22] library and optimized hyperparameters with the Optuna [23] framework.

3 Results

Our evaluation focused on the performance of two DRL algorithms, SAC and PPO, in conjunction with TPE for hyperparameter optimization, aiming to enhance the control of a robotic arm with 7-DOF. We observed significant improvements in learning efficiency, the utility of TPE in hyperparameter selection, and insights from various data visualizations and analyses. A key metric for evaluation was the success rate in reaching targets across the testing phases, which highlighted the enhanced learning efficiency and the strategic optimization of hyperparameters facilitated by TPE.

3.1 Hyperparameter Optimization

The optimization procedure benefited significantly from the strategic selection of hyperparameters via TPE, as shown in Fig. 4, parallel coordinate plots (PCP) for PPO and SAC elucidated the systematic exploration within the hyperparameter space, charting the trajectories toward optimized performance. This comprehensive view demonstrated how various hyperparameter configurations contributed to the overall efficacy of the learning models.

The Hyperparameters Importance Plot, shown in Fig. 5, underscores the relative significance of each hyperparameter in influencing the success of the learning process. This plot is instrumental in identifying the hyperparameters that have the most substantial impact on the model’s ability to learn efficiently.

A thorough examination of the SAC and PPO hyperparameters was conducted, detailing the range explored, and comparing the default values with the optimal values as determined by TPE. This comparison highlights TPE’s nuanced approach to navigating the hyperparameter space and pinpointing the most effective configurations. The hyperparameters for PPO and SAC are listed in Table 3 and Table 4, respectively. We selected default values based on recommendations from the Stable Baselines3 library [22].

Table 3: Hyperparameters for Proximal Policy Optimization (PPO).

Name	Default Value	Search space	Best
learning_rate	0.0003	1e-5 - 1e-1	0.0153
n_steps	2048	64 - 4096	559
batch_size	64	16 - 256	193
gamma	0.99	0.95 - 0.999	0.9657
ent_coef	0.0	0.0 - 0.1	0.0548
vf_coef	0.5	0.2 - 1.0	0.3999
max_grad_norm	0.5	0.1 - 10.0	9.4229
gae_lambda	0.95	0.8 - 0.99	0.8543
clip_range	0.2	0.1 - 0.4	0.2865

Table 4: Hyperparameters for Soft Actor-Critic (SAC).

Name	Default Value	Search space	Best
buffer_size	1000000	1000 - 1000000	79709
learning_starts	1000	100 - 10000	7126
batch_size	256	16 - 256	104
tau	0.005	0.001 - 0.1	0.034480
gamma	0.99	0.9 - 0.999	0.920970
learning_rate	0.0003	1e-5 - 1e-1	0.000728
ent_coef	0.2	0.0 - 0.2	0.008345
target_update_interval	1	1 - 100	40
gradient_steps	1	1 - 20	10
use_sde	False	True - False	True

3.2 Enhanced Learning Efficiency

The application of TPE markedly accelerated the learning process for both the SAC and PPO algorithms, as evidenced by the training curves (Fig. 6). The training curves for each algorithm illustrated a faster ascent toward higher mean rewards per episode, indicative of an expedited convergence toward optimal policy decisions post-TPE optimization. With TPE, PPO converges to a reward within 95% of the maximum payout 76.32% faster than it could without it. This means that roughly 40K fewer training episodes are needed for PPO to function at its best. Furthermore, compared to no TPE, this improvement for SAC happens 80.39% faster.

3.3 Comparative Evaluation

Upon evaluating the DRL models across a test set of 100,000 targets, Table 5 showcases a distinct advantage in utilizing TPE-optimized hyperparameters. The results, detailed in Table 5, emphasize the superior performance of the optimized models, marking a significant improvement in success rates, and learning convergence speed for task achievements.

Table 5: Success Rates of SAC and PPO Models Before and After Optimization

Model	Success Rate (%)
	20K Episodes	50K Episodes	100K Episodes
SAC Before Optimization	3.22	74.64	86.59
SAC After TPE Optimization	79.69	85.12	89.75
PPO Before Optimization	8.72	49.65	87.29
PPO After TPE Optimization	71.69	83.93	89.41

4 Conclusion

This study demonstrates the benefits of using the TPE for hyperparameter optimization in DRL algorithms, SAC and PPO, for controlling robotic arms with 7-DOF. Our results indicate that TPE significantly improves both the learning efficiency and the performance of these algorithms, highlighted by the enhanced success rates and the accelerated convergence towards optimal rewards. Specifically, the application of TPE enabled PPO to achieve convergence to a reward within 95% of the maximum reward 76.32% faster than without TPE, necessitating about 40,630 fewer episodes of training for optimal performance, with similar improvements observed for SAC. These findings underscore the importance of precise hyperparameter tuning in the development of robust and efficient DRL models, particularly in the context of complex robotic tasks.

A notable challenge tackled in this research was the complexity of hyperparameter tuning, where TPE proved to be a valuable tool for optimizing the DRL algorithms’ performance in robotic tasks.

Future research could broaden the scope by applying TPE to various DRL algorithms and robotic tasks such as pick-and-place, thereby unveiling more about its utility and limitations. Exploring other hyperparameter optimization methods could also contribute to refining the efficiency of DRL models further. Additionally, the practical application of these optimized models in real-world robotics scenarios will be crucial for making tangible advancements in the field. Our study paves the way for such explorations, indicating a promising direction for enhancing the performance and efficiency of DRL algorithms in complex, real-world tasks.

4.0.1 Acknowledgements

This work is supported by the University of Galway College of Science and Engineering Postgraduate Scholarship.

References

[1] Z. Arif and Y. Fu, “Mix frame visual servo control framework for autonomous assistive robotic arms,” Sensors, vol. 22, no. 2, p. 642, 2022.
[2] R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and B. Dresp-Langley, “Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,” Robotics, vol. 10, no. 1, p. 22, 2021.
[3] A. Imran, D. Hanson, G. Morales, and V. Krisciunas, “Open arms: Open-source arms, hands & control,” in 2022 22nd International Conference on Control, Automation and Systems (ICCAS), pp. 1426–1431, IEEE, 2022.
[4] A. Horgan and K. Mason, “Evolving neural networks for robotic arm control,” in International Conference on the Applications of Evolutionary Computation (Part of EvoStar), pp. 591–607, Springer, 2023.
[5] D. Han, B. Mulyana, V. Stankovic, and S. Cheng, “A survey on deep reinforcement learning algorithms for robotic manipulation,” Sensors, vol. 23, no. 7, p. 3762, 2023.
[6] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
[7] Q. Zhang, Y. Kang, Y.-B. Zhao, P. Li, and S. You, “Traded control of human–machine systems for sequential decision-making based on reinforcement learning,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 4, pp. 553–566, 2021.
[8] X. He, H. Yang, Z. Hu, and C. Lv, “Robust lane change decision making for autonomous vehicles: An observation adversarial reinforcement learning approach,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 184–193, 2022.
[9] R. Dong, J. Du, Y. Liu, A. A. Heidari, and H. Chen, “An enhanced deep deterministic policy gradient algorithm for intelligent control of robotic arms,” Frontiers in Neuroinformatics, 2023.
[10] Z. Pan, J. Zhou, Q. Fan, Z. Feng, X. Gao, and M. Su, “Robotic control mechanism based on deep reinforcement learning,” in 2023 2nd International Symposium on Control Engineering and Robotics (ISCER), pp. 70–74, IEEE, 2023.
[11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning, pp. 1861–1870, PMLR, 2018.
[12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[13] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” Advances in neural information processing systems, vol. 24, 2011.
[14] T. Lindner, A. Milecki, and D. Wyrwał, “Positioning of the robotic arm using different reinforcement learning algorithms,” International Journal of Control, Automation and Systems, vol. 19, no. 4, pp. 1661–1676, 2021.
[15] A. Lobbezoo and H.-J. Kwon, “Simulated and real robotic reach, grasp, and pick-and-place using combined reinforcement learning and traditional controls,” Robotics, vol. 12, no. 1, p. 12, 2023.
[16] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017.
[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[18] J. Bergstra, D. Yamins, D. D. Cox, et al., “Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms.,” SciPy, vol. 13, p. 20, 2013.
[19] Q. Gallouédec, N. Cazin, E. Dellandréa, and L. Chen, “panda-gym: Open-source goal-conditioned environments for robotic learning,” arXiv preprint arXiv:2106.13687, 2021.
[20] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.” http://pybullet.org, 2016–2021.
[21] M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris, A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G. Younis, “Gymnasium,” Mar. 2023.
[22] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
[23] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2623–2631, 2019.