Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement Learning For Robotic Manipulation Tasks
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement Learning For Robotic Manipulation Tasks
Abstract
Learning agents can make use of Reinforcement Learning (RL) to decide
their actions by using a reward function. However, the learning pro-
cess is greatly influenced by the elect of values of the hyperparameters
used in the learning algorithm. This work proposed a Deep Determin-
istic Policy Gradient (DDPG) and Hindsight Experience Replay (HER)
based method, which makes use of the Genetic Algorithm (GA) to fine-
tune the hyperparameters’ values. This method (GA+DDPG+HER)
experimented on six robotic manipulation tasks: FetchReach; FetchSlide;
FetchPush; FetchPick&Place; DoorOpening; and AuboReach. Analysis
of these results demonstrated a significant increase in performance
and a decrease in learning time. Also, we compare and provide evi-
dence that GA+DDPG+HER is better than the existing methods.
1
2 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcem
1 Introduction
Reinforcement Learning (RL) [1] has recently been applied to a variety of
applications, including robotic table tennis [2], surgical robot planning [3],
rapid motion planning in bimanual suture needle regrasping [4], and Aquatic
Navigation [5]. Each one of these applications employs RL as a motivating
alternative to automating manual labor.
Studies have shown that hyperparameter tuning, particularly when uti-
lizing machine learning, can have a significant impact on an algorithm’s
performance [6, 7]. From this came the inspiration to enhance the RL algo-
rithm. While the hill climbers [8] can be useful in specific situations, they are
not very useful in complex situations like robotic tasks employing RL. This is
due to the lack of a clear correlation between performance changes and chang-
ing the settings of hyperparameters. The number of epochs it takes the learning
agent to learn a given robotic task can be used to assess its performance. Due
to the non-linear nature of the relationship between the two, we looked into
using a Genetic Algorithm (GA) [9], which may be utilized for hyperparameter
optimization. Even while GAs can have significant computational costs, which
could be an overhead for any RL algorithm, GAs offer a potential method to
tune the hyperparameters in comparable kinds of issues once and for all. GA
has the ability to adjust the hyperparameters once and utilize them endlessly
in subsequent RL trainings, saving computational resources that would oth-
erwise be spent each time an RL is applied to an agent. [10] demonstrates
how GA is preferred over other optimization techniques. This is due to GA’s
capacity to handle either discrete or continuous variables, or both. It enables
the simultaneous evaluation of n members of a population, enabling multipro-
cessor machines to run parallel simulations. GA is inherently suitable for the
solution of multi-objective optimization problems since it works with a pop-
ulation of solutions [11]. [12] evaluated a small number of cost functions to
compare the performance of eight algorithms in solving simple and complex
building models. They discovered that the GA frequently approached the opti-
mal minimum. This demonstrates that GA is a powerful algorithm for solving
complex problems, making it our top pick for hyperparameter optimization.
[13], [14], [15], and [16] are some of the closely related works. These
publications’ findings add to the growing body of data that using a GA to auto-
matically modify the hyperparameters for DDPG+HER can greatly enhance
efficiency. The discrepancy can have a notable impact on how long it takes a
learning agent to learn.
We develop a novel automatic hyperparameter tweaking approach in this
paper, which we apply to DDPG+HER from [17]. The algorithm is then
applied to four existing robotic manipulator gym environments as well as three
custom-built robotic manipulator gym environments. Furthermore, the entire
algorithm is examined at various stages to determine whether the technique is
effective in increasing the overall efficiency of the learning process. The final
results support our claim and provide sufficient evidence that automating the
hyperparameter tuning procedure is critical since it reduces learning time by
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
2 Background
2.1 Genetic Algorithm (GA)
Genetic Algorithms (GAs) [9, 18, 19] were created to explore poorly-
understood areas [20], where an exhaustive search is impossible and other
search methods perform poorly. GAs, when employed as function optimizers,
aim to maximize fitness that is linked to the optimization goal. On a range of
tough design and optimization issues, evolutionary computing techniques in
general, and GAs in particular, have had a lot of empirical success. They begin
with a population of candidate solutions that have been randomly initialized
and are commonly encoded in a string (chromosome). A selection operator
narrows the search space to the most promising locations, whereas crossover
and mutation operators provide new potential solutions.
To choose parents for crossover and mutation, we employed ranking
selection [21]. Higher-ranked (fitter) individuals are probabilistically selected
through rank selection. Unlike fitness proportionate selection, ranking selection
is concerned with the existence of a fitness difference rather than its magni-
tude. Uniform crossover [22] is used to create children, who are then altered
by flip mutation [19]. Binary coding with concatenated hyperparameters is
used to encode chromosomes. [23] shows one such example of GA paired with
Lidar-monocular visual odometry (LIMO).
control method [35]. The DRL is more strong in its detailed examination of the
environment than in control theory. When used on robots, this DRL capacity
produces more intelligent and human-like behavior. Robots can thoroughly
examine the environment and discover useful solutions when DRL techniques
are used in conjunction with adequate training [36].
There are two types of RL methods: off-policy and on-policy. On-policy
methods seek to assess or improve the policy upon which decisions are made,
for e.g. SARSA learning [37]. For real robot systems, off-policy methods [38]
such as the Deep Deterministic Policy Gradient algorithm (DDPG) [39], Prox-
imal Policy Optimization [40], Advantage Actor-Critic (A2C) [41], Normalized
Advantage Function algorithm (NAF) [42] are useful. There is also a lot of work
on robotic manipulators [43, 44]. Some of this work relied on fuzzy wavelet
networks [45], while others relied on neural networks [46, 47] to complete their
goals. [48] provides a comprehensive overview of modern deep reinforcement
learning (DRL) algorithms for robot handling. Goal-conditioned reinforcement
learning (RL) frames each activity in terms of the intended outcome and, in
theory, has the potential to teach a variety of abilities [49]. The robustness
and sample effectiveness of goal-reaching procedures are frequently enhanced
by the application of hindsight experience replay (HER) [50]. For our trials,
we are employing DDPG in conjunction with HER. [51] describes recent work
on applying experience ranking to increase the learning pace of DDPG+HER.
Both a single robot [52, 53] and a multi-robot system [54–58] have been
extensively trained/taught using RL. Both model-based and model-free learn-
ing algorithms have been studied previously. Model-based learning algorithms
are heavily reliant on a model-based teacher to train deep network policies in
real-world circumstances.
Similarly, there has been a lot of work on GA’s [9] [59] and the GA operators
of crossover and mutation [60], which have been applied to a broad variety of
problems. GA has been used to solve a wide range of RL problems [60–63].
2.5 GA on DDPG+HER
GA can be used to solve a variety of optimization problems as a function
optimizer. This study concentrates on the DDPG+HER, which was briefly
discussed earlier in this chapter. Based on their fitness values, GAs can be
utilized to optimize the hyperparameters in the system. GA tries to achieve the
highest level of fitness. Various mathematical formulas can be used to convert
an objective function to a fitness function.
Existing DDPG+HER algorithms have a set of hyperparameters that can-
not be changed. When GA is applied to DDPG+HER, it discovers a better set
of hyperparameters, allowing the learning agent to learn more quickly. The fit-
ness value for this problem is the inverse of the number of epochs. GA appears
to be a potential technique for improving the system’s efficiency.
state s0 . The agent acts at for each timestep t based on the current state st:
st : at = π(st ). The accomplished action is rewarded with rt = r(st , at ), and
the distribution
P∞ p(.|st , at ) aids in sampling the new state of the environment.
Rt = i=T γ i−t ri is the discounted sum of future rewards. The purpose of the
agent is to maximize its expected return E[Rt |st , at ], and an optimal policy
∗
can be defined as any policy π ∗ , such that Qπ (s, a) ≥ Qπ (s, a) for each s ∈
S, a ∈ A, and any policy π. An optimal Q-function, Q∗ , is a policy that has
the same Q-function as the best policy and fulfills the Bellman equation:
0 0
θQ ←
− τ θQ + (1 − τ )θQ ,
0 0
θµ ←
− τ θµ + (1 − τ )θµ . (2)
0 0
yi = ri + γQ0 (si+1 , µ0 (st+1 |θµ )|θQ ), (3)
Q(st , at ) ←
− Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 )
−Q(st , at )]. (4)
We will need two learning rates, one for the actor-network (αactor ) and the
other for the critic network (αcritic ) because we have two types of networks.
8 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcem
The use of the percent of times that a random action is executed is explained
by equation (5).
(
a∗t with probability 1 − ,
at = (5)
random action with probability .
Fig. 1: Success rate vs. epochs for various τ for FetchPick&Place-v1 task.
10 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
4 Experimental Results
4.1 Experimental setup
A chromosome is binary encoded, as previously stated. Each chromosomal
string is the result of combining all of the GA’s arguments. Figure 4 depicts
an example chromosome with four binary encoded hyperparameters.
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
Fig. 2: Success rate vs. epochs for FetchPush-v1 task when τ and γ are found
using the GA.
Fig. 3: Success rate vs. epochs for FetchSlide-v1 task when τ and γ are found
using the GA.
Fig. 5: When all six hyperparameters are identified by GA, the matching
DDPG+HER versus GA+DDPG+HER charts are generated. Over ten runs,
all graphs are averaged.
Fig. 6: When all six hyperparameters are found by GA, the environments and
related DDPG+HER versus GA+DDPG+HER charts are shown. Over ten
runs, the graphs are averaged.
4.2 Running GA
We ran the GA independently on each of these scenarios to test the effec-
tiveness of our method and compared results to the hyperparameters’ original
values. We would anticipate that when comparing GA+DDPG+HER with
DDPG+HER, PPO, A2C, and DDPG, the algorithm with the fewest episodes,
running time, steps, and average epochs will do the best. This would illustrate
why optimizing hyperparameters is crucial as opposed to using the algo-
rithms’ built-in default hyperparameters. Figure 2a depicts the outcome of our
FetchPush-v1 experiment. We used the GA to run the system and discover
the best values for the hyperparameters τ and γ. We display results from ten
GA runs because the GA is probabilistic, and the results show that optimized
hyperparameters determined by the GA can lead to greater performance. The
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
Fig. 7: Using the most accurate policy learned via GA+DDPG+HER, the
AuboReach environment performs a task in a real experiment.
learning agent can learn faster and achieve a higher success rate. Figure 2b
depicts one learning run for the initial hyperparameter set, as well as the aver-
age learning for the ten GA iterations. As we evaluated the genetic algorithm,
the results displayed in figure 2 demonstrate changes when only two hyperpa-
rameters are tuned. We can see the potential for performance improvement.
16 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
Our findings from optimizing all five hyperparameters back up our optimism,
and they are detailed below.
Figure 3 (b) shows a comparison of one original experiment with two aver-
aged runs for optimizing hyperparameters τ and γ. Because this operation can
take several hours to complete in a single run, and because it was a compo-
nent of one of our initial tests, we only performed it twice. As we evaluated the
genetic algorithm, the results displayed in Figures 2 and 3 demonstrate changes
when only two hyperparameters are tuned. We can see the potential for per-
formance improvement. Our findings from optimizing all five hyperparameters
back up our optimism, and they are detailed below.
After that, GA was used to optimize all hyperparameters, with the results
presented in Figures 5 and 6 for each task. Table 1 compares the GA-discovered
hyperparameters to the RL algorithm’s initial hyperparameters. All the sim-
ulated environments, with the exception of AuboReach, used the same set
of hyperparameters discovered by GA+DDPG+HER. As seen in table 1, yet
another set of hyperparameters was generated for AuboReach. The learning
rates, αactor and αcritic are the same as they were in the beginning, while the
other four hyperparameters have different values. Figures 5 and 6 indicate that
the GA-discovered hyperparameters outperformed the original hyperparame-
ters, suggesting that the learning agent was able to learn more quickly. All of
the plots in the previous figure have been averaged over ten runs.
settings. This is because the movement speed of the Aubo i5 robotic manipu-
lator was kept sluggish in both simulation and real-world studies to avoid any
unexpected sudden movements, which could result in harm. In the AuboRe-
ach setting, there were also planning and execution processes involved in the
successful completion of each action. AuboReach, unlike the other gym environ-
ments covered in this study, could only run on a single CPU. This is because
other environments were introduced in MuJoCo and could easily run with the
maximum number of CPUs available. MuJoCo can create several instances
for training, which allows for faster learning. AuboReach must only perform
one action at a time, much like a real robot. Because of these characteristics,
training in this setting takes a long period.
(b) Training is done with random initial and target joint states with 1 CPU
Fig. 9: The AuboReach task’s success rate against epochs. This graph
represents the average of over ten runs.
a success. [-0.503, 0.605, -1.676, 1.391] were chosen as the objective joint states.
We were able to find a new set of hyperparameters using these changes to
the algorithm, as shown in Table 1. Figure 9a shows the difference in success
rates between DDPG+HER and GA+DDPG+HER during training. Without
a doubt, the GA+DDPG+HER outperforms the DDPG+HER.
After the GA+DDPG+HER had determined the best hyperparameters for
the AuboReach environment, the training was repeated using four CPUs to
determine the best policy. The robots were then subjected to this policy in
both simulated and real-world testing. The crucial point to remember here
is that for testing purposes, CPU consumption was reduced to one. In both
studies, the robot was able to transition from the training’s beginning joint
space to the goal joint space. Because unpredictability was not added during
training, the environment was constrained to only one possible path. Since
both DDPG+HER and GA+DDPG+HER eventually achieved a 100% suc-
cess rate, there was no discernible difference throughout testing. The main
distinction is the speed with which the environment may learn given a set of
hyperparameters.
The AuboReach environment was updated in another experiment to train
on random joint states. The robot may now start and reach objectives in
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
various joint states during testing thanks to the update. The GA was run on
this environment, and the hyperparameters discovered by the GA are shown in
table 1. Figure 9b shows that the plot of GA+DDPG+HER is still better than
DDPG+HER. Figures 7 and 8 show the robot in action as it accomplishes the
task of picking the object in real and simulated tests.
The use of GA+DDPG+HER, in the AuboReach environment resulted
in automatic DDPG+HER hyperparameter adjustment, which improved the
algorithm’s performance.
hyperparameter was used to represent in the tables the tasks that did not
attain the target during training.
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
(c) FetchSlide - Median success rate vs (f) DoorOpening - Median success rate
Times GA fitness function evaluated vs Times GA fitness function evaluated
Fig. 11: When all six hyperparameters are identified by GA, the
GA+DDPG+HER training evaluation plots. One GA run yielded this result.
We also generated more data to evaluate the episodes, running time (s),
steps, and epochs that an agent must learn to achieve the desired goal. Tables
2-5 present this information. The information in the tables is an average of
22 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
(c) AuboReach - Median success rate vs (f) FetchReach - Median success rate vs
Times GA fitness function evaluated Times GA fitness function evaluated
ten runs. Table 2 compares the number of episodes an agent needs to achieve
a goal. The bolded values imply superior performance, and the majority of
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
Fig. 13: When all six hyperparameters are determined using GA, the
DDPG+HER vs. GA+DDPG+HER efficiency evaluation plots (Total reward
vs episodes). Over ten runs, all graphs are averaged.
it takes less time to learn the task. In the majority of the environments, the
GA+DDPG+HER algorithm had the lowest running time, as shown in Table
3. For example, When compared to the DDPG+HER algorithm, FetchPush
with GA+DDPG+HER takes about 57.004% less time.
The average number of steps necessary to achieve the goal is another factor
to consider when analyzing and investigating the GA+DDPG+HER algo-
rithm’s performance. The average number of steps taken by an agent in each
environment is shown in Table 4. Except for the FetchSlide environment, most
of the environments, when employed with GA+DDPG+HER, outperform all
other algorithms. When compared to the DDPG+HER algorithm, FetchPush
with GA+DDPG+HER takes about 54.35% fewer steps.
The number of epochs taken by the agent to attain the goal is the final
hyperparameter used to compare the competency of GA+DDPG+HER with
four other algorithms. The average epochs for all of the environments are
shown in Table 5. Almost all environments outperform GA+DDPG+HER
in terms of efficacy. FetchPush, for example, uses 54.35% fewer epochs with
GA+DDPG+HER than it does with DDPG+HER. Following that, we give a
comparison of the GA+DDPG+HER algorithm to the other algorithms.
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
4.6 Analysis
We provided various findings and the mechanism for judging the efficacy of
GA+DDPG+HER versus DDPG+HER, PPO, A2C, and DDPG algorithms
in the previous sub-sections. Overall, GA+DDPG+HER works best, with
one exception of the FetchSlide environment. The average comparison tables
illustrate that different values of the assessment hyperparameters can be
assumed by each environment. This is determined by the type of task that
the agent is attempting to learn. While the majority of the tasks outper-
formed DDPG+HER with more than a 50% increase in efficiency, FetchSlide
underperformed DDPG+HER. The task’s goal is also credited with this per-
formance. The end-effector does not physically go to the desired position to
place the box, which makes this task unique. GA+DDPG+HER was tested
using a variety of hyperparameters and an average of over ten runs. This is suf-
ficient proof that GA+DDPG+HER outperformed several algorithms. Figures
5, 6 and 9b support our claim by demonstrating that when GA+DDPG+HER
26 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
Declarations
Funding
This work is supported by the U.S. National Science Foundation (NSF) under
grants NSF-CAREER: 1846513 and NSF-PFI-TT: 1919127, and the U.S.
Department of Transportation, Office of the Assistant Secretary for Research
and Technology (USDOT/OST-R) under Grant No. 69A3551747126 through
INSPIRE University Transportation Center, and the Vingroup Joint Stock
Company and supported by Vingroup Innovation Foundation (VINIF) under
project code VINIF.2020.NCUD.DA094. The views, opinions, findings, and
conclusions reflected in this publication are solely those of the authors and do
not represent the official policy or position of the NSF, USDOT/OST-R, and
VINIF.
Code/Data availability
Open source code, and data used in the study is available at https://github.
com/aralab-unr/ga-drl-aubo-ara-lab.
Authors’ contributions
Adarsh Sehgal is the first author and primary contributor to this paper. Adarsh
did the research and wrote this manuscript. Nicholas Ward assisted Adarsh in
code execution and gathering results. Drs. Hung La and Sushil Louis advised
and overlooked this study.
Ethics approval
Not applicable
Consent to participate
All authors are full-informed about the content of this paper, and the consent
was given for participation.
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
References
[1] Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning
vol. 135. MIT press, Cambridge (1998)
[2] Tebbe, J., Krauch, L., Gao, Y., Zell, A.: Sample-efficient reinforcement
learning in robotic table tennis. In: 2021 IEEE International Conference
on Robotics and Automation (ICRA), pp. 4171–4178 (2021). https://doi.
org/10.1109/ICRA48506.2021.9560764
[3] Xu, J., Li, B., Lu, B., Liu, Y.-H., Dou, Q., Heng, P.-A.: Surrol: An open-
source reinforcement learning centered and dvrk compatible platform for
surgical robot learning. In: 2021 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pp. 1821–1828 (2021). https://
doi.org/10.1109/IROS51168.2021.9635867
[4] Chiu, Z.-Y., Richter, F., Funk, E.K., Orosco, R.K., Yip, M.C.: Biman-
ual regrasping for suture needles using reinforcement learning for rapid
motion planning. In: 2021 IEEE International Conference on Robotics
and Automation (ICRA), pp. 7737–7743 (2021). https://doi.org/10.1109/
ICRA48506.2021.9561673
[5] Marchesini, E., Corsi, D., Farinelli, A.: Benchmarking safe deep reinforce-
ment learning in aquatic navigation. In: 2021 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 5590–5595
(2021). https://doi.org/10.1109/IROS51168.2021.9635925
[6] Probst, P., Boulesteix, A.-L., Bischl, B.: Tunability: Importance of hyper-
parameters of machine learning algorithms. The Journal of Machine
Learning Research 20(1), 1934–1965 (2019)
[7] Zhang, B., Rajan, R., Pineda, L., Lambert, N., Biedenkapp, A., Chua, K.,
Hutter, F., Calandra, R.: On the importance of hyperparameter optimiza-
tion for model-based reinforcement learning. In: International Conference
on Artificial Intelligence and Statistics, pp. 4015–4023 (2021). PMLR
[10] Nguyen, A.-T., Reiter, S., Rigo, P.: A review on simulation-based opti-
mization methods applied to building performance analysis. Applied
30 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
[13] Sehgal, A., La, H., Louis, S., Nguyen, H.: Deep reinforcement learn-
ing using genetic algorithm for parameter optimization. In: 2019 Third
IEEE International Conference on Robotic Computing (IRC), pp. 596–601
(2019). IEEE
[15] Sehgal, A., Ward, N., La, H.M., Papachristos, C., Louis, S.: Ga-drl:
Genetic algorithm-based function optimizer in deep reinforcement learn-
ing for robotic manipulation tasks. arXiv preprint arXiv:2203.00141
(2022)
[16] Sehgal, A., Singandhupe, A., La, H.M., Tavakkoli, A., Louis, S.J.:
Lidar-monocular visual odometry with genetic algorithm for parameter
optimization. In: Bebis, G., et al.(eds.) Advances in Visual Computing,
pp. 358–370. Springer, Cham (2019)
[17] Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A.,
Schulman, J., Sidor, S., Wu, Y., Zhokhov, P.: OpenAI Baselines. GitHub
(2017)
[19] Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning.
Machine learning 3(2), 95–99 (1988)
[23] Sehgal, A., Singandhupe, A., La, H.M., Tavakkoli, A., Louis, S.J.:
Lidar-monocular visual odometry with genetic algorithm for parameter
optimization. In: International Symposium on Visual Computing, pp.
358–370 (2019). Springer
[24] Watkins, C.J., Dayan, P.: Q-learning. Machine learning 8(3-4), 279–292
(1992)
[25] La, H.M., Lim, R., Sheng, W.: Multirobot cooperative learning for preda-
tor avoidance. IEEE Transactions on Control Systems Technology 23(1),
52–63 (2015). https://doi.org/10.1109/TCST.2014.2312392
[27] Doya, K.: Reinforcement learning in continuous time and space. Neural
computation 12(1), 219–245 (2000)
[30] Wei, Q., Lewis, F.L., Sun, Q., Yan, P., Song, R.: Discrete-time deter-
ministic q-learning: A novel convergence analysis. IEEE transactions on
cybernetics 47(5), 1224–1237 (2017)
[31] Kohl, N., Stone, P.: Policy gradient reinforcement learning for fast
quadrupedal locomotion. In: Robotics and Automation, 2004. Proceed-
ings. ICRA’04. 2004 IEEE International Conference On, vol. 3, pp.
2619–2624 (2004). IEEE
[32] Endo, G., Morimoto, J., Matsubara, T., Nakanishi, J., Cheng, G.:
Learning cpg-based biped locomotion with a policy gradient method:
Application to a humanoid robot. The International Journal of Robotics
Research 27(2), 213–228 (2008)
[33] Peters, J., Mülling, K., Altun, Y.: Relative entropy policy search. In:
AAAI, pp. 1607–1612 (2010). Atlanta
32 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
[34] Kalakrishnan, M., Righetti, L., Pastor, P., Schaal, S.: Learning force
control policies for compliant manipulation. In: Intelligent Robots and
Systems (IROS), 2011 IEEE/RSJ International Conference On, pp.
4639–4644 (2011). IEEE
[35] Vargas, J.C., Bhoite, M., Barati Farimani, A.: Creativity in robot manip-
ulation with deep reinforcement learning. arXiv e-prints, 1910 (2019)
[36] Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: A
survey. The International Journal of Robotics Research 32(11), 1238–1274
(2013)
[38] Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M.: Safe and
efficient off-policy reinforcement learning. In: Advances in Neural Infor-
mation Processing Systems, pp. 1054–1062 (2016)
[39] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y.,
Silver, D., Wierstra, D.: Continuous control with deep reinforcement
learning. arXiv preprint arXiv:1509.02971 (2015)
[40] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
[41] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T.,
Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforce-
ment learning. In: International Conference on Machine Learning, pp.
1928–1937 (2016). PMLR
[42] Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep q-learning
with model-based acceleration. In: International Conference on Machine
Learning, pp. 2829–2838 (2016)
[43] Deisenroth, M.P., Rasmussen, C.E., Fox, D.: Learning to control a low-
cost manipulator using data-efficient reinforcement learning (2011)
[44] Jin, L., Li, S., La, H.M., Luo, X.: Manipulability optimization of redun-
dant manipulators using dynamic neural networks. IEEE Transactions on
Industrial Electronics 64(6), 4710–4720 (2017). https://doi.org/10.1109/
TIE.2017.2674624
(2009)
[46] Miljković, Z., Mitić, M., Lazarević, M., Babić, B.: Neural network rein-
forcement learning for visual control of robot manipulators. Expert
Systems with Applications 40(5), 1721–1736 (2013)
[47] Duguleana, M., Barbuceanu, F.G., Teirelbar, A., Mogan, G.: Obstacle
avoidance of redundant manipulators using neural networks based rein-
forcement learning. Robotics and Computer-Integrated Manufacturing
28(2), 132–146 (2012)
[48] Nguyen, H., La, H.M.: Review of deep reinforcement learning for robot
manipulation. In: 2019 Third IEEE International Conference on Robotic
Computing (IRC), pp. 590–595 (2019). IEEE
[50] Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder,
P., McGrew, B., Tobin, J., Abbeel, O.P., Zaremba, W.: Hindsight experi-
ence replay. In: Advances in Neural Information Processing Systems, pp.
5048–5058 (2017)
[51] Nguyen, H., La, H.M., Deans, M.: Deep learning with experience ranking
convolutional neural network for robot manipulator. arXiv:1809.05819,
cs.RO (2018) https://arxiv.org/abs/1809.05819
[52] Pham, H.X., La, H.M., Feil-Seifer, D., Nguyen, L.V.: Autonomous uav
navigation using reinforcement learning. arXiv:1801.05086, cs.RO (2018)
https://arxiv.org/abs/1801.05086
[53] Pham, H.X., La, H.M., Feil-Seifer, D., Nguyen, L.V.: Reinforcement learn-
ing for autonomous uav navigation using function approximation. In: 2018
IEEE International Symposium on Safety, Security, and Rescue Robotics
(SSRR), pp. 1–6 (2018). https://doi.org/10.1109/SSRR.2018.8468611
[54] La, H.M., Lim, R.S., Sheng, W., Chen, J.: Cooperative flocking and
learning in multi-robot systems for predator avoidance. In: 2013 IEEE
International Conference on Cyber Technology in Automation, Control
and Intelligent Systems, pp. 337–342 (2013). https://doi.org/10.1109/
CYBER.2013.6705469
[55] La, H.M., Sheng, W., Chen, J.: Cooperative and active sensing in mobile
sensor networks for scalar field mapping. IEEE Transactions on Systems,
Man, and Cybernetics: Systems 45(1), 1–12 (2015). https://doi.org/10.
1109/TSMC.2014.2318282
34 Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforce
[56] Pham, H.X., La, H.M., Feil-Seifer, D., Nefian, A.: Cooperative
and distributed reinforcement learning of drones for field coverage.
arXiv:1803.07250, cs.RO (2018) https://arxiv.org/abs/1803.07250
[57] Dang, A.D., La, H.M., Horn, J.: Distributed formation control for
autonomous robots following desired shapes in noisy environment. In:
2016 IEEE International Conference on Multisensor Fusion and Integra-
tion for Intelligent Systems (MFI), pp. 285–290 (2016). https://doi.org/
10.1109/MFI.2016.7849502
[58] Rahimi, M., Gibb, S., Shen, Y., La, H.M.: A comparison of vari-
ous approaches to reinforcement learning algorithms for multi-robot
box pushing. In: International Conference on Engineering Research and
Applications, pp. 16–30 (2018). Springer
[59] Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist mul-
tiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary
computation 6(2), 182–197 (2002)
[60] Poon, P.W., Carter, J.N.: Genetic algorithm crossover operators for
ordering applications. Computers & Operations Research 22(1), 135–147
(1995)
[61] Liu, F., Zeng, G.: Study of genetic algorithm with reinforcement learn-
ing to solve the tsp. Expert Systems with Applications 36(3), 6995–7001
(2009)
[63] Mikami, S., Kakazu, Y.: Genetic reinforcement learning for cooperative
traffic signal control. In: Evolutionary Computation, 1994. IEEE World
Congress on Computational Intelligence., Proceedings of the First IEEE
Conference On, pp. 223–228 (1994). IEEE
[64] Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradi-
ent methods for reinforcement learning with function approximation.
Advances in neural information processing systems 12 (1999)
[65] Alagha, A., Singh, S., Mizouni, R., Bentahar, J., Otrok, H.: Target
localization using multi-agent deep reinforcement learning with proximal
policy optimization. Future Generation Computer Systems 136, 342–357
(2022)
[66] Wu, C.-A.: Investigation of Different Observation and Action Spaces for
Reinforcement Learning on Reaching Tasks (2019)
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement L
[67] Melo, L.C., Máximo, M.R.O.A.: Learning humanoid robot running skills
through proximal policy optimization. In: 2019 Latin American Robotics
Symposium (LARS), 2019 Brazilian Symposium on Robotics (SBR) and
2019 Workshop on Robotics in Education (WRE), pp. 37–42 (2019).
IEEE
[69] Scholte, N.: Goal, mistake and success learning through resimulation.
Master’s thesis (2022)
[70] Haydari, A., Zhang, M., Chuah, C.-N., Ghosal, D.: Impact of deep rl-
based traffic signal control on air quality. In: 2021 IEEE 93rd Vehicular
Technology Conference (VTC2021-Spring), pp. 1–6 (2021). IEEE
[72] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J.,
Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540
(2016)
[75] Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell,
G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al.: Multi-goal
reinforcement learning: Challenging robotics environments and request for
research. arXiv preprint arXiv:1802.09464 (2018)