Reinforcement Learning-Based Multi-AUV Adaptive Trajectory Planning for Under-Ice Field Estimation
Abstract
:1. Introduction
- The developed algorithm is non-myopic and for multiple AUVs, while existing works consider either non-myopic planning for a single vehicle [11,12,13,14] or myopic planning for multiple vehicles [8,10,17,18]. To tackle the high computational cost of non-myopic multi-vehicle planning, instead of using Monte Carlo tree search, we employ a learning algorithm which can be implemented via parallel computation. To further speed up the convergence of the planning algorithm, the decision-making strategy is adjusted on the fly by transferring the knowledge learned in previous epochs.
2. System Model
2.1. System Description
2.2. Autonomous Underwater Vehicles Trajectory Modeling
- Kinematics constraint: Due to the limited travel speed of an AUV, the distance between any two consecutive waypoints for each AUV is constrained as:
- Communication constraint: For each epoch, since the AUV needs to send its field samples to an access point when it arrives at the last waypoint, the AUV should be within the communication range of at least one of the access points, namely,
- Sensing area constraint: All the AUVs should stay within the area of interest, namely,
2.3. Unknown Field Modeling
3. Problem Formulation for Adaptive Trajectory Planning
3.1. Gaussian Process Regression for Field Estimation
3.2. Problem Formulation for Optimal Trajectory Planning
3.2.1. Reward Function
- Uncertainty reduction reward: Given the system mission objective of minimizing the field uncertainty over target locations in , the reward associated with the reduction of the field uncertainty by performing action a at the system state s is defined as:
- Trajectory cost: Notice that the AUV energy consumption increases with the travel distance and the turning angle. The mobility cost associated with action a is defined as:
- Trajectory constraint penalty: The kinematics constraint in Equation (1) will be addressed in the algorithm design for solving the optimization problem in Equation (14) (to be clear in Section 4.2). The constraints in Equations (2) and (3) are tackled by introducing a penalty term into the objective function, where zero penalty is applied when both constraints are satisfied and an extremely large penalty is incurred when either of the two constraints cannot be satisfied. The constraint penalty is defined as:
3.2.2. Bellman Optimality Equation
4. Reinforcement Learning-Based Adaptive Trajectory Planning
4.1. Deep Deterministic Policy Gradient Basics and Design
4.2. Training for Actions under Constraints
Algorithm 1 Modified deterministic policy gradient (MDPPG) algorithm: . |
Input: Initial epoch , total training episodes , total epochs in an episode , mini-batch size , discount factor , learning rate of the target networks , threshold value , action adjust variance , the critic network Q with its weights , the actor network with its weights , the target critic network with its weights , the target actor network with its weights , the field hyper-parameters and the current system state s Output: Optimal action set for future epochs, the critic and actor networks Q and with weights and , the target critic and actor networks and with weights and
|
Algorithm 2 Random action adjust: . |
Input: Action a, threshold value and action adjust variance Output: Adjusted action a
|
4.3. Online Learning for Trajectory Planning with Unknown Field Hyper-Parameters
Algorithm 3 Online trajectory planning algorithm in each epoch. |
Input: Current epoch , total training episodes , total epochs in an episode , mini-batch size , discount factor , learning rate of the target network , threshold value , action adjust variance , the critic network Q with its weights , the actor network with its weights , the target critic network with its weights and the target actor network with its weights
|
4.4. Computational Complexity
5. Algorithm Evaluation
- Scheme 1: A clairvoyant method that determines the sampling trajectories through the offline MDDPG algorithm based on the perfect knowledge of the field hyper-parameters, according to Algorithm 1;
- Scheme 2: The proposed online reinforcement learning algorithm that determines the sampling trajectories epoch-by-epoch through the MDDPG algorithm where the field hyper-parameters are online estimated in each epoch based on the collected samples, according to Algorithm 3;
- Scheme 3: All the AUVs sample the water parameter field via a random walk. Here, the simulation result to be presented is selected among 10,000 Monte Carlo runs, which yields the maximal total reward.
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Thompson, D.; Caress, D.; Thomas, H.; Conlin, D. MBARI Mapping AUV Operations in the Gulf of California 2015. In Proceedings of the Conference on MTS/IEEE OCEANS, Washington, DC, USA, 19–22 October 2015; pp. 1–7. [Google Scholar]
- Thompson, D.; Caress, D.; Clague, D.; Conlin, D.; Harvey, J.; Martin, E.; Paduan, J.; Paull, C.; Ryan, J.; Thomas, H.; et al. MBARI Dorado AUV’s Scientific Results. In Proceedings of the Conference on MTS/IEEE OCEANS, San Diego, CA, USA, 23–27 September 2013; pp. 1–9. [Google Scholar]
- Kukulya, A.; Plueddemann, A.; Austin, T.; Stokey, R.; Purcell, M.; Allen, B.; Littlefield, R.; Freitag, L.; Koski, P.; Gallimore, E.; et al. Under-ice operations with a REMUS-100 AUV in the Arctic. In Proceedings of the IEEE/OES Autonomous Underwater Vehicles, Monterey, CA, USA, 1–3 September 2010. [Google Scholar]
- Leonard, N.E.; Paley, D.A.; Lekien, F.; Sepulchre, R.; Fratantoni, D.M.; Davis, R.E. Collective motion, sensor networks, and ocean sampling. In Proceedings of the IEEE; IEEE: Piscataway, NJ, USA, 2007; Volume 95, pp. 48–74. [Google Scholar]
- Yilmaz, N.; Evangelinos, C.; Lermusiaux, P.; Patrikalakis, N. Path planning of autonomous underwater vehicles for adaptive sampling using mixed integer linear programming. IEEE J. Ocean. Eng. 2008, 33, 522–537. [Google Scholar] [CrossRef]
- Zhu, D.; Huang, H.; Yang, S.X. Dynamic task assignment and path planning of multi-AUV system based on an improved self-organizing map and velocity synthesis method in three-dimensional underwater workspace. IEEE Trans. Cybern. 2013, 43, 504–514. [Google Scholar] [PubMed]
- Szwaykowska, K.; Zhang, F. Trend and bounds for error growth in controlled Lagrangian particle tracking. IEEE J. Ocean. Eng. 2014, 39, 10–25. [Google Scholar] [CrossRef]
- Xu, Y.; Choi, J.; Oh, S. Mobile sensor network navigation using Gaussian processes with truncated observations. IEEE Trans. Robot. 2011, 27, 1118–1131. [Google Scholar] [CrossRef]
- Marchant, R.; Ramos, F. Bayesian optimisation for informative continuous path planning. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 6136–6143. [Google Scholar]
- Nguyen, L.; Kodagoda, S.; Ranasinghe, R.; Dissanayake, G. Information-driven adaptive sampling strategy for mobile robotic wireless sensor network. IEEE Trans. Control Syst. Technol. 2016, 24, 372–379. [Google Scholar] [CrossRef]
- Martinez-Cantin, R.; Freitas, N.; Brochu, E.; Castellanos, J.; Doucet, A. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robot. 2009, 27, 93–103. [Google Scholar] [CrossRef]
- Singh, A.; Krause, A.; Kaiser, W. Nonmyopic adaptive informative path planning for multiple robots. In Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09), Pasadena, CA, USA, 11–17 July 2009; pp. 1843–1850. [Google Scholar]
- Marchant, R.; Ramos, F.; Sanner, S. Sequential Bayesian optimization for spatial-temporal monitoring. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Quebec City, QC, Canada, 23–17 July 2014; pp. 553–562. [Google Scholar]
- Morere, P.; Marchant, R.; Ramos, F. Sequential Bayesian optimization as a POMDP for environment monitoring with UAVs. In Proceedings of the Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 6381–6388. [Google Scholar]
- Binney, J.; Krause, A.; Sukhatme, G.S. Informative path planning for an autonomous underwater vehicle. In Proceedings of the Conference on Robotics and Automation (ICRA), Anchorage, AK, USA, 3–7 May 2010; pp. 4791–4796. [Google Scholar]
- Hollinger, G.; Englot, B.; Hover, F.; Mitra, U.; Sukhatme, G. Uncertainty-driven view planning for underwater inspection. In Proceedings of the Conference on Robotics and Automation (ICRA), Saint Paul, MN, USA, 14–18 May 2012; pp. 4884–4891. [Google Scholar]
- Marino, A.; Antonelli, G.; Aguiar, A.; Pascoal, A.; Chiaverini, S. A decentralized strategy for multirobot sampling/patrolling: Theory and experiments. IEEE Trans. Control Syst. Technol. 2015, 23, 313–322. [Google Scholar] [CrossRef]
- Kemna, S.; Rogers, J.G.; Nieto-Granda, C.; Young, S.; Sukhatme, G.S. Multi-robot coordination through dynamic Voronoi partitioning for informative adaptive sampling in communication-constrained environments. In Proceedings of the Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2124–2130. [Google Scholar]
- Marino, A.; Antonelli, G. Experimental results of coordinated sampling/patrolling by autonomous underwater vehicles. In Proceedings of the Conference on IEEE Robotics and Automation (ICRA), Karlsruhe, Germany, 6–10 May 2013; pp. 4141–4146. [Google Scholar]
- Rasmussen, C.E. Gaussian processes in machine learning. In Advanced Lectures on Machine Learning; Springer Berlin Heidelberg: Berlin/Heidelberg, Germany, 2004; pp. 63–71. [Google Scholar]
- Williams, C.K.; Rasmussen, C.E. Gaussian processes for regression. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1996; pp. 514–520. [Google Scholar]
- Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
- Karl, H.; Willig, A. Protocols and Architectures for Wireless Sensor Networks, 1st ed.; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
- Brito, M.P.; Lewis, R.S.; Bose, N.; Griffiths, G. Adaptive autonomous underwater vehicles: An assessment of their effectiveness for oceanographic applications. IEEE Trans. Eng. Manage. 2018, 1–14. [Google Scholar] [CrossRef]
- Mertikas, S.P. Error Distributions and Accuracy Measures in Navigation: An Overview; Technical Report; Geodesy and Geomatics Engineering: Fredericton, NB, Canada, 1985. [Google Scholar]
- Kay, S.M. Fundamentals of Statistical Signal Processing: Estimation Theory; Prentice Hall: Upper Saddle River, NJ, USA, 1993; Volume 2. [Google Scholar]
- Byrd, R.; Lu, P.; Nocedal, J.; Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning, 6th ed.; Springer: New York, NY, USA, 2006. [Google Scholar]
- Le Gall, F. Powers of tensors and fast matrix multiplication. In Proceedings of the Symposium on Symbolic and Algebraic Computation, Kobe, Japan, 23–25 July 2014; pp. 296–303. [Google Scholar]
- Leithead, W.E.; Zhang, Y. O(N2)-operation approximation of covariance matrix inverse in Gaussian process regression based on quasi-Newton BFGS Method. Commun. Stat. Simul. Comput. 2007, 36, 367–380. [Google Scholar] [CrossRef]
- Takefuji, Y. Neural Network Parallel Computing; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
- Kroese, D.P.; Botev, Z.I. Spatial process simulation. In Stochastic Geometry, Spatial Statistics and Random Fields: Models and Algorithms; Schmidt, V., Ed.; Springer: Berlin, Germany, 2015; pp. 369–404. [Google Scholar] [CrossRef]
- McEwen, R.; Thomas, H.; Weber, D.; Psota, F. Performance of an AUV navigation system at Arctic latitudes. IEEE J. Ocean. Eng. 2005, 30, 443–454. [Google Scholar] [CrossRef]
- Norgre, P.; Skjetne, R. Using autonomous underwater vehicles as sensor platforms for ice-monitoring. Model. Identif. Control 2014, 35, 269–277. [Google Scholar]
Scheme 1 | Scheme 2 | Scheme 3 | |
---|---|---|---|
Total traveled distance (km) | 74.4 | 77.9 | 78.1 |
Total traveled angle (rad) | 76.6 | 117.4 | 131.5 |
Normalized mean square error (NMSE) | 0.17 | 0.26 | 1.35 |
Epoch Duration (minutes) | 30 | 40 | 50 |
---|---|---|---|
NMSE | 0.22 | 0.23 | 0.26 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, C.; Wei, L.; Wang, Z.; Song, M.; Mahmoudian, N. Reinforcement Learning-Based Multi-AUV Adaptive Trajectory Planning for Under-Ice Field Estimation. Sensors 2018, 18, 3859. https://doi.org/10.3390/s18113859
Wang C, Wei L, Wang Z, Song M, Mahmoudian N. Reinforcement Learning-Based Multi-AUV Adaptive Trajectory Planning for Under-Ice Field Estimation. Sensors. 2018; 18(11):3859. https://doi.org/10.3390/s18113859
Chicago/Turabian StyleWang, Chaofeng, Li Wei, Zhaohui Wang, Min Song, and Nina Mahmoudian. 2018. "Reinforcement Learning-Based Multi-AUV Adaptive Trajectory Planning for Under-Ice Field Estimation" Sensors 18, no. 11: 3859. https://doi.org/10.3390/s18113859