research-article

Open access

FPGA acceleration of deep reinforcement learning using on-chip replay management

Authors:

Viktor PrasannaAuthors Info & Claims

CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers

Pages 40 - 48

https://doi.org/10.1145/3528416.3530227

Published: 17 May 2022 Publication History

Abstract

A major bottleneck in parallelizing deep reinforcement learning (DRL) is in the high latency to perform various operations used to update the Prioritized Replay Buffer on CPU. The low arithmetic intensity of these operations leads to severe under-utilization of the SIMT computation power of GPUs. In this work, we propose a high-throughput on-chip accelerator for Prioritized Replay Buffer and learner that efficient allocates computation and memory resources to saturate the FPGA computation power. Our design features hardware pipelining on FPGA such that the latency of replay operations is completely hidden. Our experimental results show that the performance of the key operations in managing Prioritized Replay Buffer including sampling and priority insertions are improved by factor of 21X ~ 40X compared with the state-of-the-art implementations on CPU and GPU. In addition, our system design leads to up to 4.3X improvement in overall throughput compared with the state-of-the-art CPU-GPU implementations.

References

[1]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:arXiv:1606.01540

[2]

Konstantinos Chatzilygeroudis, Roberto Rama, Rituraj Kaushik, Dorian Goepp, Vassilis Vassiliades, and Jean-Baptiste Mouret. 2017. Black-box data-efficient policy search for robotics. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 51--58.

Digital Library

[3]

Hyungmin Cho, Pyeongseok Oh, Jiyoung Park, Wookeun Jung, and Jaejin Lee. 2019. FA3C: FPGA-Accelerated Deep Reinforcement Learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 499--513.

Digital Library

[4]

Andrea Damiani, Giorgia Fiscaletti, Marco Bacis, Rolando Brondolin, and Marco D Santambrogio. 2022. BlastFunction: A Full-stack Framework Bringing FPGA Hardware Acceleration to Cloud-native Applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15, 2 (2022), 1--27.

Digital Library

[5]

Dimitrios Danopoulos, Christoforos Kachris, and Dimitrios Soudris. 2021. Utilizing cloud FPGAs towards the open neural network standard. Sustainable Computing: Informatics and Systems 30 (2021), 100520.

[6]

Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. 2020. Transformations of high-level synthesis codes for high-performance computing. IEEE Transactions on Parallel and Distributed Systems 32, 5 (2020), 1014--1029.

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.

Digital Library

[8]

Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. 2019. SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. CoRR abs/1910.06591 (2019). arXiv:1910.06591 http://arxiv.org/abs/1910.06591

[9]

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. 2018. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning. PMLR, 1407--1416.

[10]

Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. 2019. Customisable Control Policy Learning for Robotics. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Vol. 2160. IEEE, 91--98.

[11]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861--1870.

[12]

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. 2018. Distributed Prioritized Experience Replay. CoRR abs/1803.00933 (2018). arXiv:1803.00933 http://arxiv.org/abs/1803.00933

[13]

Intel. 2017. Intel Stratix 10 MX FPGAs. https://www.intel.com/content/www/us/en/products/programmable/sip/stratix-10-mx.html

[14]

Intel. 2018. SkyLake Specification. https://www.7-cpu.com/cpu/Skylake.html

[15]

Intel. 2021. GPU Memory Latency's Impact, and Updated Test. https://chipsandcheese.com/2021/05/13/gpu-memory-latencys-impact-and-updated-test/

[16]

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. 2019. Recurrent Experience Replay in Distributed Reinforcement Learning.

[17]

Vinod Kathail. 2020. Xilinx vitis unified software platform. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 173--174.

Digital Library

[18]

Vasileios Leon, Kiamal Pekmestzi, and Dimitrios Soudris. 2021. Exploiting the Potential of Approximate Arithmetic in DSP & AI Hardware Accelerators. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 263--264.

[19]

Yuxi Li and Dale Schuurmans. 2012. MapReduce for Parallel Reinforcement Learning. In Recent Advances in Reinforcement Learning, Scott Sanner and Marcus Hutter (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 309--320.

[20]

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. 2017. Ray RLLib: A Composable and Scalable Reinforcement Learning Library. CoRR abs/1712.09381 (2017). arXiv:1712.09381 http://arxiv.org/abs/1712.09381

[21]

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2016).

[22]

Yuan Meng, Sanmukh Kuppannagari, and Viktor Prasanna. 2020. Accelerating proximal policy optimization on cpu-fpga heterogeneous platforms. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 19--27.

[23]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602 (2013). arXiv:1312.5602 http://arxiv.org/abs/1312.5602

[24]

Takao Moriyama, Giovanni De Magistris, Michiaki Tatsubori, Tu-Hoa Pham, Asim Munawar, and Ryuki Tachibana. 2018. Reinforcement Learning Testbed for Power-Consumption Optimization. CoRR abs/1808.10427 (2018). arXiv:1808.10427 http://arxiv.org/abs/1808.10427

[25]

Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. 2015. Massively Parallel Methods for Deep Reinforcement Learning. CoRR abs/1507.04296 (2015). arXiv:1507.04296 http://arxiv.org/abs/1507.04296

[26]

NVIDIA. 2022. Deep Learning Performance Documentation. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html.

[27]

Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G Spampinato, and Markus Püschel. 2014. Applying the roofline model. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 76--85.

[28]

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. 2016. Deep exploration via bootstrapped DQN. Advances in neural information processing systems 29 (2016), 4026--4034.

[29]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[30]

H. Robbins and S. Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22 (1951), 400--407.

[31]

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized Experience Replay. http://arxiv.org/abs/1511.05952 cite arxiv:1511.05952Comment: Published at ICLR 2016.

[32]

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust Region Policy Optimization. CoRR abs/1502.05477 (2015). arXiv:1502.05477 http://arxiv.org/abs/1502.05477

Digital Library

[33]

Lorenzo Servadei, Jin Hwa Lee, José A Arjona Medina, Michael Werner, Sepp Hochreiter, Wolfgang Ecker, and Robert Wille. 2022. Deep Reinforcement Learning for Optimization at Early Design Stages. IEEE Design & Test (2022).

[34]

Shengjia Shao and Wayne Luk. 2017. Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--6.

[35]

Shengjia Shao, Jason Tsai, Michal Mysior, Wayne Luk, Thomas Chau, Alexander Warren, and Ben Jeppesen. 2018. Towards hardware accelerated reinforcement learning for application-specific robotic control. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 1--8.

[36]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484--489.

[37]

Adam Stooke and Pieter Abbeel. 2018. Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811 (2018).

[38]

Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. 2019. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog 2 (2019).

[39]

xilinx. 2012. Large FPGA methodology guide. https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/ug872_largefpga.pdf

[40]

Xilinx. 2021. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html

[41]

Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor K Prasanna. 2021. Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations. arXiv (2021). arXiv:2110.01101 [cs.LG]

Cited By

Meng YKinsner MSingh DIyer MPrasanna V(2024)PEARL: Enabling Portable, Productive, and High-Performance Deep Reinforcement Learning using Heterogeneous PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649193(41-50)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649193
Guo CLuk W(2024)FPGA-Accelerated Sim-to-Real Control Policy Learning for Robotic ArmsIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.335369071:3(1690-1694)Online publication date: Mar-2024
https://doi.org/10.1109/TCSII.2024.3353690
Zhang CMeng YPrasanna V(2023)A Framework for Mapping DRL Algorithms With Prioritized Replay Buffer Onto Heterogeneous PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326482334:6(1816-1829)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3264823
Show More Cited By

Index Terms

FPGA acceleration of deep reinforcement learning using on-chip replay management
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational ...
Base64 Encoding on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Base64 encoding has many applications on the Web. Previous studies are focused on improving the efficiency of Base64 encoding on central processing units (CPUs). As field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '22: Proceedings of the 19th ACM International Conference on Computing Frontiers

May 2022

321 pages

ISBN:9781450393386

DOI:10.1145/3528416

General Chair:
Luca Sterpone
Politecnico di Torino, IT
,
Program Chairs:
Andrea Bartolini
Universit`a di Bologna, IT
,
Anastasiia Butko
Lawrence Berkeley National Laboratory

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Conference

CF '22

Sponsor:

SIGMICRO

CF '22: 19th ACM International Conference on Computing Frontiers

May 17 - 22, 2022

Turin, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
579
Total Downloads

Downloads (Last 12 months)291
Downloads (Last 6 weeks)43

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meng YKinsner MSingh DIyer MPrasanna V(2024)PEARL: Enabling Portable, Productive, and High-Performance Deep Reinforcement Learning using Heterogeneous PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649193(41-50)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649193
Guo CLuk W(2024)FPGA-Accelerated Sim-to-Real Control Policy Learning for Robotic ArmsIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.335369071:3(1690-1694)Online publication date: Mar-2024
https://doi.org/10.1109/TCSII.2024.3353690
Zhang CMeng YPrasanna V(2023)A Framework for Mapping DRL Algorithms With Prioritized Replay Buffer Onto Heterogeneous PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326482334:6(1816-1829)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3264823
Sumarudin ASutisna NSyafalni ITrilaksono BAdiono T(2023)DQN Algorithm Design for Fast Efficient Shortest Path System2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)10.1109/APSIPAASC58517.2023.10317113(254-260)Online publication date: 31-Oct-2023
https://doi.org/10.1109/APSIPAASC58517.2023.10317113

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents