research-article

Public Access

GRNN: Low-Latency and Scalable RNN Inference on GPUs

Authors:

Daniel Mawhirter,

Bo WuAuthors Info & Claims

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

Article No.: 41, Pages 1 - 16

https://doi.org/10.1145/3302424.3303949

Published: 25 March 2019 Publication History

Abstract

Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, due to the complex data dependencies and limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading to inefficient resource utilization. Consequently, companies like Microsoft and Facebook use CPUs to serve RNN models.

This work demonstrates the root causes of the unsatisfactory performance of existing implementations for RNN inference on GPUs from several aspects, including poor data reuse, low on-chip resource utilization, and high synchronization overhead. We systematically address these issues and develop a GPU-based RNN inference library, called GRNN, that provides low latency, high throughput, and efficient resource utilization. GRNN minimizes global memory accesses and synchronization overhead, as well as balancing on-chip resource usage through novel data reorganization, thread mapping, and performance modeling techniques. Evaluated on extensive benchmarking and real-world applications, we show that GRNN outperforms the state-of-the-art CPU inference library by up to 17.5X and state-of-the-art GPU inference libraries by up to 9X in terms of latency reduction.

References

[1]

Dense linear algebra on gpus. https://developer.nvidia.com/cublas. Accessed: 2018-10-1.

[2]

NVIDIA CUDA. http://www.nvidia.com/cuda.

[3]

The accelerated linear algebra compiler framework. https://www.tensorflow.org/performance/xla/, 2018.

[4]

Nv-wavenet: Better speech synthesis using gpu-enabled wavenet inference. https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/, 2018.

[5]

Nvidia tensort - programmable inference accelerator. https://developer.nvidia.com/tensorrt, 2018.

[6]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 265--283, Berkeley, CA, USA, 2016. USENIX Association.

Digital Library

[7]

Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. Baymax: Qos awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 2016.

Digital Library

[8]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.

[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.

[10]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.

[11]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724--1734, 2014.

[12]

Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 577--585, 2015.

Digital Library

[13]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, 2017. USENIX Association.

Digital Library

[14]

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Y. Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2024--2033, 2016.

Digital Library

[15]

Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. Low latency RNN inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018, pages 31:1--31:15, 2018.

Digital Library

[16]

Alex Graves, Abdelrahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645--6649, 2013.

[17]

Kshitij Gupta, Jeff A. Stuart, and John D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012.

[18]

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.

[19]

Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018, pages 620--629, 2018.

[20]

Robert Hecht-Nielsen. Neural networks for perception (vol. 2). chapter Theory of the Backpropagation Neural Network, pages 65--93. Harcourt Brace & Co., Orlando, FL, USA, 1992.

Digital Library

[21]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM '14, pages 675--678, New York, NY, USA, 2014. ACM.

Digital Library

[22]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.

Digital Library

[23]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 2741--2749, 2016.

Digital Library

[24]

Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2849--2858, 2016.

Digital Library

[25]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2873--2879, 2016.

Digital Library

[26]

Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412--1421, 2015.

[27]

Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ML serving. CoRR, abs/1712.06139, 2017.

[28]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

[29]

Frank Seide and Amit Agarwal. Cntk: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 2135--2135, 2016.

Digital Library

[30]

Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.

[31]

Ming Tan, Bing Xiang, and Bowen Zhou. Lstm-based deep learning models for non-factoid answer selection. CoRR, abs/1511.04108, 2015.

[32]

Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.

[33]

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.

[34]

Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey S. Vetter. Enabling and exploiting flexible task assignment on GPU through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, pages 119--130, 2015.

Digital Library

[35]

Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. Flep: Enabling flexible and efficient preemption on gpus. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.

Digital Library

[36]

Shucai Xiao and Wu-chun Feng. Inter-block GPU communication via fast barrier synchronization. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings, pages 1--12, 2010.

[37]

Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. CoRR, abs/1611.01604, 2016.

[38]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1480--1489, 2016.

[39]

Jiecao Yu, Andrew Lukefahr, David J. Palframan, Ganesh S. Dasika, Reetuparna Das, and Scott A. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, pages 548--560, 2017.

Digital Library

[40]

Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. Deepcpu: Serving rnn-based deep learning models 10x faster. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 951--965, Boston, MA, 2018. USENIX Association.

Digital Library

[41]

Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649--657, 2015.

Digital Library

[42]

Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017, pages 587--599, 2017.

Digital Library

Cited By

Liu YWan CDu KHoffmann HJiang JLu SMaire MGavrilovska ATerry D(2024)ChameleonAPIProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691958(365-386)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691958
Jin YZhong RLong SZhai J(2024)Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346209235:11(2208-2223)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3462092
Wang CDeng YNing ZLeach KLi JYan SHe ZCao JZhang F(2024)Building a Lightweight Trusted Execution Environment for Arm GPUsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.3334277(1-16)Online publication date: 2024
https://doi.org/10.1109/TDSC.2023.3334277
Show More Cited By

Index Terms

GRNN: Low-Latency and Scalable RNN Inference on GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Parallelism via Multithreaded and Multicore CPUs

Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
A Lyapunov-stability-based context-layered recurrent pi-sigma neural network for the identification of nonlinear systems
Abstract
A novel higher-order context-layered recurrent pi-sigma neural network (CLRPSNN) is presented for the identification of nonlinear dynamical systems. The proposed model is the modified form of the classical pi-sigma neural network (PSNN)...
Highlights
- A new higher-order context-layered recurrent pi-sigma neural network (CLRPSNN) structure is proposed for solving the identification problem.
NUPAR: A Benchmark Suite for Modern GPU Architectures
ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

March 2019

714 pages

ISBN:9781450362818

DOI:10.1145/3302424

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

EuroSys '19

Sponsor:

SIGOPS

EuroSys '19: Fourteenth EuroSys Conference 2019

March 25 - 28, 2019

Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
2,187
Total Downloads

Downloads (Last 12 months)457
Downloads (Last 6 weeks)61

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YWan CDu KHoffmann HJiang JLu SMaire MGavrilovska ATerry D(2024)ChameleonAPIProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691958(365-386)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691958
Jin YZhong RLong SZhai J(2024)Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346209235:11(2208-2223)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3462092
Wang CDeng YNing ZLeach KLi JYan SHe ZCao JZhang F(2024)Building a Lightweight Trusted Execution Environment for Arm GPUsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.3334277(1-16)Online publication date: 2024
https://doi.org/10.1109/TDSC.2023.3334277
Shan ZHuang XZhou ZZhang X(2024)openLG: A Tunable and Efficient Open-source LSTM on GPUs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650733(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650733
Saxena PGupta AChadha SGrover SGupta SMandpura A(2024)Visual Servoing using FPGA-based Hardware accelerated Deep Learning Solution2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI)10.1109/IATMSI60426.2024.10502825(1-6)Online publication date: 14-Mar-2024
https://doi.org/10.1109/IATMSI60426.2024.10502825
Xu KMeng BWang Z(2024)Generalized regression neural networks-based data-driven iterative learning control for nonlinear non-affine discrete-time systemsExpert Systems with Applications10.1016/j.eswa.2024.123339248(123339)Online publication date: Aug-2024
https://doi.org/10.1016/j.eswa.2024.123339
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Luo DYu TWu YWu HWang TZhang W(2023)SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model SplittingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605627(605-614)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605627
Hekman SBrock MKhan MZhang XShen CChang KGamess E(2023)Comparison of Long Short-Term Memory Networks and Temporal Convolutional Networks for Sentiment AnalysisProceedings of the 2023 ACM Southeast Conference10.1145/3564746.3587000(2-9)Online publication date: 12-Apr-2023
https://dl.acm.org/doi/10.1145/3564746.3587000
Aminabadi RRuwase OZhang MHe YArnau JGonzález A(2023)SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural NetworksACM Transactions on Embedded Computing Systems10.1145/355251322:2(1-23)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1145/3552513
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten