Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3302424.3303949acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

GRNN: Low-Latency and Scalable RNN Inference on GPUs

Published: 25 March 2019 Publication History

Abstract

Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, due to the complex data dependencies and limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading to inefficient resource utilization. Consequently, companies like Microsoft and Facebook use CPUs to serve RNN models.
This work demonstrates the root causes of the unsatisfactory performance of existing implementations for RNN inference on GPUs from several aspects, including poor data reuse, low on-chip resource utilization, and high synchronization overhead. We systematically address these issues and develop a GPU-based RNN inference library, called GRNN, that provides low latency, high throughput, and efficient resource utilization. GRNN minimizes global memory accesses and synchronization overhead, as well as balancing on-chip resource usage through novel data reorganization, thread mapping, and performance modeling techniques. Evaluated on extensive benchmarking and real-world applications, we show that GRNN outperforms the state-of-the-art CPU inference library by up to 17.5X and state-of-the-art GPU inference libraries by up to 9X in terms of latency reduction.

References

[1]
Dense linear algebra on gpus. https://developer.nvidia.com/cublas. Accessed: 2018-10-1.
[2]
NVIDIA CUDA. http://www.nvidia.com/cuda.
[3]
The accelerated linear algebra compiler framework. https://www.tensorflow.org/performance/xla/, 2018.
[4]
Nv-wavenet: Better speech synthesis using gpu-enabled wavenet inference. https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/, 2018.
[5]
Nvidia tensort - programmable inference accelerator. https://developer.nvidia.com/tensorrt, 2018.
[6]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 265--283, Berkeley, CA, USA, 2016. USENIX Association.
[7]
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. Baymax: Qos awareness and increased utilization of non-preemptive accelerators in warehouse scale computers. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '16, 2016.
[8]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.
[10]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
[11]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724--1734, 2014.
[12]
Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 577--585, 2015.
[13]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613--627, Boston, MA, 2017. USENIX Association.
[14]
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Y. Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2024--2033, 2016.
[15]
Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. Low latency RNN inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018, pages 31:1--31:15, 2018.
[16]
Alex Graves, Abdelrahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pages 6645--6649, 2013.
[17]
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012.
[18]
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
[19]
Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018, pages 620--629, 2018.
[20]
Robert Hecht-Nielsen. Neural networks for perception (vol. 2). chapter Theory of the Backpropagation Neural Network, pages 65--93. Harcourt Brace & Co., Orlando, FL, USA, 1992.
[21]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM '14, pages 675--678, New York, NY, USA, 2014. ACM.
[22]
Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.
[23]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 2741--2749, 2016.
[24]
Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2849--2858, 2016.
[25]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2873--2879, 2016.
[26]
Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412--1421, 2015.
[27]
Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ML serving. CoRR, abs/1712.06139, 2017.
[28]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[29]
Frank Seide and Amit Agarwal. Cntk: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 2135--2135, 2016.
[30]
Jeff A. Stuart and John D. Owens. Efficient synchronization primitives for gpus. CoRR, abs/1110.4623, 2011.
[31]
Ming Tan, Bing Xiang, and Bowen Zhou. Lstm-based deep learning models for non-factoid answer selection. CoRR, abs/1511.04108, 2015.
[32]
Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.
[33]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
[34]
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey S. Vetter. Enabling and exploiting flexible task assignment on GPU through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015, pages 119--130, 2015.
[35]
Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. Flep: Enabling flexible and efficient preemption on gpus. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.
[36]
Shucai Xiao and Wu-chun Feng. Inter-block GPU communication via fast barrier synchronization. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings, pages 1--12, 2010.
[37]
Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. CoRR, abs/1611.01604, 2016.
[38]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1480--1489, 2016.
[39]
Jiecao Yu, Andrew Lukefahr, David J. Palframan, Ganesh S. Dasika, Reetuparna Das, and Scott A. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, pages 548--560, 2017.
[40]
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. Deepcpu: Serving rnn-based deep learning models 10x faster. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 951--965, Boston, MA, 2018. USENIX Association.
[41]
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649--657, 2015.
[42]
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017, pages 587--599, 2017.

Cited By

View all
  • (2024)ChameleonAPIProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691958(365-386)Online publication date: 10-Jul-2024
  • (2024)Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346209235:11(2208-2223)Online publication date: Nov-2024
  • (2024)Building a Lightweight Trusted Execution Environment for Arm GPUsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.3334277(1-16)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. GRNN: Low-Latency and Scalable RNN Inference on GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
    March 2019
    714 pages
    ISBN:9781450362818
    DOI:10.1145/3302424
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPUs
    2. deep learning inference
    3. recurrent neural networks

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    EuroSys '19
    Sponsor:
    EuroSys '19: Fourteenth EuroSys Conference 2019
    March 25 - 28, 2019
    Dresden, Germany

    Acceptance Rates

    Overall Acceptance Rate 241 of 1,308 submissions, 18%

    Upcoming Conference

    EuroSys '25
    Twentieth European Conference on Computer Systems
    March 30 - April 3, 2025
    Rotterdam , Netherlands

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)457
    • Downloads (Last 6 weeks)61
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ChameleonAPIProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691958(365-386)Online publication date: 10-Jul-2024
    • (2024)Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346209235:11(2208-2223)Online publication date: Nov-2024
    • (2024)Building a Lightweight Trusted Execution Environment for Arm GPUsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.3334277(1-16)Online publication date: 2024
    • (2024)openLG: A Tunable and Efficient Open-source LSTM on GPUs2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650733(1-8)Online publication date: 30-Jun-2024
    • (2024)Visual Servoing using FPGA-based Hardware accelerated Deep Learning Solution2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI)10.1109/IATMSI60426.2024.10502825(1-6)Online publication date: 14-Mar-2024
    • (2024)Generalized regression neural networks-based data-driven iterative learning control for nonlinear non-affine discrete-time systemsExpert Systems with Applications10.1016/j.eswa.2024.123339248(123339)Online publication date: Aug-2024
    • (2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
    • (2023)SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model SplittingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605627(605-614)Online publication date: 7-Aug-2023
    • (2023)Comparison of Long Short-Term Memory Networks and Temporal Convolutional Networks for Sentiment AnalysisProceedings of the 2023 ACM Southeast Conference10.1145/3564746.3587000(2-9)Online publication date: 12-Apr-2023
    • (2023)SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural NetworksACM Transactions on Embedded Computing Systems10.1145/355251322:2(1-23)Online publication date: 24-Jan-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media