Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3293883.3295710acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Beyond human-level accuracy: computational challenges in deep learning

Published: 16 February 2019 Publication History

Abstract

Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent prior work shows that as dataset sizes grow, DL model accuracy and model size grow predictably. This paper leverages the prior work to project the dataset and model size growth required to advance DL accuracy beyond human-level, to frontier targets defined by machine learning experts. Datasets will need to grow 33--971×, while models will need to grow 6.6--456× to achieve target accuracies.
We further characterize and project the computational requirements to train these applications at scale. Our characterization reveals an important segmentation of DL training challenges for recurrent neural networks (RNNs) that contrasts with prior studies of deep convolutional networks. RNNs will have comparatively moderate operational intensities and very large memory footprint requirements. In contrast to emerging accelerator designs, large-scale RNN training characteristics suggest designs with significantly larger memory capacity and on-chip caches.

References

[1]
2018. AI and Compute. https://blog.openai.com/ai-and-compute
[2]
2018. DAWNBench. https://dawn.cs.stanford.edu/benchmark/
[3]
2018. DeepBench. https://github.com/baidu-research/DeepBench
[4]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/
[5]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NIPS). 1709--1720.
[6]
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, JingDong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In The International Conference on Machine Learning (ICML). 173--182.
[7]
Michele Banko and Eric Brill. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Association of Computational Linguistics (ACL).
[8]
Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, and Zhenyao Zhu. 2017. Exploring Neural Transducers for End-to-end Speech Recognition. In IEEE Automatic Speech Recognition and Understanding Workshop. 206--213.
[9]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2018. Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. In IEEE Signal Processing Magazine, Vol. 35.
[10]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759 (2014).
[11]
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Andrew Ng. 2013. Deep Learning with COTS HPC Systems. In International Conference on Machine Learning (ICML). 1337--1345.
[12]
Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In ACM SIGPLAN Notices, Vol. 30. ACM, 279--290.
[13]
Jeff Dean et al. 2017. Machine Learning for Systems and Systems for Machine Learning. Presentation at ML Systems Workshop with Neural Information Processing Systems (NIPS) Conference. http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
[14]
Priya Goyal, Piotr Dollàr, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. Facebook AI Research Publications (2017).
[15]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep Speech: Scaling Up End-to-End Speech Recognition. arXiv preprint arXiv:1412.5567 (2014).
[16]
Nick Harvey, Peter L. Bartlett, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. In The Conference on Learning Theory (COLT), Vol. 65. 1064--1068.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[18]
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409 (2017).
[19]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2 (2016).
[20]
Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv preprint arXiv:1404.5997 (2014).
[21]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In The International Conference on Learning Representations (ICLR).
[22]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In The Conference on Empirical Methods in Natural Language Processing (EMNLP). 1412--1421.
[23]
Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2018. Semantics-Preserving Parallelization of Stochastic Gradient Descent. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 224--233.
[24]
MLPerf. 2018. MLPerf: A Broad ML Benchmark Suite for Measuring Performance of ML Software Frameworks, ML Hardware Accelerators, and ML Cloud Platforms. https://mlperf.org/
[25]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. Journal on Parallel and Distributed Computing 69, 2 (February 2009), 117--124.
[26]
Raul Puri, Robert Kirby, Nikolai Yakovenko, and Bryan Catanzaro. 2018. Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv preprint arXiv:1808.01371 (2018).
[27]
Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. In The International Conference on Learning Representations (ICLR).
[28]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.
[29]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575 (January 2015).
[30]
Hasim Sak, Andrew W. Senior, and Françoise Beaufays. 2014. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. arXiv preprint arXiv:1402.1128 (2014).
[31]
Claude E. Shannon. 1951. Prediction and Entropy of Printed English. Bell System Technical Journal 30, 47--51.
[32]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:1701.06538v1 (2017).
[33]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE Computer Society, Los Alamitos, CA, USA.
[34]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In The International Symposium on Computer Architecture (ISCA). ACM, 535--547.
[35]
Samuel L. Smith and Quoc V. Le. 2017. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2 (October 2017).
[36]
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In The International Conference on Computer Vision (ICCV).
[37]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Advances in Neural Information Processing Systems (NIPS).
[38]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. Communications of the Association of Computing Machinery 52, 65--76.
[39]
Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2017. The Microsoft 2017 Conversational Speech Recognition System. Technical Report. https://www.microsoft.com/en-us/research/publication/microsoft-2017-conversational-speech-recognition-system/
[40]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet Training in Minutes. In The International Conference on Parallel Processing (ICPP). 1:1--1:10.
[41]
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2017. Recurrent Highway Networks. In The International Conference on Machine Learning (ICML).

Cited By

View all
  • (2025)Edge-assisted U-shaped split federated learning with privacy-preserving for Internet of ThingsExpert Systems with Applications10.1016/j.eswa.2024.125494262(125494)Online publication date: Mar-2025
  • (2024)Beyond Chinchilla-optimalProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693840(43445-43460)Online publication date: 21-Jul-2024
  • (2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
ISBN:9781450362252
DOI:10.1145/3293883
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. compute graph
  2. compute requirements
  3. data parallelism
  4. deep learning
  5. model parallelism
  6. neural networks

Qualifiers

  • Research-article

Conference

PPoPP '19

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)99
  • Downloads (Last 6 weeks)9
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Edge-assisted U-shaped split federated learning with privacy-preserving for Internet of ThingsExpert Systems with Applications10.1016/j.eswa.2024.125494262(125494)Online publication date: Mar-2025
  • (2024)Beyond Chinchilla-optimalProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693840(43445-43460)Online publication date: 21-Jul-2024
  • (2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
  • (2024)Social-sum-Mal: A Dataset for Abstractive Text Summarization in MalayalamACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369610723:11(1-20)Online publication date: 21-Nov-2024
  • (2024)DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI SystemsACM Transactions on Design Automation of Electronic Systems10.1145/363586729:2(1-20)Online publication date: 15-Feb-2024
  • (2024)Patra ModelCards: AI/ML Accountability in the Edge-Cloud Continuum2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678710(1-10)Online publication date: 16-Sep-2024
  • (2024)System and Design Technology Co-Optimization of SOT-MRAM for High-Performance AI Accelerator Memory SystemIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333375443:4(1065-1078)Online publication date: Apr-2024
  • (2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
  • (2024)Computer vision and deep learning meet planktonImage and Vision Computing10.1016/j.imavis.2024.104934143:COnline publication date: 1-Mar-2024
  • (2024)A virtual driving instructor that assesses driving performance on par with human expertsExpert Systems with Applications10.1016/j.eswa.2024.123355248(123355)Online publication date: Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media