research-article

Beyond human-level accuracy: computational challenges in deep learning

Authors:

Newsha Ardalani,

Gregory DiamosAuthors Info & Claims

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 1 - 14

https://doi.org/10.1145/3293883.3295710

Published: 16 February 2019 Publication History

Abstract

Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent prior work shows that as dataset sizes grow, DL model accuracy and model size grow predictably. This paper leverages the prior work to project the dataset and model size growth required to advance DL accuracy beyond human-level, to frontier targets defined by machine learning experts. Datasets will need to grow 33--971×, while models will need to grow 6.6--456× to achieve target accuracies.

We further characterize and project the computational requirements to train these applications at scale. Our characterization reveals an important segmentation of DL training challenges for recurrent neural networks (RNNs) that contrasts with prior studies of deep convolutional networks. RNNs will have comparatively moderate operational intensities and very large memory footprint requirements. In contrast to emerging accelerator designs, large-scale RNN training characteristics suggest designs with significantly larger memory capacity and on-chip caches.

References

[1]

2018. AI and Compute. https://blog.openai.com/ai-and-compute

[2]

2018. DAWNBench. https://dawn.cs.stanford.edu/benchmark/

[3]

2018. DeepBench. https://github.com/baidu-research/DeepBench

[4]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/

[5]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NIPS). 1709--1720.

Digital Library

[6]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, JingDong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In The International Conference on Machine Learning (ICML). 173--182.

Digital Library

[7]

Michele Banko and Eric Brill. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Association of Computational Linguistics (ACL).

Digital Library

[8]

Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, and Zhenyao Zhu. 2017. Exploring Neural Transducers for End-to-end Speech Recognition. In IEEE Automatic Speech Recognition and Understanding Workshop. 206--213.

[9]

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2018. Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. In IEEE Signal Processing Magazine, Vol. 35.

[10]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759 (2014).

[11]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Andrew Ng. 2013. Deep Learning with COTS HPC Systems. In International Conference on Machine Learning (ICML). 1337--1345.

Digital Library

[12]

Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In ACM SIGPLAN Notices, Vol. 30. ACM, 279--290.

Digital Library

[13]

Jeff Dean et al. 2017. Machine Learning for Systems and Systems for Machine Learning. Presentation at ML Systems Workshop with Neural Information Processing Systems (NIPS) Conference. http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

[14]

Priya Goyal, Piotr Dollàr, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. Facebook AI Research Publications (2017).

[15]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep Speech: Scaling Up End-to-End Speech Recognition. arXiv preprint arXiv:1412.5567 (2014).

[16]

Nick Harvey, Peter L. Bartlett, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. In The Conference on Learning Theory (COLT), Vol. 65. 1064--1068.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.

[18]

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409 (2017).

[19]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410v2 (2016).

[20]

Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv preprint arXiv:1404.5997 (2014).

[21]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In The International Conference on Learning Representations (ICLR).

[22]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In The Conference on Empirical Methods in Natural Language Processing (EMNLP). 1412--1421.

[23]

Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2018. Semantics-Preserving Parallelization of Stochastic Gradient Descent. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 224--233.

[24]

MLPerf. 2018. MLPerf: A Broad ML Benchmark Suite for Measuring Performance of ML Software Frameworks, ML Hardware Accelerators, and ML Cloud Platforms. https://mlperf.org/

[25]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. Journal on Parallel and Distributed Computing 69, 2 (February 2009), 117--124.

Digital Library

[26]

Raul Puri, Robert Kirby, Nikolai Yakovenko, and Bryan Catanzaro. 2018. Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv preprint arXiv:1808.01371 (2018).

[27]

Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. In The International Conference on Learning Representations (ICLR).

[28]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.

Digital Library

[29]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575 (January 2015).

[30]

Hasim Sak, Andrew W. Senior, and Françoise Beaufays. 2014. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. arXiv preprint arXiv:1402.1128 (2014).

[31]

Claude E. Shannon. 1951. Prediction and Entropy of Printed English. Bell System Technical Journal 30, 47--51.

[32]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv preprint arXiv:1701.06538v1 (2017).

[33]

Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE Computer Society, Los Alamitos, CA, USA.

[34]

Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In The International Symposium on Computer Architecture (ISCA). ACM, 535--547.

Digital Library

[35]

Samuel L. Smith and Quoc V. Le. 2017. A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451v2 (October 2017).

[36]

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In The International Conference on Computer Vision (ICCV).

[37]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Advances in Neural Information Processing Systems (NIPS).

Digital Library

[38]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures. Communications of the Association of Computing Machinery 52, 65--76.

Digital Library

[39]

Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2017. The Microsoft 2017 Conversational Speech Recognition System. Technical Report. https://www.microsoft.com/en-us/research/publication/microsoft-2017-conversational-speech-recognition-system/

[40]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet Training in Minutes. In The International Conference on Parallel Processing (ICPP). 1:1--1:10.

Digital Library

[41]

Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2017. Recurrent Highway Networks. In The International Conference on Machine Learning (ICML).

Digital Library

Cited By

Zhang SZhao ZLiu DCao YTang HYou S(2025)Edge-assisted U-shaped split federated learning with privacy-preserving for Internet of ThingsExpert Systems with Applications10.1016/j.eswa.2024.125494262(125494)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125494
Sardana NPortes JDoubov SFrankle JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Beyond Chinchilla-optimalProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693840(43445-43460)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693840
Trihinas DMichael PSymeonides M(2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
https://doi.org/10.3390/fi16120468
Show More Cited By

Index Terms

Beyond human-level accuracy: computational challenges in deep learning
1. Computing methodologies
  1. Machine learning
  2. Modeling and simulation
    1. Model development and analysis
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

EIS - Efficient and Trainable Activation Functions for Better Accuracy and Performance
Artificial Neural Networks and Machine Learning – ICANN 2021
Abstract
Activation functions play a pivotal role in function learning using neural networks. The non-linearity in a neural network is achieved by repeated use of the activation function. Over the years, numerous activation functions have been proposed to ...
On the approximation of functions by tanh neural networks
Abstract
We derive bounds on the error, in high-order Sobolev norms, incurred in the approximation of Sobolev-regular as well as analytic functions by neural networks with the hyperbolic tangent activation function. These bounds provide explicit estimates ...
Highlights
- Explicit bounds for function approximation in Sobolev norms by tanh neural networks.
- Tanh networks with 2 hidden layers are at least as expressive as deeper ReLU networks.
- Improved convergence rate for neural network approximation ...
Memory Efficient Deep Neural Network Training
Euro-Par 2021: Parallel Processing Workshops
Abstract
Recently Artificial Intelligence (AI) has demonstrated a huge progress in solving complex problems such as image classification, text generation, translation... Its success is due to a development of hardware and algorithms making possible the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

February 2019

472 pages

ISBN:9781450362252

DOI:10.1145/3293883

General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

PPoPP '19

Sponsor:

PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 16 - 20, 2019

District of Columbia, Washington

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,316
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)9

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang SZhao ZLiu DCao YTang HYou S(2025)Edge-assisted U-shaped split federated learning with privacy-preserving for Internet of ThingsExpert Systems with Applications10.1016/j.eswa.2024.125494262(125494)Online publication date: Mar-2025
https://doi.org/10.1016/j.eswa.2024.125494
Sardana NPortes JDoubov SFrankle JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Beyond Chinchilla-optimalProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693840(43445-43460)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693840
Trihinas DMichael PSymeonides M(2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
https://doi.org/10.3390/fi16120468
M RPankaj D(2024)Social-sum-Mal: A Dataset for Abstractive Text Summarization in MalayalamACM Transactions on Asian and Low-Resource Language Information Processing10.1145/369610723:11(1-20)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3696107
Ardalani NPal SGupta P(2024)DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI SystemsACM Transactions on Design Automation of Electronic Systems10.1145/363586729:2(1-20)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1145/3635867
Withana SPlale B(2024)Patra ModelCards: AI/ML Accountability in the Edge-Cloud Continuum2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678710(1-10)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678710
Mishty KSadi M(2024)System and Design Technology Co-Optimization of SOT-MRAM for High-Performance AI Accelerator Memory SystemIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333375443:4(1065-1078)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3333754
Hanindhito BPatel BJohn L(2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00031
Ciranni MMurino VOdone FPastore V(2024)Computer vision and deep learning meet planktonImage and Vision Computing10.1016/j.imavis.2024.104934143:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.imavis.2024.104934
Rehm JReshodko IGundersen O(2024)A virtual driving instructor that assesses driving performance on par with human expertsExpert Systems with Applications10.1016/j.eswa.2024.123355248(123355)Online publication date: Aug-2024
https://doi.org/10.1016/j.eswa.2024.123355
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents