research-article

Crossbow: scaling deep learning with small batch sizes on multi-GPU servers

Authors:

Alexandros Koliousis,

Pijika Watcharapichat,

Matthias Weidlich,

Peter PietzuchAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 12, Issue 11

Pages 1399 - 1412

https://doi.org/10.14778/3342263.3342276

Published: 01 July 2019 Publication History

Abstract

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the learning rate to compensate for this, which is complex and model-specific.

We describe Crossbow, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size---however small---while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. We introduce SMA, a synchronous variant of model averaging in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model. Crossbow achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. our experiments show that Crossbow improves the training time of deep learning models on an 8-GPU server by 1.3--4X compared to TensorFlow.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.

Digital Library

[2]

Amazon EC2 Instance Types, 2017. https://aws.amazon.com/ec2/instance-types/.

[3]

S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825 {cs.CL}, Feb. 2017.

Digital Library

[4]

L. Bottou. On-line Learning and Stochastic Approximations. In D. Saad, editor, On-line Learning in Neural Networks, pages 9--42. Cambridge University Press, New York, NY, USA, 1998.

Digital Library

[5]

L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223--311, 2018.

[6]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1--122, Jan. 2011.

Digital Library

[7]

S. Chaturapruek, J. C. Duchi, and C. Ré. Asynchronous Stochastic Convex Optimization: The Noise Is in the Noise and SGD Don't Care. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015.

Digital Library

[8]

J. Chen, R. Monga, S. Bengio, and R. Józefowicz. Revisiting Distributed Synchronous SGD. arXiv:1604.00981 {cs.LG}, Apr. 2016.

[9]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 {cs.DC}, Dec. 2015.

[10]

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The Loss Surfaces of Multilayer Networks. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.

[11]

H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In 2014 USENIX Annual Technical Conference (ATC), 2014.

Digital Library

[12]

H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-specialized Parameter Server. In 11th European Conference on Computer Systems (EuroSys), 2016.

Digital Library

[13]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012.

Digital Library

[14]

J. Dean, D. Patterson, and C. Young. A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution. IEEE Micro, 38(2):21--29, Mar. 2018.

[15]

A. Defossez and F. Bach. Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.

[16]

P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 {cs.CV}, June 2017.

[17]

A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In 7th ACM Symposium on Cloud Computing (SoCC), 2016.

Digital Library

[18]

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 {cs.CV}, Dec. 2015.

[19]

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6):82--97, Nov. 2012.

[20]

Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In 26th International Conference on Neural Information Processing Systems (NIPS), 2013.

Digital Library

[21]

S. Hochreiter and J. Schmidhuber. Flat Minima. Neural Computation, 9(1):1--42, Jan. 1997.

Digital Library

[22]

E. Hoffer, I. Hubara, and D. Soudry. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In 30th International Conference on Neural Information Processing Systems (NIPS), 2017.

Digital Library

[23]

Y. Huang, T. Jin, Y. Wu, Z. Cai, X. Yan, F. Yang, J. Li, Y. Guo, and J. Cheng. FlexPS: Flexible Parallelism Control in Parameter Server Architecture. PVLDB, 11(5):566--579, 2018.

Digital Library

[24]

A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko. Gist: Efficient Data Encoding for Deep Neural Network Training. In 45th Annual International Symposium on Computer Architecture (ISCA), 2018.

Digital Library

[25]

P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification. Journal of Machine Learning Research, 18(223):1--42, 2018.

Digital Library

[26]

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. J. Storkey. Three Factors Influencing Minimain SGD. arXiv:1711.04623 {cs.LG}, Nov. 2017.

[27]

X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 {cs.LG}, July 2018.

[28]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In 22nd ACM International Conference on Multimedia (MM), 2014.

Digital Library

[29]

Z. Jia, S. Lin, C. R. Qi, and A. Aiken. Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In 35th International Conference on Machine Learning (ICML), 2018.

[30]

J. Jiang, B. Cui, C. Zhang, and L. Yu. Heterogeneity-Aware Distributed Parameter Servers. In 2017 ACM International Conference on Management of Data (SIGMOD), 2017.

Digital Library

[31]

M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5:339--351, 2017.

[32]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In 44th Annual International Symposium on Computer Architecture (ISCA), 2017.

Digital Library

[33]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 {cs.LG}, Sept. 2016.

[34]

A. Krizhevsky. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv:1404.5997 {cs.NE}, Apr. 2014.

[35]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012.

Digital Library

[36]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278--2324, Nov. 1998.

[37]

Y. LeCun, L. Bottou, G. B. Orr, and K. R. Müller. Efficient BackProp, pages 9--50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

Digital Library

[38]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.

Digital Library

[39]

M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient Mini-batch Training for Stochastic Optimization. In 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.

Digital Library

[40]

X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In 35th International Conference on Machine Learning (ICML), 2018.

[41]

D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv:1804.07612 {cs.LG}, Apr. 2018.

[42]

P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.

Digital Library

[43]

D. Narayanan, K. Santhanam, and M. Zaharia. Accelerating Model Search with Model Batching. In 1st Conference on Systems and Machine Learning (SysML), SysML '18, 2018.

[44]

Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k<sup>2</sup>). Soviet Mathematics Doklady, 27:372--376, 1983.

[45]

C. Noel and S. Osindero. Dogwild! - Distributed Hogwild for CPU and GPU. Distributed Machine Learning and Matrix Computations NIPS 2014 Workshop, 2014.

[46]

NVIDIA Collective Communications Library (NCCL), 2018. https://developer.nvidia.com/nccl.

[47]

NVLink Fabric Multi-GPU Processing, 2018. https://www.nvidia.com/en-us/data-center/nvlink/.

[48]

Octoputer 4U 10-GPU Server with Single Root Complex for GPU-Direct, 2018. https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/.

[49]

B. Polyak. Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4:1--17, Dec. 1964.

[50]

B. Polyak. New Stochastic Approximation Type Procedures. Avtomatica i Telemekhanika, 7(7):98--107, Jan. 1990.

[51]

B. Polyak and A. Juditsky. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992.

Digital Library

[52]

PyTorch, 2018. https://pytorch.org.

[53]

A. Qiao, A. Aghayev, W. Yu, H. Chen, Q. Ho, G. A. Gibson, and E. P. Xing. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In 2018 USENIX Annual Technical Conference (ATC), 2018.

Digital Library

[54]

C. Qin, M. Torres, and F. Rusu. Scalable Asynchronous Gradient Descent Optimization for Out-of-core Models. PVLDB, 10(10):986--997, 2017.

Digital Library

[55]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In 24th International Conference on Neural Information Processing Systems (NIPS), 2011.

Digital Library

[56]

H. Robbins and S. Monro. A Stochastic Approximation Method. Ann. Math. Statist., 22(3):400--407, Sept. 1951.

[57]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation. In D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages 318--362. MIT Press, Cambridge, MA, USA, 1986.

Digital Library

[58]

D. Ruppert. Efficient Estimators from a Slowly Convergent Robbins-Monro Process. Technical Report 781, School of Operations Research and Industrial Enginnering, Cornell University, Ithaka, New York 14853--7501, Feb. 1988.

[59]

F. Seide and A. Agarwal. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

Digital Library

[60]

A. Sergeev and M. D. Balso. Horovod: Fast and Easy Distributed Deep Learning in Tensor Flow. arXiv:1802.05799 {cs.LG}, Feb. 2018.

[61]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the Effects of Data Parallelism on Neural Network Training. arXiv:1811.03600 {cs.LG}, Nov. 2018.

[62]

S. L. Smith, P. Kindermans, and Q. V. Le. Don't Decay the Learning Rate, Increase the Batch Size. arXiv:1711.00489 {cs.LG}, Nov. 2017.

[63]

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. In 30th International Conference on Machine Learning (ICML), 2013.

Digital Library

[64]

TensorFlow Benchmarks, 2018. https://github.com/tensorflow/benchmarks.

[65]

VGG16 models for CIFAR-10 and CIFAR-100 using Keras, 2018.https://github.com/geifmany/cifar-vgg.

[66]

L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2018.

Digital Library

[67]

P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch. Ako: Decentralised Deep Learning with Partial Gradient Exchange. In 7th ACM Symposium on Cloud Computing (SoCC), 2016.

Digital Library

[68]

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528 {cs.CL}, Jan. 2017.

[69]

W. Xu. Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. arXiv:1107.2490 {cs.LG}, Dec. 2011.

[70]

Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888 {cs.CV}, Sept. 2017.

[71]

C. Zhang and C. Ré. DimmWitted: A Study of Main-memory Statistical Analytics. PVLDB, 7(12):1283--1294, 2014.

Digital Library

[72]

J. Zhang and I. Mitliagkas. YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 {stat.ML}, June 2017.

[73]

S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with Elastic Averaging SGD. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015.

Digital Library

Cited By

Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Sirin UIdreos S(2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639307
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Show More Cited By

Recommendations

Distributed sensory systems and developer platforms from Crossbow technology
SenSys '05: Proceedings of the 3rd international conference on Embedded networked sensor systems

This demonstration will show two distributed sensory systems. One is an environmental monitoring system that can be used outdoor weather and habitat studies or for controlled environment monitoring (Figure 1). This system is composed of two types of ...
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 11

July 2019

543 pages

ISSN:2150-8097

Editors:
Lei Chen,
Fatma Özcan

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2019

Published in PVLDB Volume 12, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
214
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Sirin UIdreos S(2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639307
Strati FMa XKlimovic A(2024)Orion: Interference-aware, Fine-grained GPU Sharing for ML ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629578(1075-1092)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629578
Ren JFeng XLiu BPan XFu YMai LYang Y(2023)TorchOpt:The Journal of Machine Learning Research10.5555/3648699.364906624:1(17651-17664)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.5555/3648699.3649066
Wang HSit MHe CWen YZhang WWang JYang YMai LKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)GEARProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619921(36380-36390)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619921
Xia HZheng ZLi YZhuang DZhou ZQiu XLi YLin WSong S(2023)Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityProceedings of the VLDB Endowment10.14778/3626292.362630317:2(211-224)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626303
Robroek TDuane AYousefzadeh-Asl-Miandoab ETozun PBoehm MHulsebos MShankar SVarma P(2023)Data Management and Visualization for Benchmarking Deep Learning Training SystemsProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595851(1-5)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3595360.3595851
Chen WMo ZXu HYe KXu CMohror KArnold DBadia R(2023)Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware ApproachProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607060(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607060
Yousefzadeh-Asl-Miandoab ERobroek TTozun PYoneki ENardi L(2023)Profiling and Monitoring Deep Learning Training TasksProceedings of the 3rd Workshop on Machine Learning and Systems10.1145/3578356.3592589(18-25)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3578356.3592589
Chen ZXu CQian WZhou ADehnavi MKulkarni MKrishnamoorthy S(2023)Elastic Averaging for Efficient Pipelined DNN TrainingProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577484(380-391)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577484
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents