research-article

Public Access

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Authors:

Amar Phanishayee,

Arvind KrishnamurthyAuthors Info & Claims

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Pages 41 - 54

https://doi.org/10.1145/3267809.3267840

Published: 11 October 2018 Publication History

Abstract

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art cloud-based distributed training techniques for image classification workloads, with 25% better throughput per dollar.

References

[1]

AMD EPYC. http://www.amd.com/en/products/epyc.

[2]

Apache mxnet on aws. https://aws.amazon.com/mxnet/. (Accessed on 05/09/2018).

[3]

Arista 7060cx-32s price. https://goo.gl/cqyBtA.

[4]

Azure Windows VM sizes - HPC. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-hpc. (Accessed on 01/11/2018).

[5]

baidu-research/baidu-allreduce. https://github.com/baidu-research/baidu-allreduce. (Accessed on 05/14/2018).

[6]

Cloud tpus - ml accelerators for tensorflow Âă|Âă google cloud. https://cloud.google.com/tpu/. (Accessed on 05/16/2018).

[7]

Distributed training | caffe2. https://caffe2.ai/docs/distributed-training.html. (Accessed on 05/09/2018).

[8]

Ec2instances.info Easy Amazon EC2 instance comparison. https://www.ec2instances.info/?region=us-west-2.

[9]

Epyc benchmarks. https://www.amd.com/en/products/epycbenchmarks.

[10]

Machine learning | microsoft azure. https://azure.microsoft.com/en-us/services/machine-learning-studio/. (Accessed on 05/16/2018).

[11]

Mellanox ethernet cable prices. https://store.mellanox.com/categories/interconnect/ethernet-cables/direct-attach-copper-cables.html.

[12]

Mellanox ethernet card prices. https://store.mellanox.com/categories/adapters/ethernet-adapter-cards.html.

[13]

Mxnet on the cloud âĂŤ mxnet documentation. https://mxnet.incubator.apache.org/faq/cloud.html?highlight=ec2. (Accessed on 05/09/2018).

[14]

Nvidia 1080 ti advertised price. https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti.

[15]

Performance of distributed deep learning using ChainerMN. https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html.

[16]

Supermicro phub node price. https://www.thinkmate.com/system/superserver-6038r-txr.

[17]

Supermicro worker node price. https://www.thinkmate.com/system/superserver-1028gq-tr.

[18]

ZMQ distributed messaging. http://zeromq.org/.

[19]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265--283, Savannah, GA, 2016. USENIX Association.

Digital Library

[20]

Michael Alan, Aurojit Panda, Domenic Bottini, Lisa Jian, Pranay Kumar, and Scott Shenker. Network evolution for dnns.

[21]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, 2016.

[22]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

[23]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.

[24]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571--582, Broomfield, CO, 2014. USENIX Association.

Digital Library

[25]

Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, et al. Exploiting bounded staleness to speed up big data analytics. In USENIX Annual Technical Conference, pages 37--48, 2014.

Digital Library

[26]

Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. GeePS: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 4:1--4:16, New York, NY, USA, 2016. ACM.

Digital Library

[27]

Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth A. Gibson, and Eric P. Xing. High-performance distributed ML at scale through parameter server consistency models. CoRR, abs/1410.8043, 2014.

[28]

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12, pages 1223--1231, USA, 2012. Curran Associates Inc.

Digital Library

[29]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv: 1706.02677, 2017.

[30]

Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Parveen Patel, and Sudipta Sengupta. Vl2: A scalable and flexible data center network. Association for Computing Machinery, Inc., August 2009.

[31]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[32]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. More effective distributed ML via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223--1231, 2013.

Digital Library

[33]

Ningning Hu and Peter Steenkiste. Estimating available bandwidth using packet pair probing. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2002.

[34]

Ningning Hu and Peter Steenkiste. Evaluation and characterization of available bandwidth probing techniques. IEEE journal on Selected Areas in Communications, 21(6):879--894, 2003.

Digital Library

[35]

Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592--2600, 2016.

[36]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, Denver, CO, 2016. USENIX Association.

Digital Library

[37]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

[38]

Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa, and Peter Pietzuch. Crossbow: Scaling deep learning on multigpu servers.

[39]

Y LeCun, L Bottou, and G Orr. Efficient backprop in neural networks: Tricks of the trade. Lecture Notes in Computer Science, 1524.

Digital Library

[40]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 583--598, Berkeley, CA, USA, 2014. USENIX Association.

Digital Library

[41]

Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS'14, pages 19--27, Cambridge, MA, USA, 2014. MIT Press.

Digital Library

[42]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR, abs/1712.01887, 2017.

[43]

Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. IncBricks: Toward in-network computation with an in-network cache. SIGOPS Oper. Syst. Rev., 51(2):795--809, April 2017.

Digital Library

[44]

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. Portland: a scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.

Digital Library

[45]

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages 543--547, 1983.

[46]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693--701, 2011.

Digital Library

[47]

Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. Inside the social network's (datacenter) network. Computer Communication Review, 45:123--137, 2015.

Digital Library

[48]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696--699. MIT Press, Cambridge, MA, USA, 1988.

Digital Library

[49]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech 2014, September 2014.

[50]

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.

[51]

Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. Benchmarking state-of-the-art deep learning software tools, 2016.

[52]

Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs HÃűlzle, Stephen Stuart, and Amin Vahdat. Jupiter rising: A decade of clos topologies and centralized control in googleâĂ&Zacute;s datacenter network. In Sigcomm '15, 2015.

Digital Library

[53]

Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3(1-2):703--710, September 2010.

Digital Library

[54]

Christopher Streiffer, Huan Chen, Theophilus Benson, and Asim Kadav. Deepconfig: Automating data center network topologies management with machine learning. CoRR, abs/1712.03890, 2017.

[55]

Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.

[56]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.

[57]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl., 19(1):49--66, February 2005.

Digital Library

[58]

Leyuan Wang, Mu Li, Edo Liberty, and Alex J Smola. Optimal message scheduling for aggregation. NETWORKS, 2(3):2--3, 2018.

[59]

Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC '15, pages 381--394, New York, NY, USA, 2015. ACM.

Digital Library

[60]

Pengtao Xie, Jin Kyu Kim, and Eric P. Xing. Large scale distributed multiclass logistic regression. CoRR, abs/1409.5705, 2014.

[61]

Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.

[62]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Speeding up imagenet training on supercomputers.

[63]

Ce Zhang and Christopher Ré. Dimmwitted: A study of main-memory statistical analytics. Proc. VLDB Endow., 7(12):1283--1294, August 2014.

Digital Library

[64]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 181--193, Santa Clara, CA, 2017. USENIX Association.

Digital Library

[65]

H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko. TBD: Benchmarking and Analyzing Deep Neural Network Training. ArXiv e-prints, March 2018.

Cited By

Gu TFei JCanini M(2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673804
Ming ZHu YZhou WZheng XYao CFeng DMencagli GDazzi PLowenthal DBadia R(2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658678
Lim JKwon YHwang RMaeng KSuh ERhu MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation ModelsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640384(616-630)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640384
Show More Cited By

Recommendations

HUB: hugepage ballooning in kernel-based virtual machines
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Modern applications running on cloud data centers often consume a large amount of memory and their memory demands can vary during execution. Dynamic memory allocation is a necessity for high memory utilization. For a large dataset application, using ...
Hub Labels: Theory and Practice
Proceedings of the 13th International Symposium on Experimental Algorithms - Volume 8504

The Hub Labeling algorithm (HL) is an exact shortest path algorithm with excellent query performance on some classes of problems. It precomputes some auxiliary information (stored as a label) for each vertex, and its query performance depends only on ...
Gain parameter and dropout-based fine tuning of deep networks

Training of deep neural networks can involve two phases: unsupervised pre-training and supervised fine tuning. Unsupervised pre-training is used to learn the initial parameter values of deep networks, while as supervised fine tuning improves upon what ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

October 2018

546 pages

ISBN:9781450360111

DOI:10.1145/3267809

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SoCC '18

Sponsor:

SoCC '18: ACM Symposium on Cloud Computing

October 11 - 13, 2018

CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
1,742
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)51

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gu TFei JCanini M(2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673804
Ming ZHu YZhou WZheng XYao CFeng DMencagli GDazzi PLowenthal DBadia R(2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658678
Lim JKwon YHwang RMaeng KSuh ERhu MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation ModelsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640384(616-630)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640384
Liu JZhai YZhao GXu HFang JZeng ZZhu YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645394
Segal RAvin CScalosub G(2024)SOAR: Minimizing Network Utilization Cost With Bounded In-Network ComputingIEEE Transactions on Network and Service Management10.1109/TNSM.2023.333606721:2(1832-1851)Online publication date: Apr-2024
https://doi.org/10.1109/TNSM.2023.3336067
Qiu YZhao GXu HHuang HQiao C(2024)PARING: Joint Task Placement and Routing for Distributed Training With In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.341485332:5(4317-4332)Online publication date: Oct-2024
https://doi.org/10.1109/TNET.2024.3414853
Yang PXu HZhao GZhang QLiu JQiao C(2024)ALEPH: Accelerating Distributed Training With eBPF-Based Hierarchical Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.340499932:5(4128-4143)Online publication date: Oct-2024
https://doi.org/10.1109/TNET.2024.3404999
Fang JXu HZhao GYu ZShen BXie L(2024)Accelerating Distributed Training With Collaborative In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.338794832:4(3437-3452)Online publication date: Aug-2024
https://doi.org/10.1109/TNET.2024.3387948
Zhang QZhao GXu HYang P(2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3339524
Hu JHe YLuo WHuang JWang JWang J(2024)TaLB: Tensor-aware Load Balancing for Distributed DNN Training Acceleration2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682910(1-10)Online publication date: 19-Jun-2024
https://doi.org/10.1109/IWQoS61813.2024.10682910
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents