Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3267809.3267840acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Published: 11 October 2018 Publication History

Abstract

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art cloud-based distributed training techniques for image classification workloads, with 25% better throughput per dollar.

References

[1]
AMD EPYC. http://www.amd.com/en/products/epyc.
[2]
Apache mxnet on aws. https://aws.amazon.com/mxnet/. (Accessed on 05/09/2018).
[3]
Arista 7060cx-32s price. https://goo.gl/cqyBtA.
[4]
Azure Windows VM sizes - HPC. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-hpc. (Accessed on 01/11/2018).
[5]
baidu-research/baidu-allreduce. https://github.com/baidu-research/baidu-allreduce. (Accessed on 05/14/2018).
[6]
Cloud tpus - ml accelerators for tensorflow Âă|Âă google cloud. https://cloud.google.com/tpu/. (Accessed on 05/16/2018).
[7]
Distributed training | caffe2. https://caffe2.ai/docs/distributed-training.html. (Accessed on 05/09/2018).
[8]
Ec2instances.info Easy Amazon EC2 instance comparison. https://www.ec2instances.info/?region=us-west-2.
[9]
Epyc benchmarks. https://www.amd.com/en/products/epycbenchmarks.
[10]
Machine learning | microsoft azure. https://azure.microsoft.com/en-us/services/machine-learning-studio/. (Accessed on 05/16/2018).
[11]
Mellanox ethernet cable prices. https://store.mellanox.com/categories/interconnect/ethernet-cables/direct-attach-copper-cables.html.
[12]
Mellanox ethernet card prices. https://store.mellanox.com/categories/adapters/ethernet-adapter-cards.html.
[13]
Mxnet on the cloud âĂŤ mxnet documentation. https://mxnet.incubator.apache.org/faq/cloud.html?highlight=ec2. (Accessed on 05/09/2018).
[14]
Nvidia 1080 ti advertised price. https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti.
[15]
Performance of distributed deep learning using ChainerMN. https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html.
[16]
Supermicro phub node price. https://www.thinkmate.com/system/superserver-6038r-txr.
[17]
Supermicro worker node price. https://www.thinkmate.com/system/superserver-1028gq-tr.
[18]
ZMQ distributed messaging. http://zeromq.org/.
[19]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265--283, Savannah, GA, 2016. USENIX Association.
[20]
Michael Alan, Aurojit Panda, Domenic Bottini, Lisa Jian, Pranay Kumar, and Scott Shenker. Network evolution for dnns.
[21]
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. In International Conference on Learning Representations Workshop Track, 2016.
[22]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
[23]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
[24]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571--582, Broomfield, CO, 2014. USENIX Association.
[25]
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, et al. Exploiting bounded staleness to speed up big data analytics. In USENIX Annual Technical Conference, pages 37--48, 2014.
[26]
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. GeePS: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 4:1--4:16, New York, NY, USA, 2016. ACM.
[27]
Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth A. Gibson, and Eric P. Xing. High-performance distributed ML at scale through parameter server consistency models. CoRR, abs/1410.8043, 2014.
[28]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12, pages 1223--1231, USA, 2012. Curran Associates Inc.
[29]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv: 1706.02677, 2017.
[30]
Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Parveen Patel, and Sudipta Sengupta. Vl2: A scalable and flexible data center network. Association for Computing Machinery, Inc., August 2009.
[31]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
[32]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. More effective distributed ML via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223--1231, 2013.
[33]
Ningning Hu and Peter Steenkiste. Estimating available bandwidth using packet pair probing. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2002.
[34]
Ningning Hu and Peter Steenkiste. Evaluation and characterization of available bandwidth probing techniques. IEEE journal on Selected Areas in Communications, 21(6):879--894, 2003.
[35]
Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592--2600, 2016.
[36]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 437--450, Denver, CO, 2016. USENIX Association.
[37]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
[38]
Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa, and Peter Pietzuch. Crossbow: Scaling deep learning on multigpu servers.
[39]
Y LeCun, L Bottou, and G Orr. Efficient backprop in neural networks: Tricks of the trade. Lecture Notes in Computer Science, 1524.
[40]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 583--598, Berkeley, CA, USA, 2014. USENIX Association.
[41]
Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS'14, pages 19--27, Cambridge, MA, USA, 2014. MIT Press.
[42]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR, abs/1712.01887, 2017.
[43]
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. IncBricks: Toward in-network computation with an in-network cache. SIGOPS Oper. Syst. Rev., 51(2):795--809, April 2017.
[44]
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. Portland: a scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.
[45]
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages 543--547, 1983.
[46]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693--701, 2011.
[47]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. Inside the social network's (datacenter) network. Computer Communication Review, 45:123--137, 2015.
[48]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696--699. MIT Press, Cambridge, MA, USA, 1988.
[49]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Interspeech 2014, September 2014.
[50]
Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.
[51]
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. Benchmarking state-of-the-art deep learning software tools, 2016.
[52]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs HÃűlzle, Stephen Stuart, and Amin Vahdat. Jupiter rising: A decade of clos topologies and centralized control in googleâĂŹs datacenter network. In Sigcomm '15, 2015.
[53]
Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3(1-2):703--710, September 2010.
[54]
Christopher Streiffer, Huan Chen, Theophilus Benson, and Asim Kadav. Deepconfig: Automating data center network topologies management with machine learning. CoRR, abs/1712.03890, 2017.
[55]
Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
[56]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[57]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl., 19(1):49--66, February 2005.
[58]
Leyuan Wang, Mu Li, Edo Liberty, and Alex J Smola. Optimal message scheduling for aggregation. NETWORKS, 2(3):2--3, 2018.
[59]
Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC '15, pages 381--394, New York, NY, USA, 2015. ACM.
[60]
Pengtao Xie, Jin Kyu Kim, and Eric P. Xing. Large scale distributed multiclass logistic regression. CoRR, abs/1409.5705, 2014.
[61]
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.
[62]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Speeding up imagenet training on supercomputers.
[63]
Ce Zhang and Christopher Ré. Dimmwitted: A study of main-memory statistical analytics. Proc. VLDB Endow., 7(12):1283--1294, August 2014.
[64]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 181--193, Santa Clara, CA, 2017. USENIX Association.
[65]
H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko. TBD: Benchmarking and Analyzing Deep Neural Network Training. ArXiv e-prints, March 2018.

Cited By

View all
  • (2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
  • (2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
  • (2024)LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation ModelsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640384(616-630)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
October 2018
546 pages
ISBN:9781450360111
DOI:10.1145/3267809
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '18
Sponsor:
SoCC '18: ACM Symposium on Cloud Computing
October 11 - 13, 2018
CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)340
  • Downloads (Last 6 weeks)51
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
  • (2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
  • (2024)LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation ModelsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640384(616-630)Online publication date: 27-Apr-2024
  • (2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
  • (2024)SOAR: Minimizing Network Utilization Cost With Bounded In-Network ComputingIEEE Transactions on Network and Service Management10.1109/TNSM.2023.333606721:2(1832-1851)Online publication date: Apr-2024
  • (2024)PARING: Joint Task Placement and Routing for Distributed Training With In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.341485332:5(4317-4332)Online publication date: Oct-2024
  • (2024)ALEPH: Accelerating Distributed Training With eBPF-Based Hierarchical Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.340499932:5(4128-4143)Online publication date: Oct-2024
  • (2024)Accelerating Distributed Training With Collaborative In-Network AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2024.338794832:4(3437-3452)Online publication date: Aug-2024
  • (2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
  • (2024)TaLB: Tensor-aware Load Balancing for Distributed DNN Training Acceleration2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682910(1-10)Online publication date: 19-Jun-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media