research-article

Public Access

Hop: Heterogeneity-aware Decentralized Training

Authors:

Xuehai QianAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 893 - 907

https://doi.org/10.1145/3297858.3304009

Published: 04 April 2019 Publication History

Abstract

Recent work has shown that decentralized algorithms can deliver superior performance over centralized ones in the context of machine learning. The two approaches, with the main difference residing in their distinct communication patterns, are both susceptible to performance degradation in heterogeneous environments. Although vigorous efforts have been devoted to supporting centralized algorithms against heterogeneity, little has been explored in decentralized algorithms regarding this problem. This paper proposes Hop, the first heterogeneity-aware decentralized training protocol. Based on a unique characteristic of decentralized training that we have identified, the iteration gap, we propose a queue-based synchronization mechanism that can efficiently implement backup workers and bounded staleness in the decentralized setting. To cope with deterministic slowdown, we propose skipping iterations so that the effect of slower workers is further mitigated. We build a prototype implementation of Hop on TensorFlow. The experiment results on CNN and SVM show significant speedup over standard decentralized training in heterogeneous settings.

References

[1]

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283. https://www.usenix.org/system/files/conference/osdi16/ osdi16-abadi.pdf

Digital Library

[2]

Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276.

Digital Library

[3]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. In International Conference on Learning Representations Workshop Track. https://arxiv.org/abs/ 1604.00981

[4]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). http://dblp. uni-trier.de/db/journals/corr/corr1512.html#ChenLLLWWXXZZ15

[5]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571--582. https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/chilimbi

Digital Library

[6]

Lingkun Chu, Hong Tang, Tao Yang, and Kai Shen. 2003. Optimizing Data Aggregation for Cluster-based Internet Services. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03). ACM, New York, NY, USA, 119--130.

Digital Library

[7]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 12 (Nov. 2011), 2493--2537. http://dl.acm.org/citation.cfm?id=1953048.2078186

Digital Library

[8]

Wei Dai, Abhimanu Kumar, JinliangWei, Qirong Ho, Garth Gibson, and Eric P. Xing. 2015. High-performance Distributed ML at Scale Through Parameter Server Consistency Models. In Proceedings of the Twenty- Ninth AAAI Conference on Artificial Intelligence (AAAI'15). AAAI Press, 79--87. http://dl.acm.org/citation.cfm?id=2887007.2887019

Digital Library

[9]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1223--1231. http://papers.nips.cc/paper/ 4687-large-scale-distributed-deep-networks.pdf

Digital Library

[10]

J. C. Duchi, A. Agarwal, and M. J. Wainwright. 2012. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Trans. Automat. Control 57, 3 (March 2012), 592--606.

[11]

Mark Everingham, S. M. Eslami, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vision 111, 1 (Jan. 2015), 98--136.

Digital Library

[12]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017). arXiv:1706.02677 http://arxiv.org/ abs/1706.02677

[13]

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, and Tara Sainath. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine 29 (November 2012), 82--97. https://www.microsoft.com/en-us/research/publication/ deep-neural-networks-for-acoustic-modeling-in-speech-recognition/

[14]

Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'13). Curran Associates Inc., USA, 1223--1231. http: //dl.acm.org/citation.cfm?id=2999611.2999748

Digital Library

[15]

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 629--647. https://www. usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh

Digital Library

[16]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneityaware Distributed Parameter Servers. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, NewYork, NY, USA, 463--478.

Digital Library

[17]

Asim Kadav and Erik Kruus. 2016. ASAP: Asynchronous Approximate Data-Parallel Computation. arXiv:arXiv:1612.08608

[18]

A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Master's thesis, Department of Computer Science, University of Toronto (2009).

[19]

Mu Li. 2014. Scaling Distributed Machine Learning with the Parameter Server. In International Conference on Big Data Science and Computing. 3.

Digital Library

[20]

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5330--5340.

Digital Library

[21]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10--15, 2018. 3049--3058. http://proceedings.mlr.press/v80/lian18a.html

[22]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117--124.

Digital Library

[23]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 693--701.

Digital Library

[24]

Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 2135--2135.

Digital Library

[25]

Xiaogang Shi, Bin Cui, Yingxia Shao, and Yunhai Tong. 2016. Tornado: A System For Real-Time Iterative Analysis Over Evolving Data. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 417--430.

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556

[27]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). http://arxiv.org/abs/1409.4842

[28]

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. 2018. Communication Compression for Decentralized Training. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3--8 December 2018, Montréal, Canada. 7663--7673. http://papers.nips.cc/paper/ 7992-communication-compression-for-decentralized-training

Digital Library

[29]

Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2013. GraphX: A Resilient Distributed Graph System on Spark. In First International Workshop on Graph Data Management Experiences and Systems (GRADES '13). ACM, New York, NY, USA, Article 2, 6 pages.

Digital Library

[30]

Eric P. Xing, Qirong Ho,Wei Dai, Jin-Kyu Kim, JinliangWei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1335--1344.

Digital Library

[31]

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 1--14. http://dl.acm.org/citation.cfm? id=1855741.1855742

Digital Library

[32]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301

Digital Library

[33]

Ce Zhang and Christopher Ré. 2014. DimmWitted: A Study of Mainmemory Statistical Analytics. Proc. VLDB Endow. 7, 12 (Aug. 2014), 1283--1294.

Digital Library

[34]

Martin Zinkevich, Markus Weimer, Alexander J. Smola, and Lihong Li. 2010. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23: Conference on Neural Information Processing Systems 2010. Proceedings of A Meeting Held 6--9 December 2010, Vancouver, British Columbia, Canada. 2595--2603.

Digital Library

Cited By

Mo ZXu HXu CTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640375
Huang TVishwakarma HSala FOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Train 'n tradeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667359(28478-28490)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667359
Wang YYu JYu Z(2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术：综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
https://doi.org/10.1631/FITEE.2100298
Show More Cited By

Index Terms

Hop: Heterogeneity-aware Decentralized Training
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
    2. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Special purpose systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Concurrency control

Recommendations

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all ...
Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters
Middleware '20: Proceedings of the 21st International Middleware Conference

Decentralized algorithms, e.g., AllReduce, have been widely applied as the synchronization strategy for data-parallel distributed deep learning due to its superior performance over centralized ones. The synchronous Stochastic Gradient Descent (SGD) ...
Low precision decentralized distributed training over IID and non-IID data
Abstract
Decentralized distributed learning is the key to enabling large-scale machine learning (training) on the edge devices utilizing private user-generated local data, without relying on the cloud. However, practical realization of such on-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
993
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)32

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mo ZXu HXu CTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640375
Huang TVishwakarma HSala FOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Train 'n tradeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667359(28478-28490)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667359
Wang YYu JYu Z(2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术：综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
https://doi.org/10.1631/FITEE.2100298
Qiu YLei YWang G(2023)PSRA-HGADMM: A Communication Efficient Distributed ADMM AlgorithmProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605610(82-91)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605610
Jiang JHuang ZHuang DDu JChen LChen ZLu Y(2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3605149
Du CXiao JGuo W(2023)Topology Construction with Minimum Total Time for Geo-Distributed Decentralized Federated LearningProceedings of the 2023 9th International Conference on Computing and Artificial Intelligence10.1145/3594315.3594397(720-726)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1145/3594315.3594397
Gu DZhao YZhong YXiong YHan ZCheng PYang FHuang GJin XLiu XAamodt TJerger NSwift M(2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575721
Wu MLiao HTan LJin GZhong LXiao YWang Z(2023)Consensus and Diffusion for First-Order Distributed Optimization Over Multi-Hop NetworkIEEE Access10.1109/ACCESS.2023.329711211(76913-76925)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3297112
Zhu DPeng J(2022)Decentralized Machine Learning over the Internet2022 41st Chinese Control Conference (CCC)10.23919/CCC55666.2022.9901831(2010-2015)Online publication date: 25-Jul-2022
https://doi.org/10.23919/CCC55666.2022.9901831
Wang ZSim JLim EZhao J(2022)Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00018(126-140)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00018
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents