research-article

Open access

STRADS: a distributed framework for scheduled model parallel machine learning

Authors:

Garth A. Gibson,

Eric P. XingAuthors Info & Claims

EuroSys '16: Proceedings of the Eleventh European Conference on Computer Systems

Article No.: 5, Pages 1 - 16

https://doi.org/10.1145/2901318.2901331

Published: 18 April 2016 Publication History

Abstract

Machine learning (ML) algorithms are commonly applied to big data, using distributed systems that partition the data across machines and allow each machine to read and update all ML model parameters --- a strategy known as data parallelism. An alternative and complimentary strategy, model parallelism, partitions the model parameters for non-shared parallel access and updates, and may periodically repartition the parameters to facilitate communication. Model parallelism is motivated by two challenges that data-parallelism does not usually address: (1) parameters may be dependent, thus naive concurrent updates can introduce errors that slow convergence or even cause algorithm failure; (2) model parameters converge at different rates, thus a small subset of parameters can bottleneck ML algorithm completion. We propose scheduled model parallelism (SchMP), a programming approach that improves ML algorithm convergence speed by efficiently scheduling parameter updates, taking into account parameter dependencies and uneven convergence. To support SchMP at scale, we develop a distributed framework STRADS which optimizes the throughput of SchMP programs, and benchmark four common ML applications written as SchMP programs: LDA topic modeling, matrix factorization, sparse least-squares (Lasso) regression and sparse logistic regression. By improving ML progress per iteration through SchMP programming whilst improving iteration throughput through STRADS we show that SchMP programs running on STRADS outperform non-model-parallel ML implementations: for example, SchMP LDA and SchMP Lasso respectively achieve 10x and 5x faster convergence than recent, well-established baselines.

References

[1]

Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In WSDM (2012).

Digital Library

[2]

Apache Hadoop, http://hadoop.apache.org.

[3]

Apache Mahout, http://mahout.apache.org.

[4]

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[5]

Bradley, J. K., Kyrola, A., Bickson, D., and Guestrin, C. Parallel coordinate descent for l1-regularized loss minimization. In ICML (2011).

[6]

Chu, C., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. Map-reduce for machine learning on multicore. NIPS (2007).

[7]

Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14) (Philadelphia, PA, June 2014), USENIX Association, pp. 37--48.

Digital Library

[8]

Dai, W., Kumar, A., Wei, J., Ho, Q., Gibson, G. A., and Xing, E. P. High-performance distributed ML at scale through parameter server consistency models. In AAAI (2015).

Digital Library

[9]

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Aurelio Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS. 2012.

Digital Library

[10]

Dean, J., and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 10--10.

Digital Library

[11]

Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. Pathwise coordinate optimization. Annals of Applied Statistics 1, 2 (2007), 302--332.

[12]

Fu, W. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics 7, 3 (1998), 397--416.

[13]

Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD (2011).

Digital Library

[14]

Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012), vol. 12, p. 2.

Digital Library

[15]

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. Graphx: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2014).

Digital Library

[16]

Griffiths, T. L., and Steyvers, M. Finding scientific topics. Proceedings of National Academy of Science 101 (2004), 5228--5235.

[17]

Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. More effective distributed ML via a stale synchronous parallel parameter server. In NIPS (2013).

[18]

Karypis, G., and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.

Digital Library

[19]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS (2012).

Digital Library

[20]

Lee, S., Kim, J. K., Zheng, X., Ho, Q., Gibson, G., and Xing, E. P. On model parallelism and scheduling strategies for distributed machine learning. In NIPS. 2014.

[21]

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In OSDI (2014).

Digital Library

[22]

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (2012), 716--727.

Digital Library

[23]

McSherry, F. A uniform approach to accelerated pagerank computation. In Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005 (2005), pp. 575--582.

Digital Library

[24]

MPICH, http://mpich.org.

[25]

Power, R., and Li, J. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI (2010).

Digital Library

[26]

Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).

Digital Library

[27]

Richtárik, P., and Takáč, M. Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873 (2012).

[28]

Scherrer, C., Halappanavar, M., Tewari, A., and Haglin, D. Scaling up parallel coordinate descent algorithms. In ICML (2012).

[29]

Scherrer, C., Tewari, A., Halappanavar, M., and Haglin, D. Feature clustering for accelerating parallel coordinate descent. In NIPS. 2012.

[30]

Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J., Franklin, M. J., Jordan, M. I., and Kraska, T. MLI: An API for distributed machine learning. In ICDM (2013).

[31]

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 1 (1996), 267--288.

[32]

Tierney, L. Markov chains for exploring posterior distributions. the Annals of Statistics (1994), 1701--1728.

[33]

Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., and Zhang, Z. Minerva: A scalable and highly efficient training platform for deep learning. In APSys (2014).

[34]

Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., and Zeng, J. Peacock: Learning long-tail topic features for industrial applications. ACM Transactions on Intelligent Systems and Technology 9, 4 (2014).

Digital Library

[35]

Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC '15, ACM, pp. 381--394.

Digital Library

[36]

Yao, L., Mimno, D., and McCallum, A. Efficient methods for topic model inference on streaming document collections. In KDD (2009).

Digital Library

[37]

Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E. P., Liu, T.-Y., and Ma, W.-Y. LightLDA: Big topic models on modest compute clusters. In WWW (2015).

Digital Library

[38]

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI (2012).

Digital Library

[39]

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 423--438.

Digital Library

[40]

Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., Dobrin, R., et al. Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell 153, 3 (2013), 707--720.

[41]

Zhang, Y., Gao, Q., Gao, L., and Wang, C. Priter: A distributed framework for prioritized iterative computations. In SOCC (2011).

Digital Library

[42]

Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Parallelized stochastic gradient descent. In NIPS (2010).

Digital Library

Cited By

Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3343570
Yoosefi AKargahi M(2024)Resource-aware in-edge distributed real-time deep learningInternet of Things10.1016/j.iot.2024.10126327(101263)Online publication date: Oct-2024
https://doi.org/10.1016/j.iot.2024.101263
Choi HLee BChun SLee J(2023)Towards accelerating model parallelism in distributed deep learning systemsPLOS ONE10.1371/journal.pone.029333818:11(e0293338)Online publication date: 2-Nov-2023
https://doi.org/10.1371/journal.pone.0293338
Show More Cited By

Recommendations

STRADS-AP: simplifying distributed machine learning programming without introducing a new programming model
USENIX ATC '19: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference

It is a daunting task for a data scientist to convert sequential code for a Machine Learning (ML) model, published by an ML researcher, to a distributed framework that runs on a cluster and operates on massive datasets. The process of fitting the ...
The performance and scalability of SHMEM and MPI-2 one-sided routines on a SGI Origin 2000 and a Cray T3E-600: Performances

This paper compares the performance and scalability of SHMEM and MPI-2 one-sided routines on different communication patterns for a SGI Origin 2000 and a Cray T3E-600. The communication tests were chosen to represent commonly used communication patterns ...
OpenMP for Networks of SMPs

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroSys '16: Proceedings of the Eleventh European Conference on Computer Systems

April 2016

605 pages

ISBN:9781450342407

DOI:10.1145/2901318

General Chairs:
Cristian Cadar
Imperial College London, UK
,
Peter Pietzuch
Imperial College London, UK
,
Program Chairs:
Kimberly Keeton
HP Labs
,
Rodrigo Rodrigues
Instituto Superior Técnico (Univ. Lisbon) and INESC-ID, Lisbon, Portugal

Copyright © 2016 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2016

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

EuroSys '16

EuroSys '16: Eleventh EuroSys Conference 2016

April 18 - 21, 2016

London, United Kingdom

Acceptance Rates

EuroSys '16 Paper Acceptance Rate 38 of 180 submissions, 21%;

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
2,106
Total Downloads

Downloads (Last 12 months)247
Downloads (Last 6 weeks)29

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3343570
Yoosefi AKargahi M(2024)Resource-aware in-edge distributed real-time deep learningInternet of Things10.1016/j.iot.2024.10126327(101263)Online publication date: Oct-2024
https://doi.org/10.1016/j.iot.2024.101263
Choi HLee BChun SLee J(2023)Towards accelerating model parallelism in distributed deep learning systemsPLOS ONE10.1371/journal.pone.029333818:11(e0293338)Online publication date: 2-Nov-2023
https://doi.org/10.1371/journal.pone.0293338
Joshi PHasanuzzaman MThapa CAfli HScully T(2023)Enabling All In-Edge Deep Learning: A Literature ReviewIEEE Access10.1109/ACCESS.2023.323476111(3431-3460)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3234761
Ming FGao FLiu KZhao C(2023)Cooperative modular reinforcement learning for large discrete action space problemNeural Networks10.1016/j.neunet.2023.01.046161(281-296)Online publication date: Apr-2023
https://doi.org/10.1016/j.neunet.2023.01.046
Srirama SVemuri D(2023)CANTOComputer Communications10.1016/j.comcom.2022.12.007199:C(1-9)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.comcom.2022.12.007
Lv TWu LZhao ZWang CLi C(2023)A Memory Optimization Method for Distributed TrainingNeural Information Processing10.1007/978-981-99-8126-7_30(383-395)Online publication date: 13-Nov-2023
https://doi.org/10.1007/978-981-99-8126-7_30
Miao XZhang HShi YNie XYang ZTao YCui B(2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489511
Renz-Wieland AGemulla RKaoudi ZMarkl VIves ZBonifati AEl Abbadi A(2022)NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter AccessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517860(481-495)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517860
Miao XMa LYang ZShao YCui BYu LJiang J(2022)CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303810934:9(4119-4132)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TKDE.2020.3038109
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents