Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2901318.2901331acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

STRADS: a distributed framework for scheduled model parallel machine learning

Published: 18 April 2016 Publication History

Abstract

Machine learning (ML) algorithms are commonly applied to big data, using distributed systems that partition the data across machines and allow each machine to read and update all ML model parameters --- a strategy known as data parallelism. An alternative and complimentary strategy, model parallelism, partitions the model parameters for non-shared parallel access and updates, and may periodically repartition the parameters to facilitate communication. Model parallelism is motivated by two challenges that data-parallelism does not usually address: (1) parameters may be dependent, thus naive concurrent updates can introduce errors that slow convergence or even cause algorithm failure; (2) model parameters converge at different rates, thus a small subset of parameters can bottleneck ML algorithm completion. We propose scheduled model parallelism (SchMP), a programming approach that improves ML algorithm convergence speed by efficiently scheduling parameter updates, taking into account parameter dependencies and uneven convergence. To support SchMP at scale, we develop a distributed framework STRADS which optimizes the throughput of SchMP programs, and benchmark four common ML applications written as SchMP programs: LDA topic modeling, matrix factorization, sparse least-squares (Lasso) regression and sparse logistic regression. By improving ML progress per iteration through SchMP programming whilst improving iteration throughput through STRADS we show that SchMP programs running on STRADS outperform non-model-parallel ML implementations: for example, SchMP LDA and SchMP Lasso respectively achieve 10x and 5x faster convergence than recent, well-established baselines.

References

[1]
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In WSDM (2012).
[2]
Apache Hadoop, http://hadoop.apache.org.
[3]
Apache Mahout, http://mahout.apache.org.
[4]
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022.
[5]
Bradley, J. K., Kyrola, A., Bickson, D., and Guestrin, C. Parallel coordinate descent for l1-regularized loss minimization. In ICML (2011).
[6]
Chu, C., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. Map-reduce for machine learning on multicore. NIPS (2007).
[7]
Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14) (Philadelphia, PA, June 2014), USENIX Association, pp. 37--48.
[8]
Dai, W., Kumar, A., Wei, J., Ho, Q., Gibson, G. A., and Xing, E. P. High-performance distributed ML at scale through parameter server consistency models. In AAAI (2015).
[9]
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Aurelio Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS. 2012.
[10]
Dean, J., and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 10--10.
[11]
Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. Pathwise coordinate optimization. Annals of Applied Statistics 1, 2 (2007), 302--332.
[12]
Fu, W. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics 7, 3 (1998), 397--416.
[13]
Gemulla, R., Nijkamp, E., Haas, P. J., and Sismanis, Y. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD (2011).
[14]
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012), vol. 12, p. 2.
[15]
Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. Graphx: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2014).
[16]
Griffiths, T. L., and Steyvers, M. Finding scientific topics. Proceedings of National Academy of Science 101 (2004), 5228--5235.
[17]
Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. More effective distributed ML via a stale synchronous parallel parameter server. In NIPS (2013).
[18]
Karypis, G., and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.
[19]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS (2012).
[20]
Lee, S., Kim, J. K., Zheng, X., Ho, Q., Gibson, G., and Xing, E. P. On model parallelism and scheduling strategies for distributed machine learning. In NIPS. 2014.
[21]
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In OSDI (2014).
[22]
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (2012), 716--727.
[23]
McSherry, F. A uniform approach to accelerated pagerank computation. In Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005 (2005), pp. 575--582.
[24]
MPICH, http://mpich.org.
[25]
Power, R., and Li, J. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI (2010).
[26]
Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS (2011).
[27]
Richtárik, P., and Takáč, M. Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873 (2012).
[28]
Scherrer, C., Halappanavar, M., Tewari, A., and Haglin, D. Scaling up parallel coordinate descent algorithms. In ICML (2012).
[29]
Scherrer, C., Tewari, A., Halappanavar, M., and Haglin, D. Feature clustering for accelerating parallel coordinate descent. In NIPS. 2012.
[30]
Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J., Franklin, M. J., Jordan, M. I., and Kraska, T. MLI: An API for distributed machine learning. In ICDM (2013).
[31]
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 1 (1996), 267--288.
[32]
Tierney, L. Markov chains for exploring posterior distributions. the Annals of Statistics (1994), 1701--1728.
[33]
Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., and Zhang, Z. Minerva: A scalable and highly efficient training platform for deep learning. In APSys (2014).
[34]
Wang, Y., Zhao, X., Sun, Z., Yan, H., Wang, L., Jin, Z., Wang, L., Gao, Y., Law, C., and Zeng, J. Peacock: Learning long-tail topic features for industrial applications. ACM Transactions on Intelligent Systems and Technology 9, 4 (2014).
[35]
Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC '15, ACM, pp. 381--394.
[36]
Yao, L., Mimno, D., and McCallum, A. Efficient methods for topic model inference on streaming document collections. In KDD (2009).
[37]
Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E. P., Liu, T.-Y., and Ma, W.-Y. LightLDA: Big topic models on modest compute clusters. In WWW (2015).
[38]
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI (2012).
[39]
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 423--438.
[40]
Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., Dobrin, R., et al. Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell 153, 3 (2013), 707--720.
[41]
Zhang, Y., Gao, Q., Gao, L., and Wang, C. Priter: A distributed framework for prioritized iterative computations. In SOCC (2011).
[42]
Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Parallelized stochastic gradient descent. In NIPS (2010).

Cited By

View all
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
  • (2024)Resource-aware in-edge distributed real-time deep learningInternet of Things10.1016/j.iot.2024.10126327(101263)Online publication date: Oct-2024
  • (2023)Towards accelerating model parallelism in distributed deep learning systemsPLOS ONE10.1371/journal.pone.029333818:11(e0293338)Online publication date: 2-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroSys '16: Proceedings of the Eleventh European Conference on Computer Systems
April 2016
605 pages
ISBN:9781450342407
DOI:10.1145/2901318
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2016

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '16
EuroSys '16: Eleventh EuroSys Conference 2016
April 18 - 21, 2016
London, United Kingdom

Acceptance Rates

EuroSys '16 Paper Acceptance Rate 38 of 180 submissions, 21%;
Overall Acceptance Rate 241 of 1,308 submissions, 18%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)247
  • Downloads (Last 6 weeks)29
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
  • (2024)Resource-aware in-edge distributed real-time deep learningInternet of Things10.1016/j.iot.2024.10126327(101263)Online publication date: Oct-2024
  • (2023)Towards accelerating model parallelism in distributed deep learning systemsPLOS ONE10.1371/journal.pone.029333818:11(e0293338)Online publication date: 2-Nov-2023
  • (2023)Enabling All In-Edge Deep Learning: A Literature ReviewIEEE Access10.1109/ACCESS.2023.323476111(3431-3460)Online publication date: 2023
  • (2023)Cooperative modular reinforcement learning for large discrete action space problemNeural Networks10.1016/j.neunet.2023.01.046161(281-296)Online publication date: Apr-2023
  • (2023)CANTOComputer Communications10.1016/j.comcom.2022.12.007199:C(1-9)Online publication date: 1-Feb-2023
  • (2023)A Memory Optimization Method for Distributed TrainingNeural Information Processing10.1007/978-981-99-8126-7_30(383-395)Online publication date: 13-Nov-2023
  • (2022)HETProceedings of the VLDB Endowment10.14778/3489496.348951115:2(312-320)Online publication date: 4-Feb-2022
  • (2022)NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter AccessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517860(481-495)Online publication date: 10-Jun-2022
  • (2022)CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303810934:9(4119-4132)Online publication date: 1-Sep-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media