research-article

MALT: distributed data-parallelism for existing ML applications

Authors:

Cristian UngureanuAuthors Info & Claims

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

Article No.: 3, Pages 1 - 16

https://doi.org/10.1145/2741948.2741965

Published: 17 April 2015 Publication History

Abstract

Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impractical because of the large amount of computation required.

We introduce MALT, a machine learning library that integrates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its general-purpose API, MALT can be used to provide data-parallelism to existing ML applications written in C++ and Lua and based on SVM, matrix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications.

Supplementary Material

MP4 File (a3-sidebyside.mp4)

Download
990.90 MB

References

[1]

Halton sequence. en.wikipedia.org/wiki/Halton_sequence.

[2]

Machine Learning in Python. http://scikit-learn.org/.

[3]

Tencent 2012 KDD Cup. https://www.kddcup2012.org.

[4]

The R Project for Statistical Computing. www.r-project.org/.

[5]

Vowpal Wabbit. http://hunch.net/vw/.

[6]

PASCAL Large Scale Learning Challenge. http://largescale.ml.tu-berlin.de, 2009.

[7]

A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. JMLR, 2014.

Digital Library

[8]

T. Alrutz, J. Backhaus, T. Brandes, V. End, T. Gerhold, A. Geiger, D. Grünewald, V. Heuveline, J. Jägersküpper, A. Knüpfer, et al. Gaspi--a partitioned global address space programming interface. In Facing the Multicore-Challenge III, pages 135--136. Springer, 2013.

[9]

K. Bache and M. Lichman. UCI machine learning repository, 2013.

[10]

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger. Supervised semantic indexing. In ACM CIKM, 2009.

Digital Library

[11]

C. M. Bishop et al. Pattern Recognition and Machine Learning. Springer New York, 2006.

Digital Library

[12]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Springer COMPSTAT, 2010.

[13]

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421--436. Springer, 2012.

[14]

J. Canny and H. Zhao. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In SDM, 2013.

[15]

T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: an engineering perspective. In IEEE PODC, 2007.

Digital Library

[16]

T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In USENIX OSDI, 2014.

Digital Library

[17]

J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In USENIX HotOS, 2013.

Digital Library

[18]

A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with COTS HPC systems. In ACM ICML, 2013.

[19]

R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.

[20]

A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011.

Digital Library

[21]

H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, 2014.

Digital Library

[22]

W. Dai, J. Wei, X. Zheng, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing. Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651, 2013.

[23]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, et al. Large scale distributed deep networks. In NIPS, 2012.

Digital Library

[24]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, 2008.

Digital Library

[25]

A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Farm: fast remote memory. In USENIX NSDI, 2014.

Digital Library

[26]

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, et al. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 97--104. Springer, 2004.

[27]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In ACM KDD, 2011.

Digital Library

[28]

GPI2: Programming Next Generation Supercomputers. Benchmarks.

[29]

A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2): 8--12, 2009.

Digital Library

[30]

K. B. Hall, S. Gilpin, and G. Mann. Mapreduce/bigtable for distributed optimization. In NIPS LCCC Workshop, 2010.

[31]

M. Hilbert and P. López. The worlds technological capacity to store, communicate, and compute information. Science, 332(6025): 60--65, 2011.

[32]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM EuroSys, 2007.

Digital Library

[33]

A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In SIGCOMM. ACM, 2014.

Digital Library

[34]

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8): 30--37, 2009.

Digital Library

[35]

B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In ACM EDBT, 2013.

Digital Library

[36]

M. Li, D. Andersen, A. Smola, J. Park, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014.

Digital Library

[37]

M. Li, D. G. Andersen, and A. Smola. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, 2013.

[38]

M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In ACM KDD, 2014.

Digital Library

[39]

C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In USENIX ATC, 2013.

Digital Library

[40]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In ACM SOSP, 2013.

Digital Library

[41]

D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. CIEL: A Universal Execution Engine for Distributed Data-Flow Computing. In USENIX NSDI, 2011.

Digital Library

[42]

NEC Laboratories America. MiLDE: Machine Learning Development Environment. http://www.nec-labs.com/research-departments/machine-learning/machine-learning-software/Milde.

[43]

X. Pan, J. E. Gonzalez, S. Jegelka, T. Broderick, and M. Jordan. Optimistic concurrency control for distributed unsupervised learning. In NIPS, 2013.

[44]

F. Petroni and L. Querzoni. GASGD: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. In ACM RecSys, 2014.

Digital Library

[45]

R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In USENIX OSDI, 2010.

Digital Library

[46]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.

Digital Library

[47]

J. R. Santos, Y. Turner, and G. Janakiraman. End-to-end congestion control for infiniband. In IEEE INFOCOM, 2003.

[48]

L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8): 103--111, 1990.

Digital Library

[49]

S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In ACM EuroSys, 2013.

Digital Library

[50]

H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science-Research and Development, 26(3--4): 257--266, 2011.

Digital Library

[51]

T. White. Hadoop: The definitive guide. O'Reilly Media, Inc., 2009.

Digital Library

[52]

W. Xu. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490, 2011.

[53]

H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. In ACM VLDB, 2014.

Digital Library

[54]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI, 2012.

Digital Library

[55]

S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with Elastic Averaging SGD. arXiv preprint arXiv:1412.6651, 2014.

[56]

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, 2010.

Digital Library

Cited By

Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Li WYuan GWang ZTan GZhang PRouskas G(2024)Fast and scalable all-optical network architecture for distributed deep learningJournal of Optical Communications and Networking10.1364/JOCN.51169616:3(342)Online publication date: 22-Feb-2024
https://doi.org/10.1364/JOCN.511696
Luo SFan PLi KXing HLuo LYu H(2024)Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning with Selective MulticastIEEE Transactions on Services Computing10.1109/TSC.2024.3506480(1-13)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3506480
Show More Cited By

Index Terms

MALT: distributed data-parallelism for existing ML applications

Recommendations

On semicomplete multipartite digraphs whose king sets are semicomplete digraphs

Reid [Every vertex a king, Discrete Math. 38 (1982) 93-98] showed that a non-trivial tournament H is contained in a tournament whose 2-kings are exactly the vertices of H if and only if H contains no transmitter. Let T be a semicomplete multipartite ...
Java EE 5 Development using GlassFish Application Server: The complete guide to installing and configuring the GlassFish Application Server and developing ... 5 applications to be deployed to this server
Java EE 6 with GlassFish 3 Application Server

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

April 2015

503 pages

ISBN:9781450332385

DOI:10.1145/2741948

General Chair:
Laurent Réveillère
LaBRI, University of Bordeaux, France
,
Program Chairs:
Tim Harris
Oracle Labs, UK
,
Maurice Herlihy
Brown University

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EuroSys '15

Sponsor:

SIGOPS

EuroSys '15: Tenth EuroSys Conference 2015

April 21 - 24, 2015

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
1,053
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Peng JLiu QChen ZShao YShen YChen LCao J(2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
https://doi.org/10.1007/s00778-024-00897-2
Li WYuan GWang ZTan GZhang PRouskas G(2024)Fast and scalable all-optical network architecture for distributed deep learningJournal of Optical Communications and Networking10.1364/JOCN.51169616:3(342)Online publication date: 22-Feb-2024
https://doi.org/10.1364/JOCN.511696
Luo SFan PLi KXing HLuo LYu H(2024)Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning with Selective MulticastIEEE Transactions on Services Computing10.1109/TSC.2024.3506480(1-13)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3506480
Luo SWang RLi KXing H(2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3460185
Hanindhito BPatel BJohn L(2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00031
Krichen MAbdalzaher M(2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
https://doi.org/10.1016/j.jnca.2024.104034
Abuzied YGhanem MDawoud FGamal HSoliman ESharara HElBatt T(2024)A privacy-preserving federated learning framework for blockchain networksCluster Computing10.1007/s10586-024-04273-127:4(3997-4014)Online publication date: 2-Mar-2024
https://doi.org/10.1007/s10586-024-04273-1
Hashem ISiddiqa AAlaba FBilal MAlhashmi S(2024)Distributed intelligence for IoT-based smart cities: a surveyNeural Computing and Applications10.1007/s00521-024-10136-y36:27(16621-16656)Online publication date: 22-Jul-2024
https://doi.org/10.1007/s00521-024-10136-y
Markou GBakas NChatzichristofis SPapadrakakis M(2024)A general framework of high-performance machine learning algorithms: application in structural mechanicsComputational Mechanics10.1007/s00466-023-02386-973:4(705-729)Online publication date: 9-Jan-2024
https://doi.org/10.1007/s00466-023-02386-9
Tang SCui MQi LXu X(2023)Edge Intelligence with Distributed Processing of DNNs: A SurveyComputer Modeling in Engineering & Sciences10.32604/cmes.2023.023684136:1(5-42)Online publication date: 2023
https://doi.org/10.32604/cmes.2023.023684
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten