Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2741948.2741965acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

MALT: distributed data-parallelism for existing ML applications

Published: 17 April 2015 Publication History

Abstract

Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. Building such models on a single machine is often impractical because of the large amount of computation required.
We introduce MALT, a machine learning library that integrates with existing machine learning software and provides data parallel machine learning. MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates. MALT allows machine learning developers to specify the dataflow and apply communication and representation optimizations. Through its general-purpose API, MALT can be used to provide data-parallelism to existing ML applications written in C++ and Lua and based on SVM, matrix factorization and neural networks. In our results, we show MALT provides fault tolerance, network efficiency and speedup to these applications.

Supplementary Material

MP4 File (a3-sidebyside.mp4)

References

[1]
Halton sequence. en.wikipedia.org/wiki/Halton_sequence.
[2]
Machine Learning in Python. http://scikit-learn.org/.
[3]
Tencent 2012 KDD Cup. https://www.kddcup2012.org.
[4]
The R Project for Statistical Computing. www.r-project.org/.
[5]
Vowpal Wabbit. http://hunch.net/vw/.
[6]
PASCAL Large Scale Learning Challenge. http://largescale.ml.tu-berlin.de, 2009.
[7]
A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. JMLR, 2014.
[8]
T. Alrutz, J. Backhaus, T. Brandes, V. End, T. Gerhold, A. Geiger, D. Grünewald, V. Heuveline, J. Jägersküpper, A. Knüpfer, et al. Gaspi--a partitioned global address space programming interface. In Facing the Multicore-Challenge III, pages 135--136. Springer, 2013.
[9]
K. Bache and M. Lichman. UCI machine learning repository, 2013.
[10]
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger. Supervised semantic indexing. In ACM CIKM, 2009.
[11]
C. M. Bishop et al. Pattern Recognition and Machine Learning. Springer New York, 2006.
[12]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In Springer COMPSTAT, 2010.
[13]
L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421--436. Springer, 2012.
[14]
J. Canny and H. Zhao. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In SDM, 2013.
[15]
T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: an engineering perspective. In IEEE PODC, 2007.
[16]
T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In USENIX OSDI, 2014.
[17]
J. Cipar, Q. Ho, J. K. Kim, S. Lee, G. R. Ganger, G. Gibson, K. Keeton, and E. Xing. Solving the straggler problem with bounded staleness. In USENIX HotOS, 2013.
[18]
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with COTS HPC systems. In ACM ICML, 2013.
[19]
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
[20]
A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011.
[21]
H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, 2014.
[22]
W. Dai, J. Wei, X. Zheng, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing. Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651, 2013.
[23]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, et al. Large scale distributed deep networks. In NIPS, 2012.
[24]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, 2008.
[25]
A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. Farm: fast remote memory. In USENIX NSDI, 2014.
[26]
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, et al. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 97--104. Springer, 2004.
[27]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In ACM KDD, 2011.
[28]
GPI2: Programming Next Generation Supercomputers. Benchmarks.
[29]
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2): 8--12, 2009.
[30]
K. B. Hall, S. Gilpin, and G. Mann. Mapreduce/bigtable for distributed optimization. In NIPS LCCC Workshop, 2010.
[31]
M. Hilbert and P. López. The worlds technological capacity to store, communicate, and compute information. Science, 332(6025): 60--65, 2011.
[32]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM EuroSys, 2007.
[33]
A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In SIGCOMM. ACM, 2014.
[34]
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8): 30--37, 2009.
[35]
B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In ACM EDBT, 2013.
[36]
M. Li, D. Andersen, A. Smola, J. Park, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 2014.
[37]
M. Li, D. G. Andersen, and A. Smola. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, 2013.
[38]
M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In ACM KDD, 2014.
[39]
C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In USENIX ATC, 2013.
[40]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In ACM SOSP, 2013.
[41]
D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. CIEL: A Universal Execution Engine for Distributed Data-Flow Computing. In USENIX NSDI, 2011.
[42]
NEC Laboratories America. MiLDE: Machine Learning Development Environment. http://www.nec-labs.com/research-departments/machine-learning/machine-learning-software/Milde.
[43]
X. Pan, J. E. Gonzalez, S. Jegelka, T. Broderick, and M. Jordan. Optimistic concurrency control for distributed unsupervised learning. In NIPS, 2013.
[44]
F. Petroni and L. Querzoni. GASGD: stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning. In ACM RecSys, 2014.
[45]
R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In USENIX OSDI, 2010.
[46]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.
[47]
J. R. Santos, Y. Turner, and G. Janakiraman. End-to-end congestion control for infiniband. In IEEE INFOCOM, 2003.
[48]
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8): 103--111, 1990.
[49]
S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: distributed machine learning and graph processing with sparse matrices. In ACM EuroSys, 2013.
[50]
H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science-Research and Development, 26(3--4): 257--266, 2011.
[51]
T. White. Hadoop: The definitive guide. O'Reilly Media, Inc., 2009.
[52]
W. Xu. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490, 2011.
[53]
H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. In ACM VLDB, 2014.
[54]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX NSDI, 2012.
[55]
S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with Elastic Averaging SGD. arXiv preprint arXiv:1412.6651, 2014.
[56]
M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, 2010.

Cited By

View all
  • (2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
  • (2024)Fast and scalable all-optical network architecture for distributed deep learningJournal of Optical Communications and Networking10.1364/JOCN.51169616:3(342)Online publication date: 22-Feb-2024
  • (2024)Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning with Selective MulticastIEEE Transactions on Services Computing10.1109/TSC.2024.3506480(1-13)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems
April 2015
503 pages
ISBN:9781450332385
DOI:10.1145/2741948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EuroSys '15
Sponsor:
EuroSys '15: Tenth EuroSys Conference 2015
April 21 - 24, 2015
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networksThe VLDB Journal10.1007/s00778-024-00897-234:2Online publication date: 31-Jan-2025
  • (2024)Fast and scalable all-optical network architecture for distributed deep learningJournal of Optical Communications and Networking10.1364/JOCN.51169616:3(342)Online publication date: 22-Feb-2024
  • (2024)Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning with Selective MulticastIEEE Transactions on Services Computing10.1109/TSC.2024.3506480(1-13)Online publication date: 2024
  • (2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
  • (2024)Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00031(241-256)Online publication date: 5-May-2024
  • (2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
  • (2024)A privacy-preserving federated learning framework for blockchain networksCluster Computing10.1007/s10586-024-04273-127:4(3997-4014)Online publication date: 2-Mar-2024
  • (2024)Distributed intelligence for IoT-based smart cities: a surveyNeural Computing and Applications10.1007/s00521-024-10136-y36:27(16621-16656)Online publication date: 22-Jul-2024
  • (2024)A general framework of high-performance machine learning algorithms: application in structural mechanicsComputational Mechanics10.1007/s00466-023-02386-973:4(705-729)Online publication date: 9-Jan-2024
  • (2023)Edge Intelligence with Distributed Processing of DNNs: A SurveyComputer Modeling in Engineering & Sciences10.32604/cmes.2023.023684136:1(5-42)Online publication date: 2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media