research-article

Benchmarking Data Flow Systems for Scalable Machine Learning

Authors:

Christoph Boden,

Volker MarklAuthors Info & Claims

BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Article No.: 5, Pages 1 - 10

https://doi.org/10.1145/3070607.3070612

Published: 14 May 2017 Publication History

Abstract

Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data.

We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points and 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective.

Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.

References

[1]

http://peel-framework.org/.

[2]

https://flink.apache.org/.

[3]

https://hadoop.apache.org/.

[4]

https://mahout.apache.org/.

[5]

https://spark.apache.org/.

[6]

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. The VLDB Journal, 23(6), Dec. 2014.

Digital Library

[7]

T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.

[8]

Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, 2014.

Digital Library

[9]

k. Caninil. Sibyl: A system for large scale supervised machine learning.

[10]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache FlinkTM: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., 38(4):28--38, 2015.

[11]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.

Digital Library

[12]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. Proc. VLDB Endow., 2012.

Digital Library

[13]

A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), Mar.

Digital Library

[14]

HiBench. https://github.com/intel-hadoop/HiBench.

[15]

L. Jimmy and A. Kolcz. Large-scale machine learning at twitter. SIGMOD 2012, 2012.

Digital Library

[16]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Records, 44(4), May 2016.

Digital Library

[17]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583--598, 2014.

Digital Library

[18]

C.-J. Lin and J. J. Moré. Newton's method for large bound-constrained optimization problems. SIAM J. on Optimization, 9(4), Apr. 1999.

Digital Library

[19]

D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 1989.

[20]

O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernéndez. Spark versus flink: Understanding performance in big data analytics frameworks. In IEEE CLUSTER 2016, pages 433--442, Sept 2016.

[21]

H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In KDD '13. ACM, 2013.

Digital Library

[22]

F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In USENIX HOTOS'15. USENIX Association, 2015.

Digital Library

[23]

F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011, USA.

Digital Library

[24]

M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In WWW '07. ACM, 2007.

Digital Library

[25]

S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. ACM RecSys 2013, 2013.

Digital Library

[26]

S. Schelter, V. Satuluri, and R. Zadeh. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at NIPS 2014, 2014.

[27]

J. Shi, Y. Qiu, U. F. Minhas, L. Jiao, C. Wang, B. Reinwald, and F. Özcan. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow., 8(13), Sept. 2015.

Digital Library

[28]

J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. Performance evaluation of big data frameworks for large-scale data analytics. In IEEE BigData 2016, pages 424--431, Dec 2016.

[29]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, 2012.

Digital Library

Cited By

Payandeh ABaghaei KFayyazsanavi PRamezani SChen ZRahimi S(2023)Deep Representation Learning: Fundamentals, Technologies, Applications, and Open ChallengesIEEE Access10.1109/ACCESS.2023.333519611(137621-137659)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3335196
Ahmed NBarczak ARashid MSusnjak T(2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
https://doi.org/10.1186/s40537-022-00623-1
Ahmed NBarczak ARashid MSusnjak T(2021)An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop ClusterBig Data and Cognitive Computing10.3390/bdcc50400655:4(65)Online publication date: 5-Nov-2021
https://doi.org/10.3390/bdcc5040065
Show More Cited By

Benchmarking Data Flow Systems for Scalable Machine Learning
1. Computing methodologies

Recommendations

Scalable machine-learning algorithms for big data analytics: a comprehensive review

Big data analytics is one of the emerging technologies as it promises to provide better insights from huge and heterogeneous data. Big data analytics involves selecting the suitable big data storage and computational framework augmented by scalable ...
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Revised Selected Papers of the First Workshop on Specifying Big Data Benchmarks - Volume 8163

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications in short big data systems is a hot topic. In this paper, we focus on measuring ...
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

May 2017

76 pages

ISBN:9781450350198

DOI:10.1145/3070607

Co-chairs:
Foto Afrati,
Jacek Sroka,
Editor:
Jan Hidders,
Program Chair:
Paris Koutris

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Bundesministerium für Bildung und Forschung

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

IL, Chicago, USA

Acceptance Rates

BeyondMR'17 Paper Acceptance Rate 9 of 17 submissions, 53%;

Overall Acceptance Rate 19 of 36 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
548
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Payandeh ABaghaei KFayyazsanavi PRamezani SChen ZRahimi S(2023)Deep Representation Learning: Fundamentals, Technologies, Applications, and Open ChallengesIEEE Access10.1109/ACCESS.2023.333519611(137621-137659)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3335196
Ahmed NBarczak ARashid MSusnjak T(2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
https://doi.org/10.1186/s40537-022-00623-1
Ahmed NBarczak ARashid MSusnjak T(2021)An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop ClusterBig Data and Cognitive Computing10.3390/bdcc50400655:4(65)Online publication date: 5-Nov-2021
https://doi.org/10.3390/bdcc5040065
Ryu MTruong HKannala M(2021)Understanding quality of analytics trade-offs in an end-to-end machine learning-based classification system for building information modelingJournal of Big Data10.1186/s40537-021-00417-x8:1Online publication date: 15-Feb-2021
https://doi.org/10.1186/s40537-021-00417-x
Guo YZhang ZJiang JWu WZhang CCui BLi J(2021)Model averaging in distributed machine learning: a case study with Apache SparkThe VLDB Journal10.1007/s00778-021-00664-7Online publication date: 15-Apr-2021
https://doi.org/10.1007/s00778-021-00664-7
Liberty EKarnin ZXiang BRouesnel LCoskun BNallapati RDelgado JSadoughi AAstashonok YDas PBalioglu CChakravarty SJha MGautier PArpin DJanuschowski TFlunkert VWang YGasthaus JStella LRangapuram SSalinas DSchelter SSmola AMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Elastic Machine Learning Algorithms in Amazon SageMakerProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386126(731-737)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3386126
Zhang ZWu WJiang JYu LCui BZhang C(2020)C olumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00134(1513-1524)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00134
Rabl TBrücke CHärtling PStars SEscobar Palacios RPatel HSrivastava SBoden CMeiners JSchelter S(2020)ADABench - Towards an Industry Standard Benchmark for Advanced AnalyticsPerformance Evaluation and Benchmarking for the Era of Cloud(s)10.1007/978-3-030-55024-0_4(47-63)Online publication date: 30-Jul-2020
https://doi.org/10.1007/978-3-030-55024-0_4
Abedjan ZBreß SMarkl VRabl TSoto J(2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
https://dl.acm.org/doi/10.1145/3335409.3335415
Boden CRabl TSchelter SMarkl V(2019)Benchmarking Distributed Data Processing Systems for Machine Learning WorkloadsPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence10.1007/978-3-030-11404-6_4(42-57)Online publication date: 30-Jan-2019
https://doi.org/10.1007/978-3-030-11404-6_4
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents