Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3070607.3070612acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Benchmarking Data Flow Systems for Scalable Machine Learning

Published: 14 May 2017 Publication History

Abstract

Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. Existing Benchmarks attempt to assess the performance of data flow systems such as Apache Flink, Spark or Hadoop with non-representative workloads such as WordCount, Grep or Sort. They only evaluate scalability with respect to data set size and fail to address the crucial requirement of handling high dimensional data.
We introduce a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We implement mathematically equivalent versions of these algorithms in Apache Flink and Apache Spark, tune relevant system parameters and run a comprehensive set of experiments to assess their scalability with respect to both: data set size and dimensionality of the data. We evaluate the systems for data up to four billion data points and 100 million dimensions. Additionally we compare the performance to single-node implementations to put the scalability results into perspective.
Our results indicate that while being able to robustly scale with increasing data set sizes, current state of the art data flow systems are surprisingly inefficient at coping with high dimensional data, which is a crucial requirement for large scale machine learning algorithms.

References

[1]
http://peel-framework.org/.
[2]
https://flink.apache.org/.
[3]
https://hadoop.apache.org/.
[4]
https://mahout.apache.org/.
[5]
https://spark.apache.org/.
[6]
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. The VLDB Journal, 23(6), Dec. 2014.
[7]
T. Brants, A. C. Popat, P. Xu, F. J. Och, J. Dean, and G. Inc. Large language models in machine translation. In EMNLP, pages 858--867, 2007.
[8]
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1371--1382, 2014.
[9]
k. Caninil. Sibyl: A system for large scale supervised machine learning.
[10]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache FlinkTM: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., 38(4):28--38, 2015.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.
[12]
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. Proc. VLDB Endow., 2012.
[13]
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), Mar.
[14]
HiBench. https://github.com/intel-hadoop/HiBench.
[15]
L. Jimmy and A. Kolcz. Large-scale machine learning at twitter. SIGMOD 2012, 2012.
[16]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Records, 44(4), May 2016.
[17]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583--598, 2014.
[18]
C.-J. Lin and J. J. Moré. Newton's method for large bound-constrained optimization problems. SIAM J. on Optimization, 9(4), Apr. 1999.
[19]
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Math. Program., 1989.
[20]
O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernéndez. Spark versus flink: Understanding performance in big data analytics frameworks. In IEEE CLUSTER 2016, pages 433--442, Sept 2016.
[21]
H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In KDD '13. ACM, 2013.
[22]
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In USENIX HOTOS'15. USENIX Association, 2015.
[23]
F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011, USA.
[24]
M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In WWW '07. ACM, 2007.
[25]
S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. ACM RecSys 2013, 2013.
[26]
S. Schelter, V. Satuluri, and R. Zadeh. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at NIPS 2014, 2014.
[27]
J. Shi, Y. Qiu, U. F. Minhas, L. Jiao, C. Wang, B. Reinwald, and F. Özcan. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc. VLDB Endow., 8(13), Sept. 2015.
[28]
J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. Performance evaluation of big data frameworks for large-scale data analytics. In IEEE BigData 2016, pages 424--431, Dec 2016.
[29]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, 2012.

Cited By

View all
  • (2023)Deep Representation Learning: Fundamentals, Technologies, Applications, and Open ChallengesIEEE Access10.1109/ACCESS.2023.333519611(137621-137659)Online publication date: 2023
  • (2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
  • (2021)An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop ClusterBig Data and Cognitive Computing10.3390/bdcc50400655:4(65)Online publication date: 5-Nov-2021
  • Show More Cited By
  1. Benchmarking Data Flow Systems for Scalable Machine Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
    May 2017
    76 pages
    ISBN:9781450350198
    DOI:10.1145/3070607
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    SIGMOD/PODS'17
    Sponsor:

    Acceptance Rates

    BeyondMR'17 Paper Acceptance Rate 9 of 17 submissions, 53%;
    Overall Acceptance Rate 19 of 36 submissions, 53%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Deep Representation Learning: Fundamentals, Technologies, Applications, and Open ChallengesIEEE Access10.1109/ACCESS.2023.333519611(137621-137659)Online publication date: 2023
    • (2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
    • (2021)An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop ClusterBig Data and Cognitive Computing10.3390/bdcc50400655:4(65)Online publication date: 5-Nov-2021
    • (2021)Understanding quality of analytics trade-offs in an end-to-end machine learning-based classification system for building information modelingJournal of Big Data10.1186/s40537-021-00417-x8:1Online publication date: 15-Feb-2021
    • (2021)Model averaging in distributed machine learning: a case study with Apache SparkThe VLDB Journal10.1007/s00778-021-00664-7Online publication date: 15-Apr-2021
    • (2020)Elastic Machine Learning Algorithms in Amazon SageMakerProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386126(731-737)Online publication date: 11-Jun-2020
    • (2020)C olumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00134(1513-1524)Online publication date: Apr-2020
    • (2020)ADABench - Towards an Industry Standard Benchmark for Advanced AnalyticsPerformance Evaluation and Benchmarking for the Era of Cloud(s)10.1007/978-3-030-55024-0_4(47-63)Online publication date: 30-Jul-2020
    • (2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
    • (2019)Benchmarking Distributed Data Processing Systems for Machine Learning WorkloadsPerformance Evaluation and Benchmarking for the Era of Artificial Intelligence10.1007/978-3-030-11404-6_4(42-57)Online publication date: 30-Jan-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media