Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318265.3318268acmotherconferencesArticle/Chapter ViewAbstractPublication Pageshp3cConference Proceedingsconference-collections
research-article
Public Access

A unified scaling model in the era of big data analytics

Published: 08 March 2019 Publication History
  • Get Citation Alerts
  • Abstract

    As scale-out execution of big data analytics has become predominate datacenter workloads, it is of paramount importance to faithfully characterize the scaling properties for such workloads. To date, the most widely cited scaling laws for big data analytics is the traditional Amdahl's law, which was discovered well before the era of big data analytics. A key observation made in this paper is that both the system and workload models underlying the traditional scaling laws are too simplistic to fully characterize the scaling properties for big data analytics workloads. In this paper, we put forward a Unified Scaling model for Big data Analytics (USBA), based on a multi-stage system model and a discretized workload model. USBA allows for flexible workload scaling unifying the fixed-size and fixed-time workload models underlying Amdahl's and Gustafson's laws, respectively, and flexible system scaling in terms of both number of stages and degree of parallelism per stage. Moreover, to faithfully characterize the scaling properties for big data analytics workloads, USBA accounts for variabilities of task response times and barrier synchronization. Finally, application of USBA to the scaling analysis of four Spark-based data mining and graph benchmarks demonstrates that USBA is able to adequately characterize the scaling design space and predict the scaling properties of real-world big data analytics workloads. This makes it possible to use USBA as a useful tool to facilitate job resource provisioning for big data analytics in datacenters.

    References

    [1]
    Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation - OSDI '04, pages 137--150, 2004.
    [2]
    Matei Z. An Architecture for Fast and General Data Processing on Large Clusters. PhD thesis, University of California, Berkeley, 2013.
    [3]
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, 2007.
    [4]
    Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of Am. Federation of Infomation Processing Societies Conf., pages 483--485. ACM, 1967.
    [5]
    John L. Gustafson. Reevaluating Amdahl's law. Communications of the ACM, 31(5):532--533,1988.
    [6]
    Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.
    [7]
    Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, pages 41--51. IEEE, 2010.
    [8]
    Isaac Triguero, Daniel Peralta, Jaume Bacardit, Salvador García, and Francisco Herrera. Mrpr: a mapreduce solution for prototype reduction in big data classification, neurocomputing, 150:331--345, 2015.
    [9]
    Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Osdi, volume 8, page 7, 2008.
    [10]
    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.
    [11]
    James R Phillips. Zunzun. com online curve fitting and surface fitting web site. United States, 2012.
    [12]
    Stephen Wolfram. The mathematica. Cambridge university press Cambridge, 1999.
    [13]
    NR Draper. Response surface methodology: Process and product optimization using designed experiments: Rh myers and dc montgomery,(wiley, new york, 1995, isbn: 0471581003, pp. 714), 1997.
    [14]
    Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl's law in big data analytics: Alive and kicking in tpcx-bb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630--642. IEEE, 2018.
    [15]
    Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 98--109, 2011.
    [16]
    Yanpei Chen, Rean Griffith, David Zats, Anthony D. Joseph, and Randy Katz. Understanding TCP incast and its implications for big data workloads. ;login:, 37(3):24--38, 2012.
    [17]
    Hang Qu, Omid Mashayekhi, David Terei, and Philip Levis. Canary: A Scheduling Architecture for High Performance Cloud Computing. arXiv: 1602.01412v1 {cs.DC}, 2016.
    [18]
    M. Manivannan, B. Juurlink, and P. Stenstrom. Implications of merging phases on scalability of multi-core architectures. In Proceedings of the International Conference on Parallel Processing (ICPP), pages 622--631, 2011.
    [19]
    Mark D Hill and Michael R Marty. Amdahl's law in the multicore era. Computer, 41(7), 2008.
    [20]
    Hao Che and Minh Nguyen. Amdahl's Law for Multithreaded Multicore Processors. Journal of Parallel and Distributed Computing, 74(10):3056--3069, October 2014.
    [21]
    Stijn Eyerman and Lieven Eeckhout. Modeling critical sections in Amdahl's Law and its implications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture, pages 362--370. ACM, 2010.
    [22]
    Gang Ren, Eric Tune, Tipp Moseley. Yixin Shi, Silvius Rus, and Robert Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE micro, 30(4):65--79. 2010.
    [23]
    Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.
    [24]
    Janki Bhimani, Ningfang Mi, Miriam Leeser, and Zhengyu Yang. Fim: performance prediction for parallel computation in iterative data processing applications. In Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on, pages 359--366. IEEE, 2017.

    Cited By

    View all
    • (2021)Influencing Factors in the Scalability of Distributed Stream Processing JobsIEEE Access10.1109/ACCESS.2021.31026459(109413-109431)Online publication date: 2021

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications
    March 2019
    201 pages
    ISBN:9781450366380
    DOI:10.1145/3318265
    • Conference Chair:
    • Steven Guan
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Amdahl's law
    2. Gustafson's law
    3. MapReduce
    4. big data analytics
    5. performance modeling
    6. spark

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    HP3C '19

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Influencing Factors in the Scalability of Distributed Stream Processing JobsIEEE Access10.1109/ACCESS.2021.31026459(109413-109431)Online publication date: 2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media