research-article

Public Access

A unified scaling model in the era of big data analytics

Authors:

Hao CheAuthors Info & Claims

HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications

Pages 67 - 77

https://doi.org/10.1145/3318265.3318268

Published: 08 March 2019 Publication History

Abstract

As scale-out execution of big data analytics has become predominate datacenter workloads, it is of paramount importance to faithfully characterize the scaling properties for such workloads. To date, the most widely cited scaling laws for big data analytics is the traditional Amdahl's law, which was discovered well before the era of big data analytics. A key observation made in this paper is that both the system and workload models underlying the traditional scaling laws are too simplistic to fully characterize the scaling properties for big data analytics workloads. In this paper, we put forward a Unified Scaling model for Big data Analytics (USBA), based on a multi-stage system model and a discretized workload model. USBA allows for flexible workload scaling unifying the fixed-size and fixed-time workload models underlying Amdahl's and Gustafson's laws, respectively, and flexible system scaling in terms of both number of stages and degree of parallelism per stage. Moreover, to faithfully characterize the scaling properties for big data analytics workloads, USBA accounts for variabilities of task response times and barrier synchronization. Finally, application of USBA to the scaling analysis of four Spark-based data mining and graph benchmarks demonstrates that USBA is able to adequately characterize the scaling design space and predict the scaling properties of real-world big data analytics workloads. This makes it possible to use USBA as a useful tool to facilitate job resource provisioning for big data analytics in datacenters.

References

[1]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation - OSDI '04, pages 137--150, 2004.

Digital Library

[2]

Matei Z. An Architecture for Fast and General Data Processing on Large Clusters. PhD thesis, University of California, Berkeley, 2013.

[3]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, 2007.

Digital Library

[4]

Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of Am. Federation of Infomation Processing Societies Conf., pages 483--485. ACM, 1967.

Digital Library

[5]

John L. Gustafson. Reevaluating Amdahl's law. Communications of the ACM, 31(5):532--533,1988.

Digital Library

[6]

Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.

Digital Library

[7]

Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, pages 41--51. IEEE, 2010.

[8]

Isaac Triguero, Daniel Peralta, Jaume Bacardit, Salvador García, and Francisco Herrera. Mrpr: a mapreduce solution for prototype reduction in big data classification, neurocomputing, 150:331--345, 2015.

Digital Library

[9]

Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Osdi, volume 8, page 7, 2008.

Digital Library

[10]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.

Digital Library

[11]

James R Phillips. Zunzun. com online curve fitting and surface fitting web site. United States, 2012.

[12]

Stephen Wolfram. The mathematica. Cambridge university press Cambridge, 1999.

[13]

NR Draper. Response surface methodology: Process and product optimization using designed experiments: Rh myers and dc montgomery,(wiley, new york, 1995, isbn: 0471581003, pp. 714), 1997.

[14]

Daniel Richins, Tahrina Ahmed, Russell Clapp, and Vijay Janapa Reddi. Amdahl's law in big data analytics: Alive and kicking in tpcx-bb (bigbench). In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 630--642. IEEE, 2018.

[15]

Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 98--109, 2011.

Digital Library

[16]

Yanpei Chen, Rean Griffith, David Zats, Anthony D. Joseph, and Randy Katz. Understanding TCP incast and its implications for big data workloads. ;login:, 37(3):24--38, 2012.

[17]

Hang Qu, Omid Mashayekhi, David Terei, and Philip Levis. Canary: A Scheduling Architecture for High Performance Cloud Computing. arXiv: 1602.01412v1 {cs.DC}, 2016.

[18]

M. Manivannan, B. Juurlink, and P. Stenstrom. Implications of merging phases on scalability of multi-core architectures. In Proceedings of the International Conference on Parallel Processing (ICPP), pages 622--631, 2011.

Digital Library

[19]

Mark D Hill and Michael R Marty. Amdahl's law in the multicore era. Computer, 41(7), 2008.

Digital Library

[20]

Hao Che and Minh Nguyen. Amdahl's Law for Multithreaded Multicore Processors. Journal of Parallel and Distributed Computing, 74(10):3056--3069, October 2014.

[21]

Stijn Eyerman and Lieven Eeckhout. Modeling critical sections in Amdahl's Law and its implications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture, pages 362--370. ACM, 2010.

Digital Library

[22]

Gang Ren, Eric Tune, Tipp Moseley. Yixin Shi, Silvius Rus, and Robert Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE micro, 30(4):65--79. 2010.

Digital Library

[23]

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.

Digital Library

[24]

Janki Bhimani, Ningfang Mi, Miriam Leeser, and Zhengyu Yang. Fim: performance prediction for parallel computation in iterative data processing applications. In Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on, pages 359--366. IEEE, 2017.

Cited By

Van Dongen GVan Den Poel D(2021)Influencing Factors in the Scalability of Distributed Stream Processing JobsIEEE Access10.1109/ACCESS.2021.31026459(109413-109431)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3102645

Index Terms

A unified scaling model in the era of big data analytics
1. Computing methodologies
  1. Distributed computing methodologies
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Big Data Analytics in Association Rule Mining: A Systematic Literature Review
BDET '21: Proceedings of the 2021 3rd International Conference on Big Data Engineering and Technology

Due to the rapid impact of IT technology, data across the globe is growing exponentially as compared to the last decade. Therefore, the efficient analysis and application of big data require special technologies. The present study performs a systematic ...
Responsible Big Data Analytics for E-Business Services
ICBDR '21: Proceedings of the 5th International Conference on Big Data Research

This paper examines responsible big data analytics for e-business services and looks at how to use responsible big data analytics to obtain responsible e-business services. It addresses why responsibility matters to big data analytics and e-business ...
Big Data Management: Advanced Issues and Approaches

The objective of this article is to provide the advanced issues and approaches of big data management. The literature review indicates the overview of big data management; the aspects of Big Data Analytics BDA; the importance of big data management; the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications

March 2019

201 pages

ISBN:9781450366380

DOI:10.1145/3318265

Conference Chair:
Steven Guan
Xi'an Jiaotong-Liverpool University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

HP3C '19

HP3C '19: 2019 the 3rd International Conference on High Performance Compilation, Computing and Communications

March 8 - 10, 2019

Xi'an, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
262
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)6

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Van Dongen GVan Den Poel D(2021)Influencing Factors in the Scalability of Distributed Stream Processing JobsIEEE Access10.1109/ACCESS.2021.31026459(109413-109431)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3102645

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents