research-article

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Authors:

Herodotos Herodotou,

Shivnath BabuAuthors Info & Claims

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

Article No.: 18, Pages 1 - 14

https://doi.org/10.1145/2038916.2038934

Published: 26 October 2011 Publication History

Abstract

Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman---the system administrator---in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload.

In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.

References

[1]

Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.

[2]

S. Babu. Towards Automatic Optimization of MapReduce Programs. In SOCC, pages 137--142, 2010.

Digital Library

[3]

P. Bodik, R. Griffith, C. Sutton, A. Fox, M. Jordan, and D. Patterson. Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. In HotCloud, 2009.

Digital Library

[4]

Facebook on Hadoop, Hive, HBase, and A/B Testing. http://tinyurl.com/3dsdsh4.

[5]

BTrace: A Dynamic Instrumentation Tool for Java. http://kenai.com/projects/btrace.

[6]

B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentation of Production Systems. In USENIX ATEC, 2004.

Digital Library

[7]

S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for Workload Summarization and Implications for SQL. In VLDB, pages 730--741, 2003.

Digital Library

[8]

N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See Spot Run: Using Spot Instances for MapReduce Workflows. In HotCloud, 2010.

Digital Library

[9]

Cloudera: 7 tips for Improving MapReduce Performance. http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/.

[10]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[11]

S. Duan, V. Thummala, and S. Babu. Tuning Database Configuration Parameters with iTuned. PVLDB, 2(1):1246--1257, 2009.

Digital Library

[12]

J. Hamilton. Resource Consumption Shaping. http://tinyurl.com/4m9vch.

[13]

H. Herodotou. Hadoop Performance Models. Technical report, Duke Univ., 2010. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf.

[14]

H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4, 2011.

[15]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261--272, 2011.

[16]

M.-Y. Iu and W. Zwaenepoel. HadoopToSQL: A MapReduce Query Optimizer. In EuroSys, pages 251--264, 2010.

Digital Library

[17]

E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization of MapReduce Programs. PVLDB, 4:386--396, 2011.

Digital Library

[18]

K. Kambatla, A. Pathak, and H. Pucha. Towards Optimizing Hadoop Provisioning in the Cloud. In HotCloud, 2009.

Digital Library

[19]

A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Shopping for a Cloud Made Easy. In HotCloud, 2010.

Digital Library

[20]

J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan and Claypool, 2010.

Digital Library

[21]

M. Mesnier, M. Wachs, R. Sambasivan, A. Zheng, and G. Ganger. Modeling the Relative Fitness of Storage. SIGMETRICS, 35(1):37--48, 2007.

Digital Library

[22]

Mumak: Map-Reduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.

[23]

OpenCore Probes vs Sun BTrace. http://opencore.jinspired.com/?page_id=588.

[24]

R. J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.

[25]

A. Qureshi, R. Weber, H. Balakrishnan, J. V. Guttag, and B. Maggs. Cutting the Electric Bill for Internet-scale Systems. In SIGCOMM, pages 123--134, 2009.

Digital Library

[26]

G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, pages 1--11, 2009.

[27]

T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010.

Digital Library

[28]

T. Ye and S. Kalyanaraman. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration. In SIGMETRICS, pages 196--205, 2003.

Digital Library

[29]

W. Zheng, R. Bianchini, J. Janakiraman, J. R. Santos, and Y. Turner. JustRunIt: Experiment-Based Management of Virtualized Data Centers. In USENIX ATC, 2009.

Digital Library

Cited By

Osaba EBenguria GLobo JDiaz-De-Arcaya JAlonso JEtxaniz I(2023)Optimizing IaC Configurations: a Case Study Using Nature-inspired ComputingProceedings of the 2023 6th International Conference on Computational Intelligence and Intelligent Systems10.1145/3638209.3638223(85-90)Online publication date: 25-Nov-2023
https://dl.acm.org/doi/10.1145/3638209.3638223
Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Li JZhang YLu SGunawi HGu XHuang FLi D(2023)Performance Bug Analysis and Detection for Distributed Storage and Computing SystemsACM Transactions on Storage10.1145/358028119:3(1-33)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3580281
Show More Cited By

Index Terms

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

MATE-EC2: a middleware for processing data with AWS
MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers

Recently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the ...
TomusBlobs: scalable data-intensive processing on Azure clouds

The emergence of cloud computing has brought the opportunity to use large-scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the 'elasticity' in resource usage and ...
Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Cloud computing and big data analytics are, without a doubt, two of the most important technologies to enter the mainstream IT industry in recent years. Surprisingly, the two technologies are coming together to deliver powerful results and benefits for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

October 2011

377 pages

ISBN:9781450309769

DOI:10.1145/2038916

Program Chairs:
Jeffrey S. Chase
Duke University
,
Amr El Abbadi
Univ of California, Santa Barbara

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SOCC '11

Sponsor:

SOCC '11: ACM Symposium on Cloud Computing in conjunction with SOSP 2011

October 26 - 28, 2011

Cascais, Portugal

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

205
Total Citations
View Citations
1,422
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Osaba EBenguria GLobo JDiaz-De-Arcaya JAlonso JEtxaniz I(2023)Optimizing IaC Configurations: a Case Study Using Nature-inspired ComputingProceedings of the 2023 6th International Conference on Computational Intelligence and Intelligent Systems10.1145/3638209.3638223(85-90)Online publication date: 25-Nov-2023
https://dl.acm.org/doi/10.1145/3638209.3638223
Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Li JZhang YLu SGunawi HGu XHuang FLi D(2023)Performance Bug Analysis and Detection for Distributed Storage and Computing SystemsACM Transactions on Storage10.1145/358028119:3(1-33)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3580281
Gu RChen XDai HWang SWang ZTu YHuang YChen G(2023)Time and Cost-Efficient Cloud Data Transmission based on Serverless Computing CompressionIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10229090(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10229090
Nassereldine ADiab SBaydoun MLeach KAlt MMilojicic DEl Hajj I(2023)Predicting the Performance-Cost Trade-off of Applications Across Multiple Systems2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00029(216-228)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00029
Dharmadasa IUllah F(2023)Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386759(207-214)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386759
Alamro SLan TSubramaniam S(2023)Forseti: Dynamic chunk-level reshaping for data processing on heterogeneous clustersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.003171(14-23)Online publication date: Jan-2023
https://doi.org/10.1016/j.jpdc.2022.09.003
Chen YHoque MXu PLu JTarkoma S(2023)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s10619-023-07436-y
Shi JLu J(2023)Performance models of data parallel DAG workflows for large scale data analyticsDistributed and Parallel Databases10.1007/s10619-023-07425-141:3(299-329)Online publication date: 23-May-2023
https://doi.org/10.1007/s10619-023-07425-1
Grzegorowski M(2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
https://doi.org/10.1007/978-3-662-66544-2_8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten