research-article

Scale-up vs scale-out for Hadoop: time to rethink?

Authors:

Raja Appuswamy,

Christos Gkantsidis,

Dushyanth Narayanan,

Orion Hodson, and

Antony RowstronAuthors Info & Claims

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

October 2013

Article No.: 20, Pages 1 - 13

https://doi.org/10.1145/2523616.2523629

Published: 01 October 2013 Publication History

Abstract

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment.

Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single "scale-up" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.

References

[1]

Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/. Accessed: 08/09/2011.

[2]

G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. "PACMan: Coordinated Memory Caching for Parallel Jobs". NSDI. 2012.

Digital Library

[3]

D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. "FAWN: A Fast Array of Wimpy Nodes". Proceedings of SOSP. 2009.

Digital Library

[4]

Apache Hadoop. http://hadoop.apache.org/. Accessed: 08/09/2011.

[5]

Apache Mahout. http://mahout.apache.org/. Accessed: 02/07/2013.

[6]

Apache Pig Wiki. http://wiki.apache.org/pig/PigPerformance. Accessed: 02/07/2013.

[7]

M. Bierman and L. Grimmer. How I Use the Advanced Capabilities of Btrfs. http://www.oracle.com/technetwork/articles/servers-storage-admin/advanced-btrfs-1734952.html. Accessed: 02/07/2013. 2012.

[8]

J. Bonwick. ZFS End-to-End Data Integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data. Accessed: 02/07/2013. 2005.

[9]

B. Calder et al. "Windows Azure Storage: a highly available cloud storage service with strong consistency". Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. SOSP '11. ACM, 2011, pp. 143--157.

Digital Library

[10]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. "SCOPE: easy and efficient parallel processing of massive data sets". Proceedings of the VLDB Endowment 1.2 (2008), pp. 1265--1276.

Digital Library

[11]

R. Chen, H. Chen, and B. Zang. "Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling". Proceedings of the 19th international conference on Parallel architectures and compilation techniques. PACT '10. ACM, 2010.

Digital Library

[12]

Y. Chen, S. Alspaugh, and R. H. Katz. "Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads". PVLDB 5.12 (2012), pp. 1802--1813.

Digital Library

[13]

Cloudera. Tips and Guidelines: Improving Performance. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_11_6.html. Accessed: 02/07/2013.

[14]

J. Dean and S. Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". OSDI. 2004.

Digital Library

[15]

D. DeWitt and J. Gray. "Parallel Database Systems: The Future of High Performance Database Systems". Communications of the ACM 35.6 (1992), pp. 85--98.

Digital Library

[16]

K. Elmeleegy. "Piranha: Optimizing Short Jobs In Hadoop". VLDB. 2013.

Digital Library

[17]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center". Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. NSDI'11. USENIX, 2011.

Digital Library

[18]

A. Kyrola, G. Blelloch, and C. Guestrin. "GraphChi: large-scale graph computation on just a PC". Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. OSDI'12. USENIX Association, 2012, pp. 31--46.

Digital Library

[19]

W. Lang, J. M. Patel, and S. Shankar. "Wimpy Node Clusters: What About Non-Wimpy Workloads?" Workshop on Data Management on New Hardware (DaMon). 2010.

Digital Library

[20]

Y. Mao, R. Morris, and F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Tech. rep. MIT-CSAIL-TR-2010-020. MIT CSAIL, 2010.

[21]

M. Michael, J. E. Moreira, D. Shiloach, and R. W. Wisniewski. "Scale-up x Scale-out: A Case Study using Nutch/Lucene". Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IPDPS'07. IEEE, 2007, pp. 1--8.

[22]

S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications Co., 2011.

Digital Library

[23]

Panasas. Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor. http://www.panasas.com/sites/default/files/uploads/docs/hadoop_wp_lr_1096.pdf. Accessed: 02/07/2013.

[24]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A comparison of approaches to large-scale data analysis". SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data. ACM, 2009, pp. 165--178.

Digital Library

[25]

R. Power and J. Li. "Piccolo: Building Fast, Distributed Programs with Partitioned Tables". USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2010.

Digital Library

[26]

C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski, and C. Kozyrakis. "Evaluating MapReduce for Multi-core and Multiprocessor Systems". HPCA. 2007.

Digital Library

[27]

V. J. Reddi, B. C. Lee, T. M. Chilimbi, and K. Vaid. "Web search using mobile cores: Quantifying and mitigating the price of efficiency". Proc. 37th International Symposium on Computer Architecture (37th ISCA'10). 2010, pp. 314--325.

Digital Library

[28]

A. Rowstron, D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop". Workshop on Hot Topics in Cloud Data Processing (HotCDP). 2012.

Digital Library

[29]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. "Omega: flexible, scalable schedulers for large compute clusters". Proceedings of the 8th ACM European Conference on Computer Systems. EuroSys'13. ACM, 2013, pp. 351--364.

Digital Library

[30]

J. Talbot, R. M. Yoo, and C. Kozyrakis. "Phoenix++: Modular MapReduce for Shared-Memory Systems". Second International Workshop on MapReduce and its Applications (MAPREDUCE). 2011.

Digital Library

[31]

Windows Azure Storage. http://www.microsoft.com/windowsazure/features/storage/. Accessed: 08/09/2011.

[32]

R. M. Yoo, A. Romano, and C. Kozyrakis. "Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System". IEEE International Symposium on Workload Characterization (IISWC). 2009.

Digital Library

[33]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing". USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2012.

Digital Library

Cited By

Jackson TRichard Hodgkinson I(2023)Is there a role for knowledge management in saving the planet from too much data?Knowledge Management Research & Practice10.1080/14778238.2023.219258021:3(427-435)Online publication date: 25-Apr-2023
https://doi.org/10.1080/14778238.2023.2192580
FUKUCHI KYAMADA H(2022)Leveraging Scale-Up Machines for Swift DBMS Replication on IaaS Platforms Using BalenaDBIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7505E105.D:1(92-104)Online publication date: 1-Jan-2022
https://doi.org/10.1587/transinf.2020ZDP7505
Ahn SPark HSanchez VHwang DKim WSussman ANam B(2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00012
Show More Cited By

Index Terms

Scale-up vs scale-out for Hadoop: time to rethink?

Recommendations

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools
Read More
Large-scale seismic waveform quality metric calculation using Hadoop

In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated ...
Read More
Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

October 2013

427 pages

ISBN:9781450324281

DOI:10.1145/2523616

General Chair:
Guy Lohman

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SOCC '13

Sponsor:

SOCC '13: ACM Symposium on Cloud Computing

October 1 - 3, 2013

California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

106
Total Citations
View Citations
1,260
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)8

Other Metrics

View Author Metrics

Citations

Cited By

Jackson TRichard Hodgkinson I(2023)Is there a role for knowledge management in saving the planet from too much data?Knowledge Management Research & Practice10.1080/14778238.2023.219258021:3(427-435)Online publication date: 25-Apr-2023
https://doi.org/10.1080/14778238.2023.2192580
FUKUCHI KYAMADA H(2022)Leveraging Scale-Up Machines for Swift DBMS Replication on IaaS Platforms Using BalenaDBIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7505E105.D:1(92-104)Online publication date: 1-Jan-2022
https://doi.org/10.1587/transinf.2020ZDP7505
Ahn SPark HSanchez VHwang DKim WSussman ANam B(2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
https://doi.org/10.1109/CCGrid54584.2022.00012
Li YOu DZhou XJiang CCérin C(2022)Scalability and performance analysis of BDPS in cloudsComputing10.1007/s00607-022-01056-7104:6(1425-1460)Online publication date: 14-Feb-2022
https://doi.org/10.1007/s00607-022-01056-7
YAZIDI AAZIZI MBENLACHMI YHASNAOUI M(2021)Apache Hadoop-MapReduce on YARN framework latencyProcedia Computer Science10.1016/j.procs.2021.03.100184(803-808)Online publication date: 2021
https://doi.org/10.1016/j.procs.2021.03.100
Ioannidis TGarbis GKyzirakos KBereta KKoubarakis M(2021)Evaluating Geospatial RDF Stores Using the Benchmark Geographica 2Journal on Data Semantics10.1007/s13740-021-00118-x10:3-4(189-228)Online publication date: 23-Apr-2021
https://doi.org/10.1007/s13740-021-00118-x
Rinaldi LTorquati MMencagli GDanelutto MCastegren EDe Koster JSchmidt T(2020)High-throughput stream processing with actorsProceedings of the 10th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control10.1145/3427760.3428338(1-10)Online publication date: 17-Nov-2020
https://dl.acm.org/doi/10.1145/3427760.3428338
Mehta HHarvey PRana OBuyya RVarghese B(2020)WattsApp: Power-Aware Container Scheduling2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC48980.2020.00027(79-90)Online publication date: Dec-2020
https://doi.org/10.1109/UCC48980.2020.00027
Addisie ABertacco V(2020)Collaborative Accelerators for Streamlining MapReduce on Scale-up Machines with Incremental Data AggregationIEEE Transactions on Computers10.1109/TC.2020.3004169(1-1)Online publication date: 2020
https://doi.org/10.1109/TC.2020.3004169
YAZIDI ALahcen HASNAOUI MAZIZI M(2020)Sensitivity analysis of latency to data size in Spark environment2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS)10.1109/ICECOCS50124.2020.9314399(1-5)Online publication date: 2-Dec-2020
https://doi.org/10.1109/ICECOCS50124.2020.9314399
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents