Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2523616.2523629acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Scale-up vs scale-out for Hadoop: time to rethink?

Published: 01 October 2013 Publication History
  • Get Citation Alerts
  • Abstract

    In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment.
    Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single "scale-up" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.

    References

    [1]
    Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/. Accessed: 08/09/2011.
    [2]
    G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. "PACMan: Coordinated Memory Caching for Parallel Jobs". NSDI. 2012.
    [3]
    D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. "FAWN: A Fast Array of Wimpy Nodes". Proceedings of SOSP. 2009.
    [4]
    Apache Hadoop. http://hadoop.apache.org/. Accessed: 08/09/2011.
    [5]
    Apache Mahout. http://mahout.apache.org/. Accessed: 02/07/2013.
    [6]
    Apache Pig Wiki. http://wiki.apache.org/pig/PigPerformance. Accessed: 02/07/2013.
    [7]
    M. Bierman and L. Grimmer. How I Use the Advanced Capabilities of Btrfs. http://www.oracle.com/technetwork/articles/servers-storage-admin/advanced-btrfs-1734952.html. Accessed: 02/07/2013. 2012.
    [8]
    J. Bonwick. ZFS End-to-End Data Integrity. https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data. Accessed: 02/07/2013. 2005.
    [9]
    B. Calder et al. "Windows Azure Storage: a highly available cloud storage service with strong consistency". Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. SOSP '11. ACM, 2011, pp. 143--157.
    [10]
    R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. "SCOPE: easy and efficient parallel processing of massive data sets". Proceedings of the VLDB Endowment 1.2 (2008), pp. 1265--1276.
    [11]
    R. Chen, H. Chen, and B. Zang. "Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling". Proceedings of the 19th international conference on Parallel architectures and compilation techniques. PACT '10. ACM, 2010.
    [12]
    Y. Chen, S. Alspaugh, and R. H. Katz. "Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads". PVLDB 5.12 (2012), pp. 1802--1813.
    [13]
    Cloudera. Tips and Guidelines: Improving Performance. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_11_6.html. Accessed: 02/07/2013.
    [14]
    J. Dean and S. Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". OSDI. 2004.
    [15]
    D. DeWitt and J. Gray. "Parallel Database Systems: The Future of High Performance Database Systems". Communications of the ACM 35.6 (1992), pp. 85--98.
    [16]
    K. Elmeleegy. "Piranha: Optimizing Short Jobs In Hadoop". VLDB. 2013.
    [17]
    B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center". Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. NSDI'11. USENIX, 2011.
    [18]
    A. Kyrola, G. Blelloch, and C. Guestrin. "GraphChi: large-scale graph computation on just a PC". Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. OSDI'12. USENIX Association, 2012, pp. 31--46.
    [19]
    W. Lang, J. M. Patel, and S. Shankar. "Wimpy Node Clusters: What About Non-Wimpy Workloads?" Workshop on Data Management on New Hardware (DaMon). 2010.
    [20]
    Y. Mao, R. Morris, and F. Kaashoek. Optimizing MapReduce for Multicore Architectures. Tech. rep. MIT-CSAIL-TR-2010-020. MIT CSAIL, 2010.
    [21]
    M. Michael, J. E. Moreira, D. Shiloach, and R. W. Wisniewski. "Scale-up x Scale-out: A Case Study using Nutch/Lucene". Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IPDPS'07. IEEE, 2007, pp. 1--8.
    [22]
    S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications Co., 2011.
    [23]
    Panasas. Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor. http://www.panasas.com/sites/default/files/uploads/docs/hadoop_wp_lr_1096.pdf. Accessed: 02/07/2013.
    [24]
    A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A comparison of approaches to large-scale data analysis". SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data. ACM, 2009, pp. 165--178.
    [25]
    R. Power and J. Li. "Piccolo: Building Fast, Distributed Programs with Partitioned Tables". USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2010.
    [26]
    C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski, and C. Kozyrakis. "Evaluating MapReduce for Multi-core and Multiprocessor Systems". HPCA. 2007.
    [27]
    V. J. Reddi, B. C. Lee, T. M. Chilimbi, and K. Vaid. "Web search using mobile cores: Quantifying and mitigating the price of efficiency". Proc. 37th International Symposium on Computer Architecture (37th ISCA'10). 2010, pp. 314--325.
    [28]
    A. Rowstron, D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop". Workshop on Hot Topics in Cloud Data Processing (HotCDP). 2012.
    [29]
    M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. "Omega: flexible, scalable schedulers for large compute clusters". Proceedings of the 8th ACM European Conference on Computer Systems. EuroSys'13. ACM, 2013, pp. 351--364.
    [30]
    J. Talbot, R. M. Yoo, and C. Kozyrakis. "Phoenix++: Modular MapReduce for Shared-Memory Systems". Second International Workshop on MapReduce and its Applications (MAPREDUCE). 2011.
    [31]
    Windows Azure Storage. http://www.microsoft.com/windowsazure/features/storage/. Accessed: 08/09/2011.
    [32]
    R. M. Yoo, A. Romano, and C. Kozyrakis. "Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System". IEEE International Symposium on Workload Characterization (IISWC). 2009.
    [33]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing". USENIX Symposium on Networked Systems Design and Implementation (NSDI). 2012.

    Cited By

    View all
    • (2023)Is there a role for knowledge management in saving the planet from too much data?Knowledge Management Research & Practice10.1080/14778238.2023.219258021:3(427-435)Online publication date: 25-Apr-2023
    • (2022)Leveraging Scale-Up Machines for Swift DBMS Replication on IaaS Platforms Using BalenaDBIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7505E105.D:1(92-104)Online publication date: 1-Jan-2022
    • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
    October 2013
    427 pages
    ISBN:9781450324281
    DOI:10.1145/2523616
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SOCC '13
    Sponsor:
    SOCC '13: ACM Symposium on Cloud Computing
    October 1 - 3, 2013
    California, Santa Clara

    Acceptance Rates

    SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;
    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)69
    • Downloads (Last 6 weeks)8

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Is there a role for knowledge management in saving the planet from too much data?Knowledge Management Research & Practice10.1080/14778238.2023.219258021:3(427-435)Online publication date: 25-Apr-2023
    • (2022)Leveraging Scale-Up Machines for Swift DBMS Replication on IaaS Platforms Using BalenaDBIEICE Transactions on Information and Systems10.1587/transinf.2020ZDP7505E105.D:1(92-104)Online publication date: 1-Jan-2022
    • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
    • (2022)Scalability and performance analysis of BDPS in cloudsComputing10.1007/s00607-022-01056-7104:6(1425-1460)Online publication date: 14-Feb-2022
    • (2021)Apache Hadoop-MapReduce on YARN framework latencyProcedia Computer Science10.1016/j.procs.2021.03.100184(803-808)Online publication date: 2021
    • (2021)Evaluating Geospatial RDF Stores Using the Benchmark Geographica 2Journal on Data Semantics10.1007/s13740-021-00118-x10:3-4(189-228)Online publication date: 23-Apr-2021
    • (2020)High-throughput stream processing with actorsProceedings of the 10th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control10.1145/3427760.3428338(1-10)Online publication date: 17-Nov-2020
    • (2020)WattsApp: Power-Aware Container Scheduling2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC48980.2020.00027(79-90)Online publication date: Dec-2020
    • (2020)Collaborative Accelerators for Streamlining MapReduce on Scale-up Machines with Incremental Data AggregationIEEE Transactions on Computers10.1109/TC.2020.3004169(1-1)Online publication date: 2020
    • (2020)Sensitivity analysis of latency to data size in Spark environment2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS)10.1109/ICECOCS50124.2020.9314399(1-5)Online publication date: 2-Dec-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media