Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Published: 01 August 2013 Publication History
  • Get Citation Alerts
  • Abstract

    We analyze Hadoop workloads from three di?erent research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem. We also observe significant opportunities for optimizations of these workloads. We find that job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems. Overall, we present the first user-centered measurement study of Hadoop and find significant opportunities for improving its efficient use for data scientists.

    References

    [1]
    Yahoo! reaches for the stars with M45 supercomputing project. http://research.yahoo.com/node/1884.
    [2]
    G. Ananthanarayanan et al. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010.
    [3]
    G. Ananthanarayanan et al. PACMan: Coordinated memory caching for parallel jobs. In NSDI, 2012.
    [4]
    Apache Foundation. Hadoop. http://hadoop.apache.org/.
    [5]
    Apache Foundation. Mahout: Scalable machine learning and data mining. http://mahout.apache.org/.
    [6]
    Ashish Thusoo et. al. Hive: a petabyte scale data warehouse using Hadoop. In ICDE, pages 996-1005, 2010.
    [7]
    S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137-142, 2010.
    [8]
    D. Borthakur. The Hadoop distributed file system: Architecture and design. http://lucene.apache.org/hadoop/hdfs_design.pdf, 2007.
    [9]
    Y. Chen et al. The case for evaluating MapReduce performance using workload suites. In MASCOTS, pages 390-399.
    [10]
    Y. Chen et al. Interactive query processing in big data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12):1802-1813, 2012.
    [11]
    Concurrent, Inc. Cascading. http://www.cascading.org/, 2012.
    [12]
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
    [13]
    B. Fan et al. DiskReduce: RAID for data-intensive scalable computing. Technical Report CMU-PDL-11-112, PDL, Carnegie Mellon University, 2011.
    [14]
    H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111-1122, 2011.
    [15]
    U. Kang et al. PEGASUS: A peta-scale graph mining system implementation and observations. In ICDM, pages 229-238, 2009.
    [16]
    S. Kavulya et al. An analysis of traces from a production MapReduce cluster. In CCGRID, pages 94-103, 2010.
    [17]
    Q. Ke et al. Optimizing data partitioning for data-parallel computing. In HotOS, 2011.
    [18]
    N. Khoussainova et al. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598-609, 2012.
    [19]
    Y. Kwon et al. SkewTune: mitigating skew in mapreduce applications. In SIGMOD, pages 25-36, 2012.
    [20]
    S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.
    [21]
    G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135-146, 2010.
    [22]
    C. Olston et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099-1110, 2008.
    [23]
    C. Olston et al. Generating example data for dataflow programs. In SIGMOD, pages 245-256, 2009.
    [24]
    A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009.
    [25]
    Scoobi Team. A Scalar productivity framework for Hadoop. https://github.com/NICTA/scoobi, 2012.
    [26]
    B. Sharma et al. Modeling and synthesizing task placement constraints in Google compute clusters. In SoCC, pages 3:1-3:14, 2011.
    [27]
    R. Vernica et al. Adaptive MapReduce using situation-aware mappers. In EDBT, pages 420-431, 2012.

    Cited By

    View all
    • (2020)More IOPS for lessProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485867(18-18)Online publication date: 13-Jul-2020
    • (2019)A New Framework for Evaluating Straggler Detection Mechanisms in MapReduceACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/33287404:3(1-23)Online publication date: 12-Sep-2019
    • (2019)Reducing partition skew on MapReduceFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-6586-213:5(960-975)Online publication date: 1-Oct-2019
    • Show More Cited By

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 6, Issue 10
    August 2013
    180 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2013
    Published in PVLDB Volume 6, Issue 10

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)More IOPS for lessProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485867(18-18)Online publication date: 13-Jul-2020
    • (2019)A New Framework for Evaluating Straggler Detection Mechanisms in MapReduceACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/33287404:3(1-23)Online publication date: 12-Sep-2019
    • (2019)Reducing partition skew on MapReduceFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-6586-213:5(960-975)Online publication date: 1-Oct-2019
    • (2019)Learning automata-based algorithms for MapReduce data skewness handlingThe Journal of Supercomputing10.1007/s11227-019-02855-075:10(6488-6516)Online publication date: 1-Oct-2019
    • (2019)Testing MapReduce programsJournal of Software: Evolution and Process10.1002/smr.212031:3Online publication date: 25-Mar-2019
    • (2018)On the diversity of cluster workloads and its impact on research resultsProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277407(533-546)Online publication date: 11-Jul-2018
    • (2018)Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous ClustersProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225084(1-10)Online publication date: 13-Aug-2018
    • (2018)Searching for high-performing software configurations with metaheuristic algorithmsProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings10.1145/3183440.3195006(354-355)Online publication date: 27-May-2018
    • (2018)Data locality and VM interference aware mitigation of data skew in hadoop leveraging modern portfolio theoryProceedings of the 33rd Annual ACM Symposium on Applied Computing10.1145/3167132.3167150(175-182)Online publication date: 9-Apr-2018
    • (2018)Popularity-based covering sets for energy proportionality in shared-nothing clustersThe Journal of Supercomputing10.1007/s11227-017-2197-174:5(1885-1910)Online publication date: 1-May-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media