article

Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Authors:

Magdalena Balazinska,

Bill HoweAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 10

Pages 853 - 864

https://doi.org/10.14778/2536206.2536213

Published: 01 August 2013 Publication History

Abstract

We analyze Hadoop workloads from three di?erent research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem. We also observe significant opportunities for optimizations of these workloads. We find that job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems. Overall, we present the first user-centered measurement study of Hadoop and find significant opportunities for improving its efficient use for data scientists.

References

[1]

Yahoo! reaches for the stars with M45 supercomputing project. http://research.yahoo.com/node/1884.

[2]

G. Ananthanarayanan et al. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010.

[3]

G. Ananthanarayanan et al. PACMan: Coordinated memory caching for parallel jobs. In NSDI, 2012.

[4]

Apache Foundation. Hadoop. http://hadoop.apache.org/.

[5]

Apache Foundation. Mahout: Scalable machine learning and data mining. http://mahout.apache.org/.

[6]

Ashish Thusoo et. al. Hive: a petabyte scale data warehouse using Hadoop. In ICDE, pages 996-1005, 2010.

[7]

S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137-142, 2010.

[8]

D. Borthakur. The Hadoop distributed file system: Architecture and design. http://lucene.apache.org/hadoop/hdfs_design.pdf, 2007.

[9]

Y. Chen et al. The case for evaluating MapReduce performance using workload suites. In MASCOTS, pages 390-399.

[10]

Y. Chen et al. Interactive query processing in big data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12):1802-1813, 2012.

[11]

Concurrent, Inc. Cascading. http://www.cascading.org/, 2012.

[12]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

[13]

B. Fan et al. DiskReduce: RAID for data-intensive scalable computing. Technical Report CMU-PDL-11-112, PDL, Carnegie Mellon University, 2011.

[14]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111-1122, 2011.

[15]

U. Kang et al. PEGASUS: A peta-scale graph mining system implementation and observations. In ICDM, pages 229-238, 2009.

[16]

S. Kavulya et al. An analysis of traces from a production MapReduce cluster. In CCGRID, pages 94-103, 2010.

[17]

Q. Ke et al. Optimizing data partitioning for data-parallel computing. In HotOS, 2011.

[18]

N. Khoussainova et al. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598-609, 2012.

[19]

Y. Kwon et al. SkewTune: mitigating skew in mapreduce applications. In SIGMOD, pages 25-36, 2012.

[20]

S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.

[21]

G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135-146, 2010.

[22]

C. Olston et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099-1110, 2008.

[23]

C. Olston et al. Generating example data for dataflow programs. In SIGMOD, pages 245-256, 2009.

[24]

A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009.

[25]

Scoobi Team. A Scalar productivity framework for Hadoop. https://github.com/NICTA/scoobi, 2012.

[26]

B. Sharma et al. Modeling and synthesizing task placement constraints in Google compute clusters. In SoCC, pages 3:1-3:14, 2011.

[27]

R. Vernica et al. Adaptive MapReduce using situation-aware mappers. In EDBT, pages 420-431, 2012.

Cited By

Park HGanger GAmvrosiadis GPhanishayee AStutsman R(2020)More IOPS for lessProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485867(18-18)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3485849.3485867
Phan TPallez GIbrahim SRaghavan P(2019)A New Framework for Evaluating Straggler Detection Mechanisms in MapReduceACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/33287404:3(1-23)Online publication date: 12-Sep-2019
https://dl.acm.org/doi/10.1145/3328740
Wang ZChen QSuo BPan WLi Z(2019)Reducing partition skew on MapReduceFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-6586-213:5(960-975)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11704-018-6586-2
Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 10

August 2013

180 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 10

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
650
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)2

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Park HGanger GAmvrosiadis GPhanishayee AStutsman R(2020)More IOPS for lessProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485867(18-18)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3485849.3485867
Phan TPallez GIbrahim SRaghavan P(2019)A New Framework for Evaluating Straggler Detection Mechanisms in MapReduceACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/33287404:3(1-23)Online publication date: 12-Sep-2019
https://dl.acm.org/doi/10.1145/3328740
Wang ZChen QSuo BPan WLi Z(2019)Reducing partition skew on MapReduceFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-6586-213:5(960-975)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11704-018-6586-2
Irandoost MRahmani ASetayeshi S(2019)Learning automata-based algorithms for MapReduce data skewness handlingThe Journal of Supercomputing10.1007/s11227-019-02855-075:10(6488-6516)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02855-0
Morán Jde la Riva CTuya J(2019)Testing MapReduce programsJournal of Software: Evolution and Process10.1002/smr.212031:3Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1002/smr.2120
Amvrosiadis GPark JGanger GGibson GBaseman EDeBardeleben NGunawi HReed B(2018)On the diversity of cluster workloads and its impact on research resultsProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277407(533-546)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.5555/3277355.3277407
Zhou APhan TIbrahim SHe B(2018)Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous ClustersProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225084(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225084
Tang CSullivan KRay BChaudron MCrnkovic IChechik MHarman M(2018)Searching for high-performing software configurations with metaheuristic algorithmsProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings10.1145/3183440.3195006(354-355)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183440.3195006
Nabavinejad SGoudarzi MHaddad HWainwright RChbeir R(2018)Data locality and VM interference aware mitigation of data skew in hadoop leveraging modern portfolio theoryProceedings of the 33rd Annual ACM Symposium on Applied Computing10.1145/3167132.3167150(175-182)Online publication date: 9-Apr-2018
https://dl.acm.org/doi/10.1145/3167132.3167150
Kim MCho H(2018)Popularity-based covering sets for energy proportionality in shared-nothing clustersThe Journal of Supercomputing10.1007/s11227-017-2197-174:5(1885-1910)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1007/s11227-017-2197-1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents