research-article

Automatic optimization for MapReduce programs

Authors:

Michael J. Cafarella,

Christopher RéAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 4, Issue 6

Pages 385 - 396

https://doi.org/10.14778/1978665.1978670

Published: 01 March 2011 Publication History

Abstract

The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store-style techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program's data semantics, but one of MapReduce's attractions is precisely that it does not ask the user for such information.

This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate data-aware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs.

References

[1]

D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682, 2006.

Digital Library

[2]

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.

Digital Library

[3]

F. Afrati and J. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010.

Digital Library

[4]

A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools: Second Edition. Addison-Wesley, 2007.

Digital Library

[5]

E. Anderson and J. Tucek. Efficiency matters. In SIGOPS Workshop, pages 40--45, 2010.

Digital Library

[6]

Apache Avro. http://avro.apache.org/, 2011.

[7]

Asm. http://asm.ow2.org/, 2010.

[8]

P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In VLDB, pages 54--65, 1999.

Digital Library

[9]

Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010.

Digital Library

[10]

M. J. Cafarella and C. R&#233;. Manimal: Relational optimization for data-intensive programs. In WebDB, 2010.

Digital Library

[11]

T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In PLDI, pages 191--202, 2001.

Digital Library

[12]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[13]

J. Dittrich, J.-A. Quian&#233;-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. HadoopH++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010.

Digital Library

[14]

C. Douglas and H. Tang. Gridmix3: Emulating production io workload for apache hadoop. In USENIX FAST '10: Work-in-Progress Reports, 2010.

[15]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Speeding up hadoop using column-store techniques. To appear, PVLDB.

[16]

A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47--57, 1984.

Digital Library

[17]

M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261--276, 2009.

Digital Library

[18]

Karmasphere corporation: A study of hadoop developers. http://karmasphere.com., 2010.

[19]

K. Kim, K. Jeon, H. Han, S. G. Kim, H. Jung, and H. Y. Yeom. Mrbench: A benchmark for mapreduce framework. In ICPADS, pages 11--18, 2008.

Digital Library

[20]

E. Meijer, B. Beckman, and G. M. Bierman. LINQ: reconciling object, relations and xml in the .net framework. In SIGMOD Conference, page 706, 2006.

Digital Library

[21]

A. C. Murthy. Apache hadoop: Best practices and anti-patterns. http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/, 2010.

[22]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.

Digital Library

[23]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009.

Digital Library

[24]

Personal communication, Amr Awadallah, Cloudera corp., 2010.

[25]

C. Re, J. Sim&#233;on, and M. F. Fern&#225;ndez. A complete and efficient algebraic compiler for xquery. In ICDE, page 14, 2006.

Digital Library

[26]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005.

Digital Library

[27]

C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE, pages 657--668, 2010.

[28]

H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, 2007.

Digital Library

[29]

M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008.

Digital Library

[30]

J. Zawodny. Yahoo! launches world's largest hadoop production application. http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/, 2008.

Cited By

Yu THe ZYang ZYe FFan YJing YZhang KWang X(2022)Zebra: A novel method for optimizing text classification query in overload scenarioWorld Wide Web10.1007/s11280-022-01061-y26:3(905-931)Online publication date: 2-Jun-2022
https://dl.acm.org/doi/10.1007/s11280-022-01061-y
Song WYang YEo JSeo JKim JLee SLee GUm TCho HChun B(2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3468144
Chung AKrishnan SKaranasos KCurino CGanger GLu SHowell J(2020)Unearthing inter-job dependencies for better cluster schedulingProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488834(1205-1223)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488834
Show More Cited By

Index Terms

Automatic optimization for MapReduce programs

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Towards automatic optimization of MapReduce programs
SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

Timely and cost-effective processing of large datasets has become a critical ingredient for the success of many academic, government, and industrial organizations. The combination of MapReduce frameworks and cloud computing is an attractive proposition ...
Integrating MapReduce and RDBMSs
CASCON '10: Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research

Data processing needs are changing with the ever increasing amounts of both structured and unstructured data. While the processing of structured data typically relies on the well-developed field of relational database management systems (RDBMSs), ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 4, Issue 6

March 2011

71 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2011

Published in PVLDB Volume 4, Issue 6

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
1,149
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Yu THe ZYang ZYe FFan YJing YZhang KWang X(2022)Zebra: A novel method for optimizing text classification query in overload scenarioWorld Wide Web10.1007/s11280-022-01061-y26:3(905-931)Online publication date: 2-Jun-2022
https://dl.acm.org/doi/10.1007/s11280-022-01061-y
Song WYang YEo JSeo JKim JLee SLee GUm TCho HChun B(2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3468144
Chung AKrishnan SKaranasos KCurino CGanger GLu SHowell J(2020)Unearthing inter-job dependencies for better cluster schedulingProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488834(1205-1223)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488834
Herodotou HBabu S(2020)Profiling, what-if analysis, and cost-based optimization of MapReduce programsProceedings of the VLDB Endowment10.14778/3402707.34027464:11(1111-1122)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402746
He WAnderson MStrome MCafarella MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)A Method for Optimizing Opaque Filter QueriesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389766(1257-1272)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389766
Yang YEo JKim GKim JLee SSeo JSong WChun BDan TDahlia M(2019)Apache nemoProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358824(177-190)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358824
Barua HMondal K(2019)A Comprehensive Survey on Cloud Data Mining (CDM) Frameworks and AlgorithmsACM Computing Surveys10.1145/334926552:5(1-62)Online publication date: 13-Sep-2019
https://dl.acm.org/doi/10.1145/3349265
Khatchadourian RTang YBagherzadeh MAhmed SAtlee JBultan TWhittle J(2019)Safe automated refactoring for intelligent parallelization of Java 8 streamsProceedings of the 41st International Conference on Software Engineering10.1109/ICSE.2019.00072(619-630)Online publication date: 25-May-2019
https://dl.acm.org/doi/10.1109/ICSE.2019.00072
Li HXu HNutanong SDimitropoulos XDainotti AVanbever LBenson T(2018)BohrProceedings of the 14th International Conference on emerging Networking EXperiments and Technologies10.1145/3281411.3281418(267-279)Online publication date: 4-Dec-2018
https://dl.acm.org/doi/10.1145/3281411.3281418
Markl V(2018)Mosaics in Big DataProceedings of the 12th ACM International Conference on Distributed and Event-based Systems10.1145/3210284.3214344(7-13)Online publication date: 25-Jun-2018
https://dl.acm.org/doi/10.1145/3210284.3214344
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents