Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatic optimization for MapReduce programs

Published: 01 March 2011 Publication History
  • Get Citation Alerts
  • Abstract

    The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store-style techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program's data semantics, but one of MapReduce's attractions is precisely that it does not ask the user for such information.
    This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate data-aware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs.

    References

    [1]
    D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682, 2006.
    [2]
    A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.
    [3]
    F. Afrati and J. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010.
    [4]
    A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools: Second Edition. Addison-Wesley, 2007.
    [5]
    E. Anderson and J. Tucek. Efficiency matters. In SIGOPS Workshop, pages 40--45, 2010.
    [6]
    Apache Avro. http://avro.apache.org/, 2011.
    [7]
    Asm. http://asm.ow2.org/, 2010.
    [8]
    P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In VLDB, pages 54--65, 1999.
    [9]
    Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010.
    [10]
    M. J. Cafarella and C. Ré. Manimal: Relational optimization for data-intensive programs. In WebDB, 2010.
    [11]
    T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In PLDI, pages 191--202, 2001.
    [12]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
    [13]
    J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. HadoopH++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010.
    [14]
    C. Douglas and H. Tang. Gridmix3: Emulating production io workload for apache hadoop. In USENIX FAST '10: Work-in-Progress Reports, 2010.
    [15]
    A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Speeding up hadoop using column-store techniques. To appear, PVLDB.
    [16]
    A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD Conference, pages 47--57, 1984.
    [17]
    M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261--276, 2009.
    [18]
    Karmasphere corporation: A study of hadoop developers. http://karmasphere.com., 2010.
    [19]
    K. Kim, K. Jeon, H. Han, S. G. Kim, H. Jung, and H. Y. Yeom. Mrbench: A benchmark for mapreduce framework. In ICPADS, pages 11--18, 2008.
    [20]
    E. Meijer, B. Beckman, and G. M. Bierman. LINQ: reconciling object, relations and xml in the .net framework. In SIGMOD Conference, page 706, 2006.
    [21]
    A. C. Murthy. Apache hadoop: Best practices and anti-patterns. http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/, 2010.
    [22]
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.
    [23]
    A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009.
    [24]
    Personal communication, Amr Awadallah, Cloudera corp., 2010.
    [25]
    C. Re, J. Siméon, and M. F. Fernández. A complete and efficient algebraic compiler for xquery. In ICDE, page 14, 2006.
    [26]
    M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005.
    [27]
    C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE, pages 657--668, 2010.
    [28]
    H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, 2007.
    [29]
    M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008.
    [30]
    J. Zawodny. Yahoo! launches world's largest hadoop production application. http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/, 2008.

    Cited By

    View all
    • (2022)Zebra: A novel method for optimizing text classification query in overload scenarioWorld Wide Web10.1007/s11280-022-01061-y26:3(905-931)Online publication date: 2-Jun-2022
    • (2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
    • (2020)Unearthing inter-job dependencies for better cluster schedulingProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488834(1205-1223)Online publication date: 4-Nov-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 4, Issue 6
    March 2011
    71 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 March 2011
    Published in PVLDB Volume 4, Issue 6

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Zebra: A novel method for optimizing text classification query in overload scenarioWorld Wide Web10.1007/s11280-022-01061-y26:3(905-931)Online publication date: 2-Jun-2022
    • (2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
    • (2020)Unearthing inter-job dependencies for better cluster schedulingProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488834(1205-1223)Online publication date: 4-Nov-2020
    • (2020)Profiling, what-if analysis, and cost-based optimization of MapReduce programsProceedings of the VLDB Endowment10.14778/3402707.34027464:11(1111-1122)Online publication date: 3-Jun-2020
    • (2020)A Method for Optimizing Opaque Filter QueriesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389766(1257-1272)Online publication date: 11-Jun-2020
    • (2019)Apache nemoProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358824(177-190)Online publication date: 10-Jul-2019
    • (2019)A Comprehensive Survey on Cloud Data Mining (CDM) Frameworks and AlgorithmsACM Computing Surveys10.1145/334926552:5(1-62)Online publication date: 13-Sep-2019
    • (2019)Safe automated refactoring for intelligent parallelization of Java 8 streamsProceedings of the 41st International Conference on Software Engineering10.1109/ICSE.2019.00072(619-630)Online publication date: 25-May-2019
    • (2018)BohrProceedings of the 14th International Conference on emerging Networking EXperiments and Technologies10.1145/3281411.3281418(267-279)Online publication date: 4-Dec-2018
    • (2018)Mosaics in Big DataProceedings of the 12th ACM International Conference on Distributed and Event-based Systems10.1145/3210284.3214344(7-13)Online publication date: 25-Jun-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media