Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807128.1807150acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Towards automatic optimization of MapReduce programs

Published: 10 June 2010 Publication History
  • Get Citation Alerts
  • Abstract

    Timely and cost-effective processing of large datasets has become a critical ingredient for the success of many academic, government, and industrial organizations. The combination of MapReduce frameworks and cloud computing is an attractive proposition for these organizations. However, even to run a single program in a MapReduce framework, a number of tuning parameters have to be set by users or system administrators. Users often run into performance problems because they don't know how to set these parameters, or because they don't even know that these parameters exist. With MapReduce being a relatively new technology, it is not easy to find qualified administrators. In this position paper, we make a case for techniques to automate the setting of tuning parameters for MapReduce programs. The objective is to provide good out-of-the-box performance for ad hoc MapReduce programs run on large datasets. This feature can go a long way towards improving the productivity of users who lack the skills to optimize programs themselves due to lack of familiarity with MapReduce or with the data being processed.

    References

    [1]
    R. Avnur and J. Hellerstein. Eddies: Continuously Adaptive Query Processing. In Proc. of SIGMOD Conf., May 2000.
    [2]
    S. Babu, N. Borisov, S. Duan, H. Herodotou, and V. Thummala. Automated Experiment-Driven Management of (Database) Systems. In Proc. of the 12th Workshop on Hot Topics in Operating Systems (HotOS), May 2009.
    [3]
    G. Blelloch, C. Leiserson, B. Maggs, C. G. Plaxton, S. Smith, and M. Zagha. A Comparison of Sorting Algorithms for the Connection Machine CM-2. In Proc. of SPAA, 1991.
    [4]
    J. Boulon et al. Chukwa: A Large-scale Monitoring System. In Cloud Computing and its Applications, 2008.
    [5]
    S. Chaudhuri and G. Weikum. Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. In Proc. of VLDB Conf., Sept. 2000.
    [6]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of OSDI, 2004.
    [7]
    D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting. In Proc. of PDIS, 1991.
    [8]
    Apache Hadoop. http://hadoop.apache.org/.
    [9]
    Map-side sort is hampered by io.sort.record.percent.issues.apache.org/jira/browse/MAPREDUCE-64.
    [10]
    K. Kambatla, A. Pathak, and H. Pucha. Towards Optimizing Hadoop Provisioning in the Cloud. In Proc. of the First Workshop on Hot Topics in Cloud Computing, June 2009.
    [11]
    C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic Optimization of Parallel Dataflow Programs. In Proc. of USENIX Annual Technical Conf., 2008.
    [12]
    P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In Proc. of SIGMOD Conf., June 1979.
    [13]
    S. Seshadri and J. F. Naughton. Sampling Issues in Parallel Database Systems. In Proc. of EDBT Conf., 1992.
    [14]
    A. Shatdal and J. F. Naughton. Adaptive Parallel Aggregation Algorithms. In Proc. of SIGMOD Conf., 1995.
    [15]
    M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In Proc. of VLDB Conf., Sept. 2001.
    [16]
    A. J. Storm, C. Garcia-Arellano, S. Lightstone, Y. Diao, and M. Surendra. Adaptive Self-tuning Memory in DB2. In Proc. of VLDB Conf., 2006.
    [17]
    A. Thusoo et al. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626?--1629, 2009.
    [18]
    J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. on Mathematical Software, 11(1):37?--57, Mar. 1985.

    Cited By

    View all
    • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
    • (2023)A Hadoop configuration optimization method based on middle platform business operation requirements2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE)10.1109/ICSECE58870.2023.10263541(1211-1216)Online publication date: 18-Aug-2023
    • (2023)Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386759(207-214)Online publication date: 15-Dec-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing
    June 2010
    264 pages
    ISBN:9781450300360
    DOI:10.1145/1807128
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. MapReduce
    3. cost-based optimization

    Qualifiers

    • Research-article

    Conference

    SOCC '10
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatic Debugging of Design Faults in MapReduce ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2024.336976650:4(956-978)Online publication date: Apr-2024
    • (2023)A Hadoop configuration optimization method based on middle platform business operation requirements2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE)10.1109/ICSECE58870.2023.10263541(1211-1216)Online publication date: 18-Aug-2023
    • (2023)Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386759(207-214)Online publication date: 15-Dec-2023
    • (2023)Performance Control for Nonlinear Hadoop-Mapreduce Computing SystemsIntegrated Ferroelectrics10.1080/10584587.2023.2191506233:1(148-159)Online publication date: 10-May-2023
    • (2023)MapReduce Framework Based Sequential Association Rule Mining with Deep Learning Enabled Classification in Retail ScenarioCybernetics and Systems10.1080/01969722.2023.2166256(1-23)Online publication date: 27-Feb-2023
    • (2023)Artificial intelligence in systems biologyArtificial Intelligence10.1016/bs.host.2023.06.004(153-201)Online publication date: 2023
    • (2022)DRIIS: MapReduce Parameter Optimization of Hadoop Using Genetic AlgorithmInternational Journal of Cooperative Information Systems10.1142/S021884302150002731:01n02Online publication date: 16-Sep-2022
    • (2022)ConEx: Efficient Exploration of Big-Data System Configurations for Better PerformanceIEEE Transactions on Software Engineering10.1109/TSE.2020.300756048:3(893-909)Online publication date: 1-Mar-2022
    • (2022)AutoDiagn: An Automated Real-Time Diagnosis Framework for Big Data SystemsIEEE Transactions on Computers10.1109/TC.2021.307063971:5(1035-1048)Online publication date: 1-May-2022
    • (2022)OSC: An Online Self-Configuring Big Data Framework for Optimization of QoSIEEE Transactions on Computers10.1109/TC.2021.306327871:4(809-823)Online publication date: 1-Apr-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media