Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICDE.2012.58guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Published: 01 April 2012 Publication History
  • Get Citation Alerts
  • Abstract

    MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is being used increasingly in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by high runtime complexity of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. An accurate cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets approximating the global data distribution.

    Cited By

    View all
    • (2020)Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity ResolutionProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429140(446-455)Online publication date: 30-Nov-2020
    • (2020)Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing SystemsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389713(2455-2469)Online publication date: 11-Jun-2020
    • (2019)Intermediate Data Placement Strategy for Different Data Skew Levels Based on Random Sampling in SparkProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335495(17-23)Online publication date: 10-May-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
    April 2012
    1432 pages
    ISBN:9780769547473

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 01 April 2012

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity ResolutionProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429140(446-455)Online publication date: 30-Nov-2020
    • (2020)Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing SystemsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389713(2455-2469)Online publication date: 11-Jun-2020
    • (2019)Intermediate Data Placement Strategy for Different Data Skew Levels Based on Random Sampling in SparkProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335495(17-23)Online publication date: 10-May-2019
    • (2019)Reducing partition skew on MapReduceFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-6586-213:5(960-975)Online publication date: 1-Oct-2019
    • (2019)Load balancing in join algorithms for skewed data in MapReduce systemsThe Journal of Supercomputing10.1007/s11227-018-2578-075:1(228-254)Online publication date: 1-Jan-2019
    • (2018)Efficient shuffle management with SCache for DAG computing frameworksACM SIGPLAN Notices10.1145/3200691.317851053:1(305-316)Online publication date: 10-Feb-2018
    • (2018)Rock you like a hurricaneProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190532(1-15)Online publication date: 23-Apr-2018
    • (2018)Efficient shuffle management with SCache for DAG computing frameworksProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178510(305-316)Online publication date: 10-Feb-2018
    • (2018)Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random samplingThe Journal of Supercomputing10.1007/s11227-018-2391-974:7(3415-3440)Online publication date: 1-Jul-2018
    • (2018)Implementation of scalable fuzzy relational operations in MapReduceSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-017-2561-322:9(3061-3075)Online publication date: 1-May-2018
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media