Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2391229.2391245acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Balancing reducer skew in MapReduce workloads using progressive sampling

Published: 14 October 2012 Publication History

Abstract

The elapsed time of a parallel job depends on the completion time of its longest running constituent. We present a static load balancing algorithm that distributes work evenly across the reducers in a MapReduce job resulting in significant elapsed time reductions.
Taking a user-specified model of reducer performance, our load balancer uses a progressive objective-based cluster sampler to estimate the load associated with each reduce-key. It balances the workload using Key Chopping, to split keys with large loads into sub-keys that can be assigned to different distributive reducers, and Key Packing, to assign keys with medium loads to reducers to minimize the maximum reducer load. Keys with small loads are hashed as they have little effect on the balance. This repeats until the user specified balancing objective and confidence level are achieved.
The sampler and load balancer have been implemented in the Oracle Loader for Hadoop (OLH), a commercial MapReduce application that employs Apache Hadoop to perform parallel data formatting and data movement into partitioned relational tables. We present the performance improvements we achieve in both OLH and in a MapReduce program for inverted index creation. The balancer works for arbitrary IID key distributions, the time used for sampling is small and our solution is very effective at reducing the elapsed time for the MapReduce jobs we explored.

References

[1]
Apache Hadoop: http://hadoop.apache.org.
[2]
Blelloch, G. E. et al. 1991. A comparison of sorting algorithms for the connection machine CM-2. Proceedings of the 3rd annual ACM symposium on Parallel Algorithms and Architectures - SPAA '91 (New York, New York, USA, Jun. 1991), 3--16.
[3]
Chaudhuri, S. et al. 2004. Effective use of block-level sampling in statistics estimation. Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04 (New York, New York, USA, Jun. 2004), 287.
[4]
Cochran, W. G. 1977. Sampling Techniques. Wiley and Sons, Inc., New York.
[5]
Coffman, J. et al. 1978. An Application of Bin-Packing to Multiprocessor Scheduling. SIAM Journal on Computing. 7, 1 (1978), 1.
[6]
DeWitt, D. J. et al. Parallel sorting on a shared-nothing architecture using probabilistic splitting. {1991} Proceedings of the First International Conference on Parallel and Distributed Information Systems 280--291.
[7]
DeWitt, D. J. et al. 1992. Practical Skew Handling in Parallel Joins. (Aug. 1992), 27--40.
[8]
Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM. 51, 1 (Jan. 2008), 107.
[9]
Fan, B. et al. 2011. Small cache, big effect. Proceedings of the 2nd ACM Symposium on Cloud Computing - SOCC '11 (New York, New York, USA, Oct. 2011), 1--12.
[10]
Ganguly, S. et al. 1996. Bifocal sampling for skew-resistant join size estimation. ACM SIGMOD Record. 25, 2 (Jun. 1996), 271--281.
[11]
Gates, A. F. et al. 2009. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment. 2, 2 (Aug. 2009), 1414--1425.
[12]
Gufler, B. et al. 2011. Handling Data Skew in MapReduce. CLOSER (2011), 574--583.
[13]
Haas, P. J. et al. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. (Sep. 1995), 311--322.
[14]
Haas, P. J. et al. 1996. Selectivity and Cost Estimation for Joins Based on Random Sampling. Journal of Computer and System Sciences. 52, 3 (Jun. 1996), 550--569.
[15]
Hochbaum, D. S. and Shmoys, D. B. 1987. Using dual approximation algorithms for scheduling problems theoretical and practical results. Journal of the ACM. 34, 1 (Jan. 1987), 144--162.
[16]
Kwon, Y. et al. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. Proceedings of the 1st ACM symposium on Cloud computing - SoCC '10 (New York, New York, USA, Jun. 2010), 75.
[17]
Kwon, Y. et al. 2012. SkewTune. Proceedings of the 2012 international conference on Management of Data - SIGMOD '12 (New York, New York, USA, May. 2012), 25.
[18]
Larson, P.-A. et al. 2007. Cardinality estimation using sample views with quality assurance. Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07 (New York, New York, USA, Jun. 2007), 175.
[19]
Lin, J. 2009. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. Proceedings of the 7th Workshop on LargeScale Distributed Systems for Information Retrieval LSDSIR09 at SIGIR 2009 (2009).
[20]
Lohr, S. L. 2010. Sampling: Design and Analysis.
[21]
M. Tamer Ozsu, P. V. 2011. Principles of Distributed Database Systems.
[22]
Maciej Drozdowski 2009. Scheduling for Parallel Processing. Springer Publishing Company, Inc.
[23]
Michael Stonebraker, J. M. H. 1998. Readings in Database Systems. Morgan Kaufmann.
[24]
Olken, F. and Rotem, D. 1995. Random sampling from databases: a survey. Statistics and Computing. 5, 1 (Mar. 1995), 25--42.
[25]
Oracle Loader for Hadoop: http://www.oracle.com/technetwork/bdc/hadoop-loader.
[26]
O'Malley, O. 2008. TeraByte Sort on Apache Hadoop.
[27]
Paton, N. W. et al. 2008. Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options. The VLDB Journal. 18, 1 (Jan. 2008), 119--140.
[28]
Pig Skew Join Specification: http://wiki.apache.org/pig/PigSkewedJoinSpec.
[29]
R L Graham, E L Lawler, J K Lenstra, A. H. G. R. K. 1979. Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of Discrete Mathematics. 5, 2 (1979), 287--326.
[30]
Raab, M. and Steger, A. 1998. Balls into Bins - A Simple and Tight Analysis. (Oct. 1998), 159--170.
[31]
Rahm, L. K. and A. T. and E. 2012. Load Balancing for MapReduce-based Entity Resolution. International Conference on Data Engineering (ICDE) (2012).
[32]
Seshadri, S. and Naughton, J. F. 1992. Sampling Issues in Parallel Database Systems. (Mar. 1992), 328--343.
[33]
Silberstein, A. et al. 2008. Efficient bulk insertion into a distributed ordered table. Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08 (New York, New York, USA, Jun. 2008), 765.
[34]
Silberstein, A. E. et al. 2011. A batch of PNUTS. Proceedings of the 2011 international conference on Management of data - SIGMOD '11 (New York, New York, USA, Jun. 2011), 1101.
[35]
Stoica, I. et al. 2001. Chord. Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications - SIGCOMM '01 (New York, New York, USA, Aug. 2001), 149--160.
[36]
Swart, G. 2004. Spreading the Load Using Consistent Hashing: A Preliminary Report. (Jul. 2004), 169--176.
[37]
Thusoo, A. et al. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment. 2, 2 (Aug. 2009), 1626--1629.
[38]
Vernica, R. et al. 2012. Adaptive MapReduce using situation-aware mappers. Proceedings of the 15th International Conference on Extending Database Technology - EDBT '12 (New York, New York, USA, Mar. 2012), 420.
[39]
Walton, C. B. et al. 1991. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. (Sep. 1991), 537--548.
[40]
White, T. 2009. Hadoop: The Definitive Guide. O'Reilly.

Cited By

View all
  • (2023)Dynamic Load Balancing in Stream Processing Pipelines Containing Stream-Static JoinsElectronics10.3390/electronics1207161312:7(1613)Online publication date: 29-Mar-2023
  • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
  • (2022)Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performanceParallel Computing10.1016/j.parco.2022.102918111:COnline publication date: 1-Jul-2022
  • Show More Cited By

Index Terms

  1. Balancing reducer skew in MapReduce workloads using progressive sampling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing
    October 2012
    325 pages
    ISBN:9781450317610
    DOI:10.1145/2391229
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. MapReduce
    3. Oracle loader for Hadoop
    4. load balancing
    5. load model
    6. progressive sampling
    7. skew

    Qualifiers

    • Research-article

    Conference

    SOCC '12
    Sponsor:
    SOCC '12: ACM Symposium on Cloud Computing
    October 14 - 17, 2012
    California, San Jose

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Dynamic Load Balancing in Stream Processing Pipelines Containing Stream-Static JoinsElectronics10.3390/electronics1207161312:7(1613)Online publication date: 29-Mar-2023
    • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
    • (2022)Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performanceParallel Computing10.1016/j.parco.2022.102918111:COnline publication date: 1-Jul-2022
    • (2021)Improving Performance of Data Extracts Using Window-Based Refresh StrategiesInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET2310631(359-377)Online publication date: 1-Sep-2021
    • (2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
    • (2021)BSDP: A Novel Balanced Spark Data Partitioner2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00075(556-566)Online publication date: Dec-2021
    • (2021)WIRE: Resource-efficient Scaling with Online Prediction for DAG-based Workflows2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00025(35-46)Online publication date: Sep-2021
    • (2020)Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approachJournal of Big Data10.1186/s40537-019-0279-z7:1Online publication date: 9-Jan-2020
    • (2020)Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity ResolutionProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429140(446-455)Online publication date: 30-Nov-2020
    • (2020)Fast and Accurate Traffic Measurement With Hierarchical FilteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.299100731:10(2360-2374)Online publication date: 1-Oct-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media