Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Assignment Problems of Different-Sized Inputs in MapReduce

Published: 03 December 2016 Publication History

Abstract

A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs participating in the computation of this output. Reducers have a capacity that limits the sets of inputs they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which each input of a list, X, is required to meet each input of another list, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problems is NP-hard, and present a bin-packing-based approximation algorithm for finding a near optimal mapping schema.

References

[1]
Foto Afrati, Shlomi Dolev, Ephraim Korach, Shantanu Sharma, and Jeffrey D. Ullman. 2015. Assignment of different-sized inputs in MapReduce. In Proceedings of the 2nd Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR). 28--37. Also appears as a Brief Announcement in International Symposium on Distributed Computing (DISC), 2014, and as a technical report 14-05 at Department of Computer Science, Ben-Gurion University of the Negev.
[2]
Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and lower bounds on the cost of a map-reduce computation. Proceedings of the VLDB Endowment 6, 4 (2013), 277--288.
[3]
Foto N. Afrati and Jeffrey D. Ullman. 2013. Matching bounds for the all-pairs MapReduce problem. In Proceedings of the 17th International Database Engineering 8 Applications Symposium (IDEAS’13). 3--4.
[4]
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 131--140.
[5]
E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson. 1997. Approximation algorithms for bin packing: a survey. In Approximation Algorithms for NP-Hard Problems. PWS Publishing Co., 46--93.
[6]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04). 137--150.
[7]
M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
[8]
Michael T. Goodrich. 2010. Simulating parallel algorithms in the mapreduce framework with applications to parallel computational geometry. arXiv preprint arXiv:1004.4708.
[9]
David S. Johnson. 1973. Near-Optimal Bin Packing Algorithms. Ph.D. Dissertation. Massachusetts Institute of Technology.
[10]
David R. Karger and Jacob Scott. 2008. Efficient algorithms for fixed-precision instances of bin packing and Euclidean TSP. In Proceedings of the 11th International Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques. 104--117.
[11]
Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 938--948.
[12]
Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. 2011. Filtering: A method for solving graph problems in MapReduce. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). 85--94.
[13]
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets, 2nd ed. Cambridge University Press.
[14]
Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, and Eli Upfal. 2012. Space-round tradeoffs for MapReduce computations. In Proceedings of the International Conference on Supercomputing (ICS’12), Venice, Italy, June 25--29. 235--244.
[15]
Jeffrey D. Ullman. 2012. Designing good MapReduce algorithms. ACM Crossroads 19, 1 (2012), 30--34.

Cited By

View all
  • (2021)New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2019.28947799:3(1158-1171)Online publication date: 1-Jul-2021
  • (2021)Investigating Automatic Parameter Tuning for SQL-on-Hadoop SystemsBig Data Research10.1016/j.bdr.2021.10020425:COnline publication date: 29-Dec-2021
  • (2021)Meta-X: A Technique for Reducing Communication in Geographically Distributed ComputationsCyber Security Cryptography and Machine Learning10.1007/978-3-030-78086-9_34(467-486)Online publication date: 1-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 2
May 2017
419 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3017677
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2016
Accepted: 01 August 2016
Revised: 01 April 2016
Received: 01 July 2015
Published in TKDD Volume 11, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed computing
  2. MapReduce algorithms
  3. and reducer capacity and communication cost trade-off
  4. mapping schema
  5. reducer capacity

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • project Handling Uncertainty in Data Intensive Applications
  • Operational Program “Education and Lifelong Learning,” under the program THALES
  • European Union (European Social Fund) and Greek national funds
  • Rita Altura Trust Chair in Computer Sciences
  • Infrastructure Research in the Field of Advanced Computing and Cyber Security
  • Israel Science Foundation
  • Lynne and William Frankel Center for Computer Sciences
  • Cabarnit Cyber Security MAGNET Consortium
  • Ministry of Science and Technolog

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2019.28947799:3(1158-1171)Online publication date: 1-Jul-2021
  • (2021)Investigating Automatic Parameter Tuning for SQL-on-Hadoop SystemsBig Data Research10.1016/j.bdr.2021.10020425:COnline publication date: 29-Dec-2021
  • (2021)Meta-X: A Technique for Reducing Communication in Geographically Distributed ComputationsCyber Security Cryptography and Machine Learning10.1007/978-3-030-78086-9_34(467-486)Online publication date: 1-Jul-2021
  • (2017)A Survey on Geographically Distributed Big-Data Processing using MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2017.2723473(1-1)Online publication date: 2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media