research-article

Assignment Problems of Different-Sized Inputs in MapReduce

Authors:

Ephraim Korach,

Shantanu Sharma,

Jeffrey D. UllmanAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 11, Issue 2

Article No.: 18, Pages 1 - 35

https://doi.org/10.1145/2987376

Published: 03 December 2016 Publication History

Abstract

A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs participating in the computation of this output. Reducers have a capacity that limits the sets of inputs they can be assigned. However, individual inputs may vary in terms of size. We consider, for the first time, mapping schemas where input sizes are part of the considerations and restrictions. One of the significant parameters to optimize in any MapReduce job is communication cost between the map and reduce phases. The communication cost can be optimized by minimizing the number of copies of inputs sent to the reducers. The communication cost is closely related to the number of reducers of constrained capacity that are used to accommodate appropriately the inputs, so that the requirement of how the inputs must meet in a reducer is satisfied. In this work, we consider a family of problems where it is required that each input meets with each other input in at least one reducer. We also consider a slightly different family of problems in which each input of a list, X, is required to meet each input of another list, Y, in at least one reducer. We prove that finding an optimal mapping schema for these families of problems is NP-hard, and present a bin-packing-based approximation algorithm for finding a near optimal mapping schema.

References

[1]

Foto Afrati, Shlomi Dolev, Ephraim Korach, Shantanu Sharma, and Jeffrey D. Ullman. 2015. Assignment of different-sized inputs in MapReduce. In Proceedings of the 2nd Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR). 28--37. Also appears as a Brief Announcement in International Symposium on Distributed Computing (DISC), 2014, and as a technical report 14-05 at Department of Computer Science, Ben-Gurion University of the Negev.

[2]

Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. 2013. Upper and lower bounds on the cost of a map-reduce computation. Proceedings of the VLDB Endowment 6, 4 (2013), 277--288.

Digital Library

[3]

Foto N. Afrati and Jeffrey D. Ullman. 2013. Matching bounds for the all-pairs MapReduce problem. In Proceedings of the 17th International Database Engineering 8 Applications Symposium (IDEAS’13). 3--4.

Digital Library

[4]

Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 131--140.

Digital Library

[5]

E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson. 1997. Approximation algorithms for bin packing: a survey. In Approximation Algorithms for NP-Hard Problems. PWS Publishing Co., 46--93.

Digital Library

[6]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04). 137--150.

Digital Library

[7]

M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.

Digital Library

[8]

Michael T. Goodrich. 2010. Simulating parallel algorithms in the mapreduce framework with applications to parallel computational geometry. arXiv preprint arXiv:1004.4708.

[9]

David S. Johnson. 1973. Near-Optimal Bin Packing Algorithms. Ph.D. Dissertation. Massachusetts Institute of Technology.

[10]

David R. Karger and Jacob Scott. 2008. Efficient algorithms for fixed-precision instances of bin packing and Euclidean TSP. In Proceedings of the 11th International Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques. 104--117.

Digital Library

[11]

Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 938--948.

Digital Library

[12]

Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. 2011. Filtering: A method for solving graph problems in MapReduce. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). 85--94.

Digital Library

[13]

Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets, 2nd ed. Cambridge University Press.

Digital Library

[14]

Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, and Eli Upfal. 2012. Space-round tradeoffs for MapReduce computations. In Proceedings of the International Conference on Supercomputing (ICS’12), Venice, Italy, June 25--29. 235--244.

Digital Library

[15]

Jeffrey D. Ullman. 2012. Designing good MapReduce algorithms. ACM Crossroads 19, 1 (2012), 30--34.

Digital Library

Cited By

Yao YGao HWang JSheng BMi N(2021)New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2019.28947799:3(1158-1171)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TCC.2019.2894779
Lucas Filho ECunha de Almeida EScherzinger SHerodotou H(2021)Investigating Automatic Parameter Tuning for SQL-on-Hadoop SystemsBig Data Research10.1016/j.bdr.2021.10020425:COnline publication date: 29-Dec-2021
https://dl.acm.org/doi/10.1016/j.bdr.2021.100204
Afrati FDolev SSharma SUllman J(2021)Meta-X: A Technique for Reducing Communication in Geographically Distributed ComputationsCyber Security Cryptography and Machine Learning10.1007/978-3-030-78086-9_34(467-486)Online publication date: 1-Jul-2021
https://doi.org/10.1007/978-3-030-78086-9_34
Show More Cited By

Index Terms

Assignment Problems of Different-Sized Inputs in MapReduce

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Locality and loading aware virtual machine mapping techniques for optimizing communications in MapReduce applications

Big data refers to data that is so large that it exceeds the processing capabilities of traditional systems. Big data can be awkward to work and the storage, processing and analysis of big data can be problematic. MapReduce is a recent programming model ...
Some pairs problems
BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

A common form of MapReduce application involves discovering relationships between certain pairs of inputs. Similarity joins serve as a good example of this type of problem, which we call a "some-pairs" problem. In the framework of [4], algorithms are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 11, Issue 2

May 2017

419 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3017677

Editor:
Philip S. Yu
University of Illinois at Chicago, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2016

Accepted: 01 August 2016

Revised: 01 April 2016

Received: 01 July 2015

Published in TKDD Volume 11, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

project Handling Uncertainty in Data Intensive Applications
Operational Program “Education and Lifelong Learning,” under the program THALES
European Union (European Social Fund) and Greek national funds
Rita Altura Trust Chair in Computer Sciences
Infrastructure Research in the Field of Advanced Computing and Cyber Security
Israel Science Foundation
Lynne and William Frankel Center for Computer Sciences
Cabarnit Cyber Security MAGNET Consortium
Ministry of Science and Technolog

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
218
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yao YGao HWang JSheng BMi N(2021)New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN ClustersIEEE Transactions on Cloud Computing10.1109/TCC.2019.28947799:3(1158-1171)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TCC.2019.2894779
Lucas Filho ECunha de Almeida EScherzinger SHerodotou H(2021)Investigating Automatic Parameter Tuning for SQL-on-Hadoop SystemsBig Data Research10.1016/j.bdr.2021.10020425:COnline publication date: 29-Dec-2021
https://dl.acm.org/doi/10.1016/j.bdr.2021.100204
Afrati FDolev SSharma SUllman J(2021)Meta-X: A Technique for Reducing Communication in Geographically Distributed ComputationsCyber Security Cryptography and Machine Learning10.1007/978-3-030-78086-9_34(467-486)Online publication date: 1-Jul-2021
https://doi.org/10.1007/978-3-030-78086-9_34
Dolev SFlorissi PGudes ESharma SSinger I(2017)A Survey on Geographically Distributed Big-Data Processing using MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2017.2723473(1-1)Online publication date: 2017
https://doi.org/10.1109/TBDATA.2017.2723473

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents