Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1251254.1251264guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

MapReduce: simplified data processing on large clusters

Published: 06 December 2004 Publication History

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

References

[1]
{1} Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997.]]
[2]
{2} Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS '99), pages 10-22, Atlanta, Georgia, May 1999.]]
[3]
{3} Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996.]]
[4]
{4} Luiz A. Barroso, Jeffrey Dean, and Urs Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22-28, April 2003.]]
[5]
{5} John Bent, Douglas Thain, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed file system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004.]]
[6]
{6} Guy E. Blelloch. Scans as primitive parallel operations. IEEE Transactions on Computers, C-38(11), November 1989.]]
[7]
{7} Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78-91, Saint-Malo, France, 1997.]]
[8]
{8} Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29-43, Lake George, New York, 2003.]]
[9]
{9} S. Gorlatch. Systematic efficient parallelization of scan and other list homomorphisms. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par'96. Parallel Processing, Lecture Notes in Computer Science 1124, pages 401-408. Springer-Verlag, 1996.]]
[10]
{10} Jim Gray. Sort benchmark home page. http://research.microsoft.com/barc/SortBenchmark/.]]
[11]
{11} William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1999.]]
[12]
{12} L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G. R. Ganger, E. Riedel, and A. Ailamaki. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the 2004 USENIX File and Storage Technologies FAST Conference, April 2004.]]
[13]
{13} Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831-838, 1980.]]
[14]
{14} Michael O. Rabin. Efficient dispersal of information for security, load balancing and fault tolerance. Journal of the ACM, 36(2):335-348, 1989.]]
[15]
{15} Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active disks for large-scale data processing. IEEE Computer, pages 68-74, June 2001.]]
[16]
{16} Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing in practice: The Condor experience. Concurrency and Computation: Practice and Experience , 2004.]]
[17]
{17} L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, 1997.]]
[18]
{18} Jim Wyllie. Spsort: How to sort a terabyte quickly. http://alme1.almaden.ibm.com/cs/spsort.pdf.]]

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
  • (2024)An emergency task scheduling method based on YARN capacity schedulerInternational Conference on Algorithms, Software Engineering, and Network Security10.1145/3677182.3677288(591-596)Online publication date: 26-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6
December 2004
403 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 06 December 2004

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
  • (2024)An emergency task scheduling method based on YARN capacity schedulerInternational Conference on Algorithms, Software Engineering, and Network Security10.1145/3677182.3677288(591-596)Online publication date: 26-Apr-2024
  • (2024)Addressing Data Challenges to Drive the Transformation of Smart CitiesACM Transactions on Intelligent Systems and Technology10.1145/366348215:5(1-65)Online publication date: 7-Nov-2024
  • (2024)Occam: A Programming System for Reliable Network ManagementProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650086(148-162)Online publication date: 22-Apr-2024
  • (2024)Draconis: Network-Accelerated Scheduling for Microsecond-Scale WorkloadsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650060(333-348)Online publication date: 22-Apr-2024
  • (2023)k-means clustering with distance-based privacyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666981(19570-19593)Online publication date: 10-Dec-2023
  • (2023)SHADEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585947(135-151)Online publication date: 21-Feb-2023
  • (2023)Cloud-Edge-Client Continuum: Leveraging Browsers as Deployment Nodes with Virtual PodsProceedings of the IEEE/ACM 10th International Conference on Big Data Computing, Applications and Technologies10.1145/3632366.3632395(1-10)Online publication date: 4-Dec-2023
  • (2023)Metaverse as a ServiceProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624662(298-307)Online publication date: 30-Oct-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media