Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

MapReduce: simplified data processing on large clusters

Published: 01 January 2008 Publication History
  • Get Citation Alerts
  • Abstract

    MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

    Supplementary Material

    PDF File (p107-dean.jp.pdf)
    Requires Asian Language Support in Adobe Reader and Japanese Language Support in Your Browser.

    References

    [1]
    Hadoop: Open source implementation of MapReduce. http://lucene. apache.org/hadoop/.
    [2]
    The Phoenix system for MapReduce programming. http://csl.stanford. edu/~christos/sw/phoenix/.
    [3]
    Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Culler, D. E., Hellerstein, J. M., and Patterson, D. A. 1997. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Tucson, AZ.
    [4]
    Barroso, L. A., Dean, J., and Urs Hölzle, U. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22-28.
    [5]
    Bent, J., Thain, D., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Livny, M. 2004. Explicit control in a batch-aware distributed file system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI).
    [6]
    Blelloch, G. E. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. C-38, 11.
    [7]
    Chu, C.-T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A., and Olukotun, K. 2006. Map-Reduce for machine learning on multicore. In Proceedings of Neural Information Processing Systems Conference (NIPS). Vancouver, Canada.
    [8]
    Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of Operating Systems Design and Implementation (OSDI). San Francisco, CA. 137-150.
    [9]
    Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and Gauthier, P. 1997. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles. Saint-Malo, France. 78-91.
    [10]
    Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In 19th Symposium on Operating Systems Principles. Lake George, NY. 29-43.
    [11]
    Gorlatch, S. 1996. Systematic efficient parallelization of scan and other list homomorphisms. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, Eds. Euro-Par'96. Parallel Processing, Lecture Notes in Computer Science, vol. 1124. Springer-Verlag. 401-408
    [12]
    Gray, J. Sort benchmark home page. http://research.microsoft.com/barc/SortBenchmark/.
    [13]
    Huston, L., Sukthankar, R., Wickremesinghe, R., Satyanarayanan, M., Ganger, G. R., Riedel, E., and Ailamaki, A. 2004. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the 2004 USENIX File and Storage Technologies FAST Conference.
    [14]
    Ladner, R. E., and Fischer, M. J. 1980. Parallel prefix computation. JACM 27, 4. 831-838.
    [15]
    Rabin, M. O. 1989. Efficient dispersal of information for security, load balancing and fault tolerance. JACM 36, 2. 335-348.
    [16]
    Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of 13th International Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ.
    [17]
    Riedel, E., Faloutsos, C., Gibson, G. A., and Nagle, D. Active disks for large-scale data processing. IEEE Computer. 68-74.

    Cited By

    View all
    • (2024)MinFlowProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650716(311-328)Online publication date: 27-Feb-2024
    • (2024)Enhancing Quality in Industry 4.0: A Data-Centric Approach to Lean Six SigmaInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17500(580-587)Online publication date: 20-Apr-2024
    • (2024)Introduction to Machine LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-15723(100-105)Online publication date: 12-Mar-2024
    • Show More Cited By

    Recommendations

    Reviews

    Chris A Mattmann

    Google has revolutionized the way that large-scale data management is engineered and deployed, and evolves over time. In particular, it developed novel methods for file-based data management for rapid indexing and searching of Web pages (PageRank and Google's index structure) and for large-scale data computation. Dean and Ghemawat's article focuses on the description of Google's novel distributed computation paradigm MapReduce and its associated infrastructure and deployment at Google. MapReduce is a programming paradigm in which developers are required to cast a computational problem in the form of two atomic components: a "map" function (similar to the Lisp map function), in which a set of input data in the form of "key,value" is split into a set of intermediate "key,value" pairs, and a "reduce" function (similar to the Lisp reduce function) that takes as input an intermediate key and set of associated values, and reduces that set of associated values to a smaller set, typically consisting of just a single value. Google has found that several of their mission-critical services can be cast as a MapReduce-style problem. Specifically, Dean and Ghemawat tout Google's major success of retooling their production crawling/indexing service as a MapReduce program; there are many other examples, including large-scale machine learning problems, clustering problems for Google News and Froogle, identification of popular queries, processing satellite imagery, and over 10,000 others. The general applicability and simplicity of the MapReduce paradigm has caused other implementation frameworks to become publicly available besides Google's in-house developed solution: Apache Hadoop, an open-source, Java-based implementation of MapReduce, and the Phoenix shared-memory MapReduce system developed by the computer science department at Stanford University (both are mentioned in the paper). This is a very readable paper that serves as a higher-level summary of Dean and Ghemawat's earlier, more technical paper [1]. The casual practitioner who wants to learn the value added by adopting MapReduce style programs will find this paper interesting, as will architects who want to understand the core components and architectural style of MapReduce. These readers should focus on sections 2 and 3. For those interested in specifics of how MapReduce was implemented, optimized, and evaluated at Google, sections 4 and 5 will be of interest. Sections 1 and 6 identify the importance of using MapReduce at Google and are valuable in making the business case for adopting MapReduce at a particular organization. Overall, this paper represents a fast, enjoyable read for any software developer working in the area of data-intensive information systems, as Google has clearly engendered a viable computational paradigm and architectural style for simplifying the construction of software within the domain. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 51, Issue 1
    50th anniversary issue: 1958 - 2008
    January 2008
    106 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/1327452
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2008
    Published in CACM Volume 51, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Popular
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5,612
    • Downloads (Last 6 weeks)729
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MinFlowProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650716(311-328)Online publication date: 27-Feb-2024
    • (2024)Enhancing Quality in Industry 4.0: A Data-Centric Approach to Lean Six SigmaInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17500(580-587)Online publication date: 20-Apr-2024
    • (2024)Introduction to Machine LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-15723(100-105)Online publication date: 12-Mar-2024
    • (2024)Big Data : AnalysisInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-15074(515-520)Online publication date: 18-Jan-2024
    • (2024)JWTAMH: JSON Web Tokens Based Authentication Mechanism for HADOOP.ICST Transactions on Scalable Information Systems10.4108/eetsis.542911Online publication date: 17-Jul-2024
    • (2024)Harnessing AI for Ethical Digital Consumer Behavior AnalysisEnhancing and Predicting Digital Consumer Behavior with AI10.4018/979-8-3693-4453-8.ch012(211-237)Online publication date: 12-Apr-2024
    • (2024)Using Complex Network Analysis Techniques to Uncover Fraudulent Activity in Connected Healthcare SystemsPractical Applications of Data Processing, Algorithms, and Modeling10.4018/979-8-3693-2909-2.ch019(244-268)Online publication date: 14-Jun-2024
    • (2024)Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and InsightsTechnologies10.3390/technologies1201000512:1(5)Online publication date: 3-Jan-2024
    • (2024)Streamline Intelligent Crowd Monitoring with IoT Cloud Computing MiddlewareSensors10.3390/s2411364324:11(3643)Online publication date: 4-Jun-2024
    • (2024)A Review on Large-Scale Data Processing with Parallel and Distributed Randomized Extreme Learning Machine Neural NetworksMathematical and Computational Applications10.3390/mca2903004029:3(40)Online publication date: 27-May-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media