Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1559845.1559865acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A comparison of approaches to large-scale data analysis

Published: 29 June 2009 Publication History
  • Get Citation Alerts
  • Abstract

    There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

    References

    [1]
    Hadoop. http://hadoop.apache.org/.
    [2]
    Hive. http://hadoop.apache.org/hive/.
    [3]
    Vertica. http://www.vertica.com/.
    [4]
    Y. Amir and J. Stanton. The Spread Wide Area Group Communication System. Technical report, 1998.
    [5]
    R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.
    [6]
    Cisco Systems. Cisco Catalyst 3750-E Series Switches Data Sheet, June 2008.
    [7]
    J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for Big Data. Under Submission, March 2009.
    [8]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04, pages 10--10, 2004.
    [9]
    D. J. DeWitt and R. H. Gerber. Multiprocessor Hash-based Join Algorithms. In VLDB '85, pages 151--164, 1985.
    [10]
    D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In VLDB '86, pages 228--237, 1986.
    [11]
    S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine. In VLDB '86, pages 209--219, 1986.
    [12]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29--43, 2003.
    [13]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys '07, pages 59--72, 2007.
    [14]
    E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling object, relations and XML in the .NET framework. In SIGMOD '06, pages 706--706, 2006.
    [15]
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD '08, pages 1099--1110, 2008.
    [16]
    J. Ong, D. Fogg, and M. Stonebraker. Implementation of data abstraction in the relational database system ingres. SIGMOD Rec., 14(1):1--14, 1983.
    [17]
    D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 51(1):105--105, 2008.
    [18]
    R. Rustin, editor. ACM--SIGMOD Workshop on Data Description, Access and Control, May 1974.
    [19]
    M. Stonebraker. The Case for Shared Nothing. Database Engineering, 9:4--9, 1986.
    [20]
    M. Stonebraker and J. Hellerstein. What Goes Around Comes Around. In Readings in Database Systems, pages 2--41. The MIT Press, 4th edition, 2005.
    [21]
    D. Thomas, D. Hansson, L. Breedt, M. Clark, J. D. Davidson, J. Gehtland, and A. Schwarz. Agile Web Development with Rails. Pragmatic Bookshelf, 2006.

    Cited By

    View all
    • (2024)Towards a Hierarchical Exascale Framework for Iterative Parallel Data Analysis Algorithms2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00049(293-296)Online publication date: 20-Mar-2024
    • (2024)Lossy Compression of Adjacency Matrices by Graph Filter BanksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448045(9386-9390)Online publication date: 14-Apr-2024
    • (2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
    June 2009
    1168 pages
    ISBN:9781605585512
    DOI:10.1145/1559845
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. benchmarks
    2. mapreduce
    3. parallel database

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '09
    Sponsor:
    SIGMOD/PODS '09: International Conference on Management of Data
    June 29 - July 2, 2009
    Rhode Island, Providence, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)291
    • Downloads (Last 6 weeks)40

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards a Hierarchical Exascale Framework for Iterative Parallel Data Analysis Algorithms2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00049(293-296)Online publication date: 20-Mar-2024
    • (2024)Lossy Compression of Adjacency Matrices by Graph Filter BanksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448045(9386-9390)Online publication date: 14-Apr-2024
    • (2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
    • (2023)JQPro:Join Query Processing in a Distributed System for Big RDF Data Using the Hash-Merge Join TechniqueMathematics10.3390/math1105127511:5(1275)Online publication date: 6-Mar-2023
    • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
    • (2023)CoTel: Ontology-Neural Co-Enhanced Text LabelingProceedings of the ACM Web Conference 202310.1145/3543507.3583533(1897-1906)Online publication date: 30-Apr-2023
    • (2023)Scheduling distributed multiway spatial join queries: optimization models and algorithmsInternational Journal of Geographical Information Science10.1080/13658816.2023.217038037:6(1388-1419)Online publication date: 6-Feb-2023
    • (2023)Artificial intelligence inspired IoT-fog based framework for generating early alerts while train passengers traveling in dangerous states using surveillance videosMultimedia Tools and Applications10.1007/s11042-023-16107-083:5(13613-13635)Online publication date: 7-Jul-2023
    • (2023)Pattern-Preserved Normalization Enabled User ProfilingSmart Grid and Innovative Frontiers in Telecommunications10.1007/978-3-031-31733-0_28(331-341)Online publication date: 26-May-2023
    • (2023)Importance of Data Wrangling in Industry 4.0Data Wrangling10.1002/9781119879862.ch6(109-121)Online publication date: 14-Jun-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media