Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807167.1807273acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A comparison of join algorithms for log processing in MaPreduce

Published: 06 June 2010 Publication History

Abstract

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.

References

[1]
http://www.slideshare.net/cloudera/hw09-data-processing-in-the-enterprise.
[2]
http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis.
[3]
http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun.
[4]
http://www.slideshare.net/cloudera/hw09-hadoop-based-data-mining-platform-for-the-telecom-industry.
[5]
http://wiki.apache.org/hadoop/PoweredBy.
[6]
http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoop summit hadoop and the enterprise.html.
[7]
http://www.slideshare.net/prasadc/hive-percona-2009.
[8]
http://hadoop.apache.org/.
[9]
http://research.yahoo.com/files/facebook-hadoop-summit.pdf.
[10]
http://hadoop.apache.org/hive/.
[11]
http://www.jaql.org.
[12]
Teradata: DBC/1012 data base computer concepts and facilities, Teradata Corp., Document No. C02-0001-00, 1984.
[13]
P. A. Bernstein and N. Goodman. Full reducers for relational queries using multi-attribute semijoins. In Symp. On Comp. Network, 1979.
[14]
P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie Jr. Query processing in a system for distributed databases (SDD-1). ACM Transactions on Database Systems, 6(4):602--625, 1981.
[15]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.
[16]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[17]
D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992.
[18]
D. J. DeWitt and M. Stonebraker. MapReduce: A major step backwards. Blog post at The Database Column, 17 January 2008.
[19]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.
[20]
G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2), 1993.
[21]
J. Hammerbacher. Managing a large Hadoop cluster. Presentation, Facebook Inc., May 2008.
[22]
P. Mishra and M. H. Eich. Join processing in relational databases. ACM Comput. Surv., 24(1), 1992.
[23]
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Technical Conference, pages 267--273, 2008.
[24]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.
[25]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
[26]
D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, 1989.
[27]
H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007.

Cited By

View all
  • (2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
  • (2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: Dec-2023
  • (2023)System Log Parsing: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3222417(1-20)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. analytics
  2. hadoop
  3. join processing
  4. mapreduce

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '10
Sponsor:
SIGMOD/PODS '10: International Conference on Management of Data
June 6 - 10, 2010
Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)6
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
  • (2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: Dec-2023
  • (2023)System Log Parsing: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3222417(1-20)Online publication date: 2023
  • (2023)SIESTA: A Scalable Infrastructure of Sequential Pattern AnalysisIEEE Transactions on Big Data10.1109/TBDATA.2022.32290929:3(975-990)Online publication date: 1-Jun-2023
  • (2023)Enhancement of CURE algorithm using Map-Reduce Technique with Parallelism2023 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI)10.1109/ICDSAAI59313.2023.10452533(1-4)Online publication date: 21-Dec-2023
  • (2022)Evaluating Utilization of Cloud Computing for IoT Big Data SystemsInternational Journal of Distributed Artificial Intelligence10.4018/IJDAI.201801010310:1(34-42)Online publication date: 17-May-2022
  • (2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
  • (2022)Scaling Equi-JoinsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526042(2163-2176)Online publication date: 10-Jun-2022
  • (2022)Network Page Building Methodical Reviews Using Involuntary Manuscript Classification Procedures Founded on Deep Learning2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME55909.2022.9988457(1-8)Online publication date: 16-Nov-2022
  • (2022)Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performanceParallel Computing10.1016/j.parco.2022.102918111(102918)Online publication date: Jul-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media