research-article

A comparison of join algorithms for log processing in MaPreduce

Authors:

Jignesh M. Patel,

Eugene J. Shekita,

Yuanyuan TianAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 975 - 986

https://doi.org/10.1145/1807167.1807273

Published: 06 June 2010 Publication History

Abstract

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.

References

[1]

http://www.slideshare.net/cloudera/hw09-data-processing-in-the-enterprise.

[2]

http://www.slideshare.net/cloudera/hw09-large-scale-transaction-analysis.

[3]

http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun.

[4]

http://www.slideshare.net/cloudera/hw09-hadoop-based-data-mining-platform-for-the-telecom-industry.

[5]

http://wiki.apache.org/hadoop/PoweredBy.

[6]

http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoop summit hadoop and the enterprise.html.

[7]

http://www.slideshare.net/prasadc/hive-percona-2009.

[8]

http://hadoop.apache.org/.

[9]

http://research.yahoo.com/files/facebook-hadoop-summit.pdf.

[10]

http://hadoop.apache.org/hive/.

[11]

http://www.jaql.org.

[12]

Teradata: DBC/1012 data base computer concepts and facilities, Teradata Corp., Document No. C02-0001-00, 1984.

[13]

P. A. Bernstein and N. Goodman. Full reducers for relational queries using multi-attribute semijoins. In Symp. On Comp. Network, 1979.

[14]

P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie Jr. Query processing in a system for distributed databases (SDD-1). ACM Transactions on Database Systems, 6(4):602--625, 1981.

Digital Library

[15]

R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.

Digital Library

[16]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[17]

D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6), 1992.

Digital Library

[18]

D. J. DeWitt and M. Stonebraker. MapReduce: A major step backwards. Blog post at The Database Column, 17 January 2008.

[19]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.

Digital Library

[20]

G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2), 1993.

Digital Library

[21]

J. Hammerbacher. Managing a large Hadoop cluster. Presentation, Facebook Inc., May 2008.

[22]

P. Mishra and M. H. Eich. Join processing in relational databases. ACM Comput. Surv., 24(1), 1992.

Digital Library

[23]

C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Technical Conference, pages 267--273, 2008.

Digital Library

[24]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.

Digital Library

[25]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.

Digital Library

[26]

D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, 1989.

Digital Library

[27]

H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007.

Digital Library

Cited By

Gu ZCorcoglioniti FLanti DMosca AXiao GXiong JCalvanese D(2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
https://doi.org/10.3233/SW-223201
Cheng LWang YJhaveri RWang QMao Y(2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: Dec-2023
https://doi.org/10.1109/TNSM.2023.3273166
Zhang TQiu HCastellano GRifai MChen CPianese F(2023)System Log Parsing: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3222417(1-20)Online publication date: 2023
https://doi.org/10.1109/TKDE.2022.3222417
Show More Cited By

Index Terms

A comparison of join algorithms for log processing in MaPreduce
1. Information systems

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Join Algorithms under Apache Spark: Revisited
ICCTA '19: Proceedings of the 2019 5th International Conference on Computer and Technology Applications

Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the ...
High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop
ICICA '14: Proceedings of the 2014 International Conference on Intelligent Computing Applications

Hadoop is a quickly budding ecosystem of components based on Google's MapReduce algorithm and file system work for implementing MapReduce algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

295
Total Citations
View Citations
3,225
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gu ZCorcoglioniti FLanti DMosca AXiao GXiong JCalvanese D(2024)A systematic overview of data federation systemsSemantic Web10.3233/SW-22320115:1(107-165)Online publication date: 12-Jan-2024
https://doi.org/10.3233/SW-223201
Cheng LWang YJhaveri RWang QMao Y(2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: Dec-2023
https://doi.org/10.1109/TNSM.2023.3273166
Zhang TQiu HCastellano GRifai MChen CPianese F(2023)System Log Parsing: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3222417(1-20)Online publication date: 2023
https://doi.org/10.1109/TKDE.2022.3222417
Mavroudopoulos IGounaris A(2023)SIESTA: A Scalable Infrastructure of Sequential Pattern AnalysisIEEE Transactions on Big Data10.1109/TBDATA.2022.32290929:3(975-990)Online publication date: 1-Jun-2023
https://doi.org/10.1109/TBDATA.2022.3229092
Chavva SVijayaraj AMageshkumar NSenthilvel P(2023)Enhancement of CURE algorithm using Map-Reduce Technique with Parallelism2023 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI)10.1109/ICDSAAI59313.2023.10452533(1-4)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ICDSAAI59313.2023.10452533
Bhat MJain A(2022)Evaluating Utilization of Cloud Computing for IoT Big Data SystemsInternational Journal of Distributed Artificial Intelligence10.4018/IJDAI.201801010310:1(34-42)Online publication date: 17-May-2022
https://dl.acm.org/doi/10.4018/IJDAI.2018010103
Karpov NZhang Q(2022)SyncSignatureProceedings of the VLDB Endowment10.14778/3565816.356583316:2(330-342)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565833
Metwally AIves ZBonifati AEl Abbadi A(2022)Scaling Equi-JoinsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526042(2163-2176)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526042
Sabri BAlhayani B(2022)Network Page Building Methodical Reviews Using Involuntary Manuscript Classification Procedures Founded on Deep Learning2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME55909.2022.9988457(1-8)Online publication date: 16-Nov-2022
https://doi.org/10.1109/ICECCME55909.2022.9988457
Ramdane YBoussaid OBoukraà DKabachi NBentayeb F(2022)Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performanceParallel Computing10.1016/j.parco.2022.102918111(102918)Online publication date: Jul-2022
https://doi.org/10.1016/j.parco.2022.102918
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents