Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1774088.1774174acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Semi-join computation on distributed file systems using map-reduce-merge model

Published: 22 March 2010 Publication History

Abstract

Semi-join is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However, the overhead related to semi-join computation can be very high due to data skew and to the high cost of communication in distributed architectures. Internet search engines needs to process vast amounts of raw data every day. Hence, systems that manage such data should assure scalability, reliability and availability issues with reasonable query processing time. Hadoop and Google's File System are examples of such systems. In this paper, we present a new algorithm based on Map-Reduce-Merge model and distributed histograms for processing semi-join operations on such systems. A cost analysis of this algorithm shows that our approach is insensitive to data skew while reducing communication and disk Input/Output costs to a minimum.

References

[1]
M. Bamha. An optimal and skew-insensitive join and multi-join algorithm for ditributed architectures. In Proceedings of the International Conference on Database and Expert Systems Applications (DEXA'2005). 22--26 August, Copenhagen, Danemark, volume 3588 of Lecture Notes in Computer Science, pages 616--625. Springer-Verlag, 2005.
[2]
M. Bamha and G. Hains. A skew insensitive algorithm for join and multi-join operation on Shared Nothing machines. In the 11th International Conference on Database and Expert Systems Applications DEXA'2000, volume 1873 of Lecture Notes in Computer Science, London, United Kingdom, 2000. Springer-Verlag.
[3]
M. Bamha and G. Hains. A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP), Volume 3, Number 3, pages 333--345, September 1999. Appears also in Progress in Computer Research, F. Columbus Ed. Vol. II, Nova Science Publishers, 2001.
[4]
J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, April 1979.
[5]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 205--218, Berkeley, CA, USA, 2006. USENIX Association.
[6]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI '04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004.
[7]
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In Proceedings of the 18th VLDB Conference, pages 27--40, Vancouver, British Columbia, Canada, 1992.
[8]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29--43, New York, NY, USA, 2003. ACM Press.
[9]
Apache hadoop. http://hadoop.apache.org/core/.
[10]
Ralf Lämmel. Google's mapreduce programming model --- revisited. Sci. Comput. Program., 68(3):208--237, 2007.
[11]
S. Mohammed, B. Srinivasan, M. Bozyigit, and Phu Dung Le. Novel parallel join algorithms for grid files. In HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96), page 144, Washington, DC, USA, 1996. IEEE Computer Society.
[12]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. 2005.
[13]
Konrad Stocker, Donald Kossmann, Reinhard Braumandl, Alfons Kemper, and UniversitÃd't Passau. Integrating semi-join-reducers into state-of-the-art query processors. In ICDE '01: Proceedings of the 17th International Conference on Data Engineering, page 575, Washington, DC, USA, 2001. IEEE Computer Society.
[14]
Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, New York, NY, USA, 2007. ACM.

Cited By

View all
  • (2020)Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ48607.2020.9177610(1-8)Online publication date: Jul-2020
  • (2015)A fair scheduler using cloud computing for digital TV program recommendation systemTelecommunications Systems10.1007/s11235-014-9921-460:1(55-66)Online publication date: 1-Sep-2015
  • (2015)A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduceTransactions on Large-Scale Data- and Knowledge-Centered Systems XXV - Volume 962010.1007/978-3-662-49534-6_2(33-70)Online publication date: 1-Dec-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
March 2010
2712 pages
ISBN:9781605586397
DOI:10.1145/1774088
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data skew
  2. distributed file systems
  3. map-reduce-merge model
  4. semi-join operations

Qualifiers

  • Research-article

Conference

SAC'10
Sponsor:
SAC'10: The 2010 ACM Symposium on Applied Computing
March 22 - 26, 2010
Sierre, Switzerland

Acceptance Rates

SAC '10 Paper Acceptance Rate 364 of 1,353 submissions, 27%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ48607.2020.9177610(1-8)Online publication date: Jul-2020
  • (2015)A fair scheduler using cloud computing for digital TV program recommendation systemTelecommunications Systems10.1007/s11235-014-9921-460:1(55-66)Online publication date: 1-Sep-2015
  • (2015)A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduceTransactions on Large-Scale Data- and Knowledge-Centered Systems XXV - Volume 962010.1007/978-3-662-49534-6_2(33-70)Online publication date: 1-Dec-2015
  • (2013)Massive Parallel Join in NUMA ArchitectureProceedings of the 2013 IEEE International Congress on Big Data10.1109/BigData.Congress.2013.37(219-226)Online publication date: 27-Jun-2013
  • (2013)A cloud-based intelligent TV program recommendation systemComputers and Electrical Engineering10.1016/j.compeleceng.2013.04.02539:7(2379-2399)Online publication date: 1-Oct-2013
  • (2012)A cost aware adaptive multiple table join evaluation in MapReduce2012 9th International Conference on Fuzzy Systems and Knowledge Discovery10.1109/FSKD.2012.6233855(2437-2441)Online publication date: May-2012
  • (2012)Performance and Scalability of XML Query ProcessingProceedings of the 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS)10.1109/CISIS.2012.18(841-846)Online publication date: 4-Jul-2012
  • (2011)CPRSFuture Generation Computer Systems10.1016/j.future.2010.10.00227:6(823-835)Online publication date: 1-Jun-2011

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media