Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Lightning fast and space efficient inequality joins

Published: 01 September 2015 Publication History

Abstract

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R*-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[2]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.
[3]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.
[4]
C. Böhm, G. Klump, and H.-P. Kriegel. XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension. In SSD, pages 75--90, 1999.
[5]
C.-Y. Chan and Y. E. Ioannidis. Bitmap Index Design and Evaluation. In SIGMOD, pages 355--366, 1998.
[6]
C.-Y. Chan and Y. E. Ioannidis. An Efficient Bitmap Encoding Scheme for Selection Queries. In SIGMOD, pages 215--226, 1999.
[7]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008.
[9]
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An Evaluation of Non-Equijoin Algorithms. In VLDB, pages 443--452, 1991.
[10]
J. Dittrich, J. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):515--529, 2010.
[11]
J. Enderle, M. Hampel, and T. Seidl. Joining Interval Data in Relational Databases. In SIGMOD, pages 683--694, 2004.
[12]
D. Gao, C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Join Operations in Temporal Databases. VLDB J., 14(1):2--29, 2005.
[13]
H. Garcia-Molina, J. D. Ullman, and J. Widom. Database systems. Pearson Education, 2009.
[14]
N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD, pages 325--336, 2006.
[15]
A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD, pages 47--57, 1984.
[16]
J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized Search Trees for Database Systems. In VLDB, pages 562--573, 1995.
[17]
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. BigDansing: A System for Big Data Cleansing. In SIGMOD, pages 1215--1230, 2015.
[18]
J. K. Laurila, D. Gatica-Perez, I. Aad, B. J., O. Bornet, T.-M.-T. Do, O. Dousse, J. Eberle, and M. Miettinen. The Mobile Data Challenge: Big Data for Mobile Computing Research. In Pervasive Computing, 2012.
[19]
T. L. Lopes Siqueira, R. R. Ciferri, V. C. Times, and C. D. de Aguiar Ciferri. A Spatial Bitmap-based Index for Geographical Data Warehouses. In SAC, pages 1336--1342, 2009.
[20]
A. Okcan and M. Riedewald. Processing Theta-Joins using MapReduce. In SIGMOD, pages 949--960, 2011.
[21]
G. Smith. PostgreSQL 9.0 High Performance: Accelerate your PostgreSQL System and Avoid the Common Pitfalls that Can Slow it Down. Packt Publishing, 2010.
[22]
K. Stockinger and K. Wu. Bitmap indices for data warehouses. Data Warehouses and OLAP: Concepts, Architectures and Solutions, 5:157--178, 2007.
[23]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with wobitmap indicesrking sets. In HotCloud, pages 10--10, 2010.
[24]
X. Zhang, L. Chen, and M. Wang. Efficient Multi-way Theta-Join Processing Using MapReduce. PVLDB, 5(11):1184--1195, 2012.

Cited By

View all
  • (2025)Constant Optimization Driven Database System TestingProceedings of the ACM on Management of Data10.1145/37096743:1(1-24)Online publication date: 11-Feb-2025
  • (2024)Rapidash: Efficient Detection of Constraint ViolationsProceedings of the VLDB Endowment10.14778/3659437.365945417:8(2009-2021)Online publication date: 31-May-2024
  • (2024)Stream-aware indexing for distributed inequality join processingInformation Systems10.1016/j.is.2024.102425125:COnline publication date: 18-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 13
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
September 2015
144 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2015
Published in PVLDB Volume 8, Issue 13

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Constant Optimization Driven Database System TestingProceedings of the ACM on Management of Data10.1145/37096743:1(1-24)Online publication date: 11-Feb-2025
  • (2024)Rapidash: Efficient Detection of Constraint ViolationsProceedings of the VLDB Endowment10.14778/3659437.365945417:8(2009-2021)Online publication date: 31-May-2024
  • (2024)Stream-aware indexing for distributed inequality join processingInformation Systems10.1016/j.is.2024.102425125:COnline publication date: 18-Oct-2024
  • (2023)These Rows Are Made for Sorting and That’s Just What We’ll Do2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00159(2050-2062)Online publication date: Apr-2023
  • (2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
  • (2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
  • (2022)Unified data analyticsProceedings of the VLDB Endowment10.14778/3554821.355489815:12(3778-3781)Online publication date: 1-Aug-2022
  • (2021)Fast detection of denial constraint violationsProceedings of the VLDB Endowment10.14778/3503585.350359515:4(859-871)Online publication date: 1-Dec-2021
  • (2021)What is the price for joining securely?Proceedings of the VLDB Endowment10.14778/3494124.349414615:3(659-672)Online publication date: 1-Nov-2021
  • (2021)Bit-Oriented Sampling for Aggregation on Big DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.293101433:2(359-373)Online publication date: 1-Feb-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media