research-article

Open access

Cabin: A Compressed Adaptive Binned Scan Index

Authors:

Yiyuan Chen and

Shimin ChenAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 57, Pages 1 - 26

https://doi.org/10.1145/3639312

Published: 26 March 2024 Publication History

Abstract

Scan is a fundamental operation widely used in main-memory analytical database systems. To accelerate scans, previous studies build either record-order or sort-order structures known as scan indices. While achieving good performance, scan indices often incur significant space overhead, limiting their use in main-memory databases. For example, the most recent and best performing scan index, BinDex, consists of a sort-order position array, which is an array of rowIDs in the value order, and a set of record-order bit vectors, representing records in pre-defined value intervals. The structures can be much larger than the base data column size.

In this paper, we propose a novel scan index, Cabin, that exploits the following three techniques for better time-space tradeoff. 1) filter sketches that represent every 2^w-2 value intervals with a w-bit sketched vector, thereby exponentially reducing the space for the bit vectors; 2) selective position array that removes the rowID array for a fraction of intervals in order to lower the space overhead for the position array; and 3) data-aware intervals that judiciously select interval boundaries based on the data characteristics to better support popular values in skewed data distributions or categorical attributes. Experimental results show that compared with state-of-the-art scan solutions, Cabin achieves better time-space tradeoff, and attains 1.70 -- 4.48x improvement for average scan performance given the same space budget.

References

[1]

2023. DBLP Citation Network Dataset. https://www.aminer.cn/citation.

[2]

2023. IMDb Datasets. https://developer.imdb.com/non-commercial-datasets/.

[3]

2023. stx::Btree(tlx::Btree). https://github.com/tlx.

[4]

Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. 2001. Weaving Relations for Cache Performance. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11--14, 2001, Roma, Italy. Morgan Kaufmann, 169--180.

[5]

Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in random networks. science 286, 5439 (1999), 509--512.

[6]

Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, Namik Hrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao Tim Li, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy, Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, and Sandor Szabo. 2012. Business Analytics in (a) Blink. IEEE Data Eng. Bull. 35, 1 (2012), 9--14.

[7]

Carsten Binnig, Stefan Hildenbrand, and Franz Färber. 2009. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009. ACM, 283--296.

Digital Library

[8]

Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Commun. ACM 51, 12 (2008), 77--85.

Digital Library

[9]

David Broneske, Sebastian Breß, and Gunter Saake. 2014. Database Scan Variants on Modern CPUs: A Performance Study. In Proceedings of the 2nd International Workshop on In Memory Data Management and Analytics, IMDM 2014, Hangzhou, China, September 1, 2014. 1--15.

[10]

Chee Yong Chan and Yannis E. Ioannidis. 1998. Bitmap Index Design and Evaluation. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2--4, 1998, Seattle, Washington, USA. ACM Press, 355--366.

[11]

Douglas Comer. 1979. The Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (1979), 121--137.

Digital Library

[12]

Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. 2011. SAP HANA database: data management for modern business applications. SIGMOD Rec. 40, 4 (2011), 45--51.

Digital Library

[13]

Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. 2015. ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 31--46.

Digital Library

[14]

Craig Freedman, Erik Ismert, and Per-Åke Larson. 2014. Compilation in the Microsoft SQL Server Hekaton Engine. IEEE Data Eng. Bull. 37, 1 (2014), 22--30.

[15]

John L. Hennessy and David A. Patterson. 2012. Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann.

[16]

Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 857--872.

Digital Library

[17]

Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7--10, 2007, Online Proceedings. www.cidrdb.org, 68--78.

[18]

Ryan Johnson, Vijayshankar Raman, Richard Sidle, and Garret Swart. 2008. Row-wise parallel predicate evaluation. Proc. VLDB Endow. 1, 1 (2008), 622--634.

Digital Library

[19]

André Kohn, Viktor Leis, and Thomas Neumann. 2018. Adaptive Execution of Compiled Queries. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. IEEE Computer Society, 197--208.

[20]

Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 489--504.

Digital Library

[21]

Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code for holistic query evaluation. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA. IEEE Computer Society, 613--624.

[22]

Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, Juan Loaiza, Neil MacNaughton, Vineet Marwah, Niloy Mukherjee, Atrayee Mullick, Sujatha Muthulingam, Vivekanandhan Raja, Marty Roth, Ekrem Soylemez, and Mohamed Zaït. 2015. Oracle Database In-Memory: A dual format in-memory database. In 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13--17, 2015. IEEE Computer Society, 1253--1258.

[23]

Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM, 311--326.

Digital Library

[24]

Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013. IEEE Computer Society, 38--49.

Digital Library

[25]

Linwei Li, Kai Zhang, Jiading Guo, Wen He, Zhenying He, Yinan Jing, Weili Han, and X. Sean Wang. 2020. BinDex: A Two-Layered Index for Fast and Robust Scans. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. ACM, 909--923.

[26]

Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013. ACM, 289--300.

Digital Library

[27]

Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24--27, 1998, New York City, New York, USA. Morgan Kaufmann, 476--487.

[28]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (2011), 539--550.

Digital Library

[29]

Patrick E. O'Neil and Dallan Quass. 1997. Improved Query Performance with Variant Indexes. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13--15, 1997, Tucson, Arizona, USA. ACM Press, 38--49. https://doi.org/10.1145/253260.253268

Digital Library

[30]

Holger Pirk, Stefan Manegold, and Martin L. Kersten. 2014. Waste not... Efficient co-processing of relational data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014. IEEE Computer Society, 508--519.

[31]

Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD Vectorization for In-Memory Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 1493--1508.

Digital Library

[32]

Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. ACM, 1981--1984.

Digital Library

[33]

Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter R. Mattson, and John D. Owens. 2000. Memory access scheduling. In 27th International Symposium on Computer Architecture (ISCA 2000), June 10--14, 2000, Vancouver, BC, Canada. IEEE Computer Society, 128--138.

[34]

Lefteris Sidirourgos and Martin L. Kersten. 2013. Column imprints: a secondary index structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013. ACM, 893--904.

Digital Library

[35]

Liwen Sun, Michael J. Franklin, Sanjay Krishnan, and Reynold S. Xin. 2014. Fine-grained partitioning for aggressive data skipping. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014. ACM, 1115--1126.

Digital Library

[36]

Skye Wanderman-Milne and Nong Li. 2014. Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull. 37, 1 (2014), 31--37.

[37]

Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vectorizing Database Column Scans with Complex Predicates. In International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures - ADMS 2013, Riva del Garda, Trento, Italy, August 26, 2013. 1--12.

[38]

Marcin Zukowski, Peter A. Boncz, Niels Nes, and Sándor Héman. 2005. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

Index Terms

Cabin: A Compressed Adaptive Binned Scan Index
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
        Data scans

Recommendations

Increasing buffer-locality for multiple index based scans through intelligent placement and index scan speed control
VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Decision support systems are characterized by large concurrent scan operations. A significant percentage of these scans are executed as index based scans of the data. This is especially true when the data is physically clustered on the index columns ...
Read More
Cost-free scan: a low-overhead scan path design methodology
ICCAD '95: Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design

Conventional scan design imposes considerable area and delay overhead by using larger scan flip-flops and additional scan wires without utilizing the functionality of the combinational logic. We propose a novel low-overhead scan design methodology, ...
Read More
Super-linear time-space tradeoff lower bounds for randomized computation
FOCS '00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science

We prove the first time-space lower bound tradeoffs for randomized computation of decision problems. The bounds hold even in the case that the computation is allowed to have arbitrary probability of error on a small fraction of inputs. Our techniques ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
223
Total Downloads

Downloads (Last 12 months)223
Downloads (Last 6 weeks)76

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents