Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Cabin: A Compressed Adaptive Binned Scan Index

Published: 26 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Scan is a fundamental operation widely used in main-memory analytical database systems. To accelerate scans, previous studies build either record-order or sort-order structures known as scan indices. While achieving good performance, scan indices often incur significant space overhead, limiting their use in main-memory databases. For example, the most recent and best performing scan index, BinDex, consists of a sort-order position array, which is an array of rowIDs in the value order, and a set of record-order bit vectors, representing records in pre-defined value intervals. The structures can be much larger than the base data column size.
    In this paper, we propose a novel scan index, Cabin, that exploits the following three techniques for better time-space tradeoff. 1) filter sketches that represent every 2^w-2 value intervals with a w-bit sketched vector, thereby exponentially reducing the space for the bit vectors; 2) selective position array that removes the rowID array for a fraction of intervals in order to lower the space overhead for the position array; and 3) data-aware intervals that judiciously select interval boundaries based on the data characteristics to better support popular values in skewed data distributions or categorical attributes. Experimental results show that compared with state-of-the-art scan solutions, Cabin achieves better time-space tradeoff, and attains 1.70 -- 4.48x improvement for average scan performance given the same space budget.

    References

    [1]
    2023. DBLP Citation Network Dataset. https://www.aminer.cn/citation.
    [2]
    2023. IMDb Datasets. https://developer.imdb.com/non-commercial-datasets/.
    [3]
    2023. stx::Btree(tlx::Btree). https://github.com/tlx.
    [4]
    Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis. 2001. Weaving Relations for Cache Performance. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11--14, 2001, Roma, Italy. Morgan Kaufmann, 169--180.
    [5]
    Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in random networks. science 286, 5439 (1999), 509--512.
    [6]
    Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, Namik Hrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao Tim Li, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy, Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, and Sandor Szabo. 2012. Business Analytics in (a) Blink. IEEE Data Eng. Bull. 35, 1 (2012), 9--14.
    [7]
    Carsten Binnig, Stefan Hildenbrand, and Franz Färber. 2009. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009. ACM, 283--296.
    [8]
    Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Commun. ACM 51, 12 (2008), 77--85.
    [9]
    David Broneske, Sebastian Breß, and Gunter Saake. 2014. Database Scan Variants on Modern CPUs: A Performance Study. In Proceedings of the 2nd International Workshop on In Memory Data Management and Analytics, IMDM 2014, Hangzhou, China, September 1, 2014. 1--15.
    [10]
    Chee Yong Chan and Yannis E. Ioannidis. 1998. Bitmap Index Design and Evaluation. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2--4, 1998, Seattle, Washington, USA. ACM Press, 355--366.
    [11]
    Douglas Comer. 1979. The Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (1979), 121--137.
    [12]
    Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. 2011. SAP HANA database: data management for modern business applications. SIGMOD Rec. 40, 4 (2011), 45--51.
    [13]
    Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. 2015. ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 31--46.
    [14]
    Craig Freedman, Erik Ismert, and Per-Åke Larson. 2014. Compilation in the Microsoft SQL Server Hekaton Engine. IEEE Data Eng. Bull. 37, 1 (2014), 22--30.
    [15]
    John L. Hennessy and David A. Patterson. 2012. Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann.
    [16]
    Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 857--872.
    [17]
    Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7--10, 2007, Online Proceedings. www.cidrdb.org, 68--78.
    [18]
    Ryan Johnson, Vijayshankar Raman, Richard Sidle, and Garret Swart. 2008. Row-wise parallel predicate evaluation. Proc. VLDB Endow. 1, 1 (2008), 622--634.
    [19]
    André Kohn, Viktor Leis, and Thomas Neumann. 2018. Adaptive Execution of Compiled Queries. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. IEEE Computer Society, 197--208.
    [20]
    Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. ACM, 489--504.
    [21]
    Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code for holistic query evaluation. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA. IEEE Computer Society, 613--624.
    [22]
    Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh, Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, Juan Loaiza, Neil MacNaughton, Vineet Marwah, Niloy Mukherjee, Atrayee Mullick, Sujatha Muthulingam, Vivekanandhan Raja, Marty Roth, Ekrem Soylemez, and Mohamed Zaït. 2015. Oracle Database In-Memory: A dual format in-memory database. In 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13--17, 2015. IEEE Computer Society, 1253--1258.
    [23]
    Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM, 311--326.
    [24]
    Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013. IEEE Computer Society, 38--49.
    [25]
    Linwei Li, Kai Zhang, Jiading Guo, Wen He, Zhenying He, Yinan Jing, Weili Han, and X. Sean Wang. 2020. BinDex: A Two-Layered Index for Fast and Robust Scans. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. ACM, 909--923.
    [26]
    Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013. ACM, 289--300.
    [27]
    Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24--27, 1998, New York City, New York, USA. Morgan Kaufmann, 476--487.
    [28]
    Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (2011), 539--550.
    [29]
    Patrick E. O'Neil and Dallan Quass. 1997. Improved Query Performance with Variant Indexes. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13--15, 1997, Tucson, Arizona, USA. ACM Press, 38--49. https://doi.org/10.1145/253260.253268
    [30]
    Holger Pirk, Stefan Manegold, and Martin L. Kersten. 2014. Waste not... Efficient co-processing of relational data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014. IEEE Computer Society, 508--519.
    [31]
    Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD Vectorization for In-Memory Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 1493--1508.
    [32]
    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. ACM, 1981--1984.
    [33]
    Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter R. Mattson, and John D. Owens. 2000. Memory access scheduling. In 27th International Symposium on Computer Architecture (ISCA 2000), June 10--14, 2000, Vancouver, BC, Canada. IEEE Computer Society, 128--138.
    [34]
    Lefteris Sidirourgos and Martin L. Kersten. 2013. Column imprints: a secondary index structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013. ACM, 893--904.
    [35]
    Liwen Sun, Michael J. Franklin, Sanjay Krishnan, and Reynold S. Xin. 2014. Fine-grained partitioning for aggressive data skipping. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014. ACM, 1115--1126.
    [36]
    Skye Wanderman-Milne and Nong Li. 2014. Runtime Code Generation in Cloudera Impala. IEEE Data Eng. Bull. 37, 1 (2014), 31--37.
    [37]
    Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vectorizing Database Column Scans with Complex Predicates. In International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures - ADMS 2013, Riva del Garda, Trento, Italy, August 26, 2013. 1--12.
    [38]
    Marcin Zukowski, Peter A. Boncz, Niels Nes, and Sándor Héman. 2005. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

    Index Terms

    1. Cabin: A Compressed Adaptive Binned Scan Index

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 2, Issue 1
      SIGMOD
      February 2024
      1874 pages
      EISSN:2836-6573
      DOI:10.1145/3654807
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 March 2024
      Published in PACMMOD Volume 2, Issue 1

      Author Tags

      1. data awareness
      2. filter sketches
      3. scan index
      4. time-space tradeoff

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 223
        Total Downloads
      • Downloads (Last 12 months)223
      • Downloads (Last 6 weeks)76

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media