Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3662010.3663452acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

NULLS!: Revisiting Null Representation in Modern Columnar Formats

Published: 09 June 2024 Publication History

Abstract

Nulls are common in real-world data sets, yet recent research on columnar formats and encodings rarely address Null representations. Popular file formats like Parquet and ORC follow the same design as C-Store from nearly 20 years ago that only stores non-Null values contiguously. But recent formats store both non-Null and Null values, with Nulls being set to a placeholder value. In this work, we analyze each approach's pros and cons under different data distributions, encoding schemes (with different best SIMD ISA), and implementations. We optimize the bottlenecks in the traditional approach using AVX512. We also propose a Null-filling strategy called SmartNull, which can determine the Null values best for compression ratio at encoding time. From our micro-benchmarks, we argue that the optimal Null compression depends on several factors: decoding speed, data distribution, and Null ratio. Our analysis shows that the Compact layout performs better when Null ratio is high and the Placeholder layout is better when the Null ratio is low or the data is serial-correlated.

References

[1]
2018. Iterating over set bits quickly (SIMD edition). https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/.
[2]
2019. Really fast bitset decoding for "average" densities. https://lemire.me/blog/2019/05/03/really-fast-bitset-decoding-for-average-densities/.
[3]
2024. Aligning Velox and Apache Arrow: Towards composable data management. https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/.
[4]
2024. Apache Arrow. https://arrow.apache.org/.
[5]
2024. Apache Arrow DataFusion. https://github.com/apache/arrow-datafusion/.
[6]
2024. Apache ORC. https://orc.apache.org/.
[7]
2024. Apache Parquet. https://parquet.apache.org/.
[8]
2024. Arrow C++ C->S Conversion. https://github.com/apache/arrow/blob/1eb46f763a73d313466fdc895eae1f35fac37945/cpp/src/arrow/util/spaced.h#L66-L94.
[9]
2024. Dremio. https://www.dremio.com/.
[10]
2024. Influx Data FDAP stack. https://www.influxdata.com/glossary/fdap-stack/.
[11]
2024. MonetDB Data Compression Doc. https://www.monetdb.org/documentation-Dec2023/admin-guide/system-resources/data-compression/.
[12]
2024. Velox's SIMDized BM to SV. https://github.com/facebookincubator/velox/blob/02ca9b0b4f554868b533d2f6526a480ea1e7d035/velox/common/base/SimdUtil-inl.h#L179.
[13]
Daniel J Abadi et al. 2007. Column Stores for Wide and Sparse Data. In CIDR, Vol. 2007. 292--297.
[14]
Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding> 100 Billion Integers per Second with Scalar Code. Proceedings of the VLDB Endowment 16, 9 (2023), 2132--2144.
[15]
PA Boncz and M Zukowski. 2012. Vectorwise: Beyond column stores. IEEE Data Engineering Bulletin 35, 1 (2012), 21--27.
[16]
Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR.
[17]
E. F. Codd. 1975. Understanding Relations (Installment #6). FDT Bull. ACM SIGFIDET SIGMOD 7, 1 (1975), 1--4.
[18]
E. F. Codd. 1979. Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4, 4 (dec 1979), 397--434. https://doi.org/10.1145/320107.320109
[19]
Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. https://doi.org/10.14778/3476249.3476297
[20]
Gerhard Hill and Andrew Ross. 2009. Reducing outer joins. VLDB J. 18, 3 (2009), 599--610. https://doi.org/10.1007/S00778-008-0110-5
[21]
Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A. Chien, Jihong Ma, and Aaron J. Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with CodecDB. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 843--856. https://doi.org/10.1145/3448016.3457283
[22]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. CoRR abs/1911.13014 (2019). arXiv:1911.13014 http://arxiv.org/abs/1911.13014
[23]
Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proceedings of the ACM on Management of Data 1, 2 (2023), 1--26.
[24]
D. Lemire and L. Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (May 2013), 1--29. https://doi.org/10.1002/spe.2203
[25]
Daniel Lemire, Leonid Boytsov, and Nathan Kurz. 2014. SIMD Compression and the Intersection of Sorted Integers. CoRR abs/1401.6399 (2014). arXiv:1401.6399 http://arxiv.org/abs/1401.6399
[26]
Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster and smaller compressed bitmaps with Roaring. Software: Practice and Experience 46, 11 (April 2016), 1547--1569. https://doi.org/10.1002/spe.2402
[27]
Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. 2023. A deep dive into common open formats for analytical dbmss. Proceedings of the VLDB Endowment 16, 11 (2023), 3044--3056.
[28]
Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2024. LeCo: Lightweight Compression via Learning Serial Correlations. Proc. ACM Manag. Data 2, 1, Article 65 (mar 2024), 28 pages. https://doi.org/10.1145/3639320
[29]
Dimitar Mišev, Mikhail Rodionychev, and Peter Baumann. 2023. Performance of Null Handling in Array Databases. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 247--254.
[30]
Amadou Ngom, Prashanth Menon, Matthew Butrovich, Lin Ma, Wan Shen Lim, Todd C Mowry, and Andrew Pavlo. 2021. Filter Representation in Vectorized Query Execution. In Proceedings of the 17th International Workshop on Data Management on New Hardware. 1--7.
[31]
Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). 1981--1984. https://doi.org/10.1145/3299869.3320212
[32]
Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, René Müller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang. 2013. DB2 with BLU Acceleration: So Much More than Just a Column Store. Proc. VLDB Endow. 6, 11 (2013), 1080--1091. https://doi.org/10.14778/2536222.2536233
[33]
Kenneth A. Ross. 2004. Selection conditions in main memory. ACM Trans. Database Syst. 29, 1 (mar 2004), 132--161. https://doi.org/10.1145/974750.974755
[34]
Etienne Toussaint, Paolo Guagliardo, Leonid Libkin, and Juan Sequeda. 2022. Troubles with Nulls, Views from the Users. Proc. VLDB Endow. 15, 11 (2022), 2613--2625. https://doi.org/10.14778/3551793.3551818
[35]
Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018, Alexander Böhm and Tilmann Rabl (Eds.). ACM, 1:1--1:6. https://doi.org/10.1145/3209950.3209952
[36]
Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. Proceedings of the VLDB Endowment 17, 2 (2023), 148--161.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware
June 2024
123 pages
ISBN:9798400706677
DOI:10.1145/3662010
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

DaMoN '24 Paper Acceptance Rate 14 of 25 submissions, 56%;
Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 903
    Total Downloads
  • Downloads (Last 12 months)903
  • Downloads (Last 6 weeks)77
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media