research-article

An Empirical Evaluation of Columnar Storage Formats

Authors:

Huanchen ZhangAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 2

Pages 148 - 161

https://doi.org/10.14778/3626292.3626298

Published: 01 October 2023 Publication History

Abstract

Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed.

In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.

References

[1]

2016. File Format Benchmark - Avro, JSON, ORC & Parquet. https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet.

[2]

2016. Format Wars: From VHS and Beta to Avro and Parquet. http://www.svds.com/dataformats/.

[3]

2016. Inside Capacitor, BigQuery's next-generation columnar storage format. https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format.

[4]

2017. Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation? http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html.

[5]

2017. Some comments to Daniel Abadi's blog about Apache Arrow. https://wesmckinney.com/blog/arrow-columnar-abadi/.

[6]

2022. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php. Accessed: 2022-09-22.

[7]

2023. Amazon S3. https://aws.amazon.com/s3/.

[8]

2023. Apache Arrow. https://arrow.apache.org/.

[9]

2023. Apache Arrow Dataset API. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html.

[10]

2023. Apache Avro. https://avro.apache.org/.

[11]

2023. Apache Carbondata. https://carbondata.apache.org/.

[12]

2023. Apache Hadoop. https://hadoop.apache.org/.

[13]

2023. Apache Hive. https://hive.apache.org/.

[14]

2023. Apache Hudi. https://hudi.apache.org/.

[15]

2023. Apache Iceberg. https://iceberg.apache.org/.

[16]

2023. Apache Impala. https://impala.apache.org/.

[17]

2023. Apache ORC. https://orc.apache.org/.

[18]

2023. Apache Parquet. https://parquet.apache.org/.

[19]

2023. Apache Presto. https://prestodb.io/.

[20]

2023. Apache Spark. https://spark.apache.org/.

[21]

2023. Arrow C++ and Parquet C++. https://github.com/apache/arrow/tree/main/cpp.

[22]

2023. AutoFaiss. https://github.com/criteo/autofaiss.

[23]

2023. AutoFAISS build index API. https://criteo.github.io/autofaiss/API/_autosummary/autofaiss.external.quantize.build_index.html. Accessed: 2023-07-17.

[24]

2023. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/blobs/.

[25]

2023. BP5. https://adios2.readthedocs.io/en/latest/engines/engines.html#bp5.

[26]

2023. Chroma. https://github.com/chroma-core/chroma/.

[27]

2023. ClickHouse. https://clickhouse.com/.

[28]

2023. ClickHouse Example Datasets. https://clickhouse.com/docs/en/getting-started/example-datasets.

[29]

2023. Dremio. https://www.dremio.com//.

[30]

2023. EDGAR Log File Data Sets. https://www.sec.gov/about/data/edgar-log-file-data-sets.html.

[31]

2023. GeoNames Dataset. http://www.geonames.org/.

[32]

2023. Google BigQuery. https://cloud.google.com/bigquery.

[33]

2023. Google Cloud Storage. https://cloud.google.com/storage.

[34]

2023. Google snappy. http://google.github.io/snappy/.

[35]

2023. Hugging Face Datasets Server. https://huggingface.co/docs/datasets-server/quick_start#access-parquet-files. Accessed: 2023-07-09.

[36]

2023. image-parquet. https://discuss.huggingface.co/t/image-dataset-best-practices/13974.

[37]

2023. IMDb Datasets. https://www.imdb.com/interfaces/.

[38]

2023. InfluxData. https://www.influxdata.com/.

[39]

2023. NetCDF. https://www.unidata.ucar.edu/software/netcdf/.

[40]

2023. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.

[41]

2023. ORC C++. https://github.com/apache/orc/tree/main/c%2B%2B.

[42]

2023. Parquet Bloom Filter Jira Discussion. https://issues.apache.org/jira/browse/PARQUET-41.

[43]

2023. Pinecone. https://www.pinecone.io/.

[44]

2023. Protocol Buffers. https://developers.google.com/protocol-buffers/.

[45]

2023. Public BI benchmark. https://github.com/cwida/public_bi_benchmark.

[46]

2023. Querying Parquet with Millisecond Latency. https://www.influxdata.com/blog/querying-parquet-millisecond-latency/.

[47]

2023. RAPIDS. https://rapids.ai/.

[48]

2023. Samsung 980 PRO 4.0 NVMe SSD. https://www.samsung.com/us/computing/memory-storage/solid-state-drives/980-pro-pcie-4-0-nvme-ssd-1tb-mz-v8p1t0b-am/. Accessed: 2023-02-21.

[49]

2023. SequenceFile. https://cwiki.apache.org/confluence/display/HADOOP2/SequenceFile.

[50]

2023. The DWRF Format. https://github.com/facebookarchive/hive-dwrf.

[51]

2023. Vector Data Lakes. https://www.databricks.com/dataaisummit/session/vector-data-lakes/. Accessed: 2023-07-28.

[52]

2023. Yelp Open Dataset. https://www.yelp.com/dataset/.

[53]

2023. Zarr. https://zarr.dev/.

[54]

2023. Zstandard. https://github.com/facebook/zstd.

[55]

Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden, et al. 2013. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases 5, 3 (2013), 197--280.

[56]

Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. 671--682.

Digital Library

[57]

Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding> 100 Billion Integers per Second with Scalar Code. Proceedings of the VLDB Endowment 16, 9 (2023), 2132--2144.

Digital Library

[58]

Ankur Agiwal and Kevin Lai et al. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. Proceedings of the VLDB Endowment (PVLDB) 14 (12) (2021), 2986--2998.

[59]

Anastassia Ailamaki, David J DeWitt, Mark D Hill, and Marios Skounakis. 2001. Weaving Relations for Cache Performance. In VLDB, Vol. 1. 169--180.

[60]

Wail Y. Alkowaileet and Michael J. Carey. 2022. Columnar Formats for Schemaless LSM-Based Document Stores. Proc. VLDB Endow. 15, 10 (sep 2022), 2085--2097.

Digital Library

[61]

Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, et al. 2020. Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411--3424.

Digital Library

[62]

Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lake-house: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR. 8.

[63]

Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column Store for Cloud Data Lakes. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 3078--3090.

[64]

Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide table layout optimization based on column ordering and duplication. In Proceedings of the 2017 ACM International Conference on Management of Data. 299--314.

Digital Library

[65]

Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random access string compression. Proceedings of the VLDB Endowment 13, 12 (2020), 2649--2661.

Digital Library

[66]

Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew Mccormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, et al. 2019. Procella: Unifying serving and analytical data at YouTube. (2019).

Digital Library

[67]

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC. 143--154.

[68]

George P Copeland and Setrag N Khoshafian. 1985. A decomposition storage model. Acm Sigmod Record 14, 4 (1985), 268--279.

Digital Library

[69]

Dario Curreri, Olivier Curé, and Marinella Sciortino. [n.d.]. RDF DATA AND COLUMNAR FORMATS. Master's thesis.

[70]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD.

[71]

Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment 14, 13 (2021), 3376--3388.

Digital Library

[72]

Avrilia Floratou, Umar Farooq Minhas, and Fatma Özcan. 2014. Sql-on-hadoop: Full circle back to shared-nothing database architectures. Proceedings of the VLDB Endowment 7, 12 (2014), 1295--1306.

Digital Library

[73]

Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases. 36--47.

Digital Library

[74]

Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing relations and indexes. In Proceedings 14th International Conference on Data Engineering. IEEE, 370--379.

[75]

Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.

[76]

Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu. 2011. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 1199--1208.

Digital Library

[77]

Brian Hentschel, Michael S Kester, and Stratos Idreos. 2018. Column sketches: A scan accelerator for rapid and robust predicate evaluation. In Proceedings of the 2018 International Conference on Management of Data. 857--872.

Digital Library

[78]

Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major technical advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1235--1246.

Digital Library

[79]

S Idreos, F Groffen, N Nes, S Manegold, S Mullender, and M Kersten. 2012. Monetdb: Two decades of research in column-oriented database. IEEE Data Engineering Bulletin (2012).

[80]

Todor Ivanov and Matteo Pergolesi. 2020. The impact of columnar file formats on SQL-on-hadoop engine performance: A study on ORC and Parquet. Concurrency and Computation: Practice and Experience 32, 5 (2020), e5523.

[81]

Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A Chien, Jihong Ma, and Aaron J Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with CodecDB. In Proceedings of the 2021 International Conference on Management of Data. 843--856.

Digital Library

[82]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.

[83]

Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proc. ACM Manag. Data 1, 2, Article 118 (jun 2023), 26 pages.

Digital Library

[84]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (2015), 1--29.

Digital Library

[85]

Yinan Li, Jianan Lu, and Badrish Chandramouli. 2023. Selection Pushdown in Column Stores Using Bit Manipulation Instructions. Proc. ACM Manag. Data 1, 2, Article 178 (jun 2023), 26 pages.

Digital Library

[86]

Yinan Li and Jignesh M Patel. 2013. Bitweaving: Fast scans for main memory data processing. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 289--300.

Digital Library

[87]

Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.

Digital Library

[88]

Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2023. LeCo: Lightweight Compression via Learning Serial Correlations. arXiv preprint arXiv:2306.15374 (2023).

[89]

Samuel Madden, Jialin Ding, Tim Kraska, Sivaprasad Sudhir, David Cohen, Timothy Mattson, and Nesime Tatbul. 2022. Self-Organizing Data Containers. In The Conference on Innovative Data Systems Research, CIDR.

[90]

Heikki Mannila. 1985. Measures of presortedness and optimal sorting algorithms. IEEE transactions on computers 100, 4 (1985), 318--325.

Digital Library

[91]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment 3, 1-2 (2010), 330--339.

Digital Library

[92]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min, et al. 2020. Dremel: A decade of interactive SQL analysis at web scale. Proceedings of the VLDB Endowment 13, 12 (2020), 3461--3472.

Digital Library

[93]

Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB). Pat 200, 0 (2007), 50.

[94]

Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8, 12 (2015), 1816--1827.

Digital Library

[95]

Pouria Pirzadeh, Michael Carey, and Till Westmann. 2017. A performance study of big data analytics platforms. In 2017 IEEE international conference on big data (big data). IEEE, 2911--2920.

[96]

Felix Putze, Peter Sanders, and Johannes Singler. 2010. Cache-, Hash-, and Space-Efficient Bloom Filters. ACM J. Exp. Algorithmics 14, Article 4 (Jan 2010), 18 pages.

[97]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS.

[98]

Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. 2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1802--1813.

[99]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. 1617--1632.

Digital Library

[100]

Anil Shanbhag, Bobbi W. Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-Based Lightweight Integer Compression in GPU. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 1390--1403.

Digital Library

[101]

Lefteris Sidirourgos and Martin Kersten. 2013. Column imprints: a secondary index structure. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 893--904.

Digital Library

[102]

Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005. ACM, 553--564.

Digital Library

[103]

The Transaction Processing Council. 2021. TPC-DS Benchmark (Revision 3.2.0).

[104]

The Transaction Processing Council. 2022. TPC-H Benchmark (Revision 3.0.1).

[105]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 2 (2009), 1626--1629.

Digital Library

[106]

Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler. 2018. Albis:{High-Performance} File Format for Big Data Systems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615--630.

[107]

Kapil Vaidya, Subarna Chatterjee, Eric Knorr, Michael Mitzenmacher, Stratos Idreos, and Tim Kraska. 2022. SNARF: a learning-enhanced range filter. Proceedings of the VLDB Endowment 15, 8 (2022), 1632--1644.

Digital Library

[108]

Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. 2023. Shared Foundations: Modernizing Meta's Data Lakehouse. In The Conference on Innovative Data Systems Research, CIDR.

[109]

Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Muehlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In Proceedings of the Workshop on Testing Database Systems (Houston, TX, USA) (DBTest'18). Association for Computing Machinery, New York, NY, USA, Article 1, 6 pages.

Digital Library

[110]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data. 2614--2627.

Digital Library

[111]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 488--499.

[112]

Bobbi W Yogatama, Weiwei Gong, and Xiangyao Yu. 2022. Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS. Proceedings of the VLDB Endowment 15, 11 (2022), 2491--2503.

Digital Library

[113]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.

[114]

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. https://arxiv.org/pdf/2304.05028.pdf/. arXiv preprint arXiv:2304.05028 (2023).

[115]

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical range query filtering with fast succinct tries. In Proceedings of the 2018 International Conference on Management of Data. 323--336.

Digital Library

[116]

Huanchen Zhang, Xiaoxuan Liu, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Order-preserving key compression for in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1601--1615.

Digital Library

[117]

Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. 2006. Superscalar RAM-CPU cache compression. In 22nd International Conference on Data Engineering (ICDE'06). IEEE, 59--59.

Digital Library

[118]

Marcin Zukowski, Mark Van de Wiel, and Peter Boncz. 2012. Vectorwise: A vectorized analytical DBMS. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1349--1350.

Digital Library

Cited By

Schmidt TDurner DLeis VNeumann T(2024)Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAPProceedings of the VLDB Endowment10.14778/3681954.368200117:11(3290-3303)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682001
Zeng XMeng RPavlo AMcKinney WZhang H(2024)NULLS!: Revisiting Null Representation in Modern Columnar FormatsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663452(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663452
Liu YZeng XZhang H(2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639320
Show More Cited By

Index Terms

An Empirical Evaluation of Columnar Storage Formats

Index terms have been assigned to the content through auto-classification.

Recommendations

BtrBlocks: Efficient Columnar Compression for Data Lakes
PACMMOD

Analytics is moving to the cloud and data is moving into data lakes. These reside on object storage services like S3 and enable seamless data sharing and system interoperability. To support this, many systems build on open storage formats like Apache ...
Columnar formats for schemaless LSM-based document stores

In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a ...
NULLS!: Revisiting Null Representation in Modern Columnar Formats
DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware

Nulls are common in real-world data sets, yet recent research on columnar formats and encodings rarely address Null representations. Popular file formats like Parquet and ORC follow the same design as C-Store from nearly 20 years ago that only stores non-...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 2

October 2023

185 pages

ISSN:2150-8097

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2023

Published in PVLDB Volume 17, Issue 2

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)18

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schmidt TDurner DLeis VNeumann T(2024)Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAPProceedings of the VLDB Endowment10.14778/3681954.368200117:11(3290-3303)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682001
Zeng XMeng RPavlo AMcKinney WZhang H(2024)NULLS!: Revisiting Null Representation in Modern Columnar FormatsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663452(1-10)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663452
Liu YZeng XZhang H(2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639320
Mišev DRodionychev MBaumann P(2023)Performance of Null Handling in Array Databases2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386100(247-254)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386100

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents