research-article

MorphStore: analytical query engine with a holistic compression-enabled processing model

Authors:

Annett Ungethüm,

Johannes Pietrzyk,

Alexander Krause,

Wolfgang LehnerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 12

Pages 2396 - 2410

https://doi.org/10.14778/3407790.3407833

Published: 01 July 2020 Publication History

Abstract

In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particular, during query processing, these systems only keep the data compressed until an operator cannot process the compressed data directly, whereupon the data is decompressed, but not recompressed. Thus, the full potential of compression during query processing is not exploited. To overcome that, we developed a novel compression-enabled processing model as presented in this paper. As we are going to show, the continuous usage of compression for all base data and all intermediates is very beneficial to reduce the overall memory footprint as well as to improve the query performance.

References

[1]

D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3):197--280, 2013.

Digital Library

[2]

D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006.

Digital Library

[3]

V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1):151--166, 2005.

Digital Library

[4]

V. N. Anh and A. Moffat. Index compression using 64-bit words. Softw., Pract. Exper., 40(2), 2010.

Digital Library

[5]

C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013.

Digital Library

[6]

C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.

Digital Library

[7]

R. Barber, G. M. Lohman, I. Pandis, V. Raman, R. Sidle, G. K. Attaluri, N. Chainani, S. Lightstone, and D. Sharpe. Memory-efficient hash joins. PVLDB, 8(4):353--364, 2014.

Digital Library

[8]

C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for main memory column stores. In SIGMOD, pages 283--296, 2009.

Digital Library

[9]

S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, pages 37--48, 2011.

Digital Library

[10]

M. Boissier. Reducing the footprint of main memory HTAP systems: Removing, compressing, tiering, and ignoring data. In PhD@VLDB, volume 2175 of CEUR Workshop Proceedings, 2018.

[11]

M. Boissier and M. Jendruk. Workload-driven and robust selection of compression schemes for column stores. In EDBT, pages 674--677, 2019.

[12]

P. A. Boncz and M. L. Kersten. MIL primitives for querying a fragmented world. VLDB J., 8(2):101--119, 1999.

Digital Library

[13]

P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in monetdb. Commun. ACM, 51(12):77--85, 2008.

Digital Library

[14]

P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.

[15]

S. Chaudhuri, U. Dayal, and V. R. Narasayya. An overview of business intelligence technology. Commun. ACM, 54(8):88--98, 2011.

Digital Library

[16]

Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database systems. In SIGMOD, pages 271--282, 2001.

Digital Library

[17]

J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. PVLDB, 1(2):1313--1324, 2008.

Digital Library

[18]

P. Damme, D. Habich, J. Hildebrandt, and W. Lehner. Lightweight data compression algorithms: An experimental survey (experiments and analyses). In EDBT, pages 72--83, 2017.

[19]

P. Damme, D. Habich, and W. Lehner. Direct transformation techniques for compressed data: General approach and application scenarios. In ADBIS, pages 151--165, 2015.

[20]

P. Damme, A. Ungethüm, J. Hildebrandt, D. Habich, and W. Lehner. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Trans. Database Syst., 44(3):9:1--9:46, 2019.

Digital Library

[21]

D. Das, J. Yan, M. Zaït, S. R. Valluri, N. Vyas, R. Krishnamachari, P. Gaharwar, J. Kamp, and N. Mukherjee. Query optimization in oracle 12c database in-memory. PVLDB, 8(12):1770--1781, 2015.

Digital Library

[22]

R. Delbru, S. Campinas, and G. Tummarello. Searching web data: An entity retrieval and high-performance indexing model. J. Web Semant., 10:33--58, 2012.

Digital Library

[23]

M. Dreseler, J. Kossmann, M. Boissier, S. Klauck, M. Uflacker, and H. Plattner. Hyrise re-engineered: An extensible database system for research in relational in-memory data management. In EDBT, pages 313--324, 2019.

[24]

P. Elias. Universal codeword sets and representations of the integers. IEEE Trans. Information Theory, 21(2):194--203, 1975.

Digital Library

[25]

F. Faerber, A. Kemper, P. Larson, J. J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends in Databases, 8(1--2):1--130, 2017.

Digital Library

[26]

Z. Feng and E. Lo. Accelerating aggregation using intra-cycle parallelism. In ICDE, pages 291--302, 2015.

[27]

Z. Feng, E. Lo, B. Kao, and W. Xu. Byteslice: Pushing the envelop of main memory data processing with a new storage layout. In SIGMOD, pages 31--46, 2015.

Digital Library

[28]

J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In ICDE, pages 370--379, 1998.

Digital Library

[29]

Google. Snappy: A fast compressor/decompressor. https://github.com/google/snappy.

[30]

G. Graefe and L. D. Shapiro. Data compression and database performance. In Symposium on Applied Computing, pages 22--27, 1991.

[31]

G. Guzun and G. Canahuate. Hybrid query optimization for hard-to-compress bit-vectors. VLDB J., 25(3):339--354, 2016.

Digital Library

[32]

D. Habich, P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause, J. Hildebrandt, and W. Lehner. Morphstore - in-memory query processing based on morphing compressed intermediates LIVE. In SIGMOD, pages 1917--1920, 2019.

Digital Library

[33]

J. Hildebrandt, D. Habich, P. Damme, and W. Lehner. Compression-aware in-memory query processing: Vision, system design and beyond. In ADMS@VLDB, pages 40--56, 2016.

[34]

J. Hildebrandt, D. Habich, T. Kühn, P. Damme, and W. Lehner. Metamodeling lightweight data compression algorithms and its application scenarios. In ER Forum, pages 128--141, 2017.

[35]

D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9), 1952.

[36]

C. J. Hughes. Single-Instruction Multiple-Data Execution. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2015.

[37]

S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. Monetdb: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull., 35(1):40--45, 2012.

[38]

A. Kemper and T. Neumann. Hyper: A hybrid oltp&olap main memory database system based on virtual memory snapshots. In ICDE, pages 195--206, 2011.

Digital Library

[39]

T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. A. Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB, 11(13):2209--2222, 2018.

Digital Library

[40]

T. Kiefer, B. Schlegel, and W. Lehner. Experimental evaluation of NUMA effects on database management systems. In BTW, pages 185--204, 2013.

[41]

T. Kissinger, D. Habich, and W. Lehner. Adaptive energy-control for in-memory database systems. In SIGMOD, pages 351--364, 2018.

Digital Library

[42]

T. Kissinger, T. Kiefer, B. Schlegel, D. Habich, D. Molka, and W. Lehner. ERIS: A numa-aware in-memory storage engine for analytical workload. In ADMS@VLDB, pages 74--85, 2014.

[43]

T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. QPPT: query processing on prefix trees. In CIDR, 2013.

[44]

M. Kornacker et al. Impala: A modern, open-source SQL engine for hadoop. In CIDR, 2015.

[45]

H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz, T. Neumann, and A. Kemper. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In SIGMOD, pages 311--326, 2016.

Digital Library

[46]

J. Lee, G. K. Attaluri, R. Barber, N. Chainani, O. Draese, F. Ho, S. Idreos, M. Kim, S. Lightstone, G. M. Lohman, K. Morfonios, K. Murthy, I. Pandis, L. Qiao, V. Raman, V. K. Samy, R. Sidle, K. Stolze, and L. Zhang. Joins on encoded and partitioned data. PVLDB, 7(13):1355--1366, 2014.

Digital Library

[47]

V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a numa-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014.

Digital Library

[48]

D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Softw., Pract. Exper., 45(1):1--29, 2015.

Digital Library

[49]

D. Lemire and O. Kaser. Reordering columns for smaller indexes. Inf. Sci., 181(12):2550--2570, 2011.

Digital Library

[50]

D. Lemire, N. Kurz, and C. Rupp. Stream vbyte: Faster byte-oriented integer compression. Inf. Process. Lett., 130:1--6, 2018.

[51]

Y. Li and J. M. Patel. Bitweaving: fast scans for main memory data processing. In SIGMOD, pages 289--300, 2013.

Digital Library

[52]

P. Menon, A. Pavlo, and T. C. Mowry. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. PVLDB, 11(1):1--13, 2017.

Digital Library

[53]

T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011.

Digital Library

[54]

I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. PVLDB, 3(1):928--939, 2010.

Digital Library

[55]

A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving database management systems. In CIDR, 2017.

[56]

J. Pietrzyk, A. Ungethüm, D. Habich, and W. Lehner. Fighting the duplicates in hashing: Conflict detection-aware vectorization of linear probing. In BTW, pages 35--53, 2019.

[57]

J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, abs/1503.07387, 2015.

[58]

O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking SIMD vectorization for in-memory databases. In SIGMOD, pages 1493--1508, 2015.

Digital Library

[59]

O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In SIGMOD, pages 755--766, 2014.

Digital Library

[60]

O. Polychroniou and K. A. Ross. Vectorized bloom filters for advanced SIMD processors. In DaMoN@SIGMOD, pages 6:1--6:6, 2014.

Digital Library

[61]

O. Polychroniou and K. A. Ross. Towards practical vectorized analytical query engines. In DaMoN@SIGMOD, pages 10:1--10:7, 2019.

Digital Library

[62]

D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. Atrapos: Adaptive transaction processing on hardware islands. In ICDE, pages 688--699, 2014.

[63]

V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013.

Digital Library

[64]

R. Rice and J. Plaunt. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology, 19(6):889--897, 1971.

[65]

M. A. Roth and S. J. V. Horn. Database compression. SIGMOD Record, 22(3):31--39, 1993.

Digital Library

[66]

J. Sanchez. A review of star schema benchmark. CoRR, abs/1606.00295, 2016.

[67]

N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010.

Digital Library

[68]

B. Schlegel, R. Gemulla, and W. Lehner. Fast integer compression using SIMD instructions. In DaMoN@SIGMOD, pages 34--40, 2010.

Digital Library

[69]

F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.

Digital Library

[70]

sjoerd. MonetDB goes headless MonetDB blog. https://www.monetdb.org/blog/monetdb-goes-headless, December 2016. Accessed: 2020-02-29.

[71]

A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. Simd-based decoding of posting lists. In CIKM, pages 317--326, 2011.

Digital Library

[72]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.

Digital Library

[73]

A. Ungethüm, J. Pietrzyk, P. Damme, D. Habich, and W. Lehner. Conflict detection-based run-length encoding - AVX-512 CD instruction set in action. In ICDE Workshops, pages 96--101, 2018.

[74]

A. Ungethüm, J. Pietrzyk, P. Damme, A. Krause, D. Habich, W. Lehner, and E. Focht. Hardware-oblivious SIMD parallelism for in-memory column-stores. In CIDR, 2020.

[75]

J. Wang, C. Lin, R. He, M. Chae, Y. Papakonstantinou, and S. Swanson. MILC: inverted list compression in memory. PVLDB, 10(8):853--864, 2017.

Digital Library

[76]

T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000.

Digital Library

[77]

T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, 2009.

Digital Library

[78]

R. N. Williams. An extremely fast ziv-lempel data compression algorithm. In DCC, pages 362--371, 1991.

[79]

H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.

Digital Library

[80]

M. Zarubin, P. Damme, T. Kissinger, D. Habich, W. Lehner, and T. Willhalm. Integer compression in nvram-centric data stores: Comparative experimental analysis to DRAM. In DaMoN@SIGMOD, pages 11:1--11:11, 2019.

Digital Library

[81]

J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In WWW, pages 387--396, 2008.

Digital Library

[82]

W. X. Zhao, X. Zhang, D. Lemire, D. Shan, J. Nie, H. Yan, and J. Wen. A general simd-based approach to accelerating compression algorithms. ACM Trans. Inf. Syst., 33(3):15:1--15:28, 2015.

Digital Library

[83]

J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In SIGMOD, pages 145--156, 2002.

Digital Library

[84]

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337--343, 1977.

Digital Library

[85]

M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. Monetdb/x100 - A DBMS in the CPU cache. IEEE Data Eng. Bull., 28(2):17--22, 2005.

[86]

M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, page 59, 2006.

Digital Library

Cited By

Liu YZeng XZhang H(2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639320
Habich DPietrzyk JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654694
Fett JHabich DLehner W(2024)Amethyst - A Generalized on-the-Fly De/Re-compression Framework to Accelerate Data-Intensive Integer Operations on GPUsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_8(107-120)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70626-4_8
Show More Cited By

Recommendations

MorphStore - In-Memory Query Processing based on Morphing Compressed Intermediates LIVE
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

In this demo, we present MorphStore, an in-memory column store with a novel compression-aware query processing concept. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column ...
New CAVLC design for lossless intra coding
ICIP'09: Proceedings of the 16th IEEE international conference on Image processing

The context-based adaptive variable length coder (CAVLC) in H.264/AVC is not appropriate for lossless video coding because it was designed for lossy video coding. Since statistical characteristics of residual data in lossy and lossless coding are quite ...
Multi-view video coding based on high efficiency video coding
PSIVT'11: Proceedings of the 5th Pacific Rim conference on Advances in Image and Video Technology - Volume Part II

Multiview video coding is one of the key techniques to realize the 3D video system. MPEG started a standardization activity on 3DVC (3D video coding) in 2007. 3DVC is based on multiview video coding. MPEG finalized the standard for multiview video coding ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 12

August 2020

1710 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020

Published in PVLDB Volume 13, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
77
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YZeng XZhang H(2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639320
Habich DPietrzyk JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654694
Fett JHabich DLehner W(2024)Amethyst - A Generalized on-the-Fly De/Re-compression Framework to Accelerate Data-Intensive Integer Operations on GPUsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_8(107-120)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70626-4_8
Baunsgaard SBoehm M(2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588682
Hildebrandt JHabich DLehner W(2023)BOUNCE: memory-efficient SIMD approach for lightweight integer compressionDistributed and Parallel Databases10.1007/s10619-023-07426-041:3(439-466)Online publication date: 10-May-2023
https://dl.acm.org/doi/10.1007/s10619-023-07426-0
Habich DPietrzyk JKrause AHildebrandt JLehner W(2022)To use or not to use the SIMD gather instruction?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535089(1-5)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533737.3535089
Fett JUngethüm AHabich DLehner W(2021)The Case for SIMDified Analytical Query Processing on GPUsProceedings of the 17th International Workshop on Data Management on New Hardware10.1145/3465998.3466015(1-5)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3465998.3466015
Zarubin MDamme PKrause AHabich DLehner WWassermann BMalka MChidambaram VRaz D(2021)SIMD-MIMD cocktail in a hybrid memory glassProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463782(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463782
Jiang HLiu CPaparrizos JChien AMa JElmore ALi GLi ZIdreos SSrivastava D(2021)Good to the Last Bit: Data-Driven Encoding with CodecDBProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457283(843-856)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457283

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents