Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

MorphStore: analytical query engine with a holistic compression-enabled processing model

Published: 01 July 2020 Publication History

Abstract

In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already plays an important role in existing in-memory column-store database systems, but mainly for base data. In particular, during query processing, these systems only keep the data compressed until an operator cannot process the compressed data directly, whereupon the data is decompressed, but not recompressed. Thus, the full potential of compression during query processing is not exploited. To overcome that, we developed a novel compression-enabled processing model as presented in this paper. As we are going to show, the continuous usage of compression for all base data and all intermediates is very beneficial to reduce the overall memory footprint as well as to improve the query performance.

References

[1]
D. Abadi, P. A. Boncz, S. Harizopoulos, S. Idreos, and S. Madden. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3):197--280, 2013.
[2]
D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006.
[3]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1):151--166, 2005.
[4]
V. N. Anh and A. Moffat. Index compression using 64-bit words. Softw., Pract. Exper., 40(2), 2010.
[5]
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013.
[6]
C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.
[7]
R. Barber, G. M. Lohman, I. Pandis, V. Raman, R. Sidle, G. K. Attaluri, N. Chainani, S. Lightstone, and D. Sharpe. Memory-efficient hash joins. PVLDB, 8(4):353--364, 2014.
[8]
C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for main memory column stores. In SIGMOD, pages 283--296, 2009.
[9]
S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, pages 37--48, 2011.
[10]
M. Boissier. Reducing the footprint of main memory HTAP systems: Removing, compressing, tiering, and ignoring data. In PhD@VLDB, volume 2175 of CEUR Workshop Proceedings, 2018.
[11]
M. Boissier and M. Jendruk. Workload-driven and robust selection of compression schemes for column stores. In EDBT, pages 674--677, 2019.
[12]
P. A. Boncz and M. L. Kersten. MIL primitives for querying a fragmented world. VLDB J., 8(2):101--119, 1999.
[13]
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in monetdb. Commun. ACM, 51(12):77--85, 2008.
[14]
P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.
[15]
S. Chaudhuri, U. Dayal, and V. R. Narasayya. An overview of business intelligence technology. Commun. ACM, 54(8):88--98, 2011.
[16]
Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database systems. In SIGMOD, pages 271--282, 2001.
[17]
J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. PVLDB, 1(2):1313--1324, 2008.
[18]
P. Damme, D. Habich, J. Hildebrandt, and W. Lehner. Lightweight data compression algorithms: An experimental survey (experiments and analyses). In EDBT, pages 72--83, 2017.
[19]
P. Damme, D. Habich, and W. Lehner. Direct transformation techniques for compressed data: General approach and application scenarios. In ADBIS, pages 151--165, 2015.
[20]
P. Damme, A. Ungethüm, J. Hildebrandt, D. Habich, and W. Lehner. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Trans. Database Syst., 44(3):9:1--9:46, 2019.
[21]
D. Das, J. Yan, M. Zaït, S. R. Valluri, N. Vyas, R. Krishnamachari, P. Gaharwar, J. Kamp, and N. Mukherjee. Query optimization in oracle 12c database in-memory. PVLDB, 8(12):1770--1781, 2015.
[22]
R. Delbru, S. Campinas, and G. Tummarello. Searching web data: An entity retrieval and high-performance indexing model. J. Web Semant., 10:33--58, 2012.
[23]
M. Dreseler, J. Kossmann, M. Boissier, S. Klauck, M. Uflacker, and H. Plattner. Hyrise re-engineered: An extensible database system for research in relational in-memory data management. In EDBT, pages 313--324, 2019.
[24]
P. Elias. Universal codeword sets and representations of the integers. IEEE Trans. Information Theory, 21(2):194--203, 1975.
[25]
F. Faerber, A. Kemper, P. Larson, J. J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends in Databases, 8(1--2):1--130, 2017.
[26]
Z. Feng and E. Lo. Accelerating aggregation using intra-cycle parallelism. In ICDE, pages 291--302, 2015.
[27]
Z. Feng, E. Lo, B. Kao, and W. Xu. Byteslice: Pushing the envelop of main memory data processing with a new storage layout. In SIGMOD, pages 31--46, 2015.
[28]
J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In ICDE, pages 370--379, 1998.
[29]
Google. Snappy: A fast compressor/decompressor. https://github.com/google/snappy.
[30]
G. Graefe and L. D. Shapiro. Data compression and database performance. In Symposium on Applied Computing, pages 22--27, 1991.
[31]
G. Guzun and G. Canahuate. Hybrid query optimization for hard-to-compress bit-vectors. VLDB J., 25(3):339--354, 2016.
[32]
D. Habich, P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause, J. Hildebrandt, and W. Lehner. Morphstore - in-memory query processing based on morphing compressed intermediates LIVE. In SIGMOD, pages 1917--1920, 2019.
[33]
J. Hildebrandt, D. Habich, P. Damme, and W. Lehner. Compression-aware in-memory query processing: Vision, system design and beyond. In ADMS@VLDB, pages 40--56, 2016.
[34]
J. Hildebrandt, D. Habich, T. Kühn, P. Damme, and W. Lehner. Metamodeling lightweight data compression algorithms and its application scenarios. In ER Forum, pages 128--141, 2017.
[35]
D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9), 1952.
[36]
C. J. Hughes. Single-Instruction Multiple-Data Execution. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2015.
[37]
S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. Monetdb: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull., 35(1):40--45, 2012.
[38]
A. Kemper and T. Neumann. Hyper: A hybrid oltp&olap main memory database system based on virtual memory snapshots. In ICDE, pages 195--206, 2011.
[39]
T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. A. Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB, 11(13):2209--2222, 2018.
[40]
T. Kiefer, B. Schlegel, and W. Lehner. Experimental evaluation of NUMA effects on database management systems. In BTW, pages 185--204, 2013.
[41]
T. Kissinger, D. Habich, and W. Lehner. Adaptive energy-control for in-memory database systems. In SIGMOD, pages 351--364, 2018.
[42]
T. Kissinger, T. Kiefer, B. Schlegel, D. Habich, D. Molka, and W. Lehner. ERIS: A numa-aware in-memory storage engine for analytical workload. In ADMS@VLDB, pages 74--85, 2014.
[43]
T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. QPPT: query processing on prefix trees. In CIDR, 2013.
[44]
M. Kornacker et al. Impala: A modern, open-source SQL engine for hadoop. In CIDR, 2015.
[45]
H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz, T. Neumann, and A. Kemper. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In SIGMOD, pages 311--326, 2016.
[46]
J. Lee, G. K. Attaluri, R. Barber, N. Chainani, O. Draese, F. Ho, S. Idreos, M. Kim, S. Lightstone, G. M. Lohman, K. Morfonios, K. Murthy, I. Pandis, L. Qiao, V. Raman, V. K. Samy, R. Sidle, K. Stolze, and L. Zhang. Joins on encoded and partitioned data. PVLDB, 7(13):1355--1366, 2014.
[47]
V. Leis, P. A. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a numa-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014.
[48]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Softw., Pract. Exper., 45(1):1--29, 2015.
[49]
D. Lemire and O. Kaser. Reordering columns for smaller indexes. Inf. Sci., 181(12):2550--2570, 2011.
[50]
D. Lemire, N. Kurz, and C. Rupp. Stream vbyte: Faster byte-oriented integer compression. Inf. Process. Lett., 130:1--6, 2018.
[51]
Y. Li and J. M. Patel. Bitweaving: fast scans for main memory data processing. In SIGMOD, pages 289--300, 2013.
[52]
P. Menon, A. Pavlo, and T. C. Mowry. Relaxed operator fusion for in-memory databases: Making compilation, vectorization, and prefetching work together at last. PVLDB, 11(1):1--13, 2017.
[53]
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011.
[54]
I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. PVLDB, 3(1):928--939, 2010.
[55]
A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving database management systems. In CIDR, 2017.
[56]
J. Pietrzyk, A. Ungethüm, D. Habich, and W. Lehner. Fighting the duplicates in hashing: Conflict detection-aware vectorization of linear probing. In BTW, pages 35--53, 2019.
[57]
J. Plaisance, N. Kurz, and D. Lemire. Vectorized vbyte decoding. CoRR, abs/1503.07387, 2015.
[58]
O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking SIMD vectorization for in-memory databases. In SIGMOD, pages 1493--1508, 2015.
[59]
O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In SIGMOD, pages 755--766, 2014.
[60]
O. Polychroniou and K. A. Ross. Vectorized bloom filters for advanced SIMD processors. In DaMoN@SIGMOD, pages 6:1--6:6, 2014.
[61]
O. Polychroniou and K. A. Ross. Towards practical vectorized analytical query engines. In DaMoN@SIGMOD, pages 10:1--10:7, 2019.
[62]
D. Porobic, E. Liarou, P. Tözün, and A. Ailamaki. Atrapos: Adaptive transaction processing on hardware islands. In ICDE, pages 688--699, 2014.
[63]
V. Raman, G. K. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Müller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. J. Storm, and L. Zhang. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013.
[64]
R. Rice and J. Plaunt. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology, 19(6):889--897, 1971.
[65]
M. A. Roth and S. J. V. Horn. Database compression. SIGMOD Record, 22(3):31--39, 1993.
[66]
J. Sanchez. A review of star schema benchmark. CoRR, abs/1606.00295, 2016.
[67]
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010.
[68]
B. Schlegel, R. Gemulla, and W. Lehner. Fast integer compression using SIMD instructions. In DaMoN@SIGMOD, pages 34--40, 2010.
[69]
F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.
[70]
sjoerd. MonetDB goes headless MonetDB blog. https://www.monetdb.org/blog/monetdb-goes-headless, December 2016. Accessed: 2020-02-29.
[71]
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. Simd-based decoding of posting lists. In CIKM, pages 317--326, 2011.
[72]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.
[73]
A. Ungethüm, J. Pietrzyk, P. Damme, D. Habich, and W. Lehner. Conflict detection-based run-length encoding - AVX-512 CD instruction set in action. In ICDE Workshops, pages 96--101, 2018.
[74]
A. Ungethüm, J. Pietrzyk, P. Damme, A. Krause, D. Habich, W. Lehner, and E. Focht. Hardware-oblivious SIMD parallelism for in-memory column-stores. In CIDR, 2020.
[75]
J. Wang, C. Lin, R. He, M. Chae, Y. Papakonstantinou, and S. Swanson. MILC: inverted list compression in memory. PVLDB, 10(8):853--864, 2017.
[76]
T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000.
[77]
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, 2009.
[78]
R. N. Williams. An extremely fast ziv-lempel data compression algorithm. In DCC, pages 362--371, 1991.
[79]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.
[80]
M. Zarubin, P. Damme, T. Kissinger, D. Habich, W. Lehner, and T. Willhalm. Integer compression in nvram-centric data stores: Comparative experimental analysis to DRAM. In DaMoN@SIGMOD, pages 11:1--11:11, 2019.
[81]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In WWW, pages 387--396, 2008.
[82]
W. X. Zhao, X. Zhang, D. Lemire, D. Shan, J. Nie, H. Yan, and J. Wen. A general simd-based approach to accelerating compression algorithms. ACM Trans. Inf. Syst., 33(3):15:1--15:28, 2015.
[83]
J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In SIGMOD, pages 145--156, 2002.
[84]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337--343, 1977.
[85]
M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. Monetdb/x100 - A DBMS in the CPU cache. IEEE Data Eng. Bull., 28(2):17--22, 2005.
[86]
M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, page 59, 2006.

Cited By

View all
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
  • (2024)Amethyst - A Generalized on-the-Fly De/Re-compression Framework to Accelerate Data-Intensive Integer Operations on GPUsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_8(107-120)Online publication date: 28-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
  • (2024)Amethyst - A Generalized on-the-Fly De/Re-compression Framework to Accelerate Data-Intensive Integer Operations on GPUsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_8(107-120)Online publication date: 28-Aug-2024
  • (2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
  • (2023)BOUNCE: memory-efficient SIMD approach for lightweight integer compressionDistributed and Parallel Databases10.1007/s10619-023-07426-041:3(439-466)Online publication date: 10-May-2023
  • (2022)To use or not to use the SIMD gather instruction?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535089(1-5)Online publication date: 12-Jun-2022
  • (2021)The Case for SIMDified Analytical Query Processing on GPUsProceedings of the 17th International Workshop on Data Management on New Hardware10.1145/3465998.3466015(1-5)Online publication date: 20-Jun-2021
  • (2021)SIMD-MIMD cocktail in a hybrid memory glassProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463782(1-12)Online publication date: 14-Jun-2021
  • (2021)Good to the Last Bit: Data-Driven Encoding with CodecDBProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457283(843-856)Online publication date: 9-Jun-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media