research-article

Homomorphic Compression: Making Text Processing on Compression Unlimited

Authors:

Xiaoyong DuAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 271, Pages 1 - 28

https://doi.org/10.1145/3626765

Published: 12 December 2023 Publication History

Abstract

Lossless data compression is an effective way to handle the huge transmission and storage overhead of massive text data. Its utility is even more significant today when data volumes are skyrocketing. The concept of operating on compressed data infuses new blood into efficient text management by enabling mainly access-oriented text processing tasks to be done directly on compressed data without decompression. Facing limitations of the existing compressed text processing schemes such as limited types of operations supported, low efficiency, and high space occupation, we address these problems by proposing a homomorphic compression theory. It enables the generalization and characterization of algorithms with compression processing capabilities. On this basis, we develop HOCO, an efficient text data management engine that supports a variety of processing tasks on compressed text. We select three representative compression schemes and implement them combined with homomorphism in HOCO. HOCO supports the extension of homomorphic compression schemes through a modular and object-oriented design and has convenient interfaces for text processing tasks. We evaluate HOCO on six real-world datasets. The three schemes implemented in HOCO show trade-offs in terms of compression ratio, supported operation types, and efficiency. Experiments also show that HOCO can achieve higher throughput in random access and modification operations (averagely 9.18× than the state-of-the-art) and lower latency in text analytic tasks (averagely 7.16× than processing on uncompressed text) without compromising compression efficacy.

References

[1]

2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.

[2]

2017. Wikipedia HTML data dumps. https://dumps.wikimedia.org/enwiki/.

[3]

2019. COVID-19 Data from Yelp Opem Dataset. https://www.yelp.com/dataset.

[4]

2020. DBLP. https://dblp.uni-trier.de/xml/.

[5]

Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (Csur) 51, 4 (2018), 1--35.

Digital Library

[6]

Rachit Agarwal, Anurag Khandelwal, and Ion Stoica. 2015. Succinct: Enabling queries on compressed data. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). 337--350.

Digital Library

[7]

Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, and Inge Li Gørtz. 2015. Finger search in grammar-compressed strings. arXiv preprint arXiv:1507.02853 (2015).

[8]

Philip Bille, Gad M Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. 2015. Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513--539.

Digital Library

[9]

Mireille Bousquet-Mélou, Markus Lohrey, Sebastian Maneth, and Eric Noeth. 2015. XML compression via directed acyclic graphs. Theory of Computing Systems 57, 4 (2015), 1322--1371.

Digital Library

[10]

Nieves R Brisaboa, Adrián Gómez-Brandón, Gonzalo Navarro, and José R Paramá. 2019. Gract: a grammar-based compressed index for trajectory data. Information Sciences 483 (2019), 106--135.

Digital Library

[11]

Michael Burrows and David Wheeler. 1994. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer.

[12]

Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. 2005. The smallest grammar problem. IEEE Transactions on Information Theory 51, 7 (2005), 2554--2576.

Digital Library

[13]

Yixin Chen, Guozhu Dong, Jiawei Han, Jian Pei, Benjamin W Wah, and Jianyong Wang. 2006. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering 18, 12 (2006), 1585--1599.

Digital Library

[14]

Zheng Chen, Feng Zhang, JiaWei Guan, Jidong Zhai, Xipeng Shen, Huanchen Zhang, Wentong Shu, and Xiaoyong Du. 2023. CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--31.

Digital Library

[15]

Wenfei Fan. 2012. Graph pattern matching revised for social network analysis. In Proceedings of the 15th International Conference on Database Theory. 8--21.

Digital Library

[16]

Wenfei Fan, Jianzhong Li, Xin Wang, and Yinghui Wu. 2012. Query preserving graph compression. In Proceedings of the 2012 ACM SIGMOD international conference on management of data. 157--168.

Digital Library

[17]

Wenfei Fan, Yuanhao Li, Muyang Liu, and Can Lu. 2022. A Hierarchical Contraction Scheme for Querying Big Graphs. In Proceedings of the 2022 International Conference on Management of Data. 1726--1740.

Digital Library

[18]

Andrea Farruggia, Paolo Ferragina, and Rossano Venturini. 2014. Bicriteria data compression: Efficient and usable. In European Symposium on Algorithms. Springer, 406--417.

[19]

Paolo Ferragina, Rodrigo González, Gonzalo Navarro, and Rossano Venturini. 2009. Compressed text indexes: From theory to practice. Journal of Experimental Algorithmics (JEA) 13 (2009), 1--12.

Digital Library

[20]

Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science. IEEE, 390--398.

[21]

Paolo Ferragina and Giovanni Manzini. 2001. An experimental study of an opportunistic index. In SODA. 269--278.

[22]

Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed text. Journal of the ACM (JACM) 52, 4 (2005), 552--581.

Digital Library

[23]

Paolo Ferragina, Igor Nitto, and Rossano Venturini. 2009. On the bit-complexity of Lempel-Ziv compression. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 768--777.

[24]

Yannis Foufoulas, Lefteris Sidirourgos, Eleftherios Stamatogiannakis, and Yannis Ioannidis. 2021. Adaptive Compression for Fast Scans on String Columns. In Proceedings of the 2021 International Conference on Management of Data. 554--562.

Digital Library

[25]

Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. 2012. A faster grammar-based self-index. In International Conference on Language and Automata Theory and Applications. Springer, 240--251.

Digital Library

[26]

Moses Ganardi, Artur Jez, and Markus Lohrey. 2021. Balancing straight-line programs. Journal of the ACM (JACM) 68, 4 (2021), 1--40.

Digital Library

[27]

Michal Ganczorz and Artur Jez. 2017. Improvements on Re-Pair grammar compressor. In 2017 Data Compression Conference (DCC). IEEE, 181--190.

[28]

Shangqian Gao, Feihu Huang, Jian Pei, and Heng Huang. 2020. Discrete model compression with resource constraint for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1899--1908.

[29]

Adrià Gascón, Markus Lohrey, Sebastian Maneth, Carl Philipp Reh, and Kurt Sieber. 2020. Grammar-based compression of unranked trees. Theory of Computing Systems 64, 1 (2020), 141--176.

Digital Library

[30]

Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. 2014. From theory to practice: Plug and play with succinct data structures. In International Symposium on Experimental Algorithms. Springer, 326--337.

Digital Library

[31]

Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399--401.

Digital Library

[32]

Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropy-compressed text indexes. (2003).

[33]

Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2004. When indexing equals compression: experiments with compressing suffix arrays and applications. In SODA, Vol. 4. 636--645.

[34]

Roberto Grossi and Jeffrey Scott Vitter. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the thirty-second annual ACM symposium on Theory of computing. 397--406.

Digital Library

[35]

Shai Halevi. 2017. Homomorphic Encryption. Springer International Publishing, Cham, 219--276. https://doi.org/10.1007/978--3--319--57048--8_5

[36]

Wing-Kai Hon, Tak Wah Lam, Wing-Kin Sung, Wai-Leuk Tse, Chi-Kwong Wong, and Siu-Ming Yiu. 2004. Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences. In ALENEX/ANALC. Citeseer, 31--38.

[37]

David A Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9(1952), 1098--1101.

[38]

Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A Chien, Jihong Ma, and Aaron J Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with CodecDB. In Proceedings of the 2021 International Conference on Management of Data. 843--856.

Digital Library

[39]

Sian Jin, Sheng Di, Jiannan Tian, Suren Byna, Dingwen Tao, and Franck Cappello. 2022. Improving prediction-based lossy compression dramatically via ratio-quality modeling. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2494--2507.

[40]

P Kavitha. 2016. A survey on lossless and lossy data compression methods. International Journal of Computer Science & Engineering Technology 7, 03 (2016), 110--114.

[41]

Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. 2016. BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 485--500.

[42]

John C Kieffer and En-Hui Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory 46, 3 (2000), 737--754.

Digital Library

[43]

SR Kodituwakku and US Amarasinghe. 2010. Comparison of lossless data compression algorithms for text data. Indian journal of computer science and engineering 1, 4 (2010), 416--425.

[44]

Michael Kuchnik, George Amvrosiadis, and Virginia Smith. 2021. Progressive Compressed Records: Taking a Byte out of Deep Learning Data. Proceedings of the VLDB Endowment 14, 11 (2021), 2627--2641.

Digital Library

[45]

Stefan Kurtz. 1999. Reducing the space requirement of suffix trees. Software: Practice and Experience 29, 13 (1999), 1149--1171.

[46]

Laks VS Lakshmanan, Jian Pei, and Yan Zhao. 2003. Efficacious data cube exploration by semantic summarization and compression. In Proceedings 2003 VLDB Conference. Elsevier, 1125--1128.

[47]

Laks VS Lakshmanan, Jian Pei, and Yan Zhao. 2003. Socqet: semantic olap with compressed cube and summarization. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 658--658.

Digital Library

[48]

N Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.

Digital Library

[49]

Jinbao Li and Jianzhong Li. 2005. Data sampling control and compression in sensor networks. In International Conference on Mobile Ad-Hoc and Sensor Networks. Springer, 42--51.

Digital Library

[50]

Jinbao Li and Jianzhong Li. 2007. Data sampling control, compression and query in sensor networks. International Journal of Sensor Networks 2, 1--2 (2007), 53--61.

[51]

Jianzhong Li, Qianqian Ren, et al . 2011. Compressing information of target tracking in wireless sensor networks. Wireless Sensor Network 3, 02 (2011), 73.

[52]

Jianzhong Li, Doron Rotem, and Jaideep Srivastava. 1999. Aggregation algorithms for very large compressed data warehouses. In VLDB, Vol. 99. 651--662.

[53]

JZ Li, Doron Rotem, and Harry KT Wong. 1987. A new compression method with fast searching on large databases. (1987).

[54]

Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.

Digital Library

[55]

Panagiotis Liakos, Katia Papakonstantinopoulou, Theodore Stefou, and Alex Delis. 2022. On Compressing Temporal Graphs. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1301--1313.

[56]

Markus Lohrey, Sebastian Maneth, and Roy Mennicke. 2013. XML tree structure compression using RePair. Information Systems 38, 8 (2013), 1150--1167.

Digital Library

[57]

Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh. 2017. Compression of unordered XML trees. In 20th International Conference on Database Theory (ICDT 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[58]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Efficient document re-ranking for transformers by precomputing term representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 49--58.

Digital Library

[59]

Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935--948.

[60]

Sebastian Maneth and Fabian Peternek. 2015. A survey on methods and systems for graph compression. arXiv preprint arXiv:1504.00616 (2015).

[61]

Sebastian Maneth and Fabian Peternek. 2018. Grammar-based graph compression. Information Systems 76 (2018), 19--45.

[62]

Alvaro E Monge, Charles Elkan, et al . 1996. The field matching problem: algorithms and applications. In Kdd, Vol. 2. 267--270.

[63]

DS Malik John N Mordeson, MK Sen, and DS Malik. 1997. Fundamentals Of Abstract Algebra. The McCGraw-HILL Companies, Inc. New York st. Louis, san Francisco, printed in Singapore (1997).

[64]

Gonzalo Navarro. 2016. Compact data structures: A practical approach. Cambridge University Press.

[65]

Craig G Nevill-Manning and Ian H Witten. 1997. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (1997), 67--82.

Digital Library

[66]

Matthaios Olma, Manos Karpathiotakis, Ioannis Alagiannis, Manos Athanassoulis, and Anastasia Ailamaki. 2020. Adaptive partitioning and indexing for in situ query processing. The VLDB Journal 29 (2020), 569--591.

Digital Library

[67]

Zaifeng Pan, Feng Zhang, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. Exploring data analytics without decompression on embedded GPU systems. IEEE Transactions on Parallel and Distributed Systems 33, 7 (2021), 1553--1568.

[68]

Qianqian Ren, Jianzhong Li, and Jinbao Li. 2007. An efficient clustering-based method for data gathering and compressing in sensor networks. In Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Vol. 1. IEEE, 823--828.

Digital Library

[69]

Jorma Rissanen and Glen G Langdon. 1979. Arithmetic coding. IBM Journal of research and development 23, 2 (1979), 149--162.

[70]

Wojciech Rytter. 2004. Grammar compression, LZ-encodings, and string algorithms with implicit input. In International Colloquium on Automata, Languages, and Programming. Springer, 15--27.

[71]

Kunihiko Sadakane. 2000. Compressed text databases with efficient query algorithms based on the compressed suffix array. In International symposium on algorithms and computation. Springer, 410--421.

[72]

Kunihiko Sadakane. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In SODA, Vol. 2. Citeseer, 225--232.

[73]

Kunihiko Sadakane. 2003. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48, 2 (2003), 294--313.

Digital Library

[74]

Kunihiko Sadakane. 2007. Compressed suffix trees with full functionality. Theory of Computing Systems 41, 4 (2007), 589--607.

Digital Library

[75]

Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A Wood. 2015. A primer on compression in the memory hierarchy. Synthesis Lectures on Computer Architecture 10, 5 (2015), 1--86.

[76]

Khalid Sayood. 2017. Introduction to data compression. Morgan Kaufmann.

[77]

Anil Shanbhag, Bobbi W Yogatama, Xiangyao Yu, and Samuel Madden. 2022. Tile-based Lightweight Integer Compression in GPU. In Proceedings of the 2022 International Conference on Management of Data. 1390--1403.

Digital Library

[78]

Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.

[79]

Weitao Wan, Feng Zhang, Chenyang Zhang, Mingde Zhang, Jidong Zhai, Yunpeng Chai, Huanchen Zhang, Wei Lu, Yuxing Chen, Haixiang Li, et al . 2023. Compressed Data Direct Computing for Databases. IEEE Transactions on Knowledge and Data Engineering (2023).

[80]

Dawei Wang and Wanqiu Cui. 2022. An efficient graph data compression model based on the germ quotient set structure. Frontiers of Computer Science 16, 6 (2022), 166617.

Digital Library

[81]

Qing Wang, Hongzhi Wang, Hong Gao, and Jianzhong Li. 2010. Compression algorithms for structural query results on XML data. In Web-Age Information Management: WAIM 2010 International Workshops: IWGD 2010, XMLDM 2010, WCMT 2010, Jiuzhaigou Valley, China, July 15--17, 2010 Revised Selected Papers 11. Springer, 141--145.

[82]

Terry A. Welch. 1984. A technique for high-performance data compression. Computer 17, 06 (1984), 8--19.

Digital Library

[83]

Weili Wu, Hong Gao, and Jianzhong Li. 2006. New algorithm for computing cube on very large compressed data sets. IEEE transactions on knowledge and data engineering 18, 12 (2006), 1667--1680.

[84]

Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. 2013. TripleBit: a fast and compact system for large scale RDF data. Proceedings of the VLDB Endowment 6, 7 (2013), 517--528.

Digital Library

[85]

Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. G-TADOC: Enabling efficient GPU-based text analytics without decompression. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1679--1690.

[86]

Feng Zhang, Weitao Wan, Chenyang Zhang, Jidong Zhai, Yunpeng Chai, Haixiang Li, and Xiaoyong Du. 2022. CompressDB: Enabling efficient compressed data direct processing for various databases. In Proceedings of the 2022 International Conference on Management of Data. 1655--1669.

Digital Library

[87]

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Efficient document analytics on compressed data: Method, challenges, algorithms, insights. Proceedings of the VLDB Endowment 11, 11 (2018), 1522--1535.

Digital Library

[88]

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Zwift: A programming framework for high performance text analytics on compressed data. In Proceedings of the 2018 International Conference on Supercomputing. 195--206.

Digital Library

[89]

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2020. Enabling efficient random access to hierarchically-compressed data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1069--1080.

[90]

Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory 23, 3 (1977), 337--343.

Digital Library

[91]

Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory 24, 5 (1978), 530--536.

Digital Library

Index Terms

Homomorphic Compression: Making Text Processing on Compression Unlimited
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Data compression

Recommendations

Lossless compression of VLSI layout image data

We present a novel lossless compression algorithm called Context Copy Combinatorial Code (C4), which integrates the advantages of two very disparate compression techniques: context-based modeling and Lempel-Ziv (LZ) style copying. While the algorithm ...
Post BWT stages of the Burrows–Wheeler compression algorithm

The lossless Burrows–Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence—the Burrows–Wheeler transformation (BWT)—which groups ...
FPGA bitstream compression and decompression using LZ and golomb coding (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

In this paper we propose an optimized bitstream compression algorithm based on LZ and a novel architecture of decompressor, the proposed algorithm improves the Compression Ratio by fully utilizing the regularity of configuration bits of CLB (...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Nova Program
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
355
Total Downloads

Downloads (Last 12 months)355
Downloads (Last 6 weeks)47

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents