Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1182635.1164201acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

How to wring a table dry: entropy compression of relations and querying of compressed relations

Published: 01 September 2006 Publication History

Abstract

We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.

References

[1]
{1} Bala Iyer, David Wilhite. Data Compression Support in Data Bases. In VLDB 1994.
[2]
{2} G. Antoshenkov, D. Lomet, J. Murray. Order Preserving String Compression. In ICDE 1996.
[3]
{3} T. Cover, J. Thomas. Elements of Information Theory. John Wiley, 1991.
[4]
{4} Eric W. Weisstein et al. "Catalan Number." In MathWorld. http://mathworld.wolfram.com/CatalanNumber.html.
[5]
{5} A. Ailamaki, D. J. DeWitt, M. D. Hill, M. Skounakis Weaving relations for cache performance. In VLDB 2001.
[6]
{6} M. Stonebraker et al. C-Store: A Column Oriented DBMS. In VLDB 2005.
[7]
{7} S. Babu et al. SPARTAN: A model-based semantic compression system for massive tables. SIGMOD 2001.
[8]
{8} J. Goldstein, R. Ramakrishnan, U. Shaft. Compressing Relations and Indexes. In ICDE 1998.
[9]
{9} G. V. Cormack and R. N. Horspool. Data Compression using Dynamic Markov Modelling, Computer Journal 1987.
[10]
{10} G. V. Cormack, Data Compression In a Database System, Comm. of the ACM 28(12), 1985.
[11]
{11} G. Copeland and S. Khoshafian. A decomposition storage model. In SIGMOD 1985.
[12]
{12} Sybase IQ. www.sybase.com/products/informationmanagement
[13]
{13} D. Knuth. The Art of Computer Programming. Addison Wesley, 1998.
[14]
{14} S. Barnard and J. M. Child. Higher Algebra, Macmillan India Ltd., 1994.
[15]
{15} T. C. Hu, A. C. Tucker, Optimal computer search trees and variable-length alphabetic cods, SIAM J. Appl. Math, 1971.
[16]
{16} Huffman, D. A method for the construction of minimum redundancy codes. Proc. I.R.E. 40. 9. 1952.
[17]
{17} Jim Gray. Commentary on 2005 Datamation benchmark. http://research.microsoft.com/barc/SortBenchmark
[18]
{18} A. Zandi et al. Sort Order Preserving Data Compresion for Extended Alphabets. Data Compression Conference 1993.
[19]
{19} K. Bharat et al. The Connectivity server: Fast access to Linkage information on the web. In WWW 1998.
[20]
{20} G. Graefe and L. Shapiro. Data Compression and Database Performance. In Symp on Applied Computing, 1991.
[21]
{21} M. Poess et al. Data compression in Oracle. In VLDB 2003.

Cited By

View all
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2018)Compressed linear algebra for large-scale machine learningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0478-127:5(719-744)Online publication date: 1-Oct-2018
  • (2016)Compressed linear algebra for large-scale machine learningProceedings of the VLDB Endowment10.14778/2994509.29945159:12(960-971)Online publication date: 1-Aug-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages

Sponsors

  • SIGMOD: ACM Special Interest Group on Management of Data
  • K.I.S.S. SIG on Databases
  • AJU Information Technology Co., Ltd
  • US Army ITC-PAC Asian Research Office
  • Google Inc.
  • The Database Society of Japan
  • Samsung SOS
  • Advanced Information Technology Research Center
  • Naver
  • Microsoft: Microsoft
  • Korea Info Sci Society: Korea Information Science Society
  • SK telecom
  • Systems Applications Products
  • ORACLE: ORACLE
  • International Business Management
  • Air Force Office of Scientific Research/Asian Office of Aerospace R&D
  • Kosef
  • Kaist
  • LG Electronics
  • CCF-DBS

Publisher

VLDB Endowment

Publication History

Published: 01 September 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2018)Compressed linear algebra for large-scale machine learningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0478-127:5(719-744)Online publication date: 1-Oct-2018
  • (2016)Compressed linear algebra for large-scale machine learningProceedings of the VLDB Endowment10.14778/2994509.29945159:12(960-971)Online publication date: 1-Aug-2016
  • (2016)SquishProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939867(1575-1584)Online publication date: 13-Aug-2016
  • (2015)Efficient Lightweight Compression Alongside Fast ScansProceedings of the 11th International Workshop on Data Management on New Hardware10.1145/2771937.2771943(1-6)Online publication date: 31-May-2015
  • (2014)Distributed data management using MapReduceACM Computing Surveys10.1145/250300946:3(1-42)Online publication date: 1-Jan-2014
  • (2013)Query-aware compression of join resultsProceedings of the 16th International Conference on Extending Database Technology10.1145/2452376.2452381(29-40)Online publication date: 18-Mar-2013
  • (2010)Speeding up queries in column storesProceedings of the 12th international conference on Data warehousing and knowledge discovery10.5555/1881923.1881936(117-129)Online publication date: 30-Aug-2010
  • (2010)CheetahProceedings of the VLDB Endowment10.14778/1920841.19210203:1-2(1459-1468)Online publication date: 1-Sep-2010
  • (2010)Changing base without losing spaceProceedings of the forty-second ACM symposium on Theory of computing10.1145/1806689.1806771(593-602)Online publication date: 5-Jun-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media