Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1353343.1353407acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free access

The SBC-tree: an index for run-length compressed sequences

Published: 25 March 2008 Publication History

Abstract

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the <u>S</u>tring <u>B</u>-tree for <u>C</u>ompressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + |p|+T/B) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(m logB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

References

[1]
A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, pages 705--714, 1994.
[2]
A. Amir, G. Benson, and M. Farach. Optimal two-dimensional compressed matching. In ICALP, pages 215--226, 1994.
[3]
A. Amir, G. M. Landau, and D. Sokol. Inplace run-length 2d compressed search. In SODA, pages 817--818, 2000.
[4]
A. Amir, G. M. Landau, and U. Vishkin. Efficient pattern matching with scaling. Journal of Algorithms, 13(1):2--32, 1992.
[5]
A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity, 15(1):4--16, 1999.
[6]
O. Arbell, G. M. Landau, and J. S. Mitchell. Edit distance of run-length encoded strings. Information Processing Letters, 83(6):307--314, 2002.
[7]
L. Arge, V. Samoladas, and J. S. Vitter. On two-dimensional indexability and optimal range search indexing. In PODS, pages 346--357, 1999.
[8]
R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 1:173--189, 1972.
[9]
R. Bayer and K. Unterauer. Preffix b-trees. TODS, 2(1):11--26, 1977.
[10]
B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion b-tree. VLDB Journal, 5(4):264--275, 1996.
[11]
S. J. Bedathur and J. R. Haritsa. Engineering a fast online persistent suffix tree construction. In ICDE, pages 720--731, 2004.
[12]
T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching bwt compressed text with the boyer-moore algorithm and binary search. In DCC, pages 112--121, 2002.
[13]
H. Bunke and J. Csirik. Edit distance of run-length coded strings. In Symposium on Applied computing, pages 137--143, 1992.
[14]
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, 1994.
[15]
Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Compressed text indexing and range searching. Technical Report Purdue University tech. report, CSD TR06-021, DEC 2006.
[16]
D. Comer. Ubiquitous b-tree. ACM Computing Surveys, 11(2):121--137, 1979.
[17]
P. Dietz and D. Sleator. Two algorithms for maintaining order in a list. In STOC, pages 365--372, 1987.
[18]
P. Ferragina and R. Grossi. The string B-tree: a new data structure for string search in external memory and its applications. Journal of ACM, 46(2):236--280, 1999.
[19]
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS, pages 390--398, 2000.
[20]
W. B. Frakes and R. B. Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
[21]
E. Fredkin. Trie memory. Communications of the ACM, 3(9):490--499, 1960.
[22]
V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 90(4):167--173, 2004.
[23]
S. W. Golomb. Run-length encodings. Trans. on Information Theory, 12:399--401, 1966.
[24]
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850, 2003.
[25]
R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: experiments with compressing suffix arrays and applications. In SODA, pages 636--645, 2004.
[26]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
[27]
E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In VLDB, pages 139--148, 2001.
[28]
R. W. Irving and L. Love. The suffix binary search tree and suffix avl tree. JDA, 1(5--6):387--408, 2003.
[29]
V. Makinen and G. Navarro. Dynamic entropy-compressed sequences and full-text indexes. In CMP, pages 306--317, 2006.
[30]
V. Makinen, G. Navarro, and E. Ukkonen. Approximate matching of run-length compressed strings. In CPM, pages 31--49, 2001.
[31]
U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal, 22(5):935--948, 1993.
[32]
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23(2):262--272, 1976.
[33]
E. M. McCreight. Priority search trees. SIAM Journal, 14(2):257--276, 1985.
[34]
A. Moffat. Implementing the ppm data compression scheme. Trans. on Communications, 38(11):1917--1921, 1990.
[35]
D. R. Morrison. Patricia: Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514--534, 1968.
[36]
G. Navarro. Regular expression searching on compressed text. JDA, 1(5--6):423--443, 2003.
[37]
M. Patrascu and E. D. Demaine. Tight bounds for the partial-sums problem. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 20--29, 2004.
[38]
N. S. Prywes and H. J. Gray. The organization of a multilist-type associative memory. In Transactions on Communication and Electronics, pages 488--492, 1963.
[39]
P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232--239, 1995.
[40]
Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching in text compressed by using antidictionaries. In CPM, pages 37--49, 1999.
[41]
M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column oriented dbms. In VLDB, 2005.
[42]
H. Tanaka and A. L. Garcia. Efficient run-length encodings. Trans. on Information Theory, 28(6):880--889, 1982.
[43]
S. Tata, R. A. Hankins, and J. M. Patel. Practical suffix tree construction. In VLDB, pages 36--47, 2004.
[44]
T. E. Tzoreff. Matching patterns in strings subject to multi-linear transformations. TCS, 60(3):231--254, 1988.
[45]
P. J. Varman and R. M. Verma. An efficient multiversion access structure. TKDE, 9(3):391--409, 1997.
[46]
J. S. Vitter. External memory algorithms and data structures: Dealing with MASSIVE DATA. ACM Computing Surveys, 33(2):209--271, 2001.
[47]
P. Weiner. Linear pattern matching algorithms. In Symposium on Switching and Automata Theory, pages 1--11, 1973.
[48]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Trans. on Information Theory, 23(3):337--343, 1977.
[49]
J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. Trans. on Information Theory, 24(5):530--536, 1978.

Cited By

View all
  • (2015)An Opportunistic Text Indexing Structure Based on Run Length EncodingProceedings of the 9th International Conference on Algorithms and Complexity - Volume 907910.1007/978-3-319-18173-8_29(390-402)Online publication date: 20-May-2015
  • (2012)Fast algorithms for computing the constrained LCS of run-length encoded stringsTheoretical Computer Science10.1016/j.tcs.2012.01.038432(1-9)Online publication date: 1-May-2012
  • (2011)Compressed indexes for aligned pattern matchingProceedings of the 18th international conference on String processing and information retrieval10.5555/2051073.2051113(410-419)Online publication date: 17-Oct-2011
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '08: Proceedings of the 11th international conference on Extending database technology: Advances in database technology
March 2008
762 pages
ISBN:9781595939265
DOI:10.1145/1353343
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2008

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

EDBT '08

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)6
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2015)An Opportunistic Text Indexing Structure Based on Run Length EncodingProceedings of the 9th International Conference on Algorithms and Complexity - Volume 907910.1007/978-3-319-18173-8_29(390-402)Online publication date: 20-May-2015
  • (2012)Fast algorithms for computing the constrained LCS of run-length encoded stringsTheoretical Computer Science10.1016/j.tcs.2012.01.038432(1-9)Online publication date: 1-May-2012
  • (2011)Compressed indexes for aligned pattern matchingProceedings of the 18th international conference on String processing and information retrieval10.5555/2051073.2051113(410-419)Online publication date: 17-Oct-2011
  • (2011)Reordering columns for smaller indexesInformation Sciences: an International Journal10.1016/j.ins.2011.02.002181:12(2550-2570)Online publication date: 1-Jun-2011
  • (2011)Compressed Indexes for Aligned Pattern MatchingString Processing and Information Retrieval10.1007/978-3-642-24583-1_40(410-419)Online publication date: 2011
  • (2010)A database server for next-generation scientific data management2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)10.1109/ICDEW.2010.5452723(313-316)Online publication date: Mar-2010
  • (2010)Efficient indexing algorithms for one-dimensional discretely-scaled stringsInformation Processing Letters10.1016/j.ipl.2010.05.012110:16(730-734)Online publication date: 1-Jul-2010
  • (2008)Algorithms and data structures for external memoryFoundations and Trends® in Theoretical Computer Science10.1561/04000000142:4(305-474)Online publication date: 1-Jan-2008

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media