Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-02008-7_9guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Storage and Retrieval of Individual Genomes

Published: 14 May 2009 Publication History

Abstract

A repetitive sequence collection is one where portions of a <em>base sequence</em> of length <em>n</em> are repeated many times with small variations, forming a collection of total length <em>N</em> . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies <em>O</em> (<em>N</em> log<em>N</em> ) bits, which very soon inhibits in-memory analyses. Recent advances in full-text <em>self-indexing</em> reduce the space of suffix tree to <em>O</em> (<em>N</em> log<em>***</em> ) bits, where <em>***</em> is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.
We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length <em>n</em> of the base sequence and the number <em>s</em> of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on <em>N</em> /<em>n</em> .
We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

References

[1]
Blanford, D., Blelloch, G.: Compact representations of ordered sets. In: Proc. 15th SODA, pp. 11-19 (2004).
[2]
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994).
[3]
Church, G.M.: Genomes for all. Scientific American 294(1), 47-54 (2006).
[4]
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552-581 (2005).
[5]
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007).
[6]
Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152-165. Springer, Heidelberg (2008).
[7]
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378-407 (2006).
[8]
Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: DCC 2006: Proceedings of the Data Compression Conference (DCC 2006), pp. 213-222 (2006).
[9]
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997).
[10]
Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518-1525 (2007).
[11]
Kaplan, H.: Persistent Data Structures. In: Mehta, D.P., Sahni, S. (eds.) Handbook of Data Structures and Applications, vol. 31. Chapman & Hall, Boca Raton (2005).
[12]
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40-66 (2005).
[13]
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935-948 (1993).
[14]
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407-430 (2001).
[15]
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007).
[16]
Overmars, M.H.: Searching in the past, i. Technical Report Technical Report RUUCS- 81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands (1981).
[17]
Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842- 1843 (2007).
[18]
Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191-203. Springer, Heidelberg (2008).
[19]
Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362-373. Springer, Heidelberg (2008).
[20]
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 48(2), 294-313 (2003).
[21]
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589-607 (2007).
[22]
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164-175. Springer, Heidelberg (2008).
[23]
Waterman, M.S.: Introduction to Computational Biology. Chapman & Hall, University Press (1995).

Cited By

View all
  • (2019)Optimal construction of compressed indexes for highly repetitive textsProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310517(1344-1357)Online publication date: 6-Jan-2019
  • (2018)Optimal-time text indexing in BWT-runs bounded spaceProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3174304.3175401(1459-1477)Online publication date: 7-Jan-2018
  • (2016)FM-index of alignmentTheoretical Computer Science10.1016/j.tcs.2015.08.008638:C(159-170)Online publication date: 25-Jul-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
RECOMB 2'09: Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
May 2009
531 pages
ISBN:9783642020070
  • Editor:
  • Serafim Batzoglou

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 14 May 2009

Author Tags

  1. Comparative genomics
  2. compressed data structures
  3. full-text indexing
  4. suffix tree

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Optimal construction of compressed indexes for highly repetitive textsProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310517(1344-1357)Online publication date: 6-Jan-2019
  • (2018)Optimal-time text indexing in BWT-runs bounded spaceProceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3174304.3175401(1459-1477)Online publication date: 7-Jan-2018
  • (2016)FM-index of alignmentTheoretical Computer Science10.1016/j.tcs.2015.08.008638:C(159-170)Online publication date: 25-Jul-2016
  • (2016)CHICOProceedings of the 15th International Symposium on Experimental Algorithms - Volume 968510.1007/978-3-319-38851-9_22(326-338)Online publication date: 5-Jun-2016
  • (2014)Optimized succinct data structures for massive dataSoftware—Practice & Experience10.1002/spe.219844:11(1287-1314)Online publication date: 1-Nov-2014
  • (2013)Suffix Array of AlignmentProceedings of the 20th International Symposium on String Processing and Information Retrieval - Volume 821410.1007/978-3-319-02432-5_27(243-254)Online publication date: 7-Oct-2013
  • (2012)Iterative Dictionary Construction for Compression of Large DNA Data SetsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2011.829:1(137-149)Online publication date: 1-Jan-2012
  • (2012)Improved grammar-based compressed indexesProceedings of the 19th international conference on String Processing and Information Retrieval10.1007/978-3-642-34109-0_19(180-192)Online publication date: 21-Oct-2012
  • (2010)Unified view of backward backtracking in short read mappingAlgorithms and Applications10.5555/2167962.2167975(182-195)Online publication date: 1-Jan-2010
  • (2010)Indexing similar DNA sequencesProceedings of the 6th international conference on Algorithmic aspects in information and management10.5555/1880586.1880605(180-190)Online publication date: 19-Jul-2010
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media