Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Accelerating sequence searching: dimensionality reduction method

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL (2000) GenBank. Nucleic Acids Res 28(1): 15–18

    Article  Google Scholar 

  2. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proc. 24th VLDB Conference (VLDB’97), pp 194–205

  3. Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  4. Cormode G, Muthukrishnan S (2002) The string edit distance matching problem with moves. In: Proc. of the 13th annual ACM-SIAM symposium on Discrete Algorithms, pp 667–676

  5. DanTong Yu, Zhang A (2003) ClusterTree: integration of cluster representation and nearest-neighbor search for large data sets with high dimensions. IEEE Trans Knowl Data Eng 15(5): 1316–1337

    Article  Google Scholar 

  6. Faloutsos C, Lin KI (1995) Fast Map: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets. In: Proc. of the International Conference on Management of Data (SIGMOD’95), pp 163–174

  7. Filho RFS, Traina AJM, Traina C, Faloutsos C (2001) Similarity search without tears: the OMNI family of all-purpose access methods. In: Proc. of the 19th International Conference on Data Engineering (ICDE’01), pp 623–630

  8. Golub GH, Van Loan CF (1989) Matrix computations. The Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  9. Gravano L, Ipeirotis PG, Jagadish HV, Tolga Bozkaya N, Ozsoyoglu M (1997) Distance-based indexing for high-dimensional metric spaces. In: Proc. of the International Conference on Management of Data (SIGMOD’97), pp 357–368

  10. Gusfield D (1997) Algorithms on sequences, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge

    Google Scholar 

  11. Jolliffe IT (1986) Principal component analysis. Springer, Berlin

    Google Scholar 

  12. Kadiyala S, Shiri N (2007) A compact multi-resolution index for variable length queries in time series databases. Knowl Info Syst 15: 131–147

    Article  Google Scholar 

  13. Karp RM, Miller RE, Rosenberg AL (1972) Rapid identification of repeated patterns in strings, trees and arrays. In: Proc. of the 4th Symposium on Theory of Computing, pp 125–136

  14. Keogh E, Ann RC (2005) Exact indexing of dynamic time warping. Knowl Info Syst 7(3): 358–386

    Article  Google Scholar 

  15. Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Dimensionality reduction for fast similarity search in large databases. Knowl Info Syst 3(3): 263–286

    Article  MATH  Google Scholar 

  16. Lee S, Chun S, Kim D (2000) Similarity search for multidimensional data sequences. In: Proc. of the 16th Int’l Conf. on Data Engineering. IEEE Computer Society, Washington, pp 599–608

  17. Li M, Badger JH, Xin C, Kwong S, Kearney P, Ferragina HP, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its applications. JACM 46(2): 236–280

    Article  Google Scholar 

  18. McCreight EM (1976) A space-economical suffix tree construction algorithm. JACM 23(2): 262–272

    Article  MATH  MathSciNet  Google Scholar 

  19. Muthukrishnan S, Ahinalp SCS (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of the 32nd Symposium on Theory of Computing, pp 416–424

  20. Navarro G (2000) A guided tour to approximate sequence matching. ACM Comput Survey 33(1): 31–88

    Article  Google Scholar 

  21. Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discret Algorithms 1(1): 205–239

    MathSciNet  Google Scholar 

  22. Pearson W, Lipman D (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85: 2444–2488

    Article  Google Scholar 

  23. Popivanov I, Miller RJ (2002) Similarity search over time series data using wavelets. In: Proc. of the 18th Int’l Conf. on Data Engineering, ICDE 2002. IEEE Computer Society, San Jose, pp 212–221

  24. Sahinalp SC, Tasan M, Macker J, Ozsoyoglu ZM (2003) Distance-based indexing for string proximity search. In: Proc. of the 19th International Conference on Data Engineering (ICDE’03), pp 125–136

  25. Song G, Cui B, Zheng B, Xie K, Yang D (2008) Squeezing long sequence data for efficient similarity search. In: Proc. Apweb, 2008, pp 438–449

  26. Traina C, Traina AJM, Seeger B, Faloutsos C (2000) Slim-Trees: high performance metric trees minimizing overlap between nodes. In: proc. of the 7th International Conference on Extended Database Technology (EDBT’00), pp 51–65

  27. Venkateswaran J, Lachwani D, Kahveci T, Jermaine CM (2006) Reference-based indexing of sequence databases. In: Proc. 24th VLDB Conference (VLDB’06), pp 906–917

  28. Vieira MR, Traina C, Chino FJT, Traina AJM (2004) DBM-Tree: a dynamic metric access method sensitive to local density data. Simposio Brasileiro de Bancos de Dados (SBBD’04), pp 163–177

  29. Wang TL, Wang X, Lin KI, Shasha D, Shapiro B, Zhang K (1999) Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proc. of the 5th ACM International Conference of Knowledge Discovery and Data Mining (SIGKDD’99), pp 307–311

  30. Weiner P (1973) Linear pattern matching algorithms. In: IEEE Symposium on Switching and Automata Theory, pp 1–11

  31. Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp 311–321

  32. Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203–214

    Article  Google Scholar 

  33. http://www-db.stanford.edu/pleiades/SUMATRA.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guojie Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, G., Cui, B., Zheng, B. et al. Accelerating sequence searching: dimensionality reduction method. Knowl Inf Syst 20, 301–322 (2009). https://doi.org/10.1007/s10115-008-0180-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0180-0

Keywords