Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1083592.1083632dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

n-gram/2L: a space and time efficient two-level n-gram inverted index structure

Published: 30 August 2005 Publication History

Abstract

The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.

References

[1]
Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, Database Systems Concepts, McGraw-Hill, 4th ed., 2001.
[2]
Alistair Moffat and Justin Zobel, "Self-indexing inverted files for fast text retrieval," ACM Trans. on Information Systems, Vol. 14, No. 4, pp. 349--379, Oct. 1996.
[3]
Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, "Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System," Journal of Digital Information, Vol.1, No. 5, pp. 1--25, Jan. 2000.
[4]
Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, "Compression of Inverted Indexes for Fast Query Evaluation," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere, Finland, pp. 222--229, Aug. 2002.
[5]
Gonzalo Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, Vol. 33, No. 1, pp. 31--88, Mar. 2001.
[6]
Hugh E. Williams, "Genomic Information Retrieval," In Proc. the 14th Australasian Database Conferences, 2003.
[7]
Hugh E. Williams and Justin Zobel, "Indexing and Retrieval for Genomic Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 1, pp. 63--78, Jan./Feb. 2002.
[8]
I. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, Los Altos, California, 2nd ed., 1999.
[9]
James Mayfield and Paul McNamee, "Single N-gram Stemming," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415--416, July/Aug. 2003.
[10]
Jonathan D. Cohen, "Recursive Hashing Functions for n-Grams," ACM Trans. on Information Systems, Vol. 15, No. 3, pp. 291--320, July 1997.
[11]
Joon Ho Lee and Jeong Soo Ahn, "Using n-Grams for Korean Text Retrieval," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp.216--224, 1996.
[12]
Jeffery D. Ullman, Principles of Database and Knowledge-Base Systems Vol. I, Computer Science Press, USA, 1988.
[13]
Karen Kukich, "Techniques for Automatically Correcting Words in Text," ACM Computing Surveys, Vol. 24, No. 4, pp. 377--439, Dec. 1992.
[14]
Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Minsoo Kim, and Wook-Shin Han, "Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features," In Proc. the 21th IEEE Int'l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 2005.
[15]
Ogawa Yasushi and Matsuda Toru, "Optimizing query evaluation in n-gram indexing," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367--368, 1998.
[16]
Raghu Ramakrishnan, Database Management Systems, McGraw-Hill, 1998.
[17]
Ramez Elmasri and Shamkant B. Navathe, Fundamentals of Database Systems, Addison Wesley, 4th ed., 2003.
[18]
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
VLDB '05: Proceedings of the 31st international conference on Very large data bases
August 2005
1392 pages
ISBN:1595931546

Publisher

VLDB Endowment

Publication History

Published: 30 August 2005

Qualifiers

  • Article

Conference

ICMI05

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
  • (2013)FPIProceedings of the Joint EDBT/ICDT 2013 Workshops10.1145/2457317.2457390(397-403)Online publication date: 18-Mar-2013
  • (2012)A generic framework for efficient and effective subsequence retrievalProceedings of the VLDB Endowment10.14778/2350229.23502715:11(1579-1590)Online publication date: 1-Jul-2012
  • (2012)WHAMACM Transactions on Database Systems10.1145/2389241.238924737:4(1-39)Online publication date: 1-Dec-2012
  • (2012)Can we beat the prefix filtering?Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213847(85-96)Online publication date: 20-May-2012
  • (2012)Integration of a secure type-2 fuzzy ontology with a multi-agent platformInformation Sciences: an International Journal10.1016/j.ins.2012.02.036198(24-47)Online publication date: 1-Sep-2012
  • (2011)FaerieProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989379(529-540)Online publication date: 12-Jun-2011
  • (2011)WHAMProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989370(445-456)Online publication date: 12-Jun-2011
  • (2010)Simple and efficient algorithm for approximate dictionary matchingProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873877(851-859)Online publication date: 23-Aug-2010
  • (2010)Trie-joinProceedings of the VLDB Endowment10.14778/1920841.19209923:1-2(1219-1230)Online publication date: 1-Sep-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media