Article

n-gram/2L: a space and time efficient two-level n-gram inverted index structure

Authors:

Kyu-Young Whang,

Min-Jae LeeAuthors Info & Claims

VLDB '05: Proceedings of the 31st international conference on Very large data bases

Pages 325 - 336

Published: 30 August 2005 Publication History

Abstract

The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.

References

[1]

Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, Database Systems Concepts, McGraw-Hill, 4th ed., 2001.

Digital Library

[2]

Alistair Moffat and Justin Zobel, "Self-indexing inverted files for fast text retrieval," ACM Trans. on Information Systems, Vol. 14, No. 4, pp. 349--379, Oct. 1996.

Digital Library

[3]

Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, "Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System," Journal of Digital Information, Vol.1, No. 5, pp. 1--25, Jan. 2000.

[4]

Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, "Compression of Inverted Indexes for Fast Query Evaluation," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere, Finland, pp. 222--229, Aug. 2002.

Digital Library

[5]

Gonzalo Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, Vol. 33, No. 1, pp. 31--88, Mar. 2001.

Digital Library

[6]

Hugh E. Williams, "Genomic Information Retrieval," In Proc. the 14th Australasian Database Conferences, 2003.

Digital Library

[7]

Hugh E. Williams and Justin Zobel, "Indexing and Retrieval for Genomic Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 1, pp. 63--78, Jan./Feb. 2002.

Digital Library

[8]

I. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, Los Altos, California, 2nd ed., 1999.

Digital Library

[9]

James Mayfield and Paul McNamee, "Single N-gram Stemming," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415--416, July/Aug. 2003.

Digital Library

[10]

Jonathan D. Cohen, "Recursive Hashing Functions for n-Grams," ACM Trans. on Information Systems, Vol. 15, No. 3, pp. 291--320, July 1997.

Digital Library

[11]

Joon Ho Lee and Jeong Soo Ahn, "Using n-Grams for Korean Text Retrieval," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp.216--224, 1996.

Digital Library

[12]

Jeffery D. Ullman, Principles of Database and Knowledge-Base Systems Vol. I, Computer Science Press, USA, 1988.

Digital Library

[13]

Karen Kukich, "Techniques for Automatically Correcting Words in Text," ACM Computing Surveys, Vol. 24, No. 4, pp. 377--439, Dec. 1992.

Digital Library

[14]

Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Minsoo Kim, and Wook-Shin Han, "Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features," In Proc. the 21th IEEE Int'l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 2005.

Digital Library

[15]

Ogawa Yasushi and Matsuda Toru, "Optimizing query evaluation in n-gram indexing," In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367--368, 1998.

Digital Library

[16]

Raghu Ramakrishnan, Database Management Systems, McGraw-Hill, 1998.

Digital Library

[17]

Ramez Elmasri and Shamkant B. Navathe, Fundamentals of Database Systems, Addison Wesley, 4th ed., 2003.

Digital Library

[18]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.

Digital Library

Cited By

Wang JLi GFeng J(2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
https://dl.acm.org/doi/10.1145/2535628
Kimura MTakasu AAdachi JGuerrini G(2013)FPIProceedings of the Joint EDBT/ICDT 2013 Workshops10.1145/2457317.2457390(397-403)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2457317.2457390
Zhu HKollios GAthitsos V(2012)A generic framework for efficient and effective subsequence retrievalProceedings of the VLDB Endowment10.14778/2350229.23502715:11(1579-1590)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.14778/2350229.2350271
Show More Cited By

Index Terms

n-gram/2L: a space and time efficient two-level n-gram inverted index structure
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Single n-gram stemming
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Stemming can improve retrieval accuracy, but stemmers are language-specific. Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty. We demonstrate that selection ...
Structural optimization of a full-text n-gram index using relational normalization
Abstract
As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string ...
Detecting misspelled words in Turkish text using syllable n-gram frequencies
PReMI'07: Proceedings of the 2nd international conference on Pattern recognition and machine intelligence

In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are ...

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

VLDB '05: Proceedings of the 31st international conference on Very large data bases

August 2005

1392 pages

ISBN:1595931546

General Chair:
Kjell Bratbergsengen
Norwegian University of Science & Technology, Trondheim, Norway

Publisher

VLDB Endowment

Publication History

Published: 30 August 2005

Qualifiers

Article

Conference

ICMI05

ICMI05: Seventh International Conference on Multimodal Interfaces 2005

August 30 - September 2, 2005

Trondheim, Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
573
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang JLi GFeng J(2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
https://dl.acm.org/doi/10.1145/2535628
Kimura MTakasu AAdachi JGuerrini G(2013)FPIProceedings of the Joint EDBT/ICDT 2013 Workshops10.1145/2457317.2457390(397-403)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2457317.2457390
Zhu HKollios GAthitsos V(2012)A generic framework for efficient and effective subsequence retrievalProceedings of the VLDB Endowment10.14778/2350229.23502715:11(1579-1590)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.14778/2350229.2350271
Li YPatel JTerrell A(2012)WHAMACM Transactions on Database Systems10.1145/2389241.238924737:4(1-39)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1145/2389241.2389247
Wang JLi GFeng JCandan KChen YSnodgrass RGravano LFuxman A(2012)Can we beat the prefix filtering?Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data10.1145/2213836.2213847(85-96)Online publication date: 20-May-2012
https://dl.acm.org/doi/10.1145/2213836.2213847
Bukhari AKim Y(2012)Integration of a secure type-2 fuzzy ontology with a multi-agent platformInformation Sciences: an International Journal10.1016/j.ins.2012.02.036198(24-47)Online publication date: 1-Sep-2012
https://dl.acm.org/doi/10.1016/j.ins.2012.02.036
Li GDeng DFeng JSellis TMiller RKementsietsidis AVelegrakis Y(2011)FaerieProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989379(529-540)Online publication date: 12-Jun-2011
https://dl.acm.org/doi/10.1145/1989323.1989379
Li YTerrell APatel JSellis TMiller RKementsietsidis AVelegrakis Y(2011)WHAMProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989370(445-456)Online publication date: 12-Jun-2011
https://dl.acm.org/doi/10.1145/1989323.1989370
Okazaki NTsujii JJoshi AHuang CJurafsky D(2010)Simple and efficient algorithm for approximate dictionary matchingProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873877(851-859)Online publication date: 23-Aug-2010
https://dl.acm.org/doi/10.5555/1873781.1873877
Wang JFeng JLi G(2010)Trie-joinProceedings of the VLDB Endowment10.14778/1920841.19209923:1-2(1219-1230)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.14778/1920841.1920992
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten