Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/800057.808700acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
Article
Free access

Building a complete inverted file for a set of text files in linear time

Published: 01 December 1984 Publication History

Abstract

Given a finite set of texts S = {ω1, ..., ωk} over some fixed finite alphabet Σ, a complete inverted file for S is an abstract data type that provides the functions find(ω), which returns the longest prefix of ω which occurs in S; freq(ω), which returns the number of times ω occurs in S; and locations(ω) which returns the set of positions at which ω occurs. We give a data structure to implement a complete inverted file for S which occupies linear space and can be built in linear time, using the uniform cost RAM model. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, we use techniques from the theory of finite automata to build a deterministic finite automaton which recognizes the set of all sub words of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions.

References

[1]
Aho, Alfred V., John E. Hopcroft and Jeffrey D. Ullman; The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading Massachusetts (1974).
[2]
Apostolico, A.; "Some linear time algorithms for string statistics problems," Publication Series III, 176, Instituto per le Applicazioni del Calcolo "Mauro Picone" (IAC), Rome, 1979, 28pp.
[3]
Apostolico, A. and F. P. Preparata; "Optimal off-line detection of repetitions in a string." Theoretical Computer Science, v. 22, 1983, 297-315.
[4]
Blumer, A., J. Blumer, A. Ehren-feucht, D. Haussler, R. McConnell; "Linear size finite automata for the set of all subwords of a word: an outline of results," Bul. Eur. Assoc. Theor. Comp. Sci., 1983, no. 21, 12-20.
[5]
Cardenas, A. F.; "Analysis and performance of inverted data base structures," Comm ACM, 1975, v. 18, no. 5., 253.
[6]
Goldsmith, N.; "An appraisal of factors affecting the performance of text retrieval systems," Information Technology: Research and Development, 1982, 1, 41-53.
[7]
Kohonen, T.; Content-Addressable Memories, Springer-Verlag, Berlin, Heidelberg, New York, 1980.
[8]
Majster, M. E. and Angelika Reiser; "Efficient on-line construction and correction of position trees," SIAM J. Comput., v. 9, no. 4, Nov. 1980, 785-807.
[9]
Maller, V.; "The content addressable file store - a technical overview," Angwte. Infor. (3) (1981), 100-106.
[10]
McCreight, Edward M.; "A space-economical suffix tree construction algorithm," JACM, v. 23, no. 2, April 1976, 262-272.
[11]
Morrison, Donald R.; "PATRICIA Practical Algorithm To Retrieve Information Coded In Alphanumeric," JACM, v. 15, no. 4, October 1968, 514-534.
[12]
Nerode, Anil; "Linear automaton transformations," Proc. AMS, v. 9, 1958, 541-544.
[13]
Pratt, V. R., "Improvements and applications for the Weiner repetition finder," unpublished manuscript, May 1973 (revised Oct. 1973, March 1975).
[14]
Van Rijsbergen, C. J.; "File organization in library automation and information retrieval," Journal of Documentation, v. 32, no. 4, December 1976, 294-317.
[15]
Rodeh, Michael, Vaughan R. Pratt, and Shimon Even; "Linear algorithm for data compression via string matching," JACM, v. 28, no. 1, Jan. 1981, 16-24.
[16]
Seiferas, J. and Chen, M. T., "Efficient and Elegant Subword-Tree Construction," Univ. of Rochester 1983-84 C.S. and C.E. Research Review, 10-14.
[17]
Slisenko, A. O., "Detection of periodicities and string matching in real time," (English translation) J. Sov. Math., 22 (3) (1983) 1316-1387. (originally published 1980).
[18]
Tanimoto, S. L., "A Method for Detecting Structure in Polygons," Pattern Recognition, 1981, v. 13, no. 6, pp. 389-394.
[19]
Weiner, P.; "Linear pattern matching algorithms," IEEE 14th Annual Symposium on Switching and Automata Theory, 1973, 1-11.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC '84: Proceedings of the sixteenth annual ACM symposium on Theory of computing
December 1984
547 pages
ISBN:0897911334
DOI:10.1145/800057
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 1984

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2005)Presentations of constrained systems with unconstrained positionsIEEE Transactions on Information Theory10.1109/TIT.2005.84642351:5(1891-1900)Online publication date: 1-May-2005
  • (2005)Fast approximate matching using suffix treesCombinatorial Pattern Matching10.1007/3-540-60044-2_33(41-54)Online publication date: 31-May-2005
  • (2005)Fast identification of approximately matching substringsCombinatorial Pattern Matching10.1007/3-540-58094-8_6(64-74)Online publication date: 7-Jun-2005
  • (2005)Building the minimal DFA for the set of all subwords of a word on-line in linear timeAutomata, Languages and Programming10.1007/3-540-13345-3_9(109-118)Online publication date: 28-May-2005
  • (1988)Textual and visual access to a computer by people who know nothing about itProceedings of the 6th annual international conference on Systems documentation10.1145/358922.358943(121-134)Online publication date: 16-Oct-1988
  • (1985)The Myriad Virtues of Subword TreesCombinatorial Algorithms on Words10.1007/978-3-642-82456-2_6(85-96)Online publication date: 1985

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media