Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Access methods for text

Published: 01 March 1985 Publication History
  • Get Citation Alerts
  • Abstract

    This paper compares text retrieval methods intended for office systems. The operational requirements of the office environment are discussed, and retrieval methods from database systems and from information retrieval systems are examined. We classify these methods and examine the most interesting representatives of each class. Attempts to speed up retrieval with special purpose hardware are also presented, and issues such as approximate string matching and compression are discussed. A qualitative comparison of the examined methods is presented. The signature file method is discussed in more detail.

    References

    [1]
    AHO, A. V., AND CORASICK, M. J. 1975. Fast pattern matching: An aid to bibliographic search. Commun. A CM 18, 6 (June), 333-340.
    [2]
    AHO, A. V., AND ULLMAN, J. D. 1979. Optimal partial match retrieval when fields are independently specified. ACM Trans. Database Syst. 4, 2 (June), 168-179.
    [3]
    AHUJA, S. R., AND ROBERTS, C. S. 1980. An associative/parallel processor for partial match retrieval using superimposed codes. In Annual Symposium on Computer Architecture, pp. 218-227.
    [4]
    ANGELL, R. C., FREUND, G. E., AND WILLET, P. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4, pp. 255-261.
    [5]
    BARTON, I. J., CREASEY, S. E., LYNCH, M. F., AND SNELL, M. J. 1974. An information-theoretic approach to text searching in direct access systems. Commun. ACM 17, 6 (June), 345-350.
    [6]
    BATCHER, K. E. 1968. Sorting networks and their applications. In Proceedings of the AFIPS Spring Joint Computer Conference. AFIPS Press, Reston, Va., pp. 307-314.
    [7]
    BAYER, R., AND MCCREIGHT, E. 1972. Organization and maintenance of large ordered indexes. Acta Inf. 1, 3, pp. 173-189.
    [8]
    BENTLEY, J. L. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept.), 509-517.
    [9]
    BIRD, R. M., Tu, J. C., AND WORTHY, R.rM. 1977. Associative/parallel processors for searching very large textual data bases, in Proceedings 3rd ACM Workshop on Computer Architecture for Nonnumeric Processing, (May) ACM, New York, pp. 8- 16.
    [10]
    BOURNE, C. P. 1963. Methods of Information Handling. Wiley, New York.
    [11]
    BOURNE, C. P. 1977. Frequency and impact of spelling errors in bibliographic databases. Inf. Process. Manage. 13, 1, pp. 1-12.
    [12]
    BOYER, R. S., AND MOORE, J. $. 1977. A fast string searching algorithm. Commun. A CM 20, 10 (Oct.), 762-772.
    [13]
    BRAUEN, T. 1971. Document vector modification. In The SMART Retrieval System--Experiments in Automatic Document Processing, G. Salton, Ed. Prentice-Hall, Englewood Cliffs, N.J., Ch. 24.
    [14]
    BURKOWSKI, F. J. 1982. A hardware hashing scheme in the design of a multiterm string comparator. IEEE Trans. Comput. C-31, 9 (Sept.), 825-834.
    [15]
    CHRISTODOULAKIS, S. 1983. Access files for hatching queries in large information systems. In Proceedings of ICOD II (Aug.).
    [16]
    CHRISTODOULAKIS, S. 1984. A framework for the development of a mixed-mode message system for an office environment. In Proceedings 3rd Joint A CM-BCS Symposium on Research and Development in Information Retrieval, (Cambridge, Mass.). ACM, New York.
    [17]
    CHRISTODOULAKIS, S., AND FALOUTSOS, C. 1984. Design considerations for a message file server. IEEE Trans. Softw. Eng. SE-IO, 2 (Mar.), 201- 210.
    [18]
    COMER, D. 1979. The ubiquitous B-tree. A CM Comput. Surv. 11, 2 (June), 121-137.
    [19]
    COOPER, W. S. 1970. On deriving design equations for information retrieval systems. J. Am. Soc. Inf. Sci. (Nov.-Dec.).
    [20]
    CROFT, W. B. 1980. A model of cluster searching based on classification. Inf. Syst. 5, 189-195.
    [21]
    DAMERAU, F. J. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (Mar.), 171-176.
    [22]
    DATrOLA, R. 1979. FIRST: Flexible information retrieval system for text. J. Am. Soc. Inf. Sci, 30, (Jan.), 9-14.
    [23]
    DE LA BRIANDAIS, S. R. 1959. File searching using variable length keys. In Proceedings of the AFIPS Spring Joint Computer Conference. AFIPS Press, Reston, Va., pp. 295-298.
    [24]
    DEWEY, C. 1950. Relative Frequency of English Speech Sounds. Harvard University Press, Cambridge, Mass.
    [25]
    DUDA, R. O., AND HART, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York.
    [26]
    FAGIN, R., NIEVERGELT, J., PIPPENGER, iN., AND STRONG, H. R. 1979. Extendible hashing--a fast access method for dynamic files. A CM Trans. Database Syst. 4, 3 (Sept.), 315-344.
    [27]
    FALOUTSOS, C. 1985a. Design of a signature file method that accounts for nonuniform occurrence and query frequencies. In Proceedings of the 11th Conference on Very Large Data Bases (Stockholm, Sweden, Aug.), to appear.
    [28]
    FALOUTSOS, C. 1985b. Signature files: Design and performance comparison of some signature extraction methods. In Proceedings of the A CM SIGMOD Conference (Austin, Tex., May). ACM, New York.
    [29]
    FALOUTSOS, C., AND CHRISTODOULAKIS, S. 1984. Signature files: An access method for documents and its analytical performance evaluation. A CM Trans. Office Inf. Syst. 2, 4 (Oct.).
    [30]
    FILES, J. R., A~D HUSKEY, H. D. 1969. An information retrieval system based on superimposed coding. In Proceedings of the Fall Joint Computer Conference, vol. 35. AFIPS Press, Reston, Va., pp. 423- 432.
    [31]
    Fox, E. A. 1984. Extended information retrieval with data and text. PODS, submitted for publication.
    [32]
    FREDKIN, E. 1960. TRIE memory. Commun. ACM 3, 9 (Sept.), 490-500.
    [33]
    FRIEDMAN, S. R., MACEYAK, J. A., AND WEISS, S. F. 1971. A relevance feedback system based on document transformation. In The SMART Retrieval System--Experiments in Automatic Document Processing, G. Salton, Ed., Prentice-Hall Englewood Cliffs, N.J., Ch. 23.
    [34]
    FUJITANI, L. 1984. Laser optical disk: The coming revolution in on-line storage. Commun. ACM 27, 6 (June), 546-554.
    [35]
    GALLAGER, R. G., AND VAN VOORH{$, D. C. 1975. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. Inf. Theor. IT~21 (March), 228-230.
    [36]
    GOLOMB, S. W. 1966. Run length encodings. IEEE Trans. Inf. Theor. IT-12 (July), 399-401.
    [37]
    GRAVINA, C. M. 1978. National Westminster Bank mass storage archiving. IBM Syst. J. 17, 4, 344- 358.
    [38]
    GUSTAFSON, R. A. 1971. Elements of the randomized combinatorial file structure. In Proceedings of the A CM SIGIR Symposium on Information Storage and Retrieval (Univ. of Maryland, Apr.). ACM, New York, pp. 163-174.
    [39]
    HALL, P. A. V., AND DOWLING, G. R. 1980. Approximate string matching. ACM Comput. Surv. 12, 4 (Dec.), 381-402.
    [40]
    HARRISON, M. C. 1971. Implementation of the substring test by hashing. Commun. ACM 14, 12 (Dec.), 777-779.
    [41]
    HASKIN, R. L. 1981. Special-purpose processors for text retrieval. Database Eng. 4, i (Sept.), 16-29.
    [42]
    HASKIN, R. L., AND HOLLAAR, L. A. 1983. Operational characteristics of a hardware-based pattern marcher. ACM Trans. Database Syst. 8, 1 (Mar.), 15-40.
    [43]
    HASKIN, R. L., AND LORIE, R. A. 1982. On extending the functions of a relational database system. In Proceedings of the A CM SIGMOD Conference (Orlando, Fla.). ACM, New York, pp. 207-212.
    [44]
    HOLLAAR, L. A. 1978. Specialized merge processor networks for combining sorted lists. ACM Trans. Database Syst. 3, 3 (Sept.), 272-284.
    [45]
    HOLLAAR, L. A. 1979. Text retrieval computers. IEEE Comput. Mag. 12, 3 (Mar.), 40-50.
    [46]
    HOLLAAR, L. A., SMITH, K. F., CHOW, W. H., EMRATH, P. A., AND HASKIN, R. L. 1983. Architecture and operation of a large, full-text information-retrieval system. In Advanced Database Machine Architecture, D. K. Hsiao, Ed. Prentice- Hall, Englewood Cliffs, N.J., pp. 256-299.
    [47]
    HOPCROFT, J. E., AND ULLMAN, J. D. 1979. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, Reading, Mass., 1979.
    [48]
    IBM 1979. STAIRS/VS: Reference Manual. IBM Systern Manual.
    [49]
    JOHNSON, J. H. 1983. Formal models for string similarity. Ph.D. dissertation, Res. Rep. CS-83-32, Univ. of Waterloo, Ont., Canada, Nov.
    [50]
    KAUTZ, W. H., AND SINGLETON, R. C. 1964. Nonrandom binary superimposed codes. IEEE Trans. Inf. Theor. IT-IO (Oct.), 363-377.
    [51]
    KNOTT, G. D. 1971. Expandable open addressing hash table storage and retrieval. In Proceedings of the ACM SIGFIDET Conference (San Diego, Calif.). ACM, New York, pp. 187-206.
    [52]
    KNOTT, G. D. 1975. Hashing functions. Comput. J. 18, 3, pp. 265-278.
    [53]
    KNUTH, D. E. 1973. The Art of Computer Programming, vol. 3: Sorting and Searching. Addison- Wesley, Reading, Mass.
    [54]
    KNUTH, D. E., MORRIS, J. H., AND PRATT, V. R. 1977. Fast pattern matching in strings. SlAM J. Cornput. 6, 2 (June), 323-350.
    [55]
    LARSON, P. 1978. Dynamic hashing. BIT 18, pp. 184- 201.
    [56]
    LARSON, P. A. 1983. A method for speeding up text retrieval. In Proceedings of the A CM SIGMOD Conference (San Jose, Calif., May). ACM, New York.
    [57]
    LESK, M. E. 1978. Some applications of inverted indexes on the UNIX system. In the UNIX Programmer's Manual. Bell Laboratories, Murray Hill, N.J.
    [58]
    LLOYD, J. W. 1980. Optimal partial-match retrieval. BIT 20, pp. 406-413.
    [59]
    LLOYD, J. W., AND RAMAMOHANARAO, K. 1982. Partial-match retrieval for dynamic files. BIT, 22, pp. 150-168.
    [60]
    LOWERANCE, R., ANO WAGNER, R. A. 1975. An extension of the string-to-string correction problem. J. ACM 22, 2 (Apr.), 3-14.
    [61]
    MARTIN, G. N. N. 1979. Spiral storage: Incrementally augmentable hash addressed storage. Theory of Computation, Rep. 27, Univ. of Warwick, Coventry, England, Mar.
    [62]
    MCILROY, M. D. 1982. Development of a spelling list. IEEETrans. Commun. COM-30, 1 (Jan.), 91-99.
    [63]
    McLEoo, I. A. 1981. A database management system for document retrieval applications. Inf. Syst. 6, 2, pp. 131-137.
    [64]
    MOOERS, C. 1949. Application of random codes to the gathering of statistical information. Bull. 31, Zator Co., Cambridge, Mass., 1949. Based on M.S.C. thesis, MIT, Jan. 1948.
    [65]
    NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. A CM Trans. Database Syst. 9, I (Mar.), 38-71.
    [66]
    OROSZ, G., AND TACKACS, L. 1956. Some probability problems concerning the marking of codes into the superimposed field. J. Doc. 12, 4 (Dec.), 231- 234.
    [67]
    PETERSON, J. L. 1980. Computer programs for detecting and correcting spelling errors. Commun. ACM 23, 12 (Dec.), 676-687.
    [68]
    PFALTZ, J. L., BERMAN, W. H., AND CAGLEY, E. M. 1980. Partial match retrieval using indexed descriptor files. Comrnun. ACM 23, 9 (Sept.), 522- 528.
    [69]
    RAmTTI, F., A~O ZIZKA, J. 1984. Evaluation of access methods to text documents in office systems. In Proceedings 3rd Joint A CM-BCS Symposium on Research and Development in in{ormation Retrieval (Cambridge, Mass.).
    [70]
    RIVEST, R. L. ~976. Partial match retrieval algorithms. SIAM J. Comput. 5, 1 (Mar.), 19-50.
    [71]
    ROBERTS, C. S. 1979. Partial-match retrieval via the method of superimposed codes. Proc. iEEE 67, 12 (Dec.), 1624-1642.
    [72]
    ROmNSON, J. T. 1981. The k-D-B-tree: A search structure for large multidimensional dynamic indexes. In Proceedings o{ the A CM SIGMOD Conference. ACM, New York, pp. 10-18.
    [73]
    Roccmo, J. J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System-- Experiments in Automatic Document Processing, G. Salton, Ed., Prentice-Hall, Englewood Cliffs, N.J., Ch. 14.
    [74]
    ROTHmE, J. B., ANO LOZANO, T. 1974. Attributebased file organization in a paged memory environment. Comrnun. ACM 17, 2 (Feb.), 63-69.
    [75]
    ;SALTON, G. 1971a. The SMART Retrieval System-- Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J.
    [76]
    SALTON, G. 1971b. Relevance feedback and the optimization of retrieval effectiveness. In The SMART Retrieval System--Experiments in Automatic Document Processing, G. Salton, Ed., Prentice-Hall, Englewood Cliffs, N.J., Ch. 15.
    [77]
    SALTON, G. 1972. Experiments in automatic thesaurus construction for information retrieval. In Informarion Processing 71. North-Holland, Amsterdam, pp. 115-123.
    [78]
    SALTON, G. 1973. Recent studies in automatic text analysis and document retrieval. J. A CM 20, 2 (Apr.), 258-278.
    [79]
    SALTON, G. 1975. Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, N.J.
    [80]
    SALTO~, G. 1980. Automatic information retrieval. IEEE Comput. Mag. 13, 9 (Sept.), 41-56.
    [81]
    SALTON, G., AND MCGILL, M. J. 1983. Introduction to Modern In{ormation Retrieval McGraw-Hill, New York.
    [82]
    SALTON, G., ANO WON(;, A. 1978. Generation and search of clustered files. A CM Trans. Database Syst. 3, 4 (Dec.), 321-346.
    [83]
    SALTON, G., Fox, E. A., AND WU, H. 1983. Extended Boolean information retrieval. Commun. A CM 26, 11 (Nov.), 1022-1036.
    [84]
    SCHUEGRAPH, E. J., ANO HEAPS, H. 8. 1976. Query processing in a retrospective document retrieval system that uses word fragments as language elements. Inf. Process. Manage. 12, pp. 283- 292.
    [85]
    SEVERANCE, D. G. 1974. Identifier search mechanisms: A survey and generalized model. A CM Comput. Surv. 6, 3 (Sept.), 175-194.
    [86]
    SEVERANCE, D. G. 1983. A practitioner's Guide to database compression. In{. Syst. 8, 1, 51-62.
    [87]
    8PARCK-JONES, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 1 (Mar.), 11-20.
    [88]
    STELLHORN, W. H. 1977. An inverted file processor for information retrieval. IEEE Trans. Comput. C-26, 12 (Dec.), 1258-1267.
    [89]
    STIASSNY, 8. 1960. Mathematical analysis of various superimposed coding methods. Am. Doc. 11, 2 (Feb.), 155-169.
    [90]
    8TONEBRAKER, M., STETTNER, I-I., LYNN, N., KA- LASH, J., AND GUq'TMAN, A. 1983. Document processing in a relational database system. A CM Trans. Office In{. Syst. 1, 2 (Apr.), 143-158.
    [91]
    TSICHRITZIS, D., AND CHRISTODOULAKIS, S. 1983. Message files. A CM Trans. Office In{. Syst. 1, 1 (Jan.), 88-98.
    [92]
    TSICHRITZIS, D., CHRISTODOULAKIS, 8., ECONOMO- POULOS, P., FALOUTSOS, C., LEE, A., LEE, D., VANOE~BROEK, J., ANO WOO, C. 1983. A multimedia office filing system. In Proceedings of the 9th International Conference on Very Large Data Bases (Florence, Italy, Oct.-Nov.). VLDB Endowment, Saratoga, Calif.
    [93]
    VAN RIJSBERGEN, C. J. 1971. An algorithm for information structuring and retrieval. Comput. J. 14, 4, pp. 407-412.
    [94]
    VAN RIJSBERGEN, C. J. 1979. ln{ormation Retrieval, 2nd ed., Butterworths, London.
    [95]
    WELCH, T. A. 1984. A technique for high-performance data compression. IEEE Comput. Mag. 17, 6 (June), 8-19.
    [96]
    WINe, J. M. 1979. Partial-match retrieval using TRIES, hashing and superimposed codes. M.Sc. thesis, MIT, June.
    [97]
    Yu, C. T., ANO LUK, W. S. 1977. Analysis of effectiveness of retrieval in clustered files. J. ACM 24, 4 (Oct.), 607-622.
    [98]
    Yu, C. T., LAM, K., AND SALTON, G. 1982. Term weighting in information retrieval using the term precision model. J. ACM 29, 1 (Jan.), 152-170.
    [99]
    ZAHN, C. T. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. C-20, i (Jan.), 68-86.
    [100]
    ZIV, J., AND LEMPEL, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. In{. Theor. IT-23, 3 (May), 337-343.

    Cited By

    View all
    • (2024)Harnessing AI and DS for Fake News Detection and Prevention2024 International Conference on Science Technology Engineering and Management (ICSTEM)10.1109/ICSTEM61137.2024.10560849(1-6)Online publication date: 26-Apr-2024
    • (2023)Machine learning for detecting fake news2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)10.1109/ICAC3N60023.2023.10541589(351-356)Online publication date: 15-Dec-2023
    • (2021)Fake News detection Using Machine Learning2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-being (IHSH)10.1109/IHSH51661.2021.9378748(125-130)Online publication date: 9-Feb-2021
    • Show More Cited By

    Recommendations

    Reviews

    Harold Borko

    Automated text retrieval methods are used in libraries and in many online bibliographic and numerical data files. They provide the library user with fast and accurate access to books and journal papers. Text retrieval is also important in office automation systems where letters, reports, memos, and other documents are created, received, and filed for later retrieval. Office workers could benefit from automated information storage and retrieval systems. When these are combined with word processing and electronic mail, the amount of paper circulating in an office could be reduced. With the objective of applying text retrieval methods to an office environment, Faloutsos reviews various text retrieval techniques, indicates some of the difficulties involved, and describes some methods that he considers to be particularly applicable for use in office automation. This is a well-organized and comprehensive paper which examines different text retrieval access methods: full test scanning, inverted lists of key words, multiattribute hashing (superimposed coding), signature files, and clustering techniques. Each of these methods has a long developmental history, and lengthy explanations would be required for a full description of the method with both its advantages and disadvantages. Faloutsos does an admirable job of summarizing, but unless the reader is aleady somewhat familiar with these techniques the exposition may be hard to follow. However, there is an excellent set of references, about 100 of them, which will enable the interested reader to pursue any topic in greater depth. In the latter part of the paper, the author discusses the integration of text retrieval systems with database management systems appropriate for an office environment. He analyzes primary-key, secondary-key, and text retrieval access methods and compares these on the following criteria: space or memory utilization; response time for searching; handling of insertions, deletions, and updates; ease of growth; preservation of key-order; and the ability to integrate with other retrieval methods. The author concludes that, “In our opinion the signature file approach seems most promising for archiving documents in an office. It provides a reasonable compromise between the inversion method, which is fast on retrieval but expensive on insertion, and the full text screening method, which requires minimal insertion cost, but is slow.” But still, in the author's words, “There are many open problems in the area.”

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 17, Issue 1
    Annals of discrete mathematics, 24
    March 1985
    140 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/4078
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 1985
    Published in CSUR Volume 17, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)81
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Harnessing AI and DS for Fake News Detection and Prevention2024 International Conference on Science Technology Engineering and Management (ICSTEM)10.1109/ICSTEM61137.2024.10560849(1-6)Online publication date: 26-Apr-2024
    • (2023)Machine learning for detecting fake news2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)10.1109/ICAC3N60023.2023.10541589(351-356)Online publication date: 15-Dec-2023
    • (2021)Fake News detection Using Machine Learning2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-being (IHSH)10.1109/IHSH51661.2021.9378748(125-130)Online publication date: 9-Feb-2021
    • (2019)An Efficient and Effective Index Structure for Query Evaluation in Search EnginesAdvanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics10.4018/978-1-5225-7598-6.ch127(1730-1743)Online publication date: 2019
    • (2019)A Cognitive Information Retrieval Using POP Inference Engine ApproachesEmerging Trends and Applications in Cognitive Computing10.4018/978-1-5225-5793-7.ch002(35-48)Online publication date: 2019
    • (2018)New FastPFOR for Inverted File CompressionHandbook of Research on Biomimicry in Information Retrieval and Knowledge Management10.4018/978-1-5225-3004-6.ch006(90-102)Online publication date: 2018
    • (2018)An Efficient and Effective Index Structure for Query Evaluation in Search EnginesEncyclopedia of Information Science and Technology, Fourth Edition10.4018/978-1-5225-2255-3.ch695(7995-8005)Online publication date: 2018
    • (2018)Text Mining Methodology for Effective Online MarketingInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT12283129(465-469)Online publication date: 20-Nov-2018
    • (2018)Incorporating String Search in a Hypertext System: User Interface and Signature File Design IssuesHypermedia10.1080/09558543.1990.120311822:3(183-200)Online publication date: 29-Oct-2018
    • (2018)Signature FilesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1138(3491-3495)Online publication date: 7-Dec-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media