Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/355214.355221acmconferencesArticle/Chapter ViewAbstractPublication PagesiralConference Proceedingsconference-collections
Article
Free access

Hybrid term indexing for different IR models

Published: 01 November 2000 Publication History

Abstract

Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best precision amongst homogenous terms at a lower storage cost than indexing with bigrams. However, this was tested with conjunctive queries. Here, we extended the weighted Boolean models using fuzzy and p-norm measures, as well as the vector space model using the cosine measure, for processing hybrid terms. Our evaluation shows that all IR models using hybrid terms achieve better average precision over those using words. Across different recall values, the weighted Boolean model using fuzzy measures with hybrid terms achieve consistently about 8% higher than those using words. The vector space model using the cosine measures with hybrid terms achieved the best improvement in the average recall and precision.

References

[1]
Chien, L-F. A Model-Based Signature File Approach for Full-text Retrieval of Chinese Document Databases, Computer Processing of Chinese and Oriental Languages, 1995.
[2]
Chan, S.K., Y.C. Wong and R.W.P. Luk. Variable bit-block compression signature for English-Chinese information retrieval, Proceedings of 1RAL 98, KRDL, National University of Singapore, 1998. pp. 61-66.
[3]
Chien, L-F. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, ACM SIGIR 97 Conference, Philadelphia, USA, 1997, pp. 50-58.
[4]
Shishibori, M., M. Fiketa, K. Ando and J-I. Aoe, A Construction Method for the Index Represented by a Pointerless Patricia Trie, Proceedings of IRAL 97, Japan, 1997.
[5]
Kwok, K.L. Comparing Representations in Chinese Information Retrieval, Proc. of 20th Ann.Intl. ACM SIGIR Conf. on R&D in IR, July 27-31, 1997. pp. 34-41.
[6]
Lam, W., C-Y Wong and K.F. Wong, Performance Evaluation of Character-, Wordand N-Gram-Based Indexing for Chinese Text Retrieval, Proceedings of lRAL 97, Japan, 1997.
[7]
Nie, J-Y. and F. Ren, Chinese information retrieval: using characters or words, Information Processing and Management, 1997, 35, pp.443- 462.
[8]
Leong, M-K. and H. Zhou, Preliminary qualitative analysis of segmented vs bigram indexing in Chinese, Proceedings of the Sixth Text Retrieval Conference (TREC-6), Gaithersburg, Maryland, November, 1997, pp. 19-21.
[9]
Tsang, T.F., R.W.P. Luk and K.F. Wong, Hybrid term indexing using words and bigrams, Proceedings of IRAL 1999, Academia Sinica, Taiwan, 1999, pp. 112-117.
[10]
Fung, P. and D. Wu, Statistical Augmentation of a Chinese Machine-readable dictionary, Proceedings of Workshop on Very Large Corpora, Kyoto, August, 1994, pp. 69-85.
[11]
Guo, J. Critical tokenization and its properties, Computational Linguistics, 1997, 23(4), 569-596.
[12]
Wu, Z. and G. Tseng, ACTS: An Automatic Chinese Text Segmentation System for Full Text Retrieval, Journal of the American Society of Information Science, 1995, 46(2), 83-96.
[13]
Luk, R.W.P. Chinese-word segmentation based on maximal-matching and bigram techniques, Proceedings of ROCLING VI1, 1994, pp.273-282.
[14]
Salton, G. & Buckley, C., Term-weighting approaches in automatic text retrieval, Information Processing and Management, 1988, 24(5), 513-523.
[15]
Guo, J. and H.C. Liu, "PH - a Chinese corpus for pinyin-hanzi transcription", 1SS Technical Report, TR93-112-0, Institute of Systems Science, National University of Singapore, 1992.
[16]
Burgin, R., The Monte Carlo method and the evaluation of retrieval system performance, Journal of the American Society for Information Science, 1999, 50(2), 181-191.
[17]
Vines, P. and J. Zobel, Efficient building and querying of Asian language document databases, Proceedings of IRAL 1999, Academia Sinica, Taiwan, 1999, pp. 118-125.

Cited By

View all
  • (2005)Handling orthographic varieties in japanese IRProceedings of the Second Asia conference on Asia Information Retrieval Technology10.1007/11562382_65(666-672)Online publication date: 13-Oct-2005
  1. Hybrid term indexing for different IR models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages
    November 2000
    220 pages
    ISBN:1581133006
    DOI:10.1145/355214
    • Chairmen:
    • Kam-Fai Wong,
    • Dik L. Lee,
    • Jong-Hyeok Lee
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2000

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese information retrieval
    2. IR models
    3. evaluation
    4. indexing

    Qualifiers

    • Article

    Conference

    IRAL00
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 02 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2005)Handling orthographic varieties in japanese IRProceedings of the Second Asia conference on Asia Information Retrieval Technology10.1007/11562382_65(666-672)Online publication date: 13-Oct-2005

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media