Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Dictionary-based techniques for cross-language information retrieval

Published: 01 May 2005 Publication History

Abstract

Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.

References

[1]
Aljlayl, M., & Frieder, O. (2002). On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of conference on information and knowledge management CIKM 2002 (pp. 340-347).
[2]
ALP (1966). Language and machines: computers in translation and linguistics.
[3]
Ballesteros, L, & Croft, B. (1996). Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th international DEXA conference on database and expert systems (pp. 791-801). Available: http://ciir.cs.umass.edu/info/psfiles/irpubs/ir.html.
[4]
Ballesteros, L., & Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th international ACM SIGIR conference on research and development in information retrieval (pp. 84-91).
[5]
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (1998). Transcriber: a free tool for segmenting, labeling and transcribing speech. In Proceedings of the first international conference on language resources & evaluation (LREC) (pp. 1373-1376).
[6]
Beesley, K. R. (1998). Arabic morphological analysis on the Internet. In Proceedings of the 6th international conference and exhibition on multi-lingual computing.
[7]
Braschler, M., & Schauble, P. (2001). Experiments with the Eurospider retrieval system for CLEF2000. In Cross-language information retrieval and evaluation, Workshop of the cross-language evaluation forum, CLEF 2000, Lisbon, Portugal, Septemper 2000, revised papers, Lecture notes in computer science (Vol. LNCS 2069, pp. 140-148). Heidelberg: Springer-Verlag.
[8]
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79-85.
[9]
Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In D. K. Harman (Ed.), Overview of the third text retreival conference (TREC-3) (pp. 69-80). NIST. NIST Special Publication 500-225.
[10]
Callan, J., Croft, W., & Harding, S. (1992). The INQUERY retrieval system. In Proceedings of the third international conference on database and expert systems applications (PP. 78-83).
[11]
Chen, A. (2002). Cross-language retrieval experiments at CLEF 2002. In C. Peters (Ed.), Working notes for the CLEF 2002 workshop. Available: http://clef.ici.pi.cnr.it:2002/workshop2002/WN/01.pdf.
[12]
Church, K. W., & Hovy, E. H. (1993). Good applications for crummy machine translation. Machine Translation, 8, 239-258.
[13]
Darwish, K. (2002). Building a shallow morphological analyzer in one day. In Proceedings of ACL workshop on computational approaches to semitic languages (pp. 47-54).
[14]
De Roeck, A., & Al-Fares, W. (2000). A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the 38th annual meeting of the association for computational linguistics, Hong Kong (pp. 199-206).
[15]
Emerson, T. (2001). Segmentation of Chinese text. MultiLingual Computing & Technology, 12(2), 49-52.
[16]
Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall.
[17]
Gey, F. C., Jiang, H., Chen, A., & Larson, R. R. (1998). Manual queries and machine translation in cross-language retrieval and interactive retrieval with Cheshire II at TREC-7. In E. Voorhees, & D. Harman (Eds.), The seventh text retrieval conference (TREC-7) (pp. 527-540). NIST. NIST Special Publication 500-242.
[18]
Jin, W. (1992). A case study: Chinese segmentation and its disambiguation. Tech. rep. MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.
[19]
Kekäläinen, J., & Jäärvelin, K. (1998). The impact of query structure and query expansion on retrieval performance. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 130-137).
[20]
Koskenniemi, K. (1983). Two-level model for morphological analysis. In Proceedings of the 8th international joint conference on artificial intelligence (pp. 683-685).
[21]
Kraaij, W., & Hiemstra, D. (1998). Cross language retrieval with the TwentyOne system. In E. Voorhees, & D. K. Harman (Eds.), The sixth text retrieval conference (TREC-6) (pp. 753-761). NIST. NIST Special Publication 500-240.
[22]
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191-202).
[23]
Larkey, L., Ballesteros, L., & Connell M. (2002). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 275-282).
[24]
Leek, T., Schwartz, R., & Sistra, S. (2002). Probabilistic approaches to topic detection and tracking. In Topic detection and tracking: event-based information organization (pp. 67-84). Boston: Kluwer.
[25]
Levow, G.-A. (2003). Issues in pre- and post-translation document expansion: untranslatable cognates and missegmented words. In Proceedings of sixth international workshop on information retrieval with Asian language (pp. 77-83).
[26]
Levow, G.-A., & Oard, D. W (2002). Signal boosting for translingual topic tracking: document expansion and n-best translation. In Topic detection and tracking research (pp. 175-196). Boston: Kluwer.
[27]
Lo, W.-K., Li, Y.-C., Levow, G.-A., Wang, H-M., & Meng, H. (2003). Multi-scale document expansion in English-Mandarin cross-language spoken document retrieval. In Proceedings of European conference on speech communication and technology (Eurospeech2003) (pp. 2337-2340).
[28]
Mayfield, J., & McNamee, P. (1999). Indexing using both n-grams and words. In E. Voorhees, & D. Harman (Eds.), Proceedings of the seventh text retrieval conference (TREC-7) (pp. 419-424). NIST Special Publication 500-242.
[29]
McNamee, P., & Mayfield, J. (2002). Comparing cross-language query expansion techniques by degrading translation resources. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 159-166).
[30]
Meng, H., Chen, B., Grams, E., Lo, W.-K., Levow, G.-A., Oard, D., Schone, P., Tang, K., & Wang, J. Q. (2001). Mandarin-English information (MEI): investigating translingual speech retrieval. In Proceedings of first human language technology conference (HLT-2001) (pp. 239-245).
[31]
Nie, J.-y., Gao, J., Zhang, J., & Zhou, M. (2000). On the use of words and n-grams for Chinese information retrieval. In Proceedings of the fifth international workshop on information retrieval with Asian languages (pp. 141-148).
[32]
Oard, D. W., & Ertune, F. (2002). Translation-based indexing for cross-language retrieval. In Advances in information retrievel. Lecture notes in computer science (Vol. 2291, pp. 324-333). Berlin: Springer-Verlag.
[33]
Oard, D. W., & Gey, F. C. (2002). The TREC-2002 Arabic-English CLIR track. In E. Voorhees, & L. P. Buckland (Eds.), The eleventh text retrieval conference (TREC-2002). NIST. NIST Special Publication 500-251.
[34]
Oard, D. W., Levow, G.-A., & Cabezas, C. (2001). CLEF experiments at the University of Maryland: statistical stemming and backoff translation strategies. In Cross-language information retrieval and evaluation, Workshop of the cross-language evaluation forum, CLEF 2000. Lisbon, Portugal, September 2000. revised papers. Lecture notes in computer science (Vol. LNCS 2069, pp. 176-187). Heidelberg: Springer-Verlag.
[35]
Peters, C. (Ed.). (2001). Workshop of cross-language evaluation forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000, revised papers. Lecture notes in computer science 2069. Berlin: Springer.
[36]
Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st annual international ACMS SIGIR conference on research and development in information retrieval (pp. 55-63).
[37]
Ponte, J. M., & Croft, B. W. (1997). Text segmentation by topic. In Proceedings of the 1st European conference on research and advanced technology for digital libraries (pp. 113-125).
[38]
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3) 130-137.
[39]
Resnik, P., Oard, D. W., & Levow, G.-A. (2001). Improved cross-language retrieval using backoff translation. In Proceedings of the first human language technology conference (HLT-2001).
[40]
Sheridan, P., Wechsler, M., & Schääuble, P. (1997). Cross-language speech retrieval: establishing a baseline performance. In Proceedings of the 20th international ACM SIGIR conference on research and development in information retrieval (pp. 99-108).
[41]
Singhal, A., & Pereira, F. (1999). Document expansion for speech retrieval. In Proceedings of the 22nd international conference on research and development in information retrieval (pp. 26-33).
[42]
Wilkinson, R. (1997). Chinese document retrieval at TREC-6. In E. Voorhees, & D. K. Harman (Eds.), The sixth text retrieval conference (TREC-6) (pp. 25-30). NIST. NIST Special Publication 500-240.
[43]
Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 269-274).

Cited By

View all
  • (2023)Context-based understanding of food-related queries using a culinary knowledge modelJournal of Information Science10.1177/0165551521102216349:3(831-852)Online publication date: 1-Jun-2023
  • (2022)Improving cross-lingual text matching with dual-level collaborative coarse-to-fine filter alignment networkJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21307043:1(1299-1314)Online publication date: 1-Jan-2022
  • (2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
  • Show More Cited By

Recommendations

Reviews

Alice Louise Davison

Two very interesting topics are addressed in this paper. First, it explains how to relate two active fields: information retrieval from electronic databases of documents, and the use of online dictionary resources to translate from one language to another. The task is to define a topic according to key words in a document, and to retrieve related documents. This study uses electronic newspaper archives in French, Chinese, German, and Arabic, as well as available bilingual dictionary resources. Another highly interesting and timely aspect of this study is that it relates the translation and retrieval process for documents to the structural features of the languages involved: how words are formed; the prevalence of prefixes, suffixes, and clitics; and the relation between roots and the words of different categories and inflections that are formed from them. These factors affect the identification of word roots or stems, which in turn affects the choice of key words and related words, as well as the matching process in translation. French and German use the same orthography as English, and share many cognates with English. Their inflectional systems are similar to English, though more complex. German and Chinese freely form words by compounding, and in Chinese the words are not delimited, requiring search by matches with strings of letters of a specific length. Arabic has both complex inflection and word derivation from three-consonant roots. The discussion of current research on stem identification in Arabic is particularly insightful. The paper first lays out a complex architecture of the process combining translation with document expansion and retrieval. Each part of the process is related to a series of contrastive experiments assessing the effectiveness of different procedures at different points in the process, with differential results for the four languages under discussion. The results should encourage further work in this field, especially because the summary exposes very specific issues for future work on more reliable translation and document retrieval. Much complex information is presented in an exceptionally clear and well-organized fashion. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal
Information Processing and Management: an International Journal  Volume 41, Issue 3
Special issue: Cross-language information retrieval
May 2005
303 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2005

Author Tags

  1. cross-language information retrieval
  2. dictionary-based translation
  3. ranked retrieval

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Context-based understanding of food-related queries using a culinary knowledge modelJournal of Information Science10.1177/0165551521102216349:3(831-852)Online publication date: 1-Jun-2023
  • (2022)Improving cross-lingual text matching with dual-level collaborative coarse-to-fine filter alignment networkJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21307043:1(1299-1314)Online publication date: 1-Jan-2022
  • (2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
  • (2019)Zero-Shot Language Transfer for Cross-Lingual Sentence Retrieval Using Bidirectional Attention ModelAdvances in Information Retrieval10.1007/978-3-030-15712-8_34(523-538)Online publication date: 14-Apr-2019
  • (2018)Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data OnlyThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210157(1253-1256)Online publication date: 27-Jun-2018
  • (2018)Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/320835818:1(1-27)Online publication date: 17-Dec-2018
  • (2018)Towards a new possibilistic query translation tool for cross-language information retrievalMultimedia Tools and Applications10.1007/s11042-017-4398-277:2(2423-2465)Online publication date: 1-Jan-2018
  • (2017)Bilingual lexicon induction from non-parallel data with minimal supervisionProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298059(3379-3385)Online publication date: 4-Feb-2017
  • (2017)An expectation-maximization algorithm for query translation based on pseudo-relevant documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2016.11.00753:2(371-387)Online publication date: 1-Mar-2017
  • (2016)Bilingual distributed word representations from document-aligned comparable dataJournal of Artificial Intelligence Research10.5555/3013558.301358355:1(953-994)Online publication date: 1-Jan-2016
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media