article

Dictionary-based techniques for cross-language information retrieval

Authors:

Gina-Anne Levow,

Douglas W. Oard,

Philip ResnikAuthors Info & Claims

Information Processing and Management: an International Journal, Volume 41, Issue 3

Pages 523 - 547

https://doi.org/10.1016/j.ipm.2004.06.012

Published: 01 May 2005 Publication History

Abstract

Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.

References

[1]

Aljlayl, M., & Frieder, O. (2002). On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of conference on information and knowledge management CIKM 2002 (pp. 340-347).

[2]

ALP (1966). Language and machines: computers in translation and linguistics.

[3]

Ballesteros, L, & Croft, B. (1996). Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th international DEXA conference on database and expert systems (pp. 791-801). Available: http://ciir.cs.umass.edu/info/psfiles/irpubs/ir.html.

[4]

Ballesteros, L., & Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th international ACM SIGIR conference on research and development in information retrieval (pp. 84-91).

[5]

Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (1998). Transcriber: a free tool for segmenting, labeling and transcribing speech. In Proceedings of the first international conference on language resources & evaluation (LREC) (pp. 1373-1376).

[6]

Beesley, K. R. (1998). Arabic morphological analysis on the Internet. In Proceedings of the 6th international conference and exhibition on multi-lingual computing.

[7]

Braschler, M., & Schauble, P. (2001). Experiments with the Eurospider retrieval system for CLEF2000. In Cross-language information retrieval and evaluation, Workshop of the cross-language evaluation forum, CLEF 2000, Lisbon, Portugal, Septemper 2000, revised papers, Lecture notes in computer science (Vol. LNCS 2069, pp. 140-148). Heidelberg: Springer-Verlag.

[8]

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79-85.

Digital Library

[9]

Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC 3. In D. K. Harman (Ed.), Overview of the third text retreival conference (TREC-3) (pp. 69-80). NIST. NIST Special Publication 500-225.

[10]

Callan, J., Croft, W., & Harding, S. (1992). The INQUERY retrieval system. In Proceedings of the third international conference on database and expert systems applications (PP. 78-83).

[11]

Chen, A. (2002). Cross-language retrieval experiments at CLEF 2002. In C. Peters (Ed.), Working notes for the CLEF 2002 workshop. Available: http://clef.ici.pi.cnr.it:2002/workshop2002/WN/01.pdf.

[12]

Church, K. W., & Hovy, E. H. (1993). Good applications for crummy machine translation. Machine Translation, 8, 239-258.

[13]

Darwish, K. (2002). Building a shallow morphological analyzer in one day. In Proceedings of ACL workshop on computational approaches to semitic languages (pp. 47-54).

[14]

De Roeck, A., & Al-Fares, W. (2000). A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the 38th annual meeting of the association for computational linguistics, Hong Kong (pp. 199-206).

Digital Library

[15]

Emerson, T. (2001). Segmentation of Chinese text. MultiLingual Computing & Technology, 12(2), 49-52.

[16]

Frakes, W. B., & Baeza-Yates, R. (1992). Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall.

[17]

Gey, F. C., Jiang, H., Chen, A., & Larson, R. R. (1998). Manual queries and machine translation in cross-language retrieval and interactive retrieval with Cheshire II at TREC-7. In E. Voorhees, & D. Harman (Eds.), The seventh text retrieval conference (TREC-7) (pp. 527-540). NIST. NIST Special Publication 500-242.

[18]

Jin, W. (1992). A case study: Chinese segmentation and its disambiguation. Tech. rep. MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.

[19]

Kekäläinen, J., & Jäärvelin, K. (1998). The impact of query structure and query expansion on retrieval performance. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 130-137).

[20]

Koskenniemi, K. (1983). Two-level model for morphological analysis. In Proceedings of the 8th international joint conference on artificial intelligence (pp. 683-685).

[21]

Kraaij, W., & Hiemstra, D. (1998). Cross language retrieval with the TwentyOne system. In E. Voorhees, & D. K. Harman (Eds.), The sixth text retrieval conference (TREC-6) (pp. 753-761). NIST. NIST Special Publication 500-240.

[22]

Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191-202).

Digital Library

[23]

Larkey, L., Ballesteros, L., & Connell M. (2002). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 275-282).

[24]

Leek, T., Schwartz, R., & Sistra, S. (2002). Probabilistic approaches to topic detection and tracking. In Topic detection and tracking: event-based information organization (pp. 67-84). Boston: Kluwer.

[25]

Levow, G.-A. (2003). Issues in pre- and post-translation document expansion: untranslatable cognates and missegmented words. In Proceedings of sixth international workshop on information retrieval with Asian language (pp. 77-83).

[26]

Levow, G.-A., & Oard, D. W (2002). Signal boosting for translingual topic tracking: document expansion and n-best translation. In Topic detection and tracking research (pp. 175-196). Boston: Kluwer.

[27]

Lo, W.-K., Li, Y.-C., Levow, G.-A., Wang, H-M., & Meng, H. (2003). Multi-scale document expansion in English-Mandarin cross-language spoken document retrieval. In Proceedings of European conference on speech communication and technology (Eurospeech2003) (pp. 2337-2340).

[28]

Mayfield, J., & McNamee, P. (1999). Indexing using both n-grams and words. In E. Voorhees, & D. Harman (Eds.), Proceedings of the seventh text retrieval conference (TREC-7) (pp. 419-424). NIST Special Publication 500-242.

[29]

McNamee, P., & Mayfield, J. (2002). Comparing cross-language query expansion techniques by degrading translation resources. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 159-166).

[30]

Meng, H., Chen, B., Grams, E., Lo, W.-K., Levow, G.-A., Oard, D., Schone, P., Tang, K., & Wang, J. Q. (2001). Mandarin-English information (MEI): investigating translingual speech retrieval. In Proceedings of first human language technology conference (HLT-2001) (pp. 239-245).

[31]

Nie, J.-y., Gao, J., Zhang, J., & Zhou, M. (2000). On the use of words and n-grams for Chinese information retrieval. In Proceedings of the fifth international workshop on information retrieval with Asian languages (pp. 141-148).

Digital Library

[32]

Oard, D. W., & Ertune, F. (2002). Translation-based indexing for cross-language retrieval. In Advances in information retrievel. Lecture notes in computer science (Vol. 2291, pp. 324-333). Berlin: Springer-Verlag.

[33]

Oard, D. W., & Gey, F. C. (2002). The TREC-2002 Arabic-English CLIR track. In E. Voorhees, & L. P. Buckland (Eds.), The eleventh text retrieval conference (TREC-2002). NIST. NIST Special Publication 500-251.

[34]

Oard, D. W., Levow, G.-A., & Cabezas, C. (2001). CLEF experiments at the University of Maryland: statistical stemming and backoff translation strategies. In Cross-language information retrieval and evaluation, Workshop of the cross-language evaluation forum, CLEF 2000. Lisbon, Portugal, September 2000. revised papers. Lecture notes in computer science (Vol. LNCS 2069, pp. 176-187). Heidelberg: Springer-Verlag.

[35]

Peters, C. (Ed.). (2001). Workshop of cross-language evaluation forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000, revised papers. Lecture notes in computer science 2069. Berlin: Springer.

Digital Library

[36]

Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st annual international ACMS SIGIR conference on research and development in information retrieval (pp. 55-63).

[37]

Ponte, J. M., & Croft, B. W. (1997). Text segmentation by topic. In Proceedings of the 1st European conference on research and advanced technology for digital libraries (pp. 113-125).

[38]

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3) 130-137.

[39]

Resnik, P., Oard, D. W., & Levow, G.-A. (2001). Improved cross-language retrieval using backoff translation. In Proceedings of the first human language technology conference (HLT-2001).

[40]

Sheridan, P., Wechsler, M., & Schääuble, P. (1997). Cross-language speech retrieval: establishing a baseline performance. In Proceedings of the 20th international ACM SIGIR conference on research and development in information retrieval (pp. 99-108).

[41]

Singhal, A., & Pereira, F. (1999). Document expansion for speech retrieval. In Proceedings of the 22nd international conference on research and development in information retrieval (pp. 26-33).

[42]

Wilkinson, R. (1997). Chinese document retrieval at TREC-6. In E. Voorhees, & D. K. Harman (Eds.), The sixth text retrieval conference (TREC-6) (pp. 25-30). NIST. NIST Special Publication 500-240.

[43]

Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 269-274).

Cited By

Chatterjee U(2023)Context-based understanding of food-related queries using a culinary knowledge modelJournal of Information Science10.1177/0165551521102216349:3(831-852)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1177/01655515211022163
Li YGuo JYu ZGao S(2022)Improving cross-lingual text matching with dual-level collaborative coarse-to-fine filter alignment networkJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21307043:1(1299-1314)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-213070
Josifoski MPaskov IPaskov HJaggi MWest RCulpepper JMoffat ABennett PLerman K(2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
https://dl.acm.org/doi/10.1145/3289600.3291023
Show More Cited By

Index Terms

Dictionary-based techniques for cross-language information retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Bootstrapping dictionaries for cross-language information retrieval
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

The bottleneck for dictionary-based cross-language information retrieval is the lack of comprehensive dictionaries, in particular for many different languages. We here introduce a methodology by which multilingual dictionaries (for Spanish and Swedish) ...
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings
Abstract
This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate ...
Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11

Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper ...

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 41, Issue 3

Special issue: Cross-language information retrieval

May 2005

303 pages

ISSN:0306-4573

Issue’s Table of Contents

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2005

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chatterjee U(2023)Context-based understanding of food-related queries using a culinary knowledge modelJournal of Information Science10.1177/0165551521102216349:3(831-852)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1177/01655515211022163
Li YGuo JYu ZGao S(2022)Improving cross-lingual text matching with dual-level collaborative coarse-to-fine filter alignment networkJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21307043:1(1299-1314)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-213070
Josifoski MPaskov IPaskov HJaggi MWest RCulpepper JMoffat ABennett PLerman K(2019)Crosslingual Document Embedding as Reduced-Rank Ridge RegressionProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291023(744-752)Online publication date: 30-Jan-2019
https://dl.acm.org/doi/10.1145/3289600.3291023
Glavaš GVulić I(2019)Zero-Shot Language Transfer for Cross-Lingual Sentence Retrieval Using Bidirectional Attention ModelAdvances in Information Retrieval10.1007/978-3-030-15712-8_34(523-538)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1007/978-3-030-15712-8_34
Litschko RGlavaš GPonzetto SVulić ICollins-Thompson KMei QDavison BLiu YYilmaz E(2018)Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data OnlyThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210157(1253-1256)Online publication date: 27-Jun-2018
https://dl.acm.org/doi/10.1145/3209978.3210157
Bhattacharya PGoyal PSarkar S(2018)Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/320835818:1(1-27)Online publication date: 17-Dec-2018
https://dl.acm.org/doi/10.1145/3208358
Elayeb BRomdhane WSaoud N(2018)Towards a new possibilistic query translation tool for cross-language information retrievalMultimedia Tools and Applications10.1007/s11042-017-4398-277:2(2423-2465)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1007/s11042-017-4398-2
Zhang MPeng HLiu YLuan HSun MSingh SMarkovitch S(2017)Bilingual lexicon induction from non-parallel data with minimal supervisionProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298059(3379-3385)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3298023.3298059
Dadashkarimi JShakery AFaili HZamani H(2017)An expectation-maximization algorithm for query translation based on pseudo-relevant documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2016.11.00753:2(371-387)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.ipm.2016.11.007
Vulic IMoens M(2016)Bilingual distributed word representations from document-aligned comparable dataJournal of Artificial Intelligence Research10.5555/3013558.301358355:1(953-994)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.5555/3013558.3013583
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents