research-article

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval

Authors:

Vijay Kumar Sharma,

Ankit VidyarthiAuthors Info & Claims

Multimedia Tools and Applications, Volume 82, Issue 6

Pages 8197 - 8212

https://doi.org/10.1007/s11042-021-11074-w

Published: 11 June 2021 Publication History

Abstract

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary which is generated from a parallel corpus. The translation techniques for Hindi language suffer from a translation mis-mapped issue which is due to the morphological richness of Hindi language. In addition, a word may have multiple translations in a dictionary leading to word translation disambiguation issue. This paper addresses two key findings, i.e., Semantic Morphological Variant Selection (SMVS), and Hybrid Word Translation Disambiguation (HWTD), the former resolves translation mis-mapped issue and the later disambiguates the queries more effectively. The proposed techniques are investigated for FIRE ad-hoc datasets, where SMVS and HWTD at word level achieve better evaluation measures in comparison to the baseline Statistical Machine Translation.

References

[1]

Adriani M Using statistical term similarity for sense disambiguation in cross-language information retrieval Inf Retr 2021 2 1 71-82

[2]

Das A, Debasis G, and Utpal G Named entity recognition with word embeddings and wikipedia categories for a low-resource language ACM Trans Asian Low-Resour Lang Inform Process (TALLIP) 2017 16 3 18

[3]

Duque A, Martinez-Romo J, and Araujo L Choosing the best dictionary for cross-lingual word sense disambiguation Knowl-Based Syst 2015 81 65-75

[4]

Finch A, Taisuke H, Kumiko T, and Eiichiro S Inducing a bilingual lexicon from short parallel multiword sequences ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 2017 16 3 15

[5]

Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of COLING, vol 2012, pp 927–942

[6]

Ganguly D, Roy D, Mitra M, Jones G (2015) A word embedding based generalized language model for information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 795–798

[7]

Gupta SK, Sinha A, and Jain M Cross lingual information retrieval with SMT and query mining Adv Comput 2011 2 5 33

[8]

Hosseinzadeh Vahid A, Arora P, Liu Q, Jones GJ (2015) A comparative study of online translation services for cross language information retrieval. In: Proceedings of the 24th international conference on world wide web, pp 859–864

[9]

Jagarlamudi J, Kumaran A (2007) Cross-lingual information retrieval system for indian languages. In: Advances in multilingual and multimodal information retrieval. Springer, Berlin Heidelberg, pp 80–87

[10]

Janarthanam SC, Sethuramalingam S, Nallasamy U (2008) Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In: Proceedings of the 2nd ACM workshop on improving non english web searching, pp 33–38

[11]

Jean S, Lauly S, Firat O, Cho K (2017) Neural machine translation for cross-lingual pronoun prediction. In: Proceedings of the third workshop on discourse in machine translation, pp 54–57

[12]

Karimi S, Falk S, and Andrew T Machine transliteration survey ACM Comput Surv (CSUR) 2011 43 3 17

[13]

Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Saarland Univerisity, Germany

[14]

Koehn P Statistical machine translation 2009 Cambridge Cambridge University Press

[15]

Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The IIT bombay english-hindi parallel corpus. arXiv:1710.02855

[16]

Larkey LS, Connell ME, and Abduljaleel N Hindi CLIR in thirty days ACM Trans Asian Lang Inf Process (TALIP) 2003 2 2 130-142

[17]

Mahapatra L, Mohan M, Khapra MM, Bhattacharyya P (2010) OWNS Cross-lingual word sense disambiguation using weighted overlap counts and wordnet based similarity measures. In: Proceedings of the 5th international workshop on semantic evaluation, pp 138–141

[18]

Makin R, Pandey N, Pingali P, Varma V (2007) Approximate string matching techniques for effective CLIR. In: International workshop on fuzzy logic and applications. Springer-Verlag, pp 430–437

[19]

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

[20]

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

[21]

Monz C, Bonnie JD (2005) Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 520–527

[22]

Mustafa A, Tait J, Oakes M (2005) Literature review of cross-language information retrieval. Trans Eng Comp Technol

[23]

Nagarathinam A and Saraswathi S State of art: cross lingual information retrieval system for Indian languages Int J Comput Appl 2011 35 13 15-21

[24]

Nasharuddin NA and Abdullah MT Cross-lingual information retrieval state-of-the-art Electron J Comput Sci Inform Technol (EJCSIT) 2010 2 1 1-5

[25]

Nothman J, James RC, Tara M (2008) Transforming Wikipedia into named entity training data. In: Proceedings of the australian language technology workshop, pp 124–132

[26]

Pennington J, Richard S, Christopher M (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

[27]

Pingali P, Ganesh S, Yella S, Varma V (2008) Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd workshop on cross lingual information access (CLIA) addressing the information need of multilingual societies

[28]

Pingali P, Varma V (2007) IIIT hyderabad at CLEF 2007-Adhoc Indian language CLIR task. In: CLEF (Working Notes)

[29]

Prasad G, Fousiya KK (2015) Named entity recognition approaches: A study applied to English and Hindi language. In: International conference on circuit, power and computing technologies (ICCPCT). IEEE, pp 1–4

[30]

Razmara M, Siahbani M, Haffari R, Sarkar A (2013) Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, pp 1105–1115

[31]

Saravanan K, Udupa R, Kumaran A (2010) Cross lingual information retrieval system enhanced with transliteration generation and mining. Forum for information retrieval evaluation (FIRE-2010) workshop

[32]

Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv:1508.07909

[33]

Shakery A and Zhai C Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs Inf Retr 2013 16 1 1-29

[34]

Sharma VK, Mittal N (2016) Exploring bilingual word vectors for Hindi-English cross-language information retrieval. In: Proceedings of the international conference on informatics and analytics, pp 1–4

[35]

Sharma VK and Mittal N Exploiting parallel sentences and cosine similarity for identifying target language translation Procedia Comput Sci 2016 89 428-33

[36]

Sharma VK, Mittal N (2017) Named entity identification based translation disambiguation model. In: International conference on pattern recognition and machine intelligence. Springer, pp 365–372

[37]

Sharma VK and Mittal N Cross-lingual information retrieval: a dictionary-based query translation approach. Advances in computer and computational sciences 2018 Singapore Springer 611-618

[38]

Sharma VK, Mittal N, Vidyarthi A (2020) Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval. IETE Technical Review. pp 1–10

[39]

Sorg P and Philipp C Exploiting Wikipedia for cross-lingual and multilingual information retrieval J Data Knowl Eng 2012 74 26-45

[40]

Ture F and Lin J Exploiting representations from statistical machine translation for cross-language information retrieval ACM Trans Inf Syst (TOIS) 2014 32 4 1-32

[41]

Turney PD (2004) Word sense disambiguation by web mining for word co-occurrence probabilities. arXiv:0407065

[42]

Vulic I, Moens MF (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 363–372

[43]

Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J (2016) Google’s neural machine translation system. Bridging the gap between human and machine translation. arXiv:1609.08144

[44]

Xiaoning H, Peidong W, Haoliang Q, Muyun Y, Guohua L, Yong X (2008) Using Google translation in cross-lingual information retrieval. In: Proceedings of NTCIR-7 workshop meeting, pp 16–19

[45]

Zhang S, Duh K, Van Durme B (2017) Selective decoding for cross-lingual open information extraction. In: Proceedings of the eighth international joint conference on natural language processing (Volume 1: Long Papers), pp 832–842

[46]

Zhou D, Mark T, Tim B, Vincent W, and Helen A Translation techniques in cross-language information retrieval ACM Comput Surv (CSUR). 2012 45 1 1-44

[47]

Zou WY, Socher R, Cer DM, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. EMNLP, pp 1393–1398

Index Terms

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11

Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper ...
Using comparable corpora to improve the effectiveness of cross-language information retrieval
IceTAL'10: Proceedings of the 7th international conference on Advances in natural language processing

Large-scale comparable corpora became more abundant and accessible than parallel corpora, with the explosive growth of the World Wide Web. From the Cross-Language Information Retrieval point of view, limitation of translation resources as well as ...
Choosing the best dictionary for Cross-Lingual Word Sense Disambiguation

Selection of the best dictionary for Cross-Lingual Word Sense Disambiguation tasks.Potential improvements offered by automatically built dictionaries in ideal systems.Performance of different dictionaries on a particular unsupervised CLWSD ...

Comments

Information & Contributors

Information

Published In

cover image Multimedia Tools and Applications

Multimedia Tools and Applications Volume 82, Issue 6

Mar 2023

1544 pages

ISSN:1380-7501

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 11 June 2021

Accepted: 11 May 2021

Revision received: 01 April 2021

Received: 15 October 2020

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents