Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval

Published: 11 June 2021 Publication History

Abstract

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary which is generated from a parallel corpus. The translation techniques for Hindi language suffer from a translation mis-mapped issue which is due to the morphological richness of Hindi language. In addition, a word may have multiple translations in a dictionary leading to word translation disambiguation issue. This paper addresses two key findings, i.e., Semantic Morphological Variant Selection (SMVS), and Hybrid Word Translation Disambiguation (HWTD), the former resolves translation mis-mapped issue and the later disambiguates the queries more effectively. The proposed techniques are investigated for FIRE ad-hoc datasets, where SMVS and HWTD at word level achieve better evaluation measures in comparison to the baseline Statistical Machine Translation.

References

[1]
Adriani M Using statistical term similarity for sense disambiguation in cross-language information retrieval Inf Retr 2021 2 1 71-82
[2]
Das A, Debasis G, and Utpal G Named entity recognition with word embeddings and wikipedia categories for a low-resource language ACM Trans Asian Low-Resour Lang Inform Process (TALLIP) 2017 16 3 18
[3]
Duque A, Martinez-Romo J, and Araujo L Choosing the best dictionary for cross-lingual word sense disambiguation Knowl-Based Syst 2015 81 65-75
[4]
Finch A, Taisuke H, Kumiko T, and Eiichiro S Inducing a bilingual lexicon from short parallel multiword sequences ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 2017 16 3 15
[5]
Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of COLING, vol 2012, pp 927–942
[6]
Ganguly D, Roy D, Mitra M, Jones G (2015) A word embedding based generalized language model for information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 795–798
[7]
Gupta SK, Sinha A, and Jain M Cross lingual information retrieval with SMT and query mining Adv Comput 2011 2 5 33
[8]
Hosseinzadeh Vahid A, Arora P, Liu Q, Jones GJ (2015) A comparative study of online translation services for cross language information retrieval. In: Proceedings of the 24th international conference on world wide web, pp 859–864
[9]
Jagarlamudi J, Kumaran A (2007) Cross-lingual information retrieval system for indian languages. In: Advances in multilingual and multimodal information retrieval. Springer, Berlin Heidelberg, pp 80–87
[10]
Janarthanam SC, Sethuramalingam S, Nallasamy U (2008) Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In: Proceedings of the 2nd ACM workshop on improving non english web searching, pp 33–38
[11]
Jean S, Lauly S, Firat O, Cho K (2017) Neural machine translation for cross-lingual pronoun prediction. In: Proceedings of the third workshop on discourse in machine translation, pp 54–57
[12]
Karimi S, Falk S, and Andrew T Machine transliteration survey ACM Comput Surv (CSUR) 2011 43 3 17
[13]
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Saarland Univerisity, Germany
[14]
Koehn P Statistical machine translation 2009 Cambridge Cambridge University Press
[15]
Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The IIT bombay english-hindi parallel corpus. arXiv:1710.02855
[16]
Larkey LS, Connell ME, and Abduljaleel N Hindi CLIR in thirty days ACM Trans Asian Lang Inf Process (TALIP) 2003 2 2 130-142
[17]
Mahapatra L, Mohan M, Khapra MM, Bhattacharyya P (2010) OWNS Cross-lingual word sense disambiguation using weighted overlap counts and wordnet based similarity measures. In: Proceedings of the 5th international workshop on semantic evaluation, pp 138–141
[18]
Makin R, Pandey N, Pingali P, Varma V (2007) Approximate string matching techniques for effective CLIR. In: International workshop on fuzzy logic and applications. Springer-Verlag, pp 430–437
[19]
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
[20]
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
[21]
Monz C, Bonnie JD (2005) Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 520–527
[22]
Mustafa A, Tait J, Oakes M (2005) Literature review of cross-language information retrieval. Trans Eng Comp Technol
[23]
Nagarathinam A and Saraswathi S State of art: cross lingual information retrieval system for Indian languages Int J Comput Appl 2011 35 13 15-21
[24]
Nasharuddin NA and Abdullah MT Cross-lingual information retrieval state-of-the-art Electron J Comput Sci Inform Technol (EJCSIT) 2010 2 1 1-5
[25]
Nothman J, James RC, Tara M (2008) Transforming Wikipedia into named entity training data. In: Proceedings of the australian language technology workshop, pp 124–132
[26]
Pennington J, Richard S, Christopher M (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
[27]
Pingali P, Ganesh S, Yella S, Varma V (2008) Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd workshop on cross lingual information access (CLIA) addressing the information need of multilingual societies
[28]
Pingali P, Varma V (2007) IIIT hyderabad at CLEF 2007-Adhoc Indian language CLIR task. In: CLEF (Working Notes)
[29]
Prasad G, Fousiya KK (2015) Named entity recognition approaches: A study applied to English and Hindi language. In: International conference on circuit, power and computing technologies (ICCPCT). IEEE, pp 1–4
[30]
Razmara M, Siahbani M, Haffari R, Sarkar A (2013) Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, pp 1105–1115
[31]
Saravanan K, Udupa R, Kumaran A (2010) Cross lingual information retrieval system enhanced with transliteration generation and mining. Forum for information retrieval evaluation (FIRE-2010) workshop
[32]
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv:1508.07909
[33]
Shakery A and Zhai C Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs Inf Retr 2013 16 1 1-29
[34]
Sharma VK, Mittal N (2016) Exploring bilingual word vectors for Hindi-English cross-language information retrieval. In: Proceedings of the international conference on informatics and analytics, pp 1–4
[35]
Sharma VK and Mittal N Exploiting parallel sentences and cosine similarity for identifying target language translation Procedia Comput Sci 2016 89 428-33
[36]
Sharma VK, Mittal N (2017) Named entity identification based translation disambiguation model. In: International conference on pattern recognition and machine intelligence. Springer, pp 365–372
[37]
Sharma VK and Mittal N Cross-lingual information retrieval: a dictionary-based query translation approach. Advances in computer and computational sciences 2018 Singapore Springer 611-618
[38]
Sharma VK, Mittal N, Vidyarthi A (2020) Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval. IETE Technical Review. pp 1–10
[39]
Sorg P and Philipp C Exploiting Wikipedia for cross-lingual and multilingual information retrieval J Data Knowl Eng 2012 74 26-45
[40]
Ture F and Lin J Exploiting representations from statistical machine translation for cross-language information retrieval ACM Trans Inf Syst (TOIS) 2014 32 4 1-32
[41]
Turney PD (2004) Word sense disambiguation by web mining for word co-occurrence probabilities. arXiv:0407065
[42]
Vulic I, Moens MF (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 363–372
[43]
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J (2016) Google’s neural machine translation system. Bridging the gap between human and machine translation. arXiv:1609.08144
[44]
Xiaoning H, Peidong W, Haoliang Q, Muyun Y, Guohua L, Yong X (2008) Using Google translation in cross-lingual information retrieval. In: Proceedings of NTCIR-7 workshop meeting, pp 16–19
[45]
Zhang S, Duh K, Van Durme B (2017) Selective decoding for cross-lingual open information extraction. In: Proceedings of the eighth international joint conference on natural language processing (Volume 1: Long Papers), pp 832–842
[46]
Zhou D, Mark T, Tim B, Vincent W, and Helen A Translation techniques in cross-language information retrieval ACM Comput Surv (CSUR). 2012 45 1 1-44
[47]
Zou WY, Socher R, Cer DM, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. EMNLP, pp 1393–1398

Index Terms

  1. Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Multimedia Tools and Applications
          Multimedia Tools and Applications  Volume 82, Issue 6
          Mar 2023
          1544 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 11 June 2021
          Accepted: 11 May 2021
          Revision received: 01 April 2021
          Received: 15 October 2020

          Author Tags

          1. Cross-lingual information retrieval
          2. Translation
          3. Disambiguation
          4. Parallel corpus
          5. Recurrent neural network
          6. Continuous bag of word

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Jan 2025

          Other Metrics

          Citations

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media