Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3209978.3210157acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

Published: 27 June 2018 Publication History

Abstract

We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.

References

[1]
Mikel Artetxe, Gorka Labaka, and Eneko Agirre . 2017. Learning bilingual word embeddings with (almost) no bilingual data ACL. 451--462.
[2]
Piotr Bojanowski and Edouard Grave et al . 2017. Enriching Word Vectors with Subword Information. Transactions of the ACL Vol. 5 (2017), 135--146.
[3]
Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou . 2018. Word Translation Without Parallel Data. In ICLR.
[4]
Marc Franco-Salvador, Paolo Rosso, and Roberto Navigli . 2014. A knowledge-based representation for cross-language document retrieval and categorization EACL. 414--423.
[5]
Karl Moritz Hermann and Phil Blunsom . 2014. Multilingual Models for Compositional Distributed Semantics ACL. 58--68.
[6]
Thomas K. Landauer and Susan T. Dumais . 1997. Solutions to Plato's problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review Vol. 104, 2 (1997), 211--240.
[7]
Victor Lavrenko, Martin Choquette, and W. Bruce Croft . 2002. Cross-lingual relevance models. In SIGIR. 175--182.
[8]
Gina-Anne Levow, Douglas W. Oard, and Philip Resnik . 2005. Dictionary-Based Techniques for Cross-Lingual IR. IP & M Vol. 41, 3 (2005), 523--547.
[9]
Giovanni Da San Martino and Salvatore Romeo et al. . 2017. Cross-language question re-ranking. In SIGIR. 1145--1148.
[10]
Tomas Mikolov and Ilya Sutskever et al . 2013. Distributed Representations of Words and Phrases and their Compositionality NIPS. 3111--3119.
[11]
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever . 2013 a. Exploiting Similarities among Languages for Machine Translation. CoRR Vol. abs/1309.4168 (2013).
[12]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig . 2013 b. Linguistic Regularities in Continuous Space Word Representations NAACL-HLT. 746--751.
[13]
Dmitrijs Milajevs and Dimitri Kartsaklis et al . 2014. Evaluating Neural Word Representations in Tensor-Based Compositional Settings EMNLP. 708--719.
[14]
Jeff Mitchell and Mirella Lapata . 2008. Vector-based models of semantic composition. In ACL-HLT. 236--244.
[15]
Bhaskar Mitra and Nick Craswell . 2017. Neural Models for Information Retrieval. CoRR Vol. abs/1705.01509 (2017).
[16]
Jian-Yun Nie . 2010. Cross-Language Information Retrieval.
[17]
Jay M. Ponte and W. Bruce Croft . 1998. A language modeling approach to information retrieval SIGIR. ACM, 275--281.
[18]
Sebastian Ruder, Ivan Vuliç, and Anders Søgaard . 2017. A Survey of Cross-Lingual Embedding Models. CoRR Vol. abs/1706.04902 (2017).
[19]
Samuel L. Smith, David H.P. Turban, Steven Hamblin, and Nils Y. Hammerla . 2017. Offline Bilingual Word Vectors, Orthogonal Transformations and the Inverted Softmax. In ICLR.
[20]
Philipp Sorg and Philipp Cimiano . 2012. Exploiting Wikipedia for cross-lingual and multilingual information retrieval. DKE Vol. 74 (2012), 26--45.
[21]
Ivan Vuliç and Sien Moens . 2015. Monolingual and Cross-lingual Information Retrieval Models Based on (Bilingual) Word Embeddings. In SIGIR. 363--372.
[22]
Ivan Vuliç and Sien Moens . 2016. Bilingual Distributed Word Representations from Document-Aligned Comparable Data. JAIR Vol. 55 (2016), 953--994.
[23]
Chengxiang Zhai and John Lafferty . 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems Vol. 22, 2 (2004), 179--214.

Cited By

View all
  • (2024)Information Retrieval of Marathi Query from a Linguistic PerspectiveInnovations in Cybersecurity and Data Science10.1007/978-981-97-5791-6_52(717-728)Online publication date: 13-Dec-2024
  • (2023)Soft Prompt Decoding for Multilingual Dense RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591769(1208-1218)Online publication date: 19-Jul-2023
  • (2023)Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport DistillationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570468(1048-1056)Online publication date: 27-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-lingual vector spaces
  2. unsupervised cross-lingual ir

Qualifiers

  • Short-paper

Conference

SIGIR '18
Sponsor:

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)3
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Information Retrieval of Marathi Query from a Linguistic PerspectiveInnovations in Cybersecurity and Data Science10.1007/978-981-97-5791-6_52(717-728)Online publication date: 13-Dec-2024
  • (2023)Soft Prompt Decoding for Multilingual Dense RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591769(1208-1218)Online publication date: 19-Jul-2023
  • (2023)Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport DistillationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570468(1048-1056)Online publication date: 27-Feb-2023
  • (2023)Dual Word Embedding for Robust Unsupervised Bilingual Lexicon InductionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.329042531(2606-2615)Online publication date: 1-Jan-2023
  • (2023)Multilingual News Search—A Comparative User Study of Desktop and Mobile InterfacesInternational Journal of Human–Computer Interaction10.1080/10447318.2023.223897840:19(5648-5663)Online publication date: 2-Aug-2023
  • (2023)NCCApplied Soft Computing10.1016/j.asoc.2023.110348142:COnline publication date: 1-Jul-2023
  • (2022)Deep Multilabel Multilingual Document Learning for Cross-Lingual Document RetrievalEntropy10.3390/e2407094324:7(943)Online publication date: 7-Jul-2022
  • (2022)A Novel Cross Language Neural Retrieval Model2022 IEEE 2nd International Conference on Data Science and Computer Application (ICDSCA)10.1109/ICDSCA56264.2022.9988147(894-903)Online publication date: 28-Oct-2022
  • (2022)Neural topic-enhanced cross-lingual word embeddings for CLIRInformation Sciences10.1016/j.ins.2022.06.081608(809-824)Online publication date: Aug-2022
  • (2022)Cross-lingual embeddings with auxiliary topic modelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.116194190:COnline publication date: 9-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media