Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Neural Vector Spaces for Unsupervised Information Retrieval

Published: 26 June 2018 Publication History
  • Get Citation Alerts
  • Abstract

    We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.

    References

    [1]
    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Derek Chris Olah, Mike Schuster, Jonathon Shlens, Benoi Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Zheng Xiaoqiang. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. (2015). https://www.tensorflow.org/about/bib#large_scale_machine_learning_on_heterogeneous_distributed_systems.
    [2]
    Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016a. Analysis of the paragraph vector model for information retrieval. In ICTIR. ACM, 133–142.
    [3]
    Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016b. Improving language estimation with the paragraph vector model for ad-hoc retrieval. In SIGIR. ACM, 869--872.
    [4]
    James Allan, Donna Harman, Evangelos Kanoulas, Dan Li, Christophe Van Gysel, and Ellen Voorhees. 2017. TREC 2017 common core track overview. In TREC.
    [5]
    Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL. 238--247.
    [6]
    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (2003), 1137--1155.
    [7]
    David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.
    [8]
    Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016a. A context-aware time model for web search. In SIGIR. ACM, 205--214.
    [9]
    Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016b. A neural click model for web search. In WWW. International World Wide Web Conferences Steering Committee, 531--541.
    [10]
    Leonid Boytsov, David Novak, Yury Malkov, and Nyberg Eric. 2016. Off the beaten path: Let’s replace term-based retrieval with k-NN search. In CIKM. 1099--1108.
    [11]
    Andrei Broder. 2002. A taxonomy of web search. SIGIR Forum 36, 2 (2002), 3--10.
    [12]
    Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the limits of pooling for large collections. Inform. Retr. 10, 6 (2007), 491--508.
    [13]
    Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In ICML. ACM, 89--96.
    [14]
    Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In WWW. ACM, 1--10.
    [15]
    Minmin Chen. 2017. Efficient vector representation for documents through corruption. In ICLR.
    [16]
    Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, Aug (2011), 2493--2537.
    [17]
    Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley 8 Sons.
    [18]
    Nick Craswell, W. Bruce Croft, Jiafeng Guo, Bhaskar Mitra, and Maarten de Rijke. 2016. Neu-IR: The SIGIR 2016 workshop on neural information retrieval. In SIGIR. ACM, 1245--1246.
    [19]
    Scott C. Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 6 (1990), 391--407.
    [20]
    Li Deng, Xiaodong He, and Jianfeng Gao. 2013. Deep stacking networks for information retrieval. In ICASSP. 3153--3157.
    [21]
    Susan T. Dumais. 1995. Latent semantic indexing (LSI): TREC-3 Report. In TREC. NIST, 219--230.
    [22]
    Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011. An overview of the HDF5 technology suite and its applications. In EDBT/ICDT Workshop on Array Databases. ACM, 36--47.
    [23]
    Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J. F. Jones. 2015. Word embedding based generalized language model for information retrieval. In SIGIR. ACM, 795--798.
    [24]
    Vincent Garcia, Eric Debreuve, and Michel Barlaud. 2008. Fast k nearest neighbor search using GPU. In CVPRW. IEEE, 1--6.
    [25]
    Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In ICML. 1764--1772.
    [26]
    Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio. 2016. Noisy activation functions. arXiv:1603.00391 (2016).
    [27]
    Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016a. A deep relevance matching model for ad-hoc retrieval. In CIKM. ACM, 55--64.
    [28]
    Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016b. Semantic matching by non-linear word transportation for information retrieval. In CIKM. ACM, 701--710.
    [29]
    Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS. 297--304.
    [30]
    Donna Harman. 1992. The darpa tipster project. SIGIR Forum 26, 2 (Oct. 1992), 26--28.
    [31]
    Donna Harman. 1993. Document detection data preparation. In TIPSTER TEXT PROGRAM: PHASE I: Proceedings of a Workshop held at Fredricksburg, Virginia, September 19-23, 1993. ACL, 17--31.
    [32]
    Donna Harman and Ellen Voorhees. 1996. Overview of the fifth text retrieval conference. In TREC-5. 500--238.
    [33]
    Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In SIGIR. ACM, 50--57.
    [34]
    Po-sen Huang, N. Mathews Ave Urbana, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM. 2333--2338.
    [35]
    Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). http://arxiv.org/abs/1502.03167.
    [36]
    Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In SIGKDD. ACM, 133--142.
    [37]
    Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In ICML. 2342--2350.
    [38]
    Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing word embeddings for sentence representations. In ACL. 941--951.
    [39]
    Tom Kenter, Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke, and Bhaskar Mitra. 2017. Neural networks for information retrieval. In SIGIR 2017. ACM, 1403--1406.
    [40]
    Tom Kenter and Maarten de Rijke. 2015. Short text similarity with word embeddings. In CIKM. ACM, 1411--1420.
    [41]
    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014).
    [42]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105.
    [43]
    Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML. 1188--1196.
    [44]
    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. IEEE 86, 11 (1998), 2278--2324.
    [45]
    Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. 2177--2185.
    [46]
    Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist 3 (2015), 211--225.
    [47]
    Hang Li and Jun Xu. 2014. Semantic matching in search. Found. Trends in Inform. Retr. 7, 5 (June 2014), 343--469.
    [48]
    Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer.
    [49]
    Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM J. R8D 2 (1958), 159--165.
    [50]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.
    [51]
    Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013b. Efficient estimation of word representations in vector space. arXiv 1301.3781. (2013), 12 pages.
    [52]
    Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning tomatch using local and distributed representations of text for web search. In WWW.
    [53]
    Gordon E. Moore. 1998. Cramming more components onto integrated circuits. Proc. IEEE 86, 1 (1998), 82--85.
    [54]
    Marius Muja and David G. Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. Pattern Anal. Mach. Intell. 36, 11 (2014), 2227--2240.
    [55]
    Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving document ranking with dual word embeddings. In WWW. International World Wide Web Conferences Steering Committee, 83--84.
    [56]
    Kezban Dilek Onal, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke, and Matthew Lease. 2018. Neural information retrieval: At the end of the early years. Inform. Retr. J. (2018). To appear.
    [57]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP. 1532--1543.
    [58]
    Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/884893/en.
    [59]
    Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 5 (2004), 503--520.
    [60]
    Stephen E. Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR. 232--241.
    [61]
    Hasim Sak, Andrew W. Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech. 338--342.
    [62]
    Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. Int. J. Approx. Reason. 50, 7 (2009), 969--978.
    [63]
    Joseph A. Shaw, Edward A. Fox, Joseph A. Shaw, and Edward A. Fox. 1994. Combination of multiple searches. In TREC. 243--252.
    [64]
    Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM. 101--110.
    [65]
    Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In CIKM. ACM, 623--632.
    [66]
    Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 1 (1972), 11--21.
    [67]
    Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce Croft. 2005. Indri: A language model-based search engine for complex queries. In ICIA.
    [68]
    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104--3112.
    [69]
    Ke Tran, Arianna Bisazza, and Christof Monz. 2016. Recurrent memory network for language modeling. In NAACL-HLT. ACL, 321--331.
    [70]
    TREC. 1992--1999. TREC1-8 Adhoc Track. http://trec.nist.gov/data/qrels_eng.
    [71]
    Xinhui Tu, Jimmy Xiangji Huang, Jing Luo, and Tingting He. 2016. Exploiting semantic coherence features for information retrieval. In SIGIR. ACM, 837--840.
    [72]
    Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In ACL. Association for Computational Linguistics, 384--394.
    [73]
    Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016a. Learning latent vector spaces for product search. In CIKM. ACM, 165--174.
    [74]
    Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2017a. Pyndri: A python interface to the indri search engine. In ECIR. Springer, 744--748.
    [75]
    Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2017b. Semantic entity retrieval toolkit. In Neu-IR 2017.
    [76]
    Christophe Van Gysel, Maarten de Rijke, and Marcel Worring. 2016b. Unsupervised, efficient and semantic expertise retrieval. In WWW. ACM, 1069--1079.
    [77]
    Christophe Van Gysel, Bhaskar Mitra, Matteo Venanzi, Roy Rosemarin, Grzegorz Kukla, Piotr Grudzien, and Nicola Cancedda. 2017c. Reply with: Proactive recommendation of email attachments. In CIKM. ACM, 327--336.
    [78]
    C. J. van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann.
    [79]
    Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In ICCV. IEEE, 281.
    [80]
    Ellen M. Voorhees. 2005. The TREC robust retrieval track. SIGIR Forum 39, 1 (June 2005), 11--20.
    [81]
    Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In SIGIR. ACM, 363--372.
    [82]
    Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In SIGIR. ACM, 178--185.
    [83]
    Wikipedia. 2017. List of Nvidia graphics processing units — Wikipedia, The Free Encyclopedia. Retrieved August 8, 2017 from https://en.wikipedia.org/w/index.php?title=List_of_Nvidia_graphics_processing_units8oldid=792964538.
    [84]
    Hamed Zamani and W. Bruce Croft. 2016a. Embedding-based query language models. In ICTIR. ACM, 147--156.
    [85]
    Hamed Zamani and W. Bruce Croft. 2016b. Estimating embedding vectors for queries. In ICTIR. ACM, 123--132.
    [86]
    Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2 (2004), 179--214.
    [87]
    Guido Zuccon, Bevan Koopman, Peter Bruza, and Leif Azzopardi. 2015. Integrating and evaluating neural word embeddings in information retrieval. In ADCS. ACM, Article 12, 8 pages.

    Cited By

    View all
    • (2024) A Kernel Measure of Dissimilarity between M Distributions Journal of the American Statistical Association10.1080/01621459.2023.2298036(1-27)Online publication date: Feb-2024
    • (2024)On Representation Learning-based Methods for Effective, Efficient, and Scalable Code RetrievalNeurocomputing10.1016/j.neucom.2024.128172600(128172)Online publication date: Oct-2024
    • (2024)Conditional variational autoencoder for query expansion in ad-hoc information retrievalInformation Sciences10.1016/j.ins.2023.119764652(119764)Online publication date: Jan-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 36, Issue 4
    October 2018
    365 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3211967
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2018
    Accepted: 01 March 2018
    Revised: 01 January 2018
    Received: 01 August 2017
    Published in TOIS Volume 36, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Ad-hoc retrieval
    2. document retrieval
    3. latent vector spaces
    4. representation learning
    5. semantic matching

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ahold Delhaize, Amsterdam Data Science, the Bloomberg Research Grant program
    • Criteo Faculty Research Award program, Elsevier
    • Google Faculty Research Award scheme, the Microsoft Research Ph.D. program, the Netherlands Institute for Sound and Vision, the Netherlands Organisation for Scientific Research (NWO)
    • Yandex
    • European Community's Seventh Framework Programme (FP7/2007-2013)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 28 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024) A Kernel Measure of Dissimilarity between M Distributions Journal of the American Statistical Association10.1080/01621459.2023.2298036(1-27)Online publication date: Feb-2024
    • (2024)On Representation Learning-based Methods for Effective, Efficient, and Scalable Code RetrievalNeurocomputing10.1016/j.neucom.2024.128172600(128172)Online publication date: Oct-2024
    • (2024)Conditional variational autoencoder for query expansion in ad-hoc information retrievalInformation Sciences10.1016/j.ins.2023.119764652(119764)Online publication date: Jan-2024
    • (2024)Semantic deep learning and adaptive clustering for handling multimodal multimedia information retrievalMultimedia Tools and Applications10.1007/s11042-024-19312-7Online publication date: 25-May-2024
    • (2023)TAM GAN: Tamil Text to Naturalistic Image Synthesis Using Conventional Deep Adversarial NetworksACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358401922:5(1-18)Online publication date: 16-Feb-2023
    • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
    • (2023)Deep learning-based risk prediction for interventional clinical trials based on protocol design: A retrospective studyPatterns10.1016/j.patter.2023.1006894:3(100689)Online publication date: Mar-2023
    • (2023)Deep neural ranking model using distributed smoothingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119913224:COnline publication date: 15-Aug-2023
    • (2023)Deep learning modelling techniques: current progress, applications, advantages, and challengesArtificial Intelligence Review10.1007/s10462-023-10466-856:11(13521-13617)Online publication date: 17-Apr-2023
    • (2022)Feature Transformation Framework for Enhancing Compactness and Separability of Data Points in Feature Space for Small DatasetsApplied Sciences10.3390/app1203171312:3(1713)Online publication date: 7-Feb-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media