Embedding-based query language models

H Zamani, WB Croft - Proceedings of the 2016 ACM international …, 2016 - dl.acm.org
Proceedings of the 2016 ACM international conference on the theory of …, 2016dl.acm.org
Word embeddings, which are low-dimensional vector representations of vocabulary terms
that capture the semantic similarity between them, have recently been shown to achieve
impressive performance in many natural language processing tasks. The use of word
embeddings in information retrieval, however, has only begun to be studied. In this paper,
we explore the use of word embeddings to enhance the accuracy of query language models
in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate …
Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.
ACM Digital Library