Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Modeling Queries with Contextual Snippets for Information Retrieval

Published: 31 January 2018 Publication History

Abstract

Query expansion under the pseudo-relevance feedback (PRF) framework has been extensively studied in information retrieval. However, most expansion methods are mainly based on the statistics of single terms, which can generate plenty of irrelevant query terms and decrease retrieval performance. To alleviate this problem, we propose an approach that adapts the PRF-based contextual snippets into a context-aware topic model to enhance query representations. Specifically, instead of selecting a series of independent terms, we make full use of the query contextual information and focus on the snippets with the length of n in the PRF documents. Furthermore, we propose a context-aware topic (CAT) model to mine the topic distributions of the query-relevant snippets, namely, fine contextual snippets. In contrast to the traditional topic models that infer the topics from the whole corpus, we establish a bridge between the snippets and the corresponding PRF documents, which can be used for modeling the topics more precisely and efficiently. Finally, the topic distributions of the fine snippets are used for context-aware and topic-sensitive query representations. To evaluate the performance of our approach, we integrate the obtained queries into a topic-based hybrid retrieval model and conduct extensive experiments on various TREC collections. The experimental results show that our query-modeling approach is more effective in boosting retrieval performance compared with the state-of-the-art methods.

Supplementary Material

a47-chen-apndx.pdf (chen.zip)
Supplemental movie, appendix, image and software files for, Modeling Queries with Contextual Snippets for Information Retrieval

References

[1]
Giorgos Akrivas, Manolis Wallace, Giorgos Andreou, Giorgos Stamou, and Stefanos Kollias. 2002. Context-sensitive semantic query expansion. In ICAIS’02. 109--114.
[2]
Hagai Attias. 2000. A variational Baysian framework for graphical models. In NIPS’00. 209--215.
[3]
Claudio Biancalana, Fabio Gasparetti, Alessandro Micarelli, and Giuseppe Sansonetti. 2013. Social semantic query expansion. ACM Transactions on Intelligent Systems and Technology 4, 4, 60.
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[5]
Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In SIGIR’08. ACM, 243--250.
[6]
Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys 44, 1, 1--56.
[7]
Qin Chen, Qinmin Hu, Jimmy Xiangji Huang, and Liang He. 2018. CA-RNN: Using context-aligned recurrent neural networks for modeling sentence similarity. In AAAI’18. 8 pages.
[8]
Qin Chen, Qinmin Hu, Jimmy Xiangji Huang, Liang He, and Weijie An. 2017. Enhancing recurrent neural networks with positional attention for question answering. In SIGIR’17. 993--996.
[9]
Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16, 1, 22--29.
[10]
Kevyn Collins-Thompson and Jamie Callan. 2007. Estimation and use of uncertainty in pseudo-relevance feedback. In SIGIR’07. ACM, 303--310.
[11]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 151. Cambridge: Cambridge University Press. 177 pages.
[12]
Liana Ermakova, Josiane Mothe, and Elena Nikitina. 2016. Proximity relevance model for query expansion. In SAC’16. ACM, 1054--1059.
[13]
George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. 1987. The vocabulary problem in human-system communication. Communications of the ACM 30, 11, 964--971.
[14]
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J. F. Jones. 2015. Word embedding based generalized language model for information retrieval. In SIGIR’15. ACM, 795--798.
[15]
Charles J. Geyer. 1992. Practical Markov chain Monte Carlo. Statistical Science 7, 4 (1992), 473--483.
[16]
Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, and Narayan Bhamidipati. 2015. Context-and content-aware embeddings for query rewriting in sponsored search. In SIGIR’15. ACM, 383--392.
[17]
Brynjar Gretarsson, John Odonovan, Svetlin Bostandjiev, Tobias Hllerer, Arthur Asuncion, David Newman, and Padhraic Smyth. 2012. Topicnets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology 3, 2, 23.
[18]
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. In PNAS’04. 5228--5235.
[19]
Ben He, Jimmy Xiangji Huang, and Xiaofeng Zhou. 2011. Modeling term proximity for probabilistic information retrieval models. Information Sciences 181, 14, 3017--3031.
[20]
Qinmin Hu, Yijun Pei, Qin Chen, and Liang He. 2016. SG++: Word representation with sentiment and negation for Twitter sentiment classification. In SIGIR’16. ACM, 997--1000.
[21]
Jimmy Xiangji Huang, Jun Miao, and Ben He. 2013. High performance query expansion using adaptive co-training. Information Processing 8 Management 49, 2, 441--453.
[22]
Xiangji Huang, Yan Rui Huang, Miao Wen, Aijun An, Yang Liu, and Josiah Poon. 2006. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In ICDM’06. IEEE, 295--306.
[23]
Y. Kumar Jain and Santosh Kumar Bhandare. 2011. Min max normalization based data perturbation method for privacy protection. International Journal of Computer and Communication Technology 2, 8, 45--50.
[24]
Zongcheng Ji, Fei Xu, Bin Wang, and Ben He. 2012. Question-answer topic model for question retrieval in community question answering. In CIKM’12. ACM, 2471--2474.
[25]
Fanghong Jian, Jimmy Xiangji Huang, Jiashu Zhao, Tingting He, and Po Hu. 2016. A simple enhancement for ad-hoc information retrieval via topic modelling. In SIGIR’16. ACM, 733--736.
[26]
Álvaro Barbero Jiménez, Jorge López Lázaro, and José R. Dorronsoro. 2009. Finding optimal model parameters by deterministic and annealed focused grid search. Neurocomputing 72, 13, 2824--2832.
[27]
Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning 37, 2, 183--233.
[28]
Andisheh Keykhah, Faezeh Ensan, and Ebrahim Bagheri. 2016. Query expansion using pseudo relevance feedback on Wikipedia. In Workshop at WSDM’16.
[29]
Victor Lavrenko and W. Bruce Croft. 2001. Relevance based language models. In SIGIR’01. ACM, 120--127.
[30]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML’14. 1188--1196.
[31]
Lillian Lee. 1999. Measures of distributional similarity. In ACL’99. 25--32.
[32]
Michael E. Lesk. 1969. Word-word associations in document retrieval systems. Journal of the American Society for Information Science and Technology 20, 1, 27--38.
[33]
Lin Li, Guandong Xu, Zhenglu Yang, Peter Dolog, Yanchun Zhang, and Masaru Kitsuregawa. 2013. An efficient approach to suggesting topically related web queries using hidden topic model. World Wide Web 16, 3, 273--297.
[34]
Xinyi Li and Maarten de Rijke. 2017. Do topic shift and query reformulation patterns correlate in academic search?. In ECIR’17. 146--159.
[35]
Zhen Liao, Daxin Jiang, Enhong Chen, Jian Pei, Huanhuan Cao, and Hang Li. 2011. Mining concept sequences from large-scale search logs for context-aware query suggestion. ACM Transactions on Intelligent Systems and Technology 3, 1, 17.
[36]
Huiwen Liu, Jiajie Xu, Kai Zheng, Chengfei Liu, Lan Du, and Xian Wu. 2017. Semantic-aware query processing for activity trajectories. In WSDM’17. ACM, 283--292.
[37]
Yiqun Liu, Junwei Miao, Min Zhang, Shaoping Ma, and Liyun Ru. 2011. How do users describe their information need: Query recommendation based on snippet click model. Expert Systems with Applications 38, 11, 13847--13856.
[38]
Yuanhua Lv and ChengXiang Zhai. 2010. Positional relevance model for pseudo-relevance feedback. In SIGIR’10. ACM, 579--586.
[39]
Jun Miao, Jimmy Xiangji Huang, and Zheng Ye. 2012. Proximity-based Rocchio’s model for pseudo relevance feedback. In SIGIR’12. ACM, 535--544.
[40]
Jun Miao, Jimmy Xiangji Huang, and Jiashu Zhao. 2016. TopPRF: A probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Transactions on Information Systems 34, 4, 1--38.
[41]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Arxiv Preprint Arxiv:1301.3781.
[42]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS’13. 3111--3119.
[43]
Singthongchai Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In IMECS’13. 13--15.
[44]
Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In SIGIR’98. ACM, 275--281.
[45]
Run Wei Qiang, Yue Fei, Yi Hong Hong, and Jian Wu Yang. 2013. PKUICST at TREC 2013 microblog track. In TREC’13. 1--5.
[46]
Nazneen Fatema N. Rajani, Kate McArdle, and Jason Baldridge. 2014. Extracting topics based on authors, recipients and content in microblogs. In SIGIR’14. ACM, 1171--1174.
[47]
Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In ICWSM’10. 1--8.
[48]
Priyang Rathod, Mithun Sheshagiri, and Anugeetha Kunjithapatham. 2007. Method and apparatus for search result snippet analysis for query expansion and result filtering. US Patent App. 11/725,865.
[49]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. 1995. Okapi at TREC-3. NIST Special Publication. National Instiute of Standards 8 Technology, 109--109.
[50]
Joseph John Rocchio. 1971. Relevance feedback in information retrieval. In the SMART Retrieval System: Experiments in Automatic Document Processing. 313--323.
[51]
Gerd Ronning. 1989. Maximum likelihood estimation of Dirichlet distributions. Journal of Statistical Computation and Simulation 32, 4, 215--221.
[52]
Anna Shtok, Oren Kurland, and David Carmel. 2009. Predicting query performance by query-drift estimation. In ICTIR’09. 305--312.
[53]
Jagendra Singh and Aditi Sharan. 2015. Context window based co-occurrence approach for improving feedback based query expansion in information retrieval. International Journal of Information Retrieval Research 5, 4, 31--45.
[54]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP’13. 1631--1642.
[55]
Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. Cambridge: MIT Press.
[56]
Jeroen B. P. Vuurens and Arjen P. de Vries. 2014. Distance matters! Cumulative proximity expansions for ranking documents. Information Retrieval Journal 17, 4, 380--406.
[57]
Yashen Wang, Heyan Huang, and Chong Feng. 2017. Query expansion based on a feedback concept model for microblog retrieval. In WWW’17. 559--568.
[58]
Zhibo Wang, Long Ma, and Yanqing Zhang. 2016. A hybrid document feature extraction method using latent Dirichlet allocation and word2vec. In DSC’16. IEEE, 98--103.
[59]
Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In SIGIR’06. 178--185.
[60]
Justin Wood, Patrick Tan, Wei Wang, and Corey Arnold. 2017. Source-LDA: Enhancing probabilistic topic models using prior knowledge sources. In ICDE’17. 411--422.
[61]
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. A probabilistic model for bursty topic discovery in microblogs. In AAAI’15. 353--359.
[62]
Yuangang Yao, Jin Yi, Yanzhao Liu, Xianghui Zhao, and Chenghao Sun. 2015. Query processing based on associated semantic context inference. In ICISCE’15. 395--399.
[63]
Zheng Ye and Jimmy Xiangji Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. In SIGIR’14. ACM, 323--332.
[64]
Zheng Ye and Jimmy Xiangji Huang. 2016. A learning to rank approach for quality-aware pseudo-relevance feedback. Journal of the American Society for Information Science and Technology 67, 4, 942--959.
[65]
Zheng Ye, Jimmy Xiangji Huang, and Hongfei Lin. 2011. Finding a good query-related topic for boosting pseudo-relevance feedback. Journal of the American Society for Information Science and Technology 62, 4, 748--760.
[66]
Zhijun Yin, Liangliang Cao, Quanquan Gu, and Jiawei Han. 2012. Latent community topic analysis: Integration of community discovery with topic modeling. ACM Transactions on Intelligent Systems and Technology 3, 4, 63.
[67]
Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the KL-divergence retrieval model. In CIKM’01. 403--410.
[68]
Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR’01. ACM, 334--342.
[69]
Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In WWW’12. 879--888.
[70]
Peng Zhang, Qian Yu, Yuexian Hou, Dawei Song, Jingfei Li, and Bin Hu. 2017. A distribution separation method using irrelevance feedback data for information retrieval. ACM Transactions on Intelligent Systems and Technology 8, 3 (2017), 26 Pages.
[71]
Jiashu Zhao, Jimmy Xiangji Huang, and Ben He. 2011. CRTER: Using cross terms to enhance probabilistic information retrieval. In SIGIR’11. ACM, 155--164.
[72]
Jiashu Zhao, Jimmy Xiangji Huang, and Shicheng Wu. 2012. Rewarding term location information to enhance probabilistic information retrieval. In SIGIR’12. ACM, 1137--1138.
[73]
Jiashu Zhao, Jimmy Xiangji Huang, and Zheng Ye. 2014. Modeling term associations for probabilistic information retrieval. ACM Transactions on Information Systems 32, 2, 1--47.
[74]
Yueting Zhuang, Hanqi Wang, Jun Xiao, Fei Wu, Yi Yang, Weiming Lu, and Zhongfei Zhang. 2017. Bag-of-discriminative-words (BoDW) representation via topic modeling. IEEE Transactions on Knowledge and Data Engineering 29, 5, 977--990.

Cited By

View all
  • (2021)Contextual Text Coding: A Mixed-methods Approach for Large-scale Textual DataSociological Methods & Research10.1177/004912412098619152:2(606-641)Online publication date: 8-Feb-2021
  • (2019)Self-Attention based Network For Medical Query Expansion2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8852269(1-9)Online publication date: Jul-2019
  • (2018)TAKer: Fine-Grained Time-Aware Microblog Search with Kernel Density EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.279453830:8(1602-1615)Online publication date: 1-Aug-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 9, Issue 4
Research Survey and Regular Papers
July 2018
280 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3183892
  • Editor:
  • Yu Zheng
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2018
Accepted: 01 November 2017
Revised: 01 October 2017
Received: 01 July 2017
Published in TIST Volume 9, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Contextual snippet
  2. query representation
  3. topic modeling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Nature Science Foundation of China
  • High Technology Research and Development Program of China
  • ORF-RE (Ontario Research Fund-Research Excellence) award in BRAIN Alliance
  • Natural Sciences 8 Engineering Research Council (NSERC) of Canada
  • NSERC CREATE award in ADERSIM
  • York Research Chairs (YRC)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Contextual Text Coding: A Mixed-methods Approach for Large-scale Textual DataSociological Methods & Research10.1177/004912412098619152:2(606-641)Online publication date: 8-Feb-2021
  • (2019)Self-Attention based Network For Medical Query Expansion2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8852269(1-9)Online publication date: Jul-2019
  • (2018)TAKer: Fine-Grained Time-Aware Microblog Search with Kernel Density EstimationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.279453830:8(1602-1615)Online publication date: 1-Aug-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media