Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3404835.3463242acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

TripClick: The Log Files of a Large Health Web Search Engine

Published: 11 July 2021 Publication History

Abstract

Click logs are valuable resources for a variety of information retrieval (IR) tasks. This includes query understanding/analysis, as well as learning effective IR models particularly when the models require large amounts of training data. We release a large-scale domain-specific dataset of click logs, obtained from user interactions of the Trip Database health web search engine. Our click log dataset comprises approximately 5.2 million user interactions collected between 2013 and 2020. We use this dataset to create a standard IR evaluation benchmark - TripClick - with around 700,000 unique free-text queries and 1.3 million pairs of query-document relevance signals, whose relevance is estimated by two click-through models. As such, the collection is one of the few datasets offering the necessary data richness and scale to train neural IR models with a large amount of parameters, and notably the first in the health domain. Using TripClick, we conduct experiments to evaluate a variety of IR models, showing the benefits of exploiting this data to train neural architectures. In particular, the evaluation results show that the best performing neural IR model significantly improves the performance by a large margin relative to classical IR models, especially for more frequent queries.

Supplementary Material

MP4 File (1320.mp4)
Presentation video

References

[1]
Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click models for web search. (2015).
[2]
Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. 2020 a. ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search. arXiv preprint arXiv:2006.05324 (2020).
[3]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020 b. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
[4]
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Data Mining. 87--94.
[5]
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the ACM International Conference on Web Search and Data Mining. 126--134.
[6]
Carsten Eickhoff, Floran Gmehlin, Anu Patel, Jocelyn Boullier, and Hamish Fraser. 2019. DC^3 -- A Diagnostic Case Challenge Collection. In Proceedings of the 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR). ACM.
[7]
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 1--6.
[8]
R Brian Haynes, K Ann McKibbon, Nancy L Wilczynski, Stephen D Walter, and Stephen R Werre. 2005. Optimal search strategies for retrieving scientifically strong studies of treatment from Medline: analytical survey. Bmj, Vol. 330, 7501 (2005), 1179.
[9]
Sebastian Hofst"atter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury. 2019. On the Effect of Low-Frequency Terms on Neural-IR Models. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[10]
Sebastian Hofst"atter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. 2020. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
[11]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the ACM International Conference on Information & Knowledge Management. 2333--2338.
[12]
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1049--1058.
[13]
Jimmy Jimmy, Guido Zuccon, Joao Palotti, Lorraine Goeuriot, and Liadh Kelly. 2018. Overview of the CLEF 2018 consumer health search task. International Conference of the Cross-Language Evaluation Forum for European Languages (2018).
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[15]
Lorenz Kuhn and Carsten Eickhoff. 2016. Implicit Negative Feedback in Clinical Information Retrieval. In Proceedings of the ACM SIGIR Medical Information Retrieval Workshop .
[16]
Victor Lavrenko and W Bruce Croft. 2001. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR Conference on Research and development in information retrieval. 120--127.
[17]
Cindy Li, Elizabeth Chen, Guergana Savova, Hamish Fraser, and Carsten Eickhoff. 2020. Mining Misdiagnosis Patterns from Biomedical Literature. In Proceedings of the AMIA Informatics Summit. AMIA.
[18]
Tie-Yan LIU, Thorsten JOACHIMS, Hang LI, and Chengxiang ZHAI. 2010. Learning To Rank For Information Retrieval. Information retrieval (Boston), Vol. 13, 3 (2010).
[19]
Yuanhua Lv and ChengXiang Zhai. 2009. A comparative study of methods for estimating query language models with pseudo feedback. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1895--1898.
[20]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the International Conference on Neural Information Processing Systems.
[21]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
[22]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.
[23]
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the International Conference on Scalable Information Systems.
[24]
Kirk Roberts, Dina Demner-Fushman, Ellen M Voorhees, William R Hersh, Steven Bedrick, and Alexander J Lazar. 2018. Overview of the TREC 2018 Precision Medicine Track. In Proceedings of the Text Retrieval Conference (TREC).
[25]
Kirk Roberts, Dina Demner-Fushman, Ellen M Voorhees, William R Hersh, Steven Bedrick, Alexander J Lazar, and Shubham Pant. 2017. Overview of the TREC 2017 Precision Medicine Track. In Proceedings of the Text Retrieval Conference (TREC).
[26]
Kirk Roberts, Dina Demner-Fushman, Ellen M. Voorhees, William R. Hersh, Steven Bedrick, Alexander J. Lazar, and Shubham Pant. 2019. Overview of the TREC 2019 Precision Medicine Track. In Proceedings of the Text Retrieval Conference (TREC).
[27]
Kirk Roberts, Matthew S Simpson, Ellen M Voorhees, and William R Hersh. 2015. Overview of the TREC 2015 Clinical Decision Support Track. In TREC.
[28]
Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval (2009).
[29]
Pavel Serdyukov, Georges Dupret, and Nick Craswell. 2014. Log-based personalization: The 4th web search click data (WSCD) workshop. In Proceedings of the ACM International Conference on Web Search and Data Mining. 685--686.
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30.
[31]
Ellen M. Voorhees. 1999. The TREC-8 Question Answering Track Report. In Proceedings of the 8th Text REtrieval Conference (TREC).
[32]
Xing Wei and Carsten Eickhoff. 2018a. Distant Supervision in Clinical Information Retrieval. In AMIA Annual Symposium Proceedings. American Medical Informatics Association.
[33]
Xing Wei and Carsten Eickhoff. 2018b. Embedding Electronic Health Records for Clinical Information Retrieval. In https://arxiv.org/abs/1811.05402.
[34]
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 55--64.
[35]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1253--1256.
[36]
Yuye Zhang and Alistair Moffat. 2006. Some Observations on User Search Behaviour. Australian Journal of Intelligent Information Processing Systems (2006), 1--8.
[37]
Yukun Zheng, Zhen Fan, Yiqun Liu, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Sogou-qcl: A new dataset with click relevance label. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 1117--1120.

Cited By

View all
  • (2024)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36988222:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured DataProceedings of the ACM on Management of Data10.1145/36549232:3(1-27)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. click logs
  2. collection
  3. health information retrieval
  4. medical information retrieval
  5. neural ranking models

Qualifiers

  • Short-paper

Funding Sources

  • NSF

Conference

SIGIR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36988222:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Evaluation of Temporal Change in IR Test CollectionsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672530(3-13)Online publication date: 2-Aug-2024
  • (2024)ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured DataProceedings of the ACM on Management of Data10.1145/36549232:3(1-27)Online publication date: 30-May-2024
  • (2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
  • (2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
  • (2024)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/362364016:1(1-33)Online publication date: 6-Mar-2024
  • (2024)Enriching Simple Keyword Queries for Domain-Aware Narrative RetrievalProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00029(143-154)Online publication date: 26-Jun-2024
  • (2023)Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random SelectionProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625333(139-149)Online publication date: 26-Nov-2023
  • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
  • (2023)One-Shot Labeling for Automatic Relevance EstimationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592032(2230-2235)Online publication date: 19-Jul-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media