research-article

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

Authors:

Bodo BillerbeckAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 2983 - 2989

https://doi.org/10.1145/3340531.3412779

Published: 19 October 2020 Publication History

Abstract

Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k -anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and other potential uses.

Supplementary Material

MP4 File (3340531.3412779.mp4)

Description of the ORCAS dataset: Open Resource for Click Analysis in Search. This is based on search log data, with aggregation, to identify query-URL pairs that were clicked by many users. The data can be used to improve search or for web mining such as finding related queries.

Download
18.15 MB

References

[1]

Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting Semantic Relations from Query Logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Jose, California, USA) (KDD '07). Association for Computing Machinery, New York, NY, USA, 76--85.

Digital Library

[2]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et almbox. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016).

[3]

Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times, Vol. 9, 2008 (2006), 8.

[4]

Doug Beeferman and Adam Berger. 2000 a. Agglomerative Clustering of a Search Engine Query Log. In Proc. SIGKDD. 407--416.

Digital Library

[5]

Doug Beeferman and Adam Berger. 2000 b. Agglomerative Clustering of a Search Engine Query Log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, Massachusetts, USA) (KDD '00). Association for Computing Machinery, New York, NY, USA, 407--416.

Digital Library

[6]

Paul N. Bennett, Ryen W. White, Wei Chu, Susan T. Dumais, Peter Bailey, Fedor Borisyuk, and Xiaoyuan Cui. 2012. Modeling the impact of short- and long-term behavior on search personalization. In Proc. SIGIR. 185--194.

Digital Library

[7]

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. ACM, 89--96.

Digital Library

[8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2019 deep learning track. In TREC.

[9]

Nick Craswell and Martin Szummer. 2007. Random Walks on the Click Graph. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 239--246.

Digital Library

[10]

Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Foundations and trends in information retrieval, Vol. 10, 1 (2016), 1--117.

[11]

Sebastian Hofst"atter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, and Allan Hanbury. 2020. Local Self-Attention over Long Text for Efficient Document Retrieval. In Proc. SIGIR. ACM.

[12]

Sebastian Hofst"atter, Markus Zlabinger, and Allan Hanbury. 2019. TU Wien@ TREC Deep Learning'19--Simple Contextualization for Re-ranking. arXiv preprint arXiv:1912.01385 (2019).

[13]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333--2338.

Digital Library

[14]

Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundation and Trends in Information Retrieval, Vol. 3, 3 (March 2009), 225--331.

[15]

Qiaozhu Mei, Dengyong Zhou, and Kenneth Church. 2008. Query Suggestion Using Hitting Time. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (Napa Valley, California, USA) (CIKM '08). Association for Computing Machinery, New York, NY, USA, 469--478.

Digital Library

[16]

Bhaskar Mitra. 2015. Exploring Session Context using Distributed Representations of Queries and Reformulations. In Proc. SIGIR. ACM, 3--12.

Digital Library

[17]

Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval (2018).

Digital Library

[18]

Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In Proc. WWW. 1291--1299.

Digital Library

[19]

Bhaskar Mitra, Sebastian Hofstatter, Hamed Zamani, and Nick Craswell. 2020. Conformer-Kernel with Query Term Independence for Document Retrieval. arXiv preprint arXiv:2007.10434 (2020).

[20]

Bhaskar Mitra, Corby Rosset, David Hawking, Nick Craswell, Fernando Diaz, and Emine Yilmaz. 2019. Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks. arXiv preprint arXiv:1907.03693 (2019).

[21]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A Picture of Search. In Proceedings of the 1st International Conference on Scalable Information Systems (Hong Kong) (InfoScale '06). Association for Computing Machinery, New York, NY, USA, 1--es.

Digital Library

[22]

Pavel Serdyukov, Georges Dupret, and Nick Craswell. 2014. Log-based personalization: The 4th web search click data (WSCD) workshop. In Proceedings of the 7th ACM international conference on Web search and data mining. 685--686.

Digital Library

[23]

Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proc. SIGIR. 103--112.

Digital Library

[24]

Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, Vol. 2. Citeseer, 2--6.

[25]

Ji-Rong Wen, Jian-Yun Nie, and Hong-Jiang Zhang. 2001. Clustering User Queries of a Search Engine. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW '01). Association for Computing Machinery, New York, NY, USA, 162--168.

Digital Library

[26]

Stewart Whiting and Joemon M. Jose. 2014. Recent and Robust Query Auto-Completion. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW '14). Association for Computing Machinery, New York, NY, USA, 971--982.

[27]

Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing Web Search Using Web Click-through Data. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM '04). Association for Computing Machinery, New York, NY, USA, 118--126.

Digital Library

[28]

Yuye Zhang and Alistair Moffat. 2006. Some Observations on User Search Behaviour. Austr. J. Intelligent Information Processing Systems, Vol. 9, 2 (2006), 1--8.

Cited By

Sharma ALi HLi XJiao JBaeza-Yates RBonchi F(2024)Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671618(5669-5679)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671618
Vonásek JStraka MKrč RLasonová LEgorova EStraková JNáplava JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657851
Chen QGeng XRosset CBuractaon CLu JShen TZhou KXiong CGong YBennett PCraswell NXie XYang FTower BRao NDong AJiang WLiu ZLi MLiu CLi ZMajumder RNeville JOakley ARisvik KSimhadri HVarma MWang YYang LYang MZhang CChua TNgo CKumar RLauw HKa-Wei Lee R(2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648327
Show More Cited By

Index Terms

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query log analysis
    2. Retrieval models and ranking
  2. World Wide Web
    1. Web mining
      1. Web log analysis

Recommendations

Implementing and evaluating phrasal query suggestions for proximity search

This paper describes and evaluates a unified approach to phrasal query suggestions in the context of a high-precision search engine. The search engine performs ranked extended-Boolean searches with the proximity operator near being the default ...
Mining Web search engines for query suggestion

Queries to Web search engines are usually short and ambiguous, which provides insufficient information needs of users for effectively retrieving relevant Web pages. To address this problem, query suggestion is implemented by most search engines. However,...
Identifying popular search goals behind search queries to improve web search ranking
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Web users usually have a certain search goal before they submit a search query. However, many laypersons can't transform their search goals into suitable queries. Thus, understanding original search goals behind a query is very important for search ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

October 2020

3619 pages

ISBN:9781450368599

DOI:10.1145/3340531

General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '20

Sponsor:

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management

October 19 - 23, 2020

Virtual Event, Ireland

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sharma ALi HLi XJiao JBaeza-Yates RBonchi F(2024)Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671618(5669-5679)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671618
Vonásek JStraka MKrč RLasonová LEgorova EStraková JNáplava JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657851
Chen QGeng XRosset CBuractaon CLu JShen TZhou KXiong CGong YBennett PCraswell NXie XYang FTower BRao NDong AJiang WLiu ZLi MLiu CLi ZMajumder RNeville JOakley ARisvik KSimhadri HVarma MWang YYang LYang MZhang CChua TNgo CKumar RLauw HKa-Wei Lee R(2024)MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click LabelsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648327(292-301)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648327
Khan ARashid USaadia AYasin A(2024)A Heuristic Multimedia Verticals Aggregated Search Approach and User Behavioral Analysis2024 International Conference on Engineering & Computing Technologies (ICECT)10.1109/ICECT61618.2024.10581027(1-6)Online publication date: 23-May-2024
https://doi.org/10.1109/ICECT61618.2024.10581027
Zhao WLiu JRen RWen J(2023)Dense Text Retrieval Based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/363787042:4(1-60)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3637870
Krasakis AYates AKanoulas E(2023)Contextualizing and Expanding Conversational Queries without SupervisionACM Transactions on Information Systems10.1145/363262242:3(1-30)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3632622
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Leonhardt JRudra KAnand A(2023)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 23-Mar-2023
https://dl.acm.org/doi/10.1145/3576924
Reimer JSchmidt SFröbe MGienapp LScells HStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591890
Liu YZhang RGuo Jde Rijke MChen WFan YCheng XChen HDuh WHuang HKato MMothe JPoblete B(2023)Topic-oriented Adversarial Attacks against Black-box Neural Ranking ModelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591777(1700-1709)Online publication date: 18-Jul-2023
https://doi.org/10.1145/3539618.3591777
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents