research-article

'It's on the tip of my tongue': A new Dataset for Known-Item Retrieval

Authors:

Samarth Bhargav,

Georgios Sidiropoulos,

Evangelos KanoulasAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 48 - 56

https://doi.org/10.1145/3488560.3498421

Published: 15 February 2022 Publication History

Abstract

The tip of the tongue known-item retrieval (TOT-KIR) task involves the 'one-off' retrieval of an item for which a user cannot recall a precise identifier. The emergence of several online communities where users pose known-item queries to other users indicates the inability of existing search systems to answer such queries. Research in this domain is hampered by the lack of large, open or realistic datasets. Prior datasets relied on either annotation by crowd workers, which can be expensive and time-consuming, or generating synthetic queries, which can be unrealistic. Additionally, small datasets make the application of modern (neural) retrieval methods unviable, since they require a large number of data-points. In this paper, we collect the largest dataset yet with 15K query-item pairs in two domains, namely, Movies and Books, from an online community using heuristics, rendering expensive annotation unnecessary while ensuring that queries are realistic. We show that our data collection method is accurate by conducting a data study. We further demonstrate that methods like BM25 fall short of answering such queries, corroborating prior research. The size of the dataset makes neural methods feasible, which we show outperforms lexical baselines, indicating that neural/dense retrieval is superior for the TOT-KIR task.

Supplementary Material

MP4 File (wsdmfp285.mp4)

Presentation Video for 'It's on the tip of my tongue': A new Dataset for Known-Item Retrieval

Download
28.19 MB

References

[1]

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 357--389.

Digital Library

[2]

Jaime Arguello, Adam Ferguson, Emery Fine, Bhaskar Mitra, Hamed Zamani, and Fernando Diaz. 2021. Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval (Canberra ACT, Australia) (CHIIR '21). Association for Computing Machinery, New York, NY, USA, 5--14. https://doi.org/10.1145/3406522.3446021

Digital Library

[3]

Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building Simulated Queries for Known-Item Topics: An Analysis Using Six European Languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 455--462. https://doi.org/10.1145/1277741.1277820

Digital Library

[4]

Michael K Buckland. 1979. On types of search and the allocation of library resources. Journal of the American Society for Information Science, Vol. 30, 3 (1979), 143--147.

[5]

David Elsweiler, Morgan Harvey, and Martin Hacker. 2011a. Understanding Re-Finding Behavior in Naturalistic Email Interaction Logs. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 35--44. https://doi.org/10.1145/2009916.2009925

Digital Library

[6]

David Elsweiler, David E. Losada, José C. Toucedo, and Ronald T. Fernandez. 2011b. Seeding Simulated Queries with User-Study Data for Personal Search Evaluation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 25--34. https://doi.org/10.1145/2009916.2009924

Digital Library

[7]

Matthias Hagen, Daniel W"agner, and Benno Stein. 2015. A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09. In Advances in Information Retrieval, Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr (Eds.). Springer International Publishing, Cham, 513--525.

[8]

Claudia Hauff, Matthias Hagen, Anna Beyer, and Benno Stein. 2012. Towards Realistic Known-Item Topics for the ClueWeb. In Proceedings of the 4th Information Interaction in Context Symposium (Nijmegen, The Netherlands) (IIIX '12). Association for Computing Machinery, New York, NY, USA, 274--277. https://doi.org/10.1145/2362724.2362773

Digital Library

[9]

Claudia Hauff and Geert-Jan Houben. 2011. Cognitive Processes in Query Generation. In Advances in Information Retrieval Theory, Giambattista Amati and Fabio Crestani (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 176--187.

[10]

Ben He and Iadh Ounis. 2005. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Advances in Information Retrieval, David E. Losada and Juan M. Fernández-Luna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 200--214.

[11]

Ida Kathrine Hammeleff Jørgensen and Toine Bogers. 2020. "Kinda like The Sims... But with Ghosts?": A Qualitative Analysis of Video Game Re-Finding Requests on Reddit .Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3402942.3402971

Digital Library

[12]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Online, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

[13]

Jinyoung Kim and W. Bruce Croft. 2009. Retrieval Experiments Using Pseudo-Desktop Collections. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM '09). Association for Computing Machinery, New York, NY, USA, 1297--1306. https://doi.org/10.1145/1645953.1646117

Digital Library

[14]

Jinyoung Kim and W. Bruce Croft. 2010. Ranking Using Multiple Document Types in Desktop Search. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR '10). Association for Computing Machinery, New York, NY, USA, 50--57. https://doi.org/10.1145/1835449.1835461

Digital Library

[15]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[16]

Jin Ha Lee, Allen Renear, and Linda C Smith. 2006. Known-item search: Variations on a concept. Proceedings of the american society for information science and technology, Vol. 43, 1 (2006), 1--17.

[17]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[18]

Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. CoRR, Vol. abs/2007.14271 (2020). showeprint[arXiv]2007.14271 https://arxiv.org/abs/2007.14271

[19]

Vassilis Plachouras, Ben He, and Iadh Ounis. 2004. University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier. In TREC .

[20]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., Vol. 3, 4 (April 2009), 333--389. https://doi.org/10.1561/1500000019

Digital Library

[21]

Sargol Sadeghi, Roi Blanco, Peter Mika, Mark Sanderson, Falk Scholer, and David Vallet. 2014. Identifying Re-Finding Difficulty from User Query Logs. In Proceedings of the 2014 Australasian Document Computing Symposium (Melbourne, VIC, Australia) (ADCS '14). Association for Computing Machinery, New York, NY, USA, 105--108. https://doi.org/10.1145/2682862.2682867

Digital Library

[22]

Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts. 2007. Information Re-Retrieval: Repeat Queries in Yahoo's Logs. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 151--158. https://doi.org/10.1145/1277741.1277770

Digital Library

[23]

Mengting Wan and Julian McAuley. 2018. Item Recommendation on Monotonic Behavior Chains. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys '18). Association for Computing Machinery, New York, NY, USA, 86--94. https://doi.org/10.1145/3240323.3240369

Digital Library

[24]

Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. 2019. Fine-Grained Spoiler Detection from Large-Scale Review Corpora. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Florence, Italy, 2605--2610. https://doi.org/10.18653/v1/P19--1248

[25]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. ArXiv, Vol. abs/2006.15498 (2020).

Cited By

Borges LJha RCallan JMartins BHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Generalizable Tip-of-the-Tongue Retrieval with LLM Re-rankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657917(2437-2441)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657917
Bhargav SSchuth AHauff CChen HDuh WHuang HKato MMothe JPoblete B(2023)When the Music Stops: Tip-of-the-Tongue Retrieval for MusicProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592086(2506-2510)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592086

Index Terms

'It's on the tip of my tongue': A new Dataset for Known-Item Retrieval
1. Information systems
  1. Information retrieval
  2. World Wide Web
    1. Web searching and information discovery

Recommendations

Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification
CHIIR '21: Proceedings of the 2021 Conference on Human Information Interaction and Retrieval

While current information retrieval systems are effective for known-item retrieval where the searcher provides a precise name or identifier for the item being sought, systems tend to be much less effective for cases where the searcher is unable to ...
Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Tip-of-the-Tongue (ToT) retrieval is challenging for search engines because the queries are usually natural-language, verbose, and contain uncertain and inaccurate information. This paper studies the generalization capabilities of existing retrieval ...
When the Music Stops: Tip-of-the-Tongue Retrieval for Music
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

We present a study of Tip-of-the-tongue (ToT) retrieval for music, where a searcher is trying to find an existing music entity, but is unable to succeed as they cannot accurately recall important identifying information. ToT information needs are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

February 2022

1690 pages

ISBN:9781450391320

DOI:10.1145/3488560

General Chairs:
K. Selcuk Candan
Arizona State University, USA
,
Huan Liu
Arizona State University, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Xin Luna Dong
Meta Platforms, Inc. (former Facebook), USA
,
Jiliang Tang
Michigan State University, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

WSDM '22

Sponsor:

WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining

February 21 - 25, 2022

AZ, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
452
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)4

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Borges LJha RCallan JMartins BHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Generalizable Tip-of-the-Tongue Retrieval with LLM Re-rankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657917(2437-2441)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657917
Bhargav SSchuth AHauff CChen HDuh WHuang HKato MMothe JPoblete B(2023)When the Music Stops: Tip-of-the-Tongue Retrieval for MusicProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592086(2506-2510)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592086

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten