Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3488560.3498421acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

'It's on the tip of my tongue': A new Dataset for Known-Item Retrieval

Published: 15 February 2022 Publication History

Abstract

The tip of the tongue known-item retrieval (TOT-KIR) task involves the 'one-off' retrieval of an item for which a user cannot recall a precise identifier. The emergence of several online communities where users pose known-item queries to other users indicates the inability of existing search systems to answer such queries. Research in this domain is hampered by the lack of large, open or realistic datasets. Prior datasets relied on either annotation by crowd workers, which can be expensive and time-consuming, or generating synthetic queries, which can be unrealistic. Additionally, small datasets make the application of modern (neural) retrieval methods unviable, since they require a large number of data-points. In this paper, we collect the largest dataset yet with 15K query-item pairs in two domains, namely, Movies and Books, from an online community using heuristics, rendering expensive annotation unnecessary while ensuring that queries are realistic. We show that our data collection method is accurate by conducting a data study. We further demonstrate that methods like BM25 fall short of answering such queries, corroborating prior research. The size of the dataset makes neural methods feasible, which we show outperforms lexical baselines, indicating that neural/dense retrieval is superior for the TOT-KIR task.

Supplementary Material

MP4 File (wsdmfp285.mp4)
Presentation Video for 'It's on the tip of my tongue': A new Dataset for Known-Item Retrieval

References

[1]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 357--389.
[2]
Jaime Arguello, Adam Ferguson, Emery Fine, Bhaskar Mitra, Hamed Zamani, and Fernando Diaz. 2021. Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval (Canberra ACT, Australia) (CHIIR '21). Association for Computing Machinery, New York, NY, USA, 5--14. https://doi.org/10.1145/3406522.3446021
[3]
Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building Simulated Queries for Known-Item Topics: An Analysis Using Six European Languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 455--462. https://doi.org/10.1145/1277741.1277820
[4]
Michael K Buckland. 1979. On types of search and the allocation of library resources. Journal of the American Society for Information Science, Vol. 30, 3 (1979), 143--147.
[5]
David Elsweiler, Morgan Harvey, and Martin Hacker. 2011a. Understanding Re-Finding Behavior in Naturalistic Email Interaction Logs. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 35--44. https://doi.org/10.1145/2009916.2009925
[6]
David Elsweiler, David E. Losada, José C. Toucedo, and Ronald T. Fernandez. 2011b. Seeding Simulated Queries with User-Study Data for Personal Search Evaluation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 25--34. https://doi.org/10.1145/2009916.2009924
[7]
Matthias Hagen, Daniel W"agner, and Benno Stein. 2015. A Corpus of Realistic Known-Item Topics with Associated Web Pages in the ClueWeb09. In Advances in Information Retrieval, Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr (Eds.). Springer International Publishing, Cham, 513--525.
[8]
Claudia Hauff, Matthias Hagen, Anna Beyer, and Benno Stein. 2012. Towards Realistic Known-Item Topics for the ClueWeb. In Proceedings of the 4th Information Interaction in Context Symposium (Nijmegen, The Netherlands) (IIIX '12). Association for Computing Machinery, New York, NY, USA, 274--277. https://doi.org/10.1145/2362724.2362773
[9]
Claudia Hauff and Geert-Jan Houben. 2011. Cognitive Processes in Query Generation. In Advances in Information Retrieval Theory, Giambattista Amati and Fabio Crestani (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 176--187.
[10]
Ben He and Iadh Ounis. 2005. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Advances in Information Retrieval, David E. Losada and Juan M. Fernández-Luna (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 200--214.
[11]
Ida Kathrine Hammeleff Jørgensen and Toine Bogers. 2020. "Kinda like The Sims... But with Ghosts?": A Qualitative Analysis of Video Game Re-Finding Requests on Reddit .Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3402942.3402971
[12]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, Online, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
[13]
Jinyoung Kim and W. Bruce Croft. 2009. Retrieval Experiments Using Pseudo-Desktop Collections. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM '09). Association for Computing Machinery, New York, NY, USA, 1297--1306. https://doi.org/10.1145/1645953.1646117
[14]
Jinyoung Kim and W. Bruce Croft. 2010. Ranking Using Multiple Document Types in Desktop Search. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR '10). Association for Computing Machinery, New York, NY, USA, 50--57. https://doi.org/10.1145/1835449.1835461
[15]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[16]
Jin Ha Lee, Allen Renear, and Linda C Smith. 2006. Known-item search: Variations on a concept. Proceedings of the american society for information science and technology, Vol. 43, 1 (2006), 1--17.
[17]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[18]
Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval using PyTerrier. CoRR, Vol. abs/2007.14271 (2020). showeprint[arXiv]2007.14271 https://arxiv.org/abs/2007.14271
[19]
Vassilis Plachouras, Ben He, and Iadh Ounis. 2004. University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier. In TREC .
[20]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., Vol. 3, 4 (April 2009), 333--389. https://doi.org/10.1561/1500000019
[21]
Sargol Sadeghi, Roi Blanco, Peter Mika, Mark Sanderson, Falk Scholer, and David Vallet. 2014. Identifying Re-Finding Difficulty from User Query Logs. In Proceedings of the 2014 Australasian Document Computing Symposium (Melbourne, VIC, Australia) (ADCS '14). Association for Computing Machinery, New York, NY, USA, 105--108. https://doi.org/10.1145/2682862.2682867
[22]
Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts. 2007. Information Re-Retrieval: Repeat Queries in Yahoo's Logs. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 151--158. https://doi.org/10.1145/1277741.1277770
[23]
Mengting Wan and Julian McAuley. 2018. Item Recommendation on Monotonic Behavior Chains. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys '18). Association for Computing Machinery, New York, NY, USA, 86--94. https://doi.org/10.1145/3240323.3240369
[24]
Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. 2019. Fine-Grained Spoiler Detection from Large-Scale Review Corpora. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Florence, Italy, 2605--2610. https://doi.org/10.18653/v1/P19--1248
[25]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. ArXiv, Vol. abs/2006.15498 (2020).

Cited By

View all
  • (2024)Generalizable Tip-of-the-Tongue Retrieval with LLM Re-rankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657917(2437-2441)Online publication date: 10-Jul-2024
  • (2023)When the Music Stops: Tip-of-the-Tongue Retrieval for MusicProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592086(2506-2510)Online publication date: 19-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
February 2022
1690 pages
ISBN:9781450391320
DOI:10.1145/3488560
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. known item retrieval
  2. tip of the tongue known item retrieval

Qualifiers

  • Research-article

Funding Sources

Conference

WSDM '22

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)89
  • Downloads (Last 6 weeks)4
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Generalizable Tip-of-the-Tongue Retrieval with LLM Re-rankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657917(2437-2441)Online publication date: 10-Jul-2024
  • (2023)When the Music Stops: Tip-of-the-Tongue Retrieval for MusicProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592086(2506-2510)Online publication date: 19-Jul-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media