research-article

Open access

Exposing Query Identification for Search Transparency

Authors:

Asia J. BiegaAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 3662 - 3672

https://doi.org/10.1145/3485447.3512262

Published: 25 April 2022 Publication History

All formats PDF

Abstract

Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranked results is an important and relatively underexplored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25. We then improve upon this approach through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis of various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.

References

[1]

Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: an evaluation measure for higher order information access tasks. In Proc. CIKM. ACM, 561–570.

Digital Library

[2]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268(2016).

[3]

Theo Bertram, Elie Bursztein, Stephanie Caro, Hubert Chao, Rutledge Chin Feman, Peter Fleischer, Albin Gustafsson, Jess Hemerly, Chris Hibbert, Luca Invernizzi, 2019. Five Years of the Right to be Forgotten. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 959–972.

Digital Library

[4]

Asia J Biega, Azin Ghazimatin, Hakan Ferhatosmanoglu, Krishna P Gummadi, and Gerhard Weikum. 2017. Learning to Un-Rank: quantifying search exposure for users in online communities. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 267–276.

Digital Library

[5]

Joanna Asia Biega, Krishna P Gummadi, Ida Mele, Dragan Milchevski, Christos Tryfonopoulos, and Gerhard Weikum. 2016. R-susceptibility: An IR-centric approach to assessing privacy risks for users in online communities. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 365–374.

Digital Library

[6]

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. ACM, New York, NY, USA, 89–96.

Digital Library

[7]

Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) 19, 2 (2001), 97–130.

Digital Library

[8]

Ioannis Chios and Suzan Verberne. 2021. Helping results assessment by adding explainable elements to the deep relevance matching model. arXiv preprint arXiv:2106.05147(2021).

[9]

Jaekeol Choi, Jungin Choi, and Wonjong Rhee. 2020. Interpreting neural ranking models using grad-cam. arXiv preprint arXiv:2005.05768(2020).

[10]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2019 deep learning track. In TREC.

[11]

Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. arXiv preprint arXiv:2004.13157(2020).

[12]

Sandra Ebert, Mario Fritz, and Bernt Schiele. 2012. Active metric learning for object recognition. In Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium. Springer, 327–336.

[13]

Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining. 2221–2231.

Digital Library

[14]

Tim Gollub, Matthias Hagen, Maximilian Michel, and Benno Stein. 2013. From keywords to keyqueries: content descriptors for the web. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 981–984.

Digital Library

[15]

David Hawking and Stephen Robertson. 2003. On collection size and retrieval effectiveness. Information retrieval 6, 1 (2003), 99–105.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[17]

Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. (2000).

[18]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.

Digital Library

[19]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.

Digital Library

[20]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data(2019).

[21]

Krishnan Kumaran, Dimitri Papageorgiou, Yutong Chang, Minhan Li, and Martin Takáč. 2018. Active metric learning for supervised classification. arXiv preprint arXiv:1803.10647(2018).

[22]

Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th acm international conference on information and knowledge management. 2243–2251.

Digital Library

[23]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, Article 2 (December 2008), 27 pages. Issue 1. https://doi.org/10.1145/1416950.1416952

Digital Library

[24]

Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving Document Ranking with Dual Word Embeddings. In Proc. WWW.

Digital Library

[25]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375(2019).

[26]

Jahna Otterbacher, Jo Bates, and Paul Clough. 2017. Competent men and warm women: Gender stereotypes and backlash in image search results. In Proceedings of the 2017 chi conference on human factors in computing systems. 6620–6631.

Digital Library

[27]

Sai Teja Peddinti, Aleksandra Korolova, Elie Bursztein, and Geetanjali Sampemane. 2014. Cloak and swagger: Understanding data sensitivity through the lens of user anonymity. In 2014 IEEE Symposium on Security and Privacy. IEEE, 493–508.

Digital Library

[28]

Jeremy Pickens, Matthew Cooper, and Gene Golovchinsky. 2010. Reverted indexing for feedback and expansion. In Proceedings of the 19th ACM international conference on Information and knowledge management. 1049–1058.

Digital Library

[29]

Sayantan Polley, Rashmi Raju Koparde, Akshaya Bindu Gowri, Maneendra Perera, and Andreas Nuernberger. 2021. Towards Trustworthiness in the context of Explainable Search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2580–2584.

Digital Library

[30]

Nimrod Raifer, Fiana Raiber, Moshe Tennenholtz, and Oren Kurland. 2017. Information retrieval meets game theory: The ranking competition between documents’ authors. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 465–474.

Digital Library

[31]

Stephen Robertson. 2004. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation 60, 5 (2004), 503–520.

[32]

Stephen Robertson, Hugo Zaragoza, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4(2009), 333–389.

[33]

Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. 2013. Learning to rank query suggestions for adhoc and diversity search. Information Retrieval 16, 4 (2013), 429–451.

Digital Library

[34]

Jaspreet Singh and Avishek Anand. 2018. Interpreting search result rankings through intent modeling. arXiv preprint arXiv:1809.05190(2018).

[35]

Jaspreet Singh and Avishek Anand. 2018. Posthoc interpretability of learning to rank models using secondary training data. arXiv preprint arXiv:1806.11330(2018).

[36]

Jaspreet Singh and Avishek Anand. 2019. Exs: Explainable search using local model agnostic interpretability. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 770–773.

Digital Library

[37]

Paul Thomas, Bodo Billerbeck, Nick Craswell, and Ryen W White. 2019. Investigating searchers’ mental models to inform search explanations. ACM Transactions on Information Systems (TOIS) 38, 1 (2019), 1–25.

Digital Library

[38]

Ziv Vasilisky, Moshe Tennenholtz, and Oren Kurland. 2020. Studying Ranking-Incentivized Web Dynamics. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2093–2096.

Digital Library

[39]

Manisha Verma and Debasis Ganguly. 2019. LIRME: locally interpretable ranking model explanation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1281–1284.

Digital Library

[40]

Washington Post. 2018. Facebook: ‘Malicious actors’ used its tools to discover identities and collect data on a massive global scale. https://www.washingtonpost.com/news/the-switch/wp/2018/04/04/facebook-said-the-personal-data-of-most-its-2-billion-users-has-been-collected-and-shared-with-outsiders/. Online; accessed 3 February 2021.

[41]

Colin Wilkie and Leif Azzopardi. 2017. Algorithmic bias: do good systems make relevant documents more retrievable?. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2375–2378.

Digital Library

[42]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv preprint arXiv:2007.00808(2020).

[43]

Liu Yang, Rong Jin, and Rahul Sukthankar. 2012. Bayesian active distance metric learning. arXiv preprint arXiv:1206.5283(2012).

[44]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1253–1256.

Digital Library

[45]

Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, and Dimitris Papadias. 2009. Query by document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining. 34–43.

Digital Library

[46]

Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proc. CIKM. ACM, 497–506.

Digital Library

[47]

Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Alexander Grushetsky, Yonghui Wu, Petr Mitrichev, Ethan Sterling, Nathan Bell, Walker Ravina, and Hai Qian. 2020. Interpretable Learning-to-Rank with Generalized Additive Models. arXiv preprint arXiv:2005.02553(2020).

[48]

Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6.

Cited By

Roegiest APinkosova Z(2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638345
Jahani HAzzopardi LSanderson M(2024)Measuring the retrievability of digital library content using analytics dataJournal of the Association for Information Science and Technology10.1002/asi.24886Online publication date: 19-Mar-2024
https://doi.org/10.1002/asi.24886
Abolghasemi AVerberne SAskari AAzzopardi LFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Retrievability Bias Estimation Using Synthetically Generated QueriesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615221(3712-3716)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615221
Show More Cited By

Index Terms

Exposing Query Identification for Search Transparency
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning to Un-Rank: Quantifying Search Exposure for Users in Online Communities
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Search engines in online communities such as Twitter or Facebook not only return matching posts, but also provide links to the profiles of the authors. Thus, when a user appears in the top-k results for a sensitive keyword query, she becomes widely ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Mining query subtopics from search log data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Most queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)156
Downloads (Last 6 weeks)25

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Roegiest APinkosova Z(2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638345
Jahani HAzzopardi LSanderson M(2024)Measuring the retrievability of digital library content using analytics dataJournal of the Association for Information Science and Technology10.1002/asi.24886Online publication date: 19-Mar-2024
https://doi.org/10.1002/asi.24886
Abolghasemi AVerberne SAskari AAzzopardi LFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Retrievability Bias Estimation Using Synthetically Generated QueriesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615221(3712-3716)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615221
Reimer JSchmidt SFröbe MGienapp LScells HStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591890

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents