Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512262acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Exposing Query Identification for Search Transparency

Published: 25 April 2022 Publication History

Abstract

Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranked results is an important and relatively underexplored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25. We then improve upon this approach through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis of various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.

References

[1]
Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: an evaluation measure for higher order information access tasks. In Proc. CIKM. ACM, 561–570.
[2]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268(2016).
[3]
Theo Bertram, Elie Bursztein, Stephanie Caro, Hubert Chao, Rutledge Chin Feman, Peter Fleischer, Albin Gustafsson, Jess Hemerly, Chris Hibbert, Luca Invernizzi, 2019. Five Years of the Right to be Forgotten. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 959–972.
[4]
Asia J Biega, Azin Ghazimatin, Hakan Ferhatosmanoglu, Krishna P Gummadi, and Gerhard Weikum. 2017. Learning to Un-Rank: quantifying search exposure for users in online communities. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 267–276.
[5]
Joanna Asia Biega, Krishna P Gummadi, Ida Mele, Dragan Milchevski, Christos Tryfonopoulos, and Gerhard Weikum. 2016. R-susceptibility: An IR-centric approach to assessing privacy risks for users in online communities. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 365–374.
[6]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. ACM, New York, NY, USA, 89–96.
[7]
Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) 19, 2 (2001), 97–130.
[8]
Ioannis Chios and Suzan Verberne. 2021. Helping results assessment by adding explainable elements to the deep relevance matching model. arXiv preprint arXiv:2106.05147(2021).
[9]
Jaekeol Choi, Jungin Choi, and Wonjong Rhee. 2020. Interpreting neural ranking models using grad-cam. arXiv preprint arXiv:2005.05768(2020).
[10]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2019 deep learning track. In TREC.
[11]
Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. arXiv preprint arXiv:2004.13157(2020).
[12]
Sandra Ebert, Mario Fritz, and Bernt Schiele. 2012. Active metric learning for object recognition. In Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium. Springer, 327–336.
[13]
Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining. 2221–2231.
[14]
Tim Gollub, Matthias Hagen, Maximilian Michel, and Benno Stein. 2013. From keywords to keyqueries: content descriptors for the web. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 981–984.
[15]
David Hawking and Stephen Robertson. 2003. On collection size and retrieval effectiveness. Information retrieval 6, 1 (2003), 99–105.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[17]
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. (2000).
[18]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.
[19]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
[20]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data(2019).
[21]
Krishnan Kumaran, Dimitri Papageorgiou, Yutong Chang, Minhan Li, and Martin Takáč. 2018. Active metric learning for supervised classification. arXiv preprint arXiv:1803.10647(2018).
[22]
Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th acm international conference on information and knowledge management. 2243–2251.
[23]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, Article 2 (December 2008), 27 pages. Issue 1. https://doi.org/10.1145/1416950.1416952
[24]
Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. 2016. Improving Document Ranking with Dual Word Embeddings. In Proc. WWW.
[25]
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375(2019).
[26]
Jahna Otterbacher, Jo Bates, and Paul Clough. 2017. Competent men and warm women: Gender stereotypes and backlash in image search results. In Proceedings of the 2017 chi conference on human factors in computing systems. 6620–6631.
[27]
Sai Teja Peddinti, Aleksandra Korolova, Elie Bursztein, and Geetanjali Sampemane. 2014. Cloak and swagger: Understanding data sensitivity through the lens of user anonymity. In 2014 IEEE Symposium on Security and Privacy. IEEE, 493–508.
[28]
Jeremy Pickens, Matthew Cooper, and Gene Golovchinsky. 2010. Reverted indexing for feedback and expansion. In Proceedings of the 19th ACM international conference on Information and knowledge management. 1049–1058.
[29]
Sayantan Polley, Rashmi Raju Koparde, Akshaya Bindu Gowri, Maneendra Perera, and Andreas Nuernberger. 2021. Towards Trustworthiness in the context of Explainable Search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2580–2584.
[30]
Nimrod Raifer, Fiana Raiber, Moshe Tennenholtz, and Oren Kurland. 2017. Information retrieval meets game theory: The ranking competition between documents’ authors. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 465–474.
[31]
Stephen Robertson. 2004. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation 60, 5 (2004), 503–520.
[32]
Stephen Robertson, Hugo Zaragoza, 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4(2009), 333–389.
[33]
Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. 2013. Learning to rank query suggestions for adhoc and diversity search. Information Retrieval 16, 4 (2013), 429–451.
[34]
Jaspreet Singh and Avishek Anand. 2018. Interpreting search result rankings through intent modeling. arXiv preprint arXiv:1809.05190(2018).
[35]
Jaspreet Singh and Avishek Anand. 2018. Posthoc interpretability of learning to rank models using secondary training data. arXiv preprint arXiv:1806.11330(2018).
[36]
Jaspreet Singh and Avishek Anand. 2019. Exs: Explainable search using local model agnostic interpretability. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 770–773.
[37]
Paul Thomas, Bodo Billerbeck, Nick Craswell, and Ryen W White. 2019. Investigating searchers’ mental models to inform search explanations. ACM Transactions on Information Systems (TOIS) 38, 1 (2019), 1–25.
[38]
Ziv Vasilisky, Moshe Tennenholtz, and Oren Kurland. 2020. Studying Ranking-Incentivized Web Dynamics. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2093–2096.
[39]
Manisha Verma and Debasis Ganguly. 2019. LIRME: locally interpretable ranking model explanation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1281–1284.
[40]
Washington Post. 2018. Facebook: ‘Malicious actors’ used its tools to discover identities and collect data on a massive global scale. https://www.washingtonpost.com/news/the-switch/wp/2018/04/04/facebook-said-the-personal-data-of-most-its-2-billion-users-has-been-collected-and-shared-with-outsiders/. Online; accessed 3 February 2021.
[41]
Colin Wilkie and Leif Azzopardi. 2017. Algorithmic bias: do good systems make relevant documents more retrievable?. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2375–2378.
[42]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv preprint arXiv:2007.00808(2020).
[43]
Liu Yang, Rong Jin, and Rahul Sukthankar. 2012. Bayesian active distance metric learning. arXiv preprint arXiv:1206.5283(2012).
[44]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1253–1256.
[45]
Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, and Dimitris Papadias. 2009. Query by document. In Proceedings of the Second ACM International Conference on Web Search and Data Mining. 34–43.
[46]
Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proc. CIKM. ACM, 497–506.
[47]
Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Alexander Grushetsky, Yonghui Wu, Petr Mitrichev, Ethan Sterling, Nathan Bell, Walker Ravina, and Hai Qian. 2020. Interpretable Learning-to-Rank with Generalized Additive Models. arXiv preprint arXiv:2005.02553(2020).
[48]
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6.

Cited By

View all
  • (2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
  • (2024)Measuring the retrievability of digital library content using analytics dataJournal of the Association for Information Science and Technology10.1002/asi.24886Online publication date: 19-Mar-2024
  • (2023)Retrievability Bias Estimation Using Synthetically Generated QueriesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615221(3712-3716)Online publication date: 21-Oct-2023
  • Show More Cited By

Index Terms

  1. Exposing Query Identification for Search Transparency
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '22: Proceedings of the ACM Web Conference 2022
      April 2022
      3764 pages
      ISBN:9781450390965
      DOI:10.1145/3485447
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 April 2022

      Check for updates

      Author Tags

      1. Exposing queries
      2. Privacy
      3. Search exposure
      4. Transparency

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      WWW '22
      Sponsor:
      WWW '22: The ACM Web Conference 2022
      April 25 - 29, 2022
      Virtual Event, Lyon, France

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)156
      • Downloads (Last 6 weeks)25
      Reflects downloads up to 03 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
      • (2024)Measuring the retrievability of digital library content using analytics dataJournal of the Association for Information Science and Technology10.1002/asi.24886Online publication date: 19-Mar-2024
      • (2023)Retrievability Bias Estimation Using Synthetically Generated QueriesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615221(3712-3716)Online publication date: 21-Oct-2023
      • (2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media