Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673791.3698428acmconferencesArticle/Chapter ViewAbstractPublication Pagessigir-apConference Proceedingsconference-collections
research-article
Open access

Pessimistic Evaluation

Published: 08 December 2024 Publication History

Abstract

Traditional evaluation of information access systems has focused primarily on average utility across a set of information needs (information retrieval) or users (recommender systems). In this work, we argue that evaluating only with average metric measurements assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility. These methods are (i) grounded in ethical and pragmatic concepts, (ii) theoretically complementary to existing robustness and fairness methods, and (iii) empirically validated across a set of retrieval and recommendation tasks. These results suggest that pessimistic evaluation should be included in existing experimentation processes to better understand the behavior of systems, especially when concerned with principles of social good.

References

[1]
Jacob Abernethy, Robert E. Schapire, and Umar Syed. 2024. Lexicographic Optimization: Algorithms and Stability. Proceedings of Machine Learning Research, Vol. 238 (2024), 2503--2511.
[2]
Enrique Amigó, Damiano Spina, and Jorge Carrillo-de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). ACM, New York, NY, USA, 625--634.
[3]
Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM '08). Association for Computing Machinery, New York, NY, USA, 561--570.
[4]
Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W. Duncan Wadsworth, and Hanna Wallach. 2021. Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (Virtual Event, USA) (AIES '21). Association for Computing Machinery, New York, NY, USA, 368--378. https://doi.org/10.1145/3461702.3462610
[5]
Johannes J. Britz. 2004. To Know or not to Know: A Moral Reflection on Information Poverty. Journal of Information Science, Vol. 30, 3 (2004), 192--204.
[6]
Luca Busin and Stefano Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR '13). Association for Computing Machinery, New York, NY, USA, 22--29.
[7]
Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charles L. A. Clarke, and Ellen M. Vorhees. 2014. TREC 2013 Web Track Overview. In Proceedings of the 22nd Text REtrieval Conference (TREC 2013) proceedings of the 22nd text retrieval conference (trec 2013) ed.). https://www.microsoft.com/en-us/research/publication/trec-2013-web-track-overview/
[8]
Advait Deshpande and Helen Sharp. 2022. Responsible AI Systems: Who are the Stakeholders?. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES '22). Association for Computing Machinery, New York, NY, USA, 227--236.
[9]
Tommaso Di Noia, Nava Tintarev, Panagiota Fatourou, and Markus Schedl. 2022. Recommender systems under European AI regulations. Commun. ACM, Vol. 65, 4 (mar 2022), 69--73.
[10]
Emily Diana, Wesley Gill, Ira Globus-Harris, Michael Kearns, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2021. Lexicographically Fair Learning: Algorithms and Generalization. In 2nd Symposium on Foundations of Responsible Computing, FORC 2021, June 9--11, 2021, Virtual Conference (LIPIcs, Vol. 192), Katrina Ligett and Swati Gupta (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 6:1--6:23.
[11]
Fernando Diaz and Bhaskar Mitra. 2024. Recall, Robustness, and Lexicographic Evaluation. arxiv: 2302.11370 [cs.IR]
[12]
Virginie Do and Nicolas Usunier. 2022. Optimizing Generalized Gini Indices for Fairness in Rankings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 737--747.
[13]
Michael D. Ekstrand, Ben Carterette, and Fernando Diaz. 2024. Distributionally-Informed Recommender System Evaluation. ACM Trans. Recomm. Syst., Vol. 2, 1, Article 6 (mar 2024), 27 pages. https://doi.org/10.1145/3613455
[14]
Andres Ferraro, Gustavo Ferreira, Fernando Diaz, and Georgina Born. 2022. Measuring Commonality in Recommendation of Cultural Content: Recommender Systems to Enhance Cultural Citizenship. In Proceedings of the 16th ACM Conference on Recommender Systems.
[15]
Andres Ferraro, Gustavo Ferreira, Fernando Diaz, and Georgina Born. 2024. Measuring Commonality in Recommendation of Cultural Content to Strengthen Cultural Citizenship. ACM Trans. Recomm. Syst., Vol. 2, 1 (mar 2024).
[16]
Ben Green. 2022. Escaping the Impossibility of Fairness: From Formal to Substantive Algorithmic Fairness. Philosophy & Technology, Vol. 35, 4 (2022), 90.
[17]
Yash Gupta, Runtian Zhai, Arun Suggala, and Pradeep Ravikumar. 2023. Responsible AI (RAI) Games and Ensembles. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 72717--72749.
[18]
David J. Hand. 2010. Measurement Theory and Practice: The World Through Quantification. Wiley.
[19]
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. 2018. Fairness Without Demographics in Repeated Loss Minimization. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Sweden, 1929--1938.
[20]
Iwao Hirose. 2014. The Structure of Aggregation. In Moral Aggregation. Oxford University Press.
[21]
Anna Lauren Hoffmann. 2017. Beyond distributions and primary goods: Assessing applications of rawls in information science and technology literature since 1990. Journal of the Association for Information Science and Technology, Vol. 68, 7 (2017), 1601--1618.
[22]
T. M. Hurka. 1982. Average Utilitarianisms. Analysis, Vol. 42, 2 (1982), 65--69. http://www.jstor.org/stable/3327924
[23]
M. G. Kendall. 1945. The Treatment of Ties in Ranking Problems. Biometrika, Vol. 33, 3 (1945), 239--251.
[24]
Johannes Kruse, Lien Michiels, Alain Starke, Nava Tintarev, and Sanne Vrijenhoek. 2024. NORMalize: A Tutorial on the Normative Design and Evaluation of Information Access Systems. In Proceedings of the 2024 Conference on Human Information Interaction and Retrieval (CHIIR '24). Association for Computing Machinery, New York, NY, USA, 422--424.
[25]
Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon Tantipongpipat, Kristian Lum, Ferenc Huszár, and Rumman Chowdhury. 2022. Measuring disparate outcomes of content recommendation algorithms with distributional inequality metrics. Patterns, Vol. 3, 8 (2024/07/06 2022).
[26]
Mike Li, Hongseok Namkoong, and Shangzhou Xia. 2021. Evaluating model performance under worst-case subpopulations. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 17325--17334.
[27]
Joel Mackenzie and Alistair Moffat. 2020. Examining the Additivity of Top-k Query Processing Innovations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM '20). Association for Computing Machinery, New York, NY, USA, 1085--1094.
[28]
Kay Mathiesen. 2015. Informational Justice: A Conceptual Framework for Social Justice in Library and Information Services. Library Trends, Vol. 64, 2 (2015).
[29]
Javier Parapar and Filip Radlinski. 2021. Towards Unified Metrics for Accuracy and Diversity for Recommender Systems. Association for Computing Machinery, New York, NY, USA, 75--84.
[30]
Christian W. Probst, Andreas Gal, and Michael Franz. 2005. Average case vs. worst case: margins of safety in system design. In Proceedings of the 2005 Workshop on New Security Paradigms (NSPW '05). Association for Computing Machinery, New York, NY, USA, 25--32.
[31]
John Rawls. 1971. A Theory of Justice: Original Edition. Harvard University Press. http://www.jstor.org/stable/j.ctvjf9z6v
[32]
Amartya Sen. 1970. Collective Choice and Social Welfare. Holden-Day.
[33]
Henry Sidgwick. 2011. The Methods of Ethics. Cambridge University Press.
[34]
Ashudeep Singh, Yoni Halpern, Nithum Thain, Konstantina Christakopoulou, Ed H. Chi, Jilin Chen, and Alex Beutel. 2020. Building Healthy Recommendation Sequences for Everyone: A Safe Reinforcement Learning Approach. In 3rd FAccTRec Workshop: Responsible Recommendation.
[35]
Nikolaj Thams, Michael Oberst, and David Sontag. 2022. Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 16877--16889.
[36]
Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the Robustness and Discriminative Power of Information Retrieval Metrics for Top-N Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). Association for Computing Machinery, New York, NY, USA, 260--268.
[37]
Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2020. Assessing ranking metrics in top-N recommendation. Information Retrieval Journal, Vol. 23, 4 (2020), 411--448.
[38]
E.M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004).
[39]
Sanne Vrijenhoek, Gabriel Bénédict, Mateo Gutierrez Granada, Daan Odijk, and Maarten De Rijke. 2022. RADio -- Rank-Aware Divergence Metrics to Measure Normative Diversity in News Recommendations. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys '22). Association for Computing Machinery, New York, NY, USA, 208--219.
[40]
Sanne Vrijenhoek, Mesut Kaya, Nadia Metoui, Judith Möller, Daan Odijk, and Natali Helberger. 2021. Recommenders with a Mission: Assessing Diversity in News Recommendations. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval (CHIIR '21). Association for Computing Machinery, New York, NY, USA, 173--183.
[41]
Sanne Vrijenhoek, Lien Michiels, Johannes Kruse, Jordi Viader Guerrero, Alain Starke, and Nava Tintarev (Eds.). 2023. Proceedings of the First Workshop on Normative Design and Evaluation of Recommender Systems.
[42]
Lidan Wang, Paul N. Bennett, and Kevyn Collins-Thompson. 2012. Robust Ranking Models via Risk-Sensitive Optimization. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12). Association for Computing Machinery, New York, NY, USA, 761--770. https://doi.org/10.1145/2348283.2348385
[43]
Chuhan Wu, Qinglin Jia, Zhenhua Dong, and Ruiming Tang. 2023. Customer Lifetime Value Prediction: Towards the Paradigm Shift of Recommender System Objectives. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys '23). Association for Computing Machinery, New York, NY, USA, 1293--1294.
[44]
Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. 2023. Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 10967--10982.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR-AP 2024: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region
December 2024
328 pages
ISBN:9798400707247
DOI:10.1145/3673791
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2024

Check for updates

Author Tags

  1. evaluation
  2. fairness
  3. information retrieval
  4. recommender systems

Qualifiers

  • Research-article

Conference

SIGIR-AP 2024
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 123
    Total Downloads
  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)45
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media