Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Framework and Toolkit for Testing the Correctness of Recommendation Algorithms

Published: 07 March 2024 Publication History

Abstract

Evaluating recommender systems adequately and thoroughly is an important task. Significant efforts are dedicated to proposing metrics, methods, and protocols for doing so. However, there has been little discussion in the recommender systems’ literature on the topic of testing. In this work, we adopt and adapt concepts from the software testing domain, e.g., code coverage, metamorphic testing, or property-based testing, to help researchers to detect and correct faults in recommendation algorithms. We propose a test suite that can be used to validate the correctness of a recommendation algorithm, and thus identify and correct issues that can affect the performance and behavior of these algorithms. Our test suite contains both black box and white box tests at every level of abstraction, i.e., system, integration, and unit. To facilitate adoption, we release RecPack Tests, an open-source Python package containing template test implementations. We use it to test four popular Python packages for recommender systems: RecPack, PyLensKit, Surprise, and Cornac. Despite the high test coverage of each of these packages, we find that we are still able to uncover undocumented functional requirements and even some bugs. This validates our thesis that testing the correctness of recommendation algorithms can complement traditional methods for evaluating recommendation algorithms.

References

[1]
Technical Committee ISO/IEC JTC 1. 2017. ISO/IEC/IEEE International Standard—Systems and Software Engineering–Vocabulary (Aug.2017), 541 pages.
[2]
Vito Walter Anelli, Alejandro Bellogin, Antonio Ferrara, Daniele Malitesta, Felice Antonio Merra, Claudio Pomo, Francesco Maria Donini, and Tommaso Di Noia. 2021. Elliot: A comprehensive and rigorous framework for reproducible recommender systems evaluation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2405–2414.
[3]
Vito Walter Anelli, Alejandro Bellogín, Tommaso Di Noia, Dietmar Jannach, and Claudio Pomo. 2022. Top-N recommendation algorithms: A quest for the state-of-the-art. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization. ACM, 121–131.
[4]
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle problem in software testing: A survey. IEEE Trans. Softw. Eng. 41, 5 (May2015), 507–525.
[5]
Joeran Beel, Marcel Genzmehr, Stefan Langer, Andreas Nürnberger, and Bela Gipp. 2013. A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation. In Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys’13). ACM Press, 7–14.
[6]
Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, and Andreas Nürnberger. 2013. Research paper recommender system evaluation: A quantitative literature survey. In Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys’13). ACM Press, 15–22.
[7]
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In Proceedings of the IEEE International Conference on Big Data (Big Data). IEEE, 1123–1132.
[8]
John S. Breese, David Heckerman, and Carl Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI’98). Morgan Kaufmann, San Francisco, CA, 43–52.
[9]
Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions. AI Mag. 43, 2 (June2022), 225–238.
[10]
Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. 2018. Adversarial Attacks and Defences: A Survey. Retrieved from https://arxiv.org/abs/1810.00069.
[11]
T. Y. Chen. 1998. Metamorphic Testing: New Approach for Generating Next Test Cases. Technical Report. Department of Computer Science, Hong Kong University of Science and Technology.
[12]
Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2019. Metamorphic testing: A review of challenges and opportunities. Comput. Surveys 51, 1 (Jan.2019), 1–27.
[13]
Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Chloe He, and Brian Ko. 2022. Beyond NDCG: Behavioral testing of recommender systems with RecList. In Proceedings of the Web Conference. ACM, 99–104.
[14]
Andrzej Cihocki and Anh-Huy Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans. Fund. Electr. Commun. Comput. Sci. E92.A, 3 (2009), 708–721.
[15]
European Commission. 2019. Ethics Guidelines for Trustworthy AI. Retrieved from https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai.
[16]
Mukund Deshpande and George Karypis. 2004. Item-based top-N recommendation algorithms. ACM Trans. Inf. Syst. 22, 1 (Jan.2004), 143–177.
[17]
Michael D. Ekstrand. 2020. LensKit for Python: Next-generation software for recommender systems experiments. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. ACM, 2999–3006.
[18]
Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. 2021. A troubling analysis of reproducibility and progress in recommender systems research. ACM Trans. Info. Syst. 39, 2 (Apr.2021), 1–49.
[19]
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 101–109.
[20]
Free Software Foundation. 2016. GNU Affero General Public License Version 3 (AGPL-3.0). (18 Nov.2016). Accessed 26 July 2022. Retrieved from https://www.gnu.org/licenses/agpl-3.0.en.html.
[21]
Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011. Personalized ranking for non-uniformly sampled items. In Proceedings of the International Conference on Knowledge Discovery and Data Mining Cup (KDDCUP’11). JMLR.org, 231–247.
[22]
Diksha Garg, Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2019. Sequence and time aware neighborhood for session-based recommendations: STAN. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, 1069–1072.
[23]
Thomas George and Srujana Merugu. 2005. A scalable collaborative filtering framework based on co-clustering. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). IEEE Computer Society, 625–628.
[24]
Don Gotterbarn, Keith Miller, and Simon Rogerson. 1997. Software engineering code of ethics. Commun. ACM 40, 11 (Nov.1997), 110–118.
[25]
Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 1809–1818.
[26]
Ihsan Gunes, Cihan Kaleli, Alper Bilge, and Huseyin Polat. 2014. Shilling attacks against recommender systems: A comprehensive survey. Artific. Intell. Rev. 42, 4 (Dec.2014), 767–799.
[27]
Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 2 (2011), 217–288. arXiv:https://doi.org/10.1137/090771806
[28]
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (Dec.2015), 19 pages.
[29]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). International World Wide Web Conferences Steering Committee, 173–182.
[30]
Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18). ACM, New York, NY, 843–852.
[31]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 263–272.
[32]
Nicolas Hug. 2020. Surprise: A Python library for recommender systems. J. Open Source Softw. 5, 52 (Aug.2020), 2174.
[33]
Dietmar Jannach and Gediminas Adomavicius. 2016. Recommendations with a purpose. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys’16). ACM, New York, NY, 7–10.
[34]
Dietmar Jannach and Christine Bauer. 2020. Escaping the McNamara fallacy: Towards more impactful recommender systems research. AI Mag. 41, 4 (2020), 79–95.
[35]
O. Jeunen, K. Verstrepen, and B. Goethals. 2018. Fair Offline Evaluation Methodologies for Implicit-feedback Recommender Systems with MNAR Data. Retrieved from adrem.uantwerpen.be/bibrem/pubs/OfflineEvalJeunen2018.pdf.
[36]
Upulee Kanewala and James M. Bieman. 2018. Testing Scientific Software: A Systematic Literature Review. Retrieved from http://arxiv.org/abs/1804.01954.
[37]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[38]
Holger Krekel, Bruno Oliveira, Ronny Pfannschmidt, Floris Bruynooghe, Brianna Laugher, and Florian Bruhin. 2004. pytest x.y. Retrieved from https://github.com/pytest-dev/pytest.
[39]
Leonidas Lampropoulos, Michael Hicks, and Benjamin C. Pierce. 2019. Coverage guided, property based testing. Proceedings of the ACM on Programming Languages 3, OOPSLA (Oct.2019), 1–29.
[40]
Sara Latifi, Dietmar Jannach, and Andrés Ferraro. 2022. Sequential recommendation: A study on transformers, nearest neighbors and sampled metrics. Info. Sci. 609 (Sept.2022), 660–678.
[41]
Dung D. Le and Hady W. Lauw. 2017. Indexable Bayesian personalized ranking for efficient top-k recommendation. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17). ACM, New York, NY, 1389–1398.
[42]
Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the World Wide Web Conference (WWW’18). International World Wide Web Conferences Steering Committee, 689–698.
[43]
Nathan N. Liu, Min Zhao, Evan Xiang, and Qiang Yang. 2010. Online evolutionary collaborative filtering. In Proceedings of the 4th ACM Conference on Recommender Systems (RecSys’10). ACM, New York, NY, 95–102.
[44]
Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-based recommendation algorithms. User Model. User-Adapt. Interact. 28, 4–5 (Dec.2018), 331–390.
[45]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2019. Towards Deep Learning Models Resistant to Adversarial Attacks. Retrieved from https://arxiv.org/abs/1706.06083.
[46]
William M. McKeeman. 1998. Differential testing for Software. Digit. Tech. J. 10, 1 (1998), 8.
[47]
Lien Michiels, Robin Verachtert, and Bart Goethals. 2022. RecPack: An(other) experimentation toolkit for top-n recommendation using implicit feedback data. In Proceedings of the 16th ACM Conference on Recommender Systems. ACM, 648–651.
[48]
Andriy Mnih and Russ R. Salakhutdinov. 2007. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Vol. 20. Curran Associates. Retrieved from https://proceedings.neurips.cc/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcd-Paper.pdf.
[49]
Bamshad Mobasher, Robin Burke, Runa Bhaumik, and Chad Williams. 2007. Toward trustworthy recommender systems: An analysis of attack models and algorithm robustness. ACM Trans. Internet Technol. 7, 4 (Oct.2007), 23.
[50]
Glenford J. Myers, Tom Badgett, and Corey Sandler. 2012. The psychology and economics of software testing. In The Art of Software Testing (1st ed.), Glenford J. Myers, Tom Badgett, and Corey Sandler (Eds.). Wiley, 5–18.
[51]
Glenford J. Myers, Tom Badgett, and Corey Sandler. 2012. Test-case design. In The Art of Software Testing (1st ed.), Glenford J. Myers, Tom Badgett, and Corey Sandler (Eds.). Wiley, 41–84.
[52]
Xia Ning and George Karypis. 2011. SLIM: Sparse linear methods for top-n recommender systems. In Proceedings of the IEEE 11th International Conference on Data Mining. 497–506.
[53]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. Curran Associates, 8024–8035. Retrieved from http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[54]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[55]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2019. DeepXplore: Automated whitebox testing of deep learning systems. Commun. ACM 62, 11 (Oct.2019), 137–145.
[56]
Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-aware recommender systems. ACM Comput. Surv. 51, 4, Article 66 (July2018), 36 pages.
[57]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09). AUAI Press, 452–461.
[58]
Steffen Rendle, Walid Krichene, Li Zhang, and Yehuda Koren. 2022. Revisiting the performance of iALS on Item recommendation benchmarks. In Proceedings of the 16th ACM Conference on Recommender Systems. ACM, 427–435.
[59]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4902–4912.
[60]
Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: A systematic mapping. Empir. Softw. Eng. 25, 6 (Nov.2020), 5193–5254.
[61]
Mohammad Saberian and Justin Basilico. 2021. RecSysOps: Best practices for operating a large-scale recommender system. In Proceedings of the 15th ACM Conference on Recommender Systems. ACM, 590–591.
[62]
Aghiles Salah, Nicoleta Rogovschi, and Mohamed Nadif. 2015. A dynamic collaborative filtering system via a weighted clustering approach. Neurocomputing 175 (Oct. 2015).
[63]
Aghiles Salah, Quoc-Tuan Truong, and Hady W Lauw. 2020. Cornac: A comparative framework for multimodal recommender systems. J. Mach. Learn. Res. 21, 95 (2020), 1–5.
[64]
Sergio Segura, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 42, 9 (Sept.2016), 805–824.
[65]
Ilya Shenbin, Anton Alekseev, Elena Tutubalina, Valentin Malykh, and Sergey I. Nikolenko. 2020. RecVAE: A new variational autoencoder for top-n recommendations with implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM’20). ACM, New York, NY, 528–536.
[66]
Harald Steck. 2019. Embarrassingly shallow autoencoders for sparse data. In Proceedings of the World Wide Web Conference (WWW’19). ACM, New York, NY, 3251–3257.
[67]
Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. 2020. The global landscape of neural networks: An overview. IEEE Signal Process. Mag. 37, 5 (2020), 95–108.
[68]
Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew Hill, and Rob Ashmore. 2019. Testing Deep Neural Networks. Retrieved from https://arxiv.org/abs/1803.04792.
[69]
Quoc-Tuan Truong, Aghiles Salah, and Hady W. Lauw. 2021. Bilateral variational autoencoder for collaborative filtering. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM’21). ACM, New York, NY, 292–300.
[70]
Sakshi Udeshi and Sudipta Chattopadhyay. 2019. Grammar Based Directed Testing of Machine Learning Systems. Retrieved from http://arxiv.org/abs/1902.10027.
[71]
UNESCO. 2021. Recommendation on the Ethics of Artificial Intelligence. Retrieved from https://unesdoc.unesco.org/ark:/48223/pf0000380455.
[72]
Robin Verachtert, Lien Michiels, and Bart Goethals. 2022. Are we forgetting something? Correctly evaluate a recommender system with an optimal training window. In Proceedings of the Perspectives on the Evaluation of Recommender Systems Workshop. CEUR-WS.org, Seattle, WA.
[73]
Sanne Vrijenhoek, Gabriel Bénédict, Mateo Gutierrez Granada, Daan Odijk, and Maarten De Rijke. 2022. RADio – Rank-aware divergence metrics to measure normative diversity in news recommendations. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys’22). ACM, New York, NY, 208–219.
[74]
Shoujin Wang, Xiuzhen Zhang, Yan Wang, Huan Liu, and Francesco Ricci. 2022. Trustworthy Recommender Systems. Retrieved from https://arxiv.org/abs/2208.06265.
[75]
Markus Weimer, Alexandros Karatzoglou, and Alex Smola. 2008. Improving maximum margin matrix factorization. Mach. Learn. 72.
[76]
Eva Zangerle and Christine Bauer. 2022. Evaluating recommender systems: Survey and framework. ACM Comput. Surv. 55, 8, Article 170 (Dec.2022), 38 pages.
[77]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine learning testing: Survey, landscapes and horizons. IEEE Trans. Softw. Eng. 48, 1 (Jan.2022), 1–36.

Cited By

View all
  • (2024)Introduction to the Special Issue on Perspectives on Recommender Systems EvaluationACM Transactions on Recommender Systems10.1145/36483982:1(1-5)Online publication date: 7-Mar-2024
  • (2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Recommender Systems
ACM Transactions on Recommender Systems  Volume 2, Issue 1
March 2024
346 pages
EISSN:2770-6699
DOI:10.1145/3613520
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2024
Online AM: 20 April 2023
Accepted: 19 March 2023
Revised: 14 March 2023
Received: 15 December 2022
Published in TORS Volume 2, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Recommender systems evaluation
  2. automated testing
  3. correctness
  4. toolkit
  5. open-source

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)272
  • Downloads (Last 6 weeks)27
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Introduction to the Special Issue on Perspectives on Recommender Systems EvaluationACM Transactions on Recommender Systems10.1145/36483982:1(1-5)Online publication date: 7-Mar-2024
  • (2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media