3.1 General Overview
Most papers on evaluation in recommender systems are published at RecSys (the main conference concerning the research topic
recommender systems) (12) and at SIGIR (9) (the main conference concerning the closely related research topic of
information retrieval) (Figure
2). Notably, as can be seen from Figure
2, papers on the evaluation of recommender systems are published in a wide scale of venues (12 conference venues and 13 journal venues) where it is often only 1 paper at the respective venue in the set time frame of our review. The majority of papers on evaluation are published at conferences (39 papers) compared to 18 papers published in journals. Further, from Figure
2, we see that there is a clear concentration across conference venues (RecSys and SIGIR), whereas papers on evaluation are particularly scattered across journal venues.
Concerning the temporal evolution of evaluation papers, we observe an increasing number of papers on the evaluation of recommender systems in the analyzed time frame 2017–2022 (Figure
3). Starting in 2017, there were only 3 papers on the evaluation of recommender systems published, while this number peaked in 2021 with 19 papers on that topic. While there is a continuous upward trend of papers on that topic in conference venues, there is a sharp increase of papers on that topic in journal venues (only one journal paper in the years 2017–2020, respectively; then 6 and 8 journal papers in 2021 and 2022, respectively). We note that two of the journal papers published in 2021 (Ferrari Dacrema et al. [
40] and Mena-Maldonado et al. [
71]) are extended versions of previously published conference papers (Ferrari Dacrema et al. [
41] from 2019 and Mena-Maldonado et al. [
70] 2020, respectively). Further, the increase of journal papers on evaluation in the years 2020 and 2021 aligns with the COVID-19 pandemic, during which all conferences were either canceled or held online; which points to having led researchers to focus on journal submissions instead of conferences.
3.2 Type of Contribution
This section provides a detailed overview of the types of papers included in the literature review. The types as specified in Table
2 (i.e., benchmark, framework, metrics, model, and survey) were inferred according to the description in Section
2.3.
Figure
4 provides an overview of the number of papers per type of contribution in our sample. Most of the papers in our sample contribute to models (19); these papers provide a conceptual and empirical basis for improved recommendation or evaluation models. Considerably fewer papers (13) investigate metrics. Nine papers provide a survey, another 9 papers provide benchmarks of various approaches, and 7 papers propose frameworks.
Among the model papers, the majority focus on evaluation models, specifically on issues related to off-policy learning [
23,
31,
44,
58,
69,
73,
82,
95], which helps to obtain unbiased estimates for improved offline evaluation [
55]. Cañamares and Castells [
20] propose a probabilistic reformulation of memory-based collaborative filtering. While the core contribution of that work is a recommendation model, it also contributes to evaluation, because the experiments demonstrate that performance measurements may heavily depend on statistical properties of the input data, which the authors discuss in detail. With a probabilistic analysis, Cañamares and Castells [
21] address the question of whether popularity is an effective or misleading signal in recommendation. Their work illustrates the contradictions between the accuracy that would be measured in common biased offline experimental settings and the measured with unbiased observations. Cañamares and Castells [
22] demonstrate the importance of item sampling in offline experiments. Based on a thorough literature review, Carraro and Bridge [
23] propose a new sampling approach to debiasing offline experiments. A second line of model papers considers user-related aspects as an important ingredient of recommender systems. For example, Frumerman et al. [
42] investigate the meaning of “rejected” recommendations in a more fine-grained manner. Symeonidis et al. [
91] consider short-term intentions to inform models. Jin et al. [
54] rely on a psychometric modeling method to study the key qualities of conversational recommender systems. In a large-scale user study, Chen et al. [
25] investigate how serendipity improves user satisfaction with recommendations; their results inform the modeling for recommendations. Ostendorff et al. [
75] study users’ preferences for link-based versus text-based recommendations using qualitative evaluation methods. Lu et al. [
65] investigate whether and how annotations made by external assessors (thus, not the recommender system’s users) are a viable source for preference labeling. Guo et al. [
47] study order effects in recommendation sequences, which has implications for the design of recommender systems. Said and Bellogín [
80] evaluate and model inconsistencies in user rating behavior to improve the performance of recommendation methods. These papers considering user-related aspects have in common that each work primarily studies phenomena to improve recommendation models and the discussion of the results also contributes to methodological issues regarding the evaluation of recommender systems.
Among papers focusing on metrics, one set of papers compares metrics (e.g., References [
70,
71,
77]), whereas some papers focus their analysis on a specific type of metrics; for instance, sampling metrics (e.g., References [
60,
63]) and folding metrics (e.g., Reference [
94]). In a similar spirit, Bellogín et al. [
17] study biases in information retrieval metrics. Another line of metrics papers aims for harmonization of metrics (e.g., References [
2,
76]) or metric improvements (e.g., Reference [
64]). Balog and Radlinski [
11] propose how to measure the quality of explanations in recommender systems. Saraswat et al. [
84] propose combining both performance and user satisfaction metrics in offline evaluation, leading to improved correlation with desired business metrics. Finally, Diaz and Ferraro [
37] makes a metrics analysis and discussion leading into the proposal of an altogether metric-free evaluation method.
Papers discussing infrastructural aspects of recommender systems can be categorized into two types of framework papers: Those that contribute with a recommendation toolkit and those proposing a conceptual framework. The presented toolkits are iRec [
87],
Elliot [
10], LensKit [
39], and librec-auto [
88].
6 The framework by Bellogín and Said [
19] provides guidelines for reproducibility; their paper also provides an in-depth analysis to support their guidelines. Eftimov et al. [
38] propose a general framework that fuses different evaluation measures and aims at helping users to rank systems. Considering users’ expectations and perceptions, Belavadi et al. [
16] study the relationships between several user evaluation criteria.
Several papers provide an extensive critical evaluation across a (wide) set of approaches (Table
3). Dallmann et al. [
35] study sampling strategies for sequential item recommendation. They compare four methods across five datasets and find that both sampling strategies—uniform random sampling and sampling by popularity—can produce inconsistent rankings compared with the full ranking of the models. Ferrari Dacrema et al. [
41] and its extended version Ferrari Dacrema et al. [
40] perform a reproducibility study, critically analyzing the performance of 12 neural recommendation approaches in comparison to well-tuned, established, non-neural baseline methods. Their work identifies several methodological issues and finds that 11 of the 12 analyzed approaches are outperformed by far simpler, yet well-tuned, methods (e.g., nearest-neighbor or content-based approaches). In a similar vein, Latifi and Jannach [
61] perform a reproducibility study where they benchmark Graph Neural Networks (GNN) against an effective session-based nearest neighbor method. Also, this work finds that the conceptually simpler method outperforms the GNN-based method. Anelli et al. [
9] perform a reproducibility study, systematically comparing 10 collaborative filtering algorithms (including approaches based on nearest-neighbors, matrix factorization, linear models, and techniques based on deep learning). Different to Ferrari Dacrema et al. [
40],
41], Anelli et al. [
9] benchmark all algorithms using the very same datasets (MovieLens-1M [
48], Amazon Digital Music [
74], and epinions [
92]) and the identical evaluation protocol. Based on their study on modest-sized datasets, they conclude—similarly to other works—that the latest models are often not the best-performing ones. Kouki et al. [
59] compare 14 models (8 baseline and 6 deep learning) for session-based recommendations using 8 different popular evaluation metrics. After an offline evaluation, they selected the 5 algorithms that performed the best and ran a second round of evaluation using human experts (user study). Reference [
90] provides benchmarks across several datasets, recommendation approaches, and metrics; beyond that, this work introduces the toolkit daisyRec. Zhu et al. [
99] compare 24 models for click-through rate (CTR) prediction on multiple dataset settings. Their evaluation framework for CTR (including the benchmarking tools, evaluation protocols, and experimental settings) is publicly available. Latifi et al. [
62] focus on sequential recommendation problems, for which they compare the Transformer-based BERT4Rec method [
89] to nearest-neighbor methods, showing that the nearest-neighbor methods achieve comparable performance to BERT4Rec for the smaller datasets, whereas BERT4Rec outperforms the simple methods when the datasets are larger.
Table
4 provides an overview of survey papers on the evaluation of recommender systems. Some of the papers provide an extensive critical evaluation across a (wide) set of datasets and approaches on a specialized topic (e.g., References [
26,
40,
41,
59,
61]). Others provide a (systematic) review of the literature landscape on a specialized topic (e.g., References [
4,
5,
6,
36,
52,
83,
98]). The framework by Zangerle and Bauer [
96] is based on a survey of previous literature on the respective topic. Similarly, Zhao et al. [
98] starts with a survey of literature on aspects related to offline evaluation for top-
N recommendation, which builds the basis for their systematic comparison of a selected set of 12 algorithms across eight datasets.
3.4 Datasets
Table
6 provides an overview of the datasets used in the papers. In total, our analysis contains 80 datasets. We distinguish between papers that use pre-collected, established datasets (65 datasets) and papers that propose a custom dataset (15 datasets, see the last row of Table
6). In a graphical overview, Figure
6 presents the number of papers relying on each dataset. Note that in this chart, we have aggregated different versions of a dataset into a single dataset category (for instance, we combined the widely used MovieLens datasets MovieLens 100k, 1M, 10M, 20M, 25M, Latest, and HetRec).
Table
6 and Figure
6 show that the dataset usage distribution for established (pre-collected) datasets is dominated by the MovieLens datasets. MovieLens datasets are used 32 times in the papers investigated, with MovieLens 1M being the most popular dataset (19 usages). Furthermore, the Amazon review datasets are used in 24 papers, followed by the LastFM dataset, appearing in the evaluation of 9 papers. We also observe that 43 and hence,
\(66.15\%\) of the listed datasets are only used in a single paper. Further 8 datasets are used in 2 of the papers in our study and another 14 datasets are employed in three or more papers.
Generally, the majority of papers relied on existing, pre-collected datasets: Of 146 dataset usages, 15 were custom datasets. These findings are in line with a previous analysis of datasets being used for recommender systems evaluation [
13], with a focus on the use of data pruning methods for the years 2017 and 2018. Generally, the high number of datasets employed at a low rate makes a direct comparison of recommendation approaches hardly possible. Particularly, given the vastly different characteristics of these. In contrast, we also observe that established datasets like the MovieLens dataset family, are used frequently, allowing for a better comparison of approaches.
A further aspect to consider regarding the comparability of approaches is dataset pre-processing. Typical pre-processing steps include removing users, items, or sessions with a low number of interactions or converting explicit ratings to binary relevance values. As Ferrari Dacrema et al. [
40] note in their survey on the reproducibility of deep learning recommendation approaches, it is important that all pre-processing steps are clearly stated in the paper and that the removal of data is justified and motivated. Also, pre-processing should be included in the code published. Inspecting the papers of our survey, we find that eight papers mention that they convert explicit rating data to a binary relevance score or song play counts to explicit ratings [
17,
23,
26,
37,
38,
62,
64,
90]. Furthermore, users, items or sessions with fewer and/or more interactions than a given threshold are removed in 12 papers [
9,
22,
26,
35,
42,
61,
62,
64,
77,
90,
91,
98]. Zhao et al. [
98] refer to this pre-processing step as
n-core filtering. They perform a study on three aspects in the context of evaluating recommender systems: evaluation metrics, dataset construction, and model optimization. For dataset construction, they find that
\(44\%\) of the papers in their study do not provide any information about pre-processing, and
\(34\%\) of the papers apply
n-core filtering with
n set to 5 or 10. Sun et al. [
90] also study the impact of different thresholds for filtering users and items. Here it is important to note that, for instance, the MovieLens datasets are already pre-processed to some extent as they only include users with more than or equal to 20 interactions.
In the following, we focus our analysis on datasets that have been used at least three times in the surveyed papers. Table
7 provides an overview of these 12 datasets, where we list the domain, the feedback type (hence, whether the dataset features explicit or implicit data; in the case of explicit ratings, we also add the rating scale), the size of the dataset captured by the number of interactions, and the type of side information contained. Notably, 5 of the 12 most popular datasets stem from the movie or music domain. In terms of the type of ratings contained, the citeulike and LastFM datasets provide implicit feedback (0 or 1), while the other datasets provide explicit ratings on a scale from 0 (or 1) to 5 stars. Interestingly, when inspecting the size of the datasets, the most popular datasets appear to be relatively small, with the most popular dataset (MovieLens 1M) holding 1,000,000 interactions.
Another interesting aspect when investigating the choice of datasets for the evaluation of recommender systems is the number of different datasets used by individual papers. Evaluating a recommender system on diverse datasets is critical to gaining insights into the generalizability and robustness of the recommender system proposed. When inspecting the number of different datasets used in the experiments, we find that 26 papers (
\(45.61\%\) of all papers contained in the study) rely on a single dataset, 5 papers (
\(8.77\%\) ) rely on two datasets, 7 papers (
\(12.28\%\) ) use 3 datasets and another 10 papers (
\(17.54\%\) ) use four or more datasets. Of these, 3 papers used more than 10 different datasets: In extensive experiments, Ferrari Dacrema et al. [
41] benchmark deep learning-based recommender systems against a set of relatively simple baselines. Diaz and Ferraro [
37] showcase a metric-free evaluation method for recommendation and retrieval based on a set of 16 datasets. Chin et al. [
26] conduct an empirical study on the impact of datasets on the evaluation outcome and resulting conclusions. Their study shows a different distribution of dataset popularity among recommender systems evaluation than we observe in the analysis at hand. However, we conjecture that this is due to the diverse inclusion criteria of the studies. For instance, Chin et al.’s study is restricted to implicit feedback top-
k recommendation tasks. Notably, our analysis also contains 9 papers (
\(15.79\%\) ) that did not use any dataset. The reason here is that most of these papers are surveys [
4,
5,
6,
36,
52,
83,
96]. Furthermore, Ekstrand [
39] describes the Python LensKit software framework and Sonboli et al. [
88] describe the librec-auto toolkit.
Our analysis contains 13 versions of the Amazon review datasets, seven different versions (or subsets) of the MovieLens dataset, and two versions of the citeulike dataset. Considering the usage of different versions of the same dataset, we find that five papers use different versions of the same aggregated dataset. In their survey on dataset usage, Chin et al. [
26] use eight versions of the Amazon reviews dataset and three versions of the MovieLens dataset (of a total of 15 individual datasets used). In their reproducibility study, Ferrari Dacrema et al. [
40] used four versions of the MovieLens datasets, both versions of the citeulike datasets, and two versions of the Amazon reviews dataset (of 17 individual datasets used). In their prior reproducibility study, Ferrari Dacrema et al. [
41] used two versions of the MovieLens dataset.
We further investigate which datasets are jointly used in evaluations. For this analysis, analyze the sets of datasets co-used in the papers (note that the co-usage of individual datasets is already presented in Table
6). We employed a frequent itemset approach (i.e., the Apriori algorithm [
3]) and present the results in Table
8. This table shows the set of datasets employed together and the number of papers that co-use these datasets. The most frequently combined datasets are LastFM and MovieLens 1M (appearing in seven papers). The MovieLens 1M dataset appears in pairs with the NetflixPrize and the Yelp datasets in five papers. In the list of sets of datasets that appear in four papers, we find not only pairs but also triples of datasets that are jointly used for evaluation in three papers. Unsurprisingly, the MovieLens datasets and other popular datasets are dominant. This aspect has also been raised by Chin et al. [
26] and our results are in line with these previous findings.
Inspecting the papers that use custom datasets, we observe that the majority of these papers feature (or create) a custom dataset for three distinctive reasons. One reason is user surveys [
2,
25] and user studies being conducted [
11,
47,
54,
75], where the result of the user study itself is presented as a novel dataset. For instance, Chen et al. [
25] perform a user study to get a deeper understanding of the impact of serendipity on user satisfaction on a popular mobile e-commerce platform in China. A further reason for using custom datasets is the recent trend toward counterfactual (off-policy) learning, which requires an unbiased, missing-at-random dataset [
22,
31,
44,
58,
73]. Furthermore, several papers perform evaluations based on proprietary data provided by a private sector business entity [
44,
59,
69,
73,
77,
91].
3.5 Metrics
The reviewed literature features an extensive range of datasets, as depicted in Section
3.4. This variety is also mirrored in the selection of evaluation metrics. We divide the metrics into two categories: conventional metrics widely utilized in the field and specific metrics proposed for the unique problem addressed within a certain paper. We refer to these as custom metrics (see the final row of Table
9). A visual representation of the most frequently used metrics—those employed in at least two papers within our surveyed literature—is provided in Figure
7.
Traditionally, recommender systems research has relied on a standard set of metrics, including Precision, Recall, and normalized Discounted Cumulative Gain (nDCG) [
18,
45]. These metrics have gained significant popularity in the examined literature. However, our analysis also uncovers the existence of a diverse array of less prevalent metrics, as illustrated in Table
9. In essence, a selected group of metrics is featured prominently: Precision is employed in 22 of the 57 reviewed papers (approximately
\(36\%\) ), nDCG in 20 papers (around
\(35\%\) ), and Recall in 17 papers (nearly
\(30\%\) ). These findings resonate with the notion that ranking and relevance metrics align more closely with actual user preferences than a minimized rating prediction error does [
34,
45]. Yet, metrics associated with rating prediction, such as RMSE, MAE, and MSE, still figure prominently in a considerable portion of the reviewed literature, appearing in a total of 7 papers (about
\(12\%\) ). While a vast majority of papers do not employ rating prediction metrics, the fact that more than 1 in 10 papers uses them contradicts the general consensus in the recommender systems research field, which holds that rating prediction is an inadequate surrogate for actual user preference [
8].
Figure
7 portrays the disparity in popularity among various metrics. Precision, nDCG, and Recall are roughly twice as favored as any of the other top metrics. These three metrics epitomize the core characteristics of recommender and information retrieval systems, notably relevance and ranking.
Furthermore, it is worth mentioning that of the total 40 metrics employed in the reviewed papers, 23 metrics (approximately
\(58\%\) ) are each applied in just a single paper. Some of these uniquely applied metrics are specific to individual papers that utilize an extensive range of metrics. For example, Silva et al. [
87] introduce metrics such as user-coverage, EPC, EPD, Gini, and Hits, while Anelli et al. [
9] introduce various non-accuracy metrics like Average Coverage of Long Tail, Average Percentage of Long Tail, Expected Free Discovery, and Popularity-based Ranking-based Equal Opportunity, among others. Moreover, five metrics appear in only two papers each, and a single metric is utilized in three papers. The variation in metric usage complicates the comparison and benchmarking across different papers, as emphasized in the discussion on dataset usage (see Section
3.4).
Similarly, we scrutinize the number of metrics utilized per paper. It is crucial to emphasize that the quantity of metrics employed does not necessarily reflect the quality or completeness of a paper or recommender system. Nonetheless, the use of multiple metrics can yield insights into different facets of a system. When analyzing our data, we discover that 18 papers (
\(32\%\) ) use only a single metric, and surprisingly, and 10 papers (
\(18\%\) ) do not use any metrics whatsoever. Although the majority of papers that abstain from using metrics are categorized as literature reviews (refer to Table
4), there are exceptions. Furthermore, 9 papers (
\(16\%\) ) apply two metrics, while 5 papers (
\(9\%\) ) employ three metrics. In total, 42 papers (
\(74\%\) ) utilize three or fewer metrics. With this understanding, we now probe into the variety of metrics. In Table
10, we present a classification of evaluation metrics into overarching categories that correspond to specific recommendation tasks, like ranking, rating prediction, and relevance. Despite the absence of a universally accepted classification of metrics in the recommender systems research field, our categorization resonates with the general application scenarios of recommendations and the desired attributes of a recommender system.
In the context of metrics, it is interesting to explore the combinations of metric types, that is, the characteristics being measured in tandem. Given that recommendations apply across diverse contexts, the extensive array of metrics used mirrors the various goals pursued by recommendation applications and the stakeholders involved. By concentrating on metrics adopted in three or more papers, we examine the employed combinations in the surveyed literature (refer to Table
11). A key observation from this table is that the majority of combinations encompass ranking and relevance metrics, while combinations incorporating other metric types are less prevalent. This observation contrasts with current discussions in the recommender systems community, with the only beyond-accuracy metric appearing in the table being item coverage. This indicates that beyond-accuracy metrics are seldom used in combination with other metrics, including other beyond-accuracy metrics such as novelty, fairness, or any of the metrics in the bottom row of Table
10. A similar comment can be made regarding the utilization of success rate metrics.
Additionally, in agreement with the discourse within the recommender systems community, particularly regarding rating prediction, it is worth mentioning that no rating prediction error metrics are present in this table. This could signal a decrease in the overall usage of these metrics. Even when acknowledging that some papers use these metrics (as noted above), they do so without merging them with the more widely accepted evaluation tools and metrics.