Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Revisiting Negative Sampling vs. Non-sampling in Implicit Recommendation

Published: 25 February 2023 Publication History

Abstract

Recommendation systems play an important role in alleviating the information overload issue. Generally, a recommendation model is trained to discern between positive (liked) and negative (disliked) instances for each user. However, under the open-world assumption, there are only positive instances but no negative instances from users’ implicit feedback, which poses the imbalanced learning challenge of lacking negative samples. To address this, two types of learning strategies have been proposed before, the negative sampling strategy and non-sampling strategy. The first strategy samples negative instances from missing data (i.e., unlabeled data), while the non-sampling strategy regards all the missing data as negative. Although learning strategies are known to be essential for algorithm performance, the in-depth comparison of negative sampling and non-sampling has not been sufficiently explored by far. To bridge this gap, we systematically analyze the role of negative sampling and non-sampling for implicit recommendation in this work. Specifically, we first theoretically revisit the objection of negative sampling and non-sampling. Then, with a careful setup of various representative recommendation methods, we explore the performance of negative sampling and non-sampling in different scenarios. Our results empirically show that although negative sampling has been widely applied to recent recommendation models, it is non-trivial for uniform sampling methods to show comparable performance to non-sampling learning methods. Finally, we discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and future research topics that are worth being further explored.

1 Introduction

In the era of exponential information growth, recommender systems have been widely applied in today’s web platforms to alleviate the information overload issue and help users seek desired information and items [4, 5, 55]. The key of personalized recommendation systems is to model users’ preferences according to their historical interactions. There are mainly two challenges: (1) how to design an effective algorithm to model preference and (2) how to train the algorithm with limited user feedback. With the rapid development of neural networks in recent years, many new methods have been proposed and have achieved significant improvements. For the second challenge, fewer attempts have been made. In fact, recently many advanced recommendation models have been proposed, but the learning strategies usually become the bottleneck of the recommendation performance.
Generally, it is hard to collect explicit user preference over items in real scenarios, so user implicit feedback is widely adopted to model user interests, like clicks in news portals, purchases in e-commerce, and views on online video platforms. In implicit feedback data, the observed interaction represents a user’s positive preference for an item, while the other unobserved items are unlabeled. Only the positive feedback is observed; the negative feedback is mixed with missing values in unobserved data. Meanwhile, users usually interact with only a small number of items, compared to a huge number of items in real systems. Therefore, due to the lack of reliable negative data, learning a recommender system from implicit feedback data is very challenging. In implicit data, non-interacted items do not necessarily mean the user dislikes the items. For example, an unobserved user-item interaction can be caused by the user not seeing the item or the user seeing but not liking it. Thus, in unobserved data true-negative and unlabeled potential positive examples are mixed together. How to find out and utilize informative unobserved examples becomes the key to optimize the learning performance.
To address the problem of lacking negative samples when learning from implicit feedback data, two representative learning strategies have been proposed. The first strategy, named Negative Sampling [6, 30, 52], samples several negative items from those unlabeled data. The second strategy, named Non-sampling [9, 32, 33], takes all unlabeled items as negative and assigns them lower weights than positive samples. Both strategies have their own advantages and disadvantages: negative sampling is more efficient due to the limited number of training instance, but its performance may be affected by the low quality of sampled negative examples and slow convergence [9, 41, 70, 77]; the non-sampling strategy generally could achieve a better performance as all the training data is fully utilized, but inefficiency can be an issue [10, 32].
Many studies have been done to improve negative sampling learning and non-sampling learning, respectively. Previous work on negative sampling mainly focused on using other sampling methods rather than uniform sampling [17, 41, 51, 61] to improve the quality of negative samples, such as draw negative items through popularity-based distribution [41] and Generative Adversarial Network (GAN)-based models [35, 49, 61]. In another line of research, a number of methods have also been developed for non-sampling optimization to improve the training efficiency, including Alternating Least Squares (ALS)-based methods [32, 33] and mini-batch Stochastic Gradient Descent (SGD)-based methods [9, 10, 70].
To the best of our knowledge, although learning strategies are known to be important and there are many studies on improving negative sampling and non-sampling, respectively, the in-depth comparison of negative sampling and non-sampling in implicit recommendation has not been sufficiently explored by far. To this end, we systematically analyze the role of negative sampling and non-sampling in implicit recommendation through both theoretical considerations and experimental evaluations. Specifically, we first revisit the objective of negative sampling and non-sampling. Then, we conduct thorough comparisons between negative sampling and non-sampling strategies with careful setup when working with various representative methods. Finally, we discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and research topics that are worth further exploring.
Our results empirically show that although negative sampling has been widely applied in recent recommendation models [6, 27, 30, 63, 65], there is still a large gap for the widely used uniform sampling to show comparable performance to non-sampling learning methods. To reduce the gap, it is important to improve the quality of sampling by applying advanced negative samplers. Moreover, it is shown that a uniform weighted non-sampling learning method even outperforms many advanced sampling-based methods. Based on the above observations, we argue that the poor performance of negative sampling can be misleading if one does not recognize that they are biased, as existing neural recommendation methods typically rely on negative sampling for efficient optimization, and recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques [27, 65, 68, 69]. Previous work in the area of information retrieval and recommender systems has found that improvements achieved with some complex models were only observed because the chosen baseline methods were weak or not well trained [14, 15, 44, 54]. We believe this work presents a sanity check for negative sampling and non-sampling learning in implicit recommendation, providing references for future works related to recommendations with implicit feedback data.
The main contributions of this article are summarized as follows:
(1)
We theoretically revisit the objective and risk of negative sampling and non-sampling learning in implicit recommendation.
(2)
We conduct thorough comparisons between representative negative sampling and non-sampling training strategies with various recommendation algorithms on their best settings on different datasets.
(3)
Experimental results show that popular and representative negative-sampling-based recommendation methods generally do not perform better than simple baselines with non-sampling learning.
(4)
We discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and research topics that are worth further exploring.

2 Related Work

2.1 Item Recommendation

Early recommendation methods [36, 37] were mainly designed to model users’ explicit feedback such as ratings to movies. However, implicit feedback is actually easier to collect in real-world scenarios, such as clicks in news portals, purchases in e-commerce, and views on online video applications. For recommendation with implicit feedback data, a lot of methods were proposed before [10, 30, 33, 52]. Since truly negative is mixed with unlabeled positive, these methods differ from each other in how to use unobserved data. Specifically, Hu et al. [33] proposed a non-sampling-based method, WMF, which treats all unobserved items as negative samples and assigns them a lower constant weight. Then several efforts [32, 42] improve WMF by applying different weighting strategies based on whether the unobserved items are indeed negative ones. Different from non-sampling methods, Rendle et al. [52] proposed a sampling-based method, BPR, which optimizes the MF model based on the relative preference of a user to positive and negative items. The pairwise learning strategy with negative sampling has been widely adopted to optimize recommendation models [6, 12, 30, 59] and has become a dominant technique in recommendation.
With the development of deep learning techniques, there is a lot of work exploiting different neural networks for recommender systems. The work [30] presented a Neural Collaborative Filtering (NCF) framework to jointly learn a matrix factorization and a feedforward neural network for implicit Top-K recommendation. NCF has been widely extended for different recommendation scenarios [12, 26, 64]. In recent years, it has become a trend to explore the application of advanced deep learning architectures for recommendation tasks, such as leveraging attention mechanisms [5, 12, 71], Recurrent Neural Network (RNN) [46], Convolutional Neural Network (CNN) [28, 81], GANs [29, 61], and Graph Neural Networks (GNNs) [20, 65]. Specifically, [65] proposed NGCF to model high-order connections by propagating embeddings on the user-item interaction graph. LightGCN [27] is an extended version of NGCF by removing the feature transformation and non-linear activation function to improve the performance of the recommendation task. Besides the above methods that mainly use negative sampling for model learning, there are also some neural recommendation models based on non-sampling learning. For example, Chen et al. derived a flexible non-sampling loss and designed several efficient non-sampling neural models for various recommendation scenarios [7, 9, 10, 11].

2.2 Model Training in Recommendation

There are two learning strategies that have been proposed before to learn from implicit feedback: negative sampling strategy [6, 30, 52] and non-sampling strategy [16, 33, 42, 67].
The negative sampling strategy samples negative instances from the unobserved data. Through sampling, the scale of training samples is greatly reduced; therefore, the training process is more efficient [30]. Negative sampling has been widely applied in many recommendation models, including traditional recommendation models like BPR [52] and neural models like NCF [30]. The most popular and widely used sampling strategy is uniform sampling (also called random sampling). However, uniform sampling usually samples uninformative training instances, which usually make a limited contribution to update the model. To deal with this problem, many methods have been proposed in recent literature to replace the uniform sampling with other better samplers [17, 41, 51, 61] to improve the quality of negative samples. For example, [41, 76] proposed to sample negative examples based on the popularity of items. [17, 51] proposed to sample hard negative instances with higher prediction scores. [74] proposed a sampling-bias-corrected algorithm for estimating item frequency from streaming data. [72] proposed to use both batch and uniformly sampled negatives to tackle the selection bias of implicit recommendation. There are also some GAN-based methods in which the sampling probability will adaptively evolve by optimizing adversarial objectives [29, 35, 49, 61]. Another kind of method samples the negative instances based on the structure of the graph. For example, [66] incorporated a knowledge graph into the negative sampling process to sample high-quality negative items. [75] sampled the negative node according to their pagerank scores. [34] proposed a MixGCF method with the hop mixing and positive mixing strategies for GNN-based recommendation models. However, the above methods will suffer from inefficiency issues since the sampling process will dynamically change and usually needs to calculate all instances.
The non-sampling strategy sees all the unobserved data as negative, while assigning them with a lower weight than positive examples. For example, WMF [33] assigned all unobserved entries with a uniform weight. In EALS [32] and ExpoMF [42], the weight of unobserved entries is dependent on item popularity, which is based on the assumption that popular items are easier to be seen by users, so they should be assigned higher weights as negative. A non-sampling strategy has been shown that can leverage the whole data with a potentially better coverage [9, 32, 42, 70], but inefficiency can be an issue. To this end, a number of methods have been developed to accelerate the learning process, including batch-based ALS [1, 32, 33] and mini-batch SGD methods [9, 70]. However, this kind of method is only suitable for the recommendation models with a linear prediction layer [1, 10, 70] and regression loss function. Non-sampling methods have been widely applied in many traditional recommendation models [32, 33, 42] but have few applications in neural recommendation models. Recently, Chen et al. [10] proposed ENMF, which is a representative neural non-sampling recommendation model. To the best of our knowledge, although there are some works on improving negative sampling and non-sampling respectively, the in-depth comparison between them is left insufficiently explored. This is the main concern of this work.
The above methods are all based on discriminative modeling, which explicitly aims to distinguish positive user-item interactions from the negative counterparts. Recently, there has been another line of research that utilizes self-supervised learning for implicit recommendation training [39, 82]. Given a positive user-item interaction \((u, i)\) , the idea is to make representations for u and i similar to each other to encode the preference information. To address the potential model collapse problem, this kind of method generally applies two distinct encoder networks (i.e., online and target networks). Compared to discriminative modeling methods, self-supervised learning methods are easier to collapse into a trivial constant solution, so that the hyper-parameters need to be carefully tuned [24, 39]. To make our work more focused, the methods discussed in this article are all based on discriminative modeling. We leave the exploration of self-supervised learning as future work.

2.3 Negative Sampling in Other Domains

Negative sampling has also been widely used in other domains and tasks of machine learning, such as graph embedding [73], word embedding [45], and network embedding [79]. For example, Word2Vec [45] samples negative samples according to its word frequency, which is similar to the sampling process in recommendation. Later works on network and graph embedding [25, 50, 58] follow this setting. It has also been observed that negative instances with large scores (hard) are more useful for model training [79]. Another recent work on negative sampling of graph learning shows that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution [73]. There are also some GAN-based methods for the above tasks [2, 3, 22]. For non-sampling-based methods, [8, 40] proposed efficient non-sampling methods for learning knowledge graph embeddings.
Since the implicit recommendation is a different problem, the reliability of sampled negative instances is much harder to guarantee. In this article, we mainly focus on the comparison of negative sampling and non-sampling for the recommendation task and leave the exploration of other tasks as future work.

3 Understanding Negative Sampling and Non-sampling

In this section, we first present the problem formulation of recommendation with implicit feedback data (implicit recommendation), then revisit negative sampling and non-sampling regarding both objective and risk.

3.1 Implicit Recommendation

Table 1 shows the key notations used in this article. We denote user set U (including M users) and item set I (including N items). The implicit feedback data is denoted as an \(M \times N\) binary matrix Y, where \(y(u,i) \in \lbrace 0,1\rbrace\) denotes whether user u has interacted with item i or not. We use \(\mathcal {Y}\) to denote the set of observed entries in Y. Moreover, \({\bf I}_u\) is used to denote the positive item set of user u. Given a target user u, the task of implicit recommendation is to learn user u’s preference based on the implicit feedback data and recommend items that might be of interest to u.
Table 1.
SymbolDescription
\(M,N\) number of users and items
\({\bf U}\) user set
\({\bf I}\) item set
\({\bf Y}\) user-item interactions
\({\bf p}_u\) latent factor of user u
\({\bf q}_i\) latent factor of item i
\(\mathcal {Y}\) the user-item pairs whose values are non-zero
\({\bf I}_u\) the positive item set of user u
\(c(u,i)\) the weight of entry \(y(u,i)\)
dlatent factor number
Table 1. Notation

3.2 Negative Sampling

An implicit recommendation model with negative sampling learning strategy generally has three important components: scoring function \(\hat{y}\) , objective function \(\mathcal {L}\) , and negative sampling strategy \(p_{ns}\) [17]. The scoring function \(\hat{y}({\bf p}_u, {\bf q}_i, \Theta)\) predicts the preference of user \(u \in {\bf U}\) to item \(i \in {\bf I}\) based on u’s latent factor \({\bf p}_u\) and i’s latent factor \({\bf q}_i\) . It has been widely studied in previous work, including various models based on Matrix Factorization (MF) [29, 52, 61] and neural networks [26, 27, 30, 63, 65].
For objective function, Bayesian Personalized Ranking (BPR) [52] is widely used in the negative-sampling-based recommendation model, and is also the most representative learning method. The formulation of BPR is as follows:
\begin{equation} \begin{split}(u, i, j) \in {\bf Y}: \Leftrightarrow i \in {\bf I}_u \wedge j \in {\bf I} \backslash {\bf I}_u\\ \mathcal {L}=-\sum _{(u, i,j) \in {\bf Y}} \ln \sigma (\hat{y}(u, i)-\hat{y}(u, j)), \end{split} \end{equation}
(1)
where \(\sigma (x)= \frac{1}{1+e^{-x}}\) is an activation function (sigmoid) and \({\bf Y}\) is the training data. The negative instance \((u, j)\) is sampled through a specific distribution \(p_{ns}(j | u)\) . \(\sigma (\hat{y}(u, i)-\hat{y}(u, j))\) is modeled as the likelihood that a user u prefers item i to item j, which approaches 1 when \(\hat{y}(u,i) \gg \hat{y}(u,j)\) . Minimizing \(\mathcal {L}\) is equal to make the scores of positive user-item interactions larger than negative user-item interactions. \(\ln \sigma (x)\) is actually a differential surrogate function of the Heaviside function, so BPR approximately optimizes the ranking statistic AUC [41].
As the number of training pairs \(|{\bf Y}|\) is usually very large, learning algorithms typically are based on SGD. The gradient of BPR to the model parameters is
\begin{equation} \begin{split}\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta }=(1-\sigma (\hat{y}(u, i)-\hat{y}(u, j))) \frac{\partial (\hat{y}(u, i)-\hat{y}(u, j))}{\partial \Theta }. \end{split} \end{equation}
(2)
The gradient depends on how the scoring model would discriminate between the positive item i and the negative item j for user u. \(\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta }\) is a probability, which is close to 0 if \(\hat{y}(u, i)\) is to correctly get a larger score than \(\hat{y}(u, j)\) .
From the above gradient, we analyze the influence of negative sampling from both the sampled right examples (i.e., examples that are really negative to user u) and false examples (i.e., examples that are positive to user u in the test set). For a sampled right negative example j, positive item i can be easily distinguished from item j (i.e., \(\sigma (\hat{y}(u, i)-\hat{y}(u, j)) \rightarrow 1\) ) when the model is well trained, so that the pair \((u, i, j)\) will contribute little to the model learning because its gradient vanishes ( \(\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta } \rightarrow 0\) ). This kind of issue is particularly prominent when the sampling distribution is uniform. In recommender systems, item popularity is typically non-uniform distributed, and overall positive observations have a tailed distribution. It is very likely that the model score \(\hat{y}(u,j)\) of a uniformly sampled item j is smaller than \(\hat{y}(u,i)\) and thus the gradient magnitude is also small. Therefore, the widely used uniform sampling strategy is generally very slow to converge and hard to achieve the optimal performance.
To avoid this, the prediction scores of items are used for sampling in previous work, e.g., AOBPR and BPR-DNS [49, 51]. These kinds of methods are designed to sample more hard instances, i.e., unobserved items with higher prediction scores. However, since items with higher prediction scores are more likely to be actually positive in the test set, these kinds of methods suffer from false examples more markedly, which will hurt the model’s performance and robustness since they are sampled as negative instances during training.
From the above analyses of negative instances in the training of implicit recommendation, we can conclude that the difficulty of negative sampling is how to sample negative examples with large scores to increase the convergence speed and avoid false-negative instances to maintain robustness.

3.3 Non-sampling

A recommendation model with the non-sampling learning strategy typically creates the training data from \(\mathcal {Y}\) by giving pairs \(y(u, i)\in \mathcal {Y}\) a positive label and all other unobserved interactions \({\bf Y}\backslash \mathcal {Y}\) a negative label:
\begin{equation} y(u,i)=\left\lbrace \begin{array}{ll}{1,} & {\text{ if interaction (user } u, \text{ item } i) \text{ is observed; }} \\ {0,} & {\text{ otherwise. }} \end{array}\right. \end{equation}
(3)
Then the recommendation model is fitted to this data. The widely used non-sampling-based objective function is a weighted regression function, which assigns a training weight to each interaction in the implicit matrix:
\begin{equation} \begin{split}\mathcal {L}=\sum _{u \in \mathbf {U}} \sum _{i \in \mathbf {I}} c(u, i)\left(y(u, i)-\hat{y}(u,i)\right)^{2}, \end{split} \end{equation}
(4)
where \(c(u,i)\) denotes the weight of entry \(y(u,i)\) .
Through the above objective function, the model is optimized to predict the value 1 for elements in \(\mathcal {Y}\) and 0 for the rest. The purpose of this kind of method is to impose the gravity regularizer to penalize the prediction for missing items [41]. One of the weaknesses of this approach is that all candidate ranking items are presented to the recommendation model as negative instances during training, which means that a model with enough expressiveness cannot generate a reasonable ranking list at all as it predicts only 0.
Besides, considering that the number of missing data can be much larger than user interacted items in real applications, it is more desirable to assign the missing data a lower weight to address the class imbalance issue. A weighting strategy for unobserved examples in non-sampling methods plays a similar role to the negative sampling strategy in the sampling-based strategy [32, 42]. The difference would be that with a weighting strategy for unobserved samples all samples are used, whereas with a sampling strategy some samples might never be used during training, which may have an impact in terms of generalizability or may leave the model not trained for certain types of input, lowering its robustness. The methods with non-uniform weighting can model negative instances with a faster coverage but generally require more computational cost due to the dense weight matrix for all user-item pairs. Although some methods have been proposed to accelerate the learning process from the whole data, one challenge remains in existing non-sampling approaches: the fast non-sampling learning methods only suitable for the recommendation models with a linear prediction layer.

4 Methods for Empirical Study

To explore the performance of negative sampling and non-sampling in different scenarios, we compare various state-of-the-art and representative learning methods for Top-K recommendation with implicit feedback data. Specifically, we focus our analysis on seven methods, which can be classified into two groups based on whether they utilize negative sampling or non-sampling learning strategies. Negative sampling methods include BPR-Uniform Sampling [52], AOBPR [51], BPR-DNS [78], IRGAN [61], and SRNS [17]. Non-sampling-based methods include WMF [33] and EALS [32]. We will briefly review these methods in this section.

4.1 BPR-Uniform Sampling

Uniform sampling with BPR pair-wise loss function [52] is a most widely used and classical solution for item recommendation with implicit feedback. Uniform negative sampling over unobserved items is also called random sampling, which is based on the simplest uniform proposal. We use \(Q(j|u,i)\) to represent the probability that a negative item j is sampled for a positive pair \((u, i)\) , which is defined as
\begin{equation} Q(j|u,i)=\left\lbrace \begin{array}{ll}{0,} & {\text{ if j} \in {\bf I}_u} \\ {\frac{1}{N-N_u},} & {\text{ otherwise,}} \end{array}\right. \end{equation}
(5)
where \(N_u=|{\bf I}_u|\) denotes the number of user u’s interacted items.
Although uniform sampling has been successfully applied in numerous recommender applications and for a variety of models, it is shown that convergence slows down considerably if the number of items is large and the overall item popularity is tailed. Both properties are common to most real-world datasets.

4.2 AOBPR

AOBPR (Adaptive Oversampling BPR) [51] is designed to improve the uniform sampling strategy by adaptively sampling hard instances, i.e., unobserved items with higher prediction scores, which is hard to discriminate for the algorithm. Intuitively, when a negative item j for a positive pair \((u, i)\) should be sampled, the closer j is to the top, the more informative is j. The sampling distribution of AOBPR is defined as
\begin{equation} Q(j|u,i)=\left\lbrace \begin{array}{ll}{0,} & {\text{ if}\ j \in {\bf I}_u} \\ {\text{exp}(\frac{-\hat{r}(u,j)}{\lambda }),} & {\text{ otherwise,}} \end{array}\right. \end{equation}
(6)
where \(\hat{r}(u,j)\) is the rank of item j among all items I using the score function \(\hat{y}(u,j)\) for ordering items, and \(\lambda\) is a hyper-parameter to control the skewness of the distribution.
The limitations of AOBPR are (1) it needs to compute the scores of all items in order to determine their sampling probability, which reduces the advantage of sampling and is time-consuming in the training process, and (2) the items with higher prediction scores are more likely to be positive in the test set, and thus AOBPR is more vulnerable to false samples than uniform sampling.

4.3 BPR-DNS

The motivation of BPR-DNS (Dynamic Negative Sampling) [78] is similar to AOBPR, which also tries to adaptively sample hard instances. Different from AOBPR, DNS first randomly selects a set of negative samples, then uses the item with the highest prediction score for model optimization. The training process of DNS is shown in Algorithm 1.
The limitation of BPR-DNS is that it is also vulnerable to false samples since items with higher prediction scores are more likely to be positive in the test set. Compared to AOBPR, the advantage of BPR-DNS is that it only needs to compute the scores of n items for each sampling, and a small number of n (generally no more than 32) is enough [78].

4.4 IRGAN

IRGAN [61], a GAN-style negative sampling method, applies a mini-max game for information retrieval tasks like implicit recommendation. Specifically, IRGAN includes two components: a generator G and a discriminator D. The generator is to produce more and more difficult negative samples, and the discriminator is to minimize the discrimination objective function \(\mathcal {J}(D, G)\) . For recommendation with implicit feedback data, IRGAN optimizes the following objective function:
\begin{equation} \max _{G} \min _{D} \mathcal {J}(D, G)=\sum _{u \in {\bf U}}-\mathbb {E}_{i \sim P_{\text{pos }}(\cdot \mid u)} \log D(i \mid u)-\mathbb {E}_{j \sim P_{G}(\cdot \mid u)} \log (1-D(j \mid u)), \end{equation}
(7)
where \(P_{\text{pos }}(\cdot \mid u)\) is a positive relevance distribution and \(P_{G}(\cdot \mid u)\) is a probability distribution used to generate negative instances. \(D(i\mid u)\) estimates the probability of user u to item i. When the discriminator D and the generator G are well trained, G is used for the prediction of recommendation.
Although the GAN-style methods have shown promising results for discovering informative negative samples, it is usually very time-consuming to generate negative instances from the generator G, which limits its application ability for large-scale datasets.

4.5 SRNS

SRNS (Simplified and Robust Negative Sampling) [17] is a recently proposed negative sampling method, which is based on the finding that false-negative samples (underlying positive) generally have large prediction scores for many training epochs. SRNS captures the dynamic sampling distribution of negative samples through a memory-based component. To evaluate the quality of negative samples, SRNS also proposes a high-variance-based strategy. The training process of SRNS is shown in Algorithm 2.
The variance-based sampling strategy is proposed to avoid false-negative instances by preferring high-variance candidates, which is defined as
\begin{equation} j=\arg \max _{k \in \mathcal {M}_{u}} P_{\mathrm{pos}}(k \mid u, i)+\alpha _{t} \cdot \operatorname{std}\left[P_{\mathrm{pos}}(k \mid u, i)\right], \end{equation}
(8)
where \(P_{\mathrm{pos}}(k \mid u, i)=\mathrm{sigmoid} (\hat{y}(u,i) -\hat{y}(u,k))\) , \(\operatorname{std}[P_{\mathrm{pos}}(k \mid u, i)]\) denotes the prediction variance in the latest few epochs. \(\alpha _{t}\) is a hyper-parameter to control the importance of the variance.
The memory strategy is to dynamically update \(\mathcal {M}_{u}\) while including more hard negative samples. The new \(\mathcal {M}_{u}\) is updated by sampling \(S_1\) instances from an extended memory that merges the old \(\mathcal {M}_{u}\) and a set of randomly sampled instances. The sampling probability distribution is as follows:
\begin{equation} Q(j|u,\mathcal {M}_{u} \cup \mathcal {M}_{u}^{\prime })=\exp \left(\hat{y}(u,k) / \tau \right) / \sum _{k^{\prime } \in \mathcal {M}_{u} \cup \mathcal {M}_{u}^{\prime }} \exp \left(\hat{y}(u,k^{\prime }) / \tau \right)\!, \end{equation}
(9)
where \(\tau\) is a temperature parameter and a lower \(\tau\) would make \(Q(j|u,\mathcal {M}_{u})\) pay more attention to large-scored instances.
The main training cost of SRNS comes from the above-introduced variance-based sampling and score-based memory update. Specifically, it requires to compute \(S_1\ +\ S_2\) candidates for each positive instance and then sample \(S_1\) of them based on computed scores [17]. According to the original paper, small numbers of \(S_1\) and \(S_2\) are generally sufficient, which makes SRNS more efficient than methods requiring computing all items’ scores.

4.6 WMF

Different from sampling-based learning methods that mainly focus on improving the quality of negative samples, the key of non-sampling learning is to assign proper weights for unobserved user-item interactions. WMF [33] is a well-known non-sampling learning method, which assigns all unobserved user-item interactions with the same uniform weight:
\begin{equation} c(u,i)=\left\lbrace \begin{array}{ll}{c_{1}} & {\text{ if } y(u,i)=1 }\\ {c_{0}} & {\text{ if }y(u,i)=0, } \end{array}\right. \end{equation}
(10)
where \(c_0\) and \(c_1\) are hyper-parameters that need to be tuned according to different datasets. To simplify the tuning process, \(c_1\) is usually set to a value of 1, while \(c_0\) is set as a smaller number than \(c_1\) to address the imbalanced optimization issue [33]. For instance, in our experiments, \(c_0\) is selected from [0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.7] and is set to 0.5, 0.05, and 0.05 for movielens-1m, pinterest, and yelp2018 datasets. Generally, this parameter is related to the sparsity of data. If the data is more sparse, then a smaller value of \(c_0\) may achieve a better performance. Previous studies [18, 33, 62] have demonstrated that this strategy shows better performance than using the same weight for all user-item interactions in the recommendation task.

4.7 EALS

The uniform weight strategy assumes that all unobserved user-item interactions have the same level of negative signal, which is too simple to model real-world scenarios. It is easy to understand that a popular item has more chance to be seen by users, and thus it should have a higher weight as negative if not interacted by users [32, 42]. Specifically, the EALS [32] method assigns unobserred user-item interactions with non-uniform weights according to item popularity:
\begin{equation} c(u,i)=\left\lbrace \begin{array}{ll}{c_{1}} & {\text{ if } y(u,i)=1 }\\ {c_{i}^-} & {\text{ if }y(u,i)=0, } \end{array}\right. \end{equation}
(11)
where \(c_{v}^-\) is defined as
\begin{equation} \begin{split}c_{i}^{-}=c_{0}\frac{m_i^{x}}{\sum _{j=1}^{N}m_j^{x}}&; \ m_i=\frac{|\mathcal {Y}_i|}{\sum _{j=1}^{N}|\mathcal {Y}_j|}, \end{split} \end{equation}
(12)
where \(\mathcal {Y}_i\) denotes the positive interactions of i, \(m_v\) denotes the popularity (frequency) of item v in \({\bf Y}\) , \(c_{0}\) determines the overall weight of unobserved data, and x controls the significance level of popular items over unpopular ones. To simplify the tuning process, x is usually set to a value of 0.5 as suggested in [32].

4.8 Efficient Non-sampling Loss

The difficulty of applying non-sampling learning for implicit recommendation lies in the expensive computational cost. For example, the complexity of computing Equation (4) is \(O(|{\bf U}||{\bf I}|d)\) , which is generally unaffordable since in the real world \(|{\bf U}||{\bf I}|\) can easily reach to billion level. Several methods have been proposed [9, 32, 70, 77] to address the inefficiency issue of non-sampling learning. Specifically, recent studies [9, 10] propose an efficient loss for two-tower-based recommendation models, and prove that for a generalized matrix factorization framework whose prediction function is Equation (13), the gradient of loss Equation (4) is exactly equal to that of Equation (14) if the weight \(c(u,i)\) is simplified to \(c_i\) :
\begin{equation} \hat{y}{(u,i)}=\mathbf {h}^{T}\left(\mathbf {p}_{u} \odot \mathbf {q}_{i}\right) \end{equation}
(13)
\begin{equation} \begin{split}\tilde{\mathcal {L}}(\Theta)&=\sum _{u \in \mathbf {U}}\sum _{i \in \mathbf {I}_{u}}\left((c_{i}^{+}-c_{i}^{-})\hat{y}{(u,i)}^2-2c_{i}^{+}\hat{y}{(u,i)}\right)\\ &+\sum _{j=1}^d\sum _{k=1}^d \left((h_{j}h_{k}) \left(\sum _{u \in \mathbf {U}} p_{u,j}p_{u,k}\right) \left(\sum _{i \in \mathbf {I}} c_i^{-}q_{i,j}q_{i,k} \right) \right), \end{split} \end{equation}
(14)
where \(\mathbf {p}_u \in \mathbb {R}^d\) and \(\mathbf {q}_i\in \mathbb {R}^d\) are embeddings of user u and item i, \(\odot\) denotes the element-wise product, and \(\mathbf {h} \in \mathbb {R}^d\) is the prediction vector.
The complexity of Equation (14) is \(O((|{\bf U}|+|{\bf I}|)d^2+ |\mathcal {Y}|d),\) while that of Equation (4) is \(O(|{\bf U}||{\bf I}|d)\) . Since \(|\mathcal {Y}|\) is the number of observed user-item interactions and \(|\mathcal {Y}|\ll |{\bf U}||{\bf I}|\) in practice, the complexity is greatly reduced. The proof of this method can be found in [9, 10]. To avoid repetition, it is omitted here.
Note that this method does not directly calculate the scores of all items. Instead, it reformulates the loss over all negative instances through a partition and a decouple operation to achieve speedup. As such, it cannot be applied to the above sampling-based methods, which require computing a high number of item scores.
The above efficient loss can also be applied to a common matrix factorization recommendation model (i.e., \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ). It is used in our experiment to train non-sampling methods WMF and EALS.

5 Experiments

In this section, we conduct experiments to explore the performance of negative sampling and non-sampling algorithms in different scenarios. We aim to answer the following research questions:
How are the performances of negative sampling and non-sampling algorithms with the standard matrix factorization method for Top-K recommendation?
How is the efficiency of negative sampling and non-sampling methods?
How does negative sampling and non-sampling training algorithms boost state-of-the-art recommendation methods?
In the following part, we first introduce the experimental settings, followed by answering the above research questions.

5.1 Experimental Settings

5.1.1 Data.

Four real-world and publicly available datasets are used in our experiment, which are popularly used in previous literature [10, 17, 30, 41, 72]: Movielens-1m,1 Pinterest,2Yelp2018,3 and Alibaba.4 The statistics of the three datasets are shown in Table 2. We briefly introduce the three datasets:
Table 2.
Dataset#User#Item#InteractionDensity
Movielens-1m6,9403,7061,000,2094.47%
Pinterest55,1879,9161,500,8090.27%
Yelp201831,66838,0481,561,4060.13%
Alibaba106,04253,591907,4070.02%
Table 2. Statistical Details of the Evaluation Datasets
Movielens-1m: This is a widely used movie rating dataset, which contains 1,000,000 ratings from 1 to 5. Since we focus on learning from implicit feedback data, we follow the widely used pre-processing method to convert it into implicit. Specifically, the detailed rating is transformed into a value of 0 or 1 indicating whether a user has interacted with an item.
Pinterest: This dataset is constructed by [23] for the image recommendation task and has been used for evaluating the implicit recommendation task [17, 30] in previous work.
Yelp2018: This dataset is adopted from the 2018 edition of the Yelp challenge, where the local businesses like restaurants are viewed as the items. The yelp2018 dataset used in this article is exactly the same as that used in [27, 34, 65].
Alibaba: This dataset is collected from the Alibaba online shopping platform. The authors of [73] organize the purchase record of selected users to construct the bipartite user-item graph. The dataset used in this article is exactly the same as that used in [34, 73].
The datasets vary in the numbers of user-item interactions, density, and item frequency distribution. The frequency statistics of the evaluation datasets are shown in Figure 1.
Fig. 1.
Fig. 1. Frequency statistics of the evaluation datasets.

5.1.2 Evaluation Metrics.

The personalized ranking list for a user is generated by ranking all items that are not interacted in the training set according to the prediction scores. To evaluate the performance, we closely follow the settings of previous work [8, 27, 65]. Specifically, we randomly select 80% of interactions of each user to construct the training set and treat the remaining as the test set. From the training set, we randomly select 10% of interactions as the validation set to tune hyper-parameters.
We evaluate the ranking list using two metrics: (1) Recall and (2) Normalized Discounted Cumulative Gain (NDCG). We define the generated recommendation list for user u as \({\bf Rec}_u =\lbrace rec_u^1,rec_u^2,\ldots ,rec_u^K\rbrace\) , where K is the number of recommended items, and \(rec_u^i\) is ranked at the ith position in \({\bf Rec}_u\) according to the predicted score. The set of u’s interacted items in the test data is defined as \({\bf I}_u\) .
Recall@K: Recall measures whether the test item is in the top-K recommendation list. It is computed as follows:
\begin{equation} \begin{split}Recall@K&=\frac{1}{|{\bf U}|} \sum _u \frac{\sum _{i=1}^K{f\left(\left| \lbrace rec_u^i\rbrace \cap {\bf I}_u \right|\right)}}{K}, \end{split} \end{equation}
(15)
where \(f (x)\) is an indicator function whose value is 1 when x > 0 and 0 otherwise.
Normalized Discounted Cumulative Gain (NDCG)@K: It is widely used in information retrieval and recommendation tasks, measuring the quality of ranking through discounted importance of positions. Formally, it is computed as follows:
\begin{equation} \begin{split}DCG@K&=\frac{1}{|{\bf U}|}\sum _u \sum _{i=1}^K \frac{2^{f\left(\left| \lbrace rec_u^i\rbrace \cap {\bf I}_u \right|\right)}-1}{log2(i+1)}\\ NDCG@K&=\frac{DCG@K}{IDCG@K} , \end{split} \end{equation}
(16)
where IDCG is a normalization constant, which is the maximum possible value of \(DCG@K\) coming from the best ranking.
For each user, our evaluation protocol ranks all the items except those in the training set, which is more persuasive than ranking a random subset of negative items only [38]. For each method, we randomly initialize the model and run it five times. After that, we report the average results. Moreover, the early stopping strategy is performed, i.e., premature stopping if Recall@20 on the validation data does not increase for 50 epochs.

5.1.3 Hyper-parameter Settings.

The parameters for all methods are initialized according to the corresponding papers and are then carefully tuned to achieve optimal performances. Specifically, by applying the vanilla Uniform sampling strategy, we first use grid search to find the best sampling-independent parameters, such as learning rate and regulation. Then for each individual method, we fix the above parameters and search the rest of sampling-related parameters. The batch size is set to 512. For fair comparison, we use the same embedding size d for all methods. This setting has been widely adopted in previous work [10, 27, 30, 34, 65], which is to ensure that the basic model has the same modeling ability, and the performance difference is only caused by different learning methods. Specifically, d is set to 64 in our experiment. To prevent overfitting, we tune the dropout ratio in [0.1, 0.3, 0.5, 0.7, 0.9, 1] and the regularization in [0, 0.0001, 0.001, 0.01, 0.05]. The dropout ratio 1.0 means using all parameters, as the setting in Tensorflow. By default, we integrate the above-introduced negative sampling methods and non-sampling methods into a traditional matrix factorization model to compare the performance. The detailed information of hyper-parameter exploration is shown in Table 3. All experiments are run on the same machine (Intel Xeon 8-Core CPU of 2.4 GHz and single NVIDIA GeForce GTX TITAN X GPU) for a fair comparison.
Table 3.
MethodsPara.Tuning RangeMovielens-1mPinterestYelp2018Alibaba
BPR-Uniformlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.0001, 0.001, 0.01]0.010.010.010.01
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]1.01.01.01.0
AOBPRlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.0001, 0.001, 0.01]0.010.010.010.01
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]1.01.01.01.0
\(\lambda\) [5, 10, 20, 50, 100, 200, 500, 1000, 2000]1000100010001000
BPR-DNSlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.0001, 0.001, 0.01]0.010.010.010.01
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]1.01.01.01.0
k[2, 4, 8, 16]2248
IRGANlr[0.001, 0.005, 0.01, 0.02, 0.05]0.010.010.010.01
reg[0, 0.0001, 0.001, 0.01]0.010.010.010.01
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]0.70.90.90.9
\(\tau\) [0.5, 1, 2]1111
SRNSlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.001, 0.001, 0.01, 0.05]0.010.050.010.01
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]1.01.01.01.0
\(\tau\) [0.5, 1, 2, 10]10101010
\(\alpha\) [0.1, 1, 2, 5, 10, 20, 50]5555
\(T_0\) [25, 50, 100]50505050
\(S_1\) [2, 4, 8, 16, 32]8161616
\(S_2 /S_1\) [1, 2, 4, 8]8488
WMFlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.0001, 0.001, 0.01]0.00.00.00.0
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]0.50.90.70.5
\(c_0\) [0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 0.7]0.50.050.050.01
EALSlr[0.001, 0.005, 0.01, 0.02, 0.05]0.050.050.050.05
reg[0, 0.0001, 0.001, 0.01]0.00.00.00.0
dropout[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]0.50.90.70.5
\(c_0\) [200, 500, 1000, 2000, 4000, 5000]10005001000500
x[0.25, 0.5, 0.75]0.50.50.50.5
Table 3. Hyper-parameter Exploration

5.2 Performance Comparison

We first make a performance comparison between the above-introduced negative sampling and non-sampling learning methods. The method ItemKNN [56] is also added as a basic benchmark. For fair comparison, all methods are integrated into a common matrix factorization recommendation model (i.e., \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ). We re-implement BPR-Uniform, BPR-DNS, SRNS, WMF, and EALS with TensorFlow. For ItemKNN and AOBPR, we use the implementation in LibRec.5 For IRGAN, we use the authors’ released code.6 For sampling-based methods, we sample one negative instance for each positive instance, which is a widely used setting of previous work [27, 30, 52, 61]. The impact of sampling number is explored in Section 5.4. To evaluate on different recommendation lengths, we investigate the top-K performance with K setting to [5, 10, 20] in our experiments. The results of the comparison of different methods are shown in Table 4. From the table, we have the following key findings:
Table 4.
Movielens-1mRecall@10Recall@20Recall@50NDCG@10NDCG@20NDCG@50
ItemKNN0.08490.13250.23710.23170.21770.2312
BPR-Uniform0.15350.24240.40810.34580.34180.3737
AOBPR0.14210.23010.39170.32990.32700.3679
BPR-DNS0.15240.24300.40800.34390.34080.3725
IRGAN0.15330.24450.40820.34370.34220.3741
SRNS0.15310.24200.40260.34780.34280.3711
WMF0.15720.24490.40340.35840.35180.3787
EALS0.15500.24270.40440.35270.34650.3756
PinterestRecall@10Recall@20Recall@50NDCG@10NDCG@20NDCG@50
ItemKNN0.07820.12480.23510.06650.08780.1252
BPR-Uniform0.07830.13090.24620.06700.08910.1279
AOBPR0.08210.13810.25610.07040.09370.1332
BPR-DNS0.07990.13360.24950.06830.09070.1298
IRGAN0.08120.13990.25210.07030.09190.1301
SRNS0.08930.14730.26940.07700.10130.1425
WMF0.08330.13850.25840.07120.09420.1347
EALS0.08530.14140.26010.07380.09720.1373
Yelp2018Recall@10Recall@20Recall@50NDCG@10NDCG@20NDCG@50
ItemKNN0.03070.05540.10580.03620.04510.0646
BPR-Uniform0.03230.05630.10960.03700.04570.0655
AOBPR0.03320.05680.11140.03820.04680.067
BPR-DNS0.03500.06030.11700.04020.04930.0703
IRGAN0.03330.05780.11350.03840.04640.0678
SRNS0.03510.05990.11510.04000.04910.0695
WMF0.03650.06230.11950.04180.05120.0724
EALS0.03820.06510.12340.04400.05380.0753
AlibabaRecall@10Recall@20Recall@50NDCG@10NDCG@20NDCG@50
ItemKNN0.02870.03940.06430.01640.01980.0238
BPR-Uniform0.03060.04630.07310.01780.02150.0278
AOBPR0.03410.04790.07900.01930.02570.0291
BPR-DNS0.03240.04880.07590.01870.02280.0283
IRGAN0.03120.04290.06720.01810.02060.0243
SRNS0.02580.03640.05540.01490.01780.0218
WMF0.05130.07630.11920.02890.03550.0444
EALS0.05220.07490.11690.02950.03560.0442
Table 4. Performance Comparison of Negative Sampling and Non-sampling Methods
For fair comparison, all methods are integrated into a common matrix factorization recommendation model ( \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ).
(1)
Generally, there is a gap for uniform sampling to show comparable performance to non-sampling learning methods under the same model architecture. For example, in Table 4, non-sampling learning methods WMF and EALS achieve significantly better performance (both Recall and NDCG scores) than BPR-Uniform on the Pinterest, Yelp2018, and Alibaba datasets (p < 0.01). The main reason lies in the effectiveness of negative sampling. This finding is also consistent with many previous works [9, 10, 70, 77]. As we have introduced, after a model is well trained, the model can easily distinguish the positive samples from the negative samples of random sampling, so the gradient is close to zero and the parameters are hardly updated.
(2)
Effectiveness of negative sampler is important for personalized Top-K recommendation. In the table, the performances of AOBPR, BPR-DNS, IRGAN, and SRNS are generally better than simple uniform sampling BPR. In particular, AOBPR and BPR-DNS place more emphasis on hard negative items with larger preference scores, IRGAN learns a generator to produce hard negative instances based on adversarial sampling, SRNS prefers negative instances with both large prediction scores and high variances. These informative sampling methods can keep ensuring a large gradient along with the progress of training, which leads to better performances than uniform sampling BPR.
(3)
Among all negative sampling methods, SRNS generally performs better since it considers both informativeness and reliability of negative instances. AOBPR and BPR-DNS leverage the negative item with larger preference scores; they are more likely to suffer from the false-negative problem, since high-scored items are also likely to be positive in the test set rather than negative. The performance of IRGAN is not very competitive. This is because of the divergence between the generator and its optimum, and because GAN-style methods are very sensitive to some basic training settings [48].
(4)
We can see that the well-tuned simple non-sampling method WMF performs better than various sampling-based methods, including the state-of-the-art method SRNS in most cases. Several previous studies [9, 70, 77] have also pointed out that non-sampling learning computes the gradient over all training data (including all unobserved data). Therefore, it can easily converge to a better optimal value in a more stable way. Moreover, our empirical observations suggest that some basic training settings (e.g., negative weight \(c_0\) ) are very important for the performance of WMF. For example, a different value of negative weight can reduce the model accuracy significantly. Under this situation, some actually effective performance would be under-estimated due to the improper parameter settings [54]. This may explain why our finding is a little different from the finding in [30, 47, 61], where simple sampling-based methods are able to yield competitive performance as non-sampling methods.
(5)
EALS, which has a popularity-based weighting strategy, achieves better performance than WMF. The items in real-word scenarios usually tend to be long-tail and popular items with which a user does not interact are more likely to be negative. The observation is similar to previous work [31, 32], which shows that assigning weight according to item popularity can further boost the performance of non-sampling-based recommendation methods.
(6)
Considering the performance on each dataset, we can see that the effectiveness of non-sampling methods and sampling-based methods is related to the sparsity of the dataset. Generally, compared to sampling-based methods, the more sparse the data, the better the performance of non-sampling methods. For example, the improvements of WMF to BPR-Uniform are 1.7%, 5.7%, 11.4%, and 63.8% for Movielens-1m, Pinterest, Yelp2018, and Alibaba, respectively. The improvements increase with the reduction of data sparsity. This makes sense because it is more difficult to sample high-quality instances from sparse data than from dense data.

5.3 Efficiency Analyses

Many previous studies only focused on obtaining better results but ignored the computational efficiency [57]. In real-word systems, training efficiency is also an important factor that needs to be considered. In this section, we conduct an experiment to show the training efficiencies of representative negative sampling and non-sampling methods. All the compared methods are re-implemented with Tensorflow and are run on the same machine for a fair comparison. The training time results are shown in Table 5. Note that these comparison models all have the same network structure but use different learning strategies. We have the following key observations:
Table 5.
ModelMovielens-1m Pinterest Yelp2018 Alibaba
SIT SIT SIT SIT
BPR-Uniform24s500250m 21s500175m 26s500216m 12s1500300m
BPR-DNS45s500375m 38s500317m 78s50011h 87s150037h
SRNS210s30018h 480s30040h 11m50092h 370s50052h
WMF1s3005m 1.8s3009m 6.5s30032m 10.5s30054m
EALS1s3005m 1.8s3009m 6.5s30032m 10.5s30054m
Table 5. Comparisons of Runtime (Second/Minute/Hour [s/m/h])
“S,” “I,” and “T” represent the training time for a single iteration, the number of iterations to converge, and the total training time, respectively. WMF and EALS are trained through efficient non-sampling loss (Equation (14)).
(1)
We can obviously observe that the overall training time of non-sampling methods WMF and EALS are much faster than sampling=based methods. For example, on the four datasets, WMF and EALS only need 5 minutes, 9 minutes, 32 minutes, and 54 minutes to achieve the optimal performance. This could be attributed to three reasons: first, by leveraging efficient non-sampling loss [10], the complexity of non-sampling learning has been reduced from \(O(|{\bf U}||{\bf I}|d)\) to \(O((|{\bf U}|+|{\bf I}|)d^2+ |\mathcal {Y}|d)\) , which avoids time-consuming traversal of all items; second, non-sampling methods generally require fewer iterations to achieve the optimal performance; third, the powerful samplers such as BPR-DNS and SRNS spend much more time on sampling.
(2)
Although stronger samplers could achieve a better performance than uniform sampling, they generally need much more sampling time, especially on larger datasets. For example, on the Alibaba dataset, BPR-DNS and SRNS require 37 and 52 hours for training, respectively. Moreover, some samplers require to calculate the prediction scores of all items, which reduces the advantage of sampling and is very time-consuming in the training process.

5.4 Impact of Sampling Number

BPR pair-wise loss (Equation (1)) is the most widely used sampling-based learning strategy. However, existing BPR learning-based methods generally sample one negative instance for each positive user-item pair [6, 27, 52, 65]. The impact of negative sampling number has been largely ignored by existing studies. Here we investigate how the performance changes as the number of negative samples increases. The BPR pair-wise loss with multiple negative samples is calculated as follows:
\begin{equation} \begin{split}\mathcal {L}=-\sum _{(u, i) \in {\bf Y}}\sum _{j \in {\bf I} \backslash {\bf I}_u}^{K} \ln \sigma (\hat{y}(u, i)-\hat{y}(u, j)), \end{split} \end{equation}
(17)
where \(\sigma (x)= \frac{1}{1+e^{-x}}\) is a sigmoid function, \({\bf Y}\) is the training data, and \({\bf I}_u\) denotes the positive item set of user u. It repeats the positive instance \((u,i)\) multiple times to make it get a higher score than every negative item.
Figure 2 shows the performance of BPR-Uniform when varying the number of negative samples on the four datasets. For comparison, we also report the performance of each training epoch. We show the results on Recall@20 metrics in this section. From the figure, we can obviously find that the number of negative samples does matter for training the recommendation model with BPR loss. Generally, it is beneficial to use more negative samples. For example, in Figure 2, sampling multiple negative instances for each positive pair outperforms using only one negative instance, and can even achieve similar results as non-sampling methods on the Movielens-1m dataset. Moreover, the results show that sampling more negative instances leads to faster convergence. Note that in this section we take BPR-Uniform as an example to show the impact of negative sampling number. For other negative sampling methods, similar performance can also be observed [34]. This is because when sampling more negative items, hard negative samples are more probable to be included and provide more valuable gradient updates. The results suggest that sampling more negative instances for each positive pair appears to be a promising setting for the recommendation model with BPR loss to gain higher performance. This also shows the potential of using non-sampling learning for improving the performance of recommendation systems. We therefore call for more future considerations toward the above settings when evaluating a newly proposed recommendation model.
Fig. 2.
Fig. 2. The impact of sampling number for BPR-Uniform on the four datasets.

5.5 Further Comparison

To answer research question 3, we further compare the state-of-the-art recommendation models NGCF [65] and LightGCN [27] with both sampling and non-sampling learning (with efficient loss) strategies to explore how negative sampling and non-sampling learning boost recommendation performance. The compared models are introduced as follows:
Neural Graph Collaborative Filtering (NGCF) [65]: This is one of the state-of-the-art graph-based recommendation models, which learns representations of users and items based on graph neural network. Specifically, each node obtains the transformed representations of its multi-hop neighbors. NGCF adopts the inner product to predict the user’s preference on item \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) .
Light Graph Convolution Network (LightGCN) [27]: This is the state-of-the-art graph-based recommendation model. It simplifies the design of GCN by omitting the non-linear transformation and applying the sum-based pooling. The model prediction is the same as NGCF.
We also compare with more advanced sampling-based approaches, including the GAN-based method AdvIR and graph-based methods MCNS and MixGCF, as follows:
AdvIR [49]: AdvIR is an adversarial sampler that incorporates adversarial sampling with adversarial training by adding adversarial perturbation.
MCNS [73]: Markov chain Monte Carlo negative sampling (MCNS) proposes to sample negative by approximating the positive distribution and accelerate the process by the Metropolis-Hastings algorithm.
MixGCF [34]: MixGCF is designed for the GNN-based recommendation models and integrates multiple negatives to synthesize a hard negative by positive mixing and hop mixing.
To reduce the experiment workload and keep the comparison fair, we closely follow the settings of the MixGCF work [34]. The datasets Yelp2018 and Alibaba are exactly the same as the MixGCF work used, so we directly use the results of AdvIR, MCNS, and MixGCF in the MixGCF paper.
Table 6 shows the performance of the compared methods. From this table, we can make the following observations:
Table 6.
 Yelp2018 Alibaba
 Recall@20NDCG@20 Recall@20NDCG@20
NGCF+Uniform0.05770.0469 0.04260.0197
NGCF+IRGAN0.06150.0502 0.04350.0200
NGCF+AdvIR0.06140.0500 0.04400.0203
NGCF+MCNS0.06250.0501 0.04300.0200
NGCF+MixGCF0.06880.0566 0.05440.0262
NGCF+WMF0.06380.0526 0.07550.0355
NGCF+EALS0.06550.0542 0.07420.0348
LightGCN+Uniform0.06280.0515 0.05840.0275
LightGCN+IRGAN0.06410.0527 0.06050.0280
LightGCN+AdvIR0.06240.0510 0.05830.0273
LightGCN+MCNS0.06580.0529 0.06320.0284
LightGCN+MixGCF0.07130.0589 0.07630.0357
LightGCN+WMF0.06270.0515 0.07350.0349
LightGCN+EALS0.06470.0533 0.07470.0353
Table 6. Performance of Different Recommendation Methods on Yelp2018 and Alibaba Datasets
The datasets are exactly the same as the MixGCF work used, so we directly report the results of AdvIR, MCNS, and MixGCF in the MixGCF paper.
(1)
Among the negative-sampling-based methods, MixGCF yields the best performance on the two datasets. This is because MixGCF is designed for GNN-based recommendation models, which augments the negative samples through hop mixing technique to offer a more informative gradient of model training.
(2)
Generally, compared to the widely used uniform sampling learning method, adopting a non-sampling learning strategy boosts the recommendation performance. For example, in Table 6, NGCF+WMF performs better than NGCF+Uniform, and LightGCN+WMF performs better than LightGCN+Uniform on the two datasets.
(3)
Comparing the results of WMF and EALS reported in Table 5, we can see that the improvements of NGCF+WMF and LightGCN+WMF are relatively small. This could be attributed to two reasons. First, the GNN-based models are advantageous in learning user-item interactions; through the operation of embedding propagation, the collaborative signals have been incorporated into the embedding process in an explicit manner. Second, due to the marginal effect, the boost ability of non-sampling learning is a little limited for those expressive models.
(4)
From Tables 5 and 6, we also observe that WMF and EALS can beat state-of-the-art recommendation methods NGCF+Uniform and LightGCN+Uniform in most cases. This is very remarkable since a shallow MF framework has much fewer parameters. This result validates the importance of the choice of learning strategies. In fact, the recommendation model has been well studied by a large number of recent works, while the learning strategy usually becomes the bottleneck of the recommendation performance.

6 Discussions and Future Directions

In this section, we discuss some open issues and present several future directions.

6.1 Evaluation of Recommendation System

Evaluating recommender systems properly has been realized to be difficult since it relies heavily on empirical results [54]. Recently, several critical studies have found that the improvement achieved with some complex models was only observed because the chosen baselines were weak or the parameters were not properly optimized [14, 15, 44, 54]. In this article, we conduct thorough comparisons between negative sampling and non-sampling learning with careful setup of various representative methods. Although our empirical findings may not generalize to other tasks, we reveal the facts that a simple recommendation model with non-sampling learning can outperform many advanced sampling-based methods, which are usually neglected in previous studies since recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques [27, 65, 68, 69]. These results also encourage our community to revisit the previously proposed recommendation models with fine-tuned training settings to better investigate their potential performance.
Due to the difficulty of evaluating recommender systems, different works usually report inconsistent results [15, 53], which makes the performance of existing methods not well understood. However, it is difficult to achieve reliable experiments by authors of a single paper, which requires a community effort. As such, we believe that a suite of benchmark datasets and well-tuned baselines should be further developed by our community.

6.2 Scalability of Negative Sampling and Non-sampling

From previous studies, we can see that existing neural recommendation models generally rely on uniformly negative sampling to support efficient training. However, our findings have shown that uniformly negative sampling will lead to suboptimal performance. In fact, the recommendation model has been well studied by a large number of recent works, while the learning strategy usually becomes the bottleneck that limits the recommendation performance. Although some studies have proposed to replace uniform sampling with other sampling methods [17, 41, 51, 61] to improve the quality of negative samples, they usually only integrate the proposed samplers into simple recommendation models like matrix factorization. The main reason is that the above methods cannot meet the requirements of efficiency and effectiveness at the same time (see Table 5). For example, several state-of-the-art methods use complex structures such as GAN [61] to generate negative instances, which has posed a new challenge on model efficiency. In this case, these methods can be hardly incorporated with state-of-the-art recommendation models such as GCN models. How can we efficiently sample informative negative instances remains an important research question and deserves further exploration.
Recently, some recommendation studies also tried to apply the non-sampling learning strategy for model optimization. The results reported in those studies are consistent with our experiments, showing the superior ability of non-sampling learning for recommendation tasks. However, existing efficient non-sampling learning methods are only suitable for the recommendation models with a linear prediction layer, as shown in Figure 3, which limits the scalability and flexibility of model design. It would be beneficial if this kind of method can be applied to non-linear structures. Although challenging, it is a promising future work to propose a general efficient non-sampling learning method.
Fig. 3.
Fig. 3. An illustration of the two-tower recommendation structure.
Some recent studies have pointed out the positive effects of linear prediction [53]: (1) it is relevant to the industry as it is more applicable, (2) it simplifies the modeling and learning process, and (3) it has better alignment with other research tasks such as image models and natural language processing where the dot product is usually used. Some state-of-the-art recommendation models also adopt linear prediction for combining embeddings, such as NGCF [65], LightGCN [27], and KGAT [63]. Therefore, except for extending non-sampliong to non-linear prediction structures, other promising directions to improve non-sampling recommendation models include (1) designing better embedding layers of users and items (f and g in Figure 2) through graph learning or causal modeling; (2) leveraging content features such as context information, review, social connections, and knowledge graph, and (3) optimizing the model with multiple objective functions and multi-task learning.

6.3 Multi-task Learning

Multi-task learning is to perform joint training on different but correlated tasks, in order to obtain a better model for each task [60]. Recently, multi-task learning has been widely applied for learning a recommender system with side information such as social connections, knowledge graphs, and users’ multiple behaviors [8, 11, 21]. However, although negative sampling has been widely applied in previous recommendation work, it is still reasonable to argue that existing sampling methods are not very suitable for optimizing a multi-task model. Specifically, to generate a training batch, sampling methods need to sample negative instances for each task. This will produce much larger randomness than single-task learning and would inevitably lead to information loss. The poor performance of negative sampling for multi-task learning has been observed in previous work [7, 11]. Therefore, it will be interesting and valuable to design better and suitable sampling methods for multi-task learning.

6.4 Application in Realistic Scenarios

Most real-world recommendation systems [80] contain two stages: candidate generation and ranking. For very large systems that contain billions of items and users, both existing negative sampling and non-sampling methods are hard to directly apply due to the huge time or space complexity. In this case, a pre-processing process of candidate generation is usually applied first. For example, [43] uses co-occurrences of items to generate candidates, [19] applies a random walk on a (co-occurrence) graph, and [13] describes a hybrid approach using a mixture of features.
In practical recommender systems where new users, items, and interactions are continuously streaming in, it is important to update the model in real time to best serve users. For non-sampling methods, a commonly used online learning strategy is incremental learning [32]. Given a new user-item interaction ( \(u,i\) ), incremental learning only performs optimization steps for \({\bf p}_u\) and \({\bf q}_i\) . The assumption is that new interactions should not change the overall parameters too much, but should change the embeddings of u and i significantly. For negative sampling, the advantage is that sampling-based models are easy to get parameters updated continuously with new data. However, the problem is that existing state-of-the-art sampling methods cannot meet the requirements on effectiveness and efficiency at the same time. This, as we have discussed, deserves further exploration.

7 Conclusion

In this work, we analyze two types of training strategies: negative sampling and non-sampling for recommendation with implicit feedback. Specifically, we first revisit the objective of negative sampling and non-sampling. Then we conduct thorough comparisons between negative sampling and non-sampling with careful setup of various representative methods. Our results empirically show that although negative sampling has been widely applied in recent recommendation models, it is almost impossible for the widely used uniform sampling to show comparable performance to non-sampling learning methods. Moreover, we show that existing state-of-the-art sampling-based methods generally cannot meet the requirements on effectiveness and efficiency at the same time, which limits their application abilities to complex recommendation models and online learning scenarios. Overall, while we do not argue that sampling-based methods are always weaker than non-sampling methods, we stress that these results are usually neglected in previous work since recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques. We believe this work presents a sanity check for negative sampling and non-sampling learning in implicit recommendation, suggesting that newly proposed recommendation models should compare with more proper baselines to claim their state-of-the-art effectiveness. At last, we discuss several open problems and future research topics that are worth further exploring. We hope this work can be helpful to those researchers and practitioners who are keen to the study of recommender systems and inspire the research work in this field.

Footnotes

References

[1]
Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web. 1341–1350.
[2]
Avishek Joey Bose, Huan Ling, and Yanshuai Cao. 2018. Adversarial contrastive estimation. arXiv preprint arXiv:1805.03642 (2018).
[3]
Liwei Cai and William Yang Wang. 2017. KBGAN: Adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071 (2017).
[4]
Chong Chen, Fei Sun, Min Zhang, and Bolin Ding. 2022. Recommendation unlearning. In Proceedings of the Web Conference 2022.
[5]
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural attentional rating regression with review-level explanations. In Proceedings of the Web Conference.
[6]
Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2019. Social attentional memory network: Modeling aspect-and friend-level differences in recommendation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 177–185.
[7]
Chong Chen, Min Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Efficient non-sampling factorization machines for optimal context-aware recommendation. In Proceedings of the Web Conference 2020. 2400–2410.
[8]
Chong Chen, Min Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Jointly non-sampling learning for knowledge graph enhanced recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 189–198.
[9]
Chong Chen, Min Zhang, Chenyang Wang, Weizhi Ma, Minming Li, Yiqun Liu, and Shaoping Ma. 2019. An efficient adaptive transfer neural network for social-aware recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 225–234.
[10]
Chong Chen, Min Zhang, Yongfeng Zhang, Yiqun Liu, and Shaoping Ma. 2020. Efficient neural matrix factorization without sampling for recommendation. ACM Trans. Inf. Syst. 38, 2, (Jan.2020), Article 14.
[11]
Chong Chen, Min Zhang, Yongfeng Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Efficient heterogeneous collaborative filtering without negative sampling for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 19–26.
[12]
Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 335–344.
[13]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for Youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. 191–198.
[14]
Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems. 101–109.
[15]
Maurizio Ferrari Dacrema, Federico Parroni, Paolo Cremonesi, and Dietmar Jannach. 2020. Critically examining the claimed value of convolutions over user-item embedding maps for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
[16]
Robin Devooght, Nicolas Kourtellis, and Amin Mantrach. 2015. Dynamic matrix factorization with priors on unknown values. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 189–198.
[17]
Jingtao Ding, Yuhan Quan, Quanming Yao, Yong Li, and Depeng Jin. 2020. Simplify and robustify negative sampling for implicit collaborative filtering. Advances in Neural Information Processing Systems 33 (2020), 1094–1105.
[18]
Jingtao Ding, Guanghui Yu, Xiangnan He, Yuhan Quan, Yong Li, Tat-Seng Chua, Depeng Jin, and Jiajie Yu. 2018. Improving implicit recommender systems with view data. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3343–3349.
[19]
Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet, Mark Ulrich, and Jure Leskovec. 2018. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 1775–1784.
[20]
Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. arXiv preprint arXiv:1902.07243 (2019).
[21]
Chen Gao, Xiangnan He, Dahua Gan, Xiangning Chen, Fuli Feng, Yong Li, Tat-Seng Chua, and Depeng Jin. 2019. Neural multi-task recommendation from multi-behavior data. In International Conference on Data Engineering (ICDE’19).
[22]
Hongchang Gao and Heng Huang. 2018. Self-paced network embedding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1406–1415.
[23]
Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision. 4274–4282.
[24]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent - A new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS’20).
[25]
Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
[26]
Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 355–364.
[27]
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
[28]
Xiangnan He, Xiaoyu Du, Xiang Wang, Feng Tian, Jinhui Tang, and Tat-Seng Chua. 2018. Outer product-based neural collaborative filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2227–2233.
[29]
Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. 2018. Adversarial personalized ranking for recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 355–364.
[30]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the Web Conference 2017. 173–182.
[31]
Xiangnan He, Jinhui Tang, Xiaoyu Du, Richang Hong, Tongwei Ren, and Tat-Seng Chua. 2019. Fast matrix factorization with nonuniform weights on missing data. IEEE Transactions on Neural Networks and Learning Systems 31, 8 (2019), 2791–2804.
[32]
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 549–558.
[33]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 8th IEEE International Conference on Data Mining. IEEE, 263–272.
[34]
Tinglin Huang, Yuxiao Dong, Ming Ding, Zhen Yang, Wenzheng Feng, Xinyu Wang, and Jie Tang. 2021. MixGCF: An improved training method for graph neural network-based recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 665–674.
[35]
Binbin Jin, Defu Lian, Zheng Liu, Qi Liu, Jianhui Ma, Xing Xie, and Enhong Chen. 2020. Sampling-decomposable generative adversarial recommender. Advances in Neural Information Processing Systems 33 (2020), 22629–22639.
[36]
Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 426–434.
[37]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer8 (2009), 30–37.
[38]
Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1748–1757.
[39]
Dongha Lee, SeongKu Kang, Hyunjun Ju, Chanyoung Park, and Hwanjo Yu. 2021. Bootstrapping user and item representations for one-class collaborative filtering. arXiv preprint arXiv:2105.06323 (2021).
[40]
Zelong Li, Jianchao Ji, Zuohui Fu, Yingqiang Ge, Shuyuan Xu, Chong Chen, and Yongfeng Zhang. 2021. Efficient non-sampling knowledge graph embedding. In Proceedings of the Web Conference 2021. 1727–1736.
[41]
Defu Lian, Qi Liu, and Enhong Chen. 2020. Personalized ranking with importance sampling. In Proceedings of the Web Conference 2020. 1093–1103.
[42]
Dawen Liang, Laurent Charlin, James McInerney, and David M. Blei. 2016. Modeling user exposure in recommendation. In Proceedings of the Web Conference 2016. 951–961.
[43]
David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related pins at Pinterest: The evolution of a real-world recommender system. Proceedings of the Web Conference 2017, 583–592.
[44]
Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-based recommendation algorithms. User Modeling and User-adapted Interaction 28, 4–5 (2018), 331–390.
[45]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.
[46]
Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1933–1942.
[47]
Rong Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. 2008. One-class collaborative filtering. In 2008 8th IEEE International Conference on Data Mining. IEEE, 502–511.
[48]
Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. 2020. Bag of tricks for adversarial training. arXiv preprint arXiv:2010.00467 (2020).
[49]
Dae Hoon Park and Yi Chang. 2019. Adversarial sampling and training for semi-supervised information retrieval. In Proceedings of the Web Conference 2019. 1443–1453.
[50]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701–710.
[51]
Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 273–282.
[52]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 452–461.
[53]
Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. 2020. Neural collaborative filtering vs. matrix factorization revisited. In 14th ACM Conference on Recommender Systems. 240–248.
[54]
Steffen Rendle, Li Zhang, and Yehuda Koren. 2019. On the difficulty of evaluating baselines: A study on recommender systems. arXiv preprint arXiv:1905.01395 (2019).
[55]
Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender Systems Handbook. Springer.
[56]
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the Web Conference 2001. 285–295.
[57]
Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI. arXiv preprint arXiv:1907.10597 (2019).
[58]
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. 1067–1077.
[59]
Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent relational metric learning via memory-based attention for collaborative ranking. In Proceedings of the Web Conference 2018. 729–739.
[60]
Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3, 3 (2007), 1–13.
[61]
Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 515–524.
[62]
Menghan Wang, Mingming Gong, Xiaolin Zheng, and Kun Zhang. 2018. Modeling dynamic missingness of implicit feedback for recommendation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6670–6679.
[63]
Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: Knowledge graph attention network for recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 950–958.
[64]
Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item silk road: Recommending items from information domains to social users. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 185–194.
[65]
Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
[66]
Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. 2020. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the Web Conference 2020. 99–109.
[67]
Lin Xiao, Zhang Min, Zhang Yongfeng, Liu Yiqun, and Ma Shaoping. 2017. Learning and transferring social and item visibilities for personalized recommendation. In Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 337–346.
[68]
Xin Xin, Bo Chen, Xiangnan He, Dong Wang, Yue Ding, and Joemon Jose. 2019. CFM: Convolutional factorization machines for context-aware recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.
[69]
Xin Xin, Xiangnan He, Yongfeng Zhang, Yongdong Zhang, and Joemon Jose. 2019. Relational collaborative filtering: Modeling multiple item relations for recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 125–134.
[70]
Xin Xin, Fajie Yuan, Xiangnan He, and Joemon M. Jose. 2018. Batch IS NOT heavy: Learning word representations from all samples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1853–1862.
[71]
Feng Xue, Xiangnan He, Xiang Wang, Jiandong Xu, Kai Liu, and Richang Hong. 2019. Deep item-based collaborative filtering for top-N recommendation. ACM Transactions on Information Systems (TOIS) 37, 3 (2019), 33.
[72]
Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H. Chi. 2020. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020. 441–447.
[73]
Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. 2020. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1666–1676.
[74]
Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems. 269–277.
[75]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
[76]
Hsiang-Fu Yu, Mikhail Bilenko, and Chih-Jen Lin. 2017. Selection of negative samples for one-class matrix factorization. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 363–371.
[77]
Fajie Yuan, Xin Xin, Xiangnan He, Guibing Guo, Weinan Zhang, Tat-Seng Chua, and Jose M. Joemon. 2018. fBGD: Learning embeddings from positive unlabeled data with BGD. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence. 198–207.
[78]
Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing top-n collaborative filtering via dynamic negative item sampling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 785–788.
[79]
Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei Chen. 2019. NSCaching: Simple and efficient negative sampling for knowledge graph embedding. In 2019 IEEE 35th International Conference on Data Engineering (ICDE’19). IEEE, 614–625.
[80]
Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: A multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. ACM, 43–51.
[81]
Lei Zheng, Vahid Noroozi, and Philip S. Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 425–434.
[82]
Xin Zhou, Aixin Sun, Yong Liu, Jie Zhang, and Chunyan Miao. 2021. SelfCF: A simple framework for self-supervised collaborative filtering. arXiv preprint arXiv:2107.03019 (2021).

Cited By

View all
  • (2024)Safe Collaborative FilteringSSRN Electronic Journal10.2139/ssrn.4767721Online publication date: 2024
  • (2024)Our Model Achieves Excellent Performance on MovieLens: What Does It Mean?ACM Transactions on Information Systems10.1145/3675163Online publication date: 1-Jul-2024
  • (2024)Deconfounding User Preference in Recommendation Systems through Implicit and Explicit FeedbackACM Transactions on Knowledge Discovery from Data10.1145/367376218:8(1-18)Online publication date: 18-Jun-2024
  • Show More Cited By

Index Terms

  1. Revisiting Negative Sampling vs. Non-sampling in Implicit Recommendation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 41, Issue 1
    January 2023
    759 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3570137
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2023
    Online AM: 24 March 2022
    Accepted: 24 February 2022
    Revised: 24 December 2021
    Received: 26 May 2021
    Published in TOIS Volume 41, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Recommender systems
    2. non-sampling
    3. negative sampling
    4. implicit feedback

    Qualifiers

    • Research-article

    Funding Sources

    • Natural Science Foundation of China
    • Tsinghua University Guoqiang Research Institute

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3,710
    • Downloads (Last 6 weeks)340
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Safe Collaborative FilteringSSRN Electronic Journal10.2139/ssrn.4767721Online publication date: 2024
    • (2024)Our Model Achieves Excellent Performance on MovieLens: What Does It Mean?ACM Transactions on Information Systems10.1145/3675163Online publication date: 1-Jul-2024
    • (2024)Deconfounding User Preference in Recommendation Systems through Implicit and Explicit FeedbackACM Transactions on Knowledge Discovery from Data10.1145/367376218:8(1-18)Online publication date: 18-Jun-2024
    • (2024)Fairness and Diversity in Recommender Systems: A SurveyACM Transactions on Intelligent Systems and Technology10.1145/3664928Online publication date: 21-May-2024
    • (2024)FairHash: A Fair and Memory/Time-efficient HashmapProceedings of the ACM on Management of Data10.1145/36549392:3(1-29)Online publication date: 30-May-2024
    • (2024)Deep Causal Reasoning for RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/365398515:4(1-25)Online publication date: 18-Jun-2024
    • (2024)Unbiased, Effective, and Efficient Distillation from Heterogeneous Models for Recommender SystemsACM Transactions on Recommender Systems10.1145/3649443Online publication date: 23-Feb-2024
    • (2024)Mitigating Exposure Bias in Recommender Systems – A Comparative Analysis of Discrete Choice ModelsACM Transactions on Recommender Systems10.1145/3641291Online publication date: 27-Jan-2024
    • (2024)Privacy-preserving Multi-source Cross-domain Recommendation Based on Knowledge GraphACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363970620:5(1-18)Online publication date: 7-Feb-2024
    • (2024)Causal Inference in Recommender Systems: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/363904842:4(1-32)Online publication date: 9-Feb-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media