research-article

Open access

Revisiting Negative Sampling vs. Non-sampling in Implicit Recommendation

Authors:

Shaoping MaAuthors Info & Claims

ACM Transactions on Information Systems, Volume 41, Issue 1

Article No.: 12, Pages 1 - 25

https://doi.org/10.1145/3522672

Published: 25 February 2023 Publication History

All formats PDF

Abstract

Recommendation systems play an important role in alleviating the information overload issue. Generally, a recommendation model is trained to discern between positive (liked) and negative (disliked) instances for each user. However, under the open-world assumption, there are only positive instances but no negative instances from users’ implicit feedback, which poses the imbalanced learning challenge of lacking negative samples. To address this, two types of learning strategies have been proposed before, the negative sampling strategy and non-sampling strategy. The first strategy samples negative instances from missing data (i.e., unlabeled data), while the non-sampling strategy regards all the missing data as negative. Although learning strategies are known to be essential for algorithm performance, the in-depth comparison of negative sampling and non-sampling has not been sufficiently explored by far. To bridge this gap, we systematically analyze the role of negative sampling and non-sampling for implicit recommendation in this work. Specifically, we first theoretically revisit the objection of negative sampling and non-sampling. Then, with a careful setup of various representative recommendation methods, we explore the performance of negative sampling and non-sampling in different scenarios. Our results empirically show that although negative sampling has been widely applied to recent recommendation models, it is non-trivial for uniform sampling methods to show comparable performance to non-sampling learning methods. Finally, we discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and future research topics that are worth being further explored.

1 Introduction

In the era of exponential information growth, recommender systems have been widely applied in today’s web platforms to alleviate the information overload issue and help users seek desired information and items [4, 5, 55]. The key of personalized recommendation systems is to model users’ preferences according to their historical interactions. There are mainly two challenges: (1) how to design an effective algorithm to model preference and (2) how to train the algorithm with limited user feedback. With the rapid development of neural networks in recent years, many new methods have been proposed and have achieved significant improvements. For the second challenge, fewer attempts have been made. In fact, recently many advanced recommendation models have been proposed, but the learning strategies usually become the bottleneck of the recommendation performance.

Generally, it is hard to collect explicit user preference over items in real scenarios, so user implicit feedback is widely adopted to model user interests, like clicks in news portals, purchases in e-commerce, and views on online video platforms. In implicit feedback data, the observed interaction represents a user’s positive preference for an item, while the other unobserved items are unlabeled. Only the positive feedback is observed; the negative feedback is mixed with missing values in unobserved data. Meanwhile, users usually interact with only a small number of items, compared to a huge number of items in real systems. Therefore, due to the lack of reliable negative data, learning a recommender system from implicit feedback data is very challenging. In implicit data, non-interacted items do not necessarily mean the user dislikes the items. For example, an unobserved user-item interaction can be caused by the user not seeing the item or the user seeing but not liking it. Thus, in unobserved data true-negative and unlabeled potential positive examples are mixed together. How to find out and utilize informative unobserved examples becomes the key to optimize the learning performance.

To address the problem of lacking negative samples when learning from implicit feedback data, two representative learning strategies have been proposed. The first strategy, named Negative Sampling [6, 30, 52], samples several negative items from those unlabeled data. The second strategy, named Non-sampling [9, 32, 33], takes all unlabeled items as negative and assigns them lower weights than positive samples. Both strategies have their own advantages and disadvantages: negative sampling is more efficient due to the limited number of training instance, but its performance may be affected by the low quality of sampled negative examples and slow convergence [9, 41, 70, 77]; the non-sampling strategy generally could achieve a better performance as all the training data is fully utilized, but inefficiency can be an issue [10, 32].

Many studies have been done to improve negative sampling learning and non-sampling learning, respectively. Previous work on negative sampling mainly focused on using other sampling methods rather than uniform sampling [17, 41, 51, 61] to improve the quality of negative samples, such as draw negative items through popularity-based distribution [41] and Generative Adversarial Network (GAN)-based models [35, 49, 61]. In another line of research, a number of methods have also been developed for non-sampling optimization to improve the training efficiency, including Alternating Least Squares (ALS)-based methods [32, 33] and mini-batch Stochastic Gradient Descent (SGD)-based methods [9, 10, 70].

To the best of our knowledge, although learning strategies are known to be important and there are many studies on improving negative sampling and non-sampling, respectively, the in-depth comparison of negative sampling and non-sampling in implicit recommendation has not been sufficiently explored by far. To this end, we systematically analyze the role of negative sampling and non-sampling in implicit recommendation through both theoretical considerations and experimental evaluations. Specifically, we first revisit the objective of negative sampling and non-sampling. Then, we conduct thorough comparisons between negative sampling and non-sampling strategies with careful setup when working with various representative methods. Finally, we discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and research topics that are worth further exploring.

Our results empirically show that although negative sampling has been widely applied in recent recommendation models [6, 27, 30, 63, 65], there is still a large gap for the widely used uniform sampling to show comparable performance to non-sampling learning methods. To reduce the gap, it is important to improve the quality of sampling by applying advanced negative samplers. Moreover, it is shown that a uniform weighted non-sampling learning method even outperforms many advanced sampling-based methods. Based on the above observations, we argue that the poor performance of negative sampling can be misleading if one does not recognize that they are biased, as existing neural recommendation methods typically rely on negative sampling for efficient optimization, and recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques [27, 65, 68, 69]. Previous work in the area of information retrieval and recommender systems has found that improvements achieved with some complex models were only observed because the chosen baseline methods were weak or not well trained [14, 15, 44, 54]. We believe this work presents a sanity check for negative sampling and non-sampling learning in implicit recommendation, providing references for future works related to recommendations with implicit feedback data.

The main contributions of this article are summarized as follows:

(1)

We theoretically revisit the objective and risk of negative sampling and non-sampling learning in implicit recommendation.

(2)

We conduct thorough comparisons between representative negative sampling and non-sampling training strategies with various recommendation algorithms on their best settings on different datasets.

(3)

Experimental results show that popular and representative negative-sampling-based recommendation methods generally do not perform better than simple baselines with non-sampling learning.

(4)

We discuss the scalability and complexity of negative sampling and non-sampling and present some open problems and research topics that are worth further exploring.

2 Related Work

2.1 Item Recommendation

Early recommendation methods [36, 37] were mainly designed to model users’ explicit feedback such as ratings to movies. However, implicit feedback is actually easier to collect in real-world scenarios, such as clicks in news portals, purchases in e-commerce, and views on online video applications. For recommendation with implicit feedback data, a lot of methods were proposed before [10, 30, 33, 52]. Since truly negative is mixed with unlabeled positive, these methods differ from each other in how to use unobserved data. Specifically, Hu et al. [33] proposed a non-sampling-based method, WMF, which treats all unobserved items as negative samples and assigns them a lower constant weight. Then several efforts [32, 42] improve WMF by applying different weighting strategies based on whether the unobserved items are indeed negative ones. Different from non-sampling methods, Rendle et al. [52] proposed a sampling-based method, BPR, which optimizes the MF model based on the relative preference of a user to positive and negative items. The pairwise learning strategy with negative sampling has been widely adopted to optimize recommendation models [6, 12, 30, 59] and has become a dominant technique in recommendation.

With the development of deep learning techniques, there is a lot of work exploiting different neural networks for recommender systems. The work [30] presented a Neural Collaborative Filtering (NCF) framework to jointly learn a matrix factorization and a feedforward neural network for implicit Top-K recommendation. NCF has been widely extended for different recommendation scenarios [12, 26, 64]. In recent years, it has become a trend to explore the application of advanced deep learning architectures for recommendation tasks, such as leveraging attention mechanisms [5, 12, 71], Recurrent Neural Network (RNN) [46], Convolutional Neural Network (CNN) [28, 81], GANs [29, 61], and Graph Neural Networks (GNNs) [20, 65]. Specifically, [65] proposed NGCF to model high-order connections by propagating embeddings on the user-item interaction graph. LightGCN [27] is an extended version of NGCF by removing the feature transformation and non-linear activation function to improve the performance of the recommendation task. Besides the above methods that mainly use negative sampling for model learning, there are also some neural recommendation models based on non-sampling learning. For example, Chen et al. derived a flexible non-sampling loss and designed several efficient non-sampling neural models for various recommendation scenarios [7, 9, 10, 11].

2.2 Model Training in Recommendation

There are two learning strategies that have been proposed before to learn from implicit feedback: negative sampling strategy [6, 30, 52] and non-sampling strategy [16, 33, 42, 67].

The negative sampling strategy samples negative instances from the unobserved data. Through sampling, the scale of training samples is greatly reduced; therefore, the training process is more efficient [30]. Negative sampling has been widely applied in many recommendation models, including traditional recommendation models like BPR [52] and neural models like NCF [30]. The most popular and widely used sampling strategy is uniform sampling (also called random sampling). However, uniform sampling usually samples uninformative training instances, which usually make a limited contribution to update the model. To deal with this problem, many methods have been proposed in recent literature to replace the uniform sampling with other better samplers [17, 41, 51, 61] to improve the quality of negative samples. For example, [41, 76] proposed to sample negative examples based on the popularity of items. [17, 51] proposed to sample hard negative instances with higher prediction scores. [74] proposed a sampling-bias-corrected algorithm for estimating item frequency from streaming data. [72] proposed to use both batch and uniformly sampled negatives to tackle the selection bias of implicit recommendation. There are also some GAN-based methods in which the sampling probability will adaptively evolve by optimizing adversarial objectives [29, 35, 49, 61]. Another kind of method samples the negative instances based on the structure of the graph. For example, [66] incorporated a knowledge graph into the negative sampling process to sample high-quality negative items. [75] sampled the negative node according to their pagerank scores. [34] proposed a MixGCF method with the hop mixing and positive mixing strategies for GNN-based recommendation models. However, the above methods will suffer from inefficiency issues since the sampling process will dynamically change and usually needs to calculate all instances.

The non-sampling strategy sees all the unobserved data as negative, while assigning them with a lower weight than positive examples. For example, WMF [33] assigned all unobserved entries with a uniform weight. In EALS [32] and ExpoMF [42], the weight of unobserved entries is dependent on item popularity, which is based on the assumption that popular items are easier to be seen by users, so they should be assigned higher weights as negative. A non-sampling strategy has been shown that can leverage the whole data with a potentially better coverage [9, 32, 42, 70], but inefficiency can be an issue. To this end, a number of methods have been developed to accelerate the learning process, including batch-based ALS [1, 32, 33] and mini-batch SGD methods [9, 70]. However, this kind of method is only suitable for the recommendation models with a linear prediction layer [1, 10, 70] and regression loss function. Non-sampling methods have been widely applied in many traditional recommendation models [32, 33, 42] but have few applications in neural recommendation models. Recently, Chen et al. [10] proposed ENMF, which is a representative neural non-sampling recommendation model. To the best of our knowledge, although there are some works on improving negative sampling and non-sampling respectively, the in-depth comparison between them is left insufficiently explored. This is the main concern of this work.

The above methods are all based on discriminative modeling, which explicitly aims to distinguish positive user-item interactions from the negative counterparts. Recently, there has been another line of research that utilizes self-supervised learning for implicit recommendation training [39, 82]. Given a positive user-item interaction \((u, i)\) , the idea is to make representations for u and i similar to each other to encode the preference information. To address the potential model collapse problem, this kind of method generally applies two distinct encoder networks (i.e., online and target networks). Compared to discriminative modeling methods, self-supervised learning methods are easier to collapse into a trivial constant solution, so that the hyper-parameters need to be carefully tuned [24, 39]. To make our work more focused, the methods discussed in this article are all based on discriminative modeling. We leave the exploration of self-supervised learning as future work.

2.3 Negative Sampling in Other Domains

Negative sampling has also been widely used in other domains and tasks of machine learning, such as graph embedding [73], word embedding [45], and network embedding [79]. For example, Word2Vec [45] samples negative samples according to its word frequency, which is similar to the sampling process in recommendation. Later works on network and graph embedding [25, 50, 58] follow this setting. It has also been observed that negative instances with large scores (hard) are more useful for model training [79]. Another recent work on negative sampling of graph learning shows that the negative sampling distribution should be positively but sub-linearly correlated to their positive sampling distribution [73]. There are also some GAN-based methods for the above tasks [2, 3, 22]. For non-sampling-based methods, [8, 40] proposed efficient non-sampling methods for learning knowledge graph embeddings.

Since the implicit recommendation is a different problem, the reliability of sampled negative instances is much harder to guarantee. In this article, we mainly focus on the comparison of negative sampling and non-sampling for the recommendation task and leave the exploration of other tasks as future work.

3 Understanding Negative Sampling and Non-sampling

In this section, we first present the problem formulation of recommendation with implicit feedback data (implicit recommendation), then revisit negative sampling and non-sampling regarding both objective and risk.

3.1 Implicit Recommendation

Table 1 shows the key notations used in this article. We denote user set U (including M users) and item set I (including N items). The implicit feedback data is denoted as an \(M \times N\) binary matrix Y, where \(y(u,i) \in \lbrace 0,1\rbrace\) denotes whether user u has interacted with item i or not. We use \(\mathcal {Y}\) to denote the set of observed entries in Y. Moreover, \({\bf I}_u\) is used to denote the positive item set of user u. Given a target user u, the task of implicit recommendation is to learn user u’s preference based on the implicit feedback data and recommend items that might be of interest to u.

Table 1.

Symbol	Description
\(M,N\)	number of users and items
\({\bf U}\)	user set
\({\bf I}\)	item set
\({\bf Y}\)	user-item interactions
\({\bf p}_u\)	latent factor of user u
\({\bf q}_i\)	latent factor of item i
\(\mathcal {Y}\)	the user-item pairs whose values are non-zero
\({\bf I}_u\)	the positive item set of user u
\(c(u,i)\)	the weight of entry \(y(u,i)\)
d	latent factor number

Table 1. Notation

3.2 Negative Sampling

An implicit recommendation model with negative sampling learning strategy generally has three important components: scoring function \(\hat{y}\) , objective function \(\mathcal {L}\) , and negative sampling strategy \(p_{ns}\) [17]. The scoring function \(\hat{y}({\bf p}_u, {\bf q}_i, \Theta)\) predicts the preference of user \(u \in {\bf U}\) to item \(i \in {\bf I}\) based on u’s latent factor \({\bf p}_u\) and i’s latent factor \({\bf q}_i\) . It has been widely studied in previous work, including various models based on Matrix Factorization (MF) [29, 52, 61] and neural networks [26, 27, 30, 63, 65].

For objective function, Bayesian Personalized Ranking (BPR) [52] is widely used in the negative-sampling-based recommendation model, and is also the most representative learning method. The formulation of BPR is as follows:

\begin{equation} \begin{split}(u, i, j) \in {\bf Y}: \Leftrightarrow i \in {\bf I}_u \wedge j \in {\bf I} \backslash {\bf I}_u\\ \mathcal {L}=-\sum _{(u, i,j) \in {\bf Y}} \ln \sigma (\hat{y}(u, i)-\hat{y}(u, j)), \end{split} \end{equation}

(1)

where \(\sigma (x)= \frac{1}{1+e^{-x}}\) is an activation function (sigmoid) and \({\bf Y}\) is the training data. The negative instance \((u, j)\) is sampled through a specific distribution \(p_{ns}(j | u)\) . \(\sigma (\hat{y}(u, i)-\hat{y}(u, j))\) is modeled as the likelihood that a user u prefers item i to item j, which approaches 1 when \(\hat{y}(u,i) \gg \hat{y}(u,j)\) . Minimizing \(\mathcal {L}\) is equal to make the scores of positive user-item interactions larger than negative user-item interactions. \(\ln \sigma (x)\) is actually a differential surrogate function of the Heaviside function, so BPR approximately optimizes the ranking statistic AUC [41].

As the number of training pairs \(|{\bf Y}|\) is usually very large, learning algorithms typically are based on SGD. The gradient of BPR to the model parameters is

\begin{equation} \begin{split}\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta }=(1-\sigma (\hat{y}(u, i)-\hat{y}(u, j))) \frac{\partial (\hat{y}(u, i)-\hat{y}(u, j))}{\partial \Theta }. \end{split} \end{equation}

(2)

The gradient depends on how the scoring model would discriminate between the positive item i and the negative item j for user u. \(\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta }\) is a probability, which is close to 0 if \(\hat{y}(u, i)\) is to correctly get a larger score than \(\hat{y}(u, j)\) .

From the above gradient, we analyze the influence of negative sampling from both the sampled right examples (i.e., examples that are really negative to user u) and false examples (i.e., examples that are positive to user u in the test set). For a sampled right negative example j, positive item i can be easily distinguished from item j (i.e., \(\sigma (\hat{y}(u, i)-\hat{y}(u, j)) \rightarrow 1\) ) when the model is well trained, so that the pair \((u, i, j)\) will contribute little to the model learning because its gradient vanishes ( \(\frac{\partial \mathcal {L}(u,i,j)}{\partial \Theta } \rightarrow 0\) ). This kind of issue is particularly prominent when the sampling distribution is uniform. In recommender systems, item popularity is typically non-uniform distributed, and overall positive observations have a tailed distribution. It is very likely that the model score \(\hat{y}(u,j)\) of a uniformly sampled item j is smaller than \(\hat{y}(u,i)\) and thus the gradient magnitude is also small. Therefore, the widely used uniform sampling strategy is generally very slow to converge and hard to achieve the optimal performance.

To avoid this, the prediction scores of items are used for sampling in previous work, e.g., AOBPR and BPR-DNS [49, 51]. These kinds of methods are designed to sample more hard instances, i.e., unobserved items with higher prediction scores. However, since items with higher prediction scores are more likely to be actually positive in the test set, these kinds of methods suffer from false examples more markedly, which will hurt the model’s performance and robustness since they are sampled as negative instances during training.

From the above analyses of negative instances in the training of implicit recommendation, we can conclude that the difficulty of negative sampling is how to sample negative examples with large scores to increase the convergence speed and avoid false-negative instances to maintain robustness.

3.3 Non-sampling

A recommendation model with the non-sampling learning strategy typically creates the training data from \(\mathcal {Y}\) by giving pairs \(y(u, i)\in \mathcal {Y}\) a positive label and all other unobserved interactions \({\bf Y}\backslash \mathcal {Y}\) a negative label:

\begin{equation} y(u,i)=\left\lbrace \begin{array}{ll}{1,} & {\text{ if interaction (user } u, \text{ item } i) \text{ is observed; }} \\ {0,} & {\text{ otherwise. }} \end{array}\right. \end{equation}

(3)

Then the recommendation model is fitted to this data. The widely used non-sampling-based objective function is a weighted regression function, which assigns a training weight to each interaction in the implicit matrix:

\begin{equation} \begin{split}\mathcal {L}=\sum _{u \in \mathbf {U}} \sum _{i \in \mathbf {I}} c(u, i)\left(y(u, i)-\hat{y}(u,i)\right)^{2}, \end{split} \end{equation}

(4)

where \(c(u,i)\) denotes the weight of entry \(y(u,i)\) .

Through the above objective function, the model is optimized to predict the value 1 for elements in \(\mathcal {Y}\) and 0 for the rest. The purpose of this kind of method is to impose the gravity regularizer to penalize the prediction for missing items [41]. One of the weaknesses of this approach is that all candidate ranking items are presented to the recommendation model as negative instances during training, which means that a model with enough expressiveness cannot generate a reasonable ranking list at all as it predicts only 0.

Besides, considering that the number of missing data can be much larger than user interacted items in real applications, it is more desirable to assign the missing data a lower weight to address the class imbalance issue. A weighting strategy for unobserved examples in non-sampling methods plays a similar role to the negative sampling strategy in the sampling-based strategy [32, 42]. The difference would be that with a weighting strategy for unobserved samples all samples are used, whereas with a sampling strategy some samples might never be used during training, which may have an impact in terms of generalizability or may leave the model not trained for certain types of input, lowering its robustness. The methods with non-uniform weighting can model negative instances with a faster coverage but generally require more computational cost due to the dense weight matrix for all user-item pairs. Although some methods have been proposed to accelerate the learning process from the whole data, one challenge remains in existing non-sampling approaches: the fast non-sampling learning methods only suitable for the recommendation models with a linear prediction layer.

4 Methods for Empirical Study

To explore the performance of negative sampling and non-sampling in different scenarios, we compare various state-of-the-art and representative learning methods for Top-K recommendation with implicit feedback data. Specifically, we focus our analysis on seven methods, which can be classified into two groups based on whether they utilize negative sampling or non-sampling learning strategies. Negative sampling methods include BPR-Uniform Sampling [52], AOBPR [51], BPR-DNS [78], IRGAN [61], and SRNS [17]. Non-sampling-based methods include WMF [33] and EALS [32]. We will briefly review these methods in this section.

4.1 BPR-Uniform Sampling

Uniform sampling with BPR pair-wise loss function [52] is a most widely used and classical solution for item recommendation with implicit feedback. Uniform negative sampling over unobserved items is also called random sampling, which is based on the simplest uniform proposal. We use \(Q(j|u,i)\) to represent the probability that a negative item j is sampled for a positive pair \((u, i)\) , which is defined as

\begin{equation} Q(j|u,i)=\left\lbrace \begin{array}{ll}{0,} & {\text{ if j} \in {\bf I}_u} \\ {\frac{1}{N-N_u},} & {\text{ otherwise,}} \end{array}\right. \end{equation}

(5)

where \(N_u=|{\bf I}_u|\) denotes the number of user u’s interacted items.

Although uniform sampling has been successfully applied in numerous recommender applications and for a variety of models, it is shown that convergence slows down considerably if the number of items is large and the overall item popularity is tailed. Both properties are common to most real-world datasets.

4.2 AOBPR

AOBPR (Adaptive Oversampling BPR) [51] is designed to improve the uniform sampling strategy by adaptively sampling hard instances, i.e., unobserved items with higher prediction scores, which is hard to discriminate for the algorithm. Intuitively, when a negative item j for a positive pair \((u, i)\) should be sampled, the closer j is to the top, the more informative is j. The sampling distribution of AOBPR is defined as

\begin{equation} Q(j|u,i)=\left\lbrace \begin{array}{ll}{0,} & {\text{ if}\ j \in {\bf I}_u} \\ {\text{exp}(\frac{-\hat{r}(u,j)}{\lambda }),} & {\text{ otherwise,}} \end{array}\right. \end{equation}

(6)

where \(\hat{r}(u,j)\) is the rank of item j among all items I using the score function \(\hat{y}(u,j)\) for ordering items, and \(\lambda\) is a hyper-parameter to control the skewness of the distribution.

The limitations of AOBPR are (1) it needs to compute the scores of all items in order to determine their sampling probability, which reduces the advantage of sampling and is time-consuming in the training process, and (2) the items with higher prediction scores are more likely to be positive in the test set, and thus AOBPR is more vulnerable to false samples than uniform sampling.

4.3 BPR-DNS

The motivation of BPR-DNS (Dynamic Negative Sampling) [78] is similar to AOBPR, which also tries to adaptively sample hard instances. Different from AOBPR, DNS first randomly selects a set of negative samples, then uses the item with the highest prediction score for model optimization. The training process of DNS is shown in Algorithm 1.

The limitation of BPR-DNS is that it is also vulnerable to false samples since items with higher prediction scores are more likely to be positive in the test set. Compared to AOBPR, the advantage of BPR-DNS is that it only needs to compute the scores of n items for each sampling, and a small number of n (generally no more than 32) is enough [78].

4.4 IRGAN

IRGAN [61], a GAN-style negative sampling method, applies a mini-max game for information retrieval tasks like implicit recommendation. Specifically, IRGAN includes two components: a generator G and a discriminator D. The generator is to produce more and more difficult negative samples, and the discriminator is to minimize the discrimination objective function \(\mathcal {J}(D, G)\) . For recommendation with implicit feedback data, IRGAN optimizes the following objective function:

\begin{equation} \max _{G} \min _{D} \mathcal {J}(D, G)=\sum _{u \in {\bf U}}-\mathbb {E}_{i \sim P_{\text{pos }}(\cdot \mid u)} \log D(i \mid u)-\mathbb {E}_{j \sim P_{G}(\cdot \mid u)} \log (1-D(j \mid u)), \end{equation}

(7)

where \(P_{\text{pos }}(\cdot \mid u)\) is a positive relevance distribution and \(P_{G}(\cdot \mid u)\) is a probability distribution used to generate negative instances. \(D(i\mid u)\) estimates the probability of user u to item i. When the discriminator D and the generator G are well trained, G is used for the prediction of recommendation.

Although the GAN-style methods have shown promising results for discovering informative negative samples, it is usually very time-consuming to generate negative instances from the generator G, which limits its application ability for large-scale datasets.

4.5 SRNS

SRNS (Simplified and Robust Negative Sampling) [17] is a recently proposed negative sampling method, which is based on the finding that false-negative samples (underlying positive) generally have large prediction scores for many training epochs. SRNS captures the dynamic sampling distribution of negative samples through a memory-based component. To evaluate the quality of negative samples, SRNS also proposes a high-variance-based strategy. The training process of SRNS is shown in Algorithm 2.

The variance-based sampling strategy is proposed to avoid false-negative instances by preferring high-variance candidates, which is defined as

\begin{equation} j=\arg \max _{k \in \mathcal {M}_{u}} P_{\mathrm{pos}}(k \mid u, i)+\alpha _{t} \cdot \operatorname{std}\left[P_{\mathrm{pos}}(k \mid u, i)\right], \end{equation}

(8)

where \(P_{\mathrm{pos}}(k \mid u, i)=\mathrm{sigmoid} (\hat{y}(u,i) -\hat{y}(u,k))\) , \(\operatorname{std}[P_{\mathrm{pos}}(k \mid u, i)]\) denotes the prediction variance in the latest few epochs. \(\alpha _{t}\) is a hyper-parameter to control the importance of the variance.

The memory strategy is to dynamically update \(\mathcal {M}_{u}\) while including more hard negative samples. The new \(\mathcal {M}_{u}\) is updated by sampling \(S_1\) instances from an extended memory that merges the old \(\mathcal {M}_{u}\) and a set of randomly sampled instances. The sampling probability distribution is as follows:

\begin{equation} Q(j|u,\mathcal {M}_{u} \cup \mathcal {M}_{u}^{\prime })=\exp \left(\hat{y}(u,k) / \tau \right) / \sum _{k^{\prime } \in \mathcal {M}_{u} \cup \mathcal {M}_{u}^{\prime }} \exp \left(\hat{y}(u,k^{\prime }) / \tau \right)\!, \end{equation}

(9)

where \(\tau\) is a temperature parameter and a lower \(\tau\) would make \(Q(j|u,\mathcal {M}_{u})\) pay more attention to large-scored instances.

The main training cost of SRNS comes from the above-introduced variance-based sampling and score-based memory update. Specifically, it requires to compute \(S_1\ +\ S_2\) candidates for each positive instance and then sample \(S_1\) of them based on computed scores [17]. According to the original paper, small numbers of \(S_1\) and \(S_2\) are generally sufficient, which makes SRNS more efficient than methods requiring computing all items’ scores.

4.6 WMF

Different from sampling-based learning methods that mainly focus on improving the quality of negative samples, the key of non-sampling learning is to assign proper weights for unobserved user-item interactions. WMF [33] is a well-known non-sampling learning method, which assigns all unobserved user-item interactions with the same uniform weight:

\begin{equation} c(u,i)=\left\lbrace \begin{array}{ll}{c_{1}} & {\text{ if } y(u,i)=1 }\\ {c_{0}} & {\text{ if }y(u,i)=0, } \end{array}\right. \end{equation}

(10)

where \(c_0\) and \(c_1\) are hyper-parameters that need to be tuned according to different datasets. To simplify the tuning process, \(c_1\) is usually set to a value of 1, while \(c_0\) is set as a smaller number than \(c_1\) to address the imbalanced optimization issue [33]. For instance, in our experiments, \(c_0\) is selected from [0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.7] and is set to 0.5, 0.05, and 0.05 for movielens-1m, pinterest, and yelp2018 datasets. Generally, this parameter is related to the sparsity of data. If the data is more sparse, then a smaller value of \(c_0\) may achieve a better performance. Previous studies [18, 33, 62] have demonstrated that this strategy shows better performance than using the same weight for all user-item interactions in the recommendation task.

4.7 EALS

The uniform weight strategy assumes that all unobserved user-item interactions have the same level of negative signal, which is too simple to model real-world scenarios. It is easy to understand that a popular item has more chance to be seen by users, and thus it should have a higher weight as negative if not interacted by users [32, 42]. Specifically, the EALS [32] method assigns unobserred user-item interactions with non-uniform weights according to item popularity:

\begin{equation} c(u,i)=\left\lbrace \begin{array}{ll}{c_{1}} & {\text{ if } y(u,i)=1 }\\ {c_{i}^-} & {\text{ if }y(u,i)=0, } \end{array}\right. \end{equation}

(11)

where \(c_{v}^-\) is defined as

\begin{equation} \begin{split}c_{i}^{-}=c_{0}\frac{m_i^{x}}{\sum _{j=1}^{N}m_j^{x}}&; \ m_i=\frac{|\mathcal {Y}_i|}{\sum _{j=1}^{N}|\mathcal {Y}_j|}, \end{split} \end{equation}

(12)

where \(\mathcal {Y}_i\) denotes the positive interactions of i, \(m_v\) denotes the popularity (frequency) of item v in \({\bf Y}\) , \(c_{0}\) determines the overall weight of unobserved data, and x controls the significance level of popular items over unpopular ones. To simplify the tuning process, x is usually set to a value of 0.5 as suggested in [32].

4.8 Efficient Non-sampling Loss

The difficulty of applying non-sampling learning for implicit recommendation lies in the expensive computational cost. For example, the complexity of computing Equation (4) is \(O(|{\bf U}||{\bf I}|d)\) , which is generally unaffordable since in the real world \(|{\bf U}||{\bf I}|\) can easily reach to billion level. Several methods have been proposed [9, 32, 70, 77] to address the inefficiency issue of non-sampling learning. Specifically, recent studies [9, 10] propose an efficient loss for two-tower-based recommendation models, and prove that for a generalized matrix factorization framework whose prediction function is Equation (13), the gradient of loss Equation (4) is exactly equal to that of Equation (14) if the weight \(c(u,i)\) is simplified to \(c_i\) :

\begin{equation} \hat{y}{(u,i)}=\mathbf {h}^{T}\left(\mathbf {p}_{u} \odot \mathbf {q}_{i}\right) \end{equation}

(13)

\begin{equation} \begin{split}\tilde{\mathcal {L}}(\Theta)&=\sum _{u \in \mathbf {U}}\sum _{i \in \mathbf {I}_{u}}\left((c_{i}^{+}-c_{i}^{-})\hat{y}{(u,i)}^2-2c_{i}^{+}\hat{y}{(u,i)}\right)\\ &+\sum _{j=1}^d\sum _{k=1}^d \left((h_{j}h_{k}) \left(\sum _{u \in \mathbf {U}} p_{u,j}p_{u,k}\right) \left(\sum _{i \in \mathbf {I}} c_i^{-}q_{i,j}q_{i,k} \right) \right), \end{split} \end{equation}

(14)

where \(\mathbf {p}_u \in \mathbb {R}^d\) and \(\mathbf {q}_i\in \mathbb {R}^d\) are embeddings of user u and item i, \(\odot\) denotes the element-wise product, and \(\mathbf {h} \in \mathbb {R}^d\) is the prediction vector.

The complexity of Equation (14) is \(O((|{\bf U}|+|{\bf I}|)d^2+ |\mathcal {Y}|d),\) while that of Equation (4) is \(O(|{\bf U}||{\bf I}|d)\) . Since \(|\mathcal {Y}|\) is the number of observed user-item interactions and \(|\mathcal {Y}|\ll |{\bf U}||{\bf I}|\) in practice, the complexity is greatly reduced. The proof of this method can be found in [9, 10]. To avoid repetition, it is omitted here.

Note that this method does not directly calculate the scores of all items. Instead, it reformulates the loss over all negative instances through a partition and a decouple operation to achieve speedup. As such, it cannot be applied to the above sampling-based methods, which require computing a high number of item scores.

The above efficient loss can also be applied to a common matrix factorization recommendation model (i.e., \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ). It is used in our experiment to train non-sampling methods WMF and EALS.

5 Experiments

In this section, we conduct experiments to explore the performance of negative sampling and non-sampling algorithms in different scenarios. We aim to answer the following research questions:

•

How are the performances of negative sampling and non-sampling algorithms with the standard matrix factorization method for Top-K recommendation?

•

How is the efficiency of negative sampling and non-sampling methods?

•

How does negative sampling and non-sampling training algorithms boost state-of-the-art recommendation methods?

In the following part, we first introduce the experimental settings, followed by answering the above research questions.

5.1 Experimental Settings

5.1.1 Data.

Four real-world and publicly available datasets are used in our experiment, which are popularly used in previous literature [10, 17, 30, 41, 72]: Movielens-1m,¹ Pinterest,²Yelp2018,³ and Alibaba.⁴ The statistics of the three datasets are shown in Table 2. We briefly introduce the three datasets:

Table 2.

Dataset	#User	#Item	#Interaction	Density
Movielens-1m	6,940	3,706	1,000,209	4.47%
Pinterest	55,187	9,916	1,500,809	0.27%
Yelp2018	31,668	38,048	1,561,406	0.13%
Alibaba	106,042	53,591	907,407	0.02%

Table 2. Statistical Details of the Evaluation Datasets

•

Movielens-1m: This is a widely used movie rating dataset, which contains 1,000,000 ratings from 1 to 5. Since we focus on learning from implicit feedback data, we follow the widely used pre-processing method to convert it into implicit. Specifically, the detailed rating is transformed into a value of 0 or 1 indicating whether a user has interacted with an item.

•

Pinterest: This dataset is constructed by [23] for the image recommendation task and has been used for evaluating the implicit recommendation task [17, 30] in previous work.

•

Yelp2018: This dataset is adopted from the 2018 edition of the Yelp challenge, where the local businesses like restaurants are viewed as the items. The yelp2018 dataset used in this article is exactly the same as that used in [27, 34, 65].

•

Alibaba: This dataset is collected from the Alibaba online shopping platform. The authors of [73] organize the purchase record of selected users to construct the bipartite user-item graph. The dataset used in this article is exactly the same as that used in [34, 73].

The datasets vary in the numbers of user-item interactions, density, and item frequency distribution. The frequency statistics of the evaluation datasets are shown in Figure 1.

Fig. 1.

5.1.2 Evaluation Metrics.

The personalized ranking list for a user is generated by ranking all items that are not interacted in the training set according to the prediction scores. To evaluate the performance, we closely follow the settings of previous work [8, 27, 65]. Specifically, we randomly select 80% of interactions of each user to construct the training set and treat the remaining as the test set. From the training set, we randomly select 10% of interactions as the validation set to tune hyper-parameters.

We evaluate the ranking list using two metrics: (1) Recall and (2) Normalized Discounted Cumulative Gain (NDCG). We define the generated recommendation list for user u as \({\bf Rec}_u =\lbrace rec_u^1,rec_u^2,\ldots ,rec_u^K\rbrace\) , where K is the number of recommended items, and \(rec_u^i\) is ranked at the ith position in \({\bf Rec}_u\) according to the predicted score. The set of u’s interacted items in the test data is defined as \({\bf I}_u\) .

•

Recall@K: Recall measures whether the test item is in the top-K recommendation list. It is computed as follows:

\begin{equation} \begin{split}Recall@K&=\frac{1}{|{\bf U}|} \sum _u \frac{\sum _{i=1}^K{f\left(\left| \lbrace rec_u^i\rbrace \cap {\bf I}_u \right|\right)}}{K}, \end{split} \end{equation}

(15)

where \(f (x)\) is an indicator function whose value is 1 when x > 0 and 0 otherwise.

•

Normalized Discounted Cumulative Gain (NDCG)@K: It is widely used in information retrieval and recommendation tasks, measuring the quality of ranking through discounted importance of positions. Formally, it is computed as follows:

\begin{equation} \begin{split}DCG@K&=\frac{1}{|{\bf U}|}\sum _u \sum _{i=1}^K \frac{2^{f\left(\left| \lbrace rec_u^i\rbrace \cap {\bf I}_u \right|\right)}-1}{log2(i+1)}\\ NDCG@K&=\frac{DCG@K}{IDCG@K} , \end{split} \end{equation}

(16)

where IDCG is a normalization constant, which is the maximum possible value of \(DCG@K\) coming from the best ranking.

For each user, our evaluation protocol ranks all the items except those in the training set, which is more persuasive than ranking a random subset of negative items only [38]. For each method, we randomly initialize the model and run it five times. After that, we report the average results. Moreover, the early stopping strategy is performed, i.e., premature stopping if Recall@20 on the validation data does not increase for 50 epochs.

5.1.3 Hyper-parameter Settings.

The parameters for all methods are initialized according to the corresponding papers and are then carefully tuned to achieve optimal performances. Specifically, by applying the vanilla Uniform sampling strategy, we first use grid search to find the best sampling-independent parameters, such as learning rate and regulation. Then for each individual method, we fix the above parameters and search the rest of sampling-related parameters. The batch size is set to 512. For fair comparison, we use the same embedding size d for all methods. This setting has been widely adopted in previous work [10, 27, 30, 34, 65], which is to ensure that the basic model has the same modeling ability, and the performance difference is only caused by different learning methods. Specifically, d is set to 64 in our experiment. To prevent overfitting, we tune the dropout ratio in [0.1, 0.3, 0.5, 0.7, 0.9, 1] and the regularization in [0, 0.0001, 0.001, 0.01, 0.05]. The dropout ratio 1.0 means using all parameters, as the setting in Tensorflow. By default, we integrate the above-introduced negative sampling methods and non-sampling methods into a traditional matrix factorization model to compare the performance. The detailed information of hyper-parameter exploration is shown in Table 3. All experiments are run on the same machine (Intel Xeon 8-Core CPU of 2.4 GHz and single NVIDIA GeForce GTX TITAN X GPU) for a fair comparison.

Table 3.

Methods	Para.	Tuning Range	Movielens-1m	Pinterest	Yelp2018	Alibaba
BPR-Uniform	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.0001, 0.001, 0.01]	0.01	0.01	0.01	0.01
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	1.0	1.0	1.0	1.0
AOBPR	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.0001, 0.001, 0.01]	0.01	0.01	0.01	0.01
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	1.0	1.0	1.0	1.0
	\(\lambda\)	[5, 10, 20, 50, 100, 200, 500, 1000, 2000]	1000	1000	1000	1000
BPR-DNS	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.0001, 0.001, 0.01]	0.01	0.01	0.01	0.01
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	1.0	1.0	1.0	1.0
	k	[2, 4, 8, 16]	2	2	4	8
IRGAN	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.01	0.01	0.01	0.01
	reg	[0, 0.0001, 0.001, 0.01]	0.01	0.01	0.01	0.01
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	0.7	0.9	0.9	0.9
	\(\tau\)	[0.5, 1, 2]	1	1	1	1
SRNS	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.001, 0.001, 0.01, 0.05]	0.01	0.05	0.01	0.01
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	1.0	1.0	1.0	1.0
	\(\tau\)	[0.5, 1, 2, 10]	10	10	10	10
	\(\alpha\)	[0.1, 1, 2, 5, 10, 20, 50]	5	5	5	5
	\(T_0\)	[25, 50, 100]	50	50	50	50
	\(S_1\)	[2, 4, 8, 16, 32]	8	16	16	16
	\(S_2 /S_1\)	[1, 2, 4, 8]	8	4	8	8
WMF	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.0001, 0.001, 0.01]	0.0	0.0	0.0	0.0
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	0.5	0.9	0.7	0.5
	\(c_0\)	[0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 0.7]	0.5	0.05	0.05	0.01
EALS	lr	[0.001, 0.005, 0.01, 0.02, 0.05]	0.05	0.05	0.05	0.05
	reg	[0, 0.0001, 0.001, 0.01]	0.0	0.0	0.0	0.0
	dropout	[0.1, 0.3, 0.5, 0.7, 0.9, 1.0]	0.5	0.9	0.7	0.5
	\(c_0\)	[200, 500, 1000, 2000, 4000, 5000]	1000	500	1000	500
	x	[0.25, 0.5, 0.75]	0.5	0.5	0.5	0.5

Table 3. Hyper-parameter Exploration

5.2 Performance Comparison

We first make a performance comparison between the above-introduced negative sampling and non-sampling learning methods. The method ItemKNN [56] is also added as a basic benchmark. For fair comparison, all methods are integrated into a common matrix factorization recommendation model (i.e., \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ). We re-implement BPR-Uniform, BPR-DNS, SRNS, WMF, and EALS with TensorFlow. For ItemKNN and AOBPR, we use the implementation in LibRec.⁵ For IRGAN, we use the authors’ released code.⁶ For sampling-based methods, we sample one negative instance for each positive instance, which is a widely used setting of previous work [27, 30, 52, 61]. The impact of sampling number is explored in Section 5.4. To evaluate on different recommendation lengths, we investigate the top-K performance with K setting to [5, 10, 20] in our experiments. The results of the comparison of different methods are shown in Table 4. From the table, we have the following key findings:

Table 4.

Movielens-1m	Recall@10	Recall@20	Recall@50	NDCG@10	NDCG@20	NDCG@50
ItemKNN	0.0849	0.1325	0.2371	0.2317	0.2177	0.2312
BPR-Uniform	0.1535	0.2424	0.4081	0.3458	0.3418	0.3737
AOBPR	0.1421	0.2301	0.3917	0.3299	0.3270	0.3679
BPR-DNS	0.1524	0.2430	0.4080	0.3439	0.3408	0.3725
IRGAN	0.1533	0.2445	0.4082	0.3437	0.3422	0.3741
SRNS	0.1531	0.2420	0.4026	0.3478	0.3428	0.3711
WMF	0.1572	0.2449	0.4034	0.3584	0.3518	0.3787
EALS	0.1550	0.2427	0.4044	0.3527	0.3465	0.3756
Pinterest	Recall@10	Recall@20	Recall@50	NDCG@10	NDCG@20	NDCG@50
ItemKNN	0.0782	0.1248	0.2351	0.0665	0.0878	0.1252
BPR-Uniform	0.0783	0.1309	0.2462	0.0670	0.0891	0.1279
AOBPR	0.0821	0.1381	0.2561	0.0704	0.0937	0.1332
BPR-DNS	0.0799	0.1336	0.2495	0.0683	0.0907	0.1298
IRGAN	0.0812	0.1399	0.2521	0.0703	0.0919	0.1301
SRNS	0.0893	0.1473	0.2694	0.0770	0.1013	0.1425
WMF	0.0833	0.1385	0.2584	0.0712	0.0942	0.1347
EALS	0.0853	0.1414	0.2601	0.0738	0.0972	0.1373
Yelp2018	Recall@10	Recall@20	Recall@50	NDCG@10	NDCG@20	NDCG@50
ItemKNN	0.0307	0.0554	0.1058	0.0362	0.0451	0.0646
BPR-Uniform	0.0323	0.0563	0.1096	0.0370	0.0457	0.0655
AOBPR	0.0332	0.0568	0.1114	0.0382	0.0468	0.067
BPR-DNS	0.0350	0.0603	0.1170	0.0402	0.0493	0.0703
IRGAN	0.0333	0.0578	0.1135	0.0384	0.0464	0.0678
SRNS	0.0351	0.0599	0.1151	0.0400	0.0491	0.0695
WMF	0.0365	0.0623	0.1195	0.0418	0.0512	0.0724
EALS	0.0382	0.0651	0.1234	0.0440	0.0538	0.0753
Alibaba	Recall@10	Recall@20	Recall@50	NDCG@10	NDCG@20	NDCG@50
ItemKNN	0.0287	0.0394	0.0643	0.0164	0.0198	0.0238
BPR-Uniform	0.0306	0.0463	0.0731	0.0178	0.0215	0.0278
AOBPR	0.0341	0.0479	0.0790	0.0193	0.0257	0.0291
BPR-DNS	0.0324	0.0488	0.0759	0.0187	0.0228	0.0283
IRGAN	0.0312	0.0429	0.0672	0.0181	0.0206	0.0243
SRNS	0.0258	0.0364	0.0554	0.0149	0.0178	0.0218
WMF	0.0513	0.0763	0.1192	0.0289	0.0355	0.0444
EALS	0.0522	0.0749	0.1169	0.0295	0.0356	0.0442

Table 4. Performance Comparison of Negative Sampling and Non-sampling Methods

For fair comparison, all methods are integrated into a common matrix factorization recommendation model ( \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) ).

(1)

Generally, there is a gap for uniform sampling to show comparable performance to non-sampling learning methods under the same model architecture. For example, in Table 4, non-sampling learning methods WMF and EALS achieve significantly better performance (both Recall and NDCG scores) than BPR-Uniform on the Pinterest, Yelp2018, and Alibaba datasets (p < 0.01). The main reason lies in the effectiveness of negative sampling. This finding is also consistent with many previous works [9, 10, 70, 77]. As we have introduced, after a model is well trained, the model can easily distinguish the positive samples from the negative samples of random sampling, so the gradient is close to zero and the parameters are hardly updated.

(2)

Effectiveness of negative sampler is important for personalized Top-K recommendation. In the table, the performances of AOBPR, BPR-DNS, IRGAN, and SRNS are generally better than simple uniform sampling BPR. In particular, AOBPR and BPR-DNS place more emphasis on hard negative items with larger preference scores, IRGAN learns a generator to produce hard negative instances based on adversarial sampling, SRNS prefers negative instances with both large prediction scores and high variances. These informative sampling methods can keep ensuring a large gradient along with the progress of training, which leads to better performances than uniform sampling BPR.

(3)

Among all negative sampling methods, SRNS generally performs better since it considers both informativeness and reliability of negative instances. AOBPR and BPR-DNS leverage the negative item with larger preference scores; they are more likely to suffer from the false-negative problem, since high-scored items are also likely to be positive in the test set rather than negative. The performance of IRGAN is not very competitive. This is because of the divergence between the generator and its optimum, and because GAN-style methods are very sensitive to some basic training settings [48].

(4)

We can see that the well-tuned simple non-sampling method WMF performs better than various sampling-based methods, including the state-of-the-art method SRNS in most cases. Several previous studies [9, 70, 77] have also pointed out that non-sampling learning computes the gradient over all training data (including all unobserved data). Therefore, it can easily converge to a better optimal value in a more stable way. Moreover, our empirical observations suggest that some basic training settings (e.g., negative weight \(c_0\) ) are very important for the performance of WMF. For example, a different value of negative weight can reduce the model accuracy significantly. Under this situation, some actually effective performance would be under-estimated due to the improper parameter settings [54]. This may explain why our finding is a little different from the finding in [30, 47, 61], where simple sampling-based methods are able to yield competitive performance as non-sampling methods.

(5)

EALS, which has a popularity-based weighting strategy, achieves better performance than WMF. The items in real-word scenarios usually tend to be long-tail and popular items with which a user does not interact are more likely to be negative. The observation is similar to previous work [31, 32], which shows that assigning weight according to item popularity can further boost the performance of non-sampling-based recommendation methods.

(6)

Considering the performance on each dataset, we can see that the effectiveness of non-sampling methods and sampling-based methods is related to the sparsity of the dataset. Generally, compared to sampling-based methods, the more sparse the data, the better the performance of non-sampling methods. For example, the improvements of WMF to BPR-Uniform are 1.7%, 5.7%, 11.4%, and 63.8% for Movielens-1m, Pinterest, Yelp2018, and Alibaba, respectively. The improvements increase with the reduction of data sparsity. This makes sense because it is more difficult to sample high-quality instances from sparse data than from dense data.

5.3 Efficiency Analyses

Many previous studies only focused on obtaining better results but ignored the computational efficiency [57]. In real-word systems, training efficiency is also an important factor that needs to be considered. In this section, we conduct an experiment to show the training efficiencies of representative negative sampling and non-sampling methods. All the compared methods are re-implemented with Tensorflow and are run on the same machine for a fair comparison. The training time results are shown in Table 5. Note that these comparison models all have the same network structure but use different learning strategies. We have the following key observations:

Table 5.

Model	Movielens-1m			Pinterest			Yelp2018			Alibaba
Model	S	I	T	S	I	T	S	I	T	S	I	T
BPR-Uniform	24s	500	250m	21s	500	175m	26s	500	216m	12s	1500	300m
BPR-DNS	45s	500	375m	38s	500	317m	78s	500	11h	87s	1500	37h
SRNS	210s	300	18h	480s	300	40h	11m	500	92h	370s	500	52h
WMF	1s	300	5m	1.8s	300	9m	6.5s	300	32m	10.5s	300	54m
EALS	1s	300	5m	1.8s	300	9m	6.5s	300	32m	10.5s	300	54m

Table 5. Comparisons of Runtime (Second/Minute/Hour [s/m/h])

“S,” “I,” and “T” represent the training time for a single iteration, the number of iterations to converge, and the total training time, respectively. WMF and EALS are trained through efficient non-sampling loss (Equation (14)).

(1)

We can obviously observe that the overall training time of non-sampling methods WMF and EALS are much faster than sampling=based methods. For example, on the four datasets, WMF and EALS only need 5 minutes, 9 minutes, 32 minutes, and 54 minutes to achieve the optimal performance. This could be attributed to three reasons: first, by leveraging efficient non-sampling loss [10], the complexity of non-sampling learning has been reduced from \(O(|{\bf U}||{\bf I}|d)\) to \(O((|{\bf U}|+|{\bf I}|)d^2+ |\mathcal {Y}|d)\) , which avoids time-consuming traversal of all items; second, non-sampling methods generally require fewer iterations to achieve the optimal performance; third, the powerful samplers such as BPR-DNS and SRNS spend much more time on sampling.

(2)

Although stronger samplers could achieve a better performance than uniform sampling, they generally need much more sampling time, especially on larger datasets. For example, on the Alibaba dataset, BPR-DNS and SRNS require 37 and 52 hours for training, respectively. Moreover, some samplers require to calculate the prediction scores of all items, which reduces the advantage of sampling and is very time-consuming in the training process.

5.4 Impact of Sampling Number

BPR pair-wise loss (Equation (1)) is the most widely used sampling-based learning strategy. However, existing BPR learning-based methods generally sample one negative instance for each positive user-item pair [6, 27, 52, 65]. The impact of negative sampling number has been largely ignored by existing studies. Here we investigate how the performance changes as the number of negative samples increases. The BPR pair-wise loss with multiple negative samples is calculated as follows:

\begin{equation} \begin{split}\mathcal {L}=-\sum _{(u, i) \in {\bf Y}}\sum _{j \in {\bf I} \backslash {\bf I}_u}^{K} \ln \sigma (\hat{y}(u, i)-\hat{y}(u, j)), \end{split} \end{equation}

(17)

where \(\sigma (x)= \frac{1}{1+e^{-x}}\) is a sigmoid function, \({\bf Y}\) is the training data, and \({\bf I}_u\) denotes the positive item set of user u. It repeats the positive instance \((u,i)\) multiple times to make it get a higher score than every negative item.

Figure 2 shows the performance of BPR-Uniform when varying the number of negative samples on the four datasets. For comparison, we also report the performance of each training epoch. We show the results on Recall@20 metrics in this section. From the figure, we can obviously find that the number of negative samples does matter for training the recommendation model with BPR loss. Generally, it is beneficial to use more negative samples. For example, in Figure 2, sampling multiple negative instances for each positive pair outperforms using only one negative instance, and can even achieve similar results as non-sampling methods on the Movielens-1m dataset. Moreover, the results show that sampling more negative instances leads to faster convergence. Note that in this section we take BPR-Uniform as an example to show the impact of negative sampling number. For other negative sampling methods, similar performance can also be observed [34]. This is because when sampling more negative items, hard negative samples are more probable to be included and provide more valuable gradient updates. The results suggest that sampling more negative instances for each positive pair appears to be a promising setting for the recommendation model with BPR loss to gain higher performance. This also shows the potential of using non-sampling learning for improving the performance of recommendation systems. We therefore call for more future considerations toward the above settings when evaluating a newly proposed recommendation model.

Fig. 2.

5.5 Further Comparison

To answer research question 3, we further compare the state-of-the-art recommendation models NGCF [65] and LightGCN [27] with both sampling and non-sampling learning (with efficient loss) strategies to explore how negative sampling and non-sampling learning boost recommendation performance. The compared models are introduced as follows:

•

Neural Graph Collaborative Filtering (NGCF) [65]: This is one of the state-of-the-art graph-based recommendation models, which learns representations of users and items based on graph neural network. Specifically, each node obtains the transformed representations of its multi-hop neighbors. NGCF adopts the inner product to predict the user’s preference on item \(\hat{y}{(u,i)}=\mathbf {p}_{u}^T \mathbf {q}_{i}\) .

•

Light Graph Convolution Network (LightGCN) [27]: This is the state-of-the-art graph-based recommendation model. It simplifies the design of GCN by omitting the non-linear transformation and applying the sum-based pooling. The model prediction is the same as NGCF.

We also compare with more advanced sampling-based approaches, including the GAN-based method AdvIR and graph-based methods MCNS and MixGCF, as follows:

•

AdvIR [49]: AdvIR is an adversarial sampler that incorporates adversarial sampling with adversarial training by adding adversarial perturbation.

•

MCNS [73]: Markov chain Monte Carlo negative sampling (MCNS) proposes to sample negative by approximating the positive distribution and accelerate the process by the Metropolis-Hastings algorithm.

•

MixGCF [34]: MixGCF is designed for the GNN-based recommendation models and integrates multiple negatives to synthesize a hard negative by positive mixing and hop mixing.

To reduce the experiment workload and keep the comparison fair, we closely follow the settings of the MixGCF work [34]. The datasets Yelp2018 and Alibaba are exactly the same as the MixGCF work used, so we directly use the results of AdvIR, MCNS, and MixGCF in the MixGCF paper.

Table 6 shows the performance of the compared methods. From this table, we can make the following observations:

Table 6.

	Yelp2018		Alibaba
	Recall@20	NDCG@20	Recall@20	NDCG@20
NGCF+Uniform	0.0577	0.0469	0.0426	0.0197
NGCF+IRGAN	0.0615	0.0502	0.0435	0.0200
NGCF+AdvIR	0.0614	0.0500	0.0440	0.0203
NGCF+MCNS	0.0625	0.0501	0.0430	0.0200
NGCF+MixGCF	0.0688	0.0566	0.0544	0.0262
NGCF+WMF	0.0638	0.0526	0.0755	0.0355
NGCF+EALS	0.0655	0.0542	0.0742	0.0348
LightGCN+Uniform	0.0628	0.0515	0.0584	0.0275
LightGCN+IRGAN	0.0641	0.0527	0.0605	0.0280
LightGCN+AdvIR	0.0624	0.0510	0.0583	0.0273
LightGCN+MCNS	0.0658	0.0529	0.0632	0.0284
LightGCN+MixGCF	0.0713	0.0589	0.0763	0.0357
LightGCN+WMF	0.0627	0.0515	0.0735	0.0349
LightGCN+EALS	0.0647	0.0533	0.0747	0.0353

Table 6. Performance of Different Recommendation Methods on Yelp2018 and Alibaba Datasets

The datasets are exactly the same as the MixGCF work used, so we directly report the results of AdvIR, MCNS, and MixGCF in the MixGCF paper.

(1)

Among the negative-sampling-based methods, MixGCF yields the best performance on the two datasets. This is because MixGCF is designed for GNN-based recommendation models, which augments the negative samples through hop mixing technique to offer a more informative gradient of model training.

(2)

Generally, compared to the widely used uniform sampling learning method, adopting a non-sampling learning strategy boosts the recommendation performance. For example, in Table 6, NGCF+WMF performs better than NGCF+Uniform, and LightGCN+WMF performs better than LightGCN+Uniform on the two datasets.

(3)

Comparing the results of WMF and EALS reported in Table 5, we can see that the improvements of NGCF+WMF and LightGCN+WMF are relatively small. This could be attributed to two reasons. First, the GNN-based models are advantageous in learning user-item interactions; through the operation of embedding propagation, the collaborative signals have been incorporated into the embedding process in an explicit manner. Second, due to the marginal effect, the boost ability of non-sampling learning is a little limited for those expressive models.

(4)

From Tables 5 and 6, we also observe that WMF and EALS can beat state-of-the-art recommendation methods NGCF+Uniform and LightGCN+Uniform in most cases. This is very remarkable since a shallow MF framework has much fewer parameters. This result validates the importance of the choice of learning strategies. In fact, the recommendation model has been well studied by a large number of recent works, while the learning strategy usually becomes the bottleneck of the recommendation performance.

6 Discussions and Future Directions

In this section, we discuss some open issues and present several future directions.

6.1 Evaluation of Recommendation System

Evaluating recommender systems properly has been realized to be difficult since it relies heavily on empirical results [54]. Recently, several critical studies have found that the improvement achieved with some complex models was only observed because the chosen baselines were weak or the parameters were not properly optimized [14, 15, 44, 54]. In this article, we conduct thorough comparisons between negative sampling and non-sampling learning with careful setup of various representative methods. Although our empirical findings may not generalize to other tasks, we reveal the facts that a simple recommendation model with non-sampling learning can outperform many advanced sampling-based methods, which are usually neglected in previous studies since recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques [27, 65, 68, 69]. These results also encourage our community to revisit the previously proposed recommendation models with fine-tuned training settings to better investigate their potential performance.

Due to the difficulty of evaluating recommender systems, different works usually report inconsistent results [15, 53], which makes the performance of existing methods not well understood. However, it is difficult to achieve reliable experiments by authors of a single paper, which requires a community effort. As such, we believe that a suite of benchmark datasets and well-tuned baselines should be further developed by our community.

6.2 Scalability of Negative Sampling and Non-sampling

From previous studies, we can see that existing neural recommendation models generally rely on uniformly negative sampling to support efficient training. However, our findings have shown that uniformly negative sampling will lead to suboptimal performance. In fact, the recommendation model has been well studied by a large number of recent works, while the learning strategy usually becomes the bottleneck that limits the recommendation performance. Although some studies have proposed to replace uniform sampling with other sampling methods [17, 41, 51, 61] to improve the quality of negative samples, they usually only integrate the proposed samplers into simple recommendation models like matrix factorization. The main reason is that the above methods cannot meet the requirements of efficiency and effectiveness at the same time (see Table 5). For example, several state-of-the-art methods use complex structures such as GAN [61] to generate negative instances, which has posed a new challenge on model efficiency. In this case, these methods can be hardly incorporated with state-of-the-art recommendation models such as GCN models. How can we efficiently sample informative negative instances remains an important research question and deserves further exploration.

Recently, some recommendation studies also tried to apply the non-sampling learning strategy for model optimization. The results reported in those studies are consistent with our experiments, showing the superior ability of non-sampling learning for recommendation tasks. However, existing efficient non-sampling learning methods are only suitable for the recommendation models with a linear prediction layer, as shown in Figure 3, which limits the scalability and flexibility of model design. It would be beneficial if this kind of method can be applied to non-linear structures. Although challenging, it is a promising future work to propose a general efficient non-sampling learning method.

Fig. 3.

Some recent studies have pointed out the positive effects of linear prediction [53]: (1) it is relevant to the industry as it is more applicable, (2) it simplifies the modeling and learning process, and (3) it has better alignment with other research tasks such as image models and natural language processing where the dot product is usually used. Some state-of-the-art recommendation models also adopt linear prediction for combining embeddings, such as NGCF [65], LightGCN [27], and KGAT [63]. Therefore, except for extending non-sampliong to non-linear prediction structures, other promising directions to improve non-sampling recommendation models include (1) designing better embedding layers of users and items (f and g in Figure 2) through graph learning or causal modeling; (2) leveraging content features such as context information, review, social connections, and knowledge graph, and (3) optimizing the model with multiple objective functions and multi-task learning.

6.3 Multi-task Learning

Multi-task learning is to perform joint training on different but correlated tasks, in order to obtain a better model for each task [60]. Recently, multi-task learning has been widely applied for learning a recommender system with side information such as social connections, knowledge graphs, and users’ multiple behaviors [8, 11, 21]. However, although negative sampling has been widely applied in previous recommendation work, it is still reasonable to argue that existing sampling methods are not very suitable for optimizing a multi-task model. Specifically, to generate a training batch, sampling methods need to sample negative instances for each task. This will produce much larger randomness than single-task learning and would inevitably lead to information loss. The poor performance of negative sampling for multi-task learning has been observed in previous work [7, 11]. Therefore, it will be interesting and valuable to design better and suitable sampling methods for multi-task learning.

6.4 Application in Realistic Scenarios

Most real-world recommendation systems [80] contain two stages: candidate generation and ranking. For very large systems that contain billions of items and users, both existing negative sampling and non-sampling methods are hard to directly apply due to the huge time or space complexity. In this case, a pre-processing process of candidate generation is usually applied first. For example, [43] uses co-occurrences of items to generate candidates, [19] applies a random walk on a (co-occurrence) graph, and [13] describes a hybrid approach using a mixture of features.

In practical recommender systems where new users, items, and interactions are continuously streaming in, it is important to update the model in real time to best serve users. For non-sampling methods, a commonly used online learning strategy is incremental learning [32]. Given a new user-item interaction ( \(u,i\) ), incremental learning only performs optimization steps for \({\bf p}_u\) and \({\bf q}_i\) . The assumption is that new interactions should not change the overall parameters too much, but should change the embeddings of u and i significantly. For negative sampling, the advantage is that sampling-based models are easy to get parameters updated continuously with new data. However, the problem is that existing state-of-the-art sampling methods cannot meet the requirements on effectiveness and efficiency at the same time. This, as we have discussed, deserves further exploration.

7 Conclusion

In this work, we analyze two types of training strategies: negative sampling and non-sampling for recommendation with implicit feedback. Specifically, we first revisit the objective of negative sampling and non-sampling. Then we conduct thorough comparisons between negative sampling and non-sampling with careful setup of various representative methods. Our results empirically show that although negative sampling has been widely applied in recent recommendation models, it is almost impossible for the widely used uniform sampling to show comparable performance to non-sampling learning methods. Moreover, we show that existing state-of-the-art sampling-based methods generally cannot meet the requirements on effectiveness and efficiency at the same time, which limits their application abilities to complex recommendation models and online learning scenarios. Overall, while we do not argue that sampling-based methods are always weaker than non-sampling methods, we stress that these results are usually neglected in previous work since recently it has become common in research papers that only sampling-based baselines are used to compare with newly proposed techniques. We believe this work presents a sanity check for negative sampling and non-sampling learning in implicit recommendation, suggesting that newly proposed recommendation models should compare with more proper baselines to claim their state-of-the-art effectiveness. At last, we discuss several open problems and future research topics that are worth further exploring. We hope this work can be helpful to those researchers and practitioners who are keen to the study of recommender systems and inspire the research work in this field.

Footnotes

https://grouplens.org/datasets/movielens/1m/.

https://pinterest.com.

https://github.com/kuandeng/LightGCN/tree/master/Data/yelp2018.

⁴

https://github.com/huangtinglin/MixGCF/tree/main/data/ali.

⁵

https://github.com/guoguibing/librec.

⁶

https://github.com/geek-ai/irgan.

References

[1]

Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web. 1341–1350.

Abstract

1 Introduction

2 Related Work

2.1 Item Recommendation

2.2 Model Training in Recommendation

2.3 Negative Sampling in Other Domains

3 Understanding Negative Sampling and Non-sampling

3.1 Implicit Recommendation

3.2 Negative Sampling

3.3 Non-sampling

4 Methods for Empirical Study

4.1 BPR-Uniform Sampling

4.2 AOBPR

4.3 BPR-DNS

4.4 IRGAN

4.5 SRNS

4.6 WMF

4.7 EALS

4.8 Efficient Non-sampling Loss

5 Experiments

5.1 Experimental Settings

5.1.1 Data.

5.1.2 Evaluation Metrics.

5.1.3 Hyper-parameter Settings.

5.2 Performance Comparison

5.3 Efficiency Analyses

5.4 Impact of Sampling Number

5.5 Further Comparison

6 Discussions and Future Directions

6.1 Evaluation of Recommendation System

6.2 Scalability of Negative Sampling and Non-sampling

6.3 Multi-task Learning

6.4 Application in Realistic Scenarios

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

False Negative Sample Aware Negative Sampling for Recommendation

Generalized Negative Sampling for Implicit Feedback in Recommendation

Learning Recommenders for Implicit Feedback with Importance Resampling

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations